Crawl Anywhere 1.1.2 available

The new version 1.1.2 of Crawl Anywhere is now available. This new release includes the following new features:

Crawler

Bug fixing

This release fixes an issue that occurred when the crawler was launched in once mode. Under certain conditions, the crawler never stopped. 

Source configuration

The parameter "Crawl links inclusion/exclusion rules" allows now to specify more directives in a rule. See Configure a web site to be crawled.

Pipeline

DocTextExtractor stage

This stage can now use scripting in order to correctly extract the following 3 items in a HTML page :

  • the title
  • a date
  • the cleaned text

A clean text is a text without header, footer, advertisements, menus, … See detailled informations in documentation.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">