Crawl anywhere 1.1.0 available

The new version 1.1.0 of Crawl Anywhere is now available.

Crawler

Users management

It is now possible for administrators to manage users (create, update or delete). A standard user can only change its own password.

There is 3 types of users :

  • System administrator -> can manage sources and users
  • Fonctional administrator -> can manage sources
  • Visitor -> can see sources definition (read only)

Sources

New "Collection" concept

Collection concept is a way to group sources together while Tag is a way to qualify a source.

Automatic HTML page cleaning

The pipeline now include a stage that is able to automaticaly clean html pages from a source. Clean a page is remove all not pertinent information such as header, footer, menus, …

This option will allow to enable or disable automatic cleaning for the source.

Host aliases

Normally, for a web site, the crawler only crawls pages hosted on the same domain as the start url. Il is now possible to specify a list of hosts that will be concidered by the crawler as accepted domains for a web site.

Url parameters to be ignored

This option allows to specify parameters that will be ignored in the url of a web page. This means that 2 urls that differs only by these parameters are concidered as indentical.

Test mode

A source can have 3 modes : enabled, disabled and test.

The new test mode allows to test a source. In test mode, the web site is crawled, the crawler url are written in the MySQL database but the crawler doesn't write xml files in the ouput directory.

Javascript links detection

It is now possible to help the crawler to discover the javascript html links. This is achieved by writing a script that will describe for a web domain how to extract links.

Here is a sample script for the www.cohor.org web site.

See detailed explanations.

Pipeline

Two new stages are available :

  • language detection stage
  • mime type filtering stage

Two stages where enhanced :

  • The text extraction stage now allow to use the boilepipe library with html pages
  • The SolrQueueWriter includes new features* mapping value -> target
  • attribut normalization

Sample search interface

A sample search interface is available in order to search into the indexed documents

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">