Crawl Anywhere 1.1.4 available

The new version 1.1.4 of Crawl Anywhere is now available. This release includes new scripting capabilities, bug fixes and small settings update.

Crawler

Default user-agent changed to "CaBot"

Fix in the url normalization :

  • Default http port : the urls http://www.domain.com/ and http://www.domain.com:80/ are now considered as equals
  • Default https port : the urls https://www.domain.com/ and https://www.domain.com:443 are now considered as equals
  • Url parameter order : the urls http://www.domain.com/page.php?id=1&action=list and http://www.domain.com/page.php?action=list&id=1 are now considered as equals
  • J2EE jsessionid : the J2EE jsessionid is now correctly handle

Fix in internal crawler web service dependencies (missing java libraries)

Fix on robots.txt handling

Html meta extraction :

  • <meta name="xxx" content="yyyy" /> now produces  <item_meta_xxx>yyyy</item_meta_xxx>
  • <meta http-equiv="xxx" content="yyyy" /> now produces <item_meta_equiv_xxx>yyyy</item_meta_equiv_xxx>

In previous version, scripting allowed to manage javascript links. In addition, standard http links can now be manipulated (updated, removed, added)

Pipeline

New StandartQueueWriter stage without mapping (the in memory xml file is written without any mapping) – cool for debugging or use with DIH

In the pipeline definition file, the "position" attribute in stage element is now deprecated.

Stages can be added, removed or reorganized without update the "position" attribute.

In previous version, scripting allowed to update the html page content before conversion to text (remove header, footer, menus, …). In addition, it is possible to manipulate text after conversion from any format (html, PDF, …)

Indexer

Tested with Solr 3.1

Administration Web interface

The status page now monitors pipeline and indexer states

The status page now monitors queues size

Better source error feedback

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">