Crawl Anywhere 2.0.0 available

The new version 2.0.0 of Crawl Anywhere is now available. This new major release includes a lot of powerful enhancements.

Crawler

Recrawl period parameter

This parameter allows to overwrite for each web site the default recrawl period (crawler.properties).

Scheduling

This parameter allows to define for each web site days and hours the crawl will be processed. If at the end of an allowed time frame the crawl for web site is not completed, the crawl will resume at the same point at next allowed time frame.

Crawl caching

This feature is available if you have a MongoDB database installed (64 bits version). All crawled pages will be stored in MongoDB. It is possible to reindex (pipeline and indexer) pages without any access to the web site. It is interesting when web site settings are updated (name, language, country, tag, collection).

This feature requieres available disk space for MongoDB database. Each web site cache is a collection in MongoDB

Crawl pause/resume 

This feature is available if you have a MongoDB database installed (64 bits version). MongoDB is used as persistent memory for crawler queues (pages still to process, processed pages).

Crawl crash resume

This feature is available if you have a MongoDB database installed (64 bits version). MongoDB is used as persistent memory for crawler queues (pages still to process, processed pages).

If the crawler crash or is killed, crawler queues are not lost and so the crawler will resume crawls at next startup.

Large web site crawl

This feature is available if you have a MongoDB database installed (64 bits version). MongoDB is used as persistent memory for crawler queues (pages still to process, processed pages). 

Large web site crawl create large queues (pages still to process, processed pages). Without MongoDB, these queues are in memory. So according to available memory or to how many web sites are crawled at the same time, crawler can stop working.

Use MongoDB for queues storage do not requires a lot of memory.

Pipeline

Tag filtering

A new pipeline stage is available. This stage filters tags defined for a web site. Only tags mentioned in the page content are sent to the indexer.

Solr Indexer

Solr 3.5.0

Configuration files for Solr 3.5.0 are provided (shema.xml ans solrconfig.xml)

Better commit strategy

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">