Crawl Anywhere 3.0.1 available

The new version 3.0.1 of Crawl Anywhere is now available. This is a bug fix release.

The issue was in the pipeline. The stage in charge of extract and harmonize the content-type was misconfigured.

Even if this new release is mainly a bug fix release, it introduce a new feature : Solr index time boosts configuration (see http://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts)

You can define your index time boosts in the "config/pipeline/solrboost.xml" file

Upgrade from v3.0.0 to v3.0.1

You just need to update your bin directory and your config/pipeline directory.

In the config/pipeline there are 2 new files :

  • solrboost.xml
  • contenttypemapping.txt

In the config/pipeline the file "simplepipeline-default.xml" has changed. Please report changes from this file to your simplepipeline.xml file.

Crawl Anywhere 3.0.0 available

The new version 3.0.0 of Crawl Anywhere is now available. This new major release includes a lot of powerful enhancements. Here is the list of the main enhancements.

Crawler

Source import / export

Clear action

Crawl log history

Sitemaps support

Snacktory html page cleaning option

Custom metadata (global per web site or specific for web sites urls based on regex rules)

Http / Https protocol strategy

Pipeline

Snacktory html page cleaning 

Solr Indexer

Regex field mapping rules

Solr 4.0.0 integration

A preconfigured and patched (for multilingual anayzer) Solr 4.0.0 instance is now provided

Multilingual analyzer

A new version of the multilingual analyzer in available. This version allows detailed configuration for each language. The configuration syntax is the same as in Solr schema.xml file (see here the default configuration file : multilingual.xml).

You really need to use our Solr 4.0.0 patched version in order this analyzer works.

Tag cloud generator

A tag cloud analyzer is available. This analyzer extra extract n-terms expression (from document title titles by default). You can display dynamic tag cloud (by language, for a period) in your search interface.  

Twitter account created !

 

I am happy to announce our new Twitter account !

Some people worried about not having new and not seeing new versions coming. We are working continuously on Craw-Anywhere but we published new version when we have time to freeze development, test and document.

With this Twittter account we will provide regular news (new features, bug fix, feature request, …) and you will be able to communicate with us. All this will be shared with the community of users and interaction will be possible.

Enjoy crawling and indexing the Web with Crawl-Anywhere and Solr !

Roadmap for next release

Here are some news about next release. 

The road map for main development items is :

implement the snacktory html page cleaning algorithm (http://karussell.wordpress.com/2011/07/12/snacktory-yet-another-readability-clone-this-time-in-java/) done
sources definition import / export done / testing
use INODB as engine in MYSQL done / testing
Solr 4.0 support (multilingual analyzer port from 2.9.x/3.x version + filters chaining configuration by external definition file) done / testing
Tag cloud generation (will require Solr 4.0) done / testing
Widget -"Index it!" not started
Removed pages handling and policies  not started  (postponed to following releases ?)

We can't provide date for next release. Expected by June.

Crawl Anywhere 2.0.0 available

The new version 2.0.0 of Crawl Anywhere is now available. This new major release includes a lot of powerful enhancements.

Crawler

Recrawl period parameter

This parameter allows to overwrite for each web site the default recrawl period (crawler.properties).

Scheduling

This parameter allows to define for each web site days and hours the crawl will be processed. If at the end of an allowed time frame the crawl for web site is not completed, the crawl will resume at the same point at next allowed time frame.

Crawl caching

This feature is available if you have a MongoDB database installed (64 bits version). All crawled pages will be stored in MongoDB. It is possible to reindex (pipeline and indexer) pages without any access to the web site. It is interesting when web site settings are updated (name, language, country, tag, collection).

This feature requieres available disk space for MongoDB database. Each web site cache is a collection in MongoDB

Crawl pause/resume 

This feature is available if you have a MongoDB database installed (64 bits version). MongoDB is used as persistent memory for crawler queues (pages still to process, processed pages).

Crawl crash resume

This feature is available if you have a MongoDB database installed (64 bits version). MongoDB is used as persistent memory for crawler queues (pages still to process, processed pages).

If the crawler crash or is killed, crawler queues are not lost and so the crawler will resume crawls at next startup.

Large web site crawl

This feature is available if you have a MongoDB database installed (64 bits version). MongoDB is used as persistent memory for crawler queues (pages still to process, processed pages). 

Large web site crawl create large queues (pages still to process, processed pages). Without MongoDB, these queues are in memory. So according to available memory or to how many web sites are crawled at the same time, crawler can stop working.

Use MongoDB for queues storage do not requires a lot of memory.

Pipeline

Tag filtering

A new pipeline stage is available. This stage filters tags defined for a web site. Only tags mentioned in the page content are sent to the indexer.

Solr Indexer

Solr 3.5.0

Configuration files for Solr 3.5.0 are provided (shema.xml ans solrconfig.xml)

Better commit strategy

Crawl Anywhere 1.2.1 available

The new version 1.2.1 of Crawl Anywhere is now available. This release includes pipeline enhancements and bug fixes.

Pipeline

The pipeline is now multi-threaded. With a crawler configured to crawl more then 8 web sites simultaneously, the pipeline was the bottleneck.

With a 8 cores processor, the benchmarks give :

Threads Documents processed per hour
1 32.000
2 55.000
4 90.000

 

For Hurisearch, we now crawl 32 web sites simultaneously and the pipeline use 4 threads.  

Crawl Anywhere 1.2.0 available

The new version 1.2.0 of Crawl Anywhere is now available.

This version introduces the following new concepts :

  • Account

Web sites to be crawled are created under an account

You can create several accounts

Each account manage a set of web sites

It is possible to create users in order to manage a specific account

  • Target

A target is a Solr core.

You can associate several targets to an account

You can specify a default target for an account

A web site from an account can be indexed in any target (Solr core) associated to its account

  • Engine

An engine is a instance of crawler/pipeline/indexer

You can associate one engine to an account

Various account can share the same engine

You can deploy several engines on several servers

An engine will crawl and index the web sites from the account they are associated to

  • Web site settings

New way to define starting urls (web site, rss, page of links)

New way to specify crawling rules

 

Crawl Anywhere 1.1.4 available

The new version 1.1.4 of Crawl Anywhere is now available. This release includes new scripting capabilities, bug fixes and small settings update.

Crawler

Default user-agent changed to "CaBot"

Fix in the url normalization :

  • Default http port : the urls http://www.domain.com/ and http://www.domain.com:80/ are now considered as equals
  • Default https port : the urls https://www.domain.com/ and https://www.domain.com:443 are now considered as equals
  • Url parameter order : the urls http://www.domain.com/page.php?id=1&action=list and http://www.domain.com/page.php?action=list&id=1 are now considered as equals
  • J2EE jsessionid : the J2EE jsessionid is now correctly handle

Fix in internal crawler web service dependencies (missing java libraries)

Fix on robots.txt handling

Html meta extraction :

  • <meta name="xxx" content="yyyy" /> now produces  <item_meta_xxx>yyyy</item_meta_xxx>
  • <meta http-equiv="xxx" content="yyyy" /> now produces <item_meta_equiv_xxx>yyyy</item_meta_equiv_xxx>

In previous version, scripting allowed to manage javascript links. In addition, standard http links can now be manipulated (updated, removed, added)

Pipeline

New StandartQueueWriter stage without mapping (the in memory xml file is written without any mapping) – cool for debugging or use with DIH

In the pipeline definition file, the "position" attribute in stage element is now deprecated.

Stages can be added, removed or reorganized without update the "position" attribute.

In previous version, scripting allowed to update the html page content before conversion to text (remove header, footer, menus, …). In addition, it is possible to manipulate text after conversion from any format (html, PDF, …)

Indexer

Tested with Solr 3.1

Administration Web interface

The status page now monitors pipeline and indexer states

The status page now monitors queues size

Better source error feedback

Crawl Anywhere 1.1.3 available

We just discovered that the download links for version 1.1.2 were broken. So we published the release 1.1.3 with corrects links. This new release includes one new pipeline stage and a crawler bug fix.

Solr schema

The default Sorl schemas provided now define "AND" as the defaultOperator.

Crawler

Fixe in the url normalisation. The urls http://www.domain.com/ and http://www.domain.com:80/ were considered as different. 

Pipeline

The sample xml mapping file for the SolrIndexQueueWriter stage now explains that the following mapping is mandatory

The new FieldMapping stage was added.

Crawl Anywhere 1.1.2 available

The new version 1.1.2 of Crawl Anywhere is now available. This new release includes the following new features:

Crawler

Bug fixing

This release fixes an issue that occurred when the crawler was launched in once mode. Under certain conditions, the crawler never stopped. 

Source configuration

The parameter "Crawl links inclusion/exclusion rules" allows now to specify more directives in a rule. See Configure a web site to be crawled.

Pipeline

DocTextExtractor stage

This stage can now use scripting in order to correctly extract the following 3 items in a HTML page :

  • the title
  • a date
  • the cleaned text

A clean text is a text without header, footer, advertisements, menus, … See detailled informations in documentation.

Page 2 of 3123