Crawl Anywhere 3.0.1 available

The new version 3.0.1 of Crawl Anywhere is now available. This is a bug fix release.

The issue was in the pipeline. The stage in charge of extract and harmonize the content-type was misconfigured.

Even if this new release is mainly a bug fix release, it introduce a new feature : Solr index time […]

Crawl Anywhere 3.0.0 available

The new version 3.0.0 of Crawl Anywhere is now available. This new major release includes a lot of powerful enhancements. Here is the list of the main enhancements.

Crawler

Source import / export

Clear action

Crawl log history

Sitemaps support

Snacktory html page cleaning option

Custom metadata (global per web site or specific for […]

Twitter account created !

 

I am happy to announce our new Twitter account !

Some people worried about not having new and not seeing new versions coming. We are working continuously on Craw-Anywhere but we published new version when we have time to freeze development, test and document.

With this Twittter account we will provide regular news […]

Roadmap for next release

Here are some news about next release. 

The road map for main development items is :

implement the snacktory html page cleaning algorithm (http://karussell.wordpress.com/2011/07/12/snacktory-yet-another-readability-clone-this-time-in-java/) done sources definition import / export done / testing use INODB as engine in MYSQL done / testing Solr 4.0 support (multilingual analyzer port from 2.9.x/3.x version + filters chaining configuration by […]

Crawl Anywhere 2.0.0 available

The new version 2.0.0 of Crawl Anywhere is now available. This new major release includes a lot of powerful enhancements.

Crawler

Recrawl period parameter

This parameter allows to overwrite for each web site the default recrawl period (crawler.properties).

Scheduling

This parameter allows to define for each web site days and hours the crawl will […]

Crawl Anywhere 1.2.1 available

The new version 1.2.1 of Crawl Anywhere is now available. This release includes pipeline enhancements and bug fixes.

Pipeline

The pipeline is now multi-threaded. With a crawler configured to crawl more then 8 web sites simultaneously, the pipeline was the bottleneck.

With a 8 cores processor, the benchmarks give :

Threads Documents processed per […]

Crawl Anywhere 1.2.0 available

The new version 1.2.0 of Crawl Anywhere is now available.

This version introduces the following new concepts :

Account

Web sites to be crawled are created under an account

You can create several accounts

Each account manage a set of web sites

It is possible to […]

Crawl Anywhere 1.1.4 available

The new version 1.1.4 of Crawl Anywhere is now available. This release includes new scripting capabilities, bug fixes and small settings update.

Crawler

Default user-agent changed to "CaBot"

Fix in the url normalization :

Default http port : the urls http://www.domain.com/ and http://www.domain.com:80/ are now considered as equals Default https port : the urls https://www.domain.com/ […]

Crawl Anywhere 1.1.3 available

We just discovered that the download links for version 1.1.2 were broken. So we published the release 1.1.3 with corrects links. This new release includes one new pipeline stage and a crawler bug fix.

Solr schema

The default Sorl schemas provided now define "AND" as the defaultOperator.

Crawler

Fixe in the url normalisation. The […]

Crawl Anywhere 1.1.2 available

The new version 1.1.2 of Crawl Anywhere is now available. This new release includes the following new features:

Crawler Bug fixing

This release fixes an issue that occurred when the crawler was launched in once mode. Under certain conditions, the crawler never stopped. 

Source configuration

The parameter "Crawl links inclusion/exclusion rules" allows now to […]

Page 2 of 3123