Getting started

Here are the basic steps in order to use Crawl-Anywhere.

These steps will allow you to :

  • Download and install Crawl-anywhere
  • Define your first sources (web sites to be crawled)
  • Start the crawler
  • Start the pipeline
  • Start the indexer

Download and install Crawl anywhere

You can download the last version of Crawl anywhere at the download page. Download the 2 files.

For versions 3.0.x, you also need the mysql java connector at this page.

For the installation, see : http://www.crawl-anywhere.com/installation/

Define your first sources (web sites to be crawled)

Access to the crawler administration at the following address : http:///crawler/

The default login is : admin / admin

Go to the “sources” tab and click on the “+” icone.

The mandatory informations are the name of the source (generally the web site name) and the url of the web site.

We encourage you to provide at least the language of the web site content.

In order to optimize crawl, we also encourage you to use crawl links inclusion/exclusion rules.

Start the crawler

In the scripts directory, you will find various ready to use scripts in order to start the crawler. Here are descriptions of these scripts : Start and Stop crawler

You can modify these scripts or create new scripts in order to match you your needs. The line you will need to modify is:

The usage is:

When the crawler starts, crawler.pid, crawler.log and crawler.output files are created in the log directory. Delete the pid file will stop the crawler.

Crawled pages are stored as XML files in the queue directory.

Start the pipeline

The XML files produced by the crawler are ready to be processed by the pipeline in order to:

  • extract text from binary files (PDF, DOC, …) or from html pages
  • detect language
  • detect relevante title
  • map xml element to Solr fields

In the scripts directory, you will find various ready to use scripts in order to start the pipeline. Here are descriptions of these scripts : Start and Stop crawler

You can modify these scripts or create new scripts in order to match you your needs. The line you will need to modify is:

The usage is:

When the pipeline starts, pipeline.pid, pipeline.log and pipeline.output files are created in the log directory. Delete the pid file will stop the pipeline .

Processed XML files are stored as XML files in the queue_solr directory.

Start the indexer

The XML files produced by the pipeline are ready to be indexed into Solr by the indexer.

In the scripts directory, you will find various ready to use scripts in order to start the pipeline. Here are descriptions of these scripts : Start and Stop crawler

You can modify these scripts or create new scripts in order to match you your needs. The line you will need to modify is:

The usage is:

What are the next steps ?

The next steps are :

  • learn how to configure the crawler process
  • learn how configure the pipeline process (add or remove stages)
  • learn how to configure Solr indexing (modify fields mapping in order to match a specific Solr schema)
  • learn how to administrate crawler (define sources)