Configure the crawler

The crawler is lauched by scripts on the scripts directory (see Start and stop crawler). The main purpose of these scripts is build the classpath and launch the crawler with a command line such as :

The -p parameter allows to specify a configuration file for the crawler. The following array describes the main parameters in the configuration file.

Parameter Description
database.* This group of parameters define how to access to the crawler database.

database.adapter and database.jdbcUrlparameters shouldn’t be changed.

crawler.log_verbose 1 -> the crawler will produce a verbose log

0 -> the crawler will produce a normal log

crawler.force_once 1 -> mode once is forced even if the -o command line parameter is not specified

0 -> mode once is not forced. The -o command line parameter will control this option

crawler.witness_files_path Path where pid file is located . A pid file is created when the crawler starts. This pid file allows launch scripts to check if a instance of the crawler is already running. Delete this file will stop the crawler.
crawler.max_simultaneous_source Number of simultaneous sources crawler. This value can be increased according to your hardware and internet bandwith.
crawler.max_simultaneous_item_per_source Default number of simultaneous item crawled on a source. We recommend to not change this value. This can impact the remote crawled server. This value can be overwrite for each source in source administration.
crawler.max_depth Default maximum crawl depth for a source. This value can be overwrite for each source in source administration.
crawler.max_page Default maximum crawled item on a source. This value can be overwrite for each source in source administration.
crawler.max_page_length Maximum crawled item size in byte.
crawler.child_only Default option for child only crawl strategie. This value can be overwrite for each source in source administration.

1 -> Crawl only child item of the root item

0 -> Crawl any item in the source

For a web site source, the root item is the start url of the source.

If the root item is http://www.something.com/a/b/index.php, child items are all pages like http://www.something.com/a/b/*

crawler.period Minimum period recrawl (in hours) for a source
crawler.period_binary_file Minimum period recrawl (in hours) for a binary file (PDF, DOC, …)
crawler.contenttype_include List of accepted item content-type
crawler.contenttype_exclude List of rejected content-type
indexer.country_include List of included countries
indexer.country_exclude List of excluded countries
crawler.swfToHtmlPath Path to the swf files (flash) to html converter
crawler.scripts_path Path to the scripts directory.

The scripts directory is the root directory where are located the scripts used by connectors in order to handle specific issues.

For example, the web connector can use scripts in order to handle javascript links extraction.

crawler.ignore_url_fields Parameters ignored in urls. Typically session parameters such as : PHPSESSID and jsessionid

If two urls differs only due to these parameters, the 2 urls are considered as equals.

Additional url parameters to be ignored can be defined for each source (in the source managment in the Crawler administration)

logger.logfilename Crawler log file name
logger.logfilename_test Crawler log file name in test mode (-t command line parameter)
documenthandler.default.classname Class name of the documents handler
documenthandler.default.jobspath Output directory path for documents produced by the document handler
documenthandler.default.jobslimit Maximum number of documents in the output directory. Il this limit is reaced, the crawler enters in pause mode

WARNING : Do not change any undocumented parameters !