Crawler installation

The swf, pdf, doc, ppt and xls converters are not installed in the externals directory ?

Converters are not provided any more as a direct download. But the installation page says :

Converters are used in order to convert SWF (Flash), DOC, XLS and PPT files into text. In order to get and install these tools, follow instructions in the readme.txt file located in the install directory.

And this readme file says : You can download external tools at : http://www.crawl-anywhere.com/downloads/crawl-anywhere-external.tar.gz

This is due to licences consideration.

Crawler configuration

Why is the crawler slow ?

You can configure the crawler in order to crawl several sources and several pages per source at the same time. Check you have properly set the two concerned parameters.

In /opt/crawler/config/crawler/crawler.properties, it is these two lines =

crawler.max_simultaneous_item_per_source = 1
crawler.max_simultaneous_source = 1

These parameters will have to set according to your hardware configuration and your internet bandwidth.
Please don’t use a value higher than 2 for crawler.max_simultaneous_item_per_source.

Crawler administration

What is the test mode for a source ?

In test mode, the pages crawled in the web site are not send in the output crawler repository. The crawler just browses the web site and produces a log file in order to verify that pages are correctly discovered.

“Test” buttons don’t work while setting up a source

The test button should work. Il looks like you didn’t install and configure the crawler’s internal Web Services in Tomcat :

and, update “web/crawler/config/config.ini”

Can you explain “reset”, “rescan” and “depper” icones in the sources list ?

reset delete from the database the previously crawled urls and restart crawl of the site
rescan do not delete from the database the previously crawled urls. Load these url in the to be crawl url list and restart crawl of the site
deeper to be use when you increase the max crawl depth for a site. Start crawl from the previous deeper