Installation version 3.0.x

Pre-requisites

Linux

Under Linux, we suggest to increase the simultaneous opened file limit. In the “/etc/security/limits.conf” file, add this line and restart the server.

root soft nofile 16184

Windows

Under Windows, Cygwin is mandatory.

In provided configuration files (Tomcat, …) we assume that Cygwin is installed in c:\cygwin.

Java 6

SUN JDK 6 have to be installed in the server (do not use OpenJDK 1.6 or any other not Sun version).

Warning : if your server uses a NUMA architecture, considere using the -XX:+UseNUMA flag. Please read Java HotSpot™ Virtual Machine Performance Enhancements.

Mysql 5.x.x

MySQL 5.x.x have to be installed on the server.

Warning : MySQL NUMA architecture. The MySQL “swap insanity” problem and the effects of the NUMA architecture describes the effects of NUMA on databases. This blog post was aimed at problems NUMA created for MySQL, but the issues are the same. The posting describes the NUMA architecture and goals, and how these are incompatible with the working of databases.

Tomcat 5.5.x or Tomcat 6.0.x

Tomcat 5.5.x or Tomcat 6.0.x have to be installed on the server.

Note: default Crawl-Anywhere settings expect Tomcat to be accessible on port 8180. If on your server Tomcat is listening on an other port (for instance 8080), check following configuration files in order to update them.

Tomcat 5.5.x settings

In “/var/lib/tomcat5.5/conf/server.xml”, update the non-SSL connector in order to add the URIEncoding attribute:

Optional : If the Tomcat server is dedicated to Crawl Anywhere, it could be good to allow access to it only from localhost. In “/var/lib/tomcat5.5/conf/server.xml”, add a valve definition at engine level:

Tomcat 6.0.x settings

In “/var/lib/tomcat6/conf/server.xml”, update the non-SSL connector in order to add the attribut URIEncoding=”UTF-8″:

Optional :If the Tomcat server is dedicated to Crawl Anywhere, it could be good to allow access to it only from localhost. In “/var/lib/tomcat6/conf/server.xml”, add a valve definition at engine level:

 

In order to avoid “Out of heap size memory” or “Out of permgen memory” you can update Tomcat’s default settings. For instance in Debian Squeeze, you can edit the “/etc/default/tomcat6” file and change JAVA_OPTS (-Xmx and -XX:MaxPermSize switches)

In order to avoid “Too many open files”, you can edit Tomcat’s launch script “/etc/init.d/tomcat6”. Add this line at the top of the file :

Apache 2.2.x and PHP 5.3.x

Apache 2.2.x and PHP 5.3.x have to be installed on the server

The apache modules proxy and proxy_http can also be installed (optional). This will allow to access to Tomcat remotely on port 80.

For PHP 5, the following extensions are required :

  • php_gettext
  • php_curl
  • php_mysql

Note : Crawl-Anywhere need gettext support in php. In addition, if you wish to have the messages translated, then you will require working gettext support in the operating system and the desired locales installed for the languages you want to support. For instance, with Debian Linux, you can configure the locale support with the command

Optionnal : If you want access to Tomcat in the port 80, you can configure the apache proxy mode.

Enable proxy and proxy_http module in apache

In your virtual host definition, add the following lines:

Use appropriate allow and deny instructions in order to protect your tomcat server.

PDFTOTEXT

pfdtotext is a tool that convert PDF files into text files. This tool is used during by the pipeline processing in the extract text stage.

This tool is installed by the xpdf-utils or the poppler-utils packages.

mongoDB

Use mongoDB allows to enable some new great features :

  • Crawler start/stop/pause/resume
  • Crawler crash resume
  • Crawler page caching

Any question about mongoDB ? Please read mongoDB home page and  mongoDB Fundamentals

Due to limitations, DO NOT USE mongoDB 32bits !!!

Warning : Linux, NUMA and MongoDB tend not to work well together. If you are running MongoDB on numa hardware, we recommend turning it off (running with an interleave memory policy). Problems will manifest in strange ways, such as massive slow downs for periods of time or high system cpu time. Read this page.

Installation

  • Create a directory in order to install a Crawler instance

  • Copy the 2 tar files (crawl-anywhere-x.x-dependencies-jar.tar.gz and crawl-anywhere-x.x.tar.gz) in /opt/crawler and extract there content

  • Create log, tmp and queues directories

  • Create the MySQL crawler database

  • Initiate the database

  • Install the crawler’s internal Web Services in Tomcat:

Check install/crawler/tomcat/crawlerws-default.xml and install/crawler/tomcat/crawlerws-jndi.xml files and update then accordingly to your configuration.

Note : for Windows / cygwin installation, we are providing install/crawler/tomcat/crawlerws-jndi-cygwin.xml file.

Solr installation

This is the best time to install and configure Solr. We are providing 3 different preconfigured Solr releases : Solr 3.1.0, Solr 3.5.0 and 4.0.0.

Preconfigured means we provide in the conf directory our configured versions of solrconfig.xml and schema.xml files.

Note : Solr 4.0.0 version we are providing was built of on top of the recent official Solr 4.0. You have to use this build because we applied a patch in order our multilingual analyzer works.

Here are the steps in order to install and configure Solr version xxx :

  • Create the Solr directory

  • Declare in Tomcat the new Solr instance
    If Solr is not installed in “/opt/solr”, edit “/opt/crawler/install/solr/solr-tomcat-jndi-sample.xml” file accordingly to your configuration then copy it under tomcat’s conf directory.

Note : for Windows / cygwin installation, we are providing install/solr/solr-tomcat-jndi-sample-cygwin.xml file.

With apache mode proxy enabled, your can access to your solr administration home page at http:///tomcat/solr/

We strongly encourage you to choose Solr 4.0.0 because :

  • we provide a new multilingual analyzer really much powerful than for Solr 3.x. You can configure each language processing in the same way as Solr use to do in its schema.xml file
  • we provide a new analyzer for tags cloud generation

Configuration

The configuration files are located in the /opt/crawler/config, /opt/crawler/web and /opt/crawler/install directories.

Configure your apache web server:

  • Update “install/crawler/apache/httpd_crawler.conf” according to your configuration (path, virtual host, ServerName, ServerAdmin, …), install it in apache and reload apache

Configure the crawler web administration:

  • In web/crawler/config, rename config-default.ini in config.ini or create a symbolic link
    ln -s config-default.in config.ini
  • Update “web/crawler/config/config.ini” according to your configuration (path, mysql database, mysql user, host name, …)
  • We suggest to remove the “web/crawler/pub/phpinfo.php” file

Configure the  sample web search:

  • In web/search/config, rename config-default.ini in config.ini or create a symbolic link
    ln -s config-default.in config.ini
  • Update “web/search/config/config.ini” according to your configuration (path, virtual host, solr url, …)
  • We suggest to remove the “web/searcher/phpinfo.php” file

Configure the crawler engine:

  • In config/crawler, rename crawler-default.properties in crawler.properties or create a symbolic link
    ln -s crawler-default.properties crawler.properties
  • Check settings in config/crawler/crawler.properties. Please change the Crawler’s user-agent from “CaBot” to your own one !!!

Configure the pipeline engine:

  • In config/pipeline, rename simplepipeline-default.xml in pipeline.xml or create a symbolic link
    ln -s simplepipeline-default.xml simplepipeline.xml
  • Check settings in config/pipeline/simplepipeline.xml

Configure the indexer engine:

  • In config/indexer, rename indexer-default.xml in indexer.xml or create a symbolic link
    ln -s indexer-default.xml indexer.xml
  • Check settings in config/indexer/indexer.xml

The main parameters to check are path (database name, path, log directory, converter tools, solr url, …).

If you install all in /opt/crawler, with the crawler database name “crawler” and the database user “crawler” with password “crawler”, the only parameter you will have to change is in the “/opt/crawler/web/crawler/config/config.ini” file.

You can now access the crawler administration at http:///crawler/

The default admin login is : admin / admin

MongoDB settings

By default crawler settings do not use a MongoDB database. In order to enable MongoDB, it is necessary to update 2 configuration files :

  • config/crawler/crawler.properties

Uncomment the following lines and change host and port for MongoDB access.

  • web/crawler/config/config.ini

Uncomment the following line .

Scripts

Scripts are used in order to start ans top the crawler, the pipeline and the indexer.

Scripts are available for Linux or Windows (with Cygwin) and are located in the “scripts” directory.  In order to get and install these scripts, follow instructions in the readme.txt file located in this directory.

For example, if you are installing Crawl-Anywhere on a Linux server, the steps in order to configure scripts are:

On Linux, check the ulimit command parameter in the init.inc.sh scripts.

Document converters

Converters are used in order to convert SWF (Flash), DOC, XLS and PPT files into text.

You can download external tools at download page.

Copy this file in “/opt/crawler” directory then uncompress and unarchive its content.

Document converters are available for Linux, Win32 and Macos and are now located in the “external” directory. In order to configure these tools for your operating system, follow instructions in the readme.txt file located in this directory.

For example, if you are installing Crawl-Anywhere on a 64bits Linux server, the steps in order to configure converters are:

SWF2HTML

In order to be able to crawl a web site in flash, the crawler needs a flash to html converter.

We suggest to install this converter into “/opt/crawler/external” directory. This will match the default settings in configuration files.

PDFTOTEXT

pfdtotext is a tool that convert PDF files into text files. This tool is used during by the pipeline processing in the extract text stage.

This tool is installed by the xpdf-utils or the poppler-utils packages.

CATDOC

catdoc, catppt and xls2csv convert ms-office files into text files. Theses tools are used during by the pipeline processing in the extract text stage.

We suggest to install these 3 converters into “/opt/crawler/external” directory. This will match the default settings in configuration files.

Directories structure

At this step, the /opt/crawler directory content looks like :

Directory Description
bin the Crawl Anywhere java libraries (crawler, pipeline, indexer, utils)
config the configuration files
external the 3d-party tools (converters)
install the installation scripts
lib the java libraries Crawl Anywhere needs
log the log files (log, output and pid files)
queue the crawler output queue and the pipeline input queue
queue_solr the pipeline output queue and the Solr indexer input queue 
scripts the various scripts in order to start the crawler, the pipeline and the Solr indexer
src various source files we have to provide due to license (can be deleted)
tmp directory for temporary files
web the crawler web administration and the search web application
webapps the crawler administration internal web service 

And now … start crawling web sites

Step 1 : configure the web sites to crawled in the web admin. See : http://www.crawl-anywhere.com/getting-started/#define-your-first-sources-web-sites-to-be-crawled

Step 2 : start both crawler, pipeline and indexer processes. See : http://www.crawl-anywhere.com/getting-started/#start-the-crawler