Installation version 4.0.0

Pre-requisites

Linux

Under Linux, we suggest to increase the simultaneous opened file limit. In the “/etc/security/limits.conf” file, add this line and restart the server.

root soft nofile 16184

Windows

Under Windows, Cygwin is mandatory.

In provided configuration files (Tomcat, …) we assume that Cygwin is installed in c:cygwin.

Java 7

Oracle JDK 7 has to be installed in the server (do not use OpenJDK or any other not Oracle version).

Warning : if your server uses a NUMA architecture, considere using the -XX:+UseNUMA flag. Please read Java HotSpot™ Virtual Machine Performance Enhancements.

MongoDB 2.6.x

MongoDB 2.6.x is a mandatory and is used for ;

  • Store web sites crawl settings
  • Crawled items history
  • Manage crawl queues
  • Manage crawl cache

Any question about mongoDB ? Please read mongoDB home page and  mongoDB Fundamentals

Due to limitations, DO NOT USE mongoDB 32bits !!!

Warning : Linux, NUMA and MongoDB tend not to work well together. If you are running MongoDB on numa hardware, we recommend turning it off (running with an interleave memory policy). Problems will manifest in strange ways, such as massive slow downs for periods of time or high system cpu time. Read this page.

Apache 2.2.x and PHP 5.3.x / 5.4.x

Apache 2.2.x and PHP 5.3.x / 5.4.x have to be installed on the server.

The apache modules proxy and proxy_http can also be installed (optional). This will allow to access to Tomcat remotely on port 80.

For PHP 5, the following extensions are required :

  • php-gettext
  • php-curl
  • php-xml
  • mongodb driver 1.5.6

GetText : Crawl-Anywhere need gettext support in PHP. In addition, if you wish to have the messages translated, then you will require working gettext support in the operating system and the desired locales installed for the languages you want to support. For instance, with Debian Linux, you can configure the locale support with the command :

MongoDB driver for PHP is also required (version 1.5.6 or >). See official documentation on MongoDB web site: http://docs.mongodb.org/ecosystem/drivers/php/ and PHP documentation on how to install driver with source code: http://php.net/manual/fr/mongo.installation.php

In order to build the driver, you need PHP development tools

Centos

Debian

PDFTOTEXT

pfdtotext is a tool that convert PDF files into text files. This tool is used during by the pipeline processing in the extract text stage.

This tool is installed by the xpdf-utils or the poppler-utils packages.

Installation

Crawl-Anywhere installation

  • Create a directory in order to install a Crawler instance

  • Copy the 2 tar files (crawl-anywhere-x.x-dependencies-jar.tar.gz and crawl-anywhere-x.x.tar.gz) in /opt/crawler and extract there content

  • Create log, tmp and queues directories

  • Scripts

Scripts are used in order to start ans top the crawler, the pipeline and the indexer.

Scripts are available for Linux or Windows (with Cygwin) and are located in the “scripts” directory.  In order to get and install these scripts, follow instructions in the readme.txt file located in this directory.

For example, if you are installing Crawl-Anywhere on a Linux server, the steps in order to configure scripts are:

On Linux, check the ulimit command parameter in the init.inc.sh scripts.

MongoDB installation

MongodDB is a pre-requisite. These installation instructions are very basic. It is a 5 minutes installation procedure that should work on any Linux distribution for the last MongoDB production release.

  • Download MongoDB at : http://www.mongodb.org/downloads and copy the distribution tar file in the directory “/opt”. Choose a 64 bits version !!!
  • Extract content of the mongoDB distribution tar file

  • Install the Mongodb PHP driver (version 1.5.6 or >). See PHP documentation on how to install driver with source code: http://php.net/manual/fr/mongo.installation.php

Solr installation

Solr is a pre-requisite. These installation instructions are very basic. It is a 5 minutes installation procedure that should work on any Linux distribution for the last MongoDB production release. Even if it possible to use earlier Solr releases, we are providing configuration files for Solr 4.10.0.

We strongly encourage you to choose Solr4.3.0 or 4.10.0 because we provide a new analyzer for tags cloud generation (may not work with other 4.x versions due to Solr analyzer API changes)

Here are the steps in order to install and configure Solr version 4.x :

  • Download the binary distribution at : http://lucene.apache.org/solr/mirrors-solr-latest-redir.html. Choose solr-4.x.x.tgz file.
  • Copy the solr-4.x.x.tgz file into your /opt directory and unarchive it

  • Create the /opt/solr directory with a symbolic link

  • Finalize Solr installation by creating a server configuration and start script

  • Create a crawler index in solr home directory

Configuration Apache server

  • Update “install/crawler/apache/httpd_crawler.conf” according to your configuration (path, virtual host, ServerName, ServerAdmin, …)
  • Install it in apache and reload apache

Debian

Centos

Configuration Crawl-anywhere

The configuration files are located in the /opt/crawler/config, /opt/crawler/web and /opt/crawler/config directories.

Crawl-Anywhere should work fine without any change in configuration files. However, if you didn’t install Crawl-Anywhere in “/opt/crawler” directory or if your Sorl server don’t use the default 8983 port, you may have to change some settings in configuration files.

The main parameters to check are path (database name, path, log directory, converter tools, solr url, …).

You can now access the crawler administration at http://<your_server>/crawler/

The default admin login is : admin / admin

Document converters

Converters are used in order to convert SWF (Flash), DOC, XLS and PPT files into text.

You can download external tools at : http://www.crawl-anywhere.com/downloads/crawl-anywhere-external.tar.gz

Copy this file in “/opt/crawler” directory then uncompress and unarchive its content.

Document converters are available for Linux, Win32 and Macos and are now located in the “external” directory. In order to configure these tools for your operating system, follow instructions in the readme.txt file located in this directory.

For example, if you are installing Crawl-Anywhere on a 64bits Linux server, the steps in order to configure converters are:

SWF2HTML

In order to be able to crawl a web site in flash, the crawler needs a flash to html converter.

We suggest to install this converter into “/opt/crawler/external” directory. This will match the default settings in configuration files.

PDFTOTEXT

pfdtotext is a tool that convert PDF files into text files. This tool is used during by the pipeline processing in the extract text stage.

This tool is installed by the xpdf-utils or the poppler-utils packages.

Directories structure

At this step, the /opt/crawler directory content looks like :

Directory Description
bin the Crawl Anywhere java libraries (crawler, pipeline, indexer, utils)
config the configuration files
external the 3d-party tools (converters)
install the installation scripts
lib the java libraries Crawl Anywhere needs
log the log files (log, output and pid files)
queue the crawler output queue and the pipeline input queue
queue_solr the pipeline output queue and the Solr indexer input queue 
scripts the various scripts in order to start the crawler, the pipeline and the Solr indexer
tmp directory for temporary files
web the crawler web administration and the search web application
webapps the crawler administration internal web service 

Start daemons

Crawl-Anywhere use various daemons (Solr, MongoDB, Crawler, Pipeline, indexer, …). This daemons have to be started in order Crawl-anywhere works.

  • MongoDB

A sample init.d script is provided in /opt/crawler/install/init.d.

Start / stop commands are :

  • Solr

A sample init.d script is provided in /opt/crawler/install/init.d.

Start / stop commands are :

or

A sample init.d script (debian) is provided in /opt/crawler/install/init.d

  • Crawlerws

Start / stop commands are :

  • Crawler

Start / stop commands are :

  • Pileline

Start / stop commands are :

  • Indexer

Start / stop commands are :

And now … start crawling web sites

Step 1 : configure the web sites to crawled in the web admin. See : http://www.crawl-anywhere.com/getting-started/#define-your-first-sources-web-sites-to-be-crawled

Step 2 : learn more about starting and stoping both crawler, pipeline and indexer processes. See : http://www.crawl-anywhere.com/getting-started/#start-the-crawler