The main features of Crawl Anywhere are :
- be able to crawl any source types (web, databases, file system, CMS, …). Each source type should have its own "source connector". Today, we have a Web source connector
- be able to do any thing with crawled item (web page, cms document, database record, …). Crawled items are handled by the "document handler". Today we have the XmlQueueWriter document handler. This document handler writes all HTML pages or binary files (pdf, doc, …) as an XML file in a file system queue in order to be processed by the document processing pipeline then indexed in Solr.
- be multi-threaded (crawl several sources and documents by source at the same time)
- be highly configurable allowing definition of:
- multiple start urls per source (web site)
- start url can be any web site pages, rss feeds, sitemaps
- number of simultaneous crawled sources
- number of simultaneous items crawled by source
- stop / resume crawl
- cache crawled items
- recrawl periodicity rules based on item type (html, PDF, …)
- item type inclusion / exclusion rules
- item url inclusion / exclusion / strategy rules
- depth rule
- HTML cleaning algorithm
- scripting for advanced HTML cleaning
- be compatible with both Windows and Linux
- respect robots.txt files
- provide an administration and monitoring web interface (see screen shots)
- be easily extendible (with source connectors and document handlers and scripting)
The crawler is developed in Java. A MySQL database is used in order to store source parameters and each crawled item reference (crawl status, last crawl time, next crawl time, mime type, …).
Administration and monitoring interface allows sources administration and current crawl processes monitoring. Here are the 3 main screens.
Crawl monitoring :
Sources list :
Source details :
The crawler components.
An concrete example of crawler usage with the pipeline and the Solr indexer in order to crawl and index web sites.