What is Crawl Anywhere ?
Crawl Anywhere allows you to build vertical search engines. Crawl Anywhere includes :
- a Web Crawler with a powerful Web user interface
- a document processing pipeline
- a Solr indexer
- a full featured and customizable search application
You can see the diagram of a typical use of all components in this diagram.
Why was Crawl Anywhere created ?
Crawl Anywhere was originally developed to index in Apache Solr 5400 web sites (more than 10.000.000 pages) for the Hurisearch search engine: http://www.hurisearch.org/. During this project, various crawlers were evaluated (heritrix, nutch, …) but one key feature was missing : a user friendly web interface to manage Web sites to be crawled with their specific crawl rules. Mainly for this raison, we decided to develop our own Web crawler. Why did we choose the name "Crawl Anywhere" ? This name may appear a little over stated, but crawl any source types (Web, database, CMS, …) is a real objective and Crawl Anywhere was designed in order to easily implement new source connectors.
What is a Web Crawler ?
A web crawler is a program that will try to discover and read all HTML pages or documents (PDF, Office, …) on web sites in order, for instance, to index their content and build a search engine. Wikipedia provides a great description of a Web crawler : http://en.wikipedia.org/wiki/Web_crawler.
What is Apache Solr ?
Apache Solr is the popular, blazing fast open source enterprise search platform from the Apache Foundation Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.
See the complete Solr feature list for more details.
What are the components of Crawl Anywhere ?
The main features of the Crawler are :
- be able to crawl any source type (web, databases, file system, CMS, …). Each source type should have its own "source connector". Today, we have a Web source connector
- be able to do any thing with crawled item (web page, cms document, database record, …). Crawled items are handled by the "document handler". Today we have the XmlQueueWriter document handler. This document handler writes all HTML pages or binary files (pdf, doc, …) as an XML file in a file system queue in order to be processed by the document processing pipeline then indexed in Solr.
- be multi-threaded (crawl several sources and documents by source at the same time)
- be highly configurable allowing definition of :
- multiple start urls per source (web site)
- start url can be any web site pages, rss feeds, sitemaps
- number of simultaneous crawled sources
- number of simultaneous items crawled by source
- stop / resume crawl
- cache crawled items
- recrawl periodicity rules based on item type (html, PDF, …)
- item type inclusion / exclusion rules
- item url inclusion / exclusion / strategy rules
- depth rule
- HTML cleaning algorithm
- scripting for advanced HTML cleaning
- be compatible with both Windows and Linux
- respect robots.txt files
- provide an administration and monitoring web interface (see screen shots)
- be easily extendible (with source connectors and document handlers and scriping)
The crawler is developed in Java. A MySQL database is used in order to store source parameters and each crawled item reference (crawl status, last crawl time, next crawl time, mime type, …).
The pipeline processes html pages and binary documents (PDF, DOC, …) crawled by the Web crawler. Various configurable stages transforms and enriches these documents until they are pushed to Solr indexer. Available stages are :
- Text extractor (from HTML, PDF, Office format)
- Summary extractor
- Field mapping
- Text language detection
- Title extractor
- Content Type filter
- Solr ready to index xml files writer
The Solr indexer
The Solr indexer reads a queue of Xml documents in order to index. A Xml document contains both the data to be indexed and directive on how index the data.
The Search application
This ready to use full featured application allows you to search immediately in crawled and indexed documents. It can be a starting point for implementing your own specific search interface.
Crawl Anywhere is a feature rich powerful crawler. It not only crawls, but also has the tools to shape content to your needs. Furthermore it comes with it's own Solr search engine, but can easily be used for your own Solr implementation. It is hands down the most flexible and easy to use crawler for Solr.
Octoweb - The Netherlands