The pipeline

What is Simple Pipeline ?

Simple Pipeline is an open source software for document processing. It allows to manipulate incoming documents in various ways and produce documents directly usable by a 3-tiers software.
The idea of a such as piece of software came from need of manipulate documents produced by Crawl Anywhere before to index them in Solr. The crawler produces xml documents containing basic informations about crawler pages :

  • Web site name
  • Web site country
  • Page url
  • Page encoding
  • Page depth (in the web site)
  • Page content (html, text, binary PDF, …)
  • Page referrer

Before to be able to index these web pages, we need to manipulate xml documents produced by the crawler in various ways :

  • Clean up the page (mainly for html page in order to remove menus, header, footer, …)
  • Extract text from the page (html parsing, PDF conversion, …)
  • Extract a relevant title
  • Detect the language of the page

In order to access to input documents Simple pipeline needs connectors. A connector is responsible to connect to system containing the source documents (an input file system queue, a database, …) and send the documents into the pipeline. The pipeline is constituted by chained stages (the stages list). Each stage is responsible of one precise task (detect language, extract text, …). The last stage push the document in a target application (an output file system queue, a database, …). Connectors and stages for various input system or various document transformation can be easily developed.

Available stages are :

  • Text extractor (from HTML, PDF, Office format)
  • Summary extractor
  • Field mapping
  • Text language detection
  • Title extractor
  • Content Type filter
  • Solr ready to index xml files writer

The pipeline is described (connectors, stages, parameters) in a Xml description file.

Architecture