Configure the pipeline

The configuration file

Here is the general structure of the pipeline configuration file

A pipeline is built with one connector and several stages (the stages list).
It is easy to extend the pipeline with new connectors or new stages. This will be described in a separate page of the wiki.

Connectors

Connectors are responsible of reading the data to be proceced by the pipleline. A connector reads data in an input repository, build the original xml file to be proceced and send it into the pipeline (the stages).
Today, there is only one available connector : the FileQueueConnector.

FileQueueConnector

This connector read xml files into a directory (the queue) and push them into the pipeline.

Stages

Several stages are available. Here is the list of these stages.
Note that for each stage in the pipeline, the “position” attribute specifies the position of the stage in the pipeline.

ContentTypeFilter

ContentTypeFilter will remove from the pileline document based on there content type.

DocTextExtractor

DocTextExtractor converts document (html, pdf, …) stored in an element of the xml file into text and store this text in a new element of the xml file. See “DocTextExtractor stage” for further details.

MetaExtractor

MetaExtractor allows to analyse html page content and create custom metadata as new elements in the xml file processed by the pipeline. See “MetaExtractor stage” for further details.

DocSummaryExtractor

DocSummaryExtractor builds a small summary with the begenning of the text contained in the source element.

FieldMapping

Based on a mapping file definition rules, this stage maps a element of the xml file to an other element.

In this sample, the FielMapping stage maps the content type from the “item_contenttype” element to the “item_contenttype_root” element. With the following mapping definition file,both “application/rss+xml” and “application/atom+xml” content type will be mapped to “application/xml”.

FieldCopy

Copy the content of a field in an other.

In this sample, the FieldCopy stage captures the domain of the url in the “item_url” element and copy it in the the content type to the “item_domain” element.

TagFilter

Remove unrelevant tags

Without this stage, all tags defined for a web site in the administration is associated to all pages of the site. With this stage a tag is associated to a page only if this tag is in the text of the page.
Imagine a web site dedicated to Apple product. You define tags : iPad, iPhone, Apple TV.
Without this stage all pages of this site are indexed with these 3 tags. With this stage, iPad is indexed with a page, only if iPad is in the text of the page.

LanguageDetector

The LanguageDetector stage detects the language of the text of an element of the xml file and write the language code to an other element.

IndexerQueueWriter

IndexerQueueWriter writes to an output directory a ready to index (by Solr) xml document. This stage uses a mapping definition file in order to write an xml document matching the Solr schema.

For a detailled description of this stage, see http://www.crawl-anywhere.com/indexerqueuewriter-stage/