IndexerQueueWriter stage

IndexerQueueWriter stage is responsible to produce XML documents and push them to a file system queue.

This stage is configured in the pipeline stages list definition file (see “Configure the pipeline” at http://www.crawl-anywhere.com/configure-the-pipeline/) by this declaration :

Mapping file

This stage uses a mapping definition file (“solrmapping” parameter) in order to write an xml document. The default mapping file provided is for the Solr schema.xml also provided. This mapping fine can be change in order to match an other schema.xml file.

Here is the default solrmapping.xml mapping definition file for Crawl-Anywhere 3.x

Here is the default solrmapping.xml mapping definition file for Crawl-Anywhere 4.x

Types of mapping

value to target mapping

indicates that the string “a value” will systematically be written into the field “a_field”

source to target mapping

indicates that the xml element “source_element” will be written in the field “target_field”

this mapping type allows several options.

Source to target mapping options

the split option

this option allows to plit a source element into a multi-valued target element.

this is an attribut of a mapping element. split=”,” specify the character to be use as separator.

It the source xml document contains

will produce

the normalization option

This option specifies that the text have to be normalize. The three available normalizations are “lowercase”, “uppercase” and “date”.

“date” normalization is not really a normalization but a control in order to reject the value if it doesn’t looks like a date.

the format option

If the source is composed by several elements, the output can be formated with this option.

this is an attribut of a mapping element. format=”[%1$s]%2$s %3$s” describe how to format the output element. “%n$s” correspond to the input element number n.

It the source xml document contains

will produce

the condition option

this option describe a condition for this mapping to be used. This is a child element od the mapping element. The two possible types of condition are “in” or “not_in”
the following condition means that the source element “xxx” can’t have as value a, b, c or d

in the other hand, the following condition means that the source element “xxx” must have as value a, b, c or d

The target element (new in Crawl-anywhere 4.x)

This element can occurs several time in the mapping file. It allows to define alternate mapping according to the target of the xml file. The target parameter is defined in the crawler administration.

Solrboost file

The solrboost file allows to define solr boosts according to rules based on fields content and urls.
If a document matchs one rule an index time Solr document boost factor is written in the output file.

Here is the self documented default solrboost.xml file.