DocTextExtractor stage

This stage is responsible to extract the text from the original document (html, pfd, doc, …). It uses various libraries (Tika, jericho, …) and tools (pdftotext, catdoc, …) in order to get the text from these various formats.

In addition of its declaration in the pipeline stage list (see “Configure the pipeline”), this stage can be customized in order to extract the relevante text in an html page. It is able to execute external scripts that will analyse the page content in order to extract extract the title, the relevante text (without headers, footers, menus, …) and optionnaly a date. A script works only for a specific web site.

Here is a sample script for the www.emploisdelafamille-formation.fr web site.

This is in fact an xml file that embed one or more scripts. We can see various elements in this xml file.

The server element

The server element lists the web domains the xml file applys to. The server element is unique in the xml file.

The url element

Multiple url are possible and they contain the real script. The url element provides with its “match” attribute the scope in the web site the script will apply to.

The script element

The script element will process the web page content in order to parse it and extract the text, the title and a date.

The “engine” attribute provides the scripting engine to be used for this script. The “action” attribute provides for which action this script was written. In this case, it is “parse”.

The “page” variable contains the html contains of the web page.

The script have to create and populate an 3 elements array “parsedData”. The first element will contains the title, the second element will contains a date (or an empty value) and the third element will contains the clean text.
You can test and debug your script with the “tools_test_scripts.sh” script available in the “scripts” directory.

Various scripting language can be used (rhino, groovy, …). The appropriate scripting engine have to be installed and configured (engine accessible through the classpath of the crawler process).

Here is the description of “Scripting for the Java Platform”: http://docs.oracle.com/javase/7/docs/technotes/guides/scripting/

In the scripts directory, the “tools_list_script_engines.sh” scripts list all the installed script engines.