MetaExtractor stage

This stage is responsible to extract metadata from crawled html pages.

In addition of its declaration in the pipeline stage list (see "Configure the pipeline"), this stage can be customized in order to extract the relevante metadata in an html page. It is able to execute external scripts that will analyse the page content in order to extract text from html pages and add custom elements in the xml document processed by the pipeline. A script works only for a specific web site.

Here is a sample script for the www.saflii.org web site.

This is in fact an xml file that embed one or more scripts. We can see various elements in this xml file. 

The server element

The server element lists the web domains the xml file applys to. The server element is unique in the xml file.

The url element

Multiple url are possible and they contain the real script. The url element provides with its "match" attribute the scope in the web site the script will apply to.

The script element

The script element will process the web page content in order to parse it and extract metadata.

The "engine" attribute provides the scripting engine to be used for this script. The "action" attribute provides for which action this script was written. In this case, it is "parse".

The "page" variable contains the html contains of the web page.

The script have to create a string array. Each item in the array is the metadata name and its value "meta_name: meta_value".

You can test and debug your script with the "tools_test_scripts.sh" script available in the "scripts" directory.

Various scripting language can be used (rhino, groovy, …). The appropriate scripting engine have to be installed and configured (engine accessible through the classpath of the crawler process).

Here is the description of "Scripting for the Java Platform": http://docs.oracle.com/javase/7/docs/technotes/guides/scripting/

In the scripts directory, the "tools_list_script_engines.sh" scripts list all the installed script engines.

.