Web connector

The web connector browse an entire web site source according to te source definition (starting url, maximum depth, …). For each pages read, the web connector extract all links to other pages of the web site in order to read them and so on …

Extract links in a web page is not always easy due to the various type of links that a web page can contains : standard http links, redirections, frames, iframes and javascript links. javascript links are the most difficult to handle because they are different for each web site. The web connector is able to execute external scripts that will analyse the page content in order to extract these javascript links. A script works only for a specific web site.

Here is a sample script for the www.cohor.org web site.

This is in fact an xml file that embed one or more scripts. We can see various elements in this xml file.

The server element

The server element lists the web domains the xml file applys to. The server element is unique in the xml file.

The url element

Multiple url are possible and they contain the real script. The url element provides with its “match” attribute the scope in the web site the script will apply to.

The script element

The script element will process the web page content in order to provide an array of http links.

The “engine” attribute provides the scripting engine to be used for this script. The “action” attribute provides for which action this script was written. In this case it is always “links”.

The “page” variable contains the html contains of the web page.

The script have to create and populate an array “links”.

You can test and debug your script with the “tools_test_scripts.sh” script available in the “scripts” directory.

Various scripting language can be used (rhino, groovy, …). The appropriate scripting engine have to be installed and configured (engine accessible through the classpath of the crawler process).

Here is the description of “Scripting for the Java Platform” : http://docs.oracle.com/javase/7/docs/technotes/guides/scripting/

In the scripts directory, the “tools_list_script_engines.sh” scripts list all the installed script engines.