Configure a Web site to be crawled

The displayed page shows all Web sites configured to be crawled.

You can add a new web site to be crawled or edit settings for and existing Web Site.

The following image shows all the parameters for a configured Web site.

Web site crawl settings

We will explain these parameters individualy.

The source name

This parameter is mandatory.

The urls

One or more URLs used by the crawler to crawl the web sites. This parameter is mandatory.

In this sample we can see 2 URLs provides to the crawler. Each URL is associated with various parameters. Click on the “Add Url” button to add an URL or on the edit icon to the right of an existing url to edit this URL. The following dialog box is displayed.

The possible mode parameters :

Web site The crawler will start to crawl the web site starting this URL until a predefined depth
Links page The crawler will crawl pages under discovered links on this page. The crawler won’t follow links in subpages
RSS feed The crawler will crawl pages under¬†discovered links in this feed. The crawler won’t follow links in subpages

URL is the starting URL used for crawling

Allow other domains indicates that the crawler can crawl pages not hosted by the domain of the starting URL.

Only during first crawl indicates this starting URL is used only during the initial crawl of the web site or at each re-crawl. Typically, when crawling a blog, the homepage will be set in Web Site mode crawled only at initial crawl and its RSS feed will be crawled at each re-crawl. This will allow a very efficient crawling of the blog.

Aliases

A Web site can use several domains. By default, the crawler refuse to follow links that are not hosted by the domain of the staring URLs. In order to consider all these domains as the same Web site, list all possible domains here.

Collections

Sources can be grouped under “collections”. Indicate here one or more collections the source belongs to.

Tags

Sources can be associated to tags. Tags help describing the source. Indicate here one or more tags the source is associated with.

Country and Languages

Select the Country where the web site is located, and, more importantly select the language or language(s) of the documents on this Web site.

A correct Language indication, while help to optimizing the indexing process.

Crawl child pages only

Set to yes in order to crawl only child pages of the starting page.

If the starting page is “http://www.server.com/news/index.html”, so only page with URLs starting with “http://www.server.com/news/”, will be crawled.

Ignored fields in URL

For a web page, some parameters in url can change frequently (e.g. session id). So, at each new crawl, the crawler then considers the new URL as a new document, which will then be crawled and indexed several times. To avoid this, indicate here the parameter name that has to be ignored in order to differentiate URLs.

Automatic HTML page cleaning

HTML pages include header, footer, menus and others items that will be considered as relevant information during the indexing process. In order to optimize indexing and search, it is possible to use algorithms in order to remove these non-relevant parts of the page. You can test these algorithms with several pages of the web site and so chose the best one.

Crawl links inclusion/exclusion rules

During the crawl process each page is read in order to be sent to the Solr index. On each page the crawler also gets the links to other pages of the web sites. This is not always the best strategy for a optimized crawl, because some pages do not contain relevant information or because some parts of the web site can be ignored. This parameter allows definition of crawling rules.

For instance, you can force the crawler to ignore all pages containing a specific path or pattern in its URL. Or, you can force the crawler not to send some pages to the Solr index but only to get the links to other pages of the web site.

Path or regex patterns are associated with specific modes. These modes are :

Get page and extract links send the page to the Solr index and get links to other pages of the web site
Extract links only do not send the page to the Solr index, but get links to other pages of the web site
Ignore ignore this page (do not send it to the Solr index and do not get links to other pages of the web site

The first rule matching the URL processed by the crawler will be used, so rule ordering is very important.

Click on the “Add rule” icon to add a new rule or click on the edit button to the right of an existing rule to edit this rule. The following dialog box is displayed.