Domains

Creating Domains

The Create domain page is used for creating new domains in the system. It is possible to create a single domain or a list of domains. It is also possible to import domains from a file.

To create single domains enter domain names in the text box and press Create.

To create domains in bulk from a file select the file from your local computer with Browse and press Ingest. The file must be a simple list of domain names – one per line. The file must be UTF-8 encoded if it contains special characters.

New domains get a default configuration when created (with the defaultorderxml template and a default maximum number of bytes). New domains also get a defaultseedlist when created.

Existing domains in the system will be skipped by the ingest.

Finding Domains

Find Domain(s) is used to find domains existing in the system.

Write a domain name in the box (e.g. kb.dk). Searching is done on the complete text string. Press Search.

Left and/or right wildcards with *. The domains can also be searched by crawler traps and comments.

If there are several hits, a list of the found domains are displayed.

A link to the domain harvest history for each domain is available in the second column. Clicking on the domain name takes you to a page where you can edit the domain information:

Editing Domains

Edit domain is an overview of a single domain where it is possible to edit the domain’s definition in the harvest system.

Free-text comment box.
’Alias of’: Here it can be stated if the domain is an alias of another domain – they are identical in content and only one of them should be harvested. Domains marked as an alias will not be harvested snapshot harvests. An alias is defined one year at a time and then has to be renewed. A domain can have many aliases but aliases may not be reflexive or transitive.
‘Configurations’: New configuration and Edit open a new page: Enter/edit configuration (see below): Unused configuration can be hidden which can be useful if there are many configurations. An unused configuration is a configuration which is neither the default configuration nor a configuration used in any active harvest.
‘Seed lists’: New seed list and Edit opens a new page: Enter/edit seed list (see below). Unused seed lists can be hidden which can be useful if there are many seed lists. A seed list is considered used if it is used in 'Used configuration'.
‘Crawler traps’: Show crawler traps opens a new text box: Crawler traps (see below)
Show historical harvest information for … opens a new page Harvest history for domain…. (see Harvest Status).

Editing configurations

The Enter/edit configuration page is used to define a new configuration or edit an existing one. A configuration contains information about which Harvest template and Seed lists are used (more than one Seed list can be used - hold down CTRL).

Furthermore it is possible to choose between different Harvester templates and maximum number of bytes to be harvested in each harvest of the configuration. At creation the default number of bytes is chosen for each domain, and a default maximum number of objects is set. These can be changed later.

It is also possible to set additional parameters for the configuration - the harvest depth (MAX_HOPS), whether the harvest honours robots.txt, and whether or not to extract possible hyperlinks from Javascript elements.

Editing seed lists

Enter/edit seed list is used to define a new Seed list or to edit an existing one.

At the creation of a new Seed list a name is given that thereafter can not be changed.

In the ’Seeds’ text box a list of seeds to be harvested is given. Seeds can be omitted by writing a # prefix, e.g. http://www.kb.dk. This can also be used for comments inside the seedlist – e.g. 'this seed is important'.

Thus it is possible to define multiple distinct seedlists for a domain and to use them in different and arbitrary combinations in different harvest configurations.

Editing crawlertraps

A crawlertrap is any url or pattern of url's which should not be harvested, even though they are otherwise in-scope for the harvest configuration. The name "crawlertrap" indicates the common situation where a a webcrawler follows an endless series of links, none of which lead to "interesting" content. A calendar would be the canonical example.

Each crawlertrap is a regular expression. Matching URLs are omitted in all harvests of the domain and in other domains harvested in the same job. So be very careful not to specify overly wide crawlertrap regexes that could potentially omit things on other domains (perhaps always include the domain-name itself in the statement).

Harvest history of a domain

If you want to see all the jobs of the the finished harvests for a domain may be listed by click on Show historical harvest information for domainxxx at the bottom of the domain page. The harvest history page includes information on why the harvest stopped. The 'Stopped due to' column will show if a harvest was stopped unexpectedly or if the harvest hit the max-bytes limit for the chosen domain or if the harvest was stopped because of an error on the harvester machine.

Domain statistics

The domain statistics page will give you information about number of subdomains for each unique Top level domain known in the system. IP-numbers will be counted separately.

The number in the “Number of subdomains” column is clickable and will do a search for all domains matching that Top level domain. This is only applicable to Top level domains with a limited number of subdomains since the matching domains will be listed on one page – and that page will get very long if the system contains hundreds or thousands of domains.

Alias summary

The alias summary page gives an overview of the domains marked as aliases of other domains in the system. Both domain names are clickable and will open the domain page for the clicked domain.

The “Expires” column shows when the alias expires (12 months after they are created). The mark does not disappear after 12 month in the database but the “Overview of Aliases” page will show the “expired” ones in the top.

To renew an alias for another 12 months one is currently forced to open the domain page of the marked domain (the “Domain” column) – select “renew alias” and press Save.

(Aliases are allowed to expire after 12 months because domains tend to change ownership and usage, so there is no guarantee that an alias domain will remain an alias forever. By automatically expiring aliases, NetarchiveSuite encourages curators to confirm regularly whether a domain really is an alias.)

NetarchiveSuite 5.1 Documentation