Note that this documentation is for the old 5.55 release.
For the newest documentation, please see the current release documentation.

Running a snapshot harvest


A snapshot harvest harvests all domains known to the NetarchiveSuite installation up to a given byte limit, i.e. a limit of bytes that you harvest from each domain

This is used for nationwide harvests of '''all''' domains. You can also use "Max number of objects per domain" ("-1" means without limit). The best practice is to use byte limits or object limits - not a combination.


Each domain has one "default configuration" automatically generated when the domain is created. The default configuration is used to determine how to harvest the domain in a snapshot harvest. Typically, the default configuration is good enough for most purposes, but if you want to have a domain excluded from the snapshot harvest (e.g. if the domain is outside the group you're interested in) you may want to set the harvest limit on the default configuration for that domain to 0. The default configuration is also the one used in a selective harvest unless another configuration is chosen in the drop-down menu on the selective harvest page. The other way to control how a snapshot harvest is executed is by choosing a different harvest template. Descriptions of how harvest templates work are in the user manual.

NetarchiveSuite has support for mass creation of domains, for instance by ingesting (loading) a list of domains given by a national TLD (top-level-domain) administrator.

To ingest, go to the "Create Domain" page under "Definitions" and specify the file containing the list of domains. You can also type domains in the text window, but this is only usable for a smaller number of domains. The list should be a newline-separated list of domain names including the top level domain, but not including subdomains, protocol specifications or URL paths. Thus netarkivet.dk or archive.org are useable, while http://foo.com_, _bar.dk/hest_or _news.bbc.co.uk are not. By default, NetarchiveSuite is configured to recognise all ICANN-defined top-level domains.

When the file is specified, press "Ingest" and wait while the domains are ingested. For a first test, you probably want to keep it to a fairly small number of sites, to make sure the test harvest doesn't take too long.

After ingest, you can click on 'Domain statistics' under 'Definitions' to see an overview of how many domains are registered under the Top Level Domains (TLDs). To create a snapshot definition, go to 'Snapshot harvests' and press 'Create new snapshot harvest'. The harvest definition presented will require you to enter a harvest name, and also allows you to add comments or change the limit of how many bytes or objects to collect per domain. Keep this to a fairly low number for a first test, to make sure the harvest doesn't run too long.

 

When you have entered the information, press 'Save' and then press 'Activate'.

You can monitor the harvest and browse the harvested material exactly as you did in the previous harvests. It is possible - only while the job is running - to access the Heritrix user interface on the harvester (See further details above or in the User Manual). The Running Jobs section of the interface also gives access to a great deal of information about the state of the running job or jobs, as well as functionality to control the running Heritrix process directly from NetarchiveSuite.

Â