Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To ingest, go to the "Create Domain" page under "Definitions" and specify the file containing the list of domains. You can also type domains in the text window, but this is only usable for a smaller number of domains. The list should be a newline-separated list of domain names including the top level domain, but not including subdomains, protocol specifications or URL paths. Thus netarkivet.dk or archive.org are useable, while http://foo.com_, _bar.dk/hest_or _news.bbc.co.uk are not. What is considered a top-level domain is configurable. Typically it would be a country top level domain for most countries (like .dk, .fr etc), but fore some special cases it makes more sense to define the top level a little further down (for instance .co.uk). See how to configure this in the [Installation Manual 3.16]. When the file is specified, press "Ingest" and wait while the domains are ingested. For a first test, you probably want to keep it to a fairly small number of sites, to make sure the test harvest doesn't take too long.

...

After ingest, you can click on 'Domain statistics' under 'Definitions' to see an overview of how many domains are registered under the Top Level Domains (TLDs). To create a snapshot definition, go to 'Snapshot harvests' and press 'Create new snapshot harvest'. The harvest definition presented will require you to enter a harvest name, and also allows you to add comments or changing the limit of how many bytes or objects to collect per domain. Keep this to a fairly low number for a first test, to make sure the harvest doesn't run too long.

...

You can monitor the harvest and browse the harvested material exactly as you did in the previous harvests. It is possible - only while the job is running - to access the Heritrix user interface on the harvester (See further details above or in the User Manual).


Section
Column

Column
width100%
 
Column