Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

Walkthrough of the definition and execution of a simple harvest.

Table of Contents

Running a

...



Column

Image Removed

...

width100%

...

simple harvest

A snapshot harvest harvests all known domains up to a given byte limit, i.e. a limit of bytes that you harvest from each domain. This is used for national-wide harvests of '''all''' domains. You can also use "Max number of objects per domain" ("-1" means without limit). The best praxis is to use byte limits or object limits - not a combination.

Each domain has one "default configuration" automatically generated when the domain is created. The default configuration is used to determine how to harvest the domain in a snapshot harvest. Typically, the default configuration is good enough for most purposes, but if you want to have a domain excluded from the snapshot harvest (e.g. if the domain is outside the group you're interested in) you may want to set the harvest limit on the default configuration for that domain to 0. The default configuration is also the one used in a selective harvest unless another configuration is chosen in the drop-down menu on the selective harvest page. The other way to control how a snapshot harvest is executed is by choosing a different harvest template. Descriptions of how harvest templates work are in the user manual.

NetarchiveSuite has support for mass creation of domains, for instance by ingesting (loading) a list of domains given by a national TLD (top-level-domain) administrator.

To ingest, go to the "Create Domain" page under "Definitions" and specify the file containing the list of domains. You can also type domains in the text window, but this is only usable for a smaller number of domains. The list should be a newline-separated list of domain names including the top level domain, but not including subdomains, protocol specifications or URL paths. Thus netarkivet.dk or archive.org are useable, while http://foo.com_, _bar.dk/hest_or _news.bbc.co.uk are not. What is considered a top-level domain is configurable. Typically it would be a country top level domain for most countries (like .dk, .fr etc), but fore some special cases it makes more sense to define the top level a little further down (for instance .co.uk). See how to configure this in the [Installation Manual 3.16]. When the file is specified, press "Ingest" and wait while the domains are ingested. For a first test, you probably want to keep it to a fairly small number of sites, to make sure the test harvest doesn't take too long.

SnapshotHarvest-CreateDomains_3_16_0.png

After ingest, you can click on 'Domain statistics' under 'Definitions' to see an overview of how many domains are registered under the TLDs. To create a snapshot definition, go to 'Snapshot harvests' and press 'Create new snapshot harvest'. The harvest definition presented will require you to enter a harvest name, and also allows you to add comments or changing the limit of how many bytes or objects to collect per domain. Keep this to a fairly low number for a first test, to make sure the harvest doesn't run too long.

SnapshotHarvest-CreateHarvest_3_16_0.png

When you have entered the information, press 'Save' and then press 'Activate'.

You can monitor the harvest and browse the harvested material exactly as you did in the previous harvests.

It is possible - only while the job is running - to access the Heritrix user interface on the harvester (See further details above or in the User Manual).

The system is now up and running, and you can try out the harvesting and archiving capabilities.

This section will guide you through the steps needed to

  • harvest and store a domain
  • browse the harvested material in a browser

Setting up the harvest

Start the program as described in section "Starting simple_harvest version".

Open http://localhost:8074/HarvestDefinition in a browser on the local machine.  (Replace the host name if the QUICKSTART is running on another machine. The port is 8078 if using the Docker version of Quickstart.)

You can now define a new harvest.

Click 'Selective Harvests' under menu 'Definitions'

Image Added

Click 'Create new selective harvest definition' under the (empty) table of existing harvests.

Enter a name for the harvest in the top. Enter some second-level domain name (e.g., netarkivet.dk) in the box and press 'Add domains'. Assuming the domain doesn't already exist in the database, the system suggests you add it. Click 'Create and add to harvest definition'.

Preferably the domain should be one that you know you have permission to harvest. By default, NetarchiveSuite will harvest up to 1GB of data from a domain so you may wish to choose a small domain for your first tests. You can add more domains if you want by repeating the procedure, but in this example we will only add one domain.

Image Added

You can now click 'Save' on the 'Selective Harvest' page

Click 'Activate' for the newly defined harvest. NetarchiveSuite will now generate harvest jobs for the harvest definition.

Image Added

Go to the Job Status page by clicking 'Harvest status'. Set wanted jobs status to 'All' and click 'Show'. Refresh the page periodically until a job appears and changes to state "Started". This should take no more than two minutes. At this point, a harvester has started harvesting, using the Heritrix web harvester.

Image Added

Now you can monitor the system state for what is going on in the various components. That way you can see how the harvester is proceeding with the job:

Go to the System Status page by clicking 'Systemstate'. Click on the application HarvestControllerApplication. The most recent log record will give status information from Heritrix. You can find more application information by clicking on 'Show all' in the Index column.

Image Added

You can find more details about the running job by going to the Running Jobs page:

Image Added

The job will after a minute or two appear as running and the progress of the jobs can be followed here.  Clicking on the JobID or H3 Remote Access will take you to a page where you can view information on the state of the job and control most of the funtionality of the heritrix harvesting process directly from NetarchiveSuite:

Image Added

By clicking on the various buttons, you can see what the job has already harvested (the crawl-log), see and manipulate what the job is going to harvest (the frontier), monitor the output of the job, pause, unpause, and if necessary, terminate the job.

Clicking on "Open Scripting Console" will take you directly to the Heritrix 3 console, which can be accessed using the standard Heritrix login "admin" and Password "adminPassword" (Note: you will need to add the name of your PC as an exception to your browser's proxy configuration). See the Heritrix documentation for more information on how to use the Heritrix console.

Now go to the Job status page by clicking 'Harvest status'. Set wanted jobs status to 'All' and click 'Show'. Refresh the page until your job changes state to "Done". 

Viewing the results

Harvested jobs can be viewed in an ordinary browser. Part of the NetarchiveSuite is a "viewerproxy", that integrates with your browser to show you harvested material for Quality Assurance.

In order to use viewerproxy it is essential that you have followed the instructions for proxy setupOnce a harvest has completed, you can use the viewerproxy part to view them. Before it is ready, it needs to know which material you wish to browse.

  1. Go to the 'Harvest Status' page, select to show 'All' jobs and click 'Show'. Click on the link with the Job Id.
  2. Click on 'Select this job for QA with viewerproxy'This will make the viewerproxy browse in this job. It will take it a while to generate an index. It will then go to the viewerproxy status page.

Image Added

Image Added

Now simply enter (in the browser url field) the URL that you started harvesting from (with www), e.g. www.netarkivet.dk. It shows you the harvested material. If you go to a URL in another domain, you will get an error. Depending on the layout of the domain you harvested, there may also be missing pages or images from that domain.

The NetarchiveSuite allows automatic collection of unharvested URLs during browsing, i.e. the NetarchiveSuite allows you to browse in the collected material while it automatically collects URLs for missing pages or images that you request. This makes it easy to identify missing harvested material during Quality Assurance of the harvested material.

To try this, go back to the Viewerproxy status page and click 'Start collecting URLs'. Now browse in the collected material until you find a page or image that did not get harvested. Go back to the Viewerproxy status page and click 'Show collected URLs'.

The list will contain several URLs, including the ones you just requested and found missing during collection of URls.

Note that the current version of Viewerproxy does not support URLs harvested via https.