Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Click 'Selective Harvests' under menu 'Definitions'

!SelectiveHarvest_1!Image Added

Click 'Create new harvest definition' under the (empty) table of existing harvests.

!SelectiveHarvest_2!Image Added

Enter an arbitrary name for the harvest in the top. Enter some second-level domain name (e.g., netarchive.dk) in the box and press 'Add domains'. Preferably the domain should be one that you know you have permission to harvest. You can add more domains if you want by repeating the procedure, but in this example we will only add one domain.

!SelectiveHarvest_adddomain!Image Added

Since the domain didn't exist in the database, the system suggests you add it. Click 'Create and add to harvest definition'. You can now click 'Save' on the 'Selective Harvest' page

!SelectiveHarvest_activate!Image Added

Now you have defined a harvest definition for this domain. It will however not start a harvest before it is changed to active state.

...

Go to the Job Status page by clicking 'Harvest status'. Set wanted jobs status to 'All' and click 'Show'. Refresh the page periodically until a job appears and changes to state "Started". This should take no more than two minutes. At this point, a harvester has started harvesting, using the Heritrix web harvester.

!SelectiveHarvest_new!Image Added

Now you can monitor the system state for what is going on in the various components. That way you can see how the harvester is proceeding with the job:

Go to the System Status page by clicking 'Systemstate'. Click on the application !HarvestControllerServer. The most recent log record will give status information from Heritrix. You can find more application information by clicking on 'Show all' in the Index column.

!SelectiveHarvest_status!Image Added

Use the System Status and Job Status pages to monitor your job. You can also jump to the Heritrix GUI by clicking on the log line URL e.g. Harvest ID: 1 KB-TEST-WAY-001.kb.dk:8192 as long as the job is running by using the std. Heritrix login "admin" and Password "adminPassword"

Go to the Job status page by clicking 'Harvest status'. Set wanted jobs status to 'All' and click 'Show'. It will take a little while for the job to finish and to upload the harvested files to the !NetarchiveSuite archive (about 5 min.). Refresh the page until the job changes state to "Done".

!SelectiveHarvest_done!Image Added

Viewing the results

...

This will make the viewerproxy browse in this job. It will take it a while to generate an index. It will then go to the viewerproxy status page.

!SelectiveHarvest_proxystatus!Image Added

Now simply enter the URL that you started harvesting from (with www), e.g. www.netarchive.dk. It shows you the harvested material. If you go to a URL in another domain, you will get an error. Depending on the layout of the domain you harvested, there may also be missing pages or images from that domain.

...

To try this, go back to the viewerproxy status page and click 'Start collecting URLs'. Now browse in the collected material until you find a page or image that did not get harvested. Go back to the viewerproxy status page and click 'Show collected URLs'.

!SelectiveHarvest_missingurls!Image Added

The list will contain several URLs, including the ones you just requested and found missing during collection of URls.

...