Content Comparison

Excerpt
Walkthrough of the definition and execution of a simple harvest.

Table of Contents

Running a simple harvest

The system is now up and running, and you can try out the harvesting and archiving capabilities.

...

Click 'Create new harvest definition' under the (empty) table of existing harvests.

Image Removed

Enter an arbitrary name for the harvest in the top. Enter some second-level domain name (e.g., netarchive.dk) in the box and press 'Add domains'. Since the domain didn't exist in the database, the system suggests you add it. Click 'Create and add to harvest definition'.

Preferably the domain should be one that you know you have permission to harvest. By default, NetarchiveSuite will harvest up to 1GB of data from a domain so you may wish to choose a small domain for your first tests. You can add more domains if you want by repeating the procedure, but in this example we will only add one domain.

Image Removed

Since the domain didn't exist in the database, the system suggests you add it. Click 'Create and add to harvest definition'. Image Added

You can now click 'Save' on the 'Selective Harvest' pageImage Removed

Now you have defined a harvest definition for this domain. It will however not start a harvest before it is changed to active state.

Click 'Activate' for the newly defined harvest.The harvest definition will NetarchiveSuite will now generate harvest jobs for the harvest definition.

Image Added

Go to the Job Status page by clicking 'Harvest status'. Set wanted jobs status to 'All' and click 'Show'. Refresh the page periodically until a job appears and changes to state "Started". This should take no more than two minutes. At this point, a harvester has started harvesting, using the Heritrix web harvester.

Image RemovedImage Added

Now you can monitor the system state for what is going on in the various components. That way you can see how the harvester is proceeding with the job:

Go to the System Status page by clicking 'Systemstate'. Click on the application HarvestControllerServer. The most recent log record will give status information from Heritrix. You can find more application information by clicking on 'Show all' in the Index column.

Image RemovedImage Added

Use the System Status and Job Status pages to monitor your job. You can also jump to the Heritrix GUI by clicking on the log line URL e.g. Harvest ID: 1 pc300:8192 as long as the job is running by using the std. Heritrix login "admin" and Password "adminPassword" (Note: you will need to add the name of your PC as an exception to your browser's proxy configuration. Alternatively, just replace the PC name with "localhost" in the URL, e.g. http://localhost:8192.)

Go to the Job status page by clicking 'Harvest status'. Set wanted jobs status to 'All' and click 'Show'. It will take a little while for the job to finish and to upload the harvested files to the ! NetarchiveSuite archive (about 5 min.). Refresh the page until the job changes state to "Done".

...

In order to use viewerproxy it is essential that you have followed the instructions for proxy setup. Once that some web pages have been harvested, you can use the viewerproxy part to view them. Before it is ready, it needs to know which material you wish to browse.

Go to the 'Harvest Status' page, select to show 'All' jobs and click 'Show'. Click on the link with the Job Id.
Click on 'Select this job for QA with viewerproxy'.

...

This will make the viewerproxy browse in this job. It will take it a while to generate an index. It will then go to the viewerproxy status page.

Now simply enter the URL that you started harvesting from (with www), e.g. www.netarchive.dk. It shows you the harvested material. If you go to a URL in another domain, you will get an error. Depending on the layout of the domain you harvested, there may also be missing pages or images from that domain.

...

Copy the URLs for your harvested domain that were found missing into the clipboard. Go to the domain definition page by clicking 'Find Domain(s)' under 'Definitions' and search for your domain.

Image Added

You will now get a page with information used when harvesting that domain. In this case, we wish to add the collected URLs to the list of seeds we start our web harvests from.

Image RemovedImage Added

On the domain definition page, click 'Edit' next to the seed list.¢

Image Added

Add the URLs from the clipboard to the seed list and press 'Update'.

These URLs will be used as seeds the next time the domain is harvested, i.e. the harvest will include these URLs in the harvest. To see this in effect, create another harvest of this domain following all the steps above. Wait for the domain to finish harvesting, then go to the 'job status' page for the new job. Limit the viewerproxy to the new job only and browse the material again. The URLs that were missing last time should now be found.

Section

Column

Column

width	100%

Column

Version	Old Version 17	New Version 18
Changes made by	Nicholas Clarke	M
Saved on	Mar 27, 2012	Aug 08, 2013

Versions Compared

Key

Running a simple harvest