...
Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB (i.e, 5000000 bytes) , harvesting domains not completed in the previous harvest.
...
Test the NAS Heritrix
...
Groovy Scripts
In "Harvest status -> Running jobs" the Host field is a link to the relevant Heritrix GUI. Click on this (admin/adminPassword) and pause the job. (If there are firewall problems it might be necessary to login to the harvester machine and start a webbrowser on that machine instead.)
Go to the NAS GUI. In the System Overview for the relevant HarvestController confirm that there is a "Paused" message in the log. (The log message can be up to 15 minutes delayed). Also in the running jobs list, the job should be marked as paused with the red bullet.
Edit the Job and Resume job
...
you should see the Heritrix 3 GUI:
Click on the job name near the bottom next to status <<Active: RUNNING>>:
Click on "Pause" to pause the job and then "Scripting Console".
The H3 scripts should be downloaded from their own github repository at https://github.com/netarchivesuite/heritrix3-scripts/blob/master/src/main/java/nas.groovy . Copy and paste the script into the script box and choose "groovy" in the drop-down menu.
As the above example shows, enter your own initial in the box and uncomment the call to listFrontier(). Click on execute:
The current state of the frontier is shown. Now execute deleteFromFrontier() with a regex to match some, but not all, of the urls.
The output shows how many urls were removed. Now go back to the job page in the Heritrix GUI and unpause it.
Restart The System
- Stop NAS with the "stop_test.sh" command and restart it with the "start_test.sh" command. After some time, a job should appear in the state "Failed".
- Restart the job by clicking the "Restart?" button. A new job should be created and the old one should have the status "Resubmitted (Job X)".
- Wait for the job to finish.
...