...
For each job generated, check that netarkivet.dk is not included in the domains harvested.
Check that Jobs Complete as Expected
Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:
...
If any of these have a different reason, investigate to see if the new stop reason makes sense.
Add a New Alias
Set sulnudu.dk to be an alias of kb.dk
Change a Byte Limit
For kb.dk set the Maximum Number of Bytes on defaultconfig to -1 (no limit).
Start a Second Phase Snapshot Harvest
Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB, harvesting domains not completed in the previous harvest.
Pause a Harvest via Heritrix
Under "currently running jobs" there is a link to the relevant Heritrix GUI. Click on this (admin/adminPassword) and pause the job. (If there are firewall problems it might be necessary to login to the harvester machine and start a webbrowser on that machine instead.)
Go to the NAS GUI. In the System Overview for the relevant HarvestController confirm that there is a "Paused" message in the log.
Edit the Job and Restart
- In the Heritrix GUI, change some parameters for the domain netarkivet.dk e.g. max-hops 15 and delay-factor 1.5
- Click on "Resume" on the Heritrix Console
- Confirm that the job is running again in the NAS System overview.
Check the Overrides are Applied
In the QA interface check the order template for the job. The override values should be visible.
Restart The System
- Stop and Restart NAS. After some time, a job should appear in the state "Failed".
- Resubmit the job. A new job should be created.
- Wait for the job to finish.
Check that Alias Domains are not Harvested
For each job generated in the last harvest:
- Check that netarkivet.dk and sulnudu.dk are not listed as being harvested.
- Check that neither domain appears in the order templates for any of the jobs with the exception of the following lines:
Code Block |
---|
<map name="http-headers">
. <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127+http://netarkivet.dk/website/info.html)</string> <string name="from"> netarkivet-svar@netarkivet.dk </string>
</map> |
Check Byte Limits for the Second Harvest
- Confirm that the stop reason "Max Bytes limit reached" is give for kb.dk, dbc.dk and kum.dk .
- Confirm that "Domain Completed" is give for kaareogtrine.dk, trineogkaare.dk and kaarefc.dk .
- Confirm that oernehoej.dk and statsbiblioteket.dk are not found in the "Domain" column in any job .
Check that there was no Deduplication
Using a browser setup for ViewerProxy access, check the processors-report for one of the snapshot-harvest jobs. Confirm that there os no DeDuplicator report with a "Duplicate found" line.
Stop the Test and Clean-Up
stop_test.sh
cleanup_all_test.sh