Test snaphots harvesting in detail and subsequent follow-up harvesting
None special.
Prepare Installation
On test@kb-prod-udv-001.kb.dk:export TESTX=TEST2
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
prepare_test.sh deploy_config_dedup_disabled.xml
Check Domain Statistics
Go to the GUI and
- Check that the installation has initially 17 domains loaded
- Check that you can search for domains by name or wildcard
- Check that you can create a new domain
- Add a new configuration to an existing domain. Confirm that it is listed in the domain definition.
- Add to a domain the list of crawlertraps at http://kb-prod-udv-001.kb.dk/cvsweb/cvsweb.cgi/~checkout~/projects/webarkivering/documents/internal/crawlertrapsCollection.txt?rev=1.1;content-type=text/plain . There should be no errors.
Global Crawler Traps
- Download the file http://kb-prod-udv-001.kb.dk/cvsweb/cvsweb.cgi/~checkout~/projects/webarkivering/documents/internal/crawlertrapsCollection.txt?rev=1.1;content-type=text/plain and upload it as a global-crawler-trap list.
- Download the list again from the GUI and compare it with original list. If necessary use "sort -u" to remove any duplicates and ensure that the two versions are ordered identically.
Update Byte Limits
Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:
kb.dk | 100000 |
statsbiblioteket.dk | 100001 |
netarkivet.dk | 100002 |
dbc.dk | 100003 |
bs.dk | 100004 |
sulnudu.dk | 100005 |
Add Alias Domain
- Using the GUI, set netarkivet.dk to be an alias of kb.dk. Confirm that it is listed on the "Alias Summary" page.
- Now try to make dbc.dk an alias of netarkivet.dk. This should fail because chains of aliases are not allowed.
Start a Snapshot Harvest
- Create and activate a snapshot harvest with a "Max number of bytes per domain" of 1000000.
- Wait until HarvestJobManager on the Status page shows that jobs have been created for the harvest.
Check that Alias Domain is not Harvested
For each job generated, check that netarkivet.dk is not included in the domains harvested.
Check that Jobs Complete as Expected
Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:
bs.dk Domain Completed kum.dk Max Bytes limit reached oernhoej.dk Domain Completed drive-badmintonklub.dk Max Bytes limit reached sulnudu.dk Domain-config limit reached statsbiblioteket.dk Domain-config limit reached kb.dk Domain-config limit reached raeder.dk Max Bytes limit reached trinekc.dk Max Bytes limit reached kaarefc.dk Domain Completed kaareogtrine.dk Max Bytes limit reached dbc.dk Max Bytes limit reached olsen2.dk Domain Completed slothchristensen.dk Max Bytes limit reached pligtaflevering.dk Max Bytes limit reached trineogkaare.dk Max Bytes limit reached sy-jonna.dk Max Bytes limit reached
If any of these have a different reason, investigate to see if the new stop reason makes sense.
Add a New Alias
Set sulnudu.dk to be an alias of kb.dk
Change a Byte Limit
For kb.dk set the Maximum Number of Bytes on defaultconfig to -1 (no limit).
Start a Second Phase Snapshot Harvest
Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB, harvesting domains not completed in the previous harvest.
Pause a Harvest via Heritrix
Under "currently running jobs" there is a link to the relevant Heritrix GUI. Click on this (admin/adminPassword) and pause the job. (If there are firewall problems it might be necessary to login to the harvester machine and start a webbrowser on that machine instead.)
Go to the NAS GUI. In the System Overview for the relevant HarvestController confirm that there is a "Paused" message in the log.
Edit the Job and Restart
- In the Heritrix GUI, change some parameters for the domain netarkivet.dk e.g. max-hops 15 and delay-factor 1.5
- Click on "Resume" on the Heritrix Console
- Confirm that the job is running again in the NAS System overview.
Check the Overrides are Applied
When the job is finished, go to the QA interface check the order template for the job as listed in the reports (or login to the bitarchive and look directly in the metadata arcfile). Check that the overrides are visible. The easiest way to do this is from test@kb-prod-udv-001:
[test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001 grep max-hops /netarkiv/0001/TEST2/filedir/<jobno>-metadata-1.arc [test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001 grep delay-factor /netarkiv/0001/TEST2/filedir/<jobno>-metadata-1.arc
Restart The System
- Stop and Restart NAS. After some time, a job should appear in the state "Failed".
- Resubmit the job. A new job should be created.
- Wait for the job to finish.
Check that Alias Domains are not Harvested
For each job generated in the last harvest:
- Check that netarkivet.dk and sulnudu.dk are not listed as being harvested.
- Check that neither domain appears in the order templates for any of the jobs with the exception of the following lines:
<map name="http-headers"> . <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127+http://netarkivet.dk/website/info.html)</string> <string name="from"> netarkivet-svar@netarkivet.dk </string> </map>
Check Byte Limits for the Second Harvest
- Confirm that the stop reason "Max Bytes limit reached" is give for kb.dk, dbc.dk and kum.dk .
- Confirm that "Domain Completed" is give for kaareogtrine.dk, trineogkaare.dk and kaarefc.dk .
- Confirm that oernehoej.dk and statsbiblioteket.dk are not found in the "Domain" column in any job .
Check that there was no Deduplication
Using a browser setup for ViewerProxy access, check the processors-report for one of the snapshot-harvest jobs. Confirm that there os no DeDuplicator report with a "Duplicate found" line.
Stop the Test and Clean-Up