Test snaphots harvesting in detail and subsequent follow-up harvesting |
None special.
On test@kb-prod-udv-001.kb.dk:export TESTX=TEST2
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
stop_test.sh
cleanup_all_test.sh
prepare_test.sh deploy_config_dedup_disabled.xml
install_test.sh
start_test.sh
Go to the GUI and
Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:
kb.dk | 100000 |
---|---|
statsbiblioteket.dk | 100001 |
netarkivet.dk | 100002 |
dbc.dk | 100003 |
bs.dk | 100004 |
sulnudu.dk | 100005 |
For each job generated, check that netarkivet.dk is not included in the domains harvested.
Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:
bs.dk Domain Completed kum.dk Max Bytes limit reached oernhoej.dk Domain Completed drive-badmintonklub.dk Max Bytes limit reached sulnudu.dk Domain-config byte limit reached statsbiblioteket.dk Domain-config byte limit reached kb.dk Domain-config byte limit reached raeder.dk Max Bytes limit reached trinekc.dk Max Bytes limit reached kaarefc.dk Domain Completed kaareogtrine.dk Max Bytes limit reached dbc.dk Domain-config byte limit reached olsen2.dk Domain Completed slothchristensen.dk Max Bytes limit reached pligtaflevering.dk Max Bytes limit reached trineogkaare.dk Max Bytes limit reached sy-jonna.dk Max Bytes limit reached
If any of these have a different reason, investigate to see if the new stop reason makes sense. Generally speaking, domains with DomainCompleted will have significantly under 1000000 bytes harvested, while the other will be somewhat, but not vastly, over this limit.
Set sulnudu.dk to be an alias of kb.dk
For kb.dk set the Maximum Number of Bytes on defaultconfig to -1 (no limit).
Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB, harvesting domains not completed in the previous harvest.
Under "currently running jobs" there is a link to the relevant Heritrix GUI. Click on this (admin/adminPassword) and pause the job. (If there are firewall problems it might be necessary to login to the harvester machine and start a webbrowser on that machine instead.)
Go to the NAS GUI. In the System Overview for the relevant HarvestController confirm that there is a "Paused" message in the log. (The log message can be up to 15 minutes delayed)
For the failed job, check that the overrides are visible. The easiest way to do this is from test@kb-prod-udv-001:
[test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001 grep max-hops /netarkiv/0001/TEST2/filedir/<jobno>-metadata-1.arc [test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001 grep delay-factor /netarkiv/0001/TEST2/filedir/<jobno>-metadata-1.arc |
(Note that there should be two setup/order reports. The one containing a timestamp in its name is the original order.xml, the one called simply
metadata://netarkivet.dk/crawl/setup/order.xml is the final modified version.)
For each job generated in the last harvest:
<map name="http-headers"> . <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127+http://netarkivet.dk/website/info.html)</string> <string name="from"> netarkivet-svar@netarkivet.dk </string> </map> |
This can be done by grepping with a command like
[test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001 grep netarkivet.dk /netarkiv/0001/TEST2/filedir/*-metadata-1.arc | grep -v 'metadata:' |
or by scp'ing the metadata file to kb-prod-udv-001 and inspecting it with "less".
Using a browser setup for ViewerProxy access, check the processors-report for one of the snapshot-harvest jobs. Confirm that there was no DeDuplicator report with a "Duplicate found" line.
stop_test.sh
cleanup_all_test.sh