Test snapshot harvesting in detail and subsequent follow-up harvesting |
On devel@kb-prod-udv-001.kb.dk:
export TESTX=TEST2 export PORT=807? export MAILRECEIVERS=foo@bar.dk ##stop_test.sh ##cleanup_all_test.sh ## Disable deduplication after the installation (in the settings for harvestjobmanager or by using a dedup_disabled H3 template) ##prepare_test.sh deploy_config_dedup_disabled.xml all_test.sh Override default_order_xml with default_orderxml_nodedup.xml |
Where PORT=807? correspond to the test port allocated in the Setup DK test environment page and MAILRECEIVERS=foo@bar.dk
should correspond to your own mail address.
Go to the GUI on http://kb-test-adm-001.kb.dk:8074/HarvestDefinition/ (replace port number as needed) and
diff <(sort -u crawlertrapsCollection.txt ) <(sort -u crawlertrap ) (note that any empty lines are also removed during upload)
Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:
kb.dk | 100000 |
statsbiblioteket.dk | 100001 |
netarkivet.dk | 100002 |
dbc.dk | 100003 |
bs.dk | 100004 |
sulnudu.dk | 100005 |
For each job generated, check that netarkivet.dk is not included in the domains harvested.
Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:
bs.dk Domain Completed kum.dk Max Bytes limit reached oernhoej.dk Domain Completed drive-badmintonklub.dk Max Bytes limit reached sulnudu.dk Domain-config byte limit reached statsbiblioteket.dk Domain-config byte limit reached kb.dk Domain-config byte limit reached raeder.dk Domain Completed trinekc.dk Max Bytes limit reached kaarefc.dk Domain Completed kaareogtrine.dk Max Bytes limit reached dbc.dk Domain-config byte limit reached unknown001.dk Domain Completed slothchristensen.dk Max Bytes limit reached pligtaflevering.dk Max Bytes limit reached trineogkaare.dk Max Bytes limit reached sy-jonna.dk Max Bytes limit reached
If any of these have a different reason, investigate to see if the new stop reason makes sense. Generally speaking, domains with DomainCompleted will have significantly under 1000000 bytes harvested, while the other will be somewhat, but not vastly, over this limit.
Set sulnudu.dk to be an alias of kb.dk
For kb.dk set the Maximum Number of Bytes on defaultconfig to -1 (no limit).
Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB (i.e, 5000000 bytes) , harvesting domains not completed in the previous harvest.
In "Harvest status -> Running jobs" the Host field is a link to the relevant Heritrix GUI. Click on this (admin/adminPassword) and you should see the Heritrix 3 GUI:
Click on the job name near the bottom next to status <<Active: RUNNING>>:
Click on "Pause" to pause the job and then "Scripting Console".
The H3 scripts should be downloaded from their own github repository at https://github.com/netarchivesuite/heritrix3-scripts/blob/master/src/main/java/nas.groovy . Copy and paste the script into the script box and choose "groovy" in the drop-down menu.
As the above example shows, enter your own initial in the box and uncomment the call to listFrontier(). Click on execute:
The current state of the frontier is shown. Now execute deleteFromFrontier() with a regex to match some, but not all, of the urls.
The output shows how many urls were removed. Now go back to the job page in the Heritrix GUI and unpause it.
Assuming you have the correct viewerproxy setup (See Setup DK test environment) you should a log line describing your action. Something similar to
2016-02-02T14:05:59.170Z Action from user CSR: Deleted 563 uris matching regex '.*kb.dk.*' |
For the failed job check that the overrides are visible by looking at the order.xml in the metadata reports:
Browse reports for jobs
link.For each job generated in the last harvest:
<map name="http-headers"> . <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127+http://netarkivet.dk/website/info.html)</string> <string name="from"> netarkivet-svar@netarkivet.dk </string> </map> |
This can be done by grepping with a command like
[devel@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001.statsbiblioteket.dk grep netarkivet.dk /netarkiv/0001/TEST2/filedir/*-metadata-1.warc | grep -v 'metadata:' |
or by scp'ing the metadata file to kb-prod-udv-001 and inspecting it with "less".
Browse reports for jobs
link.stop_test.sh
cleanup_all_test.sh