Content Comparison

...

Excerpt
Test snapshot harvesting in detail and subsequent follow-up harvesting

Table of Contents

outline	true

...

On devel@kb-prod-udv-001.kb.dk:

Code Block

export TESTX=TEST2
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
##stop_test.sh
##cleanup_all_test.sh
## Disable deduplication after the installation (in the settings for harvestjobmanager or by using a dedup_disabled H3 template)
##prepare_test.sh deploy_config_dedup_disabled.xml
all_test.sh

Override default_order_xml with default_orderxml_nodedup.xml

Where PORT=807? correspond to the test port allocated in the Setup DK test environment page and MAILRECEIVERS=foo@bar.dk should correspond to your own mail address. start a standard test installation with TESTX=TEST6 as described on the Release Test homepage.

Check Domain Statistics

Go to the GUI on http://kb-test-adm-001.kb.dk:8074/HarvestDefinition/ (replace port number as needed) and

...

Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:

kb.dk	100000
statsbiblioteket.dk	100001
netarkivet.dk	100002
dbc.dk	100003
bs.dk	100004
sulnudu.dk	100005

Add Alias Domain

Using the GUI, set netarkivet.dk to be an alias of kb.dk.
1. Go to edit page of the domain 'netarkivet.dk' using the HarvestDefinition/Definitions-find-domains.jsp
2. In the "Alias of" field, type 'kb.dk'
3. Click 'save'
Confirm that it is listed on the "Alias Summary" page.
Now try to make dbc.dk an alias of netarkivet.dk. This should fail because chains of aliases are not allowed.
1. Go to edit page of the domain 'dbc.dk' using the HarvestDefinition/Definitions-find-domains.jsp
2. In the "Alias of" field, type 'netarkivet.dk'
3. Click save. An error message should be shown on screen;
  "Cannot make domain 'dbc.dk' an alias of 'netarkivet.dk', as that domain is already an alias of 'kb.dk'"

...

Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:

...

 bs.dk                   Domain Completed
 kum.dk                  Max Bytes limit reached
 oernhoej.dk             Domain Completed
 drive-badmintonklub.dk  Max Bytes limit reached
 sulnudu.dk              Domain-config byte limit reached
 statsbiblioteket.dk     Domain-config byte limit reached
 kb.dk                   Domain-config byte limit reached (currently blocked so Domain Completed)
 raeder.dk               Domain Completed
 trinekc.dk              Max Bytes limit reached
 kaarefc.dk              Domain Completed
 kaareogtrine.dk         Max Bytes limit reached
 dbc.dk                  Domain-config byte limit reached
 unknown001.dk           Domain Completed
 slothchristensen.dk     Max Bytes limit reached
 pligtaflevering.dk      Max Bytes limit reached
 trineogkaare.dk         Max Bytes limit reached
 sy-jonna.dk             Max Bytes limit reached

If any of these have a different reason, investigate to see if the new stop reason makes sense. Generally speaking, domains with DomainCompleted will have significantly under 1000000 bytes harvested, while the other will be somewhat, but not vastly, over this limit.

Add a New Alias

...

Set sulnudu.dk to be an alias of kb.dk
...

Test NAS Heritrix Integration

In "Harvest status -> Running jobs" click on the little "Edit" icon next to the hostname.

Click on the CrawlLog link and check that you can page through the crawl log.

Click on the Frontier queue link. This is not yet implemented.

Test the NAS Heritrix Groovy Scripts

In "Harvest status -> Running jobs" the Host field is a link to the relevant Heritrix GUI. Click on this, add a security exception if necessary, (admin/adminPassword) and you should see the Heritrix 3 GUI:

(Note: if you have trouble connecting to the GUI because of firewall or routing issues, you can always log in to the harvester machine and use X remote display to start a firefox on the harvester machine. )

Image Removed

Click on the job name near the bottom next to status <<Active: RUNNING>>:

Image Removed

Click on "Pause" to pause the job and then "Scripting Console".

The H3 scripts should be downloaded from their own github repository at https://github.com/netarchivesuite/heritrix3-scripts/blob/master/src/main/java/nas.groovy . Copy and paste the script into the script box and choose "groovy" in the drop-down menu.

Image Removed

As the above example shows, enter your own initial in the box and uncomment the call to listFrontier(). Click on execute:

Image Removed

(The current state of the frontier is shown. Now execute deleteFromFrontier() with a regex to match some, but not all, of the urls.)

Actually this is a bad idea as deleteFromFrontier is likely to cause the job require it to be terminated. So just skip this.

Image Removed

The output shows how many urls were removed. Now go back to the job page in the Heritrix GUI and unpause it.Update (2021-03-12): Deduplication is now enabled by default. To disable it set the deduplication.enabled setting to false in the IndexServer settings on kb-test-acs-001.

In "Harvest status > All Running Jobs" wait until one of the jobs has started. Click on the Job ID to enter H3 Remote Access.

Pause the job.

Check the following functionality:

Progression/Queues
Crawllog: check cache update, filtering, paging
Reports: click on at least two of them. Check that the Processors report shows that deduplication has been disabled.
Show/delete frontier: delete some items from the frontier
Add RejectRules: add a new rule
Modify budget: add an object limit to some domain or subdomain

Now restart (unpause) the job. Then immediatiely ...

Restart The System

Stop NAS with the "stop_test.sh" command and restart it command
On kb-test-adm-001/TESTX/conf change the settings for HarvestJobManager so deduplicaiton.enabled is "true"
restart the NAS system with the "start_test.sh" command. After some time, the job you paused and unpaused should appear in the state "Failed".
Restart the job by clicking the "Restart?" button. A new job should be created and the old one should have the status "Resubmitted (Job X)".
Wait for the job to finish.
Meanwhile click on the JobID for the failed job, then click on "Browse reports for jobs".
You should see a list of available reports including one called "scripting_events.log". This is the log of the deletions alterations you made to the frontier in the via H3 GUIRemote Access. Click on it.
Assuming you have the correct viewerproxy setup (See Setup DK test environment) you should have a log line describing your action. Something similar to
Code Block
2016-02-02T14:05:59.170Z Action from user CSR: Deleted 563 uris matching regex '.*kb.dk.*'

[ This section is commented out because none of the current NAS-H3 scripts override the crawler-bean settings.

Check the Overrides are Applied

For the failed job check that the overrides are visible by looking at the order.xml in the metadata reports:

Setup the viewerproxy as described in Setup DK test environment.
Goto the Job details page for the newly finished job by clicking the link in the JobID column.
Click the Browse reports for jobs link.
Goto the metadata://netarkivet.dk/crawl/setup/order.xml report and confirm the modified settings have been updated in the final version of the order.xml.

...

Check the Processors report. You should see a report on the non-zero number of objects which has been deduplicated.

Check that Alias Domains are not Harvested

...

Check that netarkivet.dk and sulnudu.dk are not listed as being harvested.
Check that neither domain appears in the order templates harvest template's crawler beans for any of the jobs with the (possible) exception of the following lines:

...

Code Block
[devel@kb-prod-udv-001 ~]$ ssh netarkiv@sbnetarkdv@sb-test-bar-001.statsbiblioteket.dk grep netarkivet.dk /netarkiv/0001/TEST2/filedir/*-metadata-1.warc \| grep -v 'metadata:'

...

Confirm that the stop reason "Max Bytes limit reached" or "Domain Completed" is given for all the domains included.
Confirm that domains for which the domain-config byte limit was reached in the previous harvest are not present in any job in this harvest. (e.g. dbc.dk). The exception to this is the one domain for which you changed the domain limit to unlimited .

[ No longer valid. We now include DeDuplication in TEST2. Check that there was no Deduplication

Goto the Job details page for the newly finished job by clicking the link in the JobID column.
Click the Browse reports for jobs link.
Confirm that there was no DeDuplicator report, eg. verify the string duplicatereductionjob doesn't appear in the listed reports. ](kb.dk) which should be included and which should now reach "Max Bytes limited reached" and show about 5MB harvested.

Stop the Test and Clean-Up

...

Version	Old Version 59	New Version Current
Changes made by	Colin Samuel Rosenthal	Rasmus Bohl Kristensen
Saved on	Oct 27, 2016	Mar 18, 2021

Versions Compared