Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Goals

...

Excerpt

Test

...

snapshot harvesting in detail and subsequent follow-up harvesting   


Table of Contents
outlinetrue

 

 

Prerequisites

None special.

...

Prepare Installation

On test@kbdevel@kb-prod-udv-001.kb.dk:export TESTX=TEST2
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
stop_test.sh
cleanup_all_test.sh
prepare_test.sh deploy_config_dedup_disabled.xml
install_test.sh
start_test.sh start a standard test installation with TESTX=TEST6 as described on the Release Test homepage.

Check Domain Statistics

Go to the GUI andon http://kb-test-adm-001.kb.dk:8074/HarvestDefinition/ (replace port number as needed) and

  1. Check that the installation has initially 17 16 domains loaded
  2. Check that you can search for domains by name or wildcard
  3. Check that you can create a new domain (add unknown001.dk: it doesn't exist)
  4. Add a new configuration to an existing domain. Use non-default values for Max Hops, Honour robots and Extract Javascript. Confirm that it is listed in the domain definition with the correct attributes. (you might need to click on show unused configurations).

Global Crawler Traps

  1. Download the file crawlertrapsCollection.txt and upload it as a global-crawler-trap list.
  2. Download the list again from the GUI and compare it with original list. If necessary use "sort -u" to remove any duplicates and ensure that the two versions are ordered identically.  A sample Unix command line:  diff <(sort -u crawlertrapsCollection.txt ) <(sort -u crawlertrap ) (note that any empty lines are also removed during upload)

Update Byte Limits

Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:

kb.dk100000
statsbiblioteket.dk100001
netarkivet.dk100002
dbc.dk100003
bs.dk100004
sulnudu.dk100005

Add Alias Domain

  1. Using the GUI, set netarkivet.dk to be an alias of kb.dk.
    1. Go to edit page of the domain 'netarkivet.dk' using the HarvestDefinition/Definitions-find-domains.jsp
    2. In the "Alias of" field, type 'kb.dk'
    3. Click 'save'
  2.  Confirm that it is listed on the "Alias Summary" page.
  3. Now try to make dbc.dk an alias of netarkivet.dk. This should fail because chains of aliases are not allowed.
    1. Go to edit page of the domain 'dbc.dk' using the HarvestDefinition/Definitions-find-domains.jsp
    2. In the "Alias of" field, type 'netarkivet.dk'
    3. Click save. An error message should be shown on screen; 
      "Cannot make domain 'dbc.dk' an alias of 'netarkivet.dk', as that domain is already an alias of 'kb.dk'"

Start a Snapshot Harvest

  1. Create and activate a snapshot harvest with a "Max number of bytes per domain the list of crawlertraps at http://kb-prod-udv-001.kb.dk/cvsweb/cvsweb.cgi/~checkout~/projects/webarkivering/documents/internal/crawlertrapsCollection.txt?rev=1.1;content-type=text/plain . There should be no errors.

Global Crawler Traps

  1. Download the file http://kb-prod-udv-001.kb.dk/cvsweb/cvsweb.cgi/~checkout~/projects/webarkivering/documents/internal/crawlertrapsCollection.txt?rev=1.1;content-type=text/plain and upload it as a global-crawler-trap list.
  2. Download the list again from the GUI and compare it with original list. If necessary use "sort -u" to remove any duplicates and ensure that the two versions are ordered identically.

...

  1. " of 1000000.
  2. Wait until HarvestJobManager on the Status page shows that jobs have been created for the harvest.

Check that Alias Domain is not Harvested

For each job generated, check that netarkivet.dk is not included in the domains harvested.

Check that Jobs Complete as Expected

Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:


 bs.dk                   Domain Completed
 kum.dk                  Max Bytes limit reached
 oernhoej.dk             Domain Completed
 drive-badmintonklub.dk  Max Bytes limit reached
 sulnudu.dk              Domain-config byte limit reached
 statsbiblioteket.dk     Domain-config byte limit reached
 kb.dk                   Domain-config byte limit reached (currently blocked so Domain Completed)
 raeder.dk               Domain Completed
 trinekc.dk              Max Bytes limit reached
 kaarefc.dk              Domain Completed
 kaareogtrine.dk         Max Bytes limit reached
 dbc.dk                  Domain-config byte limit reached
 unknown001.dk           Domain Completed
 slothchristensen.dk     Max Bytes limit reached
 pligtaflevering.dk      Max Bytes limit reached
 trineogkaare.dk         Max Bytes limit reached
 sy-jonna.dk             Max Bytes limit reached

If any of these have a different reason, investigate to see if the new stop reason makes sense. Generally speaking, domains with DomainCompleted will have significantly under 1000000 bytes harvested, while the other will be somewhat, but not vastly, over this limit.

Add a New Alias

Set sulnudu.dk to be an alias of kb.dk

Change a Byte Limit

For kb.dk set the Maximum Number of Bytes on defaultconfig to -1 (no limit).

Start a Second Phase Snapshot Harvest

Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB (i.e, 5000000 bytes) , harvesting domains not completed in the previous harvest.

Test NAS Heritrix Integration

Update (2021-03-12): Deduplication is now enabled by default. To disable it set the deduplication.enabled setting to false in the IndexServer settings on kb-test-acs-001.

In "Harvest status  >  All Running Jobs"  wait until one of the jobs has started. Click on the Job ID to enter H3 Remote Access.

Pause the job. 

Check the following functionality:

  • Progression/Queues
  • Crawllog: check cache update, filtering, paging
  • Reports: click on at least two of them. Check that the Processors report shows that deduplication has been disabled.
  • Show/delete frontier: delete some items from the frontier
  • Add RejectRules: add a new rule
  • Modify budget: add an object limit to some domain or subdomain

Now restart (unpause) the job. Then immediatiely ...  

Restart The System

  1. Stop NAS with the "stop_test.sh" command 
  2. On kb-test-adm-001/TESTX/conf change the settings for HarvestJobManager so deduplicaiton.enabled is "true"
  3. restart the NAS system with the "start_test.sh" command. After some time, the job you paused and unpaused should appear in the state "Failed".

    Image Added 
  4. Restart the job by clicking the "Restart?" button. A new job should be created and the old one should have the status "Resubmitted (Job X)".
  5. Wait for the job to finish.
  6. Meanwhile click on the JobID for the failed job, then click on "Browse reports for jobs".
  7. You should see a list of available reports including one called "scripting_events.log". This is the log of alterations you made to the frontier via H3 Remote Access. Click on it.
  8. Assuming you have the correct viewerproxy setup (See Setup DK test environment) you should have a log line describing your action. Something similar to

    Code Block
    2016-02-02T14:05:59.170Z Action from user CSR: Deleted 563 uris matching regex '.*kb.dk.*'


  9. Check the Processors report. You should see a report on the non-zero number of objects which has been deduplicated.

Check that Alias Domains are not Harvested

For each job generated in the last harvest:

  1. Check that netarkivet.dk and sulnudu.dk are not listed as being harvested.
  2. Check that neither domain appears in the harvest template's crawler beans for any of the jobs with the (possible) exception of the following lines:
Code Block
metadata.operatorContactUrl=http://netarkivet.dk/webcrawler/
metadata.operatorFrom=info@netarkivet.dk

This can be done by grepping with a command like

Code Block
[devel@kb-prod-udv-001 ~]$ ssh netarkdv@sb-test-bar-001.statsbiblioteket.dk grep netarkivet.dk /netarkiv/0001/TEST2/filedir/*-metadata-1.warc | grep -v 'metadata:'

or by scp'ing the metadata file to kb-prod-udv-001 and inspecting it with "less". (Or just by displaying the order template in the NAS GUI and searching.)

Check Byte Limits for the Second Harvest

  1. Confirm that the stop reason "Max Bytes limit reached" or "Domain Completed" is given for all the domains included.
  2. Confirm that domains for which the domain-config byte limit was reached in the previous harvest are not present in any job in this harvest. (e.g. dbc.dk). The exception to this is the one domain for which you changed the domain limit to unlimited (kb.dk) which should be included and which should now reach "Max Bytes limited reached" and show about 5MB harvested.

Stop the Test and Clean-Up

stop_test.sh

cleanup_all_test.sh