Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

Test snapshot harvesting in detail and subsequent follow-up harvesting   

 


Table of Contents
outlinetrue

...

Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:

...


 bs.dk                   Domain Completed
 kum.dk                  Max Bytes limit reached
 oernhoej.dk             Domain Completed
 drive-badmintonklub.dk  Max Bytes limit reached
 sulnudu.dk              Domain-config byte limit reached
 statsbiblioteket.dk     Domain-config byte limit reached
 kb.dk                   Domain-config byte limit reached (currently blocked so Domain Completed)
 raeder.dk               Domain Completed
 trinekc.dk              Max Bytes limit reached
 kaarefc.dk              Domain Completed
 kaareogtrine.dk         Max Bytes limit reached
 dbc.dk                  Domain-config byte limit reached
 unknown001.dk           Domain Completed
 slothchristensen.dk     Max Bytes limit reached
 pligtaflevering.dk      Max Bytes limit reached
 trineogkaare.dk         Max Bytes limit reached
 sy-jonna.dk             Max Bytes limit reached

...

Test NAS Heritrix Integration

Update (2021-03-12): Deduplication is now enabled by default. To disable it set the deduplication.enabled setting to false in the IndexServer settings on kb-test-acs-001.

In "Harvest status  >  All Running Jobs"  wait until one of the jobs has nonzero progressstarted. Click on the Job ID to enter H3 Remote Access.

...

  • Progression/Queues
  • Crawllog: check cache update, filtering, paging
  • Reports: click on at least two of them. Check that the Processors report shows that deduplication has been disabled.
  • Show/delete frontier: delete some items from the frontier
  • Add RejectRules: add a new rule
  • Modify budget: add an object limit to some domain or subdomain

Now restart (unpause) the job. Then immediatiely ...  

...

  1. Stop NAS with the "stop_test.sh" command 
  2. On kb-test-adm-001/TEST2TESTX/conf change the settings for HarvestJobManager so deduplicaiton.enabled is "true"
  3. restart the NAS system with the "start_test.sh" command. After some time, the job you paused and unpaused should appear in the state "Failed".

     
  4. Restart the job by clicking the "Restart?" button. A new job should be created and the old one should have the status "Resubmitted (Job X)".
  5. Wait for the job to finish.
  6. Meanwhile click on the JobID for the failed job, then click on "Browse reports for jobs".
  7. You should see a list of available reports including one called "scripting_events.log". This is the log of alterations you made to the frontier via H3 Remote Access. Click on it.
  8. Assuming you have the correct viewerproxy setup (See Setup DK test environment) you should have a log line describing your action. Something similar to

    Code Block
    2016-02-02T14:05:59.170Z Action from user CSR: Deleted 563 uris matching regex '.*kb.dk.*'


  9. Check the Processors report. You should see a report on the non-zero number of objects which has been deduplicated.

...

  1. Check that netarkivet.dk and sulnudu.dk are not listed as being harvested.
  2. Check that neither domain appears in the order templates harvest template's crawler beans for any of the jobs with the (possible) exception of the following lines:

...

Code Block
[devel@kb-prod-udv-001 ~]$ ssh netarkiv@sbnetarkdv@sb-test-bar-001.statsbiblioteket.dk grep netarkivet.dk /netarkiv/0001/TEST2/filedir/*-metadata-1.warc | grep -v 'metadata:'

...