Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 39 Next »

Test snaphots harvesting in detail and subsequent follow-up harvesting   

 

Prepare Installation

On test@kb-prod-udv-001.kb.dk:

export TESTX=TEST2
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
stop_test.sh
cleanup_all_test.sh
prepare_test.sh deploy_config_dedup_disabled.xml
install_test.sh
start_test.sh

Where PORT=807? correspond to the test port allocated in the Setup DK test environment page and MAILRECEIVERS=foo@bar.dk should correspond to your own mail address. 

Check Domain Statistics

Go to the GUI on http://kb-test-adm-001.kb.dk:8074/HarvestDefinition/ (replace port number as needed) and

  1. Check that the installation has initially 17 domains loaded
  2. Check that you can search for domains by name or wildcard
  3. Check that you can create a new domain (add unknown001.dk: it doesn't exist)
  4. Add a new configuration to an existing domain. Confirm that it is listed in the domain definition. (you might need to click on show unused configurations).

Global Crawler Traps

  1. Download the file crawlertrapsCollection.txt and upload it as a global-crawler-trap list.
  2. Download the list again from the GUI and compare it with original list. If necessary use "sort -u" to remove any duplicates and ensure that the two versions are ordered identically.  A sample Unix command line:  diff <(sort -u crawlertrapsCollection.txt ) <(sort -u crawlertrap )

Update Byte Limits

Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:

kb.dk100000
statsbiblioteket.dk100001
netarkivet.dk100002
dbc.dk100003
bs.dk100004
sulnudu.dk100005

Add Alias Domain

  1. Using the GUI, set netarkivet.dk to be an alias of kb.dk.
    1. Go to edit page of the domain 'netarkivet.dk' using the HarvestDefinition/Definitions-find-domains.jsp
    2. In the "Alias of" field, type 'kb.dk'
    3. Click 'save'
  2.  Confirm that it is listed on the "Alias Summary" page.
  3. Now try to make dbc.dk an alias of netarkivet.dk. This should fail because chains of aliases are not allowed.
    1. Go to edit page of the domain 'dbc.dk' using the HarvestDefinition/Definitions-find-domains.jsp
    2. In the "Alias of" field, type 'netarkivet.dk'
    3. Click save. An error message should be shown on screen; 
      "Cannot make domain 'dbc.dk' an alias of 'netarkivet.dk', as that domain is already an alias of 'kb.dk'"

Start a Snapshot Harvest

  1. Create and activate a snapshot harvest with a "Max number of bytes per domain" of 1000000.
  2. Wait until HarvestJobManager on the Status page shows that jobs have been created for the harvest.

Check that Alias Domain is not Harvested

For each job generated, check that netarkivet.dk is not included in the domains harvested.

Check that Jobs Complete as Expected

Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:

 
 bs.dk                   Domain Completed
 kum.dk                  Max Bytes limit reached
 oernhoej.dk             Domain Completed
 drive-badmintonklub.dk  Max Bytes limit reached
 sulnudu.dk              Domain-config byte limit reached
 statsbiblioteket.dk     Domain-config byte limit reached
 kb.dk                   Domain-config byte limit reached
 raeder.dk               Max Bytes limit reached
 trinekc.dk              Max Bytes limit reached
 kaarefc.dk              Domain Completed
 kaareogtrine.dk         Max Bytes limit reached
 dbc.dk                  Domain-config byte limit reached
 unknown001.dk           Domain Completed
 slothchristensen.dk     Max Bytes limit reached
 pligtaflevering.dk      Max Bytes limit reached
 trineogkaare.dk         Max Bytes limit reached
 sy-jonna.dk             Max Bytes limit reached

If any of these have a different reason, investigate to see if the new stop reason makes sense. Generally speaking, domains with DomainCompleted will have significantly under 1000000 bytes harvested, while the other will be somewhat, but not vastly, over this limit.

Add a New Alias

Set sulnudu.dk to be an alias of kb.dk

Change a Byte Limit

For kb.dk set the Maximum Number of Bytes on defaultconfig to -1 (no limit).

Start a Second Phase Snapshot Harvest

Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB (i.e, 5000000 bytes) , harvesting domains not completed in the previous harvest.

Pause a Harvestjob in the Heritrix GUI

Under "currently running jobs" there is a link to the relevant Heritrix GUI. Click on this (admin/adminPassword) and pause the job. (If there are firewall problems it might be necessary to login to the harvester machine and start a webbrowser on that machine instead.)

Go to the NAS GUI. In the System Overview for the relevant HarvestController confirm that there is a "Paused" message in the log. (The log message can be up to 15 minutes delayed). Also in the running jobs list, the job should be marked as paused with the red bullet.

Edit the Job and Resume job

  1. In the Heritrix GUI, add an override for the domain netarkivet.dk, where max-hops is 15 and delay-factor 1.5
  2. Click on "Resume" on the Heritrix Console
  3. Confirm that the job is running again in the NAS System overview.

Restart The System

  1. Stop and Restart NAS. After some time, a job should appear in the state "Failed".
  2. Restart the job. A new job should be created.
  3. Wait for the job to finish.

Check the Overrides are Applied

For the failed job, check that the overrides are visible. The easiest way to do this is from test@kb-prod-udv-001:

[test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001.statsbiblioteket.dk grep max-hops /netarkiv/0001/TEST2/filedir/<jobno>-metadata-1.warc
[test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001.statsbiblioteket.dk grep delay-factor /netarkiv/0001/TEST2/filedir/<jobno>-metadata-1.warc

CLARIFY - WHAT DOES THIS MEAN?: (Note that there should be two setup/order reports. The one containing a timestamp in its name is the original order.xml, the one called simply
metadata://netarkivet.dk/crawl/setup/order.xml is the final modified version.)

Check that Alias Domains are not Harvested

For each job generated in the last harvest:

  1. Check that netarkivet.dk and sulnudu.dk are not listed as being harvested.
  2. Check that neither domain appears in the order templates for any of the jobs with the (possible) exception of the following lines:
<map name="http-headers">
 . <string name="user-agent">Mozilla/5.0 (compatible; heritrix/1.5.0-200506132127+http://netarkivet.dk/website/info.html)</string> <string name="from"> netarkivet-svar@netarkivet.dk </string>
</map>

This can be done by grepping with a command like

[test@kb-prod-udv-001 ~]$ ssh netarkiv@sb-test-bar-001.statsbiblioteket.dk grep netarkivet.dk /netarkiv/0001/TEST2/filedir/*-metadata-1.warc | grep -v 'metadata:'

or by scp'ing the metadata file to kb-prod-udv-001 and inspecting it with "less".

Check Byte Limits for the Second Harvest

  1. Confirm that the stop reason "Max Bytes limit reached" or "Domain Completed" is given for all the domains included.
  2. Confirm that oernehoej.dk and statsbiblioteket.dk are not found in the "Domain" column in any job .

Check that there was no Deduplication

CLARIFY:  Using a browser setup for ViewerProxy access, check the processors-report for one of the snapshot-harvest jobs. Confirm that there was no DeDuplicator report with a "Duplicate found" line.

Stop the Test and Clean-Up

stop_test.sh

cleanup_all_test.sh

  • No labels