Goals
Test snaphots harvesting in detail and subsequent follow-up harvesting
Prerequisites
None special.
Procedure
Prepare Installation
On test@kb-prod-udv-001.kb.dk:export TESTX=TEST2
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
stop_test.sh
cleanup_all_test.sh
prepare_test.sh deploy_config_dedup_disabled.xml
install_test.sh
start_test.sh
Check Domain Statistics
Go to the GUI and
- Check that the installation has initially 17 domains loaded
- Check that you can search for domains by name or wildcard
- Check that you can create a new domain
- Add a new configuration to an existing domain. Confirm that it is listed in the domain definition.
- Add to a domain the list of crawlertraps at http://kb-prod-udv-001.kb.dk/cvsweb/cvsweb.cgi/~checkout~/projects/webarkivering/documents/internal/crawlertrapsCollection.txt?rev=1.1;content-type=text/plain . There should be no errors.
Global Crawler Traps
- Download the file http://kb-prod-udv-001.kb.dk/cvsweb/cvsweb.cgi/~checkout~/projects/webarkivering/documents/internal/crawlertrapsCollection.txt?rev=1.1;content-type=text/plain and upload it as a global-crawler-trap list.
- Download the list again from the GUI and compare it with original list. If necessary use "sort -u" to remove any duplicates and ensure that the two versions are ordered identically.
Update Byte Limits
Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:
kb.dk | 100000 |
---|---|
statsbiblioteket.dk | 100001 |
netarkivet.dk | 100002 |
dbc.dk | 100003 |
bs.dk | 100004 |
sulnudu.dk | 100005 |
Add Alias Domain
- Using the GUI, set netarkivet.dk to be an alias of kb.dk. Confirm that it is listed on the "Alias Summary" page.
- Now try to make dbc.dk an alias of netarkivet.dk. This should fail because chains of aliases are not allowed.
Start a Snapshot Harvest
- Create and activate a snapshot harvest with a "Max number of bytes per domain" of 1000000.
- Wait until HarvestJobManager on the Status page shows that jobs have been created for the harvest.
Check that Alias Domain is not Harvested
For each job generated, check that netarkivet.dk is not included in the domains harvested.
Check that Jobs Complete as Expected
Use the Harvest Status section of the GUI to monitor the jobs. When all jobs have finished, check each job in turn to see that the domains report their "Stopped due to" as follows:
bs.dk Domain Completed kum.dk Max Bytes limit reached oernhoej.dk Domain Completed drive-badmintonklub.dk Max Bytes limit reached sulnudu.dk Domain-config limit reached statsbiblioteket.dk Domain-config limit reached kb.dk Domain-config limit reached raeder.dk Max Bytes limit reached trinekc.dk Max Bytes limit reached kaarefc.dk Domain Completed kaareogtrine.dk Max Bytes limit reached dbc.dk Max Bytes limit reached olsen2.dk Domain Completed slothchristensen.dk Max Bytes limit reached pligtaflevering.dk Max Bytes limit reached trineogkaare.dk Max Bytes limit reached sy-jonna.dk Max Bytes limit reached
If any of these have a different reason, investigate to see if the new stop reason makes sense.