...
On devel@kb-prod-udv-001.kb.dk:
Code Block |
---|
export TESTX=TEST2
export PORT=807?
export MAILRECEIVERS=foo@bar.dk
##stop_test.sh
##cleanup_all_test.sh
## Disable deduplication after the installation (in the settings for harvestjobmanager or by using a dedup_disabled H3 template)
##prepare_test.sh deploy_config_dedup_disabled.xml
all_test.sh
Override default_order_xml with default_orderxml_nodedup.xml |
Where PORT=807? correspond to the test port allocated in the Setup DK test environment page and MAILRECEIVERS=foo@bar.dk
should correspond to your own mail address. start a standard test installation with TESTX=TEST6 as described on the Release Test homepage.
Check Domain Statistics
Go to the GUI on http://kb-test-adm-001.kb.dk:8074/HarvestDefinition/ (replace port number as needed) and
...
Test NAS Heritrix Integration
In "Harvest status > All > All Running Jobs" wait until one of the jobs has nonzero nonzero progress. Click on the Job ID to enter H3 Remote Access.
...
- Progression/Queues
- Crawllog: check cache update, filtering, paging
- Reports: click on at least two of them. Check that the Processors report shows that deduplication has been disabled.
- Show/delete frontier: delete some items from the frontier
- Add RejectRules: add a new rule
- Modify budget: add an object limit to some domain or subdomain
...
- Stop NAS with the "stop_test.sh" command and restart it command
- On kb-test-adm-001/TEST2/conf change the settings for HarvestJobManager so deduplicaiton.enabled is "true"
- restart the NAS system with the "start_test.sh" command. After some time, the job you paused and unpaused should appear in the state "Failed".
- Restart the job by clicking the "Restart?" button. A new job should be created and the old one should have the status "Resubmitted (Job X)".
- Wait for the job to finish.
- Meanwhile click on the JobID for the failed job, then click on "Browse reports for jobs".
- You should see a list of available reports including one called "scripting_events.log". This is the log of alterations you made to the frontier via H3 Remote Access. Click on it.
Assuming you have the correct viewerproxy setup (See Setup DK test environment) you should a log line describing your action. Something similar to
Code Block 2016-02-02T14:05:59.170Z Action from user CSR: Deleted 563 uris matching regex '.*kb.dk.*'
[ This section is commented out because none of the current NAS-H3 scripts override the crawler-bean settings.
Check the Overrides are Applied
For the failed job check that the overrides are visible by looking at the order.xml in the metadata reports:
- Setup the viewerproxy as described in Setup DK test environment.
- Goto the Job details page for the newly finished job by clicking the link in the JobID column.
- Click the
Browse reports for jobs
link. - Goto the metadata://netarkivet.dk/crawl/setup/order.xml report and confirm the modified settings have been updated in the final version of the order.xml.
]
Check the Processors report. You should see a report on the non-zero number of objects which has been deduplicated.
Check that Alias Domains are not Harvested
...
- Confirm that the stop reason "Max Bytes limit reached" or "Domain Completed" is given for all the domains included.
- Confirm that domains for which the domain-config byte limit was reached in the previous harvest are not present in any job in this harvest. (e.g. dbc.dk). The exception to this is the one domain for which you changed the domain limit to unlimited.
[ No longer valid. We now include DeDuplication in TEST2. Check that there was no Deduplication
- Goto the Job details page for the newly finished job by clicking the link in the JobID column.
- Click the
Browse reports for jobs
link. - Confirm that there was no DeDuplicator report, eg. verify the string duplicatereductionjob doesn't appear in the listed reports. ]
Stop the Test and Clean-Up
...