...
Update the "Maximum number of bytes" for the defaultconfig for six domains as follows:
kb.dk | 100000 |
statsbiblioteket.dk | 100001 |
netarkivet.dk | 100002 |
dbc.dk | 100003 |
bs.dk | 100004 |
sulnudu.dk | 100005 |
Add Alias Domain
- Using the GUI, set netarkivet.dk to be an alias of kb.dk.
- Go to edit page of the domain 'netarkivet.dk' using the HarvestDefinition/Definitions-find-domains.jsp
- In the "Alias of" field, type 'kb.dk'
- Click 'save'
- Confirm that it is listed on the "Alias Summary" page.
- Now try to make dbc.dk an alias of netarkivet.dk. This should fail because chains of aliases are not allowed.
- Go to edit page of the domain 'dbc.dk' using the HarvestDefinition/Definitions-find-domains.jsp
- In the "Alias of" field, type 'netarkivet.dk'
- Click save. An error message should be shown on screen;
"Cannot make domain 'dbc.dk' an alias of 'netarkivet.dk', as that domain is already an alias of 'kb.dk'"
...
Test NAS Heritrix Integration
In "Harvest status -> Running jobs" click on the little "Edit" icon next to the hostname.
Click on the CrawlLog link and check that you can page through the crawl log.
Click on the Frontier queue link. This is not yet implemented.
Test the NAS Heritrix Groovy Scripts
In "Harvest status -> Running jobs" the Host field is a link to the relevant Heritrix GUI. Click on this, add a security exception if necessary, (admin/adminPassword) and you should see the Heritrix 3 GUI:
(Note: if you have trouble connecting to the GUI because of firewall or routing issues, you can always log in to the harvester machine and use X remote display to start a firefox on the harvester machine. )
Click on the job name near the bottom next to status <<Active: RUNNING>>:
Click on "Pause" to pause the job and then "Scripting Console".
The H3 scripts should be downloaded from their own github repository at https://github.com/netarchivesuite/heritrix3-scripts/blob/master/src/main/java/nas.groovy . Copy and paste the script into the script box and choose "groovy" in the drop-down menu.
As the above example shows, enter your own initial in the box and uncomment the call to listFrontier(). Click on execute:
(The current state of the frontier is shown. Now execute deleteFromFrontier() with a regex to match some, but not all, of the urls.)
Actually this is a bad idea as deleteFromFrontier is likely to cause the job require it to be terminated. So just skip this.
The output shows how many urls were removed. Now go back to the job page in the Heritrix GUI and unpause it.status > All Jobs" wait until one of the jobs has nonzero progress. Click on the Job ID to enter H3 Remote Access.
Pause the job.
Check the following functionality:
- Progression/Queues
- Crawllog: check cache update, filtering, paging
- Reports: click on at least two of them
- Show/delete frontier: delete some items from the frontier
- Add RejectRules: add a new rule
- Modify budget: add an object limit to some domain or subdomain
Now restart the job. Then immediatiely ...
Restart The System
- Stop NAS with the "stop_test.sh" command and restart it with the "start_test.sh" command. After some time, the job you paused and unpaused should appear in the state "Failed".
- Restart the job by clicking the "Restart?" button. A new job should be created and the old one should have the status "Resubmitted (Job X)".
- Wait for the job to finish.
- Meanwhile click on the JobID for the failed job, then click on "Browse reports for jobs".
- You should see a list of available reports including one called "scripting_events.log". This is the log of the deletions alterations you made to the frontier in the via H3 GUIRemote Access. Click on it.
Assuming you have the correct viewerproxy setup (See Setup DK test environment) you should a log line describing your action. Something similar to
Code Block 2016-02-02T14:05:59.170Z Action from user CSR: Deleted 563 uris matching regex '.*kb.dk.*'
...