...
bs.dk Domain Completed kum.dk Max Bytes limit reached oernhoej.dk Domain Completed drive-badmintonklub.dk Max Bytes limit reached sulnudu.dk Domain-config byte limit reached statsbiblioteket.dk Domain-config byte limit reached kb.dk Domain-config byte limit reached (currently blocked so Domain Completed) raeder.dk Domain Completed trinekc.dk Max Bytes limit reached kaarefc.dk Domain Completed kaareogtrine.dk Max Bytes limit reached dbc.dk Domain-config byte limit reached unknown001.dk Domain Completed slothchristensen.dk Max Bytes limit reached pligtaflevering.dk Max Bytes limit reached trineogkaare.dk Max Bytes limit reached sy-jonna.dk Max Bytes limit reached
If any of these have a different reason, investigate to see if the new stop reason makes sense. Generally speaking, domains with DomainCompleted will have significantly under 1000000 bytes harvested, while the other will be somewhat, but not vastly, over this limit.
Add a New Alias
kb is currently blocked so use statsbiblioteket.dk instead
Set sulnudu.dk to be an alias of kb.dk
...
Create a new snapshot harvest with a 'Max number of bytes per domain' of 5MB (i.e, 5000000 bytes) , harvesting domains not completed in the previous harvest.
Test NAS Heritrix Integration
In "Harvest status -> Running jobs" click on the little "Edit" icon next to the hostname.
Click on the CrawlLog link and check that you can page through the crawl log.
Click on the Frontier queue link. This is not yet implemented.
Test the NAS Heritrix Groovy Scripts
In "Harvest status -> Running jobs" the Host field is a link to the relevant Heritrix GUI. Click on this, add a security exception if necessary, (admin/adminPassword) and you should see the Heritrix 3 GUI:
(Note: if you have trouble connecting to the GUI because of firewall or routing issues, you can always log in to the harvester machine and use X remote display to start a firefox on the harvester machine. )
Click on the job name near the bottom next to status <<Active: RUNNING>>:
...
As the above example shows, enter your own initial in the box and uncomment the call to listFrontier(). Click on execute:
(The current state of the frontier is shown. Now execute deleteFromFrontier() with a regex to match some, but not all, of the urls.)
Actually this is a bad idea as deleteFromFrontier is likely to cause the job require it to be terminated. So just skip this.
The output shows how many urls were removed. Now go back to the job page in the Heritrix GUI and unpause it.
...
- Stop NAS with the "stop_test.sh" command and restart it with the "start_test.sh" command. After some time, a the job you paused and unpaused should appear in the state "Failed".
- Restart the job by clicking the "Restart?" button. A new job should be created and the old one should have the status "Resubmitted (Job X)".
- Wait for the job to finish.
- Meanwhile click on the JobID for the failed job, then click on "Browse reports for jobs".
- You should see a list of available reports including one called "scripting_events.log". This is the log of the deletions you made to the frontier in the H3 GUI. Click on it.
Assuming you have the correct viewerproxy setup (See Setup DK test environment) you should a log line describing your action. Something similar to
Code Block 2016-02-02T14:05:59.170Z Action from user CSR: Deleted 563 uris matching regex '.*kb.dk.*'
...
- Confirm that the stop reason "Max Bytes limit reached" or "Domain Completed" is given for all the domains included.
- Confirm that oernehoej.dk and statsbiblioteket.dk are not found in the "Domain" column in any of the jobs for the second rundomains for which the domain-config byte limit was reached in the previous harvest are not present in any job in this harvest. (e.g. dbc.dk). The exception to this is the one domain for which you changed the domain limit to unlimited.
[ No longer valid. We now include DeDuplication in TEST2. Check that there was no Deduplication
...