Heritrix Control and GUI-console Access
It is possible, while a job is running, to access the Heritrix user interface on the harvester machine.
Start a browser on the harvestermachine and use the port specified, e.g. https://my.harvester.machine:8133. The port is defined by the setting settings.harvester.harvesting.heritrix.port
.
A link to all the Heritrix guis currently-running jobs can also be found on the Running Jobs page.
Enter the administrator name e.g. "admin" and password e.g. "adminPassword" as set in the settings.harvester.harvesting.heritrix.adminName
and in settings.harvester.harvesting.heritrix.adminPassword
settings.
See in the Installation Manual how you change settings.
From NetarchiveSuite 5.2 it is also possible to access many of the Heritrix GUI-console functions directly from with NetarchiveSuite. Start by clicking on the Job ID of the job you wish to manage
The page shows various information on the state of the selected job. The buttons provide access to a range of information about the state of the harvest job as well as a number of possibilities for controlling the job. The lowest row of buttons provide direct access to the pause/unpause, checkpoint, terminate, and teardown commands in the Heritrix engine itself. Similarly the "Open Scripting console" allows one to enter scripts (for example in Groovy) to manipulate the state of the currently-running job, exactly as one can from the Heritrix console itself. In addition, this section of the NetarchiveSuite GUI has some functionality which is not directly provided in the Heritrix GUI:
Show/filter Crawl Log
Here one can show and filter the crawl log, to see exactly what the job has harvested up to now. The filtering uses standard Java regexes. Clicking on "Update cache" fetches the latest harvested URL's.
Frontier Queue
By contrast, the "Frontier Queue" functionality allows one to view and manipulate the URL's queued but not yet harvested within the running heritrix. The normal workflow here is to use regex'es to narrow down the list of URL's to some small subset which is either problematic, or simply unwanted, then use the "Delete" button to remove them:
This action places the job in a Paused state, and it must be unpaused from the "Running job" page before harvesting can continue. Heritrix occasionally shows some instability when the frontier is manipulated in this way, so you should keep an eye on any jobs where you have used this functionality in case they enter a dormant state and have to be killed manually. Deletion of urls from the frontier is logged in the file scripting_events.log which is included in the metadata file generated at the end of the harvest.
Adding Rejection Rules to a Running Harvest
Another way to manipulate a running harvest is to add additional rules specifying urls to be rejected by the harvester. Click on the "Add Reject Rules" button on the "Details And Actions" page. This leads to
From here you can simply add new rejection rules in the form of Java regular expressions - for example ".*baddomain.com/.*"
.
Quota Manipulation
Another way to control the behaviour of a running job is by manipulating the budget or quota allotted to specific domains and hosts. Start by clicking on the "Modify budget" button, which leads to Â
Here we can specify the total number of objects to download from a given domain such as "mydomain.com", or host such as "subhost.mydomain.com". If the number specified is less than the number already harvested then no further objects will harvested from that domain/host.
Configuring Heritrix Monitoring Dynamically
Clicking on the "H3 Remote Access" link in the right hand menu brings up the following page:
and clicking on the "Configure" button leads to
In the text-area, you can enter a list of host-names of harvesters for which crawl-log caching is enabled. Host-names can be specified as regular expressions. At the bottom of a page you can see a list of which harvesters are included, and which are excluded, from the crawl-log caching functionality. This feature can be useful if you have many hosts running large jobs, as crawl-log caching can then be slow, and expensive in terms of disk-usage and network traffic. Â
Finally it should be noted that these powerful Heritrix monitoring and management functions are still somewhat experimental in Heritrix 5.3. The developers welcome bug-reports and suggestions for improvement.