Note that this documentation is for the old 5.2 release.
For the newest documentation, please see the current release documentation.

Heritrix Control and GUI-console Access

It is possible, while a job is running, to access the Heritrix user interface on the harvester machine.
Start a browser on the harvestermachine and use the port specified, e.g. https://my.harvester.machine:8133. The port is defined by the setting settings.harvester.harvesting.heritrix.port.

A link to all the Heritrix guis currently-running jobs can also be found on the Running Jobs page.

Enter the administrator name e.g. "admin" and password e.g. "adminPassword" as set in the settings.harvester.harvesting.heritrix.adminName and in settings.harvester.harvesting.heritrix.adminPassword settings.

See in the Installation Manual how you change settings.

From NetarchiveSuite 5.2 it is also possible to access many of the Heritrix GUI-console functions directly from with NetarchiveSuite. Start by clicking on the gearwheel symbol under the hostname of the job you want to manage.

The page shows various information on the state of the selected job. The buttons provide access to a range of information about the state of the harvest job as well as a number of possibilities for controlling the job. The four lower buttons provide direct access to the pause/unpause, checkpoint, terminate, and teardown commands in the Heritrix engine itself. Similarly the "Open Scripting console" allows one to enter scripts (for example in Groovy) to manipulate the state of the currently-running job, exactly as one can from the Heritrix console itself. In addition, this section of the NetarchiveSuite GUI has some functionality which is not directly provided in the Heritrix GUI:

Show/filter Crawl Log

Here one can show and filter the crawl log, to see exactly what the job has harvested up to now. The filtering uses standard Java regexes. Clicking on "Update cache" fetches the latest harvested URL's.

Frontier Queue

By contrast, the "Frontier Queue" functionality allows one to view and manipulate the URL's queued but not yet harvested within the running heritrix. The normal workflow here is to use regex'es to narrow down the list of URL's to some small subset which is either problematic, or simply unwanted, then use the "Delete" button to remove them:

This action places the job in a Paused state, and it must be unpaused from the "Running job" page before harvesting can continue. Heritrix occasionally shows some instability when the frontier is manipulated in this way, so you should keep an eye on any jobs where you have used this functionality in case they enter a dormant state and have to be killed manually.

Finally it should be noted that this powerful Heritrix monitoring and management functionality is still somewhat experimental in Heritrix 5.2. The developers welcome bug-reports and suggestions for improvement.