Heritrix 3 Integration in NAS

H3 is a heavily rewritten crawler compared with H1, but the integration with NAS has been designed to make the changeover as transparent as possible. 

The most obvious changes from H3 to H1 are:

  1. Crawl-order files have been completely replaced with crawler-bean (cxml) files.
  2. Communication and control with H3 is via a REST (https) interface, not JMX as in H1.
  3. The Web GUI is entirely rewritten: some functionality is lost, some has been added.

Code Integration

  1. Communication between NAS and H3 is via REST calls. These calls are implemented in a separate codebase from the main NAS software at https://github.com/netarchivesuite/heritrix3-wrapper .
  2. The H3 harvesting code exists in a maven module org.netarchivesuite.heritrix3 and its three sub-modules
    • heritrix3-controller
    • heritrix3-extensions
    • heritrix3-bundler
  3. The heritrix3-controller module contains the HeritrixLauncher implementation which starts H3, and the HeritrixController module which communicates with the running H3 instance
  4. The heritrix3-extensions module contains custom processors, including our own fork of Kristinn's deduplicator module
  5. The heritrix3-bundler is a maven assembly module which builds H3 + extensions + controller into a .zip file, separately from the main NAS zip-file. This potentially allows for more flexibility in handling of different H3 versions - for example we can now more easily upgrade H3 without redeploying the whole of NAS

New Templates

 The biggest challenge to moving to NAS5/H3 is the new template format. We have tried to minimised the disruption by including more configurable parameters in NAS itself, so that fewer templates will be needed. At Netarkivet we have gone down from 52 templates to 7.

H3 Integration in the NAS GUI

The H3 GUI is missing some functionality that many curators find essential

  1. The ability to list and search in the already-crawled data
  2. The ability to list the current frontier and remove problematic urls
  3. The ability to dynamically alter Quotas and DecideRules on runnign jobs

All these can be done using the H3 GUI Scripting Console but that

  1. Means logging into the relevant heritrix engine and navigating to the console every time, and
  2. Requires curators to modify and paste code into the console window

So let's not do that!

Instead, let's control Heritrix from the NAS GUI instead ....

We can:

  1. Start/Stop/Pause/Terminate a harvest
  2. View all the reports dynamically
  3. View and search in the crawl-log
  4. View, search, and filter the frontier
  5. Add new regexp exclusions
  6. Alter budgets by host or domain