Heritrix 3 - possible improvements

Lists the input from Netarkivet.dk to the 2012 Heritrix workshop at British library.


Authors: Søren Vejrup Carlsen(Netarkivet.dk), Nicholas Clarke (Netarkivet.dk)

General list of improvements

  • A better TemplateMigrationTool from H1 to H3 templates. The existing one (H3.1.1) is not usable as-is for migrating Netarchivesuite templates. e.g. The QuotaEnforcer part of our templates is not currently handled by the tool (cf. https://kb-dk.atlassian.net/browse/NAS-1955)
  • Better documentation, including javadoc.  More examples on how to use scripting in H3 would also be nice.
  • Improve GUI (if possible, transfer the features in H1 GUI to the H3 GUI). Feedback from ONB (see Heavily used Pages from Heritrix 1.14 Webinterface) and the curators at netarkivet.dk (Netarkivet-curator-requirements) indicate, that some pages in the H1 GUI are frequently used for monitoring, and that access to the same information in the H3 as in H1 would be a requirement for migrating NetarchiveSuite to H3. This includes examining the frontier, progress statistics, examining crawl-log (using regular expressions), and adding notes to journal in case of changes made while job is paused.
  • Improve REST-API (upgrade to REST 2). Make it possible to extend the API
  • QA-issues: There are no code-coverage, checkstyle, bugs (findbugs/PMD) reports. These should be added to the jenkins site with the use of Sonar. Framework for letting the community be involved in codereviews (crucible) would also be nice.

Nice to have:

  • Possible to run Heritrix engine purely as a HTTP server.
  • Hardwired Timezone (i.e. GMT), is that a good thing?

Questions and observations:

The description of the overall architecture of H3 found here, http://builds.archive.org:8080/javadoc/heritrix-3.1.1/ is not really reflected in the structure of the project.

Provide a graphical overview of the heritrix workflow.

More unit tests. 407 unit-tests exist now  to cover 382 source files.

  • Coverage of commons: 44.9%
  • Coverage of engine: 33.0%
  • Coverage of modules: 21.3%

Splitting the project into more modules and/or packages would probably make it easier to understand, maintain and unit-test the different parts.

Introduce NIO instead of old IO. using old IO for downloading data is not optimal. Also this could utilize a thread pool of writers so the I/O is also not hammered.

Has running with two different Bloom Filters at the same time been tried to minimize false positives? Unfeasable?

Engine code could maybe be split into more logical unit/packages:

    • http harvester
    • workflow around harvester
    • Spring configuration
    • Abstract classes/interfaces used by workflow
    • Heritrix startup

Consider an alternative architecture for H<x>

Instead of running with separate Heritrix instances in parallel on different machine with each their own workflow why not create a distributed master/slave application much like a distributed database lock manager.

Producer/consumer or architecture.

Producer:

  • Handle dns lookup in separate threads, pipelined.
  • Preprocessing and management the harvest queue
  • Persistance handling in a single place.
  • Logging
  • No I/O interference from harvester.
  • Possibly building of index while harvesting.

Consumer(s):

  • Basic http download engines.
  • Postprocessing of data for further urls which are passed back to master.
  • I/O can be used solely for harvesting without pounding the hardware excessively.

Consumer postprocessors could potentially also be injected remotely using reflection/classloading.

This could also solve the problem of large hosting providers not being overloaded since harvesting of different host with identical ip/ip-ranges could be scheduled to do it sequentially.