Harvester Architecture for Heritrix 3

We need to be clear on the overall architecture for how the harvesting component (HarvestController, HC) is to be structured when we go over to Heritrix 3. As a basic principle, I think we can agree that we should aim, at least initially, to keep as close as possible to the current architecture as possible. That is to say

  • One job per heritrix instance
  • HC starts a new clean heritrix on the same machine per job

The components we are likely to need are

  • A NAS-facing service layer that listens for new jobs. This corresponds to the current HarvestControllerServer.
  • A Heritrix3Configuration which encapsulates the information needed start heritrix. (Currently much of this logic is in the CTOR of AbstractJMXHeritrixController)
  • A Heritrix3Starter which takes an Heritrix3Configuration and starts the external heritrix instance. This is possibly the same as
  • a HeritrixRestClientFactory which uses a Heritrix3Configuration to generate a
  • Heritrix3RestClient which contains all the methods we need to communicate with H3
  • A Heritrix3Controller. If we are lucky, it might be enough simply to reimplement the existing HeritrixController interface, but using RestClient calls instead of JMX calls.
  • Generalised HeritrixLauncher and HeritrixFiles classes. Currently these are specialised to the H1 case. One way to split the execution path between H1 and H3 would be to define two different HeritrixFiles classes and give HeritrixLauncher  doH1Crawl() and doH3Crawl() methods in place of the current doCrawl() method