Detailed Architecture
We use this page to document the actual architectural choices made in the implementation.
Modification of HarvestDefinition and/or HarvestConfiguration
We need a way to define that a given harvest should use Umbra. The natural place to do this is at the HarvestDefinition level (because if we did this at the HarvestConfiguration level we would just end up having to split a HarvestDefinition into multiple jobs anyway - if only some of the configurations are umbra enabled).
There are various ways to do this such as
- Add a boolean flag "useUmbra" to the existing PartialHarvest class
- Subclassing PartialHarvest with a new class UmbraHarvest.
The boolean flag option seems most straightforward as far as implementing persistence is concerned. (Note: we will need to add schema-migration code to PostgreSQLSpecifics.java to migrate the harvestdefinitons table to what will be version 5, I think). See Managing Database Schema Changes for a checklist of how to manage schema changes.
Handling of NAS JMS Queues & Channels
In the NAS GUI there is functionality for
- Seeing which harvesting channels are known and whether they are bound to Snapshot or Selective harvests
- Creating new channels
- Seeing which HarvestDefinitions are mapped to which channels
We will leverage this functionality to bind umbra-enabled HarvestDefinitions to a new channel - for example "UMBRA_SELECTIVE". Ideally this will happen via settings.xml file for the HarvestJobManager application.
We will need to modify the Selective Harvest definition part of the GUI to enable selection/deselection of Umbra - for example a drop down among known channel names, or a tickbox, or checkboxes. How do we know which selective channel is the default and which is the "other one" ie umbra? By convention (e.g. if channel name contains "umbra") or configuration (as a startup setting to HarvestJobManager)?
HarvestController Settings
The HarvestController already has a setting for which channel it listens to, so there is just a configuration change to make sure it listens to "UMBRA_SELECTIVE". There will undoubtedly be other parameters we need to pass in - e.g. the url of the RabbitMQ broker.
Heritrix/cxml Configuration & Communication with Umbra
Here we may choose to use our current templating approach whereby we push values into the cxml file. However we could also choose to use xml DOM-processing, e.g. with XPATH. The test cxml https://github.com/netarchivesuite/netarchivesuite-umbra-docker/blob/master/heritrix/umbra.cxml provides a good template for which beans and parameters we need to add and where. Iirc (check please!) the current logic is that HarvestJobManager pushes all the template values and then deletes any unused placeholders. That won't work if we want the HarvestController to make use of the placeholders later. So either we delegate deletion of unused placeholders to HarvestController, or we use the DOM-processing approach.
Job Isolation and Cleaning-up of Umbra/RabbitMQ
Job isolation means making sure that urls discovered by a specific Umbra harvest are only returned to the job which initiated them. The test profile https://github.com/netarchivesuite/netarchivesuite-umbra-docker/blob/master/heritrix/umbra.cxml shows how to define a specific rabbitMQ channel. If we make this channel unique per job (e.g. just using the Harvest Job Number) then we should be safe. But how do we prevent the broker accumulating old queues, possibly with data on them? This is maybe not something to worry about for now.
Umbra Deployment Architecure
Now that we know how to do job isolation, one-Umbra-instance-per-harvest-machine has some advantages. It means that every HarvestController uses the same umbra endpoint configuration - "localhost:5672".
Umbra Configuration
The question of native vs. docker is less critical if we are going to just have one long-running Umbra instance per machine. But docker has advantages of allowing us to easily manage and deploy identical configurations to multiple machines.