Overall Architecture for Browser-based Harvesting

(Spawned from https://kb-dk.atlassian.net/wiki/pages/viewpage.action?pageId=16923944)

As a first-step we are integrating Umbra harvesting with NetarchiveSuite using the existing Heritrix plugins. There are many possible architectural choices for how this might work. Note that for every solution described here, the url/routing_key for harvesting has to be pushed into a placeholder in the crawler-beans. ie. the HarvestController has to tell Heritrix which Umbra instance to use.

Here are some of the solutions ...

ConceptChallengesOpportunities

Dedicated harvesters with Umbra support. Specific harvesting channel for umbra. Modified selective harvests (ie. add a "useUmbra" flag or similar). HarvestController pushes umbra configuration into crawler-bean (e.g. using placeholders). Choice of deployment architecture for Umbra itself (docker/native, one broker/one broker per machine/ one broker per umbra etc.) is deferred until later.

This is our current preferred architecture.

Code-heavy (both back-end and front-end).

Doesn't scale automagically - number of umbra-enabled harvesters must be defined in advance.

Good monitoring because different types of harvest are easily separable in GUI.

Initially can be done on a small number (1?) of specially configured harvesters.

Dedicated harvesters with Umbra support. Specific harvesting channel for umbra. New class of harvests parallel to snapshot/selective.

Code-heavy (both back-end and front-end).

Doesn't scale automagically - number of umbra-enabled harvesters must be defined in advance.

Good monitoring because different types of harvest are easily separable in GUI.

Initially can be done on a small number (1?) of specially configured harvesters.

Fully containerised (umbra+broker). Spin up as needed if crawler beans include umbra extensions.

Requires docker + docker-compose available on every harvest machine.

Requires docker skillz from developers.

Status of docker/docker-compose integration with java not fully known.

Flexible and scalable.

Development only in NetarchiveSuite backend.

Containerised solution useful for testing in a consistent portable environment.

Fully containerised (umbra+broker) but persistent - reuse the same Umbra installation for any given harvester.

Requires docker + docker-compose available on every harvest machine.

Requires some docker skillz from developers.

Must make sure that each Umbra is stateless between jobs.

Doesn't really require java-docker integration. (Can launch Umbra from NetarchiveSuite start-scripts.)

Development only in NetarchiveSuite backend.

Containerised solution useful for testing in a consistent portable environment.


Native (non-container) umbras per harvester. Single broker.

Must make sure that each Umbra is stateless between jobs (empty queue).

Need to find out how to do multiple umbras per machine (one per HarvestController).

Leverage broker's "default exchange" ability to enable automatic routing.

Heavy on harvester configuration. Require native-umbra-per-harvester running everywhere (so at least python, headless chrome, dummy X display, on every harvest machine).

No need to learn docker.

Development only in NetarchiveSuite backend.