Configuring Integration with Umbra

Harvesting data via a web browser can sometimes identify content that a non-rendering web-crawler such as Heritrix cannot - for example links dynamically generated in Javascript, often including content associated with responsive design. Internet Archive's Umbra is one of several browser-based technologies for web-harvesting. Seen from the point of view of Heritrix and NetarchiveSuite, Umbra is simply a discovery API. One sends a URL to Umbra, it renders the associated web-content, capturing a list of any other URLs fetched during the rendering, and returns a list of these URL's. It is then up to Heritrix to re-harvest these URL's and store the resulting webdata. Communication with Umbra takes place via a RabbitMQ broker, so fundamentally the only configuration information that has to be added to NetarchiveSuite is the URL of the broker-endpoint, authentication parameters (username and password) and a way of configuring which URLs are to be sent to Umbra.

All the configuration is in the NetarchiveSuite settings for the HarvestControllerApplication. A typical example might look like

<settings>
  <common>
   .
   .
  </common>
  <harvester>
    <harvesting>
      <channel>UMBRA</channel>
      <heritrix>
       .
       .
      <metadata>
        .
        .
      </metadata>
      <serverDir>harvester_umbra_1</serverDir>
      <umbra>
        <isEnabled>true</isEnabled>
        <rabbitmqUrl>amqp://guest:guest@localhost:8998/%2f</rabbitmqUrl>
        <hopsShouldProcess>^$|.*L</hopsShouldProcess>
      </umbra>
    </harvesting>
  </harvester>
</settings>

The complete list of relevant settings is

Setting	Description
`settings.harvester.harvesting.channel`	There needs to be a specific harvest channel for Umbra harvests defined in the NetarchiveSuite GUI. This channel name should then be used by any HarvestControllerApplication which is intended to receive jobs for Umbra
`settings.harvester.harvesting.umbra.isEnabled`	A flag indicating whether this HarvestControllerApplication instance is intended for Umbra harvesting. The default value is `false`.
`settings.harvester.harvesting.umbra.rabbitmqUrl`	The connection URL for the tcp-socket connection to the RabbitMQ broker to which Umbra listens. The URL should include the username and password for the broker. Only basic authentication is supported.
`settings.harvester.harvesting.umbra.hopsShouldProcess`	This parameter is a regex applied to the Heritrix Discovery Path to limit which URLs should be sent from Heritrix to Umbra. The default value "^$\|.*L" limits this to the empty string (harvest seeds) or any string ending in "L", that is any link. With this choice, Umbra is only used for actual webpages such as an enduser might load by clicking on a hyperlink.

Putting It All Together

In order to make it all work, one needs to

Choose an Installation Architecture for Umbra Itself

The Umbra documentation describes how to use Python pip to install Umbra on a single server. (There is some discussion of early experiments with this at the Danish Royal Library here.) An alternative approach is a deployment based on Docker Compose - https://github.com/netarchivesuite/netarchivesuite-umbra-docker/blob/master/umbra/Dockerfile. Although not formally "supported" it seems to work well. (There have also been some experiments with deploying Umbra in the cloud using Elastic Beanstalk - see https://github.com/netarchivesuite/netarchivesuite-umbra-docker/tree/elastic_beanstalk and A Novice Learns About Amazon Web Services) Here is how we installed the basic software in DK.

The choice of Umbra architecture will depend on your system requirements. Some possibilities are

A single instance of Umbra used by all umbra-enabled harvesters
One Umbra per HarvestControllerServer instance
One Umbra per harvesting machine (possibly running several instances of HarvestControllerServer)

At the Danish Royal Library we have tested with a One-Umbra-Per-Machine setup.

Add an Umbra queue to the NetarchiveSuite GUI

This uses longstanding functionality in the NAS GUI. Just add a new channel for Umbra harvesting - for example

Add one or more umbra-enabled HarvestControllerServer instances to your NAS distribution

As described above, configure one or more of your existing or new HarvestControllerServices to listen to the Queue you created and with the necessary connection information for one of your umbra instances.

Add the four umbra-related placeholders to any harvest templates to be used in Umbra harvests

The placholders are documented in Appendix B2: Managing Heritrix 3 Crawler-Beans. Note that for non-umbra-enabled HarvestControllerServers these new placeholders will be silently removed before starting the harvest. Therefore you can safely add these placeholders to all your harvest templates, whether or not you are currently planning to use them all for Umbra harvests.

Map some or all of your harvests to the Umbra channel

Use the existing Harvest Channel Mapping section of the NAS GUI to send specific harvests to the Umbra channel.

To check that Umbra is functioning, look in the crawl log of an umbra-enabled job for the strings "sentToAMQP" and "receivedFromAMQP" which indicate which URLs were sent to Umbra and which were received by Heritrix after being found by Umbra.

Limitations

Currently it is only possible to map an entire HarvestDefinition to Umbra. Ideally one would have a more fine-grained approach whereby specific HarvestConfigurations in a given HarvestDefinition would be sent to Umbra. ie there should be a mapping from the pair (HarvestConfiguration, HarvestDefinition) to Harvest Channel, and this mapping would be configureable in the page for editing the HarvestDefinition. This could be implemented as an override to the current behaviour ie. the more-specific mapping would "win" over the per-HarvestDefinition mapping.
The hopsShouldProcess string is currently defined by the HarvestControllerServer settings so is the same for all harvests on a given HarvestControllerServer. This too should ideally be definable for each (HarvestConfiguration, HarvestDefinition) pair, perhaps implemented as an override to a default value.

As usual, the NetarchiveSuite development team welcomes external contributions the codebase!