Bitrepository/Hadoop Backend

From NetarchiveSuite 7.0, the software supports an alternative backend based largely on off-the-shelf components, which reduces the maintainance burden of the NetarchiveSuite installation itself. This architecture has been successfully implemented at the Danish Netarkivet, leveraging existing experience in usage of bitrepository.org and hadoop software. It must, however, be emphasised that it is a complex operation to establish such an architecture and that the necessary services come with their own maintenance burden. In the following we give a brief description of the components involved. Anyone considering implementing such an architecture themselves is advised to contact Netarkivet at the Royal Danish Library for further advice.

Bitrepository Configuration

The new architecture is enabled by specifying dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient as the value of the configuration parameter settings.common.arcrepositoryClient.class .

With this set, the rest of the arcrepositoryClient settings look like:

            <arcrepositoryClient>                <class>dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient</class>
                <bitrepository>
                    <storeMaxPillarFailures>0</storeMaxPillarFailures>
                    <store_retries>3</store_retries>
                    <retryWaitSeconds>1800</retryWaitSeconds>
                    <tempdir>arcrepositoryTemp</tempdir>
                    <collectionID>netarkiv</collectionID>
                    <usepillar>netarkivonline1</usepillar>
                    <getTimeout>300000</getTimeout>
                    <getFileIDsMaxResults>10000</getFileIDsMaxResults>
                    <keyfilename>client-certkey.pem</keyfilename>
                    <!-- <settingsDir> element is set per location or machine below  -->
                </bitrepository>
            </arcrepositoryClient>

Note in particular that <settingsDir> points to a directory containing a complete set of bitrepository.org RepositorySettings and ReferenceSettings. The specified keyfile lies inside this directory. The RepositorySettings must grant relevant permissions to the NetarchiveSuite application based on the provided key. For HarvestController applications, these are PutFile permissions to enable upload. For the NetarchiveSuite GUIApplication these are GetFileIDs and GetFile permissions. Other applications do not require any bitrepository permissions.

It would be entirely possible to configure a functioning installation of NetarchiveSuite with Bitrepository integration and without the access and processing components described below. Such an installation would still enable scheduling and management of harvesting, together with the enhanced bitpreservation capabalities of bitrepository.org software, but without the QA, indexing, and batch functionality of the complete NetarchiveSuite. Deduplication of harvest jobs would also not be possible and would need to be disabled by setting of the appropriate configuration parameter in the settings for the IndexServer application.

Access Configuration

BitmagArcRepositoryClient uses a special client interface called WarcRecordClient to retrieve individual warc records from the archive via https with client key authentication. The basic idea is that a web-service (WarcRecordService) will return a (w)arc record from a file with a given offset by specifying the filename in the request path and the offset using the http one-sided range header:

connection.addRequestProperty("Range", "bytes=" + offset + "-");

Note that the semantics define that the service return a single warcrecord starting at this offset. There is python-cgi implementation (suitable for any .gz file) available at https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/wrs.

In order to make this function one extra component is required - a FileResolverService which translates a given warc-filename to an absolute path where the file can be found. NetarchiveSuite defines a FileResolver interface and a corresponding REST implementation. WarcRecordClient and FileResolver are configure in the following settings block:

            <warcRecordService>
                <baseUrl>https://myserver/cgi-bin/warcrecordservice.cgi/</baseUrl>
            </warcRecordService>
            <fileResolver>
                <class>dk.netarkivet.common.utils.service.FileResolverRESTClient</class>
                <!--to https -->
                <baseUrl>https://myserver/cgi-bin/fileresolver.cgi/</baseUrl>
                <keyfile>...somepathto... /https_key.pem</keyfile>
            </fileResolver>

We use the same https keyfile for both services.

A simple FileResolverService can be built using a cgi-script sitting on top of the standard liunx tools updatedb and locate. See https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolver and https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolverdb

Hadoop Configuration

The use of hadoop for mass-processing is somewhat complicated by the fact that we do not consider hdfs to be a suitable filesystem for long term preservation. Existing bitrepository.org filepillar software assumes that the storage is a standard POSIX filesystem. To enable mass-processing under hadoop we therefore expose the filesystem by nfs on all hadoop nodes and with the same mountpoint name. This the path to any file is the same everywhere in hadoop (and also on the bitrepository filepillar used for processing and that on which the WarcRecordService runs. All our hadoop jobs (replacing the old batch jobs) therefore take an input file consisting of a single hdfs file containing a list of paths to warcfiles (on the non-hdfs filesystem) to be processed.