Bitrepository/Hadoop Backend

From NetarchiveSuite 7.0, the software supports an alternative backend based largely on off-the-shelf components, which reduces the maintainance burden of the NetarchiveSuite installation itself. This architecture has been successfully implemented at the Danish Netarkivet, leveraging existing experience in usage of bitrepository.org and hadoop software. It must, however, be emphasised that it is a complex operation to establish such an architecture and that the necessary services come with their own maintenance burden. In the following we give a brief description of the components involved. Anyone considering implementing such an architecture themselves is advised to contact Netarkivet at the Royal Danish Library for further advice.

Bitrepository Configuration

The new architecture is enabled by specifying dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient as the value of the configuration parameter settings.common.arcrepositoryClient.class .

With this set, the rest of the arcrepositoryClient settings look like:

            <arcrepositoryClient>                <class>dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient</class>
                <bitrepository>
                    <storeMaxPillarFailures>0</storeMaxPillarFailures>
                    <store_retries>3</store_retries>
                    <retryWaitSeconds>1800</retryWaitSeconds>
                    <tempdir>arcrepositoryTemp</tempdir>
                    <collectionID>netarkiv</collectionID>
                    <usepillar>netarkivonline1</usepillar>
                    <getTimeout>300000</getTimeout>
                    <getFileIDsMaxResults>10000</getFileIDsMaxResults>
                    <keyfilename>client-certkey.pem</keyfilename>
                    <!-- <settingsDir> element is set per location or machine below  -->
                </bitrepository>
            </arcrepositoryClient>

Note in particular that <settingsDir> points to a directory containing a complete set of bitrepository.org RepositorySettings and ReferenceSettings. The specified keyfile lies inside this directory. The RepositorySettings must grant relevant permissions to the NetarchiveSuite application based on the provided key. For HarvestController applications, these are PutFile permissions to enable upload. For the NetarchiveSuite GUIApplication these are GetFileIDs and GetFile permissions. Other applications do not require any bitrepository permissions.

It would be entirely possible to configure a functioning installation of NetarchiveSuite with Bitrepository integration and without the access and processing components described below. Such an installation would still enable scheduling and management of harvesting, together with the enhanced bitpreservation capabalities of bitrepository.org software, but without the QA, indexing, and batch functionality of the complete NetarchiveSuite. Deduplication of harvest jobs would also not be possible and would need to be disabled by setting of the appropriate configuration parameter in the settings for the IndexServer application.

Access Configuration

BitmagArcRepositoryClient uses a special client interface called WarcRecordClient to retrieve individual warc records from the archive via https with client key authentication. The basic idea is that a web-service (WarcRecordService) will return a (w)arc record from a file with a given offset by specifying the filename in the request path and the offset using the http one-sided range header:

connection.addRequestProperty("Range", "bytes=" + offset + "-");

Note that the semantics define that the service return a single warcrecord starting at this offset. There is python-cgi implementation (suitable for any .gz file) available at https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/wrs.

In order to make this function one extra component is required - a FileResolverService which translates a given warc-filename to an absolute path where the file can be found. NetarchiveSuite defines a FileResolver interface and a corresponding REST implementation. WarcRecordClient and FileResolver are configure in the following settings block:

            <warcRecordService>
                <baseUrl>https://myserver/cgi-bin/warcrecordservice.cgi/</baseUrl>
            </warcRecordService>
            <fileResolver>
                <class>dk.netarkivet.common.utils.service.FileResolverRESTClient</class>
                <!--to https -->
                <baseUrl>https://myserver/cgi-bin/fileresolver.cgi/</baseUrl>
                <keyfile>...somepathto... /https_key.pem</keyfile>
            </fileResolver>
            <trustStore>
                <path>/etc/pki/java/cacerts</path>
                <password>changeit</password>
            </trustStore>

We use the same https keyfile for both services.

A simple FileResolverService can be built using a cgi-script sitting on top of the standard liunx tools updatedb and locate. See https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolver and https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolverdb

Hadoop Configuration

The use of hadoop for mass-processing is somewhat complicated by the fact that we do not consider hdfs to be a suitable filesystem for long term preservation. Existing bitrepository.org filepillar software assumes that the storage is a standard POSIX filesystem. To enable mass-processing under hadoop we therefore expose the filesystem by nfs on all hadoop nodes and with the same mountpoint name. This the path to any file is the same everywhere in hadoop (and also on the bitrepository filepillar used for processing and that on which the WarcRecordService runs. All our hadoop jobs (replacing the old batch jobs) therefore take an input file consisting of a single hdfs file containing a list of paths to warcfiles (on the non-hdfs filesystem) to be processed. The same FileResolver architecture described above is used to build this input file from the names of warcfiles to be processed.

Hadoop is enabled by including hadoop and kerberos configuration files on each NetarchiveSuite machine from which hadoop jobs must be started - that is to say those machines running the GUIApplication, IndexServer and WaybackIndexer. The hadoop configuration is loaded from the Classpath as set in the start script for the given application. If using the NetarchiveSuite deployment application then this can be done by adding a <deployClassPath> element pointing to the hadoop configuration directory (as a path relative to the NetarchiveSuite installation directory).

The behaviour of NetarchiveSuite in relation to hadoop jobs is determined by the following configuration block:

            <hadoop>
                <kerberos>                  
                   <keytab>/path/to/kerberos_config/my.keytab</keytab>
                    <krb5-conf>/path/to/kerberos_config/krb5.conf</krb5-conf>
                    <trustStore>
                        <path>/path/to/hadoop_config/truststore.jks</path>
                    </trustStore>
                </kerberos>
                <mapred>
                    <framework>yarn</framework>
                    <mapMemoryMb>4096</mapMemoryMb>
                    <mapMemoryCores>2</mapMemoryCores>
<!-- Enable caching of files from nfs to hdfs -->
                    <hdfsCacheEnabled>true</hdfsCacheEnabled>
                    <hdfsCacheDir>/user/hadoop_user/netarkivet_cache</hdfsCacheDir>
                    <hdfsCacheDays>7</hdfsCacheDays>
                    <enableUbertask>true</enableUbertask>
<!-- Path to the location of the hadoop uberlib on the NetarchiveSuite node -->
                    <hadoopUberJar>uberlib/hadoop-uber-jar.jar</hadoopUberJar>
                    <queue> <!--These values will always be cluster- and user-dependent -->
                        <interactive>foo_interactive</interactive>
                        <batch>bar_batch</batch>
                    </queue>
                    <inputFilesParentDir>/netarkiv</inputFilesParentDir>
                    <cdxJob>
                        <inputDir>/user/hadoop_user/nas_cdx_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_cdx_output</outputDir>
                    </cdxJob>
                    <metadataExtractionJob>
                        <inputDir>/user/hadoop_user/nas_cache_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_cache_output</outputDir>
                    </metadataExtractionJob>
                    <metadataCDXExtractionJob>
                        <inputDir>/user/hadoop_user/nas_metadata_cdx_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_metadata_cdx_output</outputDir>
                    </metadataCDXExtractionJob>
                    <crawlLogExtractionJob>
                        <inputDir>/user/hadoop_user/nas_crawllog_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_crawllog_output</outputDir>
                    </crawlLogExtractionJob>
                </mapred>
            </hadoop>
            <useBitmagHadoopBackend>true</useBitmagHadoopBackend>

Note that paths starting with "/user/hadoop_user" are paths in hdfs to areas used for i/o for hadoop jobs.

The hadoop uberlib is a new (in NetarchiveSuite 7.0) artefact in the NetarchiveSuite build process and can be found inside the main distribution zip. By default it will be unpacked to the path given above.

With this configuration it should be possible to start hadoop jobs from NetarchiveSuite - for example clicking on a "Browse reports for jobs" link in the GUI should start a hadoop job to index a metadata warcfile.