Bitrepository/Hadoop Backend

From NetarchiveSuite 7.0, the software supports an alternative backend based largely on off-the-shelf components, which reduces the maintainance burden of the NetarchiveSuite installation itself. This architecture has been successfully implemented at the Danish Netarkivet, leveraging existing experience in usage of bitrepository.org and hadoop software. It must, however, be emphasised that it is a complex operation to establish such an architecture and that the necessary services come with their own maintenance burden. In the following we give a brief description of the components involved. Anyone considering implementing such an architecture themselves is welcome to contact Netarkivet at the Royal Danish Library for further advice.

Bitrepository Configuration

The new architecture is enabled by specifying dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient as the value of the configuration parameter settings.common.arcrepositoryClient.class .

With this set, the rest of the arcrepositoryClient settings look like:

            <arcrepositoryClient>                <class>dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient</class>
                <bitrepository>
                    <storeMaxPillarFailures>0</storeMaxPillarFailures>
                    <store_retries>3</store_retries>
                    <retryWaitSeconds>1800</retryWaitSeconds>
                    <tempdir>arcrepositoryTemp</tempdir>
<!-- The bitrepository collection to use for storage -->
                    <collectionID>netarkiv</collectionID>
                    <usepillar>netarkivonline1</usepillar> <!-- unused -_> 
                    <getTimeout>300000</getTimeout>
                    <getFileIDsMaxResults>10000</getFileIDsMaxResults> <!-- paging size -->
                    <keyfilename>client-certkey.pem</keyfilename>
                    <!-- <settingsDir> element is set per location or machine below  -->
                </bitrepository>
            </arcrepositoryClient>

Note in particular that <settingsDir> points to a directory containing a complete set of bitrepository.org RepositorySettings and ReferenceSettings. The specified keyfile lies inside this directory. The RepositorySettings must grant relevant permissions to the NetarchiveSuite application based on the provided key. For HarvestController applications, these are PutFile permissions  to enable upload. For the NetarchiveSuite GUIApplication these are GetFileIDs and GetFile permissions. Other applications do not require any bitrepository permissions.

It would be entirely possible to configure a functioning installation of NetarchiveSuite with Bitrepository integration and without the access and processing components described below. Such an installation would still enable scheduling and management of harvesting, together with the enhanced bitpreservation capabilities of bitrepository.org software, but without the QA, indexing, and batch functionality of the complete NetarchiveSuite. Deduplication of harvest jobs would also not be possible and would need to be disabled by setting of the appropriate configuration parameter in the settings for the IndexServer application: 

        <harvester>
            <harvesting>
                <deduplication>
                    <enabled>false</enabled>
                </deduplication>

Access Configuration

BitmagArcRepositoryClient uses a special client interface called WarcRecordClient to retrieve individual warc records from the archive via https with client key authentication. The basic idea is that a web-service (WarcRecordService) will return a (w)arc record from a file with a given offset by specifying the filename in the request path and the offset using the http one-sided range header:

connection.addRequestProperty("Range", "bytes=" + offset + "-");

Note that the semantics define that the service return a single warcrecord starting at this offset. There is python-cgi implementation (suitable for any .gz file) available at https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/wrs. (A similar idea is used in various access software for webarchives, such as Openwayback and Pywb.)

In order to make this function one extra component is required - a FileResolverService which translates a given warc-filename to an absolute path where the file can be found. NetarchiveSuite defines a FileResolver interface and a corresponding REST implementation. WarcRecordClient and FileResolver are configured in the following settings block:

            <warcRecordService>
                <baseUrl>https://myserver/cgi-bin/warcrecordservice.cgi/</baseUrl>
            </warcRecordService>
            <fileResolver>
                <class>dk.netarkivet.common.utils.service.FileResolverRESTClient</class>
                <!--to https -->
                <baseUrl>https://myserver/cgi-bin/fileresolver.cgi/</baseUrl>
                <keyfile>...somepathto... /https_key.pem</keyfile>
                <retries>3</retries>
                <retryWaitSeconds>5</retryWaitSeconds>
            </fileResolver>
            <trustStore>
                <path>/etc/pki/java/cacerts</path>
                <password>changeit</password>
            </trustStore>

We use the same https keyfile for both services. 

A simple FileResolverService can be built using a cgi-script sitting on top of the standard linux tools updatedb and locate. See https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolver and https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolverdb 

There is also a trivial implementation of the FileResolver client interface, SimpleFileResolver, which can be used when all files in the archive are to be found in a single directory.

Note that the tools in the ArchiveModule for uploading files and for getting files and records out of the archive should all work with this architecture, provided an appropriate settings file is provided. These provide good tests that the architecture is functioning.

Hadoop Configuration

The use of hadoop for mass-processing is somewhat complicated by the fact that we do not consider hdfs to be a suitable filesystem for long term preservation. Existing bitrepository.org filepillar software assumes that the storage is a standard POSIX filesystem. To enable mass-processing under hadoop we therefore expose the filesystem by nfs on all hadoop nodes with the same mountpoint name. Thus the path to any file is the same everywhere in hadoop (and also on the bitrepository filepillar used for processing and the server on which the FileResolverService runs. All our hadoop jobs (replacements for the old batch jobs) therefore take an input file consisting of a single hdfs file consisting of a list of paths to warcfiles (on the non-hdfs filesystem) to be processed. The FileResolver architecture described above is used to build this input file from the names of warcfiles to be processed.

I.e. logic is:

  1. Obtain a list of warc files to be processed
  2. Resolve each warcfile to an absolute path using FileResolverService
  3. Create an input file listing these absolute paths
  4. Insert the input file in hdfs
  5. Give this hdfs file as input to the hadoop job

Hadoop is enabled by including hadoop and kerberos configuration files on each NetarchiveSuite machine from which hadoop jobs must be started - that is to say those machines running the GUIApplication, IndexServer and WaybackIndexer. The hadoop configuration is loaded from the Classpath as set in the start script for the given application. If using the NetarchiveSuite deployment application then this can be done by adding a <deployClassPath> element pointing to the hadoop configuration directory (as a path relative to the NetarchiveSuite installation directory).

The behaviour of NetarchiveSuite in relation to hadoop jobs is determined by the following configuration block:

            <hadoop>
                <kerberos>                  
                   <keytab>/path/to/kerberos_config/my.keytab</keytab>
                    <krb5-conf>/path/to/kerberos_config/krb5.conf</krb5-conf>
                    <trustStore>
                        <path>/path/to/hadoop_config/truststore.jks</path>
                    </trustStore>
                </kerberos>
                <mapred>
                    <framework>yarn</framework>
                    <mapMemoryMb>4096</mapMemoryMb>
                    <mapMemoryCores>2</mapMemoryCores>
<!-- Enable caching of files from nfs to hdfs -->
                    <hdfsCacheEnabled>true</hdfsCacheEnabled>
                    <hdfsCacheDir>/user/hadoop_user/netarkivet_cache</hdfsCacheDir>
                    <hdfsCacheDays>7</hdfsCacheDays>
                    <enableUbertask>true</enableUbertask>
<!-- Path to the location of the hadoop uberlib on the NetarchiveSuite node -->
                    <hadoopUberJar>uberlib/hadoop-uber-jar.jar</hadoopUberJar>
                    <queue> <!--These values will always be cluster- and user-dependent -->
                        <interactive>foo_interactive</interactive>
                        <batch>bar_batch</batch>
                    </queue>
                    <inputFilesParentDir>/netarkiv</inputFilesParentDir> <!-- Only used by SimpleFileResolver -->
                    <cdxJob>
                        <inputDir>/user/hadoop_user/nas_cdx_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_cdx_output</outputDir>
                    </cdxJob>
                    <metadataExtractionJob>
                        <inputDir>/user/hadoop_user/nas_cache_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_cache_output</outputDir>
                    </metadataExtractionJob>
                    <metadataCDXExtractionJob>
                        <inputDir>/user/hadoop_user/nas_metadata_cdx_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_metadata_cdx_output</outputDir>
                    </metadataCDXExtractionJob>
                    <crawlLogExtractionJob>
                        <inputDir>/user/hadoop_user/nas_crawllog_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_crawllog_output</outputDir>
                    </crawlLogExtractionJob>
                </mapred>
            </hadoop>
            <useBitmagHadoopBackend>true</useBitmagHadoopBackend>

Note that paths starting with "/user/hadoop_user" are paths in hdfs to areas used for i/o for hadoop jobs.

The hadoop uberlib is a new (in NetarchiveSuite 7.0) artefact in the NetarchiveSuite build process and can be found inside the main distribution zip. By default it will be unpacked to the path given above.

With this configuration it should be possible to start hadoop jobs from NetarchiveSuite - for example clicking on a "Browse reports for jobs" link in the GUI should start a hadoop job to index a metadata warcfile.

Alternatively the standalone job SendDedupIndexRequestToIndexserver (see Additional Tools Manual) can be used to start a single indexing job using the hadoop architecture.