Guide To Configuring the NetarchiveSuite 7.0 Backend

Architecture

NetarchiveSuite 7.0 replaces the single "JMSArcRepository" backend with three separate services, each requiring their own set of configuration files, including in each case client-certificate validation to ensure that only recognised clients (ie NetarchiveSuite) can access the services. The three services are

  1. Bitrepository (from http://bitrepository.org) for storage and bitpreservation
  2. Hadoop for mass-processing (specifically all indexing jobs for deduplication or access)
  3. WarcRecordService - a web api for extracting warc- and arc- records from the repository via a simple web request

External Configurations

The recommended way to distribute the necessary configurations to all NetarchiveSuite machines is to prepare a single directory containing all the configuration files in subdirectories. This directory can then be deployed to every machine in the NetarchiveSuite instance. In test cases we have used identical configurations for every machine, but in a production setup it would be sensible to adjust the settings so that each NetarchiveSuite application receives only the permissions it needs to function.

The external configurations directory looks like this: 

extra_configs/
├── bitmag_config
│   ├── client-certkey.pem
│   ├── logback.xml
│   ├── ReferenceSettings.xml
│   └── RepositorySettings.xml
├── hadoop_config
│   ├── core-site.xml
│   ├── hadoop-env.sh
│   ├── mapred-site.xml
│   ├── ssl-client.xml
│   ├── taskcontroller.cfg
│   ├── task-log4j.properties
│   ├── truststore.jks
│   └── yarn-site.xml
.
.
├── https_config
│   └── https_key.pem
├── kerberos_config
│   ├── krb5.conf
│   └── nat-nasbruger.keytab
└── Readme.md

(some of the hadoop files have been omitted for clarity.)

Note however that the application manager is only responsible for generating a very small part of this package: 

bitmag_config

The application manager must generate the certkey file and send the public part of the key to the collection manager. The collection manager generates ReferenceSettings.xml and RepositorySettings.xml and can also provide a suitable logback.xml file. Note that a different certkey should be generated for harvesters (which need write permission) and ViewerProxy (which needs only read permission to download files). However all harvesters can use the same certkey.

hadoop_config

This is exported directly from the hadoop cluster by the application manager for the cluster, and can be used directly with only one modification: in the file "ssl-client.xml" the key "ssl.client.truststore.location" must be manually set to the absolute path to the file "truststore.jks" in the same directory. If using a script to deploy the package to relevant servers, this irritating necessity could be coded into the script.

kerberos_config

As with the hadoop_config files, these files are provided by the application manager for the cluster.

https_config

The NetarchiveSuite manager is also responsible for generating this certkey file which provides access to the WarcRecordService. In this case, the public part of the certificate has to be sent to the application manager for WarcRecordService. 

NetarchiveSuite Settings

We now describe the new NetarchiveSuite settings which are necessary in order that these services can be enabled and used.

ArcRepositoryClient

The new BitmagArcRepositoryClient configuration looks like 

 <arcrepositoryClient>
        <class>dk.netarkivet.archive.arcrepository.distribute.BitmagArcRepositoryClient</class>
        <bitrepository>
          <storeMaxPillarFailures>1</storeMaxPillarFailures>
          <tempdir>arcrepositoryTemp</tempdir>
          <collectionID></collectionID>
          <usepillar>netarkiv-disk-devel-01</usepillar> 
          <getTimeout>300000</getTimeout>  
          <getFileIDsMaxResults>10000</getFileIDsMaxResults>
          <settingsDir>/home/netarkdv/extra_configs/bitmag_config</settingsDir>
          <keyfilename>client-certkey.pem</keyfilename>
        </bitrepository>
      </arcrepositoryClient>

Of these the following are likely to be important:

  • collectionId: This should be set to the same value as the collectionId in the RepositorySettings.xml
  • settingsDir: The location of the directory containing the bitrepository configuration files
  • keyfilename: the name of the certkey file in the bitrepository configuration directory

Hadoop

N.B. the path to the main hadoop configuration must be added to the classpath of every application that uses hadoop. If using the NetarchiveSuite DeploymentApplication, this can be achieved by adding 

    <deployClassPath>/home/netarkdv/extra_configs/hadoop_config</deployClassPath>

or similar to every application that needs to use hadoop - currently IndexRequestServer, WaybackIndexer, and the GUIApplication.

Other hadoop settings.

<settings>
    <common>
      <hadoop>
        <kerberos>
          <principal>nat-csr@KBHPC.KB.DK</principal>
          <keytab>/home/devel/extra_configs/kerberos_config/nat-csr.keytab</keytab>
          <krb5-conf>/home/devel/extra_configs/kerberos_config/krb5.conf</krb5-conf>
        </kerberos>
        <mapred>
          <hadoopUberJar>uberlib/hadoop-uber-jar.jar</hadoopUberJar>
          <cdxJob>
            <inputDir>/user/nat-csr/nas_cdx_input</inputDir>
            <outputDir>/user/nat-csr/nas_cdx_output</outputDir>
          </cdxJob>
          <metadataExtractionJob>
            <inputDir>/user/nat-csr/nas_cache_input</inputDir>
            <outputDir>/user/nat-csr/nas_cache_output</outputDir>
          </metadataExtractionJob>
          <metadataCDXExtractionJob>
            <inputDir>/user/nat-csr/nas_metadata_cdx_input</inputDir>
            <outputDir>/user/nat-csr/nas_metadata_cdx_output</outputDir>
          </metadataCDXExtractionJob>
          <crawlLogExtractionJob>
            <inputDir>/user/nat-csr/nas_crawllog_input</inputDir>
            <outputDir>/user/nat-csr/nas_crawllog_output</outputDir>
          </crawlLogExtractionJob>
        </mapred>
      </hadoop>
      <useBitmagHadoopBackend>true</useBitmagHadoopBackend>

Note that the kerberos principal and keytab (ie username and password) should be supplied by the application manager for the hadoop platform, as is the krb5.conf file. The hadoopUberJar is now included in the standard netarchivesuite deployment and should therefore have the value shown. The various input and output directories refer to paths under hadoops hdfs filesystem. These may need to be adjusted depending on the hadoop username used ie. they should refer to paths where that user has write-permission. 

WarcRecordService

The following settings are needed 

<settings>
	<common>
		<warcRecordService> 
			<baseUrl>https://kb-test-netarkivet-bitmag-acs-01.kb.dk/cgi-bin/warcrecordservice.cgi/</baseUrl>
		</warcRecordService>
		<fileResolver>
			<class>dk.netarkivet.common.utils.service.FileResolverRESTClient</class>
			<baseUrl>https://kb-test-netarkivet-bitmag-acs-01.kb.dk/cgi-bin/fileresolver.cgi/</baseUrl>
			<keyfile>/home/devel/extra_configs/https_config/https_key.pem</keyfile>
		</fileResolver>

The two urls should be provided by the application manager for the WarcRecordService system. The keyfile should be generated by the NetarchiveSuite application manager and supplied to the application manager for the WarcRecordService system.