Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
            <warcRecordService>
                <baseUrl>https://myserver/cgi-bin/warcrecordservice.cgi/</baseUrl>
            </warcRecordService>
            <fileResolver>
                <class>dk.netarkivet.common.utils.service.FileResolverRESTClient</class>
                <!--to https -->
                <baseUrl>https://myserver/cgi-bin/fileresolver.cgi/</baseUrl>
                <keyfile>...somepathto... /https_key.pem</keyfile>
            </fileResolver>
            <trustStore>
                <path>/etc/pki/java/cacerts</path>
                <password>changeit</password>
            </trustStore>

We use the same https keyfile for both services. 

...

The use of hadoop for mass-processing is somewhat complicated by the fact that we do not consider hdfs to be a suitable filesystem for long term preservation. Existing bitrepository.org filepillar software assumes that the storage is a standard POSIX filesystem. To enable mass-processing under hadoop we therefore expose the filesystem by nfs on all hadoop nodes and with the same mountpoint name. This the path to any file is the same everywhere in hadoop (and also on the bitrepository filepillar used for processing and that on which the WarcRecordService runs. All our hadoop jobs (replacing the old batch jobs) therefore take an input file consisting of a single hdfs file containing a list of paths to warcfiles (on the non-hdfs filesystem) to be processed. The same FileResolver architecture described above is used to build this input file from the names of warcfiles to be processed.

Hadoop is enabled by including hadoop and kerberos configuration files on each NetarchiveSuite machine from which hadoop jobs must be started - that is to say those machines running the GUIApplication, IndexServer and WaybackIndexer. The hadoop configuration is loaded from the Classpath as set in the start script for the given application. If using the NetarchiveSuite deployment application then this can be done by adding a <deployClassPath> element pointing to the hadoop configuration directory (as a path relative to the NetarchiveSuite installation directory).

The behaviour of NetarchiveSuite in relation to hadoop jobs is determined by the following configuration block:

Code Block
            <hadoop>
                <kerberos>                  
                   <keytab>/path/to/kerberos_config/my.keytab</keytab>
                    <krb5-conf>/path/to/kerberos_config/krb5.conf</krb5-conf>
                    <trustStore>
                        <path>/path/to/hadoop_config/truststore.jks</path>
                    </trustStore>
                </kerberos>
                <mapred>
                    <framework>yarn</framework>
                    <mapMemoryMb>4096</mapMemoryMb>
                    <mapMemoryCores>2</mapMemoryCores>
<!-- Enable caching of files from nfs to hdfs -->
                    <hdfsCacheEnabled>true</hdfsCacheEnabled>
                    <hdfsCacheDir>/user/hadoop_user/netarkivet_cache</hdfsCacheDir>
                    <hdfsCacheDays>7</hdfsCacheDays>
                    <enableUbertask>true</enableUbertask>
<!-- Path to the location of the hadoop uberlib on the NetarchiveSuite node -->
                    <hadoopUberJar>uberlib/hadoop-uber-jar.jar</hadoopUberJar>
                    <queue> <!--These values will always be cluster- and user-dependent -->
                        <interactive>foo_interactive</interactive>
                        <batch>bar_batch</batch>
                    </queue>
                    <inputFilesParentDir>/netarkiv</inputFilesParentDir>
                    <cdxJob>
                        <inputDir>/user/hadoop_user/nas_cdx_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_cdx_output</outputDir>
                    </cdxJob>
                    <metadataExtractionJob>
                        <inputDir>/user/hadoop_user/nas_cache_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_cache_output</outputDir>
                    </metadataExtractionJob>
                    <metadataCDXExtractionJob>
                        <inputDir>/user/hadoop_user/nas_metadata_cdx_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_metadata_cdx_output</outputDir>
                    </metadataCDXExtractionJob>
                    <crawlLogExtractionJob>
                        <inputDir>/user/hadoop_user/nas_crawllog_input</inputDir>
                        <outputDir>/user/hadoop_user/nas_crawllog_output</outputDir>
                    </crawlLogExtractionJob>
                </mapred>
            </hadoop>
            <useBitmagHadoopBackend>true</useBitmagHadoopBackend>

Note that paths starting with "/user/hadoop_user" are paths in hdfs to areas used for i/o for hadoop jobs.

The hadoop uberlib is a new (in NetarchiveSuite 7.0) artefact in the NetarchiveSuite build process and can be found inside the main distribution zip. By default it will be unpacked to the path given above.

With this configuration it should be possible to start hadoop jobs from NetarchiveSuite - for example clicking on a "Browse reports for jobs" link in the GUI should start a hadoop job to index a metadata warcfile.