...
Note in particular that <settingsDir> points to a directory containing a complete set of bitrepository.org RepositorySettings and ReferenceSettings. The specified keyfile lies inside this directory. The RepositorySettings must grant relevant permissions to the NetarchiveSuite application based on the provided key. For HarvestController applications, these are PutFile permissions to enable upload. For the NetarchiveSuite webGUI, Viewerproxy, and IndexServer application GUIApplication these are GetFileIDs and GetFile permissions. Other applications do not require any bitrepository permissions.
It would be entirely possible to configure a functioning installation of NetarchiveSuite with Bitrepository integration and without the access and processing components described below. Such an installation would still enable scheduling and management of harvesting, together with the enhanced bitpreservation capabalities of bitrepository.org software, but without the QA, indexing, and batch functionality of the complete NetarchiveSuite. Deduplication of harvest jobs would also not be possible and would need to be disabled by setting of the appropriate configuration parameter in the settings for the IndexServer application.
Access Configuration
BitmagArcRepositoryClient uses a special client interface called WarcRecordClient to retrieve individual warc records from the archive via https with client key authentication. The basic idea is that a web-service (WarcRecordService) will return a (w)arc record from a file with a given offset by specifying the filename in the request path and the offset using the http one-sided range header:
Code Block |
---|
connection.addRequestProperty("Range", "bytes=" + offset + "-"); |
Note that the semantics define that the service return a single warcrecord starting at this offset. There is python-cgi implementation (suitable for any .gz file) available at https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/wrs.
In order to make this function one extra component is required - a FileResolverService which translates a given warc-filename to an absolute path where the file can be found. NetarchiveSuite defines a FileResolver interface and a corresponding REST implementation. WarcRecordClient and FileResolver are configure in the following settings block:
Code Block |
---|
<warcRecordService>
<baseUrl>https://myserver/cgi-bin/warcrecordservice.cgi/</baseUrl>
</warcRecordService>
<fileResolver>
<class>dk.netarkivet.common.utils.service.FileResolverRESTClient</class>
<!--to https -->
<baseUrl>https://myserver/cgi-bin/fileresolver.cgi/</baseUrl>
<keyfile>...somepathto... /https_key.pem</keyfile>
</fileResolver> |
We use the same https keyfile for both services.
A simple FileResolverService can be built using a cgi-script sitting on top of the standard liunx tools updatedb and locate. See https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolver and https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolverdb
Hadoop Configuration
The use of hadoop for mass-processing is somewhat complicated by the fact that we do not consider hdfs to be a suitable filesystem for long term preservation. Existing bitrepository.org filepillar software assumes that the storage is a standard POSIX filesystem. To enable mass-processing under hadoop we therefore expose the filesystem by nfs on all hadoop nodes and with the same mountpoint name. This the path to any file is the same everywhere in hadoop (and also on the bitrepository filepillar used for processing and that on which the WarcRecordService runs. All our hadoop jobs (replacing the old batch jobs) therefore take an input file consisting of a single hdfs file containing a list of paths to warcfiles (on the non-hdfs filesystem) to be processed.