Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A simple FileResolverService can be built using a cgi-script sitting on top of the standard linux tools updatedb and locate. See https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolver and https://github.com/netarchivesuite/netarchivesuite-docker-compose/tree/hadoop/fileresolverdb 

There is also a trivial implementation of the FileResolver client interface, SimpleFileResolver, which can be used when all files in the archive are to be found in a single directory.

Note that the tools in the ArchiveModule for uploading files and for getting files and records out of the archive should all work with this architecture, provided an appropriate settings file is provided. These provide good tests that the architecture is functioning.

Hadoop Configuration

The use of hadoop for mass-processing is somewhat complicated by the fact that we do not consider hdfs to be a suitable filesystem for long term preservation. Existing bitrepository.org filepillar software assumes that the storage is a standard POSIX filesystem. To enable mass-processing under hadoop we therefore expose the filesystem by nfs on all hadoop nodes with the same mountpoint name. Thus the path to any file is the same everywhere in hadoop (and also on the bitrepository filepillar used for processing and the server on which the FileResolverService runs. All our hadoop jobs (replacements for the old batch jobs) therefore take an input file consisting of a single hdfs file consisting of a list of paths to warcfiles (on the non-hdfs filesystem) to be processed. The FileResolver architecture described above is used to build this input file from the names of warcfiles to be processed.

...

With this configuration it should be possible to start hadoop jobs from NetarchiveSuite - for example clicking on a "Browse reports for jobs" link in the GUI should start a hadoop job to index a metadata warcfile.

Alternatively the standalone job SendDedupIndexRequestToIndexserver (see Additional Tools Manual) can be used to start a single indexing job using the hadoop architecture.