Note that this documentation is for the old 6.0 release.
For the newest documentation, please see the current release documentation.

Archive Design

Contents

The Archive Design Description contains the description of overview of how the archive works, describes Indexing and caching and describes CrawlLogIndexCache.

Archive Overview

The NetarchiveSuite archive component (known as the ArcRepository, although it can also store WARC files) allows files to be stored in a replicated and distributed environment.

Basically, the storage nodes are replicated over a number of bitarchive replicas. During the storing process each copy is verified to have the same checksum, before a store is accepted. Besides that, there is functionality for automatically checking that each bitarchive replica holds the same files with the same checksum, ensuring that losing stored data is highly improbable.

For each bitarchive replica, the files may be stored in a distributed manner on one or more bitarchives instances. The philosophy behind this is that you can use off-the-shelf hardware with ordinary hard disks for one bitarchive replica. This allows you to use cheaper hardware, and get more CPU-power per byte. This may be important if you regularly want to perform large tasks on the given bits, like indexing or content analysis. Beware however, that the power usage may be higher in such a setup, and that the maintenance costs may be higher.

Architecture

The archive architecture is shown in the following figure:

where

  • Repository is handled by the ArcRepository application
  • Replica X for bitarchives is handled by the BitarchiveMonitor application
  • Bitarchive Instance X.N is handled by the Bitarchive application
    So there is
  • An ArcRepository: Exactly One
  • BitarchiveMonitors: One per bitarchive replica
  • Bitarchives: As many as you like per bitarchive replica
    The components communicate using JMS, and a file transfer method (currently FTP, HTTP and HTTPS methods are implemented).

The public interface is through the ArcRepository. This interface allows you to send JMS messages to the ArcRepository and get responses, using the JMSArcRepositoryClient. It has methods to store a file, get a file, get one record from a (W)ARC file, and run batch jobs on the archive. Batch jobs allow you to submit a job to be run on all files, or selected individual files, on a given location, and return the results. Additionally, there are some methods for recovering from an error scenario which we will cover shortly under bit preservation.

Repository State Machine

When files are uploaded to the repository, it uploads the file by sending upload requests and checking the uploads as sketched in the below figure.

Internally this is handled following a state machine based on the messages it receives. These can either be

  • a store message for a file to be stored
  • an Upload Reply message from a bitarchive replica that was requested to upload a file (in storing process).
  • a Checksum Reply message from a bitarchive replica that was requested find checksum as part of checking a file status (in storing process).
    The state diagram for each message is given in the below figures:

Store message

Upload Reply message

Checksum Reply message

Admin for the Repository

The ArcRepository keep a record of the upload status of all the files for all the replicas. This information was until release 3.10 stored in the admin.data file. This solution is now deprecated. Now the information about the files in the replicas is now by default stored in a database. This database can also be used for the bitpreservation.

It has the following table diagram:

A few databases indices are also needed:

create index fileandreplica on replicafileinfo (file_id, replica_id);
create index replicaandfileliststatus on replicafileinfo (replica_id, filelist_status);
create index replicaandchecksumstatus on replicafileinfo (replica_id, checksum_status);
create index fileindex on file (filename);
Communication between ArcRepository and the replicas

The following drawing shows the message interaction between the ArcRepository and the replicas:

The 'replica inf cache database' is the database described above.

Extra functionality

Besides the basic architecture, the archive component contains the following extras:

  • An index server, providing three kinds of indexing capabilities
  • A bitpreservation web interface, providing an interface for monitoring bit-integrity
    and handling error scenarios(missing/corrupt files in the archive)
  • Command line tools for uploading a file, getting a file, getting an arc
    record or running a batch job.

The index server

The index server allows you to build an index over your data.

It does this by using the batch method defined in the arc repository.

It assumes you have an archive containing only (W)ARC files, and that (W)arcfiles are named

<<job-number>>*-.[w]arc(.gz)

and that for each job number, there are arc files with the names

<<job-number>>-metadata-*.[w]arc

containing the crawl.log for the harvest that generated the (w)arc files, and a cdx file for all the (w)arc files. These files will be generated by the NetarchiveSuite harvester component.

Using the IndexRequestClient, you may request indexes over a number of jobs, either of type CDX, or as a lucene index. The lucene index comes in two different flavours,

  • One is used for the deduplication feature of the harvesting component, which only contains the objects that are not of mime-type text/*
  • The other is used by the access component (ViewerProxy), and contains all objects.

The bit preservation interface
For monitoring the bit integrity of your archive, and for performing actions in case of bit errors, a user interface is available. This is installed as a site section of the NetarchiveSuite webserver.

This will basically give you a user interface with the following features:

  • Perform a check that all files are present on a given bitarchive replica (can take hours to complete this check)
  • Perform a check that all files have correct checksum on a given bitarchive replica (can take hours to complete this check)
  • Reestablish files that are missing on one bitarchive replica, but available in another replica
  • Replace a file with bad checksum (ie. the file is corrupt) on one bitarchive replica, where a healthy copy is available in another replica