Archive Design
Contents
The Archive Design Description contains the description of overview of how the archive works, describes Indexing and caching and describes CrawlLogIndexCache.
Archive Overview
The NetarchiveSuite archive component allows files to be stored in a replicated and distributed environment.
Basically, the storage nodes are replicated over a number of bitarchive replicas. On storage each copy is checked to have equal checksum, before a store is accepted. Besides that, functionality for automatically checking that each bitarchive replica holds the same files with the same checksum, ensures that ever loosing a bit is highly improbable.
For each bitarchive replica, the files may be stored in a distributed manner on several bitarchives instances. The philosophy is that you can use off-the-shelf hardware with ordinary hard disks for one bitarchive replica. This allows you to use cheaper hardware, and for more CPU-power per byte. This may be important if you regularly want to perform large tasks on the given bits, like indexing or content analysis. Beware however, that the power usage may be higher in such a setup, and that the maintenance needs may be higher.
Architecture
The archive architecture is shown in the following figure:
where
- Repository is handled by the ArcRepository application
- Replica X for bitarchives is handled by the BitarchiveMonitor application
- Bitarchive Instance X.N is handled by the Bitarchive application
So there is
- An ArcRepository: Exactly One
- BitarchiveMonitors: One per bitarchive replica
- Bitarchives: As many as you like per bitarchive replica
The components communicate using JMS, and a file transfer method (currently FTP, HTTP and HTTPS methods are implemented).
The public interface is through the ArcRepository. This interface allows you to send JMS messages to the ArcRepository and get responses, using the JMSArcRepositoryClient. It has methods to store a file, get a file, get one record from an ARC file, and run batch jobs on the archive. Batch jobs allow you to submit a job to be run on all files, or selected individual files, on a given location, and return the results. Additionally, there are some methods for recovering from an error scenario which we will cover shortly under bit preservation.
Repository State Machine
When files are uploaded to the repository, it uploads the file by sending upload requests and checking the uploads as sketched in the below figure.
Internally this is handled following a state machine based on the messages it receives. These can either be
- a store message for a file to be stored
- a Upload Reply message from a bitarchive replica that was requested to upload a file (in storing process).
- a Checksum Reply message from a bitarchive replica that was requested find checksum as part of checking a file status (in storing process).
The state diagram for each message is given in the below figures:
Store message
Upload Reply message
Checksum Reply message
Admin for the Repository
The ArcRepository keep a record of the upload status of all the files for all the replicas. This is by default stored in the admin.data file. Though it is possible to store this in a database instead. This database can also be used for the bitpreservation.
It has the following table diagram:
A few databases indices are also needed:
create index fileandreplica on replicafileinfo (file_id, replica_id); create index replicaandfileliststatus on replicafileinfo (replica_id, filelist_status); create index replicaandchecksumstatus on replicafileinfo (replica_id, checksum_status); create index fileindex on file (filename);
Communication between ArcRepository and the replicas
The following drawing shows the message interaction between the ArcRepository and the replicas:
The 'replica inf cache database' is the database described above.
Extra functionality
Besides the basic architecture, the archive component contains the following extras:
- An index server, providing three kinds of indexing capabilities of a
. location. - A bitpreservation web interface, providing an interface for monitoring bit
. integrity and handling error scenarios(missing/corrupt files in the archive) - Command line tools for uploading a file, getting a file, getting an arc
. record or running a batch job.The index server
The index server allows you to build an index over your data.
It does this by using the batch method defined in the arc repository.
It assumes you have an archive containing only ARC files, and that arcfiles are named <<job-number>>.arc(.gz), and that for each job number, there are arc files with the names <<job-number>>-metadata.arc, containing the crawl.log for the harvest that generated the arc files, and a cdx file for all the arc files. These files will be generated by the NetarchiveSuite harvester component.
Using the IndexRequestClient, you may request indexes over a number of jobs, either of type CDX, or as a lucene index. The lucene index comes in two different flavours,
- One is used for the deduplication feature of the harvesting component, which only contains the objects that are not of mime-type text/*
- The other is used by the access component, and contains all objects.
The bit preservation interface
For monitoring the bit integrity of your archive, and for performing actions in case of bit errors, a user interface is available. This is installed as a site section of the NetarchiveSuite webserver.
This will basically give you a user interface with the following possibilities:
- Perform a check that all files are present on a given bitarchive replica (can take hours to complete this check)
- Perform a check that all files have correct checksum on a given bitarchive replica (can take hours to complete this check)
- Reestablish files that are missing on one bitarchive replica, but available in another replica
- Replace a file with bad checksum (ie. the file is corrupt) on one bitarchive replica, where a healthy copy is available in another replica
Indexes and caching
The deduplication code and the viewer proxy both make use of an index generating system to extract Lucene indexes from the data in the archive. This system makes extensive use of caching to improve index generation performance. This section describes the default index generating system implemented as the IndexRequestClient plugin.
There are four parts involved in getting an index, each of them having their own cache. The first part resides on the client side, in the IndexRequestClient class, which caches unzipped Lucene indexes and makes them available for use. The IndexRequestClient receives its data from the CrawlLogIndexCache in the form of gzipped Lucene indexes. The CrawlLogIndexCache generates the Lucene indexes based on Heritrix crawl.log files and CDX files extracted from the ARC files, and caches the generated indexes in gzipped form. The crawl.log files and CDX files are in turn received through two more caches, both of which extract their data directly from the archive using batch jobs and store them in sorted form in their caches.
All four caches are based on the generic FileIndexCache class, which handles the necessary synchronization to ensure that not only separate threads but also separate processes can access the cache simultaneously without corrupting it. When a specific cache item is requested, the cache is first checked to see if it already exists. If it doesn't, a file indicating that work is being done is locked by the process. If this lock is acquired, the actual cache-filling operation can take place, otherwise another thread or process must be working on it already, and we can wait until it finishes and take its data.
The FileIndexCache class is generic on the type of the identifier that indicates which item to get. The higher-level caches (IndexRequestClient and CrawlLogIndexCache) use a Set<Long> type to allow indexes of multiple jobs based on their IDs. The two low-level caches just use a Long type, so they operate on just one job at a time.
The two caches that handle multiple job IDs as their cache item ID must handle a couple of special scenarios: Their cache item ID may consist of hundreds or thousands of job IDs, and part of the job data may be unavailable. To deal with the first problem, any cache item with more than four job IDs in the ID set is stored in a file whose name contains the four lowest-numbered IDs followed by an MD5 checksum of a concatenation of all the IDs in sorted order. This ensures uniqueness of the cache file without overflowing operating system limits.
Every subclass to FileBasedCache uses its own directory, where the cache files are placed. The name of the final cache file is uniquely created from the id-set, which should be made into an index. Since the synchronization is done on the complete path to the cache file, then it must require two instances of the same class (e.g. DedupCrawlLogIndexCache), which is attempting to make cache on the same id-set at the same time, for a synkronisation block to occur. In this case the cache file would anyway only be made once, since the waiting instance will use the same cache file as the first instance creates.
A subclass of CombiningMultiFileBasedCache uses a corresponding subclass of RawMetadataCache to make sure that an cache file for every id exists (an id-cache file). If this file does not exist, then it will be created. Afterwards all the id-cache files will be combined to a complete file of the wanted id-set.
The id-cache files are blocked for other processes during their creation, but they are only created once since they can be used directly to create the Lucene cache for other id-sets, which contain this id.
CrawlLogIndexCache
The CrawlLogIndexCache guarantees that an index is always returned for a given request, regardless of whether part of the necessary data was available. This is done by performing a preparatory step where the data required to create the index is retrieved. If any of the data chunks are missing, a recursive attempt at generating an index for a reduced set is performed. Since the underlying data is always fetched from a cache, it is very likely that all the data for the reduced set is already available, so no further recursion is typically needed. The set of job IDs that was actually found is returned from the request to cache data, while the actual data is stored in a file whose name can be requested afterwards. Note that future requests for the full set of job IDs will cause a renewed attempt at downloading the underlying data, which may take a while, especially if the lack of data is caused by a time-out.
The CrawlLogIndexCache is the most complex of the caches, but its various responsibilities are spread out over several superclasses.
- The top class is the generic FileBasedCache handles the locking necessary to have only one thread in one process at a time create the cached data. It also provides two helper methods: getIndex() is a forgiving cache lookup for complex cache items that handles the partial results described before, and the get(Set<I>) method allows for optimized caching of multiple simple cache requests.
- The MultiFileBasedCache handles the naming of files for caches that use sets as cache item identifiers.
- The CombiningMultiFileBasedCache extends the MultiFileBasedCache to have another, simpler cache as a data source, and providing an abstract method for combining the data from the underlying cache. It adds a step to the caching process of getting the underlying data, and only performs the combine action if all required data was found.
- The CrawlLogIndexCache is a CombiningMultiFileBasedCache whose underlying data is crawl.log files, but adds a simple CDX cache to provide data not found in the crawl.log. It also implements the combine method by creating a Lucene index from the crawl.log and CDX files, using code from Kristinn Sigurðsson. The other subclass of CombiningMultiFileBasedCache, which provides combined CDX indexes, is not currently used in the system, but is available at the IndexRequestClient level.
- The CrawlLogIndexCache is further subclasses into two flavors, FullCrawlLogIndexCache which is used in the viewer proxy, and DedupCrawlLogIndexCache which is used by the deduplicator in the harvester. The DedupCrawlLogIndexCache restricts the index to non-text files, while the FullCrawlLogIndexCache indexes all files.
The two caches used by CrawlLogIndexCache are CDXDataCache and CrawlLogDataCache, both of which are simply instantiations of the RawMetadataCache. They both work by extracting records from the archived metadata files based on regular expressions, using batch jobs submitted through the ArcRepositoryClient. This is not the most efficient way of getting the data, as a batch job is submitted separately for getting the files for each job, but it is simple. It could be improved by overriding the get(Set<I>) method to collect all the data in one batch job, though some care has to be taken with synchronization and avoiding refetching unnecessary data.