Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

The current Wayback module is based on the current Wayback usage in the Danish Webarchive, and can therefore be viewed as a description of the Danish Wayback solution


The Danish Webarchive implementation of wayback has the following primary components.

ArcRepository

All access to the harvested data and metadata is via the ArcRepository interface from NetarchiveSuite. This ensures that we follow our own guidelines for bitpreservation and restricted access and also allows us to leverage the distributed ArcRepository architecture for the purpose of high-performance indexing.

Indexing Component

The Indexing Component consists of

  1. The Wayback Indexer which generates raw index data using wayback-supplied and custom code packaged in NetarchiveSuite BatchJob "wrappers" which allow the code to be executed on the distributed ArcRepository.
  2. The Aggregator which sorts and merges the raw index files. The actual sorting/indexing process is delegated to the native linux "sort" function which performs extremely well even on very large (> 100GB) files.
  3. A database which simply records which files in the archive have already been indexed and which are awaiting indexing.

Currently all these components run on the same physical machine as the wayback tomcat server.

Access Component

The access

See Tools in the Wayback module and Wayback Configuration.

...