Open Wayback 2.0 beta with local archive files

Goals

The goals of this mini-project were:

To test openwayback (beta) to see if it actually functions with our archive
To learn how to configure and deploy openwayback to operate with a locally-mounted archive
To test the performance of openwayback on a locally-mounted archive

Installation

Cloning and building wayback from its github distribution (git@github.com:iipc/openwayback.git) was painless. As with older distributions of wayback, the accepted installation procedure is

Deploy wayback jar-file on a running tomcat and wait for it to be unpacked
Shut tomcat down
Replace wayback's config files with your own
Restart tomcat

This is no-more an acceptable deployment procedure now than it ever has been. However there now exists an example using maven overlays (https://github.com/iipc/openwayback-sample-overlay) which should be investigated as a better way to create a directly deployable package.

Wayback was deployed on a tomcat 7.0.52 running at scape@iapetus:csr/tomcat on port 6051.

Initially I tried deploying to the web-context "/wayback" but was unable to get it to work and switched to deploying to the ROOT context by renaming wayback.war to ROOT.war in the webapps directory.

Configuration

A wayback AccessPoint requires various elements of which the most important is a WaybackCollection which in turn consists of a ResourceIndex, a ResourceStore, and one or more Shutdownables. The final configuration file is here: wayback_selfcontained.xml .

ResourceIndex

Fortunately, we already had about 6TB of cdx index files available on isilon, covering netarkivet up to early-2011. I set up an rsync process to copy the remaining files over. (See http://noerdroid.blogspot.dk/2013/06/getting-from-to-c-via-b-with-ssh.html for more info.) The full index is now around 8.3TB. The files are in scape@iapetus:/home/scape/scape-hdfs/csr/

(Copying the files over highlighted a problem with our naming procedure for wayback index files. We use rollover-naming as one does for logfiles. But this means that the filenames are reused so it can be difficult when one returns to making a copy after a period of time to know which files one has already copied over. )

ResourceStore

This is the main difference from the current production installation of wayback. The entire netarkiv arcrepository is mounted under /home/scape/netarkiv . One defines a ListFactoryBean containing a DirectoryResourceFileSource pointing to this directory. The ResourceStore used is a LocationDBResourceStore which has a reference to this ListFactoryBean.

Shutdownables

The Shutdownables to be configured are just those which monitor the ResourceStore for new archive files and add them to the relevant database.

Startup

When wayback is restarted, the ResourceIndex is immediately functional and it is possible to test it by searching in the wayback web-UI. This worked out of the box. Clicking on a link to a search result initially failed, simply because the ResourceStore had not had time to build up its initial database of all archive files. After a few hours this was complete. Then loading the page in wayback was found to give a server error with a stacktrace in the tomcat logs.

Bugfixing

From the stacktrace it was clear that the arcreader in openwayback was reading the cdx records wrongly - jumping to somewhere in the next record instead of the current record. This presented us with a golden opportunity to begin actively contributing to openwayback. Once I could reproduce the bug and produce a tentative solution, we could begin following openwayback's pull-request policy. The pull-request is at https://github.com/iipc/openwayback/pull/104 . After some back-and-forth on the correct way to fix the bug, it turned out that the "right" way to do it didn't work at all because of what turned out to be another bug, specifically that webarchive-commons misuses InputStream.skip(). Eventually we got everything working.

Performance

We didn't carry out any systematic performance testing, but it seemed to work generally well. There seems particularly to be an issue in replaying sites that have many harvests and many objects per page. The problem is that wayback needs to find all matches for a given url in order to step through them to find the one with the best time matching. What's really needed is an indexing system that has a secondary sorting by date.