Page Comparison

...

A wayback AccessPoint requires various elements of which the most important is a WaybackCollection which in turn consists of a ResourceIndex, a ResourceStore, and one or more Shutdownables. The final configuration file is here: wayback_selfcontained.xml .

ResourceIndex

Fortunately, we already had about 6TB of cdx index files available on isilon, covering netarkivet up to early-2011. I set up an rsync process to copy the remaining files over. (See http://noerdroid.blogspot.dk/2013/06/getting-from-to-c-via-b-with-ssh.html for more info.) The full index is now around 8.3TB. The files are in scape@iapetus:/home/scape/scape-hdfs/csr/

(Copying the files over highlighted a problem with our naming procedure for wayback index files. We use rollover-naming as one does for logfiles. But this means that the filenames are reused so it can be difficult when one returns to making a copy after a period of time to know which files one has already copied over. )

ResourceStore

This is the main difference from the current production installation of wayback. The entire netarkiv arcrepository is mounted under /home/scape/netarkiv . One defines a ListFactoryBean containing a DirectoryResourceFileSource pointing to this directory. The ResourceStore used is a LocationDBResourceStore which has a reference to this ListFactoryBean.

Shutdownables

The Shutdownables to be configured are just those which monitor the ResourceStore for new archive files and add them to the relevant database.

Startup

When wayback is restarted, the ResourceIndex is immediately functional and it is possible to test it by searching in the wayback web-UI. This worked out of the box. Clicking on a link to a search result initially failed, simply because the ResourceStore had not had time to build up its initial database of all archive files. After a few hours this was complete. Then loading the page in wayback was found to give a server error with a stacktrace in the tomcat logs.

Bugfixing

From the stacktrace it was clear that the arcreader in openwayback was reading the cdx records wrongly - jumping to somewhere in the next record instead of the current record. This presented us with a golden opportunity to begin actively contributing to openwayback. Once I could reproduce the bug and produce a tentative solution, we could begin following openwayback's pull-request policy. The pull-request is at https://github.com/iipc/openwayback/pull/104 . After some back-and-forth on the correct way to fix the bug, it turned out that the "right" way to do it didn't work at all because of what turned out to be another bug, specifically that webarchive-commons misuses InputStream.skip(). Eventually we got everything working.

Performance

We didn't carry out any systematic performance testing, but it seemed to work generally well. There seems particularly to be an issue in replaying sites that have many harvests and many objects per page. The problem is that wayback needs to find all matches for a given url in order to step through them to find the one with the best time matching. What's really needed is an indexing system that has a secondary sorting by date.

Versions Compared

Old Version 2

New Version Current

Key