Page Comparison

...

A wayback AccessPoint requires various elements of which the most important is a WaybackCollection which in turn consists of a ResourceIndex, a ResourceStore, and one or more Shutdownables. The final configuration file is here: wayback_selfcontained.xml .

ResourceIndex

Fortunately, we already had about 6TB of cdx index files available on isilon, covering netarkivet up to early-2011. I set up an rsync process to copy the remaining files over. (See http://noerdroid.blogspot.dk/2013/06/getting-from-to-c-via-b-with-ssh.html for more info.) The full index is now around 8.3TB. The files are in scape@iapetus:/home/scape/scape-hdfs/csr/

...

When wayback is restarted, the ResourceIndex is immediately functional and it is possible to test it by searching in the wayback web-UI. This worked out of the box. Clicking on a link to a search result initially failed, simply because the ResourceStore had not had time to build up its initial database of all archive files. After a few hours this was complete. Then loading the page in wayback was found to give a server error with a stacktrace in the tomcat logs.

Bugfixing

From the stacktrace it was clear that the arcreader in openwayback was reading the cdx records wrongly - jumping to somewhere in the next record instead of the current record. This presented us with a golden opportunity to begin actively contributing to openwayback. Once I could reproduce the bug and produce a tentative solution, we could begin following openwayback's pull-request policy. The pull-request is at https://github.com/iipc/openwayback/pull/104 . After some back-and-forth on the correct way to fix the bug, it turned out that the "right" way to do it didn't work at all because of what turned out to be another bug, specifically that webarchive-commons misuses InputStream.skip(). Eventually we got everything working.

Performance

We didn't carry out any systematic performance testing, but it seemed to work generally well. There seems particularly to be an issue in replaying sites that have many harvests and many objects per page. The problem is that wayback needs to find all matches for a given url in order to step through them to find the one with the best time matching. What's really needed is an indexing system that has a secondary sorting by date.

Versions Compared

Old Version 3

New Version Current

Key

ResourceIndex

Bugfixing

Performance