New Access Architecture with OpenWayback and TinyCDXServer

Netarkivet access is currently via a customised version of Internet Archive Wayback 1.8-SNAPSHOT using a flat-file CDX index. I have been looking at ways to modernise the architecure:

Replacing the index with TinyCDXServer
Replacing the backend implementation with the latest OpenWayback (2.3.1) overlaid with out own GUI customisations
Replacing the NetarchiveResourceStore with direct file-access via isilon using wayback's own LocationDBResourceStore to resolve file locations

The Index

The main focus is on the index. The current netarkivet index consists of CDX records like

bt.dk/apps/pbcs.dll/misc?dato=20060620&kategori=politik&lopenr=449504&params=itemnr=4/&ref=ph&title=&url=/misc/send.pbs&www=http://www.bt.dk/apps/pbcs.dll/gallery?avis=bt 20080616003123 http://www.bt.dk/apps/pbcs.dll/misc?url=/misc/send.pbs&title=&www=http%3A%2F%2Fwww.bt.dk%2Fapps%2Fpbcs.dll%2Fgallery%3Favis%3Dbt%26dato%3D20060620%26kategori%3Dpolitik%26lopenr%3D449504%26ref%3Dph%26params%3Ditemnr%3D4%2F text/html 200 GO4YNHE7JZXUHJANEZE65ZNFYTI6BSEE - - - 28566396 29578-33-20080616001741-00057-sb-prod-har-004.arc
ing.dk/user.php?location=51604?nocache=1&xoops_redirect=/modules/flexblocks/admin/editpage.php?moduleid=42 20080616003123 http://ing.dk/user.php?xoops_redirect=%2Fmodules%2Fflexblocks%2Fadmin%2Feditpage.php%3Fmoduleid%3D42%26location%3D51604?nocache=1 text/html 200 A3LGTO3M74U3ZENLUHWUQYLGGX2PHAEO - - - 28575249 29578-33-20080616001741-00057-sb-prod-har-004.arc
images.nordjyske.dk/11-2007/a52bd4c8-e0c5-404a-a9f9-f09a4207fdd9_thumb.jpg 20080616003123 http://images.nordjyske.dk/11-2007/a52bd4c8-e0c5-404a-a9f9-f09a4207fdd9_thumb.jpg image/jpeg 200 ULZHXODJN6ZPS6CGMU3PHBNQZSZKE5QH - - - 28585605 29578-33-20080616001741-00057-sb-prod-har-004.arc
berlingske.dk/article/20080613/fritidogforbrug/706130003/berlingske.dk/javascript1.1 20080616003119 http://www.berlingske.dk/article/20080613/fritidogforbrug/706130003/BERLINGSKE.DK/javascript1.1 text/html 200 OYSWG36KZ2LX3BSIVMBG3AHCLLHW5PML - - - 28587590 29578-33-20080616001741-00057-sb-prod-har-004.arc

stored in flat ASCII files - around 20 gigaobject in 50 files totaling 8TB. NetarchiveSuite software is responsible for sorting/merge/rollover functionality for these files. Wayback has an implementation for fast(ish) binary search in them. But increasingly we're worried that this is becoming a bottleneck (it was already a PITA).

So in this world of fancy indexing tools and massive, distributable, scalable object/keystores, why are we still using ASCII files?

Enter the CDXServer. CDXServer started as an implementation - a REST webservice for querying CDX files, but quickly became an API that others could reimplement as they saw fit.

TinyCDXServer is a Java implementation from the National Library of Australia. For a storage back end it uses RocksDB from those nice people at Facebook. The server itself is built directly on java.net.Socket .

Build and install of TinyCDXServer

Clone/download from https://github.com/nla/tinycdxserver .

Modify the pom.xml if necessary to use the NLA's own build of rocksdbjni (because the public maven releases don't support snappy compression). Then build with mvn -DskipTests clean package .

Install libsnappy on your machine or just download libsnappy.so.1 (e.g. from https://www.rpmfind.net/linux/rpm2html/search.php?query=libsnappy.so.1()(64bit) )and add it your library path with e.g. export LD_LIBRARY_PATH=/netarkiv-devel/ .

Start TinyCDXServer

java -jar ./tinycdxserver-0.3.2.jar -d /netarkiv-devel/cdxdb/ -p 8888

Playing with TinyCDXServer

Browse to http://belinda:8888 to see the landing page. You can feed TinyCDXServer by posting an unsorted cdx file to a collection, e.g.

curl -X POST --data-binary @./temp.cdx http://localhost:8888/cdxidx

then you can see the collection statistics at http://belinda:8888/cdxidx.

A search like http://belinda:8888/cdxidx?q=type:urlquery+url:http://jp.dk gives a structured result set like

<?xml version="1.0" encoding="UTF-8"?>
<wayback>
   <request>
      <startdate>19960101000000</startdate>
      <enddate>20160930064528</enddate>
      <type>urlquery</type>
      <firstreturned>0</firstreturned>
      <url>dk,jp)/</url>
      <resultsrequested>10000</resultsrequested>
      <resultstype>resultstypecapture</resultstype>
   </request>
   <results>
      <result>
         <compressedoffset>23174567</compressedoffset>
         <mimetype>text/html</mimetype>
         <file>4530-10-20060329072121-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file>
         <redirecturl>-</redirecturl>
         <urlkey>dk,jp)/</urlkey>
         <digest>MT6CQWOGIMM2G22TKG2FAYXCEHGX2LA5</digest>
         <httpresponsecode>200</httpresponsecode>
         <robotflags>-</robotflags>
         <url>http://www.jp.dk/</url>
         <capturedate>20060329072234</capturedate>
      </result>
      <result>
         <compressedoffset>23087849</compressedoffset>
         <mimetype>text/html</mimetype>
         <file>4544-10-20060330072105-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file>
         <redirecturl>-</redirecturl>
         <urlkey>dk,jp)/</urlkey>
         <digest>3IWDRY7DXMX55WISTJWZ5775KYWNURQN</digest>
         <httpresponsecode>200</httpresponsecode>
         <robotflags>-</robotflags>
         <url>http://www.jp.dk/</url>
         <capturedate>20060330072148</capturedate>
      </result>
      ....

To bulk-feed the index there are two scripts - one for data-records and one for deduplication records

#!/bin/bash
while read datafile; do
  echo indexing $datafile
  java -cp ../webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -cdxURL $datafile  > temp.cdx
  cat temp.cdx>>full.cdx
  curl -X POST --data-binary @./temp.cdx http://localhost:8888/cdxidx
done <nonmetadata.txt

#!/bin/bash
while read datafile; do
  echo indexing $datafile
  grep "duplicate:" $datafile > metatemp.crawl.log
  java -cp nas/lib/netarchivesuite-wayback-indexer.jar dk.netarkivet.wayback.DeduplicateToCDXApplication metatemp.crawl.log > metatemp.cdx
  curl -X POST --data-binary @./metatemp.cdx http://localhost:8888/cdxidx
  cat metatemp.cdx>>metafull.cdx
done <metadata.txt

where nonmetadata.txt and metadata.txt are just files with lists of data and metadata files to be indexed. Note that indexing of data warc-files uses standardised code from the wayback project, while indexing of deduplication events requires locally-developed code from NetarchiveSuite.

Indexing Results

After about 24 hours, the total indexed is as follows

Records	CDX Files	TinyCDXServer
150 000 000	59GB	5.3GB

This is about 1/140 of the whole archive, corresponding pretty well with our expected value for the total CDX file size of 8TB. With TinyCDXServer, the whole index could fit in about 750GB.

Estimating the indexing time from this exercise is a bit more complicated because deduplication records are indexed much quicker than data records, presumably because metadata files are much smaller. The 150 million records here are only 1/3 data records and 2/3 deduplication records, but for the archive as a whole the majority of records are data records. (Deduplication is only used for images and other large files, never for text.) So running single-threaded as here it would take around 300 days to reindex the archive! Hopefully we can do better ...