Netarkivet access is currently via a customised version of Internet Archive Wayback 1.8-SNAPSHOT using a flat-file CDX index. I have been looking at ways to modernise the architecure:
- Replacing the index with TinyCDXServer
- Replacing the backend implementation with the latest OpenWayback (2.3.1) overlaid with out own GUI customisations
- Replacing the NetarchiveResourceStore with direct file-access via isilon using wayback's own LocationDBResourceStore to resolve file locations
The Index
The main focus is one the index. The current netarkivet index consists of CDX records like
bt.dk/apps/pbcs.dll/misc?dato=20060620&kategori=politik&lopenr=449504¶ms=itemnr=4/&ref=ph&title=&url=/misc/send.pbs&www=http://www.bt.dk/apps/pbcs.dll/gallery?avis=bt 20080616003123 http://www.bt.dk/apps/pbcs.dll/misc?url=/misc/send.pbs&title=&www=http%3A%2F%2Fwww.bt.dk%2Fapps%2Fpbcs.dll%2Fgallery%3Favis%3Dbt%26dato%3D20060620%26kategori%3Dpolitik%26lopenr%3D449504%26ref%3Dph%26params%3Ditemnr%3D4%2F text/html 200 GO4YNHE7JZXUHJANEZE65ZNFYTI6BSEE - - - 28566396 29578-33-20080616001741-00057-sb-prod-har-004.arc ing.dk/user.php?location=51604?nocache=1&xoops_redirect=/modules/flexblocks/admin/editpage.php?moduleid=42 20080616003123 http://ing.dk/user.php?xoops_redirect=%2Fmodules%2Fflexblocks%2Fadmin%2Feditpage.php%3Fmoduleid%3D42%26location%3D51604?nocache=1 text/html 200 A3LGTO3M74U3ZENLUHWUQYLGGX2PHAEO - - - 28575249 29578-33-20080616001741-00057-sb-prod-har-004.arc images.nordjyske.dk/11-2007/a52bd4c8-e0c5-404a-a9f9-f09a4207fdd9_thumb.jpg 20080616003123 http://images.nordjyske.dk/11-2007/a52bd4c8-e0c5-404a-a9f9-f09a4207fdd9_thumb.jpg image/jpeg 200 ULZHXODJN6ZPS6CGMU3PHBNQZSZKE5QH - - - 28585605 29578-33-20080616001741-00057-sb-prod-har-004.arc berlingske.dk/article/20080613/fritidogforbrug/706130003/berlingske.dk/javascript1.1 20080616003119 http://www.berlingske.dk/article/20080613/fritidogforbrug/706130003/BERLINGSKE.DK/javascript1.1 text/html 200 OYSWG36KZ2LX3BSIVMBG3AHCLLHW5PML - - - 28587590 29578-33-20080616001741-00057-sb-prod-har-004.arc
stored in flat ASCII files - around 20 gigaobject in 50 files totaling 8TB. NetarchiveSuite software is responsible for sorting/merge/rollover functionality for these files. Wayback has an implementation for fast(ish) binary search in them. But increasingly we're worried that this is becoming a bottleneck (it was already a PITA).
So in this world of fancy indexing tools and massive, distributable, scalable object/keystores, why are we still using ASCII files?
Enter the CDXServer. CDXServer started as an implementation - a REST webservice for querying CDX files, but quickly became an API that others could reimplement as they saw fit.
TinyCDXServer is a Java implementation from the National Library of Australia. For a storage back end it uses RocksDB from those nice people at Facebook. The server itself is built directly on java.net.Socket
.
Build and install of TinyCDXServer
Clone/download from https://github.com/nla/tinycdxserver .
Modify the pom.xml if necessary to use the NLA's own build of rocksdbjni
(because the public maven releases don't support snappy compression). Then build with mvn -DskipTests clean package
.
Install libsnappy
on your machine or just download libsnappy.so.1
(e.g. from https://www.rpmfind.net/linux/rpm2html/search.php?query=libsnappy.so.1()(64bit) )and add it your library path with e.g. export LD_LIBRARY_PATH=/netarkiv-devel/
.
Start TinyCDXServer
java -jar ./tinycdxserver-0.3.2.jar -d /netarkiv-devel/cdxdb/ -p 8888
Playing with TinyCDXServer
Browse to http://belinda:8888 to see the landing page. You can feed TinyCDXServer by posting an unsorted cdx file to a collection, e.g.
curl -X POST --data-binary @./temp.cdx http://localhost:8888/cdxidx
then you can see the collection statistics at http://belinda:8888/cdxidx.
A search like http://belinda:8888/cdxidx?q=type:urlquery+url:http://jp.dk gives a structured result set like
<?xml version="1.0" encoding="UTF-8"?> <wayback> <request> <startdate>19960101000000</startdate> <enddate>20160930064528</enddate> <type>urlquery</type> <firstreturned>0</firstreturned> <url>dk,jp)/</url> <resultsrequested>10000</resultsrequested> <resultstype>resultstypecapture</resultstype> </request> <results> <result> <compressedoffset>23174567</compressedoffset> <mimetype>text/html</mimetype> <file>4530-10-20060329072121-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file> <redirecturl>-</redirecturl> <urlkey>dk,jp)/</urlkey> <digest>MT6CQWOGIMM2G22TKG2FAYXCEHGX2LA5</digest> <httpresponsecode>200</httpresponsecode> <robotflags>-</robotflags> <url>http://www.jp.dk/</url> <capturedate>20060329072234</capturedate> </result> <result> <compressedoffset>23087849</compressedoffset> <mimetype>text/html</mimetype> <file>4544-10-20060330072105-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file> <redirecturl>-</redirecturl> <urlkey>dk,jp)/</urlkey> <digest>3IWDRY7DXMX55WISTJWZ5775KYWNURQN</digest> <httpresponsecode>200</httpresponsecode> <robotflags>-</robotflags> <url>http://www.jp.dk/</url> <capturedate>20060330072148</capturedate> </result> ....