Netarkivet access is currently via a customised version of Internet Archive Wayback 1.8-SNAPSHOT using a flat-file CDX index. I have been looking at ways to modernise the architecure:
- Replacing the index with TinyCDXServer
- Replacing the backend implementation with the latest OpenWayback (2.3.1) overlaid with out own GUI customisations
- Replacing the NetarchiveResourceStore with direct file-access via isilon using wayback's own LocationDBResourceStore to resolve file locations
The Index
The main focus is on the index. The current netarkivet index consists of CDX records like
bt.dk/apps/pbcs.dll/misc?dato=20060620&kategori=politik&lopenr=449504¶ms=itemnr=4/&ref=ph&title=&url=/misc/send.pbs&www=http://www.bt.dk/apps/pbcs.dll/gallery?avis=bt 20080616003123 http://www.bt.dk/apps/pbcs.dll/misc?url=/misc/send.pbs&title=&www=http%3A%2F%2Fwww.bt.dk%2Fapps%2Fpbcs.dll%2Fgallery%3Favis%3Dbt%26dato%3D20060620%26kategori%3Dpolitik%26lopenr%3D449504%26ref%3Dph%26params%3Ditemnr%3D4%2F text/html 200 GO4YNHE7JZXUHJANEZE65ZNFYTI6BSEE - - - 28566396 29578-33-20080616001741-00057-sb-prod-har-004.arc ing.dk/user.php?location=51604?nocache=1&xoops_redirect=/modules/flexblocks/admin/editpage.php?moduleid=42 20080616003123 http://ing.dk/user.php?xoops_redirect=%2Fmodules%2Fflexblocks%2Fadmin%2Feditpage.php%3Fmoduleid%3D42%26location%3D51604?nocache=1 text/html 200 A3LGTO3M74U3ZENLUHWUQYLGGX2PHAEO - - - 28575249 29578-33-20080616001741-00057-sb-prod-har-004.arc images.nordjyske.dk/11-2007/a52bd4c8-e0c5-404a-a9f9-f09a4207fdd9_thumb.jpg 20080616003123 http://images.nordjyske.dk/11-2007/a52bd4c8-e0c5-404a-a9f9-f09a4207fdd9_thumb.jpg image/jpeg 200 ULZHXODJN6ZPS6CGMU3PHBNQZSZKE5QH - - - 28585605 29578-33-20080616001741-00057-sb-prod-har-004.arc berlingske.dk/article/20080613/fritidogforbrug/706130003/berlingske.dk/javascript1.1 20080616003119 http://www.berlingske.dk/article/20080613/fritidogforbrug/706130003/BERLINGSKE.DK/javascript1.1 text/html 200 OYSWG36KZ2LX3BSIVMBG3AHCLLHW5PML - - - 28587590 29578-33-20080616001741-00057-sb-prod-har-004.arc
stored in flat ASCII files - around 20 gigaobject in 50 files totaling 8TB. NetarchiveSuite software is responsible for sorting/merge/rollover functionality for these files. Wayback has an implementation for fast(ish) binary search in them. But increasingly we're worried that this is becoming a bottleneck (it was already a PITA).
So in this world of fancy indexing tools and massive, distributable, scalable object/keystores, why are we still using ASCII files?
Enter the CDXServer. CDXServer started as an implementation - a REST webservice for querying CDX files, but quickly became an API that others could reimplement as they saw fit.
TinyCDXServer is a Java implementation from the National Library of Australia. For a storage back end it uses RocksDB from those nice people at Facebook. The server itself is built directly on java.net.Socket
.
Build and install of TinyCDXServer
Clone/download from https://github.com/nla/tinycdxserver .
Modify the pom.xml if necessary to use the NLA's own build of rocksdbjni
(because the public maven releases don't support snappy compression). Then build with mvn -DskipTests clean package
.
Install libsnappy
on your machine or just download libsnappy.so.1
(e.g. from https://www.rpmfind.net/linux/rpm2html/search.php?query=libsnappy.so.1()(64bit) )and add it your library path with e.g. export LD_LIBRARY_PATH=/netarkiv-devel/
.
Start TinyCDXServer
java -jar ./tinycdxserver-0.3.2.jar -d /netarkiv-devel/cdxdb/ -p 8888
Playing with TinyCDXServer
Browse to http://belinda:8888 to see the landing page. You can feed TinyCDXServer by posting an unsorted cdx file to a collection, e.g.
curl -X POST --data-binary @./temp.cdx http://localhost:8888/cdxidx
then you can see the collection statistics at http://belinda:8888/cdxidx.
A search like http://belinda:8888/cdxidx?q=type:urlquery+url:http://jp.dk gives a structured result set like
<?xml version="1.0" encoding="UTF-8"?> <wayback> <request> <startdate>19960101000000</startdate> <enddate>20160930064528</enddate> <type>urlquery</type> <firstreturned>0</firstreturned> <url>dk,jp)/</url> <resultsrequested>10000</resultsrequested> <resultstype>resultstypecapture</resultstype> </request> <results> <result> <compressedoffset>23174567</compressedoffset> <mimetype>text/html</mimetype> <file>4530-10-20060329072121-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file> <redirecturl>-</redirecturl> <urlkey>dk,jp)/</urlkey> <digest>MT6CQWOGIMM2G22TKG2FAYXCEHGX2LA5</digest> <httpresponsecode>200</httpresponsecode> <robotflags>-</robotflags> <url>http://www.jp.dk/</url> <capturedate>20060329072234</capturedate> </result> <result> <compressedoffset>23087849</compressedoffset> <mimetype>text/html</mimetype> <file>4544-10-20060330072105-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file> <redirecturl>-</redirecturl> <urlkey>dk,jp)/</urlkey> <digest>3IWDRY7DXMX55WISTJWZ5775KYWNURQN</digest> <httpresponsecode>200</httpresponsecode> <robotflags>-</robotflags> <url>http://www.jp.dk/</url> <capturedate>20060330072148</capturedate> </result> ....
To bulk-feed the index there are two scripts - one for data-records and one for deduplication records
#!/bin/bash while read datafile; do echo indexing $datafile java -cp ../webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -cdxURL $datafile > temp.cdx cat temp.cdx>>full.cdx curl -X POST --data-binary @./temp.cdx http://localhost:8888/cdxidx done <nonmetadata.txt
#!/bin/bash while read datafile; do echo indexing $datafile grep "duplicate:" $datafile > metatemp.crawl.log java -cp nas/lib/netarchivesuite-wayback-indexer.jar dk.netarkivet.wayback.DeduplicateToCDXApplication metatemp.crawl.log > metatemp.cdx curl -X POST --data-binary @./metatemp.cdx http://localhost:8888/cdxidx cat metatemp.cdx>>metafull.cdx done <metadata.txt
where nonmetadata.txt
and metadata.txt
are just files with lists of data and metadata files to be indexed. Note that indexing of data warc-files uses standardised code from the wayback project, while indexing of deduplication events requires locally-developed code from NetarchiveSuite.
Indexing Results
After about 24 hours, the total indexed is as follows
Records | CDX Files | TinyCDXServer |
---|---|---|
150 000 000 | 59GB | 5.3GB |
This is about 1/140 of the whole archive, corresponding pretty well with our expected value for the total CDX file size of 8TB. With TinyCDXServer, the whole index could fit in about 750GB.
Estimating the indexing time from this exercise is a bit more complicated because deduplication records are indexed much quicker than data records, presumably because metadata files are much smaller. The 150 million records here are only 1/3 data records and 2/3 deduplication records, but for the archive as a whole the majority of records are data records. (Deduplication is only used for images and other large files, never for text.) So running single-threaded as here it would take around 300 days to reindex the archive! Hopefully we can do better ...