...
The Index
The main focus is one on the index. The current netarkivet index consists of CDX records like
...
Code Block |
---|
<?xml version="1.0" encoding="UTF-8"?>
<wayback>
<request>
<startdate>19960101000000</startdate>
<enddate>20160930064528</enddate>
<type>urlquery</type>
<firstreturned>0</firstreturned>
<url>dk,jp)/</url>
<resultsrequested>10000</resultsrequested>
<resultstype>resultstypecapture</resultstype>
</request>
<results>
<result>
<compressedoffset>23174567</compressedoffset>
<mimetype>text/html</mimetype>
<file>4530-10-20060329072121-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file>
<redirecturl>-</redirecturl>
<urlkey>dk,jp)/</urlkey>
<digest>MT6CQWOGIMM2G22TKG2FAYXCEHGX2LA5</digest>
<httpresponsecode>200</httpresponsecode>
<robotflags>-</robotflags>
<url>http://www.jp.dk/</url>
<capturedate>20060329072234</capturedate>
</result>
<result>
<compressedoffset>23087849</compressedoffset>
<mimetype>text/html</mimetype>
<file>4544-10-20060330072105-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file>
<redirecturl>-</redirecturl>
<urlkey>dk,jp)/</urlkey>
<digest>3IWDRY7DXMX55WISTJWZ5775KYWNURQN</digest>
<httpresponsecode>200</httpresponsecode>
<robotflags>-</robotflags>
<url>http://www.jp.dk/</url>
<capturedate>20060330072148</capturedate>
</result>
.... |
To bulk-feed the index there are two scripts - one for data-records and one for deduplication records
Code Block |
---|
#!/bin/bash
while read datafile; do
echo indexing $datafile
java -cp ../webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -cdxURL $datafile > temp.cdx
cat temp.cdx>>full.cdx
curl -X POST --data-binary @./temp.cdx http://localhost:8888/cdxidx
done <nonmetadata.txt |
Code Block |
---|
#!/bin/bash while read datafile; do echo indexing $datafile grep "duplicate:" $datafile > metatemp.crawl.log java -cp nas/lib/netarchivesuite-wayback-indexer.jar dk.netarkivet.wayback.DeduplicateToCDXApplication metatemp.crawl.log > metatemp.cdx curl -X POST --data-binary @./metatemp.cdx http://localhost:8888/cdxidx cat metatemp.cdx>>metafull.cdx done <metadata.txt |
where nonmetadata.txt
and metadata.txt
are just files with lists of data and metadata files to be indexed. Note that indexing of data warc-files uses standardised code from the wayback project, while indexing of deduplication events requires locally-developed code from NetarchiveSuite.
Indexing Results
After about 24 hours, the total indexed is as follows
Records | CDX Files | TinyCDXServer |
---|---|---|
150 000 000 | 59GB | 5.3GB |
This is about 1/140 of the whole archive, corresponding pretty well with our expected value for the total CDX file size of 8TB. With TinyCDXServer, the whole index could fit in about 750GB.
Estimating the indexing time from this exercise is a bit more complicated because deduplication records are indexed much quicker than data records, presumably because metadata files are much smaller. The 150 million records here are only 1/3 data records and 2/3 deduplication records, but for the archive as a whole the majority of records are data records. (Deduplication is only used for images and other large files, never for text.) So running single-threaded as here it would take around 300 days to reindex the archive! Hopefully we can do better ...