Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The Index

The main focus is one on the index. The current netarkivet index consists of CDX records like

...

Code Block
<?xml version="1.0" encoding="UTF-8"?>
<wayback>
   <request>
      <startdate>19960101000000</startdate>
      <enddate>20160930064528</enddate>
      <type>urlquery</type>
      <firstreturned>0</firstreturned>
      <url>dk,jp)/</url>
      <resultsrequested>10000</resultsrequested>
      <resultstype>resultstypecapture</resultstype>
   </request>
   <results>
      <result>
         <compressedoffset>23174567</compressedoffset>
         <mimetype>text/html</mimetype>
         <file>4530-10-20060329072121-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file>
         <redirecturl>-</redirecturl>
         <urlkey>dk,jp)/</urlkey>
         <digest>MT6CQWOGIMM2G22TKG2FAYXCEHGX2LA5</digest>
         <httpresponsecode>200</httpresponsecode>
         <robotflags>-</robotflags>
         <url>http://www.jp.dk/</url>
         <capturedate>20060329072234</capturedate>
      </result>
      <result>
         <compressedoffset>23087849</compressedoffset>
         <mimetype>text/html</mimetype>
         <file>4544-10-20060330072105-00000-sb-prod-har-001.statsbiblioteket.dk.arc</file>
         <redirecturl>-</redirecturl>
         <urlkey>dk,jp)/</urlkey>
         <digest>3IWDRY7DXMX55WISTJWZ5775KYWNURQN</digest>
         <httpresponsecode>200</httpresponsecode>
         <robotflags>-</robotflags>
         <url>http://www.jp.dk/</url>
         <capturedate>20060330072148</capturedate>
      </result>
      ....

To bulk-feed the index there are two scripts - one for data-records and one for deduplication records

Code Block
#!/bin/bash
while read datafile; do
  echo indexing $datafile
  java -cp ../webarchive-commons-jar-with-dependencies.jar org.archive.extract.ResourceExtractor -cdxURL $datafile  > temp.cdx
  cat temp.cdx>>full.cdx
  curl -X POST --data-binary @./temp.cdx http://localhost:8888/cdxidx
done <nonmetadata.txt
Code Block
#!/bin/bash
while read datafile; do
  echo indexing $datafile
  grep "duplicate:" $datafile > metatemp.crawl.log
  java -cp nas/lib/netarchivesuite-wayback-indexer.jar dk.netarkivet.wayback.DeduplicateToCDXApplication metatemp.crawl.log > metatemp.cdx
  curl -X POST --data-binary @./metatemp.cdx http://localhost:8888/cdxidx
  cat metatemp.cdx>>metafull.cdx
done <metadata.txt

where nonmetadata.txt and metadata.txt are just files with lists of data and metadata files to be indexed. Note that indexing of data warc-files uses standardised code from the wayback project, while indexing of deduplication events requires locally-developed code from NetarchiveSuite.

Indexing Results

After about 24 hours, the total indexed is as follows

RecordsCDX FilesTinyCDXServer
150 000 00059GB5.3GB

This is about 1/140 of the whole archive, corresponding pretty well with our expected value for the total CDX file size of 8TB. With TinyCDXServer, the whole index could fit in about 750GB.

Estimating the indexing time from this exercise is a bit more complicated because deduplication records are indexed much quicker than data records, presumably because metadata files are much smaller. The 150 million records here are only 1/3 data records and 2/3 deduplication records, but for the archive as a whole the majority of records are data records. (Deduplication is only used for images and other large files, never for text.) So running single-threaded as here it would take around 300 days to reindex the archive! Hopefully we can do better ... 

OpenWayback

Installing OpenWayback is now relatively straightforward if it's the 20th time you've configured a wayback instance. We use our own customised wayback overlay from https://github.com/netarchivesuite/netarkivet-openwayback-overlay .  Because all the data files are locally mounted on isilon we can use wayback's own DirectoryResourceFileSource to resolve file paths, while  the ResourceIndex is pointed at the TinyCDXServer

Code Block
   <property name="resourceIndex">
      <bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
        <property name="searchUrlBase" value="http://localhost:8888/cdxidx" />
      </bean>
   </property>

and we're done. See http://belinda.statsbiblioteket.dk:8090/wayback/ . But proxy mode access behaves very weirdly - inconsistent results on reload, missing elements, missing stylesheets etc. 

Problems with OpenWayback in Proxy Mode are discussed in the email thread at https://groups.google.com/forum/#!topic/openwayback-dev/6LSJd0872qo but I don't believe that they reach a satisfactory conclusion.  It is correct that ProxyHttpsResultURIConverter gives correct behaviour during playback, but it breaks the navigation both in the CalendarView page and in the Toolbar. A possible solution is to hack both Toolbar.jsp and CalendarResults.jsp to use a different UriConverter. For example by modifying the code in each case:

Code Block
UIResults results = UIResults.extractReplay(request);
WaybackRequest wbRequest = results.getWbRequest();
RedirectResultURIConverter conv = new RedirectResultURIConverter();
conv.setRedirectURI(results.getReplayPrefix()+ "jsp/QueryUI/Redirect.jsp");
results = new UIResults(results.getWbRequest(), conv, results.getCaptureResults(), results.getResult(), results.getResource());

This has been only partially tested.