Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Estimating the indexing time from this exercise is a bit more complicated because deduplication records are indexed much quicker than data records, presumably because metadata files are much smaller. The 150 million records here are only 1/3 data records and 2/3 deduplication records, but for the archive as a whole the majority of records are data records. (Deduplication is only used for images and other large files, never for text.) So running single-threaded as here it would take around 300 days to reindex the archive! Hopefully we can do better ... 

OpenWayback

Installing OpenWayback is now relatively straightforward if it's the 20th time you've configured a wayback instance. We use our own customised wayback overlay from https://github.com/netarchivesuite/netarkivet-openwayback-overlay .  Because all the data files are locally mounted on isilon we can use wayback's own DirectoryResourceFileSource to resolve file paths, while  the ResourceIndex is pointed at the TinyCDXServer

Code Block
   <property name="resourceIndex">
      <bean class="org.archive.wayback.resourceindex.RemoteResourceIndex">
        <property name="searchUrlBase" value="http://localhost:8888/cdxidx" />
      </bean>
   </property>

and we're done. See http://belinda.statsbiblioteket.dk:8090/wayback/ . But proxy mode access behaves very weirdly - inconsistent results on reload, missing elements, missing stylesheets etc.