Tools in the Wayback module

Contents

Tools in the Wayback Module

In addition to the tools described here, the NetarchiveSuite Java applications for continuous indexing of an arcrepository are described in the Configuration Manual.

dk.netarkivet.wayback.NetarchiveResourceStore

Wayback is a tool for browsing in webarchives. It exists in two forks, the Internet Archive version can be downloaded from http://archive-access.sourceforge.net/projects/wayback/ amd the community-supported OpenWayback version at https://github.com/iipc/openwayback. The NetarchiveSuite plugin for wayback is a class NetarchiveResourceStore which implements org.archive.wayback.ResourceStore. The class NetarchiveResourceStore instantiates a connection to a NetarchiveSuite ArcRepository and retrieves archive data from it via NetarchiveSuite. It is known to work with both forks of wayback.
In order to make use of the plugin, it is necessary to copy the required jar files into the lib-directory of your wayback installation. Ensure that wayback has access to a NetarchiveSuite settings file with the necessary connection information.

Configuringwayback to use `NetarchiveResourceStore:`

The lib directory for wayback will be under

wayback/WEB-INF/lib

in your tomcat webapps directory. Copy into the lib directory and all the jar files in netarchivesuite/lib except

netarchivesuite-deploy-core.jar
netarchivesuite-harvester-core.jar
netarchivesuite-harvest-scheduler.jar
netarchivesuite-heritrix1-controller.jar
netarchivesuite-heritrix1-extensions.jar
netarchivesuite-heritrix1-frontier.jar
netarchivesuite-monitor-core.jar

and the jar files for the packages wayback, je, jericho, jetty, junit, poi and libidn. These are either not required or are already included in the wayback distribution.

As an alternative to this hand-held deployment mechanism, there are two experimental projects which build a deployable combined wayback/netarchivesuite directly:

https://github.com/netarchivesuite/wayback-netarchivesuite is a fork of Internet Archive's wayback which includes NetarchiveSuite as a maven dependency when the war is built.

https://github.com/netarchivesuite/netarkivet-openwayback-overlay is a maven overlay project which builds a version of openwayback with various Netarkivet customisations, including adding NetarchiveSuite dependencies.

The NetarchiveSuite settings file location can be specified in the catalina.sh file of your tomcat with a line like CATALINA_OPTS="-Ddk.netarkivet.settings.file=/home/user/settings_for_my_repository.xml".
NetarchiveResourceStore has been tested with a wayback localcdxcollection using settings like:

<bean id="localcdxcollection" class="org.archive.wayback.webapp.WaybackCollection">
    <property name="resourceStore">
     <bean class="dk.netarkivet.wayback.NetarchiveResourceStore">
     </bean>     
    </property>

    <property name="resourceIndex">    
     
       <bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
          <property name="source">

          <bean class="org.archive.wayback.resourceindex.CompositeSearchResultSource">
            <property name="CDXSources">
              <list>
                  <value>/home/test/${settings.common.environmentName}/indexDir/wayback_intermediate.index</value>
                  <!--<value>/home/test/wayback_cdx_pligt/wayback.index</value>-->
              </list>
            </property>
          </bean>
         </property>
         <property name="maxRecords" value="20000" />
       </bean>
    </property>
  </bean>

but should work with other types of wayback collection.

dk.netarkivet.wayback.batch.ExtractWaybackCDXBatchJob

This batch job is a wrapper for the parts of the wayback API which generate CDX index files for use in wayback. The job can be called with a script like

java \   -Ddk.netarkivet.settings.file=../../settings_wayback.xml \   -Dsettings.common.applicationInstanceId=CDX_BATCH \   -cp ../lib/netarchivesuite-archive-core.jar \   dk.netarkivet.archive.tools.RunBatch \   -Ndk.netarkivet.wayback.ExtractWaybackCDXBatchJob \ -R1042-.*(?<metadata-[0-9]).arc \   -BMYREPLICA \   -Oout.cdx

Note the syntax of the regular expression which selects all arcfiles generated by job 1042 ''except'' for metadata arcfiles. The cdx filesgenerated are unsorted. For use in wayback they must be sorted and merged e.g. using unix sort:

export LC_ALL=C; sort --temporary-directory=/tmp 1.cdx 2.cdx 3.cdx 4.cdx > sorted.cdx

Our experience is that sorting and merging of files with total size up to 100GB can be accomplished in a few hours on a moderately powerful server machine.

dk.netarkivet.wayback.batch.ExtractDeduplicateCDXBatchJob

In netarchivesuite, duplicate objects which are not harvested are recorded as extra metadata in the heritrix crawl log. In order to be able to browse these items, these deduplication records need to be indexed. Each deduplication record will generate a cdx record showing the harvested time as the time when the duplicate record was discovered, but pointing to the archive location where the original record is stored. The batch job to execute this indexing is invoked in exactly the same way as that described above for indexing the archived data, except that in this case we would use a regular expression which matches only metadata files, rather than one which matches everything except metadata files; for example

 -R1042-'.*'metadata'.*'arc

As in the above case, the returned cdx files are unsorted.

dk.netarkivet.wayback.DeduplicateToCDXApplication

This is a command line interface to the same code for generating CDX indexes from deduplication records in crawl log files (not metadata arcfiles). It can be invoked by

java -cp dk.netarkivet.wayback.jar dk.netarkivet.wayback.DeduplicateToCDXApplication crawl1.log crawl2.log crawl3.log > out.cdx

dk.netarkivet.wayback.accesscontrol.RegExpExclusionFilterFactory

This is an alternative to the class StaticMapExclusionFilterFactory supplied with wayback. The class is a spring bean which can be added to any wayback access point in wayback.xml with a specification like

<property name="exclusionFactory">
       <bean class="dk.netarkivet.wayback.accesscontrol.RegExpExclusionFilterFactory" init-method="init">
               <property name="file" value="/home/test/wayback_regexps.txt" />
       </bean>
</property>

The file wayback_regexps.txt is a plain-text file containing a list of Java regular expressions for url's which are to be blocked in the access point. The regular expressions are applied directly to the original harvested urls. The file is only read when wayback is initialised. If the file is changed then wayback needs to be reloaded, for example by restarting tomcat.

Tools in the Wayback module