Tools in the Harvester Module

Contents

dk.netarkivet.harvester.tools.CreateCDXMetadataFile

Given a specific jobID (e.g. 42) and a harvestnamePrefix this tool can be used to create a metadata-1.warc containing the CDX-entries for all (w)arc-files belonging to that job.

prerequisites and arguments

You need to specify the repositoryclient used for accessing your archived-data. If you use the default client JMSArcRepositoryClient you also need to specify the archive replica you will use (defined by setting "settings.common.useReplicaId"), the environmentname, the applicationName, the applicationInstanceId. These can all be defined on the commandline as overrides to the default values, or defined in a local settings.xml file.

Needed jarfiles in the classpath: dk.netarkivet.harvester.jar, dk.netarkivet.archive.jar (if using default repositoryclient)

The tool only has at least two arguments --jobID 42 --harvestnamePrefix 42-3

Optional argument is the -a or -w to choose the metadata format. By default the program outputs metadata warc file, but the -a option makes the program write an metadata arc file.

Sample usage of this tool

export INSTALLDIR=/home/test/netarchive
CLASSPATH=$INSTALLDIR/lib/netarchivesuite-harvester-core.jar:
export CLASSPATH=$CLASSPATH:$INSTALLDIR/lib/netarchivesuite-archive-core.jar
export OPTS=-Ddk.netarkivet.settings.file=localsettings.xml -Dlogback.configurationFile=/path/to/logback.xml

java $OPTS dk.netarkivet.harvester.tools.CreateCDXMetadataFile [-a|-w] --jobID 42 --harvestnamePrefix 42-3

dk.netarkivet.harvester.tools.HarvestTemplateApplication

This tools enables you to create (create command), download (download command), update (update command) and show (showall command) the existing templates.

prerequisites and arguments

You need to point to a settings file with connection information for your harvest database. In a standard NAS deployment, use the INSTALLDIR/conf/settings_GUIApplication.xml

Sample usage of this tool

export INSTALLDIR=/home/test/netarchive
export CLASSPATH=$INSTALLDIR/lib/netarchivesuite-harvester-core.jar
export OPTS=-Ddk.netarkivet.settings.file=$INSTALLDIR/conf/settings_GUIApplication.xml -Dlogback.configurationFile=/path/to/logback.xml
 
java $OPTS dk.netarkivet.harvester.tools.HarvestTemplateApplication <command> <args>

The different <command> <args> possibilities:

create <template-name> <xml-file for this template>
download [<template-name>]*
update <template-name> <xml-file to replace this template>
showall

Note that with the download command, you can either download all templates in one go (with no args), or select the names of the templates to download (separated by space)

dk.netarkivet.harvester.tools.HarvestdatabaseUpdateApplication

This tools enables you to update the tables in the harvestdatabase to the versions required by this release of NetarchiveSuite. It should be run after installing the software, but before starting the NetarchiveSuite applications.

prerequisites and arguments

You need to point to a settings file with connection information for your harvest database. In a standard NAS deployment, use the INSTALLDIR/conf/settings_GUIApplication.xml

And the harvest database needs to be running as well.

Sample usage of this tool

First, the harvestdatabase is started, if it isn't up and running already.

Then the update tool is executed( in the above derby is used as database; if another database is used, you replace the derbyclient.jar with a different file):

export INSTALLDIR=/home/test/netarchive
export CLASSPATH=$INSTALLDIR/lib/netarchivesuite-harvester-core.jar:$INSTALLDIR/lib/derbyclient.jar
export OPTS =-Dlogback.configurationFile=/path/to/logback.xml -Ddk.netarkivet.settings.file=$INSTALLDIR/conf/settings_GUIApplication.xml
java $OPTS dk.netarkivet.harvester.tools.HarvestdatabaseUpdateApplication

Finally, the harvestdatabase is shutdown, if you're using derby as database.

dk.netarkivet.harvester.tools.test.SendDedupIndexRequestToIndexServer

This tool creates a deduplication index from a list of jobIds. This can be useful for various testing purposes, but also to pre-cache prior, say, to starting a planned broad crawl. The command takes a single input parameter which is the path to a file containing a list of job-numbers to be used in the indexing - for example all the jobs from a previous broad crawl.

Usage

export INSTALLDIR=/fullpath/to/installdir
export CLASSPATH=$INSTALLDIR/lib/netarchivesuite-harvester-core.jar
export OPTS=-cp $CLASSPATH -Ddk.netarkivet.settings.file=$INSTALLDIR/conf/settings_....xml -Dlogback.configurationFile=/path/to/logback.xml

java $OPTS dk.netarkivet.archive.tools.test.SendDedupIndexRequestToIndexServer <input_file.txt>

dk.netarkivet.harvester.tools.CreateIndex

This tool forces the IndexServer to create indices preemptively. This tool can be used for retrieving logs and cdx'es for previously completed harvestjobs before they are actual needed. This can be helpful if you want to improve the time it takes to generate Deduplication indices.

Prerequisites

You need to have a IndexServerApplication online. If you use HTTP as file transport method, you probably also need to override the settings.common.remoteFile.port in order to avoid conflicts (In the example below, we have set the port number to 5000).

Furthermore all harvestjobs referred to in the CreateIndex commands must have metadata-1.arc files stored in the archive.

Usage

export INSTALLDIR=/fullpath/to/installdir
export CLASSPATH=$INSTALLDIR/lib/netarchivesuite-harvester-core.jar
export OPTS=-Dsettings.common.cacheDir=/tmp/cache \
-Dsettings.common.environmentName=QUICKSTART -Dsettings.common.remoteFile.port=5000 -Dlogback.configurationFile=/path/to/logback.xml
java $OPTS dk.netarkivet.archive.tools.CreateIndex -t dedup -l 1,2
ctrl-c

This requests a deduplication index based on the harvestjobs with id 1 and 2, and stores this index in /tmp/cache/DEDUP_CRAWL_LOG/1-2-cache

dk.netarkivet.harvester.tools.HarvestDatabaseValidator

This tool tests the database settings for the harvestdatabase.

Usage

export INSTALLDIR=/fullpath/to/installdir
export CLASSPATH=$INSTALLDIR/lib/netarchivesuite-harvester-core.jar
export OPTs=-Ddk.netarkivet.settings.file=$INSTALLDIR/conf/settings_GUIApplication.xml
java $OPTS dk.netarkivet.harvester.tools.HarvestDatabaseValidator 
or
java dk.netarkivet.harvester.tools.HarvestDatabaseValidator /fullpath/to/settings.xml

The output of this program will be either Database accessTest was successful or Database accessTest was Not successful

and diagnostics as to what the problem is.