Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

Uses a command-line tool several times and check that the output from the tool is correct as well as that the appropriate bit archives have been queried. Note that the number of files mentioned in the output is not necessarily exact.

Table of Contents

Setup

  1. Login to a bitarchive server, eg.the indexserver (kb-test-acs-001.kb.dk)

    Code Block
    $ ssh kbdevel@kb-test-acs-001
    $
    export TESTX=TEST11A
    $
    cd ${TESTX}


  2. Create dir for batchprograms :

    Code Block$ mkdir batchprogs $ scp test@kb-prod-udv-001.kb.dk:/home/test/test-batch/* batchprogs/

    and copy contents of /home/devel/backup_TEST11A/batchprogs to /home/devel/TEST11A/

    Code Block
    cp -av /home/devel/backup_TEST11A/batchprogs .


Run batch programs on archived files

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ChecksumJob.class -R.*\.arc -Ooutput.checksum

This should produce a file output.checksum with be something like:

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil

This should produce a file output.evil with:

...

This is run on the KBN replica, which has four bitapps (according to the default configuration file 'deploy_config_multi_bitapps.xml'). Therefore the sentence '). Therefore the sentence 'Legal' is written four times.

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVILSB -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil.sb -BSBN

...

This is run on the SBN replica, which has only one bitapp (according to the default configuration file 'deploy_config_multi_bitapps.xml'). Therefore the sentence 'Legal' is written once.

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVIL2 -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch2.class -R.*\.arc -Eerror.evil2

Since the batch program is supposed to fail, nothing should be written to std-out, only to std-error. It is therefore the std-err which should be catched, which can be done with the -E argument.

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*\.arc -Eerror.exception -Ooutput.exception

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*\.arc -Ooutput.exception.finish

...

Code Block
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.arc' -BSBN -Ocdx.output

The output in the cdx.output file should be well formed cdx records with each line starting with a canonicalized url like these:

...

Code Block
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.warc' -BSBN -Ocdx.warc.output

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=INIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*\.arc

The following text message should be written in the console:

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*\.arc -Ooutput.simple

The output file should be quite large compared to the previous output files.

...

Code Block
org.archive.io.arc.ARCRecord@f16070 available: q:\bitarkiv\JOLF\filedir\1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc: {ip-address=0.0.0.0, content-type=text/plain, absolute-offset=0, subject-uri=filedesc://1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc.open, length=1343, creation-date=20090316092641, version=1.1}

...

Furthermore, a list of failed files should be printed to stdout (all of them WARC files).

Running methods from jar files, including Jhove based methods

Before running any Jhove methods, run the following command:

Code Block
scpcp -r test@kb-prod-udv-001.kb.dk:lib/jhove lib/.av /home/devel/backup_TEST11A/lib_jhove/* /home/devel/TEST11A/lib

eu.planets.batch.jar -> eu.planets.JhoveArcJob 

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*\.arc -Ooutput.jhove.arc

The output.jhove.arc file should be in the following format:

Code Block
0,HTML,null,text/html
null,null,null,image/png
null,null,null,text/css
0,GIF,null,image/gif
0,JPEG,null,image/jpeg
.....

Furthermore, a list of failed files should be printed to stdout (all of them WARC files).

Check the content of metadata files and 'content' 

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*\.arc -Ooutput.copy.class  

...

Check the size of the output file (run the command du -h output.copy.class): 

59M       output.copy.class

...

output.copy.class

This value should be approximately the same as the combined size of all the harvests.

Furthermore, a list of failed files should be printed to stdout (all of them WARC files).

Test a WARC Batch Job

Code Block
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/netarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.common.utils.cdx.WARCExtractCDXJob -Jlib/netarchivesuite-wayback-indexer.jar -R'.*dk.*\.warc' -BSBN -Ocdx.warc.all.output

Check that the output file (cdx.warc.all.output) has a significant amount of content.

eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'

...

Run

 

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=META -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.meta -R'.*-metadata-.*\.arc' -BKBN

 

This should give the following output in the console:

...

Run

 

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CONTENT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.content -R'.*.dk.*\.arc' -BKBN

 

The regular expression should handle any files besides the metadata files, since they don't contain the sequence '.dk' in their name. This means that they handles all the other files, the 'content' files.

...

Run

 

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=PROPS -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SystemReaderJob.class -Ooutput.system -Eerror.system -R'.*.dk.*' -BSBN

...