...
Excerpt |
---|
Uses a command-line tool several times and check that the output from the tool is correct as well as that the appropriate bit archives have been queried. Note that the number of files mentioned in the output is not necessarily exact. |
Table of Contents |
---|
Setup
Login to a bitarchive server, eg.the indexserver (kb-test-acs-001.kb.dk)
Code Block $ ssh kbdevel@kb-test-acs-001 $ export TESTX=TEST11A $ cd ${TESTX}
Create dir for batchprograms :
$ mkdir batchprogs $ scp test@kb-prod-udv-001.kb.dk:/home/test/test-batch/* batchprogs/Code Block and copy contents of /home/devel/backup_TEST11A/batchprogs to /home/devel/TEST11A/
Code Block cp -av /home/devel/backup_TEST11A/batchprogs .
Run batch programs on archived files
...
This is run on the KBN replica, which has four bitapps (according to the default configuration file 'deploy_config_multi_bitapps.xml' ). Therefore the sentence 'Legal' is written four times.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVILSB -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil.sb -BSBN |
...
This is run on the SBN replica, which has only one bitapp (according to the configuration file 'deploy_config_multi_bitapps.xmldefault configuration file'). Therefore the sentence 'Legal' is written once.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVIL2 -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch2.class -R.*\.arc -Eerror.evil2 |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*\.arc -Eerror.exception -Ooutput.exception |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*\.arc -Ooutput.exception.finish |
...
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.arc' -BSBN -Ocdx.output |
The output in the cdx.output file should be well formed cdx records with each line starting with a canonicalized url like these:
...
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.warc' -BSBN -Ocdx.warc.output |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=INIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*\.arc |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*\.arc -Ooutput.simple |
...
Code Block |
---|
org.archive.io.arc.ARCRecord@f16070 available: q:\bitarkiv\JOLF\filedir\1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc: {ip-address=0.0.0.0, content-type=text/plain, absolute-offset=0, subject-uri=filedesc://1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc.open, length=1343, creation-date=20090316092641, version=1.1} |
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Running methods from jar files, including Jhove based methods
Before running any Jhove methods, run the following command:
Code Block |
---|
scpcp -r test@kb-prod-udv-001.kb.dk:lib/jhove lib/.av /home/devel/backup_TEST11A/lib_jhove/* /home/devel/TEST11A/lib |
eu.planets.batch.jar -> eu.planets.JhoveArcJob
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*\.arc -Ooutput.jhove.arc |
...
Code Block |
---|
0,HTML,null,text/html
null,null,null,image/png
null,null,null,text/css
0,GIF,null,image/gif
0,JPEG,null,image/jpeg
..... |
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Check the content of metadata files and 'content'
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*\.arc -Ooutput.copy.class |
...
This value should be approximately the same as the combined size of all the harvests.
Furthermore, a list of all the harvests failed files should be printed to stdout (all of them WARC files).
Test a WARC Batch Job
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.common.utils.cdx.WARCExtractCDXJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*dk.*\.warc' -BSBN -Ocdx.warc.all.output |
Check that the output file (cdx.warc.all.output
) has a significant amount of content.
eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=META -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.meta -R'.*-metadata-.*\.arc' -BKBN |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CONTENT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.content -R'.*.dk.*\.arc' -BKBN |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=PROPS -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SystemReaderJob.class -Ooutput.system -Eerror.system -R'.*.dk.*' -BSBN |
...