...
Excerpt |
---|
Uses a command-line tool several times and check that the output from the tool is correct as well as that the appropriate bit archives have been queried. Note that the number of files mentioned in the output is not necessarily exact. |
Table of Contents |
---|
Setup
Login to a bitarchive server, eg.the indexserver (kb-test-acs-001.kb.dk)
Code Block $ ssh kbdevel@kb-test-acs-001 $ export TESTX=TEST11A $ cd ${TESTX}
Create dir for batchprograms :
$ mkdir batchprogs $ scp test@kb-prod-udv-001.kb.dk:/home/test/test-batch/* batchprogs/Code Block and copy contents of /home/devel/backup_TEST11A/batchprogs to /home/devel/TEST11A/
Code Block cp -av /home/devel/backup_TEST11A/batchprogs .
Run batch programs on archived files
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ChecksumJob.class -R.*\.arc -Ooutput.checksum |
This should produce a file output.checksum with be something like:
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil |
This should produce a file output.evil with:
...
This is run on the KBN replica, which has four bitapps (according to the default configuration file 'deploy_config_multi_bitapps.xml'). Therefore the sentence '). Therefore the sentence 'Legal' is written four times.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVILSB -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil.sb -BSBN |
...
This is run on the SBN replica, which has only one bitapp (according to the default configuration file 'deploy_config_multi_bitapps.xml'). Therefore the sentence 'Legal' is written once.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVIL2 -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch2.class -R.*\.arc -Eerror.evil2 |
Since the batch program is supposed to fail, nothing should be written to std-out
, only to std-error
. It is therefore the std-err
which should be catched, which can be done with the -E
argument.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*\.arc -Eerror.exception -Ooutput.exception |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*\.arc -Ooutput.exception.finish |
...
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.arc' -BSBN -Ocdx.output |
The output in the cdx.output file should be well formed cdx records with each line starting with a canonicalized url like these:
...
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.warc' -BSBN -Ocdx.warc.output |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=INIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*\.arc |
The following text message should be written in the console:
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*\.arc -Ooutput.simple |
The output file should be quite large compared to the previous output files.
...
Code Block |
---|
org.archive.io.arc.ARCRecord@f16070 available: q:\bitarkiv\JOLF\filedir\1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc: {ip-address=0.0.0.0, content-type=text/plain, absolute-offset=0, subject-uri=filedesc://1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc.open, length=1343, creation-date=20090316092641, version=1.1} |
...
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Running methods from jar files, including Jhove based methods
Before running any Jhove methods, run the following command:
Code Block |
---|
scpcp -r test@kb-prod-udv-001.kb.dk:lib/jhove lib/.av /home/devel/backup_TEST11A/lib_jhove/* /home/devel/TEST11A/lib |
eu.planets.batch.jar -> eu.planets.JhoveArcJob
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*\.arc -Ooutput.jhove.arc |
The output.jhove.arc file should be in the following format:
Code Block |
---|
0,HTML,null,text/html
null,null,null,image/png
null,null,null,text/css
0,GIF,null,image/gif
0,JPEG,null,image/jpeg
..... |
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Check the content of metadata files and 'content'
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*\.arc -Ooutput.copy.class |
...
Check the size of the output file (run the command du -h output.copy.class
):
59M output.copy.class
...
output.copy.class
This value should be approximately the same as the combined size of all the harvests.
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Test a WARC Batch Job
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/netarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.common.utils.cdx.WARCExtractCDXJob -Jlib/netarchivesuite-wayback-indexer.jar -R'.*dk.*\.warc' -BSBN -Ocdx.warc.all.output |
Check that the output file (cdx.warc.all.output
) has a significant amount of content.
eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=META -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.meta -R'.*-metadata-.*\.arc' -BKBN |
This should give the following output in the console:
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CONTENT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.content -R'.*.dk.*\.arc' -BKBN |
The regular expression should handle any files besides the metadata files, since they don't contain the sequence '.dk' in their name. This means that they handles all the other files, the 'content' files.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=PROPS -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SystemReaderJob.class -Ooutput.system -Eerror.system -R'.*.dk.*' -BSBN |
...