...
Excerpt |
---|
Uses a command-line tool several times and check that the output from the tool is correct as well as that the appropriate bit archives have been queried. Note that the number of files mentioned in the output is not necessarily exact. |
Table of Contents |
---|
Setup
Login to a bitarchive server, eg.the indexserver (kb-test-acs-001.kb.dk)
Code Block $ ssh kbdevel@kb-test-acs-001 $ export TESTX=TEST11A $ cd ${TESTX}
Create dir for batchprograms :
$ mkdir batchprogs $ scp test@kb-prod-udv-001.kb.dk:/home/test/test-batch/* batchprogs/Code Block and copy contents of /home/devel/backup_TEST11A/batchprogs to /home/devel/TEST11A/
Code Block cp -av /home/devel/backup_TEST11A/batchprogs .
Run batch programs on archived files
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ChecksumJob.class -R.*\.arc -Ooutput.checksum |
This should produce a file output.checksum with be something like:
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil |
This should produce a file output.evil with:
...
This is run on the KBN replica, which has four bitapps (according to the default configuration file 'deploy_config_multi_bitapps.xml'). ). Therefore the sentence 'Legal' is written four times.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVILSB -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil.sb -BSBN |
...
This is run on the SBN replica, which has only one bitapp (according to the default configuration file 'deploy_config_multi_bitapps.xml'). Therefore the sentence 'Legal' is written once.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVIL2 -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch2.class -R.*\.arc -Eerror.evil2 |
Since the batch program is supposed to fail, nothing should be written to std-out
, only to std-error
. It is therefore the std-err
which should be catched, which can be done with the -E
argument.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*\.arc -Eerror.exception -Ooutput.exception |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*\.arc -Ooutput.exception.finish |
...
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -Dsettings.common.applicationInstanceId=TEST11A -cp lib/dk.netarkivet.archivecp lib/netarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.arc' -BSBN -Ocdx.output |
The output in the cdx.output file should be well formed cdx records with each line starting with a canonicalized url like these:
...
(In the above cases, <redirect-url> is empty, which it will always be unless there is a 3xx http code.)
...
This method should fail before starting to process the files, and thus no files should be processed.
...
Run CDX Job On WARC Files
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -DsettingsDsettings.common.applicationInstanceId=INITDEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/netarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.classNdk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/netarchivesuite-wayback-indexer.jar -R'.*metadata.*.arc |
The following text message should be written in the console:
Code Block |
---|
Running batch job 'batchprogs/ExceptionBatchInit.class' on files matching '.*.arc' on \.warc' -BSBN -Ocdx.warc.output |
Check that there is non-empty output. This verifies that CDX deduplicate extraction also works on warcfiles.
ExceptionBatchInit
This method should fail before starting to process the files, and thus no files should be processed.
Run
Code Block |
---|
java -cp lib/netarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=INIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*\.arc |
The following text message should be written in the console:
Code Block |
---|
Running batch job 'batchprogs/ExceptionBatchInit.class' on files matching '.*.arc' on replica 'KBN', output written to stdout errors written to stderr
Processed 0 files with 0 failures |
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*\.arc -Ooutput.simple |
The output file should be quite large compared to the previous output files.
...
Code Block |
---|
org.archive.io.arc.ARCRecord@f16070 available: q:\bitarkiv\JOLF\filedir\1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc: {ip-address=0.0.0.0, content-type=text/plain, absolute-offset=0, subject-uri=filedesc://1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc.open, length=1343, creation-date=20090316092641, version=1.1} |
...
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Running methods from jar files, including Jhove based methods
Before running any Jhove methods, run the following command:
Code Block |
---|
scpcp -r test@kb-prod-udv-001.kb.dk:lib/jhove lib/.av /home/devel/backup_TEST11A/lib_jhove/* /home/devel/TEST11A/lib |
eu.planets.batch.jar -> eu.planets.JhoveArcJob
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*\.arc -Ooutput.jhove.arc |
The output.jhove.arc file should be in the following format:
Code Block |
---|
0,HTML,null,text/html
null,null,null,image/png
null,null,null,text/css
0,GIF,null,image/gif
0,JPEG,null,image/jpeg
..... |
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Check the content of metadata files and 'content'
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*\.arc -Ooutput.copy.class |
...
Check the size of the output file (run the command du -h output.copy.class
):
59M output.copy.class
...
output.copy.class
This value should be approximately the same as the combined size of all the harvests.
Furthermore, a list of failed files should be printed to stdout (all of them WARC files).
Test a WARC Batch Job
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/netarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.common.utils.cdx.WARCExtractCDXJob -Jlib/netarchivesuite-wayback-indexer.jar -R'.*dk.*\.warc' -BSBN -Ocdx.warc.all.output |
Check that the output file (cdx.warc.all.output
) has a significant amount of content.
eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=META -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.meta -R'.*-metadata-.*\.arc' -BKBN |
This should give the following output in the console:
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CONTENT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.content -R'.*.dk.*\.arc' -BKBN |
The regular expression should handle any files besides the metadata files, since they don't contain the sequence '.dk' in their name. This means that they handles all the other files, the 'content' files.
...
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=PROPS -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SystemReaderJob.class -Ooutput.system -Eerror.system -R'.*.dk.*' -BSBN |
...