...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ChecksumJob.class -R.*\.arc -Ooutput.checksum |
This should produce a file output.checksum with be something like:
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil |
This should produce a file output.evil with:
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=EVILSB -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil.sb -BSBN |
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=EVIL2 -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch2.class -R.*\.arc -Eerror.evil2 |
Since the batch program is supposed to fail, nothing should be written to std-out
, only to std-error
. It is therefore the std-err
which should be catched, which can be done with the -E
argument.
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*\.arc -Eerror.exception -Ooutput.exception |
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*\.arc -Ooutput.exception.finish |
...
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archive.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.wayback.jar -R'.*metadata.*\.arc' -BSBN -Ocdx.output |
The output should be well formed cdx records with each line starting with a canonicalized url like these:
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=INIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*\.arc |
The following text message should be written in the console:
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*\.arc -Ooutput.simple |
The output file should be quite large compared to the previous output files.
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*\.arc -Ooutput.jhove.arc |
The output.jhove.arc file should be in the following format:
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*\.arc -Ooutput.copy.class |
...
This value should be approximately the same as the combined size of all the harvests.
eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'
Copy the content of the metadata files only.
Run
...
Test a WARC Batch Job
Code Block |
---|
java -cp lib/dk.netarkivet.archive.Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archive.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.common.utils.cdx.WARCExtractCDXJob -Jlib/dk.netarkivet.wayback.jar -R'.*dk.*\.warc' -BSBN -Ocdx.warc.all.output |
Check that the output file has a significant amount of content.
eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'
Copy the content of the metadata files only.
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=META -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.meta -R'.*-metadata-.*\.arc' -BKBN |
This should give the following output in the console:
...
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=CONTENT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.content -R'.*.dk.*\.arc' -BKBN |
The regular expression should handle any files besides the metadata files, since they don't contain the sequence '.dk' in their name. This means that they handles all the other files, the 'content' files.
...