Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added some warc batch job tests.

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ChecksumJob.class -R.*\.arc -Ooutput.checksum

This should produce a file output.checksum with be something like:

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil

This should produce a file output.evil with:

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=EVILSB -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil.sb -BSBN

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=EVIL2 -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch2.class -R.*\.arc -Eerror.evil2

Since the batch program is supposed to fail, nothing should be written to std-out, only to std-error. It is therefore the std-err which should be catched, which can be done with the -E argument.

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*\.arc -Eerror.exception -Ooutput.exception

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*\.arc -Ooutput.exception.finish

...

Code Block
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archive.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.wayback.jar -R'.*metadata.*\.arc' -BSBN -Ocdx.output

The output should be well formed cdx records with each line starting with a canonicalized url like these:

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=INIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*\.arc

The following text message should be written in the console:

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*\.arc -Ooutput.simple

The output file should be quite large compared to the previous output files.

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*\.arc -Ooutput.jhove.arc

The output.jhove.arc file should be in the following format:

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*\.arc -Ooutput.copy.class  

...

This value should be approximately the same as the combined size of all the harvests.

eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'

Copy the content of the metadata files only.

 

Run

...

Test a WARC Batch Job

Code Block
java -cp lib/dk.netarkivet.archive.Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/dk.netarkivet.archive.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.common.utils.cdx.WARCExtractCDXJob -Jlib/dk.netarkivet.wayback.jar -R'.*dk.*\.warc' -BSBN -Ocdx.warc.all.output

Check that the output file has a significant amount of content.

eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'

Copy the content of the metadata files only.

 

Run

 

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=META -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.meta -R'.*-metadata-.*\.arc' -BKBN

 

This should give the following output in the console:

...

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=CONTENT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.content -R'.*.dk.*\.arc' -BKBN

 

The regular expression should handle any files besides the metadata files, since they don't contain the sequence '.dk' in their name. This means that they handles all the other files, the 'content' files.

...