Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

Uses a command-line tool several times and check that the output from the tool is correct as well as that the appropriate bit archives have been queried. Note that the number of files mentioned in the output is not necessarily exact.

Table of Contents

Setup

  1. Login to a bitarchive server, eg.the indexserver (kb-test-acs-001.kb.dk)

    Code Block
    $ ssh kbdevel@kb-test-acs-001
    $ export TESTX=TEST11A
    $
    cd ${TESTX}


  2. Create dir for batchprograms :

    Code Block$ mkdir batchprogs $ scp test@kb-prod-udv-001.kb.dk:/home/test/test-batch/* batchprogs/

    and copy contents of /home/devel/backup_TEST11A/batchprogs to /home/devel/TEST11A/

    Code Block
    cp -av /home/devel/backup_TEST11A/batchprogs .


Run batch programs on archived files

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ChecksumJob.class -R.*\.arc -Ooutput.checksum

This should produce a file output.checksum with be something like:

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CHECKSUM -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil

This should produce a file output.evil with:

...

This is run on the KBN replica, which has four bitapps (according to the default configuration file 'deploy_config_multi_bitapps.xml' ). Therefore the sentence 'Legal' is written four times.

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVILSB -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch.class -R.*\.arc -Ooutput.evil.sb -BSBN

...

This is run on the SBN replica, which has only one bitapp (according to the default configuration file 'deploy_config_multi_bitapps.xml'). Therefore the sentence 'Legal' is written once.

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EVIL2 -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/EvilBatch2.class -R.*\.arc -Eerror.evil2

Since the batch program is supposed to fail, nothing should be written to std-out, only to std-error. It is therefore the std-err which should be catched, which can be done with the -E argument.

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*\.arc -Eerror.exception -Ooutput.exception

...

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*\.arc -Ooutput.exception.finish

...

Code Block
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -Dsettings.common.applicationInstanceId=TEST11A -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.waybacknetarchivesuite-wayback-indexer.jar -R'.*metadata.*\.arc' -BSBN -Ocdx.output

The output in the cdx.output file should be well formed cdx records with each line starting with a canonicalized url like these:

...

(In the above cases, <redirect-url> is empty, which it will always be unless there is a 3xx http code.)

ExceptionBatchInit 

This method should fail before starting to process the files, and thus no files should be processed.

Run

...

Run CDX Job On WARC Files

Code Block
java -Dsettings.common.applicationInstanceId=INITDEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/netarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*.arc

The following text message should be written in the console:

Code Block
Running batch job 'batchprogs/ExceptionBatchInit.class' on files matching '.*.arc' on replica 'KBN', output written to stdout errors written to stderr 
Processed 0 files with 0 failures

Run batch programs on records within archive files

These methods does not check on the '.arc' files, but on the content of these '.arc' files.

SimpleArcBatchJob

...

Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/netarchivesuite-wayback-indexer.jar -R'.*metadata.*\.warc' -BSBN -Ocdx.warc.output

Check that there is non-empty output. This verifies that CDX deduplicate extraction also works on warcfiles. 

ExceptionBatchInit 

This method should fail before starting to process the files, and thus no files should be processed.

Run

Code Block
java -cp lib/dk.netarkivet.archivenetarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=SIMPLEINIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJobExceptionBatchInit.class -R.*\.arc -Ooutput.simple

The output file following text message should be quite large compared to the previous output files.

The size of the output files and the error files can be found by running the command: du -h output.* error.*. This gives the following results:

Code Block
4,0K    output.evil
4,0K    output.exception
4,0K    output.exception.finish
212K    output.simple
4,0K    error.evil2
4,0K    error.exception

The content of the output file contain a line of data for each record. This line will have the following structure:

Code Block
org.archive.io.arc.ARCRecord@f16070 available: q:\bitarkiv\JOLF\filedir\1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc: {ip-address=0.0.0.0, content-type=text/plain, absolute-offset=0, subject-uri=filedesc://1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc.open, length=1343, creation-date=20090316092641, version=1.1}

Running methods from jar files, including Jhove based methods

Before running any Jhove methods, run the following command:

Code Block
scp -r test@kb-prod-udv-001.kb.dk:lib/jhove lib/.

eu.planets.batch.jar -> eu.planets.JhoveArcJob 

 

Get some metadata from all records within the arc-files through Jhove.

This method writes the following metadata on the all the records of handable dataformats: parsing errorsformatversion,mimetype.

For the dataformats which are not handled, these values are replaced by 'null', except mimetype which is replaced by the mimetype of the record.

 

Run

 

Code Block
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*.arc -Ooutput.jhove.arc

 

The output.jhove.arc file should be in the following format:

 

Code Block
0,HTML,null,text/html
null,null,null,image/png
null,null,null,text/css
0,GIF,null,image/gif
0,JPEG,null,image/jpeg
.....

Check the content of metadata files and 'content' 

These three methods takes a copy of the content from different arc files based on a regular expression.

All these tests are basically stress-tests, which by dumping the content of all records within a arc file to the output stream. First the content of all the files are collected, then only the content of the metadata files, and finally the data from the harvesting. These are all put into different output files, and the first file should have the same size as the other two combined.

CopyArcContent

This tests the ability to copy the content of all the files in the bitarchives. It can be seen as a stress test of the batch system.

Run

...

written in the console:

Code Block
Running batch job 'batchprogs/ExceptionBatchInit.class' on files matching '.*.arc' on replica 'KBN', output written to stdout errors written to stderr 
Processed 0 files with 0 failures

Run batch programs on records within archive files

These methods does not check on the '.arc' files, but on the content of these '.arc' files.

SimpleArcBatchJob

Checks that it is posible to access the records within a arc file.

Run

Code Block
java -cp lib/netarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*\.arc -Ooutput.simple

The output file should be quite large compared to the previous output files.

The size of the output files and the error files can be found by running the command: du -h output.* error.*. This gives the following results:

Code Block
4,0K    output.evil
4,0K    output.exception
4,0K    output.exception.finish
212K    output.simple
4,0K    error.evil2
4,0K    error.exception

The content of the output file contain a line of data for each record. This line will have the following structure:

Code Block
org.archive.io.arc.ARCRecord@f16070 available: q:\bitarkiv\JOLF\filedir\1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc: {ip-address=0.0.0.0, content-type=text/plain, absolute-offset=0, subject-uri=filedesc://1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc.open, length=1343, creation-date=20090316092641, version=1.1}

Furthermore, a list of failed files should be printed to stdout (all of them WARC files).

Running methods from jar files, including Jhove based methods

Before running any Jhove methods, run the following command:

Code Block
cp -av /home/devel/backup_TEST11A/lib_jhove/* /home/devel/TEST11A/lib

eu.planets.batch.jar -> eu.planets.JhoveArcJob 

Get some metadata from all records within the arc-files through Jhove.

This method writes the following metadata on the all the records of handable dataformats: parsing errorsformatversion,mimetype.

For the dataformats which are not handled, these values are replaced by 'null', except mimetype which is replaced by the mimetype of the record.

Run

Code Block
java -cp lib/netarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=COPYARCJHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -CbatchprogsJbatchprogs/CopyArcContent.class -R.*.arc -Ooutput.copy.class  

This method copies all the content from the records to the output file, and therefore it takes some time to run the command (approx. 25 sek on 60 MB). Takes about 40 min's on 80 files (6GB).

The content of the output file is not easy to confirm. It should start by having the metadata of a record and then the actual content of this record.

Check the size of the output file (run the command du -h output.copy.class): 

59M     output.copy.class

...

eu.planets.batch.jar,lib/jhove.jar,lib/jhove-module.jar -Neu.planets.JhoveArcJob -R.*\.arc -Ooutput.jhove.arc

The output.jhove.arc file should be in the following format:

Code Block
0,HTML,null,text/html
null,null,null,image/png
null,null,null,text/css
0,GIF,null,image/gif
0,JPEG,null,image/jpeg
.....

Furthermore, a list of failed files should be printed to stdout (all of them WARC files).

Check the content of metadata files and 'content' 

These three methods takes a copy of the content from different arc files based on a regular expression.

All these tests are basically stress-tests, which by dumping the content of all records within a arc file to the output stream. First the content of all the files are collected, then only the content of the metadata files, and finally the data from the harvesting. These are all put into different output files, and the first file should have the same size as the other two combined.

CopyArcContent

This tests the ability to copy the content of all the files in the bitarchives. It can be seen as a stress test of the batch system.

Run

Code Block
java -cp lib/netarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*\.arc -Ooutput.copy.class  

This method copies all the content from the records to the output file, and therefore it takes some time to run the command (approx. 25 sek on 60 MB). Takes about 40 min's on 80 files (6GB).

The content of the output file is not easy to confirm. It should start by having the metadata of a record and then the actual content of this record.

Check the size of the output file (run the command du -h output.copy.class): 

59M     output.copy.class

This value should be approximately the same as the combined size of all the harvests.

Furthermore, a list of failed files should be printed to stdout (all of them WARC files).

Test a WARC Batch Job

Code Block
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -cp lib/netarchivesuite-archive-core.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.common.utils.cdx.WARCExtractCDXJob -Jlib/netarchivesuite-wayback-indexer.jar -R'.*dk.*\.warc' -BSBN -Ocdx.warc.all.output

Check that the output file (cdx.warc.all.output) has a significant amount of content.

eu.planets.batch.jar -> eu.planets.CopyArcContent: 'metadata'

Copy the content of the metadata files only.

 

Run

 

Code Block
java -cp lib/netarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=META -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.meta -R'.*-metadata-.*\.arc' -BKBN

 

This should give the following output in the console:

Code Block
Running batch job 'batchprogs/CopyArcContent.class' on files matching '.*-metadata-.*' on replica 'KBN', output written to file 'output.copy.meta', errors written to stderr
Processed 2 files with 0 failures

eu.planets.batch.jar -> eu.planets.CopyArcContent: ’content’

Copy only the content from of the files collected in the harvest. This test assumes that the harvesters in the test system is in the '.dk' domain.

 

Run

 

Code Block
java -cp lib/netarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=CONTENT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -Ooutput.copy.content -R'.*.dk.*\.arc' -BKBN

 

The regular expression should handle any files besides the metadata files, since they don't contain the sequence '.dk' in their name. This means that they handles all the other files, the 'content' files.

This should give the following output in the console:

 

Code Block
Running batch job 'batchprogs/CopyArcContent.class' on files matching '.*.dk.*' on replica 'KBN', output written to file 'output.copy.content', errors written to stderr
Processed 9 files with 0 failures

Check that the system properties can be read

 

Run

 

Code Block
java -cp lib/netarchivesuite-archive-core.jar -Dsettings.common.applicationInstanceId=PROPS -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SystemReaderJob.class -Ooutput.system -Eerror.system -R'.*.dk.*' -BSBN

 

Be aware of, that all metadata files are excluded.

Check that the error.system file is empty.

The output.system should contain system properties and a list of files. First the java version, then the operating system name, the operating system architecture and the operating system version. Then the list of files should be written, followed by a count of the files and the system property user name. It could look something like this:

 

Code Block
System properties!
java version: 1.6.0_24
os name: Linux
os architecture: i386
os version: 2.6.32-220.4.1.el6.x86_64
File: 1-1-20130114144130-00001-kb-test-har-001.kb.dk.arc
File: 1-1-20130114144130-00002-kb-test-har-001.kb.dk.arc
File: 1-1-20130114144130-00003-kb-test-har-001.kb.dk.arc
File: 1-1-20130114144130-00000-kb-test-har-001.kb.dk.arc
File count: 4
User: netarkiv