...
Login to a bitarchive server, eg.
Code Block $ ssh kb-test-acs-001 $ export TESTX=TEST11A $ cd ${TESTX}
Create dir for batchprograms:
Code Block $ mkdir batchprogs $ scp test@kb-prod-udv-001.kb.dk:/home/test/test-batch/* batchprogs/.
Run batch programs on archived files
ChecksumJob
Calculating the MD5 checksum on the archive files (only .arc files).
...
This is run on the SBN replica, which has only one bitapp (according to the configuration file 'deploy_config_multi_bitapps.xml'). Therefore the sentence 'Legal' is written once.
EvilBatch2
Tries to delete archived files, thus gets failed for all files.
...
Since the batch program is supposed to fail, nothing should be written to std-out
, only to std-error
. It is therefore the std-err
which should be catched, which can be done with the -E
argument.
This should write the following text in the consoleproduce a file error.evil2 with be something like:
Code Block |
---|
Failed files: 1-1-20130114144130-00000-kb-test-har-001.kb.dk. INFO: JMSArcRepository listens for replies on channel '[Queue 'JOLF_COMMON_THIS_REPOS_CLIENT_130_226_228_5_ISA']' Running batch job 'batchprogs/EvilBatch2.class' on files matching 'arc.*' on replica 'KBN', output written to stdout errors written to file 'error.evil2' Mar 16, 2009 11:44:14 AMarc 1-1-20130114144130-00003-kb-test-har-001.kb.dk.arc 1-1-20130114144130-00001-kb-test-har-001.kb.dk.arc 1-1-20130114144130-00002-kb-test-har-001.kb.dk.arc 1-metadata-1.arc |
ExceptionBatch
Checks if a file is a metadata file, in which case it fails with an exception. Otherwise it writes out that the given file is not a metadata file.
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=EXCEPTION -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatch.class -R.*.arc -Eerror.exception -Ooutput.exception |
This should produce a file 'output.exception'' with something like:
Code Block |
---|
non-metadata file: 1-1-20130114144130-00000-kb-test-har-001.kb.dk.arc
non-metadata file: 1-1-20130114144130-00003-kb-test-har-001.kb.dk.arc
non-metadata file: 1-1-20130114144130-00001-kb-test-har-001.kb.dk.arc
non-metadata file: 1-1-20130114144130-00002-kb-test-har-001.kb.dk.arc |
And a file 'error.exception' with something like:
Code Block |
---|
Failed files:
1-metadata-1.arc |
ExceptionBatchFinish
Does not work properly (does not give exception, only writes out name).
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=FINISH -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient batch WARNING: The batch job 'ID:15-130.226.228.5(9e:6:7d:1b:b:ec)-39359-1237200254886: To JOLF_COMMON_THE_REPOS ReplyTo JOLF_COMMON_THIS_REPOS_CLIENT_130_226_228_5_ISA OK Job: dk.netarkivet.common.utils.batch.LoadableFileBatchJob processing EvilBatch2.class' resulted in the following error: Batch job failed on 3 files. Batch job failed on 2 files. Batch job failed on 3 files. Batch job failed on 3 files. Processed 11 files with 11 failures Writing errors to file: /home/test/JOLF/error.evil2 ... |
This should produce a file error.evil2 with be something like:
Code Block |
---|
Failed files: 1-1-20130114144130-00000-kb-test-har-001.kb.dk.arc 1-1-20130114144130-00003-kb-test-har-001.kb.dk.arc 1-1-20130114144130-00001-kb-test-har-001.kb.dk.arc 1-1-20130114144130-00002-kb-test-har-001.kb.dk.arc 1-metadata-1.arctools.RunBatch -Cbatchprogs/ExceptionBatchFinish.class -R.*.arc -Ooutput.exception.finish |
This should produce a file ' output.exception.finish'' with something like:
Code Block |
---|
1-1-20130114144130-00000-kb-test-har-001.kb.dk.arc
1-metadata-1.arc
1-1-20130114144130-00001-kb-test-har-001.kb.dk.arc
1-1-20130114144130-00003-kb-test-har-001.kb.dk.arc
1-1-20130114144130-00002-kb-test-har-001.kb.dk.arc |
ExtractDeduplicateCDXBatchJob
Run
Code Block |
---|
java -Dsettings.common.applicationInstanceId=DEDUP -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml -Dsettings.common.applicationInstanceId=TEST11A -cp lib/dk.netarkivet.archive.jar dk.netarkivet.archive.tools.RunBatch -Ndk.netarkivet.wayback.batch.DeduplicationCDXExtractionBatchJob -Jlib/dk.netarkivet.wayback.jar -R'.*metadata.*' -BSBN -Ocdx.output |
The output should be well formed cdx records with each line starting with a canonicalized url like these:
Code Block |
---|
jigsaw.w3.org/css-validator/images/vcss-blue 20091001090817 http://jigsaw.w3.org/css-validator/images/vcss-blue image/gif 200 YEX3QI4MLIW3EPJCYCA2OZWT4ZEQGA36 - 8221 3-3-20091001084354-00004-kb-test-har-001.kb.dk.arc
123hjemmeside.dk/picture.aspx?id=13195185 20090114041208 http://www.123hjemmeside.dk/picture.aspx?id=13195185 image/jpeg 200 SY7VL2B5HODA3TKBR6DQLCAN74BSUL3G - 33623059 27020-70-20080505183112-00095-kb-prod-har-002.kb.dk.arc
123hjemmeside.dk/picture.aspx?id=13243517 20090114105138 http://www.123hjemmeside.dk/picture.aspx?id=13243517 image/jpeg 200 XOEJDTALKDZNB2KKRCUS3K6X2LGZKEAC - 24666098 27063-70-20080513145144-00069-kb-prod-har-001.kb.dk.arc
123hjemmeside.dk/picture.aspx?id=13244248 20090114105514 http://www.123hjemmeside.dk/picture.aspx?id=13244248 image/jpeg 200 R7AEJVSL6Z3U2CAQGNDKKJ2NL5MSOCJU - 38159302 27063-70-20080513144604-00068-kb-prod-har-001.kb.dk.arc
|
The fields are:
<canonical url> <original url> <mime-type> <http code> <checksum> <redirect url> <length> <filename>
(In the above cases, <redirect-url> is empty, which it will always be unless there is a 3xx http code.)
ExceptionBatchInit
This method should fail before starting to process the files, and thus no files should be processed.
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=INIT -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/ExceptionBatchInit.class -R.*.arc |
The following text message should be written in the console:
Code Block |
---|
Running batch job 'batchprogs/ExceptionBatchInit.class' on files matching '.*.arc' on replica 'KBN', output written to stdout errors written to stderr
Processed 0 files with 0 failures |
Run batch programs on records within archive files
These methods does not check on the '.arc' files, but on the content of these '.arc' files.
SimpleArcBatchJob
Checks that it is posible to access the records within a arc file.
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=SIMPLE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/SimpleArcBatchJob.class -R.*.arc -Ooutput.simple |
The output file should be quite large compared to the previous output files.
The size of the output files and the error files can be found by running the command: du -h output.* error.*
. This gives the following results:
Code Block |
---|
4,0K output.evil
4,0K output.exception
4,0K output.exception.finish
212K output.simple
4,0K error.evil2
4,0K error.exception |
The content of the output file contain a line of data for each record. This line will have the following structure:
Code Block |
---|
org.archive.io.arc.ARCRecord@f16070 available: q:\bitarkiv\JOLF\filedir\1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc: {ip-address=0.0.0.0, content-type=text/plain, absolute-offset=0, subject-uri=filedesc://1-1-20090316092641-00001-kb-test-har-002.kb.dk.arc.open, length=1343, creation-date=20090316092641, version=1.1} |
Running methods from jar files, including Jhove based methods
Before running any Jhove methods, run the following command:
Code Block |
---|
scp -r test@kb-prod-udv-001.kb.dk:lib/jhove lib/. |
eu.planets.batch.jar -> eu.planets.JhoveArcJob
Get some metadata from all records within the arc-files through Jhove.
This method writes the following metadata on the all the records of handable dataformats: parsing errors
, format
, version
,mimetype
.
For the dataformats which are not handled, these values are replaced by 'null', except mimetype
which is replaced by the mimetype of the record.
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=JHOVE -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Jbatchprogs/eu.planets.batch.jar,lib/jhove/jhove.jar,lib/jhove/jhove-module.jar -Neu.planets.JhoveArcJob -R.*.arc -Ooutput.jhove.arc |
The output.jhove.arc file should be in the following format:
Code Block |
---|
0,HTML,null,text/html
null,null,null,image/png
null,null,null,text/css
0,GIF,null,image/gif
0,JPEG,null,image/jpeg
..... |
Check the content of metadata files and 'content'
These three methods takes a copy of the content from different arc files based on a regular expression.
All these tests are basically stress-tests, which by dumping the content of all records within a arc file to the output stream. First the content of all the files are collected, then only the content of the metadata files, and finally the data from the harvesting. These are all put into different output files, and the first file should have the same size as the other two combined.
CopyArcContent
This tests the ability to copy the content of all the files in the bitarchives. It can be seen as a stress test of the batch system.
Run
Code Block |
---|
java -cp lib/dk.netarkivet.archive.jar -Dsettings.common.applicationInstanceId=COPYARC -Ddk.netarkivet.settings.file=conf/settings_IndexServerApplication.xml dk.netarkivet.archive.tools.RunBatch -Cbatchprogs/CopyArcContent.class -R.*.arc -Ooutput.copy.class |
This method copies all the content from the records to the output file, and therefore it takes some time to run the command (approx. 25 sek on 60 MB). Takes about 40 min's on 80 files (6GB).
The content of the output file is not easy to confirm. It should start by having the metadata of a record and then the actual content of this record.
Check the size of the output file (run the command du -h output.copy.class
):
59M output.copy.class
This value should be approximately the same as the combined size of all the harvests.