Implement CDX-generating code, that also works for WARC-files

Description

The CDX generating code must work for both ARC and WARC files. Currently the method dk.netarkivet.common.utils.cdx.ExtractCDX.generateCDX() ignores all files not ending with .arc. This method is used in the Harvest documentation phase to generate CDX-files for the arc-files coming from Heritrix

When generating a single CDX-entry for an URL request, information from several Warc-records is combined.

Note that Wayback already has code to make an CDX from WARC:

https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/

Checklist

Activity

Show:

Nicholas ClarkeSeptember 5, 2012 at 2:06 PM

Tested using an NAS setup using ARC and one using WARC.
CDX is generated for both installations and hence for ARC and WARC.

Nicholas ClarkeAugust 8, 2012 at 11:51 AM

CDX functionality seems to work in trunk.

2 new CDX tools have been added.

java -cp lib/dk.netarkivet.common.jar dk.netarkivet.common.tools.WARCExtractCDX bitarkiv/filedir/71-1-20120807124403-00000-ubuntu.warc

java -cp lib/dk.netarkivet.common.jar dk.netarkivet.common.tools.ArchiveExtractCDX bitarkiv/filedir/71-1-20120807124403-00000-ubuntu.arc
java -cp lib/dk.netarkivet.common.jar dk.netarkivet.common.tools.ArchiveExtractCDX bitarkiv/filedir/71-1-20120807124403-00000-ubuntu.warc

SrSeptember 29, 2011 at 3:11 PM

Depends on .

Fixed

Details

Assignee

Reporter

Accuracy of estimate

Rough

Original estimate

Time tracking

No time logged1w 2d remaining

Components

Fix versions

Priority

Checklist

Created September 29, 2011 at 1:35 PM
Updated February 16, 2016 at 5:28 PM
Resolved September 5, 2012 at 2:06 PM