CDX Generation and Revisit Generation for Duplicate Annotations

The correct modern way to manage a revisit (ie harvest of content identical to previously harvested contest) is to generate a Warc revisit-record. The revisit record references the exact harvest time (via the WARC-Refers-ToDate field) of the original harvest, so the replay system can simply use this information to provide the correct content corresponding to this revisit record. Standard indexing tools for warcfiles handle revisit records correctly. This has been the behaviour of NetarchiveSuite since 2016.

The old way to handle revisits was to write nothing to the arc or warcfile, but instead add an annotation to the corresponding crawl record in the corresponding metadata file. e.g.

2015-06-02T21:45:39.009Z   200         48 http://eas8.emediate.eu/eas?cu=17096;cat=;cre=mu;js=y;target=_blank RE http://newsbreak.dk/ application/x-javascript #039 20150602214538947+61 sha1:PSJGKM4YV37XVNCRLG47VQG6GIP4SXBS - duplicate:"230823-32-20150527094534-00000-kb-prod-har-023.kb.dk.warc,72896",content-size:593

This simple means that the content of the url (field 4 of the crawl log), harvested at a given timestamp (field 1 of the crawl log) can be found in file 230823-32-20150527094534-00000-kb-prod-har-023.kb.dk.warc at offset 72896 (with the given sha1).

CDX records for these duplicate annotations are generated in the method DeduplicateToCDXAdapter.adaptLine() in the NetarchiveSuite codebase. If we need to reimplement CDX indexing for a new CDX format (e.g. cdx-j) then this (not very complicated) method will also need to be reimplemented.

To summarise, data-warcfiles are indexed by standard cdx-indexing tools, but metadata-warcfiles must be indexed by custom tools to manage the duplicate annotations.

Migrated Duplicate Annotations - a sidebar

When we migrated from uncompressed to compressed warc, these duplicate annotations were a problem because they all referred to offsets in uncompressed files. We dealt with this by adding an extra metadata record in the metadata file - a lookup table to translate these records. The lookup table has WARC-Target-URI metadata://crawl/index/deduplicationmigration?majorversion=0&minorversion=0 . An example of a line in the table would be

230823-32-20150527094534-00000-kb-prod-har-023.kb.dk.warc 72896 17825 1432719936000

This simply means that the warc-record at offset 72896 in file 230823-32-20150527094534-00000-kb-prod-har-023.kb.dk.warc can be found at offset 17825 in the compressed file 230823-32-20150527094534-00000-kb-prod-har-023.kb.dk.warc.gz.

Note that "deduplicationmigration" records can be found in any metadata file up until the time we completeted the migration process.

Migrating Duplicate Annotations to Revisit Records

Ideally we would like to replace these duplicate annotations with revisit records. An example of such a record generated by Heritrix can be found at https://sbprojects.statsbiblioteket.dk/jira/browse/NARK-1103?focusedCommentId=65743&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-65743 .

Note that (according to https://github.com/iipc/openwayback/wiki/How-OpenWayback-handles-revisit-records-in-WARC-files) the crucial information in revisit record is the WARC-Refers-ToTarget-URI (this is usually the same URI as the one crawled at revisit) and WARC-Refers-To-Date, the precise original capture time. Once all the warcs have been indexed by URI and date, this allows playback software to immediately identify the correct record containing the revisited content. Note however the problem - our duplicate annotations do not include the information needed to find WARC-Refers-To-Date. Therefore one cannot just generate revisit records from duplicate annotations. However assuming we have already cdx-indexed all our warc-data files we can work around this by looking up the given URI in the CDX index and finding the correct capture of it there - ie. the capture with the given filename and offset - and reading the capture timestamp from the cdx record.

Arc-to-revisit?

What about generating warc revisit records from duplicate annotations in arc metadata files? That should be possible in exactly the same way. The revisit record only says when the data was originally harvested at the given URI and with the same payload. However in practice we may need some more experimentation with mixed arc/warc archives to see whether replay software can handle this case.