WARC-Refers-To-Date in WARC revisits records do not have the right original record date

Description

Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date.

Here is a sample original record and the associated line in the crawl.log:
{{
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-16T16:14:21Z
WARC-IP-Address: 216.58.198.206
WARC-Payload-Digest: sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Record-ID: <urn:uuid:f6ed4965-92a2-483e-977c-3794a96af663>
Content-Type: application/http; msgtype=response
Content-Length: 22951

HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Mon, 16 Jan 2017 16:13:10 GMT
Expires: Mon, 16 Jan 2017 18:13:10 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 71

2017-01-16T16:14:26.982Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg LXE http://rue89.nouvelobs.com/2017/01/15/rue89.com image/jpeg #161 20170116161421526+52 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com content-size:22951
}}

WARC Date is 2017-01-16T16:14:21Z coming out of 9th field in the crawl.log: 20170116161421526+52

Here is an associated revisit record and its line in the crawl.log:
{{

WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-17T14:02:30Z
WARC-IP-Address: 216.58.198.206
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Payload-Digest: UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Refers-To-Date: 2017-01-16T16:14:26Z
WARC-Refers-To-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Record-ID: <urn:uuid:483dbdfa-123e-45f6-9f8f-5be02c3789f7>
Content-Type: application/http; msgtype=response
Content-Length: 292

HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Tue, 17 Jan 2017 13:55:26 GMT
Expires: Tue, 17 Jan 2017 15:55:26 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 424

2017-01-17T14:02:38.705Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg E http://rue89.nouvelobs.com/ image/jpeg #037 20170117140230363+47 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com duplicate:"BnF-22218-28-20170116161005144-00002-ciblee_2016_gulliver228.bnf.fr.warc.gz,254013214,20170116161426982",content-size:22952,3t
}}

WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log

There is not much difference between the two:
2017-01-16T16:14:21Z
2017-01-16T16:14:26Z
but it prevents OpenWayback from finding the original payload.

Checklist

Activity

Show:

Sara AubryFebruary 23, 2017 at 2:24 PM

To test this fix:
1) Run and complete a job that contains at least a big image or a big PDF (this image/PDF should be recorded as a WARC response record).
2) Run a second job on the same Harvest (the image/PDF should be recorded as a WARC revisit record).
3) Check the WARC-Refers-To-Date of the revisit record matches the WARC-Date of the original record.
4) Compare the crawl.log of the two jobs, the same date should be inserted:

14 first digits should be the same as the current dates have the follong format: AAAAMMJJHHmmss

Fixed

Details

Assignee

Reporter

Organization

BNF

Components

Sprint

Fix versions

Affects versions

Priority

Checklist

Created February 2, 2017 at 10:52 AM
Updated November 7, 2023 at 10:20 AM
Resolved March 3, 2017 at 7:38 AM