WARC-Refers-To-Date in WARC revisits records do not have the right original record date
Field Tab
Test
Field Tab
Test
Description
Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date.
Here is a sample original record and the associated line in the crawl.log: {{ WARC/1.0 WARC-Type: response WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg WARC-Date: 2017-01-16T16:14:21Z WARC-IP-Address: 216.58.198.206 WARC-Payload-Digest: sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I WARC-Record-ID: <urn:uuid:f6ed4965-92a2-483e-977c-3794a96af663> Content-Type: application/http; msgtype=response Content-Length: 22951
WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log
There is not much difference between the two: 2017-01-16T16:14:21Z 2017-01-16T16:14:26Z but it prevents OpenWayback from finding the original payload.
Checklist
Activity
Show:
Sara AubryFebruary 23, 2017 at 2:24 PM
To test this fix: 1) Run and complete a job that contains at least a big image or a big PDF (this image/PDF should be recorded as a WARC response record). 2) Run a second job on the same Harvest (the image/PDF should be recorded as a WARC revisit record). 3) Check the WARC-Refers-To-Date of the revisit record matches the WARC-Date of the original record. 4) Compare the crawl.log of the two jobs, the same date should be inserted:
Trying to display revisit records in OpenWayback, we noticed that WARC-Refers-To-Date in WARC revisits records do not have the right original record date.
Here is a sample original record and the associated line in the crawl.log:
{{
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-16T16:14:21Z
WARC-IP-Address: 216.58.198.206
WARC-Payload-Digest: sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Record-ID: <urn:uuid:f6ed4965-92a2-483e-977c-3794a96af663>
Content-Type: application/http; msgtype=response
Content-Length: 22951
HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Mon, 16 Jan 2017 16:13:10 GMT
Expires: Mon, 16 Jan 2017 18:13:10 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 71
2017-01-16T16:14:26.982Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg LXE http://rue89.nouvelobs.com/2017/01/15/rue89.com image/jpeg #161 20170116161421526+52 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com content-size:22951
}}
WARC Date is 2017-01-16T16:14:21Z coming out of 9th field in the crawl.log: 20170116161421526+52
Here is an associated revisit record and its line in the crawl.log:
{{
WARC/1.0
WARC-Type: revisit
WARC-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Date: 2017-01-17T14:02:30Z
WARC-IP-Address: 216.58.198.206
WARC-Profile: http://netpreserve.org/warc/1.0/revisit/identical-payload-digest
WARC-Truncated: length
WARC-Payload-Digest: UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I
WARC-Refers-To-Date: 2017-01-16T16:14:26Z
WARC-Refers-To-Target-URI: http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg
WARC-Record-ID: <urn:uuid:483dbdfa-123e-45f6-9f8f-5be02c3789f7>
Content-Type: application/http; msgtype=response
Content-Length: 292
HTTP/1.0 200 OK
Content-Type: image/jpeg
Date: Tue, 17 Jan 2017 13:55:26 GMT
Expires: Tue, 17 Jan 2017 15:55:26 GMT
ETag: "1484105439"
X-Content-Type-Options: nosniff
Server: sffe
Content-Length: 22660
X-XSS-Protection: 1; mode=block
Cache-Control: public, max-age=7200
Age: 424
2017-01-17T14:02:38.705Z 200 22660 http://i3.ytimg.com/vi/siyBp8Csugk/0.jpg E http://rue89.nouvelobs.com/ image/jpeg #037 20170117140230363+47 sha1:UQ2OU27CBE76DZ6YAZLNQVWHTGOQVO2I http://rue89.nouvelobs.com duplicate:"BnF-22218-28-20170116161005144-00002-ciblee_2016_gulliver228.bnf.fr.warc.gz,254013214,20170116161426982",content-size:22952,3t
}}
WARC-Refers-To-Date is 2017-01-16T16:14:26Z, corresponding to 20170116161426982 in the duplicate annotation in the crawl.log. => this date is wrong, it corresponds to the 1st column which is the writing in the crawl.log
There is not much difference between the two:
2017-01-16T16:14:21Z
2017-01-16T16:14:26Z
but it prevents OpenWayback from finding the original payload.