archivefiles-report.txt missing GMT dates and closing date

Description

In 4.X versions, archivefiles-report.txt was compiled by NAS two ways:
1) extraction of opening and closing GMT dates from heritrix.out
2) list all W/ARC files from the job directory and add the possibly missing files to the report and the sizes

In 5.1, the archivefiles-report.txt has an opening but local time date (not GMT) and no closing date, which makes is inconsistent with the header structure:[ARCHIVEFILE] [Opened] [Closed] [Size]

WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:278c51d9-c8ef-4d77-bbe2-ef094318201d>
WARC-Date: 2016-07-25T11:11:01Z
Content-Length: 127
Content-Type: text/plain
WARC-Block-Digest: sha1:2KUD7DDELLB7MC2GAWPIYQZ5VOR2SNNB
WARC-IP-Address: 172.20.20.41
WARC-Target-URI: metadata://netarchivesuite.bnf.fr/crawl/reports/archivefiles-report.txt?heritrixVersion=3.3.0-LBS-2014-03&harvestid=2&jobid=20
WARC-Warcinfo-ID: <urn:uuid:0af59ceb-7321-4240-b1b8-b1ee869a60f8>

[ARCHIVEFILE] [Opened] [Closed] [Size]
BnF-20-2-20160725110956304-00000-menelas.bnf.fr.warc.gz 2016-07-25T13:10:34.000Z 239068
Absence de la date de fermeture, contenu/entêtes de colonnes décalés, date locale et pas GMT

Checklist

Activity

Show:

Sara AubryFebruary 23, 2017 at 1:53 PM

To test this fix:
1) Run and complete a job.
2) Open the associated metadata file (e.g. BnF-22313-53-metadata-1.warc.gz).
3) Check the archivefiles-report.txt has 3 columns :

  • ARCHIVEFILE : W/ARC filename

  • LastModified : W/ARC file last modification date in GMT format

  • Size : W/ARC size in bytes

Sara AubryDecember 7, 2016 at 11:26 AM

This issue has been fixed in the 5.2 release and can be closed.
The archivefiles-report.txt has now three columns : [ARCHIVEFILE] [LastModified] [Size]
The date is GMT.

Sara AubryOctober 11, 2016 at 2:39 PM

We'll fix this in next PR

SrSeptember 27, 2016 at 4:16 PM

Probably, they should be GMT also then

Sara AubrySeptember 15, 2016 at 1:58 PM

We looked into this issue and did not find any solution to get the opening dates somewhere in a Heritrix log file.
At BnF, we will use this new format with these headers [ARCHIVEFILE] [LastModified] [Size].
We still need to decide if we want to have local or GMT dates as the dates within the filename is GMT.
Any opinion?

Fixed

Details

Assignee

Reporter

Inspector (migrated)

Sprint

Fix versions

Affects versions

Priority

Checklist

Created July 27, 2016 at 1:52 PM
Updated March 9, 2017 at 7:43 AM
Resolved February 23, 2017 at 1:40 PM