archivefiles-report.txt missing GMT dates and closing date
Field Tab
Test
Field Tab
Test
Description
In 4.X versions, archivefiles-report.txt was compiled by NAS two ways: 1) extraction of opening and closing GMT dates from heritrix.out 2) list all W/ARC files from the job directory and add the possibly missing files to the report and the sizes
In 5.1, the archivefiles-report.txt has an opening but local time date (not GMT) and no closing date, which makes is inconsistent with the header structure:[ARCHIVEFILE] [Opened] [Closed] [Size]
[ARCHIVEFILE] [Opened] [Closed] [Size] BnF-20-2-20160725110956304-00000-menelas.bnf.fr.warc.gz 2016-07-25T13:10:34.000Z 239068 Absence de la date de fermeture, contenu/entêtes de colonnes décalés, date locale et pas GMT
Checklist
Activity
Show:
Sara AubryFebruary 23, 2017 at 1:53 PM
To test this fix: 1) Run and complete a job. 2) Open the associated metadata file (e.g. BnF-22313-53-metadata-1.warc.gz). 3) Check the archivefiles-report.txt has 3 columns :
ARCHIVEFILE : W/ARC filename
LastModified : W/ARC file last modification date in GMT format
Size : W/ARC size in bytes
Sara AubryDecember 7, 2016 at 11:26 AM
This issue has been fixed in the 5.2 release and can be closed. The archivefiles-report.txt has now three columns : [ARCHIVEFILE] [LastModified] [Size] The date is GMT.
Sara AubryOctober 11, 2016 at 2:39 PM
We'll fix this in next PR
SrSeptember 27, 2016 at 4:16 PM
Probably, they should be GMT also then
Sara AubrySeptember 15, 2016 at 1:58 PM
We looked into this issue and did not find any solution to get the opening dates somewhere in a Heritrix log file. At BnF, we will use this new format with these headers [ARCHIVEFILE] [LastModified] [Size]. We still need to decide if we want to have local or GMT dates as the dates within the filename is GMT. Any opinion?
In 4.X versions, archivefiles-report.txt was compiled by NAS two ways:
1) extraction of opening and closing GMT dates from heritrix.out
2) list all W/ARC files from the job directory and add the possibly missing files to the report and the sizes
In 5.1, the archivefiles-report.txt has an opening but local time date (not GMT) and no closing date, which makes is inconsistent with the header structure:[ARCHIVEFILE] [Opened] [Closed] [Size]
WARC/1.0
WARC-Type: resource
WARC-Record-ID: <urn:uuid:278c51d9-c8ef-4d77-bbe2-ef094318201d>
WARC-Date: 2016-07-25T11:11:01Z
Content-Length: 127
Content-Type: text/plain
WARC-Block-Digest: sha1:2KUD7DDELLB7MC2GAWPIYQZ5VOR2SNNB
WARC-IP-Address: 172.20.20.41
WARC-Target-URI: metadata://netarchivesuite.bnf.fr/crawl/reports/archivefiles-report.txt?heritrixVersion=3.3.0-LBS-2014-03&harvestid=2&jobid=20
WARC-Warcinfo-ID: <urn:uuid:0af59ceb-7321-4240-b1b8-b1ee869a60f8>
[ARCHIVEFILE] [Opened] [Closed] [Size]
BnF-20-2-20160725110956304-00000-menelas.bnf.fr.warc.gz 2016-07-25T13:10:34.000Z 239068
Absence de la date de fermeture, contenu/entêtes de colonnes décalés, date locale et pas GMT