NAS Warc workshop

Information on the Bnf-Netarkivet.dk workshop at KB with the purpose of defining the WARC implementation work in NAS.
  • Place: KB
  • Time:  April 2 09:15 to 13:00??.
  • Participants:
    • BnF: Clément, Sara and Sophie
    • KB: Nicholas & Søren
    • SB: Mikis 

Agenda

  • (1 hour) Recap on JHove2 module status.
    • Status for merge to HEAD of Nicholas's code.

      • Martha is aware of the problems with merging 3rd party code to HEAD, and as the Jhove2 is a high priority from IIPC will hope this will be adressed before or under the GA in Washington.
    • Status for JHove2 milestone, including demo.

      • A proposal for criterias for a validation of the prototype release, is that only the output of Jhove2 modules should be used (the code itself will will be tested as the part of the road to the final release).
      • Clement will mail Aaron regarding payment for the initial technical specification.
      • BNF will test the JHove2 release in May, so we can get the first milestone validated.
      • As Nicholas has removed the Jhove2 ability to run in parallel, the performance aspects of Nicholas's code need to be tested and perhaps discussed at the GA.
      • Nicholas and Clement should have a technical discussion regarding Nicholas's code during the GA. Subjects here would be code merge to HEAD, parallelization. Perhaps Monday at 17:00.
      • KB will be using the Jhove2 WARC for digital document characterization as part of the preservation. 
      • As WARC is currently used more for none-web archiving, Clement is very interested in input to extensions to the WARC ISO standard.
      • We talked a bit about the possibility of a PDF module which will be needed by BnF. Perhaps a job for Nicholas?
      • Nicholas will look at what it means to propose it as defaults extraction.
      • BnF will look at specific WARC extensions.
      • We should plan a NAs workshop in late august/early september. We should have finished the Jhove2 testing here and the NAS WARC functionality should be nearing completion. Nicholas current contract ends in mid september, so it shouldn't be any later (unless the contract is extented). 
      • WARC module validation:
        1. BnF will send sample WARC files Nicholas can generate Jhove2 output for inspection by BnF. The WARCs should both be analyzed with 'File' and 'Droid'.
        2. Nicholas will then mail the basic code release to Aaron for testing (Tomas(BnF) and also by Steve?), so we can be prepared for input to the final release.
        3. BnF will merge Nicholas's code and test it, including performance test (parallelism).
        4. Any PWG feedback.
  • (10 minutes) Discuss Jhonas presentation at GA : project update (10-15 presentation on Tuesday) + half day presentation at the PWG

    1. Short JhoNAS presentation
      1. General presentation of the project (WARC, Jhove2, NAS), why, who.
      2. Summary of current status.
      3. Refere people to detailed sessions.
    1. PWG workshop
    • *# Nicholas will prepare an agenda in cooperation with Clément. The agenda should be sent to IIPC as soon as possible so it can be posted to the GA web.
      1. #More detailled breakdown of what is extracted by the Jhove2 WARC module.
        • More detailled walkthrough of the metadata model which will be used in NetarchiveSuite (including ARC-WARC mapping) and the metadata is handled in general in NAS.
        • Demo of the module.
        • A priority here is to expose the value of this project for 3rd parties and listen to ideas for additional features.
      2. Nicholas will prepare an presentation in cooperation with Clément.
  • (30 minutes) Discussion about NetarchiveSuite workshop at IIPC GA.

    • We should consider breaking the last part into a discussion track and a demo/handons. Annick and Nicholas might handle the demo/handons part.
    • We should consider sending a mail to participants with a update on the agenda and request information regarding the expectations for the workshop (and confirm their participation). Sara will request a list of participants from Abbie.
  • (1 hour) Review of the currently defined tasks:NAS-1720@jira.

    • Comments added to issues.
    • BnF would like to be able to define custom identifiers for WARC.
  • (Afternoon) Leveraging the WARC formats possibility for adding metadata.
    • Define initial metadata model.
  • Mapping of NetarchiveSuite metadata with WARC warcinfo, metadata and named fileds.

    • Initial mapping defined. Can be found at the bottom of the page (attachement).
    • Clement (and Sophie and Sara) will write up the proposed WARC format specification and send it to the participants.
    • Nicholas will create a specification wiki page based on this. It will be recommended to the participants to subscribe to changes to this page.
    • The additional NAS functionality need to support the extended format (harvest info metadata, configurable file name format, etc.) will be defined by Nicholas (assisted by Søren and Mikis).

WARC in NAS format draft