2013-03-26 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference March the 26th 2013, 13:00-14:00.

Practical informationSkype-conference

  • TDC tele-conference:
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.


  • BNF: Sara
  • ONB: Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin, Mikis  and Sabine
  • Any other issues to be discussed on today's tele-conference?

Iteration 54 (4.1 development release)

See plan here. Planned release at the end of april.

Curator roadmap

The road map has now been trimmed, eg. obsolete issues have been removed. The shared list haven't been prioritized yet.


JHoNAS status (Nicholas)

  • JHove2 release
  • Final report

Mikis will ask Bjarne to give a brief summary of the JhoNAS results.




  • DK: Bjarne, Birgit, Tue?, Sabine (thu-fri)
  • BnF: Clément, Peter, Nicolas, Sara, Gildas
  • ONB: Michaela, Sven Schlarb (R&D, preservation working group)

Status of the production sites


We have installed Release 4.0 in our production environment, but that caused us some trouble. So we had to postpone our first broad crawl for 2013. Currently we are on our first step of the broad crawl – domains up to a size of 10 MB.

We have captured at least 15.000 YouTube videos. Jons son had created a special tool to capture the url’s of the videos, but this tool does not work any longer. A new tool is on it’s way and as soon as our documentation is complete, I’ll translate it into English and put it on the NAS curator wiki.

We are doing some special “on demand harvests” for some of our researchers.

We participated in the IIPC papal election url nomination

Otherwise business as usual J


  • Status for BCWeb opensourcing? 

The big news here for March is that we have started transferring our web archives into the BnF digital repository, SPAR, which will ensure the long-term preservation of our collections. We have started with the current crawls, but we will be progressively loading the retrospective collections simultaneously with the ongoing crawls, starting with the most recent collections (those harvested with NAS) and working our way back to the historical collections from 1996. It will take at least several months and possibly up to a few years to complete the transfer of all our collections.

The ingest into SPAR is closely linked to the functioning of NAS : in addition to the crawled data produced by Heritrix, SPAR will also preserve the metadata ARC files produced by NAS, containing the configurations, reports and logs that describe the crawls. This allows SPAR to create coherent collections of data using three levels: the ARC, the crawl job (containing ARCs of both data and metadata) and the harvest definition (containing the jobs). The data model of SPAR is thus based on that of NAS, but will be applied also to previous kinds of crawls (such as standalone Heritrix crawls performed by the BnF, broad crawls by Internet Archive and historical collections extracted by IA).

As well as ingesting all our existing collections, work will continue on SPAR to allow it to handle WARC files, as this is a necessary step before we can transfer our harvesting workflow to the production of WARCs.

  • We started our domain crawl with 1.4 mio. seeds mid february. Stage 1 with a budget of 10MB per domain has been completed. Before we start stage 2 we will get new hardware (crawler machines).
  • In 2013 we have a strong focus on politics. We created a new collection including websites from government, administration, political parties, blogs etc. In September parliamentary elections will take place, additionally we have a couple of regional elections.
  • We have a first case where we have to delete data from our webarchive. Last year we imported legacy data from 1997/98 into the webarchive. Legal deposit for online media in in place since 2009. A user requested deletion of his data (from 1997). We are now developing a workflow for the deletion process.


Any other business?

  • Bnf is planning to to have Nicolas start on the open sourcing of the BCWeb platform in a couple of months.