2014-04-15 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference April 15th 2014, 13:00-14:00.

Practical information

  • TDC tele-conference:
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.


  • BNF: Sara and Lam
  • ONB: Michaela
  • KB: Tue
  • SB:Mikis and Sabine
  • Any other issues to be discussed on today's tele-conference?

Welcome to Former user (Deleted) & Thorbjørn Ravn Andersen


  • Planning of the next iterations 

Planning for GA in Paris (19th to the 23rd May)

Who will be attending the GA?

  • KB: Birgit, Tue, Mads (22-23), Eld (19-21).
  • SB: Sabine, Ditte (19-21), Colin, Mikis
  • ONB: Andreas, Michaela
  • BnF: Sara, Lam, Bert, Sébastien, Clément, Annick, Géraldine, Peter, Sophie

Internal NetarchiveSuite discussion, see 2014-05-23 IIPC GA in Paris.

NetarchiveSuite workshop 2014

When and where

Status of the production sites


  • Event harvests: In May we finished one of our largest event harvests, the ESC 2014 event harvest. For the first time researchers have participated almost from the beginning. We executed two more event harvests: one on the European elections in May and one on the Danish tabloid magazine “Se og Hør” s  use of illegal methods for journalistic research in May/June
  • Documentation: Planning the migration of our documentation from the oldfashioned MoinMoin Wiki to a system which can meet the requirements from both curators and users/researchers is an ongoing  process. We have nearly finished our requirement specification and we are testing the extended fields in NAS for usability on a part of the documentation
  • Access: We are working on a citrix-login based access for our users. Until now it is opened for employees only.
  • Technical issues: we successfully upgraded our test environment to NAS 4.4, and we have planned to upgrade our production environment in August.


This month we thought we'd give you an overview of all the project crawls we are running this year, as several of them have taken place during the past month.


We have several crawls relating to events and anniversaries in 2014:

- The centenary of the First World War - this is a project that began last November and will continue until 2018 with three or four crawls per year.

- The 250th anniversary of the death of Jean-Philippe Rameau (covered in our last monthly update).

- Local and European elections - the French local elections took place last month and we are preparing the crawls in the lead up to the European elections in May.

- Winter Olympic and Paralympic Games - as part of the IIPC project.


There are also project crawls on specific themes or types of document (these are all continued from previous years):

- News and subscription news sites - crawled every day.

- Online personal and literary journals - the first crawl took place in March, the second will be in August.

- Solidarity and social movements - planned for May and June

- Travel journals - planned for June

- Auction catalogues - planned for July

- French and American official publications - two separate crawls both planned for July.

- Dailymotion videos - planned for August.


In addition, we also maintain our  "ongoing crawls", i.e. all the sites selected by BnF departments according to their collection policies which are collected at different frequencies: once a year, twice a year, monthly or weekly.


Since our storage budget is the same in 2014 as in 2013, the number of project crawls and the increase in the number of domains in our broad crawl means we are trying to optimise our ongoing crawls. We are working with the librarians who select sites to limit the number of sites that are included in multiple crawls, and to make sure that the sites collected more frequently than once a year change often enough to justify this. We've also removed the largest budget from the twice-yearly crawl, and we've changed the way Heritrix handles queues for sites with a "domain" depth - previously we had queues per host, so the budget allocated was multiplied by the number of hosts. We now have a single queue and therefore a single budget for each domain. This doesn't seem to have had an impact on the speed of crawls.


  • Olympics and Paralympics crawl finished
  • Preparing for EU elections (starting in May) and WWI crawl (starting in June)

Next meeting


Any other business?