2013-06-18 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference May the 14th 2013, 13:00-14:00.

Practical informationSkype-conference

  • TDC tele-conference:
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.


  • BNF: Sara
  • ONB: Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin, Mikis  and Sabine
  • Any other issues to be discussed on today's tele-conference?


Curator roadmap



Summary (Sara and Tue).

Status of the production sites



  • We are more than half way through our second broad crawl for 2013, we have undtil now harvested about 15000 GB. Unfortunately there is an unsolved bug (https://kb-dk.atlassian.net/browse/NAS-2198): we can't create warc-files larger than 100 MB.
  • We started harvesting Facebook for mobile devices - thus we are able to harvest all commentaries. It is done with all Facebook profiles to be harvested encoded into the harvest definition.
  • We are preparing a corpus from the archive for teaching purpose, that is to say according to a new interpretation of our personal data protection law we will give access to a part of our archived websites (event harvest on the 2011 parliamentary elections) via wayback and full text search (SOLR)
  • We are performing parallel tests on wayback 1.7 / 1.8 while we are waiting  for BNF’s solution for wayback support of https in proxy mode J
  • We have harvested more YouTube videos med følgende emner:  GRand prix Eurovision de la chanCon in a historical perspective, television and commercials, Bruce Springsteen in Denmark, Danish Jazz
  • We are still working on a general solution for harvesting stuff behind pay walls on news sites.



A quick summary of the different selective crawls we are doing this year.

We distinguish between "ongoing crawls", in which librarians  in the different departments in the BnF select seeds based on the collection policy of their department, and  "project crawls", which are collaborations between two or more departments, sometimes with external partners, based around a particular theme or an event.

For ongoing crawls there is a choice of four depths, four frequencies and three budgets. The use of budgets (small, medium or large) allows us to plan and monitor the crawls more efficiently; in terms of harvest definitions in NAS, for the twice-yearly and annual crawls we create harvest definitions for each budget, while weekly and monthly crawls are only given a "small" budget.  Project crawls can have a different range of technical settings for specific reasons.

The harvest definitions for "ongoing crawls" are as follows:

- weekly - launched every Monday at noon

- monthly - launched the first of each month

- twice-yearly (small, medium and large budgets) - the first crawl took place in February/March, and the second will be launched in August

- annual (small, medium and large budgets) - the yearly crawl has been launched this week.


The list of "project crawls" for 2013 is as follows:

- news sites - around 100 sites crawled every day, at a depth of homepage plus 1 click.

- subscription news sites - we are progressively adding titles to our crawl to collect subscription editions of news sites (5 at the moment).

- online journals - twice a year, personal and literary blogs. The first crawl was completed in March and the second will be held in August.

- videos - once a year, currently limited to Dailymotion. We have just finished this crawl and will give more details in next month's update.

- solidarity and social movements - two project crawls on social issues in France, to be launched in May and June.

- blogs - once a year, to improve the collection of blog platforms that are poorly covered in the broad crawl. The crawl will be launched in June.

- auction houses - annual crawl of auction catalogues, to be launched in June.

- travel journals - a crawl of online travel journals, also in June.

- official publications - annual crawl of government websites and publications; takes place in July.

- US official publications - crawl of US governement publications under the IDEA agreement to replace exchanges of paper documents with electronic versions; also takes place in July.

- Jean-Philippe Rameau - a crawl for next year's 250th anniversary of the death of this French composer, the crawl is planned for September.


  • Started 2nd stage of domain crawl 2013 using NAS 4.01

Next meeting

August 20th 13-14??

Any other business?