2015-03-03 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference 3trd 2015, 14:00-15:00.

Practical information

  • TDC tele-conference:
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.


  • BNF: Sara and Lam
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin, Sabine, Mikis. Ditte is out of town, not able to attend this meeting


  • H3 template functionality isn't quite ready yet. Perhaps ready in a weeks time (Søren?)
  • 5.0 alpha release isn't ready yet.  Perhaps in 3 weeks.


NetarchiveSuite at IIPC GA

Status of the production sites


  • New event crawl: the 15th February we started an event crawl on the terror attack in Copenhagen on 14th  an 15th February. As the Danish News covering of the attack is captured by our selective crawls, we focused on foreign news media (we got help from IIPC members) and social media (for example new Facebook groups and new hash tags in connection to the event). We asked the Center for intercultural and regional Studies at Copenhagen University to help with url’s to pages, blogs, etc. from Jewish and Arab circles. We participated in a radio programme: There had been a debate on Facebook – an outcome of the terror attack – between the authors Yayha Hassan and Kristian Ditlev Jensen. The question was: Is such a debate relevant for Netarchive
  • Event crawl parliamentary elections: We have boosted the ongoing crawls with nearly 10.000 politician’s profiles on social media. Probably we will use Archive-IT to collect content Heritrix can’t capture.
  • Broad crawl: We are running step 2 of our first broad crawl for 2015. We recently had some technical problems in the end of step 1
  • Curator meeting KB/SB: we met for two days in order to map all our operational tasks and to discuss the curator part of our action plan for 2015
  • Fulltext search: we are establishing an interface so we can start the testing. Furthermore we have started the work for blocking materials containing personal data, both concerning data already in the archive and a procedure for future archival content We have tools to do this with wayback url-search – but not with fultext search
  • New collection : During March the Danish public service broadcasting corporation Danmarks Radio will transfer all there video productions produced for the web only to SB
  • Collection policy and -strategy: A new version of the draft is almost ready to be discussed at the next steering committee meeting


After the dreadful attacks which occurred on the 7th and 9th of January in Paris and the events that followed, we decided to launch an emergency crawl in order to harvest web resources (news articles, blog posts, social media reactions, institutional websites…) related or reacting to them. We made an appeal to IIPC members and to our BnF network of librarians, asking them if they could help us in quickly gathering references to make the most complete and relevant seedlist possible. Due to the exceptional nature of the event, the scope and criteria of the selection were extended to an international scale and aimed to cover the different forms and diversity of the reactions. We received 2,480 URLs from eighteen different IIPC members and 1,740 URLs nominated by more than 70 BnF librarians. In addition to these selections, the already identified seed lists of French governmental, news, political, and activist websites have been specially harvested. And finally, our regular daily and weekly harvests of the principal French news sites, particularly relevant during those days, worked as usual.

Technically, the crawls were performed from 8th to 16th January 2015 and each website has been crawled at least once with a depth of  page +1 click. During the same period, selected Twitter accounts and popular hashtags (as the now famous #JesuisCharlie) have been crawled four times a day. A total of 15.9 million URLs have been collected, for a total of 0.5 TB of data.   


Broad Crawl:

  • We are currently preparing for our bi-annual broad crawl with 1.25 mio. .at domains and the new TLD .wien
  • Before the broad crawl we will change from NAS 4.01 to 4.4.
  • We finished the database migration from mysql to postgresql
  • We made a JDK change from 1.6_22 to 1.7_65 (did not work, need to switch back to lower 1.7 version)
  • The JDK change caused problems with deduplication.
  • Our IT department is developing a new storage concept, currently our storage is outsourced to the Federal Computing Centre, is would be less expensive to have the storage inhouse. We try to negotiate a higher storage budget for the webarchive.
  • We still experience technical difficulties with our cluster.

Selective Crawls:

  • In 2015 four regional elections take place. We will add the seeds to our politics collection.
  • The Eurovision Song Contest in May will be a rather small project for us. We could not get additional resources and unfortunately it will not be comparable to the great effort last year in Denmark.

Next meeting

7th april

Any other business?