2012-12-04 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference December the 4th 2012, 13:00-14:00.

Iteration 53 (4.0 development release) (Mikis)

See status here.

  • WARC format verified?
  • Configurable ARC/WARC naming postponed
  • Update to Wayback 1.7.1 postpone.
  • Outstanding 4.0 issues.
  • Sara haven't received Søren's email with the updated implemention for formatting WARC metadata files. Søren will forward the initial mail to Sara for acceptance of the current WARC format.
  • It is properly ok for BnF to postpone the configurable naming of ARC/WARC files to 4.1, but Sara will check with Clement.

Curator roadmap

Sara is working on creating the BnF input. Sabine has also been discussing the DK input to the roadmap.


Status of the production sites

  • Netarkivet

    We will finish our 4th broad crawl before the end of 2012.


    We are still working on our Facebook template in our test environment.


    We are discussing our priorities:

    -      Harvesting social media: finding solutions for the problems

    -      Improvement/new features for NAS

    • (hide unused harvest definitions, seed lists on both harvest history lists and domain overview pages
    • Give the ability of sorting harvest history lists (e.g. all newest harvest jobs first, regardless the definition, choise between one long list or next button)


    We are working on getting agreements with further e-book publishers

    We have now 14 active users, thereof 6 students. We prepared a stand alone access to wayback for the students at SB

  • BNF

    On November 28th, we finished our 2012 broad crawl running NetarchiveSuite 3.20 with a BnF patch. It started on October 10 and we had a break of about 10 days between the 2 stages, so it went quiet fast! We harvested 33TB of data with 920 jobs in stage 1 (3500 configurations per job, 1000 URL per domain) and 324 jobs in stage 2 (700 configurations per job, 2500 URL per domain). Our deduplication index between the 2 stages went up to 165GB. We had to reduce the number of URL in the  second stage from 10000 to 2500 because we filled up our annual quota for the first time. We had no incident during the crawl, the pilot and the indexer were overloaded for 2 days during the initialization of the second stage. We have to review the way data, cache, logs and temp files are written and shared over NFS before the 2013 broad crawl.

    Ongoing crawls (weekly, monthly, daily) are going smooth. We didn't plan any big crawl until next year.

    We are focusing on the harvesting of paid content (subscription areas of news sites) and migration of Wayback from 1.4.1 to 1.17.1. 

    Last week the BnF hosted the IIPC-sponsored workshop "How to fit in? Integrating a web archiving program in your organization", with 14 participants from 11 institutions. The workshop allowed us to meet and discuss with IIPC members that either use or are considering using NetarchiveSuite, but are not currently active in the NetarchiveSuite community (such as the national libraries of Estonia, Québec and Spain). A report and other documents from the workshop will be available on the IIPC website in the near future.

  • ONB:


Any other business?

Access to BCWeb source code (snapshot in email)?

  • Nicolas has already sent the sourcecode a month ago, but Mikis didn't receive the mail. Another attempt will be made
  • Sara have been in touch with both the Estonian and Spanish web harvesting people regarding usage of NetarchiveSuite, and they are now active on the NAS mailing lists.
  • The Sound quality in the teleconference was very bad. We need to look into alternatives other than Skype which doesn't work for BnF. Nicolas proposed Ekiga and will mail Mikis and Andreas regarding a testrun.