2016-08-23 Statusmeeting

Agenda for the joint BNF, ONB, SB, KB and BNE NetarchiveSuite tele-conference 2016-08-23, 13:00-14:00.

Practical information


  • BNF: Lam, Annick
  • ONB: Michaela, Andreas
  • KB/DK: Søren, Stephen, Nicholas
  • SB: Sabine, Colin
  • BNE: Mar, Juan Carlos, Fernando, Elena

NAS workshop in Vienna

January 30th 2017 - February 1st 2017 - Vienna

How many participants? Please complete Michaela's poll : http://doodle.com/poll/nk6dfc3kav4a4hs8

IIPC crawler hackathon in London

September 22-23. Is anyone attending? Tue and Colin (question) will go the the workshop

NAS 5.2 Developement Update

Feedback from KB/SB. (https://kb-dk.atlassian.net/secure/RapidBoard.jspa?rapidView=8)

BnF migration to NAS 5 + H3 Update

Feedback from BnF.

  • installalation of NAS 5.2-snapshot (development + stage/pre-production environment)
    • correcting BnF's deploy scripts (nas-deploy) for NAS 5
    • database migration (we've started to prepare sql scripts to migrate to the new database schema but for the moment we use an empty database to test NAS 5)

  • migrating BnF developements from https://github.com/netarchivesuite/netarchivesuite-svngit-migration to https://github.com/netarchivesuite/netarchivesuite
  • minor correction of date format
  • migrating host + domain profils from old order.xml format to the crawler-beans.cxml format
    • done by crawl engineer Sébastien Pivain-Leroy

  • correcting BnF's statistical tool for NAS (nas-qual) in order to handle both H1 and H3 reports format

  • pending : generate warc revisit records in format WARC 1.1
  • pending : archivefiles-report.txt missing GMT dates and closing date
    • NAS-2546 - Getting issue details... STATUS
    • can only correct date format, can't get opened date
    • in dk.netarkivet.harvester.heritrix3.HarvestDocumentation, there is this comment :
      // Generate an arcfiles-report.txt if configured to do so.
      // This is not possible to extract from the crawl.log, but we will make one from just listing the files harvested by Heritrix3

      boolean genArcFilesReport = Settings.getBoolean(Heritrix3Settings.METADATA_GENERATE_ARCHIVE_FILES_REPORT);


  • pending : attempt to launch heritrix instance with another version of Java
    • for instance java 9 for new implementation of javax.net.ssl for https, but keep java 7 for NAS
    • It looks not so easy to do (see classes HeritrixLauncher & Heritrix3Wrapper)


Status of the production sites



We are still working on the reorganization of the selective crawls: the strategy is

  • Extension of the selective crawls and smaller broad crawls –
    • We now collect all national Danish news media selectively – both newspaper websites and news media only existing online.
    • We investigate all local news media in order to decide frequency and depth for the future crawls.
    • We made a first crawl of university repositories (with OAI-extraction)

 As Heritrix 3 is not able to archive Facebook profiles. But Archive-IT is able to collect Facebook profiles with an API. We will collect about 100 representative open Facebook profiles at Archive-IT, at the moment we are doing the selection of the profiles.

 We are working on the compression of our archive

We still collect url's for the Olympics event crawl (including the paralympics). We nominate all collected url's for the IIPC collection.


We are continuing to work on this year's broad crawl. We are preparing nas-preload, the tool used to combine the different sources into a single list to be loaded into NAS. This step also includes a DNS check to avoid slowing down the crawl with domains that do not have a DNS response. This year, in addition to excluding domains with no DNS we are also excluding those that give an "unknown" response, as from previous years we know there is generally no content on these domains. Overall the seed list will contain around 4.4 million active domains, and will have improved coverages of the different regional TLDs : .alsace, .paris; .bzh (for Brittany) and the French West Indies.

Turning to project crawls, the 2016 Olympiad is now over but our Olympics crawls are still running. The project, in line with the precedent collaborative collections documenting the 2014 Sotchi Winter Games and 2012 London Summer Games, involves seven curators from the Literature and Art department who work on the selection based on eight themes. Two crawls were planned, before and after the games, covering a list of 558 seeds. Concerning social media, we focused only on Twitter, with 447 French accounts or hashtags collected twice a day from the 4th to the 24th of August. These crawls will be complemented by one for the Paralympic games, to be launched on the 18th of September. We have also communicated our list of seeds for the worldwide collaborative collection led by the British Library for IIPC.


  • we have finally launched our online search interface https://webarchiv.onb.ac.at/ and would be interested in your feedback. The websites are still not accessible, but it is possible to search for versions either by URL or in our (partial) fulltext. We built a bookmarking feature which allows to save versions online and recall them at the library webarchive terminals.
  • At the moment we have ongoing selective crawls and still an event crawl about presidential elections.






Next meetings

  • September 20
  • October 25
  • November 29
  • January 3, 2017

Any other business?