Due to low availability of development resources I propose that we skip the 4.3 release and aim for a minor 4.4 release at the end of the year. We would like to minimize the amount of testing required here by only including localized bug fixes, thus avoiding the need for major regression testing.

Decision to skip the 4.3 release and stick to a minor 4.4 release at the end of the year. The only new feature will be Bnf's NAS-2212 - Getting issue details... STATUS


Wayback meetings at BnF

Recap from Sara, Nicolas and Colin.

Curator roadmap


Next NetarchiveSuite workshop

We normally gather for a meeting in the autumn to share our views on the where NetarchiveSuite should go. Perhaps we should consider doing this in the spring next year?

Status of the production sites

  • We started our third broad crawl for 2013 in the beginning of September
  • We upgraded our test environment to NAS 4.2. It works fine. When the broad crawl is finished, we plan to upgrade the production system from NAS 4.0 to 4.2?
  • We are working on improving our documentation, not only for to facilitate the curators work, but also on demand of the researchers. We are testing how much of our documentation could be incorporated in NAS, among other by creating extended fields on both the domain level and the harvest definition level.
  • Our greatest barrier for to give access to our archive is the Danish personal data protection law. In a pilot project we extracted a corpus from our archive and screened it for personal data (especially for civil registration numbers). We both used automatic and manual screening.
  • We intensified our work with capturing content behind pay walls from news sites



Last summer, BnF tried a new type of harvest for blog platforms. We were satisfied with the result except that we had only a small sample of blogs: the volume of images for free.fr was really big and we had to stop the harvest after 15 days. So in 2013, we decided not to collect free.fr and to reduce the budget to 800 URLs per host. We had a list of 225,000 seeds which we harvested during a period of 50 days. The problem this year is that, with a depth of "host", Heritrix generated an exponential list of inactive queues: it seemed we would never finish the crawl! And so we have to think of yet another choice of parameters…

We are also working on a specific QA for large domains. From the host reports generated by Heritrix, we can regularly analyze the "Top domains" of each run. This summer, we made a general observation of "Top domains" for the whole year 2013 with the objective of finding new filters and thus eliminating "noise" in the crawls. It showed many of the domains are not chosen as seeds: there is a very large amount of image databases that we need to keep but also social networks which could be filtered (for example, facebook.com in all languages of the world!). For big domains from our seedlists, we found that we can sometimes exclude some hosts (e.g. betadev.cnrs.fr) or we can exclude some URLs (for example, URLs having HTTP 404 as response code because of Heritrix generating false URLs from Javascript).


  • 2nd stage of domain crawl 2013 is almost finished (just a few jobs to finish)
  • We have parliamentary elections on Sept 29th. We started an ongoing politics collection beginning of 2013, which also includes this event.

