2015-04-07 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference 7th 2015, 13:00-14:00.

BnF proposal for H3 librarian features

Feedback on profile migration

Hi Søren, Mikis,

Here is our feedback on reading carefully the H3 template you sent on the mailing list and put on the wiki: Migrating H1 templates to H3 to use with NetarchiveSuite 5.1+)

We compared it with the analyses and tests we have run on our side with H3 as a standalone application and our "H1 parameters Excel sheet" which lists all the parameters we use and change according to the different templates. Comparison is also based on the default template. We agree with your suggestion: there are only a few changes between the
different templates, it's much easier to migrate one template first and use that new H3 template to change the other ones.

There are some beans or properties we use and would like to add in this template, in the simpleOverrides section for some of them:
* in the metadata section:
add metadata.organization in the simpleOverrides section
add XXX.date in the simpleOverrides section (have you found a way to keep track of manual modifications date in the template, we currently use this field in our SPAR system)
put up robotsPolicyName=true in the simpleOverrides section

* in the crawlController section
add runWhileEmpty=false in the simpleOverrides section (this is the new pause-at-finish)

* in preparer section
put up queueAssignmentPolicy=dk.netarkivet.harvester.harvesting.DomainnameQueueAssignmentPolicy in the simpleOverrides section (the hostname counterpart also exists, right ?)
put up costAssignmentPolicy in the simpleOverrides section
detail the canonicalizationPolicy rules in the template for better documentation

* quotaenforcer
we do not use the quotaenforcer but if you do, placeholders should probably be put in the simpleOverrides section.

* in the prefetch section
add the RuntimeLimitEnforcer bean to allow a pause, block URIs or terminate action to enforce runtime limits on crawls

* in the fetchHTTP section
add fetchHttp.acceptHeaders in the simpleOverrides section

add the FetchFTPbean which fetches documents and directory listings using FTP protocol

These are probably minor changes. We could add the beans and properties in the template and send it to you.
It really depends on how you want to proceed. Please tell us what is the easiest way for you.

The last one is probably the most important and probably not the easiest one: to include either the WARC or ARC writer in the template,
you have the following placeholder:
<!-- Here the (W)arc writer is inserted -->
We use and change the different properties:
suffix can be set to large_2015_${HOSTNAME} or ciblee_2015_${HOSTNAME}
pool-max-active    can be set to 1, 3 or 5 depending on the crawls
write-requests / write-metadata will be set to true in certain templates, false in others
so we ready need flexibility on changing individual properties. Would that be possible?

We still want to compare the H3 default template with the NAS 5.0 template.
We noticed you deleted some properties which were commented, we just want to see which ones.

Talk to you soon.


NetarchiveSuite at IIPC GA

Status of the production sites


  •          We stopped our event crawls on the Copenhagen shooting.  Potential ongoing activities will be captured by our selective crawls.
  •          We finished our first broad crawl for 2015: we archived about 38,5 TB at all, which is about 10 TB more than our budget. We are discussing new strategies for our broad crawls.
  •          We are collecting the results of our wayback fulltext search tests (both curators and users have tested)
  •          Rumors are saying, that the parliamentary elections will be announced shortly after the queens 75th birthday (16th April), so we are ready to give our event crawl another boost.

Otherwise, business as usual


In March, we launched a selective crawl with the frequency "twice a year". In total, 1,600 domains will be captured in less than one month. The majority of the sites cover literature and cinema. For this crawl, we have asked the content librarians to read carefully all BCWeb records that have not been updated since 2011. We regularly have to encourage the librarians to review their records: in fact, it is more common for them to add new records than to remove old ones that may no longer be relevant.

There are several changes in the BnF team in April. Clément has left for his new post at the  ISSN International Centre, and Benoît Tuleu, deputy director of the legal deposit department, will be in charge of the service during the interim period. Géraldine will soon be going on maternity leave for several months, and the other members of the team will be taking on her projects during this period. Finally, Imad, the developer who has worked on several projects for us in recent years and in particular BCWeb, has also left to take on a new opportunity elsewhere.


  • End of March we have started our 4th broad crawl (we still have broad crawls only every two years with a budget of 8 TB).
  • We have 4 regional elections this year, the first takes place in May.
  • Also in May: Eurovision Song Contest. We started selecting seeds, but it will be a rather small collection.
  • Total budget for all selective crawls is 2 TB.

Next meeting

5th May

Any other business?