Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • BNF: Lam, Annick, Sara
  • ONB: Michaela, Andreas
  • KB/DK: Søren, Stephen, Nicholas
  • SB: Sabine, Colin, Niels
  • BNE: Juan Carlos, Fernando, Elena-
  • KB/Sweden: Bengt

IIPC crawler hackathon in London

...

NAS 5.2 Developement Update

https://sbforge.org/jira/kb-dk.atlassian.net/secure/RapidBoard.jspa?rapidView=8

On BnF side: some bugfixes:

...

Should we "reanimate" our curator roadmap/backlog, revise it and discuss it in Vienna?

...

Panel

Broad crawl

  • Last week we launched the third broad crawl 2016. The crawl limit per domaine will be max. 100 MB. There will be special crawls for ministeries and government bodies, and for ultra big sites (e.g. dr.dk)
  • We will try to get in touch with the webpage owneers/web hotels who are blocking our crawler (about 11% are blocking us)

Event crawl

  • The event collection for the Olympics in Rio 2016 will go on until the end of the Paralympics 2016

Selctive Selective crawls

  • We are working on the configuration of the regional/local news media crawls.
  • Facebook
    • We have test-crawled about 60 Danish Facebook profiles with Archive-IT. We are analyzing how much we get from the profiles. We have to renew our account with Archive-IT after the end of November and we are trying to negotiate a good prize.
    • We made a special crawl of Prime Minister Lars Løkkes Facebook profile on 2016.08.30, the day he published his 2025 plan.

Compression of the archive

  • We are preparing for the compression, but this awaits NAS release 5.2

Last not least

Last week we learned, that the ministry of culture wants KB and SB to merge: From January 2017 we will be “Nationalbiblioteket” with two locations, in Copenhagen and Aarhus

 

 

BnF

Panel
 

We are continuing to work on this year's broad crawl. We are preparing nas-preload, the tool used to combine the different sources into a single list to be loaded into NAS. This step also includes a DNS check to avoid slowing down the crawl with domains that do not have a DNS response. This year, in addition to excluding domains with no DNS we are also excluding those that give an "unknown" response, as from previous years we know there is generally no content on these domains. Overall the seed list will contain around 4.4 million active domains, and will have improved coverages of the different regional TLDs : .alsace, .paris; .bzh (for Brittany) and the French West Indies.

Turning to project crawls, the 2016 Olympiad is now over but our Olympics crawls are still running. The project, in line with the precedent collaborative collections documenting the 2014 Sotchi Winter Games and 2012 London Summer Games, involves seven curators from the Literature and Art department who work on the selection based on eight themes. Two crawls were planned, before and after the games, covering a list of 558 seeds. Concerning social media, we focused only on Twitter, with 447 French accounts or hashtags collected twice a day from the 4th to the 24th of August. These crawls will be complemented by one for the Paralympic games, to be launched on the 18th of September. We have also communicated our list of seeds for the worldwide collaborative collection led by the British Library for IIPC.

ONB

Panel
  • We switched to NAS 5.2 already because we had severe problems with https websites with the former version. These problems are fixed now by using H3 which runs under java 1.8.0_77 and following disabled jdk.tls Algorithms in /opt/jdk1.8.0_77/jre/lib/security/java.security

    jdk.tls.disabledAlgorithms=SSLv3, DHE, ECDHE, RC4, MD5withRSA, DH keySize < 768

    It went smooth so far. We are still using the arc format, because we have to refactor all our tools before we switch to warc.
  • The crawl about our presidential elections still running, we have a new election date beginning of December and hope to be able to finish the crawl soon.
  • Apart from one small, additional thematic crawl we will only have ongoing crawls until the end of the year. Next domain crawl is scheduled for 2017.

...