2021-10-05 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2021-10-05, 13:00-14:00.

Participants

  • BNF: Clara, Sara, Auriane
  • ONB: Andreas
  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
  • KB/DK - Aarhus: Colin
  • BNE: José, Alicia, Miguel
  • KB/Sweden: Peter, Jonas

Update on NAS latest tests and developments


Status of the production sites

Netarkivet

  • Broad Crawl. We have starting step 2 today with a 5 day limit and max bytes set lower than normal. Paused now due to issue with disks
  • Migrating to 7.2 NAS and Bitmagasinet still on going. Hadoop-part is going a bit slower than planned.
  • Working with 6 groups of approx. 7 person from IT-University CPH on finding thematic content eg. on Communal Election 2021 and content nomination/top lists.
    • Looking at https://brandmentions.com/ to find more Danish of Danish relevant content on web and social media. looks promising
    • Experimenting with other methods to find relevant content.
  • PyWb-project idea is closer to get an actual priority/ go/no-go
  • Finalizing Collection strategy for 2021-2023 and 2023-2025. 
  • Blacklight - we are closing the service. SolrWayback have taken over.
  • Access to Netarkivet for employees in Ministry of Culture that needs this for work/collection purposes is looked into (Rigsarkivet, SMK, DFI eg.) 
  • IIPC project proposal 
    User-Friendly High Fidelity Browser-Based Crawling for All - Proposal for IIPC Discretionary Funding Program 2021-2022 (1).pdf

    LEAD IIPC INSTITUTION: Royal Danish Library
    2ND IIPC INSTITUTION: UK Web Archive
    3RD IIPC INSTITUTION:  University of North Texas Libraries

    4TH IIPC INSTITUTION: National Library of New Zealand

  • Working together with Rigsarkivet (they archive mainly non-published content) to see if we can help each other. If we harvest all info, institutions doesn´t need to pay for Rigsarkivet to archive their data.
  • Talking to Kees (KB - NL) about their testing/using of SolrWayback and NAS 



BnF

The preparations for our 2021 broad crawl are coming to an end and it will be launch on October, 11th. Our seed list is made up of 5.5 million domains divided into 1108 jobs and for a total budget of 115 TB.
This year, we realized a test broad crawl, between the 9th and the 17th of September, which corresponds to 20% of the complete seed list.
The objective was to obtain more precise indicators in order to define an appropriate budget.

Last month, a new version (wayback 8.6) of our "Archives de l'internet" went into production. The Videos virtual guided tour has been updated with 383 new YouTube channels.
As mentioned in September, we also published a virtual guided tour about the web from Lorraine. This came with a new homepage around this theme. We also realized a slide show with some captures for which we have obtained reproduction rights.
You can consult it at the following address: https://www.bnf.fr/sites/default/files/2021-09/Diaporama_Parcours_guid%C3%A9_Le_web_lorrain.pdf

Lastly, we are pleased to announce that the records realized on the launch day of the ResPaDon project can be viewed at this address: https://respadon.hypotheses.org/1

ONB

  • Beginning with 2022 we are allowed to use 11 TB of storage instead of 6 TB
  • migrating old webpages crawled by HTTrack into Arcs and into NAS

BNE

KB-Sweden

There is now a consultant at KB, which will work with upgrading and improving our web archiving environment. His name is Jonas Linde, and he will participate today. 

Next meetings

  • November 2nd
  • December 14th
  • January 11th, 2022

Any other business?

·