2025-05-06 Statusmeeting

2025-05-06 Statusmeeting

Agenda for the joint NetarchiveSuite teleconference 2025-05-06, 13:00-14:00.

Participants

  • BNF:  Sara, Haja

  • ONB: Andreas, Antares

  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue

  • KB/DK - Aarhus: Colin

  • BNE: José, Miguel, Eva

  • KB/Sweden: Peter, Pär

Update on NAS latest tests and developments

Proposal from Colin to get NAS development back on course.

 

Status of the production sites

Netarkivet

  • 2nd Broadcrawl 2025- step 2 just started

  • Data delivery of all text from the archive +some metadata for research project finished. 32 TB compressed.

    • Now WARC-files with context is wanted.

      • Is context important?

      • How to filter/ingest from WARC (or current CSV)

  • “Mere vand i systemet/More water in the system” climatechange debate-project

    • Closing up

    • Webrecorder actively working on Facebook-behaviour (expand comments, view reels/content etc.). Logged in.

    • ANKM crawled 50K+ pages and 300 GB in 4 hours active work (1-2 days crawling with max capacity

  • Browsertrix

    • Respurposed hardware

    • Testing

    • Scaling

  • pywb

    • Will be looked at in the coming weeks/months

  • Outreach and more

 

BnF

At the beginning of April, we launched our annual selective harvest. Around 10 000 selections divided into 21 jobs are crawled during 6 weeks with 3 budget parameters (from 50 000 to 150 000 URL per domain).
The budget should reach between 25 and 30 TB.

We're just about to finish our tests concerning our Podcast harvest. We should launch it in May.

Following the virtual guided tour "The Environment on the Web" published in December 2024, we created an illustrated PDF booklet of this tour. Permissions were requested from the website creators, and around thirty illustrations were included in the document. You can discover it here : https://multimedia-ext.bnf.fr/pdf/PARCOURS_GUIDE_environnement.pdf

ONB

 

BNE

Last week, a major blackout affected all of Spain, Portugal, Andorra and southern France. We experienced 12 hours without electricity or internet access. The following day, we created a new event collection dedicated to the blackout, compiling national and international news, its consequences, and more. Currently, the collection includes over 200 seeds, with contributions from different regional web curators.

One of our goals for this year is to begin podcast crawling. We're starting to test the harvesting of metadata and MP3 files by RSS feeds using NAS. Our initial focus will be a selection of 100 podcast titles produced by different departments of the BNE.

We have an important group of websites that when we want to crawl them we get a strange error. First it returns a -404 error after asking for the robot.txt and it immediately gives a -2 error.  We have been able to fix some of them with a trap to avoid the robot.txt. We have other -2 errors for which we have no solution.

KB-Sweden

 

Next meetings

  • June 3rd

  • July 8th

  • September 2nd

  • October 7th

  • November 4th

  • December 2nd

  • January 6th 2026

Any other business?