2025-05-06 Statusmeeting
Agenda for the joint NetarchiveSuite teleconference 2025-05-06, 13:00-14:00.
Participants
BNF: Sara, Haja
ONB: Andreas, Antares
KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
KB/DK - Aarhus: Colin
BNE: José, Miguel, Eva
KB/Sweden: Peter, Pär
Update on NAS latest tests and developments
Proposal from Colin to get NAS development back on course.
Status of the production sites
Netarkivet
2nd Broadcrawl 2025- step 2 just started
Data delivery of all text from the archive +some metadata for research project finished. 32 TB compressed.
Now WARC-files with context is wanted.
Is context important?
How to filter/ingest from WARC (or current CSV)
“Mere vand i systemet/More water in the system” climatechange debate-project
Closing up
Webrecorder actively working on Facebook-behaviour (expand comments, view reels/content etc.). Logged in.
ANKM crawled 50K+ pages and 300 GB in 4 hours active work (1-2 days crawling with max capacity
Browsertrix
Respurposed hardware
Testing
Scaling
pywb
Will be looked at in the coming weeks/months
Outreach and more
BnF
At the beginning of April, we launched our annual selective harvest. Around 10 000 selections divided into 21 jobs are crawled during 6 weeks with 3 budget parameters (from 50 000 to 150 000 URL per domain).
The budget should reach between 25 and 30 TB.
We're just about to finish our tests concerning our Podcast harvest. We should launch it in May.
Following the virtual guided tour "The Environment on the Web" published in December 2024, we created an illustrated PDF booklet of this tour. Permissions were requested from the website creators, and around thirty illustrations were included in the document. You can discover it here : https://multimedia-ext.bnf.fr/pdf/PARCOURS_GUIDE_environnement.pdf
ONB
BNE
Last week, a major blackout affected all of Spain, Portugal, Andorra and southern France. We experienced 12 hours without electricity or internet access. The following day, we created a new event collection dedicated to the blackout, compiling national and international news, its consequences, and more. Currently, the collection includes over 200 seeds, with contributions from different regional web curators.
One of our goals for this year is to begin podcast crawling. We're starting to test the harvesting of metadata and MP3 files by RSS feeds using NAS. Our initial focus will be a selection of 100 podcast titles produced by different departments of the BNE.
We have an important group of websites that when we want to crawl them we get a strange error. First it returns a -404 error after asking for the robot.txt and it immediately gives a -2 error. We have been able to fix some of them with a trap to avoid the robot.txt. We have other -2 errors for which we have no solution.
KB-Sweden
Next meetings
June 3rd
July 8th
September 2nd
October 7th
November 4th
December 2nd
January 6th 2026