2025-07-08 Statusmeeting
Agenda for the joint NetarchiveSuite teleconference 2025-07-08, 13:00-14:00.
Participants
BNF: Sara, Leslie, Auriane
ONB: Andreas, Antares
KB/DK - Copenhagen: Thomas, Stephen, Tue
KB/DK - Aarhus: Colin
BNE: José, Miguel, Eva
KB/Sweden: Peter, Pär
Update on NAS latest tests and developments
Integration of Heritrix 3.10 in NAS (-> 7.8 or 8.0 ?).
Status of the production sites
Netarkivet
2nd Broadcrawl 2025- step 2 finished Jun 10, 2025 885,735,274 documents harvested (81 TB). 3rd broadcrawl will launch around week 32.
Looking at/adjusting RSS-crawls with NAS
Browsertrix
Have asked Webrecorder for help on scaling-issues.
Data delivery
Issues/challenges when we need to export WARC-files and exclude content (du to legal/copyright law).
We will look at small scale solutions to exclude certain domains when needing ARC/WARCs for researchers,
404-project on the way. Data set of 404-information. My tip is to have the researchers look at all the possibilities in SolrWayback before deciding what they need.
Outreach and more
Anders participated in a workshop as a member of the practice advisory board for the research project Postneutrality in Libraries, Archives and Museums. Pretty interesting: what we do is basically all post neutral in (technically and curatational). Have it ever been or seens a neutral? https://comm.ku.dk/research/archives-libraries-and-museums/post-neutrality-in-libraries-archives-and-museums/
Budget cuts at the Royal Danish Library 62,5 millioner kroner (8.376.600 Euro) at Det Kgl. Bibliotek until 2030:
https://kum.dk/aktuelt/nyheder/mindre-bureaukrati-og-kontrol-pengene-skal-bruges-smartere-i-staten
We don´t know how Digital Cultural Heritage will be affected yet.
BnF
The Podcasts harvest launched in mid-May is now finished. A total of 12,750 podcasts have been collected, representing more than 650,000 episodes. The harvest reaches 15 TB and 3.4 million files have been archived. The crawl lasted 23 days.
We are continuing our preparatory work concerning the OpenStreetMap archiving project and we are currently conducting crawling tests. We will launch a test of 20% of the total harvest this summer. And we expect to launch the harvest in production in autumn 2025.
Our biannual crawl is currently being launched. It includes nearly 4,500 selections and will last three weeks.
Finally, the preparatory work for the broad crawl is also continuing. The NAS version 7.7 has been installed, and Heritrix version 3.10 is being integrated into NAS. Then we will conduct a series of tests on these new versions.
ONB
BNE
Last month, I participated in the IIPC–IFLA News Media Section Workshop: Web Archiving in Spanish, where I presented Preserving Digital Memory in Spanish: The Role of the Spanish Web Archive. Other countries, such as Mexico and Colombia, also took part in the event, showcasing their own web archiving projects:
https://netpreserve.org/event/iipc-ifla-news-media-section-workshop-web-archiving-in-spanish/
We’ve also been working on archiving content related to LGBT Pride Week and pride parade, as part of our ongoing collection on this topic.
Additionally, we’ve identified an issue with NAS: it is case-sensitive and treats the same domain in uppercase and lowercase as different entries—resulting in duplicates. Here's an example:
KB-Sweden
Next meetings
September 2nd
October 7th
November 4th
December 2nd
January 6th 2026
Any other business?
Olga wants to organise a tech-series webinar on extractionEUr