2025-11-04 Statusmeeting
Agenda for the joint NetarchiveSuite teleconference 2025-11-04, 13:00-14:00.
Participants
BNF: Sara, Leslie, Auriane
ONB: Andreas, Antares
KB/DK - Copenhagen: Thomas, Stephen, Tue
KB/DK - Aarhus: Colin
BNE: José, Miguel, Eva
KB/Sweden: Peter, Pär
Update on NAS latest tests and developments
Release of Heritrix 3.12 - October 29th 2025 : https://github.com/internetarchive/heritrix3/releases/tag/3.12.0
Status of the production sites
Netarkivet
1st Broadcrawl 2025- step 1 finished - will start step 2 asap
Event crawl on election: Kommunalvalg 2025
Browsertrix
Working on API transfer - also from cloud
custom behaviours focus
Data extraction.
WARC-files excluding some domains via https://github.com/iipc/jwarc
Researchers will build their own SolrWayback using the WARC-files
Early ARC-files used - using JWARC and some fixes means more content is available. Some ARC-files stopped indexing due to errors in WARC-indexer.
Outreach and more
IIPC WAC - 2026 Brussels - 2 proposals sent
https://cst.ku.dk/kalender/sprogteknologisk-konference-2025/
Panel debate: Trust, design and data in language technology solutions for Danish use
BnF
We are executing the final steps before the launch of our 2025 broad crawl, which is expected to start this week. We will crawl 2500 URLs per domain and the projected budget is 189 Tio.
Following the shutdown of the http://typepad.fr servers, the harvest ended on October 14th. In addition to the 118 queues remaining to be crawled for the http://typepad.com domain, 6 queues weren't ended for the http://typepad.fr domain. In total the harvest lasted one month. We archived over 1 million URLs, for 154 GB of data.
ONB
BNE
On November 13 and 14, during the 1st Virtual Conference on Legal Deposit Experiences in the National Libraries of Ibero-America, organized by ABINIA and focused on legal deposit, we managed to include a roundtable titled: "Challenges and Opportunities of Legal Deposit in the Digital Environment", as well as a workshop on "Implementing a Web Archive from Scratch". In the first sessions, we will discuss the lessons learned from the Spanish Web Archive, and afterwards, we will have two sessions. One focused on Heritrix and the other on Browsertrix
We have completed the broad crawl of open-access journals that we carry out every year, and we are now going to begin analysing the results. With around 5 TB of data, we are currently preparing to Quality Assurance (QA).
KB-Sweden
Next meetings
December 9th
January 6th 2026