2025-07-08 Statusmeeting

2025-07-08 Statusmeeting

Agenda for the joint NetarchiveSuite teleconference 2025-07-08, 13:00-14:00.

Participants

  • BNF:  Sara, Leslie, Auriane

  • ONB: Andreas, Antares

  • KB/DK - Copenhagen: Thomas, Stephen, Tue

  • KB/DK - Aarhus: Colin

  • BNE: José, Miguel, Eva

  • KB/Sweden: Peter, Pär

Update on NAS latest tests and developments

  • Integration of Heritrix 3.10 in NAS (-> 7.8 or 8.0 ?).

Status of the production sites

Netarkivet

  • 2nd Broadcrawl 2025- step 2 finished Jun 10, 2025 885,735,274 documents harvested (81 TB). 3rd broadcrawl will launch around week 32.

BnF

The Podcasts harvest launched in mid-May is now finished. A total of 12,750 podcasts have been collected, representing more than 650,000 episodes. The harvest reaches 15 TB and 3.4 million files have been archived. The crawl lasted 23 days.

We are continuing our preparatory work concerning the OpenStreetMap archiving project and we are currently conducting crawling tests. We will launch a test of 20% of the total harvest this summer. And we expect to launch the harvest in production in autumn 2025.

Our biannual crawl is currently being launched. It includes nearly 4,500 selections and will last three weeks.

Finally, the preparatory work for the broad crawl is also continuing. The NAS version 7.7 has been installed, and Heritrix version 3.10 is being integrated into NAS. Then we will conduct a series of tests on these new versions.

ONB

 

BNE

Last month, I participated in the IIPC–IFLA News Media Section Workshop: Web Archiving in Spanish, where I presented Preserving Digital Memory in Spanish: The Role of the Spanish Web Archive. Other countries, such as Mexico and Colombia, also took part in the event, showcasing their own web archiving projects:
https://netpreserve.org/event/iipc-ifla-news-media-section-workshop-web-archiving-in-spanish/

We’ve also been working on archiving content related to LGBT Pride Week and pride parade, as part of our ongoing collection on this topic.

Additionally, we’ve identified an issue with NAS: it is case-sensitive and treats the same domain in uppercase and lowercase as different entries—resulting in duplicates. Here's an example:

image-20250708-102447.png

 

KB-Sweden

 

Next meetings

  • September 2nd

  • October 7th

  • November 4th

  • December 2nd

  • January 6th 2026

Any other business?

  • Olga wants to organise a tech-series webinar on extractionEUr