2025-03-04 Statusmeeting
Agenda for the joint NetarchiveSuite teleconference 2025-03-04, 13:00-14:00.
Participants
BNF: Sara
ONB: Andreas, Antares
KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
KB/DK - Aarhus: Colin
BNE: José, Miguel, Eva
KB/Sweden: Peter, Pär, Johan
Update on NAS latest tests and developments
Preparation of NAS Workshop in Oslo
Agenda proposal: https://docs.google.com/document/d/1eO2cgcfwQ-7BOxrjzSApjvTEgXc-VfCPEgmqyVOriic/edit?usp=sharing
Decision on times: 13:00 - 17:00
Location: at the Conference venue: Henrik Ibsens gate 110, 0255 OSLO.
Status of the production sites
Netarkivet
1st Broadcrawl 2025- step 2 almost finished - smoother crawl than ever
Data delivery of all text from the archive +some metadata for research project finished. 32 TB compressed.
“Mere vand i systemet/More water in the system” climatechange debate-project
Proceeding as planned:
Using Browsertrix Cloud to crawl hard-to-get content like video (YouTube + LinkedIn logged in) and more.
Waiting on results from development from Webrecorder on Facebook-behaviour (expand comments, view reels/content etc.). Logged in.
Lots of experience and findings using Browsertrix including live-exclusions (text-regex etc.)
Browsertrix
Lots of updates from Webrecorder - means issues on local installs. Swift reactions from Webrecorder
We have 3 instances:
Local:
Devel
Prod (with IP mapped for getting behind paywall-content)
Cloud:
3TB Pro Plan. Crawl time monthly os a bit challenging
Example: Tv2 browsertrix harvesting has now reached the maximum limit of 500K pages after 14 days and continues to harvest the remaining approx. 360K. We are up to 471GB now, so I'm guessing 1 -2 TB - we have almost 6 TB free. I estimate that if we had set the maximum limit to 1.5 million, we would get around tv2 in full depth (8 hops down) with 1 hop out and it would probably take between 1-1/2 months with the current equipment. SO it is actually possible - I think! The last thing I saw was that it was working with the 10s and 00s with appropriate drops to the 20s regularly. When 500K has been harvested and uploaded, I check the crawl log for duplicates or similar and how many urls have been run through. cf. tv2 search with * it shows 1.4 million for the entire site
Solr-index - new SDD-drives update.
Outreach and more
BnF
On the occasion of the exhibition "Apocalypse, yesterday and tomorrow" which takes place until June 8th at the BnF, we published a new homepage of our Archives de l'internet which presents selections of the Apocalypse and the end of the world on the web. Here is a preview: Dépôt légal web BnF on Twitter / X
We are currently running tests concerning our next Podcasts harvest. According to the first estimates, the budget should reach around 20TB for over 13 500 podcasts.
On the occasion of the launch of the biannual selective harvest, we launched a special crawl of some blog platforms (canalblog, over-blog, etc.). These blogs must be archived according to a particular configuration in order to avoid being dynamically blocked. 263 blogs are archived in this way.
Finally, we also launched a harvest of sites of learned societies of local history. 520 sites from all the French regions are currently being archived.
ONB
BNE
This week, we are working on a special harvest for International Women's Day. During these two weeks, we harvest daily more than 50 websites.
Error -2 has been continuously increasing. We are studying the different reasons why these errors may have happened, in case it is possible to solve some of them and to analyse which ones they are for IT security reasons.
KB-Sweden
Next meetings
March 4th
April 7th (IRL!)
May 6th
June 3rd
July 8th
September 2nd
October 7th
November 4th
December 2nd
January 6th 2026