/
2025-03-04 Statusmeeting

2025-03-04 Statusmeeting

Agenda for the joint NetarchiveSuite teleconference 2025-03-04, 13:00-14:00.

Participants

  • BNF:  Sara

  • ONB: Andreas, Antares

  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue

  • KB/DK - Aarhus: Colin

  • BNE: José, Miguel, Eva

  • KB/Sweden: Peter, Pär, Johan

Update on NAS latest tests and developments

Preparation of NAS Workshop in Oslo

Status of the production sites

Netarkivet

  • 1st Broadcrawl 2025- step 2 almost finished - smoother crawl than ever

  • Data delivery of all text from the archive +some metadata for research project finished. 32 TB compressed.

  • “Mere vand i systemet/More water in the system” climatechange debate-project

    • Proceeding as planned:

    • Using Browsertrix Cloud to crawl hard-to-get content like video (YouTube + LinkedIn logged in) and more.

    • Waiting on results from development from Webrecorder on Facebook-behaviour (expand comments, view reels/content etc.). Logged in.

    • Lots of experience and findings using Browsertrix including live-exclusions (text-regex etc.)

  • Browsertrix

    • Lots of updates from Webrecorder - means issues on local installs. Swift reactions from Webrecorder

    • We have 3 instances:

      • Local:

        • Devel

        • Prod (with IP mapped for getting behind paywall-content)

      • Cloud:

        • 3TB Pro Plan. Crawl time monthly os a bit challenging

      • Example: Tv2 browsertrix harvesting has now reached the maximum limit of 500K pages after 14 days and continues to harvest the remaining approx. 360K. We are up to 471GB now, so I'm guessing 1 -2 TB - we have almost 6 TB free. I estimate that if we had set the maximum limit to 1.5 million, we would get around tv2 in full depth (8 hops down) with 1 hop out and it would probably take between 1-1/2 months with the current equipment. SO it is actually possible - I think! The last thing I saw was that it was working with the 10s and 00s with appropriate drops to the 20s regularly. When 500K has been harvested and uploaded, I check the crawl log for duplicates or similar and how many urls have been run through. cf. tv2 search with * it shows 1.4 million for the entire site

  • Solr-index - new SDD-drives update.

  • Outreach and more

 

BnF

On the occasion of the exhibition "Apocalypse, yesterday and tomorrow" which takes place until June 8th at the BnF, we published a new homepage of our Archives de l'internet which presents selections of the Apocalypse and the end of the world on the web. Here is a preview: Dépôt légal web BnF on Twitter / X

We are currently running tests concerning our next Podcasts harvest. According to the first estimates, the budget should reach around 20TB for over 13 500 podcasts.

On the occasion of the launch of the biannual selective harvest, we launched a special crawl of some blog platforms (canalblog, over-blog, etc.). These blogs must be archived according to a particular configuration in order to avoid being dynamically blocked. 263 blogs are archived in this way.

Finally, we also launched a harvest of sites of learned societies of local history. 520 sites from all the French regions are currently being archived.

ONB

 

BNE

This week, we are working on a special harvest for International Women's Day. During these two weeks, we harvest daily more than 50 websites.

Error -2 has been continuously increasing. We are studying the different reasons why these errors may have happened, in case it is possible to solve some of them and to analyse which ones they are for IT security reasons.

KB-Sweden

 

Next meetings

  • March 4th

  • April 7th (IRL!)

  • May 6th

  • June 3rd

  • July 8th

  • September 2nd

  • October 7th

  • November 4th

  • December 2nd

  • January 6th 2026

Any other business?

Related content