Skip to end of metadata
Go to start of metadata

You are viewing an old version of this content. View the current version.

Compare with Current View Version History

« Previous Version 3 Current »

Agenda for the joint NetarchiveSuite teleconference 2025-03-04, 13:00-14:00.

Participants

  • BNF:  Sara

  • ONB: Andreas, Antares

  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue

  • KB/DK - Aarhus: Colin

  • BNE: José, Miguel, Eva

  • KB/Sweden: Peter, Pär, Johan

Update on NAS latest tests and developments

Preparation of NAS Workshop in Oslo

Status of the production sites

Netarkivet

  • 1st Broadcrawl 2025- step 2 almost finished - smoother crawl than ever

  • Data delivery of all text from the archive +some metadata for research project finished. 32 TB compressed.

  • “Mere vand i systemet/More water in the system” climatechange debate-project

    • Proceeding as planned:

    • Using Browsertrix Cloud to crawl hard-to-get content like video (YouTube + LinkedIn logged in) and more.

    • Waiting on results from development from Webrecorder on Facebook-behaviour (expand comments, view reels/content etc.). Logged in.

    • Lots of experience and findings using Browsertrix including live-exclusions (text-regex etc.)

  • Browsertrix

    • Lots of updates from Webrecorder - means issues on local installs. Swift reactions from Webrecorder

    • We have 3 instances:

      • Local:

        • Devel

        • Prod (with IP mapped for getting behind paywall-content)

      • Cloud:

        • 3TB Pro Plan. Crawl time monthly os a bit challenging

      • Example: Tv2 browsertrix harvesting has now reached the maximum limit of 500K pages after 14 days and continues to harvest the remaining approx. 360K. We are up to 471GB now, so I'm guessing 1 -2 TB - we have almost 6 TB free. I estimate that if we had set the maximum limit to 1.5 million, we would get around tv2 in full depth (8 hops down) with 1 hop out and it would probably take between 1-1/2 months with the current equipment. SO it is actually possible - I think! The last thing I saw was that it was working with the 10s and 00s with appropriate drops to the 20s regularly. When 500K has been harvested and uploaded, I check the crawl log for duplicates or similar and how many urls have been run through. cf. tv2 search with * it shows 1.4 million for the entire site

  • Solr-index - new SDD-drives update.

  • Outreach and more

BnF

On the occasion of the exhibition "Apocalypse, yesterday and tomorrow" which takes place until June 8th at the BnF, we published a new homepage of our Archives de l'internet which presents selections of the Apocalypse and the end of the world on the web. Here is a preview: https://x.com/dlwebbnf/status/1886717418245967885?s=46&t=jbG3gmDk9NL-WihrmL3kRA

We are currently running tests concerning our next Podcasts harvest. According to the first estimates, the budget should reach around 20TB for over 13 500 podcasts.

On the occasion of the launch of the biannual selective harvest, we launched a special crawl of some blog platforms (canalblog, over-blog, etc.). These blogs must be archived according to a particular configuration in order to avoid being dynamically blocked. 263 blogs are archived in this way.

Finally, we also launched a harvest of sites of learned societies of local history. 520 sites from all the French regions are currently being archived.

ONB

BNE

KB-Sweden

Next meetings

  • March 4th

  • April 7th (IRL!)

  • May 6th

  • June 3rd

  • July 8th

  • September 2nd

  • October 7th

  • November 4th

  • December 2nd

  • January 6th 2026

Any other business?

  • No labels