Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
  • 1st Broadcrawl 2025- step 2 almost finished - smoother crawl than ever

  • Data delivery of all text from the archive +some metadata for research project finished. 32 TB compressed.

  • “Mere vand i systemet/More water in the system” climatechange debate-project

    • Proceeding as planned:

    • Using Browsertrix Cloud to crawl hard-to-get content like video (YouTube + LinkedIn logged in) and more.

    • Waiting on results from development from Webrecorder on Facebook-behaviour (expand comments, view reels/content etc.). Logged in.

    • Lots of experience and findings using Browsertrix including live-exclusions (text-regex etc.)

  • Browsertrix

    • Lots of updates from Webrecorder - means issues on local installs. Swift reactions from Webrecorder

    • We have 3 instances:

      • Local:

        • Devel

        • Prod (with IP mapped for getting behind paywall-content)

      • Cloud:

        • 3TB Pro Plan. Crawl time monthly os a bit challenging

      • Example: Tv2 browsertrix harvesting has now reached the maximum limit of 500K pages after 14 days and continues to harvest the remaining approx. 360K. We are up to 471GB now, so I'm guessing 1 -2 TB - we have almost 6 TB free. I estimate that if we had set the maximum limit to 1.5 million, we would get around tv2 in full depth (8 hops down) with 1 hop out and it would probably take between 1-1/2 months with the current equipment. SO it is actually possible - I think! The last thing I saw was that it was working with the 10s and 00s with appropriate drops to the 20s regularly. When 500K has been harvested and uploaded, I check the crawl log for duplicates or similar and how many urls have been run through. cf. tv2 search with * it shows 1.4 million for the entire site

  • Solr-index - new SDD-drives update.

  • Outreach and more

...