2024-10-01 Statusmeeting

Agenda for the joint NetarchiveSuite teleconference 2024-10-01, 13:00-14:00.

Participants

  • BNF:  Auriane, Sara, Haja, Nola
  • ONB: Andreas, Antares
  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
  • KB/DK - Aarhus: Colin
  • BNE: José, Miguel
  • KB/Sweden: Peter, Pär

Update on NAS latest tests and developments

Status of the production sites

Netarkivet

  • 4th Broadcrawl 2024- step 1 - will start very soon
    • Big domains-harvest started
  • Browsertrix
    • Using it to see it´s limits and possibilities
    • Streaming video is possible, but not all medias behaviours/players are possible to crawl - we need to identify how for most important sites and get it to webrecorder to fix
  • We still participate in a broad national collection/metholodigal effort on climate change/debate and more. It´s called "More water in the system".  https://www.rigsarkivet.dk/nyheder/dokumentation-af-reaktioner-paa-klimaforandringer/
    • Using Browsertrix and archiveweb.page extension to get dynamic content (AWP jsut updated) 
  • IIPC WAC 2025 Oslo proposals sent:
      1. Submission Type / Conference Track: 15-minute presentation with 5-minute Q&A
        Web Archives For Music Research

        Ægidius, Andreas Lenander

        Topics: ADVOCACY & USER ENGAGEMENT
        Keywords: advocacy, digital music, musicology, user engagement, interviews

      2. 15-minute presentation with 5-minute Q&A
        Topics: TOOLS & INFRASTRUCTURE
        Keywords: validate, warc, wacz, jwat, warchaeology


        Why, When and How To Validate Wacz/Warc Files In a Mixed Heritrix/Browser based Crawl Platform

        Tue Hejlskov Larsen

        Royal Danish Library, Denmark

        The presentation looks into: Why is it necessary to validate wacz/warc files?
        Do we have the tools for warc and wacz files - e.g. wacz, warchaeology and jwat?
        When is it necessary to validate?
        And how does the exemplified tools perform from a usage point of view.


      3. 15-minute presentation with 5-minute Q&A
        Topics: CURATION & COLLECTIONS
        Keywords: automation, organising, curating domains, clustering


        Automatic Clustering of Domains by Industry for Effective Curation

        Thomas Smedebøl

        Royal Danish Library, Denmark

        When archiving 1.4 million .dk domains, we need practical tools to curate them by clusters.

        We have observed that, for example, hairdressers, massage therapists, and physiotherapists often have crawler traps around their booking systems. The same applies to hotels.

        Takeaway restaurants often have crawl traps around their ordering systems. Car dealerships have general issues with their used car databases.

        And online shops tend to pose great difficulties around the sorting of the offered products.

        Each industry seems to have their own set of specifics we should take into account when curating the archiving of their domains.

        It would be useful to analyse and manage these domains by industry.

        In Denmark, all companies have a CVR number that identifies them.

        This number must be displayed on their website.

        In the central business register, the company's industry is listed. By scraping all domains for the company’s CVR number, a connection can be established between domains and industries, and we can quickly generate a list of domains within museums, churches, dentists, water utilities, and all other industries in the register.

        All it takes is good planning, a lot of scraping, access to the central business registrys database and a database.

        By working with industries as a starting point, we can improve our insight, quickly manage large volumes, and spend focused time on special cases. We can also offer researchers a unique register of segmented domains.

      4. Doing What 'Is Not Humanly Possible'

        Thomas Smedebøl

        Royal Danish Library, Denmark

        With 4 annual crawls of 1.4 million .dk domains, hands-on curation is necessary.

        We manually set byte limits for domains that exceed their limit.

        My predecessor called the task impossible! 'An endless job that you will never finish.'

        I took on the challenge, and after some time, I managed to get through all the domains between harvests.

        Without getting a repetitive strain injury and without spending more time than absolutely necessary.

        The result is better quality and reduced waste of storage.

        Learn how I did it, and be inspired in your own work.


  • DigiCAM25: Born-Digital Collections, Archives and Memory

    • SolrWayback-proposal accepted

  • Anders - guest lectoring at KU (Copenhagen University)
  • Input/talks with Nettarkivet - Norway (Free text indexing, researchers needs, search/discovery  eg. jupyter notebooks on top of SolrWayback CSV-exports)

BnF

We are still continuing preparations for our upcoming 2024 broad crawl. The technical tests were successful and we launched our test broad crawl last week on 2300 URLs per domain for 5.9 million starting domains.

The Olympic and Paralympic Games harvests are now complete. In total, 15 weekly and twice monthly crawls were carried out between the beginning of June and mid-September. 1095 seeds were selected for the harvest as well as 59 Youtube channels, 340 Instagram accounts and several press sections dealing with the subject were daily harvested.

A virtual guided tour about Environmental Issues is currently being prepared. It includes 14 themes, currently being drafted, which concern the various issues surrounding climate change, the preservation of biodiversity, health, agriculture, public policies and the energy transition. It should be published at the end of 2024 or the beginning of 2025.

ONB


BNE

Last week, we completed the broad crawl of the .es domain. I do not have the results yet, but I will present them at the next meeting. This week, we are starting the broad crawl of the .gal domain, which is the regional domain of Galicia. The .gal domain includes more than 7,000 websites, and we have already completed the preparation work to begin the harvesting.

We are also working on a project to recover missing e-journals. So far, we have identified more than 500 e-journals published between 2011 and 2023. Our goal is to include links to the Spanish Web Archive in the catalogue, using the new field 857 (Marc 21: https://www.loc.gov/marc/bibliographic/bd857.html), to restore access to these e-journals.

KB-Sweden


Next meetings

  • November 5th
  • December 3rd
  • January 7th 2025

Any other business?

  • Question from Olga: do we want to have a NAS meeting in Oslo?
  • Question regarding the WAC Cfp: any interest in making a presentation (20 min) or a panel (60 min) on the NAS community. It fits with "Towards long-term sustainability and preservation of open source tools" and would be an opportunity to (re)present NAS+our activities to IIPC members. We'll be 18 years old in 2025 (wink)