2023-03-07 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2023-03-07, 13:00-14:00.

Participants

  • BNF:  Auriane, Sara, Clara, Nola
  • ONB: Andreas
  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
  • KB/DK - Aarhus: Colin
  • BNE: José, Miguel
  • KB/Sweden: Peter, Pär, Jonas

Update on NAS latest tests and developments

NAS 7.4.4 was released over the New Year. It fixes a bug relating to download of very large crawl logs via hdfs. 

If anybody wants to read a day-in-the-life story, I spent literally one day trying to a) learn kubernetes and b) use it to create a Netarchive Suite deployment. Not surprisingly it wasn't a 100% success, but I learned a lot from the attempt - /wiki/spaces/~csr/pages/5931214 

Status of the production sites

Netarkivet

  • Broad Crawl going great
  • March+April we will focus on Browsertrix
  • Data dump of all text from Netarkivet to research project on making a new Danish language model in the works.
  • Small organisation change in Copenhagen. Section manager at Digital Cultural Heritage changes to Head of Department at Digitla Transformation-dept.
  • Anders visited Nettarkivet in Oslo to see their world premiere of researcher access to web archive data. https://dhnb.eu/conferences/dhnb2023/workshops/the-norwegian-web-archive-searching-and-examining-the-web-of-the-past/
    • They used Pywb 2.6 but will use much better 2.7.x soon .
    • Had prototype free text search based on natural language extracted from HTML.  https://github.com/nlnwa/fulltekstsok
    • I showed the organisers SolrWayback - It will fullfill many of the wishes from researchers  that came up during the workshop and save them development time. They need to index 1.8 PB data though.
    • Nettarkivet uses browser-based crawler Veidemann for all their crawls, but I'm not sure of the scale (will check out). They have legal deposit law but don´t get a complte TLD list like KB do from DK Hostmaster.
    • Want to work more together. 
  • Twitter API!

BnF

First of all, this week, we are launching our first internal harvesting workshop of the year 2022. Until March, 31th, our team will experiment Browsertrix with different types of websites. In this framework we will also test the harvest of social networks.

Following the TikTok crawl launched in 2022 on the theme of the elections, we are going to launch our first current TikTok harvest this month.
198 TikTok accounts or tags have been selected until now.

On March 13, there will be an exchange day around the results and future prospects of the ResPaDon project, the aim of which is to "to set up a network about web archives". This day will be held at the BnF and will be broadcast live on Youtube.

ONB


BNE

Continuing with the tests to harvest Twitter. We have avoided the 429 error launching the crawl over night. We have a new error, but Miguel thinks it is a problem with the operating system, it’s not enough updated, and not a Twitter problem. It's not possible for us to harvest hashtags or trending topic, we have an error 404, we don’t know how to avoid it.

Special crawling for the International Women’s Day, we harvest dairy more than 170 website and more than 100 Twitter profiles and Twitter accounts.

Preparing the broad crawl of magazine in free Access. More than 10,000 magazines in open Access. We plan to launch in March or early April.

KB-Sweden

We're preparing our first broad crawl for 2023. For this purpose we're writing a Python program to automate creation of new harvest passes based on a short YAML config file containing values for maxBytes, maxObjects, maxSeconds and ordertemplate per harvest pass. Eg:

auto:
  P1:
  comment: this is an automatically created harvest pass
   objects: 3
    bytes: 1000
    seconds: 3600
  autostart: true
  previous: false
    template:
      name: broad_harvest_type_1
      placeholder_namespace: KB.
      placeholders:
        MAX_OBJECT_SIZE_BYTES: 400000000
        EXTRACT_JAVASCRIPT: false
P2:
previous: true
objects: ...

We have ended a number of older selective harvests that were started because of earlier general elections in Sweden, among them a couple of unsuccessful attempts to harvest Twitter, Facebook and Instagram.

We have added selective harvests for local authorities and regions and will soon add government agencies. These harvests are introduced as a part of our work with the e-legal collections where our other methods of collecting material (RSS-based or OAI-PMH partial harvesting, FTP, web uploading) have been less successful.

Next meetings

  • April 11th
  • May 9th
  • June 6th
  • July 4th
  • September 5th
  • October 3rd
  • November 7th
  • December 5th
  • January 9th 2024

Any other business?