2023-03-07 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2023-03-07, 13:00-14:00.

Participants

BNF: Auriane, Sara, Clara, Nola
ONB: Andreas
KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
KB/DK - Aarhus: Colin
BNE: José, Miguel
KB/Sweden: Peter, Pär, Jonas

Update on NAS latest tests and developments

NAS 7.4.4 was released over the New Year. It fixes a bug relating to download of very large crawl logs via hdfs.

If anybody wants to read a day-in-the-life story, I spent literally one day trying to a) learn kubernetes and b) use it to create a Netarchive Suite deployment. Not surprisingly it wasn't a 100% success, but I learned a lot from the attempt - /wiki/spaces/~csr/pages/5931214

Status of the production sites

Netarkivet

Broad Crawl going great
March+April we will focus on Browsertrix
- Testing use cases Getting the most feedback to Webrecorder asap before IIPC. https://github.com/webrecorder/browsertrix-cloud/issues
- Local installation of test servers
Data dump of all text from Netarkivet to research project on making a new Danish language model in the works.
Small organisation change in Copenhagen. Section manager at Digital Cultural Heritage changes to Head of Department at Digitla Transformation-dept.
Anders visited Nettarkivet in Oslo to see their world premiere of researcher access to web archive data. https://dhnb.eu/conferences/dhnb2023/workshops/the-norwegian-web-archive-searching-and-examining-the-web-of-the-past/
- They used Pywb 2.6 but will use much better 2.7.x soon .
- Had prototype free text search based on natural language extracted from HTML. https://github.com/nlnwa/fulltekstsok
- I showed the organisers SolrWayback - It will fullfill many of the wishes from researchers that came up during the workshop and save them development time. They need to index 1.8 PB data though.
- Nettarkivet uses browser-based crawler Veidemann for all their crawls, but I'm not sure of the scale (will check out). They have legal deposit law but don´t get a complte TLD list like KB do from DK Hostmaster.
  - So the rest of the community goes for Browsertrix or ArchiveIt (Brozzler) and then there´s a third choice - interesting. https://github.com/nlnwa/veidemann-harvester
- Want to work more together.
Twitter API!
- https://www.cenl.org/german-national-library-plans-archiving-german-language-twitter/
- https://www.dnb.de/EN/Professionell/Sammeln/Sammlung_Websites/twitterArchiv.html
- Option A “Donate a token”:
  - Participants register with the German National Library and receive a link to the web application and a password (enquiries by e-mail to twarchiv@dnb.de).
  - Participants submit their bearer token within the web application. Bearer tokens are used only once for the respective download; we do not store them.
  - Batches are selected.
  - Tweets downloaded to our server automatically.
  - The bearer token must be made available again if you wish to participate during the following month or if your download quota has not been fully exhausted by the end of the month."

BnF

First of all, this week, we are launching our first internal harvesting workshop of the year 2022. Until March, 31th, our team will experiment Browsertrix with different types of websites. In this framework we will also test the harvest of social networks.

Following the TikTok crawl launched in 2022 on the theme of the elections, we are going to launch our first current TikTok harvest this month.
198 TikTok accounts or tags have been selected until now.

On March 13, there will be an exchange day around the results and future prospects of the ResPaDon project, the aim of which is to "to set up a network about web archives". This day will be held at the BnF and will be broadcast live on Youtube.

ONB

BNE

Continuing with the tests to harvest Twitter. We have avoided the 429 error launching the crawl over night. We have a new error, but Miguel thinks it is a problem with the operating system, it’s not enough updated, and not a Twitter problem. It's not possible for us to harvest hashtags or trending topic, we have an error 404, we don’t know how to avoid it.

Special crawling for the International Women’s Day, we harvest dairy more than 170 website and more than 100 Twitter profiles and Twitter accounts.

Preparing the broad crawl of magazine in free Access. More than 10,000 magazines in open Access. We plan to launch in March or early April.

KB-Sweden

We're preparing our first broad crawl for 2023. For this purpose we're writing a Python program to automate creation of new harvest passes based on a short YAML config file containing values for maxBytes, maxObjects, maxSeconds and ordertemplate per harvest pass. Eg:

auto:
  P1:
    comment: this is an automatically created harvest pass
    objects: 3
    bytes: 1000
    seconds: 3600
    autostart: true
    previous: false
    template:
      name: broad_harvest_type_1
      placeholder_namespace: KB.
      placeholders:
        MAX_OBJECT_SIZE_BYTES: 400000000
        EXTRACT_JAVASCRIPT: false
  P2:
    previous: true
    objects: ...

We have ended a number of older selective harvests that were started because of earlier general elections in Sweden, among them a couple of unsuccessful attempts to harvest Twitter, Facebook and Instagram.

We have added selective harvests for local authorities and regions and will soon add government agencies. These harvests are introduced as a part of our work with the e-legal collections where our other methods of collecting material (RSS-based or OAI-PMH partial harvesting, FTP, web uploading) have been less successful.

Next meetings

April 11th
May 9th
June 6th
July 4th
September 5th
October 3rd
November 7th
December 5th
January 9th 2024

Update on NAS latest tests and developments

Status of the production sites

Next meetings

Any other business?