2023-09-05 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2023-09-05, 13:00-14:00.

Participants

  • BNF: Sara, Clara, Auriane, Nola
  • ONB: Antares
  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
  • KB/DK - Aarhus: Colin
  • BNE: José, Miguel, Eva
  • KB/Sweden: Peter, Pär

Lastest NAS evolutions

  • On NAS

- Update budget for multiple hosts https://kb-dk.atlassian.net/browse/NAS-2886
- Add filters on the harvest scale (with database changes)  https://kb-dk.atlassian.net/browse/NAS-2887
- Hide specific crawlertraps on a domain https://kb-dk.atlassian.net/browse/NAS-2889
- Fix the search by job id https://kb-dk.atlassian.net/browse/NAS-2890

  • On H3

- Improvement about HTML parser https://kb-dk.atlassian.net/browse/NAS-2891

+ discussion on NAS problems with non-fetched image resolutions (srcset data, responsive pages)

Discussion about topics for a future NAS workshop

Brainstorming: https://docs.google.com/document/d/1eO2cgcfwQ-7BOxrjzSApjvTEgXc-VfCPEgmqyVOriic/edit?usp=sharing

2019 Workshop: 2019 NAS workshop

Status of the production sites

Netarkivet

  • 3. Broadcrawl - step 2 - Done 192 of  592 Jobs
  • Presenting for https://netpreserve.org/event/iipc-webinar-nordic-web-archives-for-researchers-access-tools-and-services/
  • Workshop with The Danish Agency for Digitalisation regarding all KB´s language resources (data, delivery methods etc.)
  • Great knowledgesharing with Nina Heljeback from KB-Sweden 
  • Testing on site installation of Browsertrix Cloud
    • Seems fine so far (but does not work in Google Chrome at the moment)
  • Some focus on SoMe and freedom of speach/burning of Quran etc.
  • Data delivery projects still going on - researchers getting data via SFTP - LLM (40/15TB)
  • Still focusing on Paywall content. 
  •  Working on proposals for IIPC WAC 2024
    • Paywall-sites and more 
    • Maybe Browsertrix status
    • ?
    • ?
  • Update of default seeds
  • Scraping site maps to get more quality content
  • Ingesting ArchiveIt-files from 2020-2023..
  • Still working on ingesting files for IA 1996-1999 .dk-crawls

BnF

The digital legal deposit service welcomes a new colleague, Florence Simonet as digital collections manager. She will particularly work on the harvests.

The end of the summer is marked by two important saving sites, before closure, projects.
Skyblog, which was the largest French blogging platform in the 2000s, closed to the public on August 21st. The BnF harvest began last week and covers more than 12.6 million blogs for a total of 1261 jobs.
The harvest is expected to last about 2 months and the estimated size is about 40 TB.

Furthermore, the Orange personal pages hosting service will close on September 5th. This is a website creation service linked to the telephone operator Orange. Harvesting tests will begin soon and should cover around 450 000 sites for a about 12 TB of data.

Like every year, we are currently preparing our upcoming broad crawl which will be launched in October 2023.

ONB


BNE

We have created a new collection about comic industry, we want to collect all creators and autor that disseminate their creations on Internet, most of then, only on Internet.

In July, we launched a new search engine that users allow to find website collect in diferent selective collection by title or keywords https://www.bne.es/es/colecciones/archivo-web-espanola/buscador

We would like to ask a question about NAS, is posible to get the number of job of a group of harvest? Sometime, we need extract the number of job from a whole collection in different harvest, and we don´t know how to do it.

We are working on upgrading the NAS to version 7.4 and need help to fix a problem. When storing warcs we receive a checksum error. We have consulted the documentation on replicas in archiving, but have not found where the problem is. In our installation we do not use replicas.


The error looks like:


Host: HNAS011.bne.local

Date: Tue Jul 11 13:49:30 CEST 2023

dk.netarkivet.archive.arcrepository.distribute.JMSArcRepositoryClient.store(JMSArcRepositoryClient.java:277)

Could not store 'harvester_high/99340_1689058904228/metadata/99340-metadata-1.warc' after 3 attempts. Giving up.

The returned message 'ID:23688883-192.168.81.2(d8:2a:24:86:bd:65)-37144-1689076166675: To BNE_COMMON_THE_REPOS ReplyTo BNE_COMMON_THIS_REPOS_CLIENT_192_168_81_2_HCS_HIGH_H3_011 Error: Failure while trying to store ARC file: 99340-metadata-1.warc Arcfile: 99340-metadata-1.warc, precomputed checksum: 99cb38e2bdade211ff7ab596bbe051df' was not ok while waiting for reply on store of file 'harvester_high/99340_1689058904228/metadata/99340-metadata-1.warc' on attempt number 1 of 3. Error message was 'Failure while trying to store ARC file: 99340-metadata-1.warc'

The returned message 'ID:23688998-192.168.81.2(d8:2a:24:86:bd:65)-37144-1689076169036: To BNE_COMMON_THE_REPOS ReplyTo BNE_COMMON_THIS_REPOS_CLIENT_192_168_81_2_HCS_HIGH_H3_011 Error: Failure while trying to store ARC file: 99340-metadata-1.warc Arcfile: 99340-metadata-1.warc, precomputed checksum: 99cb38e2bdade211ff7ab596bbe051df' was not ok while waiting for reply on store of file 'harvester_high/99340_1689058904228/metadata/99340-metadata-1.warc' on attempt number 2 of 3. Error message was 'Failure while trying to store ARC file: 99340-metadata-1.warc'

The returned message 'ID:23689055-192.168.81.2(d8:2a:24:86:bd:65)-37144-1689076170195: To BNE_COMMON_THE_REPOS ReplyTo BNE_COMMON_THIS_REPOS_CLIENT_192_168_81_2_HCS_HIGH_H3_011 Error: Failure while trying to store ARC file: 99340-metadata-1.warc Arcfile: 99340-metadata-1.warc, precomputed checksum: 99cb38e2bdade211ff7ab596bbe051df' was not ok while waiting for reply on store of file 'harvester_high/99340_1689058904228/metadata/99340-metadata-1.warc' on attempt number 3 of 3. Error message was 'Failure while trying to store ARC file: 99340-metadata-1.warc'

KB-Sweden


Next meetings

  • October 3rd
  • November 7th
  • December 5th
  • January 9th 2024

Any other business?