2024-03-05 Statusmeeting

Agenda for the joint NetarchiveSuite teleconference 2024-03-05, 13:00-14:00.

Participants

  • BNF:  Auriane, Nola, Sara
  • ONB: Andreas, Antares
  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
  • KB/DK - Aarhus: Colin
  • BNE: José, Miguel, Eva
  • KB/Sweden: Peter, Pär

Update on NAS latest tests and developments

Everybody should check that they have access to the new wiki here and that their login is functioning.

Status of the production sites

Netarkivet

  • 1st Broadcrawl 2024- step 2 - no new jobs - so almost finished
  • Still testing on-site installation of Browsertrix Cloud
    • Playback should be focus for IIPC/Webrecorder next: replayweb.page (works the best) vs. PyWb
  • Social Media - focus on technology, defining representativeness etc. Meeting with external companys to be inspired.
  • Language models on a nordic level discussed.
  • Preparing for IIPC WAC 2024 in Paris

BnF

On the occasion of the European elections we planned a focused harvest on the subject. It is starting this week and it will last until mid-July. We will launch jobs every week and every month and we will supplement the crawls with Youtube, TikTok and Instagram harvests.

Then, we have just ended a harvest about the sites of photographers who participated to the Great Photographic Commission.
This commission is entitled “Radioscopy of France: views on a country affected by the health crisis” and its implementation was entrusted to the BnF by the Ministry of Culture. 200 photographers were selected for this commission. The harvest includes 159 photographers sites.

Our latest Video harvest launched at the end of January is still in progress. There are approximately 15 000 videos left to crawl out of the 58 669 planned.

ONB


BNE

In April, Basque Country elections are going to be held; we are working with the regional web curators to cover this event.

For some time now, we have been receiving complaints from website administrators regarding requests they receive from Heritrix. They have sent us traces from logs showing Heritrix making requests to non-existent URLs. Here is an example from the website “https://www.vistaalmar.es/

[05/Feb/2024:15:12:53 +0100] "GET /galeria/picture/pez-leon/category/pez-leon%20-%20pez-leon.jpg HTTP/2.0" 404 2246 "https://www.vistaalmar.es/galeria/picture/pez-leon/category/10-mundo_marino" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:13:11 +0100] "GET /-webkit-linear-gradient(top,%20 HTTP/2.0" 404 1554 "https://www.vistaalmar.es/cookies-footer.html" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:13:37 +0100] "GET /medio-ambiente/apply.activities HTTP/2.0" 404 1554 "https://choices.consentframework.com/js/pa/37261/c/cAHxP/cmp" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:13:38 +0100] "GET /medio-ambiente/apply.activities.actors.noPartner HTTP/2.0" 404 1554 "https://choices.consentframework.com/js/pa/37261/c/cAHxP/cmp" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:13:38 +0100] "GET /medio-ambiente/apply.activities.actors.partners HTTP/2.0" 404 1554 "https://choices.consentframework.com/js/pa/37261/c/cAHxP/cmp" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:13:39 +0100] "GET /medio-ambiente/apply.choice HTTP/2.0" 404 1554 "https://choices.consentframework.com/js/pa/37261/c/cAHxP/cmp" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:13:40 +0100] "GET /medio-ambiente/apply.choice.actors.partnersWithCount HTTP/2.0" 404 1554 "https://choices.consentframework.com/js/pa/37261/c/cAHxP/cmp" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:13:41 +0100] "GET /medio-ambiente/apply.choice.actors.partnersWithCount_plural HTTP/2.0" 404 1554 "https://choices.consentframework.com/js/pa/37261/c/cAHxP/cmp" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

[05/Feb/2024:15:14:24 +0100] "GET /medio-ambiente/main.activities.actors.partnersWithCount HTTP/2.0" 404 1554 "https://choices.consentframework.com/js/pa/37261/c/cAHxP/cmp" "Mozilla/5.0 (compatible; bne.es_bot; https://www.bne.es/es/colecciones/archivo-web-espanola/aviso-webmasters) Firefox/57"

Upon investigation, we've observed that in many cases, the URLs causing the complaints (the one in red, for example) are constructed using code fragments found in various JavaScript files.

KB-Sweden


Next meetings

  • April 2nd
  • May 7th
  • June 4th
  • July 2th
  • September 3rd
  • October 1st
  • November 5th
  • December 3rd
  • January 7th 2025

Any other business?