2022-02-08 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2021-02-08, 13:00-14:00.

Participants

  • BNF: Clara, Sara, Auriane
  • ONB: Andreas
  • KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
  • KB/DK - Aarhus: Colin
  • BNE: Alicia, Miguel
  • KB/Sweden: Peter, Jonas

Update on NAS latest tests and developments

Release of NetarchiveSuite version 7.3. This release fixes a number of issues in the execution of external hadoop jobs, and also in the handling of Top Level Domains.

The full release notes and download links can be found on the release page:
https://kb-dk.atlassian.net/wiki/display/NAS/NetarchiveSuite+7.x+Release+Notes

Status of the production sites

Netarkivet

  • First full broad crawl in a long time is in preparation and will be launched as soon as possible with NAS 7.3 in production.  
  • Step 1 index-generation is already done - also acting as as a real world stress test of our system with NAS 7.3. Worked without any issues and was 1.5 times faster than earlier
  • RHEL 8 (Red Hat Enterprise Linux release 8.5) is tested and running on our test-system. It works but some issues needs to be solved when downgrading to postgres 12. Our NAS-/archive infrastructure running on JAVA 8 and RHEL 8.5 is supported and expected to run until may 2031. This is important and relevant info for our analysis on outsourcing harvesting vs. community-driven development. 
  • The Bitmagasine-project is almost finished. It has been a complex project with around 10 FTE used in total.
  • User-Friendly High Fidelity Browser-Based Crawling System for All is pending final approval, but most likely there will an official announcement/Kick Off on the IIPC Call with Members on the 16th of February.
  • Data dump delivery/e-ressources is about to be a regular service with 2 hours of free counselling for projects and if more time is needed an estimate/offer needs to be made/accepted to proceed.   
  • YouTube anticipate that it may take a few more weeks, but it looks like they will be able to accommodate our wishes to harvest better/get an exemption. They say they are in the middle of changing from a more adhoc process to something more organized. They also said not so many requests make these kind of special requests. 
  • Contacted a lot of sites with Paywall content to get IP-validated access
  • Contacting Reddit - it so slow to harvest and we believe they have throttled us somehow.
  • Looking into getting/analyzing foreign domains with Danish content: 

    https://brandmentions.com/
    https://www.brandwatch.com/
    https://publicwww.com/

    Internet Archive also might want to sell us a list of domains with Danish contents based on everything they harvest.
  • Ingesting/validating webrecorder harvests/files in our test-system. When ready we'll show dept. manager and hopefully this content will be a part of our main archive
  • Small Tour de France event harvest ongoing
  • DK-curators please feel free to add any other info that is relevant.

BnF

At the end of January, we put into production the version 8.7.0 of Wayback. This release includes the update of our Press and News virtual guided tour (until January 2022) and the publication of a new homepage about artificial intelligence for our "Archives de l’internet". Thirty two out of sixty three selected captures from the Artificial Intelligence harvest are displayed randomly each day.

At the beginning of February, the version 7.4.0 of BCWeb has also been released. This version includes several improvements and evolutions (improvement of the display performance of the pages of the application, addition of the possibility to make modifications in the keywords section for inactive records...).

The 2022 Elections and Winter Olympic Games harvests have been launched on January, 17 and 31 respectively. The first one will last five months and a half and the second one, one month and a half. Both of them are launched twice a day and twice a month.
Lastly, our biannual crawl is in preparation and will be launched on February, 14.

ONB


BNE


KB-Sweden


Next meetings

  • March 8th
  • April 12th
  • May 10th
  • June 7th
  • July 5th
  • September 6th
  • October 4th
  • November 8th
  • December 6th
  • January 10th, 2023

Any other business?

·