2021-09-07 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2021-09-07, 13:00-14:00.

Participants

  • BNF: Clara, Sara
  • ONB: Andreas
  • KB/DK - Copenhagen: Anders (absent), Thomas
  • KB/DK - Aarhus: Colin
  • BNE: José, Alicia, Miguel
  • KB/Sweden: Pär, Peter

Update on NAS latest tests and developments

Version 7.2 has been released:

  1. Hopefully permanent fix to recurring issue with inconsistent software version.
  2. Many new heritrix updates pulled in from upstream, including ExtractorChrome
  3. Some cleaning up on our modifications to the sitemap-extraction modules in Heritrix.

Status of the production sites

Netarkivet

  • Outsourcing vs. Open Source. Still working on this.
    • A cut down resumé of the analysis is being made for management.
    • We are looking into how many years we can use the current platform (NAS). If it works in RHEL 8 it can be used many years. But is pretty hard to maintain for people without very specific knowledge.
    • I have told Ilya that a browserbased-initiative should be a IIPC-project like the PyWb-migration and not the normal projects with 2 or more institutions joining force.
  • Broad harvest
    • Step 1 finished.
    • We are starting step 2 today with a 5 day limit and max bytes set lower than normal.
    • We have only 3 weeks to finish cause we need to implement Bitmagasinet.
  • OnlyFans. Harvesting some content as explicit content was on it´s way out (bur it ended as a big marketing stunt).
  • Testing the new browserbased harvesting-module in Heritrix with mixed results.
  • JWAT and validation of WARC 1.1-files.
  • YouTube-Dl-extractor. We still have some issues. could we have a short Zoom-meeting with the institutions succesfully running this to troubleshoot.
  • Looking into collection strategy for 2021-2023 and 2023-2025. Will probably be along the lines of what we already do, but also PyWb for better QA and Twitter API for better data from Twitter. 
  • Participated in collabaration on survey and taskforce for events with other danish institutions to work on a rapid response-procedure for the next times a big event happens.
  • Workin with 5 grups of approx. 7 person from IT-University CPH on finding thematic content eg. on Communal Election 2021. 

BnF


ONB


BNE

  • Now, we have updated our progrestsql database until last version. We are testing both NAS versions, 6.2 and 7.1 to avoid problems and we expect to update to 7.1 in a few weeks.


  • This year in the sixth broad crawl of domains .es has been crawled 1.970.000 domains (around 68 TB of information).  The 87 % of websites saves have been completely crawled.


  • We have a new collection to save all the websites about LGTB community.


  • Since April we have problems to harvest Facebook. When we want to harvest Facebook only save the password screen. We try to contact with Facebook Spain, but it is not posible. We are going to deactivate more tan 500 seeds. Someone knows a solution for this problem?

KB-Sweden


Next meetings

  • October 5th
  • November 2nd
  • December 14th
  • January 11th, 2022

Any other business?

·