2021-10-05 Statusmeeting
Agenda for the joint NetarchiveSuite tele-conference 2021-10-05, 13:00-14:00.
Participants
- BNF: Clara, Sara, Auriane
- ONB: Andreas
- KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
- KB/DK - Aarhus: Colin
- BNE: José, Alicia, Miguel
- KB/Sweden: Peter, Jonas
Update on NAS latest tests and developments
Status of the production sites
Netarkivet
- Broad Crawl. We have starting step 2 today with a 5 day limit and max bytes set lower than normal. Paused now due to issue with disks
- Migrating to 7.2 NAS and Bitmagasinet still on going. Hadoop-part is going a bit slower than planned.
- Working with 6 groups of approx. 7 person from IT-University CPH on finding thematic content eg. on Communal Election 2021 and content nomination/top lists.
- Looking at https://brandmentions.com/ to find more Danish of Danish relevant content on web and social media. looks promising
- Experimenting with other methods to find relevant content.
- PyWb-project idea is closer to get an actual priority/ go/no-go
- Finalizing Collection strategy for 2021-2023 and 2023-2025.
- Blacklight - we are closing the service. SolrWayback have taken over.
- Access to Netarkivet for employees in Ministry of Culture that needs this for work/collection purposes is looked into (Rigsarkivet, SMK, DFI eg.)
- IIPC project proposal
User-Friendly High Fidelity Browser-Based Crawling for All - Proposal for IIPC Discretionary Funding Program 2021-2022 (1).pdfLEAD IIPC INSTITUTION: Royal Danish Library
4TH IIPC INSTITUTION: National Library of New Zealand
2ND IIPC INSTITUTION: UK Web Archive
3RD IIPC INSTITUTION: University of North Texas Libraries - Working together with Rigsarkivet (they archive mainly non-published content) to see if we can help each other. If we harvest all info, institutions doesn´t need to pay for Rigsarkivet to archive their data.
- Talking to Kees (KB - NL) about their testing/using of SolrWayback and NAS
BnF
The preparations for our 2021 broad crawl are coming to an end and it will be launch on October, 11th. Our seed list is made up of 5.5 million domains divided into 1108 jobs and for a total budget of 115 TB.
This year, we realized a test broad crawl, between the 9th and the 17th of September, which corresponds to 20% of the complete seed list.
The objective was to obtain more precise indicators in order to define an appropriate budget.
Last month, a new version (wayback 8.6) of our "Archives de l'internet" went into production. The Videos virtual guided tour has been updated with 383 new YouTube channels.
As mentioned in September, we also published a virtual guided tour about the web from Lorraine. This came with a new homepage around this theme. We also realized a slide show with some captures for which we have obtained reproduction rights.
You can consult it at the following address: https://www.bnf.fr/sites/default/files/2021-09/Diaporama_Parcours_guid%C3%A9_Le_web_lorrain.pdf
Lastly, we are pleased to announce that the records realized on the launch day of the ResPaDon project can be viewed at this address: https://respadon.hypotheses.org/1
ONB
- Beginning with 2022 we are allowed to use 11 TB of storage instead of 6 TB
- migrating old webpages crawled by HTTrack into Arcs and into NAS
BNE
- This last week we have published a post in the IIPC blog about collaboration between Barcelona Supercomputing Center and National Library of Spain calls: “The Spanish Web Archive as a training field for Natural Language Processing models”: https://netpreserveblog.wordpress.com/2021/09/29/the-spanish-web-archive-as-a-training-field-for-natural-language-processing-models/
- An emergency crawl about volcanic eruption on La Palma and his consequences.
- A new meeting with our partners from different regional libraries focus on Key words
- The NAS infrastructure is stable after some problems last week. Our I. T. Department is tested in Pre NAS with Red Hat 8, Prostgresql 13, Heritrix 7.2 and OpenWayback 4.2.0 before installing the new versions in our system
- We have a technical question, is it possible to run WaybackIndexerApplication in a Prostgresql database?
KB-Sweden
There is now a consultant at KB, which will work with upgrading and improving our web archiving environment. His name is Jonas Linde, and he will participate today.
Next meetings
- November 2nd
- December 14th
- January 11th, 2022
Any other business?
·