2024-06-04 Statusmeeting
Agenda for the joint NetarchiveSuite teleconference 2024-06-04, 13:00-14:00.
Participants
- BNF: Auriane, Sara, Haja
- ONB: Andreas, Antares
- KB/DK - Copenhagen: Anders, Thomas, Stephen, Tue
- KB/DK - Aarhus: Colin
- BNE: José, Miguel, Eva
- KB/Sweden: Peter, Pär
Update on NAS latest tests and developments
Everybody should check that they have access to the new wiki here and that their login is functioning.
Status of the production sites
Netarkivet
- 2nd Broadcrawl 2024- step 2
- Ended
- Preparing 3rd. Broadcrawl 2024- step 1
- Heritrix
- Kris - Iceland: new version coming soon
- New requests - source set etc.
- Finalizing testing on-site installation of Browsertrix. Upgraded to latest version
- https://github.com/webrecorder/browsertrix/releases
- https://github.com/webrecorder/browsertrix-crawler/releases
- Input from ONB about NAS/Btrix-integration
- Went really well:
- SolrWayback as a search & discovery tool for researchers to work with web archive collections - Workshop at DHNBC 2024, Iceland, with Jon from Nettarkivet, Norway https://www.conftool.org/dhnb2024/index.php?page=browseSessions&form_session=94&presentations=sho
- https://github.com/netarchivesuite/solrwayback/releases/tag/5.1.0
- Have BnF tested: Substantial speed up when exporting (csv,warc etc.) from large multi sharded collections. See #329. This feature still needs a little more testing. Feedback will be welcome.
- Data delivery - AI4welfare
- BelgicaWeb research project (Anders member of follow-up committee)
- Web Archives: An Untapped Source Of Smart Data - https://londondataweek.org/events/#web-archives-an-untapped-source-of-smart-data
BnF
ONB
BNE
Crawl of electronic serials finished. Last year we collected too much irrelevant content, specially in very large websites, where we couldn’t reach the links with the most important resources as pdfs. This time we have set path =10 steps to avoid collecting external content. The result looks better with this method, then we have reduced significantly the archived content, but we still have to do some quality assurance in the next weeks. Path = 10 steps, 9.143 serials, 4.825 domains, <1,5GB per domain, archived GB: 1.767 (4.500 in 2023).
Since a few months ago, we belong to The Keepers Registry (preservation of electronic serials). We have a high number of journals, especially, the digital born scientific journals in open access, collected by the web archive, but despite the quality controls, it is difficult to ensure that all their numbers have been saved, because we collect them in a massive crawl of journals with ISSN number. We are considering whether to do a number check and upload them to the repository and a preservation system (Libsafe) or keep them in the Web Archive and consider them preserved.
June 2024: cutover preparation for Alma migration. This is now top priority in the BNE.
We are planning a workshop with different Latin American countries interested in the web archiving. It will take place probably in November to talk about the online legal deposit and the web archive.
KB-Sweden
Next meetings
- July 2th
- September 3rd
- October 1st
- November 5th
- December 3rd
- January 7th 2025