Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Status of the production sites

Netarkivet

Panel
  • Second Broad Crawl will start soon 
  • Data dump of all text from Netarkivet to research project on making a new Danish language model in the works. see more here: https://github.com/kb-dk/kb-scripts/tree/master/all-text
  • Awaiting invitation from Norways Nettarkivet to learn more about their archive. 
  • Twitter API! Still awaiting new solution. Considering contacting them.
  • Focus on IIPC WAC 2023. Presentations uploaded and awailable. SESSION 8: BROWSER-BASED CRAWLING (password)
  • Asked for PyWb-analysis to be prioritized for maintenance sprint (May)

BnF

Panel

Our internal harvesting workshop about Browsertrix finished at the end of March. A total of 10 testers participated and more than 80 crawls have been launched for 40 use cases analysed.
Each tester completed a use case analysis grid in order to structure the test feedback. Our feedback will be summarised and presented to the community soon.

Within the framework of our internal project to improve our harvests, we are currently running tests on Twitter accounts in order to improve the harvest. All the selected accounts are not covered homogeneously by the harvest. Many images are notably missing. According to our tests, it might come from the mass of data that we try to harvest.

The Environmental issues and Artificial Intelligence harvests have been launched at the end of March and concerns more than 700 and 650 selections respectively. The AI harvest has been enriched by selections about prompt art and generative AI.

Finally, the international ResPaDon symposium entitled “The web: source and archive” was held in Lille from 3 to 5 April. It gave rise to many exchanges between researchers and library professionals around web archives.

...