2019-05-07 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2019-05-07, 13:00-14:00.

Participants

  • BNF: Sara, Géraldine
  • ONB: Andreas, Michaela
  • KB/DK - Copenhagen: Anders, Tue, Stephen
  • KB/DK - Aarhus: Colin, Sabine, Kristian, Knud Åge
  • BNE: Alicia, Mar
  • KB/Sweden: Par, Thomas, Peter

Update on NAS latest tests and developments

We have released a new version of umbra which we believe is more stable than the latest IA version:

Status of the production sites

Netarkivet

We started our second broad crawl for 2019, first step with a byte limit of 50 MB per domain. Second step will have a byte limit of 16 GB per domain. We adjusted the byte limits after having analyzed last year’s broad crawls.

“Ultra big sites”, “OAI-extraction (research databases)”, “Ministries and administrative bodies” and YouTube crawls are running simultaneously with the broad crawl.

We started our event crawl for the approaching parliamentary elections. We hoped that they would take place together with the elections for the European Parliament, but they will not. At the monthly curator meeting tomorrow, we will have to agree on how to deal with the European parliament elections.

We prepared our list of politicians Facebook profiles and fixed the URL’s as BNE does for the crawl with our Archive-IT account.

We upgraded UMBRA in the production system

BnF

In April, we organized an internal workshop on responsive websites. As a start, we selected a sample of websites. We first tried to visualize the archives of these sites with a more recent version of Firefox and Chromium : half of the problems disappeared which lead us to conclude that many problems are in fact access issues and not crawling issues. As we used Firefox as User agent, the visual quality was better with Firefox than Chromium.

In a second step, we analysed the source code of the websites which had crawling issues. The conclusion of these analysis was that each site has peculiarities that are specific to it. To solve the crawling problems, we tried :

- to use various user-agents (e.g. specific version of firefox user-agent, Chrome) but this change did not significantly change the quality of the crawl and the choice of the user-agent must be consistent with the choice of the browser used for the access.

- to crawl the websites with the IIPC Heritrix 3 version and the defaut Javascript extractor. It solved some problems, but not all : it dramatically reduced the number of 404 errors related to javascript through a javascript extractor which seemed to be more powerful.

- to crawl the websites with the latest release of Umbra included in NAS. During the tests, Umbra fell as during our first tests in December. It's very efficient for social networks as Instagram or pinterest, especially to crawl images. But due to the instability of the application, it's impossible to put it in production. We'll probably test it again during the preparation of our broad crawl tests.

ONB

- We started our Event-Crawl for the European Parliament Elections.
- We are still working on our hardware exchange. As soon this is finished, we will start our Domain Crawl for this year.
- Our Collegues from ONB Labs were participating at https://www.pydays.at Vienna last week, where they were were presenting the webarchive API, and showed what people can do with it. See all Information under https://cfp.linuxwochen.at/de/LWW19/public/events/898

BNE

We used to start our annual broad crawl on April but due to the changes in our IT Team we had to postpone it until summer.

For this reason, in addition, we have not been able to proceed with BCWeb updating not even study the implementation of UMBRA

We are working intensively in the collections about the diferents elections: European Parlament elections, local elections and Spanish Government elections. We launch a daily harvest in each collection to collect as much as possible of Twitter and Facebook profiles.

KB-Sweden

Testing

Next meetings

  • June 4
  • July 2
  • September 10
  • October 8
  • November 5
  • December 3
  • January 7, 2020

Any other business?

·