2018-03-06 Statusmeeting

Agenda for the joint KB, BNF, ONB and BNE NetarchiveSuite tele-conference 2018-03-06, 13:00-14:00.

Participants

  • BNF: Sara, Géraldine
  • ONB: Michaela, Andreas
  • KB/DK - Copenhagen: Tue, Nicholas, Soren
  • KB/DK - Aarhus: Colin, Sabine
  • BNE: Mar
  • KB/Sweden: Bengt

Upcoming NAS developments

NAS 5.4 development has now finally reached the Code Freeze stage! Automatic tests have passed and we are now running the Sanity Test which is a prerequisite for starting the full Release Test.

Status of the production sites

Netarkivet

At the moment a new agreement between the unions for public employees and the national, regional and local public employers has to be negotiated – and the negotiations have “crashed”. In all probability there will be strikes/lockouts.

Therefore we prepared an event crawl. Something new is, that we will involve experts from outside our organization, namely researchers from the Workers Museum. Our intention was to use BCWeb, but we have not yet fixed external access to our BCWeb. So this time we will use a Google spread sheet (the IIPC collaborative collection type)

We started our first broad crawl for 2018 – this time it will be a full broad crawl. Step 1 with a domain limit of 10 GB per domain ran from January 9 to January 21, step 2 with a limit of 12 GB per domain started on January 25 and is still running.

As to the organization of Netarchive, we have got a temporary steering committee with Tonny Skovgaard Jensen as chair and Ditte and Bjarne as members. Collin and Tue are still members of the coordinating team – and there will be employed a new main coordinator. The coordinator should have both curational and crawl engineering abilities.

BnF

In the middle of February, we launched our bi-annual crawl which should collect around 2.25 TB. At the launch, we encountered two problems. The first one concerned the saturation of the server storage used for the creation of the deduplication index: we need to rethink all our server workspaces with the new infrastructure. A few crawlers lost the connexion with the NFS server when we restarted the crawl and some jobs failed. We didn't restart the failed jobs individually because in this case some information is missing from the warcinfo record in the WARCs.

When we relaunched the whole crawl, we again encountered the problem of two exact same jobs being created with the same ID: the harvest definition was paused automatically  before all the jobs were created. So we decided to stop the crawl and relaunch it once again.

In conclusion, there's almost no deduplication for the bi-annual crawl and the amount of data crawled will therefore be larger than expected.

Since that time, Lam has fixed the problem of the resubmitted jobs: the harvestInfo.xml fields are now correctly added to the warcinfo records for these jobs. And we must therefore change NAS version to include this correction before launching our annual crawl.

ONB

  • Recent event crawls we had included some local elections, there will be one more in April.
  • We received a request from a website owner to get back his web content that got lost from his server. Andreas is working on a way to extract the data from the archive in order to hand it over to the website owner.
  • An Austrian blogging platform will shutdown in May, we would like to archive the content, because it is mostly Austrian, rather old content on a .net domain. Legally it was not clear if we have to notify every single blogger or just the platform operator. ONB’s legal advisor recommends to treat it as a small domain crawl. We will contact the platform operator for some details and hope we can archive everything before the shutdown.

BNE

            This is the update for March from the National Library of Spain.

            As we are focused on procedures and workflows related with the deposit of non-print publications (non-print legal deposit apart from web archiving), our web archiving tasks are mostly business as usual.

            We are trying to solve last problems related with NAS 5 installation. Most harvests are running properly. Regional collections are in the process of growing and stabilization. Web curators from the deposit regional libraries work directly and in a stable way on BCWeb.

            Now we are in the process of recruiting specialized web curators on topics like Literature, Social Science, Biology and Medicine, Science and Technology… as we don’t have special collections on these topics at the Library, so we don’t have web archiving collections on them yet. There is a network of University Libraries in Spain and we are going to sign an agreement with them and create a working group with web curators coming from specialized libraries to launch and maintain web collections on these topics, that are so far an important hole in our collection. Those who will collaborate want in return to have an access to the web archiving collection from their libraries for their users, and this opens a discussion in our Library, because, within the strict interpretation of our legal deposit law, the access from other libraries is restricted only to those ones with competencies on legal deposit and the University libraries don’t have this competency. So we have to study with legal advisors which alternatives we have. We’ll keep you informed about the progress of this matter.

            Regarding access interface to our web archive collections, as you know we gave access to Openwayback last summer, but through a pilot interface, quite unfriendly. So we are studying the way to redesign it, including access for collections, subjects and titles. Considering the Austrian access interface as a model for us, we started working on it.

KB-Sweden


Next meetings

  • April 10th
  • May 15th
  • June 12th
  • July 17th
  • September 4th

Any other business?