2020-03-03 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2020-03-03, 13:00-14:00.

Participants

  • BNF: Clara, Sara, Géraldine
  • ONB: Andreas
  • KB/DK - Copenhagen: Tue, Stephen, Anders 
  • KB/DK - Aarhus: Colin, Sabine, Kristian
  • BNE: Alicia
  • KB/Sweden: Pär, Peter

Join from PC, Mac, Linux, iOS or Android:

    https://kbdk.zoom.us/j/104443571

Or an H.323/SIP room system:

    H.323: 109.105.112.236
    Meeting ID: 104 443 571

    SIP: 104443571@109.105.112.236

Or Skype for Business (Lync):

    https://kbdk.zoom.us/skype/104443571

Or Telephone:

Denmark: +45 89 88 37 88 or +45 32 71 31 57
United Kingdom: +44 203 051 2874 or +44 203 481 5237 or +44 203 966 3809 or +44 131 460 1196
Finland: +358 9 4245 1488 or +358 3 4109 2129
Sweden: +46 850 539 728 or +46 8 4468 2488
Norway: +47 7349 4877 or +47 2396 0588
US: +1 669 900 6833 or +1 646 558 8656
    Meeting ID: 104 443 571

    International numbers available: https://zoom.us/u/acRu0MV3xJ

You can join a meeting by using apps from a pc, a tablet or a smartphone, but you can also use the browser based version (it works with newer versions of Chrome or Firefox)


Update on NAS latest tests and developments

Review of homebrewed heritrix extensions ? and feedback on tests.

Status of the production sites

Netarkivet

  • We are preparing the start of our next broad crawl Mid-March. We are awaiting Heritrix IIPC-release with the last of our pull requests.
  • The library’s executive board wants to keep only one online copy of Netarkivet due to costs for online copies (at the moment we have two, one in Aarhus and one in Copenhagen).
    A steering committee, project group and IT-developers are working on whether and how we can achieve one online platform for Netarkivet with the same services and performance – and still with the right level of security in terms of preservation.
  • A shitstorm provoked by a commercial for SAS (Scandinavian Airlines - https://youtu.be/ShfsBPrNcTI) claiming that everything typical Scandinavian was stolen from other nationalities, showed, that we have to react very quickly. We run a kind of mini event harvest focusing on the reactions on Twitter and Facebook
  • We reopened our discussion on how to use BCWeb by looking back to our experiences with some pilot projects. We might enter into agreements with researchers our other dedicated persons who could use BCWeb for small thematic collections, especially content outside .dk.
  • We decided to welcome a French student for about three month in autumn. We will probably invite him to a video conference soon to talk about the tasks he could work on for Netarkivet.

BnF


ONB


BNE

We celebrated the 10th anniversary of the Spanish Web Archive. We organized a conference where we had the opportunity to share experiences with other colleagues.

New collections:

  • We are working now in two event crawls about the elections in two regions: the Basque Country and Galicia
  • We want to launch a new collection about feminism this month

Serials broad crawl: We were preparing a list of serials urls in free access. They are almost 8,000 and we launch a kind of broad crawl to harvest them.

KB-Sweden

(At last we write something here. We apologize for not doing it before.)

Recap: We have run selective harvests successfully since middle of 2018 but had during 2019 lots of problems with running NAS snapshots above a certain size level (number of domains).
There was a bottlneck in the system but it was hard do figure out what. At last we found it: the Postresql database server was overwhelmed and database requests were queued up, which made the system slow and hard to monitor (as the GUI updates was out of phase).

Eventually we realized what was the bottlneck, added some indexes to the databse and suddenly everything went like clockwork! At least technically.

So in December we could complete the first part of our browad crawl (just 500 kByte limit). And in January we started part 2, with around 500.000 domains remaining and limits 2 Gbyte and 50000 objects. It has now run over 90 % of the jobs, so will probably be done within this week. Very good!

Things we discovered when monitoring is the large amount of sites which are same kind of shop, displaying many thousands of products, with a couple of images of each. And sites related to sport activities, having tons of match results and player statistics. This, combined with errors in links creating looping URL:s can lead to millions of URL:s in queue. It takes some time before suchs jobs reach the 50.000 objects limit, so we have been monitoring and deleting URL:s in the queue now and then.

Next meetings

  • April 7, 2020
  • May 5, 2020
  • June 9, 2020
  • July 7, 2020
  • September 8, 2020
  • October 6, 2020
  • November 3, 2020
  • December 8, 2020
  • January 5, 2021

Any other business?

·