2019-10-08 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2019-10-08, 13:00-14:00.

Participants

  • BNF: Clara, Sara
  • ONB: Andreas
  • KB/DK - Copenhagen: Tue, Stephen, Anders, Kristian
  • KB/DK - Aarhus: Colin, Sabine, Knud Åge
  • BNE: Alicia, María, Manuel, José
  • KB/Sweden: Par, Thomas, Peter

Join from PC, Mac, Linux, iOS or Android:

    https://kbdk.zoom.us/j/104443571

Or an H.323/SIP room system:

    H.323: 109.105.112.236
    Meeting ID: 104 443 571

    SIP: 104443571@109.105.112.236

Or Skype for Business (Lync):

    https://kbdk.zoom.us/skype/104443571

Or Telephone:

Denmark: +45 89 88 37 88 or +45 32 71 31 57
United Kingdom: +44 203 051 2874 or +44 203 481 5237 or +44 203 966 3809 or +44 131 460 1196
Finland: +358 9 4245 1488 or +358 3 4109 2129
Sweden: +46 850 539 728 or +46 8 4468 2488
Norway: +47 7349 4877 or +47 2396 0588
US: +1 669 900 6833 or +1 646 558 8656
    Meeting ID: 104 443 571

    International numbers available: https://zoom.us/u/acRu0MV3xJ

You can join a meeting by using apps from a pc, a tablet or a smartphone, but you can also use the browser based version (it works with newer versions of Chrome or Firefox)


Update on NAS latest tests and developments

Feedback on usage / tests on NetarchiveSuite 5.6 release: see NetarchiveSuite 5.6 Release Notes

Feedback on tests on BnF test NAS 6.0 + IIPC H3 release : see presentation

Status of the production sites

Netarkivet

Broad crawl

We finished step 1 of our third broad crawl for 2019 (with a limit of 50 MB/domain) and are now running step 2 (16 GB/domain). Step 1 results: In 602 jobs we harvested a total of about 93 TB or 187 million objects. There are lots of sites blocking us, we will solve that by giving our new broad crawl harvesters new IP addresses and updating our throttling firewall rules. Simultaniously we ran the selective crawls connected to the broad crawls: Research databases, Municipalities and regions, Ministries and Government agencies, YouTube.

Now we are doing the “cleaning up” and improvements to prepare the next broad crawl

Selective crawls

Getting IP-validated access to content behind paywalls is still a big issue (to get in touch with the right person from the website owners). Vi are also trying to solve another issue: we are not able to capture comments on news articles.

Ongoing projects

  • Implementation of SolR Wayback (the important step is almost the risc assessment)
  • New user access procedure and form
  • Data mining/extractions from the archive: make sure with our legal department, that we follow all relevant laws, if we create a workspace for users in SolR wayback
  • Visual instant QA of https-seeds: configuration of pc’s for reading Warc-files (with SolR wayback)

BnF


ONB

  • We are almost finished with the first stage of our yearly domain crawl. We still experiencing the Duplicate Job Generation Error. Next step is to move the database to a stronger server, as we are thinking that the slow responding database is probably a reason for this error. We we will do that after finishing the domain crawl. Then we also are able to upgrade to Version 5.6 to test this Version in Production. Also we will going to downgrade Open Mq Version to 4.5.2 as mentioned in the Meeting today.
  • Now we are using a dedicated Ip-Range for the Crawlers which is not connected to the library anymore.
  • In the yearly budget discussion the management decided that we will not get more Storage per year. So our total budget for the next year will be the same, 6 TB for Domain and selective Crawls.

BNE

Broad Crawl

We have started our annual broad crawl on September 23.

There are almost 2 million websites that we divide in sets of 500 domains per job with a limit of 150 MB/domain.

We use two specific networks (FTTH-Fiber to The Home) to make the broad crawl in order to leave the regular network for our selective collections

We have already collected 38% of the websites (26 TB of information) without important inconveniences

KB-Sweden

Broad crawl 

We have had lots of problems with the broad crawl we started a couple of weeks ago. 

Next meetings

  • November 5
  • December 3
  • January 7, 2020

Any other business?

·