2017-04-04 Statusmeeting

Agenda for the joint KB, BNF, ONB and BNE NetarchiveSuite tele-conference 2017-03-07, 13:00-14:00.

Practical information

Participants

  • BNF: Sara
  • ONB: Michaela, Andreas
  • KB/DK - Copenhagen: Stephen, Tue, Nicholas
  • KB/DK - Aarhus: Sabine, Colin
  • BNE: Mar
  • KB/Sweden: Bengt

NAS 5.3 Release

5.3 is out: NetarchiveSuite 5.3.x Release Notes

5.3.1 is opened: https://kb-dk.atlassian.net/secure/RapidBoard.jspa?rapidView=8&view=detail

NAS workshop in Vienna

Any changes to the draft agenda ? 2017 NAS workshop

Status of the production sites

Netarkivet

  • On March 8 we started our first broad crawl for 2017, first step with a budget limit of 10 MB per domain. We had lots of problems with this first broad crawl with Heritrix 3 and NAS 5.2.2. Most likely one of the problems was the job scheduling: jobs changed their state and there was lot of manual “put out fires” work. The crawl finished one on March 26.
  • With our new strategy for the selective crawls we had stopped with crawling front pages only 6 times a day for news sites. We were afraid of overloading the web site owner’s servers. For a couple of weeks ago we restarted with 6 daily front page crawls for the national news sites – so far without complaints from the site owners.
  • We selected 22 representative Facebook-profiles and started harvesting them with Archive-IT. Our first Fecebook crawl since last autumn.
  • We have NSF performance problems with the wayback calender display and we still can’t display pages using the https protocol.

  • The free text search index can be 3-4 month late due to the way it works. At the moment it is about 2 weeks late.

BnF

  • After performing our last tests on Netarchivesuite 5.3 and Heritrix 3, we went into production and started our first crawls on March 20th!
  • The beginning of the year is also the time for writing our annual report. In 2016, we crawled 125.47 TB of data including the largest broad crawl in our collection (90.5 TB). This year we chose to study the top level domains (TLDs) in the broad crawl  to measure the impact of including new regional TLDs in the seed list. The use of the TLD varies from one region to another (commercial purposes, public purposes, personal websites...) and the number of active websites is not proportional to the geographical area. We also analysed Epub files, as we did last year, to see if there is any evolution: their number is quite similar but the number of domains where they are hosted is growing. Overall, we exceeded our predictions due to the increase of the average weight of the harvested files.

ONB

  • Since one week we are using NAS 5.3 in production. No problems during selective crawls. Our Domain crawl for 2017 will start soon.

BNE

 At the National Library of Spain we are concentrating our efforts in two lines:

  • giving access to the users of our Library and the regional libraries as soon as possible to our web archive
  • we've just launched our annual domain crawl (today at 13:00)

Additionally, we have been asked by the Regional Library of Galicia to crawl the regional domain .gal. They are going to send us the whole list of domains and, as soon as our domain .es crawl finishes, we'll launch the domain crawl for .gal.

Next meetings

  • May 9th
  • June 6th
  • July 4th
  • August 8th

Any other business?