2012-05-30 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference May the 30th 2012, 13:00-14:00.

Søren is on vacation her.

Practical information

  • Skype-conference
    • Mikis will establish the skype-conference at 13:00 (Please do not connect yourself):
    • TDC tele-conference (If it fails to establish a skype tele-conference):
      • Dial in number (+45) 70 26 50 45
      • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: Nicholas
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin and Mikis
  • Any other issues to be discussed on today's tele-conference?

Heritrix 3 in NetarchiveSuite

IIPC has started an initiative to discuss the future of Heritrix 3 as a shared platform. What is NAS approach to this.

Update on Hanzo Warc tools

Søren and Nicholas?

JhoNAS status 

Nicholas?

Iteration 51 (3.20 Productiontion release) (Mikis)

Status of the production sites

  • Netarkivet:
    • Step 1 of broad crawl number 2/2012 was finished at the 14th of May, step 2 started with a delay of 8-9 days (we ran short of disc space). About 38.000 domains are blocking for our harvesters, because our aggressive harvesting causes annoyance our even breakdowns on their servers.
    • Prior to the Heritrix 3 workshop organized by archive.org, the curators at KB/SB  will come up with a survey on our actual use of Heritrix 1 and on wishes for future functionalities.
    • We still focus on possibilities of harvesting video and hope that we can learn from BNF’s experiences with DailyMotion and re-use their script(s).
  • BNF:
    •  1) We launched the "annual" part of our focussed crawl on 14th May - this is the biggest section of the sites selected by our librarians, with over 4000 configurations. We once again experienced some problems due to the fact that we cannot control in advance the number of jobs created by NAS, as we had 42 jobs for these HarvestDefinitions, in addition to the jobs for other HarvestDefinitions that are active at the same time (elections, daily, weekly and monthly crawls).
      - To overcome the problem of crawler capacity Bert was able to add 20 extra crawlers using our virtual server system. This allowed us to continue with all the different crawls, but can only be a short-term solution as it puts a lot of strain on our pool of resources.
      - To find a long-term solution to this problem, particularly in view of our next broad crawl which will be performed in the autumn, we aim to work on the way NAS creates jobs to give us more control over the process, so we can plan our crawls to make the most efficient use of our resources. Nicolas should start working on this in June.

    2) We have also started testing how to collect password-protected resources using the "Credential Store" settings of Heritrix. Two different options are possible, for sites that use a login form ("HTMLLoginForm") and those where the server controls the connection directly ("HTTPBasicAccessAuthentication"). The first results are positive, the logs show that robot is able to pass the login page and collect files. As these test crawls are not indexed we haven't yet been able to test access in the Wayback.

  • ONB:
    • We finished our second domain crawl and will now begin to crawl governmental and academic websites. We used NAS Version 3.16.1 and will now change to 3.18.3
    • We will analyse the domain crawl and compare with the previous crawl.
    • Resources for the webarchiving project have been reduced. For the duration of 1.5 years Andreas will commit 80% and Michaela 50% to webarchiving.

Date for NAS workshop at SB

Beginning of September.

Date for next joint tele-conference.

  • Jun 26th 13-14.

Any other business