2012-04-25 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference March the 20th 2012, 13:00-14:00.

Søren is on vacation her.

Practical information

  • Skype-conference
    • Mikis will establish the skype-conference at 13:00 (Please do not connect yourself):
    • TDC tele-conference (If it fails to establish a skype tele-conference):
      • Dial in number (+45) 70 26 50 45
      • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: Sara and Nicholas
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Jonas
  • SB: Colin and Mikis
  • Any other issues to be discussed on today's tele-conference?

Followup to workshop

  • Actions from last meeting.
    • Sara will create a post to the curator mailing list regarding the new content..
    • Mikis and Sara will look into defined general Jira usage. Karen will contacted afterwards for analysis of NASC-17@jira.
    • (Sara) The wiki content regarding templates and crawler traps isn't finished yet, even though some content has been added.

IIPC GA in Washington

Iteration 51 (3.20 Development release) (Mikis)

Status of the production sites

  • Netarkivet:

    It has the old Heritrix memory usage (1936M) and can manage all other indexing jobs during the creation of the broad crawl duplication index.

    The creation of the 100 GB deduplication index took as expected 4 days.

    There are between 50  and 60 currently running harvest instances and we are uploading every day between 600 – 900 GB to the archive and
    we are not using our full harvesting capacity. The  905 broad  jobs runs only for max 10 hours.

    It still has the issues regarding the upstart of the broad harvest, but it can be managed manually by copying and relinking to the unzipped index.

    • Curators (Sabine):
      We finished our newest broad crawl step 2 with 3.18.3 patch release at 18th of April, about 1 pm, the upload is about 13 TB (versus 14,6 TB in our previous broad crawl)
      Every job ran max 10 hours, so we ran twice as much jobs than our previous borad crawl.
      The next broad crawl is scheduled for the first week of May.

    We finished an event harvest on a right extremist demonstration in Aarhus the 31st of March. The event harvest includes the capture of some YouTube videos.
    We worked on the difference between the display of archived sites by different browsers and browser versions.
    Our Wayback machine is now displaying the provenience of pages from the archive (which job do they come from)

  • BNF:
    •  (Peter): focus on the election crawl.

    External access to BCWeb

    Our curator tool BCWeb is now accessible to our partners in regional libraries who are selecting sites relating the the presidential and parliamentary elections. This year we have 20 libraries participating in the election crawl (out of 26 in total), which is a record ! We held a meeting with them at the end of March to launch the project and present BCWeb. The librarians have been selecting sites for the past few weeks, and the feedback on the curator tool has been positive. We aim to provide access to other partners for other projects, for example researchers who will be selecting sites for the crawl of the Olympic Games.

    These sites are added to the crawls which started at the end of January with sites selected by librarians at the BnF. The first round of the presidential election is this Sunday, with the second round the 6th May, followed by the parliamentary elections the 10th and 17th June.

  • ONB:
    • Our Domain crawl is almost finished. Just a few rescheduled jobs are expected to be finished.
    • We are using now a small hadoop-Cluster, which is located on our crawler machines. We are using 8 worker nodes which can use 8TB of HDFS Storage. We are now using http://pig.apache.org for sorting our cdx-Index and generating statistical reports.

Date for next joint tele-conference.

  • May 23th 13-14.

Any other business