2012-02-28 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference February the 28th 2012, 13:00-14:00.

Søren is on vacation her.

Practical information

  • Skype-conference
    • Mikis will establish the skype-conference at 13:00 (Please do not connect yourself):
      • BNF skype name: sara.aubry.bnf
      • ONB skype name:
      • SB skype name: mss.statsbiblioteket.dk
      • KB skype name: christen.hedegaard.kb
    • TDC tele-conference (If it fails to establish a skype tele-conference):
      • Dial in number (+45) 70 26 50 45
      • Dial in code 9064479#
  • BridgIT:
    • BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: Sara
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Jonas
  • SB: Colin and Mikis
  • Any other issues to be discussed on today's tele-conference?

Followup to workshop

  • Actions from last meeting.
    • (tick) Peter, Michaela and Sabine will send individual posts to the curator mailing list with news on the status at the respective sites. Tue will talk to Sabine about compiling the update.
    • Mikis will try to get an overview of the added curator wiki content and send a mail to Sara, who will create a post to the curator mailing list regarding the new content..
    • Mikis and Sara will look into defined general Jira usage. Karen will contacted afterwards for analysis of NASC-17@jira.
    • The wiki content regarding templates and crawler traps isn't finished yet, even though some content has been added.

      Sara will look into the 3 last bullets.

IIPC GA in Washington

What is NetarchiveSuite representation going to be here:

  • Participants:
    • BnF: Clement, Annick and Sara.
    • ONB: Possible either Andreas or Michaela.
    • KB: Søren, Birgit, Nicholas
    • SB: Bjarne, Mikis leaving friday ca. 14:00
  • Presentations?:
    • NetarchiveSuite overview?

      No need for this.

    • NetarchiveSuite workshop friday.

      Clement has currently booked a slot friday afternoon for the NAS workshop. As Mikis and Bjarne will leave at around 14:00, Clement should try to move the workshop to the morning. The workshop should concentrate on discussions regarding usage of NAS in production environments (not a tutorial as in Viena). This includes demonstration of BnF's curator tool. Sara will draft a agenda for the workshop for commenting by SB, KB and OnB.

    • Jhonas presentation and workshop

      Clement, Nicholas and Mikis will plan for a Jhonas presentation in the PWG.

Jhonas workshop at KB in april

See NAS Warc workshop.

Clement, Sara and Sophie will from BnF will participate

Iteration 50 (3.19 Development release) (Mikis)

  • Should we switch to git as SCM?

    BnF and OnB should have at look the issue and give their opinion.

  • Codefreeze ??

    Postponed to at least 11 marts to allow for switch to Postgres as testbed DB. We have furthermore uncovered a number of 3.18 bugs, which should be fixed prior to the codefreeze.

  • 3.19.0 release test

Status of the production sites

  • Netarchive (Tue):

    We are on track again and the indexing for the broad crawl is now parallelized and the total start up time for the broad crawl including creation of a 80 Gb deduplication index took only about 24 hours without any manual intervention ( in 3.16 took it 4-5 days)

    Be aware of, that the new index creation method places a heavy load during sorting in the folder tmpdircommon. We had 24 broad crawl harvesters and 33 selective harvesters active during startup (no single low prio harvester)

    What was the main problems during the upstart:

    1. Every low prio harvester died with Java out of heap space after it got the index. It seems, that the new parallelized broad crawl index demands more memory for the Heritrix processes.
      Fix: increased memory to 3 GB per Heritrix instanse in the local settings.xml file
      <harvester>
        <harvesting>
          <heritrix>
            <heapSize>2936M</heapSize>
      
      on each 64 bit server and closed all 32 bit harvesters (4).
    2. Continuously start, running and fail of harvesters and log spam about trying to generate or to find a new index, even though the index was in place and ready  - until no more jobs in queue.
      Fix: The new requested index name was created as a link to the already created in each havester cache/DEDUP_CRAWL_LOG/  e.g.
      ln \-s 127268-127269-127270-127271-27f47726643b267f48a0368d21f7a0fe-cache 127261-127262-127263-127264-571d26967e066b5a3ccbf384c937d74d-cache
      
      ( the first folder is the new requested and last folder contains actually the generated lucene index) and all jobs was resubmitted.
    3. Selective harvest waits for index until broad crawl index is finished.
      Fix: no fix currently.
    4. Running jobs GUI out of sync with actually running jobs.
      Fix: used SVC's adhoc java tool to delete zombee "Running jobs"

    What was the main problems during the upgrade to 3.18 in production:

    1. corrupt indexes in the derby admin database.
      Fix: recreated the indexes.
    2. Very slow new lookup table in the admin database.
      Fix: reconfigurated the lookup table to one with only 1 record
  • Netarchive - curator update (Sabine):

    Lego had a game zone on the web called Lego Universe, which has been closed recently (2012-01-30). In order to document Lego Universe we did som extra crawls of lego.com and videos from YouTube.com, which were embeddet in Lego Universe.
    Quite some webpages are displayed differently by different browsers and browser versions. We studied som examples closely and documented the result in our wiki.
    Two of the most important sites with display problems are Facebook and Twitter. The crawllog tells us, that most of the given url’s are harvested, but the viewerproxy does not at all show them.
    As you know, heritrix cannot harvest streamed video or sound. As to sound we succeeded in harvesting mp3-files from rss feeds from a page with radio streams. We used a template with xml extractor.
    Upgrading to NAS 3.18 caused us some trouble - NAS was down for about a week.

  • BNF - curator update(Peter):

    The main news is that BnF Collecte du Web (BCWeb), our new selection tool, is up and running ! We opened up the tool to the users in two steps :

    • We first used the admin functions to update NAS. We had already imported the URLs into BCWeb, including the most recent changes by the curators and some reorganisation of our crawls for 2012. We then used BCWeb to transfer the first harvest definitions into NAS. The transfer went smoothly, so we proved that BCWeb is able to update the seedlists, harvest templates and harvest definitions in NAS. Using the updated harvest defintitions we then launched the twice-yearly crawl in January, the monthly crawl at the beginning of February, and we will continue to update before launching each crawl (or once a month of the daily and weekly crawls.) We have also started crawling for the 2012 Presidential and Parliamentary Elections, but we will give more details about this in a future update.
    • Once we were sure that the admin functions were OK, we opened up access to our network of 80 curators on the 7th February. So far we have had a lot of positive feedback, with a few bugs being pointed out that we hope to fix in the next release, which is planned for early in March. We will be holding training sessions in March, but we decided to open up BCWeb in advance as it is designed to be easy and intuitive to use, and we hope to use the training sessions to gather reactions and suggestions for future improvements.

    The third step, on which we are currently working, will be to provide an external access to BCWeb for partners who select sites for certain projects. To begin with this will be mainly the regional libraries who are participating in our election crawl. We are aiming to have this external access in place by mid-march.

  • ONB (Andreas):
    • Collection "Austrian Literature" which was started in December was stopped due to complains. Currently there is an examination (should be formal) of the media law. After that the crawl should be continued.
    • Domain crawl with NAS 3.16.1 is still running (stage 2) but also stopped for a short time, because of we need to provide more space to the indexapplication for doing the deduplication. Should be continued in the next days.
    • Our vague suspicion in having a too slow Internet connection is now confirmed. This is the reason for the long duration of our domain crawls. Our IT-Department promised us a better bandwith in the next months.
    • Update 23/02/2012: After getting more diskspace for the indexapplication, the domain crawl was resumed last week.

Date for next joint tele-conference.

  • March 20th 13-14.

    OK

Any other business