Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

  • Mikis will establish the skype-conference at 13:00 (Please do not connect yourself):
  • TDC tele-conference: (If it fails to establish a skype tele-conference):
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: NicholasNicolas, Sara (Wasnt able to participate).
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin and Mikis, Sabine (Sabine and Colin weren't present).

...

Panel
titleJHoNas NAS status
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1965
    : Done, needs unit testing.
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1960
    : Done, needs unit testing. Besides a WARCBatchJob also ArchiveBatchJob has been implemented for batch jobs running on both ARC and WARC.
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1958
    : Tested in local installation.
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1959
    : Done, needs unit testing.
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1962
    : Done, needs unit testing. Problems with WARC and content-length=0.
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1964
    :Done, needs unit testing. Problems with WARC and content-length=0.
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-2091
    : N/A
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-2090
    : N/A
  • Jira Legacy
    serverSBForgeSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-2061
    : Currently it is a mirror of the ARC file.
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-2055
    : N/A
  • Jira Legacy
    serverSBForgeSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-2070
    : N/A
  • Jira Legacy
    SBForge
    serverSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1961
    : N/A
  • Jira Legacy
    serverSBForgeSystem JIRA
    serverId81c76265-cab2-3ba5-b74d-ee7cd9a2765e
    keyNAS-1720

 

  • Shared testing of WARC functionality?

...

As to our selective crawls: “business as usual” – that is to say: analyze of “candidates” (new sites proposed for selective crawls), QA of selective crawls, monitoring harvest jobs, revision of harvest profiles

  • BNF:
 
Panel

On the 8th of August, the last harvest of the electoral project ended. Over a period of seven months, monthly, weekly, daily and single captures have been made of websites selected by librarians for their relation to the French presidential and parliamentary elections. The result is more than 350 million URLs, and 20.38 Tb of data (compressed: 10.67 Tb).

We have focused our efforts on harvesting the social Web, especially Twitter and Facebook, but Pinterest and Flickr too. The well-known problem of the # in the URL has been an unsurmountable obstacle to the harvest of some sites (Google+, Pearltree). But solutions were found for others. Thus Twitter was collected 4 times a day with a special harvest template: the crawler declared itself not as a browser, but as a robot. This allowed us to have access to the URL without the problematic <#!> sequence, and therefore to collect tweets. But now Twitter's URLs seem to work without this sequence, even in a normal browser, making them easier to collect.

This project was also the occasion to see our new nomination tool (BCWeb) working with NAS on a large scale. It proved to be very useful, even where we had sometimes to adjust the frequency of certain captures (to densify harvests for the electoral week-ends for example).

  • ONB:
Panel
  • Since the beginning of August, we are crawling with the double bandwith (now 10 Mbit) and we are currently working on virtualisation of our servers.
  • Currently crawling of academic ac.at and governmental gv.at websites with NAS Version 3.18.3. The new version runs without any problems.
  • Andreas worked a lot on creating new reports (with our hadoop cluster).
  • The most recent crawl of the .at domain was finished with 6,3 TB compressed / 10,22 TB raw (with a limit of max. 100 MB per domain).
  • We will start the next domain crawl in January 2013.
  • Our next thematic crawl will be about Austrian libraries and cultural institutes abroad.

...