Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 20 Next »

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference August the 21th 2012, 13:00-14:00.

Practical informationSkype-conference

  • Mikis will establish the skype-conference at 13:00 (Please do not connect yourself):
  • TDC tele-conference: (If it fails to establish a skype tele-conference):
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: Nicholas, Sara
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin and Mikis, Sabine
  • Any other issues to be discussed on today's tele-conference?

Heritrix 3 in NetarchiveSuite

 

JhoNAS status (Nicholas)

A status update from the begining of August was sent to the PWG and is accessible from this link: jhonas-project-status-aug.pdf

JHoNas JWAT/JHove2 status

All JHove2 Modules seem to work. Thomas Ledoux is working on containerMD.xsl.

Thomas Ledoux has been testing the different modules and a bunch of issues have been fixed in JWAT/Jhove2.

Current issues: WARC-Target-URI validation is too strict, unit test modules, jhove2 does not remove temp files with -t option.

And of course the usual, finish JWAT library...

JHoNas NAS status
  • Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.
    : Done, needs unit testing.
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : Done, needs unit testing. Besides a WARCBatchJob also ArchiveBatchJob has been implemented for batch jobs running on both ARC and WARC.
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : Tested in local installation.
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : Done, needs unit testing.
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : Done, needs unit testing. Problems with WARC and content-length=0.
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. :Done, needs unit testing. Problems with WARC and content-length=0.
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : N/A
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : N/A
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : Currently it is a mirror of the ARC file.
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : N/A
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : N/A
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration. : N/A
  • Unable to locate Jira server for this macro. It may be due to Application Link configuration.

 

  • Shared testing of WARC functionality?
  • WARC support should ready for codefreeze friday. The Wayback access functionality will properly not be very efficient in this release, as the current implementation requires the full WARC files to be downloaded for access.
  • Initial WARC testing will be done as part of of the release test, with further testing by the involved organizations after the release.

 

Moved sourcecode to GitHub?

I think we should consider moving the code to git hub because:

Iteration 52 (3.21 development release) (Mikis)

We plan to start the code freeze friday and begin release test monday.

The release will include beta-level WARC functionality. Th release regression test will be done on a some system configured to use ARC, with some separate tests for WARC. I hope that the BnF, ONB and Netarkivet.dk will do a more through test of the 3.21 WARC functionality after the release, so we can include as many WARC fixes and improvements in the 4.0 november PROD release. Here we can then hopefully declare WARC support in NAS production ready.

Status of the production sites

  • Netarkivet: TLR

    Second broad crawl 2012 (NR 15) was finished primo july.
    Third broad crwawl 2012 (NR 16) was started  August the 14th using 3.18.3. 1. step is allmost finished.
    Version 3.20.* is currently accepttested and we are preparing for production medio october. We have found 2 issues, which Søren is looking into.
    Our Wayback is now indexed up to July 2012 and I'm preparing/testing automatic indexing in production.
    Thanks to Jon and his son we have downloaded thousands of youtube videos the last month.

    We have during the summer 2 productions issues without big impact on the system:

    1) SB SAN pillar was down one day without affecting any harvesting because the KB site was running and all harvesters on SB was inpendent servers with own disk storage.
    2) We lost 1 day of harvesting caused by no process resources on our admin server. We are still investigating the logs for futher explanations. 

    3 questions for BNF:

    1) Can you show "Show comments" for harvested facebook.com sites?
    2) If you harvest youtube and download videos, how do you link the youtube "metadata" page with the actual video URL?
    3) Which progresql version are you using in production - 8.4?
     

  • Netarkivet: SAS (for a month ago)

As our broad crawls a speeded up to last less than 2 month, we took advantage of the break between to broad crawls 

  • To crawl “very big web sites” (such as the Danish National Broadcast dr.dk and our other main tv-station tv2.dk) in depth.
  • To crawl websites of ministries, departments etc. in depth
  • To capture url’s of YouTube videos on and by political parties

We started our own event crawl on the Olympics in London: entering url’s into the system, QA and monitoring.

As to our selective crawls: “business as usual” – that is to say: analyze of “candidates” (new sites proposed for selective crawls), QA of selective crawls, monitoring harvest jobs, revision of harvest profiles

  • BNF:

 

  • ONB:
  • Since the beginning of August, we are crawling with the double bandwith (now 10 Mbit) and we are currently working on virtualisation of our servers.
  • Currently crawling of academic ac.at and governmental gv.at websites with NAS Version 3.18.3. The new version runs without any problems.
  • Andreas worked a lot on creating new reports (with our hadoop cluster).
  • The most recent crawl of the .at domain was finished with 6,3 TB compressed / 10,22 TB raw (with a limit of max. 100 MB per domain).
  • We will start the next domain crawl in January 2013.
  • Our next thematic crawl will be about Austrian libraries and cultural institutes abroad.

Date for NAS workshop at SB

Mid-october?

Date for next joint tele-conference.

September 11th?

Any other business?

  • No labels