Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference August the 14th 2012, 13:00-14:00.

Practical information

  • TDC tele-conference:
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: Nicholas, Sara
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin and Mikis, Sabine
  • Any other issues to be discussed on today's tele-conference?

Heritrix 3 in NetarchiveSuite

 

JhoNAS status (Nicholas)

A status update from the begining of August was sent to the PWG and is accessible from this link: jhonas-project-status-aug.pdf

JHoNas JWAT/JHove2 status

All JHove2 Modules seem to work. Thomas Ledoux is working on containerMD.xsl.

Thomas Ledoux has been testing the different modules and a bunch of issues have been fixed in JWAT/Jhove2.

Current issues: WARC-Target-URI validation is too strict, unit test modules, jhove2 does not remove temp files with -t option.

And of course the usual, finish JWAT library...

JHoNas NAS status

NAS-1965
--------
Make it possible to use either ARC or WARC as the harvesting format.

Done, needs unit testing.

NAS-1960
--------
Extend our BatchJob framework to handle WARC-files on record level

Done, needs unit testing.

Besides a WARCBatchJob also ArchiveBatchJob has been implemented for batch jobs running on both ARC and WARC.

NAS-1958
--------
Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates.

Tested in local installation.

NAS-1959
--------
mplement CDX-generating code, that also works for WARC-files

Done, needs unit testing.

NAS-1962
--------
Store the contents of the metadata-1.arc files as WARC-records

Done, needs unit testing. Problems with WARC and content-length=0.

NAS-1964
--------
Upgrade of Indexserver system

Done, needs unit testing. Problems with WARC and content-length=0.

NAS-2091
--------
Add documentation for WARC usage in Netarchivesuite

N/A

NAS-2090
--------
Add documentation for ARC usage in Netarchivesuite

N/A

NAS-2061
--------
Define the layout of the metadata warc file

Currently it is a mirror of the ARC file.

NAS-2055
--------
Extend the built-in WARCWriterProcessor to allow for functionality required by NetarchiveSuite

N/A

NAS-2070
--------
WARC enable the dk.netarkivet.wayback.NetarchiveResourceStore

N/A

NAS-1961
--------
NAS-1720 Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob)

N/A

Moved sourcecode to GitHub?

I think we should consider moving the code to git hub because:

Iteration 52 (3.21 development release) (Mikis)

 

Status of the production sites

  • Netarkivet:

As our broad crawls a speeded up to last less than 2 month, we took advantage of the break between to broad crawls 

  • To crawl “very big web sites” (such as the Danish National Broadcast dr.dk and our other main tv-station tv2.dk) in depth.
  • To crawl websites of ministries, departments etc. in depth
  • To capture url’s of YouTube videos on and by political parties

We started our own event crawl on the Olympics in London: entering url’s into the system, QA and monitoring.

As to our selective crawls: “business as usual” – that is to say: analyze of “candidates” (new sites proposed for selective crawls), QA of selective crawls, monitoring harvest jobs, revision of harvest profiles

  • BNF:

 

  • ONB:

 

Date for NAS workshop at SB

Mid-october?

Date for next joint tele-conference.

September 11th?

Any other business?

  • No labels