Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference August the 14th 2012, 13:00-14:00.
Practical information
- TDC tele-conference:
- Dial in number (+45) 70 26 50 45
- Dial in code 9064479#
- BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.
Participants
- BNF: Nicholas, Sara
- ONB: Michaela and Andreas
- KB: Tue, Søren and Nicholas
- SB: Colin and Mikis, Sabine
- Any other issues to be discussed on today's tele-conference?
Heritrix 3 in NetarchiveSuite
- The week of 17.sep.
- Issue for planning: NAS-2066 Heritrix roadmap Workshop.
JhoNAS status (Nicholas)
A status update from the begining of August was sent to the PWG and is accessible from this link: jhonas-project-status-aug.pdf
All JHove2 Modules seem to work. Thomas Ledoux is working on containerMD.xsl.
Thomas Ledoux has been testing the different modules and a bunch of issues have been fixed in JWAT/Jhove2.
Current issues: WARC-Target-URI validation is too strict, unit test modules, jhove2 does not remove temp files with -t option.
And of course the usual, finish JWAT library...
NAS-1965
--------
Make it possible to use either ARC or WARC as the harvesting format.
Done, needs unit testing.
NAS-1960
--------
Extend our BatchJob framework to handle WARC-files on record level
Done, needs unit testing.
Besides a WARCBatchJob also ArchiveBatchJob has been implemented for batch jobs running on both ARC and WARC.
NAS-1958
--------
Replace the "ARCWriterProcesser" with "WARCWriterProcessor" in our Heritrix templates.
Tested in local installation.
NAS-1959
--------
mplement CDX-generating code, that also works for WARC-files
Done, needs unit testing.
NAS-1962
--------
Store the contents of the metadata-1.arc files as WARC-records
Done, needs unit testing. Problems with WARC and content-length=0.
NAS-1964
--------
Upgrade of Indexserver system
Done, needs unit testing. Problems with WARC and content-length=0.
NAS-2091
--------
Add documentation for WARC usage in Netarchivesuite
N/A
NAS-2090
--------
Add documentation for ARC usage in Netarchivesuite
N/A
NAS-2061
--------
Define the layout of the metadata warc file
Currently it is a mirror of the ARC file.
NAS-2055
--------
Extend the built-in WARCWriterProcessor to allow for functionality required by NetarchiveSuite
N/A
NAS-2070
--------
WARC enable the dk.netarkivet.wayback.NetarchiveResourceStore
N/A
NAS-1961
--------
NAS-1720 Upgrade or remove dk.netarkivet.viewerproxy.LocalCDXCache (deprecated, and uses inline CDXCacheBatchJob)
N/A
Moved sourcecode to GitHub?
I think we should consider moving the code to git hub because:
- Git is a much more flexible than Subversion, see 3 Reasons to Switch to Git from Subversion, GitSvnComparison, svn - git vs Subversion - pros and cons, Why You Should Switch from Subversion to Git.
- Will be moving the code to a standard open source hosting sites, which will increase accessability.
- GitHub is great!
Iteration 52 (3.21 development release) (Mikis)
Status of the production sites
- Netarkivet:
As our broad crawls a speeded up to last less than 2 month, we took advantage of the break between to broad crawls
- To crawl “very big web sites” (such as the Danish National Broadcast dr.dk and our other main tv-station tv2.dk) in depth.
- To crawl websites of ministries, departments etc. in depth
- To capture url’s of YouTube videos on and by political parties
We started our own event crawl on the Olympics in London: entering url’s into the system, QA and monitoring.
As to our selective crawls: “business as usual” – that is to say: analyze of “candidates” (new sites proposed for selective crawls), QA of selective crawls, monitoring harvest jobs, revision of harvest profiles
- BNF:
- ONB:
Date for NAS workshop at SB
Mid-october?
Date for next joint tele-conference.
September 11th?