2012-09-18 Statusmeeting
Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference September the 18th 2012, 13:00-14:00.
Practical informationSkype-conference
- Mikis will establish the skype-conference at 13:00 (Please do not connect yourself):
- TDC tele-conference: (If it fails to establish a skype tele-conference):
- Dial in number (+45) 70 26 50 45
- Dial in code 9064479#
- BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.
Participants
- BNF: Nicolas, Sara.
- ONB: Michaela and Andreas
- KB: Tue, Søren and Nicholas
- SB: Colin and Mikis, Sabine.
- Any other issues to be discussed on today's tele-conference?
Heritrix 3 in NetarchiveSuite
- Søren and Nicolas will go to BL on the 20-21 september.
- Issue for planning: NAS-2066 Heritrix roadmap Workshop.
Nicolas: Has started to look into H3. First impression is that the REST interfaces are not very mature. The monitoring functionality we use in H1's JMX interface hasn't been implemented. Spring might be able to help us generate JMX interfaces, Nicolas has experience with this from the Curator tool project.
Søren: Has also started to investigate the H3 functionality. Søren will compile a list of needed GUI functionality, which will include Andreas's input. It remain to be decided whether the H3 GUI will be extended with some of this functionality or whether we have to develop it ourself.
Søren and Nicolas will try to exchange notes before travelling to the workshop thursday.
JhoNAS status (Nicholas)
A status update for the September IIPC-PWG teleconference is accessible from this link: jhonas-project-status-sep.pdf
A code freeze date has been selected for JHove2 v2.1.0.
Almost all of Thomas Ledoux's JHove2 issues should have been fixed. Deletion of temporary files almost fixed.
Preparing for JWAT/JHove2 code freeze.
- Shared testing of WARC functionality?
- NAS with WARC support released 5.9.2012.
- The Wayback access functionality should work for WARC files, but the CDX batch jobs in the wayback package have not been updated to support WARC.
- Initial WARC testing was done as part of the release test.
- Requires some testing by BnF/ONB to complete the "Developer release of NAS with WARC support" milestone.
Iteration 53 (4.0 development release) (Mikis)
See status here.
The main focus here is the maturing the WARC functionality so it becomes production ready. A critical part here is that BnF and ONB can test the functionality as quickly as possible, so the feedback can be used to fix any bugs or improvements.
BnF's improve job generation mechanisme will also be part of the release.
Status of the production sites
Netarkivet: SAS
- Our third broad crawl 2012 is well under way, first step (with a limit of 10 MB per domain) started on Aug. 15th, second step (with a limit of 8 GB per domain) started on Sept. 3rd. So far we have harvested about 18 TB.
- Some jobs from our most frequently done harvest definitions (6 times a day) had been a little bit tricky: they did not stop, were just “hanging” when they were 99 % done. We located the problem: domains from one of the big Danish media groups. Fortunately they cooperated to solve the problem.
- We just started a new event crawl on a tax cause of the Danish Prime Minister’s husband. This event is of potential interest because a commission has been settled to investigate a supposed political leak, and also the role of the press is part of the case.
- We are working on an article for the library journal “Microform and Digitisation Review” on the curational aspects of the work with Netarchive
- BNF:
This summer, BnF launched a new type of harvest. We observed that blog platforms did not have a good representation in our broad crawl because of the small budget dedicated to each domain. So we prepared a selective crawl with 16 well-known French platforms (such as free.fr, skyrock.com, typepad.com). We extracted the names of sites located on these domains from all the host reports found in NAS (that means reports from 2010 to 2012). We only kept those which are still active. This gave us a list of 430 000 seeds, which we harvested during a period of two weeks. We still need to do quality assurance.
- ONB:
- Currently crawling of academic ac.at and governmental gv.at websites in 2nd Stage (Limit 7 GB)
- Find the list of heavy used Pages from Heritrix 1.14.4 User interface following this link: Heavily used Pages from Heritrix 1.14 Webinterface
Date for NAS workshop at SB
Could we postpone the meeting to the 5.-6. November?
BnF would prefer to stick to the 29-30. As Mikis will be attending another meeting on the 30. 13-15 and Bjarne will away on the 30, we should try to have the discussions relevant for Mikis and Bjarne whenthey are available.
Date for next joint tele-conference.
Oktober 16. 13-14?