Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference August the 21th 2012, 13:00-14:00.
|
A status update from the begining of August was sent to the PWG and is accessible from this link: jhonas-project-status-aug.pdf
All JHove2 Modules seem to work. Thomas Ledoux is working on containerMD.xsl. Thomas Ledoux has been testing the different modules and a bunch of issues have been fixed in JWAT/Jhove2. Current issues: WARC-Target-URI validation is too strict, unit test modules, jhove2 does not remove temp files with -t option. And of course the usual, finish JWAT library... |
|
|
I think we should consider moving the code to git hub because:
Have a look, see what you think. |
We plan to start the code freeze friday and begin release test monday. The release will include beta-level WARC functionality. Th release regression test will be done on a some system configured to use ARC, with some separate tests for WARC. I hope that the BnF, ONB and Netarkivet.dk will do a more through test of the 3.21 WARC functionality after the release, so we can include as many WARC fixes and improvements in the 4.0 november PROD release. Here we can then hopefully declare WARC support in NAS production ready. |
Second broad crawl 2012 (NR 15) was finished primo july.
Third broad crwawl 2012 (NR 16) was started August the 14th using 3.18.3. 1. step is allmost finished.
Version 3.20.* is currently accepttested and we are preparing for production medio october. We have found 2 issues, which Søren is looking into.
Our Wayback is now indexed up to July 2012 and I'm preparing/testing automatic indexing in production.
Thanks to Jon and his son we have downloaded thousands of youtube videos the last month.
We have during the summer 2 productions issues without big impact on the system:
1) SB SAN pillar was down one day without affecting any harvesting because the KB site was running and all harvesters on SB was inpendent servers with own disk storage.
2) We lost 1 day of harvesting caused by no process resources on our admin server. We are still investigating the logs for futher explanations.
3 questions for BNF:
1) Can you show "Show comments" for harvested facebook.com sites?
2) If you harvest youtube and download videos, how do you link the youtube "metadata" page with the actual video URL?
3) Which progresql version are you using in production - 8.4?
As our broad crawls a speeded up to last less than 2 month, we took advantage of the break between to broad crawls
We started our own event crawl on the Olympics in London: entering url’s into the system, QA and monitoring.
As to our selective crawls: “business as usual” – that is to say: analyze of “candidates” (new sites proposed for selective crawls), QA of selective crawls, monitoring harvest jobs, revision of harvest profiles
On the 8th of August, the last harvest of the electoral project ended. Over a period of seven months, monthly, weekly, daily and single captures have been made of websites selected by librarians for their relation to the French presidential and parliamentary elections. The result is more than 350 million URLs, and 20.38 Tb of data (compressed: 10.67 Tb). We have focused our efforts on harvesting the social Web, especially Twitter and Facebook, but Pinterest and Flickr too. The well-known problem of the # in the URL has been an unsurmountable obstacle to the harvest of some sites (Google+, Pearltree). But solutions were found for others. Thus Twitter was collected 4 times a day with a special harvest template: the crawler declared itself not as a browser, but as a robot. This allowed us to have access to the URL without the problematic <#!> sequence, and therefore to collect tweets. But now Twitter's URLs seem to work without this sequence, even in a normal browser, making them easier to collect. This project was also the occasion to see our new nomination tool (BCWeb) working with NAS on a large scale. It proved to be very useful, even where we had sometimes to adjust the frequency of certain captures (to densify harvests for the electoral week-ends for example). |
|
Mid-october?
September 11th? |