Page Comparison

...

Mikis will establish the skype-conference at 13:00 (Please do not connect yourself):
TDC tele-conference: (If it fails to establish a skype tele-conference):
- Dial in number (+45) 70 26 50 45
- Dial in code 9064479#
BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

BNF: NicholasNicolas, Sara (Wasnt able to participate).
ONB: Michaela and Andreas
KB: Tue, Søren and Nicholas
SB: Colin and Mikis, Sabine (Sabine and Colin weren't present).

Any other issues to be discussed on today's tele-conference?

...

The week of 17.sep. Søren will go to BnFBJ on the 20-21 september.
Issue for planning: NAS-2066 Heritrix roadmap Workshop.

...

Panel

title	JHoNas NAS status

Jira Legacy
SBForge
server System JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1965
: Done, needs unit testing.
Jira Legacy
server SBForgeSystem JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1960
: Done, needs unit testing. Besides a WARCBatchJob also ArchiveBatchJob has been implemented for batch jobs running on both ARC and WARC.
Jira Legacy
server SBForgeSystem JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1958
: Tested in local installation.
Jira Legacy
server SBForgeSystem JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1959
: Done, needs unit testing.
Jira Legacy
SBForge
server System JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1962
: Done, needs unit testing. Problems with WARC and content-length=0.
Jira Legacy
SBForge
server System JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1964
:Done, needs unit testing. Problems with WARC and content-length=0.
Jira Legacy
SBForge
server System JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-2091
: N/A
Jira Legacy
server SBForgeSystem JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-2090
: N/A
Jira Legacy
server SBForgeSystem JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-2061
: Currently it is a mirror of the ARC file.
Jira Legacy
server SBForgeSystem JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-2055
: N/A
Jira Legacy
SBForge
server System JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-2070
: N/A
Jira Legacy
server SBForgeSystem JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1961
: N/A
Jira Legacy
SBForge
server System JIRA
serverId 81c76265-cab2-3ba5-b74d-ee7cd9a2765e
key NAS-1720

Shared testing of WARC functionality?

Panel
WARC support should ready for codefreeze friday. The Wayback access functionality will properly not be very efficient in this release, as the current implementation requires the full WARC files to be downloaded for access. Initial WARC testing will be done as part of of the release test, with further testing by the involved organizations after the release.

Moved sourcecode to GitHub?

...

Git is a much more flexible than Subversion, see 3 Reasons to Switch to Git from Subversion, GitSvnComparison, svn - git vs Subversion - pros and cons, Why You Should Switch from Subversion to Git.
Will be moving the code to a standard open source hosting sites, which will increase accessability.
GitHub is great!

Panel
Have a look, see what you think.

Iteration 52 (3.21 development release) (Mikis)

NAS-2018

We plan to start the code freeze friday and begin release test monday.

The release will include beta-level WARC functionality. Th release regression test will be done on a some system configured to use ARC, with some separate tests for WARC. I hope that the BnF, ONB and Netarkivet.dk will do a more through test of the 3.21 WARC functionality after the release, so we can include as many WARC fixes and improvements in the 4.0 november PROD release. Here we can then hopefully declare WARC support in NAS production ready.

Panel

Jira Legacy
server	SBForge
key

Status of the production sites

Netarkivet: TLR
Second broad crawl 2012 (NR 15) was finished primo july.
Third broad crwawl 2012 (NR 16) was started this morning August August the 14th using 3.18.3. 1. step is allmost finished.
Version 3.20.* is currently tested accepttested and we are preparing for production medio october. We have found 2 issues, which Søren is looking into.
Our Wayback is now indexed up to July 2012 and I'm preparing/testing automatic indexing in production.
Thanks to Jon and his son we have downloaded thousands of youtube videos the last month.
We have during the summer 2 productions issues without big impact on the system:
1) SB SAN pillar was down one day without affecting any harvesting because the KB site was running and all harvesters on SB was inpendent servers with own disk storage.
2) We lost 1 day of harvesting caused by no process resources on our admin server. We are still investigating the logs for futher explanations.
2 3 questions for BNF:
1) Can you show "Show comments" for harvested facebook.com sites?
2) If you harvest youtube and download videos, how do you link the youtube "metadata" page with the actual video URL?
3) Which progresql version are you using in production - 8.4?
Netarkivet: SAS (for a month ago)

...

As to our selective crawls: “business as usual” – that is to say: analyze of “candidates” (new sites proposed for selective crawls), QA of selective crawls, monitoring harvest jobs, revision of harvest profiles

BNF:

Panel

On the 8th of August, the last harvest of the electoral project ended. Over a period of seven months, monthly, weekly, daily and single captures have been made of websites selected by librarians for their relation to the French presidential and parliamentary elections. The result is more than 350 million URLs, and 20.38 Tb of data (compressed: 10.67 Tb).

We have focused our efforts on harvesting the social Web, especially Twitter and Facebook, but Pinterest and Flickr too. The well-known problem of the # in the URL has been an unsurmountable obstacle to the harvest of some sites (Google+, Pearltree). But solutions were found for others. Thus Twitter was collected 4 times a day with a special harvest template: the crawler declared itself not as a browser, but as a robot. This allowed us to have access to the URL without the problematic <#!> sequence, and therefore to collect tweets. But now Twitter's URLs seem to work without this sequence, even in a normal browser, making them easier to collect.

This project was also the occasion to see our new nomination tool (BCWeb) working with NAS on a large scale. It proved to be very useful, even where we had sometimes to adjust the frequency of certain captures (to densify harvests for the electoral week-ends for example).

ONB:

Panel

Since the beginning of August, we are crawling with the double bandwith (now 10 Mbit) and we are currently working on virtualisation of our servers.
Currently crawling of academic ac.at and governmental gv.at websites with NAS Version 3.18.3. The new version runs without any problems.
Andreas worked a lot on creating new reports (with our hadoop cluster).
The most recent crawl of the .at domain was finished with 6,3 TB compressed / 10,22 TB raw (with a limit of max. 100 MB per domain).
We will start the next domain crawl in January 2013.
Our next thematic crawl will be about Austrian libraries and cultural institutes abroad.

Date for NAS workshop at SB

...

Versions Compared

Old Version 17

New Version Current

Key

Participants

Moved sourcecode to GitHub?

Iteration 52 (3.21 development release) (Mikis)

Status of the production sites

Date for NAS workshop at SB