2013-11-12 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference November 12th 2013, 13:00-14:00.

Practical informationSkype-conference

  • TDC tele-conference:
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: Sara and Nicolas
  • ONB: Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin, Mikis  and Sabine
  • Any other issues to be discussed on today's tele-conference?

Development

4.2 issues:

  • Is it ok that the WARCWriter properties write-requests, write-metadata, write-revisit-for-identical-digests and write-revisit-for-not-modified are now defined through the global settings?
  • BnF: Configurations pr. job sometimes crossover between broadcrawl/selective crawl?
  • BnF: Selective crawl deduplication interferes with broadcrawl.

Development

  •  

Spring workshop in Vienna

Doodle poll here.

Curator roadmap

Status of the production sites

Netarkivet

We upgraded our production environment to vers. 4.2 – I’am sure the new features will spare us some time. J

We are working on improving our documetation, I am quite sure that we can migrate most of our documetation to NAS using the extended fields. Adreas from ONB has promissed to fix the bugs in the extended fields, so we are looking forward to the next upgrade of NAS and hope, that Andreas will get the time for bug fixing before the next release in January.

We just started our 4th broad crawl for 2013 for about one week ago.

In the beginning of October we starded an event crawl on the local and regional elections, which will take place th 19th of November.

 

BnF

We started our 2013 broad crawl on the 21st October, with a list of just over 4 million domains. We have had a few problems :

- New management hardware in the storage rack had some unpredictable side effects. Under heavy charge, too long delays appeared in writing data into ARC files which run Heritrix into timeout errors. We therefore had to reduce the number of threads per job. This has worked, although we still have to be careful when launching our selective crawls.

- The broad crawl also caused problems for the creation of the deduplication index for our daily crawl of subscription press. We now put the jobs from the broad crawl on pause each day while the index is created, which means we lose about 40 minutes per day.

- We discovered a bug in the new version of NAS, which affected the number of configurations per job in the broad crawl - when a selective crawl was launched before all the jobs of the broad crawl were created, this caused the number of configurations per job to be set to the number used in the selective crawls (500 instead of 3,500). This bug seems to be random so we hadn't seen it during our tests. We suspended the launch of selective crawls until all the jobs for the broad crawl had been created, and will need to do the same for the second stage of the broad crawl.

We have managed to work round these problems and the crawl is continuing, but the first stage will take slightly longer than planned.

Also during November we will be starting a crawl for the centenary of World War I. The first crawl will be based on the official commemoration sites, and we will be expanding the scope for subsequent crawls during the period 2014-2018.

 

ONB
  • Academic & government Crawl is almost finished. Waiting for two domains to finish (~4000 Domains crawled to a limit of 7 GB)
  • ongoing Crawls for our Media and Political-Collection
  • By the end of the year our Online Search should be available
  • Then Bugfixing for extented fields will start

Next meeting

 

Any other business?