2018-12-04 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2018-12-04, 13:00-14:00.

Participants

  • BNF: Sara, Géraldine
  • ONB: Andreas, Michaela
  • KB/DK - Copenhagen: Tue, Stephen, Anders
  • KB/DK - Aarhus: Colin, Sabine
  • BNE: Mar
  • KB/Sweden: Bengt

Update on NAS latest tests and developments

NetarchiveSuite 5.5 has been released. See NetarchiveSuite 5.5 Release Notes for more details about the content of the release.

Currently the team in Denmark is working on migrating the entire Danish Netarchive to use the bitarchiving software from bitrepository.org. This will involve, among other things, moving all the current batch functionality over to a new mass-processing framework, almost certainly hadoop. When this is completed, therefore, much of the current backend NAS code will fall out of use in the Danish archive. This is why we need to find out which parts are in use in other institutions, and which parts can be easily deprecated without disturbing anyone. 

Preparation of 2019 NAS workshop

2019 NAS workshop

Please fill in the list of topics you wish to have in the agenda.

Quick feedback from IIPC GA and WAC

Status of the production sites

Netarkivet

Broad crawl

Step 2 of our third broad crawl (with a data limit of 14 GB per domaine) is still ongoing. It progresses rather slowly. The reason might be the growing centralization of webhosting sites. We also have problems with the job scheduling/running of jobs and monitoring of the broad crawls in “GUI open”

Selective crawls

We often run into problems, which we cannot solve without developers assistance e.g.

-          IP-validated access to content behind pay walls (the website owner claims to have established the access, but it does not work

-          Quite some websites are blocking our crawlers even though they are obliged to give access according to the legal deposit law

Event crawl

We run our mini event crawl mini-event harvest “Week 46”: web sites of local broadcast stations’ (both radio and television)

Special crawl

We had a follow up to the the special crawl for man hunt by Danish police on 28 September, when Danish Secret Service (PET) revealed, that Iranian Secret Service was prevented in an assassination on Danish soil. We crawled foreign news media articles on the revelation.  

BnF

Our annual broad crawl is still underway. It began on 8th October and must finish before the end of December. We have already crawled more than 68.52 TB. The crawl is taking more time than last year because of several technical problems.

One problem came from an unresponsive Heritrix process: the Heritix process was no longer reachable even by the HarvestController. The communication port was blocked and new instances of Heritrix were created but stopped instantly. Bert, from the IT team, has activated a new monitoring script that will kill the HarvestController in case of a hung Heritrix process.

The second source of problems came from the infrastructure: the hardware resources of the disks were saturated causing too high latency rates for both read and write operations, which meant we had to reduce the number of threads which slows down the crawl overall.
Other infrastructure problems:
- A CPU on a physical machine failed. The physical machine was removed from the park and all the virtual machines (VM) were moved to another physical machine but the network connexion was lost for four VMs during the moving. Consequently several jobs failed. Moreover we had to replace a virtual hard disk on a VM, and due to a failure to copy the deduplication index several jobs were launched and failed.
- Due to an oversight the maximum number of files which can be opened by a process in the same time was not changed increased for the broad crawl and several jobs were launched and failed.

But the most important problem comes from the broker. Several times during the crawl, the broker crashed leading to the failure of all active jobs of the broad crawl and even of those of the selective crawls. So far it has been impossible to find the reason for the crashes.The saturated hard disks with their latencies may be responsible for it. Bert will investigate further.

ONB

  •     We are almost finished with our Domain Crawl. We are waiting for the resubmitted Jobs to be finished
  •     We finished our half yearly Womam/Gender Crawl



BNE


KB-Sweden


Next meetings

  • January 8th 2019

Any other business?