Agenda for the joint KB, BNF, ONB and BNE NetarchiveSuite tele-conference 2017-12-05, 13:00-14:00.
Participants
- BNF: Sara
- ONB: Michaela, Andreas
- KB/DK - Copenhagen: Stephen, Tue, Nicholas
- KB/DK - Aarhus: Colin, Sabine
- BNE: Mar
- KB/Sweden: -
Upcoming NAS developments
- Release of NAS has been delayed because of issues with our compression project
- ongoing work on 5.3.2: https://sbforge.org/jira/secure/RapidBoard.jspa?projectKey=NAS&rapidView=8
- Priorities for Netarkivet development team in 2018 (in addition to NAS bug-fixing, performance issues etc.)
- Browse-access with OpenWayback and https support
- Imporved harvesting with umbra
- API harvesting of social media
Status of the production sites
Netarkivet
- Our fourth broad crawl for 2017 with a budget of 10 MB per domain started on November 14 and finished on November 23. We captured a little less than four TB.
- Our event harvest on the local and regional elections on November 21 are almost finished. We will give the different definitions one or two more crawls.
- Our electional Facebook crawl will be run with Archive-IT, we calculated that we could crawl about 1000 Facebook profiles within our account budget. Setting up the crawl takes quite some time. Intentionally we will run the Facebook crawl after the elections, as we will be able to capture content retrospectively.
- As mentioned before we also used BCWeb for the electional harvest – as BCWeb only was accessible internally at KB, it is kind of a pilot project for the use of BCWeb with a colleague outside Netarchive. In the next couple of weeks, we will evaluate on this different elements of the event harvest.
BnF
Our first broad crawl with NAS5 and H3 is finished! We crawled 101.55 TB in 6 weeks. We encountered 4 problems during this crawl:
We received only 5 complaints from web publishers compared to around 15 in 2016. During the coming weeks, we are going to analyse the crawl reports and the quality of the archives to produce a report on the crawl.
In parallel, we had scheduling issues: our daily news crawls stopped three times. Two jobs were submitted with the same ID and this changed the status of the selective harvest from active to inactive.
- a storage saturation problem with our new infrastructure (we lost 16 jobs of the broad crawl and a few jobs from selective crawls)
- an out of memory problem on the GUI and the broker (with no data loss)
- the use of public_suffixes.dat introduced in NAS5 made H3 create a lot of queues by host for the domain blogspot.com instead of a single queue by domain
- some second level TLDs were also created as domains and broaden the crawl scopes
We received only 5 complaints from web publishers compared to around 15 in 2016. During the coming weeks, we are going to analyse the crawl reports and the quality of the archives to produce a report on the crawl.
In parallel, we had scheduling issues: our daily news crawls stopped three times. Two jobs were submitted with the same ID and this changed the status of the selective harvest from active to inactive.
ONB
BNE
Next meetings
- January 9th, 2018