Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Update on NAS latest tests and developments

Feedback on latest works regarding NetarchiveSuite 5.6 (Cookie issue)

...

BnF discovered that NAS 5.6 no longer sends cookies with any requests and this was having an effect on the quality of their harvest. The bug was easily reproducible and was found to be have been introduced between versions 5.4.2 and 5.5 so the Netarkivet production system is also affected.

After a considerable amount of detective work, we discovered that the bug came not from a change in NAS code or Heritrix code but from a change in one of the 3rd Party libraries - specifically we had somehow come to downgrade the version of guava shipped with NAS. Simply substituting a more recent guava version in the bundler zip makes the issue go away. What we have not done:

  1. We don't know exactly why the build started packaging only the older guava version
  2. We don't know what is wrong with the older guava version that it causes this behaviour, and most importantly
  3. We haven't released a patch. We should really release patches to both 5.5 and 5.6 branches.

Heritrix Status

Colin and Clara have spent some time analysing the NAS modifications to Heritrix to see what would be needed to get a community-version of Heritrix that we could use in NAS. There seem to be three changes we would want in:

  1. Adding a timeout to the crawler-trap regex test
  2. Adding a filter to prevent inline images being interpreted as links
  3. Modifying the frontier to add additional methods to browse the queued URLs (including upgrading Berkeley DB - for which Andy Jackson already has a pull request).

Colin has reimplemented each of these in separate branches on top of the current IIPC/IA master branch (including Andy's pull request for nr. 3), and created a fourth branch (https://github.com/netarchivesuite/heritrix3/tree/h3.4-merge) in which all three modifications are merged. There is also a Netarchivesuite branch (https://github.com/netarchivesuite/netarchivesuite/tree/h3.4) which can be built against this heritrix once the heritrix has been installed locally with maven. What we need to do now is:

  1. Basic functional testing of nas/h3.4 (ie the release candidate for NAS 5.7)
  2. More extensive (acceptance) testing of nas/h3.4
  3. Make a series of pull requests to try to get our code into the main Heritrix repository.

Even if we aren't able to get our pull-requests accepted quickly, we should still base future releases on this work as we would then have a Heritrix version very close to the community version, making it easy to pull in future upstream changes.

Status of the production sites

Netarkivet

Panel

Our 4th broad crawl for 2019 is running (included the selective crawls: very big sites, ministries and administrative bodies, scientific publications, YouTube)

We focus on extremists sites, we are working on the identification of left wing, right wing and Islamic extremist sites.

We are discussing how to give researchers access to our documentation. When we chose Confluence for our documentation we thought of giving researchers access to relevant pages, but this does not seem to be a good idea. netarkivet.dk will be the platform for giving access to relevant documentation.

Other projects keeping us busy (almost the same as last month):

  • Work on risk assessment
  • Implementation of SolR Wayback
  • Consolidation and upgrade of BCWeb (build up a community)
  • Revision of collection strategies
  • Capture of content behind paywalls – the never ending story
  • New procedure for access (both internal and external)

BnF

Panel


ONB

Panel


BNE

Panel

Currently, these are our main tasks:

  • We are working to install BCWeb 6.1. We think it will be ready next week.
  • The IT Team is working to resolve the index problems that we have in OpenWayback
  • We are considering doing a “massive” crawl of periodicals in free access.
  • We create a new collection on climate change because of the COP that is taking place in Madrid.

KB-Sweden

Panel


Next meetings

  • December 3
  • January 7, 2020

Any other business?

·