2019-02-05 Statusmeeting

Agenda for the joint NetarchiveSuite tele-conference 2019-02-05, 13:00-14:00.

Participants

  • BNF: Sara, Géraldine, Clara
  • ONB: Andreas, Michaela
  • KB/DK - Copenhagen: Tue, Stephen, Anders
  • KB/DK - Aarhus: Colin, Sabine, Kristian
  • BNE: Mar
  • KB/Sweden: Par, Thomas

Update on NAS latest tests and developments

Agenda of 2019 NAS workshop

2019 NAS workshop

Please contribute to the agenda with any presentation/input on all topics and make sure the list of participants for each institution is complete.

Status of the production sites

Netarkivet

Broad crawls

We started our first broad crawl for 2019 on January 26 – step 1, with a limit of 10 MB per domain. We have withdrawn a number of sites from the normal broad crawl, they are crawled simultaneously in three definitions, “ultra big sites”, “OAI extraction” (research databases) and “ministries and government agencies”

A big issue for our broad crawls are webhosting companies. In order not to be blocked by the webhostings we make agreements with them and set up throttling in order not to overload there servers

Selective crawls

We focus on getting content behind paywalls by negotiating for IP validation. Paywalls are an issue for almost all national news media and we will miss essential content, if we do not get content behind paywalls

Event crawls

We will have parliamentary elections this year before the end of June/beginning of July. We are preparing our strategy – both for the parliamentary elections and for the Elections for the European Parliament, which will take place on 26 May in Denmark.

Access forms and procedures

We try to set up a more userfriendly procedure for getting access to our archived content

Netarchive and GDPR

We are giving all our procedures a check for to be sure that we are following the new European Data Protection regulation. We have made changes to google analytics on netarkivet.dk, now we only collect user data allowed by GDPR

BnF

Our broad crawl finished on December the 23rd. It represents 2.1 billion URLs and 106.46 TB. Due to technical difficulties it took a long time: 11 weeks (compared to 6 weeks in 2017). The technical difficulties came from the new computer architecture and the hardware, the broker and the version of NAS, resulting in multiples jobs being created that failed and thus an overall slowdown of the crawl. We will discuss this subject during the NAS workshop.The percentage of domains that are fully crawled has also decreased. We haven't finished anlaysing this collection but we've chosen to focus on the websites published for the young.

We have finished analysing all the 2018 crawl reports. Over the year we crawled 2.6 billion URLs and 136.15 TB. This is 9 TB less than 2018 due to deduplication: we've crawled more in 2019 but with deduplication, especially for the broad crawl.  The proportion of the broad crawl compared to the selective crawls is still growing: the broad crawl represents 78% of the 2018 collections and 70% in 2017. Our collections now represent more than 1 Petabyte (1 074.73 TB).

From mid-December to mid-January, we organised an internal workshop to improve the harvesting of social media (Facebook, Instagram, Twitter). We are able to crawl Facebook with the same Heritrix template we used for Twitter. But the quality of the crawl isn't guaranteed: the quality is significantly downgraded when there are more than 500 accounts in the job, and from one crawl to another the quality is very variable (sometimes we crawl nothing). We crawl basically the homepage, the posts and a lot of images: it's difficult to know exactly which images we crawl because a lot of them are not visible in the Wayback. During the workshop, we tried to crawl social media with Umbra. Umbra is very complex to install and there's no information exchange between Umbra and NAS: sometimes Umbra failed and Heritrix continued to collect. However Umbra allow us to crawl the images on Instagram that we couldn't crawl  with Heritrix. We compared also the restitution of the web archives in Python Wayback with OpenWayback. The restitution is better with Python especially for Instagram: the images are displayed while in the OpenWayback we have just a white page. For Twitter, the scroll down seems to work in the access tool (but we must do more tests). But for Facebook, we hardly noticed any change.

ONB


BNE

  • Thematic collections managed by university web curators:
    • In the framework of the agreement with University Libraries Network in Spain for the cooperation on web archiving, we have been building the basis of the collections on different topics for the university curators.
    • As the universities in Spain don’t share the same national network as the national and regional libraries, we have been analyzing with our IT team the easiest and safest connection for them to participate in the project. So, a remote access by VPN will be implemented for this purpose.
  •        Collaborative collections with regional libraries:
    • A new collection has been launched for the local and regional elections that will take place in Spain next May. Via BCWeb, all the regional web curators that manage their own collections are invited to participate in this collaborative event collection.
    • The event collection on the Regional Elections in Andalusia, that was managed directly by the web curators of that region, is about to be closed, as a new government was formed at the regional parliament. This collection included 442 seeds and collected almost 5 Tb of information.
  • Other collections:
    • By the end of December we launched an event collection on European elections that will also take place next May.
    • National Politics and Catalonian Politics are almost stable collections:
      • National Politics: 1.162 seeds / 9 Tb
      • Catalonian Politics: 2.048 seeds / 9 Tb

·         Web-curators collaborative work and training:

    • BNE web curators: we are trying to build a stable working group and schedule regular meetings to monitorize and standardize procedures with all the web curators at the Library
    • Regional libraries web curators: we are trying to schedule regular online meetings for the same purpose

  • Our main activity is now focused on the organization of the NAS Workshop in Madrid.

KB-Sweden


Next meetings

  • March 12
  • April 9
  • May 7
  • June 4
  • July 2
  • September 10
  • October 8
  • November 5
  • December 3
  • January 7, 2020

Any other business?

·         Colecciones temáticas para los conservadores de Universidades:

o   Estamos terminando de pulir las colecciones temáticas que serán la base para que empiecen a trabajar los conservadores web de la Red de Bibliotecas Universitarias Españolas. Estas colecciones se han formado principalmente con semillas que hemos extraído de la colección de Organismos Públicos, pero que eran de temas más específicos (Ciencias puras, aplicadas, etc.), que se han lanzado una vez en profundidad y que ahora se están reconfigurando con una profundidad menor pero con una frecuencia regular.

o   Hemos estudiado las diferentes posibilidades de acceso para los conservadores web de las universidades. Los que tiene acceso a la Red de la Administración del Estado (SARA) están configurando las comunicaciones y dando de alta nuestro dominio para poder acceder a las herramientas. Muchos de ellos no tienen conexión con la Red de la Administración, por lo que se les ha habilitado un acceso por VPN, es decir, un acceso remoto que probaremos en los próximos días.

·         Colecciones colaborativas con las comunidades autónomas:

o   Se ha creado una nueva colección “Elecciones municipales y autonómicas 2019” en la que pretendemos que colaboren todas las comunidades autónomas, facilitando semillas relativas a su ámbito territorial. Como otras colecciones colaborativas, hemos pedido la participación de todos los conservadores web y además lo hemos promovido en el grupo del Consejo de Cooperación Bibliotecaria. La colección empezará a recolectar las primeras semillas en febrero.

o   También vamos a pedir la colaboración de las comunidades autónomas en una nueva colección, que ya habíamos planteado hace unos meses, pero que todavía no hemos lanzado: “Patrimonio popular”. Esta colección tiene su foco en las pequeñas web de municipios o regiones, con un alto riesgo de desaparición y con una información patrimonial muy importante y efímera. Nuestra idea, es poder contar con la colaboración ciudadana a través de las redes sociales, dado que son sitios poco conocidos y difundidos.

o   A lo largo del mes de febrero pasaremos a inactiva la colección de Elecciones Andaluzas, colección en la que hemos colaborado estrechamente con la Comunidad Autónoma de Andalucía. Hemos recolectado 442 sitios y casi 5 Tb de información.

·         Otras colecciones:

o   Desde finales de diciembre, hemos empezado a recolectar contenidos sobre las Elecciones Europeas. Este mes hemos aumentado la periodicidad de recolección, pasando muchas de las semillas de mensual a semanal.

o   Seguimos alimentando las colecciones de Política Nacional (1.162 semillas, 9 Tb) y Política Catalana (2.048 semillas, 9 Tb), ya que se generan mucho contenido.

·         Comunicación entre los conservadores web: estamos tratando de crear un grupo de trabajo con conservadores web de la Biblioteca, para poder tener al menos una reunión mensual. Además, estamos estudiando qué herramientas de comunicación podemos adoptar para poder tener reuniones virtuales con los conservadores web externos.