Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel
Our first broad crawl with NAS5 and H3 is finished! We crawled 101.55 TB in 6 weeks. We encountered 4 problems during this crawl:
  • a storage saturation problem with our new infrastructure (we lost 16 jobs of the broad crawl and a few jobs from selective crawls)
  • an out of memory problem on the GUI and the broker (with no data loss)
  • the use of public_suffixes.dat introduced in NAS5 made H3 create a lot of queues by host for the domain blogspot.com instead of a single queue by domain
  • some second level TLDs were also created as domains and broaden the crawl scopes


We received only 5 complaints from web publishers compared to around 15 in 2016. During the coming weeks, we are going to analyse the crawl reports and the quality of the archives to produce a report on the crawl.

In parallel, we had scheduling issues: our daily news crawls stopped three times. Two jobs were submitted with the same ID and this changed the status of the selective harvest from active to inactive. 

ONB

Panel

BNE

Panel

Dear colleagues,

           Last month, we successfully migrated all our web collections to the production environment of NAS 5. We are reasonably happy with the new environment.

            Anyway, and despite the tests we run on the preproduction environment, we experienced some problems mainly related to the configuration of templates in NAS 5.

            Frontpage+1 and frontpage+2 didn’t work as expected. Nevertheless we realized that some of the crawls ran very fast, but they stopped when encountered any slight problem and didn’t manage to finish.

            Juan Carlos compared the NAS 5 templates with the ones in NAS 4 and adjusted some parameters. Apparently everything is working properly, crawls finish faster than before and harvest more objects. But the default template is not working yet and my IT colleagues are studying its configuration.

            We wait for the system to be more stable before running the .gal domain crawl. We hope we can launch it before the end of the year.

The Library is mirroring its storage in another location of the Ministry of Education and Culture, so we'll have there a copy of our web archive in the next few months.

The access we enabled for users by last mid-summer is only available from the BNE and the regional libraries that asked for it. Although we disseminated this new service, so far, we don't have many consultations as the access is only available on-site and the interface (the OpenWayback by default) is not very friendly. We give open access (in internet) to the captures we have of a precise website (the calendar), but once you try to access the content a message pops up noting that the access is limited to on-site facilities due to copyright reasons.

...