2014-10-28 Statusmeeting

Agenda for the joint BNF, ONB, SB and KB NetarchiveSuite tele-conference October 28th 2014, 13:00-14:00.

Practical information

  • TDC tele-conference:
    • Dial in number (+45) 70 26 50 45
    • Dial in code 9064479#
  • BridgeIT: BridgeIT conference will be available about 5 min. before start of meeting. The Bridgit url is konf01.statsbiblioteket.dk. The Bridgit password is sbview.

Participants

  • BNF: Clément, Lam and Annick
  • ONB: Michaela and Andreas
  • KB: Tue, Søren and Nicholas
  • SB: Colin, Sabine and Mikis

Development

  • H3 development in progress (Søren and Nicholas), the first alpha release of NAS with Heritrix3 support shouldn't be too far away.
  • Nicholas is investigating possibility for https support in Wayback.
  • ONB and BnF are now very welcome to test the 5.0-SNAPSHOT code base. 
  • Andreas and Lam will properly have time to test the redesigned codebase in the next couple of weeks..

NetarchiveSuite workshop 2014-2015

  • BnF would like have time to discuss harvesting of eBooks(pdfs), to improve harvesting of newspapers.
    • BnF: Crawling newspaper/eBook.

Contributions to the international seminar on the 28th?

Status of the production sites

Netarkivet

  • OpenWayback 2.0 is has been customized for Netarkivet.dk deployment and is ready for production. 
  • We have investigated on the possibilities for a wayback setup, that supports https. Trying to involve IIPC members in a discussion on https in Wayback proxy mode didn’t result in big interest.. Now we will contact the Open Wayback developers in in Norway.
  • We are migrating our wiki documentation – both the administrative data and the collection documentation to JIRA, a much more modern and flexible wiki than the old one.
  • We are analyzing how we can reduce the amount of captured data for our broad crawls. As we are close to exceeding our budget, we have only 10-20 TB left for 2014, our last broad crawl for 2014 will not be as in depth going as usual.
  • As this reduced broad crawl will take less curator resources, we will have some time to test some tools which could help us capturing content we can’t get with Heritrix.
  • We are still working on our collection policy and on an action plan for 2015-17
  • All curators will meet one day in November to revise our strategy for event crawls.
  • We have made an agreement with a web hotel, who had bloked for our crawlers: they open up for our crawlers now and we set up the firewall rules we agreed on.

BnF

We have just started to prepare our annual broad crawl. During the coming month, we will check the different parts of the process. First, we will prepare the ingest of sources in NetarchiveSuite: we have 4.4 million domains from registrars, complemented by 29,000 URL from BCWeb and other BnF databases. Then, we will configure a number of crawlers from selective harvests to be used for the broad crawl. This year, the engineers for web legal deposit will pay special attention to the environment of the crawlers as they had many problems last year: for example, to make sure that the communication with the storage racks is fine. Finally, as the total volume has to be limited to around 55 or 60 TB, we have decided to do only one step instead of two and we have to estimate which maximum budget we can give to each domain.

ONB

  • Due to budget restrictions ONB will not be able to send anyone to the Heritrix/Open Wayback training session in London.
  • We still wait for information if we can attend the NAS meeting / international seminar on web archiving in Estonia in January. We expect a decision within the next 1-2 weeks.
  • Our IT-department made a couple of changes recently and we are working on our infrastructure to restore all functions.
  • We continue working on the new search interface incl. a prototype for fulltext search.
  • We work on WWI crawl, next domain crawl will take place in 2015.

Next meeting

9th december?

Any other business?