Experience With IA Umbra

Motivation

The most basic frustration of web-archiving is that virtually any modern computer running any recent browser can render virtually any web-page. But these same web-pages often defeat our purpose-built crawlers. So why not leverage the tremendous developer effort which goes into building browsers and use a browser to render the web page we want to harvest, including executing any and all scripts/flash etc. on the website which might be necessary to generate any links? At the same time, we would like to have a single crawler in overall charge of the crawl - of crawl budgeting, scope-management, and warc-generation. So the idea then becomes that we use a plugin to the heritrix crawler which enables it to use a conventional web-browser as a link-extractor. This is how both Internet Archive's Umbra and the British Library's PhantomJS systems work, although with slight but important differences. In this investigation we are focusing on umbra.

Experience With IA Umbra

Motivation

Getting Started With UmbraIntegrating Umbra And Heritrix 3

Getting Started With Umbra
Integrating Umbra And Heritrix 3