PhantomJS and BL's WebRenderer
PhantomJS is a headless browser-stack based on WebKit. It can be used to control browsing programmatically via a Javascript API. PhantomJS can deliver output in the form of screenshots and HAR-format dumps, and can monitor web-activity during page-loading and, for example, list all urls loaded. All this functionality seems relevant for webarchiving.
The British Library have set up a workflow using phantomjs to harvest screenshot and dom-images during harvesting and to identify loaded urls for queueing in heritrix.
BL has developed two components which need to be installed in this workflow. Webrender-phantomjs is a REST-webservice facade to phantomjs, written in python/django and most-easily deployed in a gunicorn container. Webrender-har-daemon is a python daemon service which fetches urls from a rabbitmq and forwards them to webrender-phantomjs. Webrender-ahr-daemon also accepts the returned HAR data from phantomjs and saved it in warc files and passes any found urls back to heritrix either via rabbitmq or by writing them directly into heritrix action directory.
Deploying these components is not entirely trivial, but doesn't involve any insurmountable challenges either. In Ubuntu, phantomjs itself is available in the standard software channel. Webrender-phantomjs is available from https://github.com/ukwa/webrender-phantomjs . There is a Dockerfile which could be used for installation, but if you don't want to use that it's still the best documentation of the dependencies needed - openssl-devel, libjpeg-devel, pip, django etc. It would be useful if someone (ie me) created a recipe for installing the system from scratch, ideally using python virtualenv to create a localised installation.
Webrender-phantomjs is configured mostly in the file webrender-phantomjs/webrender/phantomjs/settings.py . The file gunicorn.ini can also be edited to bind the webservices to e.g. your host-interface instead of the loopback interface. The Readme has instructions for starting the service using either manage.py or gunicorn. I only got gunicorn to work. Once done, try loading a url like http://pc609.sb.statsbiblioteket.dk:8000/webtools/urls/http://www.netarkivet.dk or http://pc609.sb.statsbiblioteket.dk:8000/webtools/image/http://www.netarkivet.dk to get either a list of urls or a screenshot from netarkivet.
Webrender-har-daemon is easier to get started with. It also has a settings file which has pointers to the rabbit-queue and the webrender-phantomjs endpoint, as well as directory paths to where you want the warcfiles to end up.
If you can't be bothered run an actual harvest, you can test webrender-har-daemon from the command line using the queue-url tool from umbra:
queue-url -u amqp://guest:guest@localhost:5672 --exchange heritrix --routing-key to-webrender -i heritrix http://www.dr.dk
This dumps the relevant HAR-data in a warcfile:
WARC/1.0 WARC-Type: warcinfo WARC-Date: 2015-12-04T11:55:18Z WARC-Record-ID: <urn:uuid:8201e914-9a75-11e5-886a-e839354beac6> Content-Type: application/warc-fields Content-Length: 72 software=warcwriterpool.warcwriterpool/0.1.2 hostname=pc609 ip=127.0.1.1 WARC/1.0 WARC-Type: metadata WARC-Target-URI: http://www.dr.dk WARC-Date: 2015-12-04T12:02:43Z WARC-Record-ID: <urn:uuid:8b069658-9a76-11e5-886a-e839354beac6> Content-Type: application/json Content-Length: 12800725 { "log": { "version": "0.0.2", "creator": { "name": "PhantomJS", "version": "1.9.0" }, "pages": [ { "startedDateTime": "2015-12-04T11:02:36.953Z", "id": "http://www.dr.dk", "title": "DR Forsiden - TV, Radio, Nyheder og meget mere fra dr.dk", "pageTimings": { "onLoad": 2911 }, "renderedContent": { "text": "PCFET0NUWVBFIGh0bWw+PCEtLVtpZiBsdCBJRSA3IF0+IDxodG1sIGxhbmc9ImRhIiBjbGFzcz0iaWUgaWU2Ij4gPCFbZW5kaWZdLS0+PCEtLVtpZiBJRSA3IF0+ICAgIDxodG1sIGxhbmc9ImRhIiBjbGFzcz0iaWUgaWU3Ij4gPCFbZW5kaWZdLS0+PCEtLVtpZiBJRSA4IF0+ICA ...