Browser-based harvesting with Crawljax and Warcprox

What is it?

Crawljax (http://crawljax.com) is an open-source tool that can crawl a website by controlling a webbrowser (via the selenium framework). It can therefore crawl complex javascript-heavy sites by emulating a human user. Can we leverage this to harvest sites that are otherwise difficult to archive with heritrix?

How can I try it out?

Crawljax is a mature product and is easy to get started with. Just download the web-jar and ....

csr@pc609:~/innoweb/crawljax-web$ java -jar crawljax-web-3.6.jar -p 8081

and you have a browser-based interface for controlling your crawls

I'm too important to use a browser

This is ok for a start, but to really get to grips with Crawljax you're better off diving in to it's API. Once you have the right maven dependencies and repositories configured, getting started with the API is very straightforward. A handful of lines of code will do it. My sample code is at https://github.com/csrster/crawljax-archiver .

Can I see some output?

Not unless you include an output plugin. Fortunately Crawljax comes with a CrawlOverview plugin that does what it says on the tin.

Where are the warcs?

Crawljax delegates browsing to the browser, so it doesn't have the information to build a warc. To build a warc you need to tunnel the browsing though a warc-writing proxy.

That sounds hard

Actually it's surprisingly easy. First you need to download warcprox from https://github.com/internetarchive/warcprox . You can start it with

/usr/bin/python /usr/local/bin/warcprox -p 4338 -d /home/csr/innoweb/warcs -s 10000000

This starts a proxy on port 4338 which will listen to webtraffic and write warcs to the specified directory, rolling over every 10MB. Setting Crawljax to use a proxy for http is a standard API call.

But what about https?

That's only slightly more tricky. You have to tell Crawljax not only to proxy https but to accept untrusted certificates, like the one provided by warcprox. The code is in https://github.com/csrster/crawljax-archiver/blob/master/src/main/java/CrawlJAXProxyHarvester.java .

Show me some more of pictures

Ok.

So that's solved all out webcrawling problems

Well, not exactly. Some problems with using Crawljax:

Crawljax isn't (yet) integrated with Netarchivesuite so the crawl-definitions and metadata wouldn't be integrated with the rest of the archive.
The warcs come from a different source (warcprox) so we end up with an inhomogeneous archive.
The crawl-definition API for Crawljax is wildly different from Heritrix and we would need considerably resources to build up expertise in it.
The crawl-definition API also seems a bit too restricted. This can be worked around by writing plugins that call Javascript - but that's a lot of work if you need to do it for every domain.
Crawljax doesn't have crawl-budgetting by size.

What's the alternative?

One possible alternative is Umbra .

Umbra uses a browser but acts as a link extractor, sending the found links back to heritrix for harvesting. Umbra is clearly less-mature and less-well documented than Crawljax. It doesn't seem to have an API as such, just some "hooks" where you can add some javascript to be executed on each harvested DOM (click-on-this, scroll-by-that etc.). But as we have seen, in many cases even with Crawljax you need to specify some javascript to get what you want. And umbra has the huge advantage that it uses hertrix for harvesting so all your warc-generation, metadata-generation and harvest-budgetting is still available. But you do lose the pretty screenshots!