Browser-based harvesting with Crawljax and Warcprox

What is it?

Crawljax (http://crawljax.com) is an open-source tool that can crawl a website by controlling a webbrowser (via the selenium framework). It can therefore crawl complex javascript-heavy sites by emulating a human user. Can we leverage this to harvest sites that are otherwise difficult to archive with heritrix?

How can I try it out?

Crawljax is a mature product and is easy to get started with. Just download the web-jar and ....

csr@pc609:~/innoweb/crawljax-web$ java -jar crawljax-web-3.6.jar -p 8081

and you have a browser-based interface for controlling your crawls

I'm too important to use a browser

This is ok for a start, but to really get to grips with Crawljax you're better off diving in to it's API. Once you have the right maven dependencies and repositories configured, getting started with the API is very straightforward. A handful of lines of code will do it. My sample code is at https://github.com/csrster/crawljax-archiver .