What is it?
Crawljax (http://crawljax.com) is an open-source tool that can crawl a website by controlling a webbrowser (via the selenium framework). It can therefore crawl complex javascript-heavy sites by emulating a human user. Can we leverage this to harvest sites that are otherwise difficult to archive with heritrix?
How can I try it out?
Crawljax is a mature product and is easy to get started with. Just download the web-jar and ....
csr@pc609:~/innoweb/crawljax-web$ java -jar crawljax-web-3.6.jar -p 8081
and you have a browser-based interface for controlling your crawls
I'm too important to use a browser
This is ok for a start, but to really get to grips with Crawljax you're better off diving in to it's API. Once you have the right maven dependencies and repositories configured, getting started with the API is very straightforward. A handful of lines of code will do it. My sample code is at https://github.com/csrster/crawljax-archiver .