Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This is ok for a start, but to really get to grips with Crawljax you're better off diving in to it's API. Once you have the right maven dependencies and repositories configured, getting started with the API is very straightforward. A handful of lines of code will do it. My sample code is at https://github.com/csrster/crawljax-archiver .

Can I see some output?

Not unless you include an output plugin. Fortunately Crawljax comes with a CrawlOverview plugin that does what it says on the tin.

Where are the warcs?

Crawljax delegates browsing to the browser, so it doesn't have the information to build a warc. To build a warc you need to tunnel the browsing though a warc-writing proxy.

That sounds hard

Actually it's surprisingly easy. First you need to download warcprox from https://github.com/internetarchive/warcprox . You can start it with

Code Block
/usr/bin/python /usr/local/bin/warcprox -p 4338 -d /home/csr/innoweb/warcs -s 10000000

This starts a proxy on port 4338 which will listen to webtraffic and write warcs to the specified directory, rolling over every 10MB. Setting Crawljax to use a proxy for http is a standard API call.

But what about https?

That's only slightly more tricky. You have to tell Crawljax not only to proxy https but to accept untrusted certificates, like the one provided by warcprox. The code is in https://github.com/csrster/crawljax-archiver/blob/master/src/main/java/CrawlJAXProxyHarvester.java .

Show me some pictures

Ok.

So that's solved all out webcrawling problems

Well, not exactly. Some problems with using Crawljax:

  • Crawljax isn't (yet) integrated with Netarchivesuite so the crawl-definitions and metadata wouldn't be integrated with the rest of the archive.
  • The warcs come from a different source (warcprox) so we end up with an inhomogeneous archive.
  • The crawl-definition API for Crawljax is wildly different from Heritrix and we would need considerably resources to build up expertise in it.
  • The crawl-definition API also seems a bit too restricted. This can be worked around by writing plugins that call Javascript - but that's a lot of work if you need to do it for every domain.
  • Crawljax doesn't have crawl-budgetting by size.

What's the alternative?

One possible alternative is Umbra .

Umbra uses a browser but acts as a link extractor, sending the found links back to heritrix for harvesting. Umbra is clearly less-mature and less-well documented than Crawljax. It doesn't seem to have an API as such, just some "hooks" where you can add some javascript to be executed on each harvested DOM (click-on-this, scroll-by-that etc.). But as we have seen, in many cases even with Crawljax you need to specify some javascript to get what you want. And umbra has the huge advantage that it uses hertrix for harvesting so all your warc-generation, metadata-generation and harvest-budgetting is still available. But you do lose the pretty screenshots!