NAS Harvest Control (NAS-HC)

This is a little bit of a fancy name for an important, but not enormously complex feature, otherwise known as  NAS-2464 - Getting issue details... STATUS . The idea is that where we currently can specify byte-limits and object-limits and some other parameters (such as global crawler traps) in the NAS GUI, so should we also be able to specify the most important parameters which vary among the regularly scheduled harvests - for example harvest "depth". The actual parameters requested are specified in the NAS issue but to summarise they are

  • max-hops (integer)
  • robots policy (boolean)
  • enable/disable extractor-js (boolean)
  • extract-javascript (question) in ExtractorHTML (boolean?)
  • global crawler traps (list of regex strings)
  • local crawler traps (list of regex strings)
  • facebook "accept" regex list (list of regex strings)

In theory, these parameters could be different for each domain, in which case they would be defined at the HarvestConfiguration level. In practice, however, the curators work by defining a group of domains to be harvested with a particular strategy - e.g. "Front Page + 1 link + ignore robots" and then add all the relevant domains to that HarvestDefinition. Therefore I propose that we add most of the new NAS-HC parameters at the HarvestDefinition level, so that the extra fields are added to the harvestdefintiions table (or possibly the partial_harvests table). The values should be editable via the relevant jsp page for selective harvests. (What do we do about Snapshot Harvests? At minimum we need to remember to remove placeholders or replace them with hard-coded default values. Perhaps we also want to be able to specify some or all of them for snapshots as well.)

Defensive Robustness: 

We should code the changes in a way which always assumes sensible defaults. Database values should have defaults (e.g. the empty string). Null values submitted from the webpage should be treated as empty strings or as "don't chamnge the current value". The final processed .cxml file should not contain any dangling placeholders.

Crawler Traps:

Crawler traps are exceptional in that they are defined per-domain, and it is probably sensible to keep it that way.

There are two ways of dealing with the crawler-trap lists:

  1. As external files
  2. Inline in the xml

If we want them as external files then we need to send the content of the files as additional fields in the Job we send in the DoOneCrawlMessage. The HarvestController can then write these fields to the files specified in the crawler-beans file.

The inline method is superficially more straightforward. It should be enough to check that each line is a valid Java regexp, then xml-escape it before inserting it into the crawler bean. We then assume that Heritrix will unescape the line correctly.

Schema Updates:

There is a checklist for schema updates.