A Heritrix3 harvest is defined by a Crawler-Bean file. This is a bean-definition file from the Spring framework. You can use Heritrix3's own documentation to create Crawler-Bean files which can then be uploaded to NetarchiveSuite via the GUI. NetarchiveSuite overwrites certain placeholder values in every Crawler-Bean definition before scheduling the harvest. The following placeholders are defined - some are required in every Crawler-Bean file, others are optional. When an optional placeholder is missing from the Crawler-Bean definition, then any attempt to redefine its value via the GUI will be ignored. There is no validation of Crawler-Bean files in this version of NetarchiveSuite, so a missing required placeholder will first manifest itself as a harvest job which fails to start. Some form for validation will be introduced in a later version of NetarchiveSuite.
Required Placeholders
Placeholder | Placing | Usage |
---|---|---|
| ||
Optional Placeholders
Placeholder | Placing | Comments |
---|---|---|
crawlLimiter.maxTimeSeconds=%{MAX_TIME_SECONDS_PLACEHOLDER} | InPropertyOverrideConfigurer | |
frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER} | In PropertyOverrideConfigurer | |
quotaenforcer.groupMaxFetchSuccesses=%{QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER} | In PropertyOverrideConfigurer | |
quotaenforcer.groupMaxAllKb=%{QUOTA_ENFORCER_MAX_BYTES_PLACEHOLDER} | In PropertyOverrideConfigurer |