/
Appendix B2: Managing Heritrix 3 Crawler-Beans

Note that this documentation is for the coming release NetarchiveSuite 7.4
and is still work-in-progress.

For documentation on the released versions, please view the previous versions of the NetarchiveSuite documentation and select the relevant version.

Appendix B2: Managing Heritrix 3 Crawler-Beans

A Heritrix3 harvest is defined by a Crawler-Bean (.cxml) file. This is a bean-definition file from the Spring framework. You can use Heritrix3's own documentation to create Crawler-Bean files which can then be uploaded to NetarchiveSuite via the GUI. NetarchiveSuite overwrites certain placeholder values in every Crawler-Bean definition before scheduling the harvest. The following placeholders are defined - some are required in every Crawler-Bean file, others are optional. When an optional placeholder is missing from the Crawler-Bean definition, then any attempt to redefine its value via the GUI will be ignored. There is no validation of Crawler-Bean files in this version of NetarchiveSuite, so a missing required placeholder will first manifest itself as a harvest job which fails to start. Some form for validation will be introduced in a later version of NetarchiveSuite.

Required Placeholders
PlaceholderPlacingComments
frontier.queueTotalBudget=%{FRONTIER_QUEUE_TOTAL_BUDGET_PLACEHOLDER}

In PropertyOverrideConfigurer

See discussion below

quotaenforcer.groupMaxFetchSuccesses=%{QUOTA_ENFORCER_GROUP_MAX_FETCH_SUCCES_PLACEHOLDER}

In PropertyOverrideConfigurer

See discussion below
quotaenforcer.groupMaxAllKb=%{QUOTA_ENFORCER_MAX_BYTES_PLACEHOLDER}

In PropertyOverrideConfigurer

See discussion below  
%{CRAWLERTRAPS_PLACEHOLDER}
in the regexList in MatchesListRegexDecideRuleSubstituted with global crawler traps defined in NAS
%{ARCHIVER_PROCESSOR_BEAN_PLACEHOLDER}
At the first xml nesting level, inside the <beans> element 
%{ARCHIVER_BEAN_REFERENCE_PLACEHOLDER}
Inside the DispositionChain bean. 

 

Optional Placeholders
PlaceholderPlacementComments
crawlLimiter.maxTimeSeconds=%{MAX_TIME_SECONDS_PLACEHOLDER}

In PropertyOverrideConfigurer

if absent, e.g. if maxTimeSeconds is hardcoded in the crawler-beans file, then NAS will never override this value.
<property name="indexLocation" value="%{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER}"/> 
Inside the bean with class is.hi.bok.deduplicator.DeDuplicatorIf absent, there will be no deduplication
metadata.robotsPolicyName=%{HONOR_ROBOTS_DOT_TXT}

or

<property name="robotsPolicyName" value="%{HONOR_ROBOTS_DOT_TXT}"/> 

In PropertyOverrideConfigurer

or

In metadata bean

If absent, the robotsPolicy will be "ignore" (the default in H3) or hardwired to either obey or ignore
 extractorHtml.extractJavascript=%{EXTRACT_JAVASCRIPT}In PropertyOverrideConfigurerIf absent, the H3 template will use default value(?) or be hardwired to either true or false

scope.rules[2].maxHops=%{MAX_HOPS} (assuming TooManyHopsDecideRule is the 3rd bean defined in the "scope" bean)

or

<property name="maxHops" value="%{MAX_HOPS}" />

 In PropertyOverrideConfigurer

 

in bean for class

org.archive.modules.deciderules.TooManyHopsDecideRule
If absent, the H3 template will use default value (20) or be hardwired to something else
<property name="enabled" value="%{DEDUPLICATION_ENABLED_PLACEHOLDER}" />
in the bean of class is.hi.bok.deduplicator.DeDuplicator

It is replaced when jobs are generated by the value of the setting harvester.harvesting.deduplication.enabled for the HarvestJobManager application.

Note that this property is only valid for the version of DeDuplicator included with NetarchiveSuite.

Quote Enforcement

All three Quota/Budget -related placeholders are required, but their interpretation depends on the NAS setting  harvester.scheduler.jobGen.objectLimitIsSetByQuotaEnforcer.

Behaviour is as follows:

objectLimitIsSetByQuotaEnforcer 
true

queueTotalBudget is set to infinity

groupMaxFetchSuccesses is set to the maxObjectsPerDomain value from NAS

false

queueTotalBudget is set to the maxObjectsPerDomain value from NAS

groupMaxFetchSuccesses is set to infinity

In all cases, groupMaxAllKb is set to the value determined from the maxBytesPerDomain setting from the NAS GUI (default value is -1 which is equivalent to no limit).

Umbra Integration

To enable browser-based harvesting with Internet Archive's Umbra system, the following placeholders need to be added. If a template containing these placeholders is sent to a non-umbra-enabled harvester they will be silently removed. In other words, the same template file can be used for both umbra and non-umbra harvesting.

PlaceholderPlacementComments
%{UMBRA_SIMPLEOVERRIDES_PLACEHOLDER}
inside the <value> element in the <properties> element in the "simpleOverrides" bean.
%{UMBRA_PUBLISH_BEAN_PLACEHOLDER}
at the top level in the crawler-beans
%{UMBRA_RECEIVE_BEAN_PLACEHOLDER}
at the top level in the crawler-beans
%{UMBRA_BEAN_REF_PLACEHOLDER}
at the end of the list of processors in the "fetchProcessors" bean