...
Check that the default_orderxml template is a Heritrix3 template( go to the Edit Harvest Templates tab, and retrieve the default_orderxml: If the file-header contains "HERITRIX 3 CRAWL JOB CONFIGURATION FILE ", it is ok.Update the default cxml file to support deduplication by adding the bean
Check that the DispositionChain includes a deduplicator:
Code Block |
---|
<bean<ref idbean="DeDuplicator" class="is.hi.bok.deduplicator.DeDuplicator"> <!-- DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER is replaced by path on harvest-server --> <property name="indexLocation" value="%{DEDUPLICATION_INDEX_LOCATION_PLACEHOLDER}" /> <property name="matchingMethod" value="URL" /> <property name="tryEquivalent" value="TRUE" /> <property name="changeContentSize" value="false" /> <property name="mimeFilter" value="^text/.*" /> <property name="filterMode" value="BLACKLIST" /> <property name="origin" value="" /> <property name="originHandling" value="INDEX" /> <property name="statsPerHost" value="true" /> </bean> |
...
/> |
2. Running selective harvest
...