Harvest Templates

Template for daily harvest

Some help to harvest newspapers: A way to stop automatically the harvest after a day (exactly after 23 hours so the deduplication has time to be finished). Here is the extract of the order.xml for this purpose:

A processor that halts further progress once a fixed amount of time has elapsed since the start of a crawl.

<newObject name="RuntimeLimitEnforcer" class="org.archive.crawler.prefetch.RuntimeLimitEnforcer">

 . <boolean name="enabled">true</boolean> <newObject name="[[RuntimeLimitEnforcer#decide-rules.22|RuntimeLimitEnforcer#decide-rules"]] class="org.archive.crawler.deciderules.DecideRuleSequence">
  . <map name="rules"> </map>
 </newObject> <long name="runtime-sec">82800</long> <string name="end-operation">Terminate job</string>

</newObject>

BnF's template for crawling Facebook user and group pages

The main idea in our profile is to crawl only URIs from facebook.com which are directly related to a specific Facebook user or group. We identify those URIs by the numeric user or group ID or by the user or group name contained in the URI.

We use a Heritrix (1.14.4) profile which is based on a SURT prefixed scope. At first in the decide rule sequence, we REJECT anything from facebook.com. Then, we ACCEPT only URIs from facebook.com containing a user ID or a group name that we want to crawl. This makes sure that the robot will stay on user or group related pages and would not break out to crawl the entire Facebook site.

<newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
  [...]
  <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
  <map name="rules">
    <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>
    <newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
      <string name="decision">ACCEPT</string>
      <string name="surts-source-file"/>
      <boolean name="seeds-as-surt-prefixes">true</boolean>
      <string name="surts-dump-file">surts-dump.txt</string>
      <boolean name="also-check-via">false</boolean>
      <boolean name="rebuild-on-reconfig">true</boolean>
    </newObject>
    <newObject name="rejectPath" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
      <string name="decision">REJECT</string>
      <string name="list-logic">OR</string>
      <stringList name="regexp-list">
        <string>^http://.*\.facebook\.com/.*$</string>
      </stringList>
    </newObject>
    <newObject name="acceptPathWithParameter" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
      <string name="decision">ACCEPT</string>
      <string name="list-logic">OR</string>
      <stringList name="regexp-list">
        <string>^http://.*\.facebook\.com/.*123456789.*$</string> <!-- numeric Facebook user or group ID -->
        [...]
        <string>^http://.*\.facebook\.com/.*name.*$</string> <!-- user or group name on Facebook -->
        [...]
      </stringList>
    </newObject>

All the other decide rules and parameters (max-hops etc.) are nothing special to Facebook.