Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Harvest Templates

Template for daily harvest

Some help to harvest newspapers: A way to stop automatically the harvest after a day (exactly after 23 hours so the deduplication has time to be finished). Here is the extract of the order.xml for this purpose:

A processor that halts further progress once a fixed amount of time has elapsed since the start of a crawl.

Code Block

<newObject name="RuntimeLimitEnforcer" class="org.archive.crawler.prefetch.RuntimeLimitEnforcer">

 . <boolean name="enabled">true</boolean> <newObject name="[[RuntimeLimitEnforcer#decide-rules.22|RuntimeLimitEnforcer#decide-rules"]] class="org.archive.crawler.deciderules.DecideRuleSequence">
  . <map name="rules"> </map>
 </newObject> <long name="runtime-sec">82800</long> <string name="end-operation">Terminate job</string>

</newObject>

BnF's template for crawling Facebook user and group pages

The main idea in our profile is to crawl only URIs from facebook.com which are directly related to a specific Facebook user or group. We identify those URIs by the numeric user or group ID or by the user or group name contained in the URI.

We use a Heritrix (1.14.4) profile which is based on a SURT prefixed scope. At first in the decide rule sequence, we REJECT anything from facebook.com. Then, we ACCEPT only URIs from facebook.com containing a user ID or a group name that we want to crawl. This makes sure that the robot will stay on user or group related pages and would not break out to crawl the entire Facebook site.

Code Block

<newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
  [...]
  <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
  <map name="rules">
    <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>
    <newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
      <string name="decision">ACCEPT</string>
      <string name="surts-source-file"/>
      <boolean name="seeds-as-surt-prefixes">true</boolean>
      <string name="surts-dump-file">surts-dump.txt</string>
      <boolean name="also-check-via">false</boolean>
      <boolean name="rebuild-on-reconfig">true</boolean>
    </newObject>
    <newObject name="rejectPath" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
      <string name="decision">REJECT</string>
      <string name="list-logic">OR</string>
      <stringList name="regexp-list">
        <string>^http://.*\.facebook\.com/.*$</string>
      </stringList>
    </newObject>
    <newObject name="acceptPathWithParameter" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
      <string name="decision">ACCEPT</string>
      <string name="list-logic">OR</string>
      <stringList name="regexp-list">
        <string>^http://.*\.facebook\.com/.*123456789.*$</string> <!-- numeric Facebook user or group ID -->
        [...]
        <string>^http://.*\.facebook\.com/.*name.*$</string> <!-- user or group name on Facebook -->
        [...]
      </stringList>
    </newObject>

...

Child pages (Children Display)
excerpttrue