BnF template

BnF template for crawling Facebook user and group pages

The main idea in our profile is to crawl only URIs from facebook.com which are directly related to a specific Facebook user or group. We identify those URIs by the numeric user or group ID or by the user or group name contained in the URI.

We use a Heritrix (1.14.4) profile which is based on a SURT prefixed scope. At first in the decide rule sequence, we REJECT anything from facebook.com. Then, we ACCEPT only URIs from facebook.com containing a user ID or a group name that we want to crawl. This makes sure that the robot will stay on user or group related pages and would not break out to crawl the entire Facebook site.

<newObject name="scope" class="org.archive.crawler.deciderules.DecidingScope">
  [...]
  <newObject name="decide-rules" class="org.archive.crawler.deciderules.DecideRuleSequence">
  <map name="rules">
    <newObject name="rejectByDefault" class="org.archive.crawler.deciderules.RejectDecideRule"/>
    <newObject name="acceptIfSurtPrefixed" class="org.archive.crawler.deciderules.SurtPrefixedDecideRule">
      <string name="decision">ACCEPT</string>
      <string name="surts-source-file"/>
      <boolean name="seeds-as-surt-prefixes">true</boolean>
      <string name="surts-dump-file">surts-dump.txt</string>
      <boolean name="also-check-via">false</boolean>
      <boolean name="rebuild-on-reconfig">true</boolean>
    </newObject>
    <newObject name="rejectPath" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
      <string name="decision">REJECT</string>
      <string name="list-logic">OR</string>
      <stringList name="regexp-list">
        <string>^http://.*\.facebook\.com/.*$</string>
      </stringList>
    </newObject>
    <newObject name="acceptPathWithParameter" class="org.archive.crawler.deciderules.MatchesListRegExpDecideRule">
      <string name="decision">ACCEPT</string>
      <string name="list-logic">OR</string>
      <stringList name="regexp-list">
        <string>^http://.*\.facebook\.com/.*123456789.*$</string> <!-- numeric Facebook user or group ID -->
        [...]
        <string>^http://.*\.facebook\.com/.*name.*$</string> <!-- user or group name on Facebook -->
        [...]
      </stringList>
    </newObject>

All the other decide rules and parameters (max-hops etc.) are nothing special to Facebook.