Appendix A: Harvesting Twitter with TwitterDecidingScope

NetarchiveSuite includes a special implementation of a Heritrix DecidingScope which can be used with a modified order-template to manage a crawl of Twitter and material linked from twitter. A sample order-template (twitter-fronpages.xml) is included with the distribution. To use TwitterDecidingScope, first configure the order-xml for the desired crawl characteristics and then start a selective harvest of twitter using the specified configuration and the desired harvest limits.

The TwitterDecidingScope functions as follows. The scope searches Twitter for tweets matching any of the specified keywords (which may also be Twitter hashtags or usernames). It is possible (and advisable) to restrict the number of tweets returned by specifying the desired language and geo_location(s). The Scope then queues for download as html all the individual tweets found.

The relevant section of the order-template looks something like:

  <newObject name="scope" class="dk.netarkivet.harvester.tools.TwitterDecidingScope">
            <boolean name="enabled">true</boolean>
            <string name="seedsfile">seeds.txt</string>
            <boolean name="reread-seeds-on-config">true</boolean>
            <boolean name="queue_links">true</boolean>
            <boolean name="queue_user_status">true</boolean>
            <boolean name="queue_user_status_links">true</boolean>
            <boolean name="queue_keyword_links">true</boolean>
            <integer name="twitter_results_per_page">20</integer>
            <!--The number of pages of API search results to process -->
            <integer name="pages">2</integer>
            <!--
            The geo_locations to which search results are restricted, in the form latitude,longitude,radius,unit
            Experiments for Denmark indicate that this setting is not very useful in practice.
            -->
            <!--
            <stringList name="geo_locations">
            <string>55.976667,10.149722,200.0,km</string>
            </stringList>
            -->
            <!--Search keywords. These may be ordinary text, hashtags, or twitter usernames -->
            <stringList name="keywords">
                <string>Brønshøj</string>
                <string>rygning</string>
                <string>@politikenfeed</string>
            </stringList>
            <!--The language to which results should be restricted. Leave empty for all languages. -->
            <string name="language">da</string>
...

In addition to queueing individual tweets discovered by searching Twtitter, the various boolean flags tell TwitterDecidingScope which additional material to download as follows:

queue_links: if true, queue any links/media found in the discovered tweets
queue_user_status: if true, queue an html listing of tweets from all users responsible for the discovered tweets
queue_user_status_links: if true, attempt to find and queue any other links in other tweets from the discovered users
queue_keyword_links: queue an html listing of a search on the specified keywords

Our experience from harvesting Danish content suggests

Language filtering works very poorly. It is a better strategy to use language-specific keywords.
geo_location filtering also works poorly.
A large proportion of the linked material is from major news outlets. If you already have a harvest strategy which collects these regularly then it might be wise to block them from Twitter harvests (treating them as crawlertraps) to avoid unnecessary duplication.