Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Installing Heritrix 3

Clone from git

Build using Java 6 as JAVA_Home:

 

mvn -DskipTests package

 

Then unpack the heritrix distro from dist/target/heritrix-3.3.0-SNAPSHOT-dist.tar.gz. Hertrix 3 is started with

 

heritrix/bin/heritrix -a admin:admin

 

Building Contrib

Change directory to heritrix3/contrib and run

 

mvn -DskipTests package

 

Copy target/heritrix-contrib-3.3.0-SNAPSHOT.jar into the lib directory of the heritrix distribution. Also copy amqp client library from your maven repository to the heritrix library with something like

 

cp ~/.m2/repository/com/rabbitmq/amqp-client/3.2.1/amqp-client-3.2.1.jar ~/heritrix-3/heritrix/lib/

Using Umbra

To enable umbra in a crawl job you need to do two things

  1. Create the publisher bean and add it to the fetchProcessors bean: 

     <bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor">
      <property name="clientId" value="requests"/>
     </bean> 
    .
    .
    .
    <ref bean="extractorSwf"/>
    <ref bean="umbraBean"/>
  2. Add the listener (receiver) bean at the top level of the crawler beans file: 

     <bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/>

There is very little in umbra which is actually configurable - so far as i can see only the names of the queues. This might be useful if you are running multiple heritrix instances sharing the same broker.

Things to Think About

  • Don't forget to set the "clientId" property as shown above. It is probably a bug that this isn't set by default to be consistent with the default configuration of the receiver.
  • Heritrix sends every queued http and https link to umbra except for robots.txt and urls received from umbra.
  • Urls are received from umbra asynchronously and put directly into the frontier. That means they have no discovery path and just get an "I". This also means that they are not subject to heritrix's normal scoping rules (question).
  • Urls received from umbra are marked in the crawl log with the string "receivedFromAMQP" so you can identify them.
  • Because the communication is asynchronous there can still be urls left on the queue after the job has finished. Remember to drain the queue 

    drain-queue

    before running the next harvest.
     

 

 

  • No labels