Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 31 Next »

This guide was written for, and tested on, Ubuntu.

These are the steps:  WORK IN PROGRESS!

In a shell, first change directory to the place where you want Heritrix+Umbra to be installed. Then do the following.

mkdir heritrix-umbra
cd heritrix-umbra
wget https://sbforge.org/nexus/content/repositories/snapshots/org/archive/heritrix/heritrix/3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz
tar -xzf heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz
rm heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz
cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/bin
./heritrix -a admin:admin
cd ../..

Now, in a browser, go to https://localhost:8443/ to test that it works.

Then, to install Umbra, do the following (you may have to use pip3 instead of pip in the first line. Also, you may have to do a sudo apt-get install python-pip first).

sudo -H pip install git+https://github.com/internetarchive/umbra.git
sudo apt-get install rabbitmq-server
sudo rabbitmq-plugins enable rabbitmq_management
sudo rabbitmq-plugins enable rabbitmq_shovel
sudo rabbitmq-plugins enable rabbitmq_shovel_management
sudo service rabbitmq-server restart

RabbitMQ should now be reachable at http://localhost:15672 (user: guest, pass: guest).

Get a Heritrix 3 Crawl Job Configuration File (like this one: ).

At the top of the file, just above the first  beans  tag, insert:

<bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/>

<bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor">
  <property name="clientId" value="requests"/>
</bean>

Also, in the same file, find

  <bean id="fetchProcessors"

and within that bean, just under

  <ref bean="extractorSwf"/>

add the line:

  <ref bean="umbraBean"/>




NOTE: Heritrix can be killed by doing a

ps ax | grep eritrix

and killing the relevant process.


  • No labels