This guide was written for, and tested on, Ubuntu.
These are the steps: WORK IN PROGRESS!
In a shell, first change directory to the place where you want Heritrix+Umbra to be installed. Then do the following.
mkdir heritrix-umbra cd heritrix-umbra wget https://sbforge.org/nexus/content/repositories/snapshots/org/archive/heritrix/heritrix/3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz tar -xzf heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz rm heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/bin ./heritrix -a admin:admin cd ../..
Now, in a browser, go to https://localhost:8443/ to test that it works.
Then, to install Umbra, do the following (you may have to use pip3
instead of pip
in the first line. Also, you may have to do a sudo apt-get install python-pip
first).
sudo -H pip install git+https://github.com/internetarchive/umbra.git sudo apt-get install rabbitmq-server sudo rabbitmq-plugins enable rabbitmq_management sudo rabbitmq-plugins enable rabbitmq_shovel sudo rabbitmq-plugins enable rabbitmq_shovel_management sudo service rabbitmq-server restart
RabbitMQ should now be reachable at http://localhost:15672 (user: guest, pass: guest).
Get a Heritrix 3 Crawl Job Configuration File (like this one: beans
tag, insert:
<bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/> <bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor"> <property name="clientId" value="requests"/> </bean>
Also, in the same file, find
<bean id="fetchProcessors"
and within that bean, just under
<ref bean="extractorSwf"/>
add the line:
<ref bean="umbraBean"/>
NOTE: Heritrix can be killed by doing a
ps ax | grep eritrix
and killing the relevant process.