This guide was written for, and tested on, Ubuntu.
These are the steps: WORK IN PROGRESS!
In a shell, first change directory to the place where you want Heritrix+Umbra to be installed. Then do the following.
mkdir heritrix-umbra cd heritrix-umbra wget https://sbforge.org/nexus/content/repositories/snapshots/org/archive/heritrix/heritrix/3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz tar -xzf heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz rm heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/bin ./heritrix -a admin:admin cd ../..
Now, in a browser, go to https://localhost:8443/ to test that it works.
Do a which python
and make sure that the python
there points at a python 3 and if it does not, make sure that it does.
Then, to install Umbra, do the following.
sudo apt-get update sudo apt-get -y install python3-pip sudo -H pip3 install git+https://github.com/internetarchive/umbra.git sudo apt-get install rabbitmq-server sudo rabbitmq-plugins enable rabbitmq_management sudo rabbitmq-plugins enable rabbitmq_shovel sudo rabbitmq-plugins enable rabbitmq_shovel_management sudo service rabbitmq-server restart
RabbitMQ should now be reachable at http://localhost:15672 (user: guest, pass: guest).
Do
cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/ mkdir myTestJob cd myTestJob
Get a Heritrix 3 Crawl Job Configuration File (like this one:
),put it in heritrix-umbra/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/myTestJob/
and rename it to crawler-beans.cxml
At the top of the file, just above the first bean
tag (not beans
), insert:
<bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/> <bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor"> <property name="clientId" value="requests"/> </bean>
Also, in the same file, find
<bean id="fetchProcessors"
and within that bean, just under
<ref bean="extractorSwf"/>
add the line:
<ref bean="umbraBean"/>
Then, still under the myTestJob
dir, create a text file called seeds.txt
with a single line saying: http://netarkivet.dk
Now, in a browser, go to https://localhost:8443/ and paste the path to the myTestJob
dir into the field under add existing job, and click the add button. "myTestJob" should appear at the bottom of the window, so click it.
On the page that appears, click the build button
NOTE: Heritrix can be killed by doing a
ps ax | grep eritrix
and killing the relevant process.