Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 35 Current »

This guide was written for, and tested on, Ubuntu.

These are the steps:  WORK IN PROGRESS!

In a shell, first change directory to the place where you want Heritrix+Umbra to be installed. Then do the following.

mkdir heritrix-umbra
cd heritrix-umbra
wget https://sbforge.org/nexus/content/repositories/snapshots/org/archive/heritrix/heritrix/3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz
tar -xzf heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz
rm heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz
cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/bin
./heritrix -a admin:admin
cd ../..

Now, in a browser, go to https://localhost:8443/ to test that it works.

Do a  which python  and make sure that the  python  there points at a python 3  and if it does not, make sure that it does. (big grin)

Then, to install Umbra, do the following.

sudo apt-get update
sudo apt-get -y install python3-pip
sudo -H pip3 install git+https://github.com/internetarchive/umbra.git
sudo apt-get install rabbitmq-server
sudo rabbitmq-plugins enable rabbitmq_management
sudo rabbitmq-plugins enable rabbitmq_shovel
sudo rabbitmq-plugins enable rabbitmq_shovel_management
sudo service rabbitmq-server restart

RabbitMQ should now be reachable at http://localhost:15672 (user: guest, pass: guest).

Do

cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/
mkdir myTestJob
cd myTestJob


Get a Heritrix 3 Crawl Job Configuration File (like this one: ),

put it in  heritrix-umbra/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/myTestJob/

and rename it to  crawler-beans.cxml

At the top of the file, just above the first  bean  tag (not beans), insert:

<bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/>

<bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor">
  <property name="clientId" value="requests"/>
</bean>

Also, in the same file, find

  <bean id="fetchProcessors"

and within that bean, just under

  <ref bean="extractorSwf"/>

add the line:

  <ref bean="umbraBean"/>

Then, still under the  myTestJob  dir, create a text file called  seeds.txt  with a single line saying:  http://netarkivet.dk

Now, in a browser, go to  https://localhost:8443/  and paste the path to the  myTestJob  dir into the field under add existing job, and click the add button. "myTestJob" should appear at the bottom of the window, so click it.

On the page that appears, click the build button


NOTE: Heritrix can be killed by doing a

ps ax | grep eritrix

and killing the relevant process.


  • No labels