Heritrix+Umbra install guide
This guide was written for, and tested on, Ubuntu.
These are the steps: WORK IN PROGRESS!
In a shell, first change directory to the place where you want Heritrix+Umbra to be installed. Then do the following.
mkdir heritrix-umbra cd heritrix-umbra wget https://sbforge.org/nexus/content/repositories/snapshots/org/archive/heritrix/heritrix/3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz tar -xzf heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz rm heritrix-3.3.0-BDB-5.0.x-NAS-1.0-20180628.090816-4-dist.tar.gz cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/bin ./heritrix -a admin:admin cd ../..
Now, in a browser, go to https://localhost:8443/ to test that it works.
Do a which python
and make sure that the python
there points at a python 3 and if it does not, make sure that it does.
Then, to install Umbra, do the following.
sudo apt-get update sudo apt-get -y install python3-pip sudo -H pip3 install git+https://github.com/internetarchive/umbra.git sudo apt-get install rabbitmq-server sudo rabbitmq-plugins enable rabbitmq_management sudo rabbitmq-plugins enable rabbitmq_shovel sudo rabbitmq-plugins enable rabbitmq_shovel_management sudo service rabbitmq-server restart
RabbitMQ should now be reachable at http://localhost:15672 (user: guest, pass: guest).
Make sure Google Chromium is installed. (If not, do a sudo apt-get install chromium-browser
)
Then run Umbra as follows:
If you want to see what Umbra does in the Chromium browser, just do a umbra -v&
If you want Umbra to do its stuff without seeing the browser, do
sudo X :1
(and then press <ctrl> <alt> <F7>)
and in a(nother) shell, do export DISPLAY=:1; umbra -v&
Now, do
cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/ mkdir myTestJob cd myTestJob
Get the following Heritrix 3 Crawl Job Configuration File:
, andput it in heritrix-umbra/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/myTestJob/
Then, still under the myTestJob
dir, create a text file called seeds.txt
with a single line saying: https://da-dk.facebook.com/larsloekke/
Now, in a browser, go to https://localhost:8443/ and paste the path to the myTestJob
dir into the field under add existing job, and click the add button. "myTestJob" should appear at the bottom of the window, so click it.
On the page that appears, click the build button, and when it has built, click the launch button and reload the page until it says "Job is Finished: FINISHED".
Now the Umbra harvests will likely be running... but where do they actually dump the resulting files???
NOTE: Heritrix can be killed by doing a
ps ax | grep eritrix
and killing the relevant process.