Page Comparison

...

Now, in a browser, go to https://localhost:8443/ to test that it works.

Do a which python and make sure that the python there points at a python 3 and if it does not, make sure that it does.

Then, to install Umbra, do the following (you may have to use pip3 instead of pip in the first line. Also, you may have to do a sudo apt-get install python-pip first).

...

.

Code Block

sudo apt-get update
sudo apt-get -y install python3-pip
sudo -H pippip3 install git+https://github.com/internetarchive/umbra.git
sudo apt-get install rabbitmq-server
sudo rabbitmq-plugins enable rabbitmq_management
sudo rabbitmq-plugins enable rabbitmq_shovel
sudo rabbitmq-plugins enable rabbitmq_shovel_management
sudo service rabbitmq-server restart

RabbitMQ should now be reachable at http://localhost:15672 (user: guest, pass: guest).Get a

Make sure Google Chromium is installed. (If not, do a sudo apt-get install chromium-browser)

Then run Umbra as follows:

If you want to see what Umbra does in the Chromium browser, just do a umbra -v&

If you want Umbra to do its stuff without seeing the browser, do

sudo X :1 (and then press <ctrl> <alt> <F7>)

and in a(nother) shell, do export DISPLAY=:1; umbra -v&

Now, do

Code Block
cd heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/ mkdir myTestJob cd myTestJob

Get the following Heritrix 3 Crawl Job Configuration File (like this one:

View file

name	cxmlcrawler-beans.cxml
height	250

).

At the top of the file, just above the first bean tag (not beans), insert:

Code Block
<bean class="org.archive.crawler.frontier.AMQPUrlReceiver"/> <bean id="umbraBean" class="org.archive.modules.AMQPPublishProcessor"> <property name="clientId" value="requests"/> </bean>

Also, in the same file, find

<bean id="fetchProcessors"

and within that bean, just under

<ref bean="extractorSwf"/>

add the line:

<ref bean="umbraBean"/>, and

put it in heritrix-umbra/heritrix-3.3.0-BDB-5.0.x-NAS-1.0-SNAPSHOT/jobs/myTestJob/

Then, still under the myTestJob dir, create a text file called seeds.txt with a single line saying: https://da-dk.facebook.com/larsloekke/

Now, in a browser, go to https://localhost:8443/ and paste the path to the myTestJob dir into the field under add existing job, and click the add button. "myTestJob" should appear at the bottom of the window, so click it.

On the page that appears, click the build button, and when it has built, click the launch button and reload the page until it says "Job is Finished: FINISHED".

Now the Umbra harvests will likely be running... but where do they actually dump the resulting files???

NOTE: Heritrix can be killed by doing a

...

Versions Compared

Old Version 32

New Version Current

Key