Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Installing Umbra

See https://github.com/internetarchive/umbra for the basic instructions. Note that if you have both python2 and python3 installed then you may have to explicitly specify which version of pip to use to install umbra. e.g.

 

sudo pip3 install git+https://github.com/internetarchive/umbra.git

 

Installing RabbitMQ

 

sudo apt-get install rabbitmq-server
sudo rabbitmq-plugins enable rabbitmq_management
sudo service rabbitmq-server restart

 

After whoch rabbitmq can be managed at http://localhost:15672 (user/pass guest/guest).

Playing With Umbra

Once umbra and rabbitmq are installed, you can play with umbra without using the heritrix installation. However for these commands to work it seems that you need to make a small change to the default rabbbitmq setup. Specifically you need to bind the queue "urls" to the routing key "urls" in the gui:

Image Added

After this you can use the verbose options to manually see what links umbra finds on different pages:

 

umbra -v &
queue-url -v  http://www.netarkivet.dk

 

gives output from umbra like

 

2015-01-19 13:57:55,348 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507'}, 'url''http://netarkivet.dk/'}
2015-01-19 13:57:55,843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'Accept''text/css,*/*;q=0.1''User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507''Referer''http://netarkivet.dk/'}, 'url''http://netarkivet.dk/wp-content/themes/netarkivet/style.css'}
2015-01-19 13:57:55,843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'Accept''text/css,*/*;q=0.1''User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507''Referer''http://netarkivet.dk/'}, 'url''http://netarkivet.dk/wp-content/plugins/contact-form-7/includes/css/styles.css?ver=3.9.3'}
2015-01-19 13:57:55,843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'Accept''text/css,*/*;q=0.1''User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507''Referer''http://netarkivet.dk/'}, 'url''http://netarkivet.dk/wp-content/plugins/gallery-to-slideshow//css/gallery-to-slideshow.css?ver=1.4'}

 

etc. Notice the extracted urls, which is what we hope to queue in heritrix.