Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Installing Umbra

See https://github.com/internetarchive/umbra for the basic instructions. Note that if you have both python2 and python3 installed then you may have to explicitly specify which version of pip to use to install umbra. e.g.

 

sudo pip3 install git+https://github.com/internetarchive/umbra.git

 

Installing RabbitMQ

 

sudo apt-get install rabbitmq-server
sudo rabbitmq-plugins enable rabbitmq_management
sudo service rabbitmq-server restart

 

After whoch rabbitmq can be managed at http://localhost:15672 (user/pass guest/guest).

Playing With Umbra

Once umbra and rabbitmq are installed, you can play with umbra without using the heritrix installation. However for these commands to work it seems that you need to make a small change to the default rabbbitmq setup. Specifically you need to bind the queue "urls" to the routing key "urls" in the gui:

After this you can use the verbose options to manually see what links umbra finds on different pages:

 

umbra -v &
queue-url -v  http://www.netarkivet.dk

 

gives output from umbra like

 

2015-01-19 13:57:55,348 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507'}, 'url''http://netarkivet.dk/'}
2015-01-19 13:57:55,843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'Accept''text/css,*/*;q=0.1''User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507''Referer''http://netarkivet.dk/'}, 'url''http://netarkivet.dk/wp-content/themes/netarkivet/style.css'}
2015-01-19 13:57:55,843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'Accept''text/css,*/*;q=0.1''User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507''Referer''http://netarkivet.dk/'}, 'url''http://netarkivet.dk/wp-content/plugins/contact-form-7/includes/css/styles.css?ver=3.9.3'}
2015-01-19 13:57:55,843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py:180) sending to amqp exchange=umbra routing_key=load_url.0 payload={'parentUrl''http://www.netarkivet.dk''method''GET''parentUrlMetadata': {}, 'headers': {'Accept''text/css,*/*;q=0.1''User-Agent''Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36''X-DevTools-Emulate-Network-Conditions-Client-Id''0B80F796-5795-0FDF-AB44-149ADD52C507''Referer''http://netarkivet.dk/'}, 'url''http://netarkivet.dk/wp-content/plugins/gallery-to-slideshow//css/gallery-to-slideshow.css?ver=1.4'}

 

etc. Notice the extracted urls, which is what we hope to queue in heritrix.

  • No labels