Getting Started With Umbra
Installing Umbra
See https://github.com/internetarchive/umbra for the basic instructions. Note that if you have both python2 and python3 installed then you may have to explicitly specify which version of pip to use to install umbra. e.g.
Installing RabbitMQ
sudo apt-get install rabbitmq-server sudo rabbitmq-plugins enable rabbitmq_management sudo service rabbitmq-server restart |
After which rabbitmq can be managed at http://localhost:15672 (user/pass guest/guest).
Playing With Umbra
Once umbra and rabbitmq are installed, you can play with umbra without using the heritrix installation. However for these commands to work it seems that you need to make a small change to the default rabbbitmq setup. Specifically you need to bind the queue "urls" to the routing key "urls" in the gui:
(This binding is created automatically when using heritrix.) After this you can use the verbose options to manually see what links umbra finds on different pages:
umbra -v & queue-url -v http: //www.netarkivet.dk |
Umbra/chrome has an annoying habit of stealing keyboard focus. You can avoid this by starting umbra on a separate X server as follows:
sudo X :1 <ctrl> <alt> <f7> (to return to the original server) export DISPLAY=:1; umbra -v
giving output from umbra like
2015 - 01 - 19 13 : 57 : 55 , 348 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' }, 'url' : 'http://netarkivet.dk/' } 2015 - 01 - 19 13 : 57 : 55 , 843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'Accept' : 'text/css,*/*;q=0.1' , 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' , 'Referer' : 'http://netarkivet.dk/' }, 'url' : 'http://netarkivet.dk/wp-content/themes/netarkivet/style.css' } 2015 - 01 - 19 13 : 57 : 55 , 843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'Accept' : 'text/css,*/*;q=0.1' , 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' , 'Referer' : 'http://netarkivet.dk/' }, 'url' : 'http://netarkivet.dk/wp-content/plugins/contact-form-7/includes/css/styles.css?ver=3.9.3' } 2015 - 01 - 19 13 : 57 : 55 , 843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'Accept' : 'text/css,*/*;q=0.1' , 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' , 'Referer' : 'http://netarkivet.dk/' }, 'url' : 'http://netarkivet.dk/wp-content/plugins/gallery-to-slideshow//css/gallery-to-slideshow.css?ver=1.4' } |
etc. Notice the extracted urls, which is what we hope to queue in heritrix.