Installing Umbra
See https://github.com/internetarchive/umbra for the basic instructions. Note that if you have both python2 and python3 installed then you may have to explicitly specify which version of pip to use to install umbra. e.g.
Installing RabbitMQ
sudo apt-get install rabbitmq-server sudo rabbitmq-plugins enable rabbitmq_management sudo service rabbitmq-server restart |
After whoch rabbitmq can be managed at http://localhost:15672 (user/pass guest/guest).
Playing With Umbra
Once umbra and rabbitmq are installed, you can play with umbra without using the heritrix installation. However for these commands to work it seems that you need to make a small change to the default rabbbitmq setup. Specifically you need to bind the queue "urls" to the routing key "urls" in the gui:
After this you can use the verbose options to manually see what links umbra finds on different pages:
gives output from umbra like
2015 - 01 - 19 13 : 57 : 55 , 348 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' }, 'url' : 'http://netarkivet.dk/' } 2015 - 01 - 19 13 : 57 : 55 , 843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'Accept' : 'text/css,*/*;q=0.1' , 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' , 'Referer' : 'http://netarkivet.dk/' }, 'url' : 'http://netarkivet.dk/wp-content/themes/netarkivet/style.css' } 2015 - 01 - 19 13 : 57 : 55 , 843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'Accept' : 'text/css,*/*;q=0.1' , 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' , 'Referer' : 'http://netarkivet.dk/' }, 'url' : 'http://netarkivet.dk/wp-content/plugins/contact-form-7/includes/css/styles.css?ver=3.9.3' } 2015 - 01 - 19 13 : 57 : 55 , 843 16121 DEBUG WebsockThread9200-XC3zTY umbra.controller.AmqpBrowserController.on_request(controller.py: 180 ) sending to amqp exchange=umbra routing_key=load_url. 0 payload={ 'parentUrl' : 'http://www.netarkivet.dk' , 'method' : 'GET' , 'parentUrlMetadata' : {}, 'headers' : { 'Accept' : 'text/css,*/*;q=0.1' , 'User-Agent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36' , 'X-DevTools-Emulate-Network-Conditions-Client-Id' : '0B80F796-5795-0FDF-AB44-149ADD52C507' , 'Referer' : 'http://netarkivet.dk/' }, 'url' : 'http://netarkivet.dk/wp-content/plugins/gallery-to-slideshow//css/gallery-to-slideshow.css?ver=1.4' } |
etc. Notice the extracted urls, which is what we hope to queue in heritrix.