TEST 4

QA-1 (Secure viewerproxy, secure adm-machine, excludes, missing links, password-protected material, caching , index-generation)

Templates used by test: default_order_xml

Procedure


1. Prepare Installation

On devel@kb-prod-udv-001.kb.dk:

 

## Replace version as needed
export VERSION=5.4-RC1
export H3ZIP=/home/devel/nas_versions/bundler/NetarchiveSuite-heritrix3-bundler-$VERSION.zip
export TESTX=TEST4
export PORT=8077
export MAILRECEIVERS=svc@kb.dk
all_test.sh

Check that the GUI is available and that the System Status does not show any start-up problems.

2. Set up Apache Proxies for Adm and Acs

Login as root on kb-test-adm-001.kb.dk:

ssh root@kb-test-adm-001.kb.dk

Ask csr@statsbiblioteket.dk or tlr@kb.dk for the password.

Create a backup of httpd.conf and then edit it to reflect your assigned test PORT.

cp /etc/httpd/conf/proxy.conf ./proxy.conf.bak
nano /etc/httpd/conf/proxy.conf 

There are two VirtualHosts which need to be edited: one for adm and one for acs. The relevant lines look like

# This virtualhost
# Used in TEST4 as part of the releasetest when using PORT=8077
# normally assigned to developer svc
<VirtualHost *:8081>
        ServerAdmin helpdesk@kb.dk
        ErrorLog logs/proxy8081-error_log
        CustomLog logs/proxy8081-access_log combined
<IfModule mod_proxy.c>
        ProxyPass / http://kb-test-adm-001:8077/
        ProxyPassReverse / http://kb-test-adm-001:8077/

and

#############################################
##### Added proxy used in releasetest TEST4
#############################################
<VirtualHost *:8090>
        ServerAdmin helpdesk@kb.dk
        ErrorLog logs/proxy8090-error_log
        CustomLog logs/proxy8090-access_log combined
<IfModule mod_proxy.c>
        ProxyRequests On
        ProxyRemote * http://kb-test-acs-001.kb.dk:8077
        <Proxy *>

Now restart the apache server:

[root@kb-test-adm-001 ~]# /etc/rc.d/init.d/httpd restart

8081 is now the port number for the admin-gui and 8090 is the port number for the viewerproxy.

3. Set Browser Up To Use ADM Proxy

There are several ways to do this, but the following is the best. Start the firefox profile manager with

firefox -P --no-remote

and create a new profile. Call the new profile TEST4 so you can remember what it's for in the future.

Under Edit -> Preferences -> Advanced -> Network -> Settings set a manual http proxy configuration to kb-test-adm-001.kb.dk port 8090 with no proxy for localhost, 127.0.0.1,kb-prod-udv-001.kb.dk,kb-test-adm-001.kb.dk .

Browse to http://kb-test-adm-001.kb.dk:8081/HarvestDefinition/ (login test/test123) . You should see the admin GUI. You can set it as your start page for the profile you just created.

4. Set Up Harvesting of Netarkivet.dk

  • Edit domain 'netarkivet.dk' to use maxhops=1 in the defaultconfig, still using default_orderxml as template
  • Add ^http://netarkivet.dk/in-english/$ to the crawlertraps for netarkivet.dk.
  • Add http://www.netarkivet.dk/website/testsite to the seedlist for netarkivet.dk.

5. Harvest netarkivet.dk

Create a selective harvest of netarkivet.dk using the definitions defined in the previous step. Wait for it to complete.

6. Browse in the Job and Start Collecting Urls

  • In the GUI, select the completed job
  • Click on "Select this job for QA with viewerproxy" and wait for indexing to complete
  • Click on "Start collecting URLs"

(If prompted for a password, enter test/test123.)

Now browse in the http://netarkivet.dk website, being sure to go sufficiently deep that you collect URLs for some missing pages. Also be sure to click on the link "English".

7. Stop Collecting URLs

Go back to the Viewerproxy Status webpage and click on "Stop collecting URLs" then "Show collected URLs". Your list should look something like

http://netarkivet.dk/?page_id=123
http://netarkivet.dk/in-english/
http://netarkivet.dk/wp-content/uploads/Retningslinjer-for-adgang-til-Netarkivet.pdf
http://netarkivet.dk/wp-content/uploads/ansoegererklaering.pdf
http://www.google-analytics.com/__utm.gif?utmwv=5.4.4&utms=1&utmn=1182267154&utmhn=netarkivet.dk&utmcs=UTF-8&utmsr=1920x1200&utmvp=1421x783&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=11.2%20r202&utmdt=Netarkivet&utmhid=1151260713&utmr=-&utmp=%2F&utmht=1375792372429&utmac=UA-16233002-5&utmcc=__utma%3D71594380.2107439604.1375792372.1375792372.1375792372.1%3B%2B__utmz%3D71594380.1375792372.1.1.utmcsr%3D(direct)%7Cutmccn%3D(direct)%7Cutmcmd%3D(none)%3B&utmu=q~

Note that it should included the "in-english" page and several others from netarkivet.dk. The google-analytics links can be ignored.

8. Add the Collected URLs as Seeds and Re-harvest

  • Edit the default seedlist for netarkivet.dk to include the gathered URLs.
  • Define and start a new harvest, or just edit the previous harvest definition to have a next-run time of now.
  • When it is finished, browse in the new harvest as before. The added URLs should be browsable, with the exception of the "in-english" URL which is still blocked by the crawlertrap.

9. Test Authentication

  • If you saved the password in Firefox, go to Preferences -> Security -> Saved Passwords and click on "Remove All".
  • Close the browser
  • Restart the browser and browse to the GUI: http://kb-test-adm-001.kb.dk:8081/HarvestDefinition/
  • Enter an incorrect password and confirm that it is not accepted

10. Test Logging of Failed Login

On devel@kb-prod-udv-001:

[devel@kb-prod-udv-001 ~]$ ssh root@kb-test-adm-001.kb.dk grep Mismatch /etc/httpd/logs/proxy8081-error_log
root@kb-test-adm-001.kb.dk's password: 
[Tue Aug 06 15:23:23 2013] [error] [client 130.225.26.33] user tlr: authentication failure for "/HarvestDefinition/": Password Mismatch

Confirm that you can see the username for the failed login attempt.

11. Set Different Domains to Use Different Templates

In the Admin GUI, set the following domains to by default (i.e. in their defaultconfig configuration)  use different order templates as follows:

kaarefc.dkdefault_orderxml, max-hops=3
trinekc.dkdefault_orderxml, max-hops=4
sulnudu.dkdefault_orderxml, max-hops=1

12. Define a Multi-Domain Selective Harvest

Define a selective harvest for the domains trinekc.dk, kaarefc.dk, raeder.dk, sulnudu.dk,and netarkivet.dk. Activate it and wait for it to complete.

The harvest should generate 4 jobs - for example with job numbers 3,4,5,6. The first three domains are harvested separately, while sulnudu.dk and netarkivet.dk are harvested together, as they have the samme max-hops (1).

13. Create an Index for these Jobs

Browse to the harvest history for the multi-domain selective harvest and click on " Select these jobs for QA with viewerproxy ". Wait for the index to finish generating and redirect you to the "Viewerproxy Status" page.

14. Mess with a Crawl-log File to Create an Error

Log in to devel@kb-test-acs-001.kb.dk. 

[devel@kb-test-acs-001 ~]$ cd TEST4/cache
[devel@kb-test-acs-001 cache]$ rm -rf ./fullcrawllogindex/*  ./FULL_CRAWL_LOG/*
[devel@kb-test-acs-001 cache]$ find .
.
./dedupcrawllogindex
./dedupcrawllogindex/1-cache
./dedupcrawllogindex/1-cache/segments.gen.gz
./dedupcrawllogindex/1-cache/_0.cfs.gz
./dedupcrawllogindex/1-cache/_0.si.gz
./dedupcrawllogindex/1-cache/_0.cfe.gz
./dedupcrawllogindex/1-cache/segments_1.gz
./dedupcrawllogindex/empty-cache
./dedupcrawllogindex/empty-cache/segments.gen.gz
./dedupcrawllogindex/empty-cache/segments_1.gz
./dedupcrawllogindex/1-cache.working
./dedupcrawllogindex/empty-cache.working
./fullcrawllogindex
./cdxindex
./cdxindex/empty-cache
./cdxindex/empty-cache.working
./FULL_CRAWL_LOG
./crawllog
./crawllog/crawllog-6-cache
./crawllog/crawllog-4-cache
./crawllog/crawllog-1-cache.working
./crawllog/crawllog-3-cache (etc.)

Now choose one of the jobs from the multi-harvest run - e.g. job number 5. Edit ./crawllog/crawllog-5-cache by adding the text duplicate:"foo with no closing parenthesis to one of the crawllog lines.

15. Regenerate the Index

Now check that the logback_IndexServerApplication.xml has netarkivet.dk to log at DEBUG level. Restart IndexServerApplication if the loglevel needed to be changed.

Now browse back to the Harvest Status for the multi-job harvest and again click on " Select these jobs for QA with viewerproxy ". Wait for the index to be generated. On kb-test-acs-001 execute

[devel@kb-test-acs-001 ~]$ grep Skipping TEST4/log/IndexServerApplication.log
13:45:04.093 DEBUG d.n.h.i.CDXOriginCrawlLogIterator - Skipping over bad crawl-log line '2015-06-02T11:10:17.396Z   200       4238 http://twiki.org/p/pub/TWiki05x00/TopMenuSkin/menu-reverse-bg.png LEREXE http://twiki.org/ image/png #004 20150602111017059+336 sha1:LOTTTOPPPPZ5KHVXZ6ATPONHIUI5HVIV - duplicate:"foo, content-size:4489'
[devel@kb-test-acs-001 ~]$ 

and confirm that the line you edited is shown as having been skipped over.

16. Check Index Caching

On kb-test-acs-001, delete a crawl log for a single harvest job:

[devel@kb-test-acs-001 ~]$ rm TEST4/cache/crawllog/crawllog-5-cache

Now regenerate the index for the multi-domain harvest in the GUI. The index isn't really regenerated, as the correct index already exists. Confirm that the file you deleted is not recreated. (It is not needed because there is a cached index for the full crawl log of the entire harvest.)

17. Check Behaviour When Metadata File is Missing

From devel@kb-prod-udv-001.kb.dk go into ba-devel@KB-test-bar-01.bitarkiv.kb.dk (basedir in c:\bitarkiv\TEST4) or ba-devel@KB-TEST-BAR-016.bitarkiv.kb.dk (basedirs in

e:\bitarchive_1\TEST4, f:\bitarchive_2\TEST4, g:\bitarchive_3\TEST4) and find one of the metadata files generated by the multi-job harvest. Move it away.

C:\Users\ba-devel.BITARKIV>move d:\bitarkiv_1\TEST4\filedir\4-metadata-1.warc .                                                                          

If in doubt, check the file /home/devel/prepared_software/TEST4/settings/deploy_config_test.xml for locations of bitarchive folders on each application machine.

18. Remove the Previously Generated Crawl Index

[devel@kb-test-acs-001 ~]$ cd TEST4/cache/
[devel@kb-test-acs-001 cache]$ rm -rf cdxdata/*
[devel@kb-test-acs-001 cache]$ rm -rf crawllog/*
[devel@kb-test-acs-001 cache]$ rm -rf FULL_CRAWL_LOG/*
[devel@kb-test-acs-001 cache]$ rm -rf fullcrawllogindex/*

Now regenerate the index. The name of the generated index should still include the job number "4". Specifically it is of the form

./fullcrawllogindex/3-4-5-6.cache

consisting of the job numbers of the four jobs in the index. If more than 4 jobs in the index, the index will be named: <job1>-<job2>-<job3>-<job4>-<checksum>.cache

For the missing job number (ie 4 in this case) confirm that

  • There is no cdxdata-4-cache in the directory cdxdata
  • There is no crawllog-4-cache in the crawllog directory
  • There is a file ./crawllog/crawllog-4-cache.working but it is empty

19. Shutdown the Test and Clean Up

On devel@kb-prod-udv-001

cleanup_all_test.sh