TEST 4
Templates used by test: default_order_xml
Procedure
1. Prepare Installation
On devel@kb-prod-udv-001.kb.dk:
## Replace version as needed export VERSION= 5.4 -RC1 export H3ZIP=/home/devel/nas_versions/bundler/NetarchiveSuite-heritrix3-bundler- $VERSION .zip export TESTX=TEST4 export PORT=8077 export MAILRECEIVERS=svc@kb.dk all_test.sh C heck that the GUI is available and that the System Status does not show any start-up problems. |
2. Set up Apache Proxies for Adm and Acs
Login as root on kb-test-adm-001.kb.dk:
ssh root@kb-test-adm-001.kb.dk
Ask csr@statsbiblioteket.dk or tlr@kb.dk for the password.
Create a backup of httpd.conf and then edit it to reflect your assigned test PORT.
cp /etc/httpd/conf/proxy.conf ./proxy.conf.bak nano /etc/httpd/conf/proxy.conf
There are two VirtualHosts which need to be edited: one for adm and one for acs. The relevant lines look like
# This virtualhost # Used in TEST4 as part of the releasetest when using PORT=8077 # normally assigned to developer svc <VirtualHost *:8081> ServerAdmin helpdesk@kb.dk ErrorLog logs/proxy8081-error_log CustomLog logs/proxy8081-access_log combined <IfModule mod_proxy.c> ProxyPass / http://kb-test-adm-001:8077/ ProxyPassReverse / http://kb-test-adm-001:8077/
and
############################################# ##### Added proxy used in releasetest TEST4 ############################################# <VirtualHost *:8090> ServerAdmin helpdesk@kb.dk ErrorLog logs/proxy8090-error_log CustomLog logs/proxy8090-access_log combined <IfModule mod_proxy.c> ProxyRequests On ProxyRemote * http://kb-test-acs-001.kb.dk:8077 <Proxy *>
Now restart the apache server:
[root@kb-test-adm-001 ~]# /etc/rc.d/init.d/httpd restart
8081 is now the port number for the admin-gui and 8090 is the port number for the viewerproxy.
3. Set Browser Up To Use ADM Proxy
There are several ways to do this, but the following is the best. Start the firefox profile manager with
firefox -P --no-remote
and create a new profile. Call the new profile TEST4 so you can remember what it's for in the future.
Under Edit -> Preferences -> Advanced -> Network -> Settings set a manual http proxy configuration to kb-test-adm-001.kb.dk
port 8090 with no proxy for localhost, 127.0.0.1,kb-prod-udv-001.kb.dk,kb-test-adm-001.kb.dk
.
Browse to http://kb-test-adm-001.kb.dk:8081/HarvestDefinition/ (login test/test123) . You should see the admin GUI. You can set it as your start page for the profile you just created.
4. Set Up Harvesting of Netarkivet.dk
- Edit domain 'netarkivet.dk' to use maxhops=1 in the defaultconfig, still using default_orderxml as template
- Add ^
http://netarkivet.dk/in-english/$
to the crawlertraps for netarkivet.dk. - Add
http://www.netarkivet.dk/website/testsite
to the seedlist for netarkivet.dk.
5. Harvest netarkivet.dk
Create a selective harvest of netarkivet.dk using the definitions defined in the previous step. Wait for it to complete.
6. Browse in the Job and Start Collecting Urls
- In the GUI, select the completed job
- Click on "Select this job for QA with viewerproxy" and wait for indexing to complete
- Click on "Start collecting URLs"
(If prompted for a password, enter test/test123.)
Now browse in the http://netarkivet.dk
website, being sure to go sufficiently deep that you collect URLs for some missing pages. Also be sure to click on the link "English".
7. Stop Collecting URLs
Go back to the Viewerproxy Status webpage and click on "Stop collecting URLs" then "Show collected URLs". Your list should look something like
http://netarkivet.dk/?page_id=123 http://netarkivet.dk/in-english/ http://netarkivet.dk/wp-content/uploads/Retningslinjer-for-adgang-til-Netarkivet.pdf http://netarkivet.dk/wp-content/uploads/ansoegererklaering.pdf http://www.google-analytics.com/__utm.gif?utmwv=5.4.4&utms=1&utmn=1182267154&utmhn=netarkivet.dk&utmcs=UTF-8&utmsr=1920x1200&utmvp=1421x783&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=11.2%20r202&utmdt=Netarkivet&utmhid=1151260713&utmr=-&utmp=%2F&utmht=1375792372429&utmac=UA-16233002-5&utmcc=__utma%3D71594380.2107439604.1375792372.1375792372.1375792372.1%3B%2B__utmz%3D71594380.1375792372.1.1.utmcsr%3D(direct)%7Cutmccn%3D(direct)%7Cutmcmd%3D(none)%3B&utmu=q~
Note that it should included the "in-english" page and several others from netarkivet.dk. The google-analytics links can be ignored.
8. Add the Collected URLs as Seeds and Re-harvest
- Edit the default seedlist for netarkivet.dk to include the gathered URLs.
- Define and start a new harvest, or just edit the previous harvest definition to have a next-run time of now.
- When it is finished, browse in the new harvest as before. The added URLs should be browsable, with the exception of the "in-english" URL which is still blocked by the crawlertrap.
9. Test Authentication
- If you saved the password in Firefox, go to Preferences -> Security -> Saved Passwords and click on "Remove All".
- Close the browser
- Restart the browser and browse to the GUI: http://kb-test-adm-001.kb.dk:8081/HarvestDefinition/
- Enter an incorrect password and confirm that it is not accepted
10. Test Logging of Failed Login
On devel@kb-prod-udv-001:
[devel@kb-prod-udv-001 ~]$ ssh root@kb-test-adm-001.kb.dk grep Mismatch /etc/httpd/logs/proxy8081-error_log root@kb-test-adm-001.kb.dk's password: [Tue Aug 06 15:23:23 2013] [error] [client 130.225.26.33] user tlr: authentication failure for "/HarvestDefinition/": Password Mismatch
Confirm that you can see the username for the failed login attempt.
11. Set Different Domains to Use Different Templates
In the Admin GUI, set the following domains to by default (i.e. in their defaultconfig configuration) use different order templates as follows:
kaarefc.dk | default_orderxml, max-hops=3 |
trinekc.dk | default_orderxml, max-hops=4 |
sulnudu.dk | default_orderxml, max-hops=1 |
12. Define a Multi-Domain Selective Harvest
Define a selective harvest for the domains trinekc.dk
, kaarefc.dk
, raeder.dk,
. Activate it and wait for it to complete.sulnudu.dk
,and netarkivet.dk
The harvest should generate 4 jobs - for example with job numbers 3,4,5,6. The first three domains are harvested separately, while sulnudu.dk and netarkivet.dk are harvested together, as they have the samme max-hops (1).
13. Create an Index for these Jobs
Browse to the harvest history for the multi-domain selective harvest and click on " Select these jobs for QA with viewerproxy ". Wait for the index to finish generating and redirect you to the "Viewerproxy Status" page.
14. Mess with a Crawl-log File to Create an Error
Log in to devel@kb-test-acs-001.kb.dk.
[devel@kb-test-acs-001 ~]$ cd TEST4/cache [devel@kb-test-acs-001 cache]$ rm -rf ./fullcrawllogindex/* ./FULL_CRAWL_LOG/* [devel@kb-test-acs-001 cache]$ find . . ./dedupcrawllogindex ./dedupcrawllogindex/1-cache ./dedupcrawllogindex/1-cache/segments.gen.gz ./dedupcrawllogindex/1-cache/_0.cfs.gz ./dedupcrawllogindex/1-cache/_0.si.gz ./dedupcrawllogindex/1-cache/_0.cfe.gz ./dedupcrawllogindex/1-cache/segments_1.gz ./dedupcrawllogindex/empty-cache ./dedupcrawllogindex/empty-cache/segments.gen.gz ./dedupcrawllogindex/empty-cache/segments_1.gz ./dedupcrawllogindex/1-cache.working ./dedupcrawllogindex/empty-cache.working ./fullcrawllogindex ./cdxindex ./cdxindex/empty-cache ./cdxindex/empty-cache.working ./FULL_CRAWL_LOG ./crawllog ./crawllog/crawllog-6-cache ./crawllog/crawllog-4-cache ./crawllog/crawllog-1-cache.working ./crawllog/crawllog-3-cache (etc.)
Now choose one of the jobs from the multi-harvest run - e.g. job number 5. Edit ./crawllog/crawllog-5-cache by adding the text duplicate:"foo
with no closing parenthesis to one of the crawllog lines.
15. Regenerate the Index
Now check that the logback_IndexServerApplication.xml has netarkivet.dk to log at DEBUG level. Restart IndexServerApplication if the loglevel needed to be changed.
Now browse back to the Harvest Status for the multi-job harvest and again click on " Select these jobs for QA with viewerproxy ". Wait for the index to be generated. On kb-test-acs-001 execute
[devel@kb-test-acs-001 ~]$ grep Skipping TEST4/log/IndexServerApplication.log 13:45:04.093 DEBUG d.n.h.i.CDXOriginCrawlLogIterator - Skipping over bad crawl-log line '2015-06-02T11:10:17.396Z 200 4238 http://twiki.org/p/pub/TWiki05x00/TopMenuSkin/menu-reverse-bg.png LEREXE http://twiki.org/ image/png #004 20150602111017059+336 sha1:LOTTTOPPPPZ5KHVXZ6ATPONHIUI5HVIV - duplicate:"foo, content-size:4489' [devel@kb-test-acs-001 ~]$
and confirm that the line you edited is shown as having been skipped over.
16. Check Index Caching
On kb-test-acs-001, delete a crawl log for a single harvest job:
[devel@kb-test-acs-001 ~]$ rm TEST4/cache/crawllog/crawllog-5-cache
Now regenerate the index for the multi-domain harvest in the GUI. The index isn't really regenerated, as the correct index already exists. Confirm that the file you deleted is not recreated. (It is not needed because there is a cached index for the full crawl log of the entire harvest.)
17. Check Behaviour When Metadata File is Missing
From devel@kb-prod-udv-001.kb.dk go into ba-devel@KB-test-bar-01.bitarkiv.kb.dk (basedir in c:\bitarkiv\TEST4) or ba-devel@KB-TEST-BAR-016.bitarkiv.kb.dk (basedirs in
e:\bitarchive_1\TEST4, f:\bitarchive_2\TEST4, g:\bitarchive_3\TEST4) and find one of the metadata files generated by the multi-job harvest. Move it away.
C:\Users\ba-devel.BITARKIV>move d:\bitarkiv_1\TEST4\filedir\4-metadata-1.warc .
If in doubt, check the file /home/devel/prepared_software/TEST4/settings/deploy_config_test.xml for locations of bitarchive folders on each application machine.
18. Remove the Previously Generated Crawl Index
[devel@kb-test-acs-001 ~]$ cd TEST4/cache/ [devel@kb-test-acs-001 cache]$ rm -rf cdxdata/* [devel@kb-test-acs-001 cache]$ rm -rf crawllog/* [devel@kb-test-acs-001 cache]$ rm -rf FULL_CRAWL_LOG/* [devel@kb-test-acs-001 cache]$ rm -rf fullcrawllogindex/*
Now regenerate the index. The name of the generated index should still include the job number "4". Specifically it is of the form
./fullcrawllogindex/3-4-5-6.cache
consisting of the job numbers of the four jobs in the index. If more than 4 jobs in the index, the index will be named: <job1>-<job2>-<job3>-<job4>-<checksum>.cache
For the missing job number (ie 4 in this case) confirm that
- There is no cdxdata-4-cache in the directory cdxdata
- There is no crawllog-4-cache in the crawllog directory
- There is a file ./crawllog/crawllog-4-cache.working but it is empty
19. Shutdown the Test and Clean Up
On devel@kb-prod-udv-001
cleanup_all_test.sh