TEST12
Goals
- Test search and retrieval of harvested material through wayback.
Prerequisites
1) All netarchivesuite apps on devel@kb-test-way-001.kb.dk now uses the same derby server listening on port 50002. If this server is down, start the server with the command
cd derbyDB; bash start_derby.sh
2) This test is to be run with OpenWayback. IA wayback is no longer supported.
Procedure
Clean TEST12 derby database
On devel@kb-test-way-001.kb.dk
cd derbyDB ./stop_derby.sh rm -r wayback_indexertest12_db ./start_derby.sh
Note: some instances on devel@kb-test-way-001.kb.dk may need to be restarted after this operation (specifically, SystemTest and StressTest instances
Prepare Installation
Run a standard devel setup Setup DK test environment.
Upload a Small Bitarchive
For a JMS repository:
scp -r ${HOME}/bitarchive_testdata kb-test-adm-001: ssh kb-test-adm-001 chmod 755 bitarchive_testdata/upload_jms.sh ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX arcfiles ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX warcfiles ssh kb-test-adm-001 bitarchive_testdata/upload.sh $TESTX warcgzfiles
For a bitmag installation:
scp -r ${HOME}/bitarchive_testdata kb-test-adm-001: ssh kb-test-adm-001 bitarchive_testdata/upload_bitmag.sh Test6 new_corpus
Build Netarkivets Fork of OpenWayback
Clone the repository and build and deploy it to your local maven repository
git clone https://github.com/netarchivesuite/openwayback-netarchivesuite cd openwayback-netarchivesuite mvn -DskipTests clean install
Alternatively use "mvn -DskipTests clean deploy" to deploy a new snapshot to nexus.
Build Netarkivets OpenWayback Overlay
Clone the repository
git clone https://github.com/netarchivesuite/netarkivet-openwayback-overlay.git
Edit pom.xml to point to refer to the latest NetarchiveSuite snapshot version and to the same openwayback version installed in the previous step (currently 2.4.0-NAS-SNAPSHOT) and then build the package
cd netarkivet-openwayback-overlay mvn clean package
This builds the warfile target/netarkivet-openwayback.war which should be renamed to "wayback.war" for the next step.
Construct A Clean Wayback Environment
Checkout the deploy template from ssh://git@sbprojects.statsbiblioteket.dk:7999/nark/openwayback-config.git . (possibly with command git clone ssh://git@sbprojects.statsbiblioteket.dk:7999/nark/openwayback-config.git on kb-test-way-001.kb.dk) Copy the entire tree to kb-test-way-001.
Follow the instructions in the Readme.md file in the wayback_deploytemplate directory. Note the following:
- The name of the directory should normally be wayback_<<testx>> where <<testx>> is the lower-cased version of the deployment environment
- The procedure for building a warfile is described above
- Download tomcat 8.5 and unpack it into the wayback directory. Rename to tomcat-8.
- The git repository contains two nas settings files - one for use with JMS installations and one for bitmagasin
- The default ports for the proxy endpoint in settings.conf should be changed to your assigned tester port
- If the conf/tomcat_conf/server.xml redirect port 8443 is not available, change it to 8444
- Now drop the netarkivet-openwayback.war, renamed to wayback.war, in the wars directory in the installation.
Now start wayback/tomcat with the start script in wayback_<<test>>/bin.
Check the log for error messages
First do a sanity test that wayback is running and that the configuration is sane
- Use X-forwarding and start a firefox running directly on kb-test-way-001.kb.dk
- Check that the browser is not set to use a proxy
- Browse to localhost:<your port> and check that you can reach wayback
After this you can try the accessing the proxy endpoint via ssh port forwarding (see details below).
Redeploying to an existing installation
To redeploy to an existing wayback installation
- Drop the warfile wayback.war in the wars directory
Touch the context-descriptor file
touch tomcat/conf/Catalina/localhost/ROOT.xml
Wait a few seconds, then restart wayback with the provided script
bin/start_wayback.sh
Check That Wayback Proxy Endpoint Is Working
On devel@kb-prod-udv-001
ssh -g -N -L$PORT:kb-test-way-001.kb.dk:$PORT kb-test-way-001.kb.dk &
Now, in a browser of your choice set the internet connection settings to use kb-prod-udv-001.kb.dk Port $PORT as proxy. In Firefox, a good idea is to execute firefox -P --no-remote and create a new profile which uses this proxy setting and points to wayback as its start-page.
Go to http://kb-test-way-001.kb.dk:8080/ (or whichever port you set up as the wayback endpoint in settings.conf) and check that you can see the wayback search.box.
Wait for Indexing to Complete
On kb-prod-udv-001 wait to see the indexer application run by executing:
[devel@kb-prod-udv-001 ~]$ watch -n 10 'ssh devel@kb-test-way-001 tail -n 30 $TESTX/log/WaybackIndexerApplication0.log.0'
The indexer runs every five minutes. If you are impatient, just log onto kb-test-way-001 and in the directory $TESTX/conf kill and restart the indexer. It will run right away.
You can follow the progress of indexing with the following two commands
[devel@kb-prod-udv-001 ~]$ ssh devel@kb-test-way-001 grep \'Creating object\' $TESTX/log/WaybackIndexerApplication0.log| wc -l [devel@kb-prod-udv-001 ~]$ ssh devel@kb-test-way-001 grep \'Received\' $TESTX/log/WaybackIndexerApplication0.log| grep arc|wc -l
The first gives the number of files discovered by the indexer, and the second gives the number of files indexed. When these are equal, indexing is done.
Wait for Aggregator
After the indexer is run, wait for the aggregator to run by watching for the creation of the index file:
[devel@kb-prod-udv-001 ~]$ watch -n 10 'ssh devel@kb-test-way-001 ls /home/devel/$TESTX/indexDir/'
until the file wayback_intermediate.index appears. This will take at most ten minutes. If you are impatient, just log onto kb-test-way-001 and in the directory $TESTX/conf kill and restart the aggregator. It will run right away.
Move The Index File
Move the index file to the place where wayback expects to read it. [I think this is now unnecessary - CSR]
[devel@kb-prod-udv-001 ~]$ ssh devel@kb-test-way-001 mv /home/devel/$TESTX/indexDir/wayback_intermediate.index /home/devel/wayback_cdx/index.cdx
Browse Repository
In the proxied browser you should now be able to search and browse in the repository. The following standard domains are present in the arcfiles:
www.netarkivet.dk, www.kaarefc.dk, www.oernhoej.dk, www.pligtaflevering.dk, www.drive-badmintonklub.dk, www.dbc.dk,
www.kb.dk www.bs.dk www.sulnudu.dk www.kum.dk www.trinekc.dk www.slothchristensen.dk www.trineogkaare.dk www.sy-jonna.dk
www.kaareogtrine.dk www.raeder.dk www.statsbiblioteket.dk
In addition, the following domains are present in the warcfiles:
The warc.gz files contain a single harvest each of honda.dk, toyota.dk, mazda.dk and sa.dk from 2016-10-31. sa.dk is an example of a https site which renders badly in the current version of wayback used by Netarkivet.
Test Exclusions
- Use the wayback advanced search page to list all the url's harvested from a particular domain.
- Choose some of them you would like to block by regular expressions.
- On devel@kb-test-way-001 add these regular expressions (one per line) to the file conf/wayback_regexps.txt under the wayback installation folder.
- On devel@kb-test-way-001, restart tomcat by executing the script bin/start_wayback.sh under the wayback installation folder.
- Check the blocked urls are no longer visible in advanced search
- Check that if you try to visit one of the blocked urls wayback shows you a page informing you that the content has been blocked
Test NetarchiveCacheResourceStore
On devel@
kb-test-way-001
stop the wayback tomcat server using the stop_wayback.sh script in the bin folder of the wayback installation- Edit
conf/wayback/wayback.xml
to useNetarchiveCacheResourceStore
instead ofNetarchiveResourceStore
. Make sure that the NAS settings file in the conf directory includes a block with the following settings
<resourcestore> <maxfiles>10</maxfiles> <cachedir>/tmp</cachedir> </resourcestore>
Start the tomcat again.
- Check that you can still browse in the material.
- Shutdown the wayback server
Shutdown the Test
On devel@
kb-prod-udv-001
executecleanup_all_test.sh
- If you have a background ssh port-forwarding process running a proxy to wayback then you should also kill this at this stage.