Manual installation of the NetarchiveSuite
Contents
If the deploy software is not adequate for the installation needed, this section will give some hints on how to distribute and install the NetarchiveSuite software on a number of machines.
In the examples below, we assume that $deployInstallDir
is set to the directory in which the NetarchiveSuite code is to be installed.
We assume that all machines in the chosen scenario are unix/linux servers. The procedure below may not work on other platforms. After having created the new settings to be used in the deployment of the software, zip together the NetarchiveSuite files including the new settings and copy the modified NetarchiveSuite.zip to all machines taking part in the deployment:
export USER=test export MACHINES="machine1.domain1, machine2.domain1, .. machine1.domain2, machine2.domain2" for MACHINE in $MACHINES; do scp NetarchiveSuite.zip $USER@$MACHINE:$deployInstallDir ssh $USER@$MACHINE "cd $deployInstallDir && unzip NetarchiveSuite.zip" done
NetarchiveSuite settings
The NetarchiveSuite settings can be set for applications in three different ways:
- use default setting
- in a setting file
- on command line
Using NetarchiveSuite default settings
If no settings are set, the default setting is used. Please refer to the [Configuration Manual 3.16#DefaultSettings] for more information on these.
Setting NetarchiveSuite settings on the command line
To set the value of a setting on the command line, add "-Dkey=value" to your java command line, for instance:
java -Dsettings.common.http.port=8076 dk.netarkivet.common.webinterface.GUIApplication
will override the setting for the http port to be 8076.
Setting NetarchiveSuite settings with settings files
To set the values using a configuration file, save the settings in an XML file as described above. By default, NetarchiveSuite will look for the settings file in conf/settings.xml
, that is: the file settings.xml
under the directory conf
from the current working directory. You can override this, by specifying -Ddk.netarkivet.settings.file=path/to/settings.file.xml
on the commandline, for instance:
java -Ddk.netarkivet.settings.file=/home/netarchive/guisettings.xml dk.netarkivet.common.webinterface.GUIApplication
will read settings from the file /home/netarchive/guisettings.xml
.
You can even specify multiple configuration files, if you wish. You do this by separating the paths with ':' on unix/linux/MacOS or ';' on windows. For instance:
java -Ddk.netarkivet.settings.file=guisettings.xml:basicsettings.xml dk.netarkivet.common.webinterface.GUIApplication
will read settings from both guisettings.xml
and basicsettings.xml
in the current directory.
The order of resolving NetarchiveSuite settings
If a setting is set on both command line and in settings files, or if it is set in multiple settings files, the setting is resolved as follows:
- If the setting is set with system properties (i.e. set on the command line), use these
- Else if the setting is specified in configuration files, use the '''first''' specified value
- Else use default value
As an example, consider the resulting value for http-port (knowing that the default value is empty) when using the following two configuration files:
settings1.xml
<settings> <common> <http> <port>8076</port> </http> </common> </settings>
settings2.xml
<settings> <common> <http> <port>8077</port> </http> </common> </settings>
The following command will use the value empty string as http-port:
java dk.netarkivet.common.webinterface.GUIApplication
The following command will use the value 8078 as http-port:
java -Ddk.netarkivet.settings.file=settings1.xml:settings2.xml -Dsettings.common.http.port=8078 dk.netarkivet.common.webinterface.GUIApplication
The following command will use the value 8076 as http-port:
java -Ddk.netarkivet.settings.file=settings1.xml:settings2.xml dk.netarkivet.common.webinterface.GUIApplication
The following command will use the value 8077 as http-port:
java -Ddk.netarkivet.settings.file=settings2.xml:settings1.xml dk.netarkivet.common.webinterface.GUIApplication
Standard commandline settings
The CLASSPATH
The CLASSPATH needed to start and run the java applications in NetarchiveSuite consists of several jarfiles. The dk.netarkivet.common.jar and all our 3rd party dependencies need not be added explicitly to the CLASSPATH, as they are referenced indirectly in the jar-files.
export deployInstallDir=/path/to/netarchiveSuite export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-harvester-core.jar export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-harvest-scheduler.jar export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-netarchivesuite-heritrix3-controller.jar export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-heritrix3-extensions.jar export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-heritrix3-wrapper.jar export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-archive-core.jar export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-wayback-indexer.jar export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-monitor-core.jar
<<Anchor(CommandLineLogging)>>
Logging
We use the logback and SL4J as logging framework, so we need to refer to a logback.configurationFile. You may want to use different logging properties for different applications, especially when more than one application logs to the same logging directory..
export LOG_SETTINGS="-Dlogback.configurationFile=$deployInstallDir/conf/conf/logback_SomeApplication.xml"
Note that if you use the StatusSiteSection, your logback config-file must contain the appender dk.netarkivet.monitor.logging.CachingSL4JAppender.
<appender name="MONITOR" class="dk.netarkivet.monitor.logging.CachingSLF4JAppender"> <filter class="ch.qos.logback.classic.filter.LevelFilter"> <level>INFO</level> <onMatch>ACCEPT</onMatch> <onMismatch>NEUTRAL</onMismatch> </filter> <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{64} - %msg%n</pattern> </appender>
Each application instance has its own JMX- and RMI port. For example the JMX port could be 8100 and the associated RMI port 8200, as in the example below, for the first application instance on the machine , then 8101/8201 for the second application instance, and so on. JMX also uses a password-file, which is the same throughout the installation ($deployInstallDir/conf/jmxremote.password)
export JMX_SETTINGS="-Dsettings.common.jmx.port=8100 -Dsettings.common.jmx.rmiPort=8200"
Note: For the StatusSiteSection to work, your logging must be configured to use with the dk.netarkivet.monitor.logging.Caching
enabled, see Command Line Logging section (This is done automatically, if the NetarchiveSuite deploy software is used to configure and install your NetarchiveSuite installation).SL4JAppender
Select the appropriate settings.file for the application
The conf/settings.xml (the new one configured to your environment) is probably OK for most applications. But you may need to use special purpose settings-files for some applications, e.g. BitarchiveApplications (since you can't allocate more than one baseFileDir
on the commandline). The settings file used in an application can be specified by:
export SETTING=-Ddk.netarkivet.settings.file=$deployInstallDir/conf/settings.xml
JVM options
We need to set the maximum Java heap size to 1.5 Gbytes. You may use this to change that or add other JVM options.
export JAVA_OPTS=-Xmx1536m
Admin machine
On the admin machine, we have to start the following 5 applications:
- 1 GUIApplication.
- 1 HarvestJobManagerApplication (handles the scheduling of jobs)
- 2 instances of BitarchiveMonitorApplication (Controlling the access to a single bitarchive replica), one for each bitarchive replicas (e.g. EAST and WEST).
- 1 ARCRepositoryApplication (this application handles access to the bitarchive replicas).
Starting the GUIApplication
Before, we can start the GUIApplication, the external database needs to started in advance (The deploy software does for you if the external database is a derby database).
We also need to prepare the JSP-pages. You can unzip the war-files in the webpages directory as below:
cd $deployInstallDir/webpages rm -rf BitPreservation unzip -o BitPreservation.war -d BitPreservation rm -rf HarvestDefinition unzip -o HarvestDefinition.war -d HarvestDefinition rm -rf History unzip -o History.war -d History rm -rf QA unzip -o QA.war -d QA rm -rf Status unzip -o Status.war -d Status
Or you can update your settings.xml file to refer to the war-files instead of the unpacked directories, for instance
<common> ... <webinterface> ... <siteSection> <!-- A subclass of SiteSection that defines this part of the web interface. --> <class>dk.netarkivet.harvester.webinterface.DefinitionsSiteSection</class> <!-- The directory or war-file containing the web application for this site section.--> <webapplication>webpages/HarvestDefinition.war</webapplication> </siteSection> ... </webinterface> ... </common>
and similar for other sitesections.
Now we are ready to start the application:
cd $deployInstallDir export APP=dk.netarkivet.common.webinterface.GUIApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP
Starting the BitarchiveMonitorApplication instances
In the general set-up with two distributed bitarchive replicas, we have a BitarchiveMonitorApplication associated with each replica. Here the replicas are ReplicaOne
(with replicaId ONE
) and ReplicaTwo
(with replicaId TWO
).
To distinguish the two instances from each other, we use the '''settings.common.applicationInstanceId''' setting, which is used as a identifier (here we use BMONE and BMTWO) as the two identifiers.
Start the monitor for bitarchive at ReplicaOne
using BMONE
as identifier thus:
cd $deployInstallDir export APP_OPTIONS="-Dsettings.common.archive.bitarchive.useReplicaId=ONE \ -Dsettings.common.applicationInstanceId=BMONE" export APP=dk.netarkivet.archive.bitarchive.BitarchiveMonitorApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP
Start the monitor for the bitarchive at ReplicaTwo
using BMTWO
as identifier thus:
cd $deployInstallDir export APP_OPTIONS="-Dsettings.common.archive.bitarchive.useReplicaId=TWO \ -Dsettings.common.applicationInstanceId=BMTWO" export APP=dk.netarkivet.archive.bitarchive.BitarchiveMonitorApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP
- one ARCRepository (this application handles all access to the bitarchives).
cd $deployInstallDir export APP=dk.netarkivet.archive.arcrepository.ArcRepositoryApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP
Harvester machines
On each harvester machine, we have one or more HarvestControllerApplications. Settings related to the HarvestControllerApplication are
- setting.common.applicationInstanceId (to distinguish between HarvestControllerApplications running on same machine)
- settings.harvester.harvesting.channel (to select which of two harvest channels to accept jobs from: either a named selective harvest channel, or the snapshot channel. See Harvest Channels for further information.
- settings.harvester.harvesting.minSpaceLeft (how many bytes ''must'' be available in the serverdir to accept crawljobs). The default is 400000000 (~400 Mbytes).
- settings.harvester.harvesting.bundle pointing to the zip file containing the bundled Heritrix3.
In the following, a snapshot harvest HarvestControllerApplication is started with application instance id=SEL
cd $deployInstallDir export APP_OPTIONS="-Dsettings.harvester.harvesting.channel=SNAPSHOT -Dsettings.common.applicationInstanceId=SEL" export APP=dk.netarkivet.harvester.harvesting.HarvestControllerApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP
Bitarchive machines
For each Replica, you can have BitarchiveServer's installed on one or more machines. We suggest using just one BitarchiveServer for each machine, though it is possible to use more than one.
Each BitarchiveServer can have storage on several filesystems, so if archive-storage is spread over more than one filesystem, you need to modify the settings file like this
<settings> .. <archive> ... <bitarchive> ... <baseFileDir>/home/fileSys1/</baseFileDir> <baseFileDir>/home/fileSys2/</baseFileDir> ... </bitarchive> </archive> .. </settings>
Starting a BitarchiveServer requires knowing what Replica it resides on, and the credentials required for correcting the data stored in the bitarchive, for ReplicaOne
with id ONE
this would be:
cd $deployInstallDir export APP_OPTIONS="-Dsettings.archive.bitarchive.useReplicaId=ONE \ -Dsettings.archive.bitarchive.thisCredentials=CREDENTIALS" export APP=dk.netarkivet.archive.bitarchive.BitarchiveApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP
Access servers
On the access-servers, we deploy any number of ViewerProxyApplication instances, and maybe one IndexServerApplication (only one in all) used to generate indices needed by the harvesters and the ViewerProxyApplication instances.
cd $deployInstallDir export APP=dk.netarkivet.harvester.indexserver.IndexServerApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP
Each ViewerproxyApplication instance uses a application instance id(settings.common.applicationInstanceId), and its own distinct base directory (settings.viewerproxy.baseDir). They also belong to a Replica(settings.archive.bitarchive.useReplicaId). In the start sample below, the instance uses application instance id "first" and 'viewerproxy_first' as base directory, and belongs to ReplicaOne
with id ONE
:
cd $deployInstallDir export APP_OPTIONS="-Dsettings.common.applicationInstanceId=first \ -Dsettings.viewerproxy.baseDir=viewerproxy_first \ -Dsettings.archive.bitarchive.useReplicaId=ONE" export APP=dk.netarkivet.viewerproxy.ViewerProxyApplication java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP
About the NetarchiveSuite support for wayback, see Additional Tools Manual