/
Manual installation of the NetarchiveSuite

Note that this documentation is for the coming release NetarchiveSuite 7.4
and is still work-in-progress.

For documentation on the released versions, please view the previous versions of the NetarchiveSuite documentation and select the relevant version.

Manual installation of the NetarchiveSuite




This page describes alternative installation scenarios for cases where the automatic deploy is not adequate.



Contents

In the event that the deploy software is not adequate for the installation needed, this section will give some hints on how to distribute and install the NetarchiveSuite software on a number of machines.

In the examples below, we assume that $deployInstallDir is set to the directory in which the NetarchiveSuite code is to be installed.

We assume that all machines in the chosen scenario are unix/linux servers. The procedure below may not work on other platforms. After having created the new settings to be used in the deployment of the software, zip together the NetarchiveSuite files including the new settings and copy the modified NetarchiveSuite.zip to all machines taking part in the deployment:

export USER=test
export MACHINES="machine1.domain1, machine2.domain1, .. machine1.domain2, machine2.domain2"
for MACHINE in $MACHINES; do
  scp NetarchiveSuite.zip $USER@$MACHINE:$deployInstallDir
  ssh $USER@$MACHINE "cd $deployInstallDir && unzip NetarchiveSuite.zip"
done

NetarchiveSuite settings

The NetarchiveSuite settings can be set for applications in three different ways:

  • use default setting
  • in a setting file
  • on command line

Using NetarchiveSuite default settings

If no settings are set, the default setting is used. Please refer to the [Configuration Manual 3.16#DefaultSettings] for more information on these.

Setting NetarchiveSuite settings on the command line

To set the value of a setting on the command line, add "-Dkey=value" to your java command line, for instance:

  java -Dsettings.common.http.port=8076 dk.netarkivet.common.webinterface.GUIApplication

will override the setting for the http port to be 8076.

Setting NetarchiveSuite settings with settings files

To set the values using a configuration file, save the settings in an XML file as described above. By default, NetarchiveSuite will look for the settings file in conf/settings.xml, that is: the file settings.xml under the directory conf from the current working directory. You can override this, by specifying -Ddk.netarkivet.settings.file=path/to/settings.file.xml on the commandline, for instance:

  java -Ddk.netarkivet.settings.file=/home/netarchive/guisettings.xml dk.netarkivet.common.webinterface.GUIApplication

will read settings from the file /home/netarchive/guisettings.xml .

You can even specify multiple configuration files, if you wish. You do this by separating the paths with ':' on unix/linux/MacOS or ';' on windows. For instance:

  java -Ddk.netarkivet.settings.file=guisettings.xml:basicsettings.xml dk.netarkivet.common.webinterface.GUIApplication

will read settings from both guisettings.xml and basicsettings.xml in the current directory.

The order of resolving NetarchiveSuite settings

If a setting is set on both command line and in settings files, or if it is set in multiple settings files, the setting is resolved as follows:

  • If the setting is set with system properties (i.e. set on the command line), use these
  • Else if the setting is specified in configuration files, use the '''first''' specified value
  • Else use default value

As an example, consider the resulting value for http-port (knowing that the default value is empty) when using the following two configuration files:

settings1.xml

<settings>
  <common>
    <http>
      <port>8076</port>
    </http>
  </common>
</settings>

settings2.xml

<settings>
  <common>
    <http>
      <port>8077</port>
    </http>
  </common>
</settings>

The following command will use the value empty string as http-port:

java dk.netarkivet.common.webinterface.GUIApplication

The following command will use the value 8078 as http-port:

  java -Ddk.netarkivet.settings.file=settings1.xml:settings2.xml -Dsettings.common.http.port=8078 dk.netarkivet.common.webinterface.GUIApplication

The following command will use the value 8076 as http-port:

  java -Ddk.netarkivet.settings.file=settings1.xml:settings2.xml dk.netarkivet.common.webinterface.GUIApplication

The following command will use the value 8077 as http-port:

  java -Ddk.netarkivet.settings.file=settings2.xml:settings1.xml dk.netarkivet.common.webinterface.GUIApplication

Standard commandline settings

The CLASSPATH

The CLASSPATH needed to start and run the java applications in NetarchiveSuite consists of several jarfiles. The dk.netarkivet.common.jar and all our 3rd party dependencies need not be added explicitly to the CLASSPATH, as they are referenced indirectly in the jar-files.

export deployInstallDir=/path/to/netarchiveSuite
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-harvester-core.jar
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-harvest-scheduler.jar
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-netarchivesuite-heritrix3-controller.jar
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-heritrix3-extensions.jar
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-heritrix3-wrapper.jar
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-archive-core.jar
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-wayback-indexer.jar
export CLASSPATH=$CLASSPATH:$deployInstallDir/lib/netarchivesuite-monitor-core.jar

<<Anchor(CommandLineLogging)>>

Logging

We use logback and SL4J as logging framework, so we need to refer to a logback.configurationFile. You may want to use different logging properties for different applications, especially when more than one application logs to the same logging directory..

export LOG_SETTINGS="-Dlogback.configurationFile=$deployInstallDir/conf/conf/logback_SomeApplication.xml"

Note that if you use the StatusSiteSection, your logback config-file must contain the appender dk.netarkivet.monitor.logging.CachingSL4JAppender.

<appender name="MONITOR" class="dk.netarkivet.monitor.logging.CachingSLF4JAppender">
        <filter class="ch.qos.logback.classic.filter.LevelFilter">
            <level>INFO</level>
            <onMatch>ACCEPT</onMatch>
            <onMismatch>NEUTRAL</onMismatch>
        </filter>
        <pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level %logger{64} - %msg%n</pattern>
    </appender>

Each application instance has its own JMX- and RMI port. For example the JMX port could be 8100 and the associated RMI port 8200, as in the example below, for the first application instance on the machine, then 8101/8201 for the second application instance, and so on. JMX also uses a password-file, which is the same throughout the installation ($deployInstallDir/conf/jmxremote.password)

export JMX_SETTINGS="-Dsettings.common.jmx.port=8100 -Dsettings.common.jmx.rmiPort=8200"

Note: For the StatusSiteSection to work, your logging must be configured to use with the dk.netarkivet.monitor.logging.CachingSL4JAppender enabled, see Command Line Logging section (This is done automatically, if the NetarchiveSuite deploy software is used to configure and install your NetarchiveSuite installation).

Select the appropriate settings.file for the application

The conf/settings.xml (the new one configured to your environment) is probably OK for most applications. But you may need to use special purpose settings-files for some applications, e.g. BitarchiveApplications (since you can't allocate more than one baseFileDir on the commandline). The settings file used in an application can be specified by:

export SETTING=-Ddk.netarkivet.settings.file=$deployInstallDir/conf/settings.xml

JVM options

We need to set the maximum Java heap size to 1.5 Gbytes. You may use this to change that or add other JVM options.

export JAVA_OPTS=-Xmx1536m

Admin machine

On the admin machine, we have to start the following 5 applications:

  • 1 GUIApplication.
  • 1 HarvestJobManagerApplication (handles the scheduling of jobs)
  • 2 instances of BitarchiveMonitorApplication (Controlling the access to a single bitarchive replica), one for each bitarchive replicas (e.g. EAST and WEST).
  • 1 ARCRepositoryApplication (this application handles access to the bitarchive replicas).

Starting the GUIApplication

Before, we can start the GUIApplication, the external database needs to started in advance (The deploy software does this for you if the external database is a derby database).

We also need to prepare the JSP-pages. You can unzip the war-files in the webpages directory as below:

cd $deployInstallDir/webpages
rm -rf BitPreservation
unzip -o BitPreservation.war -d BitPreservation
rm -rf HarvestDefinition
unzip -o HarvestDefinition.war -d HarvestDefinition
rm -rf History
unzip -o History.war -d History
rm -rf QA
unzip -o QA.war -d QA
rm -rf Status
unzip -o Status.war -d Status

Or you can update your settings.xml file to refer to the war-files instead of the unpacked directories, for instance

    <common>
        ...
        <webinterface>
            ...
            <siteSection>
                <!-- A subclass of SiteSection that defines this part of the
                     web interface. -->
                <class>dk.netarkivet.harvester.webinterface.DefinitionsSiteSection</class>
                <!-- The directory or war-file containing the web application
                     for this site section.-->
                <webapplication>webpages/HarvestDefinition.war</webapplication>
            </siteSection>
            ...
        </webinterface>
        ...
    </common>

and similar for other sitesections.

Now we are ready to start the application:

cd $deployInstallDir
export APP=dk.netarkivet.common.webinterface.GUIApplication
java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP

Starting the BitarchiveMonitorApplication instances

In the standard set-up with two distributed bitarchive replicas, we have a BitarchiveMonitorApplication associated with each replica. Here the replicas are ReplicaOne (with replicaId ONE) and ReplicaTwo (with replicaId TWO).

To distinguish the two instances from each other, we use the '''settings.common.applicationInstanceId''' setting, which is used as an identifier (here we use BMONE and BMTWO) as the two identifiers.

Start the monitor for bitarchive at ReplicaOne using BMONE as identifier thus:

cd $deployInstallDir
export APP_OPTIONS="-Dsettings.common.archive.bitarchive.useReplicaId=ONE  \
 -Dsettings.common.applicationInstanceId=BMONE"
export APP=dk.netarkivet.archive.bitarchive.BitarchiveMonitorApplication
java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP

Start the monitor for the bitarchive at ReplicaTwo using BMTWO as identifier thus:

cd $deployInstallDir
export APP_OPTIONS="-Dsettings.common.archive.bitarchive.useReplicaId=TWO \
 -Dsettings.common.applicationInstanceId=BMTWO"
export APP=dk.netarkivet.archive.bitarchive.BitarchiveMonitorApplication
java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP
  • one ARCRepository (this application handles all access to the bitarchives).
cd $deployInstallDir
export APP=dk.netarkivet.archive.arcrepository.ArcRepositoryApplication
java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP

Harvester machines

On each harvester machine, we have one or more HarvestControllerApplications. Settings related to the HarvestControllerApplication are

  • setting.common.applicationInstanceId (to distinguish between HarvestControllerApplications running on same machine)
  • settings.harvester.harvesting.channel (to select which of two harvest channels to accept jobs from: either a named selective harvest channel, or the snapshot channel. See Harvest Channels for further information.)
  • settings.harvester.harvesting.minSpaceLeft (how many bytes ''must'' be available in the serverdir to accept crawljobs). The default is 400000000 (~400 Mbytes).
  • settings.harvester.harvesting.bundle pointing to the zip file containing the bundled Heritrix3.

In the following, a snapshot harvest HarvestControllerApplication is started with application instance id=SEL

 cd $deployInstallDir
 export APP_OPTIONS="-Dsettings.harvester.harvesting.channel=SNAPSHOT -Dsettings.common.applicationInstanceId=SEL"
 export APP=dk.netarkivet.harvester.harvesting.HarvestControllerApplication
 java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP

Bitarchive machines

For each Replica, you can have BitarchiveServer's installed on one or more machines. We suggest using just one BitarchiveServer for each machine, though it is possible to use more than one.

Each BitarchiveServer can have storage on several filesystems, so if archive-storage is spread over more than one filesystem, you need to modify the settings file like this

<settings>
  ..
  <archive>
    ...
    <bitarchive>
      ...
      <baseFileDir>/home/fileSys1/</baseFileDir>
      <baseFileDir>/home/fileSys2/</baseFileDir>
      ...
    </bitarchive>
  </archive>
  ..
</settings>

Starting a BitarchiveServer requires knowing what Replica it resides on, and the credentials required for correcting the data stored in the bitarchive. For ReplicaOne with id ONE this would be:

 cd $deployInstallDir
 export APP_OPTIONS="-Dsettings.archive.bitarchive.useReplicaId=ONE \
                     -Dsettings.archive.bitarchive.thisCredentials=CREDENTIALS"
 export APP=dk.netarkivet.archive.bitarchive.BitarchiveApplication
 java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP

Access servers

On the access-servers, we deploy any number of ViewerProxyApplication instances, and maybe one IndexServerApplication (only one in all) used to generate indices needed by the harvesters and the ViewerProxyApplication instances.

 cd $deployInstallDir
 export APP=dk.netarkivet.harvester.indexserver.IndexServerApplication
java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP

Each ViewerproxyApplication instance uses an application instance id (settings.common.applicationInstanceId), and its own distinct base directory (settings.viewerproxy.baseDir). They also belong to a Replica(settings.archive.bitarchive.useReplicaId). In the start sample below, the instance uses application instance id "first" and 'viewerproxy_first' as base directory, and belongs to ReplicaOne with id ONE:

 cd $deployInstallDir
 export APP_OPTIONS="-Dsettings.common.applicationInstanceId=first \
    -Dsettings.viewerproxy.baseDir=viewerproxy_first \
    -Dsettings.archive.bitarchive.useReplicaId=ONE"
 export APP=dk.netarkivet.viewerproxy.ViewerProxyApplication
 java $JAVA_OPTS $SETTING $LOG_SETTINGS $JMX_SETTINGS $APP_OPTIONS $APP

About the NetarchiveSuite support for wayback, see Additional Tools Manual