/
NetarchiveSuite in Docker and Docker-Compose

NetarchiveSuite in Docker and Docker-Compose

This project comes from Innovation Week March 2018. Code can be found at https://github.com/csrster/docker-csr

Docker

What is Docker? It is usually described as "a container". You can think of it as being a sandbox in which you can run any kind of application isolated from the rest of your system. Or more precisely, a sandbox where you control the interaction with the rest of the system - e.g. what ports or directories are accessed by the system. But Docker containers are much lighter than virtual machines because they reuse the running kernel of the host system.

The Docker Way

The Docker Way is to us one container per service. So whereas in NetarchiveSuite we typically run several NAS applications, a database server and a JMS broker all on a single admin machine, the Docker Way would be to run these each in their own container. 

Docker Use Cases

There are multiple ways to use Docker

  1. To create a throwaway service - e.g. a database server for a test instance
  2. To test your own application in a controlled sandbox
  3. To enable coordination between multiple applications and services, possibly mimicking a distributed application
  4. To create a controlled environment for deploying to a production system.

In this project I have looked at all but the last of these.

Services for NetarchiveSuite

NetarchiveSuite requires

  1. A postgres database server
  2. A JMS broker
  3. An ftp server

With Docker these can be fired up extremely quickly. The broker could be started directly from the command line 

sudo docker run -p 7676 seges/openmq

The ftp server isn't much more complicated, except that we need to define a username and password. The database is a bit more work because we need to include the scripts to initialise the database schema and ingest some test-data. To do this we extend the base Docker image for postgresql with some scripts. The entire Dockerfile looks like

FROM postgres:9.3

COPY harvestdb/0* docker-entrypoint-initdb.d/
COPY harvestdb/data/*  ./

RUN mkdir /tsindex
RUN chown postgres:postgres /tsindex

The first line just imports the basic postgres image. The next two lines copy in some scripts and sql-files to the docker image, while the last two lines create a directory for a tablespace used by the NetarchiveSuite harvestdatabase. The input directory structure just looks like 

.
├── Dockerfile
└── harvestdb
    ├── 00harvestdb_setup.sql
    ├── 01harvestdb_setup.sh
    └── data
        ├── 01netarchivesuite_init.sql
        ├── 02harvestdb.testdata.sql
        └── 03createArchiveDB.pgsql


and the base postgres image ensures the two .sql and .sh files that are copied to the directory docker-entrypoint-initdb.d get executed in alphanumerically sorted order.

NetarchiveSuite Applications

All NetarchiveSuite applications are basically similar. To start one, you need to know the name of the Application class and provide an xml settings file, a logging configuration file, and a start script that provides the right classpath (a jmxremote.password file is also needed and the harvesters need a certificate file for the Heritrix 3 https GUI). I created a generic NetarchiveSuite Dockerfile which can be uses jinja2 templating to convert generic configuration files to application specific files. For example, the generic start script looks like 

#!/usr/bin/env bash
echo Starting linux application: {{APP_LABEL}}
export CLASSPATH={{CLASSPATH}}:$CLASSPATH;
java -Xmx1024m  -Ddk.netarkivet.settings.file=/nas/settings.xml -Dlogback.configurationFile=/nas/logback.xml {{APP_CLASS}}

and the templating engine just substitutes in for the three named placeholders.

The Dockerfile looks like 

FROM mlaccetti/docker-oracle-java8-ubuntu-16.04

ADD https://sbforge.org/nexus/service/local/repositories/releases/content/org/netarchivesuite/distribution/5.2.2/distribution-5.2.2.zip  nas.zip
ADD https://sbforge.org/nexus/service/local/repositories/releases/content/org/netarchivesuite/heritrix3-bundler/5.2.2/heritrix3-bundler-5.2.2.zip  h3bundler.zip
RUN apt-get update && apt-get install -y ca-certificates unzip postgresql-client python-setuptools && easy_install j2cli

RUN unzip nas.zip -d nas
RUN unzip h3bundler.zip
RUN mv heritrix-3* bundler
RUN mv bundler/lib/* /nas/lib
WORKDIR /nas

COPY *.j2 /nas/
COPY wait-for-postgres.sh /nas/wait-for-postgres.sh
COPY jmxremote.password /nas/jmxremote.password
COPY docker-entrypoint.sh /
COPY h3server.jks /
RUN chmod 755 /nas/*.j2
RUN chmod 755 /nas/wait-for-postgres.sh
RUN chmod 755 /docker-entrypoint.sh
EXPOSE 8078

CMD ["/docker-entrypoint.sh"]

As base it uses an ubuntu image with preinstalled Java 8. I copy NetarchiveSuite 5.2.2 into the image and install the jinja2 command line (j2cli). Then comes a little bit of unpacking and renaming of some NetarchiveSuite files. I expose port 8078. Actually this is really only necessary for the GUI and ViewerProxy applications. Finally I define the command to be run by the container when it is started - docker-entrypoint.sh. What does this script actually do? 

#!/bin/bash -e
# Adapted from https://github.com/tryolabs/nginx-docker/blob/master/docker-entrypoint.sh
for f in $(find /nas -type f -name "*.j2"); do
echo -e "Evaluating template\n\tSource: $f\n\tDest: ${f%.j2}"
j2 $f > ${f%.j2}
rm -f $f
done
chmod 755 /nas/start.sh
/nas/start.sh

It applies the templates and starts the NetarchiveSuite application.

Note that on its won this will always fail, because every NetarchiveSuite application requires as an absolute minimum that the JMS broker is also running. So how do we coordinate all that?

Putting it all together with Docker-Compose

Docker-compose is a magical application that takes a single file (in yaml format) that specifies all the different Docker containers your application needs, the dependencies amongst them, and which exposed ports they use to talk to each other. For example, for NetarchiveSuite we could start with

version: "3"
services:
  database:
    build: nasdb
    ports:
      - 5432 
  mq:
    image: seges/openmq
    ports:
      - 7676
  ftp:
    image: andrewvos/docker-proftpd
    ports:
      - 20
      - 21
      - "21100-21110:21100-21110"
    environment:
      - USERNAME=jms
  nasgui:
    build: nasapp
    ports:
      - "8078:8078"
    links:
      - database
    depends_on:
      - database
      - mq
    environment:
      - APP_LABEL=GUIApplication
      - APP_CLASS=dk.netarkivet.common.webinterface.GUIApplication
      - CLASSPATH=/nas/lib/netarchivesuite-monitor-core.jar:/nas/lib/netarchivesuite-harvest-scheduler.jar:/nas/lib/netarchivesuite-harvester-core.jar:/nas/lib/netarchivesuite-archive-core.jar
    command: ["/nas/wait-for-postgres.sh", "database", "--", "/docker-entrypoint.sh"]

This defines three services (the database, the JMS broker and the ftp server)  and then a single NetarchiveSuite application.

Note the "ports" variable on the database service, for example. This means that the nasgui application will be able to access the database on the url jdbc://database:5432 . Docker and docker-compose ensure that the name "database" is mapped to the actual container and that the port 5432 which is otherwise only visible internally the database container is made available to the nasgui container. All this magic happens behind the scenes. As far as the humble programmer is concerned, the nasgui is just connecting to a named machine on a the usual port 5432. 

Note however that the GUI specifies a port "8078:8078". This means that the internally exposed port 8078 is mapped to the real port 8078 on the host machine, which is where the GUI can actually be seen.

Finally there is the script "wait-for-postgres.sh". This is just a little script that polls the postgres database and waits until postgres is started before calling docker-entrypoint.sh.

The full docker-compose.yml file contains all the necessary applications to create a fully-functional containerised NetarchiveSuite instance.

Note that the macnine names are the values container Id's supplied by docker, which is what the JVM sees as its hostname. Not especially useful!

What Else?


One advantage of using something standardised like Docker/-Compose is that you can leverage standardised tools. For example the Rancher tool provides a dashboard for all your Docker containers, and has support for deploying docker-compose builds to multiple hosts (supposedly - I didn't test it). Some screenshots