This project comes from Innovation Week March 2018. Code can be found at https://github.com/csrster/docker-csr
Docker
What is Docker? It is usually described as "a container". You can think of it as being a sandbox in which you can run any kind of application isolated from the rest of your system. Or more precisely, a sandbox where you control the interaction with the rest of the system - e.g. what ports or directories are accessed by the system. But Docker containers are much lighter than virtual machines because they reuse the running kernel of the host system.
The Docker Way
The Docker Way is to us one container per service. So whereas in NetarchiveSuite we typically run several NAS applications, a database server and a JMS broker all on a single admin machine, the Docker Way would be to run these each in their own container.
Docker Use Cases
There are multiple ways to use Docker
- To create a throwaway service - e.g. a database server for a test instance
- To test your own application in a controlled sandbox
- To enable coordination between multiple applications and services, possibly mimicking a distributed application
- To create a controlled environment for deploying to a production system.
In this project I have looked at all but the last of these.
Services for NetarchiveSuite
NetarchiveSuite requires
- A postgres database server
- A JMS broker
- An ftp server
With Docker these can be fired up extremely quickly. The broker could be started directly from the command line
sudo docker run -p 7676 seges/openmq
The ftp server isn't much more complicated, except that we need to define a username and password. The database is a bit more work because we need to include the scripts to initialise the database schema and ingest some test-data. To do this we extend the base Docker image for postgresql with some scripts. The entire Dockerfile looks like
FROM postgres:9.3 COPY harvestdb/0* docker-entrypoint-initdb.d/ COPY harvestdb/data/* ./ RUN mkdir /tsindex RUN chown postgres:postgres /tsindex
The first line just imports the basic postgres image. The next two lines copy in some scripts and sql-files to the docker image, while the last two lines create a directory for a tablespace used by the NetarchiveSuite harvestdatabase. The input directory structure just looks like
. ├── Dockerfile └── harvestdb ├── 00harvestdb_setup.sql ├── 01harvestdb_setup.sh └── data ├── 01netarchivesuite_init.sql ├── 02harvestdb.testdata.sql └── 03createArchiveDB.pgsql
and the base postgres image ensures the two .sql and .sh files that are copied to the directory docker-entrypoint-initdb.d get executed in alphanumerically sorted order.
NetarchiveSuite Applications
All NetarchiveSuite applications are basically similar. To start one, you need to know the name of the Application class and provide an xml settings file, a logging configuration file, and a start script that provides the right classpath (a jmxremote.password file is also needed and the harvesters need a certificate file for the Heritrix 3 https GUI). I created a generic NetarchiveSuite Dockerfile which can be uses jinja2 templating to convert generic configuration files to application specific files. For example, the generic start script looks like
#!/usr/bin/env bash echo Starting linux application: {{APP_LABEL}} export CLASSPATH={{CLASSPATH}}:$CLASSPATH; java -Xmx1024m -Ddk.netarkivet.settings.file=/nas/settings.xml -Dlogback.configurationFile=/nas/logback.xml {{APP_CLASS}}
and the templating engine just substitutes in for the three named placeholders.
The Dockerfile looks like
FROM mlaccetti/docker-oracle-java8-ubuntu-16.04 ADD https://sbforge.org/nexus/service/local/repositories/releases/content/org/netarchivesuite/distribution/5.2.2/distribution-5.2.2.zip nas.zip ADD https://sbforge.org/nexus/service/local/repositories/releases/content/org/netarchivesuite/heritrix3-bundler/5.2.2/heritrix3-bundler-5.2.2.zip h3bundler.zip RUN apt-get update && apt-get install -y ca-certificates unzip postgresql-client python-setuptools && easy_install j2cli RUN unzip nas.zip -d nas RUN unzip h3bundler.zip RUN mv heritrix-3* bundler RUN mv bundler/lib/* /nas/lib WORKDIR /nas COPY *.j2 /nas/ COPY wait-for-postgres.sh /nas/wait-for-postgres.sh COPY jmxremote.password /nas/jmxremote.password COPY docker-entrypoint.sh / COPY h3server.jks / RUN chmod 755 /nas/*.j2 RUN chmod 755 /nas/wait-for-postgres.sh RUN chmod 755 /docker-entrypoint.sh EXPOSE 8078 CMD ["/docker-entrypoint.sh"]
As base it uses an ubuntu image with preinstalled Java 8. I copy NetarchiveSuite 5.2.2 into the image and install the jinja2 command line (j2cli). Then comes a little bit of unpacking and renaming of some NetarchiveSuite files. I expose port 8078. Actually this is really only necessary for the GUI and ViewerProxy applications. Finally I define the command to be run by the container when it is started - docker-entrypoint.sh. What does this script actually do?
#!/bin/bash -e
# Adapted from https://github.com/tryolabs/nginx-docker/blob/master/docker-entrypoint.sh
for f in $(find /nas -type f -name "*.j2"); do
echo -e "Evaluating template\n\tSource: $f\n\tDest: ${f%.j2}"
j2 $f > ${f%.j2}
rm -f $f
done
chmod 755 /nas/start.sh
/nas/start.sh
It applies the templates and starts the NetarchiveSuite application.
Note that on its won this will always fail, because every NetarchiveSuite application requires as an absolute minimum that the JMS broker is also running. So how do we coordinate all that?
Putting it all together with Docker-Compose
Docker-compose is a magical application that takes a single file (in yaml format) that specifies all the different Docker containers your application needs, the dependencies amongst them, and which exposed ports they use to talk to each other. For example, for NetarchiveSuite we could start with
version: "3" services: database: build: nasdb ports: - 5432 mq: image: seges/openmq ports: - 7676 ftp: image: andrewvos/docker-proftpd ports: - 20 - 21 - "21100-21110:21100-21110" environment: - USERNAME=jms nasgui: build: nasapp ports: - "8078:8078" links: - database depends_on: - database - mq environment: - APP_LABEL=GUIApplication - APP_CLASS=dk.netarkivet.common.webinterface.GUIApplication - CLASSPATH=/nas/lib/netarchivesuite-monitor-core.jar:/nas/lib/netarchivesuite-harvest-scheduler.jar:/nas/lib/netarchivesuite-harvester-core.jar:/nas/lib/netarchivesuite-archive-core.jar command: ["/nas/wait-for-postgres.sh", "database", "--", "/docker-entrypoint.sh"]
This defines three services (the database, the JMS broker and the ftp server) and then a single NetarchiveSuite application.
Note the "ports" variable on the database service, for example. This means that the nasgui application will be able to access the database on the url jdbc://database:5432 . Docker and docker-compose ensure that the name "database" is mapped to the actual container and that the port 5432 which is otherwise only visible internally the database container is made available to the nasgui container. All this magic happens behind the scenes. As far as the humble programmer is concerned, the nasgui is just connecting to a named machine on a the usual port 5432.
Note however that the GUI specifies a port "8078:8078". This means that the internally exposed port 8078 is mapped to the real port 8078 on the host machine, which is where the GUI can actually be seen.
Finally there is the script "wait-for-postgres.sh". This is just a little script that polls the postgres database and waits until postgres is started before calling docker-entrypoint.sh.
The full docker-compose.yml file contains all the necessary applications to create a fully-functional containerised NetarchiveSuite instance.
What Else?
One advantage of using something standardised like Docker/-Compose is that you can leverage standardised tools. For example the Rancher tool provides a dashboard for all your Docker containers, and has support for deploying docker-compose builds to multiple hosts (supposedly - I didn't test it). Some screenshots