SB Metadata repository xml tapes backend design

We have elected to change the default fcrepo backed. This decision was not reached lightly, however. What forced us to take this path was the inability of the backup solutions we use to cope with the many small files generated by fcrepo. The random access aspect of writing to the files forced the backup systems to scan all the files for changes.

Integrating other storage implementations with Fcrepo

Fcrepo has a storage interface called Akubra https://github.com/akubra/akubra

Using this storage interface, we can integrate arbitrary storage solutions with Fcrepo. The interface is split into three java classes, Blobstore, BlobstoreConnection, Blob. The basic design is that the Blobstore is created as a Singleton by the Fcrepo server system. To work with blobs, the blobstore is asked to open a connection, BlobstoreConnection. From this connection, Blobs can be read and written.

When requesting a Blob from the BlobstoreConnection, a Blob is returned, even if it does not exist. Like a File object, it has an exist(); method. You can then open input and outputstreams on the blob.

Tapes, the basic design

No data will ever be overwritten. Every write creates a new instance of an object. This is the fundamental invariant in the tape design.

The tapes are tar files. The can be said to exist in a long chain. Each tape is named according to the time it was created. Only the newest tape can be written. When the newest tape reaches a certain size, it is closed, and a new tape is started. This new tape is now the newest tape.

Only one thread can write to the tape system at a time.

A separate index is maintained. This index retains the mapping between object identifier and the newest instance of the object (ie. tape name and offset into tape).

Tape tar files, locking and the like

The tapes are tar files. To understand the following, see http://en.wikipedia.org/wiki/Tar_(computing)

The tapes are read with a library called JTar, see https://github.com/blekinge/jtar

Each object instance is named as "<objectId>#<currentTimeMillis>" in the tape. The naming is not really important. The name should contain the objectId, so reindexing of the tapes are possible. To help people reading the tapes with normal tar tools, the names should be unique (inside the tape, if not globally), as extracting the content becomes annoying otherwise. This system, however, would not mind.

The tapes themselves are named "tape#<currentTimeMillis>.tar"

When an outputstream is opened to a blob, the global write lock is acquired by this thread. As Fedora does not tell the blob how much data it is going to write, the outputstream will buffer the written data until the stream is closed. When the stream is closed, the buffer will be written to the newest tape as a new tar entry. The object instance will be registered in the index. Lastly, the write lock will be released.

Each outputstream will have a 1MB buffer per default. If the system attempts to write content exceeding the remaining bytes in the buffer, the buffer is marked as finished and a new buffer is allocated of size max(1MB,sizeNeeded). So, if you write 1 byte, 1 byte will be used of the 1MB buffer. If you then write 1MB in one operation, the default buffer will be finished, and a new buffer of 1MB will be allocated. This new buffer will then be filled with the 1MB written. So almost 1MB will be wasted here.

It is principially not nessesary to acquire the write lock until the stream is closed, but it is acquired when the stream is opened. If the write lock is acquired on closing, it needs to be able to determine what tape is the newest tape at that time. By acquiring it on the "open" time, it can be fed this information. Since Fcrepo seems to burst-write to the disk, deadlocks or even slowdowns, have not been seen.

Reading is done by quering the index for the tape name and offset. With this information, an inputstream can be opened to the exact entry in the relevant tape. No locking is necessary for reading. After skipping to the correct offset in the tape, we read until we get to tar record header, which ought to be after 0 bytes. We then read the tar record header to get the size of the record, and then returning an inputstream starting from this position and stopping when the entire record have been read (to prevent the user from reading into the next record). At no time in this do we examine the name of the tar record we are reading.

A tape is marked as indexed (in the index below) when it is closed and a new tape started. As will be explained, tapes that are marked for indexed will not be re-read upon server startup.

The Index

For the index, a separate system called Redis http://redis.io/ is used.

A client, Jedis https://github.com/xetorthio/jedis is used as the interface to the system.

The metadata repository now requires the existence of a redis instance to function.

The index implementation have to provide the following methods

tape,offset getLocation(objectID)

setLocaiton(objectID, tape, offset)

iterator<objectID> list(idPrefix)

remove(objectID)

boolean isTapeIndexed(tape)

setTapeIndexed(tape)

The Redis instance holds a number of keys and sets.

Firstly, we have the String keys. These are used to map a object id to a string of the form tape#offset. There will be one such key for each objectID. Lookup and writing these keys should be nearly O(1).

Secondly, we have the sorted set called "buckets". It holds references (names, really) to a number of other sorted set. Each of these other sorted sets holds a number of objectIds. Each objectId that is added to the index is hashed. The first 4 characters of the hash value is then used to determine which of these buckets to add the ID to. Each bucket is named solely from the 4 characters corresponding to the hash values of the IDs it hold. The purpose of this complex structure is to be able to iterate through the objectIDs in doms, while allowing paging.

The last datastructure is the set "tapes". This contains all the tapes that have been indexed so far. Lookup on key and adding to the set is fast operations.

Error recovery

Upon startup, the server goes through the following process.

It lists and sorts by name all the tapes in the given tape folder. The tapes should therefore be named in a fashion that allows the system to sort by name to get a last-modified sorting.
For each tape it checks if the index marks this tape as indexed.
If the tape is not indexed, it is read through for indexing.
Any further tapes in the list are also indexed the same way.

Indexing a tape

The tape is read through from the beginning. This process is fast, as we can skip over the actual record contents. The relevant information here are the objectIds, the tape name and the offset of the record. For each tar record, the index is updated in the same way as it would have been when the record was written.This update will overwrite any entry that already existed in the index. As the tapes are read in order of creation-time (this is encoded in the tape name) and the records in the tape are written in order, when all entries concerning a given objectId have been indexed, the index can be sure to have the information about the newest instance of the record.

When a tape have been indexed, it is marked as such in the index.

The newest tape will be indexed, but will not be marked as indexed.

Broken tapes

Sometimes the tapes can become broken (not observed yet). This is most likely something to happen when the server terminates abnormally, and leave a tape with half a record. Upon startup, if the property "fixErrors" is set, the newest tape and only the newest tape is read through to check for this. Any tape but the newest tape should not be able to contain broken records, barring a general failure of the underlying filesystem, so they are not checked.

If an IOException occurs while reading he tape, the system regards the tape as broken and attempts to fix it. The fix is rather brutal. Every record, from the beginning of the tape, is read and written to a new temp tape, until the broken tape either runs out of bytes or the IOException reoccurs. This is done in a way to ensure that the temp tape will only have complete records. When no more can be read from the broken tape, it is deleted and replaced with the temp tape.

Rebuild

If the archive is set to rebuild, it will flush (as in flush the toilet) the content of the redis database upon startup. This will mean that no tape will be listed as indexed, so the entire tape archive will be reindexed. The redis client have some harsh requirements about timeouts on commands, and at times the flush will not return in time. In that case the server will fail to start. Simply restarting the server afterwards is the cure, as the redis instance should be finished flushing by then.

Purged objects

The tapes do support purging objects. When a blob is deleted, it is written as a new instance of 0 bytes to the tape. The name of the record is now "objectId#timestampInMillis#DELETED". When an object is deleted, it is removed from the index. Upon indexing the tape (see above), records of 0 bytes with names ending in "#DELETED" cause the objectId to be removed from the index.

Packaging the old objects for Tapes

A traditional fcrepo backend can easily be converted to use tapes. To do so, the fedora server must be stopped (to ensure that the objects are not changed).

The script "tarExisting.sh" is included with the distribution. It takes two arguments, the folder to tape and the name of the tape. As mentioned above, the tapes should be named "tape#<currentTimeMillis>.tar". For each folder containing objects for fcrepo, run the command "./tarExisiting.sh <folder> "tape#$(date '+%s%N').tar", and put the resulting tape in the configured tape folder for Fcrepo.

Spring Configuration

The system is configured from the file "akubra-llstore.xml" which is a spring config file. It is reproduced below with comments.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd">
<beans>
    <!--Standard-->
    <bean name="org.fcrepo.server.storage.lowlevel.ILowlevelStorage"
          class="org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorageModule">
        <constructor-arg index="0">
            <map/>
        </constructor-arg>
        <constructor-arg index="1" ref="org.fcrepo.server.Server"/>
        <constructor-arg index="2" type="java.lang.String"
                         value="org.fcrepo.server.storage.lowlevel.ILowlevelStorage"/>
        <property name="impl"
                  ref="org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorage"/>
    </bean>
    <bean
            name="org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorage"
            class="org.fcrepo.server.storage.lowlevel.akubra.AkubraLowlevelStorage"
            singleton="true">
        <constructor-arg>
            <description>The store of serialized Fedora objects</description>
            <ref bean="tapeObjectStore"/>
            <!--Here we reference our tape system-->
        </constructor-arg>
        <constructor-arg>
            <description>The store of datastream content</description>
            <ref bean="datastreamStore"/>
        </constructor-arg>
        <constructor-arg value="false"><!--This is set to false, as we do not ever delete stuff-->
            <description>if true, replaceObject calls will be done in a way
                that
                ensures the old content is not deleted until the new content is safely
                written. If the objectStore already does this, this should be
                given as
                false
            </description>
        </constructor-arg>
        <constructor-arg value="true">
            <description>save as above, but for datastreamStore</description>
        </constructor-arg>
    </bean>
    <!--This is the tape store Akubra Implementation-->
    <bean name="tapeObjectStore" class="dk.statsbiblioteket.metadatarepository.xmltapes.XmlTapesBlobStore"
          singleton="true">
        <constructor-arg value="urn:example.org:tapeObjectStore"/>
        <!--This parameter is the name of the storage. -->
        <property name="archive" ref="tarTapeObjectStore"/>
        <!--And this is the reference to the actual implementation-->
    </bean>
    <!--The guts of the tape system-->
    <bean name="tarTapeObjectStore" class="dk.statsbiblioteket.metadatarepository.xmltapes.TapeArchive"
          init-method="rebuild"
          singleton="true">
		<!-- The init method above is special, it can have the value "init" or "rebuild". Rebuild just flushes the index before running init.-->
        <!--This constructor argument specifies the tape store location. -->
        <constructor-arg value="file:/CHANGEME/tapeObjectStore" type="java.net.URI"/>
        <!--This specifies the maximum length a tape can be before a new tape is started-->
        <constructor-arg value="10485760" type="long"/>
        <!--10 MB-->
        <!--This is the reference to the index-->
        <property name="index" ref="redisIndex"/>
		<property name="fixErrors" value="false"/>
    </bean>

    <!--This is our Redis index-->
    <bean name="redisIndex" class="dk.statsbiblioteket.metadatarepository.xmltapes.redis.RedisIndex"
          singleton="true">
        <!--The redis server-->
        <constructor-arg value="localhost"/>
        <!--The port it is running on-->
        <constructor-arg value="6379"/>
        <!--The database name. Redis databases are always identified by integers-->
        <constructor-arg value="0"/>
    </bean>

    <!--Standard storage for managed datastreams. We do not use managed datastreams-->
    <bean name="datastreamStore" class="org.akubraproject.map.IdMappingBlobStore"
          singleton="true">
        <constructor-arg value="urn:fedora:datastreamStore"/>
        <constructor-arg>
            <ref bean="fsDatastreamStore"/>
        </constructor-arg>
        <constructor-arg>
            <ref bean="fsDatastreamStoreMapper"/>
        </constructor-arg>
    </bean>
    <!--Standard storage for managed datastreams. We do not use managed datastreams-->
    <bean name="fsDatastreamStore" class="org.akubraproject.fs.FSBlobStore"
          singleton="true">
        <constructor-arg value="urn:example.org:fsDatastreamStore"/>
        <constructor-arg value="/CHANGEME/datastreamStore"/>
    </bean>
    <!--Standard storage for managed datastreams. We do not use managed datastreams-->
    <bean name="fsDatastreamStoreMapper"
          class="org.fcrepo.server.storage.lowlevel.akubra.HashPathIdMapper"
          singleton="true">
        <constructor-arg value="##"/>
    </bean>

    <bean name="fedoraStorageHintProvider"
          class="org.fcrepo.server.storage.NullStorageHintsProvider"
          singleton="true">
    </bean>
</beans>