We have elected to change the default fcrepo backed. This decision was not reached lightly, however. What forced us to take this path was the inability of the backup solutions we use to cope with the many small files generated by fcrepo. The random access aspect of writing to the files forced the backup systems to scan all the files for changes.
Integrating other storage implementations with Fcrepo
Fcrepo has a storage interface called Akubra https://github.com/akubra/akubra
Using this storage interface, we can integrate arbitrary storage solutions with Fcrepo. The interface is split into three java classes, Blobstore, BlobstoreConnection, Blob. The basic design is that the Blobstore is created as a Singleton by the Fcrepo server system. To work with blobs, the blobstore is asked to open a connection, BlobstoreConnection. From this connection, Blobs can be read and written.
When requesting a Blob from the BlobstoreConnection, a Blob is returned, even if it does not exist. Like a File object, it has an exist(); method. You can then open input and outputstreams on the blob.
Tapes, the basic design
No data will ever be overwritten. Every write creates a new instance of an object. This is the fundamental invariant in the tape design.
The tapes are tar files. The can be said to exist in a long chain. Each tape is named according to the time it was created. Only the newest tape can be written. When the newest tape reaches a certain size, it is closed, and a new tape is started. This new tape is now the newest tape.
Only one thread can write to the tape system at a time.
A separate index is maintained. This index retains the mapping between object identifier and the newest instance of the object (ie. tape name and offset into tape).
Tape tar files, locking and the like
The tapes are tar files. To understand the following, see http://en.wikipedia.org/wiki/Tar_(computing)
The tapes are read with a library called JTar, see https://github.com/blekinge/jtar
When an outputstream is opened to a blob, the global write lock is acquired by this thread. As Fedora does not tell the blob how much data it is going to write, the outputstream will buffer the written data until the stream is closed. When the stream is closed, the buffer will be written to the newest tape as a new tar entry. The object instance will be registered in the index. Lastly, the write lock will be released.
Each outputstream will have a 1MB buffer per default. If the buffer is filled, a new buffer is allocated of size min(1MB,sizeNeeded). This could mean that somewhat much memory will be used. Swapping could improve on this?
It is principially not nessesary to acquire the write lock until the stream is closed, but it is acquired when the stream is opened. If the write lock is acquired on closing, it needs to be able to determine what tape is the newest tape at that time. By acquiring it on the "open" time, it can be fed this information. Since Fcrepo seems to burst-write to the disk, deadlocks or even slowdowns, have not been seen.
Reading is done by quering the index for the tape name and offset. With this information, an inputstream can be opened to the exact entry in the relevant tape. No locking is necessary for reading.
A tape is marked as indexed (in the index below) when it is closed and a new tape started. As will be explained, tapes that are marked for indexed will not be re-read upon server startup.
The Index
For the index, a separate system called Redis http://redis.io/ is used.
A client, Jedis https://github.com/xetorthio/jedis is used as the interface to the system.
The metadata repository now requires the existence of a redis instance to function.
The index implementation have to provide the following methods
tape,offset getLocation(objectID)
setLocaiton(objectID, tape, offset)
iterator<objectID> list(idPrefix)
remove(objectID)
boolean isTapeIndexed(tape)
setTapeIndexed(tape)
The Redis instance holds a number of keys and sets.
Firstly, we have the String keys. These are used to map a object id to a string of the form tape#offset. There will be one such key for each objectID. Lookup and writing these keys should be nearly O(1).
Secondly, we have the sorted set called "buckets". It holds references (names, really) to a number of other sorted set. Each of these other sorted sets holds a number of objectIds. Each objectId that is added to the index is hashed. The first 4 characters of the hash value is then used to determine which of these buckets to add the ID to. Each bucket is named solely from the 4 characters corresponding to the hash values of the IDs it hold. The purpose of this complex structure is to be able to iterate through the objectIDs in doms, while allowing paging.
The last datastructure is the set "tapes". This contains all the tapes that have been indexed so far. Lookup on key and adding to the set is fast operations.
Error recovery