...
When an outputstream is opened to a blob, the global write lock is acquired by this thread. As Fedora does not tell the blob how much data it is going to write, the outputstream will buffer the written data until the stream is closed. When the stream is closed, the buffer will be written to the newest tape as a new tar entry. The object instance will be registered in the index. Lastly, the write lock will be released.
Each outputstream will have a 1MB buffer per default. If the buffer is filled, a new buffer is allocated of size min(1MB,sizeNeeded). This could mean that somewhat much memory will be used. Swapping could improve on this?
It is principially not nessesary to acquire the write lock until the stream is closed, but it is acquired when the stream is opened. If the write lock is acquired on closing, it needs to be able to determine what tape is the newest tape at that time. By acquiring it on the "open" time, it can be fed this information. Since Fcrepo seems to burst-write to the disk, deadlocks or even slowdowns, have not been seen.
Reading is done by quering the index for the tape name and offset. With this information, an inputstream can be opened to the exact entry in the relevant tape. No locking is necessary for reading.
A tape is marked as indexed (in the index below) when it is closed and a new tape started. As will be explained, tapes that are marked for indexed will not be re-read upon server startup.
The Index
For the index, a separate system called Redis http://redis.io/ is used.
A client, Jedis https://github.com/xetorthio/jedis is used as the interface to the system.
The metadata repository now requires the existence of a redis instance to function.
The index implementation have to provide the following methods
tape,offset getLocation(objectID)
setLocaiton(objectID, tape, offset)
iterator<objectID> list(idPrefix)
remove(objectID)
boolean isTapeIndexed(tape)
setTapeIndexed(tape)
The Redis instance holds a number of keys and sets.
Firstly, we have the String keys. These are used to map a object id to a string of the form tape#offset. There will be one such key for each objectID. Lookup and writing these keys should be nearly O(1).
Secondly, we have the sorted set called "buckets". It holds references (names, really) to a number of other sorted set. Each of these other sorted sets holds a number of objectIds. Each objectId that is added to the index is hashed. The first 4 characters of the hash value is then used to determine which of these buckets to add the ID to. Each bucket is named solely from the 4 characters corresponding to the hash values of the IDs it hold. The purpose of this complex structure is to be able to iterate through the objectIDs in doms, while allowing paging.
The last datastructure is the set "tapes". This contains all the tapes that have been indexed so far. Lookup on key and adding to the set is fast operations.
Error recovery