Implementation Details

This section provides some background information on how S3QL works internally. Reading this section is not necessary to use S3QL.

Metadata Storage

Like most unix filesystems, S3QL has a concept of inodes.

The contents of directory inodes (aka the names and inodes of the files and sub directories contained in a directory) are stored directly in an SQLite database. This database is stored in a special storage object that is downloaded when the file system is mounted and uploaded periodically in the background and when the file system is unmounted. This has two implications:

  1. The entire file system tree can be read from the database. Fetching/storing storage objects from/in the storage backend is only required to access the contents of files (or, more precisely, inodes). This makes most file system operations very fast because no data has to be send over the network.
  2. An S3QL filesystem can only be mounted on one computer at a time, using a single mount.s3ql process. Otherwise changes made in one mountpoint will invariably be overwritten when the second mount point is unmounted.

Sockets, FIFOs and character devices do not need any additional storage, all information about them is contained in the database.

Data Storage

The contents of file inodes are split into individual blocks. The maximum size of a block is specified when the file system is created and cannot be changed afterwards. Every block is stored as an individual object in the backend, and the mapping from inodes to blocks and from blocks to objects is stored in the database.

While the file system is mounted, blocks are cached locally.

Blocks can also be compressed and encrypted before they are stored in the storage backend. This happens during upload, i.e. the cached data is unencrypted and uncompressed.

If some files have blocks with identical contents, the blocks will be stored in the same backend object (i.e., the data is only stored once).

Data De-Duplication

Instead of uploading every block, S3QL first computes a checksum (a SHA256 hash) to check if an identical blocks has already been stored in an backend object. If that is the case, the new block will be linked to the existing object instead of being uploaded.

This procedure is invisible for the user and the contents of the block can still be changed. If several blocks share a backend object and one of the blocks is changed, the changed block is automatically stored in a new object (so that the contents of the other block remain unchanged).

Caching

When an application tries to read or write from a file, S3QL determines the block that contains the required part of the file and retrieves it from the backend or creates it if it does not yet exist. The block is then held in the cache directory. It is committed to S3 when it has not been accessed for more than a few seconds. Blocks are removed from the cache only when the maximum cache size is reached.

When the file system is unmounted, all modified blocks are written to the backend and the cache is cleaned.

Eventual Consistency Handling

S3QL has to take into account that with some storage providers, changes in objects do not propagate immediately. For example, when an Amazon S3 object is uploaded and immediately downloaded again, the downloaded data might not yet reflect the changes done in the upload (see also http://developer.amazonwebservices.com/connect/message.jspa?messageID=38538)

For the data blocks this is not a problem because a data blocks always get a new object ID when they are updated.

For the metadata however, S3QL has to make sure that it always downloads the most recent copy of the database when mounting the file system.

To that end, metadata versions are numbered, and the most recent version number is stored as part of the object id of a very small “marker” object. When S3QL has downloaded the metadata it checks the version number against the marker object and, if the two do not agree, waits for the most recent metadata to become available. Once the current metadata is available, the version number is increased and the marker object updated.

Encryption

When the file system is created, mkfs.s3ql generates a 256 bit master key by reading from /dev/random. The master key is encrypted with the passphrase that is entered by the user, and then stored with the rest of the file system data. Since the passphrase is only used to access the master key (which is used to encrypt the actual file system data), the passphrase can easily be changed.

Data is encrypted with a new session key for each object and each upload. The session key is generated by appending a nonce to the master key and then calculating the SHA256 hash. The nonce is generated by concatenating the object id and the current UTC time as a 32 bit float. The precision of the time is given by the Python time() function and usually at least 1 millisecond. The SHA256 implementation is included in the Python standard library.

Once the session key has been calculated, a SHA256 HMAC is calculated over the data that is to be uploaded. Afterwards, the data is compressed (unless --compress none was passed to mount.s3ql) and the HMAC inserted at the beginning. Both HMAC and compressed data are then encrypted using 256 bit AES in CTR mode using PyCrypto. Finally, the nonce is inserted in front of the encrypted data and HMAC, and the packet is send to the backend as a new S3 object.