This section provides some background information on how S3QL works internally. Reading this section is not necessary to use S3QL.
Like most unix filesystems, S3QL has a concept of inodes.
The contents of directory inodes (aka the names and inodes of the files and sub directories contained in a directory) are stored directly in a [http://www.sqlite.org SQLite] database. This database is stored in a special S3 object that is downloaded when the file system is mounted and uploaded periodically in the background and when the file system is unmounted. This has two implications:
Sockets, FIFOs and character devices do not need any additional storage, all information about them is contained in the database.
The contents of file inodes are split into individual blocks. The maximum size of a block is specified when the file system is created and cannot be changed afterwards. Every block is stored as an individual object in the backend, and the mapping from inodes to blocks and from blocks to objects is stored in the database.
While the file system is mounted, blocks are cached locally.
Blocks can also be compressed and encrypted before they are stored in S3.
If some files have blocks with identical contents, the blocks will be stored in the same backend object (i.e., the data is only stored once).
Instead of uploading every block, S3QL first computes a checksum (a SHA256 hash, for those who are interested) to check if an identical blocks has already been stored in an backend object. If that is the case, the new block will be linked to the existing object instead of being uploaded.
This procedure is invisible for the user and the contents of the block can still be changed. If several blocks share a backend object and one of the blocks is changed, the changed block is automatically stored in a new object (so that the contents of the other block remain unchanged).
When an application tries to read or write from a file, S3QL determines the block that contains the required part of the file and retrieves it from the backend or creates it if it does not yet exist. The block is then held in the cache directory. It is committed to S3 when it has not been accessed for more than 10 seconds. Blocks are removed from the cache only when the maximum cache size is reached.
When the file system is unmounted, all modified blocks are committed to the backend and the cache is cleaned.
S3QL has to take into account that changes in objects do not propagate immediately in all backends. For example, when an Amazon S3 object is uploaded and immediately downloaded again, the downloaded data might not yet reflect the changes done in the upload (see also http://developer.amazonwebservices.com/connect/message.jspa?messageID=38538)
For the data blocks this is not a problem because a data blocks always get a new object ID when they are updated.
For the metadata however, S3QL has to make sure that it always downloads the most recent copy of the database when mounting the file system.
To that end, metadata versions are numbered, and the most recent version number is stored as part of the object id of a very small “marker” object. When S3QL has downloaded the metadata it checks the version number against the marker object and, if the two do not agree, waits for the most recent metadata to become available. Once the current metadata is available, the version number is increased and the marker object updated.
When the file system is created, mkfs.s3ql generates a 256 bit master key by reading from /dev/random. The master key is encrypted with the passphrase that is entered by the user, and then stored with the rest of the file system data. Since the passphrase is only used to access the master key (which is used to encrypt the actual file system data), the passphrase can easily be changed.
Data is encrypted with a new session key for each object and each upload. The session key is generated by appending a nonce to the master key and then calculating the SHA256 hash. The nonce is generated by concatenating the object id and the current UTC time as a 32 bit float. The precision of the time is given by the [http://docs.python.org/library/time.html#time.time Python time() function] and usually at least 1 millisecond. The SHA256 implementation is included in the Python standard library.
Once the session key has been calculated, a SHA256 HMAC is calculated over the data that is to be uploaded. Afterwards, the data is compressed with the LZMA, [http://en.wikipedia.org/wiki/Bz2 Bz2 algorithm] or LZ and the HMAC inserted at the beginning. Both HMAC and compressed data are then encrypted using 256 bit AES in CTR mode. The AES-CTR implementation is provided by the [http://cryptopp.com/ Crypto++] library. Finally, the nonce is inserted in front of the encrypted data and HMAC, and the packet is send to the backend as a new S3 object.