S3QL vs ZFS-on-NBD — Nikolaus Rath's Website

When I created S3QL in 2008, the Cloud Storage and Linux filesystem landscape looked rather different than today: there was no filesystem that supported compression, encryption, or de-duplication of data. The only relevant cloud storage system was Amazon S3, and it only offered eventual consistency.

S3QL was designed to fill these gaps: I wanted to be able to store compressed, encrypted, and de-duplicated backups in the cloud, without being tied to a particular backup software. It think it has served this usecase very well, and I am still using it today.

However, nowadays there are filesystems like ZFS which offer compression, encryption and de-duplication, and all the major cloud storage systems offer immediate consistency for all operations. In other words, large parts of S3QL now replicate functionality that also exists in-kernel filesystems, while other parts provide features that are no longer required (consistency handling).

This article documents my experience with setting up a cloud-based backup system that offers features very similar to S3QL, but is based on ZFS and network block devices (NBDs).

Contents

Filesystems vs Block Devices
s3backer
nbdkit
Object Size Considerations
Initial Success
Alignment Issues
Suspend/Hibernation
Metadata Performance
Adding caching
L2 ARC
S3backer over NBD
bcache
Back to s3backer
Summary

Filesystems vs Block Devices

The thing that has always bugged me most about S3QL is the handling of filesystem metadata. On the plus side, metadata operations are very fast. On the minus side, a full metadata copy needs to be available locally and needs to be downloaded/uploaded completely on every mount/umount. I have spent a lot of time thinking about ways to change this (e.g. by uploading a transaction log rather than the complete metadata, and/or splitting the data across multiple storage objects) but never came up with something that seemed worth the additional complexity. Therefore, I wanted to try an entirely different approach: instead of implementing a file system, implement a network-backed block storage device and run ZFS on top of that.

The appeal of this solution is that a block storage device needs to implement just a small number of very simple operations (read, write, discard), while file system semantics (and in particular the handling of metadata) are provided by ZFS.

This idea is not new. Even though a driver for userspace backed block devices was only just added to io_uring, the NBD driver has been in the kernel since at least 2004 and s3backer (which provides an S3-backed block device through FUSE) has been around for as long as S3QL. However, in the past none of these solutions appealed to me because there was no suitable (local) filesystem to combine them Handling encryption, compression and snapshots in the block-layer to me does not feel like a simplification over what S3QL is doing.

However, the rise of ZFS and BTRFS, and the availability of storage services with immediate consistency has changed this. Therefore, a few months ago I decided to try setting up a backup system using ZFS on a cloud-backed block device.

s3backer

As mentioned above, S3Backer provides a FUSE filesystem just like S3QL. However, the filesystem contains only one large file that is intended to be loopback-mounted to provide a block device, which can in turn be used with any (local) filesystem one desires.

S3Backer has a complex codebase that offers many features: compression, encryption, de-duplication, caching, and support for eventually consistent storage systems. Therefore, these features are available even when using a local filesystem that does not support them on its own.

To me, the architecture of s3backer has always felt somewhat klunky - why go through the trouble of implementing a filesystem if all we want is a block device? This is not just a cosmetic concern, but comes with a number of practical disadvantages:

The kernel serializes FUSE write and read requests on a per-file basis - which means that the entire block device will have its write and read requests serialized with no concurrency at the kernel - userspace interface.
Read and write request size cannot exceed 128 kB
All data is present in the page cache twice (once for the "upper" filesystem, once for the FUSE filesystem)
All requests pass through the VFS twice.

In the beginning, I therefore looked for a different option (however, as we'll see I later came back to s3backer).

nbdkit

nbdkit is a framework for writing NBD servers, which can then be used with the kernel's NBD client to provide a userspace-backed block device. nbdkit supports "plugins" (that implement data store/retrieval) as well as "filters" (which can be stacked to transform requests), both of which can be written in a multitude of programming languages.

nbdkit already came with an example plugin that provided a read-only view of a single storage object hosted on S3. It is written in Python and seemed straightforward to extend. Furthermore, nbdkit has an extensive test suite and merge requests are carefully reviewed, so I felt confident that I'd be able to extend this plugin quickly to do what I would need, while still having a high confidence in the correctness of the code.

nbdkit's plugin interface also meant that I could arbitrarily switch out the S3 plugin for "memory" and "file" plugins that stored all data in memory (or a local file) instead of going over the network, and to use the rate and delay filters to simulate the effects of network latency and bandwidth.

For these reasons, I decided to use nbdkit for my experiemnts. I started by contributing a number of features to the S3 plugin, adding write support, trim support, and the ability to split the data among multiple storage objects. I also extended the "stats" filter to keep track of request size and alignment distributions.

Creating the block device is then (relatively) simple:

$ nbdkit --unix my_socket S3 size=50G bucket=nikratio-test \
  key=my_data object-size=4K &  # start NBD server

$ nbd-client -unix my_socket /dev/nbd0  # connect /dev/nbd0

Object Size Considerations

The first design decision that I had to make was what object size to use on the S3 side, i.e. into how many different objects my filesystem would be split.

Ideally, the size of storage objects matches the size of NBD read/write requests received from the kernel:

If an NBD read/write is larger than the object size then it has to be spread across multiple storage objects. This means there is higher transfer overhead (since we need to transfer HTTP response/request headers for every object), higher costs (since every HTTP request is charged), and lower effective bandwidth (due to accumulation of round-trip latency, though this could be worked around by sending HTTP requests concurrently).
If an NBD write is smaller than the object size, then we first have to read the entire storage object, apply the update locally, and then write the entire object back again. This means we incur transfer overhead, higher transfer costs, higher per-request cost, and lower effective bandwidth.
If an NBD read is smaller than the object size, there are no negative consequences (since we can read just the required part of the storage object).

Unfortunately, the size of NBD read/write requests is not fixed, so choosing the right object size is not trivial (which is why I did a lot of experiments using the stats filter to determine what requests sizes are encountered in practice).

The smallest reasonable storage object size is the block size of the filesystem (since we know that every modification will at least result in one changed block). The largest reasonable storage object size is thus the maximum size of an NBD request, which can be up to 32 MB.

For ZFS, the block size is dynamic (details) and varies between 2^ashift and recordsize (both of which are ZFS configuration parameters). We want the gap between these values to be reasonably large (otherwise e.g. compression will not be effective, because it is done in recordsize chunks and the compressed data still takes at least 2^ashift bytes).

Reasonable ashift values (according to internet wisdom) are between 9 and 12, corresponding to minimum block sizes of 512 to 4096 bytes. Requests of this minimum size are also likely to be encountered frequently in practice as a result of e.g. metadata changes.

On the other hand, when storing multi-MB files in the filesystem, most of the data will be stored in full recordsize blocks. If such data blocks happen to be placed adjacently (which is common for e.g. ext4, but less likely for ZFS), even larger NBD requests can result.

So whichever object size we choose, we will most likely incur significant penalties - either for large or for small NBD write requests. To put this into numbers (assuming Amazon S3 pricings from today, an upload bandwidth of 3 MB/s, 30 ms ping times to S3 servers, and 512 bytes of HTTP metadata):

Uploading 1 GB of data using an object size of 4 kB will result in $ 1.31 of extra per-request charges, roughly 10% request overhead, and roughly 20x bandwidth reduction due to latency. (In theory we could avoid the bandwidth reduction by artifically limiting NBD requests to 4 kB, so that the splitting happens in the kernel and upload happens concurrently, but as we'll see below this is currently not an option).
Uploading 4096 bytes of data using an object size of 512 kB will result in 255-fold write amplification/bandwidth reduction (512 kB of extra read plus 508 kB of additional write).

Luckily, when using ZFS we can ameliorate this problem: ZFS can split blocks between two kinds of vdevs (backing devices) depending on their size. So if we assemble a zpool from a regular vdev (backed by one storage bucket) and a special vdev (backed by a different storage bucket), then the regular vdev will only see writes that are larger than special_small_blocks (another ZFS configuration parameter). We can therefore safely set the object size for the bucket behind the regular vdev to match recordsize, while choosing something smaller for the bucket behind the special vdev.

For my experiments, I picked ashift=12, recordsize=512k and special_small_blocks=128k (I really wanted to use 256 kB, but ran into a bug). For the storage object sizes, I chose 4 kB and 128 kB.

Initial Success

First, the good news. The setup fundamentally worked: I was able to setup ZFS across multiple NBD devices, and all ZFS operations worked just as for a local block device.

Bringing this setup up and down correctly is a bit involved so I created orchestration scripts. The commands that were ultimately executed are:

$ nbdkit --unix /tmp/tmplxtw074f/nbd_socket_sb --foreground --filter=exitlast \
  --filter=stats --threads 16 --filter=retry S3 size=50G bucket=nikratio-test \
  key=s3_plain_sb endpoint-url=http://s3.eu-west-2.amazonaws.com \
  statsfile=s3_plain_sb_stats.txt statsappend=true statsthreshold=100 retries=100 \
  retry-readonly=false retry-delay=30 retry-exponential=no object-size=4K

$ nbdkit --unix /tmp/tmplxtw074f/nbd_socket_lb --foreground --filter=exitlast \
  --filter=stats --threads 16 --filter=retry S3 size=50G bucket=nikratio-test \
  key=s3_plain_lb endpoint-url=http://s3.eu-west-2.amazonaws.com \
  statsfile=s3_plain_lb_stats.txt statsappend=true statsthreshold=100 retries=100 \
  retry-readonly=false retry-delay=30 retry-exponential=no object-size=128K

$ nbd-client -unix /tmp/tmplxtw074f/nbd_socket_sb --timeout 604800 /dev/nbd1
$ echo 32768 > /sys/block/nbd1/queue/max_sectors_kb
$ nbd-client -unix /tmp/tmplxtw074f/nbd_socket_lb --timeout 604800 /dev/nbd2
$ echo 32768 > /sys/block/nbd2/queue/max_sectors_kb
$ zpool create -f -R /zpools -o ashift=12 -o autotrim=on -o failmode=continue \
  -O acltype=posixacl -O relatime=on -O xattr=sa -O compression=zstd-19 -O checksum=sha256 \
  -O sync=disabled -O special_small_blocks=131072 -O redundant_metadata=most \
  -O recordsize=524288 -O encryption=on -O keyformat=passphrase \
  -O keylocation=file:///<path> s3_plain /dev/nbd2 special /dev/nbd1

Creating a the zpool, writing a 771 MB test file, and exporting the zpool again resulted in the following NBD requests:

4k bucket:
read: 162 ops, 0.000155 s, 4.41 MiB, 27.76 GiB/s op, 82.71 KiB/s total
write: 631 ops, 0.003885 s, 7.88 MiB, 1.98 GiB/s op, 147.90 KiB/s total

128k bucket:
read: 114 ops, 0.000117 s, 3.39 MiB, 28.27 GiB/s op, 64.75 KiB/s total
write: 1521 ops, 0.548140 s, 716.91 MiB, 1.28 GiB/s op, 13.39 MiB/s total

(the total written data is less than 771 MB due to the ZFS compression).

The distribution of NBD write request sizes was as follows:

4 kB bucket:
   4096 bytes: 67.8% of requests (428)
   8192 bytes: 11.4% of requests (72)
  12288 bytes:  4.9% of requests (31)
  16384 bytes:  4.6% of requests (29)
  20480 bytes:  1.9% of requests (12)
 114688 bytes:  1.7% of requests (11)
  32768 bytes:  0.8% of requests (5)
  53248 bytes:  0.6% of requests (4)
 253952 bytes:  0.5% of requests (3)
  45056 bytes:  0.3% of requests (2)
  77824 bytes:  0.2% of requests (1)

128 kB bucket:
 524288 bytes: 81.1% of requests (1233)
   4096 bytes:  3.0% of requests (45)
1048576 bytes:  1.0% of requests (15)
 204800 bytes:  0.6% of requests (9)
 454656 bytes:  0.5% of requests (8)
 155648 bytes:  0.5% of requests (7)
 208896 bytes:  0.4% of requests (6)
 450560 bytes:  0.3% of requests (5)
 417792 bytes:  0.3% of requests (4)
 278528 bytes:  0.2% of requests (3)
 339968 bytes:  0.1% of requests (2)
 352256 bytes:  0.1% of requests (1)

(Yes, there is a slight discrepancy between the histogram and the total count - this still needs investigating).

Alignment Issues

Initially, I limited the size of NBD requests to the larger object size (128 kB) to enable concurrent processing of larger requests (by virtue of the kernel splitting them into smaller requests and sending them concurrently to userspace, and userspace processing them with multiple threads).

However, I soon found out that the kernel's NBD client does not align its requests to the preferred block size of the NBD server. For example, the alignment of the 128 kB writes was:

131072 bytes: 95.7% of requests (5632)
      12 bit aligned: 100.0% (5632)
      13 bit aligned:  77.9% (4389)
      14 bit aligned:  59.8% (3369)
      15 bit aligned:  17.0% (959)
      16 bit aligned:  15.0% (843)
      17 bit aligned:  12.0% (677)
      18 bit aligned:   6.0% (336)
      19 bit aligned:   3.0% (168)
      20 bit aligned:   1.5% (84)
      21 bit aligned:   0.8% (43)
      22 bit aligned:   0.4% (22)
      23 bit aligned:   0.2% (11)
      24 bit aligned:   0.1% (5)
      25 bit aligned:   0.1% (3)
      26 bit aligned:   0.0% (2)
      27 bit aligned:   0.0% (1)

Ideally, every single request should have been aligned to a 128 kB boundary (i.e., be 17-bit aligned). The fact that 88% of requests had no such alignment meant that they were actually overlapping two storage objects, and thus required four HTTP requests (2 read-modify-write cycles). In other words, non-aligned writes are just as bad as writes smaller than the object size.

Changing this required non-trivial changes to the kernel that were beyond my current capabilities.

To mitigate the impact of this, I instead set the NBD request size to the maximum value (32 MB) rather than the object size. This means that large requests spanning multiple objects are split by the NBD server rather than the kernel. At least with the current ndbkit architecture, this means that the "sub-requests" are processed sequentially - but on the plus side, the requests are split in the right positions to minimize write amplification.

Suspend/Hibernation

The second disappointing finding was that any activity on the NBD-backed filesystem made it impossible to suspend (or hibernate) the system. Attempts to suspend generally ended with the kernel giving up as follows:

kernel: Freezing user space processes ...
kernel: Freezing of tasks failed after 20.003 seconds (1 tasks refusing to freeze, wq_busy=0):
kernel: task:rsync           state:D stack:    0 pid:348105 ppid:348104 flags:0x00004004
kernel: Call Trace:
kernel:  <TASK>
kernel:  __schedule+0x308/0x9e0
kernel:  schedule+0x4e/0xb0
kernel:  schedule_timeout+0x88/0x150
kernel:  ? __bpf_trace_tick_stop+0x10/0x10
kernel:  io_schedule_timeout+0x4c/0x80
kernel:  __cv_timedwait_common+0x129/0x160 [spl]
kernel:  ? dequeue_task_stop+0x70/0x70
kernel:  __cv_timedwait_io+0x15/0x20 [spl]
kernel:  zio_wait+0x129/0x2b0 [zfs]
kernel:  dmu_buf_hold+0x5b/0x90 [zfs]
kernel:  zap_lockdir+0x4e/0xb0 [zfs]
kernel:  zap_cursor_retrieve+0x1ae/0x320 [zfs]
kernel:  ? dbuf_prefetch+0xf/0x20 [zfs]
kernel:  ? dmu_prefetch+0xc8/0x200 [zfs]
kernel:  zfs_readdir+0x12a/0x440 [zfs]
kernel:  ? preempt_count_add+0x68/0xa0
kernel:  ? preempt_count_add+0x68/0xa0
kernel:  ? aa_file_perm+0x120/0x4c0
kernel:  ? rrw_exit+0x65/0x150 [zfs]
kernel:  ? _copy_to_user+0x21/0x30
kernel:  ? cp_new_stat+0x150/0x180
kernel:  zpl_iterate+0x4c/0x70 [zfs]
kernel:  iterate_dir+0x171/0x1c0
kernel:  __x64_sys_getdents64+0x78/0x110
kernel:  ? __ia32_sys_getdents64+0x110/0x110
kernel:  do_syscall_64+0x38/0xc0
kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f03c897a9c7
kernel: RSP: 002b:00007ffd41e3c518 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
kernel: RAX: ffffffffffffffda RBX: 0000561eff64dd40 RCX: 00007f03c897a9c7
kernel: RDX: 0000000000008000 RSI: 0000561eff64dd70 RDI: 0000000000000000
kernel: RBP: 0000561eff64dd70 R08: 0000000000000030 R09: 00007f03c8a72be0
kernel: R10: 0000000000020000 R11: 0000000000000293 R12: ffffffffffffff80
kernel: R13: 0000561eff64dd44 R14: 0000000000000000 R15: 0000000000000001
kernel:  </TASK>

As far as I can tell, the problem is that while an NBD request is pending, the process that waits for the result (in this case rsync) is refusing to freeze. This happens no matter how long the timeout is set, so I suspect that the root cause is that the NBD server task (in this case nbdkit) has already been frozen, so the client process is unable to make progress to a state where it can be frozen.

Interestingly enough, the same should apparently happen with FUSE (see kernel list discussion) but - at least for me - almost never happens in practice. So either I was exceptionally lucky, or something else is going on (I suspect that maybe a task waiting for FUSE I/O enters interruptible sleep, while a task waiting for ZFS I/O enters uninterruptible sleep).

As an experiment, I tried renaming the NBD server to zzz_nbdkit (hoping that freezing goes in alphabetical order), but it did not help.

I also discovered that attempting to suspend while zpool export is running is a very bad idea. In contrast to "regular" client processes, the kernel here hangs before attempting to freeze userspace:

kernel: PM: suspend entry (deep)
kernel: Filesystems sync: 661.109 seconds
[...]
kernel: Freezing user space processes ...

In between the first two messages, the systems is in a weird, semi-suspended state. This means that eg WiFi is no longer available, which may well prevent the sync to ever complete it requires data to go over the network. The same probably applies to some other devices too. Furthermore, the system will refuse to restart, suspend, or power off.

Metadata Performance

The third issue is something that, in hindsight, I probably should have expected: with the metadata being downloaded piece by piece, rather than all at once, metadata operations with ZFS-on-NBD are a lot slower than with S3QL.

I did not, however, expected them to be that slow. Running a simple find -type d -exec ls -l {} \; on a directory tree with 687 files required 28 seconds - i.e. the pace is roughly 25 files per second. (The total amount of data downloaded for this was 9.6 MB, but this was mostly likely dominated by the transfer from mounting and unmounting).

This means that even a backup where no files have been changed requires a long time to run. I could probably still accept this if I could suspend the system during that time, but taken together with the inability to suspend this becomes a dealbreaker for me. Turning on your laptop briefly to check something, and then being unable to turn it off again because a backup has kicked in is very frustrating.

Adding caching

At this point accepted that I would probably have to add some sort of persistent cache to my setup. At least in theory, a cache should be able to solve all of the above problems. I identififed four options for this:

Use ZFS's L2ARC
Use s3backer and its caching feature
Use bcache
Add persistent caching support to nbdkit (I did not look into this in more detail).

L2 ARC

ZFS L2 seemed like the most natural solution for this since it is part of ZFS. My hope was that:

Since it's tighly integrated into ZFS, it would not cache the same data twice (in ARC/page-cache and L2ARC)
Since its in-kernel, writes to the cache should not be affected by the freezing of userspace, enabling suspend during IO.

Unfortunately, I discovered that the way the L2ARC is designed makes it pretty useless for my use-case.

First, the L2 ARC is not a writeback cache, so suspend continued to be impossible while there was active I/O.

Secondly, I observed absolutely no performance increase. I tried adding an L2 ARC backed by a local, on-disk file and, to maximize cache filling, set a l2arc_noprefetch to zero, l2arc_headroom to 1000, and l2arc_write_max to 100 MB. I then mounted and accessed the system a few times. However, performance for metadata operations remained exactly the same as without the cache.

It seeems that for whatever reason, the data that's needed for simple directory walking does not make it into the L2 ARC. I spent some time studying the L2 ARC Feeding Mechanism but still could not figure out why. So I gave up on this approach.

S3backer over NBD

Since I couldn't get the L2 ARC to work, I decided to instead give s3backer a try after all (since it looked like it provided the kind of caching that I needed).

s3backer being in userspace, I did not expect it to make a difference for the ability to suspend during I/O. However, I was hoping that I'd see much improved metadata performance and better write throughput (since the cache completely decouples NBD requests from HTTP requests to cloud storage).

My first step was to contribute NBD support for s3backer, i.e. I made it possible to run S3Backer as a NBD server (in form of another nbdkit plugin) rather than a FUSE server. I also disabled as much functionality as possible (no compression, no encryption, no MD5 verification). In this setup, we still have duplication between ARC and page cache, but I contributed patches to at least advise the kernel to drop page cache data quickly.

Unfortunately working on the S3Backer codebase was quite frustrating for me for several reasons:

The code is written in C and uses structs of function pointers to provide flexible layering of components. This means that given a function call in the code, there is no easy way to jump to the code implementing the function (at least I didn't manage to do so with both Emacs and VS Code). This makes it hard to navigate and understand the codebase.
There are no unit or integration test. This means that when making changes there was no easy way to check that I didn't break something.
When testing my own changes, I repeatedly encountered bugs in the master branch - which initially I kept attributing to my own changes (examples: #191, #184)
Almost every time I would try to run S3Backer (e.g. after pulling a new version, or wanting to try a different configuration) it would not work (examples: #181, #179, #175, #174)

The last point probably deserves some additional explanation. As far as I know, S3Backer is used in production by many users, so it cannot truly be as unreliable as it felt to me. I therefore suspect that the reason for my unfortunate experience is the combination of S3backer's extreme configurability (there is a huge number of configuration parameters) with the absence of tests.

My theory is that the active S3backer users have been using it with the same, unchanging configuration for a long time (so bugs with that configuration have long been found and fixed). In contrast, I probably used a configuration that - while theoretically supported - has never been used by someone in practice and was therefore entirely untested. To S3backers credit, the bugs that I encountered were generally fixed quickly (or a workaround provided), but it still made for a frustrating experience.

Nevertheless, with the help of s3backer's main developer I was able to eventually get things running as I wanted. Once again, at first sight things seemed well:

I was able to switch between s3backer and nbdkit's S3 plugin at will, accessing the same data.
Caching worked as expected, making metadata access much faster
Alignment effects were (presumably) reduced (I did not test this systematically).

However, I also ran into new issues:

By default, s3backer's trim/discard operation is very inefficient. Given a range to discard, s3backer unconditionally tries to delete every object in this range, even if none of them actually exist (the S3 nbdkit plugin, in contrast, issues a scoped LIST request first and only deletes the objects that actually exist).
The alternative is to configure s3backer to download the full list of objects on startup and keep it in memory. This makes discard requests faster, but it means that the time and memory consumption to mount the filesystem becomes proportional to the size of the filesystem, i.e. the very thing that I disliked about S3QL and wanted to eliminate.
s3backer's cache is LRU rather than LFU. So when writing a lot of data, this data pushes the metadata out of the cache. What I really needed was seemingly an LFU or writearound cache.

bcache

My third attempt to add caching to my setup featured bcache, a kernel-side caching layer for block devices.

I had used bcache in the past as a local, SSD-based cache for HDDs. Based on this, I already had some reservations:

Firstly, the interface is clumsy. Removing/unregistering bcache devices is hard. It requires writing commands into multiple files in /sys/{fs,block} in the right order, and then making sure that nothing triggers re-registration through udev rules.

Secondly, bcache seems to have been effectively abandoned (I suspect at the point when it was sufficiently stable to serve as a proof of concept for bcachefs). This is reflected in there being some major unresolved bugs and the documentation being sparse, partially out of date, and split across too many locations (e.g. make-bcache(3) does not agree with make-bcache --help, the kernel docs refer to non-existing control files).

Still, bcache seemed like a reasonable solution for the problem at hand so I gave it a shot. I turned the two NBD devices into bcache backing devices and added two cache devices (by loopback-mounting two local image files). I then created a zpool from the resulting two bcache devices.

Once again, fundamentally this setup worked. Both ZFS and bcache performed as expected, oblivious to the fact that backing devices were stored in the cloud. I also found that make-bcache's bucket and block size settings seemed to affect only requests to the caching device, not the backing device. This was a relief, because the bcache documentation did not give me much guidance how to choose these values.

I also got great improvements in performance. When emulating 20 ms network latency and 1 MB/s bandwidth, I previously got:

# time find /zpools/test/ -type d -exec ls -l {} \; | wc -l
140466

________________________________________________________
Executed in  409.17 secs   fish           external
   usr time    3.93 secs  600.00 micros    3.93 secs
   sys time    3.62 secs  186.00 micros    3.62 secs

With bcache in the stack (and a single cache-priming run), this improved to:

# time find /zpools/test_bcache/ -type d -exec ls -l {} \; | wc -l
140466

________________________________________________________
Executed in    3.70 secs   fish           external
   usr time    2.51 secs  608.00 micros    2.51 secs
   sys time    1.30 secs  186.00 micros    1.30 secs

bcache supports writeback, writethrough, and writearound caching. In both writeback and writearound mode, the system still cannot suspend while there is active I/O. However, writearound has the advantage that writing data to the filesystem does not push out cached metadata - big advantage when running backups.

In writeback mode, we write requests can be completed without having to write to the backing (NBD) device. This means that as long as there is space available on the caching device, the system can suspend even while under I/O - Eureka! However, this comes at the disadvantage that writing new files can now push out cached metadata.

Luckily, both these caveats can be worked around by pacing writes to the zpool. I found that by rate-limit the rsync process that I use to create backups to my uplink bandwidth, I could ensure that the cache never fully fills - thus ensuring that suspend is always possible, and metadata not pushed out.

As the reader possible expects at this point, there were still other drawbacks:

Some ZFS operations seem to be able to bypass the cache even in writeback mode. Setting /sys/modules/zfs/parameters/{zfs,zil}_nocacheflush to 1 prevented this behavior for zpool sync and zpool export, but I did not manage to carry out e.g. zfs snapshot without activity on the backing device. This is important because while such commands run, suspend is not working right. I worked around this by wrapping such commands into systemd-inhibit.
There does not seem to be a good way to tell when the NBD device can be disconnected. When shutting down a bcache device (by writing into the stop files in /sys/block/<dev>/bcache/cache/stop and /sys/block/<dev>/bcache/stop, the corresponding control files disappear, but bcache continues to write to the backing (NBD) device, resulting in data loss when terminating NBDkit at this point. The only reliable method that I found was to follow the kernel's log messages and wait for a cache: bcache_device_free() bcache<N> stopped message - which feels very klunky.
The documented behavior to flush the bcache cache (writing none to the cache_mode control file) works, but also blocks suspend while this command is running. I worked around this by setting writeback_percent and writeback_delay to zero and polling /sys/block/<dev>/bcache/dirty_data.

Back to s3backer

Just when I thought that I finally had a reasonably well working configuration and started running reasonably sized backups, I discovered another problem: the performance of the nbdkit S3 plugin is abysmal for small NBD requests. For example, I would see 112 kB writes taking 0.45 seconds (248 kB/s) and 4 kB writes taking 0.36 seconds (11 kB/s).

I have not yet used root caused this but suspect that either the code does not re-use established HTTP connections or that some of the hash calculations are done in pure Python (reducing performance and preventing effective multithreading). A casual inspection of the boto Python module (which is used for interfacing with S3) made me quickly drop the idea of ever touching this code. In short, boto is very "enterprisy" code that provides a generic interface to every possible AWS service (not just S3), and attempts to navigate (let alone understand) the S3-specific code made the s3backer codebase look like a piece of code.

Therefore, I found myself coming back to s3backer a third time. This time, I disabled also the caching functionality (while keeping bcache in the stack for this). With this setup, write performance was much better (though, as mentioned before, at the cost of reduced discard performance).

Summary

So, what have I learned from all of this?

First of all, I was able to setup a ZFS-on-NBD stack that I think will would most likely satisfy my backup requirements.

But.. is it better than my current, S3QL-based setup? I am not sure.

Performance and feature wise, both solutions perform equally well.
In terms of robustness and complexity, I prefer having encryption and compression implemented in ZFS rather than in S3QL. I believe the ZFS implementation is both more efficient and better tested.
However, the ZFS-on-NBD setup involves stacking together a large number of components (ZFS, nbdkit, bcache, s3backer) that probably aren't used together in this form by many people, and each of which introduces new points of failure. Just setting up the mountpoint properly requires a script with hundreds of lines of code.
Furthermore, some of these components are abandoned, many of them as least as complex as S3QL on its own, and most of them I'm not comfortable debugging or modifying.
S3QL, on the other hand, is an all-in-one solution with a code-base that I'm very familiar with.
Lastly, I currently seem to have the choice between either abysmal performance for slow writes (when using the S3 plugin), abysmal performance for discard requests (s3backer without "preloading" the object list), and having to download (and keep in memory) a full object list at mount time (the very kind of operation that I dislike most about S3QL).

For now, I think I will run both solutions in parallel and wait for comments on this write-up!

Comments