ZFS-on-NBD with s3backer

141 views
Skip to first unread message

Nikolaus Rath

unread,
Sep 12, 2022, 4:28:49 PM9/12/22
to s3backe...@googlegroups.com
Hi all,

I've been experimenting with running ZFS on NBD in various ways (some with, and some without s3backer).

In case someone is interested, here is the (rather long) write-up: https://www.rath.org/s3ql-vs-zfs-on-nbd.html

Best,
-Nikolaus

--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«


ciesiels...@gmail.com

unread,
Sep 12, 2022, 5:16:53 PM9/12/22
to s3backer-devel
thank you, Nikolaus
that's a very interesting read

Just one question. Is the article one year old?
Untitled.png

Martin Raiber

unread,
Sep 12, 2022, 9:16:00 PM9/12/22
to s3backe...@googlegroups.com
Interesting!

Most of the disadvantages you mention for fuse can be worked around by using

 * losetup with direct-io=on <-- this allows concurrent read/writes and is the most important one
 * Set PR_SET_IO_FLUSHER on every thread involved with flushing IO to prevent circual memory allocation issues (I'd recommend doing that for the nbd solution as well)
 * Size of fuse reads/writes should be tunable but you probably know better :)

--
You received this message because you are subscribed to the Google Groups "s3backer-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3backer-deve...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/s3backer-devel/17531b8b-0e19-45f1-91a7-a337a813ececn%40googlegroups.com.

Archie Cobbs

unread,
Sep 13, 2022, 1:55:31 PM9/13/22
to s3backe...@googlegroups.com
Hi Nikolaus,

Thanks for writing all that up, it's an interesting saga...

Regarding the s3backer issues you mention... First, thanks in general for investigating and reporting those bugs and enhancements (especially NBD).

Regarding trim/discard performance: there's not much more we can do here. You get optimal speed by using the --listBlocks flag, and since 1.6.2 this query is done in the background, so it doesn't reduce startup time.

Of course if you have a zillion non-empty blocks then it will ultimately pull down a bunch of data (about 100 bytes per non-zero block (compressed)).

As far as memory consumption, it uses a bitmap so it requires one bit for each block in the file (whether zero or not). The code assumes this is not going to be a huge amount of memory, but if you have a small block size and/or huge file then it might be. Using an extant structure could reduce memory usage if memory is a problem.

If you don't use --listBlocks, then when asked to discard a contiguous range of blocks there's not much we can do other than try them one by one.

I suppose there could be some hybrid variant, where we only list the affected blocks when asked to discard a contiguous range. This would of course slow down the trim operation.

Another thing that might help would be to keep track of non-zero blocks in the on-disk block cache.

Regarding LRU vs. LFU: An interesting idea, probably wouldn't be too hard to implement.

ciesielksi writes:
Just one question. Is the article one year old?

Can't be, because it references s3backer bugs that are within the past year :)

-Archie


--
You received this message because you are subscribed to the Google Groups "s3backer-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s3backer-deve...@googlegroups.com.


--
Archie L. Cobbs

Nikolaus Rath

unread,
Sep 16, 2022, 3:02:29 AM9/16/22
to s3backer-devel
Hi,

No, of course not, that's just a typo. Thanks!

Best,
-Nikolaus

Nikolaus Rath

unread,
Sep 16, 2022, 3:17:18 AM9/16/22
to s3backer-devel
Hi,

Using direct I/O with losetup is a great idea, I didn't think of that! I think the main advantage is that it avoids page cache duplication though. FUSE_CAP_ASYNC_DIO was only introduced in libfuse 3.x, and s3backer is using libfuse 2.9..

Similarly, size of fuse reads/writes is limited by the kernel. So yes, you could patch the kernel and get larger requests, but I still think that's an unreasonable burden. Especially since all of this just reduces the disadvantages of using FUSE, without ever getting any advantages from using it :-).

Thanks also for the pointer to PR_SET_IO_FLUSHER - I didn't know about that either! I think this is particularly interesting because I think the kernel ought to use this freeze such processes last. I will follow up on that!

Best,
-Nikolaus

Nikolaus Rath

unread,
Sep 16, 2022, 4:02:54 AM9/16/22
to s3backer-devel
Hi Archie,

What do you think about the approach used by nbdkit's S3 plugin for trimming? It starts by issuing a scoped LIST request that returns only objects within the range that is to be trimmed.

In the worst case, all the objects exists, in which case the overhead is one extra round-trip (same cost as trimming one additional block) plus slightly increased bandwidth (to transfer the block list). With 100 bytes per block, that's probably negligible compared to the size of the request + response for each individual DELETE request.

In the best case, where none of the objects exist, we save thousands of pointless DELETE requests.

If the overhead of the extra LIST request is a concern, you could also do this only for trim request that span some minimum number of blocks.

Best,
-Nikolaus

Archie Cobbs

unread,
Sep 16, 2022, 11:36:40 AM9/16/22
to s3backe...@googlegroups.com
On Fri, Sep 16, 2022 at 3:02 AM Nikolaus Rath <Niko...@rath.org> wrote:
What do you think about the approach used by nbdkit's S3 plugin for trimming? It starts by issuing a scoped LIST request that returns only objects within the range that is to be trimmed.

I'm still not convinced that it's worth it, practically speaking.

A basic question is this: in what scenarios would you NOT want to use --listBlocks and why?

Another question is: when are you going to be in a situation where a filesystem is asking to TRIM a bunch of blocks that are already zero?

It seems like in normal usage a filesystem only TRIM's blocks that it is deallocating, which means that they are (very likely to be) not currently zero.

Seems like large TRIM's of already-deleted blocks would only happen in some one-time event, e.g.., maybe filesystem initialization.

In the best case, where none of the objects exist, we save thousands of pointless DELETE requests.

We don't delete one-by-one anymore, we use bulk delete in which you POST to https://blah?delete with a <Delete> XML payload containing up to 1000 blocks at once. So this has gotten more efficient since 1.6.3.
 
If the overhead of the extra LIST request is a concern, you could also do this only for trim request that span some minimum number of blocks.

Yep this makes sense... there would be some minimum size where it'd be worth it to switch strategies.

-Archie

--
Archie L. Cobbs

Nikolaus Rath

unread,
Sep 16, 2022, 4:04:05 PM9/16/22
to Archie Cobbs, s3backe...@googlegroups.com
On Sep 16 2022, Archie Cobbs <archie...@gmail.com> wrote:
> On Fri, Sep 16, 2022 at 3:02 AM Nikolaus Rath <Niko...@rath.org> wrote:
>
>> What do you think about the approach used by nbdkit's S3 plugin for
>> trimming? It starts by issuing a scoped LIST request that returns only
>> objects within the range that is to be trimmed.
>
>
> I'm still not convinced that it's worth it, practically speaking.
>
> A basic question is this: in what scenarios would you NOT want to use
> --listBlocks and why?

Me personally? All the time, for reasons of algorithmic elegance.

The current S3QL behavior (download/upload a full SQLite DB on
mount/umount) does not cause me problems in practice either, but I still
dislike it immensely :-).

> Another question is: when are you going to be in a situation where a
> filesystem is asking to TRIM a bunch of blocks that are already zero?

Definitely when working with the block device directly (e.g. through
mkfs.ext4 or LVM). I suspect that `zpool trim` is not keeping track of which
blocks it has already discarded and just trims everything that's
currently not used.

> It seems like in normal usage a filesystem only TRIM's blocks that it is
> deallocating, which means that they are (very likely to be) not currently
> zero.

I think in normal file system usage, trim's are not used at all. At
least on Debian, the default setup is to have an `fstrim` call in
crontab (IIRC for performance reasons this is preferably than having
trimming intersect regular filesystem ops).

>> In the best case, where none of the objects exist, we save thousands of
>> pointless DELETE requests.
>
> We don't delete one-by-one anymore, we use bulk delete in which you POST to
> https://blah?delete with a <Delete> XML payload containing up to 1000
> blocks at once. So this has gotten more efficient since 1.6.3.

Point taken. In that case the gains are indeed much less. But I'll be
honest - for me it's less about practical gains and more about elegance
of the approach.

Archie Cobbs

unread,
Sep 16, 2022, 6:57:56 PM9/16/22
to Nikolaus Rath, s3backe...@googlegroups.com
On Fri, Sep 16, 2022 at 3:04 PM Nikolaus Rath <Niko...@rath.org> wrote:
> A basic question is this: in what scenarios would you NOT want to use
> --listBlocks and why?

Me personally? All the time, for reasons of algorithmic elegance.

I'm all for algorithmic elegance too. But also this seems like a run-of-the-mill caching question.

I think there are two separate discussions here...
  1. You're saying: bulk delete could be smarter in the scenario when --listBlocks is not used
  2. I'm saying: I agree with #1 but why would you ever not want to use --listBlocks?
Regarding #2...

The information we are talking about caching is the per-block information { ZERO, UNKNOWN }. In s3backer this information is kept in a bitmap.

The reason there's no NONZERO state is because for practical purposes NONZERO would be the same as UNKNOWN, i.e., we wouldn't behave any differently between the two.

Of course the default is UNKNOWN. The cache gets slowly loaded over time as we write (or delete) individual blocks.

The cache saves time whenever we (a) read a zero block or (b) delete a zero block (including trim operations). Note the usefulness is whether or not you ever do a trim operation.

If you use --listBlocks then you are simply proactively loading the entire cache. This takes time proportional to the number of NONZERO blocks (running in the background). But then you never need to make another such query and all read/delete of zero blocks are optimized.

Now if you have a sparsely populated bucket then --listBlocks is clearly worthwhile.

If you have a mostly full bucket then it's perhaps not. But then also your #1 becomes less persuasive.

The longer your disk stays mounted, the more worthwhile it is because you are amortizing over more time.

If you do a giant trim operation which covers a large portion of the disk (e.g., "most of"), then it's worthwhile.

You can choose whether or not to use --listBlocks based on your own scenario.

What scenario do you have in which using --listBlocks is a big problem?

Definitely when working with the block device directly (e.g. through
mkfs.ext4 or LVM). I suspect that `zpool trim`  is not keeping track of which
blocks it has already discarded and just trims everything that's
currently not used.

> It seems like in normal usage a filesystem only TRIM's blocks that it is
> deallocating, which means that they are (very likely to be) not currently
> zero.

I think in normal file system usage, trim's are not used at all.

Not so sure about that... mount has the discard option for exactly this purpose. If you use it then manual or crontab use of fstrim should not normally be necessary. You are always staying current so to speak. Then you're free to not use --listBlocks and you also avoid the problem of #1.

If this is true, then you should only need manual fstrim occasionally. In those cases you can remount using --listBlocks.

-Archie

--
Archie L. Cobbs

Nikolaus Rath

unread,
Sep 17, 2022, 6:34:59 AM9/17/22
to Archie Cobbs, s3backe...@googlegroups.com
On Sep 16 2022, Archie Cobbs <archie...@gmail.com> wrote:
> On Fri, Sep 16, 2022 at 3:04 PM Nikolaus Rath <Niko...@rath.org> wrote:
>
>> > A basic question is this: in what scenarios would you NOT want to use
>> > --listBlocks and why?
>>
>> Me personally? All the time, for reasons of algorithmic elegance.
>>
>
> I'm all for algorithmic elegance too. But also this seems like a
> run-of-the-mill caching question.
>
> I think there are two separate discussions here...
>
> 1. You're saying: bulk delete could be smarter in the scenario when
> --listBlocks is not used
> 2. I'm saying: I agree with #1 but why would you ever not want to use
> --listBlocks?
>
> Regarding #2...
>
> The information we are talking about caching is the per-block information {
> ZERO, UNKNOWN }.

I do not consider it a cache, because the size is not fixed but
proportional to the size of the filesystem. If I could say "use x MB to
cache information about what blocks are in use" I would be perfectly
happy. But holding a dataset in memory (no matter if accumulated over
time or populated all at once in the beginning) that is proportional to
the size of the filesystem does not feel right for me.

Archie Cobbs

unread,
Sep 17, 2022, 11:20:30 AM9/17/22
to Nikolaus Rath, s3backe...@googlegroups.com
On Sat, Sep 17, 2022 at 5:34 AM Nikolaus Rath <Niko...@rath.org> wrote:
> The information we are talking about caching is the per-block information {
> ZERO, UNKNOWN }.

I do not consider it a cache, because the size is not fixed but
proportional to the size of the filesystem. If I could say "use x MB to
cache information about what blocks are in use" I would be perfectly
happy. But holding a dataset in memory (no matter if accumulated over
time or populated all at once in the beginning) that is proportional to
the size of the filesystem does not feel right for me.

That's a fair point.

The original thought was because it only cost one bit of memory per block, why add the extra complexity? A fixed bitmap should be fine. Just a pragmatic decision.

Obviously for a huge filesystem connected to a smaller machine that could possibly take up too much memory. But in the worst case, it could require at most 4GB... because there are at most 2^32 blocks (when s3backer is compiled normally)... which is big but not crazy if you're serious about performance.

A more space optimized version of this cache could use extents (i.e., compress long stretches of the same bit) to reduce memory usage. That would definitely be nicer from an aesthetic perspective. Practically speaking though, I have yet to hear any complaints about the memory usage of the zero cache.

-Archie

--
Archie L. Cobbs
Reply all
Reply to author
Forward
0 new messages