libnfs bindings in NBD server

37 views
Skip to first unread message

Richard W.M. Jones

unread,
Apr 8, 2025, 6:17:06 AMApr 8
to lib...@googlegroups.com, Ronnie Sahlberg
Hi,

Firstly I couldn't work out how to subscribe to the libnfs mailing
list, so hopefully this message finds it way to the right people.

I'm trying to add libnfs bindings to our pluggable Network Block
Device (NBD) server (https://gitlab.com/nbdkit/nbdkit) and I have a
few technical questions. This could eventually be an alternative to /
replacement for the qemu block layer libnfs driver
(https://gitlab.com/qemu-project/qemu/-/blob/master/block/nfs.c?ref_type=heads).
I want performance to be the best that is reasonably possible. I have
a few questions about the best way to structure this.

(1) nbdkit is multithreaded, with each NBD client read/write request
being handled from a pool of threads. An easy way to add libnfs
support would simply be to use the libnfs synchronous API from the
thread that handles the request.

Another possibility (which we have used in other plugins) is to start
one or more background worker threads, and use the libnfs asynchronous
API from those worker thread(s) only.

Do you have an opinion on which of these would have better performance?
And if the second, how many worker threads to use?

(2) For specifying the connection options, we could map each libnfs
feature (eg. server name, NFS version, etc) into a separate nbdkit
option, which would look to the user like:

nbdkit nfs server=nfs.example.com mount=/mnt file=disk.img version=4

or we could use the libnfs URI format:

nbdkit nfs 'nfs://nfs.example.com/mnt?disk.img?version=4'

The second one seems like the best option, but any opinions / catches
we should be aware of?

(3) NBD has a property called "multiconn" which is quite critical to
performance. When this property is advertised it allows a single
client to safely make multiple connections to the server. However we
can only advertise this property safely if 'fsync' on one connection
also persists writes that have been completed by other connections.
The exact wording from the spec is:

bit 8, NBD_FLAG_CAN_MULTI_CONN: Indicates that the server operates
entirely without cache, or that the cache it uses is shared among
all connections to the given device. In particular, if this flag is
present, then the effects of NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA MUST
be visible across all connections when the server sends its reply to
that command to the client. In the absence of this flag, clients
SHOULD NOT multiplex their commands over more than one connection to
the export.
[https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md]

Determining this property usually involves examining the server side
of whatever we are connecting to - an NFS server in this case - but I
wonder if you would know the answer here?

This question also depends on the answer to (1) since we may be able
to serialize fsync through a single worker thread.

(4) NBD supports: trim/discard (hole punching) NBD_CMD_TRIM; and
writing zeroes NBD_CMD_WRITE_ZEROES. I may be missing something, but
I don't see anything like that in the API. Is that not supported? By
NFS itself or just by libnfs?

Thanks for any help you can give!

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW

Richard W.M. Jones

unread,
Apr 8, 2025, 10:23:16 AMApr 8
to lib...@googlegroups.com, Ronnie Sahlberg
On Tue, Apr 08, 2025 at 11:17:00AM +0100, Richard W.M. Jones wrote:
> (1) nbdkit is multithreaded, with each NBD client read/write request
> being handled from a pool of threads. An easy way to add libnfs
> support would simply be to use the libnfs synchronous API from the
> thread that handles the request.
>
> Another possibility (which we have used in other plugins) is to start
> one or more background worker threads, and use the libnfs asynchronous
> API from those worker thread(s) only.
>
> Do you have an opinion on which of these would have better performance?
> And if the second, how many worker threads to use?

I see now that the multithreading feature is most likely what I should
be using:

https://github.com/sahlberg/libnfs/blob/master/README.multithreading

After implementing something using the multithreading API, I have a
few more questions (as well as the ones asked previously):

(5) Is there a "mount readonly" option (similar to 'mount -o ro ...'
in the kernel client)? Obviously I can just not write anything, but
NBD has a readonly feature flag and it would be nice for safety to
reflect this through to the NFS layer if that's a thing.

(6) Is there a way to fetch natural I/O alignment, like the Linux
minimum / optimum I/O size feature? Again this is an NBD feature
which might be reflected into NFS requests.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

Richard W.M. Jones

unread,
Apr 8, 2025, 11:49:05 AMApr 8
to lib...@googlegroups.com, Ronnie Sahlberg, ebl...@redhat.com
Here's a proposal for the nbdkit / libnfs plugin:

https://gitlab.com/nbdkit/nbdkit/-/merge_requests/84

It's very slow! I think the main reason currently is due to lack of
support for detecting file extents. Does NFS support that? In Linux
we would use SEEK_HOLE / SEEK_DATA, but those don't appear in the
libnfs source code.

ronnie sahlberg

unread,
Apr 10, 2025, 4:29:41 AMApr 10
to lib...@googlegroups.com
On Tue, 8 Apr 2025 at 20:17, 'Richard W.M. Jones' via libnfs
<lib...@googlegroups.com> wrote:
>
> Hi,
>
> Firstly I couldn't work out how to subscribe to the libnfs mailing
> list, so hopefully this message finds it way to the right people.
>
> I'm trying to add libnfs bindings to our pluggable Network Block
> Device (NBD) server (https://gitlab.com/nbdkit/nbdkit) and I have a
> few technical questions. This could eventually be an alternative to /
> replacement for the qemu block layer libnfs driver
> (https://gitlab.com/qemu-project/qemu/-/blob/master/block/nfs.c?ref_type=heads).
> I want performance to be the best that is reasonably possible. I have
> a few questions about the best way to structure this.
>
> (1) nbdkit is multithreaded, with each NBD client read/write request
> being handled from a pool of threads. An easy way to add libnfs
> support would simply be to use the libnfs synchronous API from the
> thread that handles the request.
>
> Another possibility (which we have used in other plugins) is to start
> one or more background worker threads, and use the libnfs asynchronous
> API from those worker thread(s) only.
>
> Do you have an opinion on which of these would have better performance?
> And if the second, how many worker threads to use?

Async operations I think will always give the best performance. It
allows you to have really high concurrency
without a ridiculous amount of threads.
There are some users that need to replicate enormous amounts of data
and sometimes they reach
tens of thousands of concurrent operation.
(I also personally think async/event driven designs are nicer than
multithreaded ones)

>
> (2) For specifying the connection options, we could map each libnfs
> feature (eg. server name, NFS version, etc) into a separate nbdkit
> option, which would look to the user like:
>
> nbdkit nfs server=nfs.example.com mount=/mnt file=disk.img version=4
>
> or we could use the libnfs URI format:
>
> nbdkit nfs 'nfs://nfs.example.com/mnt?disk.img?version=4'
>
> The second one seems like the best option, but any opinions / catches
> we should be aware of?

No issue as far as I can see.
Which approach you use is more a policy question for your app.

>
> (3) NBD has a property called "multiconn" which is quite critical to
> performance. When this property is advertised it allows a single
> client to safely make multiple connections to the server. However we
> can only advertise this property safely if 'fsync' on one connection
> also persists writes that have been completed by other connections.
> The exact wording from the spec is:

You can not yet do multiple sessions for one context but you can use
multiple contexts,
each connected to the same server/share and then just round-robin accross them.
See for example examples/nfs-pthreads-writefile.c

To do the kind of fsync you mention you would need to create a wrapper
that sends
a sync across all the sessions.
The filehandles are shared across all clients and sessions on the server-side
so you can just use the nfsfh for one session and use it on the other
sessions and they
will still be guaranteed to map to the same open file in memory on the server.

I strongly doubt that you will need to do this in the case of NFS
though as all servers
already guarantee that "if you sync a filehandle on one connection
this flushes the data on ALL
connections server-side."

For example, kernel nfs clients often open multiple sessions a when
you write to a file, all writes are
round-robin written to the different sessions. Then when the app does
a sync, a single COMMIT is sent
on whetever the next session in the round-robin scheme and everything
is updated and flushed correctly on the server.

That behavior you mention sounds like a NBD specific requirement.
Maybe the NBD server has cache that is local to each connection.
that is not the case for nfs.


>
> bit 8, NBD_FLAG_CAN_MULTI_CONN: Indicates that the server operates
> entirely without cache, or that the cache it uses is shared among
> all connections to the given device. In particular, if this flag is
> present, then the effects of NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA MUST
> be visible across all connections when the server sends its reply to
> that command to the client. In the absence of this flag, clients
> SHOULD NOT multiplex their commands over more than one connection to
> the export.
> [https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md]
>
> Determining this property usually involves examining the server side
> of whatever we are connecting to - an NFS server in this case - but I
> wonder if you would know the answer here?

You do not have to worry about it.
This is normal for NFS. All connections share the same cache so a
flush/COMMIT on one
session will affect/do the right thing for all sessions.

>
> This question also depends on the answer to (1) since we may be able
> to serialize fsync through a single worker thread.
>
> (4) NBD supports: trim/discard (hole punching) NBD_CMD_TRIM; and
> writing zeroes NBD_CMD_WRITE_ZEROES. I may be missing something, but
> I don't see anything like that in the API. Is that not supported? By
> NFS itself or just by libnfs?

NFSv3 does nor have this.
NFSv4 might be able to support for this but I have not looked into it.
Open an issue to add this and I can see if discard or write-zero is
possible on v4.


>
> Thanks for any help you can give!
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> Fedora Windows cross-compiler. Compile Windows programs, test, and
> build Windows installers. Over 100 libraries supported.
> http://fedoraproject.org/wiki/MinGW
>
> --
> You received this message because you are subscribed to the Google Groups "libnfs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to libnfs+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/libnfs/20250408101700.GH1450%40redhat.com.

ronnie sahlberg

unread,
Apr 10, 2025, 4:33:13 AMApr 10
to Richard W.M. Jones, lib...@googlegroups.com
On Wed, 9 Apr 2025 at 00:23, Richard W.M. Jones <rjo...@redhat.com> wrote:
>
> On Tue, Apr 08, 2025 at 11:17:00AM +0100, Richard W.M. Jones wrote:
> > (1) nbdkit is multithreaded, with each NBD client read/write request
> > being handled from a pool of threads. An easy way to add libnfs
> > support would simply be to use the libnfs synchronous API from the
> > thread that handles the request.
> >
> > Another possibility (which we have used in other plugins) is to start
> > one or more background worker threads, and use the libnfs asynchronous
> > API from those worker thread(s) only.
> >
> > Do you have an opinion on which of these would have better performance?
> > And if the second, how many worker threads to use?
>
> I see now that the multithreading feature is most likely what I should
> be using:
>
> https://github.com/sahlberg/libnfs/blob/master/README.multithreading
>
> After implementing something using the multithreading API, I have a
> few more questions (as well as the ones asked previously):
>
> (5) Is there a "mount readonly" option (similar to 'mount -o ro ...'
> in the kernel client)? Obviously I can just not write anything, but
> NBD has a readonly feature flag and it would be nice for safety to
> reflect this through to the NFS layer if that's a thing.

Not available right now but should be easy enough to add.
I created https://github.com/sahlberg/libnfs/issues/543 to add it.

>
> (6) Is there a way to fetch natural I/O alignment, like the Linux
> minimum / optimum I/O size feature? Again this is an NBD feature
> which might be reflected into NFS requests.

Not really.
All reads/write on the server will be going through pagecache so just assuming
4kb alignment is probably the best you can do.

Richard W.M. Jones

unread,
Apr 10, 2025, 4:37:22 AMApr 10
to ronnie sahlberg, lib...@googlegroups.com
On Thu, Apr 10, 2025 at 06:32:59PM +1000, ronnie sahlberg wrote:
> On Wed, 9 Apr 2025 at 00:23, Richard W.M. Jones <rjo...@redhat.com> wrote:
> > (6) Is there a way to fetch natural I/O alignment, like the Linux
> > minimum / optimum I/O size feature? Again this is an NBD feature
> > which might be reflected into NFS requests.
>
> Not really.
> All reads/write on the server will be going through pagecache so just assuming
> 4kb alignment is probably the best you can do.

That's a good point, thanks.

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

ronnie sahlberg

unread,
Apr 10, 2025, 4:39:19 AMApr 10
to lib...@googlegroups.com, ebl...@redhat.com
On Wed, 9 Apr 2025 at 01:49, 'Richard W.M. Jones' via libnfs
<lib...@googlegroups.com> wrote:
>
> Here's a proposal for the nbdkit / libnfs plugin:
>
> https://gitlab.com/nbdkit/nbdkit/-/merge_requests/84
>
> It's very slow! I think the main reason currently is due to lack of
> support for detecting file extents. Does NFS support that? In Linux
> we would use SEEK_HOLE / SEEK_DATA, but those don't appear in the
> libnfs source code.

SEEK_HOLE / SEEK_DATA is not possible in NFSv3 but is possible in NFSv4 (4.2)
Same with PUNCH_HOLE, which is like trim/discard but the server MUST
release the blocks
and cant just do-nothing (trim/discard is just a hint making them of
very limited use.)

I will open an issue and I can add these to NFSv4




>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> Read my programming and virtualization blog: http://rwmj.wordpress.com
> Fedora Windows cross-compiler. Compile Windows programs, test, and
> build Windows installers. Over 100 libraries supported.
> http://fedoraproject.org/wiki/MinGW
>
> --
> You received this message because you are subscribed to the Google Groups "libnfs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to libnfs+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/libnfs/20250408154859.GV5140%40redhat.com.

ronnie sahlberg

unread,
Apr 10, 2025, 4:43:08 AMApr 10
to Richard W.M. Jones, lib...@googlegroups.com
For performance,

It is possible to get REALLY good performance with libnfs.
Unfortunately most people are very tightlipped and don't want to
advertise what they get in their datacentres.

This guy did publish actual numbers though :
https://taras.glek.net/posts/nfs-for-fio/

I know other people also routinely get this kind of throughput in
their datacentre bulk transfers.

Richard W.M. Jones

unread,
Apr 10, 2025, 5:18:08 AMApr 10
to ronnie sahlberg, lib...@googlegroups.com
On Thu, Apr 10, 2025 at 06:42:54PM +1000, ronnie sahlberg wrote:
> For performance,
>
> It is possible to get REALLY good performance with libnfs.
> Unfortunately most people are very tightlipped and don't want to
> advertise what they get in their datacentres.
>
> This guy did publish actual numbers though :
> https://taras.glek.net/posts/nfs-for-fio/

Interesting, thanks.

Our use-case is unusual: disk images are single files, very large
(tens of gigabytes), but very sparse.

To get efficient operations, extent detection (SEEK_HOLE etc) and hole
punching make the most difference since you can skip over most of the
image. Also preallocation (writing zeroes), to a lesser extent.

Rich.

> I know other people also routinely get this kind of throughput in
> their datacentre bulk transfers.
>
>
> On Thu, 10 Apr 2025 at 18:37, Richard W.M. Jones <rjo...@redhat.com> wrote:
> >
> > On Thu, Apr 10, 2025 at 06:32:59PM +1000, ronnie sahlberg wrote:
> > > On Wed, 9 Apr 2025 at 00:23, Richard W.M. Jones <rjo...@redhat.com> wrote:
> > > > (6) Is there a way to fetch natural I/O alignment, like the Linux
> > > > minimum / optimum I/O size feature? Again this is an NBD feature
> > > > which might be reflected into NFS requests.
> > >
> > > Not really.
> > > All reads/write on the server will be going through pagecache so just assuming
> > > 4kb alignment is probably the best you can do.
> >
> > That's a good point, thanks.
> >
> > Rich.
> >
> > --
> > Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
> > Read my programming and virtualization blog: http://rwmj.wordpress.com
> > virt-top is 'top' for virtual machines. Tiny program with many
> > powerful monitoring features, net stats, disk stats, logging, etc.
> > http://people.redhat.com/~rjones/virt-top
> >

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

ronnie sahlberg

unread,
Apr 10, 2025, 6:45:55 AMApr 10
to lib...@googlegroups.com
On Thu, 10 Apr 2025 at 19:18, 'Richard W.M. Jones' via libnfs
<lib...@googlegroups.com> wrote:
>
> On Thu, Apr 10, 2025 at 06:42:54PM +1000, ronnie sahlberg wrote:
> > For performance,
> >
> > It is possible to get REALLY good performance with libnfs.
> > Unfortunately most people are very tightlipped and don't want to
> > advertise what they get in their datacentres.
> >
> > This guy did publish actual numbers though :
> > https://taras.glek.net/posts/nfs-for-fio/
>
> Interesting, thanks.
>
> Our use-case is unusual: disk images are single files, very large
> (tens of gigabytes), but very sparse.
>
> To get efficient operations, extent detection (SEEK_HOLE etc) and hole
> punching make the most difference since you can skip over most of the
> image. Also preallocation (writing zeroes), to a lesser extent.


Ok, so I assume you want to copy a large, possibly very sparse image
from one nfs share to another as fast as possible?
Let me implement SEEK_HOLE/SEEK_DATA over the weekend.
For this use case, I would use the async api.

1, mount source filesystem and destination filesystem.
2, use iosize = MIN(nfs_get_readmax(source), nfs_get_writemax(destination))
this will be the max io size to read/write to the servers
3, open source file and check size, open destination file and truncate
it to the proper size
4, loop over the source file using SEEK_HOLE/SEEK_DATA to find where
there is data to copy
5, for each block of data, use nfs_pread_async() to read iosize from 2,
in the read callback, just immediately do a nfs_pwrite_async to
write to the destination.

Keep an atomic_t counter for how many reads and writes you have in
flight and just wait until all writes have completed.
Close files, finished.


Common max read/write size that servers advertize is 256k (up to 1M or so).
For a fully populated dense file of 10GB this would translate to
possibly worst case of up to 40.000 READs in flight to the server at a
time.
In the worst case, keeping track of the XID values to match operations
in flight to response PDUs we get back from the server there is a hash
table of linked lists.
I prefer to keep the list length to ~50 entries or less, or else any
time you receive a reply PDU from the sever, walking the list to find
the matching XID for a command in flight start taking up too much
time.
This can be controlled by setting the number of hash buckets for XIDs.

See rpc_set_hash_size() and the warnings in the header file (NEVER use
this when there are commands in flight or libnfs will fail to match
the reply to an outstanding request)
If you expect to need to send 40.000 concurrent READ requests, then
rpc_set_hash_size(1000) or so might be suitable.

The drawback of setting hash size too high is it will use more memory
and it will also make things like timeout scanning take longer as
there are more entries to scan.

That should make things very fast.
Keep an eye out on the servers though. NFS does not have flow control
and some servers may crash if you send too many concurrent i/o.
One I know of crash when you push concurrency aover a few hundred
thousand concurrent rpc requests.
So you may might to implement some kind of flow control in your app if
you push this hard and if this turns out to be an issue and the server
starts crashing.


It is just much easier to not use flow control. You just queue all the
async commands you need in one go and just wait for them all to
complete.
Even with a single thread you should be able to do several GB/second
in copy this way, You are basically limited by memory speed. Reading
data from a socket into the buffer and then writing the buffer into
another socket.
> --
> You received this message because you are subscribed to the Google Groups "libnfs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to libnfs+un...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/libnfs/20250410091803.GU1450%40redhat.com.

Richard W.M. Jones

unread,
Apr 10, 2025, 12:17:48 PMApr 10
to ronnie sahlberg, lib...@googlegroups.com
> Ok, so I assume you want to copy a large, possibly very sparse image
> from one nfs share to another as fast as possible?

The ultimate aim here is to have an NBD server that can forward
requests over NFS. A couple of possible scenarios are:

qemu <-> nbdkit <-> NFS server

nbdcopy -> nbdkit -> NFS server
nbdcopy <- nbdkit <- NFS server

In the first case, the hypervisor (qemu) runs a virtual machine which
is backed by one or more raw-format, sparse disks which are located on
an NFS server. This is a surprisingly common real-world scenario for
on-prem clouds, but it's usually implemented using the Linux kernel
NFS client, with qemu using regular POSIX file APIs.

Using libnfs would be a bit more flexible and secure. In particular
not requiring kernel mounts would let us run everything as non-root
and sandboxed.

In the second case we would be using nbdcopy to bulk copy to/from
somewhere else to NFS (in one direction only). The use of nbdcopy
here is for convenience since the other end is also some exotic NBD
endpoint, probably a VMware server.

As an aside, I'm interested in which NFS servers you use / test /
recommend. I'm guessing it's all about the Linux kernel NFS server?
Or are there common non-Linux NFS servers in SANs and other dedicated
devices?

For testing only I tried out unfs3 (because that would allow me to
write automated tests, not because I expect any performance), however
libnfs fails with the error "Fragment support not yet working".
Looking at the code it seems like a protocol thing which either libnfs
or unfs3 don't implement.

> Let me implement SEEK_HOLE/SEEK_DATA over the weekend.

Thanks, it'll be interesting to see how that works out. However don't
worry, this is also more of a weekend project for me as well (at the
moment).

> For this use case, I would use the async api.

I implemented something using the multithreading API (so async calls
under the hood, as I understand the code), but using a single worker
thread. It performs at line speed for non-sparse files, but I only
did limited testing so far:

https://gitlab.com/nbdkit/nbdkit/-/merge_requests/84

Rich.

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
Reply all
Reply to author
Forward
0 new messages