Adding an epollfd where applications can register their filde descriptors

1 view
Skip to first unread message

ronnie sahlberg

unread,
Jul 13, 2025, 9:05:57 PMJul 13
to ub...@googlegroups.com
Please see the top commit here as an initial prototype:
https://github.com/sahlberg/ublksrv/tree/epoll

iSCSI and NFS are different from the other targets as they do not need
to integrate with accessing
the iouring directly or submitting their own SQEs.
They are less tightly integrated with the internals of ublksrv than
the other targets.

Currently they spawn a separate pthread to manage I/O to a socket and
invoking the nfs/iscsi completion callbacks, then triggering a context
switch back to the queue's event-loop
to perform the ublksrv i/o completion.

The main need for the separate pthread is just to poll the socket for
when it becomes readable or writable and drive the i/o.
I think it would be much more desirable if this could be driven
directly from the queue's eventloop
and where the application could just register "here is a file
descriptor, let me know when I need to do some work" and run
everything in one single thread. Removing the need for locking between
threads or context switching.

Please see the top commit in the branch above. It adds an epollfd to the queue
and a simple API where the application can register/modify a file
descriptor, a callback and which events should trigger the callback.
I converted ublk.nfs to use this to see what it would look like and I
think ublk.nfs become simpler to read, as there is no longer any
bouncing between threads and likely a bit more performant as well.

ublk.iscsi can easily be converted like this as well and also the
handling of the eventfd and .handle_event could internally be changed
to plug into the epollfd as well, eiminating the need for it driving
its own SQEs to poll the everntfd.
This would imho simplify the eventfd code a little as it could just
piggy-back ontop of a more general-purpose api that the epollfd
provides.

I think this is an extension that is useful for a lot of applications
that are async and just need to plug a filedescriptor into an
eventloop. Avoiding the need like nfs/iscsi that needs a separate
thread just to run a private event-loop driving socket i/o.


What do you think about this approach?
For now it is a single commit at the top of this branch but if this is
acceptable new feature I will rework it into a series of smaller
commits and also convert ublk.iscsi and possible the internal
implementation of eventfd as we..

regards
ronnie sahlberg

Ming Lei

unread,
Jul 13, 2025, 9:53:33 PMJul 13
to ronnie sahlberg, ub...@googlegroups.com
On Mon, Jul 14, 2025 at 11:05:44AM +1000, ronnie sahlberg wrote:
> Please see the top commit here as an initial prototype:
> https://github.com/sahlberg/ublksrv/tree/epoll
>
> iSCSI and NFS are different from the other targets as they do not need
> to integrate with accessing
> the iouring directly or submitting their own SQEs.
> They are less tightly integrated with the internals of ublksrv than
> the other targets.
>
> Currently they spawn a separate pthread to manage I/O to a socket and
> invoking the nfs/iscsi completion callbacks, then triggering a context
> switch back to the queue's event-loop
> to perform the ublksrv i/o completion.

Yes.

That is one ublk server context limit, and I am working on relaxing the
constraint in the following patchset:

https://github.com/ming1/linux/commits/ublk2-cmd-batch.v3/

which allows ublksrv i/o completed in any thread context, which
may simplify ublk server implementation, and allow to single ublk
queue/device to be handled by more than one thread contexts, meantime
allow single pthread context to handle more than one ublk queue/device.

I plan to post it out after the following prepare cleanup patchset is accepted

https://lore.kernel.org/linux-block/20250713143415.2...@redhat.com/

>
> The main need for the separate pthread is just to poll the socket for
> when it becomes readable or writable and drive the i/o.
> I think it would be much more desirable if this could be driven
> directly from the queue's eventloop
> and where the application could just register "here is a file
> descriptor, let me know when I need to do some work" and run
> everything in one single thread. Removing the need for locking between
> threads or context switching.
>
> Please see the top commit in the branch above. It adds an epollfd to the queue
> and a simple API where the application can register/modify a file
> descriptor, a callback and which events should trigger the callback.
> I converted ublk.nfs to use this to see what it would look like and I
> think ublk.nfs become simpler to read, as there is no longer any
> bouncing between threads and likely a bit more performant as well.

OK, I can understand the motivation, it switches to let ublksrv queue
pthread context to cover nfs/iscsi io submission/completion, then the extra
pthread switch may be avoided, together with the cross-pthread list & lock.

Just have several question:

1) will nfs_pread_async()/nfs_pwrite_async() need to be changed to block
way? It it can't be avoided, it blocks ublksrv event handling too, during
the blocking time, no any new ublk io command can be dispatched

2) same concern with ublkdrv_process_epollfd() which waits for 8 nfs
socket events, during the period, no any ublk io command can be handled,
so this extra wait has to be removed

3) one extra epoll_ctl() is added in fast io path, will this way slow down
the whole event loop?

4) will nfs/iscsi io submission/completion take much CPU? If yes, the
single pthread may be saturated. If not, this approach looks pretty good.

>
> ublk.iscsi can easily be converted like this as well and also the
> handling of the eventfd and .handle_event could internally be changed
> to plug into the epollfd as well, eiminating the need for it driving
> its own SQEs to poll the everntfd.
> This would imho simplify the eventfd code a little as it could just
> piggy-back ontop of a more general-purpose api that the epollfd
> provides.
>
> I think this is an extension that is useful for a lot of applications
> that are async and just need to plug a filedescriptor into an
> eventloop. Avoiding the need like nfs/iscsi that needs a separate
> thread just to run a private event-loop driving socket i/o.

Indeed, nice!

>
>
> What do you think about this approach?
> For now it is a single commit at the top of this branch but if this is
> acceptable new feature I will rework it into a series of smaller
> commits and also convert ublk.iscsi and possible the internal
> implementation of eventfd as we..

IMO, this approach looks good in high level, just the wait introduced
from ublkdrv_process_epollfd() need to be removed. Also we need to
ensure that nfs_pread_async()/nfs_pwrite_async()/iscsi... won't add
extra wait point.


Thanks,
Ming

ronnie sahlberg

unread,
Jul 13, 2025, 10:45:46 PMJul 13
to Ming Lei, ub...@googlegroups.com
Thank you.

I answered below in-line.
I will do the remaining changes described below and also re-work it
into a series of
smaller commits that will be easier to follow on what each step and change does.

Let me take a few days to do these updates and I will send as an
initial pull request
for review.
All the async/task functions in nfs/iscsi are guaranteed to be non-blocking.
The socket processing in {iscsi|nfs}_service() is also guaranteed to
be non-blocking.

I will add comments to the ublksrv_epoll_add_fd() command that it is
very important that the callbacks that are registered
must be non-blocking.


>
> 2) same concern with ublkdrv_process_epollfd() which waits for 8 nfs
> socket events, during the period, no any ublk io command can be handled,
> so this extra wait has to be removed

I call epoll_wait() with a timeout of zero so it should return immediately.

In the epollfd callback I call {nfs|iscsi}_service() and it will
process all incoming/outcoming data for the socket so if there are for
example
multiple NFS responses to complete, the socket becomes readable once,
the registered callback is invoked once,
nfs_service() is called once and from within this function all pending
responses are completed before {nfs|iscsi}_service() returns.
I.e. a single call to _service() will process all pending completions.

I set the number of events to 8 in preparation for there to be
multiple filedescriptors to wait on and thus allow
a single call to epoll_wait() to retrieve all file descriptors that
had triggered in a single call.
8 is probably too high. Perhaps setting it to 2 is better.

If eventfd is migrated to use this new epollfd api:
That would allow the socket for nfs and also the eventfd from
triggering simultaneously and both events to be retreived with a
single call to epoll_wait().

>
> 3) one extra epoll_ctl() is added in fast io path, will this way slow down
> the whole event loop?

In this prototype it always calls epoll_ctl() after each time the
callback has been invoked.
I will change this in the finished patch-series to only call it if
what we wait for changes.

Most of the time we write directly to the socket (non-blocking) and we
wait for POLLIN to process responses
but occasionally I need to change it to POLLIN|POLLOUT in case we have
data to send to the socket but it is not writeable.
This should be very rare unless there are stalls in the tcp connection.

Many other applications I suspect will only ever need to use POLLIN so
they will never need to call epoll_ctl() from the hot-path.

I will change it to keep track of what we were waiting on and only
call epoll_ctl() when we need to change the set of events.
I will also add comments to explain why it is important to minimize
the amount of calls to epoll_ctl()

>
> 4) will nfs/iscsi io submission/completion take much CPU? If yes, the
> single pthread may be saturated. If not, this approach looks pretty good.

No, it should take minimal CPU. It is mostly bound by memory speed and
the amount of data you can read/write between memory and the socket.

Ming Lei

unread,
Jul 13, 2025, 11:53:24 PMJul 13
to ronnie sahlberg, ub...@googlegroups.com
Great!

>
>
> >
> > 2) same concern with ublkdrv_process_epollfd() which waits for 8 nfs
> > socket events, during the period, no any ublk io command can be handled,
> > so this extra wait has to be removed
>
> I call epoll_wait() with a timeout of zero so it should return immediately.

Indeed, sorry for missing the `timeout` parameter.

>
> In the epollfd callback I call {nfs|iscsi}_service() and it will
> process all incoming/outcoming data for the socket so if there are for
> example
> multiple NFS responses to complete, the socket becomes readable once,
> the registered callback is invoked once,
> nfs_service() is called once and from within this function all pending
> responses are completed before {nfs|iscsi}_service() returns.
> I.e. a single call to _service() will process all pending completions.
>
> I set the number of events to 8 in preparation for there to be
> multiple filedescriptors to wait on and thus allow
> a single call to epoll_wait() to retrieve all file descriptors that
> had triggered in a single call.
> 8 is probably too high. Perhaps setting it to 2 is better.
>
> If eventfd is migrated to use this new epollfd api:
> That would allow the socket for nfs and also the eventfd from
> triggering simultaneously and both events to be retreived with a
> single call to epoll_wait().

IMO 8 or 16 or others doesn't matter if epoll_wait() won't block and the
correctness can be guaranteed, such as, new socket event won't be lost.

>
> >
> > 3) one extra epoll_ctl() is added in fast io path, will this way slow down
> > the whole event loop?
>
> In this prototype it always calls epoll_ctl() after each time the
> callback has been invoked.
> I will change this in the finished patch-series to only call it if
> what we wait for changes.
>
> Most of the time we write directly to the socket (non-blocking) and we
> wait for POLLIN to process responses
> but occasionally I need to change it to POLLIN|POLLOUT in case we have
> data to send to the socket but it is not writeable.
> This should be very rare unless there are stalls in the tcp connection.
>
> Many other applications I suspect will only ever need to use POLLIN so
> they will never need to call epoll_ctl() from the hot-path.
>
> I will change it to keep track of what we were waiting on and only
> call epoll_ctl() when we need to change the set of events.
> I will also add comments to explain why it is important to minimize
> the amount of calls to epoll_ctl()
>
> >
> > 4) will nfs/iscsi io submission/completion take much CPU? If yes, the
> > single pthread may be saturated. If not, this approach looks pretty good.
>
> No, it should take minimal CPU. It is mostly bound by memory speed and
> the amount of data you can read/write between memory and the socket.

Yes, I guess it might happen in case of high bandwidth network.

BTW, in your new PR, I'd suggest to add one `EPOLL_OP` code by using the op
code which isn't used so far, then you needn't to take extra space/flags on
user_data bit space, and is_epollfd_io() can be implemented by checking if
op in user_data equals to 'EPOLL_OP'.

IMO your approach walks towards io_uring's way, maybe in future you
can provide io_uring based nfs/iscsi API, then ublk zero copy can be applied
finally, which does save memory & cpu bandwidth.


Thanks,
Ming
Reply all
Reply to author
Forward
0 new messages