[PATCH 1/1] eventfd: implementation of EFD

Martin Sustrik

unread,

Feb 7, 2013, 1:50:02 AM2/7/13

to

When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
polling (select, poll, epoll). For polling, real system-level file descriptor
is needed.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal
POLLOUT and POLLERR, however:

I. There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,
!POLLIN & !POLLOUT, which is a perfectly valid combination for a network
protocol (rx buffer is empty and tx buffer is full), cannot be signaled
using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
short events;
void *ptr;
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
'ptr' is an opaque pointer that is not interpreted by eventfd object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik <sus...@250bpm.com>
---
fs/eventfd.c | 105 ++++++++++++++++++++++++++++++++++++-----------
include/linux/eventfd.h | 3 +-
2 files changed, 83 insertions(+), 25 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..9fec49f 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
* fs/eventfd.c
*
* Copyright (C) 2007 Davide Libenzi <dav...@xmailserver.org>
+ * Copyright (C) 2013 Martin Sustrik <sus...@250bpm.com>
*
*/

@@ -22,18 +23,26 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

+struct eventfd_mask {
+ short events;
+ void *ptr;
+};
+
struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
- /*
- * Every time that a write(2) is performed on an eventfd, the
- * value of the __u64 being written is added to "count" and a
- * wakeup is performed on "wqh". A read(2) will return the "count"
- * value to userspace, and will reset "count" to zero. The kernel
- * side eventfd_signal() also, adds to the "count" counter and
- * issue a wakeup.
- */
- __u64 count;
+ union {
+ /*
+ * Every time that a write(2) is performed on an eventfd, the
+ * value of the __u64 being written is added to "count" and a
+ * wakeup is performed on "wqh". A read(2) will return the
+ * "count" value to userspace, and will reset "count" to zero.
+ * The kernel side eventfd_signal() also, adds to the "count"
+ * counter and issue a wakeup.
+ */
+ __u64 count;
+ struct eventfd_mask mask;
+ };
unsigned int flags;
};

@@ -55,6 +64,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
{
unsigned long flags;

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;
@@ -123,12 +135,16 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)
poll_wait(file, &ctx->wqh, wait);

spin_lock_irqsave(&ctx->wqh.lock, flags);
- if (ctx->count > 0)
- events |= POLLIN;
- if (ctx->count == ULLONG_MAX)
- events |= POLLERR;
- if (ULLONG_MAX - 1 > ctx->count)
- events |= POLLOUT;
+ if (ctx->flags & EFD_MASK) {
+ events = ctx->mask.events;
+ } else {
+ if (ctx->count > 0)
+ events |= POLLIN;
+ if (ctx->count == ULLONG_MAX)
+ events |= POLLERR;
+ if (ULLONG_MAX - 1 > ctx->count)
+ events |= POLLOUT;
+ }
spin_unlock_irqrestore(&ctx->wqh.lock, flags);

return events;
@@ -158,6 +174,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,
{
unsigned long flags;

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
eventfd_ctx_do_read(ctx, cnt);
__remove_wait_queue(&ctx->wqh, wait);
@@ -188,6 +207,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
ssize_t res;
DECLARE_WAITQUEUE(wait, current);

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irq(&ctx->wqh.lock);
*cnt = 0;
res = -EAGAIN;
@@ -230,13 +252,23 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
ssize_t res;
__u64 cnt;

- if (count < sizeof(cnt))
- return -EINVAL;
- res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
- if (res < 0)
+ if (ctx->flags & EFD_MASK) {
+ spin_lock_irq(&ctx->wqh.lock);
+ if (count < sizeof(ctx->mask))
+ return -EINVAL;
+ res = copy_to_user(buf, &ctx->mask, sizeof(ctx->mask)) ?
+ -EFAULT : sizeof(ctx->mask);
+ spin_unlock_irq(&ctx->wqh.lock);
return res;
-
- return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);
+ } else {
+ if (count < sizeof(cnt))
+ return -EINVAL;
+ res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
+ if (res < 0)
+ return res;
+ return put_user(cnt, (__u64 __user *) buf) ?
+ -EFAULT : sizeof(cnt);
+ }
}

static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
@@ -247,6 +279,21 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
__u64 ucnt;
DECLARE_WAITQUEUE(wait, current);

+ if (ctx->flags & EFD_MASK) {
+ if (count < sizeof(ctx->mask))
+ return -EINVAL;
+ spin_lock_irq(&ctx->wqh.lock);
+ if (copy_from_user(&ctx->mask, buf, sizeof(ctx->mask))) {
+ spin_unlock_irq(&ctx->wqh.lock);
+ return -EFAULT;
+ }
+ if (waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh,
+ (unsigned long)ctx->mask.events);
+ spin_unlock_irq(&ctx->wqh.lock);
+ return sizeof(ctx->mask);
+ }
+
if (count < sizeof(ucnt))
return -EINVAL;
if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
@@ -293,8 +340,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)
int ret;

spin_lock_irq(&ctx->wqh.lock);
- ret = seq_printf(m, "eventfd-count: %16llx\n",
- (unsigned long long)ctx->count);
+ if (ctx->flags & EFD_MASK) {
+ ret = seq_printf(m, "eventfd-mask: %x\n",
+ (unsigned)ctx->mask.events);
+ } else {
+ ret = seq_printf(m, "eventfd-count: %16llx\n",
+ (unsigned long long)ctx->count);
+ }
spin_unlock_irq(&ctx->wqh.lock);

return ret;
@@ -412,7 +464,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)

kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
- ctx->count = count;
+ if (flags & EFD_MASK) {
+ ctx->mask.events = 0;
+ ctx->mask.ptr = NULL;
+ } else {
+ ctx->count = count;
+ }
ctx->flags = flags;

file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 3c3ef19..b806d2b 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -20,11 +20,12 @@
* shared O_* flags.
*/
#define EFD_SEMAPHORE (1 << 0)
+#define EFD_MASK (1 << 1)
#define EFD_CLOEXEC O_CLOEXEC
#define EFD_NONBLOCK O_NONBLOCK

#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)

#ifdef CONFIG_EVENTFD

--
1.7.4.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andy Lutomirski

unread,

Feb 7, 2013, 2:20:03 PM2/7/13

to

IMO that should be u64 ptr to avoid compat problems.

>
> The value of 'events' should be any combination of event flags as defined by
> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
> be signaled when polling (select, poll, epoll) on the eventfd is done later on.
> 'ptr' is an opaque pointer that is not interpreted by eventfd object.

How does this interact with EPOLLET?

--Andy

Martin Sustrik

unread,

Feb 7, 2013, 3:20:02 PM2/7/13

to

On 07/02/13 20:12, Andy Lutomirski wrote:
> On 02/06/2013 10:41 PM, Martin Sustrik wrote:
>> When implementing network protocols in user space, one has to implement
>> fake user-space file descriptors to represent the sockets for the protocol.
>>
>> While all the BSD socket API functionality for such descriptors may be faked as
>> well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
>> polling (select, poll, epoll). For polling, real system-level file descriptor
>> is needed.
>>
>> In theory, eventfd may be used for this purpose, except that it is well suited
>> only for signaling POLLIN. With some hacking it can be also used to signal
>> POLLOUT and POLLERR, however:
>>
>> I. There's no way to signal POLLPRI, POLLHUP etc.
>> II. There's no way to signal arbitraty combination of POLL* flags. Most notably,

>> !POLLIN& !POLLOUT, which is a perfectly valid combination for a network

>> protocol (rx buffer is empty and tx buffer is full), cannot be signaled
>> using current implementation of eventfd.
>>
>> This patch implements new EFD_MASK flag which attempts to solve this problem.
>>
>> Additionally, when implementing network protocols in user space, there's a
>> need to associate user-space state with the each "socket". If eventfd object is
>> used as a reference to the socket, it should be possible to associate an opaque
>> pointer to user-space data with it.
>>
>> The semantics of EFD_MASK are as follows:
>>
>> eventfd(2):
>>
>> If eventfd is created with EFD_MASK flag set, it is initialised in such a way
>> as to signal no events on the file descriptor when it is polled on. 'initval'
>> argument is ignored.
>>
>> write(2):
>>
>> User is allowed to write only buffers containing the following structure:
>>
>> struct efd_mask {
>> short events;
>> void *ptr;
>> };
>
> IMO that should be u64 ptr to avoid compat problems.

I was following the user space declaration of epoll_data:

typedef union epoll_data {
void *ptr; <-----
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;

However, now I'm looking at the kernel side definition of the whole
union which looks like this (obviously it assumes that pointer is never
longer than 64 bits):

__u64 data;

Hm, not very helpful. Anyway, I am not a kernel developer, so any
concrete suggestion about what type to use to map cleanly to user-space
void* is welcome.

>> The value of 'events' should be any combination of event flags as defined by
>> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
>> be signaled when polling (select, poll, epoll) on the eventfd is done later on.
>> 'ptr' is an opaque pointer that is not interpreted by eventfd object.
>
> How does this interact with EPOLLET?

That's an interesting question. The original eventfd code doesn't do
anything specific to either edge or level mode. Neither does my patch.

Inspection of the code seems to suggest that edge vs. level distinction
is handled elsewhere (ep_send_events_proc) where there is a separate
list of ready events and the function, after returning the event,
decides whether to leave the event in the list (level) or delete it from
the list (edge).

In any case, review from someone with experience with epoll
implementation would help.

Martin

Andrew Morton

unread,

Feb 7, 2013, 5:50:02 PM2/7/13

to

On Thu, 7 Feb 2013 07:41:32 +0100
Martin Sustrik <sus...@250bpm.com> wrote:

> When implementing network protocols in user space, one has to implement
> fake user-space file descriptors to represent the sockets for the protocol.
>
> While all the BSD socket API functionality for such descriptors may be faked as
> well (myproto_send(), myproto_recv() etc.) this approach doesn't work for
> polling (select, poll, epoll). For polling, real system-level file descriptor
> is needed.

That's a nice changelog but it omitted a critical thing: why do you
think the kernel needs this feature? What's the value and use case for
being able to poll these descriptors?

So please update the changelog and then cc net...@vger.kernel.org on
the patch - the netdev people are probably best-situated to comment on
the proposal.

Martin Sustrik

unread,

Feb 7, 2013, 6:30:02 PM2/7/13

to

When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for

polling (select, poll, epoll). And unfortunately, sockets that can't be polled
on allow only for building the simplest possible applications. Basically, you
can build a simple client, but once you want to implement a server handling
many sockets in parallel, you are stuck.

However, to do polling, real system-level file descriptor is needed,
not a fake one.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal

POLLOUT and POLLERR, but:

I. There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,

!POLLIN & !POLLOUT, which is a perfectly valid combination for a network

protocol (rx buffer is empty and tx buffer is full), cannot be signaled
using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
short events;
void *ptr;
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.
'ptr' is an opaque pointer that is not interpreted by eventfd object.

Martin Sustrik

unread,

Feb 7, 2013, 6:40:02 PM2/7/13

to

On 07/02/13 23:44, Andrew Morton wrote:

> So please update the changelog and then cc net...@vger.kernel.org on
> the patch - the netdev people are probably best-situated to comment on
> the proposal.

OK. Done. Thanks for the advice!

Martin

Andy Lutomirski

unread,

Feb 7, 2013, 8:10:02 PM2/7/13

to

Reusing epoll_data_t seems reasonable. The main consideration is that
the size of the object should not vary between 32-bit and 64-bit
userspace.

>
>
>>> The value of 'events' should be any combination of event flags as defined
>>> by
>>> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
>>> events will
>>> be signaled when polling (select, poll, epoll) on the eventfd is done
>>> later on.
>>> 'ptr' is an opaque pointer that is not interpreted by eventfd object.
>>
>>
>> How does this interact with EPOLLET?
>
>
> That's an interesting question. The original eventfd code doesn't do
> anything specific to either edge or level mode. Neither does my patch.
>
> Inspection of the code seems to suggest that edge vs. level distinction is
> handled elsewhere (ep_send_events_proc) where there is a separate list of
> ready events and the function, after returning the event, decides whether to
> leave the event in the list (level) or delete it from the list (edge).

Hmm. Having looked at the eventpoll.c source again, I remain
unconvinced that EPOLLET works the way that any userspace developer
would expect it to. But your code probably has very little to do with
this, so maybe you shouldn't worry about it. There may be some
advantage to adding (later on, if needed) an option to change the
flags set in:

+ if (waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh,
+ (unsigned long)ctx->mask.events);

(i.e. to allow the second parameter to omit some bits that were
already signaled.) Allowing write to write a bigger struct in the
future won't break anything.

It may be a good idea to return EINVAL if anyone tries to write() an
unknown poll bit.

--Andy

Martin Sustrik

unread,

Feb 8, 2013, 12:30:01 AM2/8/13

to

Hi Andy,

On 08/02/13 02:03, Andy Lutomirski wrote:
> There may be some
> advantage to adding (later on, if needed) an option to change the
> flags set in:
>
> + if (waitqueue_active(&ctx->wqh))
> + wake_up_locked_poll(&ctx->wqh,
> + (unsigned long)ctx->mask.events);
>
> (i.e. to allow the second parameter to omit some bits that were
> already signaled.) Allowing write to write a bigger struct in the
> future won't break anything.

I think I don't follow. Either the second parameter is supposed to be
*newly* signaled events, in which case the events that were already
signaled in the past should be ommitted, or it is meant to be *all*
signaled events, in which case the current implementation is OK.

Andy Lutomirski

unread,

Feb 8, 2013, 1:40:02 AM2/8/13

to

On Thu, Feb 7, 2013 at 9:26 PM, Martin Sustrik <sus...@250bpm.com> wrote:
> Hi Andy,
>
>
> On 08/02/13 02:03, Andy Lutomirski wrote:
>>
>> There may be some
>> advantage to adding (later on, if needed) an option to change the
>> flags set in:
>>
>> + if (waitqueue_active(&ctx->wqh))
>> + wake_up_locked_poll(&ctx->wqh,
>> + (unsigned long)ctx->mask.events);
>>
>> (i.e. to allow the second parameter to omit some bits that were
>> already signaled.) Allowing write to write a bigger struct in the
>> future won't break anything.
>
>
> I think I don't follow. Either the second parameter is supposed to be
> *newly* signaled events, in which case the events that were already signaled
> in the past should be ommitted, or it is meant to be *all* signaled events,
> in which case the current implementation is OK.

I defer to the experts here. But I suspect that if you want to
perfectly emulate sockets, you may need to vary what you specify.
(IIRC tcp sockets report an EPOLLIN edge every time data is received
even if the receive buffer wasn't empty.)

--Andy

Martin Sustrik

unread,

Feb 8, 2013, 2:00:02 AM2/8/13

to

On 08/02/13 07:36, Andy Lutomirski wrote:

>> On 08/02/13 02:03, Andy Lutomirski wrote:
>>>
>>> There may be some
>>> advantage to adding (later on, if needed) an option to change the
>>> flags set in:
>>>
>>> + if (waitqueue_active(&ctx->wqh))
>>> + wake_up_locked_poll(&ctx->wqh,
>>> + (unsigned long)ctx->mask.events);
>>>
>>> (i.e. to allow the second parameter to omit some bits that were
>>> already signaled.) Allowing write to write a bigger struct in the
>>> future won't break anything.
>>
>>
>> I think I don't follow. Either the second parameter is supposed to be
>> *newly* signaled events, in which case the events that were already signaled
>> in the past should be ommitted, or it is meant to be *all* signaled events,
>> in which case the current implementation is OK.
>
> I defer to the experts here. But I suspect that if you want to
> perfectly emulate sockets, you may need to vary what you specify.
> (IIRC tcp sockets report an EPOLLIN edge every time data is received
> even if the receive buffer wasn't empty.)

Hm. That sounds like leaking protocol implementation details to the
user. That's a bad design IMO and should not be encouraged.

Anyway, I have implemented your other suggestions.

Btw, one thing I am not sure about is how to submit improved patches to
the ML. Should I use the same patch name? Doesn't that cause confusion?

Martin

Martin Sustrik

unread,

Feb 8, 2013, 3:20:01 AM2/8/13

to

When implementing network protocols in user space, one has to implement
fake user-space file descriptors to represent the sockets for the protocol.

While all the BSD socket API functionality for such descriptors may be faked as
well (myproto_send(), myproto_recv() etc.) this approach doesn't work for

polling (select, poll, epoll). And unfortunately, sockets that can't be polled
on allow only for building the simplest possible applications. Basically, you
can build a simple client, but once you want to implement a server handling
many sockets in parallel, you are stuck.

However, to do polling, real system-level file descriptor is needed,
not a fake one.

In theory, eventfd may be used for this purpose, except that it is well suited
only for signaling POLLIN. With some hacking it can be also used to signal

POLLOUT and POLLERR, but:

I. There's no way to signal POLLPRI, POLLHUP etc.
II. There's no way to signal arbitraty combination of POLL* flags. Most notably,

!POLLIN & !POLLOUT, which is a perfectly valid combination for a network

protocol (rx buffer is empty and tx buffer is full), cannot be signaled
using current implementation of eventfd.

This patch implements new EFD_MASK flag which attempts to solve this problem.

Additionally, when implementing network protocols in user space, there's a
need to associate user-space state with the each "socket". If eventfd object is
used as a reference to the socket, it should be possible to associate an opaque
pointer to user-space data with it.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
short events;

union {
void *ptr;
uint32_t u32;
uint64_t u64;
};
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.

ptr, u32 and u63 are opaque data that are not interpreted by eventfd object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik <sus...@250bpm.com>
---

v2 - Concerns raised by Andy Lutomirski addressed:
- uses fixed size (64 bits) opaque data in eventfd_mask instead of void*
- write returns EINVAL if invalid POLL flags are specified
---
fs/eventfd.c | 116 +++++++++++++++++++++++++++++++++++++----------
include/linux/eventfd.h | 3 +-
2 files changed, 94 insertions(+), 25 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..bf83f6d87 100644

--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
* fs/eventfd.c
*
* Copyright (C) 2007 Davide Libenzi <dav...@xmailserver.org>
+ * Copyright (C) 2013 Martin Sustrik <sus...@250bpm.com>
*
*/

@@ -22,18 +23,35 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+
+/* On x86-64 keep the same binary layout as on i386. */
+#ifdef __x86_64__
+#define EVENTFD_MASK_PACKED __packed
+#else
+#define EVENTFD_MASK_PACKED
+#endif
+
+struct eventfd_mask {
+ __u32 events;
+ __u64 data;
+} EVENTFD_MASK_PACKED;

@@ -55,6 +73,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)

{
unsigned long flags;

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;

@@ -123,12 +144,16 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)

poll_wait(file, &ctx->wqh, wait);

spin_lock_irqsave(&ctx->wqh.lock, flags);
- if (ctx->count > 0)
- events |= POLLIN;
- if (ctx->count == ULLONG_MAX)
- events |= POLLERR;
- if (ULLONG_MAX - 1 > ctx->count)
- events |= POLLOUT;
+ if (ctx->flags & EFD_MASK) {
+ events = ctx->mask.events;
+ } else {
+ if (ctx->count > 0)
+ events |= POLLIN;
+ if (ctx->count == ULLONG_MAX)
+ events |= POLLERR;
+ if (ULLONG_MAX - 1 > ctx->count)
+ events |= POLLOUT;
+ }
spin_unlock_irqrestore(&ctx->wqh.lock, flags);

return events;

@@ -158,6 +183,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,

{
unsigned long flags;

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
eventfd_ctx_do_read(ctx, cnt);
__remove_wait_queue(&ctx->wqh, wait);

@@ -188,6 +216,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)

ssize_t res;
DECLARE_WAITQUEUE(wait, current);

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irq(&ctx->wqh.lock);
*cnt = 0;
res = -EAGAIN;

@@ -230,13 +261,23 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,

@@ -246,6 +287,23 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
ssize_t res;
__u64 ucnt;
DECLARE_WAITQUEUE(wait, current);
+ struct eventfd_mask mask;
+

+ if (ctx->flags & EFD_MASK) {

+ if (count < sizeof(mask))
+ return -EINVAL;
+ if (copy_from_user(&mask, buf, sizeof(mask)))
+ return -EFAULT;
+ if (mask.events & ~EFD_MASK_VALID_EVENTS)

+ return -EINVAL;
+ spin_lock_irq(&ctx->wqh.lock);

+ memcpy(&ctx->mask, &mask, sizeof(ctx->mask));

+ if (waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh,
+ (unsigned long)ctx->mask.events);

+ spin_unlock_irq(&ctx->wqh.lock);
+ return sizeof(ctx->mask);
+ }

if (count < sizeof(ucnt))
return -EINVAL;

@@ -293,8 +351,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)

int ret;

spin_lock_irq(&ctx->wqh.lock);
- ret = seq_printf(m, "eventfd-count: %16llx\n",
- (unsigned long long)ctx->count);
+ if (ctx->flags & EFD_MASK) {
+ ret = seq_printf(m, "eventfd-mask: %x\n",
+ (unsigned)ctx->mask.events);
+ } else {
+ ret = seq_printf(m, "eventfd-count: %16llx\n",
+ (unsigned long long)ctx->count);
+ }
spin_unlock_irq(&ctx->wqh.lock);

return ret;

@@ -412,7 +475,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)

kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
- ctx->count = count;
+ if (flags & EFD_MASK) {
+ ctx->mask.events = 0;

+ ctx->mask.data = 0;

+ } else {
+ ctx->count = count;
+ }
ctx->flags = flags;

file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 3c3ef19..b806d2b 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -20,11 +20,12 @@
* shared O_* flags.
*/
#define EFD_SEMAPHORE (1 << 0)
+#define EFD_MASK (1 << 1)
#define EFD_CLOEXEC O_CLOEXEC
#define EFD_NONBLOCK O_NONBLOCK

#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)

#ifdef CONFIG_EVENTFD

--
1.7.4.1

Martin Sustrik

unread,

Feb 8, 2013, 7:50:03 AM2/8/13

to

On 07/02/13 23:44, Andrew Morton wrote:

> That's a nice changelog but it omitted a critical thing: why do you
> think the kernel needs this feature? What's the value and use case for
> being able to poll these descriptors?

To address the question, I've written down detailed description of the
challenges of the network protocol development in user space and how the
proposed feature addresses the problems.

It's too long to fit into ChangeLog, but it may be worth reading when
trying to judge the merit of the patch.

It can be found here: http://www.250bpm.com/blog:16

Martin

Eric Wong

unread,

Feb 8, 2013, 5:10:01 PM2/8/13

to

Andy Lutomirski <lu...@amacapital.net> wrote:
> On Thu, Feb 7, 2013 at 12:11 PM, Martin Sustrik <sus...@250bpm.com> wrote:
> > On 07/02/13 20:12, Andy Lutomirski wrote:
> >> On 02/06/2013 10:41 PM, Martin Sustrik wrote:
> >>> The value of 'events' should be any combination of event flags as defined
> >>> by
> >>> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
> >>> events will
> >>> be signaled when polling (select, poll, epoll) on the eventfd is done
> >>> later on.
> >>> 'ptr' is an opaque pointer that is not interpreted by eventfd object.
> >>
> >> How does this interact with EPOLLET?
> >
> > That's an interesting question. The original eventfd code doesn't do
> > anything specific to either edge or level mode. Neither does my patch.
> >
> > Inspection of the code seems to suggest that edge vs. level distinction is
> > handled elsewhere (ep_send_events_proc) where there is a separate list of
> > ready events and the function, after returning the event, decides whether to
> > leave the event in the list (level) or delete it from the list (edge).

Right, the edge vs. level distinction is internal to epoll.

> Hmm. Having looked at the eventpoll.c source again, I remain
> unconvinced that EPOLLET works the way that any userspace developer
> would expect it to.

As as userspace developer, EPOLLET seems to work as expected/documented;
but I realized EPOLLONESHOT is what I want to be using instead.

> > In any case, review from someone with experience with epoll implementation
> > would help.

I'm no expert, but I don't think eventfd (or any file type) needs to
care about what I/O notification scheme/options it's used with.

Eric Wong

unread,

Feb 8, 2013, 5:30:01 PM2/8/13

to

Martin Sustrik <sus...@250bpm.com> wrote:
> On 07/02/13 23:44, Andrew Morton wrote:
> >That's a nice changelog but it omitted a critical thing: why do you
> >think the kernel needs this feature? What's the value and use case for
> >being able to poll these descriptors?
>
> To address the question, I've written down detailed description of
> the challenges of the network protocol development in user space and
> how the proposed feature addresses the problems.
>
> It's too long to fit into ChangeLog, but it may be worth reading
> when trying to judge the merit of the patch.
>
> It can be found here: http://www.250bpm.com/blog:16

Using one eventfd per userspace socket still seems a bit wasteful.

Couldn't you use a single pipe for all sockets and write the efd_mask to
the pipe for each socket?

A read from the pipe would behave like epoll_wait.

You might need to use one-shot semantics; but that's probably
the easiest thing in multithreaded apps anyways.

Martin Sustrik

unread,

Feb 8, 2013, 9:50:01 PM2/8/13

to

Hi Eric,

On 08/02/13 23:21, Eric Wong wrote:
> Martin Sustrik<sus...@250bpm.com> wrote:
>> On 07/02/13 23:44, Andrew Morton wrote:
>>> That's a nice changelog but it omitted a critical thing: why do you
>>> think the kernel needs this feature? What's the value and use case for
>>> being able to poll these descriptors?
>>
>> To address the question, I've written down detailed description of
>> the challenges of the network protocol development in user space and
>> how the proposed feature addresses the problems.
>>
>> It's too long to fit into ChangeLog, but it may be worth reading
>> when trying to judge the merit of the patch.
>>
>> It can be found here: http://www.250bpm.com/blog:16
>
> Using one eventfd per userspace socket still seems a bit wasteful.

Wasteful in what sense? Occupying a slot in file descriptor table?
That's the price for having the socket uniquely identified by the fd.

> Couldn't you use a single pipe for all sockets and write the efd_mask to
> the pipe for each socket?
>
> A read from the pipe would behave like epoll_wait.
>
> You might need to use one-shot semantics; but that's probably
> the easiest thing in multithreaded apps anyways.

Having multiple sockets represented by a single eventfd. how would you
distinguish where did individual events came from?

struct pollfd pfd;
...
poll (pfd, 1, -1);
if (pfd.revents & POLLIN) /* Incoming data on which socket? */
...

Martin

Martin Sustrik

unread,

Feb 8, 2013, 10:30:01 PM2/8/13

to

On 08/02/13 23:08, Eric Wong wrote:

>>>>> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified
>>>>> events will
>>>>> be signaled when polling (select, poll, epoll) on the eventfd is done
>>>>> later on.
>>>>> 'ptr' is an opaque pointer that is not interpreted by eventfd object.
>>>>
>>>> How does this interact with EPOLLET?
>>>
>>> That's an interesting question. The original eventfd code doesn't do
>>> anything specific to either edge or level mode. Neither does my patch.
>>>
>>> Inspection of the code seems to suggest that edge vs. level distinction is
>>> handled elsewhere (ep_send_events_proc) where there is a separate list of
>>> ready events and the function, after returning the event, decides whether to
>>> leave the event in the list (level) or delete it from the list (edge).
>
> Right, the edge vs. level distinction is internal to epoll.

I wrote a test program for EFD_MASK+EPOLLET and it seems to behave in
intuitive kind of way:

int main ()
{
int fd;
struct efd_mask mask;
ssize_t nbytes;
int rc;
int ep;
struct epoll_event epe;

fd = eventfd (0, EFD_MASK);

ep = epoll_create (10);
assert (ep != -1);
epe.events = EPOLLIN | EPOLLET;
rc = epoll_ctl (ep, EPOLL_CTL_ADD, fd, &epe);
assert (rc != -1);

mask.events = 0;
nbytes = write (fd, &mask, sizeof (mask));
assert (nbytes == sizeof (mask));
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 0);

mask.events = POLLIN;
nbytes = write (fd, &mask, sizeof (mask));
assert (nbytes == sizeof (mask));
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 1 && epe.events == EPOLLIN);
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 0);

mask.events = POLLIN;
nbytes = write (fd, &mask, sizeof (mask));
mask.events = 0;
nbytes = write (fd, &mask, sizeof (mask));
rc = epoll_wait (ep, &epe, 1, 100);
assert (rc == 0);

rc = close (ep);
assert (rc == 0);
rc = close (fd);
assert (rc == 0);

return 0;
}

Martin

Eric Wong

unread,

Feb 8, 2013, 11:00:03 PM2/8/13

to

Martin Sustrik <sus...@250bpm.com> wrote:
> On 08/02/13 23:21, Eric Wong wrote:
> >Martin Sustrik<sus...@250bpm.com> wrote:
> >>To address the question, I've written down detailed description of
> >>the challenges of the network protocol development in user space and
> >>how the proposed feature addresses the problems.
> >>

> >>It can be found here: http://www.250bpm.com/blog:16
> >
> >Using one eventfd per userspace socket still seems a bit wasteful.
>
> Wasteful in what sense? Occupying a slot in file descriptor table?
> That's the price for having the socket uniquely identified by the
> fd.

Yes. I realize eventfd is small, but I don't think eventfd is needed
at all, here. Just one pipe.

> >Couldn't you use a single pipe for all sockets and write the efd_mask to
> >the pipe for each socket?
> >
> >A read from the pipe would behave like epoll_wait.
> >
> >You might need to use one-shot semantics; but that's probably
> >the easiest thing in multithreaded apps anyways.
>
> Having multiple sockets represented by a single eventfd. how would
> you distinguish where did individual events came from?
>
> struct pollfd pfd;
> ...
> poll (pfd, 1, -1);
> if (pfd.revents & POLLIN) /* Incoming data on which socket? */
> ...

No eventfd, you write just write struct to the pipe, and consume the
struct to a fixed size buffer:

/* trigger readiness notification for sock,
* this probably needs a lock around it
*/
void sock_trigger(struct my_sock *sock, int events)
{
struct efd_mask mask;

/* check if the triggeered event is something sock wants: */
events &= sock->watched_events;

if (!events)
return;

mask.events = events;
mask.ptr = sock;

/*
* preventing sock from being in the pipe multiple times
* is probably required (or just a good idea). Which is
* why I mentioned oneshot semantics are probably required.
*/
if (oneshot)
sock->watched_events = 0;

/*
* This is analogous to:
* list_add_tail(&epi->rdllink, &ep->rdllist);
* in fs/eventpoll.c
*
* This may block, but that's why consumer_loop runs in different
* threads. Or run some iteration of consumer_loop here if
* it blocks (beware of stack depth from recursion, though)
*/
write(pipe_wr, &mask, sizeof(mask));
}

/* in another thread (or several threads) */
void consumer_loop(int pipe_rd)
{
struct efd_mask mask;
struct my_sock *sock;

for (;;) {
/*
* analogous to:
* epoll_wait(.., maxevents=1, ...);
*
* You can read several masks at once if have one thread,
* but I usually use maxevents=1 (+several threads) to
* distribute traffic between threads
*/
read(pipe_rd, &mask, sizeof(mask));
sock = mask.ptr;
if (mask.events & POLLIN)
sock_read(sock);
else if (mask.events & POLLOUT)
sock_write(sock);
...

/* analogous to epoll_ctl() */
if (sock->write_buffered)
sock->watched_events |= POLLOUT;
if (sock->wants_more_data)
sock->watched_events |= POLLIN;

/* onto the next ready event */

Martin Sustrik

unread,

Feb 9, 2013, 2:40:03 AM2/9/13

to

On 09/02/13 04:54, Eric Wong wrote:

>>> Using one eventfd per userspace socket still seems a bit wasteful.
>>
>> Wasteful in what sense? Occupying a slot in file descriptor table?
>> That's the price for having the socket uniquely identified by the
>> fd.
>
> Yes. I realize eventfd is small, but I don't think eventfd is needed
> at all, here. Just one pipe.

Ah. Got you! You mean not to change the kernel, just use pipe for the
purpose.

However, the convoluted pipe-style design is the problem I am trying to
solve rather than the solution. It leads to convoluted APIs with
convoluted semantics as described in the article. I've been using that
kind of design for past 8 years and every time I have to deal with it I
swear that one day I will implement a proper in-kernel solution to get
rid of the hack.

And now I have finally done so.

Martin

Eric Wong

unread,

Feb 9, 2013, 7:00:02 AM2/9/13

to

Martin Sustrik <sus...@250bpm.com> wrote:
> On 09/02/13 04:54, Eric Wong wrote:
> >>>Using one eventfd per userspace socket still seems a bit wasteful.
> >>
> >>Wasteful in what sense? Occupying a slot in file descriptor table?
> >>That's the price for having the socket uniquely identified by the
> >>fd.
> >
> >Yes. I realize eventfd is small, but I don't think eventfd is needed
> >at all, here. Just one pipe.
>
> Ah. Got you! You mean not to change the kernel, just use pipe for
> the purpose.
>
> However, the convoluted pipe-style design is the problem I am trying
> to solve rather than the solution. It leads to convoluted APIs with
> convoluted semantics as described in the article. I've been using
> that kind of design for past 8 years and every time I have to deal
> with it I swear that one day I will implement a proper in-kernel
> solution to get rid of the hack.
>
> And now I have finally done so.

Yes, your eventfd change is probably the best way if you want/need
to only watch a subset of your sockets, especially if you want
poll/select to be an option.

Martin Sustrik

unread,

Feb 9, 2013, 7:10:02 AM2/9/13

to

On 2013-02-09 12:51, Eric Wong wrote:

> Yes, your eventfd change is probably the best way if you want/need
> to only watch a subset of your sockets, especially if you want
> poll/select to be an option.

Yes, the poll/select thing is the important point.

I wouldn't care if the only problem was that I, as the protocol
implementer, would have to implement some kind of workaround in my
protocol library. The problem is that these convoluted semantics leak --
through the use of poll, select et al. -- to the end user.

From my personal experience I can say that end users have pretty hard
time using such complex workarounds instead of simply using a native
file descriptor with standardised semantics.

Martin

Andrew Morton

unread,

Feb 14, 2013, 6:00:02 PM2/14/13

to

> ...

This patch adds userspace interfaces which will require manpage
updates. Please Cc Michael and work with him on getting those changes
completed.

>
> +/* On x86-64 keep the same binary layout as on i386. */
> +#ifdef __x86_64__
> +#define EVENTFD_MASK_PACKED __packed
> +#else
> +#define EVENTFD_MASK_PACKED
> +#endif
> +
> +struct eventfd_mask {
> + __u32 events;
> + __u64 data;
> +} EVENTFD_MASK_PACKED;

The x86-64 specific thing is ugly. I can find no explanation of why it
was done, but it should go away. You could make `events' a u64, or
swap the order of the two fields and make the struct __packed on all
architectures.

Given that the size of the types is fixed, I see no compat issues here.

As this struct is known by userspace, this definition should appear in
a header which is available to usersapce: include/uapi/...

> struct eventfd_ctx {
> struct kref kref;
> wait_queue_head_t wqh;
> - /*
> - * Every time that a write(2) is performed on an eventfd, the
> - * value of the __u64 being written is added to "count" and a
> - * wakeup is performed on "wqh". A read(2) will return the "count"
> - * value to userspace, and will reset "count" to zero. The kernel
> - * side eventfd_signal() also, adds to the "count" counter and
> - * issue a wakeup.
> - */
> - __u64 count;
> + union {
> + /*
> + * Every time that a write(2) is performed on an eventfd, the
> + * value of the __u64 being written is added to "count" and a
> + * wakeup is performed on "wqh". A read(2) will return the
> + * "count" value to userspace, and will reset "count" to zero.
> + * The kernel side eventfd_signal() also, adds to the "count"
> + * counter and issue a wakeup.
> + */
> + __u64 count;
> + struct eventfd_mask mask;

The nice explanation for `count' was retained, but is it appropriate
that `mask' have no explanation?

> + };
> unsigned int flags;
> };
>
> ...

>
> @@ -230,13 +261,23 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> ssize_t res;
> __u64 cnt;
>
> - if (count < sizeof(cnt))
> - return -EINVAL;
> - res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
> - if (res < 0)
> + if (ctx->flags & EFD_MASK) {
> + spin_lock_irq(&ctx->wqh.lock);
> + if (count < sizeof(ctx->mask))
> + return -EINVAL;
> + res = copy_to_user(buf, &ctx->mask, sizeof(ctx->mask)) ?
> + -EFAULT : sizeof(ctx->mask);
> + spin_unlock_irq(&ctx->wqh.lock);
> return res;

This code is crawling with bugs.

- can return with wqh.lock held -> dead kernel

- performs copy_to_user() under spinlock -> warning spew, kernel
deadlocks. It should go via a local temporary, as was done in
eventfd_write().

This should have filled your screen with warnings when testing. Either
it wasn't tested or its author forgot to read
Documentation/SubmitChecklist section 12. Please do so ;)

(otoh maybe might_sleep and lockdep fail to detect copy_*_user under
spinlock when the copy doesn't fault. If so, that's a big fail)

`mask' can be made local to this `if' block, which is nicer.

> if (count < sizeof(ucnt))
> return -EINVAL;
> @@ -293,8 +351,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)
> int ret;
>
> spin_lock_irq(&ctx->wqh.lock);
> - ret = seq_printf(m, "eventfd-count: %16llx\n",
> - (unsigned long long)ctx->count);
> + if (ctx->flags & EFD_MASK) {
> + ret = seq_printf(m, "eventfd-mask: %x\n",
> + (unsigned)ctx->mask.events);
> + } else {
> + ret = seq_printf(m, "eventfd-count: %16llx\n",
> + (unsigned long long)ctx->count);
> + }
> spin_unlock_irq(&ctx->wqh.lock);

This is a non-back-compatible userspace interface change. A procfs
file which previously displayed

eventfd-count: nnnn

can now also display

eventfd-mask: nnnn

So existing userspace could misbehave.

Please fully describe the proposed interface change in the changelog.
That description should include the full pathname of the procfs file
and example before-and-after output and a discussion of whether and why
the risk to existing userspace is acceptable.

> @@ -412,7 +475,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>
> kref_init(&ctx->kref);
> init_waitqueue_head(&ctx->wqh);
> - ctx->count = count;
> + if (flags & EFD_MASK) {
> + ctx->mask.events = 0;
> + ctx->mask.data = 0;
> + } else {
> + ctx->count = count;
> + }
> ctx->flags = flags;
>
> file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index 3c3ef19..b806d2b 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -20,11 +20,12 @@
> * shared O_* flags.
> */
> #define EFD_SEMAPHORE (1 << 0)
> +#define EFD_MASK (1 << 1)

Does this addition comply with the "CAREFUL:" immediately above it?

It would be best to add code comemntary describing what this constant does.

> #define EFD_CLOEXEC O_CLOEXEC
> #define EFD_NONBLOCK O_NONBLOCK
>
> #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
> -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
> +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)
>
> #ifdef CONFIG_EVENTFD

--

Andy Lutomirski

unread,

Feb 14, 2013, 7:00:01 PM2/14/13

to

On Thu, Feb 14, 2013 at 2:54 PM, Andrew Morton
<ak...@linux-foundation.org> wrote:
> On Fri, 8 Feb 2013 09:11:17 +0100
> Martin Sustrik <sus...@250bpm.com> wrote:
>
>> When implementing network protocols in user space, one has to implement
>> fake user-space file descriptors to represent the sockets for the protocol.

>

I suspect that the fdinfo stuff is only used by criu, which will need
to be updated regardless. (If the kernel adopts the policy "don't
break criu", then that may be the end of new kernel features.)

--Andy

Michał Mirosław

unread,

Feb 14, 2013, 9:50:01 PM2/14/13

to

2013/2/8 Martin Sustrik <sus...@250bpm.com>:

> When implementing network protocols in user space, one has to implement
> fake user-space file descriptors to represent the sockets for the protocol.

[...]

> This patch implements new EFD_MASK flag which attempts to solve this problem.

[...]

> @@ -412,7 +464,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)
>
> kref_init(&ctx->kref);
> init_waitqueue_head(&ctx->wqh);
> - ctx->count = count;
> + if (flags & EFD_MASK) {
> + ctx->mask.events = 0;
> + ctx->mask.ptr = NULL;
> + } else {
> + ctx->count = count;
> + }
> ctx->flags = flags;
>
> file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,

Since EFD_MASK is a persistent flag for a fd's lifetime, maybe you
could instead of all those if/elses and BUG_ON()s use another
file_operations struct for this feature?

Best Regards,
Michał Mirosław

Martin Sustrik

unread,

Feb 14, 2013, 10:50:02 PM2/14/13

to

Hi Andrew,

Thanks for the detailed code review! I'll have a look at all the
problems you've pointed out, however, one quick question:

>> - ret = seq_printf(m, "eventfd-count: %16llx\n",
>> - (unsigned long long)ctx->count);

>> + if (ctx->flags& EFD_MASK) {

>> + ret = seq_printf(m, "eventfd-mask: %x\n",
>> + (unsigned)ctx->mask.events);
>> + } else {
>> + ret = seq_printf(m, "eventfd-count: %16llx\n",
>> + (unsigned long long)ctx->count);
>> + }
>> spin_unlock_irq(&ctx->wqh.lock);
>
> This is a non-back-compatible userspace interface change. A procfs
> file which previously displayed
>
> eventfd-count: nnnn
>
> can now also display
>
> eventfd-mask: nnnn
>
> So existing userspace could misbehave.
>
> Please fully describe the proposed interface change in the changelog.
> That description should include the full pathname of the procfs file
> and example before-and-after output and a discussion of whether and why
> the risk to existing userspace is acceptable.

I am not sure what the policy is here. Is not printing out the state of
the object acceptable way to maintain backward compatibility? If not so,
does new type of object require new procfs file, which, AFAIU, is the
only way to retain full backward compatibility?

Martin

Andrew Morton

unread,

Feb 15, 2013, 12:30:01 AM2/15/13

to

On Fri, 15 Feb 2013 04:42:27 +0100 Martin Sustrik <sus...@250bpm.com> wrote:

> > This is a non-back-compatible userspace interface change. A procfs
> > file which previously displayed
> >
> > eventfd-count: nnnn
> >
> > can now also display
> >
> > eventfd-mask: nnnn
> >
> > So existing userspace could misbehave.
> >
> > Please fully describe the proposed interface change in the changelog.
> > That description should include the full pathname of the procfs file
> > and example before-and-after output and a discussion of whether and why
> > the risk to existing userspace is acceptable.
>
> I am not sure what the policy is here. Is not printing out the state of
> the object acceptable way to maintain backward compatibility? If not so,
> does new type of object require new procfs file, which, AFAIU, is the
> only way to retain full backward compatibility?

Adding a new file is the only way I can think of to preserve the API.
But from Andy's comment is sounds like we don't have to worry a lot
about back-compatibility.

Andy Lutomirski

unread,

Feb 15, 2013, 12:40:02 PM2/15/13

to

On Thu, Feb 14, 2013 at 9:24 PM, Andrew Morton
<ak...@linux-foundation.org> wrote:
> On Fri, 15 Feb 2013 04:42:27 +0100 Martin Sustrik <sus...@250bpm.com> wrote:
>
>> > This is a non-back-compatible userspace interface change. A procfs
>> > file which previously displayed
>> >
>> > eventfd-count: nnnn
>> >
>> > can now also display
>> >
>> > eventfd-mask: nnnn
>> >
>> > So existing userspace could misbehave.
>> >
>> > Please fully describe the proposed interface change in the changelog.
>> > That description should include the full pathname of the procfs file
>> > and example before-and-after output and a discussion of whether and why
>> > the risk to existing userspace is acceptable.
>>
>> I am not sure what the policy is here. Is not printing out the state of
>> the object acceptable way to maintain backward compatibility? If not so,
>> does new type of object require new procfs file, which, AFAIU, is the
>> only way to retain full backward compatibility?
>
> Adding a new file is the only way I can think of to preserve the API.
> But from Andy's comment is sounds like we don't have to worry a lot
> about back-compatibility.
>

I'm not even convinced there's an issue in the first place (other than
the fact that use of this feature will break old criu, regardless of
/proc changes). The fdinfo files already vary by descriptor type.
Anything that screws up if unexpected fields are present is already
screwed.

--Andy

Martin Sustrik

unread,

Feb 15, 2013, 1:40:02 PM2/15/13

to

Ok then. I'll leave the relevant code as is.

Martin

Martin Sustrik

unread,

Feb 18, 2013, 4:00:01 AM2/18/13

to

On 14/02/13 23:54, Andrew Morton wrote:

>> +/* On x86-64 keep the same binary layout as on i386. */
>> +#ifdef __x86_64__
>> +#define EVENTFD_MASK_PACKED __packed
>> +#else
>> +#define EVENTFD_MASK_PACKED
>> +#endif
>> +
>> +struct eventfd_mask {
>> + __u32 events;
>> + __u64 data;
>> +} EVENTFD_MASK_PACKED;
>
> The x86-64 specific thing is ugly. I can find no explanation of why it
> was done, but it should go away. You could make `events' a u64, or
> swap the order of the two fields and make the struct __packed on all
> architectures.
>
> Given that the size of the types is fixed, I see no compat issues here.

I've just copied how the definition is done for epoll_event. The comment
there goes like this:

/*
* On x86-64 make the 64bit structure have the same alignment as the
* 32bit structure. This makes 32bit emulation easier.
*
* UML/x86_64 needs the same packing as x86_64
*/

If you still think I should remove the #ifdef, I am happy to do so.

Martin

Martin Sustrik

unread,

Feb 18, 2013, 6:40:02 AM2/18/13

to

When implementing network protocols in user space, one has to implement

fake file descriptors to represent the sockets for the protocol.

Polling on such fake file descriptors is a problem (poll/select/epoll accept
only true file descriptors) and forces protocol implementers to use various
workarounds resulting in complex, non-standard and convoluted APIs.

More generally, ability to create full-blown file descriptors for
userspace-to-userspace signalling is missing. While eventfd(2) goes half the way
towards this goal it has follwoing shorcomings:

I. There's no way to signal POLLPRI, POLLHUP etc.

II. There's no way to signal arbitrary combination of POLL* flags. Most notably,
simultaneous !POLLIN and !POLLOUT, which is a perfectly valid combination

for a network protocol (rx buffer is empty and tx buffer is full), cannot be

signaled using eventfd.

This patch implements new EFD_MASK flag which solves the above problems.

Additionally, to provide a way to associate user-space state with eventfd
object, it allows to attach user-space data to the file descriptor.

The semantics of EFD_MASK are as follows:

eventfd(2):

If eventfd is created with EFD_MASK flag set, it is initialised in such a way
as to signal no events on the file descriptor when it is polled on. 'initval'
argument is ignored.

write(2):

User is allowed to write only buffers containing the following structure:

struct efd_mask {
short events;
union {
void *ptr;
uint32_t u32;
uint64_t u64;
};
};

The value of 'events' should be any combination of event flags as defined by
poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will
be signaled when polling (select, poll, epoll) on the eventfd is done later on.

'ptr', 'u32' or 'u64' are opaque data that are not interpreted by eventfd

object.

read(2):

User is allowed to read an efd_mask structure from the eventfd marked by
EFD_MASK. Returned value shall be the last one written to the eventfd.

select(2), poll(2) and similar:

When polling on the eventfd marked by EFD_MASK flag, all the events specified
in last written 'events' field shall be signaled.

Signed-off-by: Martin Sustrik <sus...@250bpm.com>
---

Following changes were made to the patch since v2:

- eventfd_mask structure renamed to efd_mask to keep user-space prefixes
consistent
- efd_mask is __packed for all architectures
- eventfd.h header file added to uapi; given there wasn't one so far, in
addition to moving efd_mask there, I've moved also all the eventfd flags that
are meant to be visible from the user space
- comment for 'mask' member eventfd_ctx added
- synchronisation bugs in eventfd_read fixed
- several variable declarations moved from the beginning of the function to
the blocks where they are used
- changelog text made simpler and more up to the point

There was also a request to explain why this functionality is needed. I've
written an article explaining the problems user-space network protocol
implementers face here: http://www.250bpm.com/blog:16
---
fs/eventfd.c | 194 ++++++++++++++++++++++++++++--------------
include/linux/eventfd.h | 17 +---
include/uapi/linux/eventfd.h | 40 +++++++++
3 files changed, 172 insertions(+), 79 deletions(-)
create mode 100644 include/uapi/linux/eventfd.h

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 35470d9..8f821b1 100644

--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -2,6 +2,7 @@
* fs/eventfd.c
*
* Copyright (C) 2007 Davide Libenzi <dav...@xmailserver.org>
+ * Copyright (C) 2013 Martin Sustrik <sus...@250bpm.com>
*
*/

@@ -22,18 +23,31 @@
#include <linux/proc_fs.h>
#include <linux/seq_file.h>

+#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)

+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK)

+#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP)
+

struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
- /*
- * Every time that a write(2) is performed on an eventfd, the
- * value of the __u64 being written is added to "count" and a
- * wakeup is performed on "wqh". A read(2) will return the "count"
- * value to userspace, and will reset "count" to zero. The kernel
- * side eventfd_signal() also, adds to the "count" counter and
- * issue a wakeup.
- */
- __u64 count;
+ union {
+ /*
+ * Every time that a write(2) is performed on an eventfd, the
+ * value of the __u64 being written is added to "count" and a
+ * wakeup is performed on "wqh". A read(2) will return the
+ * "count" value to userspace, and will reset "count" to zero.
+ * The kernel side eventfd_signal() also, adds to the "count"
+ * counter and issue a wakeup.
+ */
+ __u64 count;
+

+ /*
+ * When using eventfd in EFD_MASK mode this stracture stores the
+ * current events to be signaled on the eventfd (events member)
+ * along with opaque user-defined data (data member).
+ */
+ struct efd_mask mask;
+ };
unsigned int flags;
};

@@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)

{
unsigned long flags;

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);
if (ULLONG_MAX - ctx->count < n)
n = ULLONG_MAX - ctx->count;

@@ -123,12 +140,16 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait)

poll_wait(file, &ctx->wqh, wait);

spin_lock_irqsave(&ctx->wqh.lock, flags);
- if (ctx->count > 0)
- events |= POLLIN;
- if (ctx->count == ULLONG_MAX)
- events |= POLLERR;
- if (ULLONG_MAX - 1 > ctx->count)
- events |= POLLOUT;
+ if (ctx->flags & EFD_MASK) {
+ events = ctx->mask.events;
+ } else {
+ if (ctx->count > 0)
+ events |= POLLIN;
+ if (ctx->count == ULLONG_MAX)
+ events |= POLLERR;
+ if (ULLONG_MAX - 1 > ctx->count)
+ events |= POLLOUT;
+ }
spin_unlock_irqrestore(&ctx->wqh.lock, flags);

return events;

@@ -158,6 +179,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait,

{
unsigned long flags;

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+
spin_lock_irqsave(&ctx->wqh.lock, flags);

eventfd_ctx_do_read(ctx, cnt);
__remove_wait_queue(&ctx->wqh, wait);

@@ -188,6 +212,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt)
ssize_t res;
DECLARE_WAITQUEUE(wait, current);

+ /* This function should never be used with eventfd in the mask mode. */
+ BUG_ON(ctx->flags & EFD_MASK);
+

spin_lock_irq(&ctx->wqh.lock);
*cnt = 0;
res = -EAGAIN;

@@ -227,63 +254,92 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
loff_t *ppos)
{
struct eventfd_ctx *ctx = file->private_data;
- ssize_t res;
- __u64 cnt;

- if (count < sizeof(cnt))
- return -EINVAL;
- res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
- if (res < 0)

- return res;

-
- return put_user(cnt, (__u64 __user *) buf) ? -EFAULT : sizeof(cnt);

+ if (ctx->flags & EFD_MASK) {

+ struct efd_mask mask;

+ if (count < sizeof(mask))
+ return -EINVAL;

+ spin_lock_irq(&ctx->wqh.lock);
+ mask = ctx->mask;
+ spin_unlock_irq(&ctx->wqh.lock);
+ if (copy_to_user(buf, &mask, sizeof(mask)))
+ return -EFAULT;
+ return sizeof(mask);
+ } else {
+ ssize_t res;
+ __u64 cnt;

+ if (count < sizeof(cnt))
+ return -EINVAL;
+ res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt);
+ if (res < 0)
+ return res;
+ return put_user(cnt, (__u64 __user *) buf) ?
+ -EFAULT : sizeof(cnt);
+ }
}

static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,

loff_t *ppos)
{
struct eventfd_ctx *ctx = file->private_data;
- ssize_t res;
- __u64 ucnt;
- DECLARE_WAITQUEUE(wait, current);

- if (count < sizeof(ucnt))
- return -EINVAL;
- if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
- return -EFAULT;
- if (ucnt == ULLONG_MAX)
- return -EINVAL;
- spin_lock_irq(&ctx->wqh.lock);
- res = -EAGAIN;
- if (ULLONG_MAX - ctx->count > ucnt)
- res = sizeof(ucnt);
- else if (!(file->f_flags & O_NONBLOCK)) {
- __add_wait_queue(&ctx->wqh, &wait);
- for (res = 0;;) {
- set_current_state(TASK_INTERRUPTIBLE);
- if (ULLONG_MAX - ctx->count > ucnt) {
- res = sizeof(ucnt);
- break;
- }
- if (signal_pending(current)) {
- res = -ERESTARTSYS;
- break;

+ if (ctx->flags & EFD_MASK) {

+ struct efd_mask mask;

+ if (count < sizeof(mask))
+ return -EINVAL;
+ if (copy_from_user(&mask, buf, sizeof(mask)))
+ return -EFAULT;
+ if (mask.events & ~EFD_MASK_VALID_EVENTS)
+ return -EINVAL;
+ spin_lock_irq(&ctx->wqh.lock);
+ memcpy(&ctx->mask, &mask, sizeof(ctx->mask));
+ if (waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh,
+ (unsigned long)ctx->mask.events);
+ spin_unlock_irq(&ctx->wqh.lock);
+ return sizeof(ctx->mask);

+ } else {
+ ssize_t res;
+ __u64 ucnt;
+ DECLARE_WAITQUEUE(wait, current);
+ if (count < sizeof(ucnt))
+ return -EINVAL;
+ if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
+ return -EFAULT;
+ if (ucnt == ULLONG_MAX)

+ return -EINVAL;
+ spin_lock_irq(&ctx->wqh.lock);

+ res = -EAGAIN;
+ if (ULLONG_MAX - ctx->count > ucnt)
+ res = sizeof(ucnt);
+ else if (!(file->f_flags & O_NONBLOCK)) {
+ __add_wait_queue(&ctx->wqh, &wait);
+ for (res = 0;;) {
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (ULLONG_MAX - ctx->count > ucnt) {
+ res = sizeof(ucnt);
+ break;
+ }
+ if (signal_pending(current)) {
+ res = -ERESTARTSYS;
+ break;
+ }
+ spin_unlock_irq(&ctx->wqh.lock);
+ schedule();
+ spin_lock_irq(&ctx->wqh.lock);
}
- spin_unlock_irq(&ctx->wqh.lock);
- schedule();
- spin_lock_irq(&ctx->wqh.lock);
+ __remove_wait_queue(&ctx->wqh, &wait);
+ __set_current_state(TASK_RUNNING);
}
- __remove_wait_queue(&ctx->wqh, &wait);
- __set_current_state(TASK_RUNNING);
- }
- if (likely(res > 0)) {
- ctx->count += ucnt;
- if (waitqueue_active(&ctx->wqh))
- wake_up_locked_poll(&ctx->wqh, POLLIN);
- }
- spin_unlock_irq(&ctx->wqh.lock);
+ if (likely(res > 0)) {
+ ctx->count += ucnt;
+ if (waitqueue_active(&ctx->wqh))
+ wake_up_locked_poll(&ctx->wqh, POLLIN);
+ }
+ spin_unlock_irq(&ctx->wqh.lock);

- return res;
+ return res;
+ }
}

#ifdef CONFIG_PROC_FS
@@ -293,8 +349,13 @@ static int eventfd_show_fdinfo(struct seq_file *m, struct file *f)
int ret;

spin_lock_irq(&ctx->wqh.lock);

- ret = seq_printf(m, "eventfd-count: %16llx\n",
- (unsigned long long)ctx->count);

+ if (ctx->flags & EFD_MASK) {

+ ret = seq_printf(m, "eventfd-mask: %x\n",
+ (unsigned)ctx->mask.events);
+ } else {
+ ret = seq_printf(m, "eventfd-count: %16llx\n",
+ (unsigned long long)ctx->count);
+ }
spin_unlock_irq(&ctx->wqh.lock);

return ret;
@@ -412,7 +473,12 @@ struct file *eventfd_file_create(unsigned int count, int flags)

kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
- ctx->count = count;
+ if (flags & EFD_MASK) {
+ ctx->mask.events = 0;

+ ctx->mask.data = 0;

+ } else {
+ ctx->count = count;
+ }
ctx->flags = flags;

file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx,

diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 3c3ef19..218aba6 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,24 +8,11 @@
#ifndef _LINUX_EVENTFD_H
#define _LINUX_EVENTFD_H

-#include <linux/fcntl.h>
+#include <uapi/linux/eventfd.h>
+
#include <linux/file.h>
#include <linux/wait.h>

-/*
- * CAREFUL: Check include/asm-generic/fcntl.h when defining
- * new flags, since they might collide with O_* ones. We want
- * to re-use O_* flags that couldn't possibly have a meaning
- * from eventfd, in order to leave a free define-space for
- * shared O_* flags.
- */
-#define EFD_SEMAPHORE (1 << 0)
-#define EFD_CLOEXEC O_CLOEXEC
-#define EFD_NONBLOCK O_NONBLOCK
-
-#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
-
#ifdef CONFIG_EVENTFD

struct file *eventfd_file_create(unsigned int count, int flags);
diff --git a/include/uapi/linux/eventfd.h b/include/uapi/linux/eventfd.h
new file mode 100644
index 0000000..03057a5
--- /dev/null
+++ b/include/uapi/linux/eventfd.h
@@ -0,0 +1,40 @@
+/*

+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_EVENTFD_H
+#define _UAPI_LINUX_EVENTFD_H
+
+/* For O_CLOEXEC */
+#include <linux/fcntl.h>
+#include <linux/types.h>
+
+/*
+ * CAREFUL: Check include/asm-generic/fcntl.h when defining
+ * new flags, since they might collide with O_* ones. We want
+ * to re-use O_* flags that couldn't possibly have a meaning
+ * from eventfd, in order to leave a free define-space for
+ * shared O_* flags.
+ */
+
+/* Provide semaphore-like semantics for reads from the eventfd. */
+#define EFD_SEMAPHORE (1 << 0)
+/* Provide event mask semantics for the eventfd. */
+#define EFD_MASK (1 << 1)
+/* Set the close-on-exec (FD_CLOEXEC) flag on the eventfd. */
+#define EFD_CLOEXEC O_CLOEXEC
+/* Create the eventfd in non-blocking mode. */
+#define EFD_NONBLOCK O_NONBLOCK
+
+/* Structure to read/write to eventfd in EFD_MASK mode. */
+struct efd_mask {

+ __u32 events;
+ __u64 data;

+} __packed;
+
+#endif /* _UAPI_LINUX_EVENTFD_H */
--
1.7.4.1

Martin Sustrik

unread,

Feb 18, 2013, 7:00:02 AM2/18/13

to

On 14/02/13 23:54, Andrew Morton wrote:

> This patch adds userspace interfaces which will require manpage
> updates. Please Cc Michael and work with him on getting those changes
> completed.

Right. It adds the efd_mask structure. As far as I understand how it
works is that the actual user-space definition of the structure should
be provided by glibc (sys/eventfd.h), right?

If so, should the man page be updated now? If so, it may result in the
situation where the man page describes a structure that's not available
in the user space yet.

Is there a formal process to do this kind of kernel+glibc+docs changes?

Martin

Damian Hobson-Garcia

unread,

Jul 9, 2015, 4:00:07 AM7/9/15

to

Hello Martin and all,

I recently came across this (quite old by now) patch submission for an
extension to the functionality of eventfd and I noticed that the
discussion seems to have fizzled out. Is this functionality still of
use for user space network protocols? It seems like it would be usable
for other notification use cases as well. In particular I'm thinking of
poll/select/epoll support for a user space video codec via libv4l.
Is there value in re-examining this patch?

Thank you,
Damian

Martin Sustrik

unread,

Jul 9, 2015, 5:00:06 AM7/9/15

to

Hi Damian,

Yes, this patch would be geneally useful for implementing stuff in user
space that otherwise would have to live in kernelspace.

Unfortunately, I have no cycles left to pursue getting it to the
mainline. If you feel like you can take care of it, that would be great.

I can help with the documentation, if needed.

Martin

Damian Hobson-Garcia

unread,

Jul 9, 2015, 5:10:07 AM7/9/15

to

Hi Martin,

On 2015-07-09 5:41 PM, Martin Sustrik wrote:
> Hi Damian,
>
> Yes, this patch would be geneally useful for implementing stuff in user
> space that otherwise would have to live in kernelspace.
>
> Unfortunately, I have no cycles left to pursue getting it to the
> mainline. If you feel like you can take care of it, that would be great.

I'll see what I can do. I'll rebase it and resubmit it.
Unless there are major changes (which I highly doubt), I'll keep you as
the author, if that's ok with you.

>
> I can help with the documentation, if needed.

Thank you, that would be very much appreciated.

Damian

Martin Sustrik

unread,

Jul 9, 2015, 5:30:06 AM7/9/15

to

On 2015-07-09 11:06, Damian Hobson-Garcia wrote:
> Hi Martin,
>
> On 2015-07-09 5:41 PM, Martin Sustrik wrote:
>> Hi Damian,
>>
>> Yes, this patch would be geneally useful for implementing stuff in
>> user
>> space that otherwise would have to live in kernelspace.
>>
>> Unfortunately, I have no cycles left to pursue getting it to the
>> mainline. If you feel like you can take care of it, that would be
>> great.
> I'll see what I can do. I'll rebase it and resubmit it.
> Unless there are major changes (which I highly doubt), I'll keep you as
> the author, if that's ok with you.

Sure. No problem.

>>
>> I can help with the documentation, if needed.
> Thank you, that would be very much appreciated.

Ok, ping me when it's needed.

Thanks!
Martin

[PATCH 1/1] eventfd: implementation of EFD_MASK flag

Martin Sustrik

Andy Lutomirski

Martin Sustrik

Andrew Morton

Martin Sustrik

Martin Sustrik

Andy Lutomirski

Martin Sustrik

Andy Lutomirski

Martin Sustrik

Martin Sustrik

Martin Sustrik

Eric Wong

Eric Wong

Martin Sustrik

Martin Sustrik

Eric Wong

Martin Sustrik

Eric Wong

Martin Sustrik

Andrew Morton

Andy Lutomirski

Michał Mirosław

Martin Sustrik

Andrew Morton

Andy Lutomirski

Martin Sustrik

Martin Sustrik

Martin Sustrik

Martin Sustrik

Damian Hobson-Garcia

Martin Sustrik

Damian Hobson-Garcia

Martin Sustrik