epoll design problems with common fork/exec patterns

Marc Lehmann

unread,

Oct 27, 2007, 3:01:07 AM10/27/07

to linux-...@vger.kernel.org, Davide Libenzi

Hi!

I ran into what I see as unsolvable problems that make epoll useless as a
generic event mechanism.

I recently switched to libevent as event loop, and found that my programs
work fine when it is using select or poll, but work eratically or halt
when using epoll.

The reason as I found out is the peculiar behaviour of epoll over fork.
It doesn't work as documented, and even if, it would make the use of
third-party libraries using fork usually impossible.

Here are two scenarios where it screws up:

- some library forks, explicitly closes all fd's it doesn't need, and execs
another program (which is common behvaiour).

In this case, the parent process works fine until the child closes fds,
after which the fds become unarmed in the parent too. This works as
documented, but since libraries expect this to work without affecting the
parent, this puts a new and incompatible strain on what libraries can do,
which in turn makes epoll unsuitable in cases where you don't control all
your code.

- I have a library that emulates asynchronous I/O with a thread pool, and
uses a pipe for event notification. That library registers a fork handler
that closes the pipe in the child and recreates it, so the child could
continue doing AIO (as could the parent).

This, too, screws up notifications for the parent,

Now, the epoll manpage says that closing a fd will remove it from all
fd sets. This would explain the behaviour above. Unfortunately (or
fortunately?) this is not what happens: when the fds are being closed by
exec or exit, the fds do not get removed from the epoll set.

This behaviour strikes me as extremely illogical. On the one hand, one
cannot share the epoll fd between processes normally, but on fork,
you can, even though it makes no sense (the child has a different fd
"namespace" than the parent) and actually works on (then( unrelated fds in
the other process.

It also strikes as weird that the order of closing fds should make so much
of a difference: if the epoll fd is closed first in the child, the other
fds will survive in the parent, if its closed last, they don't. Makes no
sense to me.

Now, the problem I see is not that it makes no sense to me - thats clearly
my problem. The problem I see is that there is no way to avoid the
associated problems except by patching all code that would ever use fork,
even if it never has heard anything about epoll yet. This is extremely
nonlocal action at a distance, as this affects a lot of code not even the
author might be aware of (fork is rather common).

To illustrate, here are some workarounds I thought about:

- rearming all fds after fork: doesn't work, as the fds get removed
asynchronously so I would have to wait for the child to do it.
- closing the epoll fd after fork: doesn't work unless I control
the fork. I can install a handler to be called using pthreads, but
that won't help as other handlers might be called first (as in the case of
the aio library above), screwing me.
- closing and recreating the epoll fd before the fork: isn't support event
remotely by libevent or similar event loops, and would not help either
as I cnanot control the calls to fork.

Is epoll really designed to be so incompatible with the most commno fork
patterns? Shouldn't epoll do refcounting, as is commonly done under
Unix? As the fd space is not shared between rpocesses, why does epoll
try? Shouldn't the epoll information be copied just like the fd table
itself, memory, and other resources?

As it looks now, epoll looks useless except in the most controlled
environments, as it doesn't duplicate state on fork as is done with the
other fd-related resources (as opposed to the underlying files, which are
properly shared).

--
The choice of a
-----==- _GNU_ Deliantra, the free in data+content MORPG
----==-- _ generation
---==---(_)__ __ ____ __ http://www.deliantra.net/
--==---/ / _ \/ // /\ \/ /
-=====/_/_//_/\_,_/ /_/\_\
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Eric Dumazet

unread,

Oct 27, 2007, 4:24:24 AM10/27/07

to Marc Lehmann, linux-...@vger.kernel.org, Davide Libenzi

Marc Lehmann a écrit :

> Hi!
>
> I ran into what I see as unsolvable problems that make epoll useless as a
> generic event mechanism.
>
> I recently switched to libevent as event loop, and found that my programs
> work fine when it is using select or poll, but work eratically or halt
> when using epoll.
>
> The reason as I found out is the peculiar behaviour of epoll over fork.
> It doesn't work as documented, and even if, it would make the use of
> third-party libraries using fork usually impossible.
>
> Here are two scenarios where it screws up:
>
> - some library forks, explicitly closes all fd's it doesn't need, and execs
> another program (which is common behvaiour).
>
> In this case, the parent process works fine until the child closes fds,
> after which the fds become unarmed in the parent too. This works as

I have no idea what exact problem you have. But if the child closes some file
descriptor that were 'cloned' at fork() time, this only decrements a refcount,
and definitely should not close it for the 'parent'. epoll in this regard uses
a generic kernel service (file descriptor sharing between tasks).

I have some apps that are happily using epoll() and fork()/exec() and have no
problem at all. I usually use O_CLOEXEC so that all close() are done at exec()
time without having to do it in a loop. epoll continues to work as expected in
the parent process.

> documented, but since libraries expect this to work without affecting the
> parent, this puts a new and incompatible strain on what libraries can do,
> which in turn makes epoll unsuitable in cases where you don't control all
> your code.
>
> - I have a library that emulates asynchronous I/O with a thread pool, and
> uses a pipe for event notification. That library registers a fork handler
> that closes the pipe in the child and recreates it, so the child could
> continue doing AIO (as could the parent).
>
> This, too, screws up notifications for the parent,
>
> Now, the epoll manpage says that closing a fd will remove it from all
> fd sets. This would explain the behaviour above. Unfortunately (or
> fortunately?) this is not what happens: when the fds are being closed by
> exec or exit, the fds do not get removed from the epoll set.

at exec() (granted CLOEXEC is asserted) or exit() time, only the refcount of
each file is decremented. Only if their refcount becomes NULL, files are then
removed from epoll set.

Too many questions here, showing lack of understanding.

>
> As it looks now, epoll looks useless except in the most controlled
> environments, as it doesn't duplicate state on fork as is done with the
> other fd-related resources (as opposed to the underlying files, which are
> properly shared).
>

epoll definitly is not useless. It is used on major and critical apps.
You certainly missed something.
Please provide some code to illustrate one exact problem you have.

Marc Lehmann

unread,

Oct 27, 2007, 4:50:01 AM10/27/07

to Eric Dumazet, linux-...@vger.kernel.org, Davide Libenzi

On Sat, Oct 27, 2007 at 10:23:17AM +0200, Eric Dumazet <da...@cosmosbay.com> wrote:
> > In this case, the parent process works fine until the child closes fds,
> > after which the fds become unarmed in the parent too. This works as
>
> I have no idea what exact problem you have.

Well, I explained it rather succinctly, I think. If you tell me whats unclear
I can explain...

> But if the child closes some
> file descriptor that were 'cloned' at fork() time, this only decrements a
> refcount, and definitely should not close it for the 'parent'.

It doesn't. It removes it from the epoll set, though, so the parent will not
receive events for that fd anymore.

> I have some apps that are happily using epoll() and fork()/exec() and have

The problem I described is fork/close/exec. close being the explicit
syscall.

> no problem at all. I usually use O_CLOEXEC so that all close() are done at
> exec() time without having to do it in a loop. epoll continues to work as
> expected in the parent process.

This is because epoll doesn't behave like documented: It removes the fd
from the parents epoll set only on an explicit close() syscall, not on an
implicit close from exec.

> >fd sets. This would explain the behaviour above. Unfortunately (or
> >fortunately?) this is not what happens: when the fds are being closed by
> >exec or exit, the fds do not get removed from the epoll set.
>
> at exec() (granted CLOEXEC is asserted) or exit() time, only the refcount
> of each file is decremented. Only if their refcount becomes NULL, files are
> then removed from epoll set.

Yes. But thats obviously not the only way to close fds.

> >Is epoll really designed to be so incompatible with the most commno fork
> >patterns? Shouldn't epoll do refcounting, as is commonly done under
> >Unix? As the fd space is not shared between rpocesses, why does epoll
> >try? Shouldn't the epoll information be copied just like the fd table
> >itself, memory, and other resources?
>
> Too many questions here, showing lack of understanding.

You already said you don't the problem. No need to get insulting :(

> epoll definitly is not useless. It is used on major and critical apps.
> You certainly missed something.

Well, it behaves like documented, which is the problem. You admit you
don't understand the problem or the documentation, so again, no need to
insult me.

> Please provide some code to illustrate one exact problem you have.

// assume there is an open epoll set that listens for events on fd 5
if (fork () = 0)
{
close (5);
// fd 5 is now removed from the epoll set of the parent.
_exit (0);
}

--
The choice of a
-----==- _GNU_

----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ p...@goof.com
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

Eric Dumazet

unread,

Oct 27, 2007, 5:23:36 AM10/27/07

to Marc Lehmann, linux-...@vger.kernel.org, Davide Libenzi

Marc Lehmann a écrit :

Hum... I will update my english vocabulary and mark "missed" as an insult.

I have no problem with epoll nor its documentation.

>
>> Please provide some code to illustrate one exact problem you have.
>
> // assume there is an open epoll set that listens for events on fd 5
> if (fork () = 0)
> {
> close (5);
> // fd 5 is now removed from the epoll set of the parent.
> _exit (0);
> }
>

It doesnt on every kernels I had played with. And I played with *lot* of
kernels you know.

If such a bug exists on your kernel, please fill a complete bug report, giving
details.

Thank you

Marc Lehmann

unread,

Oct 27, 2007, 5:35:09 AM10/27/07

to Eric Dumazet, Marc Lehmann, linux-...@vger.kernel.org, Davide Libenzi

On Sat, Oct 27, 2007 at 11:22:25AM +0200, Eric Dumazet <da...@cosmosbay.com> wrote:
> >Well, it behaves like documented, which is the problem. You admit you
> >don't understand the problem or the documentation, so again, no need to
> >insult me.
>
> Hum... I will update my english vocabulary and mark "missed" as an insult.

Well, ignoring my arguments by claiming I lack understanding is an insult,
as you didn't take my arguments at face value but declassified them by
attacking my person.

> I have no problem with epoll nor its documentation.

Thats fine for you. But I have, at least, with epoll, as the documented
and observed behaviour makes epoll unusable as a general event loop
replacement.

> It doesnt on every kernels I had played with. And I played with *lot* of
> kernels you know.

No, I don't know that. And so far you only said you used fork+exec, not
close in between, so maybe the playing you did was not related to this
problem?

I also played with a lot of kernels, but for epoll specifically, I played
with 2.6.21-2-amd64 and 2.6.22-1-amd64, both from debian unstable with no
customisations.

> If such a bug exists on your kernel, please fill a complete bug report,
> giving details.

As this behaviour is clearly documented in the epoll manpage, why do you
think it is a bug? I think its fairly bad, but at least tis documented as
the behaviour it should be:

Q6 Will the close of an fd cause it to be removed from all epoll sets automatically?
A6 Yes.

As such filing, a bug report for behaviour which isn't in fact a bug would
be counterproductive. My goal in my mail was to find out if there are
work arounds for this peculiar behaviour (Or inspire discussion on this
behaviour).

Of course, one can create big programs using epoll to their advantage. I
never claimed otherwise. But as a general event loop replacement (i.e.
outside of controleld environments), epoll does not currently qualify,
as I would have to control an awful lot of code (think of an perl module
interfacing to epoll: you would not have to control all third-party
modules that might interfere with fork+close+exec. This is very common in
scripting languages).

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / p...@goof.com
-=====/_/_//_/\_,_/ /_/\_\

Eric Dumazet

unread,

Oct 27, 2007, 6:26:46 AM10/27/07

to Marc Lehmann, linux-...@vger.kernel.org, Davide Libenzi

Marc Lehmann a écrit :

> On Sat, Oct 27, 2007 at 11:22:25AM +0200, Eric Dumazet <da...@cosmosbay.com> wrote:
>
>> If such a bug exists on your kernel, please fill a complete bug report,
>> giving details.
>
> As this behaviour is clearly documented in the epoll manpage, why do you
> think it is a bug? I think its fairly bad, but at least tis documented as
> the behaviour it should be:
>
> Q6 Will the close of an fd cause it to be removed from all epoll sets automatically?
> A6 Yes.

Answer : epoll documentation cannot explain the full semantic of file
descriptors, or difference between user side (file descriptors) and kernel
side (files and fds)
Or should, since you had problems. But then, if the epoll documentation has to
document the full Unix/Linux files semantic, nobody will read it.

The 'close' of a file is not close(fd) :)
But : the last close() so that underlying file refcount is 0

example 1)

fd = open("somefile", ...)
fd1 = dup(fd);
epoll_add_in_my_set(fd1);/* setup epoll work on fd1 */
{do_something;}
close(fd1); /* this is not the last close and will NOT close 'somefile' */
/* It wont be removed from epoll sets NOW */

close(fd); /* oh yes, this one is the real 'file close', now we perform epoll
cleanups */

epoll has to deal with files, but documentation is a User side documentation,
so has to use 'file descriptors'. So everything that plays with the file
descriptor table can make the thing complex to understand/document.
(fork()/dup()/close()/exit()/exec()....)

example 2)

int pfd[2];
pipe(pfd);
epoll_add_in_my_set(pfd[0]);/* setup epoll work on pfd[0] for example */
pid = fork();
if (pid == 0) {

close(pfd[0]); /* this is not the last close and will NOT close pipe */
/* epoll has NO WAY to perform some cleanup at this stage */

close(pfd[1]); /* this not the last close and will NOT close the pipe*/
_exit(0);
}
close(pfd[1]);
wait(NULL);
{do_something_epoll_related;}
close(pfd[0]); /* finally we close the pipe, and epoll can do its cleanup */

fork() is acting sort of dup() , as it increases all file refcounts.

You have problems about close()/dup()/fork()/... file descriptors semantic,
which is handled by a layer independent from epoll stuff.

Marc Lehmann

unread,

Oct 27, 2007, 6:47:13 AM10/27/07

to Eric Dumazet, Marc Lehmann, linux-...@vger.kernel.org, Davide Libenzi

On Sat, Oct 27, 2007 at 12:23:52PM +0200, Eric Dumazet <da...@cosmosbay.com> wrote:
> > Q6 Will the close of an fd cause it to be removed from all epoll
> > sets automatically?
> > A6 Yes.
>
> Answer : epoll documentation cannot explain the full semantic of file

epoll documentation easily can. there is nothig keeping it from it. don't
make silly arguments like that.

> Or should, since you had problems

You are again implying I lakc understanding. That is, however, not true.
I don't see the point in being insulted by you, so I won'T continue
talking to you :(

> The 'close' of a file is not close(fd) :)

Good that you understand that.

That is one of my problems, as the manpage talks about closing of the fd,
but there are multiple ways to do that, and some are not handled the same
way.

> epoll has to deal with files, but documentation is a User side
> documentation, so has to use 'file descriptors'.

There is obviously no need for documentation to do that, contrary to your
claim. The manpages for e.g. dup or the official sus manpages manage to
document it (mostly) correctly, so your claim that documentation must use
file descriptors when the underlying file structure is meant is disproven.

> fork() is acting sort of dup() , as it increases all file refcounts.
>
> You have problems about close()/dup()/fork()/... file descriptors semantic,
> which is handled by a layer independent from epoll stuff.

No, I have no problem with dup at all.

I have a problem with explicitlx closing file descriptors in the child will
stop events for those files to be reported in the parent.

I am sorry, but I epxlained this very clearly a number of times, but for some
reason, apart from accusing me to not understanding files and file
descritpors or (clear enough) documentation, you ignore that and instead
hammer on other problems.

To me, it seems you are not the one who understands.

--
The choice of a Deliantra, the free code+content MORPG
-----==- _GNU_ http://www.deliantra.net
----==-- _ generation
---==---(_)__ __ ____ __ Marc Lehmann
--==---/ / _ \/ // /\ \/ / p...@goof.com
-=====/_/_//_/\_,_/ /_/\_\

Davide Libenzi

unread,

Oct 27, 2007, 12:59:50 PM10/27/07

to Marc Lehmann, Eric Dumazet, Linux Kernel Mailing List

On Sat, 27 Oct 2007, Marc Lehmann wrote:

> > Please provide some code to illustrate one exact problem you have.
>
> // assume there is an open epoll set that listens for events on fd 5
> if (fork () = 0)
> {
> close (5);
> // fd 5 is now removed from the epoll set of the parent.
> _exit (0);
> }

Hmmm ... what? I assume you know that:

1) A file descriptor is a userspace view/handle of a kernel object

2) The kernel object has a use-count for as many file descriptors that
have been handed out to userspace

3) A close() decreases the internal counter by one

4) The kernel object gets effectively closed when the internal counter
goes to zero

5) A fork() acts as a dup() on the file descriptors by hence bumping up
its internal counter

6) Epoll removes the file from the set, when the *kernel* object gets
closed (internal use-count goes to zero)

With that in mind, how can the code snippet above trigger a removal from
the epoll set?

- Davide

Willy Tarreau

unread,

Oct 27, 2007, 1:48:26 PM10/27/07

to Davide Libenzi, Marc Lehmann, Eric Dumazet, Linux Kernel Mailing List

On Sat, Oct 27, 2007 at 09:59:07AM -0700, Davide Libenzi wrote:
> On Sat, 27 Oct 2007, Marc Lehmann wrote:
>
> > > Please provide some code to illustrate one exact problem you have.
> >
> > // assume there is an open epoll set that listens for events on fd 5
> > if (fork () = 0)
> > {
> > close (5);
> > // fd 5 is now removed from the epoll set of the parent.
> > _exit (0);
> > }
>
> Hmmm ... what? I assume you know that:
>
> 1) A file descriptor is a userspace view/handle of a kernel object
>
> 2) The kernel object has a use-count for as many file descriptors that
> have been handed out to userspace
>
> 3) A close() decreases the internal counter by one
>
> 4) The kernel object gets effectively closed when the internal counter
> goes to zero
>
> 5) A fork() acts as a dup() on the file descriptors by hence bumping up
> its internal counter
>
> 6) Epoll removes the file from the set, when the *kernel* object gets
> closed (internal use-count goes to zero)
>
> With that in mind, how can the code snippet above trigger a removal from
> the epoll set?

Davide,

from what I understand, Marc is not asking for the code above to remove
the fd from the epoll set, but he's in fact complaining that he *observed*
that the fd was removed from the epoll set in the *parent* process when
the child closes it, which is of course not expected at all. As strange
as it looks like, this might need investigation. It is possible that there
is some strange bug somewhere in some kernel versions.

Marc, I think that if you indicate the last kernel version on which you
observed this and provide a very short and easy reproducer, it would
help everyone investigating this. Basically something which reports "OK"
or "KO".

Regards,
Willy

Davide Libenzi

unread,

Oct 27, 2007, 2:01:40 PM10/27/07

to Willy Tarreau, Marc Lehmann, Eric Dumazet, Linux Kernel Mailing List

That would be *really* strange, since epoll hooks in __fput() in order to
perform proper cleanup. This means that, in the case above, the file will
be really closed in the parent too. That, I think, would trigger way more
serious problems in userspace.

> Marc, I think that if you indicate the last kernel version on which you
> observed this and provide a very short and easy reproducer, it would
> help everyone investigating this. Basically something which reports "OK"
> or "KO".

Of course. That'd be great.

- Davide

David Schwartz

unread,

Oct 28, 2007, 12:47:49 AM10/28/07

to Linux-Kernel@Vger. Kernel. Org

> 6) Epoll removes the file from the set, when the *kernel* object gets
> closed (internal use-count goes to zero)
>
> With that in mind, how can the code snippet above trigger a removal from
> the epoll set?

I don't see how that can be. Suppose I add fd 8 to an epoll set. Suppose fd
5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set,
since there no longer is an fd 8? Events on files registered for epoll
notification are reported by descriptor, so the set membership has to be
associated (as reflected into userspace) with the descriptor, not the file.

For example, consider:

1) Process creates an epoll set, the set gets fd 4.

2) Process creates a socket, it gets fd 5.

3) The process adds fd 5 to set 4.

4) The process forks.

5) The child inherits the epoll set but not the socket.

Here the kernel cannot quite do the right thing. Ideally, the parent would
still have fd 5 in its version of the epoll set. After all, it has not
closed fd 5. However, the child *cannot* see fd 5 in its version of the
epoll set since it has no fd 5. An event reported for fd 5 would be
nonsense.

So it seems the kernel either has to break one of these "would/cannot"
requirements, or it has to split the epoll set in two. However, splitting
the set into two sets is clearly wrong since the processes should share it.

Q6 Will the close of an fd cause it to be removed from
all
epoll sets automatically?

A6 Yes.

Note that this talks of the close of an "fd", not a file. The 'close'
function in fact closes an fd, as that fd is then reusable. So it sounds
like the problem above is solved by removing the fd from the set, but in
practice this doesn't happen. I have programs that call 'close' between
'fork' and 'exec' and do not see the socket removed from the poll set.

DS

Eric Dumazet

unread,

Oct 28, 2007, 5:34:41 AM10/28/07

to dav...@webmaster.com, Linux-Kernel@Vger. Kernel. Org

David Schwartz a écrit :

>> 6) Epoll removes the file from the set, when the *kernel* object gets
>> closed (internal use-count goes to zero)
>>
>> With that in mind, how can the code snippet above trigger a removal from
>> the epoll set?
>
> I don't see how that can be. Suppose I add fd 8 to an epoll set. Suppose fd
> 5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set,
> since there no longer is an fd 8? Events on files registered for epoll
> notification are reported by descriptor, so the set membership has to be
> associated (as reflected into userspace) with the descriptor, not the file.

Events are not necessarly reported "by descriptors". epoll uses an opaque
field provided by the user.

It's up to the user to properly chose a tag that will makes sense if the user
app is playing dup()/close() games for example.

typedef union epoll_data
{
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;

It's true some applications are using 'fd' field from epoll_data_t, but in
this case they should not play dup()/close() games that could change the
meaning of their 'epoll tags'. They would better use 'ptr/u64' for example to
map the event to an application object. In this object they might find the
correct handle (fd) to communicate with the kernel for a given 'file'. This
handle could then be remapped to another handle using dup()/fcntl()/close()...

>
> For example, consider:
>
> 1) Process creates an epoll set, the set gets fd 4.
>
> 2) Process creates a socket, it gets fd 5.
>
> 3) The process adds fd 5 to set 4.
>
> 4) The process forks.
>
> 5) The child inherits the epoll set but not the socket.
>
> Here the kernel cannot quite do the right thing. Ideally, the parent would
> still have fd 5 in its version of the epoll set. After all, it has not
> closed fd 5. However, the child *cannot* see fd 5 in its version of the
> epoll set since it has no fd 5. An event reported for fd 5 would be
> nonsense.

Yes, it would be nonsense that the child still tries to get events from the
epoll set while he cannot possibly use the socket. If you use 'ptr' field to
retrieve an object, this object probably would have no meaning in the child
anyway, especially after an exec() syscall.

That kind of user error can also happens with select()/poll(), if you do for
example :

FD_ZERO(&fdset);
FD_SET(fd, &fdset);
select(fd+1,&fdset, NULL, NULL, NULL);
newfd = dup(fd);
close(fd);
for (i = 0 ; i < maxfd ; i++)
if (FD_ISSET(i, &fdset))
read(i, ...)

Davide Libenzi

unread,

Oct 28, 2007, 2:49:40 PM10/28/07

to David Schwartz, Linux-Kernel@Vger. Kernel. Org, Eric Dumazet

On Sat, 27 Oct 2007, David Schwartz wrote:

> I don't see how that can be. Suppose I add fd 8 to an epoll set. Suppose fd
> 5 is a dup of fd 8. Now, I close fd 8. How can fd 8 remain in my epoll set,
> since there no longer is an fd 8? Events on files registered for epoll
> notification are reported by descriptor, so the set membership has to be
> associated (as reflected into userspace) with the descriptor, not the file.

Eric already answered to your question (epoll deals with internal kernel
objects - aka file*).
I just want to answer this one for another reason. WTF is wrong with all
of you Cc-list-trimmers?
Could you *please* stop trimming Cc-lists?

- Davide

David Schwartz

unread,

Oct 28, 2007, 5:05:20 PM10/28/07

to da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org

Eric Dumazet wrote:

> Events are not necessarly reported "by descriptors". epoll uses an opaque
> field provided by the user.
>
> It's up to the user to properly chose a tag that will makes sense
> if the user
> app is playing dup()/close() games for example.

Great. So the only issue then is that the documentation is confusing. It
frequently uses the term "fd" where it means file. For example, it says:

Q1 What happens if you add the same fd to an
epoll_set
twice?

A1 You will probably get EEXIST. However, it is
possible
that two threads may add the same fd twice. This is
a
harmless condition.

This gives no reason to think there's anything wrong with adding the same
file twice so long as you do so through different descriptors. (One can
imagine an application that does this to segregate read and write operations
to avoid a race where the descriptor is closed from under a writer due to
handling a fatal read error.) Obviously, that won't work.

And this part:

Q6 Will the close of an fd cause it to be removed from
all
epoll sets automatically?

A6 Yes.

This is incorrect. Closing an fd will not cause it to be removed from all
epoll sets automatically. Only closing a file will. This is what caused the
OP's confusion, and it is at best imprecise and, at worst, flat out wrong.

DS

PS: It is customary to trim individuals off of CC lists when replying to a
list when the subject matter of the post is squarely inside the subject of
the list. If the person CC'd was interested in the list's subject, he or she
would presumably subscribe to the list. Not everyone wants two copies of
every post. Not everyone wants a personal copy of every sub-thread that
results from a post they make. In the past few years, I've received
approximately an equal number of complaints about trimming CC's on posts to
LKML and not trimming CC's on such posts.

Davide Libenzi

unread,

Oct 29, 2007, 2:56:44 PM10/29/07

to David Schwartz, da...@cosmosbay.com, Michael Kerrisk, Linux-Kernel@Vger. Kernel. Org

On Sun, 28 Oct 2007, David Schwartz wrote:

>
> Eric Dumazet wrote:
>
> > Events are not necessarly reported "by descriptors". epoll uses an opaque
> > field provided by the user.
> >
> > It's up to the user to properly chose a tag that will makes sense
> > if the user
> > app is playing dup()/close() games for example.
>
> Great. So the only issue then is that the documentation is confusing. It
> frequently uses the term "fd" where it means file. For example, it says:
>
> Q1 What happens if you add the same fd to an
> epoll_set
> twice?
>
> A1 You will probably get EEXIST. However, it is
> possible
> that two threads may add the same fd twice. This is
> a
> harmless condition.
>
> This gives no reason to think there's anything wrong with adding the same
> file twice so long as you do so through different descriptors. (One can
> imagine an application that does this to segregate read and write operations
> to avoid a race where the descriptor is closed from under a writer due to
> handling a fatal read error.) Obviously, that won't work.

I agree, that is confusing. However, you can safely add two different file
descriptors pointing to the same file*, with different event masks, and
that will work as expected.

> And this part:
>
> Q6 Will the close of an fd cause it to be removed from
> all
> epoll sets automatically?
>
> A6 Yes.
>
> This is incorrect. Closing an fd will not cause it to be removed from all
> epoll sets automatically. Only closing a file will. This is what caused the
> OP's confusion, and it is at best imprecise and, at worst, flat out wrong.

OTOH you cannot list *every* possible scenario in a man page, otherwise
you end up writing a book instead of a man page. I will try to find some
time with Michael to refine the man page.

> PS: It is customary to trim individuals off of CC lists when replying to a
> list when the subject matter of the post is squarely inside the subject of
> the list. If the person CC'd was interested in the list's subject, he or she
> would presumably subscribe to the list. Not everyone wants two copies of
> every post. Not everyone wants a personal copy of every sub-thread that
> results from a post they make. In the past few years, I've received
> approximately an equal number of complaints about trimming CC's on posts to
> LKML and not trimming CC's on such posts.

Does anyone that in 2007 still did not manage to find a way to avoid dups
in hitting his mailbox, deserve any consideration at all?
OTOH many ppl, like myself, uses To and Cc header to direct email to
proper folders, where they are treated with a different level of
attention. And your stripp-all-headers mania screws that up badly.

- Davide

Mark Lord

unread,

Oct 29, 2007, 6:36:59 PM10/29/07

to Willy Tarreau, Davide Libenzi, Marc Lehmann, Eric Dumazet, Linux Kernel Mailing List

Willy Tarreau wrote:
>> On Sat, 27 Oct 2007, Marc Lehmann wrote:
>>
>>>> Please provide some code to illustrate one exact problem you have.
>>> // assume there is an open epoll set that listens for events on fd 5
>>> if (fork () = 0)
>>> {
>>> close (5);
>>> // fd 5 is now removed from the epoll set of the parent.
>>> _exit (0);
>>> }

.

> from what I understand, Marc is not asking for the code above to remove
> the fd from the epoll set, but he's in fact complaining that he *observed*
> that the fd was removed from the epoll set in the *parent* process when
> the child closes it, which is of course not expected at all. As strange
> as it looks like, this might need investigation. It is possible that there
> is some strange bug somewhere in some kernel versions.
>
> Marc, I think that if you indicate the last kernel version on which you
> observed this and provide a very short and easy reproducer, it would
> help everyone investigating this. Basically something which reports "OK"
> or "KO".

That's how I read it, too.
So basically, a program like this, perhaps.
Except that, here running 2.6.23.1, it works just fine (no removal bug).

#include <sys/epoll.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/socket.h>

static int del_from_epoll_set (int efd, int fd, const char *msg)
{
struct epoll_event e;

memset(&e, 0, sizeof(e));
e.data.fd = fd;
if (epoll_ctl(efd, EPOLL_CTL_DEL, fd, &e) == -1) {
int err = errno;
fprintf(stderr, "epoll_ctl(DEL) failed (%s): %s\n", msg, strerror(err));
return -1;
}
return 0;
}

static int add_to_epoll_set (int efd, int fd, __uint32_t events, const char *msg)
{
struct epoll_event e;

memset(&e, 0, sizeof(e));
e.events = events;
e.data.fd = fd;
if (epoll_ctl(efd, EPOLL_CTL_ADD, fd, &e) == -1) {
int err = errno;
fprintf(stderr, "epoll_ctl(ADD) failed (%s): %s\n", msg, strerror(err));
return -1;
}
return 0;
}

int main (int argc, char **argv)
{
int efd, sd, fds[2];
pid_t cpid;

if (pipe(fds) == -1) {
perror("pipe()");
exit(1);
}
sd = socket(PF_INET, SOCK_STREAM, 0);
if (sd == -1) {
perror("socket");
exit(1);
}

efd = epoll_create(5);
if (efd == -1) {
perror("epoll_create");
exit(1);
}

if (add_to_epoll_set(efd, fileno(stdin), EPOLLIN, "stdin"))
exit(1);
if (add_to_epoll_set(efd, fileno(stdout), EPOLLOUT, "stdout"))
exit(1);
if (add_to_epoll_set(efd, fds[0], EPOLLIN, "pipe_read"))
exit(1);
if (add_to_epoll_set(efd, fds[1], EPOLLOUT, "pipe_write"))
exit(1);
if (add_to_epoll_set(efd, sd, EPOLLIN|EPOLLOUT, "socket"))
exit(1);

// assume there is an open epoll set that listens for events on fd 5

cpid = fork();
if (cpid == 0) {
close(fileno(stdin));
close(fileno(stdout));
close(fds[0]);
close(fds[1]);
close(sd);
exit(0);
}
waitpid(cpid, NULL, 0);

// now test whether the fd's are still in the epoll set:
add_to_epoll_set(efd, sd, EPOLLIN|EPOLLOUT, "sd");
add_to_epoll_set(efd, fds[0], EPOLLIN, "fds[0]");
add_to_epoll_set(efd, fds[1], EPOLLOUT, "fds[1]");
add_to_epoll_set(efd, fileno(stdin), EPOLLIN, "fileno(stdin)");
add_to_epoll_set(efd, fileno(stdout), EPOLLOUT, "fileno(stdout)");

del_from_epoll_set(efd, sd, "sd");
del_from_epoll_set(efd, fds[0], "fds[0]");
del_from_epoll_set(efd, fds[1], "fds[1]");
del_from_epoll_set(efd, fileno(stdin), "fileno(stdin)");
del_from_epoll_set(efd, fileno(stdout), "fileno(stdout)");

printf("Done.\n");
exit(0);

Michael Kerrisk

unread,

Feb 26, 2008, 10:15:27 AM2/26/08

to Davide Libenzi, David Schwartz, da...@cosmosbay.com, Chris "?" Heath, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

So can I summarize what I understand:

a) Adding the same file descriptor twice to an epoll set will cause an
error (EEXIST).

b) In a separate message to linux-man, Chris Heath says that two threads
*can't* add the same fd twice to an epoll set, despite what the existing
man page text says. I haven't tested that, but it sounds to me as though
it is likely to be true. Can you comment please Davide?

c) It is possible to add duplicated file descriptors referring to the same
underlying open file description ("file *"). As you note, this can be a
useful filtering technique, if the two file descriptors specify different
masks.

Assuming that is all correct, for man-pages-2.79, I've reworked the text
for Q1/A1 as follows:

Q1 What happens if you add the same file descriptor
to an epoll set twice?

A1 You will probably get EEXIST. However, it is pos-
sible to add a duplicate (dup(2), dup2(2),
fcntl(2) F_DUPFD, fork(2)) descriptor to the same
epoll set. This can be a useful technique for
filtering events, if the duplicate file descrip-
tors are registered with different events masks.

Seem okay Davide?

Cheers,

Michael

PS I've trimmed the part of this thread about Q6/A6, since I dealt with
that in another thread ("epoll and shared fd's").

--
Michael Kerrisk
Maintainer of the Linux man-pages project
http://www.kernel.org/doc/man-pages/
Want to report a man-pages bug? Look here:
http://www.kernel.org/doc/man-pages/reporting_bugs.html

--

Davide Libenzi

unread,

Feb 26, 2008, 1:52:38 PM2/26/08

to Michael Kerrisk, David Schwartz, da...@cosmosbay.com, Chris "?" Heath, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

Yes.

> b) In a separate message to linux-man, Chris Heath says that two threads
> *can't* add the same fd twice to an epoll set, despite what the existing
> man page text says. I haven't tested that, but it sounds to me as though
> it is likely to be true. Can you comment please Davide?

Yes, you can't add the same fd twice. Think about a DB where "file*,fd" is
the key.

> c) It is possible to add duplicated file descriptors referring to the same
> underlying open file description ("file *"). As you note, this can be a
> useful filtering technique, if the two file descriptors specify different
> masks.
>
> Assuming that is all correct, for man-pages-2.79, I've reworked the text
> for Q1/A1 as follows:
>
> Q1 What happens if you add the same file descriptor
> to an epoll set twice?
>
> A1 You will probably get EEXIST. However, it is pos-
> sible to add a duplicate (dup(2), dup2(2),
> fcntl(2) F_DUPFD, fork(2)) descriptor to the same
> epoll set. This can be a useful technique for
> filtering events, if the duplicate file descrip-
> tors are registered with different events masks.
>
> Seem okay Davide?

Looks sane to me.

- Davide

Chris "ク"

unread,

Feb 26, 2008, 8:40:30 PM2/26/08

to Davide Libenzi, Michael Kerrisk, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

To clarify, the key appears to be file* plus the user-space integer that
represents the fd.

> > c) It is possible to add duplicated file descriptors referring to the same
> > underlying open file description ("file *"). As you note, this can be a
> > useful filtering technique, if the two file descriptors specify different
> > masks.
> >
> > Assuming that is all correct, for man-pages-2.79, I've reworked the text
> > for Q1/A1 as follows:
> >
> > Q1 What happens if you add the same file descriptor
> > to an epoll set twice?
> >
> > A1 You will probably get EEXIST. However, it is pos-
> > sible to add a duplicate (dup(2), dup2(2),
> > fcntl(2) F_DUPFD, fork(2)) descriptor to the same
> > epoll set. This can be a useful technique for
> > filtering events, if the duplicate file descrip-
> > tors are registered with different events masks.
> >
> > Seem okay Davide?
>
> Looks sane to me.

I think fork(2) should not be in the above list. fork(2) duplicates the
kernel's fd, but the user-space integer that represents the fd remains
the same, so you will get EEXIST if you try to add the fd that was
duplicated by fork.

Chris

Davide Libenzi

unread,

Feb 27, 2008, 2:35:44 PM2/27/08

to Chris "ク", Michael Kerrisk, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

On Tue, 26 Feb 2008, Chris "ă~BŻ" Heath wrote:

> On Tue, 2008-02-26 at 10:51 -0800, Davide Libenzi wrote:
> >
> > Yes, you can't add the same fd twice. Think about a DB where "file*,fd" is
> > the key.
>
> To clarify, the key appears to be file* plus the user-space integer that
> represents the fd.

Yes, that's what I said.

> > > c) It is possible to add duplicated file descriptors referring to the same
> > > underlying open file description ("file *"). As you note, this can be a
> > > useful filtering technique, if the two file descriptors specify different
> > > masks.
> > >
> > > Assuming that is all correct, for man-pages-2.79, I've reworked the text
> > > for Q1/A1 as follows:
> > >
> > > Q1 What happens if you add the same file descriptor
> > > to an epoll set twice?
> > >
> > > A1 You will probably get EEXIST. However, it is pos-
> > > sible to add a duplicate (dup(2), dup2(2),
> > > fcntl(2) F_DUPFD, fork(2)) descriptor to the same
> > > epoll set. This can be a useful technique for
> > > filtering events, if the duplicate file descrip-
> > > tors are registered with different events masks.
> > >
> > > Seem okay Davide?
> >
> > Looks sane to me.
>
> I think fork(2) should not be in the above list. fork(2) duplicates the
> kernel's fd, but the user-space integer that represents the fd remains
> the same, so you will get EEXIST if you try to add the fd that was
> duplicated by fork.

Good catch, fork(2) should not be there.

- Davide

Michael Kerrisk

unread,

Feb 28, 2008, 8:13:44 AM2/28/08

to Davide Libenzi, Chris "ク" Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

Okay -- removed.

But it is an ugly inconsistency. On the one hand, a child process
cannot add the duplicate file descriptor to the epoll set. (In every
other case that I can think of , descriptors duplicated by fork have
similar semantics to descriptors duplicated by dup() and friends.) On
the other hand, the very fact that the child has a duplicate of the
descriptor means that even if the parent closes its descriptor, then
epoll_wait() in the parent will continue to receive notifications for
that descriptor because of the duplicated descriptor in the child.

The choice of [file *, fd] as the key for epoll sets really does seem
unfortunate. Keying on [pid, fd] would have given saner semantics, it
seems to me. Obviously it can't be changed now though.

Cheers,

Michael

--
Michael Kerrisk
Maintainer of the Linux man-pages project
http://www.kernel.org/doc/man-pages/
Want to report a man-pages bug? Look here:
http://www.kernel.org/doc/man-pages/reporting_bugs.html

Michael Kerrisk

unread,

Feb 28, 2008, 8:24:14 AM2/28/08

to Davide Libenzi, Chris "ク" Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

Davide,

with the earlier discussion in this thread in mind, I added a Q0/A0 to
epoll.7, just make the point about keys clear:

Q0 What is the key used to distinguish the file descrip-
tors in an epoll set?

A0 The key is the combination of the file descriptor
number and the open file description (also known as
"open file handle", the kernel's internal representa-
tion of an open file).

Does that seem okay?

Davide Libenzi

unread,

Feb 28, 2008, 2:24:03 PM2/28/08

to Michael Kerrisk, Chris "ク" Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

On Thu, 28 Feb 2008, Michael Kerrisk wrote:

> But it is an ugly inconsistency. On the one hand, a child process
> cannot add the duplicate file descriptor to the epoll set. (In every
> other case that I can think of , descriptors duplicated by fork have
> similar semantics to descriptors duplicated by dup() and friends.) On
> the other hand, the very fact that the child has a duplicate of the
> descriptor means that even if the parent closes its descriptor, then
> epoll_wait() in the parent will continue to receive notifications for
> that descriptor because of the duplicated descriptor in the child.

Have you ever tried to think what it means for different *processes*
sharing a single epoll fd and doing epoll_wait() over it?
Most common case is a single event fetch thread plus dispatch. Going to
epoll_wait() over a single epoll fd from many *threads* is very much
possible, but requires care (news at 11, system software development
requires care too).
Sharing a single epoll fd (by the means of any process sharing it doing
add/wait) from different *processes* makes almost no sense at all.

"a child process cannot add the duplicate file descriptor to the epoll

set" ... how do you expect the parent (that doesn't even have the new fd
mapped) to react to such events?
If the next question is "But then why we made the epoll fd inheritable?",
the answer is, because it makes sense in many cases for a parent to hand
over an fd set to a child.

> The choice of [file *, fd] as the key for epoll sets really does seem
> unfortunate. Keying on [pid, fd] would have given saner semantics, it
> seems to me. Obviously it can't be changed now though.

I think we already went over this, and I think I clearly explained you the
reasons of not hooking into sys_close.

- Davide

Davide Libenzi

unread,

Feb 28, 2008, 2:34:55 PM2/28/08

to Michael Kerrisk, Chris "ク" Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

On Thu, 28 Feb 2008, Michael Kerrisk wrote:

> Davide,
>
> with the earlier discussion in this thread in mind, I added a Q0/A0 to
> epoll.7, just make the point about keys clear:
>
>
> Q0 What is the key used to distinguish the file descrip-
> tors in an epoll set?
>
> A0 The key is the combination of the file descriptor
> number and the open file description (also known as
> "open file handle", the kernel's internal representa-
> tion of an open file).
>
> Does that seem okay?

Looks fine to me! We need to clarify better Q6 WRT fork().

- Davide

Michael Kerrisk

unread,

Feb 29, 2008, 10:47:16 AM2/29/08

to Davide Libenzi, Chris \"¥¯\" Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

Hi Davide,

On Thu, Feb 28, 2008 at 8:23 PM, Davide Libenzi <dav...@xmailserver.org> wrote:
> On Thu, 28 Feb 2008, Michael Kerrisk wrote:
>
> > But it is an ugly inconsistency. On the one hand, a child process
> > cannot add the duplicate file descriptor to the epoll set. (In every
> > other case that I can think of , descriptors duplicated by fork have
> > similar semantics to descriptors duplicated by dup() and friends.) On
> > the other hand, the very fact that the child has a duplicate of the
> > descriptor means that even if the parent closes its descriptor, then
> > epoll_wait() in the parent will continue to receive notifications for
> > that descriptor because of the duplicated descriptor in the child.
>
> Have you ever tried to think what it means for different *processes*
> sharing a single epoll fd and doing epoll_wait() over it?

As I think is clear, I've only given it very limited thought ;-).

The point is that the existing implementation actually supports

"different *processes* sharing a single epoll fd and doing

epoll_wait() over it", but the semantics are unintuitive. It may be
that the existing implementation was the best way of doing things.
But when I see the strange corner cases in the semantics, I can't help
but wonder (way too late), whether there might have been some other
way of implementing things that led to more intuitive semantics.

> Most common case is a single event fetch thread plus dispatch. Going to
> epoll_wait() over a single epoll fd from many *threads* is very much
> possible, but requires care (news at 11, system software development
> requires care too).
> Sharing a single epoll fd (by the means of any process sharing it doing
> add/wait) from different *processes* makes almost no sense at all.
>
> "a child process cannot add the duplicate file descriptor to the epoll
> set" ... how do you expect the parent (that doesn't even have the new fd
> mapped) to react to such events?

(Not sure if you missed my meaning here. Of course the parent already
has the fd mapped; it's the fd that the child inherited. Anyway, my
real point was that while descriptors duplicated by fork() are
normally semantically similar to other duplicated descriptors, when it
comes to epoll they are not -- and that has the potential to surprise
users.)

> If the next question is "But then why we made the epoll fd inheritable?",
> the answer is, because it makes sense in many cases for a parent to hand
> over an fd set to a child.

Fair enough.

So here's an idea about how things might alternatively have been done:

a) The key for epoll entries could have been [file *, fd, PID]

b) an epoll_wait() only returns events for fds where the PID maps that
of the caller.

c) a close of a file descriptor removes the corresponding [file *,
fd, PID] from the epoll set.

d) when a fork() is done, then the epoll set has a new set of keys
added. These are duplicates of the [file *, fd, PID] entries for the
parent, but with the PID of the child substituted into the new keys.
Say the parent had PID 1000, and the child has PID 2000. If the epoll
set initially contained:

[X, 3, 1000]
[Y, 4, 1000]

then after fork() we'd have:

[X, 3, 1000]
[Y, 4, 1000]
[X, 3, 2000]
[Y, 4, 2000]

There is of course room for debate about the efficiency of this
approach, I suppose. But it seems to me (and perhaps I've missed a
number of things) that that could have given sane semantics with
respect to fork(), duplicated descriptors, and close(). Furthermore,
it would have allowed us to sanely support "different *processes*
sharing a single epoll fd and doing epoll_wait() over it".

Of course, this is all academic now: we can't change the ABI.

> > The choice of [file *, fd] as the key for epoll sets really does seem
> > unfortunate. Keying on [pid, fd] would have given saner semantics, it
> > seems to me. Obviously it can't be changed now though.
>
> I think we already went over this, and I think I clearly explained you the
> reasons of not hooking into sys_close.

You said elsewhere:

[[
That'd mean placing an eventpoll custom hook into sys_close(). Looks very
bad to me, and probably will look even worse to other kernel folks.
Is not much a performance issue (a check to see if a file* is an eventpoll
file is as easy as comparing the f_op pointer), but a design/style issue.
]]

But that wasn't very clear to me actually. I note that filp_close()
already has special case handling for dnotify (R.I.P.) and fcntl()
)aka POSIX) file locks, so there was already precedent for a custom
hook, AFAICS, and epoll is at least as worthy of special treatment as
either of those cases.

Cheers,

Michael

--
Michael Kerrisk
Maintainer of the Linux man-pages project
http://www.kernel.org/doc/man-pages/
Want to report a man-pages bug? Look here:
http://www.kernel.org/doc/man-pages/reporting_bugs.html

Davide Libenzi

unread,

Feb 29, 2008, 2:20:11 PM2/29/08

to Michael Kerrisk, Chris "ク" Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

On Fri, 29 Feb 2008, Michael Kerrisk wrote:

> As I think is clear, I've only given it very limited thought ;-).
>
> The point is that the existing implementation actually supports
> "different *processes* sharing a single epoll fd and doing
> epoll_wait() over it", but the semantics are unintuitive. It may be
> that the existing implementation was the best way of doing things.
> But when I see the strange corner cases in the semantics, I can't help
> but wonder (way too late), whether there might have been some other
> way of implementing things that led to more intuitive semantics.

Oh boy. The fact that you can have an epoll fd cross the fork boundary,
does not mean that any indiscriminate use of it leads to sane results:

efd = epoll_create();
fork();
pipe(fds);
epoll_ctl(efd, ADD, fds[0]);
epoll_wait(); ????
...
pipe(fds);
epoll_ctl(efd, ADD, fds[0]);
epoll_wait(); ????

It is *NOT* a matter of semantics.

> > If the next question is "But then why we made the epoll fd inheritable?",
> > the answer is, because it makes sense in many cases for a parent to hand
> > over an fd set to a child.
>
> Fair enough.
>
> So here's an idea about how things might alternatively have been done:
>
> a) The key for epoll entries could have been [file *, fd, PID]
>
> b) an epoll_wait() only returns events for fds where the PID maps that
> of the caller.
>
> c) a close of a file descriptor removes the corresponding [file *,
> fd, PID] from the epoll set.
>
> d) when a fork() is done, then the epoll set has a new set of keys
> added. These are duplicates of the [file *, fd, PID] entries for the
> parent, but with the PID of the child substituted into the new keys.
> Say the parent had PID 1000, and the child has PID 2000. If the epoll
> set initially contained:
>
> [X, 3, 1000]
> [Y, 4, 1000]
>
> then after fork() we'd have:
>
> [X, 3, 1000]
> [Y, 4, 1000]
> [X, 3, 2000]
> [Y, 4, 2000]
>
> There is of course room for debate about the efficiency of this
> approach, I suppose.

There sure is :)

> You said elsewhere:
>
> [[
> That'd mean placing an eventpoll custom hook into sys_close(). Looks very
> bad to me, and probably will look even worse to other kernel folks.
> Is not much a performance issue (a check to see if a file* is an eventpoll
> file is as easy as comparing the f_op pointer), but a design/style issue.
> ]]
>
> But that wasn't very clear to me actually. I note that filp_close()
> already has special case handling for dnotify (R.I.P.) and fcntl()
> )aka POSIX) file locks, so there was already precedent for a custom
> hook, AFAICS, and epoll is at least as worthy of special treatment as
> either of those cases.

I guess that over the time, Al became software WRT junk going there :)

- Davide

Michael Kerrisk

unread,

Feb 29, 2008, 2:55:42 PM2/29/08

to Davide Libenzi, Chris \"¥¯\" Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel. Org, linu...@vger.kernel.org

On Fri, Feb 29, 2008 at 8:19 PM, Davide Libenzi <dav...@xmailserver.org> wrote:
> On Fri, 29 Feb 2008, Michael Kerrisk wrote:
>
> > As I think is clear, I've only given it very limited thought ;-).
> >
> > The point is that the existing implementation actually supports
> > "different *processes* sharing a single epoll fd and doing
> > epoll_wait() over it", but the semantics are unintuitive. It may be
> > that the existing implementation was the best way of doing things.
> > But when I see the strange corner cases in the semantics, I can't help
> > but wonder (way too late), whether there might have been some other
> > way of implementing things that led to more intuitive semantics.
>
> Oh boy. The fact that you can have an epoll fd cross the fork boundary,
> does not mean that any indiscriminate use of it leads to sane results:

I ddidn't mean that it did. Certainly in the current implementation
it there will insane situations ;-).

> efd = epoll_create();
> fork();
> pipe(fds);
> epoll_ctl(efd, ADD, fds[0]);
> epoll_wait(); ????
> ...
> pipe(fds);
> epoll_ctl(efd, ADD, fds[0]);
> epoll_wait(); ????
>
>
> It is *NOT* a matter of semantics.

Of course -- but I don't think I suggested that I disagree on this.

Okay -- but I suspect it could have been made fairly efficient.

> > You said elsewhere:
> >
> > [[
> > That'd mean placing an eventpoll custom hook into sys_close(). Looks very
> > bad to me, and probably will look even worse to other kernel folks.
> > Is not much a performance issue (a check to see if a file* is an eventpoll
> > file is as easy as comparing the f_op pointer), but a design/style issue.
> > ]]
> >
> > But that wasn't very clear to me actually. I note that filp_close()
> > already has special case handling for dnotify (R.I.P.) and fcntl()
> > )aka POSIX) file locks, so there was already precedent for a custom
> > hook, AFAICS, and epoll is at least as worthy of special treatment as
> > either of those cases.
>
> I guess that over the time, Al became software WRT junk going there :)

Sorry -- I don't understand that last sentence?

Sam Varshavchik

unread,

Mar 2, 2008, 10:17:49 AM3/2/08

to linu...@vger.kernel.org, Michael Kerrisk, Davide Libenzi, Chris ¥¯ Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel.

Hijacking this epoll thread, the following related question occurs to me:

#Q8
# Does an operation on a file descriptor affect the already collected but
#not yet reported events?
#
#A8
# You can do two operations on an existing file descriptor. Remove would
#be meaningless for this case. Modify will re-read available I/O.

Why is EPOLL_CTL_DEL considered meaningless? A process is wrapping up its
business and is preparing to remove the fd from the epoll set, and then
close the file descriptor itself. In the meantime, the fd became readable,
and a POLLIN event gets collected. So, what happens to the collected event,
when the EPOLL_CTL_DEL operation is made?

Davide Libenzi

unread,

Mar 2, 2008, 4:45:35 PM3/2/08

to Sam Varshavchik, linu...@vger.kernel.org, Michael Kerrisk, Chris ¥¯ Heath, David Schwartz, da...@cosmosbay.com, Linux-Kernel@Vger. Kernel.

Any epoll_wait() done after the POLLIN and before the EPOLL_CTL_DEL, will
show up. After the EPOLL_CTL_DEL, of course, no events will be reported.

- Davide