epoll and timeouts

Arkadiy

unread,

Nov 16, 2007, 2:51:38 PM11/16/07

to

Hi all,

I am trying to figure out how to use epoll, and the part that I am
missing is how to set timeouts for individual sockets. I can see that
the timeout can be set in the epoll_wait, but it doesn't seem good
enough sinse different sockets are added to the epoll at different
time. Also I can't see anything among events that would be associated
with the timeout.

So where do I look next?

Thanks in advance for any advise.

Regards,
Arkadiy

fjb...@yahoo.com

unread,

Nov 16, 2007, 8:10:53 PM11/16/07

to

On Nov 16, 11:51 am, Arkadiy <vertl...@gmail.com> wrote:
> Hi all,
>
> I am trying to figure out how to use epoll, and the part that I am
> missing is how to set timeouts for individual sockets. I can see that
> the timeout can be set in the epoll_wait, but it doesn't seem good
> enough sinse different sockets are added to the epoll at different
> time. Also I can't see anything among events that would be associated
> with the timeout.

So, for example, you have a socket connecting you to each of several
clients, and if any client doesn't send data for 10 seconds, you
disconnect them.

You would have to keep track of it yourself. For example, you could
have a variable for each client that contains the last time at which
it sent data. Before calling epoll_wait, determine how long before
the *first* client times out, and use that as the timeout for
epoll_wait. Then if epoll_wait times out, disconnect that client,
recompute the timeout, and continue. select() or poll() would be the
same.

As is traditional, Unix provides the minimum amount of functionality
in system calls. Since a single timeout is enough to let you
implement more complicated schemes, that's what they provide.

Arkadiy

unread,

Nov 17, 2007, 11:11:55 AM11/17/07

to

> So, for example, you have a socket connecting you to each of several
> clients, and if any client doesn't send data for 10 seconds, you
> disconnect them.

Actually the scenario is a little different.

To serve a client the server has to access some information over the
network, from yet another server. So I read the request, then get the
information, and then send the result to the client. Getting the
information from another server is the subject of my question. If it
can't be done during, let's say 2 secs, I need to take an alternative
route for serving the request.

> You would have to keep track of it yourself. For example, you could
> have a variable for each client that contains the last time at which
> it sent data. Before calling epoll_wait, determine how long before
> the *first* client times out, and use that as the timeout for
> epoll_wait. Then if epoll_wait times out, disconnect that client,
> recompute the timeout, and continue. select() or poll() would be the
> same.

OK, this sounds doable, although I can still see some problems with
this.

First, if I add a socket in the middle of epoll_wait, its timeout must
be at least as long as scheduled finish of the epoll_wait minus
current time. Otherwise, since the timeout is not an event, I will
find out about it only after the epoll_wait ends. This prevents me
from designing a more or less generic system where timeouts would be
provided at the time of operation.

Second, I see no means of iterating over epoll-added sockets (maybe I
am missing something). It's possible to store a void pointer with the
socket in the epoll, and I assume it should be the pointer to an
object that works with the socket (I use C++). But since I can't
iterate through these objects (only through ones where events fired)
it means that, in addition to adding sockets to epoll, I need to keep
a parallel list of sockets (with experation time). This is doable,
but not very elegant :-(

> As is traditional, Unix provides the minimum amount of functionality
> in system calls. Since a single timeout is enough to let you
> implement more complicated schemes, that's what they provide.

Fare enough. But I would at least expect it to honor setsockopt with
SO_SNDTIMEO/SO_RCVTIMEO.

Regards,
Arkadiy

fjb...@yahoo.com

unread,

Nov 17, 2007, 4:47:40 PM11/17/07

to

I don't understand what you mean by "adding a socket in the middle of
epoll_wait". While epoll_wait is running your program is blocked.
Unless you have multiple threads or something?

> Second, I see no means of iterating over epoll-added sockets (maybe I
> am missing something). It's possible to store a void pointer with the
> socket in the epoll, and I assume it should be the pointer to an
> object that works with the socket (I use C++). But since I can't
> iterate through these objects (only through ones where events fired)
> it means that, in addition to adding sockets to epoll, I need to keep
> a parallel list of sockets (with experation time). This is doable,
> but not very elegant :-(

Nevertheless, that's the usual method, AFAIK.

> > As is traditional, Unix provides the minimum amount of functionality
> > in system calls. Since a single timeout is enough to let you
> > implement more complicated schemes, that's what they provide.
>
> Fare enough. But I would at least expect it to honor setsockopt with
> SO_SNDTIMEO/SO_RCVTIMEO.

I suppose you might, but my man page says those only apply to send and
receive operations. epoll isn't any of those.

Arkadiy

unread,

Nov 17, 2007, 6:17:58 PM11/17/07

to

> I don't understand what you mean by "adding a socket in the middle of
> epoll_wait". While epoll_wait is running your program is blocked.
> Unless you have multiple threads or something?

Right I am thinking to have a main thread accepting connections, and
distributing them among several threads, each looping around its own
epoll device. This way I think I can take advantage of multiple
processes.

An alternative would be to have a pool of threads each serving its own
connection (no epoll).

Still have to decide which way to go...

> > Second, I see no means of iterating over epoll-added sockets (maybe I
> > am missing something). It's possible to store a void pointer with the
> > socket in the epoll, and I assume it should be the pointer to an
> > object that works with the socket (I use C++). But since I can't
> > iterate through these objects (only through ones where events fired)
> > it means that, in addition to adding sockets to epoll, I need to keep
> > a parallel list of sockets (with experation time). This is doable,
> > but not very elegant :-(
>
> Nevertheless, that's the usual method, AFAIK.

OK, this makes it easier to accept :-)

> > Fare enough. But I would at least expect it to honor setsockopt with
> > SO_SNDTIMEO/SO_RCVTIMEO.
>
> I suppose you might, but my man page says those only apply to send and
> receive operations. epoll isn't any of those.

I checked -- these settings have no effect on epoll.

Regards,
Arkadiy

David Schwartz

unread,

Nov 19, 2007, 2:41:41 PM11/19/07

to

On Nov 17, 8:11 am, Arkadiy <vertl...@gmail.com> wrote:

> First, if I add a socket in the middle of epoll_wait, its timeout must
> be at least as long as scheduled finish of the epoll_wait minus
> current time. Otherwise, since the timeout is not an event, I will
> find out about it only after the epoll_wait ends. This prevents me
> from designing a more or less generic system where timeouts would be
> provided at the time of operation.

You are associating two things that have nothing to do with each other
and that is making things needlessly complex. You call 'epoll', you
get any socket events. You know 'epoll' doesn't do timeouts, so why
are you trying to force it to?

Keep your 'epoll' code and your timeout code completely separate. Who
cares when you added the socket to the 'epoll' set and when you called
'epoll'? When the socket times out, handle the time out. When you get
an event from 'epoll' handle that.

> Second, I see no means of iterating over epoll-added sockets (maybe I
> am missing something). It's possible to store a void pointer with the
> socket in the epoll, and I assume it should be the pointer to an
> object that works with the socket (I use C++). But since I can't
> iterate through these objects (only through ones where events fired)
> it means that, in addition to adding sockets to epoll, I need to keep
> a parallel list of sockets (with experation time). This is doable,
> but not very elegant :-(

I can't imagine what the alternative is to having a list of sockets
and the expiration time of each. What alternative could you possibly
even imagine?

You can store a pointer with the socket in the 'epoll' structure. But
I'd recommend just using the descriptor. An efficient map of
descriptors to internal socket structures is probably best.

DS

Arkadiy

unread,

Nov 19, 2007, 11:03:46 PM11/19/07

to

> You are associating two things that have nothing to do with each other
> and that is making things needlessly complex. You call 'epoll', you
> get any socket events. You know 'epoll' doesn't do timeouts, so why
> are you trying to force it to?
>
> Keep your 'epoll' code and your timeout code completely separate. Who
> cares when you added the socket to the 'epoll' set and when you called
> 'epoll'? When the socket times out, handle the time out. When you get
> an event from 'epoll' handle that.

OK, sounds good.

> > Second, I see no means of iterating over epoll-added sockets (maybe I
> > am missing something). It's possible to store a void pointer with the
> > socket in the epoll, and I assume it should be the pointer to an
> > object that works with the socket (I use C++). But since I can't
> > iterate through these objects (only through ones where events fired)
> > it means that, in addition to adding sockets to epoll, I need to keep
> > a parallel list of sockets (with experation time). This is doable,
> > but not very elegant :-(
>
> I can't imagine what the alternative is to having a list of sockets
> and the expiration time of each. What alternative could you possibly
> even imagine?

Once I handle timeouts separately, this stops being a problem. When
timeout fires I have to go to epoll and remove the socket (by the file
descriptor, so I need a mapping between the timeout event and
descriptor). If the socket finished its operation, I remove the
timeout... or something like this.

> You can store a pointer with the socket in the 'epoll' structure. But
> I'd recommend just using the descriptor. An efficient map of
> descriptors to internal socket structures is probably best.

I don't think I understand why the map is better. Let's say something
becomes available on the socket, so I need to read some number of
bytes into the buffer associated with the socket. If I have an object
that stores both the buffer address and the file descriptor, and the
address of this object is right in the event, than I don't need to do
any map lookups. This looks more eficient to me. Am I missing
something?

Regards,
Arkadiy

William Ahern

unread,

Nov 20, 2007, 10:39:31 AM11/20/07

to

Arkadiy <vert...@gmail.com> wrote:
<snip>

> > You can store a pointer with the socket in the 'epoll' structure. But
> > I'd recommend just using the descriptor. An efficient map of
> > descriptors to internal socket structures is probably best.
>
> I don't think I understand why the map is better. Let's say something
> becomes available on the socket, so I need to read some number of
> bytes into the buffer associated with the socket. If I have an object
> that stores both the buffer address and the file descriptor, and the
> address of this object is right in the event, than I don't need to do
> any map lookups. This looks more eficient to me. Am I missing
> something?

Because it's not portable. epoll is Linux only, and similar high-performance
polling interfaces might not store that pointer for you. And, more so, Linux
hasn't exactly been known to keep the most thought-out interfaces. The
highly organic development leaves something to be desired in cases like
these, whatever the benefits are on the whole. Notice all the development
going into pollable mutexes and timers. Those interfaces have long been
analyzed and rolled into things like BSD's kqueue or Solaris's event ports.
Those are two interfaces w/ substantially less risk of changing or, of
particular concern with Linux, not being superceded.

Rainer Weikusat

unread,

Nov 20, 2007, 11:16:19 AM11/20/07

to

William Ahern <wil...@wilbur.25thandClement.com> writes:
> Arkadiy <vert...@gmail.com> wrote:
> <snip>
>> > You can store a pointer with the socket in the 'epoll' structure. But
>> > I'd recommend just using the descriptor. An efficient map of
>> > descriptors to internal socket structures is probably best.
>>
>> I don't think I understand why the map is better. Let's say something
>> becomes available on the socket, so I need to read some number of
>> bytes into the buffer associated with the socket. If I have an object
>> that stores both the buffer address and the file descriptor, and the
>> address of this object is right in the event, than I don't need to do
>> any map lookups. This looks more eficient to me. Am I missing
>> something?
>
> Because it's not portable. epoll is Linux only, and similar
> high-performance polling interfaces might not store that pointer for
> you.

Using epoll means this part of the application is already 'not
portable' to something which does not support Linux-system calls
(which, in reality, means, that it is or will be portable to about
everything, because 'about everything' provides Linux-compatible
system-call interface). So the damage is already done and there is no
reason to not use all of the features epoll provides.

> And, more so, Linux hasn't exactly been known to keep the most
> thought-out interfaces.

Which is supposed to mean what?

> The highly organic development leaves something to be desired in
> cases like these, whatever the benefits are on the whole.

Which?

> Notice all the development going into pollable mutexes and
> timers. Those interfaces have long been analyzed and rolled into
> things like BSD's kqueue or Solaris's event ports.

So, basically, there are some features some of the BSDs and some
Solaris versions already have and these features are in the process of
being added to Linux. And the conclusion is?

> Those are two interfaces w/ substantially less risk of changing or,
> of particular concern with Linux, not being superceded.

I cannot see how this would follow from anything you wrote so far. To
me, this sentence basically means that 'if someone comes up with
something better than what currently exists in Linux, it will be
integrated[*], but the same would neither happen for *BSD nor for
Solaris'.

[*] This is a too idealistic perspective. Eg I have a
measurably faster tun-driver than the one in Linux (measured
against 2.4, to be precise, but it hasn't changed much in
2.6). But the modifications generously tramples all over the
driver and (to some degree) over the 'use it this way'
skb usage interface to achieve this effect. I therefore
assume that Linux-integration would be extremely unlikely
and publishing a patch would at best lead to random boneheads
sending me nasty e-mails. In any case, I am not going to try.

David Schwartz

unread,

Nov 20, 2007, 11:45:48 AM11/20/07

to

On Nov 19, 8:03 pm, Arkadiy <vertl...@gmail.com> wrote:

> > You can store a pointer with the socket in the 'epoll' structure. But
> > I'd recommend just using the descriptor. An efficient map of
> > descriptors to internal socket structures is probably best.

> I don't think I understand why the map is better.

Because you have complete control over the map.

> Let's say something
> becomes available on the socket, so I need to read some number of
> bytes into the buffer associated with the socket. If I have an object
> that stores both the buffer address and the file descriptor, and the
> address of this object is right in the event, than I don't need to do
> any map lookups. This looks more eficient to me. Am I missing
> something?

What if the object no longer exists by the time you get the epoll
event? Here's the problem:

You handle a fatal error on the connection.
You remove the connection's descriptor from the epoll set, but you
don't realize that the epoll thread just returned from epoll with an
event.
When is it safe to remove the connection from memory?

So you wind up either having to make your connection destruction code
really ugly or you have to validate the pointer with some kind of
structure anyway.

That said, I suppose one could argue that the socket descriptor reuse
creates much the same problem. You can't just assume that an event for
descriptor 5 is for the connection descriptor 5 refers to *now*
without the same kind of checking.

DS

William Ahern

unread,

Nov 20, 2007, 11:47:44 AM11/20/07

to

Rainer Weikusat <rwei...@mssgmbh.com> wrote:
> William Ahern <wil...@wilbur.25thandClement.com> writes:
> > Arkadiy <vert...@gmail.com> wrote:
> > <snip>
> >> > You can store a pointer with the socket in the 'epoll' structure. But
> >> > I'd recommend just using the descriptor. An efficient map of
> >> > descriptors to internal socket structures is probably best.
> >>
> >> I don't think I understand why the map is better. Let's say something
> >> becomes available on the socket, so I need to read some number of
> >> bytes into the buffer associated with the socket. If I have an object
> >> that stores both the buffer address and the file descriptor, and the
> >> address of this object is right in the event, than I don't need to do
> >> any map lookups. This looks more eficient to me. Am I missing
> >> something?
> >
> > Because it's not portable. epoll is Linux only, and similar
> > high-performance polling interfaces might not store that pointer for
> > you.

> Using epoll means this part of the application is already 'not
> portable' to something which does not support Linux-system calls
> (which, in reality, means, that it is or will be portable to about
> everything, because 'about everything' provides Linux-compatible
> system-call interface). So the damage is already done and there is no
> reason to not use all of the features epoll provides.

There is such a thing as mitigation of damage. No need to irreparably bind
yourself when its not necessary to meet your aim. It seemed to have been
assumed that epoll was necessary to meet his aim, but, I wagered, the data
structure optimization reliant on epoll was being challenged.

<snip>

> > Those are two interfaces w/ substantially less risk of changing or,
> > of particular concern with Linux, not being superceded.

> I cannot see how this would follow from anything you wrote so far. To
> me, this sentence basically means that 'if someone comes up with
> something better than what currently exists in Linux, it will be
> integrated[*], but the same would neither happen for *BSD nor for
> Solaris'.

> [*] This is a too idealistic perspective. Eg I have a
> measurably faster tun-driver than the one in Linux (measured
> against 2.4, to be precise, but it hasn't changed much in
> 2.6). But the modifications generously tramples all over the
> driver and (to some degree) over the 'use it this way'
> skb usage interface to achieve this effect. I therefore
> assume that Linux-integration would be extremely unlikely
> and publishing a patch would at best lead to random boneheads
> sending me nasty e-mails. In any case, I am not going to try.

dnotify => inotify.

Granted, dnotify still exists in the kernel, but point is, these things
change, especially in Linux-land. kqueue hasn't changed... well... never.
The interface is the same today as it always has been, precisely because it
was carefully designed to accomodate new notification mechanisms. There have
been API changes to epoll floated; they haven't been adopted, but, again,
the risk is one of degree, and the risk is greater in Linux than elsewhere.

Arkadiy

unread,

Nov 20, 2007, 12:18:33 PM11/20/07

to

On Nov 20, 11:45 am, David Schwartz <dav...@webmaster.com> wrote:

> What if the object no longer exists by the time you get the epoll
> event? Here's the problem:
>
> You handle a fatal error on the connection.
> You remove the connection's descriptor from the epoll set, but you
> don't realize that the epoll thread just returned from epoll with an
> event.
> When is it safe to remove the connection from memory?

The fatal error on the connection is an epoll event, correct? When I
process this event, I remove the connection's descriptor from the
epoll set, and then remove the connection from memory... What am I
missing?

Regards,
Arkadiy

David Schwartz

unread,

Nov 21, 2007, 6:58:37 PM11/21/07

to

On Nov 20, 9:18 am, Arkadiy <vertl...@gmail.com> wrote:

> The fatal error on the connection is an epoll event, correct?

It could be, but it could just as well be a "shutdown" command
received from another connection or from an administrative interface.

> When I
> process this event, I remove the connection's descriptor from the
> epoll set, and then remove the connection from memory... What am I
> missing?

Here's the problem:

1) Your call to 'epoll' returns.
2) You dispatch threads to handle all the events you detected.
3) An event occurs on the socket.
4) You call 'epoll'.
5) It returns.
6) The event you dispatched in 2 is now handled, removing the
descriptor from the 'epoll' set and removing it from memory.
7) The first thread now tries to handle the event that occured in step
3.

You can solve this by handling all events before calling 'epoll'
again, but that's kind of horrible. It tends to be very unfair to
connections that do large numbers of small things versus connections
that do small numbers of large things.

DS

Arkadiy

unread,

Nov 26, 2007, 10:09:00 AM11/26/07

to

On Nov 21, 6:58 pm, David Schwartz <dav...@webmaster.com> wrote:

> You can solve this by handling all events before calling 'epoll'
> again, but that's kind of horrible. It tends to be very unfair to
> connections that do large numbers of small things versus connections
> that do small numbers of large things.

So do I understand correctly that, once epoll_wait() returns, the
occured events need to be handled asynchronously (for example by a
pool of worker threads) and the epoll_wait() has to imediatly be
called again?

In my understanding such approach can easily create a situation when
two threads are reading from (or writing to) the same socket at the
same time:

1) some bytes become available on a socket;
2) epoll_wait() returns;
3) the event gets dispatched to a worker thread;
4) epoll_wait() is called;
5) some more bytes become available on the socket;
6) epoll_wait() returns();
7) the event gets dispatched to another worker thread;

Is it implied that this situation needs to be handled, or am I missing
something?

Regards,
Arkadiy

David Schwartz

unread,

Nov 26, 2007, 12:22:47 PM11/26/07

to

On Nov 26, 7:09 am, Arkadiy <vertl...@gmail.com> wrote:

> On Nov 21, 6:58 pm, David Schwartz <dav...@webmaster.com> wrote:

> > You can solve this by handling all events before calling 'epoll'
> > again, but that's kind of horrible. It tends to be very unfair to
> > connections that do large numbers of small things versus connections
> > that do small numbers of large things.

> So do I understand correctly that, once epoll_wait() returns, the
> occured events need to be handled asynchronously (for example by a
> pool of worker threads) and the epoll_wait() has to imediatly be
> called again?

It depends upon your application and your requirements and whether
you're using edge-triggered or level-triggered events.

You can do some very cool things with edge-triggered events. For
example, you can have the thread that called 'epoll' process the
events it got back while another thread calls 'epoll' as soon as it
can. You won't get the same events twice because they won't re-arm
until they're serviced.

With level-triggered events, you can't call 'epoll' again until you've
at least partially serviced every event you discovered, otherwise
you'll get that same event again. This doesn't mean you have to fully
service them. For a read event, for example, as soon as the service
thread has read all the data from the socket, it can mark the event
serviced, even though it hasn't look at the data contents yet.

> In my understanding such approach can easily create a situation when
> two threads are reading from (or writing to) the same socket at the
> same time:
>
> 1) some bytes become available on a socket;
> 2) epoll_wait() returns;
> 3) the event gets dispatched to a worker thread;
> 4) epoll_wait() is called;
> 5) some more bytes become available on the socket;
> 6) epoll_wait() returns();
> 7) the event gets dispatched to another worker thread;
>
> Is it implied that this situation needs to be handled, or am I missing
> something?

You would have to work at it to create an architecture where this was
an actual problem. In a realistic architecture, one of many things
would stop this. The most common is an atomic dispatch flag for a
connection. When you detect that you need to read from the connection,
you check the flag. If it's set, do nothing. If it's clear, you
dispatch and set it. Then the service thread atomically clears the
flag at the same time it decides not to do any further work on the
connection/

Another way is simply to lock the connection so the second event has
to wait until the first is finished.

The best way depends on whether your events are edge or level
triggered and other requirements. But if it's not 100% obvious that
this could never, ever happen, then your architecture is doing
something horribly wrong. One horribly wrong thing you can do that can
lead to this is pass a file descriptor to 'read' or 'write' before you
make sure there exists a high-level "connection" data structure
associated with that socket.

DS

Arkadiy

unread,

Nov 29, 2007, 10:43:37 AM11/29/07

to

OK, thanks.

After playing with this I think I have a better understanding of what
you mean. I think in my case I can just process EPOLLINs in the epoll
thread until _all_ the request is read, and only then dispatch it to
one of service threads for processing. Since I have read all the
request, no more events will be generated under normal conditions
untill I change it to EOPLLOUT (I still have to be careful if error
events happen). When the processing is done, I will change the events
to EPOLLOUT to start writing.

Does this make sense?

Thanks,
Arkadiy

David Schwartz

unread,

Nov 30, 2007, 11:48:16 PM11/30/07

to

On Nov 29, 7:43 am, Arkadiy <vertl...@gmail.com> wrote:

> After playing with this I think I have a better understanding of what
> you mean. I think in my case I can just process EPOLLINs in the epoll
> thread until _all_ the request is read, and only then dispatch it to
> one of service threads for processing. Since I have read all the
> request, no more events will be generated under normal conditions
> untill I change it to EOPLLOUT (I still have to be careful if error
> events happen). When the processing is done, I will change the events
> to EPOLLOUT to start writing.
>
> Does this make sense?

Are you using edge-triggered or level-triggered? And are you
dispatching events or are you handling them in the epoll thread? And
can more than one thread call epoll_wait or just one?

What you are saying could be really good for the right combination of
those factors.

DS