threads & sockets

1 view
Skip to first unread message

Lyonya

unread,
Apr 30, 2005, 5:09:16 PM4/30/05
to
I have a multi-threaded application where the main thread reads messages
(using select()) and despatches requests to various followers threads.
Now, when a follower-thread receives a specified message to process, it
needs to send a 'question' to the sender. This thread cannot process
until it reads 'answer', so it blocks on read(). The problem is the main
is monitoring that socket too. What's the best way to deal with such a
problem? Is there any way for a thread to temporarely 'hide' socket
(file descriptor) from others?

Pascal Bourguignon

unread,
Apr 30, 2005, 5:37:27 PM4/30/05
to
Lyonya <leonid...@optonline.com> writes:

Manage several fdset, one for each threads. When the master thread
gives a socket to a worker thread, remove the socket from the fdset of
the master and add it to the fdset of the worker, and vice-versa when
the worker thread is done.

--
__Pascal Bourguignon__ http://www.informatimago.com/

In a World without Walls and Fences,
who needs Windows and Gates?

Lyonya

unread,
Apr 30, 2005, 10:01:58 PM4/30/05
to
Pascal Bourguignon wrote:
> Lyonya <leonid...@optonline.com> writes:
>
>
>>I have a multi-threaded application where the main thread reads
>>messages (using select()) and despatches requests to various followers
>>threads. Now, when a follower-thread receives a specified message to
>>process, it needs to send a 'question' to the sender. This thread
>>cannot process until it reads 'answer', so it blocks on read(). The
>>problem is the main is monitoring that socket too. What's the best way
>>to deal with such a problem? Is there any way for a thread to
>>temporarely 'hide' socket (file descriptor) from others?
>
>
> Manage several fdset, one for each threads. When the master thread
> gives a socket to a worker thread, remove the socket from the fdset of
> the master and add it to the fdset of the worker, and vice-versa when
> the worker thread is done.
>
is it possible to remove a socket from fdset while a thread (main one in
this case) is listening to this fdset?

Jonathan Adams

unread,
Apr 30, 2005, 10:28:44 PM4/30/05
to
In article <ZtSce.2250$Hf6...@fe11.lga>,
Lyonya <leonid...@optonline.com> wrote:

This may (or may not) help you, but the Event Ports API added in
Solaris 10 is designed for exactly this sort of thing; see

http://blogs.sun.com/roller/page/barts/20040720#entry_2_event_ports

for an example of their use; the key bit of API for this is that once a
thread has received an event for a file descriptor, other threads will
not get events until he re-associates it. This lets you get rid of
the main thread as a choke point.

Cheers,
- jonathan

David Schwartz

unread,
Apr 30, 2005, 11:51:34 PM4/30/05
to

"Lyonya" <leonid...@optonline.com> wrote in message
news:nMWce.4512$FE3....@fe12.lga...

> is it possible to remove a socket from fdset while a thread (main one in
> this case) is listening to this fdset?

Just remove it from the master fdset that the thread copies from before
it calls 'select'. The worst thing that will happen is the thread will
return from 'select' one extra time.

By the way, the design I recommend is to have one thread for each, say,
100 connections. That thread calls 'select' or 'poll' on those connections.
When it detects any read hits, it reads the data, puts it in a receive
queue, and dispatches a thread to process the data. This way, the thread can
run to completion and call 'select' or 'poll' again only when all the data
has been read on all the sockets this thread is handling. If too much data
is pending in the receive queue, the socket can be removed from the
'select'/'poll' thread and added back only when the receive queue empties.

For sending, I recommend two socket states, 'sending' and 'idle'. A
socket is 'idle' if there is no data that you have buffered to send on that
socket. An 'idle' socket is not in the write fd set. I recommend the
following write algorithm:

For idle sockets, since there is no data buffered, just try the write
immediately. If all the data is written, the socket can stay idle and we
never need to 'select' or 'poll'. If no data (or only some of the data) is
sent, add the unsent data to the send queue, change the socket's state to
'active' and add the socket to the master write poll/select set.

When you get a 'write' hit, write as much data as possible. If you write
all of it, the send queue is now empty. Set the state to idle and remove the
socket from the write poll/select set.

If you set a reasonable select or poll timeout (say, 350mS), and only
select or poll on about 100 sockets per thread, the delay in transiting from
'idle' to 'active' won't hurt you, especially since it happens only rarely
and only when the TCP stack's send buffer is full.

This works *very* well for typical protocols like SMTP and HTTP.

DS


James Antill

unread,
May 2, 2005, 2:01:50 PM5/2/05
to
On Sat, 30 Apr 2005 20:51:34 -0700, David Schwartz wrote:

> By the way, the design I recommend is to have one thread for each, say,
> 100 connections. That thread calls 'select' or 'poll' on those connections.

Why? If you have 10_000 conenctions, 10% of which are active, you are
requiring 1000 threads ... what does this buy you?
Is this a way to work around poll() scalability when epoll() etc. is
lacking?

[snip generic read/write IO event loop/

> If you set a reasonable select or poll timeout (say, 350mS), and
> only

Why would you force a timeout every 350mS? If there is nothing to do,
there is nothing to do. Or is this in lieu of having correct timers, you
just try them every 350mS?

--
James Antill -- ja...@and.org
http://www.and.org/vstr/httpd

Pascal Bourguignon

unread,
May 2, 2005, 5:10:02 PM5/2/05
to
James Antill <james-...@and.org> writes:
>> If you set a reasonable select or poll timeout (say, 350mS), and
>> only
>
> Why would you force a timeout every 350mS? If there is nothing to do,
> there is nothing to do. Or is this in lieu of having correct timers, you
> just try them every 350mS?

You time out every 350 milli siemens? Every 350 milli-ampere/volt ???


--
__Pascal Bourguignon__ http://www.informatimago.com/

This is a signature virus. Add me to your signature and help me to live

David Schwartz

unread,
May 2, 2005, 5:13:38 PM5/2/05
to

"James Antill" <james-...@and.org> wrote in message
news:pan.2005.05.02....@and.org...

> On Sat, 30 Apr 2005 20:51:34 -0700, David Schwartz wrote:

>> By the way, the design I recommend is to have one thread for each,
>> say,
>> 100 connections. That thread calls 'select' or 'poll' on those
>> connections.

> Why? If you have 10_000 conenctions, 10% of which are active, you are
> requiring 1000 threads ... what does this buy you?

What it buys you is that if data is received on only a single
connection, you don't have to remove a thread from 10,000 wait queues only
to put it back on all 10,000 wait queues a split second later.

> Is this a way to work around poll() scalability when epoll() etc. is
> lacking?

Largely so, but 'epoll' has its own issues.

> [snip generic read/write IO event loop/
>
>> If you set a reasonable select or poll timeout (say, 350mS), and
>> only

> Why would you force a timeout every 350mS? If there is nothing to do,
> there is nothing to do. Or is this in lieu of having correct timers, you
> just try them every 350mS?

There is no 'block in select until a condition variable is signalled' or
similar primitives to use. So you are left with making a mechanism of
inter-thread communication that can be selected on (like using a pipe) or
timing out the select. I like timing out the select better myself.

One huge benefit to the time out method is that your performance gets
better under load rather than worse. Imagine you're handling 250 connections
and you have a high rate of new incoming connections. If you break out of
'select' or 'poll' each time a new connection is received in order to add it
to the fd set, you are removing yourself from 250 wait queues only to put
yourself back on 250 queues perhaps dozens of times a second. Whereas with a
350mS timeout, you do so only about three times a second.

The pipe method is better under low load, the fixed timeout method is
better under high load. I find is odd to tune for low load. It's when you're
under high load that performance is important.

Again though, this is a workaround for defects in 'select'/'poll' and
you may be able to avoid these kinds of tradeoffs if you have 'kqueue',
'/dev/poll', or 'epoll' options.

DS


David Schwartz

unread,
May 2, 2005, 5:38:19 PM5/2/05
to

"Pascal Bourguignon" <p...@informatimago.com> wrote in message
news:87wtqhb...@thalassa.informatimago.com...

> James Antill <james-...@and.org> writes:
>>> If you set a reasonable select or poll timeout (say, 350mS), and
>>> only
>>
>> Why would you force a timeout every 350mS? If there is nothing to do,
>> there is nothing to do. Or is this in lieu of having correct timers, you
>> just try them every 350mS?
>
> You time out every 350 milli siemens? Every 350 milli-ampere/volt ???

Yep, you got it. Every 350 thousandths of a mA per V.

DS


Andrei Voropaev

unread,
May 3, 2005, 4:07:23 AM5/3/05
to
On 2005-05-02, David Schwartz <dav...@webmaster.com> wrote:
>
> "James Antill" <james-...@and.org> wrote in message
[...]

>> Why would you force a timeout every 350mS? If there is nothing to do,
>> there is nothing to do. Or is this in lieu of having correct timers, you
>> just try them every 350mS?
>
> There is no 'block in select until a condition variable is signalled' or
> similar primitives to use. So you are left with making a mechanism of
> inter-thread communication that can be selected on (like using a pipe) or
> timing out the select. I like timing out the select better myself.
>
> One huge benefit to the time out method is that your performance gets
> better under load rather than worse. Imagine you're handling 250 connections
> and you have a high rate of new incoming connections. If you break out of
> 'select' or 'poll' each time a new connection is received in order to add it
> to the fd set, you are removing yourself from 250 wait queues only to put
> yourself back on 250 queues perhaps dozens of times a second. Whereas with a
> 350mS timeout, you do so only about three times a second.
>
> The pipe method is better under low load, the fixed timeout method is
> better under high load. I find is odd to tune for low load. It's when you're
> under high load that performance is important.

Hm. There's something I don't understand in your explanations. What do
you mean when you say "removing from 250 queues". If I have FD set with
250 fds in it, do you mean that when I call select, then kernel puts all
of them into separate "queue"? Then in your scenario you should run
select with empty FD set, only with timeout? But then after select
expires you have to check each of 250 fds to see if there's anything
ready on them. Which means that you have to switch to kernel context 250
times (or even more if some of those fds require both read and write).
And everyone knows that switching to kernel context is expensive, more
expensive than putting 250 fds into "queues". Otherwise people wouldn't
make up functions like poll or select. They would simply sleep and then
go in the loop checking each socket. Besides your program would become
less responsive in this scenario.

Well, if you run select with 250 fds in the set and with timeout, then
this may return before the time-out expires (and it does under heavy
load), if any of the sockets is ready. So you would check all of your
mutexes more often than 3-4 times per second, again wasting CPU time.
On the other hand, if time-out expires and no data is ready (and no
event on mutex or other thread-specific stuff is ready), then you just
get all your fds from the queue and put them back again in vain, so many
times per second, thus taking away CPU time from those threads, that may
need it for doing real work. Again, I feel like not-using the time-out
for select is better. Pipe for communicating thread synchronisation
events is better.

So, poll/select have defficiencies, but somehow all people believe that
the best way this can be solved is by using improved poll/select and not
different algorithm for calling poll/select.

>
> Again though, this is a workaround for defects in 'select'/'poll' and
> you may be able to avoid these kinds of tradeoffs if you have 'kqueue',
> '/dev/poll', or 'epoll' options.

So, I don't see how your approach helps to avoid select/poll "defects".
So far I see only that it makes them worse.


--
Minds, like parachutes, function best when open

David Schwartz

unread,
May 3, 2005, 4:56:57 AM5/3/05
to

"Andrei Voropaev" <avo...@mail.ru> wrote in message
news:3dopprF...@individual.net...

> Hm. There's something I don't understand in your explanations. What do
> you mean when you say "removing from 250 queues". If I have FD set with
> 250 fds in it, do you mean that when I call select, then kernel puts all
> of them into separate "queue"?

If you call 'select' on 250 file descriptors and 'select' does not
return immediately, then the kernel must arrange for any of 250 different
events to wake your thread. This generally requires creating 250 wait queue
objects and putting one on each of the 250 sockets' queues.

> Then in your scenario you should run
> select with empty FD set, only with timeout? But then after select
> expires you have to check each of 250 fds to see if there's anything
> ready on them. Which means that you have to switch to kernel context 250
> times (or even more if some of those fds require both read and write).

Well, there are several possible solutions. One is to 'select' with a
zero timeout so that it returns immediately and sleep for, say, 100
milliseconds if 'sleep' returns no hits. This means you are never on a
socket's wait queue, but it adds latency.

The wait queue issue only is a problem if you 'select' with a timeout
and do not return immediately. This is the case you want to avoid. If you
can't avoid it, minimize the number of sockets you are 'select'ing on to
minimize the number of wait objects that need to be created, queued,
dequeued, and destroyed.

> And everyone knows that switching to kernel context is expensive, more
> expensive than putting 250 fds into "queues". Otherwise people wouldn't
> make up functions like poll or select. They would simply sleep and then
> go in the loop checking each socket. Besides your program would become
> less responsive in this scenario.

I never suggested checking all 250 fds. You should definitely use
'select' or 'poll' for socket discovery. However, you should work very hard
to minimize the case where you 'select' on a large number of sockets that
include at least one that is heavily used but do not return immediately.

> Well, if you run select with 250 fds in the set and with timeout, then
> this may return before the time-out expires (and it does under heavy
> load), if any of the sockets is ready. So you would check all of your
> mutexes more often than 3-4 times per second, again wasting CPU time.

I don't know what mutexes you are talking about. It will return when any
socket is ready, but you then examine the returned 'select' or 'poll' set
and only mess with the particular socket that you discovered.

> On the other hand, if time-out expires and no data is ready (and no
> event on mutex or other thread-specific stuff is ready), then you just
> get all your fds from the queue and put them back again in vain, so many
> times per second, thus taking away CPU time from those threads, that may
> need it for doing real work.

The point is that while it is so many times per second, it has a hard
limit. So the cost of the 'three per second' cycle will go down with load.

> Again, I feel like not-using the time-out
> for select is better. Pipe for communicating thread synchronisation
> events is better.

Except this method gets less and less efficient as load goes up.

> So, poll/select have defficiencies, but somehow all people believe that
> the best way this can be solved is by using improved poll/select and not
> different algorithm for calling poll/select.

No, that's not true. If you have access to a superior replacement for
'poll'/'select', you should use it. If you don't, what is the alternative to
calling 'poll' or 'select' in the most efficient manner?

>> Again though, this is a workaround for defects in 'select'/'poll' and
>> you may be able to avoid these kinds of tradeoffs if you have 'kqueue',
>> '/dev/poll', or 'epoll' options.

> So, I don't see how your approach helps to avoid select/poll "defects".
> So far I see only that it makes them worse.

Perhaps you don't understand what the defects are and what it is in the
kernel that causes them. The primary defects are:

1) You have to re-check every single socket just because one socket has
been discovered. When 'poll', say, returns with only one socket ready, you
do the work on that socket, call 'poll' again, and the kernel first has to
check every single socket descriptor in the set.

2) If you call 'poll' with a timeout, and it does not return
immediately, the kernel must block the thread on a very large number of
events. This requires a create/queue/dequeue/destroy cycle for each socket
that may (worst case) have to be repeated once for each I/O you do.

3) To add a new socket to the set of interest, you have to stop checking
all the other sockets, and then start checking them all again. If your rate
of new connections is high, this can eat up a lot of CPU.

Now, if you don't think these are the defects, feel free to explain what
you think they are. If you think my approach makes any of these defects
worse, feel free to explain how. If you don't see how it makes them better,
well, read over my post and think about it. This is hard-earned knowledge
from 7 years of messing with exactly this.

DS


James Antill

unread,
May 3, 2005, 12:28:21 PM5/3/05
to
On Mon, 02 May 2005 14:13:38 -0700, David Schwartz wrote:

> "James Antill" <james-...@and.org> wrote:
>> Is this a way to work around poll() scalability when epoll() etc. is
>> lacking?
>
> Largely so, but 'epoll' has its own issues.

I presume you mean that you need a syscall for every change in
notification when using level triggered events? Or are you saying that
you've found you need to keep the number of fd's per. thread low, even
when using epoll()?

>> [snip generic read/write IO event loop/
>>
>>> If you set a reasonable select or poll timeout (say, 350mS), and
>>> only
>
>> Why would you force a timeout every 350mS? If there is nothing to do,
>> there is nothing to do. Or is this in lieu of having correct timers, you
>> just try them every 350mS?
>
> There is no 'block in select until a condition variable is signalled' or
> similar primitives to use. So you are left with making a mechanism of
> inter-thread communication that can be selected on (like using a pipe) or
> timing out the select. I like timing out the select better myself.

Right, sorry, this is my lack of threading insight coming from the
"always use processes not threads" view of the world.

> One huge benefit to the time out method is that your performance gets
> better under load rather than worse. Imagine you're handling 250 connections
> and you have a high rate of new incoming connections. If you break out of
> 'select' or 'poll' each time a new connection is received in order to add it
> to the fd set, you are removing yourself from 250 wait queues only to put
> yourself back on 250 queues perhaps dozens of times a second. Whereas with a
> 350mS timeout, you do so only about three times a second.

Interesting. I presume you are doing the pretest with zero timeout, which
I believed significantly mitigated this problem ... if not solved it. Do
you have any benchmarks etc.?
Also, I would have assumed that having ~175mS delay for every connection
would have done bad things.

> The pipe method is better under low load, the fixed timeout method is
> better under high load. I find is odd to tune for low load. It's when you're
> under high load that performance is important.

But under high load, not only would I expect the zero timeout pretest to
trigger more often ... but I'd assume it'd be more likely one of the other
fds is going to have an event (then again I guess on a four proc with
10_000 connections, you'd have 100/100 and I'd have 4/2500 ... so
this might well be very different).

> Again though, this is a workaround for defects in 'select'/'poll' and
> you may be able to avoid these kinds of tradeoffs if you have 'kqueue',
> '/dev/poll', or 'epoll' options.

But somewhat like the high/low load decisions, it seems weird to design
around poll() (and it's deficiencies) rather than just saying "yes, you'll
want epoll()/whatever if you need large numbers of connections".

David Schwartz

unread,
May 3, 2005, 3:05:43 PM5/3/05
to

"James Antill" <james-...@and.org> wrote in message
news:pan.2005.05.03....@and.org...

> On Mon, 02 May 2005 14:13:38 -0700, David Schwartz wrote:

>> "James Antill" <james-...@and.org> wrote:
>>> Is this a way to work around poll() scalability when epoll() etc. is
>>> lacking?
>>
>> Largely so, but 'epoll' has its own issues.

> I presume you mean that you need a syscall for every change in
> notification when using level triggered events? Or are you saying that
> you've found you need to keep the number of fd's per. thread low, even
> when using epoll()?

Under high load, 'epoll' suffers from internal overflows and falls back
to 'poll'. (Unless the design has changed since I last checked.) If a
significant fraction of your sockets are very active, 'epoll' won't buy you
very much.

>> There is no 'block in select until a condition variable is signalled'
>> or
>> similar primitives to use. So you are left with making a mechanism of
>> inter-thread communication that can be selected on (like using a pipe) or
>> timing out the select. I like timing out the select better myself.

> Right, sorry, this is my lack of threading insight coming from the
> "always use processes not threads" view of the world.

The problem with always using processes is that a page fault causes your
entire server to stall. This can cause a serious problem if your design gets
less efficient with load -- you may never catch back up from the stall.

>> One huge benefit to the time out method is that your performance gets
>> better under load rather than worse. Imagine you're handling 250
>> connections
>> and you have a high rate of new incoming connections. If you break out of
>> 'select' or 'poll' each time a new connection is received in order to add
>> it
>> to the fd set, you are removing yourself from 250 wait queues only to put
>> yourself back on 250 queues perhaps dozens of times a second. Whereas
>> with a
>> 350mS timeout, you do so only about three times a second.

> Interesting. I presume you are doing the pretest with zero timeout, which
> I believed significantly mitigated this problem ... if not solved it. Do
> you have any benchmarks etc.?

I would only use a zero timeout on those versions of Linux that had
really slow wait queue code. In all other cases, I recommend using a
sensible timeout. Just use other coding techniques to minimize the wait
queue cost.

> Also, I would have assumed that having ~175mS delay for every connection
> would have done bad things.

It only applies in two cases. One is when a connection is first made,
there is this delay before you start accepting inbound data on it. The other
is when the kernel's write buffer fills for the connection, there can be
this delay before you detect a situation where the buffer is not full.

How bad this is depends upon your protocol. For example, if you're
trying to support a very rapid transaction where immediate reception of data
is critical, this would be unacceptable. If you expect lots of transitions
from a full send buffer to an empty one (say you're sending USENET news over
a LAN), this might be unacceptable.

You have to tune your behavior to your protocol. The I/O library I
developed for the company I work for allows you to specify the type of
performance you need on the socket (low latency, high throughput, mixed) and
sets values like the number of sockets per I/O thread, initial sleep before
calling 'poll', and the 'poll' delay based on these parameters.

>> The pipe method is better under low load, the fixed timeout method is
>> better under high load. I find is odd to tune for low load. It's when
>> you're
>> under high load that performance is important.

> But under high load, not only would I expect the zero timeout pretest to
> trigger more often ... but I'd assume it'd be more likely one of the other
> fds is going to have an event (then again I guess on a four proc with
> 10_000 connections, you'd have 100/100 and I'd have 4/2500 ... so
> this might well be very different).

One big immediate difference is that if I only have one active
connection, I'm at most messing with 100 sockets. You may be messing with
2,500. So I win that one immediately.

On the other hand, if I have two active connections on different
threads, I may have a lot more context switches than you do. So you win that
one.

It's hard to argue this from a purely generic viewpoint because the
'poll' internals vary between OSes. Some always first test all sockets for a
possible immediate return before doing any wait queue work. Some start
putting the thread on wait queues until they find a socket that's ready,
then they finish checking the sockets without putting the thread on any wait
queues and then remove the thread from any wait queue it's already on.

>> Again though, this is a workaround for defects in 'select'/'poll' and
>> you may be able to avoid these kinds of tradeoffs if you have 'kqueue',
>> '/dev/poll', or 'epoll' options.

> But somewhat like the high/low load decisions, it seems weird to design
> around poll() (and it's deficiencies) rather than just saying "yes, you'll
> want epoll()/whatever if you need large numbers of connections".

You want it, but you don't always have it. Sometimes you have to design
portable code that takes advantage of these things when they're available
but still has to perform well when they're not.

DS


James Antill

unread,
May 3, 2005, 9:34:17 PM5/3/05
to
On Tue, 03 May 2005 12:05:43 -0700, David Schwartz wrote:

> Under high load, 'epoll' suffers from internal overflows and falls back
> to 'poll'. (Unless the design has changed since I last checked.) If a
> significant fraction of your sockets are very active, 'epoll' won't buy you
> very much.

As far as I know, this is not the case, if there is a memory failure it
is done on epoll_ctl(..., EPOLL_CTL_ADD,...). And looking in
fs/eventpoll.c doesn't show an obvious code path that handles
overflow, or otherwise falls back to pure poll() (although, obviously, if
every fd has an event you are going to have to do a similar amount of work).

Maybe you are thinking of the old linux 2.2.x SIGRT interface?

> The problem with always using processes is that a page fault causes your
> entire server to stall. This can cause a serious problem if your design gets
> less efficient with load -- you may never catch back up from the stall.

Processes, not process. But yes, a "problem" is that a page fault makes
that process block and a "problem" with threads is that touching shared
data will thrash the CPU cache.
However, lets agree to disagree, as I think you have about as much
chance of convincing me that threads are a good idea as I have of
convincing you they are a bad one.

>> Interesting. I presume you are doing the pretest with zero timeout,
>> which I believed significantly mitigated this problem ... if not solved
>> it. Do you have any benchmarks etc.?
>
> I would only use a zero timeout on those versions of Linux that had
> really slow wait queue code. In all other cases, I recommend using a
> sensible timeout. Just use other coding techniques to minimize the wait
> queue cost.

I wouldn't advocate _only_ using a zero timeout, however something like:

ret = poll(fds, num, 0);
if (!ret && timeout)
ret = poll(fds, num, timeout);

...is fairly well known to be better in all cases that matter (Ie. it's
slower when you aren't going to do anything anyway).

> It only applies in two cases. One is when a connection is first made,
> there is this delay before you start accepting inbound data on it. The
> other is when the kernel's write buffer fills for the connection, there
> can be this delay before you detect a situation where the buffer is not
> full.

The initial explanation said that the write set would be updated, thus
you'd return immediately in that case ... but anyway, yes, I
meant the added latency for HTTP (this with SMTP was one of your example
protocols) connection startup would be a bad thing.

For instance a quick strace -tt on my web server, with firefox requesting
over localhost a html page, the .css file referenced in that page
and /favicon.ico which it is wont to do (but is 404 in this case) shows:

20:56:57.984249 poll() = 1
20:57:06.145660 gettimeofday({1115168226, 145726}, NULL) = 0
20:57:06.145798 accept(3) = 5
20:57:06.145950 gettimeofday({1115168226, 145995}, NULL)
20:57:06.146047 fcntl64(5, F_SETFD, FD_CLOEXEC) = 0
20:57:06.146143 fcntl64(5, F_GETFL)
20:57:06.146242 fcntl64(5, F_SETFL, O_RDWR|O_NONBLOCK)
20:57:06.146389 time(NULL) = 1115168226
20:57:06.146505 writev() = 49 -- logging
20:57:06.178875 accept() = -1 EAGAIN

# request index.html
20:57:06.179040 readv(5) = 437
20:57:06.179399 open() = 6
20:57:06.179531 fstat64(6) = 0
20:57:06.179661 open() = 7
20:57:06.179767 fstat64(7)
20:57:06.216684 close(6)
20:57:06.217097 time(NULL) = 1115168226
20:57:06.217221 writev = 212 -- logging
20:57:06.217452 setsockopt(5, SOL_TCP, TCP_CORK, [1], 4)
20:57:06.217559 writev = 249
20:57:06.217699 sendfile64 = 1401
20:57:06.253016 close
20:57:06.253254 setsockopt(5, SOL_TCP, TCP_CORK, [0], 4)
20:57:06.351843 poll = 1
20:57:06.351989 gettimeofday({1115168226, 352035}, NULL)

# request css
20:57:06.352100 readv
20:57:06.439431 setsockopt
20:57:06.517558 poll()

# request /favicon.ico -- giving 404 (doesn't hit disk)
20:57:06.708613 readv
20:57:06.762695 setsockopt
20:57:06.780873 poll

This discounts a real network, and doesn't take into account the
strace overhead ... and lots of things change if the request gives a
dynamic response etc. Even so with a req for a file taking 0.18
secs, and a req with a 404 taking 0.08 secs throwing away upto 0.35 secs
in latency goes against everything I think I know.

>>> I find is odd to tune for low load. It's when you're
>>> under high load that performance is important.
>

> One big immediate difference is that if I only have one active
> connection, I'm at most messing with 100 sockets. You may be messing
> with 2,500. So I win that one immediately.

One active connection != high load, IMO. And if you are assuming that the
active connections will be evenly distributed, that seems like a false
economy.

>> But somewhat like the high/low load decisions, it seems weird to design
>> around poll() (and it's deficiencies) rather than just saying "yes,
>> you'll want epoll()/whatever if you need large numbers of connections".
>
> You want it, but you don't always have it. Sometimes you have to
> design portable code that takes advantage of these things when they're
> available but still has to perform well when they're not.

Solaris, Linux and FreeBSD all have highly scalable level trigger
IO event mechanisms. If someone cares about that level of performance,
they almost certainly have epoll() like functionality available IMO.
The poll() fallback is mainly for the 5% that don't care about high
scalability but want the rest of the functionality.

David Schwartz

unread,
May 3, 2005, 10:53:18 PM5/3/05
to

"James Antill" <james-...@and.org> wrote in message
news:pan.2005.05.04....@and.org...

> I wouldn't advocate _only_ using a zero timeout, however something like:
>
> ret = poll(fds, num, 0);
> if (!ret && timeout)
> ret = poll(fds, num, timeout);
>
> ...is fairly well known to be better in all cases that matter (Ie. it's
> slower when you aren't going to do anything anyway).

Some operating systems do this internally already. In those cases, doing
this again in user-space is obviously sub-optimal.

> This discounts a real network, and doesn't take into account the
> strace overhead ... and lots of things change if the request gives a
> dynamic response etc. Even so with a req for a file taking 0.18
> secs, and a req with a 404 taking 0.08 secs throwing away upto 0.35 secs
> in latency goes against everything I think I know.

In a realistic situation, keep alives would be used and the overhead of
initially establishing the connection would be drowned out by the benefits.
If you add on, for example, a situation where checking is done before the
connection is allowed (DNS to compare to allow/deny lists, consulting
blacklists, and so on), the delay is swamped by other delays. Again, it
depends upon your application.

>>>> I find is odd to tune for low load. It's when you're
>>>> under high load that performance is important.
>>
>> One big immediate difference is that if I only have one active
>> connection, I'm at most messing with 100 sockets. You may be messing
>> with 2,500. So I win that one immediately.
>
> One active connection != high load, IMO. And if you are assuming that the
> active connections will be evenly distributed, that seems like a false
> economy.

One active connection could be high load, if that connection is over a
gigabit network interface. Some applications have very high load from very
small numbers of connections. Again, it depends upon your application.

>>> But somewhat like the high/low load decisions, it seems weird to design
>>> around poll() (and it's deficiencies) rather than just saying "yes,
>>> you'll want epoll()/whatever if you need large numbers of connections".
>>
>> You want it, but you don't always have it. Sometimes you have to
>> design portable code that takes advantage of these things when they're
>> available but still has to perform well when they're not.
>
> Solaris, Linux and FreeBSD all have highly scalable level trigger
> IO event mechanisms. If someone cares about that level of performance,
> they almost certainly have epoll() like functionality available IMO.
> The poll() fallback is mainly for the 5% that don't care about high
> scalability but want the rest of the functionality.

Under many realistic conditions, poll's scalability is literally
perfect. If the number of discovered sockets increases proportionately with
the number of active sockets, then poll is O(1). Doubling the number of
sockets doubles the cost of poll but also doubles the amount of work poll
does, discovering twice as many sockets.

And what would you suggest for OSX? ;)

DS


William Ahern

unread,
May 4, 2005, 3:38:40 AM5/4/05
to
David Schwartz <dav...@webmaster.com> wrote:
> "James Antill" <james-...@and.org> wrote in message
> news:pan.2005.05.04....@and.org...
> > Solaris, Linux and FreeBSD all have highly scalable level trigger
> > IO event mechanisms. If someone cares about that level of performance,
> > they almost certainly have epoll() like functionality available IMO.
> > The poll() fallback is mainly for the 5% that don't care about high
> > scalability but want the rest of the functionality.

> Under many realistic conditions, poll's scalability is literally
> perfect. If the number of discovered sockets increases proportionately with
> the number of active sockets, then poll is O(1). Doubling the number of
> sockets doubles the cost of poll but also doubles the amount of work poll
> does, discovering twice as many sockets.

> And what would you suggest for OSX? ;)

OS X supports kqueue--originally from FreeBSD.

In any event you can just libevent, which is a small event loop API which
can use /dev/poll, epoll, SIGIO, kqueue, poll and select underneath. And in
later versions you can even have one looping dispatcher per thread ;)

- Bill

David Schwartz

unread,
May 4, 2005, 7:05:04 AM5/4/05
to

"William Ahern" <wil...@wilbur.25thandClement.com> wrote in message
news:0pamk2-...@wilbur.25thandClement.com...

>> And what would you suggest for OSX? ;)
>
> OS X supports kqueue--originally from FreeBSD.

Really?! Thanks for that information.

DS


James Antill

unread,
May 4, 2005, 11:18:45 AM5/4/05
to
On Tue, 03 May 2005 19:53:18 -0700, David Schwartz wrote:

> "James Antill" <james-...@and.org> wrote in message
> news:pan.2005.05.04....@and.org...
>

>> ret = poll(fds, num, 0);
>> if (!ret && timeout)
>> ret = poll(fds, num, timeout);
>>
>> ...is fairly well known to be better in all cases that matter (Ie. it's
>> slower when you aren't going to do anything anyway).
>
> Some operating systems do this internally already. In those cases, doing
> this again in user-space is obviously sub-optimal.

True, but not significantly so ... as again it's only ever in the case
where there is nothing to do.

>> This discounts a real network, and doesn't take into account the
>> strace overhead ... and lots of things change if the request gives a
>> dynamic response etc. Even so with a req for a file taking 0.18
>> secs, and a req with a 404 taking 0.08 secs throwing away upto 0.35 secs
>> in latency goes against everything I think I know.
>
> In a realistic situation, keep alives would be used and the overhead of
> initially establishing the connection would be drowned out by the benefits.

This example used HTTP/1.1, it had one connection and three
requests. So, yes, the extra accept() latency would only be for the
initial request. The initial connection overhead was 0.033215 (most of
which was the logging, which could be faster).

> If you add on, for example, a situation where checking is done before the
> connection is allowed (DNS to compare to allow/deny lists, consulting
> blacklists, and so on), the delay is swamped by other delays. Again, it
> depends upon your application.

I can see this for SMTP, where it's common to have RBL or SMTP callbacks
etc. But for HTTP?

> Under many realistic conditions, poll's scalability is literally
> perfect. If the number of discovered sockets increases proportionately with
> the number of active sockets, then poll is O(1). Doubling the number of
> sockets doubles the cost of poll but also doubles the amount of work poll
> does, discovering twice as many sockets.

This doesn't change if you don't artificially limit to 100 fds though.

> And what would you suggest for OSX? ;)

An install CD?:)
But as the other poster said:

http://developer.apple.com/documentation/Darwin/Reference/ManPages/man2/kqueue.2.html

Reply all
Reply to author
Forward
0 new messages