http://beej.us/guide/bgnet/output/html/singlepage/bgnet.html#poll
I am leaning more towards poll(). I have finally completed my first server
side socket excepting for recv() and a multiplexing function; such as
poll(). Thanks for any opinions or ideas.
Bill
Oops that the right part of the page. Here's what I have here with error
checking out for brevity and no recv() or send(). It compiles and that was
my main concern. Now after recv() is added I need poll() or select().
#include <string.h>
#include <netdb.h>
#include <netinet/in.h>
#include <sys/socket.h>
#include <sys/types.h>
struct sockaddr_storage theirs;
struct addrinfo ad, *p;
int sock, nsock;
socklen_t stheir;
int main()
{
memset(&theirs, 0, sizeof theirs);
memset(&ad, 0, sizeof ad);
ad.ai_family = PF_INET;
ad.ai_socktype = SOCK_STREAM;
ad.ai_flags = AI_PASSIVE;
getaddrinfo(NULL, "21", &ad, &p);
sock = socket(p->ai_family, p->ai_socktype, 0);
bind(sock, p->ai_addr, p->ai_addrlen);
listen(sock, 5);
stheir = sizeof theirs;
nsock = accept(sock, (struct sockaddr *) &theirs, &stheir);
return 0;
}
Bill
I recommend using 'poll' for two reasons.
First, it distinguishes between a closed/error connection and a ready
connection immediately, whereas 'select' makes you call a reading or
writing function to tell. This simplifies the fast paths for
connection shutdown. (But you still have to handle a close/error from
'read' or 'write' since the error or close may occur later.)
Second, it doesn't become inefficient when the descriptor index is
large. Due to the way the 'select' sets are composed, 'select'ing on
higher-numbered descriptors is less efficient and can get you into
trouble depending on how the C library was compiled.
DS
I always use poll because I consider the interface to be saner.
- Passing a bitmap of interesting file descriptors was a
'clever hack' by the time select was invented --- due to
limitations in the Berkeley UNIX(*) kernels of this time,
any theoretically possible descriptor set would fit into a
32bit word (the maximum number of open files for a process
was <32). Nowadays, it is an idiotic[*] idea (cf David's
remark regarding descriptor numbers)
- When using poll, the data structures defining the interest
set are under application control because the kernel uses a
special structure member to communicate event
information. For select, the kernel destroys the
interest set because the same memory area must be used to
provide 'result information' to userspace. This implies
that the interest set bitmasks must be recreated for each
select-call which is cumbersome and wasteful
- an current fd set is much larger that a few pollfd
structures, potentially[**] causing a lof of uninteresting
bytes to be copied into and out of the kernel
[*] Originally, 'idiot' was used to describe a person
incapable of learning anything because of the unshakeable to
conviction to already know everything.
[**] It is possible to 'optimize' this copy process for
suitable (non-sparse) descriptor sets. The computer then needs
to load a lot of code from memory instead of a lot of data ...
OTOH with poll you have to manually maintain the array of pollfd structures
you're passing in. With select you can let the FD_ macros do all the work for
you. Unless you hard code the array size then any efficiency gains you
get through using poll() you'll soon lose by having to use a vector<> or
realloc()
Also poll doesn't return the amount of time left if in timeout mode and
something happens unlike select.
B2003
> OTOH with poll you have to manually maintain the array of pollfd structures
> you're passing in. With select you can let the FD_ macros do all the work for
> you.
I honestly don't see how there's any difference. You have to call the
FD_ macros. The only reason you need the FD_ macros is because the
memory format of the structures is not specified. Being forced to
interact with the structures through the narrow interface of the
macros is, IMO, a limitation of 'select'.
> Unless you hard code the array size then any efficiency gains you
> get through using poll() you'll soon lose by having to use a vector<> or
> realloc()
That presents two false alternatives. There are many other possible
solutions. For example, you can reallocate the array only when the
highest-numbered file descriptor you've ever handled increases. You
can expand the array by, say, 16 entries when that happens and never
decrease it. This will have no measurable effect on efficiency and
doesn't require any hard coded size.
> Also poll doesn't return the amount of time left if in timeout mode and
> something happens unlike select.
You mean "unlike what my implementation of 'select' happens to do".
For example, SuSv2 says only "On successful completion, the object
pointed to by the timeout argument may be modified." That doesn't
sound all that useful to me.
DS
So you would say go with select() then? I'm certainly not knowledgable
enough to do anything manually here. I'm going to log into my server with
telnet. I know there's that timeval struct with poll().
Bill
> I recommend using 'poll' for two reasons.
> Second, it doesn't become inefficient when the descriptor index is
> large. Due to the way the 'select' sets are composed, 'select'ing on
> higher-numbered descriptors is less efficient and can get you into
> trouble depending on how the C library was compiled.
While it handles large indexes better than select, poll still becomes
inefficient when the total number of descriptors is large.
For really high-performance servers that need to deal with large numbers
(tens of thousands) of descriptors, POSIX doesn't really have a
solution. Various OS-specific options exist (kqueue on BSD, epoll on
Linux, /dev/epoll on Solaris, etc.)
Chris
> I always use poll because I consider the interface to be saner.
[...]
What I am seeing is that I'm going to need a recv() and send() somewhere
too and a loop. Maybe more than one. I have purchased the Stevens book "Unix
network programming" in hopes that that will help. Alas buying books just to
go out of date is not my number 1 idea. I was hoping for internet tutorials
or wikis.
Bill
> While it handles large indexes better than select, poll still becomes
> inefficient when the total number of descriptors is large.
Only if comparatively few of the descriptors are active. The cost of a
call to 'poll' is roughly O(n). If the number of descriptors
discovered ready is usually one, that sucks. But if the number of
descriptors discovered per call is also O(n), then its efficiency does
not go down as the number of descriptors goes up.
In fact, with more descriptors, the amount of time between calls to
'poll' goes up. If you assume the amount of network activity is
constant, that means more descriptors discovered per call to 'poll'.
This means 'poll' actually becomes more efficient as the number of
descriptors goes up. (If, and only if, your access pattern is like
this one.)
However, the worst-case behavior of 'select' and 'poll' are pretty
awful.
> For really high-performance servers that need to deal with large numbers
> (tens of thousands) of descriptors, POSIX doesn't really have a
> solution. Various OS-specific options exist (kqueue on BSD, epoll on
> Linux, /dev/epoll on Solaris, etc.)
And many of these techniques outperform 'select' or 'poll' even when
the number of descriptors is small, so the only downside is usually
added implementation complexity. Also important, the worst-case
behavior of these mechanisms is *way* better than 'select' or 'poll'.
If you typically have a small number of active descriptors and a large
number of inactive descriptors, you need to implement these high-
performance mechanisms.
DS
This statement doesn't really make sense. 'Big Oh' is supposed to
communicate some information regarding how the execution cost of an
algorithm changes in response to changes in the number of items it
needs to process. Invoking poll and returning from poll requires that
a struct pollfd is copied in and out of the kernel for each descriptor
in the interest set. This copy requires at least an O(n) algorithm,
meaning, the cost of it will be proportional to the number of
descriptors which are supposed to be polled. The cost of the userspace
algorithm examining the results is proportional to the average
distance between the first and last active descriptor in the struct
pollfd array which may or may not increase when the amount of
supposed-to-be-polled descriptors increases.
But the real issue is actually that a struct pollfd is a fairly large
thing. It needs 64 bits of data, meaning, for more than 16
descriptors, the copying cost associated with invoking poll and
returning from poll will exceed the copying cost of doing the same
with select.
> This statement doesn't really make sense. 'Big Oh' is supposed to
> communicate some information regarding how the execution cost of an
> algorithm changes in response to changes in the number of items it
> needs to process. Invoking poll and returning from poll requires that
> a struct pollfd is copied in and out of the kernel for each descriptor
> in the interest set. This copy requires at least an O(n) algorithm,
> meaning, the cost of it will be proportional to the number of
> descriptors which are supposed to be polled. The cost of the userspace
> algorithm examining the results is proportional to the average
> distance between the first and last active descriptor in the struct
> pollfd array which may or may not increase when the amount of
> supposed-to-be-polled descriptors increases.
Exactly. Now, imagine you double the number of descriptors. The cost
of the 'poll' operation itself doubles. But the number of descriptors
discovered ready will more than double. It will double simply because
twice as many descriptors means twice as many ready descriptors in the
same amount of time (assuming the average activity per descriptor is
constant). But double the amount of work will mean more time elapsed
between two calls to 'poll'.
So if you double the number of descriptors, each call to 'poll' takes
twice as long, but discovers more than twice the number of ready
descriptors. The cost of 'poll' per unit of work done goes *down* as
the number of descriptors goes up. (Assuming you don't have a
degenerate case like only one active descriptor. Again, 'poll's worst-
case behavior is *really* bad.)
DS
It doesn't just "happen to do". Its been the de facto behaviour for years
and is bloody useful if you're writing a realtime app.
B2003
poll() is probably more efficient, but select() is easier to use for someone
not used to writing server network code.
B2003
Anyone who has a program that is dealing with thousands of descriptors
should really think about going multiprocess or multithreaded. A single
threaded program dealing with that many descriptors sounds like a recipe
for inefficient processing.
B2003
How many people have already asked here what could be wrong with their
select loops who didn't understand that the information they passed
into the kernel is no longer contained in the memory area they wrote
it to after the call? And how about the other classic, that the
nfds-member isn't used to pass the number of descriptors but the
numerically largest descriptor passed to the kernel + 1?
"Look ma, I dug a tunnel through the floor! After all, anyone could
use the door!
I never had a problem understanding that it changes the bits - otherwise how
the hell is it going to work?
>nfds-member isn't used to pass the number of descriptors but the
>numerically largest descriptor passed to the kernel + 1?
You pass in a constant - FD_SETSIZE - and forget about it. Easy.
>"Look ma, I dug a tunnel through the floor! After all, anyone could
>use the door!
Whatever. Try explaining to someone new that to use poll they'll have to do
their own memory management or artificially limit the number of descriptors
they'll use.
B2003
You claimed that it would be 'easier to use'. I quoted two frequent
problems so-called 'newbies' typically have with select. Knowledge
about the intracies of bitmaps isn't something people are born with.
You claimed that it would be 'easier to use'. I quoted two frequent
problems so-called 'newbies' typically have with select. Knowledge
about the intricacies of bitmaps isn't something people are born with.
If they don't understand bitmaps then they haven't learnt C properly and
so they're unlikely to understand memory management properly either.
B2003
> Try explaining to someone new that to use poll they'll have to do their
> own memory management or artificially limit the number of descriptors
> they'll use.
My problem with poll() is this:
POLLIN - Data other than high-priority data may be read without blocking.
POLLRDNORM - Normal data may be read without blocking.
POLLRDBAND - Priority data may be read without blocking.
POLLPRI - High-priority data may be read without blocking.
POLLOUT - Normal data may be written without blocking.
POLLWRNORM - Equivalent to POLLOUT.
POLLWRBAND - Priority data may be written.
What's the difference between POLLIN and POLLRDNORM? (Especially wrt.
POLLOUT being equivalent to POLLWRNORM.) And how does POLLRDBAND differ
from POLLPRI? How do they all map to normal TCP payload as opposed to
urgent bytes? What is the effect of POLLWRBAND on TCP, since select()
doesn't offer an fd_set for writing "exceptional data"?
IMO, select() is a more intuitive and simpler API, while poll() is very
STREAMS specific. pselect() is an interface that allows a thread to wait
safely and portably on input, output, signals, and timeout. I'm inclined
to find it telling that pselect() was added while "ppoll()" was not.
Hm, hm, hm, POLLHUP seems to open a can of worms too.
http://www.opengroup.org/onlinepubs/9699919799/functions/poll.html
----v----
POLLHUP
A device has been disconnected, or a pipe or FIFO has been closed by the
last process that had it open for writing. Once set, the hangup state of a
FIFO shall persist until some process opens the FIFO for writing or until
all read-only file descriptors for the FIFO are closed. This event and
POLLOUT are mutually-exclusive; a stream can never be writable if a hangup
has occurred. However, this event and POLLIN, POLLRDNORM, POLLRDBAND, or
POLLPRI are not mutually-exclusive. This flag is only valid in the revents
bitmask; it shall be ignored in the events member.
----^----
Notice "[t]his event and POLLOUT are mutually-exclusive". If select()
reports readability on a TCP socket file descriptor and read() (or its
equivalents) returns (ssize_t)0, I'm still allowed to write() to the
socket. If the peer only issued shutdown(fd, SHUT_WR) and the underlying
implementation sent a FIN packet to me accordingly, I can still send more
data in state CLOSE_WAIT, and the peer that started with active close can
read the data in state FIN_WAIT_2.
Indeed, when I wrote my TCP port forwarder to teach myself select() and
co., I took care to support half-open sockets. IIRC, an HTTP RFC or some
such even mentions this method as delineating the end of the request from
the client. The server receives the FIN, but it can still send back the
response through the half-open (half-closed) socket.
This makes it possible for poll() to signal POLLIN without POLLHUP, then
for read() to return 0 ever after. Not very intuitive.
Also notice "this event and POLLIN, POLLRDNORM, POLLRDBAND, or POLLPRI are
not mutually-exclusive". Specifying POLLHUP in /events/ without POLLIN
seems to make no sense. I'm not even sure a TCP implementation *could*
detect a POLLHUP condition without reading and acknowledging all
previously written data by way of POLLIN. (If the data is not
acknowledged, the sender will block after a while and the protocol simply
won't reach the point where the sender could issue a shutdown.) So waiting
solely for POLLHUP probably deadlocks a pair of peers communicating over
TCP.
Furthermore, provided I specify POLLIN | POLLHUP in /events/, I can't just
close (or shut down) on receiving a POLLHUP in /revents/; I have to check
for POLLIN anyway. If POLLIN is set, I have to call read() or a similar
function. I think that even when read() will return (ssize_t)0, an
implementation is allowed to set POLLIN beside POLLHUP, because the read()
will not block (or was that POLLRDNORM again?). So the availability of
POLLHUP in the standard won't necessarily save me a read() call on EOF.
poll() is needlessly complex for non-STREAMS-based files. Where
portability matters more than performance, I'd stick to select() under
SUSv[12] and pselect() under SUSv[34]. Otherwise, I'd probably use
whatever implementation-dependent fast API is present.
Cheers,
lacos
[...]
> I'd rather stick to select()
I believe I have edited you text suitably to make the actual content
visible. But this is a statement of nostalgic, emotional affection
and somewhat misplaced in a technical discussion.
> "Ersek, Laszlo" <la...@caesar.elte.hu> writes:
>> On Tue, 4 May 2010, bolta...@boltar.world wrote:
>>> Try explaining to someone new that to use poll they'll have to do
>>> their own memory management or artificially limit the number of
>>> descriptors they'll use.
>>
>> My problem with poll() is this:
>
> [...]
>
>> I'd rather stick to select()
>
> I believe I have edited you text suitably [...]
Accidental omission: The 'rather' was added by me.
> "Ersek, Laszlo" <la...@caesar.elte.hu> writes:
>> On Tue, 4 May 2010, bolta...@boltar.world wrote:
>>> Try explaining to someone new that to use poll they'll have to do
>>> their own memory management or artificially limit the number of
>>> descriptors they'll use.
>>
>> My problem with poll() is this:
>
> [...]
>
>> I'd rather stick to select()
>
> I believe I have edited you text suitably to make the actual content
> visible.
Thank you for your effort.
> But this is a statement of nostalgic, emotional affection and somewhat
> misplaced in a technical discussion.
You could have tried to point out how to use the events / revents masks
for TCP, wrt. EOF, half-closed sockets, and urgent data. Or how to use
poll() correctly in combination with changing signal masks. Or what
POLLxxxx flags can be portably assumed equivalent in addition to what the
standard guarantees, and which POLLxxxx events can be ignored for TCP. If
these are obvious, you could have cited references.
BTW, your comment, attempting solely to deride me, is somewhat misplaced
in a technical discussion. You could have ignored my post, for example.
Have a nice day,
lacos
> BTW, your comment, attempting solely to deride me, is somewhat
> misplaced in a technical discussion.
I am honestly convinced that you (and the Boltar-person) are trying to
make up complicated problems based on hypothetical possibilities where
none actually exist in order to distract from the discussion of the
properties of different kinds of 'synchronous I/O multiplexing
interfaces' whose contents would otherwise not be completely favorable
regarding a solution both of you happen to prefer. And my 'comment'
attempts to communicate this in a succinct way instead falling into
one of your cleverly conceived 'bottomless side-issue trapdoors'.
> I am honestly convinced that you (and the Boltar-person) are trying to
> make up complicated problems
I can't prove I do *not* want to do something.
> based on hypothetical possibilities where none actually exist in order
> to distract from the discussion
Why would I want to distract?
> of the properties of different kinds of 'synchronous I/O multiplexing
> interfaces' whose contents would otherwise not be completely favorable
> regarding a solution both of you happen to prefer.
Oh. So you think I'm trying to derail the discussion with personal
anecdotes because I'm supposedly afraid you might be able to prove poll()
superior, and then my life would have been in vain, writing all those
correct but suboptimally performing select() loops!
I see your point. It's wrong. I like to understand function specifications
in depth before calling said functions. No matter how superior you prove
poll() if I'm not sure I can call it correctly. When I say "I'd stick to
select() [unless you prove poll() un-messy -- please do!]", isn't that
invitation for you to prove poll() un-messy? I'm setting up my argument
for refutation by stating it as a fact. That should elicit more responses
from subscribers than asking questions. (Which I also did ask.) Instead of
disproving my post by describing poll()'s portably correct usage, you
accuse me of malice.
> And my 'comment' attempts to communicate this in a succinct way instead
> falling into one of your cleverly conceived 'bottomless side-issue
> trapdoors'.
I'm flattered.
(Though a bit dismayed seeing my evil plan laid bare.)
Cheers,
lacos
> >And many of these techniques outperform 'select' or 'poll' even when
> >the number of descriptors is small, so the only downside is usually
> >added implementation complexity. Also important, the worst-case
> >behavior of these mechanisms is *way* better than 'select' or 'poll'.
> Anyone who has a program that is dealing with thousands of descriptors
> should really think about going multiprocess or multithreaded. A single
> threaded program dealing with that many descriptors sounds like a recipe
> for inefficient processing.
Right, but whether you use multiple processes, multiple threads, or a
single thread, you still have to discover which sockets you need to do
I/O on somehow.
DS
Try reading your man pages.
--
Ian Collins
> I see your point. It's wrong. I like to understand function specifications
> in depth before calling said functions. No matter how superior you prove
> poll() if I'm not sure I can call it correctly. When I say "I'd stick to
> select() [unless you prove poll() un-messy -- please do!]", isn't that
> invitation for you to prove poll() un-messy? I'm setting up my argument
> for refutation by stating it as a fact. That should elicit more responses
> from subscribers than asking questions. (Which I also did ask.) Instead of
> disproving my post by describing poll()'s portably correct usage, you
> accuse me of malice.
As I recall, 'select' actually takes three FD sets. Does anyone know
precisely what that third set is for?
DS
Actually I was just telling my point of view. I'm sorry if you have a problem
with people disagreeing with you. I'm glad I don't have to work with people
like you who think they know everything and are convinced they're always
right.
B2003
An observation I have made in the past is that people tend to 'find'
their own vices in others.
> If they don't understand bitmaps then they haven't learnt C properly and
> so they're unlikely to understand memory management properly either.
Well, part of the problem is that they don't necessarily even know
that they're bitmaps. You'll seldom see that mentioned in the
documentation for 'select' and nothing in any standard (as far as I
know) requires them to be bitmaps.
The fact is, there's a set of classic errors people make in 'select'
that we see all the time. Failing to reinitialize the fd sets is a
mistake almost everyone seems to make the first time they use
'select'. There really are no such errors with 'poll'. I don't think
this is because 'poll' is simpler, I think it's largely because people
cut their teeth on 'select' and don't even try to mess with 'poll',
typically, until they're more experienced.
In fact, the only common error I see with select is thinking that
'poll' will return writability or readability if the connection closes
or errors. And the only people expect this behavior, typically, is
because it's what 'select' does. (Had they started out with 'poll',
they likely wouldn't have expected this.)
Frankly, I don't think easier is simpler or more complicated than the
other. They both have their little quirks that bite people (like maxfd
+1 in 'select' and that you have to test for events you didn't ask for
in 'poll') and they both have their deep pits of confusion (like
POLLPRI versus POLLRDBAND for 'poll' and like what's an 'exception'
for 'select').
For a typical program handling a small number of connections with no
significant performance requirements, they behave identically.
DS
[...]
> I like to understand function specifications in depth before calling
> said functions. No matter how superior you prove poll() if I'm not
> sure I can call it correctly. When I say "I'd stick to select()
> [unless you prove poll() un-messy -- please do!]", isn't that
> invitation for you to prove poll() un-messy?
As I already wrote: You are (correctly) seeing all kinds of
theoretical problems which practictally don't exist. I can give you an
entirely practical viewpoint in the hope that it will communicate
something sensible. I cannot go and change Linux manpages (except on
my own system) or even the UNIX(*) standard text.
I care for exactly two things wrt 'I/O multiplexing', namely 'can I
read data' and 'can I write pending data'. I have so far not found any
uses for 'TCP urgent data'[*]. I don't care for STREAMS. They are not
supported on Linux and hence, irrelevant for new code[**]. This implies
that I use POLLIN and POLLOUT (when appropriate) for events and
usually map anything 'weird' which might be returned in revents to
POLLIN. If it is an EOF, read will detect that. The same is true for
any kind of error condition. I have so far not had any problems which
could have been solved by ppoll (which exists on Linux, anyway) which
didn't have another usable solution.
[*] People who want to use 'priorization' are usually confused
and their actual problem is lack of resources. An example I
have used to explain that in the past: Assuming twenty people
were invited to a barbecue and, after all of them arrived, it
was discovered that only ten steaks are available. There are
two sensible solutions to this problem, namely, buy more
steaks or divide the existing ones among the guests. And there
is the 'electrical engineering approach': Shoot ten of the
guests. For obvious reasons, nobody except electrical
engineers considers this to be a viable option and even
electrical engineers only do so if they are dealing with other
people's guests.
[**] My boss would presumably roast me on slowly on a small
fire if I would make major efforts to support 'weird operating
systems' because the code could theoretically also run there.
Minor portability problems are best addressed once the code is
actually in need of being ported.
Lastly, I don't write "real-time applications" for systems which have
nothing better to do than to uselessly copy 48K of data around for
every I/O operations (simplified), with me being nevertheless
convinced that this cannot possibly be something I am doing wrong.
> Try reading your man pages.
I don't like to. I rather read the SUS.
(I'm currently in a situation where this approach is feasible.)
Thanks,
lacos
> I care for exactly two things wrt 'I/O multiplexing', namely 'can I
> read data' and 'can I write pending data'. I have so far not found any
> uses for 'TCP urgent data'[*].
(Neither have I -- I just wanted my port forwarder to support it, partly
for telnet, partly for learning.)
> I don't care for STREAMS. They are not supported on Linux and hence,
> irrelevant for new code[**].
> [**] My boss would presumably roast me on slowly on a small
> fire if I would make major efforts to support 'weird operating
> systems' because the code could theoretically also run there.
> Minor portability problems are best addressed once the code is
> actually in need of being ported.
I used to have different priorities. I guess that makes me overzealous
towards portability. My apologies.
> This implies that I use POLLIN and POLLOUT (when appropriate) for events
> and usually map anything 'weird' which might be returned in revents to
> POLLIN. If it is an EOF, read will detect that. The same is true for any
> kind of error condition.
(I kind of feel a contradiction between this and:
----v----
From dav...@webmaster.com Wed May 5 12:49:19 2010
Date: Wed, 5 May 2010 03:49:19 -0700 (PDT)
From: David Schwartz <dav...@webmaster.com>
Newsgroups: comp.unix.programmer
Subject: Re: experienced opinions
[snip]
In fact, the only common error I see with select is thinking that 'poll'
will return writability or readability if the connection closes or errors.
And the only people expect this behavior, typically, is because it's what
'select' does. (Had they started out with 'poll', they likely wouldn't
have expected this.)
[snip]
----^----
but I rest my case.)
> I have so far not had any problems which could have been solved by ppoll
> (which exists on Linux, anyway) which didn't have another usable
> solution.
So there is a ppoll(); great!
--o--
Thank you very much. I really appreciate this!
lacos
An observation I'll make now is that you're full of shit.
B2003
This is an observation about one million violent football
fans and other people of a similar mental disposition have also
already made. I believe that your discussion tactic wrt select wasn't
entirely honest: The memory-management statement was a red herring and
you should know that yourself. You wrote about 'arbitrarily limiting
the amount of descriptors which can be handled' as if this was an
unquestionably bad thing. But a select fd set is just a bit array of
an arbitrary 'large enough' size which has exactly the same property.
Thank you for proving my point. Now have the last word and go away.
B2003
,----
| The memory-management statement was a red herring and
| you should know that yourself. You wrote about 'arbitrarily limiting
| the amount of descriptors which can be handled' as if this was an
| unquestionably bad thing. But a select fd set is just a bit array of
| an arbitrary 'large enough' size which has exactly the same property.
`----
How about adressing that instead of trying to silence people who
happen to disagree with you for rational reasons they have tried to
communicate by heaping abuse onto them?
[...]
>> and usually map anything 'weird' which might be returned in revents
>> to POLLIN. If it is an EOF, read will detect that. The same is true
>> for any kind of error condition.
>
> (I kind of feel a contradiction between this and:
>
> ----v----
> From dav...@webmaster.com Wed May 5 12:49:19 2010
> Date: Wed, 5 May 2010 03:49:19 -0700 (PDT)
> From: David Schwartz <dav...@webmaster.com>
> Newsgroups: comp.unix.programmer
> Subject: Re: experienced opinions
>
> [snip]
>
> In fact, the only common error I see with select is thinking that
> poll' will return writability or readability if the connection closes
> or errors.
[...]
There is none. POLLHUP and POLLERR are two revents-only values which
report possibly interesting connection state changes other than 'data
to read available' or 'bufferspace to write data into
available'. Since both errors and 'hangups' can occur during normal
I/O operations, the respective input- and output-handlers need to be
able to deal with these conditions, anyway, and there is no reason to
care specially for either of both. Because of this, I usually do
something like
if (revents & ~(POLLIN | POLLOUT)) revents |= POLLIN;
and let the ordinary input handler deal with it.
> There is none. POLLHUP and POLLERR are two revents-only values which
> report possibly interesting connection state changes other than 'data
> to read available' or 'bufferspace to write data into
> available'. Since both errors and 'hangups' can occur during normal
> I/O operations, the respective input- and output-handlers need to be
> able to deal with these conditions, anyway, and there is no reason to
> care specially for either of both. Because of this, I usually do
> something like
>
> if (revents & ~(POLLIN | POLLOUT)) revents |= POLLIN;
>
> and let the ordinary input handler deal with it.
I sometimes wonder why 'poll' didn't do that by default and let you
ignore IN if HUP or ERR is set if you want to. Either way would work,
but settings POLLIN would be more consistent with how 'select'
behaves. I wonder if this was felt to be a deficiency in 'select'.
In any event, it's a minor issue. You can certainly treat HUP/ERR the
same way as IN if you want. Sometimes it makes your code more
efficient not to, but it does complicate handling half-open
connections if you need to.
The point is that both 'select' and 'poll' have small wrinkles that
can bite a first-time user. None of this is a big deal though, every
interface is like that.
DS
Oh, now I see it. I misunderstood your topmost sentence:
>> On Wed, 5 May 2010, Rainer Weikusat wrote:
>>> and usually map anything 'weird' which might be returned in revents to
>>> POLLIN. [...]
I interpreted this as
"and usually map anything 'weird' which might be returned in revents to
POLLIN [in *events*]",
while what you actually meant was
"and usually map anything 'weird' which might be returned in revents to
POLLIN [in *revents*]".
The second form corresponds to the code you posted.
The first (misinterpreted) form does match what David describes as an
error, doesn't it? The first form says "I'll just set POLLIN in /events/
and I'll get POLLIN in /revents/ too if anything weird happens". Stuck in
the select() mindset, I am :)
Thanks again,
lacos
> On May 4, 10:01�am, "Ersek, Laszlo" <la...@caesar.elte.hu> wrote:
>
>> I like to understand function specifications in depth before calling
>> said functions.
>
> As I recall, 'select' actually takes three FD sets. Does anyone know
> precisely what that third set is for?
SUSv4 XSH 2.10.11 "Socket Receive Queue" [0] and onwards describes
"out-of-band data" and "out-of-band data mark". The word "segment" is used
in a logical sense, not TCP segment.
The select() spec [1] says:
----v----
The pselect() function shall examine the file descriptor sets whose
addresses are passed in the /readfds/, /writefds/, and /errorfds/
parameters to see whether some of their descriptors are ready for reading,
are ready for writing, or have an exceptional condition pending,
respectively.
[...]
If a socket has a pending error, it shall be considered to have an
exceptional condition pending. Otherwise, what constitutes an exceptional
condition is file type-specific. For a file descriptor for use with a
socket, it is protocol-specific except as noted below. [...]
If a descriptor refers to a socket, the implied input function is the
/recvmsg()/ function with parameters requesting normal and ancillary data,
such that the presence of either type shall cause the socket to be marked
as readable. The presence of out-of-band data shall be checked if the
socket option SO_OOBINLINE has been enabled, as out-of-band data is
enqueued with normal data. [...]
[...]
A socket shall be considered to have an exceptional condition pending if a
receive operation with O_NONBLOCK clear for the open file description and
with the MSG_OOB flag set would return out-of-band data without blocking.
(It is protocol-specific whether the MSG_OOB flag would be used to read
out-of-band data.) A socket shall also be considered to have an
exceptional condition pending if an out-of-band data mark is present in
the receive queue. Other circumstances under which a socket may be
considered to have an exceptional condition pending are protocol-specific
and implementation-defined.
----^----
<rant>
The text lists "state -> exceptional condition" implications. When one
checks a bit in the third fd_set, he needs the reverse direction:
"exceptional condition -> what state?". In my interpretation, at least the
following situations are possible:
(1) Pending error -- use getsockopt(sock, SOL_SOCKET, SO_ERROR, ...).
Continuing with TCP in mind, which should
- support out-of-band data,
- support the out-of-band data mark,
- enqueue out-of-band data at the end of the queue,
- not place ancillary-data-only segments in the queue (that is, segments
with neither normal nor out-of-band data),
(2) An out-of-band data mark is present in the Receive Queue. (Regardless
of whether SO_OOBINLINE was set.)
Out-of-band data may not be readable without blocking when the third
fd_set fires -- only the mark may be present. Supposing TCP over IP(v4), a
large TCP segment may be fragmented. When the first fragment (containing
the TCP header and thus the urgent pointer) is processed, the mark may be
placed immediately in the queue.
logical segment | hole to be filled | mark | expected logical segment
with normal data | with normal data | | with out-of-band data
For me this means that the third fd_set can't be used at all in a select()
that is meant to block, as such a select may return immediately, and a
subsequent blocking receive call may still block (or a nonblocking one may
still return with -1/EAGAIN (or EWOULDBLOCK), resulting in spinning).
(The figure above is intended to display the in-line enqueueing of
out-of-band data, but it really doesn't matter now -- it would only effect
how that data would be available (or how that data would become lost) once
we got to the mark.)
My approach is to set SO_OOBINLINE. This allows me to work with the first
two sets only. Errors are returned with read()/write(). The finalization
of a pending connect() is signalled as writability, and the result can be
queried via SO_ERROR.
The presence of an out-of-band data *mark* without the OOB data itself
won't wake the select(). When woken and FD_ISSET() reports readability,
out-of-band data is simply checked for with sockatmark() before each
read(). No read() will coalesce normal and OOB data. If sockatmark()
returns 1, the next byte to read() is the urgent byte. (No MSG_OOB is
needed.)
If the kernel processes multiple TCP headers with urgent pointers before I
get to call sockatmark(), each single urgent byte but the last one will be
inlined in the normal data stream. (I seem to remember that I experimented
with this on Linux, and this was the case even with SO_OOBINLINE turned
off. An urgent byte was only dropped if SO_OOBINLINE was turned off and I
actively read past the mark, before the mark was moved. So I didn't really
care about race conditions.)
(I'm not sure how one could do without SO_OOBINLINE. A TCP segment
carrying a single urgent byte wouldn't wake a select() that didn't pass a
third fd_set. On the other hand, a select() passing a third fd_set could
lead to indefinite spinning, or indefinite blocking.)
The problem with sockatmark() is that it was first introduced in SUSv3. I
wrote my port forwarder for SUSv1. I think I worked it around by calling
select() in a non-blocking manner right before and after the read(), to
see if there was an exceptional condition pending that ceased due to the
read(). (If the "before" condition was true due to an error, then the
"after" condition wasn't checked, because read() returned that error
first.) I chose this as the default workaround.
I (hopefully) implemented this before-after check "primitive" with recv()
too. Once the SO_OOBINLINE socket signalled readability, I temporarily
turned off SO_OOBINLINE, and tried to read a single byte with recv(...,
MSG_PEEK | MSG_OOB), then turned SO_OOBINLINE back on. A successful
receive meant "at the mark" (and left the urgent byte over to the normal,
subsequent read), -1/EINVAL meant "no mark", and -1/EAGAIN (or
EWOULDBLOCK) meant "mark nearby". (This was no problem, because a
before-after "mark nearby" -> "no mark" transition, due to the normal,
in-lined read(), was interpreted as "urgent data consumed" just the same.)
Looking back, perhaps I could have found a third way: recvmsg() can report
on output, in msg_flags, whether it consumed out-of-band data. Since OOB
(logical) segments don't coalesce with normal (logical) segments,
recvmsg() would have consumed a sole byte at these times. I'm not sure how
I would have had to fiddle with SO_OOBINLINE, though.
I never considered SIOCATMARK.
(If anyone cares, the code is "forward3.c" under [2]; most recently, it's
been running on my workstation since Mar 25 to log SOAP.)
I would never touch out-of-band data again; not with a ten foot pole.
--o--
If we're already talking about what socket functions precisely do, can
anyone judge whether both Solaris' and Linux' handling of accept() are
POSIX conformant, when accept() fails with -1/EMFILE (or due to another
resource scarcity)? On Linux, such an event throws away the pending
connection (I think the peer gets a FIN instead of an RST because the
kernel pre-completes the handshake). On Solaris, the incoming connection
remains pending, and one must not FD_SET the listening socket before the
next close(), or else the select()-accept() loop will spin.
Any constructive criticism is greatly appreciated.
</rant>
Cheers,
lacos
[0] http://www.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_10_11
[1] http://www.opengroup.org/onlinepubs/9699919799/functions/select.html
[2] http://lacos.hu
> I would never touch out-of-band data again; not with a ten foot pole.
Agreed. It was a bad idea, poorly executed, that has festered since.
> If we're already talking about what socket functions precisely do, can
> anyone judge whether both Solaris' and Linux' handling of accept() are
> POSIX conformant, when accept() fails with -1/EMFILE (or due to another
> resource scarcity)? On Linux, such an event throws away the pending
> connection (I think the peer gets a FIN instead of an RST because the
> kernel pre-completes the handshake). On Solaris, the incoming connection
> remains pending, and one must not FD_SET the listening socket before the
> next close(), or else the select()-accept() loop will spin.
The 'select'/'accept' loop should spin in that case. This is why you
must perform operations that reduce resource consumption before
operations that increase them. And if you make no forward progress,
you must implement a rate-limiter.
And if you detect resource exhaustion, you must do something about it!
If the implementation tells you that you have too many open files, it
is not sensible to react by trying to open more files.
I find both behaviors sensible, FWIW. I like Solaris' behavior
because, knowing the behavior, it's easier to code around it sensibly.
Linux's behavior works better if you don't consider the case
specifically.
Generally, in the case where you can't accept any more incoming
connections, you need/want to close any attempts as quickly as
possible. Sensible clients will interpret this as an overload
condition.
DS
> Generally, in the case where you can't accept any more incoming
> connections, you need/want to close any attempts as quickly as possible.
> Sensible clients will interpret this as an overload condition.
Thank you for the advice.
I considered closing the server socket and setting it up again, so as to
"flush" all connection requests pending in the listen queue. However, I
was afraid that I might not be able to re-bind the same local address
(even with SO_REUSEADDR, another process might "steal" the port), and then
all would be lost.
I implemented a primitive "rate limiter": no more connections accepted
until a living socket is closed. A client waiting for an acknowledgement
of its connect() may time out, but for the caliber at hand this approach
seemed workable.
How could the Solaris implementation be made refuse new connections if the
"rate limiter" is in effect? Simply by setting up a low backlog value with
the initial listen()? Or by manipulating the backlog dinamically with
repeated listen() calls during peaking loads?
(I don't know if that's possible at all. The listen() spec in the SUSv4
[0] doesn't seem to disallow it or to define an error condition for it. It
could be an interesting experiment to see whether, with say 16 connections
pending and waiting for an accept(), a listen(srv, 10) would immediately
reset the last six connection requests; in effect flushing the listen
queue and protecting further clients from waiting.)
Thank you,
lacos
[0] http://www.opengroup.org/onlinepubs/9699919799/functions/listen.html
> How could the Solaris implementation be made refuse new connections if the
> "rate limiter" is in effect? Simply by setting up a low backlog value with
> the initial listen()? Or by manipulating the backlog dinamically with
> repeated listen() calls during peaking loads?
Basically, you open two or three junk descriptors (copies of /dev/
null). If you get an EMFILE, you close the junk connections, call
'accept' followed by 'close' until you get 'WOULDBLOCK', then open the
junk files again.
Once you get an EMFILE, you never need get it again. Once you know how
many sockets you're allowed, if you 'accept' a connection such that
the next connection would cause an EMFILE, you can just drop
connections in a tight loop.
Generally, you can fairly easily determine the maximum number of
descriptors you can have. Say it's 16,384. You can then set a limit
just a bit lower than that, say 16,380. (That helps to save a few
descriptors for urgent stuff, such as if an existing connection forces
you to open a file.) If you accept a new connection whose descriptor
number is higher than that, you can send it an error and then close
it, just close it, take steps to load shed, or whatever.
You can set soft and hard limits, if that's suitable for your
application. At the soft limit, you may reduce timeouts, disconnect
"less important" connections, take yourself out of the server pool, or
whatever is appropriate for your application.
DS
> Basically, you open two or three junk descriptors (copies of /dev/
> null). If you get an EMFILE, you close the junk connections, call
> 'accept' followed by 'close' until you get 'WOULDBLOCK', then open the
> junk files again.
Great, thanks!
> load shed
English never stops amazing me.
Cheers,
lacos
> For really high-performance servers that need to deal with large
> numbers (tens of thousands) of descriptors, POSIX doesn't really have
> a solution. Various OS-specific options exist (kqueue on BSD, epoll
> on Linux, /dev/epoll on Solaris, etc.)
Dan Kegel's article is an interesting read.
excessive emphasis on threads compared to processes
What is really needed is a whole NEW threading concept where individual
threads can have private-to-that-thread resources, like file descriptors
(but done without giving up the ability to choose to share them). Then
you can spread the descriptors and other resources out in ways that allow
them to be managed better.
--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------
> excessive emphasis on threads compared to processes
Process-pool designs are not really realistic yet. Nobody's done the
work needed to make them useful.
I keep hoping somebody will, since I think that's a phenomenal design
approach. You would need to allocate lots of memory address space
before you fork off the child processes (64-bit OSes make this easy),
and have a special "shared allocator" to allocate shared memory. You'd
need a library that made it easy to register file descriptors as
shared and hand them from process to process. You'd also need a "work
pool" implementation that only accepted references to shared resources
to identify a work item.
Ideally, a process could register what it was messing with. So if it
crashed/failed, the system would know what was potentially corrupt.
> What is really needed is a whole NEW threading concept where individual
> threads can have private-to-that-thread resources, like file descriptors
> (but done without giving up the ability to choose to share them). Then
> you can spread the descriptors and other resources out in ways that allow
> them to be managed better.
I'm not sure how that would be any better. Currently, if you want a
file descriptor to only be accessed by one thread, just only access it
from that one thread.
DS
I've seen few application purposes ... so few I can't even think of one at
the moment ... although I know I though about one many years back ... which
would need to hand file descriptors between tasks (I'm using this as a
generic term for a unit of work that could be a thread or a process), other
than the simplistic case of a master listener for a daemon that hands off
to workers for each arriving connection (probably best done as part of the
creation of that task ... which is usually what we see happening).
The 64-bit VM does give us space to do some memory structuring to deal with
the issues like having memory that is shared, and having private memory that
still needs to be distinguished (e.g. two tasks that cannot see each other's
stack, but we still want the addresses to be different for some reason).
Well, it will until we waste too much of it and eventually exhaust it.
| Ideally, a process could register what it was messing with. So if it
| crashed/failed, the system would know what was potentially corrupt.
I'd prefer that registration of it be a part of getting it. That is, when
the task gets the resource, it already is registered. Think descriptors
and processes. Kill the process and the descriptors close (aside for a
few glitches in the design such as stuck devices ... but that's another
whole rant for another day) and go away. Sharing the resources would have
to be considered, too.
|> What is really needed is a whole NEW threading concept where individual
|> threads can have private-to-that-thread resources, like file descriptors
|> (but done without giving up the ability to choose to share them). �Then
|> you can spread the descriptors and other resources out in ways that allow
|> them to be managed better.
|
| I'm not sure how that would be any better. Currently, if you want a
| file descriptor to only be accessed by one thread, just only access it
| from that one thread.
But it's still silly to have to deal with the issues of a descriptor space
that exceeds some large value just because all the other descriptors are
visible together in at least some descriptor space. Think of 40,000 HTTP
tasks working at once. You need a couple descriptors each, at least.
Why do they all need to share the same descriptor space, even if there is
some need to share the same virtual memory space (which may or may not be
a good thing).
Threads often are a performance win ... NOT because they allow faster
sharing between tasks (not always needed) ... BUT just because context
switches between threads of the same virtual memory space are faster.
You won't need to switch the segment structure. You won't need to flush
the VM translation cache. You won't even need to discard memory cache
in many cases (depending on architecture in some).
But threads are also often a risk ... they are not padded cells, for example.
And then there is the issue of having enough separate stacks, file descriptor
spaces, etc.
A lot of this is an issue of architectural design (and not just architecture
of the CPU) ... architecture of how process and thread contexts are organized
and how information flows where it needs to go. If it's a web server that
just delivers static files (for example all those button images and such),
then it is mostly very simple. But if it needs to keep state for each user,
or share information between distinct users in real time, especially if in
faster time than storing it in a database can do, then the architecture of
the server/service needs to go into this. That design needs to consider the
effects of threads, processes, shared resources (which ones are needed and
which ones are not), and even distinct hardware. For example, users accessing
a web based mail system might be best redirected to the same machine each time
during a login session, allowing their state cache to be kept in one place,
even while thousands of machines and tens of millions of total processes or
threads are running to service them.
There's really no general purpose solution. There won't be until we get to
a level where the "loose fit" of "one size fits all" won't matter (this will
require machines that would look to us today much as today's machines would
impress people from the 1980's).
> Threads often are a performance win ... NOT because they allow faster
> sharing between tasks (not always needed) ... BUT just because context
> switches between threads of the same virtual memory space are faster.
This is a common misconception. Threads are rarely, if ever, a
performance win because they make context switches faster. Threads are
primarily a performance win because they minimize the need for context
switches. A threaded web server can do a little bit of work for each
of a thousand clients without even a single context switch.
If you are having so many context switches that the cost of a context
switch shows up on your performance radar, you are doing something
horribly wrong. Schedulers are specifically designed to ensure that
context switches are infrequent, and you would have to be putting some
disastrous pressures on them for that design to fail to do its job.
DS
That depends upon what you call a context switch. Somehow I think to
switch threads you have to somehow save and restore a few registers, the
Program Counter for sure, unless you have more cores than threads. The
more registers that have to be exchanged the longer the switching time.
> That depends upon what you call a context switch. Somehow I think to
> switch threads you have to somehow save and restore a few registers, the
> Program Counter for sure, unless you have more cores than threads. The
> more registers that have to be exchanged the longer the switching time.
Compared to blowing out the code and data caches, the time it takes to
save and restore a few registers is meaningless.
DS
Threads are a performance win because they don't need to flush the TLB's
on context switches between threads in the same process.
A thread context switch is enormously less
expensive than a process context switch. The larger the page size,
the better.
TLB misses are expensive. TLB misses are _really_ expensive in virtual
machines[*].
scott
[*] Up to 22 memory references when using nested page tables, depending on
processor page directory cache hit rate; this can be reduce to 11 if the
nested page table uses 1GB pages sizes (vice 4 or less without using SVM).
It's not the caches, so much, as it is the TLB's. The caches (at least
on physically indexed architectures like Intel/AMD's) are not flushed on a
context switch; either a thread context switch or process context switch
may or may not result in a subsequent cache miss - that depends on many
factors. A thread switch is less likely to see a subsequent cache miss,
however.
A cache miss is a single memory reference. A TLB miss is 2 (1gb pages) to
4 (4k pages) to 22 (4k pages with 4k nested page table in virtual machine)
memory references. And _if_ the scheduler needs to do a global TLB shootdown
on a process switch, that requires bringing all processors in the system to
a barrier during the switch (blech). Fortunately, global shootdowns are only
necessary when the page table itself is changed (like by mmap, brk, etc).
Cache miss rates are best controlled by the scheduler trying not to move threads
from one core to another.
scott
And I've done the same thing in just one process, too. But it makes for
more complicated code. It defeats using forms of abstraction that lets
programmers focus less on the mechanics and more on the objective.
| If you are having so many context switches that the cost of a context
| switch shows up on your performance radar, you are doing something
| horribly wrong. Schedulers are specifically designed to ensure that
| context switches are infrequent, and you would have to be putting some
| disastrous pressures on them for that design to fail to do its job.
How is it that a scheduler has anything to do with this? If you have
different tasks under different thread or different processes, then the
scheduler is forced to make a context switch when something different
has to be done, and that's in a different thread/process. The win for
threads is the context switch to another thread within the same process
is cheaper than a context switch between processes.
However, once the context switch to a new VM does take place, the cache that
pointed to the previous process is useless (except for shared parts since
this is a physical/real address caching architecture).
| A cache miss is a single memory reference. A TLB miss is 2 (1gb pages) to
| 4 (4k pages) to 22 (4k pages with 4k nested page table in virtual machine)
| memory references. And _if_ the scheduler needs to do a global TLB shootdown
| on a process switch, that requires bringing all processors in the system to
| a barrier during the switch (blech). Fortunately, global shootdowns are only
| necessary when the page table itself is changed (like by mmap, brk, etc).
TLB is certainly the biggie. But the cache still is a factor.
| Cache miss rates are best controlled by the scheduler trying not to move threads
| from one core to another.
Or the whole process between CPUs.
| [*] Up to 22 memory references when using nested page tables, depending on
| processor page directory cache hit rate; this can be reduce to 11 if the
| nested page table uses 1GB pages sizes (vice 4 or less without using SVM).
Is the page table also stored in cache, even if also in the TLB?
> Threads are a performance win because they don't need to flush the TLB's
> on context switches between threads in the same process.
Nope. That's like saying that cars are faster than bicycles because
they don't have pedals. While it's true that threads are a performance
win and it's true that context switches between threads of the same
process are faster than context switches between threads from
different processes, the latter does not cause the former.
> A thread context switch is enormously less
> expensive than a process context switch. The larger the page size,
> the better.
It doesn't matter. In any sensible threaded application, there will be
so few context switches that making them faster will be lost in the
noise.
DS
> How is it that a scheduler has anything to do with this?
The scheduler is specifically designed to allocate timeslices to ready
to run threads such that the number of context switches is low enough
that they don't impact performance.
This design will work quite well unless you do something stupid. For
example, if you use a process-per-connection design and need to do a
tiny bit of work for each of 150 connections, you will need about 150
context switches. But that's because you did something stupid.
So long as you don't do something stupid like that, the cost of
context switches is lost in the noise because the scheduler will not
make very many of them.
> If you have
> different tasks under different thread or different processes, then the
> scheduler is forced to make a context switch when something different
> has to be done, and that's in a different thread/process.
There is no such thing as "tasks under different thread". Threads
share all memory, all file descriptors, everything. There is no way
something can be "stuck in the wrong thread" unless you specifically
design things that way.
Assuming a sane designer, he would only allow tasks to get "stuck to a
thread" where that didn't harm performance. And, of course, you can
always shoot yourself in the foot. However, if the process that holds
a file descriptor is not running, no forward progress can be made
without a context switch.
> The win for
> threads is the context switch to another thread within the same process
> is cheaper than a context switch between processes.
No. That is not why threads are a win. That is, as I've tried to
explain, a common misconception. It's like saying jet planes are a win
over bicycles because you don't have to pedal them.
Threads are a win over processes because it makes no difference which
thread runs. The process makes forward progress so long as any ready-
to-run thread gets the CPU. That is, in a properly designed multi-
threaded application, the amount of work done before a context switch
will be needed will be much higher.
DS
Depends on:
1) how recently used and
2) The cache eviction behavior.
Generally, I wouldn't count on any of the PTE entries being present
in the processor cache, you might find one or more of the intermediate
entries (PML4, PDP, PD) in a shared L3 cache of sufficient size, but
I wouldn't count on it.
Both AMD and Intel have special PML4/PDP/PD caches in the processor to
help make TLB fills a bit more efficient. The PML4 entry will likely
be cached (since there are only two entries available in the PML4, one
for the lower 512GB, and one for the uppert 512GB) and the PML4 is the
first reference on a table walk.
Consider a page table mapping 1TB of memory with 4k pages. This requires
two gigabytes of memory just for the page tables. Consider then, that
each process must have it's own page table, and you'll see that the processor
cache has little benefit for TLB fills.
scott
I've never seen a thread that doesn't require a context switch, aside
from the user-level M-N threads in the old SVR4.2MP threads library, and
that was also a context switch, just done in the library rather than the
kernel.
If you degenerate your system to a single thread per core, and only have
one process (i.e. a real-time embedded) system, then there won't be many
context switches between threads.
However, in real-world threaded applications there _are_ context switches,
and there are _many_ context switches, and a thread context switch is
more efficient than a process context switch.
scott
Indeed. The shared parts are key. Context switches between VM's are a special
case, and AMD has some help for the TLB's in this case by associating a ASID
with the VM. However, this is orthogonal to the point I made above about
thread switches within a process being more efficient than thread switches
between processes.
scott
[...]
>> A thread context switch is enormously less
>> expensive than a process context switch. � The larger the page size,
>> the better.
>
> It doesn't matter. In any sensible threaded application, there will be
> so few context switches that making them faster will be lost in the
> noise.
Dedicating threads to particular subtasks of something which is
supposed to be done is also a sensible way to design 'a threaded
application', just one which is rather geared towards simplicity of
the implementation than maximum performance. Because a thread context
switch is cheaper than a process context switch, such simple designs
are useful for a wider range of tasks when using threads instead of
processes.
What's wrong with allocating memory space after forking using the normal
shared memory allocators or mmap()ing a file in a common filesystem?
I assume the library would be simply to hide the complexity of passing
file descriptors using unix sockets?
> Ideally, a process could register what it was messing with. So if it
> crashed/failed, the system would know what was potentially corrupt.
The system as a whole likely doesn't care...only the other related
processes that might want to touch the same resource.
Chris
> However, in real-world threaded applications there _are_ context switches,
> and there are _many_ context switches, and a thread context switch is
> more efficient than a process context switch.
Why would there be many context switches? All the reasons processes
need to switch contexts don't apply to threads. For example:
1) Process running doesn't have access to the memory object that needs
to be manipulated. Doesn't apply, threads all share VM.
2) Process running doesn't have access to the file descriptor that
needs work. Doesn't apply, threads share all file descriptors.
And so on.
Processes require lots of context switches because the wrong one is
often running. With a threaded design, there can only be a "wrong
thread" if one is intentionally created. And one would not create one
where it impacts performance (unless one is an idiot).
DS
> Dedicating threads to particular subtasks of something which is
> supposed to be done is also a sensible way to design 'a threaded
> application', just one which is rather geared towards simplicity of
> the implementation than maximum performance.
You can always trade-off performance for something else. The point is
that you *have* the performance in the first place and where that
performance comes from.
> Because a thread context
> switch is cheaper than a process context switch, such simple designs
> are useful for a wider range of tasks when using threads instead of
> processes.
Compared to the ability to avoid context switches entirely, the
relative cost difference of process versus thread context switches is
lost in the noise in realistic scenarios. Of course, things that only
make things better are good, and this is certainly a small benefit to
threads. But it isn't a game changer. On the other hand, the ability
to reduce the number of context switches by an order of magnitude
(because you never have a thread running that can't access the memory
or file descriptor needed to make forward progress) *is* a game
changer.
DS
> Threads are a win over processes because it makes no difference which
> thread runs. The process makes forward progress so long as any ready-
> to-run thread gets the CPU. That is, in a properly designed multi-
> threaded application, the amount of work done before a context switch
> will be needed will be much higher.
It seems like you are limiting a "properly designed multithreaded
application" to those that use a "pool of worker threads" model.
"properly designed" doesn't always mean "as fast as possible at
runtime". It also seems like by definition these "properly designed"
apps must be rarely doing anything that would block, because that causes
a context switch. Does that mean that they need to use asynch I/O, or
that such apps rarely have to wait for data from a disk? Lastly, it
seems like they must rarely execute system calls, because the cost of an
thread-to-thread context switch isn't much more than that of a syscall.
Finally, the worker pool design doesn't rule out processes instead of
threads. In general it is possible to set up a multi-process model to
mimic a multi-thread model, using shared memory and passing around file
descriptors. Once the setup is completed, at runtime the primary
performance difference between the two models is the fact that you need
to flush the TLB on a context switch between processes and not between
threads.
Chris
> What's wrong with allocating memory space after forking using the normal
> shared memory allocators or mmap()ing a file in a common filesystem?
You can't follow pointers inside the shared memory without special
guard code.
> I assume the library would be simply to hide the complexity of passing
> file descriptors using unix sockets?
Not just that. It would also hide the complexity of allocating and
managing shared memory, assigning tasks to processes, creating and
destroying processes that cooperate, and so on. Basically, it would
provide an interface that was very much like threads, except that non-
shared memory could be managed as easily as shared ones.
> > Ideally, a process could register what it was messing with. So if it
> > crashed/failed, the system would know what was potentially corrupt.
> The system as a whole likely doesn't care...only the other related
> processes that might want to touch the same resource.
By "the system", I mean the system of cooperating processes. The idea
would be that one of the big advantages would be that such a system
could run riskier code and a process crash wouldn't take down the
cooperating processes. But you wouldn't have all of the disadvantages
of a "process per connection" approach.
DS
> It seems like you are limiting a "properly designed multithreaded
> application" to those that use a "pool of worker threads" model.
No, no, not at all. However, a properly designed multithreaded
application would use a pool of worker threads model unless another
model provided, on balance, more benefits than detriments than that
approach. That is, no properly-designed multi-threaded application can
be worse than a pool of worker threads design. Because if it was
worse, then the choice of a pool of worker threads was improper.
> "properly designed" doesn't always mean "as fast as possible at
> runtime". It also seems like by definition these "properly designed"
> apps must be rarely doing anything that would block, because that causes
> a context switch. Does that mean that they need to use asynch I/O, or
> that such apps rarely have to wait for data from a disk?
Oh no, neither. You can certainly wait for data from a disk.
> Lastly, it
> seems like they must rarely execute system calls, because the cost of an
> thread-to-thread context switch isn't much more than that of a syscall.
I don't follow this argument. What does one have to do with the other?
You frequently have no choice but to do very expensive things. But
when you do make expensive system calls because you have to, that's
you making forward progress at full speed. That's good.
> Finally, the worker pool design doesn't rule out processes instead of
> threads. In general it is possible to set up a multi-process model to
> mimic a multi-thread model, using shared memory and passing around file
> descriptors. Once the setup is completed, at runtime the primary
> performance difference between the two models is the fact that you need
> to flush the TLB on a context switch between processes and not between
> threads.
I agree, that's just not very practical right now for the reasons
discussed elsewhere in this thread. But that might well be the best of
both worlds, allowing you to get most of the benefits of multi-
threading while avoiding contention when allocating and freeing
resources you don't need shared.
As I said, it would be a wonderful thing if someone would put the
effort needed into developing a library and toolkit for using this
model. It's much more doable now that 64-bit OSes are available, since
it helps to allocate a large chunk of address space before 'fork'ing
processes off so that you can dereference pointers in shared memory
without needing to manually alias them.
DS
I suspect David has been fortunate to only have had special case use of
threads and not run into the general case. If for some reason the problem
to be solved requires a large thread stack space, a recursion solution
perhaps, and used a large number of threads, sufficient that the physical
memory address for thread 1 and thread N is farther apart than the
available cache ...
Simple. There are generally more threads than there are processing units,
and all the threads want to accomplish something.
There is generally more than one 'process' running on any given unix system
at any given time. On my 4-core ubuntu box, I count 115 kernel threads. On
my 192 core RHEL box, I count over 1000 kernel threads (watchdog, events, kintegrityd,
kswapd, crypto, SCSI error handlers, etc.)
I assume your system is performing I/O? kswapd (which handles writing from
the file cache to the device) will run, resulting in a context switch. SoftIRQ
handling may require a context switch.
Then there are various system daemons (ntpd, xinetd, init, getty, nfs daemons)
(I preclude the 100 processes started by GNOME per logged in user, since a real
server wouldn't have GNOME running, nor would it even have a head).
The reason for all these threads _is_ that thread context switches are cheap.
Serving web pages (cf an earlier message in this thread) is a _solved_ problem[*]
and there is no need to craft a highly efficient web server anymore.
[*] scale out, not up.
Which _cannot be done_ on any reasonable modern general purpose
unix-like operating system.
>relative cost difference of process versus thread context switches is
>lost in the noise in realistic scenarios. Of course, things that only
Have you ever measured this? I have, several times, on various
architectures from mainframes to MPP boxes to hypervisors. The cost
difference is measurable and hardly in the noise.
>make things better are good, and this is certainly a small benefit to
>threads. But it isn't a game changer. On the other hand, the ability
>to reduce the number of context switches by an order of magnitude
>(because you never have a thread running that can't access the memory
>or file descriptor needed to make forward progress) *is* a game
So you never need to fault a page in? Context switch.
So you never need to read or write the file descriptor? Context switch.
scott
This exact model has been de rigour in the Mainframe world since the
1960's. On Unix, use Tuxedo or other MQ middleware, or system V message
queues, or POSIX message queues.
This is also the model used by 90% of the web servers in existence. An
Alteon (Nortel, Cisco, etc) load balancer queues requests to a pool of
servers.
>
>As I said, it would be a wonderful thing if someone would put the
>effort needed into developing a library and toolkit for using this
>model. It's much more doable now that 64-bit OSes are available, since
I see no advantage to this model for any application. Why do you
think a multiple process (hence multiple address space) model is
superior to a multi-threaded process? If you're worried about
the cost of poll on a large pool of file descriptors, then you've
posed your problem poorly and should rethink your solution.
>it helps to allocate a large chunk of address space before 'fork'ing
>processes off so that you can dereference pointers in shared memory
>without needing to manually alias them.
c'est what?
s
> Simple. There are generally more threads than there are processing units,
> and all the threads want to accomplish something.
That won't make context switches except as each thread finishes out
its timeslice. The scheduler is carefully designed to give each thread
a large enough timeslice so that the cost of these context switches
are lost in the noise.
> I assume your system is performing I/O? kswapd (which handles writing from
> the file cache to the device) will run, resulting in a context switch. SoftIRQ
> handling may require a context switch.
Sure, but these will generally run as timeslices run out. When a
system task pre-empts something, that's unavoidable, and application
architecture can't do anything about it.
What I'm talking about are context switches forced by application
architecture. They are very, very ungood.
> The reason for all these threads _is_ that thread context switches are cheap.
That doesn't make sense. With all the reasons you specified, the
context switches won't be between threads in the same process anyway.
The only exception is when a thread completes its timeslice, in which
case the timeslices are lost in the noise.
DS
> >Compared to the ability to avoid context switches entirely, the
> Which _cannot be done_ on any reasonable modern general purpose
> unix-like operating system.
You misunderstand. If you reduce the number of context switches needed
to do a particular task by one, you have eliminated one context switch
entirely. That is most definitely possible.
> >relative cost difference of process versus thread context switches is
> >lost in the noise in realistic scenarios. Of course, things that only
> Have you ever measured this? I have, several times, on various
> architectures from mainframes to MPP boxes to hypervisors. The cost
> difference is measurable and hardly in the noise.
Sure, because you deliberately set out to measure that one thing.
That's like saying that brake friction is a significant factor in the
fuel economy of a 747 because operate very badly while the brakes are
creating friction.
> >make things better are good, and this is certainly a small benefit to
> >threads. But it isn't a game changer. On the other hand, the ability
> >to reduce the number of context switches by an order of magnitude
> >(because you never have a thread running that can't access the memory
> >or file descriptor needed to make forward progress) *is* a game
> So you never need to fault a page in? Context switch.
> So you never need to read or write the file descriptor? Context switch.
Again, all of those things happen, but they are lost in the noise
compared to a process-based architecture that forces a context switch
every time the "wrong process" is running.
DS
> I see no advantage to this model for any application. Why do you
> think a multiple process (hence multiple address space) model is
> superior to a multi-threaded process?
Every time a threaded process allocates memory or allocate a file
descriptor, synchronization is required with the other threads in the
process. This can be totally eliminated by using a process pool
approach. It's also convenient for fault isolation. When a thread
fails, the process context is lost. By being able to isolate what
shared state a process can get to, you can recover from a fault with
fewer problems and less loss.
> If you're worried about
> the cost of poll on a large pool of file descriptors, then you've
> posed your problem poorly and should rethink your solution.
No, that's not the issue at all.
> >it helps to allocate a large chunk of address space before 'fork'ing
> >processes off so that you can dereference pointers in shared memory
> >without needing to manually alias them.
> c'est what?
If multiple processes map shared memory or shared files, the pointers
to that memory will typically be different. That means that this
memory cannot contain pointers that you can dereference the normal
way. However, if you use a 64-bit operating system, allocate a huge
chunk of address space before you 'fork' off the processes, and use a
special inter-process allocator to manage that space, you can pass
pointers between processes and dereference them with no special effort
at all.
Something strange is going on here. You're a smart guy and the things
I'm saying are very simple, yet nothing I'm saying seems to be getting
through. Perhaps we have wildly different assumptions or somehow you
are thinking I'm saying something totally different from what I'm
actually saying.
What I'm saying is that at least for some applications, a "process
pool" server that was very analogously coded to "thread pool" servers
might be able to provide benefits that thread pools alone cannot. Each
process in the pool can uses threads, of course, if that makes sense.
The benefit is primarily that sharing can be precisely controlled and
secondarily that false sharing is minimized.
DS
Every time memory is allocated or a file descriptor
is allocated by any thread in the system (whether a single or multiple
threaded process), synchronization is required. There is no difference
between a single threaded process and a multiple threaded process from
the point of view of the kernel allocators.
>process. This can be totally eliminated by using a process pool
>approach. It's also convenient for fault isolation. When a thread
>fails, the process context is lost. By being able to isolate what
>shared state a process can get to, you can recover from a fault with
>fewer problems and less loss.
A process failure in this context will generally be due to a
software bug, and might cascade to all the remaining processes.
If each process is running on a different system (such tier1 systems),
then a hardware failure will not affect the remaining systems (unless
it was due to power, cooling or network access).
>
>>=A0If you're worried about
>> the cost of poll on a large pool of file descriptors, then you've
>> posed your problem poorly and should rethink your solution.
>
>No, that's not the issue at all.
>
>> >it helps to allocate a large chunk of address space before 'fork'ing
>> >processes off so that you can dereference pointers in shared memory
>> >without needing to manually alias them.
>
>> c'est what?
>
>If multiple processes map shared memory or shared files, the pointers
>to that memory will typically be different. That means that this
>memory cannot contain pointers that you can dereference the normal
>way. However, if you use a 64-bit operating system, allocate a huge
>chunk of address space before you 'fork' off the processes, and use a
>special inter-process allocator to manage that space, you can pass
>pointers between processes and dereference them with no special effort
>at all.
Then you've lost the isolation that you trumpeted above as a benefit.
>
>Something strange is going on here. You're a smart guy and the things
>I'm saying are very simple, yet nothing I'm saying seems to be getting
>through. Perhaps we have wildly different assumptions or somehow you
>are thinking I'm saying something totally different from what I'm
>actually saying.
I guest that's the case. I'm a operating system guy, and that colors
my thinking. I write multithreaded applications just like I do
operating systems, using the same synchronization techniques etc.
The first hardware I used professionally had instructions for mutexes and
condition variables, a rather clever microkernel system.
>What I'm saying is that at least for some applications, a "process
>pool" server that was very analogously coded to "thread pool" servers
>might be able to provide benefits that thread pools alone cannot. Each
>process in the pool can uses threads, of course, if that makes sense.
>The benefit is primarily that sharing can be precisely controlled and
>secondarily that false sharing is minimized.
That's the way most web servers work today, particularly with apache.
But they don't share memory.
Oracle does, using system 5 SHM, but they do some pretty grody tricks
to allow the oracle code to access the shared memory directly, without
any base-relative accesses (at link time, they link a stub assembler file
based at the SHM load address with the rest of the database code such that
the database code can directly reference variables in the fixed portion of
the shared global area. The remainder is allocated dynamically, but
all the processes that share the SGA map it at the same process
scott
> >Every time a threaded process allocates memory or allocate a file
> >descriptor, synchronization is required with the other threads in the
> Every time memory is allocated or a file descriptor
> is allocated by any thread in the system (whether a single or multiple
> threaded process), synchronization is required. There is no difference
> between a single threaded process and a multiple threaded process from
> the point of view of the kernel allocators.
That is not true.
In the case of file descriptors, the file descriptor table is shared
by all the threads. Allocating a new file descriptor requires
manipulating this shared resource. If two threads in the same process
both call 'socket', the kernel must somehow assign one a lower
descriptor number than the other. This requires an ordering that is
not required if two threads from different processes allocate a file
descriptor.
The same is true of memory. When a thread allocates memory, some
address space must be taken from the process' pool. If a mapping needs
to be made or modified, the mapping will be to the shared address
space. This will require some synchronization with the possibility of
a conflict.
The same is true when a thread is created. Threads are process
resources, and if two threads in the same process each try to create a
thread, there is likely to be more synchronization overhead that if
threads in different processes each created an additional thread.
> A process failure in this context will generally be due to a
> software bug, and might cascade to all the remaining processes.
How would that happen? The only mechanism one process can use to
influence another process is with the consent of both processes. One
can precisely contain what a peer process can access in ways one
cannot similarly constrain a peer thread.
> >If multiple processes map shared memory or shared files, the pointers
> >to that memory will typically be different. That means that this
> >memory cannot contain pointers that you can dereference the normal
> >way. However, if you use a 64-bit operating system, allocate a huge
> >chunk of address space before you 'fork' off the processes, and use a
> >special inter-process allocator to manage that space, you can pass
> >pointers between processes and dereference them with no special effort
> >at all.
> Then you've lost the isolation that you trumpeted above as a benefit.
Not at all. You can precisely control what parts of that address space
are accessible by what processes at what time.
> >What I'm saying is that at least for some applications, a "process
> >pool" server that was very analogously coded to "thread pool" servers
> >might be able to provide benefits that thread pools alone cannot. Each
> >process in the pool can uses threads, of course, if that makes sense.
> >The benefit is primarily that sharing can be precisely controlled and
> >secondarily that false sharing is minimized.
> That's the way most web servers work today, particularly with apache.
>
> But they don't share memory.
Right, and they typically don't hand off file descriptors.
> Oracle does, using system 5 SHM, but they do some pretty grody tricks
> to allow the oracle code to access the shared memory directly, without
> any base-relative accesses (at link time, they link a stub assembler file
> based at the SHM load address with the rest of the database code such that
> the database code can directly reference variables in the fixed portion of
> the shared global area. The remainder is allocated dynamically, but
> all the processes that share the SGA map it at the same process
Right, and that's not needed any more with 64-bit operating systems.
DS
There are a dozen other data structures that are allocated as part
of a successful open(2) or socket(2) call. An almost free (when
not contended) spin lock is used to protect the descriptor table.
But there is an open file descriptor (filp) and for many files an
in-core inode as well. These are all system wide data structures.
>
>The same is true of memory. When a thread allocates memory, some
>address space must be taken from the process' pool. If a mapping needs
If using malloc, malloc will use locking to protect the data structures
in libc. However, there are many ways (lookaside lists, pools, etc) to
aleviate the synchronization, _if there is any contention at all_, which
may not be particularly likely (if there is, fix the app, if your problem
is the malloc pool in a threaded application, use one malloc pool
per thread).
However, under the covers malloc uses sbrk/mmap to get memory for the
process, which requires accessing many global kernel data structures
with associated synchronization.
If you're getting contention on memory allocation, it's your choice
of allocators that is the problem. With 90% of threaded apps, contention
on the malloc(3) pool or the file descriptor table is exceedingly
rare, and a spinlock is a single compare instruction in the uncontented
case (even a LOCK prefix doesn't hurt so long as the lock doesn't
straddle two cache lines).
I had opportunity to analyze a customer benchmark where they
were beating the crap out of single cache line (simulating a very
poorly written threaded application) on a large ccNUMA system with 192 cores.
The processors (AMD) ensure fairness by issuing global bus locks
if some number of attempts to acquire the cache line with intent
to modify fail; however this really kills system performance.
The contention can be avoided by good application practice with
the current pthread paradigm - no new process pool model is
required (and the very minor savings over not having to spinlock
the file descriptor table on socket(2) would be surpased by the
additional overhead of TLB fills and the additional kernel data
structures and required locking (for which contention will be
more likely with multiple processes, than with multiple threads
in many cases).
Note that the operating system is a threaded application and there has
been a lot of effort put into various lock free and efficient locking
algorithms (such as RCU).
With lookaside-lists or pools of pre-allocated elements, only
an atomic compare and exchange instruction is used to allocate
or return an element, and this can be simply implemented in
a application that requires that level of allocation/deallocation
using GCC builtins (or inline asm for other compilers).
>to be made or modified, the mapping will be to the shared address
>space. This will require some synchronization with the possibility of
>a conflict.
>
>The same is true when a thread is created. Threads are process
>resources, and if two threads in the same process each try to create a
>thread, there is likely to be more synchronization overhead that if
>threads in different processes each created an additional thread.
>
>> A process failure in this context will generally be due to a
>> software bug, and might cascade to all the remaining processes.
>
>How would that happen? The only mechanism one process can use to
>influence another process is with the consent of both processes. One
>can precisely contain what a peer process can access in ways one
>cannot similarly constrain a peer thread.
I was assuming your "pool of processes sharing some subset of memory"
model. In which case, none of the remaining processes can rely
on the content of the shared memory, since its consistency with
respect to any partially made updates cannot be assured.
>
>> >If multiple processes map shared memory or shared files, the pointers
>> >to that memory will typically be different. That means that this
>> >memory cannot contain pointers that you can dereference the normal
>> >way. However, if you use a 64-bit operating system, allocate a huge
>> >chunk of address space before you 'fork' off the processes, and use a
>> >special inter-process allocator to manage that space, you can pass
>> >pointers between processes and dereference them with no special effort
>> >at all.
>
>> Then you've lost the isolation that you trumpeted above as a benefit.
>
>Not at all. You can precisely control what parts of that address space
>are accessible by what processes at what time.
Which is a _very_ expensive kernel system call.
>
>> >What I'm saying is that at least for some applications, a "process
>> >pool" server that was very analogously coded to "thread pool" servers
>> >might be able to provide benefits that thread pools alone cannot. Each
>> >process in the pool can uses threads, of course, if that makes sense.
>> >The benefit is primarily that sharing can be precisely controlled and
>> >secondarily that false sharing is minimized.
>
>> That's the way most web servers work today, particularly with apache.
>>
>> But they don't share memory.
>
>Right, and they typically don't hand off file descriptors.
Thankfully. That's one of the sillier things that STREAMS in SYSV and
Unix Domain sockets in Tahoe introduced. Just because it is there, doesn't
make it a good solution for most problems.
>
>> Oracle does, using system 5 SHM, but they do some pretty grody tricks
>> to allow the oracle code to access the shared memory directly, without
>> any base-relative accesses (at link time, they link a stub assembler file
>> based at the SHM load address with the rest of the database code such tha=
>t
>> the database code can directly reference variables in the fixed portion o=
>f
>> the shared global area. =A0 The remainder is allocated dynamically, but
>> all the processes that share the SGA map it at the same process
>
>Right, and that's not needed any more with 64-bit operating systems.
Actually, that is incorrect. It's more important with 64-bit OS,
particularly on ccNUMA systems[1]. Oracle (as most modern applications
should/will be) is ccNUMA aware, so it makes intelligent allocation
decisions in the Shared Global Area (SGA) and uses the affinity mechanisms
to associate shadow (background) processes to the NUMA node closest to
the SGA memory used by that session. They issue one shmget for the fixed
SGA (which is still mapped at a common program virtual address across
all oracle processes which access the shared global area), and then split
the dynamic portion by using the numa aware SHMGET to get a segment on
each of the NUMA nodes.
Since all multi-socket AMD systems today are ccNUMA, and all Intel
multisocket systems from Nehalem on are ccNUMA, all serious high
performance application writers should consider NUMA awareness for
their applications.
>DS
[1] They map at a fixed address for the convenience of the programmer
and for access efficiency (no need for pointers or offset calculations).
You make a lot of good points. I think I have to update my thinking
based on the massive effort that's already gone into making threading
cheap. Some of my arguments may well be overcome by events, as they
say in the military.
DS