OT: load distribution algorithm

John Kelly

unread,

Sep 24, 2007, 10:10:19 PM9/24/07

to

I've googled and wikipediaed for answers, but the amount of material
is overwhelming. So if anyone here has a simple solution ...

I have a thread accepting socket connections from remote users. But
the number may become very large, so when a new socket is accepted, I
will detach it from the accepting thread, and send it to a pool of
threads which handle I/O events on the open sockets, distributing the
socket load across the pool of threads.

Sockets will be randomly closed by users, so the load of open sockets
will continually shift among the threads.

When a new connection arrives, I need to decide which thread to hand
it off to. I don't need to select the thread with the least load, but
OTOH, I want to be sure I never select the thread with the highest
load. The idea is to keep the load well balanced across the pool, but
it doesn't have to be perfectly level.

And I don't want to spend much time making the choice, I want to make
a quick decision (with a quick algorithm) and move on.

Assume I have a list of thread ids, and a corresponding list of the
number of open sockets in each thread. Or maybe an array containing
both, I don't know yet. But in any case, assume some other code will
maintain the list and counters of open sockets.

I'm not asking for code, I'm just soliciting ideas, in case anyone has
already solved this problem, or knows someone who has ...

--
Internet service
http://www.isp2dial.com/

Arjen Markus

unread,

Sep 25, 2007, 2:36:56 AM9/25/07

to

One important issue you have not elaborated is: how do you
know which thread has the largest load?

I do not know anything of what the threads are doing, but
assuming the number of open sockets is a fair indication of
the loads, then simply go through the list of threads and
pick the one with the smallest number.

An alternative is to have the threads themselves indicate
their load in a way that is relevant to them, for instance
if they have a queue of tasks to perform, the number of
tasks might be a good enough measure.

Or, to prevent the thread with a single heavy duty task
from taking on too much new tasks, pick the thread that
has last received a new task.

Just a few thoughts,

Arjen

Uwe Klein

unread,

Sep 25, 2007, 4:36:28 AM9/25/07

to

John Kelly asked:
<how to distribute work fairly>

Queue the jobs and have the
threads pick off the top when idle?

How does apache do this?

uwe

John Kelly

unread,

Sep 25, 2007, 6:22:03 AM9/25/07

to

On Tue, 25 Sep 2007 10:36:28 +0200, Uwe Klein
<uwe_klein_...@t-online.de> wrote:

><how to distribute work fairly>

>Queue the jobs and have the
>threads pick off the top when idle?

>How does apache do this?

I've wondered the same myself. But I have no idea.

John Kelly

unread,

Sep 25, 2007, 6:22:03 AM9/25/07

to

On Mon, 24 Sep 2007 23:36:56 -0700, Arjen Markus
<arjen....@wldelft.nl> wrote:

>One important issue you have not elaborated is: how do you
>know which thread has the largest load?

For my purposes, "load" is simply the number of open sockets. Each
thread will increment an internal counter when initializing a new
socket, and decrement when closing. When the internal counter
changes, the thread will post the new value a global tsv, so at any
moment, I can look at the list and choose the one with the lowest
number of open sockets.

I don't want to iterate the whole list on every decision, trying to
find the minimum value. If there are 100 entries in the list, I could
average 50 iterations for each decision. I don't necessarily need to
find the minimum value, but I want to avoid the maximum value.

>I do not know anything of what the threads are doing, but
>assuming the number of open sockets is a fair indication of
>the loads, then simply go through the list of threads and
>pick the one with the smallest number.

Right. I just don't want to iterate the whole list. I want a quicker
way to decide.

>Or, to prevent the thread with a single heavy duty task
>from taking on too much new tasks, pick the thread that
>has last received a new task.

That's a idea. But because users will close sockets randomly, and
many open sockets may be idling, the oldest thread could still have
many open sockets. I see it as a simple question of who has the most
open sockets, vs. who has the least.

Maybe I'm looking at the problem the wrong way. It's true that each
thread will have its own internal counter, and when that changes, the
thread will post its new value to a global variable.

Since what I care most about is avoiding the maximum value, I don't
really need a list of all the counters. All I need is one variable
which holds the minimum value, and the corresponding thread id. That
will ensure I don't pick the thread with the maximum count.

It will be simple enough to maintain that single global value. I
guess that solves the problem. Sometimes it helps to think out loud,
thanks for listening.

Uwe Klein

unread,

Sep 25, 2007, 6:39:15 AM9/25/07

to

John Kelly wrote:
> On Tue, 25 Sep 2007 10:36:28 +0200, Uwe Klein
> <uwe_klein_...@t-online.de> wrote:
>
>
>><how to distribute work fairly>
>
>
>>Queue the jobs and have the
>>threads pick off the top when idle?
>
>
>>How does apache do this?
>
>
> I've wondered the same myself. But I have no idea.
>
>

Hmm,

http://httpd.apache.org/docs/2.0/mod/worker.html
http://www.apacheweek.com/issues/97-11-28

??
uwe

John Kelly

unread,

Sep 25, 2007, 8:12:09 AM9/25/07

to

On Tue, 25 Sep 2007 12:39:15 +0200, Uwe Klein
<uwe_klein_...@t-online.de> wrote:

>http://httpd.apache.org/docs/2.0/mod/worker.html
>http://www.apacheweek.com/issues/97-11-28

Their model is probably more complicated than what I need, since I
don't have multiple pids.

On Tue, 25 Sep 2007 10:22:03 +0000, John Kelly <j...@isp2dial.com>
wrote:

>It will be simple enough to maintain that single global value. I
>guess that solves the problem.

Or maybe not. Now I realize you have to consider an initial state
where they all have the same minimum value, namely 0. So it's not as
simple as keeping one variable.

Maybe it's a problem like having queues, where you want to choose the
shortest queue first.

Question is, how to maintain some data structure which gives the
quickest answer for which queue is the shortest. Maybe it should be a
two step handshake, where the main thread submitting the work will
increment the maximum, and the threads doing the work will decrement
the minimum, as sockets are closed.

Kind of like going to the grocery store when they first open in the
morning. Initially, every queue is empty, so on the first pass you
just round robin until each queue is started, and after that, find the
shortest queue first. Interesting problem ...

Uwe Klein

unread,

Sep 25, 2007, 8:28:01 AM9/25/07

to

John Kelly wrote:

> Question is, how to maintain some data structure which gives the
> quickest answer for which queue is the shortest. Maybe it should be a
> two step handshake, where the main thread submitting the work will
> increment the maximum, and the threads doing the work will decrement
> the minimum, as sockets are closed.

Maintain a set of statistics for each thread as [list ...]
set cur_thrd_info [ list $cur_id $cur_socks $cur_rtime $whatnot ]

build a superlist of these.

use [lsearch -index $iidx ... ]
and [lsort -index $iidx ... ]

for sorting and finding entries.

uwe

Aric Bills

unread,

Sep 25, 2007, 2:19:48 PM9/25/07

to

On Sep 25, 4:28 am, Uwe Klein <uwe_klein_habertw...@t-online.de>
wrote:

> Maintain a set of statistics for each thread as [list ...]
> set cur_thrd_info [ list $cur_id $cur_socks $cur_rtime $whatnot ]
>
> build a superlist of these.
>
> use [lsearch -index $iidx ... ]
> and [lsort -index $iidx ... ]
>
> for sorting and finding entries.
>
> uwe

I like the list structure Uwe proposes. Some random thoughts:

* If you keep the list in order, you can use [lset] to quickly adjust
values as the load changes.

* If you have to search the list very often, you might consider
creating a hash table to serve as an index, rather than calling
[lsearch] each time.

* One of the nice things about [lsort -index] is that it can create a
copy of your list, allowing you to keep the original in order.

* You might plan several connections in advance each time you sort
your list. For example, when you sort your list, make a queue of the
best (say) five threads to accept new connections. (Could even be the
same thread five times if one thread has a significantly lighter load
than others.) Then when you get new connections, you simply dequeue
until the queue is empty, at which point you sort again and repopulate
the queue.

John Kelly

unread,

Sep 25, 2007, 7:43:38 PM9/25/07

to

On Tue, 25 Sep 2007 18:19:48 -0000, Aric Bills <aric....@gmail.com>
wrote:

>I like the list structure Uwe proposes. Some random thoughts:

>* If you have to search the list very often, you might consider

>creating a hash table to serve as an index, rather than calling
>[lsearch] each time.

Thanks for the ideas, all are welcome.

In the meantime I wondered, could it be possible to devise a state
machine which avoids the need for sorting and searching?

I don't know how to formally describe it, I can only tell it like a
story. If anyone spots defects in my thinking, please speak up. Here
we go ...

Imagine one dispatcher sending "jobs" (or in this case, sockets) to a
pool of worker queues. We call each worker queue "you" (they are like
borg, they don't need names).

At initial state, all queues are equal. The dispatcher will simply
round robin jobs, until some queue completes a job. How can we know
all queues are equal?

There is a global flag, the shortest queue flag (sqf). At initial
state, its value is -1, which means nobody has the flag, and thus, all
queues are equal.

When the dispatcher accepts a new socket, he will detach the socket
from his thread, select a worker queue (you!), and send you an event
message using thread::send.

The message contains the socket id and a command calling your main
proc. Your main proc is merely a short stub which attaches the socket
to your thread, and sets a fileevent handler which will start the real
work, after stubby returns.

But before all that, stubby will first increment your queue length
counter, because now you have one more socket to work, and the counter
tells you how many.

When you finish a job (user quit, whatever) you will close the socket.
Then, you will decrement your queue length counter and try to grab the
shortest queue flag. If nobody has the flag (sqf < 0), you get it by
default.

Once you get the flag, you immediately set it to the freshly updated
value of your queue length. The value (sqf >= 0) means somebody has
the flag, and the rules say, no one else can grab it from you, unless
their value is less than yours.

Maybe you have the shortest queue, maybe not, who knows? In any case,
you were the first to finish a job, and until now, the dispatcher used
simple round robin. If the dispatcher chooses you next, his choice
may not be optimal, but it would never be worse than round robin by
more than a value of 1. We can tolerate that, it's insignificant for
our purposes.

What's important is now, somebody has the flag!

You will also mark the flag with your id (an index value which locates
your thread id) so the dispatcher can look and see who has the flag.

Now when the next socket arrives, the dispatcher will look and see if
somebody has the flag. Of course, he was already doing this, even in
the initial state, that's how he knew to use simple round robin in the
first place.

But now that somebody has the flag, it gets more interesting.

The dispatcher is very lazy, and he does not want to be bothered with
hard work. He wants you to do the hard work, and when he gets another
socket to dispatch, he does not want to query every one of you, to see
who has the shortest queue. He only likes simple, easy decisions.

So, what does the dispatcher think?

Well let's see, somebody has the flag now, and its value tells me what
their queue length is. I also know who was next in the round robin
rotation, so let me look and compare their queue length to the worker
who has the flag. Yes, I have a global list of all queue counters,
the workers keep it updated for me when they adjust their own internal
counters. I'm much too lazy to look through the whole list, but I can
tolerate checking one, the one up next in the round robin.

So what to do? Naturally, I'll send it to the shortest of the two,
unless they are equal, in which case, next up in the round robin wins
the tie.

Now if the decision went to next up in the round robin, it's just like
actually doing round robin, so I'll advance the round robin pointer to
next in the rotation.

But if the decision went to the queue holding the flag, I don't need
to do anything more, I'll just leave the round robin pointer where it
was, and I'm done until the next socket arrives.

"Ah, I like this job!"

Earlier, we said "no one else can grab it [the flag] from you, unless
their value is less than yours."

The workers have to settle that among themselves, but the dispatcher
never worries, he knows the worker borg always follow the rule.

Neil Madden

unread,

Sep 26, 2007, 10:06:29 AM9/26/07

to

John Kelly wrote:
...
> Imagine one dispatcher sending "jobs" (or in this case, sockets) to a
> pool of worker queues. We call each worker queue "you" (they are like
> borg, they don't need names).
>
> At initial state, all queues are equal. The dispatcher will simply
> round robin jobs, until some queue completes a job. How can we know
> all queues are equal?
>
> There is a global flag, the shortest queue flag (sqf). At initial
> state, its value is -1, which means nobody has the flag, and thus, all
> queues are equal.
>
> When the dispatcher accepts a new socket, he will detach the socket
> from his thread, select a worker queue (you!), and send you an event
> message using thread::send.
>
> The message contains the socket id and a command calling your main
> proc. Your main proc is merely a short stub which attaches the socket
> to your thread, and sets a fileevent handler which will start the real
> work, after stubby returns.

Are these really queues then? Sounds more like you have a bunch of
threads each performing a certain number of jobs simultaneously (via the
event loop). So you can store the state of worker threads by a set of
pairs (tid,jobs), where tid is the thread id and jobs is the number of
jobs currently being processed by that thread.

>
> But before all that, stubby will first increment your queue length
> counter, because now you have one more socket to work, and the counter
> tells you how many.
>
> When you finish a job (user quit, whatever) you will close the socket.
> Then, you will decrement your queue length counter and try to grab the
> shortest queue flag. If nobody has the flag (sqf < 0), you get it by
> default.

Why not just whack all the (tid,jobs) pairs into a prioqueue sorted on
the # of jobs? That way you just pop the first element of the prioqueue
and that is the thread with the least number of jobs currently. Of
course, you need to update the prioqueue order when you change the job
counter for a thread, but that's probably not too hard.

My main question is, though, how many worker threads are you expecting
to have? You can just sort the queues by job count in O(n log n) with
small constant factors using [lsort] or search in O(n) to find the
minimum, so are you expecting n to be sufficiently large for this to be
a real bottleneck in your application (outweighing all the socket I/O)?

-- Neil

wiede...@googlemail.com

unread,

Sep 26, 2007, 12:36:41 PM9/26/07

to

Neil Madden wrote:

> Why not just whack all the (tid,jobs) pairs into a prioqueue sorted on
> the # of jobs? That way you just pop the first element of the prioqueue
> and that is the thread with the least number of jobs currently. Of
> course, you need to update the prioqueue order when you change the job
> counter for a thread, but that's probably not too hard.

That was my initial idea with sorted stat lists.

sort the list by loadindex and take the lowest.

But one may need other information as well thus
keep track of more than num sockets.

Like how do i determine stuck threads?

uwe

John Kelly

unread,

Sep 26, 2007, 12:48:54 PM9/26/07

to

On Wed, 26 Sep 2007 15:06:29 +0100, Neil Madden <n...@cs.nott.ac.uk>
wrote:

>Are these really queues then?

No, I said "queues" because when I was trying to work out the state
machine, the problem seemed like a shortest queue first problem in a
grocery store. I had to think in terms of something familiar in the
real world, mathematical abstractions by themselves don't motivate my
thinking.

>Sounds more like you have a bunch of threads each performing a
>certain number of jobs simultaneously (via the event loop).

Right. I would guess that under the covers, socket I/O is handled
with a select(). If that's true, I don't want to have 5,000 open
sockets all loaded onto the same select(). I want to spread the load
around to 10 threads, each doing their own select(), loaded with only
500 sockets each.

>So you can store the state of worker threads by a set of pairs
>(tid,jobs), where tid is the thread id and jobs is the number of
>jobs currently being processed by that thread.

Right again. And that's just what I do.

>Why not just whack all the (tid,jobs) pairs into a prioqueue sorted on
>the # of jobs? That way you just pop the first element of the prioqueue
>and that is the thread with the least number of jobs currently. Of
>course, you need to update the prioqueue order when you change the job
>counter for a thread, but that's probably not too hard.

That's an idea.

>My main question is, though, how many worker threads are you expecting
>to have?

100 for the client pool, 100 for the back end server pool, 40 inbound
translation workers, and 40 outbound translation workers. So roughly
300. Much more than that, and Tcl itself does not scale well.

>You can just sort the queues by job count in O(n log n) with small
>constant factors using [lsort] or search in O(n) to find the minimum,
>so are you expecting n to be sufficiently large for this to be a real
>bottleneck in your application (outweighing all the socket I/O)?

I don't really know the answer. You may be right, O(n log n) for that
decision may not amount to much, in the grand scheme of things.

But I do know my dispatcher is very lazy, and he hates working harder
than an O(1) state machine. :-)

tom.rmadilo

unread,

Sep 27, 2007, 1:56:59 AM9/27/07

to

On Sep 24, 7:10 pm, John Kelly <j...@isp2dial.com> wrote:
> I'm not asking for code, I'm just soliciting ideas, in case anyone has
> already solved this problem, or knows someone who has ...

Yes, AOLserver.

The problem is not something which can be solved in Tcl alone.
AOLserver has many features that you will never get in a pure Tcl
application. For instance, AOLserver uses poll, not select. Another
feature is that input reads go through a scatter/gather structure
which is more efficient and instantly adaptable to UDP. Maybe more
interesting is that driver threads are specialized and don't have too
much to do with Tcl.

Anyway, you can't really force this stuff on anyone. But over the last
week of discussions dealing with threads I have come to understand
that a roll-your-own approach, using just Tcl, is destined for
failure. This is really because of all the reasons discussed and
referenced here.

It is really somewhat disingenuous to talk about a threaded
application in pure Tcl, even with the threads extension. Yes, you can
create and use threads. So f---ing what! Unless you have full use of a
number of other features, threads are useless.

With AOLserver, you can just read _Unix Network Programming_ by
Stevens and get a clue about what is possible. Threads is just a piece
of the pie.

Bottom line is that multi-threaded programming requires more than
threads, network programming requires more than read/write/select,
client/server programming isn't an easily solved problem, so if you
find some code which is addressing these issues, give it a second
look.

John Kelly

unread,

Sep 27, 2007, 2:35:40 AM9/27/07

to

On Thu, 27 Sep 2007 05:56:59 -0000, "tom.rmadilo"
<tom.r...@gmail.com> wrote:

>The problem is not something which can be solved in Tcl alone.
>AOLserver has many features that you will never get in a pure Tcl
>application. For instance, AOLserver uses poll, not select.

Not everyone agrees that poll is better than select:

> http://www.openldap.org/lists/openldap-devel/200411/msg00056.html

> Unless you have full use of a number of other features, threads
> are useless.

I'm using Tcl threads to distribute sockets among a pool of threads,
so I don't have a huge number of sockets all sitting on one select().
And I use an event loop in each thread to process the I/O from those
sockets.

Mixing Tcl threads and events seems to be just what I need.

Melissa Schrumpf

unread,

Sep 28, 2007, 5:10:06 PM9/28/07

to

In article <rvimf31mhajairv77...@4ax.com>,
John Kelly <j...@isp2dial.com> wrote:

> On Thu, 27 Sep 2007 05:56:59 -0000, "tom.rmadilo"
> <tom.r...@gmail.com> wrote:
>
> >The problem is not something which can be solved in Tcl alone.
> >AOLserver has many features that you will never get in a pure Tcl
> >application. For instance, AOLserver uses poll, not select.
>
>
> Not everyone agrees that poll is better than select:
>
> > http://www.openldap.org/lists/openldap-devel/200411/msg00056.html

And not everyone agrees that multi-threaded is a good model for a
server, either. I assume you know for certain that you'll benefit from
multiple threads (with the advent of multi-core machines, this becomes
more of a safe bet than it has ever been).

> > Unless you have full use of a number of other features, threads
> > are useless.

> I'm using Tcl threads to distribute sockets among a pool of threads,
> so I don't have a huge number of sockets all sitting on one select().

Why is that such a bad thing? If you're worried about overhead at that
level, is Tcl the right choice for implementation?

--
MKS

Melissa Schrumpf

unread,

Sep 28, 2007, 5:32:20 PM9/28/07

to

John Kelly wrote:
> Arjen Markus wrote:

> >One important issue you have not elaborated is: how do you
> >know which thread has the largest load?

> For my purposes, "load" is simply the number of open sockets.

Interesting. So a thread with ten idle open sockets connected to
machines 50 hops away on 300 baud modems is more heavily "loaded" than a
thread with one open socket on the local gigabit LAN? A thread with ten
connections serving static content is more heavily "loaded" than a
thread serving processor-intensive CGI? Well, it's your application, so
you define the parameters.

> Each
> thread will increment an internal counter when initializing a new
> socket, and decrement when closing. When the internal counter
> changes, the thread will post the new value a global tsv, so at any
> moment, I can look at the list and choose the one with the lowest
> number of open sockets.

> I don't want to iterate the whole list on every decision, trying to
> find the minimum value. If there are 100 entries in the list, I could
> average 50 iterations for each decision.

Finding the minimum value in a 100 element array requires looking at 100
elements. What makes you think you'd require 50 iterations?

set minval 0xffffffff
set idx 0
set minidx 0
foreach sockcount $threadsockcountlist {
if {$minval>$ sockcount} {
set minidx $idx
set minval $sockcount
}
incr idx
}

> I don't necessarily need to
> find the minimum value, but I want to avoid the maximum value.

Well that's even simpler: get the socket count of the first thread. If
it is zero, use it. Get the socket count of the second thread. If it
is less than that of the first thread, use it. And so on. If you get
to the last thread in the list, and they all have the same number of
open sockets, use the last one.

Not that I'd suggest that method: "peak" usage with threads that are
open for a long time will bias the distribution method toward more
sockets per thread in the "low numbered" threads, and completely idle
threads at the end of the queue.

> Right. I just don't want to iterate the whole list. I want a quicker
> way to decide.

Quicker than checking the count on each of N threads? Well, you could
shift the "work" of determining the next thread to use from the "master"
thread to the workers: any time a worker thread gets a new socket, or
closes an old one, have IT search the socket count list, and write a
global "next thread ID" variable. The work is still being done, but not
at the moment of selection. Unfortunately, if you use this method,
there is no guarantee that the "latest worker thread" has had an
opportunity to run and determine the "next worker thread" before the
"master" thread has to select a new worker thread for the next new
socket. So the previous worker thread gets picked again, and again, and
again.

The way to avoid that is to have the master thread select the "next
worker thread" to use after it has assigned a job. The only difference
between this and selecting the next thread at the time a job is pending
is that the task has been done ahead of time.

--
MKS

John Kelly

unread,

Sep 28, 2007, 5:40:26 PM9/28/07

to

On Fri, 28 Sep 2007 17:10:06 -0400, Melissa Schrumpf
<m_schrumpf_at...@microsoft.com> wrote:

>And not everyone agrees that multi-threaded is a good model for a
>server, either. I assume you know for certain that you'll benefit from
>multiple threads (with the advent of multi-core machines, this becomes
>more of a safe bet than it has ever been).

There's more at stake than performance scaling. A threaded design is
a better design when the work model fits threads. Some argue against
threads because designing with threads requires care and thought. Tcl
makes it easy to use threads, but I suppose it still seems daunting to
lazy programmers.

>> I'm using Tcl threads to distribute sockets among a pool of threads,
>> so I don't have a huge number of sockets all sitting on one select().

>Why is that such a bad thing? If you're worried about overhead at that
>level, is Tcl the right choice for implementation?

Do you know how select() determines which fd to service?

John Kelly

unread,

Sep 28, 2007, 6:02:17 PM9/28/07

to

On Fri, 28 Sep 2007 17:32:20 -0400, Melissa Schrumpf
<m_schrumpf_at...@microsoft.com> wrote:

>> Right. I just don't want to iterate the whole list. I want a quicker
>> way to decide.

>Quicker than checking the count on each of N threads? Well, you could
>shift the "work" of determining the next thread to use from the "master"
>thread to the workers: any time a worker thread gets a new socket, or
>closes an old one, have IT search the socket count list, and write a
>global "next thread ID" variable.

Good idea. I took that approach, but improved it so the worker does
not need to search a list.

>at the moment of selection. Unfortunately, if you use this method,
>there is no guarantee that the "latest worker thread" has had an
>opportunity to run and determine the "next worker thread" before the
>"master" thread has to select a new worker thread for the next new
>socket. So the previous worker thread gets picked again, and again, and
>again.

I eliminated all those concerns.

As I posted earlier, I devised an O(1) state machine to solve the
problem. I've written the code, tested it, and it works. Though
since that earlier post, I did make some minor refinements which I
have not published ...

tom.rmadilo

unread,

Sep 28, 2007, 8:11:49 PM9/28/07

to

On Sep 28, 2:10 pm, Melissa Schrumpf
<m_schrumpf_at_yahoo_com_...@microsoft.com> wrote:
> In article <rvimf31mhajairv77ilcpvr6la91sqa...@4ax.com>,

Well, the bad idea here is that multiple select() is better/faster
than one. If the task is only I/O, it is better to do it in one place.
The example I have been using is AOLserver, but lighttpd does the
same. Interestingly, AOLserver outperforms lighttpd in serving static
4k files, so the pipeline for reading, handing off and servicing
connections is highly optimized.

But additionally, you need things like timed wait, and many other
features to have a successful, robust server. The point of threads is
not performance, unless you are using a multi-core system. Then you
can get a performance boost. But threads can simplify a great number
of programming tasks by treating them as independent from everything
else which might be going on in other threads. Essentially, you are
freed from thinking about cooperation or synchronization.

Since there is always a need here to come up with an OT example, I'll
try that route.

Imagine if there existed a restaurant with one table (process) and one
set of flatware and china. The plate represents the current working
memory. The food, the contents of that memory. The flatware represents
the running process. The table represents the processor. The waiter
represents the OS. (And the kitchen represents the I/O & storage
subsystems) With threads, you have one meal per plate. With an event
loop, you have many meals on a large plate. Important parts of this
analogy are: food takes up the same amount of space (plate space), one
set of flatware/table means only one customer can eat at once (in
either case). The waiter has to take care to schedule the use/reuse/
cleaning of the flatware, and scheduling a customer to use it.

There are problems with this analogy. For one, the waiter is
responsible for making certain decisions, but with a Tcl event loop,
you have a bus-boy trying to do the same job. The bus-boy could be
interrupted by the waiter, who can handle multiple tables and multiple
plates. The waiter can come over and take the flatware from the bus-
boy and transport it to another table/plate, where the customers are
dining on expensive food, or who provided a better tip. The bus-boy
has no power to do anything until the waiter gives back the flatware.

Once a customer leaves, the bus-boy is stuck with cleaning the plate,
or the portion used by the customer. But the bus-boy can't do anything
without the okay of the waiter, so s/he might have to wait around even
to do this mundane task.

Anyway, the bottom line is that every process takes a certain amount
of space and a certain amount of time. Having a bus-boy deciding which
customer gets the flatware next doesn't necessarily speed up anything
or take less space, however it is easier to share the peas or roast
beef. But the single plate/many customer service makes it hard for two
customers to order the same thing. No, these are _my_ peas. So the bus-
boy has to come up with some way to distinguish one set of peas from
another. When all customers have their own plate, it is pretty easy to
figure out which peas are your peas.

Using threads, real threads which are not subject to a bus-boy, can
greatly simplify the use of generic code in a multi-customer
environment.

After this I would redefine performance as the ability to satisfy many
different customers...fairly. This is very difficult to achieve using
the bus-boy model.

(Disclaimer: I've been a bus-boy; waiters know what they are doing)

tom.rmadilo

unread,

Sep 28, 2007, 8:34:16 PM9/28/07

to

On Sep 28, 3:02 pm, John Kelly <j...@isp2dial.com> wrote:
> As I posted earlier, I devised an O(1) state machine to solve the
> problem. I've written the code, tested it, and it works. Though
> since that earlier post, I did make some minor refinements which I
> have not published ...

If you can't narrow your solution down any further than order of
algorithmic magnitude, I would say that is a failure. This is not even
a theoretical win. Hell, USPS probably beats your solution; at some
point you have to deal with reality.

Daniel A. Steffen

unread,

Sep 29, 2007, 12:18:35 AM9/29/07

to

On Sep 27, 2:48 am, John Kelly <j...@isp2dial.com> wrote:
> I would guess that under the covers, socket I/O is handled
> with a select(). If that's true, I don't want to have 5,000 open
> sockets all loaded onto the same select(). I want to spread the load
> around to 10 threads, each doing their own select(), loaded with only
> 500 sockets each.

the problem with that idea is that due to the way the threaded tcl
notifier is implemented, there is only ever a single select() waiting
on all sockets (and other fd's) of all threads of the process.
That select() runs in a thread of its own (the notifier thread), and
tcl threads waiting for events block on a condition variable until the
notifier thread signals them, c.f. NotifierThreadProc() in
tclUnixNotfy.c.

If select() scaling is a real (measurable) problem for you rather than
a theoretical one, you may want to look at using kqueue() (if your
platform has it). The original kqueue paper has a good comparison of
select() vs poll() vs kqueue() performance with 1000s of fd's.

I have an almost finished tcl kqueue extension somewhere, kqueue() was
quite easy to integrate with the select()-based tcl notifier (a kqueue
is itself represented by a fd that select() can wait on, it becomes
readable when there are events on the kqueue).

Cheers,

Daniel

wiede...@googlemail.com

unread,

Sep 29, 2007, 3:39:06 AM9/29/07

to

On 26 Sep., 18:48, John Kelly <j...@isp2dial.com> wrote:
> On Wed, 26 Sep 2007 15:06:29 +0100, Neil Madden <n...@cs.nott.ac.uk>
> wrote:
>
> >Are these really queues then?
>
> No, I said "queues" because when I was trying to work out the state
> machine, the problem seemed like a shortest queue first problem in a
> grocery store. I had to think in terms of something familiar in the
> real world, mathematical abstractions by themselves don't motivate my
> thinking.

Heheh,

the shortest queue may well be the one with the slowest cashier.

Everybody before you notices this already and moved to another
faster queue ;-)

The list and lsort -index i proposed earlier is by itself only
a framework but it has the potential to test different algorithms
in an easy way.

uwe

to

wiede...@googlemail.com

unread,

Sep 29, 2007, 3:46:15 AM9/29/07

to

On 28 Sep., 23:40, John Kelly <j...@isp2dial.com> wrote:
> On Fri, 28 Sep 2007 17:10:06 -0400, Melissa Schrumpf

> >level, is Tcl the right choice for implementation?

>
> Do you know how select() determines which fd to service?
>

A couple of years back when webservers still shot at each
other there was a lot of research and thought invested
into "galloping horde" effects due to all threads being
woken on a port 80 connect I think.

It may well be worth to dive into the history bin and
read some stories from olden times ...

uwe

tom.rmadilo

unread,

Sep 29, 2007, 4:17:01 AM9/29/07

to

On Sep 29, 12:39 am, wiederl...@googlemail.com wrote:
> the shortest queue may well be the one with the slowest cashier.

> in an easy way.

All of this analysis seems to assume that each connection is the same.
An analogy where the customers can decide which line to stand in by
observation is not useful. Why? First, all cashiers are the same, what
is different is the customers. Customers are responsible for removing
items from their basket and handing them to the cashier. Some
customers could look fast, but end up slowing way down. So trying to
calculate queue/cashier performance is doomed to fail. (Just wait 'til
they pull out the roll of pennies to pay)

What is needed is first a timeout. That is, if a customer is taking
too long, they get kicked out. Otherwise, you eventually get stuck
waiting on slow customers.

Second, you need a queue to feed the cashiers, not a queue for each
cashier. In other words, the cashiers work with one customer, then ask
for another one. If you want to work with 10 customers at once, then
all you do is loop over the customers. At the end of the loop you see
how many customers are left, and ask for more so that you end up with
10 at the next loop. So the feeder queue doesn't need to know much, or
communicate with cashiers except to transfer batches of customers.

AOLserver works like my local bank on Friday afternoon. There is a pre-
queue person who makes sure that customers have everything ready
before they get to the head of the line. Only those with real business
get through the queue. At the head of the line, cashiers pull off
customers one by one.

Aric Bills

unread,

Sep 29, 2007, 4:42:35 AM9/29/07

to

On Sep 28, 1:40 pm, John Kelly <j...@isp2dial.com> wrote, among other
things:

> Some argue against
> threads because designing with threads requires care and thought. Tcl
> makes it easy to use threads, but I suppose it still seems daunting to
> lazy programmers.

There's no need to be defensive, dogmatic, or demeaning around here.
Make the strongest possible case for your point of view and accept
that there will be differences of opinion. In any case, the program
you're writing is your own, so the mother-in-law rule applies (hope
mine isn't reading this): listen politely to the advice, nod your
head, then go do it your way.

John Kelly

unread,

Sep 29, 2007, 11:24:46 AM9/29/07

to

On Sat, 29 Sep 2007 04:18:35 -0000, "Daniel A. Steffen"
<daniel....@gmail.com> wrote:

>On Sep 27, 2:48 am, John Kelly <j...@isp2dial.com> wrote:

>> I would guess that under the covers, socket I/O is handled
>> with a select(). If that's true, I don't want to have 5,000 open
>> sockets all loaded onto the same select(). I want to spread the load
>> around to 10 threads, each doing their own select(), loaded with only
>> 500 sockets each.

>the problem with that idea is that due to the way the threaded tcl
>notifier is implemented, there is only ever a single select() waiting
>on all sockets (and other fd's) of all threads of the process.
>That select() runs in a thread of its own (the notifier thread), and
>tcl threads waiting for events block on a condition variable until the
>notifier thread signals them, c.f. NotifierThreadProc() in
>tclUnixNotfy.c.

That's what I wanted to know. Thanks for telling.

>If select() scaling is a real (measurable) problem for you rather than
>a theoretical one, you may want to look at using kqueue() (if your
>platform has it). The original kqueue paper has a good comparison of
>select() vs poll() vs kqueue() performance with 1000s of fd's.

>I have an almost finished tcl kqueue extension somewhere, kqueue() was
>quite easy to integrate with the select()-based tcl notifier (a kqueue
>is itself represented by a fd that select() can wait on, it becomes
>readable when there are events on the kqueue).

My design should make it easy to switch. When your kqueue extension
is complete, please advise.

John Kelly

unread,

Sep 29, 2007, 11:52:44 AM9/29/07

to

On Sat, 29 Sep 2007 08:42:35 -0000, Aric Bills <aric....@gmail.com>
wrote:

>There's no need to be defensive, dogmatic, or demeaning around here.

>Make the strongest possible case for your point of view and accept
>that there will be differences of opinion. In any case, the program
>you're writing is your own, so the mother-in-law rule applies (hope
>mine isn't reading this): listen politely to the advice, nod your
>head, then go do it your way.

[nod]

Alexandre Ferrieux

unread,

Sep 29, 2007, 12:35:23 PM9/29/07

to

On Sep 29, 12:02 am, John Kelly <j...@isp2dial.com> wrote:
>
> As I posted earlier, I devised anO(1) state machine to solve the

> problem. I've written the code, tested it, and it works. Though
> since that earlier post, I did make some minor refinements which I
> have not published ...

I failed to find its description in this lengthy thread.
Care to share it again (or provide a link) ?
I'm not asking for trade secrets... Just say what kind of
incrementally-maintained ordered structure you are using.

-Alex

John Kelly

unread,

Sep 29, 2007, 1:01:51 PM9/29/07

to

On Sat, 29 Sep 2007 16:35:23 -0000, Alexandre Ferrieux
<alexandre...@gmail.com> wrote:

>On Sep 29, 12:02 am, John Kelly <j...@isp2dial.com> wrote:
>>
>> As I posted earlier, I devised anO(1) state machine to solve the
>> problem. I've written the code, tested it, and it works. Though
>> since that earlier post, I did make some minor refinements which I
>> have not published ...
>
>I failed to find its description in this lengthy thread.

Here's a google link to where I first described it:
http://groups.google.com/group/comp.lang.tcl/msg/76c7bb7d3d020ccb

>Care to share it again (or provide a link) ?

I will. The post linked above was a rough, sketchy plan. I refined
the idea when writing the code.

>I'm not asking for trade secrets... Just say what kind of
>incrementally-maintained ordered structure you are using.

It's no secret, but it will be easier for me to explain within the
context of the server framework. I'm working on a minimal version of
that, it's an interesting example of using threads and events.

I'll post it soon. Needs more work, maybe another week or so ...

Donal K. Fellows

unread,

Sep 30, 2007, 8:08:37 AM9/30/07

to

tom.rmadilo wrote:
> For instance, AOLserver uses poll, not select.

On at least some platforms (OK, on Solaris for sure which is the only
place I actually bothered to look) one is implemented using the other.

Donal.

Alexandre Ferrieux

unread,

Sep 30, 2007, 11:54:32 AM9/30/07

to

On 30 sep, 14:08, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

Do you mean there's just one syscall and the other is implemented in
libc over it, or rather two syscalls, one of which is a layer above
the other within the Solaris kernel ?

-Alex

Melissa Schrumpf

unread,

Sep 30, 2007, 11:34:46 PM9/30/07

to

John Kelly wrote:
> Melissa Schrumpf wrote:

> >Why is that such a bad thing? If you're worried about overhead at that
> >level, is Tcl the right choice for implementation?

> Do you know how select() determines which fd to service?

On every OS that Tcl runs on? No, I don't.

Which plays into my second question: if that level of detail is a
concern to you for performance, is Tcl the correct choice in the first
place?

--
MKS

Melissa Schrumpf

unread,

Oct 1, 2007, 12:18:42 AM10/1/07

to

tom.rmadilo wrote:
> Melissa Schrumpf wrote:
> > John Kelly wrote:

> > > I'm using Tcl threads to distribute sockets among a pool of threads,
> > > so I don't have a huge number of sockets all sitting on one select().
> >
> > Why is that such a bad thing? If you're worried about overhead at that
> > level, is Tcl the right choice for implementation?
>
> Well, the bad idea here is that multiple select() is better/faster
> than one. If the task is only I/O, it is better to do it in one place.

Doesn't that depend on both volume of connections and architecture?
Could servicing multiple network interfaces change that assumption?

If we assume some number of worker threads larger than the number of
available processors, and some number of sockets larger than the number
of working threads, then when a socket becomes readable, the kernel
would wake the appropriate sleeping thread, no? There's overhead
involved in that context switch.

I assume the gripe about select() is that it is inefficient, and
increasingly so with large numbers of descriptors. I assume the
application in question will be operating in this domain. Okay, but
doesn't the kernel have to process the same number of waiting
descriptors, regardless of how many different threads they are
select()ed in? Or does the inefficiency come in processing the entire
list on every call to select()? If the latter, okay, I can see how
that's not a very good thing, and probably calls for something more like
kqueue.

> The point of threads is
> not performance, unless you are using a multi-core system. Then you
> can get a performance boost. But threads can simplify a great number
> of programming tasks by treating them as independent from everything
> else which might be going on in other threads.

I'm familiar with the concept. The last server I wrote didn't expect a
high number of connections, but I implemented it using a threaded model,
because it was just simpler to do. In fact, for the application. I knew
that connection volume would be fairly low (not stressing max limits),
so I simply used a one-thread-per-connection model.

> Essentially, you are
> freed from thinking about cooperation or synchronization.

Well, you relegate those functions to the OS, and you hope the OS
implements the cooperation in a manner you'd like. Which is a
reasonable assumption when dealing with only one sort of scarce resource.

--
MKS

Melissa Schrumpf

unread,

Oct 1, 2007, 12:22:08 AM10/1/07

to

John Kelly wrote:
> Melissa Schrumpf wrote:

> >at the moment of selection. Unfortunately, if you use this method,
> >there is no guarantee that the "latest worker thread" has had an
> >opportunity to run and determine the "next worker thread" before the
> >"master" thread has to select a new worker thread for the next new
> >socket. So the previous worker thread gets picked again, and again, and
> >again.

> I eliminated all those concerns.

Curious, how? Unless the thread handing out the new connections to
workers is the one that updates the "next worker" pointer, how can you
guarantee that it has been updated by the time the next socket is ready
to be handed off?

--
MKS

Donal K. Fellows

unread,

Oct 1, 2007, 5:44:16 AM10/1/07

to

Alexandre Ferrieux wrote:
> Do you mean there's just one syscall and the other is implemented in
> libc over it

Yes. You can/could see it in stack traces.

Donal.

Alexandre Ferrieux

unread,

Oct 1, 2007, 6:45:39 AM10/1/07

to

On Sep 29, 7:01 pm, John Kelly <j...@isp2dial.com> wrote:
> Here's a google link to where I first described it:http://groups.google.com/group/comp.lang.tcl/msg/76c7bb7d3d020ccb
>

Thanks. I can see it now :-)

> I'll post it soon. Needs more work, maybe another week or so ...

OK, take your time. In the meantime, I'm just surprised nobody
mentioned (or I missed it again) what happens when the job holding the
sqf gets busy (it happens !), and hence its queue length becomes
larger than the sqf. How do we know how many ex-aequo had already the
same queue length ? Or do you update sqf to -1 again in this case ?

In short, to be O(1) it seems to me that the algorithm must maintain a
somewhat ordered (if incompletely) structure, and not call a O(N)
procedure (like resetting sqf to a reasonable value by scanning
evrybody) from time to time...

-Alex

Alexandre Ferrieux

unread,

Oct 1, 2007, 6:52:44 AM10/1/07

to

On Oct 1, 11:44 am, "Donal K. Fellows" <donal.k.fell...@man.ac.uk>
wrote:

You're right, just put my hands on a Sun. Then it's select(3) calling
poll(2). But this loses any advantage of select of having a much
smaller footprint in terms cross-address-space-transfer, right ?

(Notice I shouldn't care too much considering the significance of
Solaris in my field ;-)

-Alex

John Kelly

unread,

Oct 1, 2007, 3:44:42 PM10/1/07

to

On Mon, 01 Oct 2007 10:45:39 -0000, Alexandre Ferrieux
<alexandre...@gmail.com> wrote:

>OK, take your time. In the meantime, I'm just surprised nobody
>mentioned (or I missed it again) what happens when the job holding the
>sqf gets busy (it happens !), and hence its queue length becomes
>larger than the sqf. How do we know how many ex-aequo had already the
>same queue length ? Or do you update sqf to -1 again in this case ?

First, I should clarify:

A group of sockets sitting on an event loop may not be a queue in the
usual sense, but I think of it as a "queue" in the sense of trying to
find the shortest one first. So by "queue" I mean a group of sockets
sitting on an event loop.

Now to the question ...

When writing the code, I realized I could eliminate the -1 flag value
special case.

I no longer view the initial state as a special case. I arbitrarily
give the flag to some worker, since they're all equal at the initial
state. I like number 1, so that's who gets it.

I also arbitrarily start the roundrobin with the worker holding the
flag.

If the worker holding the flag gets busy, that won't matter. When a
new socket arrives, the dispatcher always makes a simple choice. He
knows who is next in the roundrobin, he knows who holds the flag, and
he knows their respective socket counts.

Shortest "queue" always wins.

When there is a winner, the winner gets the socket. The dispatcher
does NOT advance the roundrobin, he leaves it where it was, for the
next cycle.

Now if both queues were equal, it's a tie. By rule, the one next up
in the roundrobin wins the tie.

This decision is repeated for every new socket that arrives.

Now imagine opening for business, and a big burst of sockets arrive.
The roundrobin will evenly distribute the sockets among the workers,
and it doesn't matter who holds the flag, UNTIL at least one socket
closes.

When that happens, the worker who closed his socket will try to grab
the flag (if someone else holds it). He follows a rule that says, to
grab the flag, his count must be less than the flag holder's count.

He compares his count to the flag holder's count. It doesn't matter
what the other counts are, he only needs to know the lowest, and that
will always be at the flag holder. So every time a socket closes, the
flag either moves to the shortest queue, or stays at the shortest
queue.

The dispatcher's decision is easy, because the workers keep their own
count of sockets. When the dispatcher sends a new socket to a worker,
the worker increments his count. And when the worker closes a socket,
he decrements his count.

The worker can't do both at the same time, he either does one or the
other. When he decrements his count, and tries to grab the flag, it
doesn't matter what anyone else is doing.

Access to the flag is serialized by a tsv, so the lowest count always
gets it, even if that means changing hands in rapid succession when a
burst of sockets close.

>In short, to be O(1) it seems to me that the algorithm must maintain a
>somewhat ordered (if incompletely) structure, and not call a O(N)
>procedure (like resetting sqf to a reasonable value by scanning
>evrybody) from time to time...

Right, that's not necessary.

John Kelly

unread,

Oct 1, 2007, 4:00:27 PM10/1/07

to

>John Kelly wrote:

>> Melissa Schrumpf wrote:

I tried to explain in my follow up to Alex. I hope it made sense.

John Kelly

unread,

Oct 1, 2007, 4:53:07 PM10/1/07

to

On Mon, 01 Oct 2007 00:18:42 -0400, Melissa Schrumpf
<m_schrumpf_at...@microsoft.com> wrote:

>I assume the gripe about select() is that it is inefficient, and
>increasingly so with large numbers of descriptors. I assume the
>application in question will be operating in this domain. Okay, but
>doesn't the kernel have to process the same number of waiting
>descriptors, regardless of how many different threads they are
>select()ed in?

As Daniel pointed out.

>there is only ever a single select() waiting on all sockets

So in a Tcl-only implementation, my load distribution algorithm cannot
achieve it purpose.

However, that Tcl feature is only an implementation detail. I also
want a prototype that helps me understand how use distributed systems
for managing work.

As for the question of select() scaling with large numbers of fd's, I
speculate the kernel has to iterate the entire list of fd's sitting on
a select(), to see which ones have data to process.

It's like hardware interrupts and ports. The hardware signals you
with an interrupt, but when multiple ports are associated with one
IRQ, you must read every port, to see which ones are ready.

Same principle with fd's and select(). To scale, you must divide the
workload.

>Or does the inefficiency come in processing the entire
>list on every call to select()? If the latter, okay, I can see how
>that's not a very good thing, and probably calls for something more like
>kqueue.

I think you can scale with select(), if you divide the workload.

Alexandre Ferrieux

unread,

Oct 1, 2007, 6:00:49 PM10/1/07

to

On Oct 1, 10:53 pm, John Kelly <j...@isp2dial.com> wrote:
>
> As for the question of select() scaling with large numbers of fd's, I
> speculate the kernel has to iterate the entire list of fd's sitting on
> a select(), to see which ones have data to process.

Not necessarily: if you "reverse the arrow" (I mean, let the fds' be
active elements "pushing" their state to converge in one mutex-
protected place, instead of being passive things to be pulled in
sequence), then there's no iteration[*]. But isn't this more or less
what epoll does ?

-Alex

[*] I mean, iteration just on the active ones, which are supposedly a
small fraction of the totality

John Kelly

unread,

Oct 1, 2007, 7:46:05 PM10/1/07

to

On Mon, 01 Oct 2007 19:44:42 +0000, John Kelly <j...@isp2dial.com>
wrote:
flag.

> Shortest "queue" always wins. When there is a winner, the
> winner gets the socket.

Well, not exactly. I should have said: There are only two
contestants: the flag, and the roundrobin.

Sometimes, the flag and the roundrobin belong to the same worker. In
that case, there is no contest. Proceed with roundrobin.

Proceed with roundrobin means: choose the worker identified by the
roundrobin pointer (arrow, whatever), and then, advance the roundrobin
pointer, in preparation for the next cycle.

Now if the flag and the roundrobin do NOT belong to the same worker,
then they must belong to different workers. So see which of the two
has the shortest queue (lowest socket count).

If it's a tie (counts are equal), follow the same rule as no contest.
Proceed with roundrobin.

Now if the roundrobin has the lowest count, he "wins," but he *is* the
roundrobin, so follow the same rule as no contest (or a tie). Proceed
with roundrobin.

And now, there's only one possibility remaining:

The flag wins with lowest count. In that case, give socket to the
flagholder. And do NOT advance the roundrobin pointer.

This may seem excessively verbose, but it's important to the proof,
and the logic reduces to only a few lines of code. Should be ready to
post fairly soon ...

John Kelly

unread,

Oct 3, 2007, 12:34:48 AM10/3/07

to

On Sat, 29 Sep 2007 16:35:23 -0000, Alexandre Ferrieux
<alexandre...@gmail.com> wrote:

>> I devised an O(1) state machinee

> Care to share it again (or provide a link) ?

ftp://isp2dial.com/imapwwwbui/imapwwwbui-0.001.tgz

It's all good.

Donal K. Fellows

unread,

Oct 10, 2007, 4:49:55 AM10/10/07

to

John Kelly wrote:
> I think you can scale with select(), if you divide the workload.

On the one hand, the one problem with select() that isn't silly whining
is that the size of the lists to process depend on the total number of
FDs in the process, and not the number that are being waited upon by the
particular call. In turn, that makes things a bit harder to scale even
if you divide the workload up. On the other hand, any disk or network
I/O is going to dominate that utterly anyway; physical devices are
slower than a little bit of bit shuffling, even when that bit shuffling
is in the OS kernel. You have to watch out for kernel developers and
their claims of inefficiency of APIs, as they are often using very
specialized measures that are hard to consistently apply to practical
situations.

The gripping hand is that select() is an extremely widely supported API;
going to something else will require a fair bit of engineering effort
just to make sure that the code builds reliably everywhere. The best way
to change things is to contribute effort and patches, as always. (You
only need to change a single source file, but you probably also need to
provide configure script changes.)

Donal.