io_work_batch_queue g_queue;
mutex g_mutex;
cond g_cond;
void io_thread()
{
for (;;)
{
io_work_batch workb;
io_wait_for_data(workb);
workb.simple_parsing();
while (! g_mutex.try_lock())
{
if (! io_poll_for_data(workb))
{
g_mutex.lock();
break;
}
workb.simple_parsing();
}
g_queue.push(workb);
g_mutex.unlock();
g_cond.signal();
}
}
void computation_thread()
{
for (;;)
{
io_work_batch workb;
g_mutex.lock();
while (! g_queue.pop(workb))
{
g_cond.wait(g_mutex);
}
g_mutex.unlock();
workb.process();
}
}
Why does a I/O thread need to wait on the mutex when it could poll for data
and perhaps batch up more work in the mean time? If there is no data then
the mutex will actually be waited on. Does the scheme make any sense?
[...]
I think it might be wise to limit the number of times an I/O thread can
execute try_lock.
void io_thread()
{
for (;;)
{
unsigned int limit = 5;
io_work_batch workb;
io_wait_for_data(workb);
workb.simple_parsing();
while (! g_mutex.try_lock())
{
if (! --limit ||
! io_poll_for_data(workb))
{
g_mutex.lock();
break;
}
workb.simple_parsing();
}
g_queue.push(workb);
g_mutex.unlock();
g_cond.signal();
}
}
?
> Why does a I/O thread need to wait on the mutex when it could poll for data
> and perhaps batch up more work in the mean time?
This is very, very unlikely to happen. The mutex is only held for the
tiny amount of time it takes the worker threads to pull work out of
the queue.
> If there is no data then
> the mutex will actually be waited on. Does the scheme make any sense?
It's very unlikely to make any difference. In any reasonable
implementation of the rest of the design, this code will almost never
make any difference.
In any event, unless you have some special reason to need fixed I/O
threads and work threads, that tends to be an inefficient design. It's
better to have a single pool of threads that morph between I/O and
work threads as needed. That way, the number of context switches can
be minimized and there won't be a need to ping-pong every work request
from one core to another.
For example, instead of 4 I/O thread and 8 worker threads, have 12
threads in a single pool. A thread follows this type of logic:
1) If there's work on the queue, do the first job, and go back to step
1.
2) If all I/O jobs are already being done by other threads, block on
the c.v. and goto step 1.
3) Block on an I/O task until it either no longer needs to be done or
we get at least one job to do.
4) If we didn't get any job in step 3, goto step 1.
5) Queue the job(s) we did get.
6) Did we get more than one job? If so, wake other threads. (So they
can take the job(s) we don't.) If we got exactly one job, wake one
other thread (so one can do the job and one can do the I/O).
7) Go to step 1.
Which approach is better depends on how much data is involved in the
job. If the jobs are big in terms of memory, then moving the job to
another thread will be expensive. On the other hand, if the jobs are
tiny, moving the I/O code to another thread (which won't have it in
cache) may be expensive. If the I/O code is too large to fit in cache
anyway, then not having to shift the data tends to be a win.
There's actually one more layer of complexity for real-world
applications, because you typically have three layers, not just two.
First, there's socket discovery (figuring out which sockets need I/O
done), then there's the actual I/O (the 'read'/'write' calls and
moving the data between the application and the kernel), and then
there's the server level (generating the data to send or processing
the data received).
For the full blown three-layer case, the algorithms that provide
optimal performance get quite complex.
DS
If you really care about that, you may consider using lock-free queue.
Then you can replace:
while (! g_mutex.try_lock())
{
if (! io_poll_for_data(workb))
{
g_mutex.lock();
break;
}
workb.simple_parsing();
}
g_queue.push(workb);
g_mutex.unlock();
g_cond.signal();
With just:
g_queue.push(workb);
--
Dmitriy V'jukov
Fine-tuning might become relevant if the workload needs adjustments for task
priorities.
http://www.cs.wustl.edu/~schmidt/PDF/OM-01.pdf
Regards,
Markus
> > Why does a I/O thread need to wait on the mutex when it could poll for
> > data and perhaps batch up more work in the mean time?
> This is very, very unlikely to happen. The mutex is only held for the
> tiny amount of time it takes the worker threads to pull work out of
> the queue.
> > If there is no data then
> > the mutex will actually be waited on. Does the scheme make any sense?
> It's very unlikely to make any difference. In any reasonable
> implementation of the rest of the design, this code will almost never
> make any difference.
I wanted to see if doing actual work instead of perhaps blocking on a kernel
waitset could be beneficial. I guess you could think of it as an adaptive
mutex, except that each iteration of the spin-wait would try to poll for
other work to do. Perhaps I have a boneheaded line of thinking on the
matter, but I am interested in how one can exploit a failed try_lock.
> In any event, unless you have some special reason to need fixed I/O
> threads and work threads, that tends to be an inefficient design. It's
> better to have a single pool of threads that morph between I/O and
> work threads as needed. That way, the number of context switches can
> be minimized and there won't be a need to ping-pong every work request
> from one core to another.
> For example, instead of 4 I/O thread and 8 worker threads, have 12
> threads in a single pool.
The I/O threads do some simple parsing before they send formatted work items
over to the computational threads. The actual computations can involve
searching big complicated data-structures and performing some number
crunching based on search results. So, I can envision a scenario in which
all 12 threads are performing computations. This means that no parsing can
be done. Also, imagine worst-case in which each of the 12 threads started
the computation process at around the same time. The I/O and parsing aspect
would experience a stall based on the complexity of the processing.
Now, if I had 4 I/O and 8 workers then data can be received and parsed
regardless of how much load the 8 workers have. Also, if there is an error
in the parsing, an I/O thread can drop the connection, or perform some other
protocol level actions. I want that service to be responsive, so I think
that a separate thread pool was prudent. Do you think that is a legitimate
concern David?
> A thread follows this type of logic:
> 1) If there's work on the queue, do the first job, and go back to step
> 1.
> 2) If all I/O jobs are already being done by other threads, block on
> the c.v. and goto step 1.
> 3) Block on an I/O task until it either no longer needs to be done or
> we get at least one job to do.
> 4) If we didn't get any job in step 3, goto step 1.
> 5) Queue the job(s) we did get.
> 6) Did we get more than one job? If so, wake other threads. (So they
> can take the job(s) we don't.) If we got exactly one job, wake one
> other thread (so one can do the job and one can do the I/O).
> 7) Go to step 1.
That's interesting. I am not sure I understand step 3. If thread A is at
step 3 and thread B produced work on the queue, how does step 3 get
signaled? I am not sure how you intermingle condvar waits/signals with I/O
discovery.
> Which approach is better depends on how much data is involved in the
> job. If the jobs are big in terms of memory, then moving the job to
> another thread will be expensive. On the other hand, if the jobs are
> tiny, moving the I/O code to another thread (which won't have it in
> cache) may be expensive. If the I/O code is too large to fit in cache
> anyway, then not having to shift the data tends to be a win.
> There's actually one more layer of complexity for real-world
> applications, because you typically have three layers, not just two.
> First, there's socket discovery (figuring out which sockets need I/O
> done), then there's the actual I/O (the 'read'/'write' calls and
> moving the data between the application and the kernel), and then
> there's the server level (generating the data to send or processing
> the data received).
> For the full blown three-layer case, the algorithms that provide
> optimal performance get quite complex.
Thank you for all of this valuable information!
:^)
> If you really care about that, you may consider using lock-free queue.
Not sure I want to get into all of those hardcore algorithm's right now.
> Then you can replace:
> [...]
> With just:
> g_queue.push(workb);
:^)
Usually, try_lock is used to implement complex locking schemes. However, I
see no problem with experimenting. It could be beneficial. If a thread fails
to acquire a mutex and there is other important work to be done right then
and there, well, I could see the logic in letting the failed thread go ahead
and work on that.
>> In any event, unless you have some special reason to need fixed I/O
>> threads and work threads, that tends to be an inefficient design. It's
>> better to have a single pool of threads that morph between I/O and
>> work threads as needed. That way, the number of context switches can
>> be minimized and there won't be a need to ping-pong every work request
>> from one core to another.
>
>> For example, instead of 4 I/O thread and 8 worker threads, have 12
>> threads in a single pool.
>
> The I/O threads do some simple parsing before they send formatted work
> items over to the computational threads. The actual computations can
> involve searching big complicated data-structures and performing some
> number crunching based on search results. So, I can envision a scenario in
> which all 12 threads are performing computations. This means that no
> parsing can be done. Also, imagine worst-case in which each of the 12
> threads started the computation process at around the same time. The I/O
> and parsing aspect would experience a stall based on the complexity of the
> processing.
>
> Now, if I had 4 I/O and 8 workers then data can be received and parsed
> regardless of how much load the 8 workers have. Also, if there is an error
> in the parsing, an I/O thread can drop the connection, or perform some
> other protocol level actions. I want that service to be responsive, so I
> think that a separate thread pool was prudent. Do you think that is a
> legitimate concern David?
Yeah. Well, IMHO, on Windows at least the IOCP queue uses non-paged memory.
So, the faster you get items off that queue the better! Therefore, IMHO, it
can be good to decouple I/O from actual work.
>> A thread follows this type of logic:
>
>> 1) If there's work on the queue, do the first job, and go back to step
>> 1.
>> 2) If all I/O jobs are already being done by other threads, block on
>> the c.v. and goto step 1.
>> 3) Block on an I/O task until it either no longer needs to be done or
>> we get at least one job to do.
>> 4) If we didn't get any job in step 3, goto step 1.
>> 5) Queue the job(s) we did get.
>> 6) Did we get more than one job? If so, wake other threads. (So they
>> can take the job(s) we don't.) If we got exactly one job, wake one
>> other thread (so one can do the job and one can do the I/O).
>> 7) Go to step 1.
>
> That's interesting. I am not sure I understand step 3. If thread A is at
> step 3 and thread B produced work on the queue, how does step 3 get
> signaled? I am not sure how you intermingle condvar waits/signals with I/O
> discovery.
On Windows, you can post a custom message to the IOCP that can act as a
signal to wake threads so they can process the work queue.
> The I/O threads do some simple parsing before they send formatted work items
> over to the computational threads. The actual computations can involve
> searching big complicated data-structures and performing some number
> crunching based on search results. So, I can envision a scenario in which
> all 12 threads are performing computations. This means that no parsing can
> be done. Also, imagine worst-case in which each of the 12 threads started
> the computation process at around the same time. The I/O and parsing aspect
> would experience a stall based on the complexity of the processing.
Right, but what good would it do to discover new I/O if you had no
thread to do the processing?
> Now, if I had 4 I/O and 8 workers then data can be received and parsed
> regardless of how much load the 8 workers have. Also, if there is an error
> in the parsing, an I/O thread can drop the connection, or perform some other
> protocol level actions. I want that service to be responsive, so I think
> that a separate thread pool was prudent. Do you think that is a legitimate
> concern David?
I do, but I don't think is requires you to force handoffs. You can
simply specify that no more than X threads can do I/O and no more than
Y can do computation. (It's just a matter of which pool the thread
checks first.)
> > A thread follows this type of logic:
> > 1) If there's work on the queue, do the first job, and go back to step
> > 1.
> > 2) If all I/O jobs are already being done by other threads, block on
> > the c.v. and goto step 1.
> > 3) Block on an I/O task until it either no longer needs to be done or
> > we get at least one job to do.
> > 4) If we didn't get any job in step 3, goto step 1.
> > 5) Queue the job(s) we did get.
> > 6) Did we get more than one job? If so, wake other threads. (So they
> > can take the job(s) we don't.) If we got exactly one job, wake one
> > other thread (so one can do the job and one can do the I/O).
> > 7) Go to step 1.
> That's interesting. I am not sure I understand step 3. If thread A is at
> step 3 and thread B produced work on the queue, how does step 3 get
> signaled? I am not sure how you intermingle condvar waits/signals with I/O
> discovery.
It doesn't have to. If we get to step 3, we are an I/O thread for the
moment. So we don't care if work is added to the queue. Some thread
has to wait for I/O on that port/set/whatever, so it's us if we get to
step 3. Note that this is the "empty work queue" case, so there's
plenty of other threads to do that work. No reason to force another
thread to come around and do the I/O we're already ready to do.
> Thank you for all of this valuable information!
You are welcome.
DS
The choice is definitely up to you. However from my practice, mutexes
and condition variables are not much easier to implement than
queues...
Lock-free queue eliminates the problem that one thread must wait for
another. IMVHO it's nice, especially if that's the very problem you
are trying to solve.
--
Dmitriy V'jukov
> > The I/O threads do some simple parsing before they send formatted work
> > items
> > over to the computational threads. The actual computations can involve
> > searching big complicated data-structures and performing some number
> > crunching based on search results. So, I can envision a scenario in
> > which
> > all 12 threads are performing computations. This means that no parsing
> > can
> > be done. Also, imagine worst-case in which each of the 12 threads
> > started
> > the computation process at around the same time. The I/O and parsing
> > aspect
> > would experience a stall based on the complexity of the processing.
> Right, but what good would it do to discover new I/O if you had no
> thread to do the processing?
I mentioned that the I/O threads can do some simple parsing and react to
certain protocol requirements. Why should that work be blocked when there is
high load an computation threads are busy? I/O threads have important work
to do I do not want it to be "paused" because all worker/computation threads
are busy.
> > Now, if I had 4 I/O and 8 workers then data can be received and parsed
> > regardless of how much load the 8 workers have. Also, if there is an
> > error
> > in the parsing, an I/O thread can drop the connection, or perform some
> > other
> > protocol level actions. I want that service to be responsive, so I think
> > that a separate thread pool was prudent. Do you think that is a
> > legitimate
> > concern David?
> I do, but I don't think is requires you to force handoffs. You can
> simply specify that no more than X threads can do I/O and no more than
> Y can do computation. (It's just a matter of which pool the thread
> checks first.)
AFAICT, that is equal to what I was doing in the first place. Except, in my
model I have much more control. You're solution is much more random.
> > > A thread follows this type of logic:
> > > 1) If there's work on the queue, do the first job, and go back to step
> > > 1.
> > > 2) If all I/O jobs are already being done by other threads, block on
> > > the c.v. and goto step 1.
> > > 3) Block on an I/O task until it either no longer needs to be done or
> > > we get at least one job to do.
> > > 4) If we didn't get any job in step 3, goto step 1.
> > > 5) Queue the job(s) we did get.
> > > 6) Did we get more than one job? If so, wake other threads. (So they
> > > can take the job(s) we don't.) If we got exactly one job, wake one
> > > other thread (so one can do the job and one can do the I/O).
> > > 7) Go to step 1.
> > That's interesting. I am not sure I understand step 3. If thread A is at
> > step 3 and thread B produced work on the queue, how does step 3 get
> > signaled? I am not sure how you intermingle condvar waits/signals with
> > I/O
> > discovery.
> It doesn't have to. If we get to step 3, we are an I/O thread for the
> moment.
But that I/O thread is dependant on I/O reactivity wrt the amount of time
it's blocked on I/O for. It cannot react to a queued item, unless you follow
Chris Thomason's idea to queue a dummy item in the IOCP in order to allow a
blocked I/O thread to process the protocol.
You have not totally convinced me that dumping all threads into a single bag
is a good idea. I think that separating I/O from computation is beneficial
in my specific scenario. If all computation threads are busy, at least the
I/O threads can still respond and react to protocol level errors, whatever.
You seem to want me to let any thread randomly choose work. Are you
suggesting that I follow the Clik++ style random scheduling "mess"? What am
I missing David?
[...]
I have read that lock-free algorithms perform much worse than there
lock-based equivalents in this group. David Schwartz's posts vs. SenderX,
and Chris Thomasson. AFAICT, David says that lock-free == horrible
saturation of the bus which is not good!
It's never, NEVER, that simple.
If you do something wrong, it'll hurt you. Sometimes it's difficult to
determine what's wrong. (Which, um, [cough, cough] generally means you
don't know enough about your environment to have any business being
there; but that's an entirely different discussion.)
The important message is don't just "do the buzzword" and assume you're
better off because you followed the herd. Analyze. Test. Understand.
I can't count how many times someone's issued the challenge '"they" told
me threads would speed up my code, but I added threads and the
performance tanked.' Probably anyone reading this can come up with a
dozen likely mistakes that would cause this.
Lock-free is a tool. There are a lot of things you can't do without it.
There's no OS supporting multiprocessing that doesn't do lock free; just
as there are none that don't have locks.
Just because you know how to use a hammer, don't assume it's the right
tool to insert a screw. But it's as bad to use a screwdriver to insert a
nail just because someone told you not to use a hammer for a screw.
Yes, inappropriate or excessive use of lock-free patterns can swamp your
bus and hurt performance. Just as inappropriate and excessive use of
locks can destroy performance with context switches, live-locks, false
sharing, and all sorts of other evil stuff. Often, it's the perfect
recipe of both, each in the right place and at the right time, that
gives you the optimal performance.
SenderX and Chris Thomasson are a good example, since you brought it up.
SenderX was a loudmouthed lock-free zealot who said locks are evil and
only lock-free can save the day. We had some fascinating conversations
and arguments, and then one day he said "wow, I added locks to my
lock-free system and it got even better". He became Chris Thomasson, the
newsgroup became a more pleasant place, and all the little threads got
christmas presents. What's not to like?
> > Right, but what good would it do to discover new I/O if you had no
> > thread to do the processing?
> I mentioned that the I/O threads can do some simple parsing and react to
> certain protocol requirements. Why should that work be blocked when there is
> high load an computation threads are busy? I/O threads have important work
> to do I do not want it to be "paused" because all worker/computation threads
> are busy.
Okay, so then if you would have had 8 I/O threads in your design, just
always make a thread an I/O thread if there are fewer than 8.
> > I do, but I don't think is requires you to force handoffs. You can
> > simply specify that no more than X threads can do I/O and no more than
> > Y can do computation. (It's just a matter of which pool the thread
> > checks first.)
> AFAICT, that is equal to what I was doing in the first place. Except, in my
> model I have much more control. You're solution is much more random.
No, it's not equal. In your case, to discover I/O and then do
computation will always require the information to be handed off from
one thread to another, and in almost all cases it will require a
context switch.
You say it's "random", but that's the whole point. Every bit of
ordering your code forces has a cost. The key to efficient
multithreading is to only enforce ordering where it is necessary for
the correct operation of the program. If you let the scheduler do its
job, you will get better performance.
> But that I/O thread is dependant on I/O reactivity wrt the amount of time
> it's blocked on I/O for. It cannot react to a queued item, unless you follow
> Chris Thomason's idea to queue a dummy item in the IOCP in order to allow a
> blocked I/O thread to process the protocol.
I'm not sure I follow you. The idea is this:
1) An I/O thread discovers a job.
2) The I/O thread queues that job.
3) If necessary, the thread can go back to checking for I/O, but if
there's another thread to do that, then we continue.
4) We check if there's a compute job, and if so, we do it.
> You have not totally convinced me that dumping all threads into a single bag
> is a good idea. I think that separating I/O from computation is beneficial
> in my specific scenario. If all computation threads are busy, at least the
> I/O threads can still respond and react to protocol level errors, whatever.
The same thing occurs in my design. If you need 8 I/O threads, and
there are fewer than 8, then you make the thread an I/O thread.
> You seem to want me to let any thread randomly choose work. Are you
> suggesting that I follow the Clik++ style random scheduling "mess"? What am
> I missing David?
I'm just saying it's utterly pointless to have a thread hand job A off
to another thread so it can do job B when it was already in the middle
of doing job A. Just let that thread keep doing what it's doing and
let some other thread do the different job.
Forced context switches are a huge performance sapper in multi-
threaded code. If a thread can continue to make forward progress on
the job it's already doing, you should try to avoid handing that job
off to another thread.
DS