peformance of wait notify concept of condition variable

kus

unread,

Apr 26, 2017, 7:56:57 AM4/26/17

to

Hi,
I am implementing a client server application as referred to my previous
posts .The issue i am seeing,when i am notifying a waiting thread,that
waiting thread sometimes takes around 10secs to repond to that notifying
thread.I am seeing this situation when very large number of connections
are opened simultaneously.Is it a normal behaviour or i am implementing
something wrong here?
Thanks,
Kushal

kushal bhattacharya

unread,

Apr 26, 2017, 9:25:27 AM4/26/17

to

To be more clear I just want to point out the facct that from the notifying thread ,I am using notify_all(),so according to the functionality of this fnuction it should notify all the waiting threads.Suppose I am using large number number of connections say,500 connections or more than that,then i am creating 500 notifying threads and 500 waiting threads.So, If i call notify_all() from any of the notifying threads,then all the 500 waiting threads will be notified,but within that it has some condtion to fulfil according to condition.wait().So,thinking about this,am i compromising some performance here and is it one of the culprit behind this delay?

Paavo Helde

unread,

Apr 26, 2017, 9:37:42 AM4/26/17

to

Nobody remembers anything about your previous posts. But in any case 10
seconds seems extremely slow for just a simple responsiveness check
(heck, even 10 ms would be slow). This may for example indicate that you
have overloaded the machine and it is swapping memory to/from disk cache
all the time. Check your memory usage.

Alternatively there may be errors in your logic and 10 s is related to a
timeout somewhere.

BTW, what is "very large" number of connections? Is it the same as the
number of running worker threads? If yes, then you are doing it wrong,
there is in general no point to have more worker threads than the number
of CPU cores.

hth
Paavo

kushal bhattacharya

unread,

Apr 26, 2017, 9:43:54 AM4/26/17

to

On Wednesday, April 26, 2017 at 5:26:57 PM UTC+5:30, kushal bhattacharya wrote:

Yes suppose i have 500 connections then 500 threads are created here,I am having some confusion regarding notify_all() and notify_one(), in this scenario will there be any performance gain if i use notify_one() and notify only one waiting thread

kushal bhattacharya

unread,

Apr 26, 2017, 9:44:43 AM4/26/17

to

On Wednesday, April 26, 2017 at 5:26:57 PM UTC+5:30, kushal bhattacharya wrote:

this architechture is done for scalability measures so i am just increasing the amount of connections now and accordingly distributing the work among those threads

Marcel Mueller

unread,

Apr 26, 2017, 12:48:38 PM4/26/17

to

I think using 500 threads is the first problem. Since your computer
probably is far away from 500 CPU cores this is not efficient.

Waiting with 500 threads for a single condvar is even more ineffective.
Each time you call notify_all 500 threads have to wake up, acquire a
mutex, check a condition and probably most of the time release the mutex
and return to waiting state. Or are you writing a chat server?
Although this wastes several thousands of clock cycles this does not yet
explain 10 seconds delay.

Most probably you do some blocking I/O operation while you hold the
mutex of the condvar. This could explain a delay in the order of 20 ms
per thread. The answer is simple: don't do that. It leads
multi-threading ad absurdum.

Secondly you should not handle that much connection with individual
threads. It is unlikely that all connections have do real work
/concurrently/. This would most likely exhaust your hardware resources.
Scalable server applications usually use a pool of worker threads that
service a queue of incoming requests. This limits the parallelism to a
reasonable level.

Marcel

Paavo Helde

unread,

Apr 26, 2017, 2:19:40 PM4/26/17

to

Creating 500 threads is killing all your scalability if your hardware
does not support 500 CPU cores.

See examples in boost.asio about how to create a reasonable thread pool
for servicing a large number of connections.

Bonita Montero

unread,

Apr 26, 2017, 2:51:55 PM4/26/17

to

> Creating 500 threads is killing all your scalability if your hardware
> does not support 500 CPU cores.
>
> See examples in boost.asio about how to create a reasonable thread pool
> for servicing a large number of connections.

In cases where boost.asio is suitable you would otherwise do blocking
i/o in the threads. And having 500 threads idle-waiting for completion
of blocking i/o isn't a big deal for hardware for years.

Jerry Stuckle

unread,

Apr 26, 2017, 2:57:40 PM4/26/17

to

Your initial premise is incorrect. According to you, machines with a
single core should only be able to run MS-DOS, because pretty much every
OS since then has had multiple threads. In fact, on my 2 core, 4 thread
machine, I currently have hundreds of threads running.

Often times I create multiple threads because I am waiting on multiple
events. That doesn't mean all are running code concurrently; in
actuality multiple threads in a wait state is very common in a
multi-threaded program.

In the OP's case, I bet he never has all 500 threads ready to run
concurrently.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstu...@attglobal.net
==================

Paavo Helde

unread,

Apr 26, 2017, 3:30:43 PM4/26/17

to

500 idle threads might not be a big deal (at least not in 64-bit
programs), but they might become a problem when they start doing
something at the same time. If I understand correctly, that's what OP is
doing, he's sending notify_all() and all threads wake up and start doing
something, probably much more than they should. The point is that there
is no reason to create much more threads than the number of cores, they
just tend to eat up all the memory and start to fight with each other
over resources. Been there, done that.

Also I somehow have the feeling that OP's hardware predates the "for
years" you mention.

Cheers
Paavo

Chris M. Thomasson

unread,

Apr 26, 2017, 4:44:42 PM4/26/17

to

On 4/26/2017 9:48 AM, Marcel Mueller wrote:
> On 26.04.17 15.24, kushal bhattacharya wrote:
>> On Wednesday, April 26, 2017 at 5:26:57 PM UTC+5:30, kushal
>> bhattacharya wrote:

[...]

> Most probably you do some blocking I/O operation while you hold the
> mutex of the condvar.

I basically have to concur here for 10s is a wild number for me wrt
using a condvar for signalling.

> This could explain a delay in the order of 20 ms
> per thread. The answer is simple: don't do that. It leads
> multi-threading ad absurdum.

Agreed.

Chris M. Thomasson

unread,

Apr 26, 2017, 4:50:43 PM4/26/17

to

Fwiw, my advise is to create a couple of threads per CPU and use async
IO wrt the target platform. For instance, on windows use Input Output
Completion Ports (IOCP), on a Posix system use the AIO API.

Creating a thread per connection is not going to scale at all.

Ian Collins

unread,

Apr 26, 2017, 10:26:40 PM4/26/17

to

On 04/27/17 01:24 AM, kushal bhattacharya wrote:

> To be more clear I just want to point out the facct that from the
> notifying thread ,I am using notify_all(),so according to the
> functionality of this fnuction it should notify all the waiting
> threads.Suppose I am using large number number of connections say,500
> connections or more than that,then i am creating 500 notifying
> threads and 500 waiting threads.So, If i call notify_all() from any
> of the notifying threads,then all the 500 waiting threads will be
> notified,but within that it has some condtion to fulfil according to
> condition.wait().So,thinking about this,am i compromising some
> performance here and is it one of the culprit behind this delay?

Google "Thundering herd problem" :)

--
Ian

woodb...@gmail.com

unread,

Apr 26, 2017, 11:30:10 PM4/26/17

to

I suggest Duckduckgo as an alternative to Google:
https://duckduckgo.com

Brian
Ebenezer Enterprises - In G-d we trust.
http://webEbenezer.net

Chris M. Thomasson

unread,

Apr 27, 2017, 1:22:15 AM4/27/17

to

Oh yeah, that's bad. I have seen the problem when some code was using
broadcast when it only needed a single signal, but I cannot seem to
remember seeing a 10s wait time to signal a thread in a condition
variable before. The OP's critical section must be overly complex and/or
overloaded. Also, its not good to send 500 threads through a single
funnel. ;^)

kushal bhattacharya

unread,

Apr 27, 2017, 1:26:44 AM4/27/17

to

by blocking operation when i log something in a file from different threads is it part of blocking operation?

kushal bhattacharya

unread,

Apr 27, 2017, 1:29:38 AM4/27/17

to

500 threads are run parrallely i have checked that happening.The thing i am concerned about is about the architechture i am following right now

kushal bhattacharya

unread,

Apr 27, 2017, 1:32:06 AM4/27/17

to

in the critical section i am just fetching out the value from the list thats the only thing i am doing right now.So,i dont think that would take much time in processing

kushal bhattacharya

unread,

Apr 27, 2017, 1:34:42 AM4/27/17

to

On Thursday, April 27, 2017 at 10:52:15 AM UTC+5:30, Chris M. Thomasson wrote:

Could you please explain about what do you mean by single funnel

Paavo Helde

unread,

Apr 27, 2017, 1:38:17 AM4/27/17

to

On 27.04.2017 8:31, kushal bhattacharya wrote:
> in the critical section i am just fetching out the value from the list thats the only thing i am doing right now.So,i dont think that would take much time in processing

When dealing with bottlenecks, you don't think. You measure.

Ian Collins

unread,

Apr 27, 2017, 1:52:13 AM4/27/17

to

I first struct it when I was writing a simulator for a power system
that could have up to 128 rectifier modules on a serial bus. Naturally
I just gave each rectifier it's own thread and used a mutex/condvar for
the "bus". Worked well with a couple of modules, flat lined my new
shiny Pentium era build machine with 128...

It would probably work OK on current 32 core/64 thread machines :)

--
Ian

Chris M. Thomasson

unread,

Apr 27, 2017, 2:18:12 AM4/27/17

to

Using a single mutex is a huge bottleneck with 500 threads pounding away
at it. Imagine a tiny funnel that lets one drop out at a time, and each
drop was a thread. Now pour 500 drops into it and wait. Its like an
hourglass where each grain of sand is a thread, and the bottleneck
allows one grain through at a time.

Now imagine pouring 500 drops into a colander. The bottleneck is less
extreme... ;^)

Chris M. Thomasson

unread,

Apr 27, 2017, 2:22:29 AM4/27/17

to

Even the model of creating a couple of worker threads per cpu has issues
if all of them are blasting a single mutex.

kushal bhattacharya

unread,

Apr 27, 2017, 5:15:12 AM4/27/17

to

So,How would i distribute work between notifying thread and the waiting threads these threads should work independently due to large number of connections involved here

Ian Collins

unread,

Apr 27, 2017, 5:21:10 AM4/27/17

to

You don't use one thread per connection, that is a really bad model if
you want your application to scale.

--
Ian

Jerry Stuckle

unread,

Apr 27, 2017, 9:27:06 AM4/27/17

to

> 500 threads are run parrallely i have checked that happening.The thing i am concerned about is about the architechture i am following right now
>

If they are run in parallel, I would agree with others. Your design
could be seriously improved.

Bonita Montero

unread,

Apr 27, 2017, 10:13:56 AM4/27/17

to

> ... The point is that there is no reason to create much more threads

> than the number of cores, they just tend to eat up all the memory

> and start to fight with each other over resources. ...

If the threads are i/o-bound, that's not a problem.
And doing synchronous i/o and keeping the state in the threads
registers and stack is more convenient than doing asychronous
i/o.

Bonita Montero

unread,

Apr 27, 2017, 10:17:47 AM4/27/17

to

> Even the model of creating a couple of worker threads per
> cpu has issues if all of them are blasting a single mutex.

If you have a procuder-consumer-pattern, the mutex is held only for very
short intervals and the times preparing the item enqueued or processing
the item dequeued is magnitutes longer. So the likehood of a collision
shouldn't be high even for 500 threads.

Bonita Montero

unread,

Apr 27, 2017, 10:19:16 AM4/27/17

to

> You don't use one thread per connection, that is a really
> bad model if you want your application to scale.

Synchronous I/O scales not very much less, but it uses a lot of memory
for the threads stacks.

Marcel Mueller

unread,

Apr 27, 2017, 2:38:13 PM4/27/17

to

On 27.04.17 07.26, kushal bhattacharya wrote:
> by blocking operation when i log something in a file from different threads is it part of blocking operation?

Yes.

No file I/O, no network I/O while holding a mutex that is shared between
hundreds of threads.

For debugging purposes this is OK but then you get the performance impact.

Marcel

kushal bhattacharya

unread,

Apr 27, 2017, 3:02:25 PM4/27/17

to

But the operation here i am doing must be an asynchronous one since the client is firing at very small insterval and that is recieved by these threads for further processing

Chris M. Thomasson

unread,

Apr 27, 2017, 9:21:13 PM4/27/17

to

Until all 500 threads get sustained requests in a high volume period of
time in the server. All those threads waking up, and blasting the shi%
out of a single mutex due to unexpected user load, will suffer many
problems. Imagine a server that is saturated with active requests.
Periods where tens of thousands of requests are constantly flowing in in
rapid rates.

Chris M. Thomasson

unread,

Apr 27, 2017, 11:00:23 PM4/27/17

to

What exactly are you using the single mutex for? You seemed to have
mentioned logging to a file? Imvho, logging logic should be "separated"
from the actual IO threads. I need answers to these questions before I
can give proper advise...

Fwiw, IO threads should pass lengthy requests to other non-IO threads.
If an IO thread spends time processing a lengthy request, well that time
is sort of "wasted" wrt the IO thread taking pressure off of the socket,
or whatever, as fast as possible.

Back in the day, I would queue log requests to a separate thread or
process in certain circumstances, whose sole purpose was to log.

When I used locks for this, I remember setting things up where each
thread had it own personal lock. To queue log requests a thread would
lock its own lock and queue the request locally. Then a dedicated log
thread would periodically and/or episodically wake up and iterate
through all the threads taking there locks, flushing all items in the
queue, and unlock, for each one.

It then processed all of the requests, went back to a sleep mode, and
woke up again. Iirc, it would do a try lock and skip if a threads mutex
could not be acquired. It would track the number of times it failed then
finally go for a full lock on the mutex. This scheme actually scaled
quite nicely, for mutexs... ;^)

kushal bhattacharya

unread,

Apr 28, 2017, 12:46:33 AM4/28/17

to

Ok let me tell you the whole scenario then,there are 2 types of threads which i am using as i said earlier,the notifying thread first recieves packet from the the client ,then it parses the packet builds a message object , pushes it in a list and then notifies the waiting thread.The waiting thread in turn,when awakened checks whether the list is non empty according it transmits the corresponding ack packet to the client.

kushal bhattacharya

unread,

Apr 28, 2017, 12:49:07 AM4/28/17

to

the method you mentioned ,could you plese give me some sample code,actually i cant really get the actual workflow here :)

Ian Collins

unread,

Apr 28, 2017, 3:23:14 AM4/28/17

to

Please don't quote me without attributions.

--
Ian

kushal bhattacharya

unread,

Apr 28, 2017, 3:30:06 AM4/28/17

to

Hi,
Sorry if i am being offensive why is thread per connection considered to be a bad model?

Ian Collins

unread,

Apr 28, 2017, 4:31:05 AM4/28/17

to

On 04/28/17 07:29 PM, kushal bhattacharya wrote:
> On Friday, April 28, 2017 at 12:53:14 PM UTC+5:30, Ian Collins wrote:
>> On 04/28/17 02:19 AM, Bonita Montero wrote:
>>>> You don't use one thread per connection, that is a really
>>>> bad model if you want your application to scale.
>>> Synchronous I/O scales not very much less, but it uses a lot of memory
>>> for the threads stacks.
>>>
>> Please don't quote me without attributions.
>

> Hi,
> Sorry if i am being offensive why is thread per connection considered to be a bad model?
>

You aren't being offensive!

It is a bad model partly due to the issues you are seeing - it simply
does not scale. Have a look at the design of something like the Apache
web server if you want to see how a large scale application is designed.

The biggest problem with too many threads is the overhead involved in
context switching. Then there is the megabyte or two of stack for each
thread, these soon add up.

--
Ian

Bonita Montero

unread,

Apr 28, 2017, 4:54:48 AM4/28/17

to

>> If you have a procuder-consumer-pattern, the mutex is held only for very
>> short intervals and the times preparing the item enqueued or processing
>> the item dequeued is magnitutes longer. So the likehood of a collision
>> shouldn't be high even for 500 threads.

> Until all 500 threads get sustained requests in a high volume period of

> time in the server. ...

Such a large number of threads is usually I/O-bound, so this is not a
real problem then.

Paavo Helde

unread,

Apr 28, 2017, 5:28:58 AM4/28/17

to

On 28.04.2017 11:30, Ian Collins wrote:
>
> The biggest problem with too many threads is the overhead involved in
> context switching. Then there is the megabyte or two of stack for each
> thread, these soon add up.

In 64-bit programs the stack is often 8 MB per thread. Actually this is
not much of an issue as this is just address space reservation, no
actual memory is involved, and 64-bit programs have lots of room in the
address space.

For 32-bit programs the things are worse, if you have 500 threads and
2MB stack per each, this makes 1 GB. On some systems this is already
half of the usable address space.

Cheers
Paavo

Chris M. Thomasson

unread,

Apr 28, 2017, 7:00:09 PM4/28/17

to

Have you ever compared the differences of using a thread per-connection
model vs an async-model in a highly loaded server? Take responses to
user generated requests completed per-second as a heuristic for the
triggering of load management logic.

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365198(v=vs.85).aspx

Why use IOCP on windows? Well, we can handle tens of thousands of
connections . These types of scenarios wrt the extreme number of threads
simpy do not scale. 500 threads is just asking for trouble, and the OP
has a performance problem. Either doing working with bad implements like
io while holding a lock, and/or 500 active threads blasting a single mutex.

Chris M. Thomasson

unread,

Apr 28, 2017, 7:38:03 PM4/28/17

to

On 4/28/2017 1:54 AM, Bonita Montero wrote:

Btw, I agree with Ian, please try to make proper attributions.

Chris M. Thomasson

unread,

Apr 28, 2017, 9:00:21 PM4/28/17

to

On 4/28/2017 1:54 AM, Bonita Montero wrote:

Read all of:

http://tangentsoft.net/wskfaq

Old school, back in my winsock days.

Chris M. Thomasson

unread,

Apr 28, 2017, 9:32:11 PM4/28/17

to

Fwiw, here is some very crude pseudo-code that shows how a queue
per-thread can distribute the load. Like using a colander rather than a
single funnel. This has a single log thread but its assuming that
logging is not all that "frequent". The load is distributed by the fact
that each thread has its own mutex. Here is the crude pseudo-code:

Also this does not have any lifetime management (ref counting ect...)
wrt keeping request nodes alive in it:
_________________________
struct per_io_worker
{
// our local intrusive linked list node of workers
per_io_worker_list_node m_node;

// our local log queue
log_queue m_logq;

// our work loop... Infinity aside.
void work()
{
for (;;)
{
// wait for our io
raw_request* r = consume_io();

// queue our log request locally
m_logq.push(r);

// can I do something else?
//[... fill in the blank here ...];
}
}
};

// The logger!
struct log_io_worker
{
// a reference to a list of workers
per_io_worker_list& m_wlist;

// our personal log list
raw_request_list m_list;

// our work loop... Infinity aside.
void work()
{
// wait/sleep for signal

// gain requests in read access mode
m_wlist.read_lock();
for each worker in m_wlist
{
m_list.push_items(worker.m_logq.dequeue());
}
m_wlist.read_unlock();

// process log requests locally! :^)
for each raw_request in m_list
{
process_log(raw_request);
}

// dump it!
m_list.clear(); // empty it all out
}
};
_________________________

This example is crude, and does not show how to make it adaptable wrt
using try lock. Its high level and shows how per-thread queues can be
used by a single log thread. Wrt the wait/sleep part, well, we can
create a semaphore with timeout. Or build in some fancy convar logic for
this.

Can you grok this setup?

Sorry if I made a typo in the crude pseudo-code.

;^o

Chris M. Thomasson

unread,

Apr 28, 2017, 9:38:46 PM4/28/17

to

Notice the read lock here on the shared reference to the list of worker
threads. Well, if your app is rapidly creating and destroying many
threas, this can become a bottleneck as well. Fwiw, a thread
per-connection model does this wrt periods of load in which many clients
are connecting and disconnecting with minimal work in between. If each
connect requires a new thread, well, shi% happens man!

This read lock will scale if the action of adding and removing threads
from the list is very rare. This should be the norm.

Bonita Montero

unread,

Apr 29, 2017, 7:24:00 AM4/29/17

to

Asynchronous I/O saves rather memory for the other threeads than
CPU-load.
A voluntary context-switch on an current Windows on a current CPU
is only 1000-2000 clock-cycles. So when a thread blocks on I/O and
the core switches to another thread where I/O has terminated, that's
not much CPU-load. So AIO is saves rather memory for the thread's
stack than CPU-load.
AIO I/O is theoretically beneficial only when there's not much pro-
cessing between initiating and termiating AIO-requests. Otherwise
the queues would halt. So AIO is used normally in rare cases like
database-writer-threads.

Chris M. Thomasson

unread,

Apr 29, 2017, 4:36:35 PM4/29/17

to

On 4/29/2017 4:23 AM, Bonita Montero wrote:
> Asynchronous I/O saves rather memory for the other threeads than
> CPU-load.

It also saves CPU load. It also reduces pressure on mutexs wrt hundreds
of threads pounding them.

> A voluntary context-switch on an current Windows on a current CPU
> is only 1000-2000 clock-cycles. So when a thread blocks on I/O and
> the core switches to another thread where I/O has terminated, that's
> not much CPU-load. So AIO is saves rather memory for the thread's
> stack than CPU-load.
> AIO I/O is theoretically beneficial only when there's not much pro-
> cessing between initiating and termiating AIO-requests. Otherwise
> the queues would halt. So AIO is used normally in rare cases like
> database-writer-threads.

This does not make sense to me. Context switching hundreds of active
threads, do to load, is absolutely terrible. AIO and IOCP is about
scaling. Avoiding the funnel is about scaling. Creating a server is
about scaling.

Also, again can you _please_ add proper _attributions_ Bonita.

Chris M. Thomasson

unread,

Apr 29, 2017, 5:06:30 PM4/29/17

to

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There is an ambiguity here. the dequeue function implies that a single
object is removed. Well, in this scheme it would totally work, but I
really wanted a flush function that removed all of the items from the
queue in a single atomic operation. Basically, just like this:

https://groups.google.com/d/topic/comp.lang.c++/yt27gw0cbyo/discussion
(search the source code for the text "flush")... ;^)

Therefore, the line above the carets should read as:

m_list.push_items(worker.m_logq.FLUSH());

capital letter for emphasis in the pseudo-code. ;^)

Chris Vine

unread,

Apr 29, 2017, 8:14:04 PM4/29/17

to

On Sat, 29 Apr 2017 13:36:24 -0700
"Chris M. Thomasson" <inv...@invalid.invalid> wrote:
> On 4/29/2017 4:23 AM, Bonita Montero wrote:
> > Asynchronous I/O saves rather memory for the other threeads than
> > CPU-load.
>
> It also saves CPU load. It also reduces pressure on mutexs wrt
> hundreds of threads pounding them.
>
> > A voluntary context-switch on an current Windows on a current CPU
> > is only 1000-2000 clock-cycles. So when a thread blocks on I/O and
> > the core switches to another thread where I/O has terminated, that's
> > not much CPU-load. So AIO is saves rather memory for the thread's
> > stack than CPU-load.
> > AIO I/O is theoretically beneficial only when there's not much pro-
> > cessing between initiating and termiating AIO-requests. Otherwise
> > the queues would halt. So AIO is used normally in rare cases like
> > database-writer-threads.
>

> This does onnot make sense to me. Context switching hundreds of active

> threads, do to load, is absolutely terrible. AIO and IOCP is about
> scaling. Avoiding the funnel is about scaling. Creating a server is
> about scaling.
>
> Also, again can you _please_ add proper _attributions_ Bonita.

This discussion is in danger of being flogged to death, repeatedly.

What Bonita said is obviously true. Asynchronous i/o works best when
asynchronous events are waiting for things to happen, such as a file
descriptor becoming ready. If the work load involves many high load
tasks which do not wait, it is useless.

What you said is also true, sort of, save that if the work load involves
500 tasks involving high load (non-waiting) tasks, and you only have 4
cores, no amount of native threads is going to help you. The more you
have, the less good the performance will be.

Requiescat in pacem.

Chris M. Thomasson

unread,

Apr 29, 2017, 10:48:16 PM4/29/17

to

Agreed.

Fwiw, I was trying to focus on "periods of load" in which a
thread-per-connection model fails. Sure, your server might be working
fine for a while, then a batch of work comes in all at once and things
start to get slow. In the contrived scenario, these batch's start to be
more and more frequent due to more requests and/or new connections, more
users. 500 threads can spike to 2000+ threads. Just brainstorming here.

kushal bhattacharya

unread,

Apr 30, 2017, 2:18:38 AM4/30/17

to

I couldnt really get how can 500 threads suddenly spike to 2000 threads

Bonita Montero

unread,

Apr 30, 2017, 5:43:25 AM4/30/17

to

>> Asynchronous I/O saves rather memory for the other threeads
>> than CPU-load.

> It also saves CPU load. It also reduces pressure on mutexs wrt
> hundreds of threads pounding them.

The mutexes are usually part of a cv, and those mutexes are held
in a poducer-consumer-pattern for a very short interval. The rest
of the time the thread spends preparing or processing an item,
which is a by magnitudes longer interval. So the likehood of a
collision isn't high even with 500 threads.

> This does not make sense to me. Context switching hundreds of
> active threads, do to load, is absolutely terrible.

No, context-switching even some thousand times a second isn't
much load even on Windows and much les on Linux, which is by
far more efficient.

> AIO and IOCP is about scaling.

They can be applied only in rare cases when there isn't much
processing besides enqueuing or dequeuing asynchronous requests.

Bonita Montero

unread,

Apr 30, 2017, 8:43:35 AM4/30/17

to

I've written two tiny applications. Both create a 1GB file wich
is created unbuffered to simulate the behaviour in a DB-server
where blocks are not in the cache. The file is read parallely
in 4kB-blocks with either 500 threads or one thread using an
IOCP.

The multithreaded app can be found here:
https://pastebin.com/ByC33n65
I had to use asynchronous I/O within the threads because I can't
do SetFilePos on the same file-handle from multiple threads. So
with overlapped I/O and immediately blocking after that, I can
give the file-position to be read in the OVERLAPPED-structure.

The singlethreaded app with IOCP can be found here:
https://pastebin.com/WrmBjVMC

On my 3,6GHz Ryzen 1800X the multithreaded code runs with
about 7% CPU-load. And the singlethreaded code is about 6%.
So there's not a big difference.

Bonita Montero

unread,

Apr 30, 2017, 8:57:45 AM4/30/17

to

And the memory-consumption:
The multithreaded app has a private commit of 19MB and a workingset
of 13,7MB. The single-threaded app has a private commit of 740kB and
a workingset of 2,7MB.
So I was right to assume, that code that does parallel I/O via IOCPs
is rather memory-saving than CPU-saving.

Scott Lurndal

unread,

Apr 30, 2017, 9:37:48 AM4/30/17

to

Bonita Montero <Bonita....@gmail.com> writes:
>>> Asynchronous I/O saves rather memory for the other threeads
>>> than CPU-load.
>
>> It also saves CPU load. It also reduces pressure on mutexs wrt
>> hundreds of threads pounding them.
>
>The mutexes are usually part of a cv, and those mutexes are held
>in a poducer-consumer-pattern for a very short interval. The rest
>of the time the thread spends preparing or processing an item,
>which is a by magnitudes longer interval. So the likehood of a
>collision isn't high even with 500 threads.

Please learn to quote properly with attribution. Your assumption
that the mutex _only_ protects the condition variable is faulty.

It likely also protects other shared data, including e.g. listheads.

Bonita Montero

unread,

Apr 30, 2017, 9:41:39 AM4/30/17

to

>> The mutexes are usually part of a cv, and those mutexes are held
>> in a poducer-consumer-pattern for a very short interval. The rest
>> of the time the thread spends preparing or processing an item,
>> which is a by magnitudes longer interval. So the likehood of a
>> collision isn't high even with 500 threads.

> Please learn to quote properly with attribution. Your assumption
> that the mutex _only_ protects the condition variable is faulty.

I didn't assume this. I only said, that the mutex is part of a cv.
That doesn't exclude the rest.

Bonita Montero

unread,

Apr 30, 2017, 9:56:17 AM4/30/17

to

> The multithreaded app can be found here:
> https://pastebin.com/ByC33n65

There was a little mistake; the file created was larger
than necessary and there was one | instead of || typo.
So here's the corrected code:
https://pastebin.com/wqiLFsmJ

> The singlethreaded app with IOCP can be found here:
> https://pastebin.com/WrmBjVMC

Here ther was the same "bug" with the too large file:
https://pastebin.com/yHvghReX

Chris M. Thomasson

unread,

Apr 30, 2017, 8:16:00 PM4/30/17

to

Bugs aside for a moment, I am wondering why you are creating an event in
the IOCP code? This is not required at all.

Chris M. Thomasson

unread,

Apr 30, 2017, 9:01:51 PM4/30/17

to

Have you ever pounded a producer/consumer mutex protected cv based queue
with 500 active threads?

Chris M. Thomasson

unread,

Apr 30, 2017, 9:07:56 PM4/30/17

to

Well, in a thread per-connection model thinking in the general sense wrt
designing a server, say we start with 50 users. Then all of a sudden
there are 250 because your a plurality of your friends told others about
the good thing. Then, you check the active server load and notice that
there is now 500 active users! Next month, there may be 1500+ active
users. Imho, this is a _very_ important aspect to think about when
designing a server.

Chris M. Thomasson

unread,

Apr 30, 2017, 9:25:31 PM4/30/17

to

You need to associate the file with the IOCP, then issue overlapped
requests. You have a phantom overlapped io before the attachment to the
iocp occurs, and to make things worse, you have a totally unneeded event
object hooked up to it.

Bonita Montero

unread,

May 1, 2017, 1:56:54 AM5/1/17

to

> Bugs aside for a moment, I am wondering why you are creating
> an event in the IOCP code? This is not required at all.

This is only once where I'm creating the file. An IOCP is not
necessary at this place and waiting for an event ist sufficient
because there is only one I/O-event to wait for.

Bonita Montero

unread,

May 1, 2017, 2:03:46 AM5/1/17

to

> You have a phantom overlapped io before the attachment to
> the iocp occurs, and to make things worse, you have a totally
> unneeded event object hooked up to it.

Should I use IOCPs for a _single_ I/O-request to enlargen the file?

Scott Lurndal

unread,

May 1, 2017, 8:23:46 AM5/1/17

to

But it is not. It is a separate entity used to protect the
predictate _for_ the condition variable.

Bonita Montero

unread,

May 1, 2017, 8:32:23 AM5/1/17

to

> But it is not. It is a separate entity used to protect the
> predictate _for_ the condition variable.

Your pettifogging is silly.

Chris M. Thomasson

unread,

May 1, 2017, 10:49:27 AM5/1/17

to

Afaict, you are not waiting on that event to complete before you bind
the file the IOCP. So, you do not know if the write was even finished
before you start issuing reads.

It would better to bind with IOCP; issue the write; wait for a
completion, then issue the reads.

Chris M. Thomasson

unread,

May 1, 2017, 10:55:10 AM5/1/17

to

Yes. Just bind to IOCP; write; wait on GQCS, then issue the reads. By
the way, you don't even wait on that unnecessary event for the first
write to complete before you bind it. You also fail to do this in the
threaded version. Bad mojo. It basically creates a race condition.

Also, can you please add proper attribution to your posts?

Bonita Montero

unread,

May 1, 2017, 11:04:43 AM5/1/17

to

> Yes. Just bind to IOCP; write; wait on GQCS, then issue the reads.

It's useless to use an IOCP when enlargen the file. Waiting for an
event is sufficient an easier when there is only one I/O-request.

> By the way, you don't even wait on that unnecessary event for
> the first write to complete before you bind it.

Ok, that's a mistake.

Chris M. Thomasson

unread,

May 1, 2017, 11:13:12 AM5/1/17

to

On 5/1/2017 8:04 AM, Bonita Montero wrote:
>> Yes. Just bind to IOCP; write; wait on GQCS, then issue the reads.
>
> It's useless to use an IOCP when enlargen the file. Waiting for an
> event is sufficient an easier when there is only one I/O-request.

Imho, its more complicated to use an event. You also forget to destroy
that event.

Your way:

1: create event
2: issue write
3: wait on event
4: destroy event
5: bind iocp
6: issue reads

the other way:

1: bind iocp
2: issue write
3: wait on gqcs
4: issue reads

>> By the way, you don't even wait on that unnecessary event for
>> the first write to complete before you bind it.
>
> Ok, that's a mistake.

Yes. It creates a race-condition.

Bonita Montero

unread,

May 1, 2017, 11:29:43 AM5/1/17

to

> Imho, its more complicated to use an event.
> You also forget to destroy that event.

I dindn't forget to destroy anything because this is only
a simple test-app.

> Your way:
>
> 1: create event
> 2: issue write
> 3: wait on event
> 4: destroy event
> 5: bind iocp
> 6: issue reads
>
> the other way:
>
> 1: bind iocp
> 2: issue write
> 3: wait on gqcs
> 4: issue reads

5: get gcqs for reads

That's a matter of tase.

> Yes. It creates a race-condition.

Obvoiously the reads don't fail, so it seems that there is an
implementation-defined behaviour. The write becomes visible to
any following reads whithout being physically complete on disk.

Ian Collins

unread,

May 1, 2017, 3:39:19 PM5/1/17

to

Your lack of attributions is rude.

--
Ian

Chris M. Thomasson

unread,

May 1, 2017, 5:12:11 PM5/1/17

to

Exactly correct. :^)

Chris M. Thomasson

unread,

May 1, 2017, 5:39:17 PM5/1/17

to

It destroys the integrity of the chain.

Chris M. Thomasson

unread,

May 1, 2017, 6:37:24 PM5/1/17

to

Its a race-condition. Don't worry too much about it and thank you for
creating the code. Fwiw, I just had to correct a mistake I made wrt
giving the dimensions of a bitmap. I said 1920x1080, actually it is
960x540. Damn it to heck! Here is my correction:

https://groups.google.com/d/msg/sci.crypt/xytM7aFRfjQ/elUM9F94AAAJ

Just try to give some proper attributions in future posts. Heck, even a
CT would be nice if you quote me. :^)

Chris M. Thomasson

unread,

May 1, 2017, 9:35:18 PM5/1/17

to

On 4/26/2017 10:51 PM, Ian Collins wrote:
> On 04/27/17 05:22 PM, Chris M. Thomasson wrote:
>> On 4/26/2017 7:26 PM, Ian Collins wrote:
>>> On 04/27/17 01:24 AM, kushal bhattacharya wrote:
>>>
>>>> To be more clear I just want to point out the facct that from the
>>>> notifying thread ,I am using notify_all(),so according to the
>>>> functionality of this fnuction it should notify all the waiting
>>>> threads.Suppose I am using large number number of connections say,500
>>>> connections or more than that,then i am creating 500 notifying
>>>> threads and 500 waiting threads.So, If i call notify_all() from any
>>>> of the notifying threads,then all the 500 waiting threads will be
>>>> notified,but within that it has some condtion to fulfil according to
>>>> condition.wait().So,thinking about this,am i compromising some
>>>> performance here and is it one of the culprit behind this delay?
>>>
>>>
>>> Google "Thundering herd problem" :)
>>>
>>
>> Oh yeah, that's bad. I have seen the problem when some code was using
>> broadcast when it only needed a single signal, but I cannot seem to
>> remember seeing a 10s wait time to signal a thread in a condition
>> variable before. The OP's critical section must be overly complex and/or
>> overloaded. Also, its not good to send 500 threads through a single
>> funnel. ;^)
>
> I first struct it when I was writing a simulator for a power system that
> could have up to 128 rectifier modules on a serial bus. Naturally I
> just gave each rectifier it's own thread and used a mutex/condvar for
> the "bus". Worked well with a couple of modules, flat lined my new
> shiny Pentium era build machine with 128...
>
> It would probably work OK on current 32 core/64 thread machines :)

I would still feel sorry for that single mutex. ;^)

Bonita Montero

unread,

May 2, 2017, 8:09:24 AM5/2/17

to

> Its a race-condition. ...

Theoretically. Practically, even if I repeatedly read the last block
in every iteration, the code never fails - although on my SSD it is
about three to four seconds until the file is successfully enlarged.
So it seems like I assumed: the enlagement becomes visible immediately
after WriteFile.

But there's theroetically another bug: the ReadFile-requests are
allowed to terminate immediately without any aschronous behaviour,
although I'm issuing asynchronous reads, i.e. ReadFile() returns
true. So the I changed the single-threaded code:

BOOL fRfRet;

for( unsigned par = 0; par < PARALLEL_DEGREE; par += !fRfRet )
{
aol[par].Offset = uid( mt ) & (DWORD)-(LONG)BLOCK_SIZE;
aol[par].OffsetHigh = 0;
aol[par].hEvent = NULL;
if( !(fRfRet = ReadFile( hFile, abBlock, BLOCK_SIZE, NULL,
&aol[par] ))
&& GetLastError() != ERROR_IO_PENDING )
return EXIT_FAILURE;
}

and in this place:

pol->Offset = uid( mt ) & (DWORD)-(LONG)BLOCK_SIZE;
pol->OffsetHigh = 0;
pol->hEvent = NULL;
for( ; ; )
if( ReadFile( hFile, abBlock, BLOCK_SIZE, NULL, pol ) )
continue;
else
if( GetLastError() == ERROR_IO_PENDING )
break;
else
return EXIT_FAILURE;

Scott Lurndal

unread,

May 2, 2017, 9:28:34 AM5/2/17

to

Actually, it probably wouldn't work very well on current highly
threaded cores (e.g. Vulcan, with 128 threads). To obtain the
mutex, the requesting core must gain exclusive access to the
cache-line containing the mutex (linux, for example, will spin
in user-mode for a few iterations before blocking in the kernel).

The contention for this one, single, cache line can drag a system
to its knees rather quickly on a highly contended lock. Some
x86 processors will attempt to resolve starvation situations by
asserting a global bus lock which is very bad for system performance.

Chris M. Thomasson

unread,

May 3, 2017, 8:30:13 PM5/3/17

to

On 5/2/2017 5:09 AM, Bonita Montero wrote:
>> Its a race-condition. ...
>
> Theoretically. Practically, even if I repeatedly read the last block
> in every iteration, the code never fails - although on my SSD it is
> about three to four seconds until the file is successfully enlarged.
> So it seems like I assumed: the enlagement becomes visible immediately
> after WriteFile.
>
> But there's theroetically another bug: the ReadFile-requests are
> allowed to terminate immediately without any aschronous behaviour,
> although I'm issuing asynchronous reads, i.e. ReadFile() returns
> true. So the I changed the single-threaded code:

[...]

The race-condition mucks up the waters, but once its hooked up to IOCP,
even if ReadFile or WSARecv complete immediately aka they returned TRUE
and 0 respectively, an IOCP event is guaranteed to be scheduled.

Read all of:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms741688(v=vs.85).aspx

The completion routine is scheduled. When hooked up to IOCP, this means
that a message is in the queue. GQCS will pick it up.

Now, in the context of creating windows servers, we can use the fact
that a WSARecv completed immediately to issue a couple of more
overlapped receives to allow for more efficient handling of fast
connections. A friend of mine called it "burst mode".

I suppose we can do the same for files. Fwiw, I have always used
TransmitPackets when ever I could wrt sending memory and/or files.

Also, again, can you please add proper attributions to your posts?

Thanks Bonita.

Bonita Montero

unread,

May 4, 2017, 3:00:37 PM5/4/17

to

I was talking about the extension of the file with the single-threaded
version of my test-app. MSDN states the following on asynchronous I/O
that extends a file:
"On Windows NT, any write operation to a file that extends its length
will be synchronous." (http://bit.ly/2qwJA7w)
So WriteFile returns FALSE and GetLastError() also returns
ERROR_IO_PENDING as expected, but the write-operation is
nevertheless finished after WriteFile returns.

And I was right to assume that an asynchronous I/O-operation may pro-
ceed synchronously:
"Be careful when coding for asynchronous I/O because the system
reserves the right to make an operation synchronous if it needs to."
http://bit.ly/2qwJA7w
So deciding to handle two different cases upon the return-value
of ReadFile is correct.

Bonita Montero

unread,

May 4, 2017, 3:04:12 PM5/4/17

to

And see the source-code of the first listing of http://bit.ly/2qwJA7w
This code does the same thing I wrote: it handles the I/O-postprocessing
depending on the return-value of ReadFile!!!

Chris M. Thomasson

unread,

May 4, 2017, 3:09:18 PM5/4/17

to

Will read it. Thank you.

Chris M. Thomasson

unread,

May 11, 2017, 7:03:18 PM5/11/17

to

Perfect!

Chris M. Thomasson

unread,

May 13, 2017, 1:26:26 AM5/13/17

to

Fwiw, even if ReadFile returns TRUE, you will get a IO completion. If
the HANDLE is hooked up to an IOCP, that means you will get a completion
on GQCS. This is perfectly normal. There is nothing wrong or odd here.
Wrt IOCP, your code does not need to do anything special when ReadFile
returns TRUE. Its a 1 to 1 relationship. Think of:

[0]:ReadFile goes ERROR_IO_PENDING
[1]:ReadFile goes ERROR_IO_PENDING
[2]:ReadFile returns TRUE
[3]:ReadFile goes ERROR_IO_PENDING

You will get 4 completions from GQCS. That's that.

Bonita Montero

unread,

May 18, 2017, 11:24:20 AM5/18/17

to

> Fwiw, even if ReadFile returns TRUE, you will get a IO completion.

Where is this specified?

Chris M. Thomasson

unread,

May 18, 2017, 1:40:30 PM5/18/17

to

On 5/18/2017 8:24 AM, Bonita Montero wrote:
>> Fwiw, even if ReadFile returns TRUE, you will get a IO completion.
>
> Where is this specified?

Its hard to find, but read here:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa365683(v=vs.85).aspx

Right here:
________________________
However, if I/O completion ports are being used with this asynchronous
handle, a completion packet will also be sent even though the I/O
operation completed immediately. In other words, if the application
frees resources after WriteFile returns TRUE with ERROR_SUCCESS in
addition to in the I/O completion port routine, it will have a
double-free error condition. In this example, the recommendation would
be to allow the completion port routine to be solely responsible for all
freeing operations for such resources.
________________________

See? If a HANDLE is hooked up to IOCP, a completion _will_ be queued
even if TRUE is returned with ERROR_SUCCESS. This goes for ReadFile as well.

Not sure why MSDN makes this so hard to find!

Chris M. Thomasson

unread,

May 18, 2017, 3:51:58 PM5/18/17

to

This thread is bringing back many old memories of my time creating
servers on WinNT 4.0. Ah, the good ol' days. Actually, I am working on a
new special HTTP server that uses my experimental fractal encryption as
a "proof of concept". :^)

Also, try to use proper attributions on usenet. Thanks.

Chris M. Thomasson

unread,

May 18, 2017, 4:00:43 PM5/18/17

to

No, you are wrong. What you wrote is _not_ correct when the HANDLE is
hooked up to IOCP. ReadFile that returns TRUE, wrt GetLastError
returning ERROR_SUCCESS means that GQCS will dequeue a completion.

You need to study up on how to use properly use IOCP.

I have a lot of experience in the area and can help you out.

jak

unread,

May 18, 2017, 6:58:43 PM5/18/17

to

Il 26/04/2017 13:56, kus ha scritto:
> Hi,
> I am implementing a client server application as referred to my previous
> posts .The issue i am seeing,when i am notifying a waiting thread,that
> waiting thread sometimes takes around 10secs to repond to that notifying
> thread.I am seeing this situation when very large number of connections
> are opened simultaneously.Is it a normal behaviour or i am implementing
> something wrong here?
> Thanks,
> Kushal
Hi,
try applying this algorithm before accessing the resource:

https://en.wikipedia.org/wiki/Dining_philosophers_problem

your problem seems to be a threads battle :)

Bonita Montero

unread,

May 19, 2017, 4:16:14 AM5/19/17

to

> No, you are wrong. What you wrote is _not_ correct when the HANDLE is

> hooked up to IOCP. ...

Read the article I referred to; it isn't about IOCPs.

Bonita Montero

unread,

May 19, 2017, 4:27:12 AM5/19/17

to

> Afaict, you are not waiting on that event to complete before you bind
> the file the IOCP. So, you do not know if the write was even finished
> before you start issuing reads.

Extending the file is done always synchronously; look at the MSDN-doc.

Chris M. Thomasson

unread,

May 19, 2017, 9:52:00 AM5/19/17

to

But, you used IOCP in your code?

Chris M. Thomasson

unread,

May 19, 2017, 9:53:30 AM5/19/17

to

You used async IO and failed to wait on the event in your code.

Chris M. Thomasson

unread,

May 19, 2017, 12:35:31 PM5/19/17

to

On 5/19/2017 1:15 AM, Bonita Montero wrote:

I did read it. I never really liked using OVERLAPPED with the event.
Also, never like using completion routines.

Imvho, IOCP is much, much better.

Chris M. Thomasson

unread,

May 19, 2017, 12:39:04 PM5/19/17

to

On 5/19/2017 6:53 AM, Chris M. Thomasson wrote:
> On 5/19/2017 1:27 AM, Bonita Montero wrote:
>>> Afaict, you are not waiting on that event to complete before you bind
>>> the file the IOCP. So, you do not know if the write was even finished
>>> before you start issuing reads.
>>
>> Extending the file is done always synchronously; look at the MSDN-doc.

That like of thinking is dangerous.

> You used async IO and failed to wait on the event in your code.

How can you be absolutely sure of that since its in async mode?

Also, at least add a CT or something for proper quoting. I feel as if I
am beating a dead horse or something horrible like that. I see that you
are using Thunderbird, whats going on here?

;^o

Bonita Montero

unread,

May 19, 2017, 1:33:38 PM5/19/17

to

>> Read the article I referred to; it isn't about IOCPs.

> But, you used IOCP in your code?

Yes, in booth, but I extend the file before a IOCP is bound to the
file-handle.

Bonita Montero

unread,

May 19, 2017, 1:34:08 PM5/19/17

to

> I did read it. I never really liked using OVERLAPPED with the event.
> Also, never like using completion routines.
>
> Imvho, IOCP is much, much better.

There is no advantage of a IOCP in this case.

Bonita Montero

unread,

May 19, 2017, 1:35:07 PM5/19/17

to

> You used async IO and failed to wait on the event in your code.

Yes, I missed the wait, but fortunately it isn't necessary because
extending a file is always synchronous, even if the file is opened
for overlapped I/O.

Chris M. Thomasson

unread,

May 19, 2017, 2:21:01 PM5/19/17

to

Where is this documented by Microsoft?

Chris M. Thomasson

unread,

May 19, 2017, 2:30:09 PM5/19/17

to

Okay. But, can you be 100% sure that the async IO file never will give
an async result? How?

Chris M. Thomasson

unread,

May 19, 2017, 5:02:39 PM5/19/17

to

On 5/19/2017 10:33 AM, Bonita Montero wrote:

Why do you use that event for the expansion, and fail to wait on it,
then switch over to IOCP? Even in the IOCP mode, you make your code fail
when ReadFile returns TRUE: WHY?

This is just plain strange.