do I need to use Critical Section here?

Scott

unread,

Jun 14, 2006, 2:31:19 PM6/14/06

to

Hi,
I am designing an IOCP application that will have a maximum connections of
5000. I have reserved enough virtual memory for these per socket connection,
then when need one, simply allocate the memory required. I also have an
array of 5000 (char), which represents which part of memory is being used,
so when a connection is established, this array is read for a zero, then the
apprortiate memory block allocated. And when closed, the memory is returned,
and the corrosponding array is set back to 0.

I understand the reason for critical section, which prevents the memory to
be written to by seperate threads at the same time, but. If the thread that
waits for connections, simply checks the array, which is char's, which will
take 1 read cycle, there will be no way that it can be written to at the
same time from another thread. And in the IOCP routine, where it is cleared
when the socket is closed, the virtual memory is reallocated before the char
array byte is cleared. Would I still need to use critical section before
searching for empty array block, and before clearing it when socket closed?

Thanks
Scott

bluethought

unread,

Jun 14, 2006, 3:26:25 PM6/14/06

to

Scott wrote:

> Hi,
> I am designing an IOCP application that will have a maximum connections of
> 5000. I have reserved enough virtual memory for these per socket connection,
> then when need one, simply allocate the memory required. I also have an
> array of 5000 (char), which represents which part of memory is being used,
> so when a connection is established, this array is read for a zero, then the
> apprortiate memory block allocated. And when closed, the memory is returned,
> and the corrosponding array is set back to 0.
>

So let me see, you are using the char iFlag[5000] to store indicators
as to whether that particular memory block is in use or not? I.e. if
the memory blocks are indexed 0 to 4999, then iFlag[X] is 0 if block X
is not in use, 1 if block X is in use.

That is my assumption...it makes more sense to use an int though for
this....explained below (or a BOOL, which is equivalent in Win32 to an
int/long).

> I understand the reason for critical section, which prevents the memory to
> be written to by seperate threads at the same time, but. If the thread that
> waits for connections, simply checks the array, which is char's, which will
> take 1 read cycle, there will be no way that it can be written to at the
> same time from another thread. And in the IOCP routine, where it is cleared
> when the socket is closed, the virtual memory is reallocated before the char
> array byte is cleared. Would I still need to use critical section before
> searching for empty array block, and before clearing it when socket closed?
>

That seems to be one thread that is a acquirer [listener/accept
thread], and multiple threads that are releasers [IOCP Worker Threads].
This seems fine as long as only *one* thread is an acquirer, having
more than one will lead to race conditions that will need atomic access
to resolve. Having multiple releasers is not an issue either because
the release is signalled only after the memory is reallocated, and each
releaser thread will work on a different array index.

If you still want to be on the safe side, in case you change your
algorithm somewhere down the line, use an array of longs (or
ints/BOOLs) and then use

long g_lMemoryBlockInUse[5000];

BOOL Acquire(int iIndex)
{
if (InterlockedExchange(&g_lMemoryBlockInUse[iIndex], TRUE))
return FALSE; // Error, block already acquired!

return TRUE;
}

void Release(int iIndex)
{
if (!InterlockedExchange(&g_lMemoryBlockInUse[iIndex], FALSE))
return FALSE; // Error, block already released!

return TRUE;
}

Then just call Acquire instead of setting the char array element to 1
and release instead of clearing it. This allows you to even use
multiple acquirer threads without any hassles.

Peter Duniho

unread,

Jun 14, 2006, 5:19:18 PM6/14/06

to

"bluethought" <bluet...@gmail.com> wrote in message
news:1150313185.8...@g10g2000cwb.googlegroups.com...
> [...]

> That is my assumption...it makes more sense to use an int though for
> this....explained below (or a BOOL, which is equivalent in Win32 to an
> int/long).

Can you define "more sense"? I was looking for a more obvious explanation
later in your post, but couldn't find it.

My first thought when I saw his post was that it makes more sense to use a
true bitfield, rather than wasting 8 bits for a single bit's worth of
information. Especially since large data structures can cause cache
problems. Gone are the days when all one needs to look at is an individual
instruction's timing to figure out what's faster or better.

In this case, even though theoretically more instructions will be used to
operate on the bitfield, in reality the use of larger data structures
(either 8 bits per flag or 32 bits per flag) causes the data structure to
have a larger footprint, using up more of the cache and potentially slowing
other areas of the code that are more performance-critical.

I suspect that no harm would come from using a bitfield, and it may actually
improve performance (especially compared to using 32-bit storage for each
flag).

All of the above is theoretical, of course. One would need to measure
actual performance in context to know for sure the correct solution. But I
do know that code that tries to minimize abuse of the cache tends to be some
of the fastest, all else being equal.

Pete

bluethought

unread,

Jun 14, 2006, 7:23:24 PM6/14/06

to

Peter Duniho wrote:

> "bluethought" <bluet...@gmail.com> wrote in message
> news:1150313185.8...@g10g2000cwb.googlegroups.com...
> > [...]
> > That is my assumption...it makes more sense to use an int though for
> > this....explained below (or a BOOL, which is equivalent in Win32 to an
> > int/long).
>
> Can you define "more sense"? I was looking for a more obvious explanation
> later in your post, but couldn't find it.

Simply since the Interlocked family of functions which allow for atomic
access operate on 32 bit values (or 64 bit values). This is the primary
reason for me to say it makes more sense. It allows for more
extensibility.

I've also read in the past that it makes more sense speedwise to
operate on 32 bit vars rather than 8 bit / 1 bit [boolean] vars. I
can't substantiate it myself here, but some looking up might get you
the information....I do remember this being linked to the use of BOOL
rather than bool - BOOL being just typedefed as an int. This is a
secondary reason, though, and I might be mistaken. The more compelling
reason is the first one.

Peter Duniho

unread,

Jun 14, 2006, 9:44:26 PM6/14/06

to

"bluethought" <bluet...@gmail.com> wrote in message

news:1150327404.0...@i40g2000cwc.googlegroups.com...

> Simply since the Interlocked family of functions which allow for atomic
> access operate on 32 bit values (or 64 bit values). This is the primary
> reason for me to say it makes more sense. It allows for more
> extensibility.

Interesting. Thanks. I certainly am curious whether the 32X increase in
data size is offset by the use of the atomic access operations. I wonder if
anyone's ever actually benchmarked that, and whether the results depend on
any particular parameters.

I guess with L2 caches these days that are 1-2MB, you can deal with a lot of
flags before screwing up your cache. :)

Pete

dav...@webmaster.com

unread,

Jun 14, 2006, 10:22:50 PM6/14/06

to

Scott wrote:

> I understand the reason for critical section, which prevents the memory to
> be written to by seperate threads at the same time, but. If the thread that
> waits for connections, simply checks the array, which is char's, which will
> take 1 read cycle, there will be no way that it can be written to at the
> same time from another thread. And in the IOCP routine, where it is cleared
> when the socket is closed, the virtual memory is reallocated before the char
> array byte is cleared. Would I still need to use critical section before
> searching for empty array block, and before clearing it when socket closed?

Just use a critical section. You have no way to know that a read
operation won't write the result read back. (Compilers frequently
generate code to copy a memory location into a register, operate on it,
then write it back. They may generate better code by writing the
register back to memory even if it hasn't changed.)

For example:
if(j!=USED) { j=USED; ret=true; }
return ret;

May be optimized to:
if( atomic_swap(j, USED) != USED) ret=true;

Where 'atomic_swap' is an operation that sets 'j' to USED and returns
the previous value.

This stuff about "takes 1 read cycle" is all assumptions about how you
think the compiler should compile your code. Modern compilers are not
naive and such assumptions are often false. Even if it's true today on
the compiler you are using, it may not be true tomorrow.

Use functions that have guarantees. Make platform/compiler
optimizations later when there's proof there's a need.

DS

bluethought

unread,

Jun 15, 2006, 2:33:14 AM6/15/06

to

Peter Duniho wrote:

> Interesting. Thanks. I certainly am curious whether the 32X increase in
> data size is offset by the use of the atomic access operations. I wonder if
> anyone's ever actually benchmarked that, and whether the results depend on
> any particular parameters.

Time willing, I'll give that a shot and see what happens. I simply
don't have the technical expertise to do more than very superficial
benchmarking though - would need DS here or somebody with more
expertise on CPUs / assembly to do really thorough benchmarking!

> I guess with L2 caches these days that are 1-2MB, you can deal with a lot of
> flags before screwing up your cache. :)
>

Indeed!

Scott

unread,

Jun 15, 2006, 3:25:42 AM6/15/06

to

HI David,
Thanks for the below.

I never knew the compiler would do these changes. I write a lot of software
for processors in assembly language where I excactly know what is going on
in the processor at one time, so have to get out of the same thinking.

Thanks again

Scott

<dav...@webmaster.com> wrote in message
news:1150338170.7...@y41g2000cwy.googlegroups.com...

Eric Jensen

unread,

Jun 15, 2006, 11:29:42 AM6/15/06

to

"Scott" <sc...@blueyonder.co.uk> skrev i en meddelelse
news:XBYjg.399979$xt.6...@fe3.news.blueyonder.co.uk...

[snip]

When its only an array of fundamental C++/primitive types (char, bool,
short, int, ect.) you'll only need a to worrie about multiply threads
writing the same block of memory simultaneously if there is more then 1 cpu
in your system.

If you only have 1 CPU the OS will time slice between threads, therefore
will only one thread be able to write to memory at all times. In case you
have more then one cpu, or dont know the number of CPUs on the machine, you
must use the interlocked* winapi functions. Even 2 CPU's cant write the same
block of memory simultaneously, but the CPUs has memory caches and the
interlocked* functions forces the CPUs memory caches to agree. However, with
only 1 CPU you may still find the need to syncronise access to a variable.
Generally you will use critical section if the threads works with the
variable.

Lets say that you have 10 threads that each downloads on a socket, and adds
received byte count to a variable.

unsigned int uiTotalReceived = 0;

DWORD WINAPI MyThread(LPVOID arg) {
while (true) {
// do some socket stuff that reads
uiTotalReceived += bytesRead;
}
}

The above will not need a critical section or interlocked* funcs (assuming 1
cpu only), but this will:

unsigned int uiTotalReceived = 0;

DWORD WINAPI MyThread(LPVOID arg) {
unsigned int uiTotalReceivedCopy;
while (true) {
uiTotalReceivedCopy = uiTotalReceived;
uiTotalReceived += bytesRead;
if (uiTotalReceived != (uiTotalReceivedCopy + bytesRead)
//whatever
}
}
}

With critical section it will look like:

unsigned int uiTotalReceived = 0;
CRITICAL_SECTION cs;

DWORD WINAPI MyThread(LPVOID arg) {
unsigned int uiTotalReceivedCopy;
InitializeCriticalSection(&cs);
while (true) {
EnterCriticalSection(&cs); // will block until lock is obtained
uiTotalReceivedCopy = uiTotalReceived;
uiTotalReceived += bytesRead;
if (uiTotalReceived != (uiTotalReceivedCopy + bytesRead)
// This can not happen!
}
LeaveCriticalSection(&cs);
}
DeleteCriticalSection(&cs);
}

The above (last one) sample, uses critical section to syncronize access to a
variable there is used more then 1 time in same code block. If we used
interlocked:

if (uiTotalReceived != (uiTotalReceivedCopy + bytesRead)
// With interlocked this can happen
}

If your variables are classes (e.g. a client collection or std::cout for
that matter) there is the possibility that the OS will give time to another
thread before std::cout completed, and if it also uses std::cout the text
printed will be mixed toghether.
Thread1: std::cout << "Hello World!" << std::flush;
Thread2: std::cout << "Goodbye evil World!" << std::flush;
Could output: "HelGoodbyelo evWorld!il World!".
In such cases you will use critical section on a system with only 1 CPU.
Criticalsection is not safe on multiply CPUs! For multiply CPUs you must
create your own critical section, for example by comparing thread id's with
InterlockedExchange().

I hope that was enough info for you to decide what to use by yourself :) And
please excuse if i left something out (forgot).

//eric

Peter Duniho

unread,

Jun 15, 2006, 12:30:57 PM6/15/06

to

"Eric Jensen" <er...@no.spam.com> wrote in message
news:44917ce5$0$141$edfa...@dread16.news.tele.dk...

> When its only an array of fundamental C++/primitive types (char, bool,
> short, int, ect.) you'll only need a to worrie about multiply threads
> writing the same block of memory simultaneously if there is more then 1
> cpu in your system.

This is only sort of correct.

Your code example illustrates why:

> If you only have 1 CPU the OS will time slice between threads, therefore
> will only one thread be able to write to memory at all times.

The problem is that you have no way to predict where one thread's timeslice
will end. So, in your example:

> [...]

> unsigned int uiTotalReceived = 0;
>
> DWORD WINAPI MyThread(LPVOID arg) {
> while (true) {
> // do some socket stuff that reads
> uiTotalReceived += bytesRead;
> }
> }

The thread does more than just write to the variable. It has to read it as
well. The following sequence will result in incorrect results, even with
just one CPU:

Thread A: read uiTotalReceived
Thread B: read uiTotalReceived
Thread A: add bytesRead to uiTotalReceived and write it back to
uiTotalReceived
Thread B: add bytesRead to uiTotalReceived and write it back to
uiTotalReceived

Because Thread B is adding its own bytesRead value to an out-of-date version
of uiTotalReceived, it will essentially ignore the addition done by Thread
A. The update to the byte count made by Thread A winds up lost.

There are ways to structure limited kinds of code that share writeable data
so that synchronization can be avoided on single CPU computers, but it's
tricky and risky to do and IMHO most programmers who try wind up doing it
wrong.

If you have a processor that has an add instruction that can operated
directly on memory, you MAY be able to avoid some of the above issues.
However, I'm not convinced that even on Intel, you can guarantee 100% the
basic code you wrote will always translate to a single instruction in-memory
add, and all bets are off for other hardware types.

Pete

dav...@webmaster.com

unread,

Jun 15, 2006, 4:14:36 PM6/15/06

to

Eric Jensen wrote:

> If you only have 1 CPU the OS will time slice between threads, therefore
> will only one thread be able to write to memory at all times. In case you
> have more then one cpu, or dont know the number of CPUs on the machine, you
> must use the interlocked* winapi functions. Even 2 CPU's cant write the same
> block of memory simultaneously, but the CPUs has memory caches and the
> interlocked* functions forces the CPUs memory caches to agree.

Anyone who thinks that the Interlocked* operations are about CPU memory
caches really can't understand SMP issues. This is a common falsehood
that causes as many problems as the belief that a web server assigns
each incoming connection a new port number so that it can tell them all
apart.

DS

Peter Duniho

unread,

Jun 15, 2006, 7:45:39 PM6/15/06

to

<dav...@webmaster.com> wrote in message
news:1150402476.3...@h76g2000cwa.googlegroups.com...

> Anyone who thinks that the Interlocked* operations are about CPU memory
> caches really can't understand SMP issues.

That said, I'll suggest that it is more useful to post a message that
explains what they ARE about, than posting a message that explains what they
are NOT about.

The latter serves only to tell someone they are wrong. It does nothing to
help advance the knowledge of others (which is, of course, the primary
usefulness of communication).

I admit that it's always important to be aware of what one does not know.
But it's even better to be able to fill those gaps when possible.

Pete

dav...@webmaster.com

unread,

Jun 15, 2006, 8:23:44 PM6/15/06

to

Peter Duniho wrote:

> <dav...@webmaster.com> wrote in message
> news:1150402476.3...@h76g2000cwa.googlegroups.com...

> > Anyone who thinks that the Interlocked* operations are about CPU memory
> > caches really can't understand SMP issues.

> That said, I'll suggest that it is more useful to post a message that
> explains what they ARE about, than posting a message that explains what they
> are NOT about.

In this specific case, I don't agree, for two reasons:

1) Good multi-threaded programming is more about *not* trying to
understand how the particulare hardware works or what it does, it's
about producing programs that work well regardless of hardware details.
(Of course, it doesn't hurt to know more, and you might be able to
produce better code that way, but you still first need to produce
assumption-free code. It's a skill that is, logically, vastly prior to
producing platform-optimized code.)

2) This is a really complex issue that can't be adequately explained in
the context of one USENET post. All that's really needed in this
context is the understanding that it's very complex and you shouldn't
make assumptions about it. (In fact, it seems pretty much that nobody
understands some of these issues, such as in precisely what cases
64-bit aligned writes are atomic on P4s.)

> The latter serves only to tell someone they are wrong. It does nothing to
> help advance the knowledge of others (which is, of course, the primary
> usefulness of communication).

In this case, pretty much all you need to know is that you should not
make any assumptions about what the hardware will do. ;)

DS

Chris Thomasson

unread,

Jun 15, 2006, 9:33:28 PM6/15/06

to

"Eric Jensen" <er...@no.spam.com> wrote in message
news:44917ce5$0$141$edfa...@dread16.news.tele.dk...
>

> "Scott" <sc...@blueyonder.co.uk> skrev i en meddelelse
> news:XBYjg.399979$xt.6...@fe3.news.blueyonder.co.uk...
>
> [snip]

> block of memory simultaneously, but the CPUs has memory caches and the
> interlocked* functions forces the CPUs memory caches to agree.

This is a common misunderstanding. I suggest that you read this and the
links contained within:

http://groups.google.com/group/comp.programming.threads/msg/e6ba993eda0bc79a

then, read this entire thread:

http://groups.google.com/group/comp.programming.threads/msg/423df394a0370fa6
(this is my response to question from Scott Meyers wrt compiler interference
with lock-free algorithms...)

There are a couple of posts by Neill Clift from Microsoft. He works on
performance critical aspects (memory synchronization) of various Window
kernels, I think he even worked on the kernel for that version of Windows
which ran on DEC Alphas. He briefly discusses some of the issues wrt the
Interlocked API. According to Neill the Interlocked operations have both
acquire and release semantics, basically they are full memory barriers.

Here is how it would look on SPARC:

Whatever InterlockedWhatever(...) {

membar #LoadStore|#StoreStore

/* Peform the atomic operation */

membar #StoreLoad|#StoreStore

return Whatever;
}

That is how Neill said that they worked on DEC Alpha which has a
memory-model that is so weak it requires memory barriers for dependant load
ordering!

:O