Is LeaveCriticalSection a memory barier ?

Vladimir Petter

unread,

Oct 27, 2003, 12:14:48 PM10/27/03

to

Hi,

The question is related to a multiprocessor machine.

Lets say I have couple class member variables.

Is it enough to protect access to those variables with
Enter/LeaveCriticalSection so I can safely use C/C++
operators inside?

Or I have to use Interlocked* functions instead of C/C++
operators anyways to guaranty this will work property
on multiprocessor machine?

In other words can LeaveCriticalSecrtion be considered as
a memory barrier?

Vladimir.

Vladimir Petter

unread,

Oct 27, 2003, 1:09:14 PM10/27/03

to

That is how the diassembly for RtlLeaveCriticalSection looks like on the Windows XP:

.text:77F5B372                 align 10h
.text:77F5B380 ; Exported entry 677. RtlLeaveCriticalSection
.text:77F5B380
.text:77F5B380 ; ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ S U B R O U T I N E ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
.text:77F5B380
.text:77F5B380
.text:77F5B380                 public RtlLeaveCriticalSection
.text:77F5B380 RtlLeaveCriticalSection proc near       ; CODE XREF: sub_77F55422+B p
.text:77F5B380                                         ; RtlAllocateHeap+3E5 p ...
.text:77F5B380
.text:77F5B380 arg_0           = dword ptr 4
.text:77F5B380
.text:77F5B380                 mov     edx, [esp+arg_0]
.text:77F5B384                 xor     eax, eax
.text:77F5B386                 dec     dword ptr [edx+8]
.text:77F5B389                 jnz     short loc_77F5B3B0
.text:77F5B38B                 mov     [edx+0Ch], eax
.text:77F5B38E
.text:77F5B38E loc_77F5B38E:                           ; DATA XREF: .data:77FC1004 o
.text:77F5B38E                 lock dec     dword ptr [edx+4]
.text:77F5B392                 jge     short loc_77F5B397
.text:77F5B394                 retn    4

The only instruction that cares about processor's cache coherency is the one at .text:77F5B38E.

This instruction is access to a member variable of CRITICAL_SECTION structure.

Intel documentation (24547109.pdf and 24547209.pdf) both are saying that LOCK signal guaranties that

bus will be locked during prefixed operation, but it does not promise flashing processor's cash.

Am I missing something?

Vladimir.

Vladimir Petter

unread,

Oct 27, 2003, 2:07:00 PM10/27/03

to

Sorry I've missed the part of disassembly, but question stands.

.text:77F5B36D ; ---------------------------------------------------------------------------

.text:77F5B372                 align 10h
.text:77F5B380 ; Exported entry 677. RtlLeaveCriticalSection
.text:77F5B380
.text:77F5B380 ; ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦ S U B R O U T I N E ¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦
.text:77F5B380
.text:77F5B380
.text:77F5B380                 public RtlLeaveCriticalSection
.text:77F5B380 RtlLeaveCriticalSection proc near       ; CODE XREF: sub_77F55422+B p
.text:77F5B380                                         ; RtlAllocateHeap+3E5 p ...
.text:77F5B380
.text:77F5B380 arg_0           = dword ptr 4
.text:77F5B380
.text:77F5B380                 mov     edx, [esp+arg_0]
.text:77F5B384                 xor     eax, eax
.text:77F5B386                 dec     dword ptr [edx+8]
.text:77F5B389                 jnz     short loc_77F5B3B0
.text:77F5B38B                 mov     [edx+0Ch], eax
.text:77F5B38E
.text:77F5B38E loc_77F5B38E:                           ; DATA XREF: .data:77FC1004 o
.text:77F5B38E                 lock dec     dword ptr [edx+4]
.text:77F5B392                 jge     short loc_77F5B397
.text:77F5B394                 retn    4

.text:77F5B397 ; ---------------------------------------------------------------------------
.text:77F5B397
.text:77F5B397 loc_77F5B397:                           ; CODE XREF: RtlLeaveCriticalSection+12 j
.text:77F5B397                 push    edx
.text:77F5B398                 call    RtlpUnWaitCriticalSection
.text:77F5B39D                 xor     eax, eax
.text:77F5B39F                 retn    4
.text:77F5B39F ; ---------------------------------------------------------------------------
.text:77F5B3A2                 align 10h
.text:77F5B3B0
.text:77F5B3B0 loc_77F5B3B0:                           ; CODE XREF: RtlLeaveCriticalSection+9 j
.text:77F5B3B0                                         ; DATA XREF: .data:77FC1008 o
.text:77F5B3B0                 lock dec     dword ptr [edx+4]
.text:77F5B3B4                 retn    4
.text:77F5B3B4 RtlLeaveCriticalSection endp

Vladimir

James Antognini

unread,

Oct 27, 2003, 2:26:39 PM10/27/03

to

I just read pertinent sections of "Intel Architecture Software
Developer's Manual," vols. 2 and 3, 1997, and I agree with your
conclusion that the locked DEC (and the corresponding locked INC upon
entry to the critical section) does not affect the caches of other
processors. I do not see a reason for the critical-section mechanism to
invalidate the caches of other processors.

That said, I don't think this is a problem, because the actual updating
of memory on one CPU (at least if the memory item is properly aligned,
e.g., a 4-byte ULONG on a 4-byte boundary updated by a 4-byte operation)
should cause other CPUs to invalidate their caches for that piece of
memory. See "Intel Architecture Software Developer's Manual," vol. 3,
1997, page 9-4, specifically in reference to "snooping."

Obviously, the compiler has to generate code to make proper use of the
x86-architecture cache, else all may be for naught. For example, it
would not do to fetch the value of variable X before entering the
critical section, since cache invalidation applies only at fetch and not
at some later arbitrary point. But in my test, the compiler behaved
correctly:

21: ULONG X = 0;
00401028 mov dword ptr [ebp-4],0
22: CRITICAL_SECTION critX;
23:
24: InitializeCriticalSection(&critX);
0040102F mov esi,esp
00401031 lea eax,[ebp-1Ch]
00401034 push eax
00401035 call dword ptr [__imp__InitializeCriticalSection@4
(0042a15c)]
0040103B cmp esi,esp
0040103D call __chkesp (004010a0)
25: EnterCriticalSection(&critX);
00401042 mov esi,esp
00401044 lea ecx,[ebp-1Ch]
00401047 push ecx
00401048 call dword ptr [__imp__EnterCriticalSection@4
(0042a158)]
0040104E cmp esi,esp
00401050 call __chkesp (004010a0)
26: X++;
00401055 mov edx,dword ptr [ebp-4]
00401058 add edx,1
0040105B mov dword ptr [ebp-4],edx
27: LeaveCriticalSection(&critX);
0040105E mov esi,esp
00401060 lea eax,[ebp-1Ch]
00401063 push eax
00401064 call dword ptr [__imp__LeaveCriticalSection@4
(0042a154)]
0040106A cmp esi,esp
0040106C call __chkesp (004010a0)

So only one instance of the program can be accessing X at any time -- by
software protocol --, and the CPU architecture ensures -- by snooping or
some other mechanism -- that the program sees the latest value of X.

Vladimir Petter wrote:

> Intel documentation (24547109.pdf and 24547209.pdf) both are saying

> that LOCK signal guaranties thatbus will be locked during prefixed

> operation, but it does not promise flashing processor's cash.Am I
> missing something?

--
If replying by e-mail, please remove "nospam." from the address.

James Antognini
Windows DDK MVP

Vladimir Petter

unread,

Oct 27, 2003, 3:24:54 PM10/27/03

to

Hi James,

Thanks for response.

> That said, I don't think this is a problem, because the actual updating
> of memory on one CPU (at least if the memory item is properly aligned,
> e.g., a 4-byte ULONG on a 4-byte boundary updated by a 4-byte operation)
> should cause other CPUs to invalidate their caches for that piece of
> memory. See "Intel Architecture Software Developer's Manual," vol. 3,
> 1997, page 9-4, specifically in reference to "snooping."

Yes! That was the missing part! I felt that I am missing something.
Would it be correct to say that inside critical section it would NOT be a
problem to manipulate on a non properly aligned data (cause critical section
will make this operation atomic)?

Thanks,
Vladimir.

James Antognini

unread,

Oct 27, 2003, 4:42:43 PM10/27/03

to

I don't think so. Vis-a-vis cache coherency, it's not the critical section that
counts but the cache mechanism. If an item is not properly aligned in memory,
especially if it crosses a cache line (the Intel manual is rather stern in this
respect), I think you're exposed to another CPU having a stale part of the item
in its cache.

Now I admit I could be wrong on points like this, but I hew to the policy of
following practices like proper alignment to minimize the chance that my code,
running on an oddball CPU or some future super-duper CPU, might fail. And,
remember, when and if it does fail in that way, it's going to be very difficult
to diagnose the problem and near-impossible to reproduce it.

Vladimir Petter wrote:

> Would it be correct to say that inside critical section it would NOT be a
> problem to manipulate on a non properly aligned data (cause critical section
> will make this operation atomic)?

--

Slava M. Usov

unread,

Oct 27, 2003, 5:05:04 PM10/27/03

to

"Vladimir Petter" <vla...@hotmail.com> wrote in message
news:O#dar4KnD...@TK2MSFTNGP10.phx.gbl...

[...]

> Is it enough to protect access to those variables with
> Enter/LeaveCriticalSection so I can safely use C/C++
> operators inside?

Yes.

> Or I have to use Interlocked* functions instead of C/C++
> operators anyways to guaranty this will work property
> on multiprocessor machine?

No.

> In other words can LeaveCriticalSecrtion be considered as
> a memory barrier?

Yes.

The locked instructions do not flush the cache. But they force strong
ordering. For example:

[begin quote 'IA-32 Software Dev. Manual', Vol. 3, Section 7.1.2.2]
Locked operations are atomic with respect to all other memory operations and
all externally visible events. Only instruction fetch and page table
accesses can pass locked instructions. Locked instructions can be used to
synchronize data written by one processor and read by another processor.

For the P6 family processors, locked operations serialize all outstanding
load and store operations (that is, wait for them to complete). This rule is
also true for the Pentium 4 and Intel Xeon processors, with one exception:
load operations that reference weakly ordered memory types (such as the WC
memory type) may not be serialized.
[end quote 'IA-32 Software Dev. Manual', Vol. 3, Section 7.1.2.2]

On architectures other than IA-32, MS guarantees that critical sections will
be memory barriers.

S

James Antognini

unread,

Oct 27, 2003, 6:38:25 PM10/27/03

to

I don't think this is a correct understanding of what is important. The critical
section ensures that only one instance is executing the section at a given
moment; the compiler's putting fetch and update instructions between entry and
exit of the section ensures no instance will fetch a value outside the section
and store its updated value inside the section. The machine architecture ensures
that fetch and store (on proper boundaries at least) will work atomically, as
observed by other CPUs at the instant of fetch or store. Strong ordering is,
too, part of the architecture, but I don't see it having a role: There is no way
strong ordering or the lack of it could make a difference, given the critical
section and the atomic-operation guarantee.

"Slava M. Usov" wrote:

> > Is it enough to protect access to those variables with
> > Enter/LeaveCriticalSection so I can safely use C/C++
> > operators inside?
>
> Yes.
>
> > Or I have to use Interlocked* functions instead of C/C++
> > operators anyways to guaranty this will work property
> > on multiprocessor machine?
>
> No.
>
> > In other words can LeaveCriticalSecrtion be considered as
> > a memory barrier?
>
> Yes.
>
> The locked instructions do not flush the cache. But they force strong
> ordering.

--

Kirk Ferdmann

unread,

Oct 28, 2003, 2:53:21 AM10/28/03

to

"James Antognini" <anto...@mindspring.nospam.com> wrote in message
news:3F9DAC71...@mindspring.nospam.com...

> I don't think this is a correct understanding of what is important. The
critical
> section ensures that only one instance is executing the section at a given
> moment; the compiler's putting fetch and update instructions between entry
and
> exit of the section ensures no instance will fetch a value outside the
section
> and store its updated value inside the section. The machine architecture
ensures
> that fetch and store (on proper boundaries at least) will work atomically,
as
> observed by other CPUs at the instant of fetch or store. Strong ordering
is,
> too, part of the architecture, but I don't see it having a role: There is
no way
> strong ordering or the lack of it could make a difference, given the
critical
> section and the atomic-operation guarantee.

I might be a way off here, but AFAIK the hardware does not guarantee
automicity. On a real hardware operations like that are merely sequentially
consistent. And thus strong ordering is important.

-Kirk

Slava M. Usov

unread,

Oct 28, 2003, 11:48:51 AM10/28/03

to

"James Antognini" <anto...@mindspring.nospam.com> wrote in message
news:3F9DAC71...@mindspring.nospam.com...

[...]

> The machine architecture ensures that fetch and store (on proper
> boundaries at least) will work atomically, as observed by other CPUs at
> the instant of fetch or store.

You're correct saying "atomically", but "at the instant of fetch or store"
is somewhat lacking. The fetches and stores do not propagate as soon as they
happen, and as reads can be out of order, it can potentially happen that a
CPU modifies a few words guarded, then another CPU reads the same words and
receives, say, stale data for some and new for the others. Even though each
word gets updated atomically. To prevent this from happening, the second CPU
will have to ensure that before it gets into the CS, no reads of the guarded
data have been attempted. And the lock prefix ensures just that.

S

Vladimir Petter

unread,

Oct 28, 2003, 12:50:37 PM10/28/03

to

Slava,

> You're correct saying "atomically", but "at the instant of fetch or store"
> is somewhat lacking. The fetches and stores do not propagate as soon as
they
> happen, and as reads can be out of order, it can potentially happen that a
> CPU modifies a few words guarded, then another CPU reads the same words
and
> receives, say, stale data for some and new for the others. Even though
each
> word gets updated atomically. To prevent this from happening, the second
CPU
> will have to ensure that before it gets into the CS, no reads of the
guarded
> data have been attempted. And the lock prefix ensures just that.

Does that mean that you would recoment to use Interlocked* functions even
inside critical section?

Vladimir.

Slava M. Usov

unread,

Oct 28, 2003, 3:25:27 PM10/28/03

to

"Vladimir Petter" <vla...@hotmail.com> wrote in message

news:uSJkWxXn...@TK2MSFTNGP09.phx.gbl...

> Does that mean that you would recoment to use Interlocked* functions even
> inside critical section?

Of course not! I thought I would clarify the issue but I only caused more
confusion. Sigh.

One locked instruction will prevent the CPU from speculative reads. That one
locked instruction happens in the CS entry sequence. That guarantees that
after a CS has been acquired by CPU A, all the writes that had been done by
any other CPU before it released the CS will be visible at CPU A. So it is a
memory barrier.

S

Vladimir Petter

unread,

Oct 28, 2003, 3:30:15 PM10/28/03

to

Thanks everyone who responded. Now I am clear.

Vladimir.

Kirk Ferdmann

unread,

Oct 29, 2003, 12:31:28 AM10/29/03

to

"Slava M. Usov" <stripit...@gmx.net> wrote in message
news:erA8PNXn...@tk2msftngp13.phx.gbl...

> You're correct saying "atomically", but "at the instant of fetch or store"
> is somewhat lacking. The fetches and stores do not propagate as soon as
they
> happen, and as reads can be out of order, it can potentially happen that a
> CPU modifies a few words guarded, then another CPU reads the same words
and
> receives, say, stale data for some and new for the others. Even though
each
> word gets updated atomically.

Isn't that the definition for sequential consistency and not automicity?

-Kirk

Slava M. Usov

unread,

Oct 29, 2003, 7:42:56 AM10/29/03

to

"Kirk Ferdmann" <kirk_f...@nospam.hotmail.com> wrote in message
news:7tednSb0_Mo...@comcast.com...

> Isn't that the definition for sequential consistency and not automicity?

It is. I was stressing the point that atomicity had nothing to do with that,
except making things a bit easier. I guess that message of mine should be
just ignored, because everybody seems to misunderstand it. My bad.

S

James Antognini

unread,

Nov 3, 2003, 2:30:17 PM11/3/03

to

After a little more reading of the Intel manuals, I think this statement is mistaken
or at the least unhelpful. Cache coherency, via snooping, ensures other CPUs don't
have stale data in their cache. But then the CPUs will have to go to memory to get
the data in their latest state. Clearly it is vital that memory have the latest
data, and strong ordering means that updates have been written through to memory.

James Antognini wrote:

> Strong ordering is,
> too, part of the architecture, but I don't see it having a role: There is no way
> strong ordering or the lack of it could make a difference, given the critical
> section and the atomic-operation guarantee.

--