Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

InterlockedXxx vs. simple access

73 views

Skip to first unread message

Udo Eberhardt

unread,

Oct 15, 1999, 3:00:00 AM10/15/99

to

The DDK documentation states the InterlockedXxx functions are
interlocked only with respect to other InterlockedXxx functions. Ok,
this means I should access an interlocked variable only by using
InterlockedXxx functions. But how can I implement a simple counter
variable that is incremented by an ISR and read by a dispatch routine?
The ISR would use InterlockedIncrement. That's easy.
The dispatch routine could use InterlockedExchange to read the current
value of the counter. But this operation gets the currrent value of
the variable *and* sets a new value (as it's name suggests). I would
like to do an interlocked read only in order to get the current value
without changing it. Why there is no InterlockedGet() ?

Second question:
Can I use a simple access to the variable? So there would be the
following scenario:

ISR:
InterlockedIncrement(&counter);

Dispatch Function:
x = counter;

Is the latter access atomic? Even on SMP systems?
Let's assume counter is a ULONG (DWORD) variable that is aligned on a
DWORD boundary.

What's the magic of the InterlockedXxx functions?

Thanks in advance for any comments

Udo

--
Udo Eberhardt

Walter Oney

unread,

Oct 18, 1999, 3:00:00 AM10/18/99

to

Udo Eberhardt wrote:
> The DDK documentation states the InterlockedXxx functions are
> interlocked only with respect to other InterlockedXxx functions. Ok,
> this means I should access an interlocked variable only by using
> InterlockedXxx functions. But how can I implement a simple counter
> variable that is incremented by an ISR and read by a dispatch routine?

The "interlocked only with respect to other InterlockedXxx functions"
really means that no other access (including a fetch) will be allowed to
the same memory location for the duration of the updating operation. On
an x86, for example, it's done by means of LOCK prefix and an INC, DEC,
CMPXCH, etc., instruction. The value of the variable may change
immediately after you're done with the InterlockedXxx call, but a
near-simultaneous fetch will always get a self-consistent result. The
data variable doesn't need to be aligned unless other features (besides
locking) of the CPU so require, either. [Refer to section 13.1.1 in the
i486 manual, for example.]

I suspect the reason that the DDK says this so clumsily is that they
want to caution people that InterlockedXxx isn't the same as a spin lock
or a mutex. If you want to be the only person accessing a variable for
more than one instruction, you need to use one of those other
mechanisms.

--
Walter Oney
http://www.oneysoft.com

Rich Testardi

unread,

Oct 18, 1999, 3:00:00 AM10/18/99

to

Hi,

> The DDK documentation states the InterlockedXxx functions are
> interlocked only with respect to other InterlockedXxx functions. Ok,
> this means I should access an interlocked variable only by using
> InterlockedXxx functions. But how can I implement a simple counter
> variable that is incremented by an ISR and read by a dispatch routine?

> The ISR would use InterlockedIncrement. That's easy.
> The dispatch routine could use InterlockedExchange to read the current
> value of the counter. But this operation gets the currrent value of
> the variable *and* sets a new value (as it's name suggests). I would
> like to do an interlocked read only in order to get the current value
> without changing it.

I think the intent of the conservative wording on those routines was
to allow NT to support processor architectures which did not have such
a generous and flexible set of interlocked operations... We ported NT
to an architecture that was quite, uh, limited in this regard... We
had to actually use spinlocks INSIDE the InterlockedXxx() routines...
(And we actually had to use a POOL of spinlocks, based on a hash of the
operand address...) It was pretty ugly... In this case, you can begin
to see that there are cases when the InterlockedXxx() ops are not atomic
with regard to other ops (which did not acquire these private spinlocks).

I believe you can even see a bit of this on x86 when you start handing
unaligned operands to the InterlockedXxx() calls... If you're just using
a MOV instruction to read the value of the unaligned counter, then the
MOV is NOT atomic with regard to the InterlockedXxx() call (which could
execute atomically between the first and second accesses of the MOV), and
you can get inconsistent data.

The Pentium Pro OS manual in section 7.1.1 is quite explicit about which
ops are guaranteed atomic. I'd not seen that listed in earlier manuals...
Fortunately, a 32 bit access to a 32 bit aligned operand IS listed
there, so for all practical purposes, you'll be OK on Intel processors.
On others (not that such a thing exists!), you might be hit or miss...

> Why there is no InterlockedGet() ?

Good question -- for true portability, there should be...

If I were worried about other processors (or unaligned operands), I'd use:

x = InterlockedExchangeAdd(&y, 0);

-- Rich

Rich Testardi, #include <disclaimer.h>, EMC Corporation
Phone: +1-970-203-0937, E-mail: rtes...@emc.com
Street: http://home.interserv.com/~rtestard/address.htm
My webpage: http://home.interserv.com/~rtestard/
My dog's webpage: http://www.dogchow.com/pages/tc/
What better way to honor the giver than to love, cherish,
and enjoy the gift? What greater gift than life?

Udo Eberhardt

unread,

Oct 18, 1999, 3:00:00 AM10/18/99

to

On Mon, 18 Oct 1999 10:47:45 -0600, "Rich Testardi"

<rtes...@interserv.com> wrote:
>I believe you can even see a bit of this on x86 when you start handing
>unaligned operands to the InterlockedXxx() calls... If you're just using
>a MOV instruction to read the value of the unaligned counter, then the
>MOV is NOT atomic with regard to the InterlockedXxx() call (which could
>execute atomically between the first and second accesses of the MOV), and
>you can get inconsistent data.
>

This was my problem. Is a MOV atomic? If the answer is yes, is it
atomic on SMP systems, too.

>The Pentium Pro OS manual in section 7.1.1 is quite explicit about which
>ops are guaranteed atomic. I'd not seen that listed in earlier manuals...
>Fortunately, a 32 bit access to a 32 bit aligned operand IS listed
>there, so for all practical purposes, you'll be OK on Intel processors.
>On others (not that such a thing exists!), you might be hit or miss...
>

This is a clear answer to the first question above. A MOV is atomic
(proper alignment assumed). But is this true even on SMP systems? Does
the hardware ensure atomic memory transactions? What about caching
issues? Is the operand cache (internal and/or external) always
consistent?
As Walter Oney in its posting states, the InterlockedXxx functions
include the LOCK prefix. Does such a LOCK operation cause an
invalidate or flush of the corresponding cache line?
I know this issues are documented in the Pentium manuals. But I don't
own this documentation. You mention the Pentium Pro OS manual. Where I
can get this manual? Is it for free?

>If I were worried about other processors (or unaligned operands), I'd use:
>
> x = InterlockedExchangeAdd(&y, 0);

This is a good idea. Thanks.

Udo

--
Udo Eberhardt

Doug Kehn

unread,

Oct 18, 1999, 3:00:00 AM10/18/99

to

> >If I were worried about other processors (or unaligned operands), I'd
use:
> >
> > x = InterlockedExchangeAdd(&y, 0);
> This is a good idea. Thanks.
>

I've created an used a macro called InterlockedGet which is nothing more
than:
#define InterlockedGet(val) InterlockedExchange(&(val),(val))

I'm not sure if using InterlockedExchangeAdd(...) would be faster?

...doug

Udo Eberhardt

unread,

Oct 19, 1999, 3:00:00 AM10/19/99

to

On Mon, 18 Oct 1999 22:41:33 -0500, "Doug Kehn"
<doug...@compuserve.com> wrote:

>> >If I were worried about other processors (or unaligned operands), I'd
>use:
>> >
>> > x = InterlockedExchangeAdd(&y, 0);
>> This is a good idea. Thanks.
>>
>
>I've created an used a macro called InterlockedGet which is nothing more
>than:
>#define InterlockedGet(val) InterlockedExchange(&(val),(val))
>

Oops, this does not work. You don't get an interlocked access this
way. Because you use the value of the variable as an argument to
InterlockedExchange there is an read access to the variable (MOV
instruction) followed by the interlocked instruction. There is a small
window between this two operations. Within this window another (write)
access to the variable can take place. When this happens the following

InterlockedExchange call sets the variable to its previous value. The
other access has no effect. This is this kind of synchronization
problem that is really hard to find, because the window is so small.

The InterlockedExchangeAdd(&var,0) method is a safe way to implement
an InterlockedGet.
Unfortunately InterlockedExchangeAdd is not available in wdm.h and in
wdm.lib, as I have realized today :-(
The interesting thing is: It is available in wdm.h for alpha and IA64
but not for x86. Very strange...

Udo

--
Udo Eberhardt

argv1

unread,

Oct 20, 1999, 3:00:00 AM10/20/99

to

The reason you dont need to do this is because the documentation should
probably read
"InterlockedXxx operations are atomic wrt other InterlockedXxx operations
AND
ANY OTHER ATOMIC MEMORY OPERATION"
an aligned read/write to a DWORD is atomic on all the processors NT runs on
(or
has run on) so you dont need an InterlockedRead or InterlockedWrite.
(on the alpha actually, an aligned 64bit read/write is also atomic, but i
digress)

"Udo Eberhardt" <Udo.Eb...@REMOVE-THISthesycon.de> wrote in message
news:380cb5db...@news.tu-ilmenau.de...

Jamie Hanrahan

unread,

Oct 20, 1999, 3:00:00 AM10/20/99

to

I know what you are saying. The data might change just before or after a
non-interlocked read. This doesn't change for an "interlocked read", so
it looks like the InterlockedExchangeAdd(&var,0) trick looks unnecessary.

Nevertheless, you DO need to do it unless either (a) you have some other
"serializing" instruction in the stream, or (b) you don't care if your
fetch happens sooner (maybe a lot sooner) than you think it does.

Modern CPUs are permitted to perform "late writes" (doing writes later
than they appear in the instruction stream) and "early reads". In other
words, they do not guarantee "strong memory access ordering". (AFAIK, P6
cores are allowed to do this, though current implementations don't, yet.
Later Alphas definitely do it, as well ia64.)

The memory serializing mechanism (LOCK prefix on x86), and MB on Alpha,
provide serializing points, before which reads can't be moved and after
which writes can't be moved.

(Every once in a while someone dusts off an old paper describing a
"lockless mutex" mechanism that lets you implement something like mutexes
or spinlocks on an MP system without an atomic test-and-set. On modern
CPUs it fails, for the above reason.)

--- Jamie Hanrahan, Kernel Mode Systems, San Diego CA
Windows NT/2000 driver consulting and training
http://www.kernel-mode.com/

Please post replies, followups, questions, etc., in news, not via e-mail.

Udo Eberhardt

unread,

Oct 20, 1999, 3:00:00 AM10/20/99

to

On Wed, 20 Oct 1999 14:22:25 GMT, j...@cmkrnl.com (Jamie Hanrahan)
wrote:

>
>I know what you are saying. The data might change just before or after a
>non-interlocked read. This doesn't change for an "interlocked read", so
>it looks like the InterlockedExchangeAdd(&var,0) trick looks unnecessary.
>
>Nevertheless, you DO need to do it unless either (a) you have some other
>"serializing" instruction in the stream, or (b) you don't care if your
>fetch happens sooner (maybe a lot sooner) than you think it does.
>
>Modern CPUs are permitted to perform "late writes" (doing writes later
>than they appear in the instruction stream) and "early reads". In other
>words, they do not guarantee "strong memory access ordering". (AFAIK, P6
>cores are allowed to do this, though current implementations don't, yet.
>Later Alphas definitely do it, as well ia64.)
>
>The memory serializing mechanism (LOCK prefix on x86), and MB on Alpha,
>provide serializing points, before which reads can't be moved and after
>which writes can't be moved.
>

-- snip

Yes, I think I need the LOCK prefix (or an equivalent on other CPUs).
And I get it only by using an InterlockedXxx function.
An execption to this rule would be possible if there is some clever
hardware that ensures coherence between caches, prefetch queues etc.
of multiple CPUs.

Udo

--
Udo Eberhardt

J. J. Farrell

unread,

Oct 21, 1999, 3:00:00 AM10/21/99

to

In article <380b6b1f...@news.tu-ilmenau.de>,

Udo.Eb...@REMOVE-THISthesycon.de wrote:
>
> A MOV is atomic
> (proper alignment assumed). But is this true even on SMP systems? Does
> the hardware ensure atomic memory transactions? What about caching
> issues? Is the operand cache (internal and/or external) always
> consistent?
> As Walter Oney in its posting states, the InterlockedXxx functions
> include the LOCK prefix. Does such a LOCK operation cause an
> invalidate or flush of the corresponding cache line?

As long as the operand is aligned, everything is OK on systems
which abide by the MultiProcessor Specification - the vast
majority (if not all) of MP x86 systems which run NT. The LOCK
is often simulated by appropriate fiddling (a technical term)
with the cache state. If the operand is not aligned, even the
locked interaction is not guaranteed to work on an MPS system.
[ You can bet your bottom dollar that unaligned locks do work
in all shipping general-purpose NT systems; I worked on a
prototype system which was MPS compliant but didn't support
unaligned locks which crossed a cache-line boundary, and the
amount of software which failed on this system was remarkable! ]

> I know this issues are documented in the Pentium manuals. But I don't
> own this documentation. You mention the Pentium Pro OS manual. Where I
> can get this manual? Is it for free?

Intel have reworked the manuals as IA-32 architecture manuals
with various supplements as necessary for the different processor
families. They are available as PDF files on Intel's web site
for free on-line viewing and download. You can also get printed
versions from Intel, but they probably cost money unless you're
likely to buying a few million processors.

Sent via Deja.com http://www.deja.com/
Before you buy.

Rich Testardi

unread,

Oct 21, 1999, 3:00:00 AM10/21/99

to

> >> >> >If I were worried about other processors (or unaligned operands), I'd use:
> >> >> >
> >> >> > x = InterlockedExchangeAdd(&y, 0);

Maybe I should have been more explicit about the operation I was protecting against.
For the *unaligned* memory case, let's assume we have a ULONG that crosses a cache
line boundary... The value in the ULONG is 0xffffffff

Processor 1 Processor 2
----------- -----------
MOV ECX,[ULONG]
(fetches cache line #1)
fetches first two bytes of ULONG = 0xffff
InterlockedIncrement(&ULONG);
(fetch #1 and #2 cache lines *private*)
read ULONG = 0xffffffff
increment = 0x00000000
write ULONG = 0x00000000
(fetches cache line #2)
fetches second two bytes of ULONG = 0x0000
assemble ULONG = 0x0000ffff

In this case, processor #1 saw a number that was NEITHER the "before" or "after"
value of the ULONG -- this is a bug... If Processor #1 had seen either 0xffffffff
of 0x00000000, I'd have considered the operation successful. But not 0x0000ffff.

If we replace the MOV with InterlockedExchangeAdd(&ULONG, 0), processor #1 gets
a good answer (either 0x00000000 or 0xffffffff), always.

> I know what you are saying. The data might change just before or after a
> non-interlocked read. This doesn't change for an "interlocked read", so
> it looks like the InterlockedExchangeAdd(&var,0) trick looks unnecessary.

It's not just a matter of the data changing "before" or "after" -- it's a
matter of "during"... Then you get 4 bytes which are not self consistent.
All algorithms can handle "the before or after case"; few (none?) can handle
the during case...

So, I think that you have to use InterlockedExchangeAdd() if you're talking
unaligned data (or for some weird processor types that don't guarantee atomicity
of the MOV instructions under certain circumstances, as above) and you want
4 self-consistent bytes (which, I believe was the intent of the original poster).
Of course, folks should not have unaligned data, but if you're dealing with a
legacy data structure, sometimes you can't help it...

Walter Oney

unread,

Oct 22, 1999, 3:00:00 AM10/22/99

to

Rich Testardi wrote:
> It's not just a matter of the data changing "before" or "after" -- it's a
> matter of "during"... Then you get 4 bytes which are not self consistent.
> All algorithms can handle "the before or after case"; few (none?) can handle
> the during case...

Both the i486 and Pentium manuals say that "the integrity of the LOCK
prefix is not affected by the alignment of the memory field. Memory
locking is observed for arbitrarily misaligned fields." The manuals also
seem to imply that no other access to shared memory can occur during the
locked instruction.

Not being a CPU designer, I don't know how far to take these statements.
They seem to imply that you wouldn't need an InterlockedGet function
because two successive cache lines that contain a misaligned operand
would be *both* be locked. Certainly none of the drivers I've looked at
worries about this.

So, is the doc wrong? Or just insular in that it applies only to Intel
chips? Or is it just that *all* of memory is locked by LOCK?

Rich Testardi

unread,

Oct 22, 1999, 3:00:00 AM10/22/99

to

> > It's not just a matter of the data changing "before" or "after" -- it's a
> > matter of "during"... Then you get 4 bytes which are not self consistent.
> > All algorithms can handle "the before or after case"; few (none?) can handle
> > the during case...
>
> Both the i486 and Pentium manuals say that "the integrity of the LOCK
> prefix is not affected by the alignment of the memory field. Memory
> locking is observed for arbitrarily misaligned fields." The manuals also
> seem to imply that no other access to shared memory can occur during the
> locked instruction.

I think we were talking about the case of a MOV instruction. You're not
even allowed to prefix MOV with LOCK (invalid opcode exception). So, in
the case of a simple MOV, the operation is *not* atomic if the operand is
not aligned and crosses a cache line. That is the argument for needing
an InterlockedGet(), which you can implement with InterlockedExchangeAdd().

Of course, that is also the argument for not using unaligned operands!!!

> Not being a CPU designer, I don't know how far to take these statements.
> They seem to imply that you wouldn't need an InterlockedGet function
> because two successive cache lines that contain a misaligned operand
> would be *both* be locked. Certainly none of the drivers I've looked at
> worries about this.

Two successive cache lines that contain an unaligned operand are ONLY
locked if you use a LOCK prefix. Since the compiler won't do this for
a normal C data reference (and, in fact, the CPU would except if it did),
the only way to get the desired behavior is with an InterlockedXxx()
call.

Yes, I agree no drivers I've ever seen worry about this -- but they all
use aligned data, which IS accessed atomically on x86 for a MOV operation.

-- Rich

Walter Oney

unread,

Oct 22, 1999, 3:00:00 AM10/22/99

to

Rich Testardi wrote:
> I think we were talking about the case of a MOV instruction. You're not
> even allowed to prefix MOV with LOCK (invalid opcode exception). So, in
> the case of a simple MOV, the operation is *not* atomic if the operand is
> not aligned and crosses a cache line. That is the argument for needing
> an InterlockedGet(), which you can implement with InterlockedExchangeAdd().

The documentation implies to me that memory access is interlocked by
LOCK even by other people who are not using LOCK.

Gil Hamilton

unread,

Oct 22, 1999, 3:00:00 AM10/22/99

to

On Fri, 22 Oct 1999 06:26:52 -0400, Walter Oney <walt...@oneysoft.com>
wrote:

> Rich Testardi wrote:
> > It's not just a matter of the data changing "before" or "after" -- it's a
> > matter of "during"... Then you get 4 bytes which are not self consistent.
> > All algorithms can handle "the before or after case"; few (none?) can handle
> > the during case...
>
> Both the i486 and Pentium manuals say that "the integrity of the LOCK
> prefix is not affected by the alignment of the memory field. Memory
> locking is observed for arbitrarily misaligned fields." The manuals also
> seem to imply that no other access to shared memory can occur during the
> locked instruction.

As you say, "during the locked instruction". But in this scenario, it's
the other way around: where the "other" (locked) access to the shared
memory occurs *during a non-locked instruction*.

Go back and study Rich's example again. The read is simply a
"mov ecx, [Some_ulong]"; there is no LOCK prefix on the instruction and
hence, it isn't protected. The lock is being asserted during the
InterlockedXxx instruction (usually an "xchg", "xadd" or "cmpxchg").

- GH

Udo Eberhardt

unread,

Oct 25, 1999, 3:00:00 AM10/25/99

to

Thanks to all the contributors for the very interesting discussion.
I want to try to summarize the discussion and this way to get an
answer to the original question I posted.

First of all I want to assume that all DWORD variables are properly
aligned on DWORD boundaries. The unaligned case is somewhat more
complicated. The documentation states that the InterlockedXxx
functions do work correctly only if the operands are aligned. Because
it is pretty easy to ensure proper alignment we can ignore the
unaligned case. It is always dangerous to use misaligned DWORDs.

My original question was:
Does the following work properly, even on SMP systems?

static ULONG counter;

ISR:
InterlockedIncrement(&counter); // becomes: LOCK INC [counter]

Dispatch Function:
x = counter; // becomes for example: MOV ECX,[counter]

Does x always get a self-consistent DWORD value?

Based on the comments from this thread, I can state the following:

A MOV instruction on a DWORD that is DWORD aligned IS atomic on single
CPU systems and on multi CPU systems that do conform to the MPS
specification.

No further interlocking is required in the code shown above. I get
always a self-consistent value for x.

Due to "late write" or "early read" features of modern CPUs the fetch
of counter can happen sooner in the instruction stream. So if I need
synchronization points I have to use other synchronization primitives
(e.g. spin locks).

The DDK documentation of the InterlockedXxx functions should be read
as "InterlockedXxx operations are atomic with respect to other
InterlockedXxx operations AND ANY OTHER ATOMIC MEMORY OPERATION (like
a MOV)"

A portable way to implement the sample shown above would be
x = InterlockedExchangeAdd(&counter, 0);
This is a good trick. May be it generates slightly more overhead than
the simple MOV.
Unfortunately InterlockedExchangeAdd is not available in WDM.H (it is
in NTDDK.H). Who knows what's the reason...
Therefore in WDM drivers I would use the simple MOV.

Does anyone disagree with one of the statements?

Thanks for your comments

Udo

--
Udo Eberhardt

0 new messages

Search

Clear search

Close search

Google apps

Main menu