Let's say I've got a member variable (in a heavily threaded app):
private int _firstTime = 0;
To make changes to this is easy enough:
InterlockedExchange(ref _firstTime, 1);
... or I can use the InterlockedCompareExchange:
int oldValue = InterlockedCompareExchange(ref _firstTime, 1, 0);
No problems here.
... but, if I just want to read the value, I CANNOT do:
int myValue = _firstTime;
or
if (_fistTime==0)
Doing these requires a memory barrier of some type (volatile variable,
monitor, etc) to have any degree of reliability.
Now, reading the value out using InterlockedCompareExchange always seemed a
bit silly to me. I don't want to always write:
int myValue = InterlockedCompareExchange(ref _firstTime, Int32.MinValue,
Int32.MinValue);
I've never found (although not for lack of looking):
int myValue = InterlockedRead(ref _firstTime);
Of the code that I write, reading the value, and branching based on it's
value, is much more common than writing to the value.
I have, in the past, used volitile variables for this - but I've stopped due
to the various issues regarding volatiles (non atomic reads & writes).
Is there a concensus on the best way to do this? It seems like all the
solutions are.... incomplete.
Essentially I'm looking for a volitile varaiable that is atomically read and
written. Unfortuantly, these don't exist in any language I've yet used.
--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins
<snip>
> Essentially I'm looking for a volitile varaiable that is atomically
> read and written. Unfortuantly, these don't exist in any language
> I've yet used.
I'm not sure what you're after that a simple volatile *wouldn't* give
you here, given that *any* Int32 variable will be accessed atomically,
whether it's volatile or not. You can't do atomic increments without
something like Interlocked, but simple reads and writes are atomic.
--
Jon Skeet - <sk...@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too
> [...]
> Of the code that I write, reading the value, and branching based on it's
> value, is much more common than writing to the value.
>
> I have, in the past, used volitile variables for this - but I've stopped
> due to the various issues regarding volatiles (non atomic reads &
> writes).
Jon's reply matches what I believed to be true. Can you explain why it is
you believe that a non-atomic read or write is possible for a 32-bit
value? If that's true, it's a pretty significant error in the way I've
approached multi-threaded programming. Which is not out of the question
by any means, but if that's the case I really want to know about it and
know why it's the case.
Pete
> I've got a quick question that's been bugging me for a long, long time:
>
> Let's say I've got a member variable (in a heavily threaded app):
> private int _firstTime = 0;
> Doing these requires a memory barrier of some type (volatile variable,
> monitor, etc) to have any degree of reliability.
Yes.
> I've never found (although not for lack of looking):
> int myValue = InterlockedRead(ref _firstTime);
Perhaps Thread.VolatileRead() is what you're looking for?
> I have, in the past, used volitile variables for this - but I've stopped due
> to the various issues regarding volatiles (non atomic reads & writes).
Can you expand on your objection to volatile variables?
> Is there a concensus on the best way to do this? It seems like all the
> solutions are.... incomplete.
>
> Essentially I'm looking for a volitile varaiable that is atomically read and
> written. Unfortuantly, these don't exist in any language I've yet used.
I'm not aware of any circumstances where normally aligned volatile
variables of the machine word size or smaller are read or written
non-atomically on the CLR. Can you explain more?
-- Barry
> Doing these requires a memory barrier of some type (volatile variable,
> monitor, etc) to have any degree of reliability.
>
I don't see the need for a memory barrier, unless you need memory
ordering guarantees. Volatile access would make sense since you really want
to read from memory. However, this only affects how the Jitter generates
code.
> Now, reading the value out using InterlockedCompareExchange always seemed
> a bit silly to me. I don't want to always write:
>
> int myValue = InterlockedCompareExchange(ref _firstTime, Int32.MinValue,
> Int32.MinValue);
This looks quite pointless. If you don't need memory ordering guarantees a
volatile read is good enough. I don't find the spec now, but I suspect
InterlockedCompareExchange has probably both acquire and release
semantics.
>
> I've never found (although not for lack of looking):
> int myValue = InterlockedRead(ref _firstTime);
>
This is just a memory fence plus a volatile read. And it's not clear that
you
actually need the former.
> Of the code that I write, reading the value, and branching based on it's
> value, is much more common than writing to the value.
>
Again volatile read is what you want.
> I have, in the past, used volitile variables for this - but I've stopped
> due to the various issues regarding volatiles (non atomic reads & writes).
>
What makes you believe that either reads or writes are non-atomic?
They certainly are at CLR instruction level.
> Is there a concensus on the best way to do this? It seems like all the
> solutions are.... incomplete.
For a one-time init thing you will probably need atomic compare
exchange for acquiring the construction lock and release semantics when
signaling successful construction.
If you need to wait for construction you'd probably have three states:
- uninitialized
- under_construction
- constructed
Checking whether the shared object is constructed is done by a volatile
read from the guard variable. If the result is "constructed" you're done.
Just a few cycles - maybe satisified from cache.
If not, use atomic compare exchange to acquire the "under-construction"
lock. If an other thread was faster wait. If not construct object and
use store with release semantics to signal object constructed.
I'm not sure if there is a really efficient Wait method (in this particular
context) in the CLR.
Anyway, the standard case (object already constructed) is very fast.
>
> Essentially I'm looking for a volitile varaiable that is atomically read
> and written. Unfortuantly, these don't exist in any language I've yet
> used.
>
I think the CLI standard requires that to a certain extent. That said, the
text is vague. Bottom line I think you can rely on reads or writes being
volatile, but not more complex operations that both read and write,
e.g.:
volatile int i;
do_something_with(i); // ok
i = some_value(); // ok
i++; // bad no guarantees here
But one might argue that this is really a language thing.
-hg
you can. The Interlocked write ensures this across all cpus and cache. Any
thread reading the var after the Interlocked write will see the correct var
so it is a memory barrier without a volatile modifier. All locks are
ultimatley built on interlocked at the lowest level to get this behavior.
So for reads, you can just go "if (var == 2){}". Naturally, this works for
a single var. If you have multiple invariants you need to protect, you
probabaly need a lock.
But surely that's only guaranteed if the read *actually* happens after
the Interlocked write occurs. If the variable isn't volatile, there's
nothing to stop the reading thread from having reordered the reads. For
instance:
int x=0, y=0;
...
Thread 1:
Interlocked.Increment (ref x, 1);
Interlocked.Increment (ref y, 1);
Thread 2:
int a = x;
int b = y;
The JIT *could* reorder the reads of x and y in thread 2, such that you
get a=0, b=1, despite the fact that x is incremented before y. That
wouldn't be the case with volatile variables.
At least, that's how I understand it.
Hi Jon. Not sure I totally follow you here. If Interlocked x happens, a
reader will never see x==0 (until it wraps). So a read on x or y is volatile
using an interlocked. Naturally, in either method, a reader could read x and
y after x was incremented, but before y was incremented because there is no
lock here. But this is a different problem (i.e race). If we need both vars
to be atomic in respect to each other and handled as an invariant "set",
then we need a lock of some type protecting the set. As I said, maybe I am
not following your intent and will slap my head.
--wjs
I still believe that it can read y before either of them are
incremented, and x after both of them are incremented. There's nothing
to stop the JIT from reordering the reads to read y before x, because
they're not volatile. I wouldn't be surprised to learn that the
interlocked operation makes sure that when it *does* try to read x and
y, those values are really the latest ones, but I don't think there's
any guarantee about the order of those reads unless at least one of
them is volatile. (It's nearly midnight, so I can't be bothered to work
out which it should be - y, I think.)
I definitely *wasn't* talking about the atomicity of the two operations
- just the ordering.
Like so:
int b = y;
int a = x;
? tia
--
William Stacey [C# MVP]
"Jon Skeet [C# MVP]" <sk...@pobox.com> wrote in message
news:MPG.20a86b0...@msnews.microsoft.com...
> So are you saying the JIT can reorder these two lines:
> int a = x;
> int b = y;
>
> Like so:
> int b = y;
> int a = x;
Yes, if x and y are not volatile locations. The language of the spec is
that reads and writes to volatile locations are the only observable side
effects of memory access; memory accesses may be arbitrarily reordered
if they aren't volatile. For example, if the value of y has been loaded
to a register because it was used somewhere before the line 'int a =
x;', then the value of y used in the line 'int b = y;' may predate the
actual in-memory value of x, because the value may be used from the
register, rather than freshly loading it from memory. So, volatile can
inhibit these kinds of optimizations (avoiding redundant loads).
There's more to it than that, volatile reads have acquire semantics,
volatile writes release semantics. That has stronger guarantees than
just inhibiting optimization, as it affects e.g. write buffer reordering
in some processors.
Exactly, so long as there aren't any memory barriers in the way. Memory
barriers (which include operations with volatile locations) stop the
JIT from moving memory operations beyond the barrier.
It's unfortunate that the CLI spec isn't nearly as clear as it might be
about all this, but Vance Morrison wrote a great MSDN article about the
.NET 2.0 model (which is stricter than the CLI spec):
http://msdn.microsoft.com/msdnmag/issues/05/10/MemoryModels/
Here he states the compiler will not re-order volatile variable access,
However these operations still *could be re-ordered by the processor. In
fact, he shows correcting this issue with an Interlocked operation. This
article also state that Interlocked functions *ensure appropriat barriers
for memory ordering. So (at least for this example) it would seem
Interlocked provides as good or better symatics as the volatile example (and
also prevents CPU re-order). However, I am not sure any this matters for
this example, as in both cases you still have the race issue. R1 can read
x, then W1 writes x and y, then R1 reads y - or various combinations of
same. So some form of critical section around both x and y would seem to be
the proper behavior, unless we don't care about x or y in respect to each
other (i.e. simple counters that are not treated as a set).
--
William Stacey [C# MVP]
PCR concurrency library: www.codeplex.com/pcr
PSH Scripts Project www.codeplex.com/psobject
"Jon Skeet [C# MVP]" <sk...@pobox.com> wrote in message
news:MPG.20a8e11...@msnews.microsoft.com...
> There's more to it than that, volatile reads have acquire semantics,
> volatile writes release semantics. That has stronger guarantees than
> just inhibiting optimization, as it affects e.g. write buffer reordering
> in some processors.
I see that in CLI spec, but have you actually verified it? It is not
what I would expect and not what I have seen.
Oddly, I see that the Jitter even removes dead loads altogher.
Anyway, given the rationale in the CLI spec (hardware
register access), I don't see how this could possibly work in
a reasonably cheap way for the P4 memory model.
-hg
Hmm. That sounds like it's a broken implementation then. Basically, if
the .NET CLR executes code in a way which violates the spec, the
implementation is flawed. If the processor could reorder things, the
CLR should make sure that that is invisible to the user, beyond what is
possible within the CLI spec.
I'd be interested to hear Joe Duffy's opinion on that part of the
article.
> In fact, he shows correcting this issue with an Interlocked operation. This
> article also state that Interlocked functions *ensure appropriat barriers
> for memory ordering. So (at least for this example) it would seem
> Interlocked provides as good or better symatics as the volatile example (and
> also prevents CPU re-order).
No - because only the thread which *calls* the Interlocked method knows
that interlocked is involved, whereas with volatile both the reading
thread *and* the writing thread know to use memory barriers. Of course,
if you use Interlocked in both threads, to both read *and* write the
value, then everything will be okay.
> However, I am not sure any this matters for
> this example, as in both cases you still have the race issue. R1 can read
> x, then W1 writes x and y, then R1 reads y - or various combinations of
> same. So some form of critical section around both x and y would seem to be
> the proper behavior, unless we don't care about x or y in respect to each
> other (i.e. simple counters that are not treated as a set).
If we care about atomicity, there needs to be some kind of locking.
If we only care that the change to y is seen after the change to x
(i.e. you can't see x=0, y=1) then volatile will do the job but I don't
believe that changing the variable with Interlocked and then reading it
directly in another thread is guaranteed to work.
> Here is some more food for the party:
> http://msdn2.microsoft.com/en-us/library/ms686355.aspx
>
> Here he states the compiler will not re-order volatile variable access,
> However these operations still *could be re-ordered by the processor.
Volatile read has acquire semantics. Loads won't move before an acquire,
similarly, writes won't move after a release. These should be
implemented by CPU instructions, as explained in the article you linked
to, and aren't restricted to just compiler optimizations. This means
that volatile read on .NET is sufficient, because the CPU won't reorder
such that the second load moves before the first load, because of the
acquire semantics (aka read fence).
> In
> fact, he shows correcting this issue with an Interlocked operation. This
> article also state that Interlocked functions *ensure appropriat barriers
> for memory ordering.
The have full fence semantics (except the Interlocked*Acquire and
Interlocked*Release).
> So (at least for this example) it would seem
> Interlocked provides as good or better symatics as the volatile example (and
> also prevents CPU re-order).
Volatile also prevents CPU re-order, but asymmetrically: read causes
acquire, write causes release. The write side release means that e.g.
"publishing" a new instance to a static volatile variable won't have a
problem such that some fields in the instance aren't fully retired
before other threads could read the instance reference - writes won't
move ahead of the release / write fence.
I would like to see Joe comment on that as well. If that article is
correct, it sounds like "volatile" may not work as people expect it to be
working.
| No - because only the thread which *calls* the Interlocked method knows
| that interlocked is involved, whereas with volatile both the reading
| thread *and* the writing thread know to use memory barriers. Of course,
| if you use Interlocked in both threads, to both read *and* write the
| value, then everything will be okay.
The reader does not need to know as it gets the read fence automatically (at
the hardware layer) because the Interlocked buts up a *full memory barrier
fence so reads can not be reordered at the instruction level. vista has
some other versions with seperate read aquire/write aquire methods, but the
ones in the framework use the full fence. I am not sure how volatile is
implemented under the covers, but it would not suprise me if it also uses an
interlocked operation.
| If we care about atomicity, there needs to be some kind of locking.
| If we only care that the change to y is seen after the change to x
| (i.e. you can't see x=0, y=1) then volatile will do the job but I don't
| believe that changing the variable with Interlocked and then reading it
| directly in another thread is guaranteed to work.
hmm. We have seem to cover the same ground. Everything I have seen (more
then that link) says it does. I would be interested if you find something
to contrary as that would be good info. Also, most (if not all) locks are
based on Interlocked in their guts, so if it does not work, then everything
in the history of the world is broke. Ok, that is bit drastic... tia
--
William Stacey [C# MVP]
"Barry Kelly" <barry....@gmail.com> wrote in message
news:t4nu339s34tpgovpb...@4ax.com...
--
William Stacey [C# MVP]
"William Stacey [C# MVP]" <william...@gmail.com> wrote in message
news:epyqE0Qk...@TK2MSFTNGP03.phx.gbl...
I posted correction in thread above. It seems the issue with volatile was
with vs2003, not vs2005.
--wjs
Well, not just how people expect it to work, but how it's *specified*
to work.
I've now seen your post about it basically being a potential bug in the
.NET 1.1 CLR.
> | No - because only the thread which *calls* the Interlocked method knows
> | that interlocked is involved, whereas with volatile both the reading
> | thread *and* the writing thread know to use memory barriers. Of course,
> | if you use Interlocked in both threads, to both read *and* write the
> | value, then everything will be okay.
>
> The reader does not need to know as it gets the read fence automatically (at
> the hardware layer) because the Interlocked buts up a *full memory barrier
> fence so reads can not be reordered at the instruction level.
But the read could *already* have been reordered by the JIT, ages
before this occurs. How could the JIT know whether a variable would
*ever* be used with Interlocked, and avoid reordering?
> vista has
> some other versions with seperate read aquire/write aquire methods, but the
> ones in the framework use the full fence. I am not sure how volatile is
> implemented under the covers, but it would not suprise me if it also uses an
> interlocked operation.
But there's more to it than that - there's the restriction on JIT
reordering, and *that* can't be known about ahead of time.
> | If we care about atomicity, there needs to be some kind of locking.
> | If we only care that the change to y is seen after the change to x
> | (i.e. you can't see x=0, y=1) then volatile will do the job but I don't
> | believe that changing the variable with Interlocked and then reading it
> | directly in another thread is guaranteed to work.
>
> hmm. We have seem to cover the same ground. Everything I have seen (more
> then that link) says it does. I would be interested if you find something
> to contrary as that would be good info. Also, most (if not all) locks are
> based on Interlocked in their guts, so if it does not work, then everything
> in the history of the world is broke. Ok, that is bit drastic... tia
I'm not saying that Interlocked doesn't work. I'm saying that it only
works if you use it consistently across all threads.
I'll try to get round to mailing Joe about it...
I assume it knows the same way it knows a var is volatile - by inspecting
the code as it compiles and keeping some state if it sees a var wrapped with
interlocked or volatile, it does not reordering. Not sure. I am not sure
when and where is actually does these optimizations. I think I read
something about tight local loops, but not sure. It would be interesting to
see a good article on the subject.
| I'll try to get round to mailing Joe about it...
Thanks. That would be great to see what he says.
--
William Stacey [C# MVP]
When the JIT has to create native code for one method, it might not
even have loaded all the assemblies which will manipulate the same
variables as that method does. I can't see that it could *possibly*
know whether or not some code, somewhere, is going to use it with the
Interlocked class.
Compare that with finding out whether or not the variable is volatile -
you look in a single place, and that's it.
> | I'll try to get round to mailing Joe about it...
>
> Thanks. That would be great to see what he says.
Yeah. This relies on me finding the time, of course :)
<snip>
> > | I'll try to get round to mailing Joe about it...
> >
> > Thanks. That would be great to see what he says.
>
> Yeah. This relies on me finding the time, of course :)
I've found the time:
http://msmvps.com/blogs/jon.skeet/archive/2007/05/09/non-volatile-
reads-and-interlocked-and-how-they-interact.aspx
All comments welcome. I'm sorry that it's clearly written from my
viewpoint rather than that of an impartial observer - I hope I haven't
misrepresented any views. I'm happy to edit where appropriate.
5. Explicit atomic operations... These operations (e.g. Increment,
Decrement, Exchange, and CompareExchange) perform implicit acquire/release
operations. ..."
So by my read, this contract would require the CLI to respect the same
optimization rules as volatile. System.Thread.MemoryBarrier would seem to
have to respect proper order too even if the vars are not decorated with
volatile. Do you read it different? Thanks.
--
William Stacey [C# MVP]
"Jon Skeet [C# MVP]" <sk...@pobox.com> wrote in message
news:MPG.20abb72...@msnews.microsoft.com...
The operations on Interlocked *do* perform implicit acquire/release
operations - but the *reading* operations don't. That's the problem.
To put it an alternative way, it is (in some ways - not all) a bit like
acquiring a lock while writing a value, but not acquiring the same lock
when reading. Yes, you've made sure that the data is available to be
read with the memory barrier on the writing thread, but you haven't
made sure that the reading thread actually performs the read when you
expect it to.
Basically, both threads need a memory barrier in order to be effective.
Using Interlocked puts a memory barrier on the calling thread, but
that's all - IMO :)
Hi Jon. Two issues here I think:
1) Interlocked and barrier. Interlocked does provide a full memory barrier.
At least from the docs, this is clear.
2) Does it prevent reads from being reordered by the CLI? This is not as
clear. It seems that if it implicitly expresses volatile read/write as it
claims in ECMA, then it would also prevent such optimizations. Same way you
would expect Thread.MemoryBarrier() to do. IIRC, these CLI optimization
rules must be very localized and are very strict. They can't just do them
all over the place. They can only do them them when they *know there can be
no side effects.
http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx
"We've spent a lot of time thinking about what the correct memory model for
the CLR should be. If I had to guess, we're going to switch from the ECMA
model to the following model. I think that we will try to persuade other
CLI implementations to adopt this same model, and that we will try to change
the ECMA specification to reflect this.
1.. Memory ordering only applies to locations which can be globally
visible or locations that are marked volatile. Any locals that are not
address exposed can be optimized without using memory ordering as a
constraint since these locations cannot be touched by multiple threads in
parallel.
2.. Non-volatile loads can be reordered freely.
3.. Every store (regardless of volatile marking) is considered a release.
4.. Volatile loads are considered acquire.
5.. Device oriented software may need special programmer care. Volatile
stores are still required for any access of device memory. This is
typically not a concern for the managed developer."
This was wrote back in May 2003, and don't know if this made it into 2.0 or
latter or at all. But he seems to touch on this very issue we are talking
about. Order changes can only be applied to locals that are not address
exposed and do not have some volatile symatics around them. As interlocked
and MemoryBarrier has implicit volatile contract, I would assume CLI
inspection would have to honor those same as with volatile keyword. Maybe
not. Would be nice to get a clearer writing on this.
I think we're starting to lose some context. Here's your original reply
to Chris:
William wrote:
> Chris wrote:
> | ... but, if I just want to read the value, I CANNOT do:
> |
> | int myValue = _firstTime;
> | or
> | if (_fistTime==0)
> |
> | Doing these requires a memory barrier of some type (volatile variable,
> | monitor, etc) to have any degree of reliability.
>
> you can. The Interlocked write ensures this across all cpus and cache.
This is the basic point that Jon (and I, and the standard, and Chris
Brumme's post that you quote) disagree with you on.
An interlocked operation is a full fence (read and write barrier, both
acquire and release) on the thread that it was *executed* on. It doesn't
affect reads from *other* threads.
> | Basically, both threads need a memory barrier in order to be effective.
> | Using Interlocked puts a memory barrier on the calling thread, but
> | that's all - IMO :)
>
> Hi Jon. Two issues here I think:
> 1) Interlocked and barrier. Interlocked does provide a full memory barrier.
> At least from the docs, this is clear.
> 2) Does it prevent reads from being reordered by the CLI? This is not as
> clear.
There is no way it could perform its advertised purpose if it didn't
prevent related reads moving before it. The CLI needs to pass the
location of the target to the interlocked operation, so it can't have
related values cached in (e.g.) a register or it would be plainly faulty
with only a *single* thread of operation. Since that discounts compiler
optimizations that would move related reads before the interlocked
operation, the only thing left is CPU semantics - and that's what the
Interlocked operation guarantees, it guarantees a full fence for the
calling thread.
> http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx :
> [extra quoted for clarity -- Barry]
> > "We've spent a lot of time thinking about what the correct memory model for
> > the CLR should be. If I had to guess, we're going to switch from the ECMA
> > model to the following model. I think that we will try to persuade other
> > CLI implementations to adopt this same model, and that we will try to change
> > the ECMA specification to reflect this.
> >
> > 1.. Memory ordering only applies to locations which can be globally
> > visible or locations that are marked volatile. Any locals that are not
> > address exposed can be optimized without using memory ordering as a
> > constraint since these locations cannot be touched by multiple threads in
> > parallel.
> > 2.. Non-volatile loads can be reordered freely.
> > 3.. Every store (regardless of volatile marking) is considered a release.
> > 4.. Volatile loads are considered acquire.
> > 5.. Device oriented software may need special programmer care. Volatile
> > stores are still required for any access of device memory. This is
> > typically not a concern for the managed developer."
> This was wrote back in May 2003, and don't know if this made it into 2.0 or
> latter or at all. But he seems to touch on this very issue we are talking
> about.
Interlocked operations pass the address (reference to the location), so
they are "address exposed" and therefore the read can't move before the
interlocked operation. Volatile semantics aren't required for this; in
fact, not even threads are required for this. An optimization that moved
the read before the interlocked call (or any other call by reference) is
wrong in every case.
> Order changes can only be applied to locals that are not address
> exposed and do not have some volatile symatics around them.
That's not what Chris Brumme wrote. Instead, he writes:
> Chris Brumme (in post above) wrote:
> > 1.. Memory ordering only applies to locations which can be globally
> > visible or locations that are marked volatile. Any locals that are not
> > address exposed can be optimized without using memory ordering as a
> > constraint since these locations cannot be touched by multiple threads in
> > parallel.
That is to say, order changes are irrelevant (and in fact, can't even be
detected, if the compiler has no bugs) for locals that are not address
exposed, since only a single thread of execution can ever see their
values. That is to say, he says almost exactly the *opposite* of what
you said - that order changes can *only* be detected, are *only*
relevant, for locations that are globally visible, address exposed or
marked volatile.
> As interlocked
> and MemoryBarrier has implicit volatile contract, I would assume CLI
> inspection would have to honor those same as with volatile keyword. Maybe
> not. Would be nice to get a clearer writing on this.
All this is irrelevant to your original point at the root of the thread
though, in that, contrary to what you claim, a memory barrier is still
required for reads from threads *other* than the one that performed the
interlocked operation, if you're going to guarantee that you read what's
really there and not some stale value.
Hi Barry. I address this part first as it is key to everything else. There
*is a full fence on interlocked methods (save the newer win32 ones like
InterlockedIncrementAcquire that provide acquire and release semantics). It
is *gaureenteed* by the OS and hw. It is documented in many places (see
some below). There can never be stale data, as hw ensures this across all
cpus and caches. After an Interlock completes, any and all threads will see
only the last data. This is one reason why it is a relatively expensive
call. Where is it documented there is not a full barrier?
Here is some linq:
http://msdn2.microsoft.com/en-us/library/ms684122.aspx
http://www.gamasutra.com/features/20060630/paquet_01.shtml
http://blogs.msdn.com/oldnewthing/archive/2004/05/28/143769.aspx
http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx
There *is* a full barrier, *within that thread*. Not within all others.
Any thread which actually *performs* a read after the Interlock
completes will indeed see the new value. The point is that with
reordering in the reading thread, there's no guarantee that it will
make any attempt to read the value after the Interlock completes - it
could have read it before the Interlocked operation executes, even if a
value is later read which looks like it should have been read
beforehand. Suppose the source code is:
Thread 1
int b = y;
int a = x;
Thread 2
Interlocked.Increment (ref x);
Interlocked.Increment (ref y);
The first thread's code can be reordered by the JIT, and then executed
as:
Thread 1 Thread 2
int a = x;
Interlocked.Increment (ref x);
Interlocked.Increment (ref y);
int b = y;
Note the reordering. The guarantee that threads reading after
Interlocked.Increment see the new values is still intact - it's just
that x has been read before y, which appears to be wrong according to
the original source code.
This is because there aren't any memory barriers in the reading thread.
Could you point to where in the ECMA spec it says anything about a
memory barrier on one thread affecting what is possible in terms of
reordering on another thread?
It's important to understand that this *isn't* just a case of hardware
caches and memory fences etc - the JIT itself can reorder the reads,
which is what I've been trying to show above.
> Here is some linq:
> http://msdn2.microsoft.com/en-us/library/ms684122.aspx
> http://www.gamasutra.com/features/20060630/paquet_01.shtml
> http://blogs.msdn.com/oldnewthing/archive/2004/05/28/143769.aspx
> http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx
I don't see anything in those pages to prevent the JIT from reordering
the reads on the reading thread. Half of them aren't even about .NET,
which means they won't be addressing the JIT-reordering issue at all.
I haven't heard back from Joe yet - I might mail Chris Brumme to see if
he'd care to comment.
What would be the point of a full barrier if it did not work across threads?
You would not need it for a single thread. The full barrier ensures the cpu
can not reorder reads and all reads will see the last write. But, I think
your talking about reorder by the CLI as below.
I understand your point. I am playing devils advocate with myself trying to
find proof either way. But, by my read (reading thru the sparse lines) the
JIT is *not allowed to make this optimization here. It can only reorder
when:
1) There is no volatile
2) The var is not address exposed (e.g. local).
3) There is no other explicit or implicit memory barriers protecting the var
(i.e. Thread.MemoryBarrier(), Interlocked, Monitor, etc)
So this would fail test #3 and not allow reorder by JIT. The ecma I posted
hints at this, but is far from detailed on this. Chris B talks more about
this, but don't know when/if he implemented it in the CLR. Naturally, I
could be nuts. It would not be the first time. Thank you.
<snip>
> 2.. Non-volatile loads can be reordered freely.
And that's the absolutely key one. As far as the reading thread is
concerned, it's performing non-volatile loads. It doesn't know that an
Interlocked operation will occur anywhere else, and it's not loading a
volatile variable - therefore it's a non-volatile load, and the
operations can be reordered as shown in the post I've just made.
<snip>
The same point as locking has. Both threads need to try to acquire a
lock before it becomes particularly useful, but I don't think you'd
argue that it *isn't* useful.
> You would not need it for a single thread. The full barrier ensures the cpu
> can not reorder reads and all reads will see the last write.
... if they perform a genuine read *after* that point, yes.
> But, I think your talking about reorder by the CLI as below.
Yes - although from the memory model point of view (which doesn't know
about CPU fences) it's irrelevant. The memory model makes certain
guarantees without saying whether the optimisations which can be
performed are due to the JIT or the CPU.
<snip>
> | Note the reordering. The guarantee that threads reading after
> | Interlocked.Increment see the new values is still intact - it's just
> | that x has been read before y, which appears to be wrong according to
> | the original source code.
> |
> | This is because there aren't any memory barriers in the reading thread.
>
> I understand your point. I am playing devils advocate with myself trying to
> find proof either way. But, by my read (reading thru the sparse lines) the
> JIT is *not allowed to make this optimization here. It can only reorder
> when:
> 1) There is no volatile
> 2) The var is not address exposed (e.g. local).
> 3) There is no other explicit or implicit memory barriers protecting the var
> (i.e. Thread.MemoryBarrier(), Interlocked, Monitor, etc)
You seem to be treating these as an "and" whereas it's not - otherwise
you could never have reordering of *any* non-local reads.
I agree with 1. I believe you're misreading the spec on point 2 - it's
not a condition, it's just that *any* reads/writes can be reordered.
> So this would fail test #3 and not allow reorder by JIT. The ecma I posted
> hints at this, but is far from detailed on this. Chris B talks more about
> this, but don't know when/if he implemented it in the CLR. Naturally, I
> could be nuts. It would not be the first time. Thank you.
I don't see where it hints at this at all. Out of the 5 rules you
mentioned in the spec, the important one is this: "Non-volatile loads
can be reordered freely."
The statement:
int a = x;
is *not* a volatile load when x isn't a volatile variable. Just because
some other thread may or may not perform an interlocked operation
doesn't make loading it volatile. I think that's the crux of our
disagreement.
So, having reduced it to this (if you agree with me on that much) can
you provide any indication that accessing a variable at any point on
any thread with an Interlocked operation makes every load of that
variable on every thread count as being volatile? I haven't seen any
evidence of that.
It is a full memory barrier. That means that the new value is actually in
RAM, coherent across all processors (some architectures might choose write
snooping instead of invalidating caches in other cores) before the
Interlocked call returns. This falls far short of forcing other threads to
"see the last data", because another thread could have cached the value in a
CPU register (hoisting a complex expression outside a loop, anyone?).
Access to the memory address is guaranteed to yield the new value, but only
volatile variables result in memory being addressed on every use of the
variable name. Otherwise the compiler can perform alias analysis, determine
that the reading thread can't change the variable in the interim, and thus
uses the cached value.
I figured I would write a little program and see if I can make the problem
show up.
The sample app I wrote below does, every once in a while, go "Boom".
This means the memory model is indeed reordering the reads of test and
test2.
My environment for running this test is:
Dual-Core AMD Athlon
Windows XP x64
Compiled in Release Mode, using "Any CPU"
Running without a debugger attached.
I've run the test using "Any CPU", "x86" and "x64" as specific compile
targets. Same result each time.
Sometimes it runs for a long time without failing, other times I get a ton
of failures. For example:
...
Boom : 287884447, 287884446
Boom : 287885420, 287885419
Boom : 287885780, 287885779
Boom : 287886016, 287886015
Boom : 287887440, 287887439
Boom : 287887798, 287887797
using System;
using System.Collections.Generic;
using System.Text;
using System.Threading;
namespace ConsoleApplication4
{
class Program
{
static long test = 0;
static long test2 = 0;
static void Main(string[] args)
{
Thread t1 = new Thread(ReadThreadProc);
Thread t2 = new Thread(ReadThreadProc);
Thread t3 = new Thread(WriteThreadProc);
t1.IsBackground = true;
t2.IsBackground = true;
t3.IsBackground = true;
t1.Start();
t2.Start();
t3.Start();
Console.WriteLine("Press any key to exit");
Console.ReadLine();
}
private static void WriteThreadProc()
{
while (true)
{
Interlocked.Increment(ref test);
Interlocked.Increment(ref test2);
}
}
private static void ReadThreadProc()
{
while (true)
{
long v1 = test;
long v2 = test2;
if (v1 > v2)
Console.WriteLine("Boom : {0}, {1}", v1, v2);
}
}
}
}
--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins
Please ignore that last post...
--
Chris Mullins, MCSD.NET, MCPD:Enterprise, Microsoft C# MVP
http://www.coversant.com/blogs/cmullins
"Chris Mullins [MVP]" <cmul...@yahoo.com> wrote in message
news:eJAdpbAl...@TK2MSFTNGP03.phx.gbl...
If it's any consolation, I made the exact same mistake when I first
made the blog post :)
One of the posts in this very interesting threat referred to this article :
http://msdn2.microsoft.com:80/en-us/library/ms686355.aspx
This article includes the following piece of code (to fix the race
condition):
volatile int iValue;
volatile BOOL fValueHasBeenComputed = FALSE;
extern int ComputeValue();
void CacheComputedValue()
{
if (!fValueHasBeenComputed)
{
iValue = ComputeValue();
fValueHasBeenComputed = TRUE;
}
}
BOOL FetchComputedValue(int *piResult)
{
if (fValueHasBeenComputed)
{
*piResult = iValue;
return TRUE;
}
else return FALSE;
}
My (complete ignorent) question is this:
Must fValueHasBeenComputed really be declared volatile for the program to
produce correct results ? Isn't it true that , because iValue is declared
volatile, it is impossible that fValueHasBeenComputed = TRUE and the result
of ComputeValue() is not stored in iValue ?
Of course, because fValueHasBeenComputed is not declared volatile it may be
that FetchComputedValue() returns FALSE when in fact iValue does contain the
computed value but the other case can never occur. E.g. FetchComputedValue
returns TRUE and piResult is incorrect.
Are my assumptions correct ?
Jos
> | There *is* a full barrier, *within that thread*. Not within all others.
>
> What would be the point of a full barrier if it did not work across threads?
A full barrier is a combination of both a read barrier and a write
barrier.
A write barrier makes sure that all pending writes are retired, and is
useful just before publishing a value, so that when it is read on a
different thread / CPU, there are no pending writes that could catch any
readers by surprise if they were to end up being retired a little later
(out of order). The hardware-level concept of a write barrier affects
only the CPU executing the write barrier, ensuring that all pending
writes are retired before the publishing the final value. It's an
ordering primitive for that CPU. Yes it does affect other threads, but
only in so far as what they can observe from memory, in conjunction with
the *other* half of the equation - a read barrier.
A read barrier makes sure that no reads / prefetches / what have you
have snuck before retrieve a particular value from a location, a value
that you expect to have been published by another thread. The read
barrier only affects the CPU performing the read. It's directly
complementary to a write barrier on the publishing thread, it ensures
that you read what you think you read, and not what the CPU inferred you
we're going to indirectly read by (e.g.) sneaking ahead and peeking at
the instruction stream.
Read and write barriers are complementary. Write barriers stop writes
moving forward in time, while read barriers stop reads moving backwards
in time. If you don't have balanced write and read barriers on threads
that are communicating via shared memory without locks, then their reads
and writes will overlap, and consequently linear, intuitive, von Neumann
expectations of invariants will be broken [unless of course, a finite
state machine model for all possible combinations have been worked out
and tested, see e.g. Cliff Click's work on a lock-free hash table for
Java:
http://blogs.azulsystems.com/cliff/2007/05/hardware_visibi.html
.]
Again, a read barrier only affects the thread / CPU that it's executed
on. It doesn't affect any writers on other threads. And read barriers of
some kind (even the implicit ones with 'lock') aren't optional even when
the writing thread is using volatile / interlocked / write barriers /
other mechanisms.
> You would not need it for a single thread.
Barriers aren't needed for single threads because then there's only one
thread context, one CPU, and the OS would have responsibility for any
read / write barriers before context switching into / out of your
application's thread of control, since there's no concurrency possible
with only a single thread.
> The full barrier ensures the cpu
> can not reorder reads and all reads will see the last write. But, I think
> your talking about reorder by the CLI as below.
Memory barriers / fences only affect ordering of reads and writes of the
thread / CPU they were executed on, though.
> I understand your point. I am playing devils advocate with myself trying to
> find proof either way. But, by my read (reading thru the sparse lines) the
> JIT is *not allowed to make this optimization here. It can only reorder
> when:
> 1) There is no volatile
> 2) The var is not address exposed (e.g. local).
>> 3) There is no other explicit or implicit memory barriers protecting the var
> (i.e. Thread.MemoryBarrier(), Interlocked, Monitor, etc)
I believe you're misreading something big. Quoting from Chris Brumme's
post:
http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx:
> 1. Memory ordering only applies to locations which
> can be globally visible or locations that are marked
> volatile. Any locals that are not address exposed can be
> optimized without using memory ordering as a constraint
> since these locations cannot be touched by multiple threads
> in parallel.
This implies that loads and stores can be reordered freely when its
non-exposed locals being dealt with. ("Reordered freely" is always
subject to correctness on a single-threaded basis, of course, so there
are limits to this reordering.)
> 2. Non-volatile loads can be reordered freely.
Now here's something that you seem to have missed: any and all reads
(loads) that aren't volatile can be reordered *freely*, even though the
location being loaded *isn't* local.
> 3. Every store (regardless of volatile marking) is considered a release.
This was an upgrade in the CLR's memory model over the ECMA memory
model, IIRC, though Intel x86 architecture keeps it ok, but it did need
specifying for IA64. It implies that a write barrier isn't required when
publishing a value, because the CLR (but not ECMA) will handle that
situation automatically.
> 4. Volatile loads are considered acquire.
And by contrast, from point 2, non-volatile loads are not considered
acquire, so they can freely move backwards in time. This is what makes
your original post to the thread incorrect.
> 5. Device oriented software may need special programmer care.
I've little to say about devices in CLR :)
> So this would fail test #3 and not allow reorder by JIT.
I believe the rules you write are mistaken and not supported by the
standards or links you post, and in any case, interlocked operations
only affect a *single* thread, not all threads.
"Chris Mullins [MVP]" <cmul...@yahoo.com> wrote in message
news:unrnxiAl...@TK2MSFTNGP05.phx.gbl...
The reads are done in the same order as the increments, which means
that even without any reordering, it's quite possible to get any of the
four possibilities.
I think so, but it wouldn't be very useful. I assume that if
FetchComputedValue() returns FALSE, you would sleep and try again? Only,
the compiler is allowed to determine that FetchComputedValue doesn't depend
on any volatile variables, so it optimizes by hoisting it outside the loop,
and turns this:
int buf;
while (!FetchComputedValue(&buf))
Sleep();
into
int buf;
begin:
BOOL result = FetchComputedValue(&buf);
if (!result) goto end;
Sleep();
goto begin;
end:
into
int buf;
BOOL result = FetchComputedValue(&buf);
if (!result) goto end;
begin:
Sleep();
goto begin;
end:
or
int buf;
if (!FetchComputedValue(&buf)) {
while (true)
Sleep();
}
Whole program optimization (cross-module inlining) makes this a virtual
certainty.
>
> Jos
>
Thanks Barry. Is this documented somewhere? An Interlocked would hardly be
useful if that was the case. They effect other threads by virtue of their
atomic nature *and the barrier they impose. In the case of a shared
variable, only changed in CPU cache, the execution of the normal interlocked
instruction make this var (and all cache memory of current CPU) be
committed into physical memory *before executing the interlocked
instruction. So other threads will "see" changes from interlocked
operations after seeing changes in another shared variable (i.e. release
semantics).
For example:
a++;
InterlockedIncrement (ref b);
c++;
Other threads will only see changes in "a" prior to changes in "b", and will
see changes in "c" after changes in b. Critical section lock algorithms,
such as the "Peterson's algorithm" rely on this behavior or would not work
on SMP hardware. Maybe we are saying the same thing, and I am just not
understanding your intent.
No, interlocked is useful in the same way that locking is useful,
despite it being a co-operative thing.
> They effect other threads by virtue of their
> atomic nature *and the barrier they impose. In the case of a shared
> variable, only changed in CPU cache, the execution of the normal interlocked
> instruction make this var (and all cache memory of current CPU) be
> committed into physical memory *before executing the interlocked
> instruction. So other threads will "see" changes from interlocked
> operations after seeing changes in another shared variable (i.e. release
> semantics).
>
> For example:
>
> a++;
> InterlockedIncrement (ref b);
> c++;
>
> Other threads will only see changes in "a" prior to changes in "b", and will
> see changes in "c" after changes in b. Critical section lock algorithms,
> such as the "Peterson's algorithm" rely on this behavior or would not work
> on SMP hardware. Maybe we are saying the same thing, and I am just not
> understanding your intent.
They will see changes in a, b and c in that order if they *read* them
in that order - but if you do:
int x = c;
int y = b;
int z = a;
then there's no guarantee that you won't see the change to c but not
the change to a, because a could *actually* have been reordered to be
read before c in the first place.
Joe Duffy has replied to the blog post, by the way, confirming this.
Thanks. Good to see his reply. So he is saying Interlocked *would prevent
the clr re-order issue? So doing something like:
int a = x;
int b = Interlocked.Read(ref y);
to your example would fix this example in terms of clr reorder issue? As he
said, it is easier to use volatile, but just to finish the thread. Or would
you need Interlocked.Read on both lines?
--
wjs
I *think* that you only need the one, although you can't do it with an
int because there's no Interlocked.Read(ref int value). It's
unfortunate that the docs for Interlocked.Read imply that all
Interlocked is used for is atomic reads, rather than preventing
reordering.
Personally though, I'd usually use a lower-tech approach, such as
locking, even for a single value. I consider even volatile to be
somewhat high-tech!
> | Joe Duffy has replied to the blog post, by the way, confirming this.
>
> Thanks. Good to see his reply. So he is saying Interlocked *would prevent
> the clr re-order issue? So doing something like:
>
> int a = x;
> int b = Interlocked.Read(ref y);
Interlocked.Read is for reading 64-bit values atomically on 32-bit
platforms. Thread.VolatileRead is what you want here (or preferably mark
y as volatile).
> to your example would fix this example in terms of clr reorder issue?
It's not just a CLR issue, it's a CPU issue too - as Joe indicates in
the reply to Jon's blog post, IA64 will reorder reads etc. more than
e.g. x86.
> As he
> said, it is easier to use volatile, but just to finish the thread. Or would
> you need Interlocked.Read on both lines?
Thread.VolatileRead will be a read barrier, and the normal way of
understanding "read barrier" is that it prevents later reads moving
before it, but when you're mixing read-barriers with normal reads, the
barrier-based mental model gets a little shaky. What, in that model,
stops the read of 'x' into local 'a' moving after the call (i.e. in the
other direction)? IMO both x and y should be marked volatile, or both
should be read with Thread.VolatileRead, to nail them down. I'd stay
away from "mix and match" volatile / non-volatile variables.
Even better, use a lock unless performance timing indicates that this
trickery is absolutely necessary.