Weird unlocked behaviour?

Jan Wielemaker

unread,

Dec 22, 2009, 9:49:27 AM12/22/09

to

Hi.

I noticed some quite weird behaviour in a C-written multi-threaded app.
There is a global variable (called `generation'). This value is
(unlocked) incremented using ++generation from multiple threads. It is
also read from multiple threads and in these threads read from multiple
locations in the source.

Now, I know that this is unsafe in the sense that not all threads see a
consistent value for generation and I can loose some `steps' in the
generation. For my purposes though, this is not fatal and synchronising
would be rather costly.

I'm now trying to track an error that seems to indicate that at some
later moment in time, the *same* thread gets a *lower* value for
generation (++ is the *only* write-operation and this is surely no
integer overflow). That is fatal for the app :-(

How can this happen? Could it be because a thread changes core? I
would expect the system guarantees complete consistency if a thread
is moved to another core.

Could it be some cache-issue? The variable generation is a 64-bit value
(uintptr_t), part of a global struct. Anything closely around it is
never changed after program startup.

The app runs on a quadcore AMD; 64-bits; SuSE Linux 11.0 (GCC 4.3.2;
glibc 2.9; kernel 2.6.27).

Thanks for any clue!

--- Jan

Jan Wielemaker

unread,

Dec 22, 2009, 10:27:30 AM12/22/09

to

On 2009-12-22, Jan Wielemaker <ja...@hppc323.few.vu.nl> wrote:
> Hi.
>
> I noticed some quite weird behaviour in a C-written multi-threaded app.
> There is a global variable (called `generation'). This value is
> (unlocked) incremented using ++generation from multiple threads. It is
> also read from multiple threads and in these threads read from multiple
> locations in the source.
>
> Now, I know that this is unsafe in the sense that not all threads see a
> consistent value for generation and I can loose some `steps' in the
> generation. For my purposes though, this is not fatal and synchronising
> would be rather costly.

Ok, without atomic increment a thread can be preempted between reading
and writing while others keep incrementing. I guess it is also possible
that the thread doing ++generation reads an outdated value from its
cache. I've put locks around the increments (those are not time
critical) and that seems to fix the issue.

Sorry for the noise. Hope I'm not the only one with this misconception.

--- Jan

David Schwartz

unread,

Dec 22, 2009, 2:11:46 PM12/22/09

to

On Dec 22, 7:27 am, Jan Wielemaker <j...@hppc323.few.vu.nl> wrote:

> Ok, without atomic increment a thread can be preempted between reading
> and writing while others keep incrementing.

Exactly.

> I guess it is also possible
> that the thread doing ++generation reads an outdated value from its
> cache.

An outdated value in the caches would not be valid, and so the CPU
would not use it until it could make it valid. A CPU will not modify a
value in cacheable memory unless that value is exclusively held in
that CPU's cache. The cache coherency hardware guarantees, logically
enough, that the caches are coherent.

DS

frege

unread,

Dec 23, 2009, 1:03:18 AM12/23/09

to

On Dec 22, 9:49 am, Jan Wielemaker <j...@hppc323.few.vu.nl> wrote:

> I'm now trying to track an error that seems to indicate that at some
> later moment in time, the *same* thread gets a *lower* value for
> generation (++ is the *only* write-operation and this is surely no
> integer overflow). That is fatal for the app :-(
>
> How can this happen?

++g is 3 steps: read add write
let's look at an example:

g == 0
thread1 reads g (== 0)
thread2 reads g (0)
thread1 adds then writes g (1)
threadn reads adds writes g (2)
thread1 reads adds writes g (3)
thread2 adds 1 to it's old value of g (1)
thread2 writes g (1)
thread1 reads adds writes g (2)

thread1 previously saw g as 3 now sees it as 2.

Or other variations. Basically a lot can happen between the read then
the write!

Tony

Jan Wielemaker

unread,

Dec 23, 2009, 3:35:58 AM12/23/09

to

On 2009-12-22, David Schwartz <dav...@webmaster.com> wrote:
> On Dec 22, 7:27�am, Jan Wielemaker <j...@hppc323.few.vu.nl> wrote:
>
>> Ok, without atomic increment a thread can be preempted between reading
>> and writing while others keep incrementing.
>
> Exactly.

:-)

>> I guess it is also possible
>> that the thread doing ++generation reads an outdated value from its
>> cache.
>
> An outdated value in the caches would not be valid, and so the CPU
> would not use it until it could make it valid. A CPU will not modify a
> value in cacheable memory unless that value is exclusively held in
> that CPU's cache. The cache coherency hardware guarantees, logically
> enough, that the caches are coherent.

Hmmm. Do I understand this? I always thought that if CPU 1 writes to
an address, any other CPU may get the value prior to this writing action
as long as no synchronisation action has taken place. Is this not true?
And, if it is not true, is it not true for all hardware, or just not for
Intel/AMD IA32/X64?

Cheers --- Jan

Dmitriy Vyukov

unread,

Dec 23, 2009, 5:06:31 AM12/23/09

to

It can and it can't at the same time. It actually depends on how you
define the term "prior". Note that there is no single time in
distributed systems.
It's usually considered as CPU1 always sees prior writes from CPU2.
But the statement is basically senseless, because it's by definition,
it's prior because CPU1 has seen it. There is basically no other way
to define the term "prior" other than by means of visibility. Thus,
one can't define 'visibility' by means of 'prior', otherwise there
will be cyclic dependency between terms.

Synchronization actions has nothing to do with visibility, then relate
only to mutual ordering (non-coherent systems aside for a moment).

--
Dmitriy V'jukov

Jan Wielemaker

unread,

Dec 23, 2009, 7:43:54 AM12/23/09

to

On 2009-12-23, Dmitriy Vyukov <dvy...@gmail.com> wrote:

> On Dec 23, 11:35锟絘m, Jan Wielemaker <j...@hppc323.few.vu.nl> wrote:
>> On 2009-12-22, David Schwartz <dav...@webmaster.com> wrote:
>>

>> > On Dec 22, 7:27锟絘m, Jan Wielemaker <j...@hppc323.few.vu.nl> wrote:
>>
>> >> Ok, without atomic increment a thread can be preempted between reading
>> >> and writing while others keep incrementing.
>>
>> > Exactly.
>>
>> :-)
>>
>> >> I guess it is also possible
>> >> that the thread doing ++generation reads an outdated value from its
>> >> cache.
>>
>> > An outdated value in the caches would not be valid, and so the CPU
>> > would not use it until it could make it valid. A CPU will not modify a
>> > value in cacheable memory unless that value is exclusively held in
>> > that CPU's cache. The cache coherency hardware guarantees, logically
>> > enough, that the caches are coherent.
>>

>> Hmmm. 锟紻o I understand this? 锟絀 always thought that if CPU 1 writes to

>> an address, any other CPU may get the value prior to this writing action

>> as long as no synchronisation action has taken place. 锟絀s this not true?

>> And, if it is not true, is it not true for all hardware, or just not for
>> Intel/AMD IA32/X64?
>
> It can and it can't at the same time. It actually depends on how you
> define the term "prior". Note that there is no single time in
> distributed systems.
> It's usually considered as CPU1 always sees prior writes from CPU2.
> But the statement is basically senseless, because it's by definition,
> it's prior because CPU1 has seen it. There is basically no other way
> to define the term "prior" other than by means of visibility. Thus,
> one can't define 'visibility' by means of 'prior', otherwise there
> will be cyclic dependency between terms.

Hi Dmitriy,

I was talking about time as we all preceive it: wall-time and next I was
trying to grasp visibility. I know this is a bit naive way to think about it,
but afterall it is way of looking at the behaviour if you do not do anything
special multi-threaded/core related. I.e. you have two threads running that
poke in shared memory addresses without using explicit communication.

> Synchronization actions has nothing to do with visibility, then relate
> only to mutual ordering (non-coherent systems aside for a moment).

I guess it is time for some more reading :-)

Cheers --- Jan

Dmitriy Vyukov

unread,

Dec 23, 2009, 10:24:36 AM12/23/09

to

On Dec 23, 3:43 pm, Jan Wielemaker <j...@hppc323.few.vu.nl> wrote:

> >> >> I guess it is also possible
> >> >> that the thread doing ++generation reads an outdated value from its
> >> >> cache.
>
> >> > An outdated value in the caches would not be valid, and so the CPU
> >> > would not use it until it could make it valid. A CPU will not modify a
> >> > value in cacheable memory unless that value is exclusively held in
> >> > that CPU's cache. The cache coherency hardware guarantees, logically
> >> > enough, that the caches are coherent.
>

> >> Hmmm. Do I understand this? I always thought that if CPU 1 writes to

> >> an address, any other CPU may get the value prior to this writing action

> >> as long as no synchronisation action has taken place. Is this not true?

> >> And, if it is not true, is it not true for all hardware, or just not for
> >> Intel/AMD IA32/X64?
>
> > It can and it can't at the same time. It actually depends on how you
> > define the term "prior". Note that there is no single time in
> > distributed systems.
> > It's usually considered as CPU1 always sees prior writes from CPU2.
> > But the statement is basically senseless, because it's by definition,
> > it's prior because CPU1 has seen it. There is basically no other way
> > to define the term "prior" other than by means of visibility. Thus,
> > one can't define 'visibility' by means of 'prior', otherwise there
> > will be cyclic dependency between terms.
>
> Hi Dmitriy,
>
> I was talking about time as we all preceive it: wall-time and next I was
> trying to grasp visibility. I know this is a bit naive way to think about it,
> but afterall it is way of looking at the behaviour if you do not do anything
> special multi-threaded/core related. I.e. you have two threads running that
> poke in shared memory addresses without using explicit communication.

From this point of view: (1) if you think on second-scale level then
all writes are visible instantly (for cache-coherent systems), (2) if
you think on nanosecond-scale level then writes become visible some
arbitrary time later. Rough reference number: 100-1000 cycles.
Regarding "synchronization actions", there is no way to speedup
propagation of changes. If there would be a one, it would be turned on
by default for all writes.

That's for hardware. But there is also an issue wrt compilers. If you
program in C/C++, then compiler may leave any write in register and
not push it to memory subsystem. That way write may actually not
become visible to other threads for seconds/minutes/hours.

--
Dmitriy V'jukov

frege

unread,

Dec 23, 2009, 11:40:44 AM12/23/09

to

Another CPU may get the value "prior" to this writing action because
it actually *read* the value *prior* to it being written! ie the read
got reordered to happen before when you thought it happened. If you
try to *guarantee* the order of the read/write, then you can only do
so via a synchronization action. If you try to 'detect' when things
are written (via if() statements instead of sync primitives) then
speculative reads and reordering may still have the read happen before
the write.

In other words, you can't tear open the case of your computer and look
at the memory values directly. You can only 'see' them via CPUs, and
thus your are getting a CPU's view of things, not the 'wall clock'
view.

Tony

David Schwartz

unread,

Dec 23, 2009, 2:26:13 PM12/23/09

to

On Dec 23, 12:35 am, Jan Wielemaker <j...@hppc323.few.vu.nl> wrote:

> > An outdated value in the caches would not be valid, and so the CPU
> > would not use it until it could make it valid. A CPU will not modify a
> > value in cacheable memory unless that value is exclusively held in
> > that CPU's cache. The cache coherency hardware guarantees, logically
> > enough, that the caches are coherent.

> Hmmm. Do I understand this? I always thought that if CPU 1 writes to
> an address, any other CPU may get the value prior to this writing action
> as long as no synchronisation action has taken place. Is this not true?

It is true, but it's not due to the CPU reading from a cache. It's due
to two things:

1) The CPU that does the write may hold the write in a write posting
buffer. This may cause the write to not be visible on other CPUs until
much later than normal program sequence would suggest.

2) The CPU that does the read may pre-fetch the read and acquire the
data earlier than the normal program sequence would suggest.

DS

David Schwartz

unread,

Dec 23, 2009, 2:29:52 PM12/23/09

to

On Dec 23, 8:40 am, frege <gottlobfr...@gmail.com> wrote:

> Another CPU may get the value "prior" to this writing action because
> it actually *read* the value *prior* to it being written! ie the read
> got reordered to happen before when you thought it happened. If you
> try to *guarantee* the order of the read/write, then you can only do
> so via a synchronization action. If you try to 'detect' when things
> are written (via if() statements instead of sync primitives) then
> speculative reads and reordering may still have the read happen before
> the write.

Exactly! The incorrect assumption is that the CPU executes your code
in the same order you wrote it. The CPU executes your code so that the
effect is the same as if it executed your code in the same order you
write it if, and only if, you follow the rules.

If you have dependencies in your code that are not allowed by the
platform specification, then CPU superscalar optimizations may cause
your code to execute differently than it should. For example, if you
have:

a=1;
b=2;

And these two writes are to normal memory, there is no reason the CPU
can't reorder them if it's more efficient for it to execute them in
the other order. Similarly, if you have:

a=c;
b=d;

If these two reads are from normal memory and 'a' and 'b' are in
registers, there's no reason the CPU can't reorder the two reads if it
wants to.

So it's the CPU itself that causes the problem. Reads and writes are
only required to occur in the exact order the code specifies them if
there's a reason they must that's made known to the CPU. If there's a
reason they must occur in the specified order and that reason is not
made known to the CPU, the code will fail.

DS