Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

concurrency-induced memory-access anomalies

25 views

Skip to first unread message

t...@cs.ucr.edu

unread,

Aug 13, 2002, 7:31:58 AM8/13/02

Here, for comment, is an attempt at a two-page writeup on
concurrency-induced memory-access anomalies, i.e., word tearing, false
sharing, out-of-order access, volatility, etc.

Tom Payne

=====================================================================

CONCURRENCY-INDUCED MEMORY-ACCESS ANOMALIES MADE SIMPLE (I hope)

A congituous segment of memory that is dedicated to holding values of
a particular type during a particular phase of a process is called an
"object" in C/C++ literature. In other literature, it might be called
a (possibly nameless) variable or a data item. The "state" of an
object is the tuple of bits found in its segment of memory. What
"value" that state represents is determined by the encoding algorithm
used to implement the object's type.

There are four memory-access anomalies that arise in environments
where multiple concurrent streams of computational activity share
access to a given object or collection of objects. Those streams may
be threads and/or the handling signals, interrupts, exceptions, or
longjmps.

TEARING. For purposes of reading and writing, memory usually has an
access granularity that is larger than a byte. The unit of access
granularity is usually called a "word" --- it is called a "sector" in
discussions of disk systems and a "line" in discussions of main-memory
caches. The important point is that the reading and writing of words
is atomic, i.e., one access to a word can never get time-interleaved
with another access to that same word.

Suppose, however, that a given object spans multiple words and that a
write operation on that object is in progress and has modified the
state of some of those words but not others. The object is now in an
indeterminate state --- on some systems, accessing it would generate
generate a trap that gets handled by the operating system in undefined
ways. Likewise, if a write to that object were to occur while a read
was in progress, i.e., after some but not all of the words of the
object had been read, the value read could be similarly incoherent
with similarly undefined consequences.

This phenomenon is often called "word tearing", but in reality it is
the object that appears torn (along word boundaries). The problem is
that a write and another access to that object got interleaved. But
we always can and often must use a lock (a.k.a.\ a mutex) to enforce
atomicity of any given collection of operations on any given
collection of data items. It is important, however, that operations
on locks themselves be intrinsically atomic, i.e., locks should be
allocated on a single word.

In C and C++ there are no intrinsically atomic data types. All that
the C and C++ standards guarantee is that objects of type "volatile
sig\_atomic\_t" will not to get torn when they are accessed by a
program and written by a concurrent invocation of a signal handler.

FALSE SHARING. Interleaved (i.e., non-atomic) partial-word writes to
distinct objects that occupy a given word can leave one of those
objects in an incoherent state. Obviously, to write a new value to
one of those objects, that entire word must be read, the portion of
the word that involves the object being updated must be modified, and
then the entire word must be written back. Some systems perform this
read/modify/write operation atomically in hardware, but such
operations incur a penalty over simple writes.

A given object is said to be "isolated" if writes to other objects
cannot interfere with (i.e., invalidate) writes to the given object,
in the way described above. In practice, isolation can be achieved
either:

* by allocating the object so that it does not share any of the
words it occupies with any other objects, or

* by making sure that partial-word writes to words it occupies
(i.e., writes on behalf of neighboring objects) are atomic ---
such atomicity can be provided either by hardware or by putting
all accesses to all objects that share a given word into a common
lock-protected critical region.

There is a space penalty in the first case and time penalty in the
second.

OUT-OF-ORDER ACCESS. Modern compilers and architectures often
rearrange the order of instructions in ways that improve performance
but do not affect behavior in single-stream environments. Usually the
order of reads and writes to a given object must be preserved, but the
relative order of accesses to distinct objects is usually unimportant.
In a multi-stream environment, however, the reads and writes
associated with the acquisition of a given lock must precede reads and
writes on the objects that lock protects, which must in turn precede
the writes associated with releasing that lock. Therefore notice has
to be given to the compiler and the underlying hardware to insure that
accessing of critical data does not leak out the tops or bottoms of
critical sections. Many architectures provide "barrier instructions"
past which hardware will not hoist or sink instructions. Most
compilers do not provide directives that are barriers to code motion,
but even in a single-stream environment, a compiler would get into
serious trouble with monothreaded code if it moved instructions calls
to separately compiled functions. So, it usually suffices to compile
lock-handling functions as a separate translation unit.

VOLATILITY. The question of what is the current bit-tuple (and hence
the current value) held by a particular object is a bit trickier than
one might imagine. The problem is one of timing: regardless of when
an object is read, instantly that value is no longer ``current'',
i.e., a new value might immediately have been written to that object
by hardware, by a debugger, by an interrupt or signal handler, or by
another thread. But, if up-to-the-instant values are unattainable,
how recent is recent enough? Ultimately, there needs to be points in
the execution of a program after which values read are assumed to be
valid until the next such point occurs. Commonly these points are
called ``sequence points.'' In C/C++, there is a sequence point at
each comma, semicolon, function call, and operator having a defined
order of evaluation.

If a given object is not "volatile", it can be assumed that the
object's current value is the last value stored to that object by the
current stream of computational activity. By contrast, a volatile
object is assumed to be subject to modification by other concurrent
streams of hardware or software activity. Therefore, the value last
stored to a volatile object by the current stream might be stale,
i.e., older than the most recent sequence point, so a more up-to-date
value must be fetched from that object (at significant cost increase
over using a value that is already register resident). Objects that
are allocated to input registers, that are to be modified via a
debugger, or that are read by other concurrent streams of hardware or
software activity should be declared to be volatile; otherwise, the
resulting behavior will be undefined.

In principle the new value of a modified object must be written to the
object's location before the next sequence point. If, however, object
is not "observable", the new value need not be written to the
object's storage location until and unless the object's value might
get fetched from there. Objects that are allocated to output
registers, observed via a debugger, or read by other concurrent
streams of hardware or software activity should, therefore, be
declared to be observable.

In C and C++, objects of volatile-qualified types are assumed be both
volatile and observable, in the sense defined above. The
volatile-qualification of types is a mechanism that the C and C++
standards provide for the program and programmer to suspend the
default guarantees of non-volatility and non-observability, which
implementations commonly use (via the as-if rule) to omit much memory
traffic, thereby, improving performance.

Updates to locks must promptly be realized in thread-shared memory.
Their values must not be left in unstored registers. (And, in the
multiprocessor case, their values must not be left in unreconciled
caches.) From the perspective of individual threads, a lock object
must be both volatile and observable. On the other hand, it is
unnecessary to declare lock-protected data to be volatile or
observable, provided that appropriate barriers to code motion are
inserted at lock acquisition and release operations.

Fergus Henderson

unread,

Aug 13, 2002, 9:43:33 AM8/13/02

t...@cs.ucr.edu writes:

>Here, for comment, is an attempt at a two-page writeup on
>concurrency-induced memory-access anomalies, i.e., word tearing, false
>sharing, out-of-order access, volatility, etc.

Thanks. That's a useful summary.
I have a few comments, intended in a spirit of constructive criticism:

>There are four memory-access anomalies that arise in environments
>where multiple concurrent streams of computational activity share
>access to a given object or collection of objects. Those streams may
>be threads and/or the handling signals, interrupts, exceptions, or
>longjmps.

The last sentence there is a bit incoherent.

Note that longjmp() by itself does not introduce any asynchronous
behaviour and thus does not suffer from any of the four memory-access
anomalies that you listed.

>TEARING. For purposes of reading and writing, memory usually has an
>access granularity that is larger than a byte. The unit of access
>granularity is usually called a "word" --- it is called a "sector" in
>discussions of disk systems and a "line" in discussions of main-memory
>caches. The important point is that the reading and writing of words
>is atomic, i.e., one access to a word can never get time-interleaved
>with another access to that same word.

Here and below you are assuming that there is only one size which can
be written atomically. I don't think that is a reasonable assumption.

>called ``sequence points.'' In C/C++, there is a sequence point at
>each comma, semicolon, function call, and operator having a defined
>order of evaluation.

Saying that there is a sequence point at each comma is misleading.
There's no sequence point between the two updates to x in
`f(x++, x++)', for example. Only the comma operator guarantees
a sequence point (and that is already covered by your wording
"each operator having a defined order of evaluation).

--
Fergus Henderson <f...@cs.mu.oz.au> | "I have always known that the pursuit
The University of Melbourne | of excellence is a lethal habit"
WWW: <http://www.cs.mu.oz.au/~fjh> | -- the last words of T. S. Garp.

t...@cs.ucr.edu

unread,

Aug 14, 2002, 1:03:53 AM8/14/02

In comp.programming.threads Fergus Henderson <f...@cs.mu.oz.au> wrote:
: t...@cs.ucr.edu writes:

:>Here, for comment, is an attempt at a two-page writeup on
:>concurrency-induced memory-access anomalies, i.e., word tearing, false
:>sharing, out-of-order access, volatility, etc.

: Thanks. That's a useful summary.
: I have a few comments, intended in a spirit of constructive criticism:

Thanks.

:>There are four memory-access anomalies that arise in environments

:>where multiple concurrent streams of computational activity share
:>access to a given object or collection of objects. Those streams may
:>be threads and/or the handling signals, interrupts, exceptions, or
:>longjmps.

: The last sentence there is a bit incoherent.

I'll work on it.

: Note that longjmp() by itself does not introduce any asynchronous

: behaviour and thus does not suffer from any of the four memory-access
: anomalies that you listed.

The volatility-related issue I have in mind here is:

All accessible objects have values as of the time longjmp was
called, except that the values of objects of automatic storage
duration that are local to the function containing the invocation of
the corresponding setjmp macro that do not have volatile-qualified
type and have been changed between the setjmp invocation and longjmp
call are indeterminate. [C89 7.6.2.1]

The point is that there are two distinct continuations (streams of
compuational activity) from the invocation of setjmp. It's as though
we timesliced to the first continuation at the invocation of setjmp,
and then timesliced back to the second continuation at the invocation
of longjmp, thereby possibly losing the updates that the first
continuation made to non-volatile automatics local to that invocation
of setjmp. (A somewhat lame analog of pseudo-concurrency.)

:>TEARING. For purposes of reading and writing, memory usually has an

:>access granularity that is larger than a byte. The unit of access
:>granularity is usually called a "word" --- it is called a "sector" in
:>discussions of disk systems and a "line" in discussions of main-memory
:>caches. The important point is that the reading and writing of words
:>is atomic, i.e., one access to a word can never get time-interleaved
:>with another access to that same word.

: Here and below you are assuming that there is only one size which can
: be written atomically. I don't think that is a reasonable assumption.

AFAIK, for a given physical memory system, there is usually only one
intrinsically atomic size for memory writes, and atomic writes of
other sizes are accomplished at an increased cost via hardware-locked
read/modify/write sequences. (Perhaps someone more knowledgeable of
hardware than I can comment on this hypothesis.)

:>called ``sequence points.'' In C/C++, there is a sequence point at

:>each comma, semicolon, function call, and operator having a defined
:>order of evaluation.

: Saying that there is a sequence point at each comma is misleading.
: There's no sequence point between the two updates to x in
: `f(x++, x++)', for example. Only the comma operator guarantees
: a sequence point (and that is already covered by your wording
: "each operator having a defined order of evaluation).

Oops. Thanks for the correction.

Tom Payne

Christian Bau

unread,

Aug 14, 2002, 2:59:42 AM8/14/02

In article <ajcobp$enk$1...@glue.ucr.edu>, t...@cs.ucr.edu wrote:

> The volatility-related issue I have in mind here is:
>
> All accessible objects have values as of the time longjmp was
> called, except that the values of objects of automatic storage
> duration that are local to the function containing the invocation of
> the corresponding setjmp macro that do not have volatile-qualified
> type and have been changed between the setjmp invocation and longjmp
> call are indeterminate. [C89 7.6.2.1]
>
> The point is that there are two distinct continuations (streams of
> compuational activity) from the invocation of setjmp. It's as though
> we timesliced to the first continuation at the invocation of setjmp,
> and then timesliced back to the second continuation at the invocation
> of longjmp, thereby possibly losing the updates that the first
> continuation made to non-volatile automatics local to that invocation
> of setjmp. (A somewhat lame analog of pseudo-concurrency.)

I don't think this has anything to do with threading. What happened
historically I believe is that implementations put "volatile" values
into memory and kept other values often in processor registers. A
combination of setjmp/longjmp has to restore the processor state to a
certain degree, and historically did that by setjmp storing things like
current registers, program counter and stack counter and longjmp
restoring them. Such an implementation would behave just in the way
described that volatile values are ok, but non-volatile values that are
changed between setjmp and longjmp can get restored to their original
value for example. As this behaviour was easy to implement and useful it
was put into the C Standard.

But it has nothing to do with threading. longjmp is nothing than a very
complicated form of goto. And in C++, where exceptions can do everything
that longjmp can and more, throwing exceptions must keep _all_ variables
in the correct state, not only volatile. So this has nothing really to
do with volatiles either. This part of the C standard is just historical
coincidence.

t...@cs.ucr.edu

unread,

Aug 14, 2002, 4:06:17 AM8/14/02

In comp.std.c Christian Bau <christ...@freeserve.co.uk> wrote:

At the most basic (machine-code) level, threads are all about saving
and restoring registers, which is exactly what setjmp and longjmp are
about. It's no coincidence that many threads implementations (e.g.,
GNUthreads) pass execution from one thread to another via the
well-known idiom:

if ( !setjmp(t1->registers) ) longjmp(t2->registers, true);

where t1 and t2 are thread descriptors having a jmp_buf member named
"registers". The point is that:
- Threads can be viewed (and implemented) as fancy co-routines.
- A co-routine is semantically equivalent to a jmp_buf.
- Calling a coroutine is semantically equivalent to a setjmp
followed by a longjmp.

Tom Payne

Alexander Terekhov

unread,

Aug 14, 2002, 5:47:13 AM8/14/02

t...@cs.ucr.edu wrote:
[...]

> - Threads can be viewed (and implemented) as fancy co-routines.

No. In the wrong hands, threads can be "degraded" [using things like
RIP-OS/2-critical-section-beast, etc.] to something resembling "fancy
co-routines". ;-)

regards,
alexander.

Christian Bau

unread,

Aug 14, 2002, 5:48:29 PM8/14/02

In article <ajd31p$ibl$1...@glue.ucr.edu>, t...@cs.ucr.edu wrote:

> At the most basic (machine-code) level, threads are all about saving
> and restoring registers, which is exactly what setjmp and longjmp are
> about. It's no coincidence that many threads implementations (e.g.,
> GNUthreads) pass execution from one thread to another via the
> well-known idiom:
>
> if ( !setjmp(t1->registers) ) longjmp(t2->registers, true);
>
> where t1 and t2 are thread descriptors having a jmp_buf member named
> "registers". The point is that:
> - Threads can be viewed (and implemented) as fancy co-routines.
> - A co-routine is semantically equivalent to a jmp_buf.
> - Calling a coroutine is semantically equivalent to a setjmp
> followed by a longjmp.

In many cases, threads are all about two or more processors executing
instructions simultaneously, with two sets of registers, and the
behaviour is as if exeution passed from one thread to the other and back
in every machine cycle.

It may very well be that on some implementation of C setjmp and longjmp
are defined in such a way that they can be abused for implementing
thread switching on a single processor system. But that has nothing to
do with the setjmp and longjmp that we know from the C Standard. I
personally would prefer thread switching to be implemented with some
honest assembly code than by abusing the C library.

Douglas A. Gwyn

unread,

Aug 14, 2002, 5:17:52 PM8/14/02

t...@cs.ucr.edu wrote:
> At the most basic (machine-code) level, threads are all about saving

> and restoring registers, ...

And volatile is primarily about not caching values in registers.

t...@cs.ucr.edu

unread,

Aug 14, 2002, 9:09:03 PM8/14/02

In comp.std.c Christian Bau <christ...@freeserve.co.uk> wrote:

[...]
: It may very well be that on some implementation of C setjmp and longjmp

: are defined in such a way that they can be abused for implementing
: thread switching on a single processor system. But that has nothing to
: do with the setjmp and longjmp that we know from the C Standard.

Either either the Standard-specified semantics for setjmp and longjmp
are appropriate for dispatching a processor from one thread to another
or they aren't. (Note that even on a mutiprocessor system we only
pass one processor at a time from one thread to another.)

So far as I can tell, the thread implementations that use a jmp_buf to
represent a thread's machine context miss only one detail from being
fully conforming. Namely, there is no way to within the C standard to
give a jmp_buf a new stack. There are, however, tricks within
extensions of the standard for doing so, e.g., sigaltstack():

"Portable Multithreading, the Signal Stack Trick for User-Space
Thread Creation",
Ralf S. Engelschall,
Proceedings of the USENIX Annual Technical Conference,
June 18-23, 2000, San Diego, California, CA.
(http://www.gnu.org/software/pth/rse-pmt.ps)

: I personally would prefer thread switching to be implemented with some

: honest assembly code than by abusing the C library.

I'm not claiming that one *should* use setjmp and longjmp to dispatch
threads nor that they are a perfect solution to the dispatching of
processors among threads. Rather, I'm claiming that:
(1) That a longjmp via a jmp_buf creates a "stream of computational
activity" whose access to values set by the original continuation
from jmp_buf's initializing setjmp can sometimes run into
memory-access anomalies involving issues of volatility.
(2) That the relationship between between setjmp/longjmp and
dispatching CPUs among threads is not coincidental; rather, it is
based on nearly identical semantics.

Tom Payne

David Butenhof

unread,

Aug 15, 2002, 9:46:16 AM8/15/02

t...@cs.ucr.edu wrote:

> In comp.std.c Christian Bau <christ...@freeserve.co.uk> wrote:
> [...]
> : It may very well be that on some implementation of C setjmp and longjmp
> : are defined in such a way that they can be abused for implementing
> : thread switching on a single processor system. But that has nothing to
> : do with the setjmp and longjmp that we know from the C Standard.
>
> Either either the Standard-specified semantics for setjmp and longjmp
> are appropriate for dispatching a processor from one thread to another
> or they aren't. (Note that even on a mutiprocessor system we only
> pass one processor at a time from one thread to another.)
>
> So far as I can tell, the thread implementations that use a jmp_buf to
> represent a thread's machine context miss only one detail from being
> fully conforming. Namely, there is no way to within the C standard to
> give a jmp_buf a new stack. There are, however, tricks within
> extensions of the standard for doing so, e.g., sigaltstack():

Well, it's funny you mention that word "conforming". Let's see... how about
this line from POSIX:

"The effect of a call to longjmp() where initialization of the
jmp_buf structure was not performed in the calling thread is
undefined."

If you're interested in context switching, you need to look at the SysV (and
SUS) context management functions; getcontext, setcontext, makecontext.
(But "Use of contexts to create alternate stacks is not defined by this
volume of IEEE Std 1003.1-2001.")

The observation that setjmp/longjmp has nothing to do with thread context
switch is quite correct. For one thing, the standard does not require that
setjmp/longjmp handle sufficient context to switch execution streams: e.g.,
under getcontext/setcontext:

"... since the context restored by longjmp() does not necessarily
contain all the information that setcontext() requires."

> "Portable Multithreading, the Signal Stack Trick for User-Space
> Thread Creation",
> Ralf S. Engelschall,
> Proceedings of the USENIX Annual Technical Conference,
> June 18-23, 2000, San Diego, California, CA.
> (http://www.gnu.org/software/pth/rse-pmt.ps)

I reviewed this paper, and discussed it at length with Ralf. While his
exploits were intriguing, my primary objection was that his result was not
at all "portable". It violates the standards in a variety of ways, and the
only real excuse is that it seemed to work OK on the particular platforms
about which he cared. That's not MY definition of "portable", though I
can't argue the principle is entirely irrelevant.

> : I personally would prefer thread switching to be implemented with some
> : honest assembly code than by abusing the C library.
>
> I'm not claiming that one *should* use setjmp and longjmp to dispatch
> threads nor that they are a perfect solution to the dispatching of
> processors among threads. Rather, I'm claiming that:
> (1) That a longjmp via a jmp_buf creates a "stream of computational
> activity" whose access to values set by the original continuation
> from jmp_buf's initializing setjmp can sometimes run into
> memory-access anomalies involving issues of volatility.

A bit fuzzier than the way the C standard says it, but not incorrect.

> (2) That the relationship between between setjmp/longjmp and
> dispatching CPUs among threads is not coincidental; rather, it is
> based on nearly identical semantics.

The word "nearly" is significant, and it's just not "nearly enough". You're
trying to stretch a point too far here, and you'd be better off standing
clear. ;-)

--
/--------------------[ David.B...@hp.com ]--------------------\
| Hewlett-Packard Company Tru64 UNIX & VMS Thread Architect |
| My book: http://www.awl.com/cseng/titles/0-201-63392-2/ |
\-------------[ http://homepage.mac.com/~dbutenhof ]--------------/

t...@cs.ucr.edu

unread,

Aug 15, 2002, 2:45:02 PM8/15/02

In comp.std.c David Butenhof <David.B...@compaq.com> wrote:
: t...@cs.ucr.edu wrote:
[...]
: The observation that setjmp/longjmp has nothing to do with thread context
: switch is quite correct.

See below.

[...]
: I can't argue the principle is entirely irrelevant.

See above. ;-)

[...]
:> I'm not claiming that one *should* use setjmp and longjmp to dispatch

:> threads nor that they are a perfect solution to the dispatching of
:> processors among threads. Rather, I'm claiming that:
:> (1) That a longjmp via a jmp_buf creates a "stream of computational
:> activity" whose access to values set by the original continuation
:> from jmp_buf's initializing setjmp can sometimes run into
:> memory-access anomalies involving issues of volatility.

: A bit fuzzier than the way the C standard says it, but not incorrect.

Thanks.

:> (2) That the relationship between between setjmp/longjmp and

:> dispatching CPUs among threads is not coincidental; rather, it is
:> based on nearly identical semantics.

: The word "nearly" is significant,

That's why I put it there. ;-)

: and it's just not "nearly enough". You're trying to stretch a point too
: far here,

It's near enough for arguing that the connection is "not
coincidental".

I'm mainly defending the inclusion of the handling of longjmps
alongside threads and the handling of interrupts/signals as "streams
of computational activity" that run into certain memory-access
anomalies. It has been argued that:
- this inclusion was inappropriate,
- that setjmp/longjmp has "nothing to do with threading",
- that whatever connection is there is "purely coincidental",
- that Engleschall's setjmp/longjmp-base approach doesn't meet your
standards of portability, but you "can't argue that the principle
is entirely irrelevant,
- that my main point is fuzzy but "not incorrect".
On that matter, I rest my case. ;-)

: and you'd be better off standing clear. ;-)

Warning rejected! (Inquiring minds want to know.) If C had a standard
function for initializing a jmp_buf with a specified stack, what
aspect of the C's semantics for setjmp and longjmp make

if ( ! setjmp( thread1->registers ) ) longjmp( thread2->registers, 1);

inadequate for passing control from one thread to another? (I realize
that C's wording doesn't automatically preserve common extensions like
pending interrupts; extensions would have to contain their own wording
for that.)

Tom Payne

David Butenhof

unread,

Aug 16, 2002, 2:43:50 PM8/16/02

t...@cs.ucr.edu wrote:

> In comp.std.c David Butenhof <David.B...@compaq.com> wrote:
> : t...@cs.ucr.edu wrote:
> [...]
> : The observation that setjmp/longjmp has nothing to do with thread
> : context switch is quite correct.
>
> See below.
>
> [...]
> : I can't argue the principle is entirely irrelevant.
>
> See above. ;-)

Sheesh. That's what I get for being polite. ;-)

> [...]
> :> I'm not claiming that one *should* use setjmp and longjmp to dispatch
> :> threads nor that they are a perfect solution to the dispatching of
> :> processors among threads. Rather, I'm claiming that:
> :> (1) That a longjmp via a jmp_buf creates a "stream of computational
> :> activity" whose access to values set by the original continuation
> :> from jmp_buf's initializing setjmp can sometimes run into
> :> memory-access anomalies involving issues of volatility.
>
> : A bit fuzzier than the way the C standard says it, but not incorrect.
>
> Thanks.
>
> :> (2) That the relationship between between setjmp/longjmp and
> :> dispatching CPUs among threads is not coincidental; rather, it is
> :> based on nearly identical semantics.
>
> : The word "nearly" is significant,
>
> That's why I put it there. ;-)
>
> : and it's just not "nearly enough". You're trying to stretch a point too
> : far here,
>
> It's near enough for arguing that the connection is "not
> coincidental".

I disagree. Which is, of course, the point.

> I'm mainly defending the inclusion of the handling of longjmps
> alongside threads and the handling of interrupts/signals as "streams
> of computational activity" that run into certain memory-access
> anomalies. It has been argued that:
> - this inclusion was inappropriate,
> - that setjmp/longjmp has "nothing to do with threading",
> - that whatever connection is there is "purely coincidental",
> - that Engleschall's setjmp/longjmp-base approach doesn't meet your
> standards of portability, but you "can't argue that the principle
> is entirely irrelevant,
> - that my main point is fuzzy but "not incorrect".
> On that matter, I rest my case. ;-)

Threads are about concurrency, not context switches.

setjmp/longjmp is about PARTIAL context switches, concurrency is not
allowed. The standard requires that setjmp/longjmp is merely a synchronous
transfer of control within a thread.

If you're going to look at threads as synchronous coroutines, for example,
reordering of memory operations is irrelevant -- that cannot occur within
the scope of a single processor. The transition from coroutines to
concurrent threads isn't a minor step, it's a fundamental discontinuity.

> : and you'd be better off standing clear. ;-)
>
> Warning rejected! (Inquiring minds want to know.) If C had a standard
> function for initializing a jmp_buf with a specified stack, what
> aspect of the C's semantics for setjmp and longjmp make
>
> if ( ! setjmp( thread1->registers ) ) longjmp( thread2->registers, 1);
>
> inadequate for passing control from one thread to another? (I realize
> that C's wording doesn't automatically preserve common extensions like
> pending interrupts; extensions would have to contain their own wording
> for that.)

But that's only the tip of the iceberg. There could be innumerable implicit
aspects of "a thread" that need not and on most implementations would not
be saved or restored by setjmp and longjmp -- because it would be expensive
and pointless.

Setjmp/longjmp is nothing but a fancy goto. It's not even an adequate
"rewind", and it's not even close to being anything like a thread. It's a
synchronous transfer of control WITHIN ONE "stream of computational
activity". It does not and cannot (correctly, portably, or philosophically)
spawn or switch between two "streams". Some common implementations can be
perverted in a way that appears to generate similar effects, just as on
some implementations one can create a POSIX thread with the address of a
C++ method -- it's still wrong.

The use of "volatile" with setjmp/longjmp is static and synchronous: the
marked variables must have been written before the setjmp() call so they'll
still be in memory (if otherwise undisturbed) when setjmp() returns from a
matching longjmp().

This has absolutely no applicability to threads. Coroutines, perhaps; though
you need to be talking about setcontext and getcontext here to have any
firm footing... but neither set of operations creates or constitutes a
thread. The rules of volatile are not written to apply to, and are not
appropriate for, variables shared between threads. Not enough on one hand,
because it cannot deal with reordering of memory operations, and far too
much on the other hand, because it inhibits valuable and legal memory
optimizations within properly synchronized code regions.

Now, you've said that in your paper, (although "volatile" and "threads"
together makes a grating and ominous sound that presages supernatural
horrors better left untold), and I was really only trying to follow up on
this secondary "subthread".

Going back to your paper, the fundamental problem in your discussion of
volatile is in (by implication, or even perhaps by omission) overselling
the value of a "sequence point" across threads. At a sequence point, the
current thread must again fetch that processor's current view of the
volatile data. That's not "up to date", because without synchronization
it's already out of date. That's OK when the single volatile data IS the
communication, between TWO entities. Like an old serial port control
register. You write; it reads and responds; etc.

So you CAN use volatile with threads, if two threads communicate through a
single volatile cell that communicates nothing but its own value. Beyond
that, volatile is absolutely irrelevant. Your paper doesn't address this at
all, and therefore can be inferred to be making dangerous promises. (Though
I don't think it actually says anything incorrect.) This is rather like
pointing the tourists down the path and assuming that they'll notice where
the trail curves off before the sheer cliff. "Thud". Ooops, guess they
didn't. Oh well.

Or to put it another way, if you're going to take upon yourself the job of
educating people, I'd prefer if you'd take the responsibility a little
further and explore the consequences of what you've taught. The answer IS
42, and the question IS "what is 9 times 6"... but, really, what does it
all mean?

But I really wasn't going to get into any of that... ;-)

David Hopwood

unread,

Aug 16, 2002, 8:24:35 PM8/16/02

-----BEGIN PGP SIGNED MESSAGE-----

David Butenhof wrote:
> Going back to your paper, the fundamental problem in your discussion of
> volatile is in (by implication, or even perhaps by omission) overselling
> the value of a "sequence point" across threads. At a sequence point, the
> current thread must again fetch that processor's current view of the
> volatile data. That's not "up to date", because without synchronization
> it's already out of date. That's OK when the single volatile data IS the
> communication, between TWO entities. Like an old serial port control
> register. You write; it reads and responds; etc.
>
> So you CAN use volatile with threads, if two threads communicate through a
> single volatile cell that communicates nothing but its own value.

... and if you understand that in the threading models defined by POSIX,
Java/JVM, or almost any other model intended to be implementable on
multiprocessors, there's no guarantee that changes to the value will be
communicated between threads in a timely fashion, if at all.

> Beyond that, volatile is absolutely irrelevant.

Yep (unless you're relying on knowledge of how a specific compiler implements
volatile on specific hardware).

- --
David Hopwood <david....@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPV2DPjkCAxeYt5gVAQFXZQf9Hioa7acD2a+4DjVNbVkIVeVwGkU11+GY
uQ88pO+N2MAq0/zfgv3Dw7xY4kzq47stZtZJRgIZVoPIB1t3vq7meLyXLwB9STXR
16OYL5p+U/M8nv3wUbX5jzpuv817f81AeK0ywL0sjz2SlokLh1JimUHO4KrDNjtm
rpvhbaIgQoOf5PaaxIWXW+H4kvL8kRhPkRFUBhh1LYimdgEv1e3jjAQTpTCEHjop
4XhA8PlXbNH+lBu1n+btMONLPV2hJpu/9UWOtRZD7iP1nxtmIKs3nieWoyQBjlEE
MsCu4aYMFUGbFMK/eASfM/1cRZ1/mH6sSrbKKc5q0pnxecRmDX2m5w==
=3N/5
-----END PGP SIGNATURE-----

t...@cs.ucr.edu

unread,

Aug 17, 2002, 9:44:30 AM8/17/02

David Butenhof <David.B...@compaq.com> wrote:
[...]
: Threads are about concurrency, not context switches.

There can be threads without concurrency --- pseudo-concurrency (based
on context switching) suffices.

: setjmp/longjmp is about PARTIAL context switches, concurrency is not
: allowed.

AFAIK, they save and restore all of the context that's discussed in
the C Standard. Extensions to that context can be handled by adding
wrappers.

: The standard requires that setjmp/longjmp is merely a synchronous

: transfer of control within a thread.

AFAIK, the C Standard doesn't mention threads.

: If you're going to look at threads as synchronous coroutines, for example,

: reordering of memory operations is irrelevant -- that cannot occur within
: the scope of a single processor.

Huh?

: The transition from coroutines to

: concurrent threads isn't a minor step, it's a fundamental discontinuity.

In fact, I've seen implementations of threads where the Thread class
was derived (in the sense of object-oriented inheritance) quite
naturally from the Coroutine class. (IIRC, that's essentially the
approach that Wirth takes in his book on Modual2.)

[...]
: There could be innumerable implicit

: aspects of "a thread" that need not and on most implementations would not
: be saved or restored by setjmp and longjmp -- because it would be expensive
: and pointless.

So, use wrappers to save and restore whatever additional context.

: Setjmp/longjmp is nothing but a fancy goto.

Agreed --- and wait() is an even fancier goto. ;-)

: It's not even an adequate "rewind",

Agreed.

: and it's not even close to being anything like a thread.

Closeness is in the eye of the beholder.

: It's a synchronous transfer of control WITHIN ONE "stream of computational

: activity". It does not and cannot (correctly, portably, or philosophically)
: spawn or switch between two "streams". Some common implementations can be
: perverted in a way that appears to generate similar effects, just as on
: some implementations one can create a POSIX thread with the address of a
: C++ method -- it's still wrong.

You keep saying that, but the questions remain:

* Where in the C standard are words present or missing that would prevent
jmp_bufs from being semantically equivalent to coroutines, if there were
a way to initialize them with a new stack? (I've looked through C89 and,
thus far, not found any such thing.)

* What's wrong with deriving (in the object-oriented sense) threads from
coroutines and passing control among them via the setjmp/longjmp idiom
(hack) wrapped with code that saves signal-blockage status and handles
the necessary locks and queues? (Perhaps this question is simply a
matter of taste, of which there's no disputing.)

: The use of "volatile" with setjmp/longjmp is static and synchronous: the

: marked variables must have been written before the setjmp() call so they'll

^^??^^
: still be in memory (if otherwise undisturbed) when setjmp() returns from a
: matching longjmp().

AFAIK, the problem occurs when certain non-volatile locals are updated
by the initial continuation AFTER the setjmp() and subsequently read
by a post-longjmp continuation.

: This has absolutely no applicability to threads. Coroutines, perhaps; though

Thanks for getting into that --- it's exactly the sort of feedback I
want and need. These points are scattered throughout the sections of
that writeup, but deserve to be gathered into a summary/warning, e.g.:

SUMMARY/WARNING. Operations on the basic thread-coordination
mechanisms (i.e., on locks/mutexes and two-thread turn flags) require
barriers to prevent accesses to the objects that these mechanisms
protect from slipping past points of coordiation, e.g., into or out of
critical regions. The thread-shared objects that represent the states
of these mechanisms must be atomic, isolated, and have
volatile-qualified types. Other thread-shared data require the
protection of these mechanisms but do not require the overhead burden
of isolation, atomicity, or volatile-qualified types.

Tom Payne

Alexander Terekhov

unread,

Aug 17, 2002, 3:42:28 PM8/17/02

t...@cs.ucr.edu wrote:
>
> David Butenhof <David.B...@compaq.com> wrote:
> [...]
> : Threads are about concurrency, not context switches.
>
> There can be threads without concurrency --- pseudo-concurrency (based
> on context switching) suffices.

Those are "synchronous coroutines"; NOT threads. ;-)

[...]

> : If you're going to look at threads as synchronous coroutines, for example,
> : reordering of memory operations is irrelevant -- that cannot occur within
> : the scope of a single processor.
>
> Huh?

Memory access reordering IS taking place, but you just can't
observe/notice it on a uniprocessor.

[...]

> The thread-shared objects that represent the states
> of these mechanisms must be atomic, isolated, and have
> volatile-qualified types.

Nah, I'd rather have something along the lines of proposed(*)
"atomic" java classes with memory synchronization semantics.

regards,
alexander.

(*) http://www.cs.umd.edu/~pugh/java/memoryModel/archive/0855.html

P.S. With respect to reordering: reads/writes that are done inside
sync.regions simply can't escape lock/unlock "barriers" -- in both
directions; other "transformations" are "OLL KORRECT", I guess.

t...@cs.ucr.edu

unread,

Aug 17, 2002, 8:29:21 PM8/17/02

Alexander Terekhov <tere...@web.de> wrote:

: t...@cs.ucr.edu wrote:
:>
:> David Butenhof <David.B...@compaq.com> wrote:
:> [...]
:> : Threads are about concurrency, not context switches.
:>
:> There can be threads without concurrency --- pseudo-concurrency (based
:> on context switching) suffices.

: Those are "synchronous coroutines"; NOT threads. ;-)

Huh? In response to Dave's claim that "threads are about concurrency,
not context switches", I'm simply noting that:
1) there are real threads on uniprocessor systems,
2) there is no true concurrency among those threads, only
pseudoconcurrency,
3) that pseudoconcurrency is based on context switching.
Do you disagree?

: [...]

:> : If you're going to look at threads as synchronous coroutines, for example,
:> : reordering of memory operations is irrelevant -- that cannot occur within
:> : the scope of a single processor.
:>
:> Huh?

: Memory access reordering IS taking place, but you just can't
: observe/notice it on a uniprocessor.

Barrier are not required where reordering cannot be noticed. It's my
impression, however, that barriers are required on some uniprocessor
systems. Even uniprocessors have to be told not to let access to
critical data slip out of critical regions.

: [...]

:> The thread-shared objects that represent the states
:> of these mechanisms must be atomic, isolated, and have
:> volatile-qualified types.

: Nah, I'd rather have something along the lines of proposed(*)
: "atomic" java classes with memory synchronization semantics.

Good point. Rather than getting into language specific terminology,
I'll replace

"... and have volatile-qualified types."

with

"... and have to be treated as volatile and observable."

In C/C++ one can accomplish this treatment via a volatile-qualified
types. In most any language, however, it's more natural to instead
use the approach you cited, namely restricting access to special get()
and set() functions that respect volatility and observability,
respectively. Thanks.

: regards,
: alexander.

: (*) http://www.cs.umd.edu/~pugh/java/memoryModel/archive/0855.html

: P.S. With respect to reordering: reads/writes that are done inside
: sync.regions simply can't escape lock/unlock "barriers" -- in both
: directions; other "transformations" are "OLL KORRECT", I guess.

I don't understand your point here. :-(

Tom Payne

Alexander Terekhov

unread,

Aug 19, 2002, 7:02:17 AM8/19/02

t...@cs.ucr.edu wrote:
[...]
> :> : Threads are about concurrency, not context switches.
> :>
> :> There can be threads without concurrency --- pseudo-concurrency (based
> :> on context switching) suffices.
>
> : Those are "synchronous coroutines"; NOT threads. ;-)
>
> Huh? In response to Dave's claim that "threads are about concurrency,
> not context switches", I'm simply noting that:
> 1) there are real threads on uniprocessor systems,
> 2) there is no true concurrency among those threads, only
> pseudoconcurrency,
> 3) that pseudoconcurrency is based on context switching.
> Do you disagree?

Well, "brand new" Intel's-hyper-threaded-stuff aside, how about the
"pseudoconcurrency" that is provided by the following "uniprocessor":

http://www.research.ibm.com/journal/rd/446/borkenhagen.html
("A multithreaded PowerPC processor for commercial servers")

> :> : If you're going to look at threads as synchronous coroutines, for example,
> :> : reordering of memory operations is irrelevant -- that cannot occur within
> :> : the scope of a single processor.
> :>
> :> Huh?
>
> : Memory access reordering IS taking place, but you just can't
> : observe/notice it on a uniprocessor.
>
> Barrier are not required where reordering cannot be noticed. It's my
> impression, however, that barriers are required on some uniprocessor
> systems. Even uniprocessors have to be told not to let access to
> critical data slip out of critical regions.

Memory-mapped I/O stuff aside, consider [from the document referenced
below]:

"....
Storage synchronisation is necessary in a uniprocessor system when
generating or modifying instructions. We don't talk about generating
or modifying instructions in this article, because POWER4 shows no
new behaviour in this regard. Programs running on a uniprocessor
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

system, or programs that don't share memory with other programs or
threads, will produce the intended results without the need for
storage synchronization; this is a consequence of the architecture's
requirement that all memory accesses performed by a given processor
appear to be performed in program order with respect to that processor
and applies on all processor implementations."

> : P.S. With respect to reordering: reads/writes that are done inside
> : sync.regions simply can't escape lock/unlock "barriers" -- in both
> : directions; other "transformations" are "OLL KORRECT", I guess.
>
> I don't understand your point here. :-(

Well, my poor English aside, it's rather "murky" stuff, indeed. ;-) Consider:

http://www.ibm.com/servers/esdd/articles/power4_mem.html
(POWER4 and shared memory synchronisation)

http://www.ibm.com/chips/products/powerpc/newsletter/apr2001/design-h-t.html

"Synchronization Instructions

Commonly misunderstood PowerPC instructions are those that perform
synchronization. These instructions include:

- Enforce In/Order Execution of I/O (eieio) - This instruction is for
data accesses to guarantee that loads and stores complete with respect
to one another. Since PowerPC defines a weakly ordered storage model
in which loads and stores can complete out of order, this instruction
exists to guarantee ordering where necessary.

- Synchronize (sync) - This instruction guarantees that the preceding
instructions complete before the sync completes. This instruction is
useful for guaranteeing load/store access completion. For example,
a sync may be used when writing memory mapped I/O registers to a
slow device before making further access to the device.

- Instruction Synchronize (isync) - This instruction provides ordering
for all effects of all instructions executed by the processor. It is
used to synchronize the instruction context, such as memory translation,
endianness, cache coherency, etc. Instruction pipelines are flushed when
an isync is performed, and the next instruction is fetched in the new
context. This instruction is useful for self-modifying code."

http://www.ibm.com/chips/techlib/techlib.nsf/techdocs/852569B20050FF778525699600682CC7/$file/booke_rm.pdf
(search for mbar (that's the new moniker for the "old" eieio) and msync ("old" name: sync))

t...@cs.ucr.edu

unread,

Aug 19, 2002, 11:53:22 AM8/19/02

Alexander Terekhov <tere...@web.de> wrote:

: t...@cs.ucr.edu wrote:
: [...]
:> :> : Threads are about concurrency, not context switches.
:> :>
:> :> There can be threads without concurrency --- pseudo-concurrency (based
:> :> on context switching) suffices.
:>
:> : Those are "synchronous coroutines"; NOT threads. ;-)
:>
:> Huh? In response to Dave's claim that "threads are about concurrency,
:> not context switches", I'm simply noting that:
:> 1) there are real threads on uniprocessor systems,
:> 2) there is no true concurrency among those threads, only
:> pseudoconcurrency,
:> 3) that pseudoconcurrency is based on context switching.
:> Do you disagree?

: Well, "brand new" Intel's-hyper-threaded-stuff aside, how about the
: "pseudoconcurrency" that is provided by the following "uniprocessor":

: http://www.research.ibm.com/journal/rd/446/borkenhagen.html
: ("A multithreaded PowerPC processor for commercial servers")

That processor apparently involves fine-grain *pseudo-concurrency*:
"When one of the threads would normally be stalled, instructions from
the other threads can utilize the processor's resources..."

:> :> : If you're going to look at threads as synchronous coroutines, for example,

:> :> : reordering of memory operations is irrelevant -- that cannot occur within
:> :> : the scope of a single processor.
:> :>
:> :> Huh?
:>
:> : Memory access reordering IS taking place, but you just can't
:> : observe/notice it on a uniprocessor.
:>
:> Barrier are not required where reordering cannot be noticed. It's my
:> impression, however, that barriers are required on some uniprocessor
:> systems. Even uniprocessors have to be told not to let access to
:> critical data slip out of critical regions.

: Memory-mapped I/O stuff aside, consider [from the document referenced
: below]:

: "....
: Storage synchronisation is necessary in a uniprocessor system when
: generating or modifying instructions. We don't talk about generating
: or modifying instructions in this article, because POWER4 shows no
: new behaviour in this regard. Programs running on a uniprocessor
: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
: system, or programs that don't share memory with other programs or
: threads, will produce the intended results without the need for
: storage synchronization; this is a consequence of the architecture's
: requirement that all memory accesses performed by a given processor
: appear to be performed in program order with respect to that processor
: and applies on all processor implementations."

My impression was that *some* aggrressive super-scalar architectures
require synchronization by the program. In principle (but not in
practice), one might also see out-of-order issues from the compilers,
e.g., if one in-lined coordination operations like lock() and
unlock().

:> : P.S. With respect to reordering: reads/writes that are done inside

:> : sync.regions simply can't escape lock/unlock "barriers" -- in both
:> : directions; other "transformations" are "OLL KORRECT", I guess.
:>
:> I don't understand your point here. :-(

: Well, my poor English aside, it's rather "murky" stuff, indeed. ;-) Consider:

: http://www.ibm.com/servers/esdd/articles/power4_mem.html
: (POWER4 and shared memory synchronisation)

: http://www.ibm.com/chips/products/powerpc/newsletter/apr2001/design-h-t.html

: "Synchronization Instructions

: Commonly misunderstood PowerPC instructions are those that perform
: synchronization. These instructions include:

: - Enforce In/Order Execution of I/O (eieio) - This instruction is for
: data accesses to guarantee that loads and stores complete with respect
: to one another. Since PowerPC defines a weakly ordered storage model
: in which loads and stores can complete out of order, this instruction
: exists to guarantee ordering where necessary.

So, with respect traffic between:

* cache and memory: "[the architecture has] a weakly orders storage model
in which loads and stores can complete out of order ..."

* cache and processor: "the architecture's [has a] requirement that all

memory accesses performed by a given processor appear to be performed

in program order with respect to that processor..."

Sounds reasonable and perhaps I was simply confusing the two. Thanks.

Tom Payne

0 new messages