using volatile bools instead of mutexes when synchronizing threads

skoco

unread,

Feb 15, 2004, 7:10:48 AM2/15/04

to

Hello!
I was thinking about a interesting thing of replacing mutexes (which
obviously needs some kind of syscall) with volatile variables (bool
will do all needed functionality)
I haven't tried it yet (i will soon), but do you think there could
be some kind of problem? volatile specifier ensures that bool variable
will be always read into registers only at the time it's needed, so
holding the variable in register and changing its value in another
thread shouldn't happen, am I right?
Please be mercifull with critics if this idea is pure stupidity :-)

Thanks a lot for some comments.
skoco

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Ralf

unread,

Feb 16, 2004, 7:22:53 AM2/16/04

to

"skoco" <skoc...@yahoo.com> schrieb im Newsbeitrag
news:d7be6dd6.04021...@posting.google.com...

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?
> Please be mercifull with critics if this idea is pure stupidity :-)
>
> Thanks a lot for some comments.
> skoco

Hi,

using volatile bool's as mutexes will cause some problems.
For example: asking for the run condition (if( bMut == false) )
can give you the green light. But one line later can be the context
switch to another thread. In the second thread the condition can
be improved also, it will get also green light!

But I think, there is a solution: for two threads you need two
volatile bool's: b1 and b2. The first thread has to switch the variables
in the order b1, b2, the second thread has to switch them in the reverse
order.
Between the switching there should be another check of the variables
(like in the double check pattern).

Ralf

www.oop-trainer.de/indexp1.html

Dhruv Matani

unread,

Feb 16, 2004, 7:26:32 AM2/16/04

to

On Sun, 15 Feb 2004 07:10:48 -0500, skoco wrote:

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?
> Please be mercifull with critics if this idea is pure stupidity :-)

Yes. The idea is pure stupid AFAICS! That's because there are no signle
bit registers, and even it there were, I doubt that the hardware would be
able to guarantee Atomic instructions on those registers for those
particular instructions. You already have a lock instruction on x86. Make
use of it!

Regards,
-Dhruv.

Peter Koch Larsen

unread,

Feb 16, 2004, 7:27:47 AM2/16/04

to

"skoco" <skoc...@yahoo.com> skrev i en meddelelse
news:d7be6dd6.04021...@posting.google.com...

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)

First, mutexes do not neccesarily require a "syscall", if by that you mean a
call to the kernel.

> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?

There could be many kinds of problems, so dont.;-) Also, your reasoning
above is not valid. Even if a variable is declared volatile, nothing
prevents it from being being read into a register.

> Please be mercifull with critics if this idea is pure stupidity :-)
>
> Thanks a lot for some comments.
> skoco
>

You could visit comp.programming.threads and have a look one of the many
threads related to your question. The short answer is that you should use
some library (pthreads/boost/whatever) and rely on it in order to get your
mutexes.

Kind regards
Peter

Graeme Prentice

unread,

Feb 16, 2004, 7:34:34 AM2/16/04

to

On 15 Feb 2004 07:10:48 -0500, skoco wrote:

>Hello!
> I was thinking about a interesting thing of replacing mutexes (which
>obviously needs some kind of syscall) with volatile variables (bool
>will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
>be some kind of problem? volatile specifier ensures that bool variable
>will be always read into registers only at the time it's needed, so
>holding the variable in register and changing its value in another
>thread shouldn't happen, am I right?
> Please be mercifull with critics if this idea is pure stupidity :-)

You can't just read a bool, check for false, then set it true and hope
that you're the only thread that then thinks it owns the flag. If a
thread switch occurs between reading the bool and setting it true, you
can get two threads both thinking they "own" the flag. You need a
non-interruptible "test and set" instruction. This is sort of what
interlocked- increment does on windows. On a windows or unix platform
you are better to use the operating system supplied mutex. On an
embedded platform, if there is no "test and set" instruction available,
you can possibly disable interrupts for a short time, to ensure no
thread switch occurs between the read and the write.

Normally, a mutex would have additional functionality that when the task
has to wait on the mutex, it goes into a "semaphore wait queue" rather
than into the run queue, and when the semaphore gets signalled, the task
is woken up. This ensures that all tasks that wait on the same
semaphore, eventually get a turn.

Regarding volatile, check my recent question in this newsgroup.

Graeme

Michael Tiomkin

unread,

Feb 16, 2004, 7:36:21 AM2/16/04

to

skoc...@yahoo.com (skoco) wrote in message news:<d7be6dd6.04021...@posting.google.com>...

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?

> Please be merciful with critics if this idea is pure stupidity :-)

The problem is that you cannot safely hold the value in a register
before changing and writing it to the memory in the same thread:
another thread can try to update the memory at the same time.
Safe and efficient implementation of mutex without system calls
needs an atomic test-and-set operation that is issued directly on memory.
Usually, this operation doesn't exist in high level languages.
BTW, you can easily do this with inlined assembly. There is only
a perfomance problem - in case that the request of your thread
is not satisfied, you'll need to use busy waiting with sleep and test&set
loop. The standard system semaphore object would suspend your thread
and put it in a waiting queue instead of busy wait.

A book on realtime or OS programming can be helpful in this case.

Ingo Berg

unread,

Feb 16, 2004, 6:12:04 PM2/16/04

to

Dont do that!

It wont work because checking the flag and resetting it requires more
then one instruction for the processor. So theoretically the execution
could switch to another thread right after the first one has
determined the flag is not set but before it could reset it. A second
thread checking that flag would not recognize that a flag change is
about to happen by the first threat. Hence this threat could enter the
function too. It does not matter if the varible is volatile since the
first threat was disrupted before it was able to change its value.
What makes this approach even worse is that it might look like it
works because most times the thread change will not take place at
this critical point in time. Errors that originate from such mistakes
are very hard to detect.

Kind regards,
Ingo

Thomas Richter

unread,

Feb 16, 2004, 6:15:27 PM2/16/04

to

Hi,

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?
> Please be mercifull with critics if this idea is pure stupidity :-)

What you need for mutexes is some kind of atomic "test and set"
operation, or an atomic "increment and test" operation. "atomic" in
the sense of that it cannot be interrupted by other processes, threads
or (possibly) CPUs in a multi-processor system.

volatile is not sufficient to guarantee that.

You *might* get away with a mutex count (increment/decrement/test
operation) that is atomic on some popular platforms,
(i.e. "if (++i ==0) { }") but that's outside of the scope of C++ that
doesn't give you *any* guarantee about atomicity of these operations.

So long,
Thomas

Marco Oman

unread,

Feb 16, 2004, 6:17:02 PM2/16/04

to

On 15 Feb 2004 07:10:48 -0500, skoco <skoc...@yahoo.com> wrote:

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?
> Please be mercifull with critics if this idea is pure stupidity :-)

No need of being mercyful, since I have been bitten by a bug
born from the same idea.

Said that what I post is more another question rather than an answer.
On a dual processor machine (with shared memory architecture)
one may end up with one thread writing (from processor A) on actual memory
and the other one (which happens to use always processor B) reading
from its cache (and so missing the shared memory update).
If you want to avoid system calls on x86 architectures there are
some instruction that implememnt some sort of interlocked memory
access and so are very fast. Win32 has the functions Interlocked*
that wrap them, they have also linux counterpart.

Now the question: can the proposed approach work on a
single processor machine?

David Turner

unread,

Feb 16, 2004, 6:22:47 PM2/16/04

to

Hi

> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)

I think you'll find that this is impossible. At the very least, you need an
atomic test-and-set primitive, which you don't have access to from C++
(because not all platforms have such a thing). I'm afraid you're stuck with
that syscall :-).

Try comp.programming.threads for more information about synchronization.

Regards
David Turner

v

unread,

Feb 17, 2004, 6:15:27 AM2/17/04

to

hello,
well not really an answer, and not really encoraging to do it the way i do,
but i rarely protect bool variables from being accessed by concurrent
threads.
e.g.
class thread {
....
public:
void run() {
while(!stop) { ... }
}
void do_stop() {
stop=true;
}
};

correct me if wrong (i am no assemby guru), but in the worst case
(while(!stop) and stop=true
runing concurrently) what could happen????

a) stop is in memory (not in a register) and while(!stop) reads the old
value although it has been changed in
the register and is being written back shortly after, loop one more time.
b) stop is in register, stop=true needs 2 cycles, while(!stop) reads the
incomplete value and loops one more time

assuming that bool(false) is represented as all bits 0, and bool(true) any
bit being set to 1 independet of the length.

bye
vlado

"skoco" <skoc...@yahoo.com> schrieb im Newsbeitrag
news:d7be6dd6.04021...@posting.google.com...

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?
> Please be mercifull with critics if this idea is pure stupidity :-)

Daniel Pfeffer

unread,

Feb 17, 2004, 6:16:12 AM2/17/04

to

"skoco" <skoc...@yahoo.com> wrote in message
news:d7be6dd6.04021...@posting.google.com...

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)
> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem? volatile specifier ensures that bool variable
> will be always read into registers only at the time it's needed, so
> holding the variable in register and changing its value in another
> thread shouldn't happen, am I right?

Your understanding of 'volatile' is correct, but there are a few problems
with this approach:

1. If the 'bool' is owned by a different thread your code must go into a
busy loop, sampling the 'bool' every so often. This wastes CPU resources.
2. Scheduling algorithms that provide a priority boost when a thread is
suffering from "CPU time starvation" will not work. (e.g. some O/Ses may
temporarily raise the priority of a thread - within limits - if it hasn't
received CPU resources in the last X seconds).
3. The system is not expandable to multiple locks (see the Wait
ForMultipleObjects() API in Win32). These locks are sometimes unavoidable.
4. The O/S typically monitors the resources acquired by the various threads
and will release a mutex held by a thread that crashes. This will not happen
with your 'bool'.

To summarise, there are very good reasons why mutexes are part of the O/S.
Use the O/S constructs, and don't re-invent the wheel.

HTH,
Daniel Pfeffer

ka...@gabi-soft.fr

unread,

Feb 17, 2004, 6:20:50 AM2/17/04

to

skoc...@yahoo.com (skoco) wrote in message
news:<d7be6dd6.04021...@posting.google.com>...

> I was thinking about a interesting thing of replacing mutexes

> (which obviously needs some kind of syscall) with volatile variables
> (bool will do all needed functionality)

If you are concerned with synchronizing between threads, it doesn't
work.

> I haven't tried it yet (i will soon), but do you think there could
> be some kind of problem?

Sure. Like the fact that different threads will see different values at
different times.

> volatile specifier ensures that bool variable will be always read into
> registers only at the time it's needed, so holding the variable in
> register and changing its value in another thread shouldn't happen, am
> I right?

It depends on how the implementation defines access to a volatile. Most
implementations today don't define it in a way that can be useful for
communicating between threads.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

Jerry Feldman

unread,

Feb 17, 2004, 6:28:32 AM2/17/04

to

On 16 Feb 2004 07:34:34 -0500
Graeme Prentice <inv...@yahoo.co.nz> wrote:

> Normally, a mutex would have additional functionality that when the
> task has to wait on the mutex, it goes into a "semaphore wait queue"
> rather than into the run queue, and when the semaphore gets signalled,
> the task is woken up. This ensures that all tasks that wait on the
> same semaphore, eventually get a turn.

While this is a bit off topic, this is not entirely true, and depends on
the specic mutex implementation. The bottom line is that if the mutex is
unlocked, the ovderhead should be very low. There is the POSIX
pthread_mutex_trylock() function that returns 0 if the mutex is unlocked
and EBUSY when locked. Depending on the implementation, the POSIX fast
mutex should have very low overhead, and to do better, you could
implement your own in assembler. Neither C++ nor C have atomic variables
as part of the standard.

--
Jerry Feldman <gaf-nospam-at-blu.org>
Boston Linux and Unix user group
http://www.blu.org PGP key id:C5061EA9
PGP Key fingerprint:053C 73EC 3AC1 5C44 3E14 9245 FB00 3ED5 C506 1EA9

Bill Wade

unread,

Feb 17, 2004, 6:33:35 AM2/17/04

to

skoc...@yahoo.com (skoco) wrote in message news:<d7be6dd6.04021...@posting.google.com>...

> Hello!
> I was thinking about a interesting thing of replacing mutexes (which
> obviously needs some kind of syscall) with volatile variables (bool
> will do all needed functionality)

The standard doesn't make any promises (neither do I). However you
will want to use volatile sig_atomic_t instead of volatile bool. A
problem with bool is that it may be smaller than the machine's word
size, so a bool variable may share a word with another variable. When
the two variables are written (one from each thread) one of the writes
may be lost, because a write is implemented as (load word, modify
bool, write word) and this may not be atomic.

In most cases you'll want to have only one writer of the variable.

Of course you don't get the blocking that a mutex provides.

YMMV, HTH

Maciej Sobczak

unread,

Feb 17, 2004, 12:29:12 PM2/17/04

to

Hi,

v wrote:

> well not really an answer, and not really encoraging to do it the way i do,
> but i rarely protect bool variables from being accessed by concurrent
> threads.
> e.g.
> class thread {
> ....
> public:
> void run() {
> while(!stop) { ... }
> }
> void do_stop() {
> stop=true;
> }
> };
>
> correct me if wrong (i am no assemby guru), but in the worst case
> (while(!stop) and stop=true
> runing concurrently) what could happen????

It may work most of the time. If the bool flag is volatile, it may even
work more often than without it.
You may find that on single Intel CPU it just works, because when one
thread sets the flag to true, the other will sooner or later notice it,
and I believe this is the purpose of the flag. There are no trap values
on bools (on Intel CPU), so nothing wrong can happen...

But let's consider what will happen if you run your program on a machine
with many CPUs. It is likely that one of the threads will be executing
on one CPU and the other thread on the second CPU. These CPUs may have
separate caches, so when one thread (do_stop) write true to the stop
flag, it may not become visible to the second (run) thread, because the
two caches will have different visions about what is the real value of
the flag.
You need to ensure visibility between the two thread. A shared variable
does not cut it, so special instructions are needed to achieve it (to
ensure visibility between threads, for example by synchronizing all
caches). Synchronization primitives do this, so you should use them
instead of relying only on luck.

--
Maciej Sobczak : http://www.msobczak.com/
Programming : http://www.msobczak.com/prog/

ka...@gabi-soft.fr

unread,

Feb 17, 2004, 4:42:58 PM2/17/04

to

Graeme Prentice <inv...@yahoo.co.nz> wrote in message
news:<lqmv20p530nnc7i5l...@4ax.com>...

> On 15 Feb 2004 07:10:48 -0500, skoco wrote:

> > I was thinking about a interesting thing of replacing mutexes (which
> >obviously needs some kind of syscall) with volatile variables (bool
> >will do all needed functionality)
> > I haven't tried it yet (i will soon), but do you think there could
> >be some kind of problem? volatile specifier ensures that bool variable
> >will be always read into registers only at the time it's needed, so
> >holding the variable in register and changing its value in another
> >thread shouldn't happen, am I right?
> > Please be mercifull with critics if this idea is pure stupidity :-)

> You can't just read a bool, check for false, then set it true and hope
> that you're the only thread that then thinks it owns the flag. If a
> thread switch occurs between reading the bool and setting it true, you
> can get two threads both thinking they "own" the flag. You need a
> non-interruptible "test and set" instruction.

You also need some sort of guarantee that the other processes (perhaps
running on other processors) will see the same change, and not access
some older value that they happen to have in cache.

> This is sort of what interlocked- increment does on windows.

Windows supplies some special atomic increment and atomic decrement
functions, which may be useful in some cases. (I use them for reference
counting, for example.) Other platforms do so as well, although with a
different name.

> On a windows or unix platform you are better to use the operating
> system supplied mutex. On an embedded platform, if there is no "test
> and set" instruction available, you can possibly disable interrupts
> for a short time, to ensure no thread switch occurs between the read
> and the write.

The key here, of course, is that it is always platform specific, and
what is necessary cannot normally be expressed in C++.

> Normally, a mutex would have additional functionality that when the
> task has to wait on the mutex, it goes into a "semaphore wait queue"
> rather than into the run queue, and when the semaphore gets signalled,
> the task is woken up. This ensures that all tasks that wait on the
> same semaphore, eventually get a turn.

> Regarding volatile, check my recent question in this newsgroup.

The important point to remember with regards to volatile and threading
is that for multithread accesses, volatile is neither necessary nor
sufficient.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Graeme Prentice

unread,

Feb 17, 2004, 4:43:48 PM2/17/04

to

On 17 Feb 2004 06:28:32 -0500, Jerry Feldman wrote:

>On 16 Feb 2004 07:34:34 -0500
>Graeme Prentice <inv...@yahoo.co.nz> wrote:
>
> > Normally, a mutex would have additional functionality that when the
> > task has to wait on the mutex, it goes into a "semaphore wait queue"
> > rather than into the run queue, and when the semaphore gets signalled,
> > the task is woken up. This ensures that all tasks that wait on the
> > same semaphore, eventually get a turn.
>While this is a bit off topic, this is not entirely true, and depends on
>the specic mutex implementation.

Sure. I was just trying to point out that there's more to a mutex than
merely a boolean lock.

>The bottom line is that if the mutex is
>unlocked, the ovderhead should be very low. There is the POSIX
>pthread_mutex_trylock() function that returns 0 if the mutex is unlocked
>and EBUSY when locked. Depending on the implementation, the POSIX fast
>mutex should have very low overhead, and to do better, you could
>implement your own in assembler. Neither C++ nor C have atomic variables
>as part of the standard.

Yes they do. There's a type called sig_atomic_t defined in the C
standard (7.14 para 2) as atomic in the presence of asynchronous
interrupts and incorporated into the C++ standard (17.4.3.1.4)

Graeme

Michael Furman

unread,

Feb 18, 2004, 7:04:52 AM2/18/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04021...@posting.google.com...
> [...]

> The important point to remember with regards to volatile and threading
> is that for multithread accesses, volatile is neither necessary nor
> sufficient.

I more or less agree with you that volatile is not sufficient for anything
in the multithreaded
case, but isn't it necessary?
Could you please alaborate what exactly do you mean. My guesses:

1. Exact semantic of volatile is not defiled in the language, so any use
would be unportable.
And "neither necessary nor sufficient" related to portable code. I agree
with that - so
no further questions.

2. Volatile does not change behavior of code in any usefull way.
... anything else.

Let's consider the two threads:

Thread1:
stopflag = false;
while(!stopflag)
{ ...do something w/o any access to "stopflag" .... }
TurnPowerOff();

Thread2:
...do something w/o any access to "stopflag" ....
// It is time to turn power off
while(true)
{
stopflag = true;
Sleep(1);
}

In the absence of volatile in the definition of "stopflag" compiler can
optimize
"while(!stopflag)" into "while(true)" and "TurnPowerOff" would never be
called.
If volatile (with its I believe absolutely typical semantic) present I need
the only
weak constraint on hardware mylti threading model to make it working as
supposed
(i.e. to turn power off at some point): wtritten value will be visible from
another
thread earlier or later).

In short: do you think that volatile does not change behavior of the code
(w/o
interrupt handlers) in any observable way? If not - do you believe that
there is some observable difference, but it cannot used??

Regards,
Michael Furman

ka...@gabi-soft.fr

unread,

Feb 18, 2004, 5:53:09 PM2/18/04

to

wa...@stoner.com (Bill Wade) wrote in message
news:<2bbfa355.04021...@posting.google.com>...

> skoc...@yahoo.com (skoco) wrote in message
> news:<d7be6dd6.04021...@posting.google.com>...

> > I was thinking about a interesting thing of replacing mutexes

> > (which obviously needs some kind of syscall) with volatile
> > variables (bool will do all needed functionality)

> The standard doesn't make any promises (neither do I). However you
> will want to use volatile sig_atomic_t instead of volatile bool. A
> problem with bool is that it may be smaller than the machine's word
> size, so a bool variable may share a word with another variable. When
> the two variables are written (one from each thread) one of the writes
> may be lost, because a write is implemented as (load word, modify
> bool, write word) and this may not be atomic.

We've been through this before. Atomicity is only part of the problem
(although it is a real part). The other part is the implementation
defined semantics of volatile. Arguably, the intent of the standard is
that if one processor writes a volatile variable to memory, all other
processors will see the results of that write. Intent or no, however,
it ISN'T what most current implementations guarantee. On modern
hardware, you need a special instructions to synchronize memory between
processors, and the compilers I've seen don't generate them, even if the
variable is declared volatile.

> In most cases you'll want to have only one writer of the variable.

> Of course you don't get the blocking that a mutex provides.

There's a more general aspect that is worrying me. When I use a mutex,
it is to protect something. The mutex has three effects: it blocks all
but one thread, it suspends waiting threads so that they don't use CPU
resources, and it synchronizes memory. All memory, including that not
declared volatile. This last functionality will definitly not be
present if you spin lock on a bool -- unlike a call to the system API
for a lock, the compiler has no way of knowing that other objects must
also be written to or read from memory. Thus if I write something like:

extern int i ;
volatile atomic_bool lock = 0 ;
i = 0 ;
while ( lock == 0 ) {
}
cout << i ;

The compiler can "see" that i is not modified, and will probably just
generate code to output a literal 0. It would be a very rare compiler
that will reread i from memory in such a case.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Gerhard Menzl

unread,

Feb 18, 2004, 5:56:45 PM2/18/04

to

Graeme Prentice wrote:

>>Neither C++ nor C have atomic variables as part of the standard.
>
>
> Yes they do. There's a type called sig_atomic_t defined in the C
> standard (7.14 para 2) as atomic in the presence of asynchronous
> interrupts and incorporated into the C++ standard (17.4.3.1.4)

Unfortunately, this is of no significance for thread synchronization,
which is what the original question was about.

Gerhard Menzl
--
Humans may reply by replacing the obviously faked part of my e-mail
address with "kapsch".

Simon Turner

unread,

Feb 19, 2004, 5:36:46 AM2/19/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message
news:<c0u750$1b7ifq$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04021...@posting.google.com...
> > [...] The important point to remember with regards to volatile and
> > threading is that for multithread accesses, volatile is neither
> > necessary nor sufficient.
>
> I more or less agree with you that volatile is not sufficient for anything
> in the multithreaded case, but isn't it necessary? Could you please
> alaborate what exactly do you mean. My guesses:

To preempt your guesses slightly: this discussion is better suited to
comp.programming.threads. More accurately, this has been done to death
there, so the topic is probably best suited to searching groups.google.com
for "volatile group:comp.programming.threads" and reading back through the
articles found.

> 1. Exact semantic of volatile is not defiled in the language, so any use
> would be unportable. And "neither necessary nor sufficient" related to
> portable code. I agree with that - so no further questions.

IIRC, volatile just prevents the compiler from eliding loads from memory, by
informing it that a value may change between loads and can't be cached.

> 2. Volatile does not change behavior of code in any usefull way. ...
> anything else.

This is sort of true. Volatile does change behaviour in a way that is useful
for some purposes, just not for thread safety.

> Let's consider the two threads:
>
> Thread1:
> stopflag = false;
> while(!stopflag)
> { ...do something w/o any access to "stopflag" .... }
> TurnPowerOff();
>
> Thread2:
> ...do something w/o any access to "stopflag" ....
> // It is time to turn power off
> while(true)
> {
> stopflag = true;
> Sleep(1);
> }
>
> In the absence of volatile in the definition of "stopflag" compiler can
> optimize "while(!stopflag)" into "while(true)" and "TurnPowerOff" would
> never be called. If volatile (with its I believe absolutely typical
> semantic) present I need the only weak constraint on hardware mylti
> threading model to make it working as supposed (i.e. to turn power off at
> some point): wtritten value will be visible from another thread earlier or
> later).

That MAY be true, but what you're really doing is forcing a load from memory
in thread 1 each time round the loop. That load can still come from cache,
which can still contain the old value.

This will only work in practice on a single processor, or when some other
process is running so that the caches eventually get written to/read from
main memory.

> In short: do you think that volatile does not change behavior of the code
> (w/o interrupt handlers) in any observable way? If not - do you believe
> that there is some observable difference, but it cannot used??

OK, James told you it was neither necessary or sufficent. For all the
painful details, search through the long and tedious discussions on
comp.programming.threads, but a precis is:

Not necessary:

the volatile qualifier only has any effect on aliased memory - ie, it only
makes sense if something other than the current function can change the
value at that location. Since calling any non-inlined function (such as
pthread_mutex_lock) forces the caller to reload the value from memory if it
is referenced later (the called function could have changed it), normal
synchronisation will work without the use of volatile.
Eg, in

int val; // not volatile

void foo() { val++; }

void bar()
{
val=0;
while(!val) {
foo()
}
}

val doesn't need to be volatile-qualified, because the call to foo() forces
a reload anyway.

Not sufficient:

It isn't sufficient to get a changed value in a timely fashion (because the
value can spend an arbitrary amount of time loitering around in caches), to
enforce any kind of coherency, or to prevent race conditions. In short, it
doesn't do any of the things we use synchronisation for. The people who
invented mutexes et al weren't complete dummies, and wouldn't have bothered
if volatile would have sufficed. It doesn't.

Incoherency (see above):

bool tediousComputationFinished;
int tediousComputationResult;

void doTediousComputation();

// runs in thread 1
void computationThread()
{
tediousComputationFinished = false;

doTediousComputation();

tediousComputationFinished = true;
tediousComputationResult = result;
}

// runs in thread 2
void foo()
{
while(!tediousComputationFinished) { /*sleep*/ }

// use tediousComputationResult...
// BUT just because the finished flag finally meandered its way through
// a couple of caches DOESN'T mean the result is visible yet, so that
// could be garbage.
// You need locking (or platform specific memory barriers) for this.

Michael Furman

unread,

Feb 19, 2004, 1:24:30 PM2/19/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04021...@posting.google.com...
[...]

> There's a more general aspect that is worrying me. When I use a mutex,
> it is to protect something. The mutex has three effects: it blocks all
> but one thread, it suspends waiting threads so that they don't use CPU
> resources, and it synchronizes memory. All memory, including that not
> declared volatile.

Could you please tell what is a "mutex" you are refering to? Have I missed
the time and mutex is a part of the C++ standard? Or you are talking about
posix mutex and "C++" is a part of the posix standard? In neither,
how do you know that mutex (anyone) protect "volative memory"?

Regards,
Michael Furman

Balog Pal

unread,

Feb 20, 2004, 6:30:52 AM2/20/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04021...@posting.google.com...

> extern int i ;

> volatile atomic_bool lock = 0 ;
> i = 0 ;
> while ( lock == 0 ) {
> }
> cout << i ;
>
> The compiler can "see" that i is not modified, and will probably just
> generate code to output a literal 0. It would be a very rare compiler
> that will reread i from memory in such a case.

For some reason I always had the impression thet *this* is the exact thing
volatile was invented for. To explain the compiler "this variable may be
changed by means not seen by the compiler". So if I have a variable used in
the above situation I always make it volatile. Ant that forbids the
assumptation that i is still 0, as would be reasonable for a plain int.

And I know no other way to express that to the compiler -- so James could
you explain your previous statement that 'volatile is neither sufficient nor
necessary' with threading? i agree it is not sufficient (for the general
case), but how can you live safe without it?

Paul

ka...@gabi-soft.fr

unread,

Feb 20, 2004, 6:34:08 AM2/20/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message
news:<c0u750$1b7ifq$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04021...@posting.google.com...
> > [...]
> > The important point to remember with regards to volatile and
> > threading is that for multithread accesses, volatile is neither
> > necessary nor sufficient.

> I more or less agree with you that volatile is not sufficient for
> anything in the multithreaded case, but isn't it necessary?

No, since the other things that are necessary provide all of the
necessary guarantees.

> Could you please alaborate what exactly do you mean. My guesses:

> 1. Exact semantic of volatile is not defiled in the language, so any
> use would be unportable. And "neither necessary nor sufficient"
> related to portable code. I agree with that - so no further questions.

> 2. Volatile does not change behavior of code in any usefull way.
> ... anything else.

Basically the second. Using volatile will result in slower code being
generated, but will typically not have any other effect with regards to
threading.

> Let's consider the two threads:

> Thread1:
> stopflag = false;
> while(!stopflag)
> { ...do something w/o any access to "stopflag" .... }
> TurnPowerOff();

> Thread2:
> ...do something w/o any access to "stopflag" ....
> // It is time to turn power off
> while(true)
> {
> stopflag = true;
> Sleep(1);
> }

> In the absence of volatile in the definition of "stopflag" compiler
> can optimize "while(!stopflag)" into "while(true)" and "TurnPowerOff"
> would never be called.

With or without volatile, there is no guarantee that Thread1 will ever
see the change to memory made in Thread2. In order to ensure that the
modifications in one thread are visible in another thread, it is
necessary (on most modern machines) to issue some special instructions
(membar on a Sparc, for example). Arguably, the intent of volatile is
that the compiler generate these instructions when a volatile variable
is accessed, but in practice, they don't (and Posix very definitly
doesn't require them to in C).

According to Posix, the above code has undefined behavior, so the
problem of whether the compiler optimizes accesses or not is moot. In
order for the behavior to be defined, *ALL* accesses to stopflag must be
protected, normally by a mutex. In which case, Posix does guarantee
that the code works. With or without volatile.

The real problem, I think, is that the standard (C or C++) doesn't
address threading issues at all. Other standards (such as Posix) do,
and they have chosen not to make volatile relevant in any way. A Posix
compliant C (and presumably C++, although Posix doesn't acknowledge the
existance of C++) is not allowed to optimize accesses accross calls to
functions like pthread_mutex_lock and pthread_mutex_unlock, and these
functions are required to issue any necessary barrier instructions. So
without the locks, you have undefined behavior, with or without
volatile, and with the locks, everything is correct, with or without
volatile.

> If volatile (with its I believe absolutely typical semantic) present I
> need the only weak constraint on hardware mylti threading model to
> make it working as supposed (i.e. to turn power off at some point):
> wtritten value will be visible from another thread earlier or later).

Where do you get this guarantee? It isn't true on some of the systems I
work on.

> In short: do you think that volatile does not change behavior of the
> code (w/o interrupt handlers) in any observable way? If not - do you
> believe that there is some observable difference, but it cannot used??

Interrupt handlers are a different question. You'll have to see what
your system requires for these: volatile may or may not be useful,
depending on what the system requires. I suppose that the same thing is
true for threads, but volatile is not useful with threads under Posix,
nor, as far as I can tell, under Windows.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Tom Tanner

unread,

Feb 20, 2004, 6:38:18 AM2/20/04

to

ka...@gabi-soft.fr wrote in message
<huge snip>

> Of course you don't get the blocking that a mutex provides.
>
> There's a more general aspect that is worrying me. When I use a mutex,
> it is to protect something. The mutex has three effects: it blocks all
> but one thread, it suspends waiting threads so that they don't use CPU
> resources, and it synchronizes memory. All memory, including that not
> declared volatile. This last functionality will definitly not be
> present if you spin lock on a bool -- unlike a call to the system API
> for a lock, the compiler has no way of knowing that other objects must
> also be written to or read from memory. Thus if I write something like:
>
> extern int i ;
> volatile atomic_bool lock = 0 ;
> i = 0 ;
> while ( lock == 0 ) {
> }
> cout << i ;
>
> The compiler can "see" that i is not modified, and will probably just
> generate code to output a literal 0. It would be a very rare compiler
> that will reread i from memory in such a case.

But you haven't told the compiler that i can be changed by things
outside of the code, so why should any compiler reload i? i is as
volatile lock is. "extern" doesn't automatically imply "volatile".

Presumably the standard says somewhere that a call to a function is
entirely at liberty to marmelise any external variables, so

extern int i; i = 0; wibble(); cout << i;

has to reload i, whether or not wibble access an O/S mutex or not.

Bill Wade

unread,

Feb 20, 2004, 8:44:01 AM2/20/04

to

ka...@gabi-soft.fr wrote

>
> We've been through this before. Atomicity is only part of the problem
> (although it is a real part). The other part is the implementation
> defined semantics of volatile. Arguably, the intent of the standard is
> that if one processor writes a volatile variable to memory, all other
> processors will see the results of that write. Intent or no, however,
> it ISN'T what most current implementations guarantee.

But some do, and many single-processor machines support more than one
thread. A few even support more than one process and shared memory
;-).

I guess your point is that if you want to avoid the mutex by looking
at memory, you'll want to ensure that your look is both atomic and
synchronized.

On some systems, volatile bool will be atomic and synchronized. On
some more systems volatile sig_atomic_t will be atomic and
synchronized. On some systems neither does what you want. RTFM.

Michael Furman

unread,

Feb 20, 2004, 8:44:45 AM2/20/04

to

"Simon Turner" <s_j_t...@yahoo.co.uk> wrote in message
news:ea3f115.04021...@posting.google.com...
> [...]

> > I more or less agree with you that volatile is not sufficient for
anything
> > in the multithreaded case, but isn't it necessary? Could you please
> > alaborate what exactly do you mean. My guesses:
>
> To preempt your guesses slightly: this discussion is better suited to
> comp.programming.threads. More accurately, this has been done to death
> there, so the topic is probably best suited to searching groups.google.com
> for "volatile group:comp.programming.threads" and reading back through the
> articles found.

Thanks fot the good advice - though it is not for me: I have not started
this thread
and asked James to elaborete his point.

>
> > 1. Exact semantic of volatile is not defiled in the language, so any
use
> > would be unportable. And "neither necessary nor sufficient" related to
> > portable code. I agree with that - so no further questions.
>
> IIRC, volatile just prevents the compiler from eliding loads from memory,
by
> informing it that a value may change between loads and can't be cached.
>
> > 2. Volatile does not change behavior of code in any usefull way. ...
> > anything else.
>
> This is sort of true. Volatile does change behaviour in a way that is
useful
> for some purposes, just not for thread safety.

That it was I asked: proof of that! Though what is the meaning of proof for
"sort of true?" ;).

Have you missed last sentence of the fragment you comment on:
"I need the only weak constraint on hardware mylti-threading model to make

it working as supposed (i.e. to turn power off at some point): wtritten
value
will be visible from another thread earlier or later).

> [...]

>
> OK, James told you it was neither necessary or sufficent. For all the
> painful details, search through the long and tedious discussions on
> comp.programming.threads, but a precis is:

OK, so you "sort of" forbidding me to ask James? That sounds interesting :-)

>
> Not necessary:
>
> the volatile qualifier only has any effect on aliased memory - ie, it only
> makes sense if something other than the current function can change the
> value at that location. Since calling any non-inlined function (such as
> pthread_mutex_lock) forces the caller to reload the value from memory if
it
> is referenced later (the called function could have changed it), normal
> synchronisation will work without the use of volatile.

Though that is usually true (with compiles that process source files
independently
and with "mutex" implemented as non-inlined function) it is:
1. unrelated to my example: you are trying to say that it is not nessesary
only in case
you are using "pthread_mutex_lock".
2. does not have to be true (for example, in case of compiler that can cal
look
inside other files or if the finction in question is inlined).

[...]

> Not sufficient:

Of cause it is not: either some constraint (like I described in the sentence
you missed)
or extra hardware functions (like memory barriers) are needed.

Michael Furman

unread,

Feb 20, 2004, 8:45:07 AM2/20/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message

news:c10u3k$1ceits$1...@ID-122417.news.uni-berlin.de...

>
> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04021...@posting.google.com...
> [...]
> > There's a more general aspect that is worrying me. When I use a mutex,
> > it is to protect something. The mutex has three effects: it blocks all
> > but one thread, it suspends waiting threads so that they don't use CPU
> > resources, and it synchronizes memory. All memory, including that not
> > declared volatile.
>
> Could you please tell what is a "mutex" you are refering to? Have I missed
> the time and mutex is a part of the C++ standard? Or you are talking about
> posix mutex and "C++" is a part of the posix standard? In neither,
> how do you know that mutex (anyone) protect "volative memory"?

I meant:
> how do you know that mutex (anyone) protect "non-volative memory"?
- sorry for typo.

Graeme Prentice

unread,

Feb 20, 2004, 12:25:58 PM2/20/04

to

On 19 Feb 2004 13:24:30 -0500, Michael Furman wrote:

>
><ka...@gabi-soft.fr> wrote in message
>news:d6652001.04021...@posting.google.com...
>[...]
>> There's a more general aspect that is worrying me. When I use a mutex,
>> it is to protect something. The mutex has three effects: it blocks all
>> but one thread, it suspends waiting threads so that they don't use CPU
>> resources, and it synchronizes memory. All memory, including that not
>> declared volatile.
>
>Could you please tell what is a "mutex" you are refering to? Have I missed
>the time and mutex is a part of the C++ standard? Or you are talking about
>posix mutex and "C++" is a part of the posix standard? In neither,
>how do you know that mutex (anyone) protect "volative memory"?

For Windows, the information on what volatile does in Visual C++ is
sketchy. Information on how to write multithreaded apps that can run on
a machine with multiple CPUs is sketchier still which means that as long
as you follow the guidelines for a multithreaded app (which say to
protect shared resources with a synchronization primitive such as a
mutex), it's the operating systems problem. As far as I can see, many
Windows programmers who write multithreaded apps are unaware that their
program might run on a machine with multiple CPUs or what the
implications of that are.

This web page says there is an implied memory barrier in the OSs locking
mechanisms (e.g. mutexes)
http://www.microsoft.com/whdc/hwdev/driver/MPmem-barrier.mspx

but it also suggests that use of the volatile keyword provides coherency
of memory and is meaningful in a multithreaded environment - however it
refers the reader to Visual C++ documentation where information on the
effect of volatile is fairly useless - though it suggests that it's
useful in a multithreaded app.

Since all of the Visual C++ documentation describes that a mutex or
other synchronisation primitive should be used whenever accessing shared
resources, it's safe to say that using a mutex does "protect volatile
memory" - to answer your question. On Windows, you can also use the
InterlockedXX API functions to access specific shared variables - which
is more efficient than a mutex, but a mutex has other purposes as James
mentioned.

In C++, volatile has 3 effects - 1) it affects the order of execution
(accesses to volatile objects must be executed in the order implied by
the source code), 2) it forces the compiler to perform all accesses
implied by the source code 3) it prevents the compiler from doing any
additional accesses than what is implied by the source code.

Item number 3 is important when accessing hardware registers where an
unintended read of a register can have a side effect such as a missed
interrupt. As in the "abstract machine", accesses of volatile variables
between two sequence points can be reordered (and according to GCC,
multiple accesses between two sequence points can be combined).

The guarantee on volatile in C++ is provided by a short sentence in 1.9
para 6 <quote> The observable behavior of the abstract machine is its
sequence of reads and writes to volatile data and calls to library I/O
functions.<end quote> and by 1.9 para 5 which requires the observable
behaviour of the abstract machine be preserved.

1.9 para 6 ensures that the following program writes 42 to stdout and
not some random value caused by reordering of the two statements.
int main() { int a = 42; std::cout << a; }

In C# and Java, the volatile keyword provides a guarantee of "memory
coherency" in a multithreaded / multiple CPU environment, but not in
C++, except where specified by a particular implementation.

I'm curious to know exactly what a "memory barrier" operation involves
as it sounds potentially very time consuming - how long does it take to
write 50K of data from the cache to external memory or to other caches?
Intel XEON micro has hyperthreading where multiple "CPUs" share all
resources including the cache and each "CPU" is just a bunch of
registers - they can execute threads of a process in parallel and
presumably don't have time consuming "memory barrier" operations to
perform.

Graeme

ka...@gabi-soft.fr

unread,

Feb 20, 2004, 3:52:03 PM2/20/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message
news:<c10u3k$1ceits$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04021...@posting.google.com...
> [...]
> > There's a more general aspect that is worrying me. When I use a
> > mutex, it is to protect something. The mutex has three effects: it
> > blocks all but one thread, it suspends waiting threads so that they
> > don't use CPU resources, and it synchronizes memory. All memory,
> > including that not declared volatile.

> Could you please tell what is a "mutex" you are refering to? Have I
> missed the time and mutex is a part of the C++ standard? Or you are
> talking about posix mutex and "C++" is a part of the posix standard?
> In neither, how do you know that mutex (anyone) protect "volative
> memory"?

There is no mutex in the C++ standard, because there are no threads. If
you are using threading, you are dependant on at least one other
standard, in addition to the C++ standard. Posix is the one I know, so
it is the one I use for examples; I suspect that the guarantees given by
Windows, or other systems, are similar.

Unless your compiler supports one of these additional standards, you
cannot use it for multithreaded code. If it supports one of these
additional standards, then you get the guarantees of that standard. The
Posix standard is quite clear: memory is synchronized accross a certain
number of functions (listed in the standard). Whether said memory is
declared volatile or not -- there is no requirement in the Posix
standard for volatile on memory accessed from different threads. There
is a requirement, however, that if any thread modifies an object, and
more than one thread accesses it, that all accesses be protected by some
sort of locking mechanisme: mutex, semaphore, etc. (and all of the
functions involving such mechanisms are in the list of those requiring
full memory synchronization). Thus (to repeat a frequent example):

Singleton*
Singleton::instance()
{
if ( pInstance == NULL ) {
pInstance = new Singleton ;
}
return pInstance ;
}

is undefined if there is more than one thread. In fact, if there is
more than one thread, it is undefined behavior regardless of what you do
to protect it in the if -- the initial test must also be protected in
some way.

On the other hand,

Singleton*
Singleton::instance()
{
pthread_mutex_lock( someMutex ) ;
if ( pInstance == NULL ) {
pInstance = new Singleton ;
}
pthread_mutex_unlock( someMutex ) ;
return pInstance ;
}

is guaranteed. All of the writes in the protected zone (including those
in the constructor) must have been made accessible in globally visible
memory before I return from the pthread_mutex_unlock function, and
previously read values that happen to be floating around (e.g. in a
cache line) must be purged in the pthread_mutex_lock function.

Note that the final read in the return statement could also be critical.
Except that in this particular case, once a thread has returned from the
pthread_mutex_unlock function, pInstance will never be modified. In
cases where the protected object may be modified at a later time, it is
important not to release the lock until the return value has been read
from global memory. In this case, RAII isn't just good programming
practice; it is necessary if the program is to be correct. (Note that
we are counting on a very subtle interaction between two standards here,
and I would much prefer that the C++ committee recognize the existance
of threads, and specify it clearly. Or that Posix recognize the
existance of C++, although that wouldn't necessarily help people on
other platforms.)

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Ben Hutchings

unread,

Feb 21, 2004, 5:38:35 AM2/21/04

to

Michael Furman wrote:
>
> "Michael Furman" <Michae...@Yahoo.com> wrote in message
> news:c10u3k$1ceits$1...@ID-122417.news.uni-berlin.de...
>>
>> <ka...@gabi-soft.fr> wrote in message
>> news:d6652001.04021...@posting.google.com...
>> [...]
>> > There's a more general aspect that is worrying me. When I use a mutex,
>> > it is to protect something. The mutex has three effects: it blocks all
>> > but one thread, it suspends waiting threads so that they don't use CPU
>> > resources, and it synchronizes memory. All memory, including that not
>> > declared volatile.
>>
>> Could you please tell what is a "mutex" you are refering to? Have I missed
>> the time and mutex is a part of the C++ standard? Or you are talking about
>> posix mutex and "C++" is a part of the posix standard? In neither,
>> how do you know that mutex (anyone) protect "volative memory"?
> I meant:
>> how do you know that mutex (anyone) protect "non-volative memory"?
> - sorry for typo.

That is essential to their operation. The mutex functions do that
with some combination of methods from these two lists:

Inhibiting unsafe re-ordering by the processor
1. Do nothing because the processor does not re-order.
2. Use assembly language to generate the special instructions
required.

Inhibiting unsafe re-ordering by the compiler
3. Do nothing because the compiler does not re-order.
4. Put the functions in a separate source file so that the compiler
cannot tell what memory they access and must be conservative.
5. Use inline assembly and if necessary specify that it reads and
writes arbitrary memory.

Alternately they can both be covered at the same time:

6. Use compiler intrinsics that generate the special instructions
required and tell the compiler what not to do.

Michael Furman

unread,

Feb 21, 2004, 5:48:50 AM2/21/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...
> [...]

> > Could you please tell what is a "mutex" you are refering to? Have I
> > missed the time and mutex is a part of the C++ standard? Or you are
> > talking about posix mutex and "C++" is a part of the posix standard?
> > In neither, how do you know that mutex (anyone) protect "volative
> > memory"?
>
> There is no mutex in the C++ standard, because there are no threads. If
> you are using threading, you are dependant on at least one other
> standard, in addition to the C++ standard. Posix is the one I know, so
> it is the one I use for examples; I suspect that the guarantees given by
> Windows, or other systems, are similar.
>
> Unless your compiler supports one of these additional standards, you
> cannot use it for multithreaded code.

That is a first your point I could not agree. I (and many others, I believe)
use C++ in multithreaded code - w/o support of any extra standard
(mostly in embedded code, in the environment with, often, very non-standard
synchronization primitives) - I just do not have alternative.
What I rely on is:
1. using "volatile" that forces translation of the C++ code accessing some
memory
into the machine code that immediately does "equivalent" access.
2. Some knowledge about CPU hardware related to caches and parallel memory
access (simple and most usual case: just one CPU and multiprogramming;
in multi-CPU cases - different cache strategies and special memory
instructions).

> If it supports one of these
> additional standards, then you get the guarantees of that standard. The

> POSIX standard is quite clear: memory is synchronized across a certain
> [... perfectly correct text about POSIX snipped]

practice; it is necessary if the program is to be correct. (Note that

> we are counting on a very subtle interaction between two standards here,

> and I would much prefer that the C++ committee recognize the existence
> of threads, and specify it clearly. Or that POSIX recognize the
> existence of C++, although that wouldn't necessarily help people on
> other platforms.)

Yes - it is subtle.
And there is another reason why I don't want to be limited by POSIX when
I use multithreading: it is very costly. I am not talking here about need to
implement
many standard primitives (with is also sometimes a problem), but about high
cost
of using POSIX mutexex - taking a mutex typically should do full memory
synchronization when I actually need to synchronize access to just one
variable/object!

So, while I agree that adding some kind of threads to the C++ standard would
be
helpful (especially for hosted environment), there are many other uses
(mostly in
freestanding cases) that have slightly different threads then POSIX ones
and "volatile" (it would be great to make it definition in C++ standard more
strong though) works (and necessary) there.
(none, that "thread" and "mutex" existed long before creating POSIX, so I am
using these words not necessarily as POSIX ones)

Regards,
Michael Furman

Michael Furman

unread,

Feb 21, 2004, 5:55:38 AM2/21/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.0402...@posting.google.com...
> [...]

> > I more or less agree with you that volatile is not sufficient for
> > anything in the multithreaded case, but isn't it necessary?
>
> No, since the other things that are necessary provide all of the
> necessary guarantees.

I guess you are talking about strict POSIX threading model (which is
not consitent with C++ by the way). It is not true for other models
(see my other post).

>
> > Could you please alaborate what exactly do you mean. My guesses:
>
> > 1. Exact semantic of volatile is not defiled in the language, so any
> > use would be unportable. And "neither necessary nor sufficient"
> > related to portable code. I agree with that - so no further questions.
>
> > 2. Volatile does not change behavior of code in any usefull way.
> > ... anything else.
>
> Basically the second. Using volatile will result in slower code being
> generated, but will typically not have any other effect with regards to
> threading.

Using volatile will slow the code only in places of accessing the variable.
And only by (typically one) extra instruction that access memory.
Taking mutex (in POSIX model) costs saving/reading all variables
(like they are all volatile), but only in the point of taking mutex.
In some cases the fist method is better; in other - the second.

(I am not saying that using "volatile" is enough in the first method -
it is onlu nessesary. We shoud also use some other wais to
ensure atomicity and some kind of coherence)

>
> > Let's consider the two threads:
>
> > Thread1:
> > stopflag = false;
> > while(!stopflag)
> > { ...do something w/o any access to "stopflag" .... }
> > TurnPowerOff();
>
> > Thread2:
> > ...do something w/o any access to "stopflag" ....
> > // It is time to turn power off
> > while(true)
> > {
> > stopflag = true;
> > Sleep(1);
> > }
>
> > In the absence of volatile in the definition of "stopflag" compiler
> > can optimize "while(!stopflag)" into "while(true)" and "TurnPowerOff"
> > would never be called.
>
> With or without volatile, there is no guarantee that Thread1 will ever
> see the change to memory made in Thread2. In order to ensure that the
> modifications in one thread are visible in another thread, it is
> necessary (on most modern machines) to issue some special instructions
> (membar on a Sparc, for example).

Without volatile it is guaranteed that it will not work correctly with
compiled with
simple optimization.

Without volatile, it is not guaranteed alone. But addition of some extra
things/mechanisms
would guarantee it. For example, one of:
- just one CPU and classic multiprogramming environment;
- mechanism (either software or hardware) that periodically flush all
caches.

That is what I call necessary (but not enough).

> Arguably, the intent of volatile is
> that the compiler generate these instructions when a volatile variable
> is accessed, but in practice, they don't (and Posix very definitly
> doesn't require them to in C).

I hope it is not - IMO it would make it useles.

>
> According to Posix, the above code has undefined behavior, so the
> problem of whether the compiler optimizes accesses or not is moot. In
> order for the behavior to be defined, *ALL* accesses to stopflag must be
> protected, normally by a mutex. In which case, Posix does guarantee
> that the code works. With or without volatile.

Is it POSIX newsgroup? :-)

>
> The real problem, I think, is that the standard (C or C++) doesn't
> address threading issues at all. Other standards (such as Posix) do,
> and they have chosen not to make volatile relevant in any way. A Posix
> compliant C (and presumably C++, although Posix doesn't acknowledge the
> existance of C++) is not allowed to optimize accesses accross calls to
> functions like pthread_mutex_lock and pthread_mutex_unlock, and these
> functions are required to issue any necessary barrier instructions. So
> without the locks, you have undefined behavior, with or without
> volatile, and with the locks, everything is correct, with or without
> volatile.

What I would like from the C++ standard rather then to address threading
(or in addition to) as address to memory synchronization problems.
There are "volatile" and "sig_atomic_t" are something - but they are not
defined in the wide enough cotext and probably they are not enough.

>
> > If volatile (with its I believe absolutely typical semantic) present I
> > need the only weak constraint on hardware mylti threading model to
> > make it working as supposed (i.e. to turn power off at some point):
> > wtritten value will be visible from another thread earlier or later).
>
> Where do you get this guarantee? It isn't true on some of the systems I
> work on.

I never said it is garanteed an any system. But it is for some and it could
be
made garanteed for others (see above).

>
> > In short: do you think that volatile does not change behavior of the
> > code (w/o interrupt handlers) in any observable way? If not - do you
> > believe that there is some observable difference, but it cannot used??
>
> Interrupt handlers are a different question. You'll have to see what
> your system requires for these: volatile may or may not be useful,
> depending on what the system requires. I suppose that the same thing is
> true for threads, but volatile is not useful with threads under Posix,
> nor, as far as I can tell, under Windows.

No, it is not true for Windows. Here is from MSVC documentation:

W> The volatile keyword is a type qualifier used to declare that an object
can
W> be modified in the program by something other than statements, such as
the
W> operating system, the hardware, or a concurrently executing thread.

So, what I am trying to say is that POSIX (and even Windows) is not
everything
in the world. And POSIX uses very heavy method (flushign out all data at the
synchronization point) that makes "volatile" useles, but it is not good for
avery
application.

Regards,
Michael Furman

Balog Pal

unread,

Feb 21, 2004, 6:00:54 AM2/21/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.0402...@posting.google.com...

> > In the absence of volatile in the definition of "stopflag" compiler
> > can optimize "while(!stopflag)" into "while(true)" and "TurnPowerOff"
> > would never be called.
>
> With or without volatile, there is no guarantee that Thread1 will ever
> see the change to memory made in Thread2.

On any real system?

> In order to ensure that the
> modifications in one thread are visible in another thread, it is
> necessary (on most modern machines) to issue some special instructions
> (membar on a Sparc, for example).

membar serves completely different purposes -- it is about _ordering_
(well, actually it have other forms too, but I'm sure you had the ordering
membar in mind here).

You shall issue it to ensure ordered writes on one end are read correctly on
the other -- what is way important for lots of things. But omitting membar
on SPARC will no way let you read a changed value of that stopflag when you
spin on it endlessly waiting the change.

> Arguably, the intent of volatile is
> that the compiler generate these instructions when a volatile variable
> is accessed, but in practice, they don't (and Posix very definitly
> doesn't require them to in C).

Simply generating the most simple reads is much better than just believing
it still has some value, or precompile the whole result.

> According to Posix, the above code has undefined behavior, so the
> problem of whether the compiler optimizes accesses or not is moot.

Sure, from the Posix standpoint it is undefined, but on a real system it can
still be defined and known. If you're to restrict yourself to stuff that
is defined only within the posix itself you'll waste tremendous amount of
resources.

> In
> order for the behavior to be defined, *ALL* accesses to stopflag must be
> protected, normally by a mutex. In which case, Posix does guarantee
> that the code works. With or without volatile.

What is exactly what I just said -- waste of resources for little reason. As
that stopflag loop works correctly on most machines (including sparc V9 in
RMO) as long as simple write and read access instructions are emitted by the
compiler. Where it doesn't work should be az extic system with really
dangerous memory model any thread programmer should be aware anyway.

> > If volatile (with its I believe absolutely typical semantic) present I
> > need the only weak constraint on hardware mylti threading model to
> > make it working as supposed (i.e. to turn power off at some point):
> > wtritten value will be visible from another thread earlier or later).
>
> Where do you get this guarantee? It isn't true on some of the systems I
> work on.

What is that system? IMHO you're mistaken here confusing memory ordering
issues with memory change visibility in general.

> Interrupt handlers are a different question.

Guess you locked in at posix as 'threading' ;-). What about hand-written
shedulers based on a timer interrupt or something alike? It is not too hard
to write a simple one. And the resulting system is way easier to handle --
no need for mutex objects, instead the critical section is where you disable
interrupts. That is threading too, isn't it? And volatile is your friend
for any shared object.

> true for threads, but volatile is not useful with threads under Posix,

That claim sounds better. :) If you further narrow it down to not assume
anything but the basic Posix standard, there's no argument.

> nor, as far as I can tell, under Windows.

For that one you're way wrong IMO. Here you can do lots of well defined
things.

Paul

Balog Pal

unread,

Feb 21, 2004, 1:20:51 PM2/21/04

to

"Graeme Prentice" <inv...@yahoo.co.nz> wrote in message

news:tmkb3097tv6ljjbui...@4ax.com...

> As far as I can see, many
> Windows programmers who write multithreaded apps are unaware that their
> program might run on a machine with multiple CPUs or what the
> implications of that are.

What is in general terms a Bad Thing(TM) but for this particular issue not
something scary. As Windows is restricted to Intel paltforms (unless I
misunderstood the drop of alpha and mips). And Intel does thngs on almost
insane level to make MP systems behaving like a SP system in terms of
memory use.

Thus the outcome is *if* you create a schema that works well in MT
environment for any possible imagined point of interruptions, it will work
on multiprocessor system as well as on a single-processor one. (Khm,
thinking again, with an exception: there may be problems wit object access
that is not atomic -- a single processor can't read inconsistent data while
that may be possible for another. But IMO in the design nonatomic access
shall be considered as an interruptable and disruptable operation for all
cases.)

> but it also suggests that use of the volatile keyword provides coherency
> of memory

Memory coherency is provided by the processor -- volatile is irrelevant
here. Yuo need no special instructions to broadcast a change to a memory
location -- the write in itself does invalidate the cache lines on all
processors. Though the memory ordering issues can apply in some
situations -- but asserting LOCK# is a signal for all the processors to put
in a barrier. So any instruction with implicit or explicit lock will do the
job.

> and is meaningful in a multithreaded environment - however it
> refers the reader to Visual C++ documentation where information on the
> effect of volatile is fairly useless - though it suggests that it's
> useful in a multithreaded app.

That can be unfortunetely misleading -- volatile itself does not generate
lock prefixes at access. But you need it to tell the compiler not to remove
simple access at least. In situations where ordering issues does not apply
it is sufficient for communication.

> I'm curious to know exactly what a "memory barrier" operation involves
> as it sounds potentially very time consuming

there are different ones, thevariant mentioned here is a lightweight one --
it merely constrains ordering of accesses within a processor. IOW inhibits
reorderings across the barrier.

> - how long does it take to
> write 50K of data from the cache to external memory or to other caches?
> Intel XEON micro has hyperthreading where multiple "CPUs" share all
> resources including the cache and each "CPU" is just a bunch of
> registers - they can execute threads of a process in parallel and
> presumably don't have time consuming "memory barrier" operations to
> perform.

Actually the new intel processors have *FENCE instructions that are similar
to the ordering membars on other processors. they got introduced especially
for the purpose to *gain* speed, to give opportunity to use less brutal
things than the current LOCK.

Paul

Graeme Prentice

unread,

Feb 21, 2004, 8:49:32 PM2/21/04

to

On 21 Feb 2004 13:20:51 -0500, Balog Pal wrote:

>"Graeme Prentice" <inv...@yahoo.co.nz> wrote in message
>news:tmkb3097tv6ljjbui...@4ax.com...
>
>> As far as I can see, many
>> Windows programmers who write multithreaded apps are unaware that their
>> program might run on a machine with multiple CPUs or what the
>> implications of that are.
>
>What is in general terms a Bad Thing(TM) but for this particular issue not
>something scary. As Windows is restricted to Intel paltforms (unless I
>misunderstood the drop of alpha and mips). And Intel does thngs on almost
>insane level to make MP systems behaving like a SP system in terms of
>memory use.

How about AMD? It seems x86 provides automatic memory coherency but
still has a re-ordering problem

That Microsoft web page (here's the URL again)
http://www.microsoft.com/whdc/hwdev/driver/MPmem-barrier.mspx
describes the purpose of the memory barrier for x86 and AMD64 is to
prevent hardware reordering on a *single CPU* and makes no mention
whatsoever of a "multiple cache issue" for multiple CPUs.
i.e. it says that on a single CPU, the hardware can change a write/read
sequence (generated by the compiler) into a read/write (which makes no
difference on a single CPU) but with multiple CPUs, it can make a
difference and a memory barrier is required when shared memory and
multiple CPUs are involved.

Perhaps the re-ordering is rare, but as that Microsoft web page says,
interlocked sequences, memory barriers and standard operating system
locking mechanisms should be used when shared memory is involved. It
looks highly likely that in cases where this reordering could make a
difference, multithreaded Windows programs would be using a mutex or
synchronisation of some sort anyway.

The classic while(flag){} in one thread, where flag is written by
another thread, should work just fine though (on Windows), as long as
flag is declared volatile and flag is the only thing being shared. If
you wanted to be conservative you might use InterlockedExchange to read
it (what value would you write??), but according to VS help, this does
not result in a memory barrier instruction on X86 (but does on Itanium),
so a plain read seems just as good on X86.

Here's a description of what volatile means in Java
http://www.javaperformancetuning.com/news/qotm030.shtml

and in C#
http://www.jaggersoft.com/csharp_standard/17.4.3.htm

Use of volatile in a Visual Studio C# program adds no special
instructions to X86 code as far as I can see.

>
>> but it also suggests that use of the volatile keyword provides coherency
>> of memory
>
>Memory coherency is provided by the processor -- volatile is irrelevant
>here. Yuo need no special instructions to broadcast a change to a memory
>location -- the write in itself does invalidate the cache lines on all
>processors. Though the memory ordering issues can apply in some
>situations -- but asserting LOCK# is a signal for all the processors to put
>in a barrier. So any instruction with implicit or explicit lock will do the
>job.

I didn't read the article carefully enough. It suggests using volatile
to prevent the ***compiler*** from re-ordering - which is one of the
purposes of the volatile keyword in C++ - even though most descriptions
you see (including VC help files) don't directly say this and just say
that volatile tells the compiler that an "unseen" process might change
the value asynchronously.

Presumably if you first acquire a mutex before accessing shared memory,
any reordering the compiler or hardware does won't matter.

>
>> - how long does it take to
>> write 50K of data from the cache to external memory or to other caches?
>> Intel XEON micro has hyperthreading where multiple "CPUs" share all
>> resources including the cache and each "CPU" is just a bunch of
>> registers - they can execute threads of a process in parallel and
>> presumably don't have time consuming "memory barrier" operations to
>> perform.

I should correct what I said here, since each CPU obviously has a
"brain" (consisting of x-zillion transistors) as well as "registers".
When I said each CPU is just a bunch of registers I was thinking of its
"state". Presumably sharing cache memory slows each logical CPU down a
little.

Graeme

Graeme Prentice

unread,

Feb 22, 2004, 6:00:09 AM2/22/04

to

On 16 Feb 2004 07:22:53 -0500, Ralf wrote:

>
>using volatile bool's as mutexes will cause some problems.
>For example: asking for the run condition (if( bMut == false) )
>can give you the green light. But one line later can be the context
>switch to another thread. In the second thread the condition can
>be improved also, it will get also green light!
>
>But I think, there is a solution: for two threads you need two
>volatile bool's: b1 and b2. The first thread has to switch the variables
>in the order b1, b2, the second thread has to switch them in the reverse
>order.
>Between the switching there should be another check of the variables
>(like in the double check pattern).

This doesn't work reliably. Even on a single CPU machine with cache
coherency (e.g. Pentium), the hardware can reorder the reads and writes
so that instead of each thread doing "set my flag" followed by "read the
other flag", the hardware reorders this to "read the other flag", "set
my flag" - with the consequence that each thread can get the green
light simultaneously.

This is discussed from a Java point of view here
http://www.javaworld.com/javaworld/jw-05-2001/jw-0525-double_p.html

and here's a good explanation of how reordering can occur
http://www.javaworld.com/javaworld/jw-02-2001/jw-0209-toolbox.html

but there's probably other ways.

Graeme Prentice

unread,

Feb 22, 2004, 6:01:57 AM2/22/04

to

On 21 Feb 2004 20:49:32 -0500, Graeme Prentice wrote:

>
>How about AMD? It seems x86 provides automatic memory coherency but
>still has a re-ordering problem

It also seems that most (but not all) hardware provides cache coherency
(i.e. the latest value written by one CPU is always seen by another CPU
when next read) but reordering of reads and writes is common.
http://www.it.lth.se/msprojects/ita/past/lockbehavior/mssc_thesis_aron_akesson.pdf

>
>That Microsoft web page (here's the URL again)
>http://www.microsoft.com/whdc/hwdev/driver/MPmem-barrier.mspx
>describes the purpose of the memory barrier for x86 and AMD64 is to
>prevent hardware reordering on a *single CPU* and makes no mention
>whatsoever of a "multiple cache issue" for multiple CPUs.
>i.e. it says that on a single CPU, the hardware can change a write/read
>sequence (generated by the compiler) into a read/write (which makes no
>difference on a single CPU) but with multiple CPUs, it can make a
>difference and a memory barrier is required when shared memory and
>multiple CPUs are involved.

[correcting myself again]

Reordering of write/read into read/write does make a difference on a
single CPU multithreaded app.

>
>Here's a description of what volatile means in Java
>http://www.javaperformancetuning.com/news/qotm030.shtml

Apparently not all (not many, even) Java implementations obey the
semantics of volatile.

Would you say this has become Off-Topic? I don't think so. Writing
multithreaded programs that work reliably is relevant to C++. I think
it ought to be not too hard for a C++ implementation to provide
standardised mutex and synchronization primitives that map to the
appropriate things on the targeted hardware or OS when they're readily
available. Judging by the Java experience, giving C++ "volatile" the
strong semantics that Java tries to have would be hard.

Balog Pal

unread,

Feb 22, 2004, 12:24:54 PM2/22/04

to

"Graeme Prentice" <inv...@yahoo.co.nz> wrote in message

news:mbof30pmae3afi0c4...@4ax.com...

> >What is in general terms a Bad Thing(TM) but for this particular issue
not
> >something scary. As Windows is restricted to Intel paltforms (unless I
> >misunderstood the drop of alpha and mips). And Intel does thngs on almost
> >insane level to make MP systems behaving like a SP system in terms of
> >memory use.
>
> How about AMD? It seems x86 provides automatic memory coherency but
> still has a re-ordering problem

I didn't tead AMD manuals so I can't tell you what they ahve there, but AMD
lives on Intel compatibility, and for finantial reasons they can't allow
deviance. Also, breaking the way the memory system works would break lots of
program on Windows, and in a way that can be tied to processor.
I specify programs I write to be able to run in ceratain hardware and
software environment -- say IA-32 and listed versions of Windows -- if the
user decides to run on something else, it is his problem (if there's a
problem). Most programs I use have similar statements, mentioning AMD is
rare, or happens just to state a "minimum" processor level.
I didn't hear any real problems related to use AMD instead of Intel, so they
must be rare animals. :)

> That Microsoft web page (here's the URL again)
> http://www.microsoft.com/whdc/hwdev/driver/MPmem-barrier.mspx
> describes the purpose of the memory barrier for x86 and AMD64 is to
> prevent hardware reordering on a *single CPU* and makes no mention
> whatsoever of a "multiple cache issue" for multiple CPUs.
> i.e. it says that on a single CPU, the hardware can change a write/read
> sequence (generated by the compiler) into a read/write (which makes no
> difference on a single CPU) but with multiple CPUs, it can make a
> difference and a memory barrier is required when shared memory and
> multiple CPUs are involved.

Yep, that is (should be) a known issue for a threads programmer. As it may
be irrelevant for his platform, but he must know exactly why.

> Perhaps the re-ordering is rare, but as that Microsoft web page says,
> interlocked sequences, memory barriers and standard operating system
> locking mechanisms should be used when shared memory is involved.

Very true. :) The most basic tool here is, as I mentioned earlier is
'lock'. Which is like a swiss army knife, ensures atomicity, broadcasts
change, and forces membar on the whole system. But doing so much you want
to consider it as an expensive thing, you will not do without real need.
Just like you will not want semaphores in the middle of a single-rail
section.

> It
> looks highly likely that in cases where this reordering could make a
> difference, multithreaded Windows programs would be using a mutex or
> synchronisation of some sort anyway.

Yep. Ordering is an issue only when you access multiple shared objects, on
a shared section. If you have a single object, or already established an
exclusive region, it can't hurt you.

> The classic while(flag){} in one thread, where flag is written by
> another thread, should work just fine though (on Windows), as long as
> flag is declared volatile and flag is the only thing being shared.

Yep.

> If
> you wanted to be conservative you might use InterlockedExchange to read
> it (what value would you write??),

if it's just a 2-state stopflag, you exchange to 0. And examine what you
fetched, if that's 1, you got your signal. If the flag is shared for other
clients, you write back that value. (certainly can work for more values,
just create a good protocol.)

You'd could write an InterlockedLoad to get rid of the unwanted part.

> but according to VS help, this does
> not result in a memory barrier instruction on X86 (but does on Itanium),

That help must be way wrong. (intel architecture manuals are avalilable on
intel's site for download.)

Or you misunderstood the statement -- it does not have *FENCE _instruction_
certainly, but the instruction used has all its effects.

> Presumably if you first acquire a mutex before accessing shared memory,
> any reordering the compiler or hardware does won't matter.

A mutex operation (as well as the other sync primitives) must include a
memory barrier. Otherwise it would not protect much. :)

Paul

Balog Pal

unread,

Feb 22, 2004, 12:31:17 PM2/22/04

to

"Graeme Prentice" <inv...@yahoo.co.nz> wrote in message

news:5lng30pqnelamdhj2...@4ax.com...

> >Between the switching there should be another check of the variables
> >(like in the double check pattern).

There was a ton of threads on why that double check thing is broken.

> This doesn't work reliably. Even on a single CPU machine with cache
> coherency (e.g. Pentium), the hardware can reorder the reads and writes
> so that instead of each thread doing "set my flag" followed by "read the
> other flag", the hardware reorders this to "read the other flag", "set
> my flag" - with the consequence that each thread can get the green
> light simultaneously.

But this is not true, 'processor self-consistency' is preserved on any sane
CPU design. It certainly IS for Pentium. Also for SPARC.
So you can't observe the problem on a single CPU.

(You may try reading/writing part of the same object or twiddle with
misaligned objects to shoot yourself in the leg, but the result of reads and
writes made by the only CPU wil always give you the result expected by the
strict instruction order despite any speculatice reads or reordering done. )

> This is discussed from a Java point of view here
> http://www.javaworld.com/javaworld/jw-05-2001/jw-0525-double_p.html
>
> and here's a good explanation of how reordering can occur
> http://www.javaworld.com/javaworld/jw-02-2001/jw-0209-toolbox.html
>
> but there's probably other ways.

Did these articles mention there'll be problem on single CPU?

Paul

Balog Pal

unread,

Feb 23, 2004, 6:17:03 AM2/23/04

to

"Graeme Prentice" <inv...@yahoo.co.nz> wrote in message

news:r7sg309t2s2i3ulhd...@4ax.com...

> Writing
> multithreaded programs that work reliably is relevant to C++. I think
> it ought to be not too hard for a C++ implementation to provide
> standardised mutex and synchronization primitives that map to the
> appropriate things on the targeted hardware or OS when they're readily
> available.

Last time I thought about the issue I consluded a single membar(type)
primitive would solve most (possibly all) theoretical problems wrt threading
sync.

The compiler should respect the barrier as one -- moving object accest
across the barrier being forbidden. Also it would emit the appropriate
processor instruction. That is nothing for lots of systems.

Dealing with cache stuff may require something more -- I didn't yet work on
a system where *data* cache was an issue, the memory system was just
specified as a single unit, so I can't say anythong about the implications
and required cure.

> Judging by the Java experience, giving C++ "volatile" the
> strong semantics that Java tries to have would be hard.

I don't think C++ volatile has a problem worth fixing in the standard --
possibly beyond some clarification on what 'access' really means and similar
issues turned out a source of confusion or different understanding.

There was a time I thought volatile access should have an implicit membar --
but later concluded that would be a suboptimal solution -- real life code
asks for explicit membar points. If we had them the programmer could place
them where needed and that's it.

Paul

Graeme Prentice

unread,

Feb 23, 2004, 6:24:58 AM2/23/04

to

On 22 Feb 2004 12:31:17 -0500, Balog Pal wrote:

>"Graeme Prentice" <inv...@yahoo.co.nz> wrote in message
>news:5lng30pqnelamdhj2...@4ax.com...
>
>> >Between the switching there should be another check of the variables
>> >(like in the double check pattern).
>
>There was a ton of threads on why that double check thing is broken.
>
>> This doesn't work reliably. Even on a single CPU machine with cache
>> coherency (e.g. Pentium), the hardware can reorder the reads and writes
>> so that instead of each thread doing "set my flag" followed by "read the
>> other flag", the hardware reorders this to "read the other flag", "set
>> my flag" - with the consequence that each thread can get the green
>> light simultaneously.
>
>But this is not true, 'processor self-consistency' is preserved on any sane
>CPU design. It certainly IS for Pentium. Also for SPARC.
>So you can't observe the problem on a single CPU.

If what the MS web page says on reordering on X86 is true, then why
couldn't the following happen on a single CPU system

//thread 1
read var2
// thread 1 switched out

// thread 2
read var 1
// thread 2 switched out

// thread 1
write var 1
check var 2, it's zero, I can go

// thread 2
write var 2
check var 1, it's zero, I can go

- the programmer actually wrote
write var 1;
read var 2

>
>(You may try reading/writing part of the same object or twiddle with
>misaligned objects to shoot yourself in the leg, but the result of reads and
>writes made by the only CPU wil always give you the result expected by the
>strict instruction order despite any speculatice reads or reordering done. )
>
>> This is discussed from a Java point of view here
>> http://www.javaworld.com/javaworld/jw-05-2001/jw-0525-double_p.html
>>
>> and here's a good explanation of how reordering can occur
>> http://www.javaworld.com/javaworld/jw-02-2001/jw-0209-toolbox.html
>>
>> but there's probably other ways.
>
>Did these articles mention there'll be problem on single CPU?

It didn't say the problem was limited to mulltiple CPU systems or to
systems without instant cache coherency.

Graeme

ka...@gabi-soft.fr

unread,

Feb 23, 2004, 2:26:50 PM2/23/04

to

"Balog Pal" <pa...@lib.hu> wrote in message
news:<4036...@andromeda.datanet.hu>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.0402...@posting.google.com...
> > > In the absence of volatile in the definition of "stopflag"
> > > compiler can optimize "while(!stopflag)" into "while(true)" and
> > > "TurnPowerOff" would never be called.

> > With or without volatile, there is no guarantee that Thread1 will
> > ever see the change to memory made in Thread2.

> On any real system?

Sparc v9 and above. Alpha. IA-64.

> > In order to ensure that the modifications in one thread are visible
> > in another thread, it is necessary (on most modern machines) to
> > issue some special instructions (membar on a Sparc, for example).

> membar serves completely different purposes -- it is about _ordering_
> (well, actually it have other forms too, but I'm sure you had the ordering
> membar in mind here).

> You shall issue it to ensure ordered writes on one end are read
> correctly on the other -- what is way important for lots of things.
> But omitting membar on SPARC will no way let you read a changed value
> of that stopflag when you spin on it endlessly waiting the change.

It ensures that any data incidentally read during the loads issued
before the membar are ignored by loads after the membar. That's
ordering, yes. But it is necessary in this case.

> > Arguably, the intent of volatile is that the compiler generate
> > these instructions when a volatile variable is accessed, but in
> > practice, they don't (and Posix very definitly doesn't require them
> > to in C).

> Simply generating the most simple reads is much better than just
> believing it still has some value, or precompile the whole result.

The problem is that if the data at the address read happens to already
be in the processor, because of an earlier read, then the processor may
not go to main memory in order to fetch the value. And if the processor
doesn't go to main memory, then it will not see any modifications made
by another processor.

In practice, there is an extremely high probability that it will work.
The amount of memory held locally in the processor (I'm not talking
about cache, but about physically on the processor chip, in memory
interface registers, etc.) is very small, and sooner or later, you will
get a hardware interruption, or some cron job will interrupt, your
process will be interrupted, and some other process will read enough
other data to purge the internal contents. It's not formally
guaranteed, however, so all this means is that it will seem to work
during your tests.

> > According to Posix, the above code has undefined behavior, so the
> > problem of whether the compiler optimizes accesses or not is moot.

> Sure, from the Posix standpoint it is undefined, but on a real system
> it can still be defined and known.

It could be. It isn't using g++ or Sun CC on a Sparc. From what other
people have told me, it isn't on an Alpha either.

> If you're to restrict yourself to stuff that is defined only within
> the posix itself you'll waste tremendous amount of resources.

Well, if you want your program to work, it's generally better if you
restrict yourself to things that are guaranteed somewhere.

> > In order for the behavior to be defined, *ALL* accesses to stopflag
> > must be protected, normally by a mutex. In which case, Posix does
> > guarantee that the code works. With or without volatile.

> What is exactly what I just said -- waste of resources for little
> reason. As that stopflag loop works correctly on most machines
> (including sparc V9 in RMO) as long as simple write and read access
> instructions are emitted by the compiler.

The Sparc architecture document says most explicitly that it doesn't
work on a Sparc v9 in RMO. From what people have told me, it doesn't
work on an Alpha either, nor on an Intel 64 bit architecture.

> Where it doesn't work should be az extic system with really dangerous
> memory model any thread programmer should be aware anyway.

Most general purpose systems today use a really dangerous memory models.

> > > If volatile (with its I believe absolutely typical semantic)
> > > present I need the only weak constraint on hardware mylti
> > > threading model to make it working as supposed (i.e. to turn
> > > power off at some point): wtritten value will be visible from
> > > another thread earlier or later).

> > Where do you get this guarantee? It isn't true on some of the
> > systems I work on.

> What is that system? IMHO you're mistaken here confusing memory
> ordering issues with memory change visibility in general.

The two are related.

> > Interrupt handlers are a different question.

> Guess you locked in at posix as 'threading' ;-). What about
> hand-written shedulers based on a timer interrupt or something alike?
> It is not too hard to write a simple one. And the resulting system is
> way easier to handle -- no need for mutex objects, instead the
> critical section is where you disable interrupts. That is threading
> too, isn't it? And volatile is your friend for any shared object.

I've worked on those sort of systems. A long time ago, in simpler
times. A hand-written scheduler for a multi-processor machine is NOT
simple to write. Of course, when you are programming at that level, you
don't use pthread_mutex_lock. You need to know the hardware intimately,
and design your code around it. But then, you won't be writing the
scheduler in standard C++ either.

> > true for threads, but volatile is not useful with threads under
> > Posix,

> That claim sounds better. :) If you further narrow it down to not
> assume anything but the basic Posix standard, there's no argument.

That's really all I'm claiming. The problem, or at least, my impression
of the problem, is that there is some confusion between threads and
interrupt handling. Threads are a fairly high level concept, supported
by the OS (be it Posix or Windows); on a multiprocessor system, you have
no control over which processor an individual thread runs on, nor in
which memory it variables have been mapped. In an interrupt routine,
you are a lot closer to the hardware, and have a lot more control. And
if we are talking about an interrupt handler on an embedded processor,
you can probably define almost to a letter what external cycles are
being generated on the bus.

> > nor, as far as I can tell, under Windows.

> For that one you're way wrong IMO. Here you can do lots of well
> defined things.

You can do lots of well defined things, but you can also do undefined
things. The one look I took at code generated by VC++ under Windows
did not suggest much of a guarantee for volatile if you were running on
a machine with multiple CPUs, but I'll admit that my knowledge of Intel
hardware is not that complete with regards to the modern processors;
when I last worked on Intel hardware, there was no pipelining of memory
accesses, except for a six byte look-ahead queue for instructions. And
volatile was relevant. (It's also relevant on my 10 year old Sparc at
home.)

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Feb 23, 2004, 2:28:18 PM2/23/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message

news:<c163k7$1da5rj$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.0402...@posting.google.com...
> > [...]
> > > I more or less agree with you that volatile is not sufficient
> > > for anything in the multithreaded case, but isn't it necessary?

> > No, since the other things that are necessary provide all of the
> > necessary guarantees.

> I guess you are talking about strict POSIX threading model (which is
> not consitent with C++ by the way). It is not true for other models
> (see my other post).

Posix is the only threading standard I have on line. (Or had -- I just
got the documentation for VC++ installed this morning. So once I've
learned to find my way around in it...)

> > > Could you please alaborate what exactly do you mean. My guesses:

> > > 1. Exact semantic of volatile is not defiled in the language, so
> > > any use would be unportable. And "neither necessary nor
> > > sufficient" related to portable code. I agree with that - so no
> > > further questions.

> > > 2. Volatile does not change behavior of code in any usefull way.
> > > ... anything else.

> > Basically the second. Using volatile will result in slower code
> > being generated, but will typically not have any other effect with
> > regards to threading.

> Using volatile will slow the code only in places of accessing the
> variable. And only by (typically one) extra instruction that access
> memory.

This depends. On an older Sparc, the addition of a membar instruction
was unmeasurable. On some newer architectures, it may involve flushing
a cache line, or a loss of some pipelining. It's one extra instruction,
but how long that one instruction holds us up is not clear. (And on a
Sparc, it would be two instructions, one before and one after the
access.)

> Taking mutex (in POSIX model) costs saving/reading all variables (like
> they are all volatile), but only in the point of taking mutex. In
> some cases the fist method is better; in other - the second.

> (I am not saying that using "volatile" is enough in the first method -
> it is onlu nessesary. We shoud also use some other wais to ensure
> atomicity and some kind of coherence)

My point is that once we have used the other ways, volatile is no longer
necessary. And that the other ways are necessary; volatile alone is not
sufficient.

> > > Let's consider the two threads:

> > > Thread1:
> > > stopflag = false;
> > > while(!stopflag)
> > > { ...do something w/o any access to "stopflag" .... }
> > > TurnPowerOff();

> > > Thread2:
> > > ...do something w/o any access to "stopflag" ....
> > > // It is time to turn power off
> > > while(true)
> > > {
> > > stopflag = true;
> > > Sleep(1);
> > > }

> > > In the absence of volatile in the definition of "stopflag"
> > > compiler can optimize "while(!stopflag)" into "while(true)" and
> > > "TurnPowerOff" would never be called.

> > With or without volatile, there is no guarantee that Thread1 will
> > ever see the change to memory made in Thread2. In order to ensure
> > that the modifications in one thread are visible in another thread,
> > it is necessary (on most modern machines) to issue some special
> > instructions (membar on a Sparc, for example).

> Without volatile it is guaranteed that it will not work correctly with
> compiled with simple optimization.

There's never a guarantee that it won't work. Unless the optimizer is
espeically dumb, however, there is a very high probability.

> Without volatile, it is not guaranteed alone. But addition of some
> extra things/mechanisms would guarantee it. For example, one of:

> - just one CPU and classic multiprogramming environment;

Technically, that isn't guaranteed either. In practice, I can't imagine
a processor where it wouldn't work with just a single CPU. On the other
hand, at least in the environment where I work (servers), I can't
imagine a context where I can be sure that there will only be a single
CPU. If not today, tomorrow...

> - mechanism (either software or hardware) that periodically flush all
> caches.

I'm not aware of any such mechanism on the machines I work on.

> That is what I call necessary (but not enough).

I agree that there may be other mechanisms that a lock which would work.
I've got some asm in my RefCntPtr for the Sparc, for example, to
implement an atomic increment and decrement; for Windows, I use a
special library routine -- no ASM, but not portable either.

In no case have I declared anything volatile.

> > Arguably, the intent of volatile is that the compiler generate
> > these instructions when a volatile variable is accessed, but in
> > practice, they don't (and Posix very definitly doesn't require them
> > to in C).

> I hope it is not - IMO it would make it useles.

Well, it isn't so for g++ (at least under Solaris), nor for Sun CC, nor
for VC++. I don't have access to any other compilers at the moment, so
I can't speak for the others.

> > According to Posix, the above code has undefined behavior, so the
> > problem of whether the compiler optimizes accesses or not is moot.
> > In order for the behavior to be defined, *ALL* accesses to stopflag
> > must be protected, normally by a mutex. In which case, Posix does
> > guarantee that the code works. With or without volatile.

> Is it POSIX newsgroup? :-)

No, but since the issue concerns implementation defined behavior, we've
got to talk about some concrete examples. Once I've gotten into the
Windows documentation a bit more, I'll try and use examples from both,
in order to be a bit more fair. Still, I know and work on Posix
(Solaris, HP/UX and AIX), so it will doubtlessly be Posix which I can
talk about best.

The only alternative to discussing Posix or Windows examples here is to
say that we cannot talk about the problem at all. Even though it
concerns most C++ programmers in one way or another.

> > The real problem, I think, is that the standard (C or C++) doesn't
> > address threading issues at all. Other standards (such as Posix)
> > do, and they have chosen not to make volatile relevant in any way.
> > A Posix compliant C (and presumably C++, although Posix doesn't
> > acknowledge the existance of C++) is not allowed to optimize
> > accesses accross calls to functions like pthread_mutex_lock and
> > pthread_mutex_unlock, and these functions are required to issue any
> > necessary barrier instructions. So without the locks, you have
> > undefined behavior, with or without volatile, and with the locks,
> > everything is correct, with or without volatile.

> What I would like from the C++ standard rather then to address
> threading (or in addition to) as address to memory synchronization
> problems. There are "volatile" and "sig_atomic_t" are something - but
> they are not defined in the wide enough cotext and probably they are
> not enough.

You are not alone; I believe that there are a fair number of people who
would like to see threading addressed in a future version of the
standard. FWIW: I would like to see it made quite clear that accesses
to a volatile object are synchronized between threads. Even in a
multiprocessor environment, where different threads run on different
processors. I'm not the one who writes the standard, however, and I'm
not sure that this particular requirement will make it into the
standard, even if threading in general does.

> > > If volatile (with its I believe absolutely typical semantic)
> > > present I need the only weak constraint on hardware mylti
> > > threading model to make it working as supposed (i.e. to turn
> > > power off at some point): wtritten value will be visible from
> > > another thread earlier or later).

> > Where do you get this guarantee? It isn't true on some of the
> > systems I work on.

> I never said it is garanteed an any system. But it is for some and it
> could be made garanteed for others (see above).

It could (probably) be guaranteed for all systems. In practice, I don't
know of a multiprocessor system where it is guaranteed, however.

> > > In short: do you think that volatile does not change behavior of
> > > the code (w/o interrupt handlers) in any observable way? If not
> > > - do you believe that there is some observable difference, but
> > > it cannot used??

> > Interrupt handlers are a different question. You'll have to see
> > what your system requires for these: volatile may or may not be
> > useful, depending on what the system requires. I suppose that the
> > same thing is true for threads, but volatile is not useful with
> > threads under Posix, nor, as far as I can tell, under Windows.

> No, it is not true for Windows. Here is from MSVC documentation:

> W> The volatile keyword is a type qualifier used to declare that an

> W> object can be modified in the program by something other than
> W> statements, such as the operating system, the hardware, or a
> W> concurrently executing thread.

That's interesting. From what little I understand of 80x86 hardware, in
order for this guarantee to hold on a multiprocessor system, the
instructions which access the variable must be preceded by a lock prefix
(but I could be wrong here -- it wasn't necessary when I worked on 8086,
back when it was 8086 and not 80x86). From what I have seen by
examining the generated assembler from VC++ 6.0, it doesn't generate a
lock prefix.

> So, what I am trying to say is that POSIX (and even Windows) is not
> everything in the world.

Certainly not, but Posix and Windows together represent a pretty large
segment of the general purpose machines. (There are a lot more embedded
systems, of course, and from what little I know about them, volatile is
likely to be useful on them.)

> And POSIX uses very heavy method (flushign out all data at the
> synchronization point) that makes "volatile" useles, but it is not
> good for avery application.

Most of the modern Posix systems have alternatives, although at the
hardware level, you do end up flushing "everything". (That's what a
membar instruction does on a Sparc, at any rate.) The problem is that
the alternatives are different for every different platform -- rather
than just having to deal with two variants: Posix and Windows, you have
to deal with four or five for Posix, and probably at least two (IA-32
and IA-64) for Windows. And that generally speaking, once you've
implemented whatever is necessary, even at this level, you no longer
need volatile: the InterlockedIncrement instruction under Windows, for
example, doesn't require (or even allow) that its argument be volatile,
and of course, the bits of assembler code for Sparc have no notion of
volatile.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Balog Pal

unread,

Feb 23, 2004, 2:29:08 PM2/23/04

to

"Graeme Prentice" <inv...@yahoo.co.nz> wrote in message

news:r53i309d3msm1einh...@4ax.com...

> If what the MS web page says on reordering on X86 is true, then why
> couldn't the following happen on a single CPU system
>
> //thread 1
> read var2
> // thread 1 switched out
>
> // thread 2
> read var 1
> // thread 2 switched out
>
> // thread 1
> write var 1
> check var 2, it's zero, I can go
>
> // thread 2
> write var 2
> check var 1, it's zero, I can go

For me that reads like a scenario written specificially to be broken.
thr1:
int temp = var2;
var1 = ?;
if(temp == 0) go();

thr2:
int temp = var1;
var2 = ?;
if(temp == 0) go();

Does it make any sense? If you elect to not use MutEx sessions, you must
use atomic operations. If your protocol requires fetch&test for var1, it
shall be an atomic operation. If it is an atomic operation, there's no
chance for a {read, interrupt, change, test oldvalue, make obsolete
decision} scenario.
Note that even if the check happens with fetch, you're still roken, as here
you need the full sequence be atomic, incluing the write. It's pretty
uncommon to use 2 variables that way -- the normal solution to use single
variable, and atomic swap or compare-and-swap.

> - the programmer actually wrote
> write var 1;
> read var 2

Without membars the processor may internally read var2 before writing var1,
and use that value. Provided those are distinct variables. If they are
actually identical, the write will invalidate the previous read (or replace
with the written value if possible at the stage). So you can't see a
difference on a single processor. (that is called preserving the
'processor self consistency'). If a thread switch happens it can't happen
without membars, so it will wash away any chance of problems of working with
an old value of var2 after writing var1.

Only another processor observing the memory bus and also with ability to
change bytes in the memory can introduce a problem.

> >> and here's a good explanation of how reordering can occur
> >> http://www.javaworld.com/javaworld/jw-02-2001/jw-0209-toolbox.html
> >>
> >> but there's probably other ways.
> >
> >Did these articles mention there'll be problem on single CPU?
>
> It didn't say the problem was limited to mulltiple CPU systems or to
> systems without instant cache coherency.

I read the articles, and must say it's not really correct. It introduces a
plenty of things and then minces them likely creating confusion for those
not having background. The final conlusion is good -- that DCL can't be
hacked to work without membar on the observing part, for the systems
requiring it. There are better put articles on the theme. The introduction
part of the thesis you linked recently is good to read.

Also note theproblems intrduced by the Java VM. The whole thing is pretty
moot -- if I write programs for the JVM, that is my 'platform'. I must
follow the rules of JVM and get the behavior described in Java dox. It
can't be any different depending on what processor runs below me, what
memory model used, JIT or anything.
And the language must use those pretty rigid rules to make it possible to
work on anything. Java dox avoids calling rules violation 'undefined
behavior' but that's what you get, and with way less chances the behavior
gets defined by the 'implementation' or 'platform actually used'.

In a C++ project I can use shortcuts for the reason I can prove it shall
work. And when in doubt I can introduce some simple steps to verify
portions of code got compiled to the assy I desire. (in practice those will
more likely be inline assy parts :)

I wouldn't dare do any shortcut on Java, as I see nothing to guarantee I'll
get what I think.

Paul

ka...@gabi-soft.fr

unread,

Feb 23, 2004, 2:35:01 PM2/23/04

to

wa...@stoner.com (Bill Wade) wrote in message
news:<2bbfa355.0402...@posting.google.com>...
> ka...@gabi-soft.fr wrote

> > We've been through this before. Atomicity is only part of the
> > problem (although it is a real part). The other part is the
> > implementation defined semantics of volatile. Arguably, the intent
> > of the standard is that if one processor writes a volatile variable
> > to memory, all other processors will see the results of that write.
> > Intent or no, however, it ISN'T what most current implementations
> > guarantee.

> But some do, and many single-processor machines support more than one
> thread. A few even support more than one process and shared memory
> ;-).

Yes. I think that there is some abiguity as to what the subject of
discussion is; perhaps I got confused by the word "thread". When I
worked on embedded systems, we had processes. Since there was no memory
mapping or protection, of course, they were really a lot closer to what
most people would call threads today, but somehow, I still think of them
as processes, and when I hear the word thread, I place myself
immediately into a Posex/Windows context.

If you are writing code for a Posex/Windows context, them IMHO, if you
aren't considering multiple processors, you have your head in the sand.
Multi-processors are almost the rule for servers running under these
systems, and I wouldn't be a bit surprised to see their use generalizing
on the client side. (We already have some clients running
multi-processor machines, but they are, for the moment, the exceptions.)

If you are writing code for an embedded system, running on a specific
hardware, under a real time OS, and what you are calling threads is what
I used to call processes for those machines, then the rules are
completely different. I haven't worked in this environment recently,
but the last time I did, volatile was very relevant. The same as it is
probably relevant (but you'll have to check with the system and the
compiler) if you are writing device drivers or other stuff that runs in
privileged mode and near the hardware under Unix or Windows.

> I guess your point is that if you want to avoid the mutex by looking
> at memory, you'll want to ensure that your look is both atomic and
> synchronized.

Well, that too, but mainly, I was only considering one particular type
of environment, because that was the only environment where I had
encountered the word "thread".

> On some systems, volatile bool will be atomic and synchronized. On
> some more systems volatile sig_atomic_t will be atomic and
> synchronized. On some systems neither does what you want. RTFM.

Moral: if you are doing anything not explicitly covered by the C++
standard, you have to know what guarantees the platform you are working
on offers.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Feb 23, 2004, 2:35:23 PM2/23/04

to

"Balog Pal" <pa...@lib.hu> wrote in message

news:<4034...@andromeda.datanet.hu>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04021...@posting.google.com...

> > extern int i ;
> > volatile atomic_bool lock = 0 ;
> > i = 0 ;
> > while ( lock == 0 ) {
> > }
> > cout << i ;

> > The compiler can "see" that i is not modified, and will probably
> > just generate code to output a literal 0. It would be a very rare
> > compiler that will reread i from memory in such a case.

> For some reason I always had the impression thet *this* is the exact
> thing volatile was invented for. To explain the compiler "this
> variable may be changed by means not seen by the compiler". So if I
> have a variable used in the above situation I always make it volatile.
> Ant that forbids the assumptation that i is still 0, as would be
> reasonable for a plain int.

That's what volatile was invented for. But volatile only affects
accesses through volatile qualified expressions -- the compiler should
generate an access to lock each time through the loop (for some
implementation defined meaning of access), but there is nothing to
prevent it from optimizing all accesses to i. The code sets i to 0; the
code then loops, in a loop where the compiler can see everything that is
modified; the code then outputs i. Since i is not volatile, and the
compiler has seen all possible accesses to it, the compiler is fully
justified in outputting 0, without rereading i.

If you declare i volatile, then the compiler should reread. But then
you have two volatile objects that are modified, and have to be
concerned about possible reordering by the hardware. At least with the
compilers I currently have access to, the code generated by volatile
does nothing to prevent reordering at the hardware level.

> And I know no other way to express that to the compiler -- so James
> could you explain your previous statement that 'volatile is neither
> sufficient nor necessary' with threading? i agree it is not
> sufficient (for the general case), but how can you live safe without
> it?

Because once you've added locks, or whatever, the compiler has the
information it needs to avoid the dangerous accesses. Not every access
to the variable, of course (as would be the case with volatile), but it
will know to ensure that everything is safely written back before any
changes in the possession of the lock.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Feb 23, 2004, 2:35:52 PM2/23/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message

news:<c161hc$1ebr8k$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04022...@posting.google.com...
> > [...]
> > > Could you please tell what is a "mutex" you are refering to? Have
> > > I missed the time and mutex is a part of the C++ standard? Or you
> > > are talking about posix mutex and "C++" is a part of the posix
> > > standard? In neither, how do you know that mutex (anyone)
> > > protect "volative memory"?

> > There is no mutex in the C++ standard, because there are no
> > threads. If you are using threading, you are dependant on at least
> > one other standard, in addition to the C++ standard. Posix is the
> > one I know, so it is the one I use for examples; I suspect that the
> > guarantees given by Windows, or other systems, are similar.

> > Unless your compiler supports one of these additional standards,
> > you cannot use it for multithreaded code.

> That is a first your point I could not agree. I (and many others, I
> believe) use C++ in multithreaded code - w/o support of any extra
> standard (mostly in embedded code, in the environment with, often,
> very non-standard synchronization primitives)

OK. You're using "threading" in the general sense; I was speaking about
a more concrete instance -- that of non-privileged code running under
one of the mainstream OS's (Unix or Windows). You're still counting on
a number of guarantees in addition to the C++ standard, but they haven't
been standardized by any external organization.

> - I just do not have alternative.
> What I rely on is:
> 1. using "volatile" that forces translation of the C++ code accessing
> some memory into the machine code that immediately does "equivalent"
> access.

> 2. Some knowledge about CPU hardware related to caches and parallel
> memory access (simple and most usual case: just one CPU and
> multiprogramming; in multi-CPU cases - different cache strategies and
> special memory instructions).

Right. For embedded systems, that's what you've got, and that's what
you work with.

> > If it supports one of these additional standards, then you get the
> > guarantees of that standard. The POSIX standard is quite clear:
> > memory is synchronized across a certain [... perfectly correct text
> > about POSIX snipped]

> practice; it is necessary if the program is to be correct. (Note that

> > we are counting on a very subtle interaction between two standards
> > here, and I would much prefer that the C++ committee recognize the
> > existence of threads, and specify it clearly. Or that POSIX
> > recognize the existence of C++, although that wouldn't necessarily
> > help people on other platforms.)

> Yes - it is subtle.
> And there is another reason why I don't want to be limited by POSIX
> when I use multithreading: it is very costly. I am not talking here
> about need to implement many standard primitives (with is also
> sometimes a problem), but about high cost of using POSIX mutexex -
> taking a mutex typically should do full memory synchronization when I
> actually need to synchronize access to just one variable/object!

High cost is relative:-). If you are implementing code that should run
in non-privileged mode on a variety of Posix platforms, then you are
limited by Posix, whether you like it or not. If you are implementing
non-portable code that is running close to the hardware, in privileged
mode, or on a processor which doesn't even make that sort of
distinction, then you obviously aren't concerned with Posix, but rather
with the guarantees on your platform. The closer to the hardware you
get, the more the guarantees begin to look like some sort of hardware
specification, and less like and OS API.

> So, while I agree that adding some kind of threads to the C++ standard
> would be helpful (especially for hosted environment), there are many
> other uses (mostly in freestanding cases) that have slightly different
> threads then POSIX ones and "volatile" (it would be great to make it
> definition in C++ standard more strong though) works (and necessary)
> there. (none, that "thread" and "mutex" existed long before creating
> POSIX, so I am using these words not necessarily as POSIX ones)

You've raised an interesting point. The fact that C++ is used on
embedded processors, in such contexts, must be taken into account when
defining the thread model in the standard, if the standard is ever to
define such a model. I'm not sure what path the standard should take
here. For example, in such cases, you must know what types can be
accessed atomically, and what types will result in two or more accesses
at the lowest level. But I can't see the standard requiring atomic
access for anything more than a byte -- I've worked on eight bit
machines, and I suppose that some such beasts still exist.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Feb 23, 2004, 2:36:14 PM2/23/04

to

"Balog Pal" <pa...@lib.hu> wrote in message

news:<4037...@andromeda.datanet.hu>...

> "Graeme Prentice" <inv...@yahoo.co.nz> wrote in message
> news:tmkb3097tv6ljjbui...@4ax.com...

> > As far as I can see, many Windows programmers who write
> > multithreaded apps are unaware that their program might run on a
> > machine with multiple CPUs or what the implications of that are.

> What is in general terms a Bad Thing(TM) but for this particular issue
> not something scary. As Windows is restricted to Intel paltforms
> (unless I misunderstood the drop of alpha and mips). And Intel does
> thngs on almost insane level to make MP systems behaving like a SP
> system in terms of memory use.

Understandably. Note that much of what I am saying is a question of
ensuring that your programs will work in the future -- currently, for
example, none of the Sun Sparcs implements RMO (or so I've been told),
so you shouldn't have problems due to it. On the other hand, I've heard
no guarantee from Sun that this will be the policy forever in the
future, so I prefer to be prepared. I imagine that the situation is
similar with Microsoft on Intel.

> Thus the outcome is *if* you create a schema that works well in MT
> environment for any possible imagined point of interruptions, it will
> work on multiprocessor system as well as on a single-processor one.
> (Khm, thinking again, with an exception: there may be problems wit
> object access that is not atomic -- a single processor can't read
> inconsistent data while that may be possible for another. But IMO in
> the design nonatomic access shall be considered as an interruptable
> and disruptable operation for all cases.)

The atomic access issue is doubtlessly the one which will cause the most
problems. As you say, for commercial reasons, most large vendors try to
ensure backwards compatiblity, and making the customer look like an
idiot. On the other hand, the atomic access issue pretty much means
that programs not designed for multi-processor architectures are likely
to break anyway, so the risk of adding ordering issues is probably
small.

> > but it also suggests that use of the volatile keyword provides
> > coherency of memory

> Memory coherency is provided by the processor -- volatile is
> irrelevant here.

There are several aspects involved. If the compiler optimizes away the
instruction which causes the access, the processor won't put it back
in.

> You need no special instructions to broadcast a change to a memory

> location -- the write in itself does invalidate the cache lines on all
> processors. Though the memory ordering issues can apply in some
> situations -- but asserting LOCK# is a signal for all the processors
> to put in a barrier. So any instruction with implicit or explicit
> lock will do the job.

That's what I've been lead to believe. (I'm not an expert on current
Intel architectures.) What I have verified is that VC++ does NOT
generate a lock prefix when accessing a volatile variable. And I
believe (it was true in the past, anyway), that without a lock prefix,
the processor doesn't assert the LOCK# signal on the memory bus.

I'm not sure what instructions do an implicit lock. Back when I was
working on Intel, none did.

> > and is meaningful in a multithreaded environment - however it refers
> > the reader to Visual C++ documentation where information on the
> > effect of volatile is fairly useless - though it suggests that it's
> > useful in a multithreaded app.

> That can be unfortunetely misleading -- volatile itself does not
> generate lock prefixes at access. But you need it to tell the compiler
> not to remove simple access at least. In situations where ordering
> issues does not apply it is sufficient for communication.

That's a contractition with what you just wrote above. I believe that
the Intel caching is implemented with "write through"; that a write to a
local cache will always generate a write access to main memory (perhaps
with some delay), and that the caches of all of the processors will
detect this, and flush their cached data. From what I have been told,
however, this is not true on IA-64; it is a feature of the IA-32.

It remains to establish how read and write accesses to the memory bus
and the local cache map to individual instructions. Generally speaking,
at least on some modern processors, the fact that a load instruction has
been issued does NOT suffice to ensure an actual access, even to the
local cache. Actual read accesses are by cache line, considerably wider
than a byte, or even than a word. And if a following read access finds
the data in an internal memory access register, it may use that data,
rather than going to external memory.

Things like the membar instruction for Sparc, or the memory barrier
primitives from Microsoft, documented in the previously posted link,
aren't there for the fun of it. They fulfill a real need. And if
volatile doesn't cause the compiler to generate something similar, then
volatile doesn't fulfill that need.

> > I'm curious to know exactly what a "memory barrier" operation
> > involves as it sounds potentially very time consuming

> there are different ones, thevariant mentioned here is a lightweight
> one -- it merely constrains ordering of accesses within a
> processor. IOW inhibits reorderings across the barrier.

> > - how long does it take to write 50K of data from the cache to
> > external memory or to other caches? Intel XEON micro has
> > hyperthreading where multiple "CPUs" share all resources including
> > the cache and each "CPU" is just a bunch of registers - they can
> > execute threads of a process in parallel and presumably don't have
> > time consuming "memory barrier" operations to perform.

> Actually the new intel processors have *FENCE instructions that are
> similar to the ordering membars on other processors. they got
> introduced especially for the purpose to *gain* speed, to give
> opportunity to use less brutal things than the current LOCK.

The key is, of course, that without such instructions, you pretty much
have to do the equivalent with every instruction. The cost time, but
only when you actually use it, and not on every memory access. And they
don't mean flushing a 50K cache; I cannot imagine a processor allowing
50K to become "dirty" before having done anything about it.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Ben Hutchings

unread,

Feb 23, 2004, 2:36:41 PM2/23/04

to

Balog Pal wrote:
> "Graeme Prentice" <inv...@yahoo.co.nz> wrote in message
> news:tmkb3097tv6ljjbui...@4ax.com...
>
>> As far as I can see, many
>> Windows programmers who write multithreaded apps are unaware that their
>> program might run on a machine with multiple CPUs or what the
>> implications of that are.
>
> What is in general terms a Bad Thing(TM) but for this particular issue not
> something scary. As Windows is restricted to Intel paltforms (unless I
> misunderstood the drop of alpha and mips). And Intel does thngs on almost
> insane level to make MP systems behaving like a SP system in terms of
> memory use.

<snip>

> Actually the new intel processors have *FENCE instructions that are similar
> to the ordering membars on other processors. they got introduced especially
> for the purpose to *gain* speed, to give opportunity to use less brutal
> things than the current LOCK.

Windows runs on those too, and soon it will run on PowerPC again in
the X-Box 2 (though I imagine that has a single processor). It also
runs on AMD64 but that provides the same strong guarantees as x86.
<http://blogs.msdn.com/cbrumme/archive/2003/05/17/51445.aspx> has
details of the memory model for .NET, which is basically the same as
the memory model for Windows.

Ben Hutchings

unread,

Feb 23, 2004, 2:42:22 PM2/23/04

to

Graeme Prentice wrote:
> On 21 Feb 2004 20:49:32 -0500, Graeme Prentice wrote:

<snip>

> >on a single CPU, the hardware can change a write/read
> >sequence (generated by the compiler) into a read/write (which makes no
> >difference on a single CPU) but with multiple CPUs, it can make a
> >difference and a memory barrier is required when shared memory and
> >multiple CPUs are involved.
>
> [correcting myself again]
>
> Reordering of write/read into read/write does make a difference on a
> single CPU multithreaded app.

<snip

How?

Michael Furman

unread,

Feb 23, 2004, 10:26:38 PM2/23/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...
> [...]

> > I guess you are talking about strict POSIX threading model (which is
> > not consitent with C++ by the way). It is not true for other models
> > (see my other post).
>
> Posix is the only threading standard I have on line. (Or had -- I just
> got the documentation for VC++ installed this morning. So once I've
> learned to find my way around in it...)

> [...]

>
> > Taking mutex (in POSIX model) costs saving/reading all variables (like
> > they are all volatile), but only in the point of taking mutex. In
> > some cases the fist method is better; in other - the second.
>
> > (I am not saying that using "volatile" is enough in the first method -
> > it is onlu nessesary. We shoud also use some other wais to ensure
> > atomicity and some kind of coherence)
>
> My point is that once we have used the other ways, volatile is no longer
> necessary. And that the other ways are necessary; volatile alone is not
> sufficient.

And by "other ways" you mean just one way: "Posix" - don't you?
If so - I argee.

> [...]

> > Without volatile it is guaranteed that it will not work correctly with
> > compiled with simple optimization.
>
> There's never a guarantee that it won't work. Unless the optimizer is
> espeically dumb, however, there is a very high probability.

OK, you are right :-).

>
> > Without volatile, it is not guaranteed alone. But addition of some
> > extra things/mechanisms would guarantee it. For example, one of:
>
> > - just one CPU and classic multiprogramming environment;
>
> Technically, that isn't guaranteed either. In practice, I can't imagine
> a processor where it wouldn't work with just a single CPU. On the other
> hand, at least in the environment where I work (servers), I can't
> imagine a context where I can be sure that there will only be a single
> CPU. If not today, tomorrow...

You are looking for some garantee aplied to everything. It can only be done
by
some standard - and there is no one presently (besides POSIX).

>
> > - mechanism (either software or hardware) that periodically flush all
> > caches.
>
> I'm not aware of any such mechanism on the machines I work on.

I am not too, but I can easily imagine (and implement) separate thread
(or process or timer queue entry, or whatever) that periodically do that.

>
> > That is what I call necessary (but not enough).
>
> I agree that there may be other mechanisms that a lock which would work.
> I've got some asm in my RefCntPtr for the Sparc, for example, to
> implement an atomic increment and decrement; for Windows, I use a
> special library routine -- no ASM, but not portable either.
>
> In no case have I declared anything volatile.

That is (I guess) because both protected variable (counter) and protection
mechanism
are encapsulated on one object. If you try to implement something like mutex
(that can be used to protect any variable) - you would have either force
everything
to be flushed (like in POSIX) or use volatile (that I suppose to "flush" one
variable)
and some more fine grain hardware instructions (like flush just one word -
is such
instructions exist on modern computers?).

>
> > > Arguably, the intent of volatile is that the compiler generate
> > > these instructions when a volatile variable is accessed, but in
> > > practice, they don't (and Posix very definitly doesn't require them
> > > to in C).
>
> > I hope it is not - IMO it would make it useles.
>
> Well, it isn't so for g++ (at least under Solaris), nor for Sun CC, nor
> for VC++. I don't have access to any other compilers at the moment, so
> I can't speak for the others.
>
> > > According to Posix, the above code has undefined behavior, so the
> > > problem of whether the compiler optimizes accesses or not is moot.
> > > In order for the behavior to be defined, *ALL* accesses to stopflag
> > > must be protected, normally by a mutex. In which case, Posix does
> > > guarantee that the code works. With or without volatile.
>
> > Is it POSIX newsgroup? :-)
>
> No, but since the issue concerns implementation defined behavior, we've
> got to talk about some concrete examples. Once I've gotten into the
> Windows documentation a bit more, I'll try and use examples from both,
> in order to be a bit more fair. Still, I know and work on Posix
> (Solaris, HP/UX and AIX), so it will doubtlessly be Posix which I can
> talk about best.

In posix it is extemely simple - with a cost of flushing everything
everytime.
So, using mutexes is very costly, especially in case of large caches and/or
many CPU's.

>
> The only alternative to discussing Posix or Windows examples here is to
> say that we cannot talk about the problem at all. Even though it
> concerns most C++ programmers in one way or another.

Why we can't? I believe we are doing it right now :-).
> [...]

>
> > What I would like from the C++ standard rather then to address
> > threading (or in addition to) as address to memory synchronization
> > problems. There are "volatile" and "sig_atomic_t" are something - but
> > they are not defined in the wide enough cotext and probably they are
> > not enough.
>
> You are not alone; I believe that there are a fair number of people who
> would like to see threading addressed in a future version of the
> standard. FWIW: I would like to see it made quite clear that accesses
> to a volatile object are synchronized between threads. Even in a
> multiprocessor environment, where different threads run on different
> processors. I'm not the one who writes the standard, however, and I'm
> not sure that this particular requirement will make it into the
> standard, even if threading in general does.

Here I am not sure I agree - I like present situation when "volatile"
takes only part of this garantee that is directly related to compiler.
Another part needs using special hardware features - they could be
different even for different implementations of the same architecture.
(of cause, again I am thinking more about freestanding
implementations) - so I am afraid it would use overkilling "flus everything"
method.

> [...]

> > > Interrupt handlers are a different question. You'll have to see
> > > what your system requires for these: volatile may or may not be
> > > useful, depending on what the system requires. I suppose that the
> > > same thing is true for threads, but volatile is not useful with
> > > threads under Posix, nor, as far as I can tell, under Windows.
>
> > No, it is not true for Windows. Here is from MSVC documentation:
>
> > W> The volatile keyword is a type qualifier used to declare that an
> > W> object can be modified in the program by something other than
> > W> statements, such as the operating system, the hardware, or a
> > W> concurrently executing thread.
>
> That's interesting. From what little I understand of 80x86 hardware, in
> order for this guarantee to hold on a multiprocessor system, the
> instructions which access the variable must be preceded by a lock prefix
> (but I could be wrong here -- it wasn't necessary when I worked on 8086,
> back when it was 8086 and not 80x86). From what I have seen by
> examining the generated assembler from VC++ 6.0, it doesn't generate a
> lock prefix.

In my understanding, "W" says that using vilatile is nessesary - it does not
say that it is enough.

> > So, what I am trying to say is that POSIX (and even Windows) is not
> > everything in the world.
>
> Certainly not, but Posix and Windows together represent a pretty large
> segment of the general purpose machines. (There are a lot more embedded
> systems, of course, and from what little I know about them, volatile is
> likely to be useful on them.)

That is (the second part) what I am trying to emphasize. And I would
appreciate
some more support for that from the Standard.

>
> > And POSIX uses very heavy method (flushign out all data at the
> > synchronization point) that makes "volatile" useles, but it is not
> > good for avery application.
>
> Most of the modern Posix systems have alternatives, although at the
> hardware level, you do end up flushing "everything". (That's what a
> membar instruction does on a Sparc, at any rate.) The problem is that
> the alternatives are different for every different platform -- rather
> than just having to deal with two variants: Posix and Windows, you have
> to deal with four or five for Posix, and probably at least two (IA-32
> and IA-64) for Windows. And that generally speaking, once you've
> implemented whatever is necessary, even at this level, you no longer
> need volatile: the InterlockedIncrement instruction under Windows, for
> example, doesn't require (or even allow) that its argument be volatile,
> and of course, the bits of assembler code for Sparc have no notion of
> volatile.

But if you use InterlockedIncrement for synchrotizing access to some other
variable (let's consider one CPU environment), you need to declare it
volatile.
Otherwise compiler could use a register copy of this variable and would
not see that it changed.

Regards,
Michael Furman

Balog Pal

unread,

Feb 24, 2004, 8:32:40 AM2/24/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.0402...@posting.google.com...

> > membar serves completely different purposes -- it is about _ordering_

> > (well, actually it have other forms too, but I'm sure you had the
ordering
> > membar in mind here).

> It ensures that any data incidentally read during the loads issued

> before the membar are ignored by loads after the membar. That's
> ordering, yes. But it is necessary in this case.

To remind the case -- it was a spin on a single variable. Order, as we'd
expect is interesting if we have at least to somethings. ;-) In more
general examples it is certainly vital. Just not in ths particular one.

> The problem is that if the data at the address read happens to already
> be in the processor, because of an earlier read, then the processor may
> not go to main memory in order to fetch the value.

Well. Term 'main memory' is irrelevant for way many systems. Those that
define only the 'processor' and 'memory'. There 'memory' may hav whatever
guts, multiple level caches or anythong, the processor needs not know what
goes on there it just deals with memory. If it asks data at an address, it
gets it. internal caches may make it a short or a long process, but the
result shall be the same. Sparc V9 is defined this way. Tha arcitecture
manual says nothing at all about caches (except this basic expectation).
IA32 is also such a system with a difference the manual describes a plenty
of things abut how the cache works on some pmodels -- that is an interesting
read but also irrelevant toi our discussion.

If the processor executes an instruction that asks a read it must do a read.
that read may be out of order but it must be a read. If you spin on that
read instruction the processor is just not allowed to ignore the memory
system and thonk it has read it once, and that result is good for the next
billion reads.

> And if the processor
> doesn't go to main memory, then it will not see any modifications made
> by another processor.

Wrong assumptation. [see proof later]

> In practice, there is an extremely high probability that it will work.
> The amount of memory held locally in the processor (I'm not talking
> about cache, but about physically on the processor chip, in memory
> interface registers, etc.) is very small, and sooner or later, you will
> get a hardware interruption, or some cron job will interrupt, your
> process will be interrupted, and some other process will read enough
> other data to purge the internal contents.

As a matter of fact that also makes the presented schema not working on a
real system. But it must work even if the process timeslice is infinite
long, and no interrupts happen.

> > If you're to restrict yourself to stuff that is defined only within
> > the posix itself you'll waste tremendous amount of resources.
>
> Well, if you want your program to work, it's generally better if you
> restrict yourself to things that are guaranteed somewhere.

In my terms it's not 'better' but anything else is strngly prohibited :). I
geerally offer to sit on that locomotive on the test run if I said it will
work, and I allow it to carry stuff.

But that doesn't imply I must use only elements written in one document and
ignore others. If the target platform is "any posix" sure those rules
must be strictly followed. If the situation is more restricted, and other
elements documented, I can use other schemas. And if they have heavy impact
on performance, I likely should use them, it would be a lame excuse to say
'well it crowls on this sparc to keep the sources compile and run on some
aplha without change'. Guess you would not accept that either if you
ordered the software strictly for sparc. ;-))

Certainly where I don;t know any better, or the tradeoff is negligable it is
better to pick some easier route.

> > > In order for the behavior to be defined, *ALL* accesses to stopflag
> > > must be protected, normally by a mutex. In which case, Posix does
> > > guarantee that the code works. With or without volatile.
>
> > What is exactly what I just said -- waste of resources for little
> > reason. As that stopflag loop works correctly on most machines
> > (including sparc V9 in RMO) as long as simple write and read access
> > instructions are emitted by the compiler.
>
> The Sparc architecture document says most explicitly that it doesn't
> work on a Sparc v9 in RMO.

Where does it say that? Just knowing the scope of that document you could
think it can;t say such thing.

[proof]
And instead of the theoretical part take a look at the sync primitive
examples in the document. For example the spin lock implementation at J.6.
Do you see anythig special in the spin loop? Certainly no. just a plain
load-compare-retry spin. Using your theory it would never see any change in
the memory content when it changes on another processor.

It certainly does see the change.

The ordering membar instruction is needed later, at the exit -- to make any
subsequent accesses not use any previously read data. To access and read
the flag variable -- that maps to 'stopflag' in the original stuff nothing
at all special is needed.

> > Where it doesn't work should be az exotic system with really dangerous

> > memory model any thread programmer should be aware anyway.
>
> Most general purpose systems today use a really dangerous memory models.

There was a pretty good thesis on a related topic pecently linked in a
messagi in this thread. That thesis stated the opposite. [I do not
consider a RMO model dangerous, I meant some system where you must flush
caches manually or stuff like that. ]

> > What is that system? IMHO you're mistaken here confusing memory
> > ordering issues with memory change visibility in general.
>
> The two are related.

Not the way you seem to think. Other posts suggest you think about cache
flushing issues. The operations the the _sequencing_ membar instruction
does on SPARC. Those are important to deal with memory mapped io or alike
devices. Or special OS level tasks when it reprograms the cache
controller, remap memory, etc.
Nothing related to the different memory models require one of those.
All description, and the sync primitive examples use exclusively the
_ordering_ membar instructions. And as the description starts "introduce an
ordering in the instruction stream of a *single* processor". IOW they only
influence the reorder buffers of the one processor the get executed on.

> I've worked on those sort of systems. A long time ago, in simpler
> times. A hand-written scheduler for a multi-processor machine is NOT
> simple to write.

Certainly, but I was talking about a single processor one :)

> Of course, when you are programming at that level, you
> don't use pthread_mutex_lock. You need to know the hardware intimately,
> and design your code around it. But then, you won't be writing the
> scheduler in standard C++ either.

Yep, but the idea was to _use_ that system in C++. I just wanted to show a
system that IMHO is in all practical terms having threads, though has a very
different method to avoid conflicts.

> That's really all I'm claiming. The problem, or at least, my impression
> of the problem, is that there is some confusion between threads and
> interrupt handling.

Possibly it's easy to create confusion. (Not if it doesn't sneak in by
itself ;-)

The interrupt system is really not exactly like a thread system, but the
main issues are very similar. And rules to observe are similar too -- so in
practical terms you do the same things to create a working system. The
actual primitives may be different.

> Threads are a fairly high level concept, supported
> by the OS (be it Posix or Windows);

The 'simple scheduler' I was trying to describe would do exactly like the OS
you mention. And wishing it 'simple' exactly for the purpose we could think
only of the really basic issues realted to threads. The real-life
schedulers get complicated for efficiency reasons. And it's most complicated
task is to pick the next thread to wake. While the issues under discussion
has nothing to do with that part, we only expect threads tio run in some
theoretial time-space. :)

> on a multiprocessor system, you have
> no control over which processor an individual thread runs on, nor in
> which memory it variables have been mapped. In an interrupt routine,
> you are a lot closer to the hardware, and have a lot more control.

Again, m intetntion was to completely ignore what the tread manager does --
the interesting part is what the thread itself does.

In the posix model you use mutex objects. When you have a critical section
of code, you try to lock the mutex object. That will succeed when the way
is ree and block any other attempts. If it's closed, the thread gets
suspended until sometime later. When you're finished you unlock the mutex --
that is supposed to wake threads blocked on it. Thus a shared data object
controls thread execution inside the section.

In the interrupt-driven hand made single proc threading you may simply block
the interrupts. As the task switch happens on the interrupt, no switch can
happen while you disabled them. So the section is well guarded if there is a
disable-ints at start and restore-ints at the end. Certainly if you have
them implemented, you can also use hand-made mutexes.

A single thread + interrupt system is even more interesting. Here you can't
have mutexes, only the disable-ints method. (As there's no scheduler, and
generally no way to suspend -- discounting the case of suspendng the main
thread waiting on a flag set from the interrupt level).

The common part is you must plan the shared object access. And prevent
messing up the state due to parallel acting on the same things. The
different is the strategy reach that goal.

Paul

Graeme Prentice

unread,

Feb 24, 2004, 8:35:07 AM2/24/04

to

On 23 Feb 2004 14:42:22 -0500, Ben Hutchings wrote:

>Graeme Prentice wrote:
>> On 21 Feb 2004 20:49:32 -0500, Graeme Prentice wrote:
><snip>
>> >on a single CPU, the hardware can change a write/read
>> >sequence (generated by the compiler) into a read/write (which makes no
>> >difference on a single CPU) but with multiple CPUs, it can make a
>> >difference and a memory barrier is required when shared memory and
>> >multiple CPUs are involved.
>>
>> [correcting myself again]
>>
>> Reordering of write/read into read/write does make a difference on a
>> single CPU multithreaded app.
><snip
>
>How?

um ... well, as Paul (Balog Pal) explained in the sub-thread on double
checked locking, it can't happen. I remembered the suggestion someone
made a few days ago about using two volatile bools instead of one, and
when I thought about it I decided that if the hardware re-orders
write/read into read/write (as this MS web page says)
http://www.microsoft.com/whdc/hwdev/driver/MPmem-barrier.mspx

then the two bool flag scheme fails, so I hastened to correct what I
thought was a mistake, forgetting that the MS web page is explicitly
talking about multi-processor architectures.

It was only when Paul said it can't happen that I thought if the
hardware re-orders a write/read into a read/write, the entire operation
is completed long before another thread comes along to do the same thing
- however, without knowing the intimate details of how re-ordering
works, a programmer can only guess at what scenarios can occur and rely
on the guidelines for handling shared memory. Paul also says a membar
would occur between switching threads as well, which is logical for an
OS to do, but it's not immediately obvious to a naive programmer like me
who'd never heard of re-ordering or cache incoherency until a few days
ago. BTW (Paul) - I don't see any sign of a FENCE instruction being
executed during execution of InterlockedIncrement on X86 code - it just
does a LOCK(ed) ADD which provides atomicity only.

Regarding my comment about C++ standardization of multithreading - I
should have paid more attention to boost threads because looking in
there, I see it provides mutex that works on Windows, MAC and pthreads
(Posix Unix?) and presumably the synchronization mechanisms provided by
boost threads take care of the memory issues being discussed in this
thread. I tend to feel that allowing boost to provide standardized
thread mechanisms would be better than adding to the already large C++
standard library, if it weren't for the fact that it requires volunteers
to maintain it and create it, whereas standardizing it would encourage
professional compiler vendors to provide it. The compiler vendor could
do a better job of providing it than a library vendor because the
compiler vendor knows the target OS and hardware. Another issue is that
the boost thread mechanisms require some sort of scheduler underneath
which would probably come with its own synchronization stuff, as well as
being many and varied (I guess).

Graeme

Balog Pal

unread,

Feb 24, 2004, 8:37:18 AM2/24/04

to

{Warning: topic drift. This thread is becoming very verbose with very
little if any C++ content. Please either increase the C++ content or
move elsewhere. -mod}

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...

> Note that much of what I am saying is a question of
> ensuring that your programs will work in the future -- currently, for
> example, none of the Sun Sparcs implements RMO (or so I've been told),
> so you shouldn't have problems due to it. On the other hand, I've heard
> no guarantee from Sun that this will be the policy forever in the
> future, so I prefer to be prepared. I imagine that the situation is
> similar with Microsoft on Intel.

I put it another way (though it is close). The product shall run
correctly on the *specified* environment. What is that -- may be anything.
The important part is that the promise must be kept.

So if I say SPARC, that must mean SPARC. With any modes including RMO. If I
say SPARC TSO, that's okey too to specify. The client may refuse to buy the
product if he thinks RMO is important.

Similar, if I state IA32 (note: where I wrote 'intel' recently IA32 was
really meant, sorry if that lead to confusion), and [list] of windows
platforms, that is meant. If used outside, that shall be the client's
problem.

I can not, dare not and will not state my soft will run on any imaginable
future platform. If thet future platform is compatible with dox I used, it
will run. If it differs, it may run otr may break. How could I know?

And preparing for unforseen changes is IMHO not practical. :) All attempts
I've seen made were futile. forseen changes are another class, those may be
worth considering.

> The atomic access issue is doubtlessly the one which will cause the most
> problems. As you say, for commercial reasons, most large vendors try to
> ensure backwards compatiblity, and making the customer look like an
> idiot. On the other hand, the atomic access issue pretty much means
> that programs not designed for multi-processor architectures are likely
> to break anyway, so the risk of adding ordering issues is probably
> small.

There was an interesting debate in the past -- about theoretical change in
how 'lock' would change on intel. As one school says we use the intel
dox, drop in some handmade assy where needed and everythong works fine. The
other school says follow the MS dox and use the API Interlocked** but
nothing else, if change happens the os will have adjusted code.

On such issue I take a more practical than theoretic view. For some issues
I use the 'documentation by practice', meaning analyse sources or bytes of
Microsoft products to deduce what they really mean -- dociumentation is many
times unclear, fuzzy or even wrong. Or at least is was at multiple points in
time I needed it. So, if I see MS programmers relied on some behavior or
meaning, I consider it 'locked in'.
IOW I'll have a code that will run 'where [specified] version of Windows
run. And break where the other breaks. I'm positive the PC design will
not change in a way to render tons of existing software break (many others
follow similar policy) -- and should that happen anyway, It will be known
soon enough, and correcting measures possible.

> > > but it also suggests that use of the volatile keyword provides
> > > coherency of memory
>
> > Memory coherency is provided by the processor -- volatile is
> > irrelevant here.
>
> There are several aspects involved. If the compiler optimizes away the
> instruction which causes the access, the processor won't put it back
> in.

Yess, volatile is there to ensure the instruction not removed by the
compiler. And nothing else is expected -- the program shall be contetnt with
whatever the memory system provides for the simple instruction. If any
extra measures shall be taken, they're taken separately.

> > You need no special instructions to broadcast a change to a memory
> > location -- the write in itself does invalidate the cache lines on all
> > processors. Though the memory ordering issues can apply in some
> > situations -- but asserting LOCK# is a signal for all the processors
> > to put in a barrier. So any instruction with implicit or explicit
> > lock will do the job.
>
> That's what I've been lead to believe. (I'm not an expert on current
> Intel architectures.) What I have verified is that VC++ does NOT
> generate a lock prefix when accessing a volatile variable. And I
> believe (it was true in the past, anyway), that without a lock prefix,
> the processor doesn't assert the LOCK# signal on the memory bus.

Certainly it doesn't, that would kill performance. The point is any
earlier locked instruction will do the job of assuring a memory barrier, no
need for eg. a calling pthread_mutex_* just for that purpose.

> I'm not sure what instructions do an implicit lock. Back when I was
> working on Intel, none did.

I'm lazy to fetch the dox, but IIRC its XCHG and it gained the implicit lock
sometimes arond the 8088 processor to avoid problems of its 8-bit bus. But
it most definitely was locked on the 286. And it's carried ever since. I
personally consider that as an artefact, and never rely on it, when lock is
needed I use the prefix.

> > That can be unfortunetely misleading -- volatile itself does not
> > generate lock prefixes at access. But you need it to tell the compiler
> > not to remove simple access at least. In situations where ordering
> > issues does not apply it is sufficient for communication.
>
> That's a contractition with what you just wrote above.

I see no contradiction.

> I believe that
> the Intel caching is implemented with "write through";

IIRC WT was used for a short period around the Pentium, and default changed
to WB around P2.

> that a write to a
> local cache will always generate a write access to main memory (perhaps
> with some delay), and that the caches of all of the processors will
> detect this, and flush their cached data.

That was changed to WB backed up with caches talking MESI protocol at
PentiumII, and is possibly changed further multiple times. But all that is
completely irrelevant, the point is caches do some magic, and all processors
always see the same memory

> From what I have been told,
> however, this is not true on IA-64; it is a feature of the IA-32.

It is a feature of IA32, I can't say anything about IA64.

> It remains to establish how read and write accesses to the memory bus
> and the local cache map to individual instructions.

The dox has a clear description and detailied explanation on the impact with
tips on programming practices.

> Generally speaking,
> at least on some modern processors, the fact that a load instruction has
> been issued does NOT suffice to ensure an actual access, even to the
> local cache.

If you're sure on that I beg for a reference.

> Actual read accesses are by cache line, considerably wider
> than a byte, or even than a word. And if a following read access finds
> the data in an internal memory access register, it may use that data,
> rather than going to external memory.

Errr, I recall some register of that kind, but it's invalidated together
with the cache line, so as everything else, it's completely transparent from
processor's point of view, access can be "omitted" only for the situation
that same value would be read if you went the whole path.

> Things like the membar instruction for Sparc, or the memory barrier
> primitives from Microsoft, documented in the previously posted link,
> aren't there for the fun of it. They fulfill a real need.

Sure they are.

> And if
> volatile doesn't cause the compiler to generate something similar, then
> volatile doesn't fulfill that need.

Here is the slip of the logic -- as it was already stated, those
instructions serve one purpose, when it is needed you must ensere they are
there. But they're not needed for any possible use of volatile, and as long
as the programmed *does not think* mistakenly volatile replaces them, there
will be no problem.

The exemple you provided in another post -- the old famous singleton init is
such a situation for it has multiple memory events, and the code tries to
rely on their proper order. That order must be forced.

For the other example of this subthread, using a volatile bool stopflag that
is spinned on on threadmain no ordering requirement is there. stopflag is a
solo object unrelated to anything else around, it's no problem if its
statechenge is observed not ordered wrt other stuff handling around --
especially as in the example it caused the thread exiting, a syncing
instruction even in strict posix sense.

If you still think there can be an ordering rolem -- especially one
preventable with a well placed ordering membar please describe the scenario.
I still think you're mislead by thinking the instructions has anything to
flush caches, commit bytes to the memory system, make changes visible --
that would not happen without them. It pretty well happens on the platforms
we talk about.

> > Actually the new intel processors have *FENCE instructions that are
> > similar to the ordering membars on other processors. they got
> > introduced especially for the purpose to *gain* speed, to give
> > opportunity to use less brutal things than the current LOCK.
>
> The key is, of course, that without such instructions, you pretty much
> have to do the equivalent with every instruction.

Why would you? Knowing that despite any possible speculative reads, writes,
reorderings, the processor-self-consistency is preserved it leaves very few
situations where ordering is really important and must be forced. The need
is limited to memory objects shared by processors. They appear only at sync
points.

Typically at the exit point of a mutex lock and entry of unlock operation.

(though, the strong ordered models do exactly that, any operation has its
implicit membar of certain types )

Paul

ka...@gabi-soft.fr

unread,

Feb 24, 2004, 2:09:10 PM2/24/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message

news:<c1do90$1himgh$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04022...@posting.google.com...
> > [...]
> > > I guess you are talking about strict POSIX threading model (which
> > > is not consitent with C++ by the way). It is not true for other
> > > models (see my other post).

> > Posix is the only threading standard I have on line. (Or had -- I
> > just got the documentation for VC++ installed this morning. So once
> > I've learned to find my way around in it...)
> > [...]

> > > Taking mutex (in POSIX model) costs saving/reading all variables
> > > (like they are all volatile), but only in the point of taking
> > > mutex. In some cases the fist method is better; in other - the
> > > second.

> > > (I am not saying that using "volatile" is enough in the first
> > > method - it is onlu nessesary. We shoud also use some other wais
> > > to ensure atomicity and some kind of coherence)

> > My point is that once we have used the other ways, volatile is no
> > longer necessary. And that the other ways are necessary; volatile
> > alone is not sufficient.

> And by "other ways" you mean just one way: "Posix" - don't you? If so
> - I argee.

I was considering the context of threading within a process on a modern,
general purpose machine and OS: Windows or Unix. I cited the Posix
guarantees because those are the ones I am familiar with -- I presume
that Windows is similar.

I was not considering embedded processors, various real-time OS's, or
the like. My experience with such machines dates from a simpler time
when emitting a store instruction provoked an immediate write to
external memory, and multiprocessor systems using common shared memory
were rare. I suspect that this is still true for some of the smaller
embedded systems. In such cases, volatile is probably a sufficient
solution for some problems (supposing atomicity, etc.).

> > [...]

> > I agree that there may be other mechanisms that a lock which would
> > work. I've got some asm in my RefCntPtr for the Sparc, for example,
> > to implement an atomic increment and decrement; for Windows, I use a
> > special library routine -- no ASM, but not portable either.

> > In no case have I declared anything volatile.

> That is (I guess) because both protected variable (counter) and
> protection mechanism are encapsulated on one object. If you try to
> implement something like mutex (that can be used to protect any
> variable) - you would have either force everything to be flushed (like
> in POSIX) or use volatile (that I suppose to "flush" one variable) and
> some more fine grain hardware instructions (like flush just one word -
> is such instructions exist on modern computers?).

The protection only concerns incrementing and decrementing a counter.
The generic implementation uses a mutex to protect this -- a very
expensive solution, but one that is guaranteed to work (without
volatile). The specific solution for Sparc v9 uses a bit of assembler,
with a membar instruction, to ensure memory synchronization. The
specific solution for Windows uses the InterlockedIncrement and
InterlockedDecrement primitives provided by this system -- from what I
know of the Intel architecture, I suspect that underneith them, there is
an incr or a decr machine instruction, preceded by a lock prefix.

In no case have I found a case where volatile would be relevant to
this. The guarantees I need have all be provided by the other
mechanisms (Mutex, assembler code, special OS primitive). Conceivably,
an Intel compiler (or a compiler for another platform, for that matter)
could define that the ++ or -- operator on a volatile int generated the
appropriate instruction, preceded by a lock prefix (or whatever is
needed on the other platform). I know of no compiler which does this,
however, and generally speaking, I think it would go beyond the
traditional meaning of volatile (which only concerns individual
accesses, and not a read-modify-write cycle).

However, my real point is that some additional guarantees beyond
volatile are necessary, AND that these guarantees are sufficient; that
once I've done whatever else is necessary, volatile is no longer needed.

[...]

> > No, but since the issue concerns implementation defined behavior,
> > we've got to talk about some concrete examples. Once I've gotten
> > into the Windows documentation a bit more, I'll try and use examples
> > from both, in order to be a bit more fair. Still, I know and work
> > on Posix (Solaris, HP/UX and AIX), so it will doubtlessly be Posix
> > which I can talk about best.

> In posix it is extemely simple - with a cost of flushing everything
> everytime. So, using mutexes is very costly, especially in case of
> large caches and/or many CPU's.

That's one possible implementation. The machines I have seen don't do
it that poorly; generally, writes to cache go through (with some delay)
to the main memory, and the cache itself tracks writes to the main
memory, and invalidates the cache line immediately. There typically
isn't that much that needs to be flushed and/or updated when a membar
instruction is issued (as it must be in the implementation of things
like pthread_mutex_lock on a Sparc).

I don't think that this is where the cost of the mutex is coming from.
To start with, locking a mutex involves switching from user mode to
kernal mode, and a lot of other kernal related processing. That is what
makes it expensive.

FWIW: I recently did a benchmark of the instance function of a
singleton, using no protection (valid if the first call to instance
precedes the first call to pthread_create, for example), using a membar,
and using a mutex. Practically speaking, there was no measurable
difference between no protection and the membar instruction -- in fact,
the "measured" value was a couple of nanoseconds: no higher than the
noise factor in my measurement rig. With a mutex, it took around 450
nanoseconds. So there is more than just memory synchronization going
on. (FWIW: the test machine isn't the most recent, nor the top of the
line of what Sun makes. So it is very unlikely that there is any really
sophisticated reordering going on in the hardware, and membar is
probably trivial. I suspect that there might be a measurable difference
between membar and no protection on the latest top of the line Sparc.
But I don't have access to one to verify it.)

> > [...]

> > > What I would like from the C++ standard rather then to address
> > > threading (or in addition to) as address to memory synchronization
> > > problems. There are "volatile" and "sig_atomic_t" are something -
> > > but they are not defined in the wide enough cotext and probably
> > > they are not enough.

> > You are not alone; I believe that there are a fair number of people
> > who would like to see threading addressed in a future version of the
> > standard. FWIW: I would like to see it made quite clear that
> > accesses to a volatile object are synchronized between threads.
> > Even in a multiprocessor environment, where different threads run on
> > different processors. I'm not the one who writes the standard,
> > however, and I'm not sure that this particular requirement will make
> > it into the standard, even if threading in general does.

> Here I am not sure I agree - I like present situation when "volatile"
> takes only part of this garantee that is directly related to compiler.
> Another part needs using special hardware features - they could be
> different even for different implementations of the same architecture.

The features may be different, but how you access them, or their
implications in what the compiler is allowed to generate, might be part
of the standard.

> (of cause, again I am thinking more about freestanding
> implementations) - so I am afraid it would use overkilling "flus
> everything" method.

I wasn't thinking of necessarily defining it down to that level,
although I would not oppose to a standardized mutex class (which
wouldn't make it illegal for the implementation to provide finer grained
protection when possible). But what I am really concerned with is
things like constructing local static objects (thread-safe or not?), or
even some sort of definition as to what sequence points mean when the
program is being executed by several threads.

> > [...]
> > > > Interrupt handlers are a different question. You'll have to
> > > > see what your system requires for these: volatile may or may
> > > > not be useful, depending on what the system requires. I
> > > > suppose that the same thing is true for threads, but volatile
> > > > is not useful with threads under Posix, nor, as far as I can
> > > > tell, under Windows.

> > > No, it is not true for Windows. Here is from MSVC documentation:

> > > W> The volatile keyword is a type qualifier used to declare that an
> > > W> object can be modified in the program by something other than
> > > W> statements, such as the operating system, the hardware, or a
> > > W> concurrently executing thread.

> > That's interesting. From what little I understand of 80x86
> > hardware, in order for this guarantee to hold on a multiprocessor
> > system, the instructions which access the variable must be preceded
> > by a lock prefix (but I could be wrong here -- it wasn't necessary
> > when I worked on 8086, back when it was 8086 and not 80x86). From
> > what I have seen by examining the generated assembler from VC++ 6.0,
> > it doesn't generate a lock prefix.

> In my understanding, "W" says that using vilatile is nessesary - it
> does not say that it is enough.

True. But if that is really the case, it has very wide implications.
It means, for example, that it is pratically impossible to write a
singleton, or for that matter, to do anything in constructors or
destructors. (An object is NEVER volatile in the constructor or the
destructor.)

Note too that at least with VC++, all volatile does is inhibit certain
optimizations. Given the level of optimization of this compiler,
introducing an external function call (say to WaitForsingleObject on a
Mutex) has exactly the same effect. How can volatile be necessary if it
results in exactly the same code being generated? What is the real
effect of volatile?

And the last question raises a more general complaint on my part.
According to ISO 9899:1999, §4/8: "An implementation shall be
accompanied by a document that defines all implementation-defined and
locale-specific characteristics and all extensions." For some reason,
a similar sentence is missing from §1.4 in ISO 14882:1998, although we
do have §1.3.5 "implementation-defined behavior behavior, for a
well-formed program construct and correct data, that depends on the
implementation and that each implementation shall document." In both
standards, we have the fact that accessing a volatile object is
"observable behavior", but that what constitutes an access is
implementation defined. And this whole discussion turns more or less
around this point: we are guaranteed an "access", but what does this
access in turn guarantee us?

And although the C standard and the C++ standard require implementation
defined behavior to be documented, I've yet to find anything really
relevant, for any of the compilers I know of. Does writing to a
volatile variable mean:

- A machine store instruction is generated; what happens after that is
entirely up to the hardware, will vary from one machine to the next,
and for some architectures, is not even guaranteed to generate a
write cycle on the external processor bus (at least not
immediately)?

- A write through to main memory is guaranteed. In practice, on a
Sparc, this would imply a membar instruction somewhere in the
generated sequence. From what I understand about IA-32, it would
require a lock prefix on the mov instruction, but I'm not sure about
this.

- Any processor, anywhere which can see the memory can see the new
value. While this is probably what most people imagine,
intuitively, the implications, for example in the presence of memory
mapped files on a remote disk, are enormous.

My own feeling is that the second point probably corresponds closest to
the intent. From studing generated code, I can say that both Sun CC and
g++, on a Sparc, implement the first. Which is their right -- that's
what implementation-defined means. But they have to document this
choice, and I've not found any such documentation. For any compiler.

But I'm ranting. In practice, you can generally count on the first
guarantee above. And only the first. What it means, of course, depends
on the hardware.

Maybe.

You probably have a point; I'm not sure. The documentation for
InterlockedIncrement certainly doesn't guarantee much beyond the single
variable which is incremented. On the other hand, I'm not sure that
volatile will guarantee much more. See my comments on VC++ above: the
fact that InterlockedIncrement is an external function means that the
compiler will flush any globally accessible variable to memory before
calling it, volatile or not. And VC++ doesn't do anything more with
volatile, so we're back where we started from.

I rather agree that the volatile is nice -- I do want to say that this
must be visible elsewhere. But from what little I can see, volatile, at
least with VC++, doesn't guarantee this. Or at least, on a
multiprocessor machine, there is no guarantee that "elsewhere" includes
a thread in the same process, running on a different processor.

And IMHO, by far the most serious problem here is a lack of
documentation. Documentation required by the C and the C++ standards.
But what little documentation there is tends to be pleonastic: for its C
compiler, Microsoft defines "Any reference to a volatile-qualified type
is an access." Which is about as good as saying that they define an
access to be an access. It begs the question.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Ben Hutchings

unread,

Feb 25, 2004, 10:07:24 AM2/25/04

to

ka...@gabi-soft.fr wrote:
<snip>
> ...generally speaking, once you've

> implemented whatever is necessary, even at this level, you no longer
> need volatile: the InterlockedIncrement instruction under Windows, for
> example, doesn't require (or even allow) that its argument be volatile,

<snip>

InterlockedIncrement and *all* the other interlocked functions for
Win32, aside from SList functions, do take pointers to volatile-
qualified types. So they allow but don't require the use of volatile
objects. The use of volatile-qualification seems rather pointless
since only those functions should be used for manipulating the shared
objects and they don't need it, but Win32 isn't known for its sound
design.

Ben Hutchings

unread,

Feb 25, 2004, 10:07:57 AM2/25/04

to

Michael Furman wrote:
<snip>

> In posix it is extemely simple - with a cost of flushing everything
> everytime.
> So, using mutexes is very costly, especially in case of large caches
> and/or many CPU's.

<snip>

Cache flushes are ridiculously time-consuming, taking of the order of
a million cycles. This is why everyone (AFAIK) building multi-
processor shared-memory systems with caches uses a cache snooping
protocol to avoid the need to do that when synchronising. Memory
synchronisation only requires flushing some relatively short queues.
Mutex operations are still expensive relative to, say, out-of-line
function calls, but they are not as expensive as you think.

ka...@gabi-soft.fr

unread,

Feb 25, 2004, 6:50:35 PM2/25/04

to

"Balog Pal" <pa...@lib.hu> wrote in message

news:<403a...@andromeda.datanet.hu>...

> {Warning: topic drift. This thread is becoming very verbose with very
> little if any C++ content. Please either increase the C++ content or
> move elsewhere. -mod}

Most of my comments will concern the signification of lock, so I hope it
will pass.

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04022...@posting.google.com...

[...]

> > > > but it also suggests that use of the volatile keyword provides
> > > > coherency of memory

> > > Memory coherency is provided by the processor -- volatile is
> > > irrelevant here.

> > There are several aspects involved. If the compiler optimizes away
> > the instruction which causes the access, the processor won't put it
> > back in.

> Yess, volatile is there to ensure the instruction not removed by the
> compiler. And nothing else is expected -- the program shall be
> contetnt with whatever the memory system provides for the simple
> instruction. If any extra measures shall be taken, they're taken
> separately.

That's one point of view. My impression is that the intent of volatile
is that external changes be seen, and that my changes can be seen
externally. In such a case, volatile isn't necessarily enough on some
machines, and IMHO, it should be. But it's hard to talk about the
intent in this case, since we're talking about situations that didn't
exist when the keyword was introduced.

What is clearly the intent, and more than the intent: an explicit
requirement, is that the implementation document what it means by
"access" to a volatile object -- does the access ensure ordering within
a single thread, or multiple threads, for example. I could accept your
definition of volatile, IF that was what the implementation documented.
I've yet to find this information for any of the compilers I use,
however; from looking at the generated code, they don't, but I shouldn't
have to have to look at the generated code to know this.

[...]

> > It remains to establish how read and write accesses to the memory
> > bus and the local cache map to individual instructions.

> The dox has a clear description and detailied explanation on the
> impact with tips on programming practices.

Which dox? Intel's or MS's? Are they available on line.

> > Generally speaking, at least on some modern processors, the fact
> > that a load instruction has been issued does NOT suffice to ensure
> > an actual access, even to the local cache.

> If you're sure on that I beg for a reference.

I explained the details of how it works in another posting. I have been
told for a fact (by people working on the processor) that this is
actually the case in the Alpha architecture. My interpretation of the
specifications of RMO on a Sparc are that it is also allowed; none of
the current Sparc processors seems to implement RMO, however, so it is
difficult to determine whether this is really the case.

> > Actual read accesses are by cache line, considerably wider than a
> > byte, or even than a word. And if a following read access finds
> > the data in an internal memory access register, it may use that
> > data, rather than going to external memory.

> Errr, I recall some register of that kind, but it's invalidated
> together with the cache line, so as everything else, it's completely
> transparent from processor's point of view, access can be "omitted"
> only for the situation that same value would be read if you went the
> whole path.

Hmmm. You may have something there. Reads can be suppressed, but only
if the main cache line hasn't been invalidated in the meantime.

The problem is that all of the documents I have talk in terms of very
abstract memory models, which are quite far from such hardware details,
so it is difficult to know exactly.

> > Things like the membar instruction for Sparc, or the memory barrier
> > primitives from Microsoft, documented in the previously posted
> > link, aren't there for the fun of it. They fulfill a real need.

> Sure they are.

> > And if volatile doesn't cause the compiler to generate something
> > similar, then volatile doesn't fulfill that need.

> Here is the slip of the logic -- as it was already stated, those

> instructions serve one purpose, when it is needed you must ensure they

> are there. But they're not needed for any possible use of volatile,
> and as long as the programmed *does not think* mistakenly volatile
> replaces them, there will be no problem.

The real question is: what is the purpose of volatile? (I know:
implementation defined:-).) The classical use (an example in the C
standard) is memory mapped IO. One might thus suppose that it would be
sufficient for at least that use. On a Sparc, however, memory mapped IO
requires a "membar #MemIssue" instruction -- none of the compilers I
have generate such a thing for volatile accesses.

If the compilers actually documented this, then at least they would be
conform. Something along the lines of:

What constitues an access to an object that has volatile-qualified
type

An access means that the corresponding machine instruction to
load or store memory has been executed.

Note that this does not mean that anything has actually been
read or written. In particular, for memory mapped IO, or for
access from another thread or process, you will have to add
assembler code to generate some form of a membar instruction;
for memory that has been mapped by means of mmap, it is only
guaranteed that the values be visible on another process after
an fsync on the underlying file.

Whether such a weak definition is acceptable or not is a quality of
implementation issue; I certainly never expect to see an implementation
where the last sentence wouldn't hold.

> The exemple you provided in another post -- the old famous singleton
> init is such a situation for it has multiple memory events, and the
> code tries to rely on their proper order. That order must be forced.

And there is no C++ standard way of forcing it. There is a Posix
standard way. There is also a Sparc standard way, considerably faster,
but obviously, a lot less portable. And of course, in some
environments, there isn't a problem at all.

> For the other example of this subthread, using a volatile bool
> stopflag that is spinned on on threadmain no ordering requirement is
> there. stopflag is a solo object unrelated to anything else around,
> it's no problem if its statechenge is observed not ordered wrt other
> stuff handling around -- especially as in the example it caused the
> thread exiting, a syncing instruction even in strict posix sense.

I'll have to reconsider that one. I'm not convinced, but it is
possible.

> If you still think there can be an ordering role -- especially one

> preventable with a well placed ordering membar please describe the
> scenario. I still think you're mislead by thinking the instructions
> has anything to flush caches, commit bytes to the memory system, make
> changes visible -- that would not happen without them. It pretty well
> happens on the platforms we talk about.

> > > Actually the new intel processors have *FENCE instructions that
> > > are similar to the ordering membars on other processors. they
> > > got introduced especially for the purpose to *gain* speed, to
> > > give opportunity to use less brutal things than the current LOCK.

> > The key is, of course, that without such instructions, you pretty
> > much have to do the equivalent with every instruction.

> Why would you? Knowing that despite any possible speculative reads,
> writes, reorderings, the processor-self-consistency is preserved it
> leaves very few situations where ordering is really important and must
> be forced. The need is limited to memory objects shared by
> processors. They appear only at sync points.

I got sloppy with my pronouns. In this case, what I meant is you, the
designer of the chip... There are times that ordering will be
important; you cannot write software without it, and the chip must
provide it somehow. If there is no explicit instruction to provide it
(so that the programmer can decide when it is necessary), then the chip
must enforce the ordering systematically. After every instruction.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Feb 25, 2004, 6:55:27 PM2/25/04

to

Graeme Prentice <inv...@yahoo.co.nz> wrote in message

news:<0l2l309625irfdqp7...@4ax.com>...

[...]

> BTW (Paul) - I don't see any sign of a FENCE instruction being
> executed during execution of InterlockedIncrement on X86 code - it
> just does a LOCK(ed) ADD which provides atomicity only.

Does it? It was my understanding that a lock prefix also synchronized;
that all accesses before the instruction had completed, and none after
the instruction had started when the locked instruction executed. I
could be mistaken, though. It's been a decade or so since I last worked
on Intel architecture, and I'm pretty sure that the change from 16 bits
to 32 isn't the only change in this period.

> Regarding my comment about C++ standardization of multithreading - I
> should have paid more attention to boost threads because looking in
> there, I see it provides mutex that works on Windows, MAC and pthreads
> (Posix Unix?) and presumably the synchronization mechanisms provided
> by boost threads take care of the memory issues being discussed in
> this thread.

Generally speaking, the OS system primatives on which the boost threads
are based take care of the memory issues.

> I tend to feel that allowing boost to provide standardized thread
> mechanisms would be better than adding to the already large C++
> standard library, if it weren't for the fact that it requires
> volunteers to maintain it and create it, whereas standardizing it
> would encourage professional compiler vendors to provide it.

I can think of a number of advantages of standardizing it. And a
potential problem: what happens in environments which don't support
threading? Are we willing to say that an environment without threads
cannot support C++? Or do we introduce (or rather expand) the concept
of "optional" packages in the language?

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Feb 25, 2004, 6:56:26 PM2/25/04

to

"Balog Pal" <pa...@lib.hu> wrote in message

news:<403a...@andromeda.datanet.hu>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.0402...@posting.google.com...

> > > membar serves completely different purposes -- it is about
> > > _ordering_ (well, actually it have other forms too, but I'm sure
> > > you had the ordering membar in mind here).

> > It ensures that any data incidentally read during the loads issued
> > before the membar are ignored by loads after the membar. That's
> > ordering, yes. But it is necessary in this case.

> To remind the case -- it was a spin on a single variable. Order, as
> we'd expect is interesting if we have at least to somethings. ;-) In
> more general examples it is certainly vital. Just not in ths
> particular one.

There are still multiple accesses that have to be
synchronized/sequenced.

I suppose that there is an issue of time: all architectures I know do
guarantee that data written will eventually become visible elsewhere.
Not necessarily immediately, but eventually. I'm less sure about the
read -- processors DO coelesce several distinct reads into one. (This
is what creates the ordering problems with reads.) Do they do this to
reads at the same address? Or only with reads at different addresses,
but near enough that one physical access will suffice?

> > The problem is that if the data at the address read happens to
> > already be in the processor, because of an earlier read, then the
> > processor may not go to main memory in order to fetch the value.

> Well. Term 'main memory' is irrelevant for way many systems.

Use the term "memory visible to other processors", then.

> Those that define only the 'processor' and 'memory'. There 'memory'

> may have whatever guts, multiple level caches or anythong, the

> processor needs not know what goes on there it just deals with memory.

I know. In one application here, we used mmap to a common file to share
memory between processes -- which works fine (at least under Solaris) if
the two processes are running under the same instance of the OS (on the
same machine, even if on different processes). I wouldn't want to try
it for processes running on different machines (with the shared file
mounted by NFS); I am pretty sure that even the rwlocks we use wouldn't
be sufficient here. Note, however, that as far as C++ (and just about
everything else) is concerned, I'm just accessing memory.

> If it asks data at an address, it gets it. Internal caches may make

> it a short or a long process, but the result shall be the same. Sparc

> V9 is defined this way. The arcitecture manual says nothing at all

> about caches (except this basic expectation). IA32 is also such a
> system with a difference the manual describes a plenty of things abut
> how the cache works on some pmodels -- that is an interesting read but

> also irrelevant to our discussion.

Is it? Even a classical cache can introduce delays and modify the order
that reads and writes appear to occur in the memory shared between
several threads (running on different processors). And many modern
processors have internal memory access pipelines, which effectively act
like very small caches, and can introduce further reordering, and
suppress external memory cycles to the local cache or the shared memory.

> If the processor executes an instruction that asks a read it must do a

> read. That read may be out of order but it must be a read.

That's where I'm not so sure. A physical read from memory will not read
just a byte. It will read a much larger block -- perhaps an entire
cache line, but at least a word. Suppose I read three bytes, A, B and
C, in that order, and A and C happen to be in the same physical memory
block (word, cache line or whatever). What actually happens in a modern
processor is that the CPU issues a read command for A to the read
pipeline, marks the target register dirty, and goes on. It then does
the same thing for B, and for C. From what I understand, at least on
some processors, the hardware will recognize that the read command for
the word containing C is already in the pipeline, and will coalesce the
reads to A and C. This results in an obvious reordering of the reads;
it also means that while the instruction execution unit emitted three
read commands, only two appear on the processor bus.

Now replace the access to C with a second access to A. Are you sure
that the hardware will recognize this, and not coalesce the accesses?

> If you spin on that read instruction the processor is just not allowed
> to ignore the memory system and thonk it has read it once, and that
> result is good for the next billion reads.

Are you sure? How long is it allowed to defer the read?

Or more to the point, it reads once each time through the loop, but from
where? The instruction execution unit reads from a memory access
pipeline; is there some logic there to ensure that the access will go
further, even if the information is already present?

> > And if the processor doesn't go to main memory, then it will not
> > see any modifications made by another processor.

> Wrong assumptation. [see proof later]

> > In practice, there is an extremely high probability that it will
> > work. The amount of memory held locally in the processor (I'm not
> > talking about cache, but about physically on the processor chip, in
> > memory interface registers, etc.) is very small, and sooner or
> > later, you will get a hardware interruption, or some cron job will
> > interrupt, your process will be interrupted, and some other process
> > will read enough other data to purge the internal contents.

> As a matter of fact that also makes the presented schema not working
> on a real system. But it must work even if the process timeslice is
> infinite long, and no interrupts happen.

And I'm not convinced that it is guaranteed to do so on all systems.

> > > If you're to restrict yourself to stuff that is defined only
> > > within the posix itself you'll waste tremendous amount of
> > > resources.

> > Well, if you want your program to work, it's generally better if
> > you restrict yourself to things that are guaranteed somewhere.

> In my terms it's not 'better' but anything else is strngly prohibited

> :). I generally offer to sit on that locomotive on the test run if I

> said it will work, and I allow it to carry stuff.

> But that doesn't imply I must use only elements written in one
> document and ignore others.

I quite agree. It depends on the requirements specifications.

> If the target platform is "any posix" sure those rules must be
> strictly followed. If the situation is more restricted, and other
> elements documented, I can use other schemas. And if they have heavy
> impact on performance, I likely should use them, it would be a lame
> excuse to say 'well it crowls on this sparc to keep the sources
> compile and run on some aplha without change'. Guess you would not
> accept that either if you ordered the software strictly for
> sparc. ;-))

If the software which runs on the Sparc is to be made to run on an
Alpha, it will be recompiled. If I have to, I can even use different
versions of the code for the most critical parts. And if I'm developing
on a Sparc, unless there is a specific statement in the requirements
that it must also run on an Alpha, I can consider that it need only run
on a Sparc.

If the software runs on a single processor Sparc, and the user upgrades
to a multiple processor system, I won't have that possibility. Unless I
have an explicit statement in the requirements that it didn't need to
support running on a multiple processor system, I have to take multiple
processor systems into account.

This obviously depends on the context you are working in. The above
certainly holds if you are developing application software running on a
Posix system. It almost certainly doesn't hold if you are developing,
say, an ABS brake system running on a one chip microprocessor.

> Certainly where I don;t know any better, or the tradeoff is negligable
> it is better to pick some easier route.

> > > > In order for the behavior to be defined, *ALL* accesses to
> > > > stopflag must be protected, normally by a mutex. In which
> > > > case, Posix does guarantee that the code works. With or
> > > > without volatile.

> > > What is exactly what I just said -- waste of resources for little
> > > reason. As that stopflag loop works correctly on most machines
> > > (including sparc V9 in RMO) as long as simple write and read
> > > access instructions are emitted by the compiler.

> > The Sparc architecture document says most explicitly that it
> > doesn't work on a Sparc v9 in RMO.

> Where does it say that? Just knowing the scope of that document you
> could think it can't say such thing.

I think it is indirectly implicit in the fact that reads can be
reordered. The only time in fact that reads are reordered is to
coalesce external read cycles.

> [proof]
> And instead of the theoretical part take a look at the sync primitive
> examples in the document. For example the spin lock implementation at
> J.6. Do you see anythig special in the spin loop? Certainly no.
> just a plain load-compare-retry spin. Using your theory it would
> never see any change in the memory content when it changes on another
> processor.

> It certainly does see the change.

> The ordering membar instruction is needed later, at the exit -- to
> make any subsequent accesses not use any previously read data. To
> access and read the flag variable -- that maps to 'stopflag' in the
> original stuff nothing at all special is needed.

I'd feel better if they had cited the rule that guaranteed that it
works; I can't find a real guarantee of it in the normative part of the
document. (I've printed out the parts concerning memory, so that I can
study them again in detail. It's quite possible that I've missed
something.)

Of course, the orginal code didn't have any barriers, anywhere. Whereas
the example here definitly makes it clear that at least one is
necessary.

> > > Where it doesn't work should be az exotic system with really
> > > dangerous memory model any thread programmer should be aware
> > > anyway.

> > Most general purpose systems today use a really dangerous memory
> > models.

> There was a pretty good thesis on a related topic pecently linked in a
> messagi in this thread. That thesis stated the opposite. [I do not
> consider a RMO model dangerous, I meant some system where you must
> flush caches manually or stuff like that. ]

Then we disagree about the meaning of "dangerous". Reordering accesses
definitely doesn't correspond to the intuitive idea of how memory access
works. And given the number of people who expect volatile to help, and
the fact that volatile typically doesn't prevent reordering accesses at
the hardware level, makes me think that it is probably dangerous for
most people.

> > > What is that system? IMHO you're mistaken here confusing memory
> > > ordering issues with memory change visibility in general.

> > The two are related.

> Not the way you seem to think. Other posts suggest you think about
> cache flushing issues.

Yes and no. I'm aware that the memory generally called "cache" is
usually more or less synchronized by external, hardware means. The
order in which another processor sees changes made in my cache, and
when, may vary somewhat, but it will see them, sooner or later, if it
looks at its cache -- the modifications in my cache will eventually
invalidate the data in its cache.

Modern processors, however, access memory via an on-chip pipeline. When
the execution unit executes a store instruction, the write (even to
cache) doesn't take place immediately; an order is put into the
pipeline, and the execution unit continues. Similarly, a load
instruction will cause a command to be put into a read pipeline, and the
target register will be marked as dirty, remapped, or some such. Modern
processors treat these pipelines as a sort of first level cache. A
read which corresponds to information already in the pipeline will fetch
the data from the pipeline, and will NOT generate a read instruction on
the bus -- several successive load instructions may result in a single
read cycle on the bus.

It's more complicated than that, of course, and I don't know all of the
details myself. And most processors guarantee processor integrety -- if
a write command is in the write pipeline, a read to that address will
see the data that will ultimately be written. But the basics are there;
the fact that the execution unit has executed a load or a store doesn't
translate immediately to a read or a write cycle on the external
processor bus.

The second point, and the one I really want to stress, is the fact that
while one might expect the volatile keyword to cause the compiler to
generate whatever is necessary to guarantee the external cycles, in
practice, it doesn't. Compiler implementors have taken advantage of the
"implementation defined" aspect of the meaning of "access" to remove any
useful meaning volatile might have for multithreaded code.

> The operations the the _sequencing_ membar instruction does on SPARC.
> Those are important to deal with memory mapped io or alike devices.
> Or special OS level tasks when it reprograms the cache controller,
> remap memory, etc.

Or self-modifying code:-).

> Nothing related to the different memory models require one of those.
> All description, and the sync primitive examples use exclusively the
> _ordering_ membar instructions. And as the description starts
> "introduce an ordering in the instruction stream of a *single*
> processor". IOW they only influence the reorder buffers of the one
> processor the get executed on.

> > I've worked on those sort of systems. A long time ago, in simpler
> > times. A hand-written scheduler for a multi-processor machine is
> > NOT simple to write.

> Certainly, but I was talking about a single processor one :)

Go on, make life easy for yourself:-).

> > Of course, when you are programming at that level, you don't use
> > pthread_mutex_lock. You need to know the hardware intimately, and
> > design your code around it. But then, you won't be writing the
> > scheduler in standard C++ either.

> Yep, but the idea was to _use_ that system in C++. I just wanted to
> show a system that IMHO is in all practical terms having threads,
> though has a very different method to avoid conflicts.

OK.

I understand your point. I've worked on a lot of such systems, just not
recently. When I worked on them, I used volatile, and I wrote code like
the spin lock example which started this sub-thread. And it worked.
Guaranteed.

I may have mis-understood something, but my impression was that there
was a claim that the spin-lock example was guaranteed to work, by the
C++ standard, whereas in my experience, there are definite cases where
it won't work, or won't work reliably. It is true that those cases
involve large, complex machines, with multiple processors and advanced
OS's. The sort of machines where you proably wouldn't be tempted to use
it anyway. The fact that it doesn't work on such machines is IMHO
interesting, but mainly because the reasons why it doesn't work apply to
more realistic code as well. And I do not dispute the fact that there
are contexts where it does work, and even contexts where it is the
appropriate solution.

> > That's really all I'm claiming. The problem, or at least, my
> > impression of the problem, is that there is some confusion between
> > threads and interrupt handling.

> Possibly it's easy to create confusion. (Not if it doesn't sneak in by
> itself ;-)

> The interrupt system is really not exactly like a thread system, but
> the main issues are very similar. And rules to observe are similar too
> -- so in practical terms you do the same things to create a working
> system. The actual primitives may be different.

> > Threads are a fairly high level concept, supported
> > by the OS (be it Posix or Windows);

> The 'simple scheduler' I was trying to describe would do exactly like
> the OS you mention. And wishing it 'simple' exactly for the purpose
> we could think only of the really basic issues realted to threads.
> The real-life schedulers get complicated for efficiency reasons. And
> it's most complicated task is to pick the next thread to wake. While
> the issues under discussion has nothing to do with that part, we only
> expect threads tio run in some theoretial time-space. :)

> > on a multiprocessor system, you have no control over which
> > processor an individual thread runs on, nor in which memory it
> > variables have been mapped. In an interrupt routine, you are a lot
> > closer to the hardware, and have a lot more control.

> Again, my intetntion was to completely ignore what the tread manager

> does -- the interesting part is what the thread itself does.

The problem is that it does it in a context determined by the underlying
system. When I wrote system level code, for example, I had primitives
to mask and unmask the interrupts (CLI and STI on the Intel);
presumably, these could still be used in an interrupt routine, and can
certainly be used in system software. They will cause a bus error in a
normal "thread", however.

> In the posix model you use mutex objects. When you have a critical
> section of code, you try to lock the mutex object. That will succeed

> when the way is free and block any other attempts. If it's closed, the

> thread gets suspended until sometime later. When you're finished you
> unlock the mutex -- that is supposed to wake threads blocked on
> it. Thus a shared data object controls thread execution inside the
> section.

Note that Posix also requires memory synchronization in the functions
concerning mutexes.

> In the interrupt-driven hand made single proc threading you may simply
> block the interrupts. As the task switch happens on the interrupt, no
> switch can happen while you disabled them. So the section is well
> guarded if there is a disable-ints at start and restore-ints at the
> end. Certainly if you have them implemented, you can also use
> hand-made mutexes.

And the implementation of mutex (if you have it) certainly masks the
interrupts at one time or another:-). (I wrote a real time OS for the
8086 back in 1979, so I am aware of these issues.)

> A single thread + interrupt system is even more interesting. Here you
> can't have mutexes, only the disable-ints method. (As there's no
> scheduler, and generally no way to suspend -- discounting the case of
> suspendng the main thread waiting on a flag set from the interrupt
> level).

> The common part is you must plan the shared object access. And prevent
> messing up the state due to parallel acting on the same things. The
> different is the strategy reach that goal.

No problem there. What are we arguing about? Presumably, too,
compilers designed to support this kind of programming will provide the
relevant primitives. The Intel 8086 compilers certainly had built-in
functions for masking and unmasking the interrupts. Had memory barriers
been necessary back then, the probably would have had support for them
as well. If the compiler had done any optimizing, I'm sure they would
have designed something to ensure that certain specified accesses didn't
get optimized away. (Of course, being young and macho at the time, I
wrote everything in assembler anyway.)

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Balog Pal

unread,

Feb 25, 2004, 7:05:42 PM2/25/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...

> with a membar instruction, to ensure memory synchronization. The

> specific solution for Windows uses the InterlockedIncrement and
> InterlockedDecrement primitives provided by this system -- from what I
> know of the Intel architecture, I suspect that underneith them, there is
> an incr or a decr machine instruction, preceded by a lock prefix.

The "problem" with that is you rely on compiler magic knowing those
functions. And the last time I searched the dox, I didn't find any hard
evidence on that.

Now suppose the compiler treats Interlocked*** as any general function. Do
you still hold we are okey to go without volatile, and the compiler will
sure not mess up?

Here I'm in the 'why not way volatile explicitly' camp, I see nothing it
could hurt.
Also, it may be reasonable to use my own assy routines instead those
half-broken functions. (the result is documented as fuzzy on some win
versions.) If so, again why rely on compiler's expected behavior. Volatile
have not much semantics, but it can be used in some situations for good.

There is a problem with false views flying aroun like volatile solves
certain things for threading -- those shall be banished, but no reason to
fall on the other sife of the horse stating it is generally useless for
anything.

I'd even leave it in for the situations you describe, where it is unarguably
obsolete. But that's just my personal taste. ;-) probably I'm so used to
it, see something volatile means it's some kind of sync object (or I/O but
that is another story).

> I don't think that this is where the cost of the mutex is coming from.
> To start with, locking a mutex involves switching from user mode to
> kernal mode, and a lot of other kernal related processing. That is what
> makes it expensive.

I don't think there;s much processing, but the switch itself is really
expensive.
Where possible the implementation tries to do the mutex most in user mode --
trying to lock and only do the switch if it's locked.

> FWIW: I recently did a benchmark of the instance function of a
> singleton, using no protection (valid if the first call to instance
> precedes the first call to pthread_create, for example), using a membar,
> and using a mutex. Practically speaking, there was no measurable
> difference between no protection and the membar instruction

If you're not in a relaxed model, membar is as good as a nop. Even if you're
in relaxed model the penalty shall be almost nothing. worst worst case being
as many memory accesses as entiies in the reorder mechanism. real case
limits that to reorders actually created by the processor.

> -- in fact,
> the "measured" value was a couple of nanoseconds: no higher than the
> noise factor in my measurement rig. With a mutex, it took around 450
> nanoseconds.

What is pretty fat if all you want to do is a single atomic inc. What's
worse, it can be a real processor hog, if you put it in the original example
watching the stoppflag in the thread main loop. Bad thing to do without a
good reason.

> I wasn't thinking of necessarily defining it down to that level,
> although I would not oppose to a standardized mutex class (which
> wouldn't make it illegal for the implementation to provide finer grained
> protection when possible). But what I am really concerned with is
> things like constructing local static objects (thread-safe or not?), or
> even some sort of definition as to what sequence points mean when the
> program is being executed by several threads.

Yeah those last things really ask for being defined. I'd also consider
guarantee that the program has a single thread arriving to start of main().

Also explicit points in the code that prevent optimization around.

Guaranteed state of some objects at task switch. I'd expect at elast
volatile sig_atomic_t to be stable, but possibly more.

> And the last question raises a more general complaint on my part.
> According to ISO 9899:1999, §4/8: "An implementation shall be
> accompanied by a document that defines all implementation-defined and
> locale-specific characteristics and all extensions." For some reason,
> a similar sentence is missing from §1.4 in ISO 14882:1998, although we
> do have §1.3.5 "implementation-defined behavior behavior, for a
> well-formed program construct and correct data, that depends on the
> implementation and that each implementation shall document." In both
> standards, we have the fact that accessing a volatile object is
> "observable behavior", but that what constitutes an access is
> implementation defined. And this whole discussion turns more or less
> around this point: we are guaranteed an "access", but what does this
> access in turn guarantee us?

Yep, same sad story, but nobody will be too happy if we decide 'then VC is
not conforming'. There's a light chance Herb reads this and push some
buttons at the appropriate appartment. ;-)

> And although the C standard and the C++ standard require implementation
> defined behavior to be documented, I've yet to find anything really
> relevant, for any of the compilers I know of. Does writing to a
> volatile variable mean:
>
> - A machine store instruction is generated; what happens after that is
> entirely up to the hardware, will vary from one machine to the next,
> and for some architectures, is not even guaranteed to generate a
> write cycle on the external processor bus (at least not
> immediately)?

IMW a guaranteed store instruction. + I would desire it as an order point
wrt not only volatiles but nonvolatiles around.

> - A write through to main memory is guaranteed. In practice, on a
> Sparc, this would imply a membar instruction somewhere in the
> generated sequence. From what I understand about IA-32, it would
> require a lock prefix on the mov instruction, but I'm not sure about
> this.

I already stated, bringing in 'main memory' as term in the standard is a
really bad idea. (It probably wouldn't be voted for, but anyway.)

You'd need at least a #MemIssue level membar on sparc, and likely a complete
cache dump on a plenty of processors including intels.

It's like asking a nuke strike without real reasons. most systems define a
uniform memory system, that can in fact run long witout ever turning to
"main" memory.

> - Any processor, anywhere which can see the memory can see the new
> value. While this is probably what most people imagine,
> intuitively, the implications, for example in the presence of memory
> mapped files on a remote disk, are enormous.

That worth thinking about -- IMHO it would be better to define standard
intrinsics to create that effect at explicit points -- as often you have a
bunch of volatiles together, and need for those points is far less than
accesses to the variables.

I'm not sure of the impact on all possible systems, the automatic way might
still be a good tradeoff for the security gained -- that also shall be
evaluated.

> And IMHO, by far the most serious problem here is a lack of
> documentation. Documentation required by the C and the C++ standards.

yeah :(

> But what little documentation there is tends to be pleonastic: for its C
> compiler, Microsoft defines "Any reference to a volatile-qualified type
> is an access." Which is about as good as saying that they define an
> access to be an access. It begs the question.

Cool.

Paul

Ben Hutchings

unread,

Feb 25, 2004, 7:09:08 PM2/25/04

to

ka...@gabi-soft.fr wrote:
<snip>

> I don't think that this is where the cost of the mutex is coming from.
> To start with, locking a mutex involves switching from user mode to
> kernal mode, and a lot of other kernal related processing. That is what
> makes it expensive.

<snip>

That is not generally necessary in the non-contended case (i.e. where
there is no need to wait). It should be sufficient to perform some
kind of atomic test-and-set (usually decrement) followed by a kernel
call if and only if the result shows that the mutex was already
locked. Look up "futex" or read <http://www.opengroup.org/onlinepubs/
007904975/functions/pthread_mutexattr_init.html#tag_03_544_08_01>.
Futexes were implemented on Linux recently and Windows uses a similar
technique in its CRITICAL_SECTIONs.

Balog Pal

unread,

Feb 25, 2004, 7:20:57 PM2/25/04

to

"Ben Hutchings" <do-not-s...@bwsint.com> wrote in message
news:slrnc3nf7k.p3b....@shadbolt.i.decadentplace.org.uk...

> InterlockedIncrement and *all* the other interlocked functions for
> Win32, aside from SList functions, do take pointers to volatile-
> qualified types.

Do they? The ones in my Visual C are declared as:

LONG InterlockedIncrement(
LPLONG lpAddend // pointer to the variable to increment
);

and I can't pass them the address of my volatile LONG without a cast -- the
compiler reject that. :-(

Paul

Michael Furman

unread,

Feb 26, 2004, 9:19:13 PM2/26/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...

> "Michael Furman" <Michae...@Yahoo.com> wrote in message

> [...]

> > And by "other ways" you mean just one way: "Posix" - don't you? If so
> > - I argee.
>
> I was considering the context of threading within a process on a modern,
> general purpose machine and OS: Windows or Unix. I cited the Posix
> guarantees because those are the ones I am familiar with -- I presume
> that Windows is similar.

No, it is not similar - se my quote from MSVC documentation in the earlier
post.

>
> I was not considering embedded processors, various real-time OS's, or
> the like. My experience with such machines dates from a simpler time
> when emitting a store instruction provoked an immediate write to
> external memory, and multiprocessor systems using common shared memory
> were rare. I suspect that this is still true for some of the smaller
> embedded systems. In such cases, volatile is probably a sufficient
> solution for some problems (supposing atomicity, etc.).

No, it is not. Volatile is never (I believe) sufficient, but it is usually
necessary - unless there is
something else added to the compiler that guarantee disabling relevant
optimizations.
(Posix is an example - if C compiler is Posix compliant and you uses Posix
synchronization primitives, you do not need volatile).

>
> > > [...]

> > > In no case have I declared anything volatile.
>
> > That is (I guess) because both protected variable (counter) and
> > protection mechanism are encapsulated on one object. If you try to
> > implement something like mutex (that can be used to protect any
> > variable) - you would have either force everything to be flushed (like
> > in POSIX) or use volatile (that I suppose to "flush" one variable) and
> > some more fine grain hardware instructions (like flush just one word -
> > is such instructions exist on modern computers?).
>
> The protection only concerns incrementing and decrementing a counter.

Yes - that is what I mean. If you make any kind of locking mechanism -
for example mutex, but not the Posix one that disable register optimization
of all variables around the mutex lock call - you will need some other way
to tell compiler to reread just one variable (in case its value is cached
in the register).
Of cause it is not enough, because you have to say the similar thing to
the hardware. But it is necessary!

> [...]

>
> In no case have I found a case where volatile would be relevant to
> this. The guarantees I need have all be provided by the other
> mechanisms (Mutex, assembler code, special OS primitive). Conceivably,
> an Intel compiler (or a compiler for another platform, for that matter)
> could define that the ++ or -- operator on a volatile int generated the
> appropriate instruction, preceded by a lock prefix (or whatever is
> needed on the other platform). I know of no compiler which does this,
> however, and generally speaking, I think it would go beyond the
> traditional meaning of volatile (which only concerns individual
> accesses, and not a read-modify-write cycle).

Yes. But if you do not declare variable as volatile, "++" operator can just
increase a value in the local register, w/o any access to the memory!
Posix Mutes as an addition to the C standard can (must) provide this. But
how
assembler code or OS primitive that is not related to the compiler
can do this?

> However, my real point is that some additional guarantees beyond
> volatile are necessary, AND that these guarantees are sufficient; that
> once I've done whatever else is necessary, volatile is no longer needed.

The second part would be true if you add to compilie (and Standard)
something else that will disable some specific optimization.

> [...]

> > Here I am not sure I agree - I like present situation when "volatile"
> > takes only part of this garantee that is directly related to compiler.
> > Another part needs using special hardware features - they could be
> > different even for different implementations of the same architecture.
>
> The features may be different, but how you access them, or their
> implications in what the compiler is allowed to generate, might be part
> of the standard.
>
> > (of cause, again I am thinking more about freestanding
> > implementations) - so I am afraid it would use overkilling "flus
> > everything" method.
>
> I wasn't thinking of necessarily defining it down to that level,
> although I would not oppose to a standardized mutex class (which
> wouldn't make it illegal for the implementation to provide finer grained
> protection when possible). But what I am really concerned with is
> things like constructing local static objects (thread-safe or not?), or
> even some sort of definition as to what sequence points mean when the
> program is being executed by several threads.

I am concerned with this too, but I doubt that much could be done except
the more formal definition of what we de facto have today. It is too much
depensd
on the hardware architecture - and it changes.

> [...]

> > In my understanding, "W" says that using vilatile is nessesary - it
> > does not say that it is enough.
>
> True. But if that is really the case, it has very wide implications.
> It means, for example, that it is pratically impossible to write a
> singleton, or for that matter, to do anything in constructors or
> destructors. (An object is NEVER volatile in the constructor or the
> destructor.)

I guess you are right and this is a problem. And I don't think it is related
only
to MT case - I thing it is same with interrupt processing or accessing
memory mapped hardware registers.

>
> Note too that at least with VC++, all volatile does is inhibit certain
> optimizations. Given the level of optimization of this compiler,
> introducing an external function call (say to WaitForsingleObject on a
> Mutex) has exactly the same effect. How can volatile be necessary if it
> results in exactly the same code being generated? What is the real
> effect of volatile?

Where is it defined that external function call inhibit all variable
optimization?
(though all compilers I am familiar with do not look at other source files
and have to do something like that).
And why synchronization primitives could not be inlined? With or without
"asm" inside?

>
> And the last question raises a more general complaint on my part.
> According to ISO 9899:1999, §4/8: "An implementation shall be
> accompanied by a document that defines all implementation-defined and
> locale-specific characteristics and all extensions." For some reason,
> a similar sentence is missing from §1.4 in ISO 14882:1998, although we
> do have §1.3.5 "implementation-defined behavior behavior, for a
> well-formed program construct and correct data, that depends on the
> implementation and that each implementation shall document." In both
> standards, we have the fact that accessing a volatile object is
> "observable behavior", but that what constitutes an access is
> implementation defined. And this whole discussion turns more or less
> around this point: we are guaranteed an "access", but what does this
> access in turn guarantee us?
>
> And although the C standard and the C++ standard require implementation
> defined behavior to be documented, I've yet to find anything really
> relevant, for any of the compilers I know of. Does writing to a
> volatile variable mean:
>
> - A machine store instruction is generated; what happens after that is
> entirely up to the hardware, will vary from one machine to the next,
> and for some architectures, is not even guaranteed to generate a
> write cycle on the external processor bus (at least not
> immediately)?

That is what, I believe, it de facto means on most (if not all)
implementations.
And that is what I need if I thing of embedded applications.

>
> - A write through to main memory is guaranteed. In practice, on a
> Sparc, this would imply a membar instruction somewhere in the
> generated sequence. From what I understand about IA-32, it would
> require a lock prefix on the mov instruction, but I'm not sure about
> this.
>
> - Any processor, anywhere which can see the memory can see the new
> value. While this is probably what most people imagine,
> intuitively, the implications, for example in the presence of memory
> mapped files on a remote disk, are enormous.
>
> My own feeling is that the second point probably corresponds closest to
> the intent. From studing generated code, I can say that both Sun CC and
> g++, on a Sparc, implement the first. Which is their right -- that's
> what implementation-defined means. But they have to document this
> choice, and I've not found any such documentation. For any compiler.

As I would already said I would like to have #1 alone, because it does not
add any extra overhead.

I do not understand what #2 is for. For memory mapped registors it sould
be prowided by hardware anyway. If you use Posix mutexes yoy need
nothing.

And #3 IMO is not defined. "Any processor, anywhere", but when?
I think you suppose the common time and infinite speed - I doubt that it
is reasonable.

> [...]

1. InterlockedIncrement (or some other primitive) does not have to be an
external function.
2. compiler does not have to flush any globally accessible variable to
memory
even before calling external function.
3. VC++ volatale (as every other I know about) forse flush/reread of just
one
variable rather then any globally accessible variable that can be much
faster.

>
> I rather agree that the volatile is nice -- I do want to say that this
> must be visible elsewhere. But from what little I can see, volatile, at
> least with VC++, doesn't guarantee this. Or at least, on a
> multiprocessor machine, there is no guarantee that "elsewhere" includes
> a thread in the same process, running on a different processor.

to make it visible anywhere - is to add extra dependance from the hardware.
For example different address ranges can have very different behavior:
some could be virtual and directed to disk, others non-chached or cached
with different caching pilicies - how compiler would know that (having just
a pointer)?

>
> And IMHO, by far the most serious problem here is a lack of
> documentation. Documentation required by the C and the C++ standards.
> But what little documentation there is tends to be pleonastic: for its C
> compiler, Microsoft defines "Any reference to a volatile-qualified type
> is an access." Which is about as good as saying that they define an
> access to be an access. It begs the question.

IMHO, the lack of documentation in implementations (regarding "volatile")
has
the same cause with the lack of definition of the volatile in the C/C++
standard.
It is just very hard (though I believe not impossible) to write a definition
that
would be formal and not refering to details of the hardware and compiler
internals.

Regards,
Michael Furman

ka...@gabi-soft.fr

unread,

Feb 27, 2004, 9:58:18 AM2/27/04

to

"Balog Pal" <pa...@lib.hu> wrote in message

news:<403c...@andromeda.datanet.hu>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04022...@posting.google.com...

> > with a membar instruction, to ensure memory synchronization. The
> > specific solution for Windows uses the InterlockedIncrement and
> > InterlockedDecrement primitives provided by this system -- from what
> > I know of the Intel architecture, I suspect that underneith them,
> > there is an incr or a decr machine instruction, preceded by a lock
> > prefix.

> The "problem" with that is you rely on compiler magic knowing those
> functions. And the last time I searched the dox, I didn't find any
> hard evidence on that.

On the other hand, without some compiler magic, they can't work at all.
I think I see what you are getting at, and formally, you may have a
point. Pratically, I'm less sure. The code in the functions is
written in assembler, and it modifies a value in memory. The compiler
cannot move reads and stores of that value accross the function.

More to the point, perhaps: in the only cases where I've used these
function, the variable, once initialized, was only accessed by means of
these functions. I suspect that this is typical -- any thing more, and
you start needed a mutex.

> Now suppose the compiler treats Interlocked*** as any general
> function. Do you still hold we are okey to go without volatile, and
> the compiler will sure not mess up?

I think so, for the reasons stated above: the compiler either knows
about the function (compiler magic), and does the right thing, or it
doesn't, and in order to move loads and stores accross the function, it
must be able to prove that such moves don't affect the semantics of the
program; that is, that the function doesn't access or change the values
stored in memory in any way. And above all, if you are accessing a
variable by means of these functions, you probably shouldn't be
accessing them otherwise anyway -- I can't think of a case off hand that
would be correct if you did.

> Here I'm in the 'why not way volatile explicitly' camp, I see nothing
> it could hurt.

It's true that adding volatile can never make a correct program
incorrect. It can confuse the reader, because he may think it is doing
something it isn't. I'm very sceptical about the "it can't hurt, so why
not" argument: either you know what you are doing, and you don't bother
with volatile unless it is necessary, or you don't, in which case,
adding more or less random volatiles isn't going to help much.

I can imagine that there are cases where you add volatile because
something in the rest of the documentation isn't clear -- you can prove
that the program must work with volatile, but you can't prove that it
works without, given the documentation. I can't think of any, but then,
I've not had to deal with Windows, or a lot of other platforms. For
Posix threading, the documentation is far from perfect, but it is
sufficient to be able to say that volatile is never relevant. All of
the other cases of multiple threads I've had to deal with have been low
level work (device drivers, etc.) on much older machines -- volatile has
been relevant (and I've used it) there, along with various compiler
"extensions" or bits of assembler code (e.g. to mask interrupts).

> Also, it may be reasonable to use my own assy routines instead those
> half-broken functions. (the result is documented as fuzzy on some win
> versions.) If so, again why rely on compiler's expected behavior.
> Volatile have not much semantics, but it can be used in some
> situations for good.

Certainly. But don't forget that whatever optimizations the compiler
might do, the results have to be a correct program, with the same
observable behavior as without optimization. In general, this means
that either the compiler knows and understands the code in your
function, or it must suppose that the function uses and may modify all
accessible data. If your function is written in assembler, either the
compiler can see it and understands assembler (and thus, knows the
semantics of the lock prefix), or it must not move accesses accross the
call to your function.

> There is a problem with false views flying aroun like volatile solves
> certain things for threading -- those shall be banished, but no reason
> to fall on the other sife of the horse stating it is generally useless
> for anything.

I'm not stating that it is generally useless for anything. I am stating
that it is pretty useless for a code running in normal user mode under a
modern, general purpose operating system (except when appied to
sig_atomic_t, and used to synchronize signals). I've explicitly said
that if you are writing low level software -- device drivers and the
like, or if you are working on a simpler processor, it might be useful;
it usually is, in fact, although you really should check the
documentation to be sure (supposing that you can find relevant
documentation -- with g++, VC++ and Sun CC, I've had to resort to
carefully disassembling the generated code with and without volatile to
figure out what it means).

> I'd even leave it in for the situations you describe, where it is
> unarguably obsolete.

Not at all. If you are writing a device driver, it is probable that you
will need it. And a lot of machines today, I think, use memory mapped
IO; it is certainly the case of anything from what used to be DEC, and
it is the case for Sparc. (It is not the case for the PC architecture,
nor for IBM mainframes, I think.) Memory mapped IO definitly needs
volatile. On larger, modern machines such as the Sparc, it may need a
lot more, too -- on the Sparc, it needs enough more that I'd probably
resort to external functions written in assembler. But I suspect that
this isn't true for most embedded processors, and that if the processor
uses memory mapped IO, and regardless, for communicating between
interrupt handlers and the rest, volatile is both necessary and
sufficient.

> But that's just my personal taste. ;-) probably I'm so used to it,
> see something volatile means it's some kind of sync object (or I/O but
> that is another story).

> > I don't think that this is where the cost of the mutex is coming
> > from. To start with, locking a mutex involves switching from user
> > mode to kernal mode, and a lot of other kernal related processing.
> > That is what makes it expensive.

> I don't think there;s much processing, but the switch itself is really
> expensive. Where possible the implementation tries to do the mutex
> most in user mode -- trying to lock and only do the switch if it's
> locked.

That's what I expected. But 450 nanoseconds is a lot of time, compared
to other operations. Where is it going?

> > FWIW: I recently did a benchmark of the instance function of a
> > singleton, using no protection (valid if the first call to instance
> > precedes the first call to pthread_create, for example), using a
> > membar, and using a mutex. Practically speaking, there was no
> > measurable difference between no protection and the membar
> > instruction

> If you're not in a relaxed model, membar is as good as a nop. Even if
> you're in relaxed model the penalty shall be almost nothing. worst
> worst case being as many memory accesses as entiies in the reorder
> mechanism. real case limits that to reorders actually created by the
> processor.

I know. Given the age of the development machine I tried it on, I doubt
that I am in relaxed mode. So it doesn't surprise me too much that the
time is very, very little.

> > -- in fact, the "measured" value was a couple of nanoseconds: no
> > higher than the noise factor in my measurement rig. With a mutex,
> > it took around 450 nanoseconds.

> What is pretty fat if all you want to do is a single atomic inc.

Quite. That's why I wrote my own atomic increment and decrement
routines, in assembler. For Sparc v9 -- as far as I can tell, there is
NO way to do so in earlier Sparcs.

> What's worse, it can be a real processor hog, if you put it in the
> original example watching the stoppflag in the thread main loop.

And a busy wait on a volatile flag isn't going to be a processor
hog:-). (Actually, the original code isn't guaranteed to ever finish
under Posix. Posix allows support of threads with real priorities: if
the wait is in a higher priority thread than the thread which should set
the variable, you have an infinite loop.)

> Bad thing to do without a good reason.

> > I wasn't thinking of necessarily defining it down to that level,
> > although I would not oppose to a standardized mutex class (which
> > wouldn't make it illegal for the implementation to provide finer
> > grained protection when possible). But what I am really concerned
> > with is things like constructing local static objects (thread-safe
> > or not?), or even some sort of definition as to what sequence points
> > mean when the program is being executed by several threads.

> Yeah those last things really ask for being defined. I'd also consider
> guarantee that the program has a single thread arriving to start of
> main().

You do NOT have that guarantee today. You can always call abort() or
exit(), or have an unhandled exception in a static initializer, and
static initializers are normally executed before reaching main.

> Also explicit points in the code that prevent optimization around.

A fence, in sum.

> Guaranteed state of some objects at task switch. I'd expect at elast
> volatile sig_atomic_t to be stable, but possibly more.

But what. I've used C on an 8 bit machine, where not even accessing a
(16 bit) int was atomic.

The current standard requires that there be an atomic type:
sig_atomic_t, which must be a typedef to an integral type. It only
requires atomicity for signals, because that is the only type of
asynchronous access it supports.

Practically, there are systems in which only the three character types
are atomic. The C and the C++ standard should probably continue to
support them. Interestingly, I think that there are systems in which
character types are NOT atomic -- in which you can only read and write
words, and to write a character type, you have to read the word, modify
the relevant byte, and rewrite it. I'm not sure how to define threading
on such a system, however; a priori, you'd have to lock each time you
wrote a char.

> > And the last question raises a more general complaint on my part.
> > According to ISO 9899:1999, §4/8: "An implementation shall be
> > accompanied by a document that defines all implementation-defined
> > and locale-specific characteristics and all extensions." For some
> > reason, a similar sentence is missing from §1.4 in ISO 14882:1998,
> > although we do have §1.3.5 "implementation-defined behavior
> > behavior, for a well-formed program construct and correct data, that
> > depends on the implementation and that each implementation shall
> > document." In both standards, we have the fact that accessing a
> > volatile object is "observable behavior", but that what constitutes
> > an access is implementation defined. And this whole discussion
> > turns more or less around this point: we are guaranteed an "access",
> > but what does this access in turn guarantee us?

> Yep, same sad story, but nobody will be too happy if we decide 'then
> VC is not conforming'.

VC isn't conforming. Nor is Sun CC. Nor is g++. None of these three
are conforming with regards to this point. None are conforming with
regards to export either, and I don't think any are fully conforming
with regards to two phase look-up or some of the other more subtle
points of templates.

> There's a light chance Herb reads this and push some buttons at the
> appropriate appartment. ;-)

Ditto for someone from Sun and for the g++ development teams:-).

All three compilers have made a public commitment for conformance, while
admitting that they aren't there yet. We'll have to wait and see how
the public commitment resolves into actual deliverables.

On the other hand, this isn't exactly some new, exotic feature that the
C++ committee dreamed up just to make life hard for implementors. In
this case, the C++ standard refers explicitly to the 1990 C standard --
which was in fact ratified by ANSI in 1989 (and of course, frozen
somewhat earlier). So the implementors have had at least 15 years to
work on the problem. Can it really take more than fifteen years to
write two lines of documentation:-)?

> > And although the C standard and the C++ standard require
> > implementation defined behavior to be documented, I've yet to find
> > anything really relevant, for any of the compilers I know of. Does
> > writing to a volatile variable mean:

> > - A machine store instruction is generated; what happens after that is
> > entirely up to the hardware, will vary from one machine to the next,
> > and for some architectures, is not even guaranteed to generate a
> > write cycle on the external processor bus (at least not
> > immediately)?

> IMW a guaranteed store instruction. + I would desire it as an order
> point wrt not only volatiles but nonvolatiles around.

The languages standards say more or less explicitly that it isn't. The
key word volatile is attached to the type. It only affects accesses
through an expression in which the type of the lvalue being accessed is
volatile qualified.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Michael Furman

unread,

Feb 27, 2004, 1:18:37 PM2/27/04

to

"Ben Hutchings" <do-not-s...@bwsint.com> wrote in message

news:slrnc3pj8u.okv....@shadbolt.i.decadentplace.org.uk...

> ka...@gabi-soft.fr wrote:
> <snip>
> > I don't think that this is where the cost of the mutex is coming from.
> > To start with, locking a mutex involves switching from user mode to
> > kernal mode, and a lot of other kernal related processing. That is what
> > makes it expensive.
> <snip>
>
> That is not generally necessary in the non-contended case (i.e. where
> there is no need to wait). It should be sufficient to perform some
> kind of atomic test-and-set (usually decrement) followed by a kernel
> call if and only if the result shows that the mutex was already
> locked. Look up "futex" or read <http://www.opengroup.org/onlinepubs/

No, it is not enough: if you use mutex to protect reading some variable,
that
is being modified by another thread that executed on another CPU in relaxed
memory model, and if your mutex operations do nothing in case there is no
contention - you will never see that the variable changed.
To see it you need:
1. Compiler: to reread a value from the memory location - in case it uses
the copy in
the register. This is "volatile" for.
2. CPU: to force reread the falue from the shared memory rather then using
its cache value.
Some cache invalidating instruction must be executed for that.

Regards,
Michael Furman

ka...@gabi-soft.fr

unread,

Feb 27, 2004, 10:37:13 PM2/27/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message

news:<c1jc8u$1j6rd2$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04022...@posting.google.com...
> > "Michael Furman" <Michae...@Yahoo.com> wrote in message
> > [...]
> > > And by "other ways" you mean just one way: "Posix" - don't you? If so
> > > - I argee.

> > I was considering the context of threading within a process on a
> > modern, general purpose machine and OS: Windows or Unix. I cited
> > the Posix guarantees because those are the ones I am familiar with
> > -- I presume that Windows is similar.

> No, it is not similar - se my quote from MSVC documentation in the
> earlier post.

You don't say where that quote comes from; it looks like just a vague,
naïve explination of what the author thinks volatile does in C/C++, and
not a specification of the exact semantics of EnterCriticalSection or
one of the Wait... functions.

In practice, I think that Windows is very much like Posix here. To
date, I haven't been able to find the exact documentation concerning
memory synchronization, and thus I cannot determine the formal
guarantees. But in practice, the fact that mutexes and critical
sections are useless unless synchronization is part of the guarantee
(and in fact, the fact that they are pratically unimplementable without
memory synchronization) is a very strong argument that this is what was
meant.

Stop and think of it for a moment. Unless thread synchronization calls
also synchronize memory, *every* access to *every* byte that can be
accessed by more than one thread must be volatile qualified. If you
construct an object that is to be seen by another thread, for example,
you cannot use the initializer construct, and you cannot use the
implicit this pointer in the constructor -- you have to write something
like:

MyClass volatile* vthis = this ;
vthis->x = initialX ;
vthis->y = initialY ;
// ...

Ditto for the destructor.

Now, are the GUI componants in MFC written like this? Or do they have
the restriction that they can never, under any circomstances, be
accessed other than from the thread which created them?

> > I was not considering embedded processors, various real-time OS's,
> > or the like. My experience with such machines dates from a simpler
> > time when emitting a store instruction provoked an immediate write
> > to external memory, and multiprocessor systems using common shared
> > memory were rare. I suspect that this is still true for some of the
> > smaller embedded systems. In such cases, volatile is probably a
> > sufficient solution for some problems (supposing atomicity, etc.).

> No, it is not. Volatile is never (I believe) sufficient, but it is
> usually necessary - unless there is something else added to the
> compiler that guarantee disabling relevant optimizations. (Posix is
> an example - if C compiler is Posix compliant and you uses Posix
> synchronization primitives, you do not need volatile).

If the C (or C++) compiler is capable of compiling MFC components that
can be used in more than one thread, then you don't need volatile?

I agree that there is a lot of critical specification documentation
missing, or at least, that I don't know how to find it. I do know what
is needed in order for the system to be useful. You simply enforce the
rule that all accesses to everything the might be accessed by another
thread must be through a volatile qualified lvalue; it just isn't
practical, given e.g. that the this pointers in constructors and in
destructors are never volatile. Microsoft doesn't do it in their
examples, and from what I can see, they don't do it in their own code.

There does remain one point: Posix explicitly says that if no thread
modifies an object, all threads can access it without additional
protection or synchronization. I'm not sure that this guarantee can be
taken for granted: G++ under Posix explicitly says that they do NOT give
this guarantee for objects of library types, for example. So this means
that if you have something like:

extern char const name[] = "XYZ" ;

you must create a critical section or use a mutex in order to access it
under Windows, but not under Posix. (Note that if you use std::string
instead of char[] in the above example, you must protect access to it
with mutexes even under Posix if you are using g++. And not just
theoretically -- it really doesn't work, at least on a Sparc,
otherwise. Whereas I'd be really, really surprised if the problem with
char[] under Windows was anything but purely theoretic.)

It would be nice if someone with connections to Microsoft were reading
this thread, and could confirm or deny the actual intent.

> > > > [...]
> > > > In no case have I declared anything volatile.

> > > That is (I guess) because both protected variable (counter) and
> > > protection mechanism are encapsulated on one object. If you try to
> > > implement something like mutex (that can be used to protect any
> > > variable) - you would have either force everything to be flushed
> > > (like in POSIX) or use volatile (that I suppose to "flush" one
> > > variable) and some more fine grain hardware instructions (like
> > > flush just one word - is such instructions exist on modern
> > > computers?).

> > The protection only concerns incrementing and decrementing a counter.

> Yes - that is what I mean. If you make any kind of locking mechanism -
> for example mutex, but not the Posix one that disable register
> optimization of all variables around the mutex lock call - you will
> need some other way to tell compiler to reread just one variable (in
> case its value is cached in the register).

I think you have the logic slightly backwards. The compiler can
optimize if and only if it can prove that the observable behavior is not
modified. If you call an external function written in assembler, the
compiler cannot (presumably) modify that assembler; it must assume that
the assembler accesses and modifies all accessible variables, or it must
be able to analyse the assembler, and see which ones it accesses and
modifies. If without mutliple threads, if I write something like:

int i = 0 ;
f( &i ) ;
std::cout << i ;

the compiler must take into account anything that f does to i -- if it
cannot analyse and optimize f as well, it MUST write i to the memory
whose address it passes, and reread it from this address after the call.

If f is an inline function, of course, it would be a very poor compiler
which couldn't analyse it, and optimize it inline. If f is defined in
the same translation unit, a lot of compilers can still do the necessary
work. If f is defined in another translation unit, there are only a
few. If f is actually 'extern `COBOL" f( int* )', I don't know of any,
but why not. Even if f is written in assembler, why not?

Of course, if f is written in assembler, and at one point, it does:

lock incr dword ptr [ebx]

(where ebx has previously been loaded with the argument) then the
compiler had better ensure that the semantics of its optimization are
the same as if this instruction had been executed -- personnally, I
would have no complaints about a compiler which inlined f even if it
were written in assembler, in another module, provided it maintained the
semantics.

The only problem here involves other variables. If the compiler does
analyse down to this level, it knows enough not to put i in a register.
On the other hand, if this increment is being used as a protection
mechanism, it doesn't know enough not to put anything else in a
register. Volatile might be useful for other variables in such cases,
but I think you'd quickly run into other problems. For anything but the
simplest operations, you need to use the protection mechanisms provided
by the system.

> Of cause it is not enough, because you have to say the similar thing
> to the hardware. But it is necessary!

In some cases. But not under Posix. And not under Windows, either, I'm
convinced.

> > [...]

> > In no case have I found a case where volatile would be relevant to
> > this. The guarantees I need have all be provided by the other
> > mechanisms (Mutex, assembler code, special OS primitive).
> > Conceivably, an Intel compiler (or a compiler for another platform,
> > for that matter) could define that the ++ or -- operator on a
> > volatile int generated the appropriate instruction, preceded by a
> > lock prefix (or whatever is needed on the other platform). I know
> > of no compiler which does this, however, and generally speaking, I
> > think it would go beyond the traditional meaning of volatile (which
> > only concerns individual accesses, and not a read-modify-write
> > cycle).

> Yes. But if you do not declare variable as volatile, "++" operator can
> just increase a value in the local register, w/o any access to the
> memory!

True. Provided it can prove that no other function could access the
variable (in a single threaded context, of course).

> Posix Mutes as an addition to the C standard can (must) provide this.
> But how assembler code or OS primitive that is not related to the
> compiler can do this?

Typically, because the compiler will not be able to prove the lack of
access. To prove that the function doesn't access any particular
globally accessible object, it must be able to analyse the function. If
it can analyse the function, it can see that either the function doesn't
access the variable (in which case, there is no problem), or that it
does -- the atomic increment function which does a `lock incr [ebx]'
does access the int its argument points to.

The problem concerns variables that aren't accessed by the function. If
I write `pthread_mutex_lock( &x )' or `EnterCriticalSection( &x )', it
is probable that the called function does access x, and thus, the
compiler cannot put x in a register, or whatever. In the case of the
first, I have an explicit guarantee from Posix that any Posix compliant
compiler will not put any other globally accessible object in a register
accross the call, and an explicit guarantee that the function will do
whatever hardware dependant things are necessary to ensure that other
threads will see the value that I see in memory. In the case of
Windows, I'be been unable to find anything concrete, one way or the
other, which is worrisome. But pratically speaking, anything other than
what Posix does in this case would break most of Windows, as well as
most existing applications, so I don't see it happening. (Of course,
I'd much prefer an explicit guarantee, if I could get it.) The
important thing to note is that without these explicit calls, all bets
are off.

> > However, my real point is that some additional guarantees beyond
> > volatile are necessary, AND that these guarantees are sufficient;
> > that once I've done whatever else is necessary, volatile is no
> > longer needed.

> The second part would be true if you add to compilie (and Standard)
> something else that will disable some specific optimization.

I don't think compiler optimizations are really the problem here, as
long as all control paths go through one of the explict synchronizing
functions. The problem is ensuring that there are explict instructions
generated to ensure memory synchronization. Posix guarantees this
explicitly. Windows on an IA-32 guarantees it implicitly, in that the
IA-32 architecture itself ensures sufficient synchronization always, at
a hardware level and without additional instructions. (I think -- if
you want your code to count on it, verify, because I've not actually
written code for Intel since MS-DOS 3.2.) Windows on other platforms
guarantees it very, very indirectly, in that their own library code
wouldn't work otherwise.

> > [...]

> > > Here I am not sure I agree - I like present situation when
> > > "volatile" takes only part of this garantee that is directly
> > > related to compiler. Another part needs using special hardware
> > > features - they could be different even for different
> > > implementations of the same architecture.

> > The features may be different, but how you access them, or their
> > implications in what the compiler is allowed to generate, might be
> > part of the standard.

> > > (of cause, again I am thinking more about freestanding
> > > implementations) - so I am afraid it would use overkilling "flus
> > > everything" method.

> > I wasn't thinking of necessarily defining it down to that level,
> > although I would not oppose to a standardized mutex class (which
> > wouldn't make it illegal for the implementation to provide finer
> > grained protection when possible). But what I am really concerned
> > with is things like constructing local static objects (thread-safe
> > or not?), or even some sort of definition as to what sequence points
> > mean when the program is being executed by several threads.

> I am concerned with this too, but I doubt that much could be done
> except the more formal definition of what we de facto have today. It
> is too much depensd on the hardware architecture - and it changes.

I don't really need more than a formal definition of what we have de
facto. But I would like something that I can count on in the future.

> > [...]
> > > In my understanding, "W" says that using vilatile is nessesary -
> > > it does not say that it is enough.

> > True. But if that is really the case, it has very wide
> > implications. It means, for example, that it is pratically
> > impossible to write a singleton, or for that matter, to do anything
> > in constructors or destructors. (An object is NEVER volatile in the
> > constructor or the destructor.)

> I guess you are right and this is a problem. And I don't think it is
> related only to MT case - I thing it is same with interrupt processing
> or accessing memory mapped hardware registers.

Quite. Except that you don't construct C++ objects in IO address
space. And an interrupt is a very specific barrier -- I imagine that
the hardware could systematically provide synchronization across it
(although from what I can read, Sparc doesn't).

> > Note too that at least with VC++, all volatile does is inhibit
> > certain optimizations. Given the level of optimization of this
> > compiler, introducing an external function call (say to
> > WaitForsingleObject on a Mutex) has exactly the same effect. How
> > can volatile be necessary if it results in exactly the same code
> > being generated? What is the real effect of volatile?

> Where is it defined that external function call inhibit all variable
> optimization?

The fact that an external function can access and modify all globally
accessible objects. If the compiler is to optimize accross it, it must
be able to analyse the function, and either determine that it doesn't
actually use the object, or to optimize the called function as well in
order to use the correct values, and to put any modifications where the
caller can find them. Some compilers today are actually able to do
this, at least if the function is written in C++, but VC++ isn't one of
them. My statement above specifically targets VC++.

Note that a compiler could do more with volatile. I would be quite
pleased if I wrote ++i where i was volatile, the compiler generated
`lock incr i'. (It would have to document this, of course. It is a
special meaning of `access'.)

> (though all compilers I am familiar with do not look at other source
> files and have to do something like that). And why synchronization
> primitives could not be inlined? With or without "asm" inside?

If there is an asm -- the compiler must either assume that all globally
accessible objects are accessed and modified, or it must look into the
asm. Presumably, if it knows enough asm to look into it, it knows that
it cannot move accesses across a membar instruction.

More correctly, that is largely sufficient with simple processors, where
a load or a store instruction always generates an external access on the
bus, and where accesses are never reordered. It's worthless on a Sparc
with RMO or on an Alpha. And probably on a IA-64. And of limited use
on an IA-32.

> > - A write through to main memory is guaranteed. In practice, on a
> > Sparc, this would imply a membar instruction somewhere in the
> > generated sequence. From what I understand about IA-32, it
> > would require a lock prefix on the mov instruction, but I'm not
> > sure about this.

> > - Any processor, anywhere which can see the memory can see the new
> > value. While this is probably what most people imagine,
> > intuitively, the implications, for example in the presence of
> > memory mapped files on a remote disk, are enormous.

> > My own feeling is that the second point probably corresponds closest
> > to the intent. From studing generated code, I can say that both Sun
> > CC and g++, on a Sparc, implement the first. Which is their right
> > -- that's what implementation-defined means. But they have to
> > document this choice, and I've not found any such documentation.
> > For any compiler.

> As I would already said I would like to have #1 alone, because it does
> not add any extra overhead.

If you don't want extra overhead, don't use volatile:-). The advantage
of #2 is that it is useful in its own right; you don't need additional
instructions that can't be expressed in C or in C++.

> I do not understand what #2 is for. For memory mapped registors it
> sould be prowided by hardware anyway.

One would hope:-). The Sparc architecture manual says that it might not
be.

> If you use Posix mutexes yoy need nothing.

If you use Posix mutexes, you don't need volatile, so the question is
moot.

> And #3 IMO is not defined. "Any processor, anywhere", but when? I
> think you suppose the common time and infinite speed - I doubt that it
> is reasonable.

I don't think #3 is reasonable either:-). I threw it in to show just
how wide the range of possibilities could be.

> > Maybe.

Well, you can't write it in C, so it must be either an external function
(written in assembler) or some sort of compiler built-in. (The latter
case is even better, since the compiler will know that it has very
special semantics, and can adapt its optimization around them.)

> 2. compiler does not have to flush any globally accessible variable to
> memory even before calling external function.

Only if it can trace everything that happens in the external function,
see above.

> 3. VC++ volatale (as every other I know about) forse flush/reread of
> just one variable rather then any globally accessible variable that
> can be much faster.

> > I rather agree that the volatile is nice -- I do want to say that
> > this must be visible elsewhere. But from what little I can see,
> > volatile, at least with VC++, doesn't guarantee this. Or at least,
> > on a multiprocessor machine, there is no guarantee that "elsewhere"
> > includes a thread in the same process, running on a different
> > processor.

> to make it visible anywhere - is to add extra dependance from the
> hardware. For example different address ranges can have very
> different behavior: some could be virtual and directed to disk, others
> non-chached or cached with different caching pilicies - how compiler
> would know that (having just a pointer)?

The whole question is: what is the meaning of volatile? From what I
gatherfor VC++ under Windows, the answer is just about "nothing". I get
no useful guarantees that I don't already have.

> > And IMHO, by far the most serious problem here is a lack of
> > documentation. Documentation required by the C and the C++
> > standards. But what little documentation there is tends to be
> > pleonastic: for its C compiler, Microsoft defines "Any reference to
> > a volatile-qualified type is an access." Which is about as good as
> > saying that they define an access to be an access. It begs the
> > question.

> IMHO, the lack of documentation in implementations (regarding
> "volatile") has the same cause with the lack of definition of the
> volatile in the C/C++ standard. It is just very hard (though I
> believe not impossible) to write a definition that would be formal and
> not refering to details of the hardware and compiler internals.

True. But if you are documenting a single implementation, you can very
well refer to details of the hardware and the compiler internals.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

Ben Hutchings

unread,

Feb 27, 2004, 11:01:33 PM2/27/04

to

Michael Furman wrote:
<snip>

> No, it is not. Volatile is never (I believe) sufficient, but it is
> usually necessary - unless there is something else added to the
> compiler that guarantee disabling relevant optimizations.

<snip>

I really don't believe it is necessary. Any object which needs to be
synchronised must be reachable from outside the current invokation of
the current function; otherwise, how could any other thread reach it?
So long as the compiler cannot see the body of a synchronisation
function, it cannot know whether it depends on any particular
externally reachable object, so it must be conservative and avoid re-
ordering access to such objects across calls to the synchronisation
function.

Ben Hutchings

unread,

Feb 27, 2004, 11:14:56 PM2/27/04

to

Michael Furman wrote:
> "Ben Hutchings" <do-not-s...@bwsint.com> wrote in message
> news:slrnc3pj8u.okv....@shadbolt.i.decadentplace.org.uk...
> > ka...@gabi-soft.fr wrote:
> > <snip>
> > > I don't think that this is where the cost of the mutex is coming from.
> > > To start with, locking a mutex involves switching from user mode to
> > > kernal mode, and a lot of other kernal related processing. That is what
> > > makes it expensive.
> > <snip>
> >
> > That is not generally necessary in the non-contended case (i.e. where
> > there is no need to wait). It should be sufficient to perform some
> > kind of atomic test-and-set (usually decrement) followed by a kernel
> > call if and only if the result shows that the mutex was already
> > locked. Look up "futex" or read <http://www.opengroup.org/onlinepubs/
>
> No, it is not enough: if you use mutex to protect reading some
> variable, that is being modified by another thread that executed on
> another CPU in relaxed memory model, and if your mutex operations do
> nothing in case there is no contention - you will never see that the
> variable changed.

<snip>

Yes, they may need a memory barrier as well of course. I thought that
was taken as read but I should have mentioned it.

Andrei Alexandrescu

unread,

Feb 28, 2004, 10:38:15 PM2/28/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...

> I can think of a number of advantages of standardizing it. And a
> potential problem: what happens in environments which don't support
> threading? Are we willing to say that an environment without threads
> cannot support C++? Or do we introduce (or rather expand) the concept
> of "optional" packages in the language?

Yes, "expand" is the word. I think it's time to get rid of the notion that
we don't have optional packages already. Given that there are systems
without a console (on which cout, cerr, and cin are all routed to nothing)
and/or without dynamic memory, systems that do have C++ compilers, I could
definitely imagine a compiler that would happily accept some future standard
synchronization primitives (and compile them down to no-ops), coming with a
standard library that always returns failure from the function that creates
threads.

Andrei

Balog Pal

unread,

Feb 29, 2004, 11:37:20 AM2/29/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...

> > BTW (Paul) - I don't see any sign of a FENCE instruction being

> > executed during execution of InterlockedIncrement on X86 code - it
> > just does a LOCK(ed) ADD which provides atomicity only.

the lock prefix implicitly has the semantics of the strongest fence
instruction. As I told before the fnece instructions were introduced to add
lighter versions that do less with less impact on the system (just like the
new casts in C++ :)

> > I tend to feel that allowing boost to provide standardized thread
> > mechanisms would be better than adding to the already large C++
> > standard library, if it weren't for the fact that it requires
> > volunteers to maintain it and create it, whereas standardizing it
> > would encourage professional compiler vendors to provide it.
>
> I can think of a number of advantages of standardizing it. And a
> potential problem: what happens in environments which don't support
> threading?

That's teh easiest thing -- same as what happens in many current comlers
when you use the thread API bit didn't use the -mt or whatever compiler
switches also. You get errors on thread create, nops on sync operations,
etc. Not really hard to come up with reasonable answers after the working
version is ready.

> Are we willing to say that an environment without threads
> cannot support C++? Or do we introduce (or rather expand) the concept
> of "optional" packages in the language?

Even that is okey. Why not. But why not just follow what we use today?
What fopen does where files are not existing?

Paul

Hendrik Schober

unread,

Feb 29, 2004, 11:37:58 AM2/29/04

to

ka...@gabi-soft.fr <ka...@gabi-soft.fr> wrote:
> [...]

> I can think of a number of advantages of standardizing it. And a
> potential problem: what happens in environments which don't support

> threading? [...]

What happens in environments that don't
support files?

Schobi

--
Spam...@gmx.de is never read
I'm Schobi at suespammers dot org

"Sometimes compilers are so much more reasonable than people."
Scott Meyers

t...@cs.ucr.edu

unread,

Feb 29, 2004, 12:37:11 PM2/29/04

to

ka...@gabi-soft.fr wrote:
[...]
+ Things like the membar instruction for Sparc, or the memory barrier
+ primitives from Microsoft, documented in the previously posted link,
+ aren't there for the fun of it. They fulfill a real need. And if
+ volatile doesn't cause the compiler to generate something similar, then
+ volatile doesn't fulfill that need.

Agreed.

Since the standard doesn't recognize multithreading, concepts like
"volatile" that are defined in the standard are not required to work
well in a multithreading environment. In particular, there is nothing
in the definition of "volatile" that prevents the compiler or CPU from
hoisting or sinking access to other variables past accesses to a
volatile variable. All that volatility does for a variable is prevent
the implementation from optimizing away accesses to that variable,
i.e., volatile variables are not guaranteed to retain their last
stored value and writes to them may be observable behavior of the
program.

Tom Payne

ka...@gabi-soft.fr

unread,

Mar 1, 2004, 7:36:31 AM3/1/04

to

Ben Hutchings <do-not-s...@bwsint.com> wrote in message

news:<slrnc3v0e2.okv....@shadbolt.i.decadentplace.org.uk>...

> Michael Furman wrote:
> <snip>
> > No, it is not. Volatile is never (I believe) sufficient, but it is
> > usually necessary - unless there is something else added to the
> > compiler that guarantee disabling relevant optimizations.
> <snip>

> I really don't believe it is necessary. Any object which needs to be
> synchronised must be reachable from outside the current invokation of
> the current function; otherwise, how could any other thread reach it?
> So long as the compiler cannot see the body of a synchronisation
> function, it cannot know whether it depends on any particular
> externally reachable object, so it must be conservative and avoid re-
> ordering access to such objects across calls to the synchronisation
> function.

That's a chancey "as long as" -- although far from being mainstream
technology, some modern compilers do do intermodule optimization. You
do need some sort of guarantee from the compiler.

Posix furnishes this guarantee explicitly -- if the compiler is Posix
conform. (Of course, most Posix compilers meet the guarantee without
any extra effort, for the very reasons you mention.) I've been unable
to find much information concerning Windows, however 1) the state of
optimization in the Windows world is such that this hasn't yet become a
problem, even with the most recent compilers, and 2) not giving the
guarantee would break enough code, including examples from Microsoft,
that no vendor would dare do it.

Finally, there is the simple fact that if a compiler is intelligent
enough to analyse code in separate modules, some of which was written in
assembler (since you can't get a membar instruction or a lock prefix
otherwise), it is also intelligent enough to recognize when a function
grabs or frees a lock, and to do the right thing.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

ka...@gabi-soft.fr

unread,

Mar 1, 2004, 7:37:49 AM3/1/04

to

"Andrei Alexandrescu" <SeeWebsit...@moderncppdesign.com> wrote in
message news:<c1qi69$1ltu6p$1...@ID-14036.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04022...@posting.google.com...
> > I can think of a number of advantages of standardizing it. And a
> > potential problem: what happens in environments which don't support
> > threading? Are we willing to say that an environment without
> > threads cannot support C++? Or do we introduce (or rather expand)
> > the concept of "optional" packages in the language?

> Yes, "expand" is the word. I think it's time to get rid of the notion
> that we don't have optional packages already. Given that there are
> systems without a console (on which cout, cerr, and cin are all routed
> to nothing) and/or without dynamic memory, systems that do have C++
> compilers, I could definitely imagine a compiler that would happily
> accept some future standard synchronization primitives (and compile
> them down to no-ops), coming with a standard library that always
> returns failure from the function that creates threads.

Formally, there are only two different levels of conformance: hosted and
free-standing. I imagine that most systems which don't have consoles
are free-standing. But even for hosted implementations -- my work is
mainly on large scale servers, which are started as cron jobs. On Unix
based systems (but I don't see how Windows could be different), standard
in for a cron job is connected to /dev/null. Formally, the standard is
maintained -- /dev/null is just an empty file (and the standard never
says that standard in has to be interactive). Practically, of course,
it IS a difference which must be taken into account in the code.

Of course, the thing I was really thinking of, in addition to the
distinction hosted and free-standing, is things like time() (which is
defined to return (time_t)(-1) if the system doesn't support a real time
clock). For that matter, a hosted system must support ifstream (and
fopen), but nother guarantees that there must be filenames for which the
open succeeds.

The case of time() is IMHO a good example. I'd much prefer some sort of
required documentation, and probably a preprocessor definition to allow
conditional compilation, rather than a run-time result.

On the other hand, I don't want to see a fractionning of the C++
standard, where no two implementations support exactly the same
combination of options. Any optional parts should correspond to large
subsystems, and not individual features.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

t...@cs.ucr.edu

unread,

Mar 2, 2004, 6:37:23 AM3/2/04

to

ka...@gabi-soft.fr wrote:
+ wa...@stoner.com (Bill Wade) wrote in message
[...]
+> On some systems, volatile bool will be atomic and synchronized. On
+> some more systems volatile sig_atomic_t will be atomic and
+> synchronized. On some systems neither does what you want. RTFM.
+
+ Moral: if you are doing anything not explicitly covered by the C++
+ standard, you have to know what guarantees the platform you are working
+ on offers.

What "will be" so is not a guarantee. AFAIK, the C and C++ standards
guarantee special properties for "sig_atomic_t" only in the case of
signal handlers and for "volatile" only in the case of monothreading.
POSIX implementations, for example, guarantee much more. I'd be
interested in knowing which standards, if any, guarantee that
"volatile bool will be atomic and synchronized" and which that
"volatile sig_atomic_t will be atomic and synchronized".

Tom Payne

Michael Furman

unread,

Mar 3, 2004, 3:52:23 PM3/3/04

to

<ka...@gabi-soft.fr> wrote in message
news:d6652001.04022...@posting.google.com...

> [...]

> > > I was considering the context of threading within a process on a
> > > modern, general purpose machine and OS: Windows or Unix. I cited
> > > the Posix guarantees because those are the ones I am familiar with
> > > -- I presume that Windows is similar.
>
> > No, it is not similar - se my quote from MSVC documentation in the
> > earlier post.
>
> You don't say where that quote comes from; it looks like just a vague,

> naďve explination of what the author thinks volatile does in C/C++, and

> not a specification of the exact semantics of EnterCriticalSection or
> one of the Wait... functions.

It is from manual that is actually located in the help. Yes, it is vague -
I doubt that better definition exists.

>
> In practice, I think that Windows is very much like Posix here. To
> date, I haven't been able to find the exact documentation concerning
> memory synchronization, and thus I cannot determine the formal
> guarantees. But in practice, the fact that mutexes and critical
> sections are useless unless synchronization is part of the guarantee
> (and in fact, the fact that they are pratically unimplementable without
> memory synchronization) is a very strong argument that this is what was
> meant.
>
> Stop and think of it for a moment. Unless thread synchronization calls
> also synchronize memory, *every* access to *every* byte that can be
> accessed by more than one thread must be volatile qualified. If you
> construct an object that is to be seen by another thread, for example,
> you cannot use the initializer construct, and you cannot use the
> implicit this pointer in the constructor -- you have to write something
> like:
>
> MyClass volatile* vthis = this ;
> vthis->x = initialX ;
> vthis->y = initialY ;
> // ...
>
> Ditto for the destructor.

Yes. I believe that you have to declere the whole object "volatile"
in this case.

>
> Now, are the GUI componants in MFC written like this? Or do they have
> the restriction that they can never, under any circomstances, be
> accessed other than from the thread which created them?

MFC GUI does not use multythreading (it may be hard to believe, but it
is true!). That is why its components are not written in this style!

>
> > > I was not considering embedded processors, various real-time OS's,
> > > or the like. My experience with such machines dates from a simpler
> > > time when emitting a store instruction provoked an immediate write
> > > to external memory, and multiprocessor systems using common shared
> > > memory were rare. I suspect that this is still true for some of the
> > > smaller embedded systems. In such cases, volatile is probably a
> > > sufficient solution for some problems (supposing atomicity, etc.).
>
> > No, it is not. Volatile is never (I believe) sufficient, but it is
> > usually necessary - unless there is something else added to the
> > compiler that guarantee disabling relevant optimizations. (Posix is
> > an example - if C compiler is Posix compliant and you uses Posix
> > synchronization primitives, you do not need volatile).
>
> If the C (or C++) compiler is capable of compiling MFC components that
> can be used in more than one thread, then you don't need volatile?

Almost all MFC components does not use multithreading - that is why
most are not using volatile! (some do use).

>
> I agree that there is a lot of critical specification documentation
> missing, or at least, that I don't know how to find it. I do know what
> is needed in order for the system to be useful. You simply enforce the
> rule that all accesses to everything the might be accessed by another
> thread must be through a volatile qualified lvalue; it just isn't
> practical, given e.g. that the this pointers in constructors and in
> destructors are never volatile. Microsoft doesn't do it in their
> examples, and from what I can see, they don't do it in their own code.

I actually did not see Microsoft MT examples.
But I am not saying that Microsoft software is a perfect example.
It is just another one - closer to fine memory coherence model
(that I am specially interested in) than coarse model of POSIX
(flash/reread everything that potentially could change in both compiler
and hardware on every mutex lock/unlock call).
I am very intersted in the model "flash/reread only what I need".

> [...]

Of cause everythin you just said is correct for the standard C++. But the
standard
C++ does not have threads - so it kind of not appliable.
Yes, it is possible to extend it for MT case this way - and if I have many
CPU
that I intensively use with relaxed memory model, every call of assembler
function, inline assembler (and may be more) will cause flushing caches,
saving registers etc. Sometimes it could generate significant overhead.

What I want is to leave a possibility to avoid this - in some extreme cases.
I will explicitly (using "asm") say what memory objects I need to
sinchronize
and "volatile" seems to be what I need to prevent compiler to cashing them.
I am not worrying much about Windows!

> [...]

>
> The whole question is: what is the meaning of volatile? From what I
> gatherfor VC++ under Windows, the answer is just about "nothing". I get
> no useful guarantees that I don't already have.

There is no definition for C++ and threading. And though "asm" exists
in C++ I am not sure what garantee C++ gives in presence of
asm instructions. So we have to talk about possible extentions.
Your (POSIX-like) extension is reasonable good for traditional style
applications. But it could impose overhead in multi-cpu relaxed
environment. So my point is it would be good to have lower level
and more flexible variant when I can user some finer control over
it.
What do we have in most compilers (I believe) is good enough:
1. inline "asm" is not treated at thing that could change every variable
(am I correct here? I did not actually check it).
2. volatile disable compiler memory/register optimization.

Removing volatile and sinchronizing all register-optimized variables
at the points of synchronization - may be it is not to bad - but
the point of synchronization may be just writing to the special variable
(memory mapper register), rather then asm - how compiler would know it?

> [...]

Regards,
Michael Furman

ka...@gabi-soft.fr

unread,

Mar 4, 2004, 5:16:14 PM3/4/04

to

"Michael Furman" <Michae...@Yahoo.com> wrote in message

news:<c2597h$1q175u$1...@ID-122417.news.uni-berlin.de>...

> <ka...@gabi-soft.fr> wrote in message
> news:d6652001.04022...@posting.google.com...
> > [...]
> > > > I was considering the context of threading within a process on a
> > > > modern, general purpose machine and OS: Windows or Unix. I
> > > > cited the Posix guarantees because those are the ones I am
> > > > familiar with -- I presume that Windows is similar.

> > > No, it is not similar - se my quote from MSVC documentation in the
> > > earlier post.

> > You don't say where that quote comes from; it looks like just a

> > vague, naïve explination of what the author thinks volatile does in

> > C/C++, and not a specification of the exact semantics of
> > EnterCriticalSection or one of the Wait... functions.

> It is from manual that is actually located in the help. Yes, it is
> vague - I doubt that better definition exists.

Which is a problem in itself.

In the absense of any real specifications, you more or less have to look
at what the compiler generates, and how it works on the hardware in
question. It's a poor substitute, because you really have no guarantees
for the future, but if there isn't anything else...

VC++ 6.0 under Windows NT on an IA-32 architecture actually gives you
more guarantees that Posix. Many of them, however, depend on IA-32
features which, I suspect, are only there for reasons of backward
compatibility. And won't be present in IA-64.

About the only thing you can count on is that any radical change here
will break a lot of user code (and probably a lot of Microsoft code as
well). Microsoft didn't change the for scope for fear of breaking
something; hopefully, they will show as much respect for the users here
as well.

> > In practice, I think that Windows is very much like Posix here. To
> > date, I haven't been able to find the exact documentation concerning
> > memory synchronization, and thus I cannot determine the formal
> > guarantees. But in practice, the fact that mutexes and critical
> > sections are useless unless synchronization is part of the guarantee
> > (and in fact, the fact that they are pratically unimplementable
> > without memory synchronization) is a very strong argument that this
> > is what was meant.

> > Stop and think of it for a moment. Unless thread synchronization
> > calls also synchronize memory, *every* access to *every* byte that
> > can be accessed by more than one thread must be volatile qualified.
> > If you construct an object that is to be seen by another thread, for
> > example, you cannot use the initializer construct, and you cannot
> > use the implicit this pointer in the constructor -- you have to
> > write something like:

> > MyClass volatile* vthis = this ;
> > vthis->x = initialX ;
> > vthis->y = initialY ;
> > // ...

> > Ditto for the destructor.

> Yes. I believe that you have to declere the whole object "volatile" in
> this case.

It doesn't help. The volatile (like the const) only takes effect once
the constructor has finished, and ceases to take effect as soon as you
enter the destructor. There is NO way to have a this pointer to a
volatile or to a const in a constructor or in a destructor.

And standard C++ doesn't support `extern "COBOL"'. Nor using a lock
prefix in assembler code. We're way out of the area of the ISO
standard. And in the area of what is commonly considered acceptable --
we don't need the ISO standard to know that it is a compiler error for
the optimizer to change the semantics of a legal program. The assembler
parts may not be covered by the C++ standard, but they are covered by
the definition of what the machine instructions do. If I write a
function in assembler which modifies a global variable, and I call that
function from a C++ program, the C++ program had better not keep the
variable in a register across that call.

As a simple example, consider something like:

char buffer[ 5 ] ;
strcpy( buffer, "text" ) ;
::write( 1, buffer, 4 ) ;

Are you trying to say that it is acceptable for the compiler to suppress
the strcpy, since I don't access buffer anywhere in my C++ program?

> Yes, it is possible to extend it for MT case this way - and if I have
> many CPU that I intensively use with relaxed memory model, every call
> of assembler function, inline assembler (and may be more) will cause
> flushing caches, saving registers etc. Sometimes it could generate
> significant overhead.

At least with every compiler I've ever seen, calling an assembler
function will cause flushing values to memory. It's not a question of
MT; it is a more fundamental question concerning what is and is not a
legal optimization. The C++ standard tries to formallize this for the
cases it addresses, but it is not a new issue, and the C++ standard says
nothing that is different from what has been generally acceptable
practice.

The function ::write, above, is written in assembler (at least on my
machine). I suppose that one could argue that it is a special case,
because it is defined by the Posix standard, and my C++ compiler is also
Posix compliant. But in fact, if I write a function in assembler, the
compiler *will* ensure that every variable that is potentially visible
to that function has the most up to date values when the function is
called, and that all of these values will be reread after the function
is called. I do not know of any compiler today which will analyse the
assembler in the function to determine which variables it really
accesses. Supposing a compiler did this, however, it would have to
fully understand the assembler for the target machine, and take into
account the semanics of such things as a lock prefix.

> What I want is to leave a possibility to avoid this - in some extreme
> cases. I will explicitly (using "asm") say what memory objects I need
> to sinchronize and "volatile" seems to be what I need to prevent
> compiler to cashing them. I am not worrying much about Windows!

The problem is that what volatile means is really up to the
implementation, and that current implementations do NOT synchronize
anything just because it is declared volatile. What this means in turn
depends on the application -- with IA-32, it is probably acceptable,
since this architecture ensures a strict memory ordering.

> > [...]

> > The whole question is: what is the meaning of volatile? From what I
> > gatherfor VC++ under Windows, the answer is just about "nothing". I
> > get no useful guarantees that I don't already have.

> There is no definition for C++ and threading. And though "asm" exists
> in C++ I am not sure what garantee C++ gives in presence of asm
> instructions. So we have to talk about possible extentions. Your
> (POSIX-like) extension is reasonable good for traditional style
> applications. But it could impose overhead in multi-cpu relaxed
> environment. So my point is it would be good to have lower level and
> more flexible variant when I can user some finer control over it.

I wouldn't disagree there. I think that there are a lot of things that
would be good. The important point -- my only point, in fact, is that
the definition of volatile depends on the implementation, and usually,
at least, it doesn't do anything useful in this context. My gut feeling
is that the original intent of volatile was that it should, but I'm not
sure. When volatile was added, it is certain that the authors didn't
have these kinds of problems in mind, since at that time, memory access
was a lot simpler.

> What do we have in most compilers (I believe) is good enough:
> 1. inline "asm" is not treated at thing that could change every
> variable (am I correct here? I did not actually check it).

I think in most compilers, it is treated as if it could change every
variable it could access. I've not used much inline asm, except for
"membar"; I have verified that even with maximum optimization, using Sun
CC 5.1, the compiler does not move reads and writes accross that
particular bit of asm. (On the other hand, the context was such that
I'm not sure it would have moved them anyway.)

Most of the time, when I need assembler, I will write a separate
function with it. And I've never seen a compiler yet which will not
assume that the function will access and modify every variable which it
could possibly access. Local variables whose address is not taken may
stay in registers, but everything else is flushed to memory, and reread
after the function.

> 2. volatile disable compiler memory/register optimization.

> Removing volatile and sinchronizing all register-optimized variables
> at the points of synchronization - may be it is not to bad - but the
> point of synchronization may be just writing to the special variable
> (memory mapper register), rather then asm - how compiler would know
> it?

Volatile typically disables compiler memory/register optimizations. But
the implementations I've seen do nothing to prevent optimizations
(reordering, suppression of some reads, etc.) made by the hardware,
automatically, on the fly. Partially, at least, because the basic
framework of most compilers goes back to a time when such hardware
optimizations didn't exist, and there was nothing to prevent: when the
compiler generated a store instruction, there was a write cycle on the
global memory bus, and when it generated a load, there was a read cycle.

Times have changed, and neither the standards, nor the implementation
defined parts of the compilers, have caught up.

--
James Kanze GABI Software mailto:ka...@gabi-soft.fr
Conseils en informatique orientée objet/ http://www.gabi-soft.fr
Beratung in objektorientierter Datenverarbeitung
11 rue de Rambouillet, 78460 Chevreuse, France, +33 (0)1 30 23 45 16

[ See http://www.gotw.ca/resources/clcm.htm for info about ]

PeteK

unread,

Mar 5, 2004, 6:07:53 AM3/5/04

to

OK,

Here's a practical example of why you might need volatile.

An instruction set from a mainframe system I uset do work on contains
two instructions

INCT (increment and test)
TDEC (test and decrement)

that issue a "clear slaves" instruction to all processors in order to
synchronise memory. Now consider something like this:

static int Flag;

void SetFlag()
{
asm
{
INCT Flag;
}

}

inline void DoSomething()
{
...
}

void Go()
{
Flag = 0;

while( Flag == 0 )
{
DoSomething();
}
}

Now if we call Go on thread 1 we can, at some future time, stop the
loop by calling SetFlag on thread 2. Since the INCT instruction
ensures that thread 1 will see the change to Flag (because of the
clear slaves instruction) everything is OK.

Except that it's not!

Without volatile what's to stop the compiler optimising Go into the
rough equivalent of:

loop: call DoSomething // this function can be inlined
goto loop

OTOH if we declare Flag as a volatile int the compiler must re-read
the value each time round the loop.

Just because you can write posix-compliant code that dispenses with
the need for volatile doesn't mean that volatile is of no use in
thread-based programming.

PeteK