Why volatile may make sense for parallel code today.

Bonita Montero

unread,

Nov 22, 2023, 11:35:19 AM11/22/23

to

#include <Windows.h>
#include <thread>

using namespace std;

int main()
{
constexpr size_t ROUNDS = 1'000'000;
size_t volatile r = 1'000'000;
jthread thr( [&]()
{
while( r )
SleepEx( INFINITE, TRUE );
} );
for( size_t r = ROUNDS; r--; )
QueueUserAPC( (PAPCFUNC)[]( auto p ) { --*(size_t*)p; },
thr.native_handle(), (ULONG_PTR)&r );
}

Chris M. Thomasson

unread,

Nov 22, 2023, 4:22:39 PM11/22/23

to

std::atomic<size_t> r

red floyd

unread,

Nov 22, 2023, 7:17:54 PM11/22/23

to

I'm confused. Does std:atomic imply "do not optimize access to this
variable"? Because if it doesn't, then I can see how the "while (r)"
loop can just spin.

Chris M. Thomasson

unread,

Nov 23, 2023, 12:05:26 AM11/23/23

to

std::atomic should honor a read, when you read it even from
std::memory_order_relaxed. If not, imvvvvhhooo, its broken?

Chris M. Thomasson

unread,

Nov 23, 2023, 12:06:37 AM11/23/23

to

Afaict, std::atomic should imply volatile? Right? If not, please correct me!

Bonita Montero

unread,

Nov 23, 2023, 2:08:19 AM11/23/23

to

Am 22.11.2023 um 22:22 schrieb Chris M. Thomasson:

> std::atomic<size_t> r

The trick with my code is that the APC function is executed in the same
thread context as the function rpeatedly probing r as an end-indicator,
so I don't need atomic here. You should have known it better.

Kaz Kylheku

unread,

Nov 23, 2023, 3:26:51 AM11/23/23

to

On 2023-11-23, Bonita Montero <Bonita....@gmail.com> wrote:
> Am 22.11.2023 um 22:22 schrieb Chris M. Thomasson:
>
>> std::atomic<size_t> r
>
> The trick with my code is that the APC function is executed in the same
> thread context as the function rpeatedly probing r as an end-indicator,

So what is "parallel" doing your subject line?

--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazi...@mstdn.ca
NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

Bonita Montero

unread,

Nov 23, 2023, 3:31:27 AM11/23/23

to

Am 23.11.2023 um 09:26 schrieb Kaz Kylheku:
> On 2023-11-23, Bonita Montero <Bonita....@gmail.com> wrote:
>> Am 22.11.2023 um 22:22 schrieb Chris M. Thomasson:
>>
>>> std::atomic<size_t> r
>>
>> The trick with my code is that the APC function is executed in the same
>> thread context as the function rpeatedly probing r as an end-indicator,
>
> So what is "parallel" doing your subject line?

The parallel part is injecting the APC with QueueUserAPC().
Interestingly there's some framework-code calling the function object
which I use for the thread code that brings the thread into an alertable
state so that the loop never loops because r is already zero.

David Brown

unread,

Nov 23, 2023, 4:03:18 AM11/23/23

to

You will probably find that compilers in practice will re-read "r" each
round of the loop, regardless of the memory order. I am not convinced
this would be required for "relaxed", but compilers generally do not
optimise atomics as much as they are allowed to. They are, as far as I
have seen in my far from comprehensive testing, treated as though
"atomic" implied "volatile".

But as far as I can tell from the C and C++ standards, "atomic" does not
imply "volatile". There are situations where atomics can be "optimised"
- re-ordered with respect to other code, or simplified - while volatile
atomics cannot. I can see no reason why adjacent non-volatile relaxed
atomic reads of the same object cannot be combined, even if separated by
other code (with no volatile or atomic accesses). The same goes for
writes. If you have :

std::atomic<int> ax = 100;

...

x = 1;
x += 2;
x = x * x;

then you are guaranteed that any other thread reading "ax" will see
either the old value (100, if it was not changed), or the final value of
9. It /might/ also see values of 1 or 3 along the way, but there is no
requirement for the code to produce these intermediate values or for
them to be visible to other threads.

At least, that is how I interpret things. And I believe the fact that
the C and C++ standards make a distinction between atomics and volatile
atomics indicates that the standard authors do not see "atomic" as
implying the semantics of "volatile" - even if compiler writers choose
to act that way.

I personally thing it was a terrible mistake to mix sequencing and
ordering with atomics when multi-threading was introduced to the C and
C++ standards. Atomics would have been simpler, more efficient, and
consistent with their naming if their semantics had not included any
kind of synchronisation. Synchronisation and ordering is a very
different concept from atomic access, and should be covered differently
(by fences of various sorts).

Bonita Montero

unread,

Nov 23, 2023, 4:35:42 AM11/23/23

to

Am 23.11.2023 um 10:02 schrieb David Brown:

> I am not convinced ....
> ... but compilers generally do not optimise atomics as much as they are allowed to.

Is there some kind of contradiction ?

Bonita Montero

unread,

Nov 23, 2023, 9:39:13 AM11/23/23

to

Am 23.11.2023 um 09:31 schrieb Bonita Montero:

> Interestingly there's some framework-code calling the function object
> which I use for the thread code that brings the thread into an alertable
> state so that the loop never loops because r is already zero.

It's for sure no framework code: I've checked the code with a Win32
thread created through CreateThread() and my APCs are eaten up before
the thread's main function runs. Really strange.

Richard Damon

unread,

Nov 23, 2023, 11:07:50 AM11/23/23

to

My understanding is that std:atomic needs to honor a read in the sense
that it wlll get the most recent value that has happened "before" the
read (as determined by memory order).

So, if nothing in the loop can establish a time order with respect to
other threads, then it should be allowed for the compiler to optimize
out the read. SleepEx could (and should) establish time order, so the
compiler can't, in this case, optimize the read away.

Bonita Montero

unread,

Nov 23, 2023, 11:20:44 AM11/23/23

to

Am 23.11.2023 um 17:07 schrieb Richard Damon:

> My understanding is that std:atomic needs to honor a read in the sense
> that it wlll get the most recent value that has happened "before" the
> read (as determined by memory order).

... and the read is atomic - even if the trivial object is 1kB in size.

> So, if nothing in the loop can establish a time order with respect to
> other threads, then it should be allowed for the compiler to optimize

> out the read. ...

An atomic doesn't cache repeatable reads. The order memory-consistency
parameter is just for the ordering of other reads and writes.

Richard Damon

unread,

Nov 23, 2023, 12:51:03 PM11/23/23

to

On 11/23/23 11:20 AM, Bonita Montero wrote:
> Am 23.11.2023 um 17:07 schrieb Richard Damon:
>
>> My understanding is that std:atomic needs to honor a read in the sense
>> that it wlll get the most recent value that has happened "before" the
>> read (as determined by memory order).
>
> ... and the read is atomic - even if the trivial object is 1kB in size.

Yes, which has nothing to do with the question.

>
>> So, if nothing in the loop can establish a time order with respect to
>> other threads, then it should be allowed for the compiler to optimize
>> out the read. ...
>
> An atomic doesn't cache repeatable reads. The order memory-consistency
> parameter is just for the ordering of other reads and writes.
>

Yes, the atomic itself doesn't "cache" the data, but as far as I read,
there is no requirement to refetch the data if the code still has the
old value around, and it hasn't been invalidated by possible memory
ordering.

If there can't be a write "before" the second read, that wasn't also
also "after" the first read, then there is no requirement to refetch the
data. In relaxed memory orders, just being physically before isn't
enough to be "before", but you need some explicit "barrier" to establish it.

I will admit this isn't an area I consider myself an expert in, but I
find no words that prohibit the optimization. The implementation does
need to consider possible action by other "threads" but only as far a
constrained by memory order, so two reads in the same ordering "slot"
are not forced.

Scott Lurndal

unread,

Nov 23, 2023, 2:55:45 PM11/23/23

to

Linux tends to apply the volatile qualifier on the access, rather
than the definition.

#define ACCESS_ONCE(x) (*(volatile __typeof__(x) *)&(x))

while (ACCESS_ONCE(r)) {
}

Makes it rather obvious when reading the code what the intent
is, and won't be affected of someone accidentially removes the
volatile qualifier from the declaration of r.

Works just fine in c++, too.

Scott Lurndal

unread,

Nov 23, 2023, 2:56:44 PM11/23/23

to

Bonita Montero <Bonita....@gmail.com> writes:
>Am 22.11.2023 um 22:22 schrieb Chris M. Thomasson:
>
>> std::atomic<size_t> r
>
>The trick with my code

That's enough to fail a job interview....

Kaz Kylheku

unread,

Nov 23, 2023, 3:32:44 PM11/23/23

to

On 2023-11-22, Bonita Montero <Bonita....@gmail.com> wrote:
> #include <Windows.h>
> #include <thread>
>
> using namespace std;
>
> int main()
> {
> constexpr size_t ROUNDS = 1'000'000;
> size_t volatile r = 1'000'000;
> jthread thr( [&]()
> {
> while( r )
> SleepEx( INFINITE, TRUE );
> } );
> for( size_t r = ROUNDS; r--; )

This shadows the r variable. Did you mean "for (size_t i = ROUNDS; i--)"?

> QueueUserAPC( (PAPCFUNC)[]( auto p ) { --*(size_t*)p; },
> thr.native_handle(), (ULONG_PTR)&r );

Thus, this takes the address of the for loop's r variable, not the volatile one
that the thread is accessing. Is that what you wanted?

BTW, is the C++ lambda too broken to access the r via lexical scoping?
Why can't the APC just do "--r".

I believe local functions in Pascal from 1971 can do this.

Bonita Montero

unread,

Nov 23, 2023, 11:26:21 PM11/23/23

to

... with a nerd like you.

Bonita Montero

unread,

Nov 23, 2023, 11:27:28 PM11/23/23

to

Am 23.11.2023 um 21:32 schrieb Kaz Kylheku:
> On 2023-11-22, Bonita Montero <Bonita....@gmail.com> wrote:
>> #include <Windows.h>
>> #include <thread>
>>
>> using namespace std;
>>
>> int main()
>> {
>> constexpr size_t ROUNDS = 1'000'000;
>> size_t volatile r = 1'000'000;
>> jthread thr( [&]()
>> {
>> while( r )
>> SleepEx( INFINITE, TRUE );
>> } );
>> for( size_t r = ROUNDS; r--; )
>
> This shadows the r variable. Did you mean "for (size_t i = ROUNDS; i--)"?
>
>> QueueUserAPC( (PAPCFUNC)[]( auto p ) { --*(size_t*)p; },
>> thr.native_handle(), (ULONG_PTR)&r );
>
> Thus, this takes the address of the for loop's r variable, not the volatile one
> that the thread is accessing. Is that what you wanted?
>
> BTW, is the C++ lambda too broken to access the r via lexical scoping?
> Why can't the APC just do "--r".
>
> I believe local functions in Pascal from 1971 can do this.
>

I already corrected that with my code and I guessed no one will notice
that here; usually I'm right with that.

Chris M. Thomasson

unread,

Nov 24, 2023, 12:47:16 AM11/24/23

to

Wow, no shit Scott. Yikes!

Chris M. Thomasson

unread,

Nov 24, 2023, 12:53:40 AM11/24/23

to

Huh? What does that even mean? Really, humm... ;^o

Chris M. Thomasson

unread,

Nov 24, 2023, 12:55:24 AM11/24/23

to

Usually, wrong, or always right? humm...

Chris M. Thomasson

unread,

Nov 24, 2023, 12:59:28 AM11/24/23

to

On 11/23/2023 8:26 PM, Bonita Montero wrote:

Do you secretly like nerds? https://youtu.be/7dP1Vp1E-bo

lol!

Chris M. Thomasson

unread,

Nov 24, 2023, 1:05:52 AM11/24/23

to

On 11/23/2023 8:20 AM, Bonita Montero wrote:
> Am 23.11.2023 um 17:07 schrieb Richard Damon:
>
>> My understanding is that std:atomic needs to honor a read in the sense
>> that it wlll get the most recent value that has happened "before" the
>> read (as determined by memory order).
>
> ... and the read is atomic - even if the trivial object is 1kB in size.

humm.. Say, the read is from a word in memory. Define your trivial
object, POD, l2 cache line sized, and aligned on a l2 cache line
boundary? Are you refering to how certain arch works?

Chris M. Thomasson

unread,

Nov 24, 2023, 1:53:37 AM11/24/23

to

On 11/23/2023 10:05 PM, Chris M. Thomasson wrote:
> On 11/23/2023 8:20 AM, Bonita Montero wrote:
>> Am 23.11.2023 um 17:07 schrieb Richard Damon:
>>
>>> My understanding is that std:atomic needs to honor a read in the
>>> sense that it wlll get the most recent value that has happened
>>> "before" the read (as determined by memory order).
>>
>> ... and the read is atomic - even if the trivial object is 1kB in size.
>
> humm.. Say, the read is from a word in memory. Define your trivial
> object, POD, l2 cache line sized, and aligned on a l2 cache line
> boundary? Are you refering to how certain arch works?

How many words in your cache lines, say l2?

Bonita Montero

unread,

Nov 24, 2023, 2:53:53 AM11/24/23

to

Am 24.11.2023 um 07:05 schrieb Chris M. Thomasson:

> humm.. Say, the read is from a word in memory. Define your trivial
> object, POD, l2 cache line sized, and aligned on a l2 cache line
> boundary? Are you refering to how certain arch works?

Read that:
https://stackoverflow.com/questions/61329240/what-is-the-difference-between-trivial-and-non-trivial-objects

Bonita Montero

unread,

Nov 24, 2023, 2:57:41 AM11/24/23

to

Am 23.11.2023 um 18:50 schrieb Richard Damon:

> Yes, the atomic itself doesn't "cache" the data, but as far as I read,
> there is no requirement to refetch the data if the code still has the
> old value around, and it hasn't been invalidated by possible memory
> ordering.

I don't believe that, think about an atomic flag that is periodically
polled. The compiler shouldn't cache that value.

Chris M. Thomasson

unread,

Nov 24, 2023, 3:16:21 AM11/24/23

to

std::atomic is going to work for such a flag. Depending on your setup,
it should be using std::memory_order_relaxed for the polling.

Bonita Montero

unread,

Nov 24, 2023, 3:30:18 AM11/24/23

to

Am 24.11.2023 um 09:16 schrieb Chris M. Thomasson:

> std::atomic is going to work for such a flag. Depending on your
> setup, it should be using std::memory_order_relaxed for the polling.

There's also atomic_flag, but it has some limitations over atomic_bool
that I've never used it. You can set it only in conjunction with an
atomic read and I never had a use for that. And this relies on a atomic
exchange, which costs a lot more than just a byte write.

Chris M. Thomasson

unread,

Nov 24, 2023, 4:05:09 AM11/24/23

to

Fwiw, this flag should be aligned on a l2 cache line boundary, and
padded up to a l2 cache line size.

Chris M. Thomasson

unread,

Nov 24, 2023, 4:05:57 AM11/24/23

to

You can stuff a cache line with words, as long as you do not straddle a
cache line boundary... YIKES!

Chris M. Thomasson

unread,

Nov 24, 2023, 4:07:00 AM11/24/23

to

I know. Btw, what the hell happened to std::is_pod? ;^)

David Brown

unread,

Nov 24, 2023, 4:08:48 AM11/24/23

to

That is often my preference too, since it is the access that is
"volatile" - a "volatile object" is simply one for which all accesses
are "volatile".

For the pedants, it might be worth noting that the "cast to pointer to
volatile" technique of ACCESS_ONCE is not actually guaranteed to be
treated as a volatile access in C until C17/C18 when the wording was
changed to talk about accesses via "volatile lvalues" rather than
accesses to objects declared as volatile. (When the topic was discussed
by the committee, everyone agreed that all known compiler vendors
treated "cast to pointer to volatile" accesses as volatile, so the
change was a formality rather than any practical difference.) I don't
know if and when this change was added to C++.

Bonita Montero

unread,

Nov 24, 2023, 4:10:23 AM11/24/23

to

PODs are also trivial but go beyond since you can copy them
with a memcpy():

Chris M. Thomasson

unread,

Nov 24, 2023, 4:14:17 AM11/24/23

to

On 11/23/2023 8:20 AM, Bonita Montero wrote:

> Am 23.11.2023 um 17:07 schrieb Richard Damon:
>
>> My understanding is that std:atomic needs to honor a read in the sense
>> that it wlll get the most recent value that has happened "before" the
>> read (as determined by memory order).
>
> ... and the read is atomic - even if the trivial object is 1kB in size.

How is that read atomic with 1kb of data? On what arch?

David Brown

unread,

Nov 24, 2023, 4:23:12 AM11/24/23

to

That is exactly how I see it (I also do not consider myself an expert in
this area). I cannot see any requirement in the description of the
execution, covering sequencing, ordering, "happens before", and all the
rest, that suggests that the number of atomic accesses, or their order
amongst each other, or their order with respect to volatile accesses or
non-volatile accesses, is forced to follow the source code except where
the atomics have specific sequencing. Atomic accesses are not
"volatile" - they are not, in themselves, "observable behaviour".

Because the the sequencing requirements for atomics depends partly on
things happening in other threads, compilers are much more limited in
how they can re-order or otherwise optimise atomic accesses than they
are for normal accesses (unless the compiler knows all about the other
threads too!). Compilers must be pessimistic about optimisation. But
for certain simple cases, such as multiple neighbouring atomic reads of
the same address or multiple neighbouring writes to the same address, I
can't see any reason why they cannot be combined.

(Again, I am not an expert here - and I will be happy to be corrected.
They say the best way to learn something on the internet is not by
asking questions, but by writing something that is wrong!)

David Brown

unread,

Nov 24, 2023, 4:36:05 AM11/24/23

to

On 24/11/2023 05:27, Bonita Montero wrote:

> I already corrected that with my code and I guessed no one will notice
> that here; usually I'm right with that.
>

You really believe that?

I think one of the (many) reasons people don't take you seriously is
that you never check your work. You invariably post code that is badly
wrong, followed by multiple replies to yourself making corrections and
improvements. Every time you claim your code is bug-free, we know you
will follow up shortly with a bug fix. Every time you claim it is
"perfect", we know that you will follow it with an "improved" version
("perfect" and "improved" being in your opinion only).

Yes, people have noticed. Yes, people will continue to notice.

It's nice that you post code, however, as it can start some interesting
discussions - before descending into a pantomime farce. But it might
make things a little better if you bothered to re-read your code before
posting, or even try testing it.

Kaz Kylheku

unread,

Nov 24, 2023, 1:27:54 PM11/24/23

to

Anyway, this APC mechanism is quite similar to signal handling.
Particularly asynchronous signal handling. Just like POSIX signals, it
makes the execution abruptly call an unrelated function and then resume
at the interrupted point.

The main difference is that the signal has a number, which selects a
registered handler, rather than specifying a function directly.

Why I bring this up is that ISO C (since 1990, I think), has specified a
use of a "volatile sig_atomic_t" type in regard to asynchronous signal
handlers. (Look it up.)

The use of volatile with interrupt-like mechanisms is nothing new.

Chris M. Thomasson

unread,

Nov 24, 2023, 6:33:12 PM11/24/23

to

On 11/24/2023 1:14 AM, Chris M. Thomasson wrote:
> On 11/23/2023 8:20 AM, Bonita Montero wrote:
>> Am 23.11.2023 um 17:07 schrieb Richard Damon:
>>
>>> My understanding is that std:atomic needs to honor a read in the
>>> sense that it wlll get the most recent value that has happened
>>> "before" the read (as determined by memory order).
>>
>> ... and the read is atomic - even if the trivial object is 1kB in size.
>
> How is that read atomic with 1kb of data? On what arch?

Unless you atomically read a pointer that points to 1kB of memory.

Chris M. Thomasson

unread,

Nov 24, 2023, 6:37:40 PM11/24/23

to

After reading this, for some reason I am now thinking about signal safe
sync primitives in POSIX. Fwiw, certain pure lock/wait free algorithms
in signal handlers are okay.

Bonita Montero

unread,

Nov 25, 2023, 7:11:26 AM11/25/23

to

Am 24.11.2023 um 19:27 schrieb Kaz Kylheku:

> Anyway, this APC mechanism is quite similar to signal handling.
> Particularly asynchronous signal handling. Just like POSIX signals,
> it makes the execution abruptly call an unrelated function and then
> resume at the interrupted point.

I don't think so because APCs only can interrupt threads in an alertable
mode. Signals can interrupt nearly any code and they have implications
on the compiler's ABI through defining the size of the red zone. So com-
pared to signals APCs are rather clean, nevertheless you can do a lot of
interesting things with signals, as reported lately when I was informed
that mutexes with the glibc rely on signals; I gues it's the same when
a thread waits for a condition variable and a mutex at once.

> The main difference is that the signal has a number, which selects
> a registered handler, rather than specifying a function directly.

The ugly thing with synchronous signals is that the signal handler is
global for all threads. You can concatenate them but the next signal
handler in the chain may be in a shared object already unloaded. I
think this should be corrected my making synchronous signals' handlers
thread-specific.

> The use of volatile with interrupt-like mechanisms is nothing new.

I think this pattern doesn't happen very often since it's rare that
a signal shares state with the interrupted code.

Chris M. Thomasson

unread,

Nov 25, 2023, 5:46:37 PM11/25/23

to

You might be interested in how pthreads-win32 handles async thread
cancellation. Iirc, it uses a kernel module. It's hackish, but interesting.

https://sourceware.org/pthreads-win32/

Marcel Mueller

unread,

Dec 1, 2023, 2:50:16 AM12/1/23

to

Am 23.11.23 um 06:06 schrieb Chris M. Thomasson:
> Afaict, std::atomic should imply volatile? Right? If not, please correct
> me!

In practice yes. But is this required by the standard? I could not find
any hint. Strictly speaking it is still required.

In fact memory ordering does not guarantee any particular time when the
change appears at another thread. So there is always some delay. But
could it be infinite? So the compiler could cache anything if no other
memory access according to the memory barrier is generated by the code?
This applies to read and write.
But I think it is almost impossible to write any reasonable code, that
causes no other memory access that forces the atomic value to be read or
written.

Marcel