gcc libatomic "sample library"

David Brown

unread,

Feb 11, 2021, 6:17:48 AM2/11/21

to

(This came from the "The weirdest compiler bug" thread in c.l.c++, but I
think it makes more sense as a thread in itself, and is equally relevant
to C and C++. Although it is gcc-specific, the principles are general
for all C11 and C++11 tools.)

Like most compilers, gcc ships with some "language support" libraries -
not the actual C or C++ runtime libraries, but code to support parts of
the language. Here I have been looking at gcc's code for supporting
atomic accesses. I'd like to get thoughts and opinions about the
correctness of the implementation. I'm particularly interested in the
targets I use myself, but I believe the problems affect any target.

From the gcc Wiki page <https://gcc.gnu.org/wiki/Atomic/GCCMM> there is
a link to a "sample libatomic", which was perhaps just meant as a simple
starting point. But it is shipped with gcc now, AFAICS. (It is
certainly shipped with the "gnu arm embedded" toolchain, but that
includes many things besides the compiler. If it turns out that it is
that toolchain packaging that is the problem rather than gcc, I'll
happily move my complaining to them!)

The "sample" libatomic library is here:

<https://gcc.gnu.org/wiki/Atomic/GCCMM?action=AttachFile&do=view&target=libatomic.c>

Basically, operations are done "lock-free" (using __atomic_ builtins) if
the size has __atomic_always_lock_free. Otherwise a simple user-mode
busy waiting spin lock is used, picked from an array of 16 locks using a
hash on the address.

In typical use on a modern multi-core system, you are going to be making
sure you use sizes that are lock-free in most cases (when you can get
64-bit, and perhaps 128-bit, that's normally enough). If you have a
type that is bigger, this library will use the locks. The likelihood of
there being a contention on the locks is very low in practice,
especially with the multiple locks with address hashing. And /if/ there
is contention, it means the losing thread will busy-wait while the other
thread finishes its work. Since the other thread is likely to be on a
different core, and the actual operation is fast, a short busy-wait is
not a problem. The only time when it would be a long delay is if one
thread took the lock, then got pre-empted before releasing it. But with
multiple cores, typical big OS "almost everything at the same level"
thread priorities, and a scheduler that boosts low priority threads that
have been stuck for a while, you can be confident that sooner or later -
usually sooner - the locking thread will run and release the lock.

Smaller embedded systems are a different world. As an example (because
it is the processor I am using), consider a Cortex-M processor and an
RTOS (real-time operating system). In this kind of system, you have a
single cpu core, and you have a strict hierarchy in thread priorities.
The highest priority runnable thread is always running (with interrupts
being somewhat akin to even higher priority threads). When you use
"real" locks - mutexes supplied by the OS - there are priority boosting
mechanisms to avoid deadlocks or blocks when a high priority thread
wants a lock that is held by a low priority thread.

But with spin locks, that doesn't work. So if a low priority thread
takes a lock, and then it is pre-empted (such as by an external
interrupt from a device) and a high priority thread or interrupt routine
wants the lock, that high priority thread will spin forever. It will
never give the low priority thread the chance to run and release the lock.

All this means that if you use atomics in your embedded system,
everything looks fine and all your testing will likely succeed. But
there is always the chance of a deadlock happening.

The efficient way to do general atomic operations (ones that can't be
handled by a single instruction or a restartable load/store exclusive
loop) on single-core devices is to disable interrupts temporarily. It
won't help for multi-core systems - there the locking must be done with
OS help.

Chris M. Thomasson

unread,

Feb 11, 2021, 5:04:06 PM2/11/21

to

Humm... Thanks for the information David... Well, afaict, using adaptive
mutexes might work "better" than pure spinlocks. The array of 16 locks
is interesting to me. I need to examine the hashing they use on the
pointers. Fwiw, it really reminds me of my MultiMutex experiment:

https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/Ti8LFyH4CgAJ

Marcel Mueller

unread,

Feb 12, 2021, 1:34:47 AM2/12/21

to

Am 11.02.21 um 12:17 schrieb David Brown:

> Smaller embedded systems are a different world. As an example (because
> it is the processor I am using), consider a Cortex-M processor and an
> RTOS (real-time operating system). In this kind of system, you have a
> single cpu core, and you have a strict hierarchy in thread priorities.
> The highest priority runnable thread is always running (with interrupts
> being somewhat akin to even higher priority threads). When you use
> "real" locks - mutexes supplied by the OS - there are priority boosting
> mechanisms to avoid deadlocks or blocks when a high priority thread
> wants a lock that is held by a low priority thread.

I worked on similar architectures in the past. (inmos T8xx, hardware
scheduler, high priority not preempted by time slice, guaranteed IRQ
response in less than 1µs at 20 MHz!

> But with spin locks, that doesn't work. So if a low priority thread
> takes a lock, and then it is pre-empted (such as by an external
> interrupt from a device) and a high priority thread or interrupt routine
> wants the lock, that high priority thread will spin forever.

- Raw spin locks make no sense on /any/ single core.
- Priority inversion applies to /any/ mutex.

Spin locks with sleep(0) and when this does not help sleep(1) might be a
compromise. But this is more like a work around rather than a
sophisticated solution since it does not address the origin of the
problem at all.

> All this means that if you use atomics in your embedded system,
> everything looks fine and all your testing will likely succeed.

Indeed, lock-free is one of the first choices IMHO.
But this might not be sufficient either. CAS loops might spin forever in
some cases if the algorithm does not guarantee forward progress.

> But
> there is always the chance of a deadlock happening.

Use the C++ standard atomic<T> and check for is_always_lock_free with a
static assertion and you are fine.

> The efficient way to do general atomic operations (ones that can't be
> handled by a single instruction or a restartable load/store exclusive
> loop) on single-core devices is to disable interrupts temporarily. It
> won't help for multi-core systems - there the locking must be done with
> OS help.

This is, of course, not covered by the C++ standard. It also implies
other impacts, e.g. making IRQ response times unpredictable.
Personally I dislike this old hack from the 70s/80s.

On hardware with strict priorities this might not be required at all
because e.g. an interrupt handler never gets preempted. Of course, you
always need to know the interrupt context where your code is running,
which might be difficult in lower level functions like libraries.

Marcel

David Brown

unread,

Feb 12, 2021, 4:09:18 AM2/12/21

to

On 12/02/2021 07:34, Marcel Mueller wrote:
> Am 11.02.21 um 12:17 schrieb David Brown:
>> Smaller embedded systems are a different world. As an example (because
>> it is the processor I am using), consider a Cortex-M processor and an
>> RTOS (real-time operating system). In this kind of system, you have a
>> single cpu core, and you have a strict hierarchy in thread priorities.
>> The highest priority runnable thread is always running (with interrupts
>> being somewhat akin to even higher priority threads). When you use
>> "real" locks - mutexes supplied by the OS - there are priority boosting
>> mechanisms to avoid deadlocks or blocks when a high priority thread
>> wants a lock that is held by a low priority thread.
>
> I worked on similar architectures in the past. (inmos T8xx, hardware
> scheduler, high priority not preempted by time slice, guaranteed IRQ
> response in less than 1µs at 20 MHz!
>
>> But with spin locks, that doesn't work. So if a low priority thread
>> takes a lock, and then it is pre-empted (such as by an external
>> interrupt from a device) and a high priority thread or interrupt routine
>> wants the lock, that high priority thread will spin forever.
>
> - Raw spin locks make no sense on /any/ single core.
> - Priority inversion applies to /any/ mutex.
>
> Spin locks with sleep(0) and when this does not help sleep(1) might be a
> compromise. But this is more like a work around rather than a
> sophisticated solution since it does not address the origin of the
> problem at all.
>

The aim of this implementation appears to be to make it simple and OS
independent. I don't believe that can be done.

>> All this means that if you use atomics in your embedded system,
>> everything looks fine and all your testing will likely succeed.
>
> Indeed, lock-free is one of the first choices IMHO.
> But this might not be sufficient either. CAS loops might spin forever in
> some cases if the algorithm does not guarantee forward progress.
>
>> But
>> there is always the chance of a deadlock happening.
>
> Use the C++ standard atomic<T> and check for is_always_lock_free with a
> static assertion and you are fine.

There is no problem with the types that are always lock free - the
compiler generates code for these directly. But then, for types that
are always lock free, you generally don't need atomics at all - with a
single core system, "volatile" is fine (plus the occasional memory
barrier) for loads and stores. (And if you stick to the sanity rule of
only writing to a given object in /one/ thread, then RMW operations are
fine too.)

The fun comes with bigger types. And on a 32-bit processor, that
includes 64-bit types - which are not uncommon. That is when having
atomic types in the language becomes a big benefit - and it is when they
fail completely with this implementation.

(Just for fun, the Cortex-M can /read/ 64-bit data atomically, as the
double-read instruction is restartable. But it can't write it
atomically - an interrupt in the middle of the instruction will leave
the object halfway updated.)

>
>> The efficient way to do general atomic operations (ones that can't be
>> handled by a single instruction or a restartable load/store exclusive
>> loop) on single-core devices is to disable interrupts temporarily. It
>> won't help for multi-core systems - there the locking must be done with
>> OS help.
>
> This is, of course, not covered by the C++ standard.

This is an implementation library in a compiler - it is not covered by
the C++ standard either. The standard only says what the code should
do, not how it should do it (and in this case, the code does not work).

> It also implies
> other impacts, e.g. making IRQ response times unpredictable.
> Personally I dislike this old hack from the 70s/80s.
>

The "old hack" is far and away the easiest, safest and most efficient
method. IRQ response times are /always/ unpredictable - if you think
otherwise, you have misunderstood the system. The best you can do is
figure out a maximum response time, and even then it may only apply to
the highest priority interrupt. Real time systems are not about saying
when things will happen - they are about giving guarantees for the
maximum delays.

Remember, during most interrupt handling functions on most systems,
interrupts are disabled - your maximum interrupt response is already
increased by the time it takes to run your timer interrupt, or UART
interrupt, or ADC interrupt, or whatever other interrupts you have.
These will all be much longer than the time for a couple of loads or
stores. Thus disabling interrupts around atomic accesses does not
affect IRQ response times.

> On hardware with strict priorities this might not be required at all
> because e.g. an interrupt handler never gets preempted. Of course, you
> always need to know the interrupt context where your code is running,
> which might be difficult in lower level functions like libraries.
>

Absolutely true. And that is often how these things get handled in
practice.

For example, if you have a 64-bit value that is updated only in an
interrupt routine and never read by anything of higher priority, it can
just write the data directly with two write instructions. If an even
higher priority interrupt breaks in in the middle, that's okay. The
lower priority task that reads that value can do the read in a loop - it
keeps reading the value until it gets two consistent reads.

There is no single maximally efficient solution that will work in all
cases - any generic method will be overkill for some use-cases. But a
toolchain-provided generic method that doesn't always work - that is,
IMHO, worse than no implementation.

Marcel Mueller

unread,

Feb 12, 2021, 1:35:31 PM2/12/21

to

Am 12.02.21 um 10:09 schrieb David Brown:

>> Use the C++ standard atomic<T> and check for is_always_lock_free with a
>> static assertion and you are fine.
>
> There is no problem with the types that are always lock free - the
> compiler generates code for these directly. But then, for types that
> are always lock free, you generally don't need atomics at all - with a
> single core system, "volatile" is fine (plus the occasional memory
> barrier) for loads and stores.

Strictly speaking this is not true if you take DMA into account. But
this is not a common use case.

> (And if you stick to the sanity rule of
> only writing to a given object in /one/ thread, then RMW operations are
> fine too.)

Indeed.

> The fun comes with bigger types. And on a 32-bit processor, that
> includes 64-bit types - which are not uncommon. That is when having
> atomic types in the language becomes a big benefit - and it is when they
> fail completely with this implementation.

Sure, when the hardware does not allow lock free access, then there are
no generic, satisfactory solutions.

> (Just for fun, the Cortex-M can /read/ 64-bit data atomically, as the
> double-read instruction is restartable. But it can't write it
> atomically - an interrupt in the middle of the instruction will leave
> the object halfway updated.)

In this case yo cannot use DWCAS on this platform. You need to seek for
other solutions. E.g. store and replace a 32 bit pointer to the actual
value.

> This is an implementation library in a compiler - it is not covered by
> the C++ standard either. The standard only says what the code should
> do, not how it should do it (and in this case, the code does not work).

So the library needs to be adjusted platform dependent.

>> It also implies
>> other impacts, e.g. making IRQ response times unpredictable.
>> Personally I dislike this old hack from the 70s/80s.
>
> The "old hack" is far and away the easiest, safest and most efficient
> method. IRQ response times are /always/ unpredictable - if you think
> otherwise, you have misunderstood the system. The best you can do is
> figure out a maximum response time,

There are systems with guaranteed maximum values.

> and even then it may only apply to
> the highest priority interrupt.

Of course.

> Real time systems are not about saying
> when things will happen - they are about giving guarantees for the
> maximum delays.

Exactly. But that is enough.

> Remember, during most interrupt handling functions on most systems,
> interrupts are disabled - your maximum interrupt response is already
> increased by the time it takes to run your timer interrupt, or UART
> interrupt, or ADC interrupt, or whatever other interrupts you have.
> These will all be much longer than the time for a couple of loads or
> stores.

Context switches can be quite expensive if your hardware has many
registers (including MMU) and no distinct register sets for different
priority levels.

> Thus disabling interrupts around atomic accesses does not
> affect IRQ response times.

Feel free to do so if it is suitable on your platform. You already
mentioned that this is not sufficient on multi core systems, which
become quite common for embedded systems too nowadays.

> There is no single maximally efficient solution that will work in all
> cases - any generic method will be overkill for some use-cases. But a
> toolchain-provided generic method that doesn't always work - that is,
> IMHO, worse than no implementation.

Obviously your particular case is sensitive to priority inversion.

But this is always the case when you use libraries. They cover /common
cases/ not all cases.
If a generic atomic library does not guarntee forward progress when used
with different priorities it is not suitable for this case.

Marcel

Chris M. Thomasson

unread,

Feb 12, 2021, 4:16:20 PM2/12/21

to

On 2/11/2021 10:34 PM, Marcel Mueller wrote:
> Am 11.02.21 um 12:17 schrieb David Brown:

[...]

> Indeed, lock-free is one of the first choices IMHO.
> But this might not be sufficient either. CAS loops might spin forever in
> some cases if the algorithm does not guarantee forward progress.

Side note... Iirc, Joe Seigh, who worked over at IBM, told me that there
was actually a mechanism to prevent infinite live lock in their CAS
operation. I need to send him a email to clarify. Also, iirc, a windows
kernel guy mentioned something kind of similar wrt the Windows SList,
and even allowed for node deletion using SEH.

[...]

Chris M. Thomasson

unread,

Feb 12, 2021, 4:20:02 PM2/12/21

to

[...]

Agreed. Its to sensitive.

Chris M. Thomasson

unread,

Feb 12, 2021, 7:05:34 PM2/12/21

to

It can be like orchestrating a symphony.

https://youtu.be/5znrVdAtEDI

Bonita Montero

unread,

Feb 13, 2021, 7:51:38 AM2/13/21

to

> - Raw spin locks make no sense on /any/ single core.

Raw spinlocks almost never make sense in userland.

David Brown

unread,

Feb 13, 2021, 9:56:42 AM2/13/21

to

On 12/02/2021 19:35, Marcel Mueller wrote:
> Am 12.02.21 um 10:09 schrieb David Brown:
>>> Use the C++ standard atomic<T> and check for is_always_lock_free with a
>>> static assertion and you are fine.
>>
>> There is no problem with the types that are always lock free - the
>> compiler generates code for these directly. But then, for types that
>> are always lock free, you generally don't need atomics at all - with a
>> single core system, "volatile" is fine (plus the occasional memory
>> barrier) for loads and stores.
>
> Strictly speaking this is not true if you take DMA into account. But
> this is not a common use case.

That depends very much on the way DMA and the memory system is
implemented. On many microcontrollers, volatile is all you need. On
others, the memory barrier instructions generated with atomics is not
sufficient - you need explicit cache flush instructions. (That's the
kind of thing that makes low-level code so much fun!)

>
>> (And if you stick to the sanity rule of
>> only writing to a given object in /one/ thread, then RMW operations are
>> fine too.)
>
> Indeed.
>
>> The fun comes with bigger types. And on a 32-bit processor, that
>> includes 64-bit types - which are not uncommon. That is when having
>> atomic types in the language becomes a big benefit - and it is when they
>> fail completely with this implementation.
>
> Sure, when the hardware does not allow lock free access, then there are
> no generic, satisfactory solutions.

I think (correct me if I'm wrong) that every system will have a limit to
the lock-free sizes they support.

>
>> (Just for fun, the Cortex-M can /read/ 64-bit data atomically, as the
>> double-read instruction is restartable. But it can't write it
>> atomically - an interrupt in the middle of the instruction will leave
>> the object halfway updated.)
>
> In this case yo cannot use DWCAS on this platform. You need to seek for
> other solutions. E.g. store and replace a 32 bit pointer to the actual
> value.
>

Yes, that could perhaps be a way to handle things, but off the top of my
head I can't see how to do this safely and generically.

>
>> This is an implementation library in a compiler - it is not covered by
>> the C++ standard either. The standard only says what the code should
>> do, not how it should do it (and in this case, the code does not work).
>
> So the library needs to be adjusted platform dependent.

Yes.

>
>>> It also implies
>>> other impacts, e.g. making IRQ response times unpredictable.
>>> Personally I dislike this old hack from the 70s/80s.
>>
>> The "old hack" is far and away the easiest, safest and most efficient
>> method. IRQ response times are /always/ unpredictable - if you think
>> otherwise, you have misunderstood the system. The best you can do is
>> figure out a maximum response time,
>
> There are systems with guaranteed maximum values.

Of course - but that is the maximum of the inherent cpu maximum response
time, maximum delays from memory systems, maximum run-times of other
interrupt functions (that have not re-enabled interrupts), and so on.
Interrupt-disable sections which are shorter than the maximum interrupt
function time will not affect the maximum response time for interrupts.

>
>> and even then it may only apply to
>> the highest priority interrupt.
>
> Of course.
>
>> Real time systems are not about saying
>> when things will happen - they are about giving guarantees for the
>> maximum delays.
>
> Exactly. But that is enough.

Yes.

>
>> Remember, during most interrupt handling functions on most systems,
>> interrupts are disabled - your maximum interrupt response is already
>> increased by the time it takes to run your timer interrupt, or UART
>> interrupt, or ADC interrupt, or whatever other interrupts you have.
>> These will all be much longer than the time for a couple of loads or
>> stores.
>
> Context switches can be quite expensive if your hardware has many
> registers (including MMU) and no distinct register sets for different
> priority levels.
>

Yes, context switches can be expensive - but I don't see how that is
relevant. Interrupt response times don't usually have to take full
context switch times into account, because you don't need a full context
switch to get to the point where you are able to react quickly to the
urgent event. Most cpus preserve the instruction pointer, flag
register, and perhaps one or two general purpose registers when an
interrupt occurs - they rarely do full context switches.

(There are more niche architectures with very fast and deterministic
response times.)

>> Thus disabling interrupts around atomic accesses does not
>> affect IRQ response times.
>
> Feel free to do so if it is suitable on your platform. You already
> mentioned that this is not sufficient on multi core systems, which
> become quite common for embedded systems too nowadays.
>

Embedded systems can broadly be divided into three groups. There are
"big" systems, for which multi-core is becoming more common - these are
dominated by Linux, and you can use whatever multi-threading techniques
you like from the Linux world. There are "small" systems, with a single
core - these are far and away the largest in quantity. In between, you
have asymmetric multiprocessing - devices with a big fast core for hard
work (or possibly multiple fast cores running Linux), and a small core
for low-power states, handling simple low-level devices, or dedicated to
things like wireless communication. Although you have more than one
core in the same package, these are running different programs (and
perhaps different OS's).

I see very little in the way of symmetric multicore systems running
RTOS's. That may change, however.

>
>> There is no single maximally efficient solution that will work in all
>> cases - any generic method will be overkill for some use-cases. But a
>> toolchain-provided generic method that doesn't always work - that is,
>> IMHO, worse than no implementation.
>
> Obviously your particular case is sensitive to priority inversion.
>
> But this is always the case when you use libraries. They cover /common
> cases/ not all cases.
> If a generic atomic library does not guarntee forward progress when used
> with different priorities it is not suitable for this case.
>

All RTOS systems are sensitive to priority inversion, as are all small
single-core embedded systems with bare metal (interrupt functions are,
in effect, high priority pre-emptive threads). It is not a special case
- it applies to just about every use of devices such as the Cortex-M or
other microcontrollers. You cannot use the gcc-provided atomics library
for even the simple situation of having a 64-bit value shared between an
interrupt function and a main loop or other thread - it will kill your
system if there is an access conflict.

Marcel Mueller

unread,

Feb 13, 2021, 2:42:19 PM2/13/21

to

Am 13.02.21 um 15:56 schrieb David Brown:

>> In this case yo cannot use DWCAS on this platform. You need to seek for
>> other solutions. E.g. store and replace a 32 bit pointer to the actual
>> value.
>
> Yes, that could perhaps be a way to handle things, but off the top of my
> head I can't see how to do this safely and generically.

This is quite easy. you only need two physical storage for the data. One
holds the current value, one the next value. When you synchronize only
writers they can safely swap the storage pointer. No need to synchronize
readers. Often that's enough.

Even more sophisticated solutions may use thread local storage for the
values. Each thread has a current and a next storage. This removes the
need for the writer mutex. In case of an IRQ handler (which usually is
not re-entrant) dedicated storage for the IRQ handler could do the job.

If this is still not sufficient because the high update rate could cause
a reader to get two versions behind the stolen bits hack in the
referring pointer may identify this situations. Readers need to
increment the master pointer atomically before using the storage it
points to. Now writers know that they should not discard or modify this
storage.

>> Obviously your particular case is sensitive to priority inversion.
>>
>> But this is always the case when you use libraries. They cover /common
>> cases/ not all cases.
>> If a generic atomic library does not guarntee forward progress when used
>> with different priorities it is not suitable for this case.
>
> All RTOS systems are sensitive to priority inversion,

Sure, but lock free algorithms are not. ;-)

> You cannot use the gcc-provided atomics library
> for even the simple situation of having a 64-bit value shared between an
> interrupt function and a main loop or other thread - it will kill your
> system if there is an access conflict.

Is it possible to raise the priority of all mutex users for the time of
the critical section? This will still abuse a time slice if the spin
lock does not explicitly call sleep. But at least it will not deadlock.

Marcel

Chris M. Thomasson

unread,

Feb 13, 2021, 4:24:33 PM2/13/21

to

You are correct. I remember a long time ago when CMPXCHG16B was first
introduced, it was _not_ guaranteed that every future 64-bit x86 would
support it. This instruction on a 64-bit system is DWCAS for Double
Width Compare-And-Swap. IBM z arch has CDS for a DWCAS. Iirc, CDS is
guaranteed to be on all z-arch.

For fun.... Check is odd ball instruction out David:

CMP8XCHG16 for the Itanium.

;^)

[...]

Chris M. Thomasson

unread,

Feb 13, 2021, 4:35:43 PM2/13/21

to

Iirc, priority inheritance can sort of "reduce" priority inversion. Have
binary semaphores on the mind.

https://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_mutexattr_getprioceiling.html

Chris M. Thomasson

unread,

Feb 13, 2021, 4:36:19 PM2/13/21

to

On 2/13/2021 4:51 AM, Bonita Montero wrote:
>> - Raw spin locks make no sense on /any/ single core.
>
> Raw spinlocks almost never make sense in userland.

True.

David Brown

unread,

Feb 14, 2021, 5:46:45 AM2/14/21

to

On 13/02/2021 20:42, Marcel Mueller wrote:
> Am 13.02.21 um 15:56 schrieb David Brown:
>>> In this case yo cannot use DWCAS on this platform. You need to seek for
>>> other solutions. E.g. store and replace a 32 bit pointer to the actual
>>> value.
>>
>> Yes, that could perhaps be a way to handle things, but off the top of my
>> head I can't see how to do this safely and generically.
>
> This is quite easy. you only need two physical storage for the data. One
> holds the current value, one the next value. When you synchronize only
> writers they can safely swap the storage pointer. No need to synchronize
> readers. Often that's enough.

No, that won't work. It was the first thing that came to my mind too,
but it is not sufficient.

Let's model the object we want to access atomically as:

typedef struct T { uint32_t lo; uint32_t hi; } T;

You store two copies:

volatile T d[2];

and a pointer:

volatile T* volatile p = &d[0];

Updating will be something like:

void update(T x) {
get_writer_lock();
volatile T * q = &d[1 - (p - d)];
*q = x;
p = q;
release_lock();
}

You are using extra synchronisation for writing, which is not ideal, but
it is not uncommon to have only a single writer.

You are then suggesting that this is safe for reading:

T read(void) {
return *p;
}

Let's break this down. The implementation will be something like:

T read(void) {
T x;
volatile T* q = p;
// point 1
x.lo = q->lo;
// point 2
x.hi = q->hi;
return x;
}

If the reader thread is pre-empted (or interrupted) at point 1 by the
writer, that's okay - the reader doesn't see the new data, but it gets
the consistent old one, as the writer has modified the new copy. The
same happens if it is pre-empted at point 2. Since the pointer is read
atomically, the data is consistent.

Except... what if the writer does two updates? Or two writer threads
run while the reader thread is paused? Then a writer is stomping all
over the data that the reader thread has partially read.

So it is not nearly as simple as you imply. It is a step towards a
solution, but requires work. A "store/load exclusive" loop can make
reading safe, but you still need a synchronisation mechanism for the
writers that requires locking (and thus fails to be a generic lock-free
mechanism).

In the common situation of a single writer and a single reader, you can
use this kind of arrangement. But you use three copies, not two, and
you have tracking of which buffer is used by the reader and writer.
Even then it's a bit fiddly, and the most efficient solutions need
knowledge of how the threads can interact and interrupt each other.

> Even more sophisticated solutions may use thread local storage for the
> values. Each thread has a current and a next storage. This removes the
> need for the writer mutex. In case of an IRQ handler (which usually is
> not re-entrant) dedicated storage for the IRQ handler could do the job.
>
> If this is still not sufficient because the high update rate could cause
> a reader to get two versions behind the stolen bits hack in the
> referring pointer may identify this situations. Readers need to
> increment the master pointer atomically before using the storage it
> points to. Now writers know that they should not discard or modify this
> storage.
>

Your mention of "high update rate" is perhaps why I am not happy with
your solutions. You are talking about things that will likely work well
in most cases - I am trying to look at things that are guaranteed to
work in /every/ case.

When you have specific cases, you can pick solutions that are efficient
and work given the assumptions that are valid in that case. Maybe you
have only one writer, maybe you know about the synchronisation and which
thread can pre-empt the other - and so on.

But a library that comes with a toolchain that implements atomic read
and write of any sized data needs to work in /all/ cases. "Usually
sufficient" is not good enough, nor is "assuming a low update rate" or
any other assumption. It needs to work /every/ time, with no additional
assumptions (except perhaps assumptions that can be checked at compile
or link time, such as specific OS support).

>
>>> Obviously your particular case is sensitive to priority inversion.
>>>
>>> But this is always the case when you use libraries. They cover /common
>>> cases/ not all cases.
>>> If a generic atomic library does not guarntee forward progress when used
>>> with different priorities it is not suitable for this case.
>>
>> All RTOS systems are sensitive to priority inversion,
>
> Sure, but lock free algorithms are not. ;-)

Agreed.

But atomics are not exclusively about being lock-free. Lock-free
atomics let you build lock-free algorithms that can scale well across
multiple cores. However, atomics are fundamentally about making
particular operations appear unbreakable - and that applies even if the
atomic operation in question is not lock-free. Beyond a certain size or
complexity, operations invariably require some kind of lock (or special
hardware support) to be atomic - and then you have locks, and you have
sensitivity to priority inversion.

>
>> You cannot use the gcc-provided atomics library
>> for even the simple situation of having a 64-bit value shared between an
>> interrupt function and a main loop or other thread - it will kill your
>> system if there is an access conflict.
>
> Is it possible to raise the priority of all mutex users for the time of
> the critical section? This will still abuse a time slice if the spin
> lock does not explicitly call sleep. But at least it will not deadlock.
>

Doing something like that would negate all the benefits of trying to use
atomics rather than simply using mutexes in the first place.

It is far better (again, in the single core case) to simply disable
interrupts for the short code section needed to do the atomic access.
This has the effect of raising the priority of the current thread to the
maximum - bit it does so for the shortest possible time.

Chris M. Thomasson

unread,

Feb 14, 2021, 3:56:04 PM2/14/21

to

On 2/14/2021 2:46 AM, David Brown wrote:
> On 13/02/2021 20:42, Marcel Mueller wrote:
>> Am 13.02.21 um 15:56 schrieb David Brown:
>>>> In this case yo cannot use DWCAS on this platform. You need to seek for
>>>> other solutions. E.g. store and replace a 32 bit pointer to the actual
>>>> value.
>>>
>>> Yes, that could perhaps be a way to handle things, but off the top of my
>>> head I can't see how to do this safely and generically.
>>
>> This is quite easy. you only need two physical storage for the data. One
>> holds the current value, one the next value. When you synchronize only
>> writers they can safely swap the storage pointer. No need to synchronize
>> readers. Often that's enough.
>
> No, that won't work. It was the first thing that came to my mind too,
> but it is not sufficient.
>
> Let's model the object we want to access atomically as:
>
> typedef struct T { uint32_t lo; uint32_t hi; } T;

For some reason this reminds me of Joe Seighs 63-bit atomic counter:

https://groups.google.com/g/comp.lang.asm.x86/c/FScbTaQEYLc/m/X0gAskwQW44J

;^)

[...]