You are more familiar with the details of these things than most people,
so I hope you (or someone else) will correct me if my logic below is wrong.
There's no problem when the target has a single unbreakable instruction
for the action. And LL/SC are fine for atomic loads or stores of
different sizes.
But LL/SC is not sufficient for read-modify-write sequences of a size
larger than can be handled by a single atomic instruction.
Imagine you have a processor that can atomically read or write an
unsigned integer type "uint". Your sequence for "uint_inc" will be :
retry:
load link x = *p
x++
if (store conditional *p = x fails) goto retry
If two processes try this, they can interleave and be started or stopped
without trouble - the result will be an atomic increment.
Now consider a double-sized type containing two "uint" fields:
retry:
load link x_lo = *p
x_hi = *(p + 1)
x_lo++
if (!x_lo) x_hi++
if (store conditional *p = x_lo fails) goto retry
*(p + 1) = x_hi
If the process executing this is stopped after the first write, and a
second process is run that calls a similar function, then the new
process will see a half-changed value for the object resulting in a
corrupted object. Resumption of the first process will half-change the
value again. Different combinations of using "store_conditional" on the
two stores will result in similar problems.
The only way to make a multi-unit RMW operation work is if other
processes are /blocked/ from breaking in during the actual write
sequence. Reads and the calculation can be re-retried, but not the
writes - they must be made an unbreakable sequence. And that, in
general, means a lock and OS support to ensure that the locking process
gets to finish.
The gcc implementation of atomic operations (larger than can be handled
with a single instruction) uses simple user-space spin locks (the lock
can be accessed atomically - with an LL/SC sequence, for the ARM).
If one process tries to access the atomic while another process has the
lock, it will spin - running a busy wait loop. As long as these
processes are running on different cores, there's no problem with one
core running a few rounds of a tight loop while another core does a
quick load or store. Given that contention is rare and cores are often
plentiful, this results in a very efficient atomic operation. But it
can deadlock - a process could take the spin lock and then get
descheduled by the OS, and other threads wanting the lock could be
activated. If these fill up the cores (maybe you have multiple threads
all using the same supposedly lock-free atomic structure), you are screwed.
And if you have only one core (like almost all microcontrollers), and
the thread that has the lock is interrupted by an interrupt routine that
wants to access the same atomic variable, you are /really/ screwed.
This can happen with such simple code as a 64-bit atomic counter in an
interrupt routine that is also accessed atomically from a background task.
It's very unlikely that you'll hit a problem, but it is possible. To
me, that is useless - atomics need guaranteed forward progress. That
means the std::atomic<> stuff needs to use OS-level locks for advanced
cases that can't be handled directly by instructions or LL/SC sequences,
or for a microcontroller you'd want to disable interrupts around the
access. The alternative is to refuse to compile the operations and only
support atomics that are smaller or simpler.