Never use strncpy!

Juha Nieminen

unread,

Sep 21, 2022, 6:04:29 AM9/21/22

to

Well, *almost* never, at least.

I always thought that std::strncpy() works exactly like std::strcpy(),
except that it stops early if the specified count is reached. Turns out
that I was gravely mistaken:

"If, after copying the terminating null character from src, count is not
reached, additional null characters are written to dest until the total
of count characters have been written."

This means that if you have, let's say, a 1 MB buffer into which you copy
with std::strncpy() lots and lots of strings, the vast majority of them
very short, expecting it to be efficient... turns out you'll be writing
1 MB worth of data every single time. Which will make the thing quite
slow if you weren't aware of this.

It's just better to do your own custom version of strncpy() that does
what strncpy() should be doing, ie. just stop once the source string
ends.

I can't find any standard library (C or C++) function that does that,
so you'll just have to write your own. (Luckily it's trivial to do.)

And while you are at it, you might also want to fix this little problem:

"If count is reached before the entire string src was copied, the
resulting character array is not null-terminated."

Perhaps return to the caller some value telling if the string was
truncated.

Alf P. Steinbach

unread,

Sep 21, 2022, 9:06:50 AM9/21/22

to

Thank you. I was not aware.

- Alf

David Brown

unread,

Sep 21, 2022, 9:24:19 AM9/21/22

to

Several of the str* functions in the C standard library are downright
silly. There are surprising inconsistencies (strncat guarantees a
null-terminated result, strncpy does not), mostly useless return values,
and missing functions (no strnlen). There are no volatile versions
which could be used to ensure that something like an "memset" to wipe
memory would actually be run. Then there are the myths and abuses that
are common in real-world code, such as assumptions that "memcpy" runs
forwards or works like "memmove". And there are no versions of the
moves or copies that work on bigger block sizes - you are dependent on
the quality of the compiler to figure out when larger sizes can be used.

Good compilers can optimise some of this - if you have several "strcat"
calls in a row, gcc can remember the length of each cat as a short-cut
for the next one. It can sometimes inline memcpy, and other functions,
giving better results. Poorer compilers implement these as external
functions in a DLL, with correspondingly bad performance on small
strings or memory blocks.

The solution, of course, is a selection of non-standard additional
string and memory functions that exist in some C libraries and not
others, and sometimes have the same name but different functionality in
different libraries. Oh, and there's the Annex K "bounds-checking"
functions that MS pushed into the C standards that neither they nor
anyone else implements, and no one would use even if they /were/
implemented.

Bonita Montero

unread,

Sep 21, 2022, 9:30:26 AM9/21/22

to

Am 21.09.2022 um 15:24 schrieb David Brown:

> Several of the str* functions in the C standard library are downright
> silly. There are surprising inconsistencies (strncat guarantees a
> null-terminated result, strncpy does not), mostly useless return values,
> and missing functions (no strnlen). There are no volatile versions
> which could be used to ensure that something like an "memset" to wipe

> memory would actually be run. ...

volatile ist mostly deprecated. atomics are used most of the time
where you used volatiles before. And memcpy() memset don't need
any behaviour like volatile or atomics since you could add memory
barriers and you geet all you need.

Scott Lurndal

unread,

Sep 21, 2022, 10:01:16 AM9/21/22

to

Juha Nieminen <nos...@thanks.invalid> writes:
>Well, *almost* never, at least.
>
>I always thought that std::strncpy() works exactly like std::strcpy(),
>except that it stops early if the specified count is reached. Turns out
>that I was gravely mistaken:

I've always used memcpy. I've very seldom used any str* function
other than strlen and once in a blue moon, strtok. I use snprintf
in place of strcat, for instance.

Ben Bacarisse

unread,

Sep 21, 2022, 10:56:58 AM9/21/22

to

Juha Nieminen <nos...@thanks.invalid> writes:

> Well, *almost* never, at least.
>
> I always thought that std::strncpy() works exactly like std::strcpy(),
> except that it stops early if the specified count is reached. Turns out
> that I was gravely mistaken:
>
> "If, after copying the terminating null character from src, count is not
> reached, additional null characters are written to dest until the total
> of count characters have been written."

...

> "If count is reached before the entire string src was copied, the
> resulting character array is not null-terminated."

Yes, a famous oddity that stems from one very specific use in
fixed-width Unix data structures (the most common use for a file name in
a directory entry that had to be zero filled but need not be zero
terminated).

> I can't find any standard library (C or C++) function that does that,
> so you'll just have to write your own. (Luckily it's trivial to do.)

A common trick was to write

*dest = 0;
strncat(dest, source, N);

or even (in an expression context)

strncat((*dest = 0, dest), source, N)

> Perhaps return to the caller some value telling if the string was
> truncated.

If your C library includes Annex K there is the rather gruesome
strncpy_t function.

And there is a very widely available, but non-standard, function called
strlcat.

--
Ben.

Mut...@dastardlyhq.com

unread,

Sep 21, 2022, 11:36:54 AM9/21/22

to

On Wed, 21 Sep 2022 10:04:09 -0000 (UTC)
Juha Nieminen <nos...@thanks.invalid> wrote:
>Well, *almost* never, at least.
>
>I always thought that std::strncpy() works exactly like std::strcpy(),
>except that it stops early if the specified count is reached. Turns out
>that I was gravely mistaken:
>
>"If, after copying the terminating null character from src, count is not
>reached, additional null characters are written to dest until the total
>of count characters have been written."
>
>This means that if you have, let's say, a 1 MB buffer into which you copy
>with std::strncpy() lots and lots of strings, the vast majority of them
>very short, expecting it to be efficient... turns out you'll be writing
>1 MB worth of data every single time. Which will make the thing quite
>slow if you weren't aware of this.

Useful to know. Wonder why they did it that way? Doesn't seem very logical.

Mut...@dastardlyhq.com

unread,

Sep 21, 2022, 11:42:31 AM9/21/22

to

On Wed, 21 Sep 2022 15:24:01 +0200
David Brown <david...@hesbynett.no> wrote:
>Several of the str* functions in the C standard library are downright
>silly. There are surprising inconsistencies (strncat guarantees a
>null-terminated result, strncpy does not), mostly useless return values,
>and missing functions (no strnlen). There are no volatile versions
>which could be used to ensure that something like an "memset" to wipe
>memory would actually be run. Then there are the myths and abuses that
>are common in real-world code, such as assumptions that "memcpy" runs
>forwards or works like "memmove". And there are no versions of the
>moves or copies that work on bigger block sizes - you are dependent on
>the quality of the compiler to figure out when larger sizes can be used.

Also with some compilers the follow works:

snprintf(mystr,some_max_len,"%s etc etc",mystr, etc etc)

and with some it just produces garbage in mystr. Which is annoying as it saves
a lot of mucking about with strcat when forced to use plain C.

Frederick Virchanza Gotham

unread,

Sep 21, 2022, 12:00:08 PM9/21/22

to

On Wednesday, September 21, 2022 at 11:04:29 AM UTC+1, Juha Nieminen wrote:

> "If count is reached before the entire string src was copied, the
> resulting character array is not null-terminated."

In my last job I programmed cameras with embedded Linux, and I outlawed 'strncpy'.

I didn't put it in my own code reviews and I didn't accept code reviews containing it.

Very poorly designed.

Bonita Montero

unread,

Sep 21, 2022, 12:10:25 PM9/21/22

to

There's no real purpose for things like that so that it doesn't make
sense to think about such things.

Scott Lurndal

unread,

Sep 21, 2022, 12:12:25 PM9/21/22

to

It seems fraught to store into the source string.

Better use of snprintf (and yes, this example will break
if the input buffer is too small; in this example, that
is guaranteed not to be the case). Easily fixed if necessary using
std::min(bplen,bytecount) in the subtract.

size_t
c_processor::format_insn(struct _op *opp,
c_environment *env,
mem_addr_t ip,
ulong afl, ulong bfl,
c_operand *opa, c_operand *opb,
c_operand *opc, bool symbolic,
char **bpp, size_t bplen)
{
size_t bytecount = 0;
char buf[10];
char *bp = *bpp;

bytecount = snprintf(bp, bplen, "[%1.1lu/%4.4lu]%s:%6.6llu: %4.4s ",
p_procnum, p_curtasknum,
env->print(buf, sizeof(buf)), ip, opp->op_name);
bplen -= bytecount, bp += bytecount;

if (!opp->op_noafbf) {
bytecount = snprintf(bp, bplen, "%2.2lu", afl);
bplen -= bytecount, bp += bytecount;

if (opp->op_bfhex) {
bytecount = snprintf(bp, bplen, "%2.2lx ", bfl);
bplen -= bytecount, bp += bytecount;
} else {
bytecount = snprintf(bp, bplen, "%2.2lu ", bfl);
bplen -= bytecount, bp += bytecount;
}
} else {
if (opp->op_opcode == OP_ACM) {
bytecount = snprintf(bp, bplen, "%2.2lx ", afl);
bplen -= bytecount, bp += bytecount;
}
}

if ((opp->op_operands > 0) && (opa != NULL)) {
bplen = opa->dump(&bp, bplen, symbolic);
*bp++ = ' ';
--bplen;
}

if ((opp->op_operands > 1) && (opb != NULL)) {
bplen = opb->dump(&bp, bplen, symbolic);
*bp++ = ' ';
--bplen;
}

if ((opp->op_operands > 2) && (opc != NULL)) {
bplen = opc->dump(&bp, bplen, symbolic);
*bp++ = ' ';
--bplen;
}

*bpp = bp;
return bplen;
}

Mut...@dastardlyhq.com

unread,

Sep 21, 2022, 12:20:13 PM9/21/22

to

You don't need to keep demonstrating that you've never done any programming
outside of your ivory tower and certainly not in C, we already know.

Bonita Montero

unread,

Sep 21, 2022, 12:27:22 PM9/21/22

to

If you do things like the above you're in the highest ivory tower ever.

David Brown

unread,

Sep 21, 2022, 1:21:00 PM9/21/22

to

On 21/09/2022 15:30, Bonita Montero wrote:
> Am 21.09.2022 um 15:24 schrieb David Brown:
>
>> Several of the str* functions in the C standard library are downright
>> silly. There are surprising inconsistencies (strncat guarantees a
>> null-terminated result, strncpy does not), mostly useless return
>> values, and missing functions (no strnlen). There are no volatile
>> versions which could be used to ensure that something like an "memset"
>> to wipe memory would actually be run. ...
>
> volatile ist mostly deprecated.

Volatile is not deprecated at all. Overly complex expressions involving
volatile were deprecated in C++20 as their semantics were unclear.

> atomics are used most of the time
> where you used volatiles before.

Complete nonsense. Volatile accesses and atomics are different things,
for different purposes.

> And memcpy() memset don't need
> any behaviour like volatile or atomics since you could add memory
> barriers and you geet all you need.
>

Following a memcpy() or memset() with a memory barrier (a relaxed order
full fence) is certainly a possibility. But such fences can be a lot
more expensive than using volatile writes.

Bonita Montero

unread,

Sep 21, 2022, 1:33:04 PM9/21/22

to

Am 21.09.2022 um 19:20 schrieb David Brown:
> On 21/09/2022 15:30, Bonita Montero wrote:
>> Am 21.09.2022 um 15:24 schrieb David Brown:
>>
>>> Several of the str* functions in the C standard library are downright
>>> silly. There are surprising inconsistencies (strncat guarantees a
>>> null-terminated result, strncpy does not), mostly useless return
>>> values, and missing functions (no strnlen). There are no volatile
>>> versions which could be used to ensure that something like an
>>> "memset" to wipe memory would actually be run. ...
>>
>> volatile ist mostly deprecated.
>
> Volatile is not deprecated at all. Overly complex expressions involving
> volatile were deprecated in C++20 as their semantics were unclear.

The semantics of volatile has been reduced with C++20:
https://en.cppreference.com/w/cpp/language/cv

>> atomics are used most of the time
>> where you used volatiles before.
>
> Complete nonsense. Volatile accesses and atomics are different things,
> for different purposes.

Before C++11 volatile were partitially used where today you use atomics.
But the whole semantics were platform-specific. F.e. there's a mode of
MSVC where volatile reads have acquire semantics and volatile writes
have release semantics.

> Following a memcpy() or memset() with a memory barrier (a relaxed order
> full fence) is certainly a possibility. But such fences can be a lot
> more expensive than using volatile writes.

volatile writes can't be substituted with fences.

Andrey Tarasevich

unread,

Sep 21, 2022, 4:24:02 PM9/21/22

to

On 9/21/2022 3:04 AM, Juha Nieminen wrote:
>
> I always thought that std::strncpy() works exactly like std::strcpy(),
> except that it stops early if the specified count is reached. Turns out
> that I was gravely mistaken:
>

The matter has been explained, explained and over-explained to death
already, including here in comp.lang.* newsgroups. It is well-known that
`strncpy` has never been intended as a "safe string copying" function.
It is a niche function introduced for so called "fixed-width" string
support.

https://stackoverflow.com/questions/2886931/difference-fixed-width-strings-and-zero-terminated-strings

It has never been intended for use with zero-terminated strings.

--
Best regards,
Andrey

Ben Bacarisse

unread,

Sep 21, 2022, 4:49:34 PM9/21/22

to

Except for the quibble that a null in the source string is respected --
i.e. the destination is considered to be a fixed-width field but not the
source.

--
Ben.

David Brown

unread,

Sep 21, 2022, 5:15:01 PM9/21/22

to

On 21/09/2022 19:33, Bonita Montero wrote:
> Am 21.09.2022 um 19:20 schrieb David Brown:
>> On 21/09/2022 15:30, Bonita Montero wrote:
>>> Am 21.09.2022 um 15:24 schrieb David Brown:
>>>
>>>> Several of the str* functions in the C standard library are
>>>> downright silly. There are surprising inconsistencies (strncat
>>>> guarantees a null-terminated result, strncpy does not), mostly
>>>> useless return values, and missing functions (no strnlen). There
>>>> are no volatile versions which could be used to ensure that
>>>> something like an "memset" to wipe memory would actually be run. ...
>>>
>>> volatile ist mostly deprecated.
>>
>> Volatile is not deprecated at all. Overly complex expressions
>> involving volatile were deprecated in C++20 as their semantics were
>> unclear.
>
> The semantics of volatile has been reduced with C++20:
> https://en.cppreference.com/w/cpp/language/cv

No, the page there says - as I said - that some uses of volatile in
complex expressions were deprecated in C++20.

>
>>> atomics are used most of the time
>>> where you used volatiles before.
>>
>> Complete nonsense. Volatile accesses and atomics are different
>> things, for different purposes.
>
> Before C++11 volatile were partitially used where today you use atomics.

If you used "volatile" thinking you got the effects of atomic access,
you were wrong. Prior to C++11 (and C11), C and C++ did not have any
concept of multiple threads - they did not have any need of "atomic"
accesses in the language. People who write OS's and similar low-level
code needed to implement atomics for the system, and "volatile" was
/part/ of those implementations. People writing code for such OS's and
using atomics, used the OS's functions, classes, and macros.

If you use "volatile" when you mean "atomic", your code is wrong. If
you use "atomic" when you mean "volatile", your code will probably work
but will be much less efficient. "volatile atomic" is an extremely
common combination.

> But the whole semantics were platform-specific. F.e. there's a mode of
> MSVC where volatile reads have acquire semantics and volatile writes
> have release semantics.
>

The details of volatile accesses are implementation-dependent, by
necessity. And an implementation is allowed to make them stronger -
though doing so is going to be inefficient and encourage misconceptions
and unwarranted assumptions.

>> Following a memcpy() or memset() with a memory barrier (a relaxed
>> order full fence) is certainly a possibility. But such fences can be
>> a lot more expensive than using volatile writes.
>
> volatile writes can't be substituted with fences.
>

A fence implies a memory barrier - writes that are logically part of the
source code must be completed before the barrier completes. That is
part of why you have fences.

Keith Thompson

unread,

Sep 21, 2022, 7:01:30 PM9/21/22

to

(Replying in part to things that have been said in other posts in this
thread.)

strncpy() is not poorly designed. It's quite reasonably designed for
the niche purpose for which it was intended, where the source is an
ordinary null-terminated string and the target, an N-byte character
array, hold a sequence of M significant non-null characters followed by
exactly N-M null characters.

It is poorly *named*. The name implies that, as strncat is a "safer"
strcat, strncpy is a "safer" strcpy. Both strncat and strncpy let you
specify the size of the target array, avoiding writing past the end of
it, but strncpy treats its target as null-terminated string.

I wrote about strncpy here (a lot of what I write has been covered in
this thread):
http://the-flat-trantor-society.blogspot.com/2012/03/no-strncpy-is-not-safer-strcpy.html

Of course this is comp.lang.c++, so you should usually be using
std::string, but sometimes you do need to deal with C-style strings.
It's unlikely (but still possible), that strncpy() might be the right
tool for the job. If it is, thoroughly comment the code so that the
next person who maintains it doesn't break your assumptions.

A digression: Quietly truncating the output, as strncpy and strncat do,
is not always "safer". Sometimes it's exactly what you want, for
example if you're printing data in fixed-width columns and it's going to
be obvious that something has been truncated. Sometimes silent
truncation can be worse than terminating the program, for example if a
command string "rm -rf $HOME/tmpdir" is quietly truncated to
"rm -rf $HOME/". If your code handles errors, always think about what
that error handling will actually do.

--
Keith Thompson (The_Other_Keith) Keith.S.T...@gmail.com
Working, but not speaking, for Philips
void Void(void) { Void(); } /* The recursive call of the void */

Richard Damon

unread,

Sep 21, 2022, 7:18:55 PM9/21/22

to

Yes, it is to copy a "C String" (Null Terminated) into a fixed width field.

Ben Bacarisse

unread,

Sep 21, 2022, 7:40:08 PM9/21/22

to

Again, a quibble: not quite. A null will be respected, but there is no
need for the source to be a C string.

--
Ben.

Richard Damon

unread,

Sep 21, 2022, 8:45:59 PM9/21/22

to

It may be able to do more, but I suspect the purpose it was designed was
for that.

Keith Thompson

unread,

Sep 21, 2022, 8:56:07 PM9/21/22

to

You're right (and I misstated it in a recent post). The standard (I'll
quote N1570 because I have it open) says:

The strncpy function copies not more than n characters (characters
that follow a null character are not copied) from the array pointed
to by s2 to the array pointed to by s1. If copying takes place
between objects that overlap, the behavior is undefined.

Which means that the source doesn't have to point to a C string.

The Linux Programmer's Manual man page incorrectly suggests that the
source has to be a pointer to a C string:

The strcpy() function copies the string pointed to by src,
including the terminating null byte ('\0'), to the buffer pointed
to by dest.
[...]

The strncpy() function is similar, except that at most n bytes of
src are copied.

The phrase "the string pointed to by src" in the description of strcpy
implies that the behavior is undefined if src doesn't point to a string.
The man page incorrectly implies that same wording applies to strncpy.

Keith Thompson

unread,

Sep 21, 2022, 9:00:30 PM9/21/22

to

According to the standard's description, the source clearly does not
have to point to a C string -- and strncpy() would work perfectly well
to copy one fixed-sized buffer to another, which is a very plausible use
case if you're working with such a data structure.

char source[5] = "hello"; // no null terminator
char target[5];
strncpy(target, source, sizeof target);

Of course it also works if the source is a pointer to a C string, and
it's explicitly required to deal with the null terminator.

Bonita Montero

unread,

Sep 21, 2022, 10:40:16 PM9/21/22

to

Am 21.09.2022 um 23:14 schrieb David Brown:

>> Before C++11 volatile were partitially used where today you use atomics.

> If you used "volatile" thinking you got the effects of atomic access,

> you were wrong. ...

No, you could use volatile in an implementation-defined way and it did
partitially the same like atomic.

>>> Following a memcpy() or memset() with a memory barrier (a relaxed
>>> order full fence) is certainly a possibility. But such fences can
>>> be a lot more expensive than using volatile writes.

>> volatile writes can't be substituted with fences.

> A fence implies a memory barrier - ...

It's a different word for the same thing.

Juha Nieminen

unread,

Sep 22, 2022, 2:00:20 AM9/22/22

to

David Brown <david...@hesbynett.no> wrote:
> If you use "volatile" when you mean "atomic", your code is wrong.

Yeah. Some programmers might have mistakenly thought that 'volatile'
means the same thing as "atomic", and thus "thread-safe". However,
programmers who know what they are doing also know that it isn't
anything like that.

I think that the exact semantics of 'volatile' might be largely
implementation-defined, but AFAIK in most if not all compilers
(especially gcc and clang) it effectively acts as an optimization
barrier. It tells the compiler "any access to this must never be
optimized away, nor moved to happen somewhere else in the code".

In other words, if you eg. write a loop where you read a
'volatile' ten times, then the compiler must generate code that
reads it ten times, inside that exact loop and nowhere else.
The compiler must not optimize it to one single read or
completely away (because it doesn't see anything changing it).

By far the most common (perhaps only) use of this is in embedded
programming, especially on very small processors, where things
like ports and CPU pins are mapped into RAM (which means that
reading the same memory location may give different values at
different times, and writing values there most definitely must
never be optimized away).

There's another useful use of 'volatile': In benchmarking.
Run the code that's to be benchmarked, and then assign the
return value to a volatile. This stops the compiler from
optimizing the whole thing away because it sees that the
result isn't being used for anything. (It might not have
optimized it away in the first place, but assigning the
result to a volatile makes sure of that.)

Juha Nieminen

unread,

Sep 22, 2022, 2:03:26 AM9/22/22

to

memcpy requires you to know the length of the string in advance, which
is often not necessary, and would require traversing the string twice
in order to copy it. (Why traverse it twice? You can copy it while
traversing it for the first time!)

Juha Nieminen

unread,

Sep 22, 2022, 2:09:29 AM9/22/22

to

Richard Damon <Ric...@damon-family.org> wrote:
>> Except for the quibble that a null in the source string is respected --
>> i.e. the destination is considered to be a fixed-width field but not the
>> source.
>>
>
> Yes, it is to copy a "C String" (Null Terminated) into a fixed width field.

Then it should have been named something entirely different. As it is now
it gets extremely easily confused with strcpy(), as if it were a "safer"
variant of it, in the same was as strncat() is a "safer" variant of
strcat().

David Brown

unread,

Sep 22, 2022, 2:56:39 AM9/22/22

to

On 22/09/2022 04:40, Bonita Montero wrote:
> Am 21.09.2022 um 23:14 schrieb David Brown:
>
>>> Before C++11 volatile were partitially used where today you use atomics.
>
>> If you used "volatile" thinking you got the effects of atomic access,
>> you were wrong. ...
>
> No, you could use volatile in an implementation-defined way and it did
> partitially the same like atomic.
>

What do you think "volatile" does, as a qualifier for accesses?

What do you think "atomic" means - both in terms of what the C and C++
standards say, and what the term means more generally in programming?

There is a degree of overlap, but they are not the same thing.

>>>> Following a memcpy() or memset() with a memory barrier (a relaxed
>>>> order full fence) is certainly a possibility. But such fences can
>>>> be a lot more expensive than using volatile writes.
>
>>> volatile writes can't be substituted with fences.
>
>> A fence implies a memory barrier - ...
>
> It's a different word for the same thing.
>

No, they are different things. Atomic fences are about synchronisation
between different threads - they ensure that threads running on
different cores see a consistent picture of the relevant data in memory.
Memory barriers are about ordering of the /local/ view of memory reads
and writes.

David Brown

unread,

Sep 22, 2022, 3:33:07 AM9/22/22

to

On 22/09/2022 07:59, Juha Nieminen wrote:
> David Brown <david...@hesbynett.no> wrote:
>> If you use "volatile" when you mean "atomic", your code is wrong.
>
> Yeah. Some programmers might have mistakenly thought that 'volatile'
> means the same thing as "atomic", and thus "thread-safe". However,
> programmers who know what they are doing also know that it isn't
> anything like that.
>
> I think that the exact semantics of 'volatile' might be largely
> implementation-defined, but AFAIK in most if not all compilers
> (especially gcc and clang) it effectively acts as an optimization
> barrier. It tells the compiler "any access to this must never be
> optimized away, nor moved to happen somewhere else in the code".

Yes. You can think of "volatile" as telling the compiler "there are
things you don't know about that mean you can't apply as-if
optimisations here".

The implementation-defined nature of volatile accesses is unavoidable.
If you have, say, a volatile write to a uint32_t variable then most
32-bit or 64-bit systems will implement it as a single write. A 16-bit
system will have to break it into two writes. An original Alpha would
do a 64-bit read, a modify, then a 64-bit write. Most systems will
attempt a single write even if the address is unaligned, others might
break it down or the hardware might cause a trap on the unaligned
access. Accesses to bitfields have multiple possible implementations.
All in all, there's a lot that can't be specified in the standards.

Then there are read-write-modify accesses such as "v++;" or "v |=
0x0100;". Some architectures can do these atomically, others would need
bus locks and interrupt disabling to be atomic.

(As an interesting aside here, in the C standards up to C17 only
described "access to volatile objects". Only in C17 did it change to
"volatile accesses", defining what was meant by using a
pointer-to-volatile cast to access data that had not been defined as
volatile. I don't know what the C++ standards say there - but I believe
that, as for C, every compiler handles volatile accesses as you might
expect.)

>
> In other words, if you eg. write a loop where you read a
> 'volatile' ten times, then the compiler must generate code that
> reads it ten times, inside that exact loop and nowhere else.
> The compiler must not optimize it to one single read or
> completely away (because it doesn't see anything changing it).

It can still unroll the loop. Similarly if you have :

volatile int v;

if (a) {
v = 1;
} else {
v = 2;
}

then the compiler can do:

int temp = a ? 1 : 2;
v = temp;

The run-time pattern of volatile accesses must match the "abstract
machine" exactly, but the pattern of the generated assembly need not.

>
> By far the most common (perhaps only) use of this is in embedded
> programming, especially on very small processors, where things
> like ports and CPU pins are mapped into RAM (which means that
> reading the same memory location may give different values at
> different times, and writing values there most definitely must
> never be optimized away).

Yes.

And for single-core processors (covering the great majority of
small-systems embedded programming), "volatile" is often sufficient in
many places where atomics would be needed in general. But you need to
be aware of its limitations here - it forms part of the solution, but
not necessarily all of it. (The gcc implementation of C11/C++11
atomics, at least in the gcc versions I have looked at, are dangerously
wrong for single core embedded systems.)

>
> There's another useful use of 'volatile': In benchmarking.
> Run the code that's to be benchmarked, and then assign the
> return value to a volatile. This stops the compiler from
> optimizing the whole thing away because it sees that the
> result isn't being used for anything. (It might not have
> optimized it away in the first place, but assigning the
> result to a volatile makes sure of that.)

Yes, that kind of use is convenient. You also have to ensure that the
calculation depends on a volatile variable, not just that the result is
written to one. It gives you a more self-contained test than reading an
starting input from the command line and printf'ing the result.

David Brown

unread,

Sep 22, 2022, 3:38:11 AM9/22/22

to

On 22/09/2022 01:01, Keith Thompson wrote:

> If your code handles errors, always think about what
> that error handling will actually do.
>

That's good general advice for all programming - and something many
people don't consider deeply enough. Add to it that your error handling
code must be tested as well as the rest of your code. I've seen several
cases in practice where poor and untested error handling code turned a
glitch into a disaster.

(I agree with everything in the rest of your post - it's just that your
final sentence looked so good as a "tip of the day" !)

Bonita Montero

unread,

Sep 22, 2022, 6:02:14 AM9/22/22

to

Am 22.09.2022 um 08:56 schrieb David Brown:

> What do you think "volatile" does, as a qualifier for accesses?

A volatile read or write is usually at least the same as a read or
write through a memory_order relaxed. So there are some guarantees
you can rely on.

> No, they are different things. Atomic fences are about synchronisation
> between different threads - they ensure that threads running on
> different cores see a consistent picture of the relevant data in memory.
> Memory barriers are about ordering of the /local/ view of memory reads
> and writes.

Fences and barriers mean the same. A fence or barrier makes that changes
from a foreign thread become visible to another thread or changes from a
thread become visible for other threads.

Keith Thompson

unread,

Sep 22, 2022, 2:45:08 PM9/22/22

to

It's already been pointed out that the source argument to strncpy() does
not need to be a pointer to a (null-terminated) string.

And yes, the name is misleading (also already pointed out).

Andrey Tarasevich

unread,

Sep 22, 2022, 10:00:34 PM9/22/22

to

Yes, by spec, it is a _conversion_ function. Its specific purpose is to
convert a zero-terminated source string to a fixed-width target string.

In modern usage (under assumption that nobody needs fixed-width strings
anymore), this function might still be usable for secure initialization
of sensitive data fields, when one wants to make sure that the old
string content of the data field is fully erased when the new string is
copied in.

--
Best regards,
Andrey

Lynn McGuire

unread,

Sep 22, 2022, 10:04:06 PM9/22/22

to

On 9/21/2022 5:04 AM, Juha Nieminen wrote:
> Well, *almost* never, at least.
>

> I always thought that std::strncpy() works exactly like std::strcpy(),
> except that it stops early if the specified count is reached. Turns out
> that I was gravely mistaken:
>

> "If, after copying the terminating null character from src, count is not
> reached, additional null characters are written to dest until the total
> of count characters have been written."
>
> This means that if you have, let's say, a 1 MB buffer into which you copy
> with std::strncpy() lots and lots of strings, the vast majority of them
> very short, expecting it to be efficient... turns out you'll be writing
> 1 MB worth of data every single time. Which will make the thing quite
> slow if you weren't aware of this.
>
> It's just better to do your own custom version of strncpy() that does
> what strncpy() should be doing, ie. just stop once the source string
> ends.
>
> I can't find any standard library (C or C++) function that does that,
> so you'll just have to write your own. (Luckily it's trivial to do.)
>
> And while you are at it, you might also want to fix this little problem:
>
> "If count is reached before the entire string src was copied, the
> resulting character array is not null-terminated."
>
> Perhaps return to the caller some value telling if the string was
> truncated.

Thanks for helping me to better understand strncpy. Of course, we use
it extensively. Although, we use the "safe" version strncpy_s about
half the time, 62 out of 125 uses.

Lynn

Andrey Tarasevich

unread,

Sep 22, 2022, 10:05:08 PM9/22/22

to

On 9/22/2022 7:00 PM, Andrey Tarasevich wrote:
> On 9/21/2022 1:49 PM, Ben Bacarisse wrote:
>> Andrey Tarasevich <andreyta...@hotmail.com> writes:
>>
>>> On 9/21/2022 3:04 AM, Juha Nieminen wrote:
>>>> I always thought that std::strncpy() works exactly like std::strcpy(),
>>>> except that it stops early if the specified count is reached. Turns out
>>>> that I was gravely mistaken:
>>>>
>>>
>>> The matter has been explained, explained and over-explained to death
>>> already, including here in comp.lang.* newsgroups. It is well-known
>>> that `strncpy` has never been intended as a "safe string copying"
>>> function. It is a niche function introduced for so called
>>> "fixed-width" string support.
>>>
>>> https://stackoverflow.com/questions/2886931/difference-fixed-width-strings-and-zero-terminated-strings
>>>
>>>
>>> It has never been intended for use with zero-terminated strings.
>>
>> Except for the quibble that a null in the source string is respected --
>> i.e. the destination is considered to be a fixed-width field but not the
>> source.
>>
>
> Yes, by spec, it is a _conversion_ function. Its specific purpose is to
> convert a zero-terminated source string to a fixed-width target string.
>

... and yes, Keith Thompson makes a good point that the source does not
have to be a zero-terminated string. I.e. it can also be used for
fixed-width string copying.

One can probably argue that it might be more efficient than plain
`memcpy`, since `memcpy` would copy the original zeroes with a
memory-to-memory operation, while `strncpy` would fill the tail portion
of the string with "its own" zeros instead.

--
Best regards,
Andrey

Richard Damon

unread,

Sep 22, 2022, 11:34:08 PM9/22/22

to

Maybe, but it was names LONG ago in the infancy of the language, so that
is water under the bridge.

One thing to remember, the n versions weren't so much designed as
"Safter" versions, but versions for a special purpose.

THey may work as safer versions, but I don't think that was major goal.

Keith Thompson

unread,

Sep 23, 2022, 1:50:08 AM9/23/22

to

Andrey Tarasevich <andreyta...@hotmail.com> writes:
[...]

> ... and yes, Keith Thompson makes a good point that the source does
> not have to be a zero-terminated string. I.e. it can also be used for
> fixed-width string copying.

To be fair, it was Ben Bacarisse who pointed this out -- *after* I had
incorrectly stated (based on faulty man page) that the source has to be
a pointer to a null-terminated string.

David Brown

unread,

Sep 23, 2022, 7:24:00 AM9/23/22

to

On 22/09/2022 12:02, Bonita Montero wrote:
> Am 22.09.2022 um 08:56 schrieb David Brown:
>
>> What do you think "volatile" does, as a qualifier for accesses?
>
> A volatile read or write is usually at least the same as a read or
> write through a memory_order relaxed. So there are some guarantees
> you can rely on.

The key point of an atomic access, above all else, is that it is an
all-or-nothing access. If one thread writes to an atomic object, and
another thread reads it, then the reading thread will see either the
complete old data or the complete new data.

"Volatile" does not give you that guarantee.

The key points regarding volatile accesses is that the compiler cannot
assume it knows about the way the target memory is used - it may be read
or written independently of the program code, and that the accesses are
"observable behaviour". Thus every volatile access must be done exactly
as it is in the "abstract machine" that defines the language - with the
same values, same number of accesses, same order of accesses in respect
to other volatile accesses.

"Atomic" does not have those semantics. The compiler can combine two
relaxed atomic writes to one. It can do some re-ordering regarding
atomics and normal accesses, and even across volatile accesses. (For
atomic memory access stricter than "relaxed", there are more
restrictions on ordering.)

This is why the C11 "atomic_store" and "atomic_load" functions take a
pointer to volatile atomic as their parameter, not a pointer to atomic.

>
>> No, they are different things. Atomic fences are about
>> synchronisation between different threads - they ensure that threads
>> running on different cores see a consistent picture of the relevant
>> data in memory. Memory barriers are about ordering of the /local/
>> view of memory reads and writes.
>
> Fences and barriers mean the same. A fence or barrier makes that changes
> from a foreign thread become visible to another thread or changes from a
> thread become visible for other threads.
>

In the C standards (I refer to them as they are simpler and clearer than
the C++ standards, but the memory model is the same) say that an
"atomic_thread_fence(memory_order_relaxed)" has no effect. This is
rather different from a memory barrier, which requires compiler-specific
extensions and which can be viewed roughly as a kind of "cache flush" in
which the "cache" is the processor registers along with any information
the compiler knows about any objects.

Again, the non-relaxed fences will likely have a memory barrier effect
in practice - but they do so at a significantly higher cost than a
memory barrier.

Bonita Montero

unread,

Sep 23, 2022, 7:49:00 AM9/23/22

to

Am 23.09.2022 um 13:23 schrieb David Brown:

> The key point of an atomic access, above all else, is that it is an
> all-or-nothing access. If one thread writes to an atomic object, and
> another thread reads it, then the reading thread will see either the
> complete old data or the complete new data.

Although it is possible no one actually uses atomics for non-native
types. And for native types what I said holds true against volatiles.

> "Atomic" does not have those semantics. The compiler can combine two
> relaxed atomic writes to one.

Cite the standard.

> In the C standards (I refer to them as they are simpler and clearer than
> the C++ standards, but the memory model is the same) say that an
> "atomic_thread_fence(memory_order_relaxed)" has no effect. This is
> rather different from a memory barrier, which requires compiler-specific
> extensions and which can be viewed roughly as a kind of "cache flush" in
> which the "cache" is the processor registers along with any information
> the compiler knows about any objects.

That's pettifogging since no one uses fences which actually won't work.

David Brown

unread,

Sep 23, 2022, 9:25:42 AM9/23/22

to

On 23/09/2022 13:49, Bonita Montero wrote:
> Am 23.09.2022 um 13:23 schrieb David Brown:
>
>> The key point of an atomic access, above all else, is that it is an
>> all-or-nothing access. If one thread writes to an atomic object, and
>> another thread reads it, then the reading thread will see either the
>> complete old data or the complete new data.
>
> Although it is possible no one actually uses atomics for non-native
> types. And for native types what I said holds true against volatiles.
>

No.

>> "Atomic" does not have those semantics. The compiler can combine two
>> relaxed atomic writes to one.
>
> Cite the standard.
>

"As if" rule.

>> In the C standards (I refer to them as they are simpler and clearer
>> than the C++ standards, but the memory model is the same) say that an
>> "atomic_thread_fence(memory_order_relaxed)" has no effect. This is
>> rather different from a memory barrier, which requires
>> compiler-specific extensions and which can be viewed roughly as a kind
>> of "cache flush" in which the "cache" is the processor registers along
>> with any information the compiler knows about any objects.
>
> That's pettifogging since no one uses fences which actually won't work.
>

They /do/ work - they just do what they are supposed to do, not what you
think they should do.

Bonita Montero

unread,

Sep 23, 2022, 10:03:15 AM9/23/22

to

Am 23.09.2022 um 15:25 schrieb David Brown:
> On 23/09/2022 13:49, Bonita Montero wrote:
>> Am 23.09.2022 um 13:23 schrieb David Brown:
>>
>>> The key point of an atomic access, above all else, is that it is an
>>> all-or-nothing access. If one thread writes to an atomic object, and
>>> another thread reads it, then the reading thread will see either the
>>> complete old data or the complete new data.
>>
>> Although it is possible no one actually uses atomics for non-native
>> types. And for native types what I said holds true against volatiles.
>>
>
> No.

If you use non-native types with atomics they use STM, and that's
really slow. That's while no one uses atomic for non-native types.

>
>>> "Atomic" does not have those semantics. The compiler can combine two
>>> relaxed atomic writes to one.
>>
>> Cite the standard.
>>
>
> "As if" rule.
>
>>> In the C standards (I refer to them as they are simpler and clearer
>>> than the C++ standards, but the memory model is the same) say that an
>>> "atomic_thread_fence(memory_order_relaxed)" has no effect. This is
>>> rather different from a memory barrier, which requires
>>> compiler-specific extensions and which can be viewed roughly as a
>>> kind of "cache flush" in which the "cache" is the processor registers
>>> along with any information the compiler knows about any objects.
>>
>> That's pettifogging since no one uses fences which actually won't work.
>>
>
> They /do/ work - they just do what they are supposed to do, not what you
> think they should do.

No, they have actually no effect.

Scott Lurndal

unread,

Sep 23, 2022, 10:34:36 AM9/23/22

to

Bonita Montero <Bonita....@gmail.com> writes:
>Am 23.09.2022 um 15:25 schrieb David Brown:

>> They [memory barriers] /do/ work - they just do what they are supposed to do, not what you

>> think they should do.
>
>No, they have actually no effect.

If that were the case, then there would be no need for memory
barrier instructions, right?

So why are they present in all architectures (x86, arm, mips, ppc, ia64?)

Hint: because they actually _do_ have an effect, moreso in arm/mips/ppc
than in x86, but they are actually required in x86 as well; I recall a
linux kernel bug from a decade ago in the TCP stack related to skb
management where a missing memory barrier caused all kinds of havoc
even in the soi disant 'strongly-ordered' x86 memory model.

Mut...@dastardlyhq.com

unread,

Sep 23, 2022, 10:47:43 AM9/23/22

to

On Wed, 21 Sep 2022 18:27:30 +0200
Bonita Montero <Bonita....@gmail.com> wrote:
>Am 21.09.2022 um 18:19 schrieb Mut...@dastardlyhq.com:
>> On Wed, 21 Sep 2022 18:10:32 +0200
>> Bonita Montero <Bonita....@gmail.com> wrote:
>>> Am 21.09.2022 um 17:42 schrieb Mut...@dastardlyhq.com:
>>>> On Wed, 21 Sep 2022 15:24:01 +0200
>>>> David Brown <david...@hesbynett.no> wrote:
>>>>> Several of the str* functions in the C standard library are downright
>>>>> silly. There are surprising inconsistencies (strncat guarantees a
>>>>> null-terminated result, strncpy does not), mostly useless return values,
>>>>> and missing functions (no strnlen). There are no volatile versions
>>>>> which could be used to ensure that something like an "memset" to wipe
>>>>> memory would actually be run. Then there are the myths and abuses that
>>>>> are common in real-world code, such as assumptions that "memcpy" runs
>>>>> forwards or works like "memmove". And there are no versions of the
>>>>> moves or copies that work on bigger block sizes - you are dependent on
>>>>> the quality of the compiler to figure out when larger sizes can be used.
>>>>
>>>> Also with some compilers the follow works:
>>>>
>>>> snprintf(mystr,some_max_len,"%s etc etc",mystr, etc etc)
>>>>
>>>> and with some it just produces garbage in mystr. Which is annoying as
>>>> it saves a lot of mucking about with strcat when forced to use plain C.
>>>
>>> There's no real purpose for things like that so that it doesn't make
>>> sense to think about such things.
>>
>> You don't need to keep demonstrating that you've never done any programming
>> outside of your ivory tower and certainly not in C, we already know.
>>
>
>If you do things like the above you're in the highest ivory tower ever.

Uh huh, whatever you say genius.

Michael S

unread,

Sep 23, 2022, 11:19:54 AM9/23/22

to

On Friday, September 23, 2022 at 5:34:36 PM UTC+3, Scott Lurndal wrote:
> Bonita Montero <Bonita....@gmail.com> writes:
> >Am 23.09.2022 um 15:25 schrieb David Brown:
> >> They [memory barriers] /do/ work - they just do what they are supposed to do, not what you
> >> think they should do.
> >
> >No, they have actually no effect.
> If that were the case, then there would be no need for memory
> barrier instructions, right?
>
> So why are they present in all architectures (x86, arm, mips, ppc, ia64?)
>
> Hint: because they actually _do_ have an effect, moreso in arm/mips/ppc
> than in x86,

Actually, less so on MIPS than on x86.
On MIPS, I think, memory barriers needed only in code that mixes normal,
ie. write-back cached, memory accesses (WB) with special I/O-type accesses
like uncached (UC) and especially write-combined (WC).
On x86 it's to the main reason for barriers, too, but there are subtle cases
where MB is needed on pure WB areas as well.
That's because x86 (and SPARC) default memory ordering is TCO rather than SC.
However in practice on x86 WB areas people normally use implied
barriers that are present in all RMW and CAS instructions with LOCK prefix.

Bonita Montero

unread,

Sep 23, 2022, 12:36:24 PM9/23/22

to

Am 23.09.2022 um 16:34 schrieb Scott Lurndal:
> Bonita Montero <Bonita....@gmail.com> writes:
>> Am 23.09.2022 um 15:25 schrieb David Brown:
>
>>> They [memory barriers] /do/ work - they just do what they are supposed to do, not what you
>>> think they should do.
>>
>> No, they have actually no effect.
>
> If that were the case, then there would be no need for memory
> barrier instructions, right?

Read what I said in the context !

Philipp Klaus Krause

unread,

Sep 23, 2022, 1:50:30 PM9/23/22

to

Am 22.09.22 um 08:03 schrieb Juha Nieminen:

How about using memccpy? It is in C23, don't know about C++.

Philipp Klaus Krause

unread,

Sep 23, 2022, 1:52:51 PM9/23/22

to

Am 21.09.22 um 15:24 schrieb David Brown:

>
> There are no volatile versions
> which could be used to ensure that something like an "memset" to wipe
> memory would actually be run.

Are you looking for C23 memset_explicit?

Scott Lurndal

unread,

Sep 23, 2022, 1:57:38 PM9/23/22

to

Michael S <already...@yahoo.com> writes:
>On Friday, September 23, 2022 at 5:34:36 PM UTC+3, Scott Lurndal wrote:
>> Bonita Montero <Bonita....@gmail.com> writes:
>> >Am 23.09.2022 um 15:25 schrieb David Brown:
>> >> They [memory barriers] /do/ work - they just do what they are supposed to do, not what you
>> >> think they should do.
>> >
>> >No, they have actually no effect.
>> If that were the case, then there would be no need for memory
>> barrier instructions, right?
>>
>> So why are they present in all architectures (x86, arm, mips, ppc, ia64?)
>>
>> Hint: because they actually _do_ have an effect, moreso in arm/mips/ppc
>> than in x86,
>
>Actually, less so on MIPS than on x86.
>On MIPS, I think, memory barriers needed only in code that mixes normal,
>ie. write-back cached, memory accesses (WB) with special I/O-type accesses
>like uncached (UC) and especially write-combined (WC).

I was thinking specifically of kseg 0/1 aliases, which are uncached.

>On x86 it's to the main reason for barriers, too, but there are subtle cases
>where MB is needed on pure WB areas as well.
>That's because x86 (and SPARC) default memory ordering is TCO rather than SC.
>However in practice on x86 WB areas people normally use implied
>barriers that are present in all RMW and CAS instructions with LOCK prefix.

I missed Alpha in the list above, many of the alpha architects and
engineers ended up at Cavium designing their version of MIPS processors.

Bonita Montero

unread,

Sep 24, 2022, 5:38:21 AM9/24/22

to

Am 21.09.2022 um 22:23 schrieb Andrey Tarasevich:
> On 9/21/2022 3:04 AM, Juha Nieminen wrote:
>>

>> I always thought that std::strncpy() works exactly like std::strcpy(),
>> except that it stops early if the specified count is reached. Turns out
>> that I was gravely mistaken:
>>
>

> The matter has been explained, explained and over-explained to death
> already, including here in comp.lang.* newsgroups. It is well-known that
> `strncpy` has never been intended as a "safe string copying" function.
> It is a niche function introduced for so called "fixed-width" string
> support.
>
> https://stackoverflow.com/questions/2886931/difference-fixed-width-strings-and-zero-terminated-strings
>
> It has never been intended for use with zero-terminated strings.
>

The problem with that is that you can't pass a string copied like that
reliably to any C-API that requires a null-terminated string. It would
have been better if strncpy() would always return a null-terminated
string, thereby making the last character zero if necessary.
For me this API rather looks mostly useless.

Mut...@dastardlyhq.com

unread,

Sep 24, 2022, 5:44:18 AM9/24/22

to

The C strings API is also inconsistent because snprintf() always adds a
terminating null whether the arguments fit into the string or not.

Bonita Montero

unread,

Sep 24, 2022, 5:59:15 AM9/24/22

to

I don't consider this inconsistent because one API is unusable,
thereby almost non-existing.

Richard Damon

unread,

Sep 24, 2022, 6:28:04 AM9/24/22

to

And that is because strncpy wasn't intended to create a "C-API" string,
but to fill a char[N] field in a struct.

There were many spots those existed without a required terminating NULL.

David Brown

unread,

Sep 24, 2022, 10:23:01 AM9/24/22

to

On 23/09/2022 16:03, Bonita Montero wrote:
> Am 23.09.2022 um 15:25 schrieb David Brown:
>> On 23/09/2022 13:49, Bonita Montero wrote:
>>> Am 23.09.2022 um 13:23 schrieb David Brown:
>>>
>>>> The key point of an atomic access, above all else, is that it is an
>>>> all-or-nothing access. If one thread writes to an atomic object,
>>>> and another thread reads it, then the reading thread will see either
>>>> the complete old data or the complete new data.
>>>
>>> Although it is possible no one actually uses atomics for non-native
>>> types. And for native types what I said holds true against volatiles.
>>>
>>
>> No.
>
> If you use non-native types with atomics they use STM, and that's
> really slow. That's while no one uses atomic for non-native types.

There are a number of ways to implement atomics that are larger than a
single bus operation can handle, or that involve multiple bus
operations. Different processor types have different solutions, and
some are optimised or limited to particular setups (such as single
writer, single processor, etc.). Some processors can handle atomic
accesses for sizes that are bigger than native C/C++ types (such as
128-bit accesses). Some cannot handle atomic writes for the bigger
native types.

And non-aligned accesses might be supported for volatile accesses, while
not being atomic.

In summary - you are making so many assumptions it is easiest just to
say you are wrong.

>
>>
>>>> "Atomic" does not have those semantics. The compiler can combine
>>>> two relaxed atomic writes to one.
>>>
>>> Cite the standard.
>>>
>>
>> "As if" rule.
>>
>>>> In the C standards (I refer to them as they are simpler and clearer
>>>> than the C++ standards, but the memory model is the same) say that
>>>> an "atomic_thread_fence(memory_order_relaxed)" has no effect. This
>>>> is rather different from a memory barrier, which requires
>>>> compiler-specific extensions and which can be viewed roughly as a
>>>> kind of "cache flush" in which the "cache" is the processor
>>>> registers along with any information the compiler knows about any
>>>> objects.
>>>
>>> That's pettifogging since no one uses fences which actually won't work.
>>>
>>
>> They /do/ work - they just do what they are supposed to do, not what
>> you think they should do.
>
> No, they have actually no effect.
>

Perhaps some of your misconceptions are valid within your limited little
world of x86 programming on Windows with MSVC. There is a wider world
out there, and even if you never enter it, please stop making posts in a
general C++ newsgroup full of invalid assumptions that only applies to
such a limited subset of C++.

David Brown

unread,

Sep 24, 2022, 10:25:30 AM9/24/22

to

Yes - although I'd forgotten the name. It will certainly solve one of
the functions I see as missing.

Bonita Montero

unread,

Sep 24, 2022, 10:25:41 AM9/24/22

to

Am 24.09.2022 um 16:22 schrieb David Brown:

> There are a number of ways to implement atomics that are larger than
> a single bus operation can handle, or that involve multiple bus

> operations. ...

It's always done by STM, and that's slow.

> Perhaps some of your misconceptions are valid within your limited little
> world of x86 programming on Windows with MSVC. There is a wider world
> out there, and even if you never enter it, please stop making posts in a
> general C++ newsgroup full of invalid assumptions that only applies to
> such a limited subset of C++.

I was referring to what you said and you forgot what you said.

David Brown

unread,

Sep 24, 2022, 11:10:14 AM9/24/22

to

On 24/09/2022 16:25, Bonita Montero wrote:
> Am 24.09.2022 um 16:22 schrieb David Brown:
>
>> There are a number of ways to implement atomics that are larger than
>> a single bus operation can handle, or that involve multiple bus
>> operations. ...
>
> It's always done by STM, and that's slow.

No, it is not. You really have no idea about these things - repeating
yourself does not make it any less myopic. I think I'm done trying to
explain them to you.

Bonita Montero

unread,

Sep 24, 2022, 11:14:28 AM9/24/22

to

Am 24.09.2022 um 17:09 schrieb David Brown:
> On 24/09/2022 16:25, Bonita Montero wrote:
>> Am 24.09.2022 um 16:22 schrieb David Brown:
>>
>>> There are a number of ways to implement atomics that are larger than
>>> a single bus operation can handle, or that involve multiple bus
>>> operations. ...
>>
>> It's always done by STM, and that's slow.
>

> No, it is not. ...

I checked the disassembly of atomics beyond native types with
MSVC, clang Windows / Linux and g++ - you don't.
If you don't use STM you use the kernel and the atomic operation
never fails. Try that out yourself: atomics beyond native types
can fail, so you have STM.

Keith Thompson

unread,

Sep 24, 2022, 2:53:27 PM9/24/22

to

Richard Damon <Ric...@Damon-Family.org> writes:
[...]

> And that is because strncpy wasn't intended to create a "C-API"
> string, but to fill a char[N] field in a struct.
>
> There were many spots those existed without a required terminating NULL.

NUL, null, null character, or '\0', not NULL (which is a null pointer
constant).

Bonita Montero

unread,

Sep 24, 2022, 2:58:38 PM9/24/22

to

Am 24.09.2022 um 20:53 schrieb Keith Thompson:
> Richard Damon <Ric...@Damon-Family.org> writes:
> [...]
>> And that is because strncpy wasn't intended to create a "C-API"
>> string, but to fill a char[N] field in a struct.
>>
>> There were many spots those existed without a required terminating NULL.
>
> NUL, null, null character, or '\0', not NULL (which is a null pointer
> constant).
>

I woudln't have understood Richard without your help.

Juha Nieminen

unread,

Sep 26, 2022, 3:53:29 AM9/26/22

to

David Brown <david...@hesbynett.no> wrote:
> There are a number of ways to implement atomics that are larger than a
> single bus operation can handle, or that involve multiple bus
> operations. Different processor types have different solutions, and
> some are optimised or limited to particular setups (such as single
> writer, single processor, etc.). Some processors can handle atomic
> accesses for sizes that are bigger than native C/C++ types (such as
> 128-bit accesses). Some cannot handle atomic writes for the bigger
> native types.

I think there's a bit of confusion here about what the term "atomic"
means. You seem to be talking about a concept of "atomic" with some kind
of meaning like "mutual exclusion supported by the CPU itself".

That's not what "atomic" means in general, when talking about
multithreaded programming. In general "atomic" merely means that the
resource in question can only be accessed by one thread at a time
(in other words, it implements some sort of mutual exclusion).

As a concrete example: POSIX requires that fwrite() be atomic (for a
particular FILE object). This means that no two threads can write
to the same FILE object with a singular fwrite() call at the same
time. In other words, fwrite() implements (at some level) some kind
of (per FILE object) mutex.

"Atomic" is actually a stronger guarantee than merely "thread-safe".

If fwrite() were merely guaranteed to be "thread-safe", it would just
mean that it won't break (eg. corrupt its internal state, or any other
data anywhere else) if two threads call it at the same time, but it
wouldn't guarantee that the data written by those two threads won't
be interleaved somehow.

However, since fwrite() is "atomic", not just "thread-safe" (if it
conforms to POSIX), then it implements a mutex for the entire function
call (for that particular FILE object).

Juha Nieminen

unread,

Sep 26, 2022, 3:55:11 AM9/26/22

to

That sounds like it would be the exact tool for the job.

Bonita Montero

unread,

Sep 26, 2022, 4:21:35 AM9/26/22

to

Am 26.09.2022 um 09:53 schrieb Juha Nieminen:
> David Brown <david...@hesbynett.no> wrote:
>> There are a number of ways to implement atomics that are larger than a
>> single bus operation can handle, or that involve multiple bus
>> operations. Different processor types have different solutions, and
>> some are optimised or limited to particular setups (such as single
>> writer, single processor, etc.). Some processors can handle atomic
>> accesses for sizes that are bigger than native C/C++ types (such as
>> 128-bit accesses). Some cannot handle atomic writes for the bigger
>> native types.
>
> I think there's a bit of confusion here about what the term "atomic"
> means. You seem to be talking about a concept of "atomic" with some
> kind of meaning like "mutual exclusion supported by the CPU itself".

You're confused, David not.

Juha Nieminen

unread,

Sep 26, 2022, 4:34:22 AM9/26/22

to

Bonita Montero <Bonita....@gmail.com> wrote:
> You're confused, David not.

Just go away, asshole.

David Brown

unread,

Sep 26, 2022, 7:39:43 AM9/26/22

to

On 26/09/2022 09:53, Juha Nieminen wrote:
> David Brown <david...@hesbynett.no> wrote:
>> There are a number of ways to implement atomics that are larger than a
>> single bus operation can handle, or that involve multiple bus
>> operations. Different processor types have different solutions, and
>> some are optimised or limited to particular setups (such as single
>> writer, single processor, etc.). Some processors can handle atomic
>> accesses for sizes that are bigger than native C/C++ types (such as
>> 128-bit accesses). Some cannot handle atomic writes for the bigger
>> native types.
>
> I think there's a bit of confusion here about what the term "atomic"
> means. You seem to be talking about a concept of "atomic" with some kind
> of meaning like "mutual exclusion supported by the CPU itself".
>

No, that's not what I am saying.

> That's not what "atomic" means in general, when talking about
> multithreaded programming. In general "atomic" merely means that the
> resource in question can only be accessed by one thread at a time
> (in other words, it implements some sort of mutual exclusion).

And that's not quite right either.

"Atomic" means that accesses are indivisible. As many threads as you
want can read or write to the data at the same time - the defining
feature is that there is no possibility of a partial access succeeding.

We've mostly mentioned reads and writes - but more complex transactions
can be atomic too, such as increments. The term can also apply to
collections of accesses, well-known from the database world. Such
atomic transactions need to be built on top of low-level atomic accesses
with locks, lock-free algorithms, or more advanced protocols such as
software transactional memory.

Atomic accesses do not have to be purely hardware implementations,
though that is the most efficient - and anything software-based is going
to depend on smaller hardware-based atomic accesses. By far the most
convenient accesses are when you can read or write the memory with
normal memory access instructions, or at most by using things such as a
"bus lock prefix" available on some processors. On RISC processors,
anything beyond a single read or write of a size handled directly by
hardware typically involves load-store-exclusive sequences.

When you have to use code sequences for access, then it's common that
you end up with mutual exclusion - one thread at a time has access. But
it doesn't have to be that way, and different software sequences can be
used to optimise different usage patterns. All that matters is that if
a read sequence exits happily saying "I've read the data", then the data
it read matches exactly the data that some thread wrote at some point.

>
> As a concrete example: POSIX requires that fwrite() be atomic (for a
> particular FILE object). This means that no two threads can write
> to the same FILE object with a singular fwrite() call at the same
> time. In other words, fwrite() implements (at some level) some kind
> of (per FILE object) mutex.
>

That's at a much higher level than has been under discussion here - but
yes, that is applying the same term and guarantees for different
purposes. (The "atomic" requirement does not force a mutex, but
fwrite() has other guarantees beyond mere atomicity.)

> "Atomic" is actually a stronger guarantee than merely "thread-safe".
>
> If fwrite() were merely guaranteed to be "thread-safe", it would just
> mean that it won't break (eg. corrupt its internal state, or any other
> data anywhere else) if two threads call it at the same time, but it
> wouldn't guarantee that the data written by those two threads won't
> be interleaved somehow.
>

"Thread safe" is not as well-defined a term as "atomic", as far as I see it.

> However, since fwrite() is "atomic", not just "thread-safe" (if it
> conforms to POSIX), then it implements a mutex for the entire function
> call (for that particular FILE object).

"Atomic" is not really enough to describe the behaviour of a function
like "fwrite", since the function does not act on a single "state". If
you have two threads trying to write A and B to the same object
simultaneously, atomicity means that a third thread reading the object
will see A or B, and never a mixture. It's fine if this is implemented
by a write of A then a write of B, a write of B then a write of A, a
write of A alone, a write of B alone, a lock blocking the thread then a
mix of A, B, C and D that gets sorted into one of A or B before the lock
is released, or any other combination. Clearly that is not the
behaviour you want from fwrite() - here there should be either A then B,
or B then A.

Chris M. Thomasson

unread,

Sep 26, 2022, 8:34:40 PM9/26/22

to

Huh?

Chris M. Thomasson

unread,

Sep 26, 2022, 8:48:14 PM9/26/22

to

On 9/26/2022 4:39 AM, David Brown wrote:
> On 26/09/2022 09:53, Juha Nieminen wrote:
>> David Brown <david...@hesbynett.no> wrote:
>>> There are a number of ways to implement atomics that are larger than a
>>> single bus operation can handle, or that involve multiple bus
>>> operations. Different processor types have different solutions, and
>>> some are optimised or limited to particular setups (such as single
>>> writer, single processor, etc.). Some processors can handle atomic
>>> accesses for sizes that are bigger than native C/C++ types (such as
>>> 128-bit accesses). Some cannot handle atomic writes for the bigger
>>> native types.
>>
>> I think there's a bit of confusion here about what the term "atomic"
>> means. You seem to be talking about a concept of "atomic" with some kind
>> of meaning like "mutual exclusion supported by the CPU itself".
>>
>
> No, that's not what I am saying.
>
>> That's not what "atomic" means in general, when talking about
>> multithreaded programming. In general "atomic" merely means that the
>> resource in question can only be accessed by one thread at a time
>> (in other words, it implements some sort of mutual exclusion).
>
> And that's not quite right either.
>
> "Atomic" means that accesses are indivisible.

Exactly.

> As many threads as you
> want can read or write to the data at the same time - the defining
> feature is that there is no possibility of a partial access succeeding.

[...]

Atomic to me, say a RMW sequence:

<pseudo-code>
________________________
int g_value = 0;

// A RMW operation
int
fetch_add_busted(
int& origin,
) {
int result = origin;
origin = result + 1;
return result;
}
________________________

Well, there are problems with this. Its not atomic, and can give garbage
in multi-threaded environments... So, we can lock it up, using a hashed
based mutex algorithm (hashing on address into a table of mutexes). I
posted one in the past:

<pseudo-code>
________________________
int g_value = 0;

// A RMW operation
int
fetch_add(
int& origin,
) {
hash_lock(&origin);
int result = origin;
origin = result + 1;
hash_unlock(&origin);
return result;
}
________________________

Okay, we are atomic. There are other ways to get this done.

https://en.cppreference.com/w/cpp/atomic/atomic/fetch_add

;^)

Scott Lurndal

unread,

Sep 27, 2022, 10:17:41 AM9/27/22

to

GCC has had built-ins to generate atomic accesses (e.g. __sync_fetch_and_add)
for many years now.

On intel/amd these generate lock prefixes, on other architectures
with atomic support (e.g. ARMv8 LDADD, et alia) those instructions
will be generated.

David Brown

unread,

Sep 27, 2022, 12:07:57 PM9/27/22

to

For more demanding cases - sizes larger than the hardware supports
directly, or read-write-modify on RISC - gcc uses a library that does
much what Chris has shown here. The locks are simple busy-wait
user-space spin locks on an atomic flag, which are very efficient on
most systems (especially in the common case of no contention).
Unfortunately, this solution is worse than useless in some cases, such
as real-time systems and single-core systems.

Chris M. Thomasson

unread,

Sep 27, 2022, 3:42:40 PM9/27/22

to

Yes. Using an address based hashed locking scheme works just in case the
arch does not support the direct CPU instruction(s) (think CAS vs LL/SC)
for an atomic RMW operation. However, the locking emulation is most
definitely, not ideal. Not lock-free, indeed. When the arch supports it,
the compiler should be using lock-free operations wrt:

https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free

Using a hashed locking scheme for the atomic fetch-add impl would return
false wrt is_lock_free... Also, I forgot to add the rest of the
fetch-add, wrt the god damn dangling comma. Notice the original pseudo
code I posted upthread?
________________________

// A RMW operation
int

fetch_add_busted(
int& origin,
) {

int result = origin;
origin = result + 1;

return result;
}
________________________

Here as well:
________________________

// A RMW operation
int
fetch_add(
int& origin,
) {
hash_lock(&origin);
int result = origin;
origin = result + 1;
hash_unlock(&origin);
return result;
}
________________________

Humm... WTF? Let me correct them:
________________________

// A RMW operation
int

fetch_add_busted(
int& origin,
int addend
) {
int result = origin;
origin = result + addend;
return result;
}
________________________

Corrected here as well:
________________________

// A RMW operation
int
fetch_add(
int& origin,

int addend

) {
hash_lock(&origin);
int result = origin;

origin = result + addend;
hash_unlock(&origin);
return result;
}
________________________

Sorry about that non-sense David: Wrt the dangling comma. Forgot to
introduce the addend for the fetch-add RMW operation.

Shit happens. :^)

David Brown

unread,

Sep 28, 2022, 3:32:31 AM9/28/22

to

LL/SC /is/ a locking scheme - using a hardware lock. And neither CAS
nor LL/SC work for RMW or even plain write operations that are bigger
than the processor can handle in a single write action.

> However, the locking emulation is most
> definitely, not ideal. Not lock-free, indeed.

Processors can generally handle lock-free atomic access of a single
object of limited size - usually the natural width for the processor.
Some processors have instructions for double-width atomic accesses (such
as a double compare-and-swap). And sometimes instruction sequences,
such as LL/SC with loops, are needed - especially for RMW.

Lock-free algorithms beyond that are for specific data structures. You
can't make lock-free atomic access to a 32 byte object. You either have
to use locks (as will be done with a std::atomic<> for the type, or
using the C11 _Atomic qualifier). If you want lock-free access, you
have to wrap it all up in a more advanced structure, using something
like a lock-free atomic pointer to the "current" version of the data
allocated on a heap.

>
> Sorry about that non-sense David: Wrt the dangling comma. Forgot to
> introduce the addend for the fetch-add RMW operation.
>
> Shit happens. :^)
>

That's just minor detail, so not a problem at all.

Chris M. Thomasson

unread,

Sep 28, 2022, 4:35:16 PM9/28/22

to

Correct. Imvho, the hardware itself is a _lot_ more efficient at these
types of things... Agreed in a sense? I actually prefer pessimistic CAS
over optimistic primitives like LL/SC. Iirc, a LL/SC can fail just by
reading from the reservation granule. Let alone writing to it... PPC had
a special section in its docs that explain the possible issue of a live
lock. Iirc, even CAS has some special logic in the processor that can
actually assert a bus lock.

>> However, the locking emulation is most definitely, not ideal. Not
>> lock-free, indeed.
>
> Processors can generally handle lock-free atomic access of a single
> object of limited size - usually the natural width for the processor.
> Some processors have instructions for double-width atomic accesses (such
> as a double compare-and-swap). And sometimes instruction sequences,
> such as LL/SC with loops, are needed - especially for RMW.

Afaict, DWCAS is there to help get around the ABA problem ala IBM sysv
appendix, oh shit, I forgot the appendix number. I used to know it,
decades ago. I will try to find it.

> Lock-free algorithms beyond that are for specific data structures. You
> can't make lock-free atomic access to a 32 byte object. You either have
> to use locks (as will be done with a std::atomic<> for the type, or
> using the C11 _Atomic qualifier). If you want lock-free access, you
> have to wrap it all up in a more advanced structure, using something
> like a lock-free atomic pointer to the "current" version of the data
> allocated on a heap.

Agreed. Although, I have created lock-free allocators that never used
dynamic memory, believe it or not. Everything exists on threads stacks.
And memory from thread A could be "freed" by another thread. I remember
a project I had to do for a Quadros based system. Completely based on
stacks. Wow, what a time.

>> Sorry about that non-sense David: Wrt the dangling comma. Forgot to
>> introduce the addend for the fetch-add RMW operation.
>>
>> Shit happens. :^)
>>
>
> That's just minor detail, so not a problem at all.

Thanks. :^)

Scott Lurndal

unread,

Sep 28, 2022, 5:11:22 PM9/28/22

to

"Chris M. Thomasson" <chris.m.t...@gmail.com> writes:
>On 9/28/2022 12:32 AM, David Brown wrote:
>> On 27/09/2022 21:42, Chris M. Thomasson wrote:

>>> Yes. Using an address based hashed locking scheme works just in case
>>> the arch does not support the direct CPU instruction(s) (think CAS vs
>>> LL/SC) for an atomic RMW operation.
>>
>> LL/SC /is/ a locking scheme - using a hardware lock. And neither CAS
>> nor LL/SC work for RMW or even plain write operations that are bigger
>> than the processor can handle in a single write action.
>
>Correct. Imvho, the hardware itself is a _lot_ more efficient at these
>types of things... Agreed in a sense? I actually prefer pessimistic CAS
>over optimistic primitives like LL/SC. Iirc, a LL/SC can fail just by
>reading from the reservation granule. Let alone writing to it... PPC had
>a special section in its docs that explain the possible issue of a live
>lock. Iirc, even CAS has some special logic in the processor that can
>actually assert a bus lock.

When ARM was designing their 64-bit architecture (ARMv8) circa 2011/2, they only
provided a LL/SC equivalent (load-exclusive/store-exclusive). Their
architecture partners at the time quickly requested support for
real RMW atomics, which were added as part of the LSE (Large System ISA
Extensions). LDADD, LDCLR (and with complement), LDSET (or), LDEOR (xor),
LDSMAX (signed maximum), LDUMAX (unsigned maximum), LDSMIN, LDUMIN.

The processor fabric forwards the operation to the point of coherency
(e.g. the L2/LLC) for cachable memory locations and to the endpoint for
uncachable memory locations (e.g. a PCIexpress or CXL endpoint).

Manfred

unread,

Sep 30, 2022, 7:24:30 PM9/30/22

to

On 9/22/2022 1:01 AM, Keith Thompson wrote:
> Juha Nieminen <nos...@thanks.invalid> writes:
>> Well, *almost* never, at least.
>>

[...]
>
> strncpy() is not poorly designed.
[...]
>
> It is poorly *named*.

Agreed.

The name implies that, as strncat is a "safer"
> strcat, strncpy is a "safer" strcpy. Both strncat and strncpy let you
> specify the size of the target array, avoiding writing past the end of
> it, but strncpy treats its target as null-terminated string.
>
(likely typo, sounds like this last bit has been gotten backwards)

Keith Thompson

unread,

Sep 30, 2022, 7:43:31 PM9/30/22

to

Yes, thank you. strncat() treats its target (but not its source) as a
null-terminated string, both before and after copying. strncpy() does
not.

Bonita Montero

unread,

Oct 1, 2022, 12:27:16 AM10/1/22

to

Am 28.09.2022 um 23:11 schrieb Scott Lurndal:

> When ARM was designing their 64-bit architecture (ARMv8) circa 2011/2, they
> only provided a LL/SC equivalent (load-exclusive/store-exclusive). Their
> architecture partners at the time quickly requested support for
> real RMW atomics, which were added as part of the LSE (Large System ISA
> Extensions). LDADD, LDCLR (and with complement), LDSET (or), LDEOR (xor),
> LDSMAX (signed maximum), LDUMAX (unsigned maximum), LDSMIN, LDUMIN.

Eh, RMW can be emulated with LL/SC but not vice versa.
A CAS emulated by LL/SC isn't slower than a native CAS.
But atomic increments, decrements, ands, ors or whatever
ebulated with LL/SC is sometimes slower.

Scott Lurndal

unread,

Oct 1, 2022, 1:02:05 PM10/1/22

to

Who said anything about CAS[*]?

[*] For your edification, CAS on modern archtitectures isn't
handled by the CPU, but rather by the point of coherency (LLC
or PCI-Express/CXL endpoint). Something you can't do with
LL/SC at all.

Chris M. Thomasson

unread,

Oct 2, 2022, 3:44:45 PM10/2/22

to

Oh yeah! I remember reading about this over on comp.arch a while back. I
wonder if I can find the post. Thanks Scott. :^)

Chris M. Thomasson

unread,

Oct 2, 2022, 3:46:53 PM10/2/22

to

On 9/30/2022 9:26 PM, Bonita Montero wrote:
> Am 28.09.2022 um 23:11 schrieb Scott Lurndal:
>
>> When ARM was designing their 64-bit architecture (ARMv8) circa 2011/2,
>> they
>> only provided a LL/SC equivalent (load-exclusive/store-exclusive).
>> Their
>> architecture partners at the time quickly requested support for
>> real RMW atomics, which were added as part of the LSE (Large System ISA
>> Extensions). LDADD, LDCLR (and with complement), LDSET (or), LDEOR
>> (xor),
>> LDSMAX (signed maximum), LDUMAX (unsigned maximum), LDSMIN, LDUMIN.
>
> Eh, RMW can be emulated with LL/SC but not vice versa.
> A CAS emulated by LL/SC isn't slower than a native CAS.

Using LL/SC can be tricky. You really need to isolate the reservation
granule...

> But atomic increments, decrements, ands, ors or whatever
> ebulated with LL/SC is sometimes slower.

How many times do you spin on a SC failure before you get, pissed off?

Michael S

unread,

Oct 2, 2022, 6:32:58 PM10/2/22

to

I don't think so.
IMHO, a typical implementation is that CPU acquires the ownership
of location and refuses all attempts by other agents to take it back
until both parts of CAS are completed and committed to L1$.
What you say is an idea that floats widely but never implemented on
general-purpose CPUs. Mostly, because for workloads that run on
general-purpose CPUs, it's a very bad idea.

May be, on some network processor it works the way, you suggest,
but I wouldn't call architectures of these processors "modern".

Bonita Montero

unread,

Oct 2, 2022, 10:14:41 PM10/2/22

to

Am 02.10.2022 um 21:46 schrieb Chris M. Thomasson:

> Using LL/SC can be tricky. You really need to isolate the reservation
> granule...

In essence, you're doing the same thing with CAS, but it's easier
to use since it eliminates the need for DWCAS for lock-free stacks.

>> But atomic increments, decrements, ands, ors or whatever
>> ebulated with LL/SC is sometimes slower.

> How many times do you spin on a SC failure before you get, pissed off?

With atomic non-CAS operations you never spin.

Chris M. Thomasson

unread,

Oct 2, 2022, 10:44:14 PM10/2/22

to

On 10/2/2022 7:14 PM, Bonita Montero wrote:
> Am 02.10.2022 um 21:46 schrieb Chris M. Thomasson:
>
>> Using LL/SC can be tricky. You really need to isolate the reservation
>> granule...
>
> In essence, you're doing the same thing with CAS, but it's easier
> to use since it eliminates the need for DWCAS for lock-free stacks.

Iirc, there was a paper on LL/SC and the ABA problem. I am not sure if
every implementation of LL/SC can get around it. Iirc, a LL/SC that will
make a SC fail if anything reads and/or writes from/to the RG should
definitely get around ABA. However, from experience, I would choose
something like cmpxchg8b over LL/SC any day.

>>> But atomic increments, decrements, ands, ors or whatever
>>> ebulated with LL/SC is sometimes slower.
>
>> How many times do you spin on a SC failure before you get, pissed off?
>
> With atomic non-CAS operations you never spin.
>
>
>

I was referring to SC failing. Why did it fail? Spuriously? From a read
into the reservation granule? If a CAS fails, we know it is because the
actual values were different. The compare part failed. There is a way to
attack CAS. Have a racing heard of threads setting the shared value to
random values. The a poor thread trying to do a CAS to update a
data-structure or something, just might fail a lot of times.

So, when a CAS fails, well, it's different than when a SC fails...

Chris M. Thomasson

unread,

Oct 2, 2022, 10:48:30 PM10/2/22

to

Iirc, the "weak" CAS in C++ allows for spurious failures.

Bonita Montero

unread,

Oct 2, 2022, 10:53:40 PM10/2/22

to

Am 03.10.2022 um 04:43 schrieb Chris M. Thomasson:
> On 10/2/2022 7:14 PM, Bonita Montero wrote:
>> Am 02.10.2022 um 21:46 schrieb Chris M. Thomasson:
>>
>>> Using LL/SC can be tricky. You really need to isolate the reservation
>>> granule...
>>
>> In essence, you're doing the same thing with CAS, but it's easier
>> to use since it eliminates the need for DWCAS for lock-free stacks.
>

> Iirc, there was a paper on LL/SC and the ABA problem. ...

There's no ABA-problem with LL/SC'd stacks since you can detect if
the word has changed to the same value.

> I was referring to SC failing. Why did it fail? Spuriously? ..

Only when the cacheline holding the word has been touched
by another thread.

Chris M. Thomasson

unread,

Oct 2, 2022, 10:58:48 PM10/2/22

to

On 10/2/2022 7:53 PM, Bonita Montero wrote:
> Am 03.10.2022 um 04:43 schrieb Chris M. Thomasson:
>> On 10/2/2022 7:14 PM, Bonita Montero wrote:
>>> Am 02.10.2022 um 21:46 schrieb Chris M. Thomasson:
>>>
>>>> Using LL/SC can be tricky. You really need to isolate the
>>>> reservation granule...
>>>
>>> In essence, you're doing the same thing with CAS, but it's easier
>>> to use since it eliminates the need for DWCAS for lock-free stacks.
>>
>> Iirc, there was a paper on LL/SC and the ABA problem. ...
>
> There's no ABA-problem with LL/SC'd stacks since you can detect if
> the word has changed to the same value.

Iirc, it depended on how the LL/SC was implemented, how sensitive it was
to alterations in the reservation granule.

>> I was referring to SC failing. Why did it fail? Spuriously? ..
>
> Only when the cacheline holding the word has been touched
> by another thread.

This would be the weak form of CAS wrt C++. A "true" RMW CAS will fail
only when the condition (the compare part) fails. A bus lock might need
to be used under heavy conditions. Scott knows about that.

Bonita Montero

unread,

Oct 3, 2022, 2:27:29 AM10/3/22

to

Am 03.10.2022 um 04:58 schrieb Chris M. Thomasson:

>> There's no ABA-problem with LL/SC'd stacks since you can detect if
>> the word has changed to the same value.

> Iirc, it depended on how the LL/SC was implemented, how
> sensitive it was to alterations in the reservation granule.

It may fail only if the cacheline has been evicted for other
reasons than that the SC'd word changed.

Scott Lurndal

unread,

Oct 3, 2022, 10:01:47 AM10/3/22

to

Michael S <already...@yahoo.com> writes:
>On Saturday, October 1, 2022 at 8:02:05 PM UTC+3, Scott Lurndal wrote:
>> Bonita Montero <Bonita....@gmail.com> writes:
>> >Am 28.09.2022 um 23:11 schrieb Scott Lurndal:
>> >
>> >> When ARM was designing their 64-bit architecture (ARMv8) circa 2011/2, they
>> >> only provided a LL/SC equivalent (load-exclusive/store-exclusive). Their
>> >> architecture partners at the time quickly requested support for
>> >> real RMW atomics, which were added as part of the LSE (Large System ISA
>> >> Extensions). LDADD, LDCLR (and with complement), LDSET (or), LDEOR (xor),
>> >> LDSMAX (signed maximum), LDUMAX (unsigned maximum), LDSMIN, LDUMIN.
>> >
>> >Eh, RMW can be emulated with LL/SC but not vice versa.
>> >A CAS emulated by LL/SC isn't slower than a native CAS.
>> >But atomic increments, decrements, ands, ors or whatever
>> >ebulated with LL/SC is sometimes slower.
>> >
>> Who said anything about CAS[*]?
>>
>>
>> [*] For your edification, CAS on modern archtitectures isn't
>> handled by the CPU, but rather by the point of coherency (LLC
>> or PCI-Express/CXL endpoint).
>
>I don't think so.

The processors that we build implement it exactly as I say.

>IMHO, a typical implementation is that CPU acquires the ownership
>of location and refuses all attempts by other agents to take it back
>until both parts of CAS are completed and committed to L1$.

That hasn't been my experience (in the past with Intel/AMD x86_64
where we extended the coherency protocol from HT/QPI over a high
speed fabric (IB/10Ge), and currently with custom high-end ARM64
processors that my CPOE sells).

Note that I said at the "point of coherency". That may very
well be the L1 cache for some processors, the L2/LLC for others
(depending on cache inclusivity etc.).

Chris M. Thomasson

unread,

Oct 3, 2022, 4:49:20 PM10/3/22

to

Reading/writing to the reservation granule (RG) can cause an issue.
Iirc, the RG can be bigger than a l2 cacheline...

Chris M. Thomasson

unread,

Oct 3, 2022, 4:50:10 PM10/3/22

to

Correct.

Chris M. Thomasson

unread,

Oct 4, 2022, 1:54:04 PM10/4/22

to

Iirc, a RG can be several contiguous L2 cachelines. So, working with
LL/SC can be a bit tricky. One generally wants to align the memory on a
natural RG boundary, and pad it up to the size of a RG. False sharing a
RG is HORRIBLE! Really bad. Think live lock...

Wrt CAS, I tend to focus on a single L2 cacheline. False sharing a L2
cacheline wrt pure RMW CAS, well, its definitely not ideal, but at least
CAS's can complete...

Humm...