An argument *against* (the liberal use of) references

Juha Nieminen

unread,

Nov 22, 2022, 4:52:49 AM11/22/22

to

Recently I made a post about references, about how I think many C++
programmers think of them in the wrong way (ie. they think of them
as being effectively an alternative syntax for pointers, a "more
limited pointer syntax", or "a safer pointer syntax", when in fact
references shouldn't be semantically thought of as pointers at all,
but as aliases for the objects they are referring to). In that thread
some criticism was presented about the use of references, and arguments
for their use.

For the sake of fairness and balance, here's an argument *against*
(the liberal use of) references.

Most C++ programmers have been conditioned to always take larger objects
(and even not so large objects) by const reference parameters in functions,
because that's more efficient. (The heavier an object is to deep-copy,
the more efficient a reference to it becomes, obviously.)

However, not many of them consider the *thread-safety* problem that taking
a parameter by reference introduces. In single-threaded programs this is
rather irrelevant, but multithreaded programming is becoming more and more
common every day. Also, if you are writing a library to be used in programs
out there, you have to consider its thread-safety even if the library itself
doesn't use threads.

What is this thread-safety problem introduced by references (especially
when we are writing a function that takes a parameter by reference)? The
fact that in theory another thread could modify the object being referred
to, at the same time that this function is trying to read it.

And there's nothing this function can do to defend against that. (Unless
this function cooperates with the calling code to make it thread-safe, eg.
by using a mutex offered by the calling code.) Even if the function is
internally thread-safe, it can't help the fact that it's using an external
resource without mutual exclusion (unless provided by the calling code).

How many times have you thought about the fact that taking a parameter
by reference makes the function automatically not-thread-safe? I know
I haven't. Like ever.

And, as has been pointed out, in the calling code itself it's not obvious
that a function is taking a parameter by reference, and thus might need
mutual exclusion.

Fred. Zwarts

unread,

Nov 22, 2022, 5:23:39 AM11/22/22

to

Op 22.von..2022 om 10:52 schreef Juha Nieminen:

I have seen such problems often. When possible, I try to make the class
itself thread-safe, such that it can only be accessed with thread-safe
member functions. But sometimes one has to use classes that are not
thread-safe themselves. Then this is something to think of. Using a copy
instead of a reference is not always the solution, because making a copy
is not always thread-safe either.

Paavo Helde

unread,

Nov 22, 2022, 6:28:06 AM11/22/22

to

If the code passing a reference is not thread-safe, then just changing
it to pass by value will not magically make it thread-safe. Without
proper synchronization, the object may be changed by another thread at
any moment, for example in the middle of the copy operation, and thus
the copy might become internally inconsistent.

In multithreaded programs typically there are 3 types of objects:

1. Non-mutable shared objects which can be accessed by multiple threads
without locking. Can be passed by reference.

2. Shared objects which need locking when accessed. The locking will be
best placed inside the objects methods, so that the callers do not need
to worry about that. Can be passed by reference. Also lock-free data
structures would belong here.

3. Single-threaded objects which are accessed and modified in a single
thread only. Do not need locking, but require deep copying when passed
to another thread. The copying must happen before the copy will become
accessible in the other thread. This copying can be indeed done by
passing the object to a function by value - that's what is done e.g. by
the std::thread constructor.

In their own thread such objects can be accessed without locking and can
be passed via references, no problems.

In short, just casual pass-by-value is neither sufficient nor needed for
multi-thread safety. Copying is needed when passing over single-threaded
objects to other threads, which ought better to happen in clearly
defined points in the program.

Bonita Montero

unread,

Nov 22, 2022, 9:37:01 AM11/22/22

to

You've got problems that practical developers don't have.

Chris Vine

unread,

Nov 22, 2022, 10:21:17 AM11/22/22

to

On Tue, 22 Nov 2022 09:52:33 -0000 (UTC)
Juha Nieminen <nos...@thanks.invalid> wrote:
[snip]

> However, not many of them consider the *thread-safety* problem that taking
> a parameter by reference introduces. In single-threaded programs this is
> rather irrelevant, but multithreaded programming is becoming more and more
> common every day. Also, if you are writing a library to be used in programs
> out there, you have to consider its thread-safety even if the library itself
> doesn't use threads.
>
> What is this thread-safety problem introduced by references (especially
> when we are writing a function that takes a parameter by reference)? The
> fact that in theory another thread could modify the object being referred
> to, at the same time that this function is trying to read it.
>
> And there's nothing this function can do to defend against that. (Unless
> this function cooperates with the calling code to make it thread-safe, eg.
> by using a mutex offered by the calling code.) Even if the function is
> internally thread-safe, it can't help the fact that it's using an external
> resource without mutual exclusion (unless provided by the calling code).
>
> How many times have you thought about the fact that taking a parameter
> by reference makes the function automatically not-thread-safe? I know
> I haven't. Like ever.

Ah, the frailty of human memory! I drew it to your attention
some time ago. Anyway, I agree with your conclusions above.

********************************************************************
Date: Sat, 25 Aug 2018 01:38:29 +0100
From: Chris Vine <chris@cvine--nospam--.freeserve.co.uk>
Newsgroups: comp.lang.c++
Subject: Re: Should you use constexpr by default?

On Thu, 23 Aug 2018 05:48:38 -0000 (UTC)
Juha Nieminen <nos...@thanks.invalid> wrote:
> Even beyond that, in the era of C++11 and newer, the constness of
> a member function should indicate that it's thread-safe to call it
> without a locking mechanism.

It absolutely doesn't mean that. Locking of object (instance) data is
required in a const member function if it accesses those data at a time
when a non-const member function might concurrently mutate the data in
another thread.

You may have got this idea from a talk given by Herb Sutter in which he
asserted that "const means thread safe". It doesn't.

A const member function can also mutate static data.
********************************************************************

Richard Damon

unread,

Nov 22, 2022, 10:48:38 AM11/22/22

to

Taking a parameter by referance makes you no less thread safe than
taking a pointer as an parameter. In either case, the caller needs to
make sure it has proper "ownership" of the object it is passing a
reference or pointer to.

Yes, taking a parameter by (const) reference vs taking it by value
increases the exposure to thread safety issues.

I will agree that the presence of refernce arguments perhaps makes it
easier to miss some "sharing" that is happening, but ultimately, the C
and C++ philosophy is the programmer needs to know what he is doing.
They are NOT languages that coddle the programmer with extreme safety.

Chris M. Thomasson

unread,

Nov 22, 2022, 4:18:56 PM11/22/22

to

> taking a pointer as an parameter. [...]

Agreed. Passing in a pointer vs a reference has no bearing on thread
safety. Think of, <pseudo-code>:

int
fetch_add(
int* const src,
int addend
){
return atomic_fetch_add(src, addend);
}

int
fetch_add(
int& src,
int addend
){
return atomic_fetch_add(&src, addend);
}

where this low level atomic_fetch_add function takes a pointer to an int
in src and an int as addend for its parameters. The way we pass in the
fetch_add::src parameter is irrelevant wrt reference vs. pointer wrt
thread safety. The atomic_fetch_add function takes care to make sure the
fetch_add RMW operation is atomic. Nothing to do with references vs
pointers...

El Jo

unread,

Nov 22, 2022, 4:41:55 PM11/22/22

to

Il giorno martedì 22 novembre 2022 alle 09:52:49 UTC Juha Nieminen ha scritto:
> Recently I made a post about references, about how I think many C++
> programmers think of them in the wrong way (ie. they think of them
> as being effectively an alternative syntax for pointers, a "more
> limited pointer syntax", or "a safer pointer syntax", when in fact
> references shouldn't be semantically thought of as pointers at all,
> but as aliases for the objects they are referring to). In that thread
> some criticism was presented about the use of references, and arguments
> for their use.
>
> For the sake of fairness and balance, here's an argument *against*
> (the liberal use of) references.

Looks like a big post, but:
1. thread safety can only enforced by mutual-exclusion so I'd expect that before the function call, a mutex.
2. In 2021 we don't pass anymore by value big objects or by references, every we can we should transfer the ownership,
if this is not the case we should pass the pointer to the object. We pass by references just small objects that we don't own.
Just 1c.
BR,

Chris M. Thomasson

unread,

Nov 22, 2022, 4:51:32 PM11/22/22

to

On 11/22/2022 1:41 PM, El Jo wrote:
> Il giorno martedì 22 novembre 2022 alle 09:52:49 UTC Juha Nieminen ha scritto:
>> Recently I made a post about references, about how I think many C++
>> programmers think of them in the wrong way (ie. they think of them
>> as being effectively an alternative syntax for pointers, a "more
>> limited pointer syntax", or "a safer pointer syntax", when in fact
>> references shouldn't be semantically thought of as pointers at all,
>> but as aliases for the objects they are referring to). In that thread
>> some criticism was presented about the use of references, and arguments
>> for their use.
>>
>> For the sake of fairness and balance, here's an argument *against*
>> (the liberal use of) references.
>
> Looks like a big post, but:
> 1. thread safety can only enforced by mutual-exclusion so I'd expect that before the function call, a mutex.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Why do you say that? C++ has fairly decent atomic capabilities, that try
to avoid a mutex if the underlying architecture supports lock-free
atomic RMW's and loads/stores.

Juha Nieminen

unread,

Nov 23, 2022, 1:52:41 AM11/23/22

to

Paavo Helde <ees...@osa.pri.ee> wrote:
> If the code passing a reference is not thread-safe, then just changing
> it to pass by value will not magically make it thread-safe. Without
> proper synchronization, the object may be changed by another thread at
> any moment, for example in the middle of the copy operation, and thus
> the copy might become internally inconsistent.

I think you have a point there. I was thinking that since the function is
getting a local copy of the object then mutual exclusion problems go away
because that local copy is completely independent of the original.

I didn't think that *copying* the object for the function in itself isn't
automatically thread-safe. Yet still, for a few seconds, I had the strong
instinct that there has to be something "more thread-safe" about making
a copy of the value for the function than have the function take a
reference to it... but the more I try to figure out how, I fail. Copying
merely moves the mutual exclusion problem to a slightly different place,
but it doesn't solve it. The calling code *still* needs to solve the
mutual exclusion problem regardless of which way the function takes the
parameter.

Alf P. Steinbach

unread,

Nov 23, 2022, 7:51:31 AM11/23/22

to

Depends on when the copying is done.

Copying to the tread instantiation is safe.

- Alf

Juha Nieminen

unread,

Nov 23, 2022, 8:47:09 AM11/23/22

to

If you are calling a function and giving as parameter a variable that may
be modified by another thread, you need to take care of the mutual
exclusion problem regardless of whether that function takes the parameter
by value or by reference.

I originally didn't think about the fact that the function taking it by
value doesn't solve the problem because now copying the value needs the
mutual exclusion. The only thing that happens is that the point of
potential conflict has been moved to a slightly different place.

You could make a local copy of the value before passing it to the
function (which could be quite efficient if the variable is atomic),
but even then it doesn't really matter if the function takes it by
value or by reference. (OTOH this may be much more efficient if
the function takes a significant amount of time because you don't
need to keep the mutex locked for the duration of the function.)

Bonita Montero

unread,

Nov 23, 2022, 2:08:52 PM11/23/22

to

If you don't need an iterated or indexed access with a pointer
you'd better use a reference since you can't have any accidental
modifications on the reference. And references have the advantage
that you can pass temporaries to them. With operator overloading
you can't use pointers instead.

Paavo Helde

unread,

Nov 30, 2022, 7:00:03 AM11/30/22

to

Curiously enough, I just spent 3 days for tracking down a random race
condition bug in a large application. It appeared that for fixing it I
had to add a single ampersand character, i.e. instead of making a copy
of the object I had to just take a reference to it. Note this is the
exact opposite of the general suggestion you advocated earlier ;-)

Actually once located, the bug was simple. The object which was copied
was a single-threaded refcounted smartpointer, and by copying it the
refcounter got incremented (and later decremented). Alas, this was
accidentally done from parallel threads at the same time, without any
synchronization, so eventually the refcounter got messed up.

After fixing it by taking a reference to the smartpointer instead of
copying it, the refcounter now remains constant all the time throughout
the parallel regime (and all other access is read-only as well), so
everything now works fine.

In principle one could make copies of the pointed objects before the
parallel regime, or in this particular case it would have been enough to
use thread-safe smartpointers, but both these approaches would affect
the performance, and we are always struggling with the performance.

Juha Nieminen

unread,

Dec 1, 2022, 1:45:55 AM12/1/22

to

Paavo Helde <ees...@osa.pri.ee> wrote:
> Actually once located, the bug was simple. The object which was copied
> was a single-threaded refcounted smartpointer, and by copying it the
> refcounter got incremented (and later decremented). Alas, this was
> accidentally done from parallel threads at the same time, without any
> synchronization, so eventually the refcounter got messed up.

I think that if the reference count is declared atomic, it can be safely
directly incremented. When decrementing you would need to use the
fetch_sub() function to see if the object needs to be destroyed.

While modifying an atomic might not be equally fast as a non-atomic,
it shouldn't be all that much slower either, at least if the target
architecture supports atomic operations.

Paavo Helde

unread,

Dec 1, 2022, 2:19:39 AM12/1/22

to

I have pondered this myself. Maybe I should measure the actual slowdown
after temporarily making the refcounters atomic. But this seems overkill
because these smartpointers would still point to single-threaded objects
which are meant to be primarily used in single-thread regime, so in most
cases making the smartpointers atomic does not buy anything.

When tracking down this bug, I monitored all refcounter changes for a
particular single smartpointer during the program run (ca 10 min). There
were 591848 increments and decrements, from which 1526 came from the
problematic (parallelized) part. It looks like a pessimization to slow
down 99.75% of accesses when only 0.25% would actually benefit from this.

Stuart Redmann

unread,

Dec 1, 2022, 8:06:17 AM12/1/22

to

600k changes in reference counts look suspicious to me. When you pass a
ref-counted object to a worker thread, there should only be a single change
in the refcount. This would be because inside the worker thread some object
takes (shared) ownership of shared object. If the shared object needs to be
passed to sub-routines, you should pass them as references or plain
pointers (if the subroutine must be able to cope with non-existing
objects). It should be rare occurrence that another object in the worker
thread needs to take ownership of the shared object.

Another thought: if thread-safety is too costly, you could use two smart
pointer classes: thread-safe pointers and forwarding non-thread-safe smart
pointers. The forwarding smart pointers have their own thread-UNsafe
refcount and the thread-safe smart pointer as member.

Regards,
Stuart

Paavo Helde

unread,

Dec 1, 2022, 9:45:23 AM12/1/22

to

You are right, that's how I fixed the bug (by using a reference). There
are now 1526 less changes in refcounts ;-)

As for others ~600k changes, these seem legitimate. This is a scripting
language engine (think something like Python) with complex data
structures built up via refcounted smartpointers. This particular object
is apparently used as some default column in data tables. I think it was
used for 2000 columns in some 4000-column table. And there were many
tables like that. If you insert the same refcounted vector as a new
column in a table 2000 times, via a member function having a
smartpointer parameter, then you already get something like at least
6000 refcount changes.

>
> Another thought: if thread-safety is too costly, you could use two smart
> pointer classes: thread-safe pointers and forwarding non-thread-safe smart
> pointers. The forwarding smart pointers have their own thread-UNsafe
> refcount and the thread-safe smart pointer as member.

I tried to measure the impact of std::atomic<int> refcounters and in
first tests it seems the overhead on x86_64 is zero (with no
contention). So it seems I could use them without drawbacks, but this
would not save much because the pointed objects would still be not
thread-safe. I guess it might work out if I could ensure that all
objects are physically immutable after some initialization. Hmm, time
for thoughts.

Scott Lurndal

unread,

Dec 1, 2022, 11:02:37 AM12/1/22

to

Paavo Helde <ees...@osa.pri.ee> writes:
>01.12.2022 15:06 Stuart Redmann kirjutas:

<snip>

>> Another thought: if thread-safety is too costly, you could use two smart
>> pointer classes: thread-safe pointers and forwarding non-thread-safe smart
>> pointers. The forwarding smart pointers have their own thread-UNsafe
>> refcount and the thread-safe smart pointer as member.
>
>I tried to measure the impact of std::atomic<int> refcounters and in
>first tests it seems the overhead on x86_64 is zero (with no
>contention).

Which follows naturally from the fact that the core doing
the atomic access has exclusive access to the cache line
containing the refcounter. No overhead at all, unless
the ref counter isn't aligned and crosses a cache-line
boundary (or the access is to an uncached memory
range or caching is disabled), in which case the processor
will take a system-wide
lock to perform the operation, which is catastrophic
on systems with large processor counts.

(Note that both Intel and AMD processors will fall back to
the system-wide lock if a cache line is highly contended
after some time period has elapsed in order to make forward
progress.)

If the atomic access is to memory on a CXL.memory device, the
operation will not benefit from local cache line latencies
and the atomicity will be guaranteed by the CXL.memory device
exporting the memory to the host in some implementation defined
manner.

Paavo Helde

unread,

Dec 1, 2022, 1:22:12 PM12/1/22

to

01.12.2022 18:02 Scott Lurndal kirjutas:
> Paavo Helde <ees...@osa.pri.ee> writes:
>> 01.12.2022 15:06 Stuart Redmann kirjutas:
>
> <snip>
>
>>> Another thought: if thread-safety is too costly, you could use two smart
>>> pointer classes: thread-safe pointers and forwarding non-thread-safe smart
>>> pointers. The forwarding smart pointers have their own thread-UNsafe
>>> refcount and the thread-safe smart pointer as member.
>>
>> I tried to measure the impact of std::atomic<int> refcounters and in
>> first tests it seems the overhead on x86_64 is zero (with no
>> contention).
>
> Which follows naturally from the fact that the core doing
> the atomic access has exclusive access to the cache line
> containing the refcounter. No overhead at all, unless

Thanks for the clarifications!

> the ref counter isn't aligned and crosses a cache-line
> boundary (or the access is to an uncached memory
> range or caching is disabled), in which case the processor
> will take a system-wide
> lock to perform the operation, which is catastrophic
> on systems with large processor counts.

It is clear that having a misaligned cross-border atomic would be very
bad. But what about normal uncached memory ranges, wouldn't these be
just loaded into the cache, without disturbing other processors, and
without any "catastrophic" consequences?

Scott Lurndal

unread,

Dec 1, 2022, 2:09:39 PM12/1/22

to

Generally uncached means that the processor fetches
directly from memory bypassing the cache and never evicting any
lines. This is an important characteristic for MMIO space
where a read access has a side effect (e.g. reading a UART
Data Register).

It also depends on how the processor atomic instructions are implemented.

In legacy Intel/AMD systems, where the LOCK prefix is being used, the
systemwide lock is the only possibility [*].

For ARM64 with the Large System Extensions (LSE) atomic instructions,
the processor can send the atomic operation to the point of coherency
(either the cache subsystem, or if caching is disabled, to the DRAM
controller or PCI-Express device (PCIe supports atomics) and the
synchronization happens at the "endpoint".

Without support all the way to the memory controller or endpoint,
there is no other way to sychronize all agents accessing the controller
or endpoint without acquiring a global mutex of some sort.

[*] It's been a decade since I worked directly with those processors
and they may have added support for atomic operations to the
internal ring or now mesh structures used to communicate between
the processing elements and the memory controllers and PCI root port
bridges, in which case, like ARM64, they can push the atomic op
all the way out to the endpoint/controller.

Michael S

unread,

Dec 1, 2022, 2:59:50 PM12/1/22

to

"Unchached memory range" is a misnormer.
A proper name is uncacheable range (region).
Unfortunately "uncached" in the meaning of "uncacheable" is used
quite often. Even Intel's official manuals suffer from such
inconsistent vocabulary.

Chris M. Thomasson

unread,

Dec 1, 2022, 3:22:02 PM12/1/22

to

Check this out:

https://github.com/jseigh/atomic-ptr-plus/blob/master/atomic-ptr/atomic_ptr.h

It is a truly atomic reference counted pointer. A thread can take a
reference without owning a prior reference.

Here is a patent:

https://patents.justia.com/patent/5295262

Paavo Helde

unread,

Dec 1, 2022, 4:54:02 PM12/1/22

to

01.12.2022 21:59 Michael S kirjutas:

> "Unchached memory range" is a misnormer.
> A proper name is uncacheable range (region).
> Unfortunately "uncached" in the meaning of "uncacheable" is used
> quite often. Even Intel's official manuals suffer from such
> inconsistent vocabulary.

Thanks, I had to look up what is "uncacheable memory". I guess the 80186
processor where I learned my basics did not have such a thing.

Michael S

unread,

Dec 2, 2022, 7:11:56 AM12/2/22

to

80186 was "embedded" microprocessor similar at core to 8086.
It was typically used with no cache so didn't need a concept of uncacheable
regions.
80286 and especially i386 was used with (external) cache quite often, but
according to my understanding their caches were what we call today
"memory-side caches" associated with main memory.
From system (both CPU and other bus masters) perspective such caches
are totally transparent (except for entering/leaving deep sleep states, but
back then they didn't do it) so there still was no need for uncacheable regions.

System-side caches and associated problem first appear in x86 world in i486.
Still, the in original i486 the system cache had strict write-through policy, so the
problems were minor.
Then came Pentium with 8 KB of write-back Data cache and a little later came
new models of i486 with even bigger write-back cache and problems became
quite real, especially because approximately in the same time PCI took over
I/O bus role and suddenly multiple bus masters that were high-end curiosity
before then, became common in consumer PC hardware.
But even then x86 architecture lacked adequate answer to a new challenge.

The first reasonable answer (MTRR registers) came only in PPro but it still
had problems of scalability - too few regions.
Later (P-III) they invented PAT which from theoretical point of view is inferior
to MTRRs because in PAT scheme cachability is an attribute of virtual address
rather than of physical address. But PAT is ultimately scalable and is one
solution that, as long as OS does a proper plumbing, is one solution that can
rule over all aspects of cachabilty. So it won.

Michael S

unread,

Dec 2, 2022, 8:53:18 AM12/2/22

to

Huh?
There is absolute no relationship between what prefix is used (instruction
encoding issue) and implementation.

> For ARM64 with the Large System Extensions (LSE) atomic instructions,
> the processor can send the atomic operation to the point of coherency
> (either the cache subsystem, or if caching is disabled, to the DRAM
> controller or PCI-Express device (PCIe supports atomics) and the
> synchronization happens at the "endpoint".
>
> Without support all the way to the memory controller or endpoint,
> there is no other way to sychronize all agents accessing the controller
> or endpoint without acquiring a global mutex of some sort.
>
> [*] It's been a decade since I worked directly with those processors
> and they may have added support for atomic operations to the
> internal ring or now mesh structures used to communicate between
> the processing elements and the memory controllers and PCI root port
> bridges, in which case, like ARM64, they can push the atomic op
> all the way out to the endpoint/controller.

x86 atomic operations have global order, which is somewhat stronger
than "total order" of the rest of normal* x86 stores. The difference is that
unlike "total order" global order makes no exceptions for store-to-load
forwarding from core's local store queue.

BTW, it means that your claim in post above "no overhead at all unless ..."
is incorrect in the absolute sense. There is an overhead even without "unless".
But the overhead in uncontended case is small - order of dozen or two of CPU
clocks. So, undetectable in Paavo's case of only 1000 updates per second.
For 1M updates per second impact would me detectable with precise time
measurements and for 100M per second there would be big slowdown.

Last September Intel published this manual:
https://cdrdv2-public.intel.com/671368/architecture-instruction-set-extensions-programming-reference.pdf
Manual contains new atomic instruction AADD/AAND/AOR/AXOR that provide
weaker (WC) ordering in WB memory regions. The manual does not say when
this instructions are going to be implemented nor if they will be implemented at all.
It also does not explain in which situation they are expected to be useful.
However one thing is clear: they will *not* be useful in typical user-mode code
that deals with reference counting.
May be, usable in userland in extreme fire-and-foget situations like counting
events where counter is updated not too often, but often enough to matter,
typically not from the same core as the last one and read approximately never.

My guess is that this instructions were invented to help Optane DIMMs.
So today, with Optane DIMMs officially dead, it would be logical for Intel to
never implement this strange instructions that could easily lead to
programmer's mistakes.

[*] normal in this case means WB or UC. For WC stores it's more relaxed.

Scott Lurndal

unread,

Dec 2, 2022, 9:37:16 AM12/2/22

to

Michael S <already...@yahoo.com> writes:
>On Thursday, December 1, 2022 at 9:09:39 PM UTC+2, Scott Lurndal wrote:
>> Paavo Helde <ees...@osa.pri.ee> writes:
>> >01.12.2022 18:02 Scott Lurndal kirjutas:
>> >> Paavo Helde <ees...@osa.pri.ee> writes:
>> >>> 01.12.2022 15:06 Stuart Redmann kirjutas:
>> >>

>>
>> In legacy Intel/AMD systems, where the LOCK prefix is being used, the
>> systemwide lock is the only possibility [*].
>>
>
>Huh?
>There is absolute no relationship between what prefix is used (instruction
>encoding issue) and implementation.

The only way to specify an atomic access in those chips was
to use the LOCK prefix (e.g. LOCK ADD generates an atomic
add, et alia).

Paavo Helde

unread,

Dec 2, 2022, 11:34:02 AM12/2/22

to

02.12.2022 15:53 Michael S kirjutas:
> On Thursday, December 1, 2022 at 9:09:39 PM UTC+2, Scott Lurndal wrote:

> BTW, it means that your claim in post above "no overhead at all unless ..."
> is incorrect in the absolute sense. There is an overhead even without "unless".
> But the overhead in uncontended case is small - order of dozen or two of CPU
> clocks. So, undetectable in Paavo's case of only 1000 updates per second.
> For 1M updates per second impact would me detectable with precise time
> measurements and for 100M per second there would be big slowdown.

Just a minor note, earlier I posted numbers for only a single
smartpointer, whereas in the full program there are probably tens or
hundreds of thousands of them. I just measured it and the total rate of
refcount changes is something like 5M per second.

With this 5M/s rate I cannot see any slowdown (of using std::atomic<int>
instead of int) in my measurements (with no contention), variations
caused by other uncontrollable factors seem to be much larger. Maybe I
should rerun this on a Linux box where the things are more stable.

Chris M. Thomasson

unread,

Dec 2, 2022, 3:29:59 PM12/2/22

to

Iirc, XCHG has an implicit LOCK prefix?

Scott Lurndal

unread,

Dec 2, 2022, 3:56:00 PM12/2/22

to

Not sure about XCHG, but CMPXCHG requires the prefix
when used in a multiprocessor system; on a uniprocessor
it will be atomic because interrupts are always taken
between instructions (unlike, the VAX, for instance,
where certain instructions (MOVC3/5) can be interrupted
and restarted). I suspect that XCHG has similar
characteristics.

Chris M. Thomasson

unread,

Dec 2, 2022, 3:58:53 PM12/2/22

to

Iirc, XCHG is the _only_ atomic RMW instruction that has an _implicit_
LOCK prefix. CMPXCHG _needs_ the programmer to put in a LOCK prefix.

Branimir Maksimovic

unread,

Dec 2, 2022, 4:09:51 PM12/2/22

to

"lock" prefix causes the processor's bus-lock signal to be asserted during
execution of the accompanying instruction. In a multiprocessor environment,
the bus-lock signal insures that the processor has exclusive use of any shared
memory while the signal is asserted. The "lock" prefix can be prepended only
to the following instructions and only to those forms of the instructions
where the destination operand is a memory operand: "add", "adc", "and", "btc",
"btr", "bts", "cmpxchg", "cmpxchg8b", "dec", "inc", "neg", "not", "or", "sbb",
"sub", "xor", "xadd" and "xchg". If the "lock" prefix is used with one of
these instructions and the source operand is a memory operand, an undefined
opcode exception may be generated. An undefined opcode exception will also be
generated if the "lock" prefix is used with any instruction not in the above
list. The "xchg" instruction always asserts the bus-lock signal regardless of
the presence or absence of the "lock" prefix.

--

7-77-777
Evil Sinner!
with software, you repeat same experiment, expecting different results...

Scott Lurndal

unread,

Dec 2, 2022, 4:19:04 PM12/2/22

to

Yes, that is the case

XCHG:
"If a memory operand is referenced, the processor's locking protocol is automatically
implemented for the duration of the exchange operation, regardless of the presence
or absence of the LOCK prefix or of the value of the IOPL. (See the LOCK prefix
description in this chapter for more information on the locking protocol.)

Chris M. Thomasson

unread,

Dec 2, 2022, 4:22:32 PM12/2/22

to

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

BINGO! I had a strong feeling I was right.

https://youtu.be/TnZrWWUFl8I

Branimir Maksimovic

unread,

Dec 2, 2022, 4:36:18 PM12/2/22

to

Nice music :p

Chris M. Thomasson

unread,

Dec 2, 2022, 5:14:01 PM12/2/22

to

Thanks again, Branimir, for the quote of the docs.

https://youtu.be/UZ2-FfXZlAU
(super mario music in a live big band format? Nice... :^)

My hat is off to you.

Chris M. Thomasson

unread,

Dec 2, 2022, 5:16:49 PM12/2/22

to

Indeed. When I remembered this I kind of doubted myself for a moment.
Then, I said well, I know its true, and if I am wrong then I must of
fried my brain a bit.

Juha Nieminen

unread,

Dec 6, 2022, 6:36:42 AM12/6/22

to

Paavo Helde <ees...@osa.pri.ee> wrote:
> When tracking down this bug, I monitored all refcounter changes for a
> particular single smartpointer during the program run (ca 10 min). There
> were 591848 increments and decrements, from which 1526 came from the
> problematic (parallelized) part. It looks like a pessimization to slow
> down 99.75% of accesses when only 0.25% would actually benefit from this.

There's place for micro-optimization and there's place to do the
Right Thing (TM) instead.

In the vast, vast majority of situations micro-optimization will have
little to no effect on the program. It's only when you have number
crunching code that does something billions of times per second that
micro-optimization may start having some discernible effect. Those
situations tend to be very rare and far-in-between. And when you do
have such situations you can make faster versions of things for that
alone.

Micro-optimization is extra useless if it's surrounded by, and thus
swamped by code that's a lot slower than it. Even when micro-optimizing
you should start with the worst offenders, not the smallest things.

(By "micro-optimization" I'm referring to things that do not change
the computational complexity of something and only makes that something
some clock cycles faster.)

So unless your smart pointer is being copied and assigned around
millions of times per second in tight number-crunching loops, you can
safely ignore any lost clock cycles by making the reference counter
thread-safe.

(If you actually need to copy and assign smart pointers around
millions of times per second in a tight number-crunching inner
loop, perhaps create a specialized version of the pointer for
that particular purpose...)

Chris M. Thomasson

unread,

Dec 6, 2022, 3:26:49 PM12/6/22

to

On 12/6/2022 3:36 AM, Juha Nieminen wrote:
> Paavo Helde <ees...@osa.pri.ee> wrote:
>> When tracking down this bug, I monitored all refcounter changes for a
>> particular single smartpointer during the program run (ca 10 min). There
>> were 591848 increments and decrements, from which 1526 came from the
>> problematic (parallelized) part. It looks like a pessimization to slow
>> down 99.75% of accesses when only 0.25% would actually benefit from this.

[...]

> So unless your smart pointer is being copied and assigned around
> millions of times per second in tight number-crunching loops, you can
> safely ignore any lost clock cycles by making the reference counter
> thread-safe.

[...]
A thread-safe reference counted pointer can heavily damage performance
in certain usage scenarios. Blasting the system with memory barriers and
atomic RMW ops all over the place. Now, there is a work around called
proxy reference counting. Are you familiar with it?

Scott Lurndal

unread,

Dec 6, 2022, 3:40:10 PM12/6/22

to

"Chris M. Thomasson" <chris.m.t...@gmail.com> writes:

The fact that smart pointers do allocation/deallocation has made
them useless for high-performance threaded code, IMO.

Paavo Helde

unread,

Dec 6, 2022, 4:53:45 PM12/6/22

to

06.12.2022 22:39 Scott Lurndal kirjutas:

>
> The fact that smart pointers do allocation/deallocation has made
> them useless for high-performance threaded code, IMO.

You have got it backwards. Smartpointers are taken into use for coping
with the fact that objects need to by dynamically allocated and
deallocated, by the program logic.

And this allocation/deallocation would happen relatively rarely. If the
object lifetimes were short, then typically they could be controlled
much better, and there would be no need for refcounted smartpointers, or
maybe even no need for dynamic allocation of objects in the first place.

Chris M. Thomasson

unread,

Dec 6, 2022, 4:59:34 PM12/6/22

to

Imvvho, a smart pointer should not need to allocate anything under the
covers. Also, are you familiar with proxy reference counting? It has the
ability to amortize a single reference over n objects.

Chris M. Thomasson

unread,

Dec 6, 2022, 5:01:42 PM12/6/22

to

On 12/6/2022 1:53 PM, Paavo Helde wrote:

Fwiw, RCU is a form of proxy collection. Fwiw, I wrote an experimental
one using pure C++.

https://pastebin.com/raw/f71480694
(goes to pure text page, no ads and shit like that...)

Scott Lurndal

unread,

Dec 6, 2022, 5:04:34 PM12/6/22

to

"Chris M. Thomasson" <chris.m.t...@gmail.com> writes:

>On 12/6/2022 1:53 PM, Paavo Helde wrote:
>> 06.12.2022 22:39 Scott Lurndal kirjutas:
>>
>>>
>>> The fact that smart pointers do allocation/deallocation has made
>>> them useless for high-performance threaded code, IMO.
>>
>> You have got it backwards. Smartpointers are taken into use for coping
>> with the fact that objects need to by dynamically allocated and
>> deallocated, by the program logic.
>>
>> And this allocation/deallocation would happen relatively rarely.

Assumption not in evidence. I've personnally had to rip smart pointers
out of code because the allocation/deallocation happened very
frequently. One if the applications was simulating a processor pipeline,
another was handling network packets both were written by well-educated
people familiar with C++.

Granted, one can specify a more efficient allocator, but

1) most C++ programmers don't bother or don't know how
2) Even then there is unnecessary overhead unless the allocator is pool based.

KISS applies, always.

Öö Tiib

unread,

Dec 7, 2022, 12:38:48 AM12/7/22

to

No one argues with that. Just that keeping it simple is far from simple.
For example it is tricky to keep dynamic allocations minimal. That is
not fault of smart pointers.

The std::unique_ptr helps at places where dynamic allocations are
needed greatly, especially when there can be exceptions. It has next to
no overhead.

Yes, the std::shared_ptr is loaded. Usage of std::make_shared
helps a bit but thinking about how to make it simpler and to get rid of
shared ownership or even dynamic allocations is hard and not always
fruitful.

Juha Nieminen

unread,

Dec 7, 2022, 4:06:07 AM12/7/22

to

Chris M. Thomasson <chris.m.t...@gmail.com> wrote:

If you really need to copy/assign smart pointers in tight number-crunching
inner loops, then perhaps *don't* copy/assign such smart pointers in such
loops (or use any smart pointers for that matter)?

In scenarios that don't require the last clock cycles squeezed out of
them, does a "memory barrier" or other optimization hindrances really
matter all that much? If code that takes 0.01% of the total runtime
gets 1% slower... how much will it slow down the overall program?
Do the math.

Scott Lurndal

unread,

Dec 7, 2022, 10:22:33 AM12/7/22

to

=?UTF-8?B?w5bDtiBUaWli?= <oot...@hot.ee> writes:
>On Wednesday, 7 December 2022 at 00:04:34 UTC+2, Scott Lurndal wrote:
>> "Chris M. Thomasson" <chris.m.t...@gmail.com> writes:
>> >On 12/6/2022 1:53 PM, Paavo Helde wrote:
>> >> 06.12.2022 22:39 Scott Lurndal kirjutas:
>> >>
>> >>>
>> >>> The fact that smart pointers do allocation/deallocation has made
>> >>> them useless for high-performance threaded code, IMO.
>> >>
>> >> You have got it backwards. Smartpointers are taken into use for coping
>> >> with the fact that objects need to by dynamically allocated and
>> >> deallocated, by the program logic.
>> >>
>> >> And this allocation/deallocation would happen relatively rarely.
>> Assumption not in evidence. I've personnally had to rip smart pointers
>> out of code because the allocation/deallocation happened very
>> frequently. One if the applications was simulating a processor pipeline,
>> another was handling network packets both were written by well-educated
>> people familiar with C++.
>>
>> Granted, one can specify a more efficient allocator, but
>>
>> 1) most C++ programmers don't bother or don't know how
>> 2) Even then there is unnecessary overhead unless the allocator is pool based.
>>
>> KISS applies, always.
>
>No one argues with that. Just that keeping it simple is far from simple.
>For example it is tricky to keep dynamic allocations minimal. That is
>not fault of smart pointers.

In my experience, it has been generally sufficent to pre-allocate the
data structures and store them in a table or look-aside list,
as the maximum number is bounded.

For example, an application handling network packets on a processor
with 64 cores, may only need 128 jumbo packet buffers if the packet processing
thread count matches the core count. These can be preallocated
and then passed as regular pointers throughout the flow.\

(Specialized DPUs have a custom hardware block (network pool allocator)
that allocates hardware buffers to packets on ingress and those buffers
are passed by hardware to the other blocks in the flow, such as
blocks to identify the flow, fragment/defragment a packet,
apply encryption/decryption algorithms, all controlled by a
hardware scheduler block, etc.).

Likewise for a simulation of an internal processor interconnect
such as ring or mesh structure, there are a fixed maximum number
of flits than can be active at any point in time. Preallocating
them into a lookaside list eliminates allocation and deallocation
overhead on every flit.

When simulating a full SoC, the maximum inflight objects is
likewise bounded and for the most part, can be preallocated.

Paavo Helde

unread,

Dec 7, 2022, 11:37:20 AM12/7/22

to

07.12.2022 17:22 Scott Lurndal kirjutas:

> =?UTF-8?B?w5bDtiBUaWli?= <oot...@hot.ee> writes:
>>
>> No one argues with that. Just that keeping it simple is far from simple.
>> For example it is tricky to keep dynamic allocations minimal. That is
>> not fault of smart pointers.
>
> In my experience, it has been generally sufficent to pre-allocate the
> data structures and store them in a table or look-aside list,
> as the maximum number is bounded.

My experience is more that the user wants to read in unknown number of
tiff files containing unknown number of image frames of unknown sizes,
then start to process them by script-driven flexible algorithms,
producing an unknown number of intermediate and final results of unknown
size. And this processing ought to be as fast as possible, as nobody
wants to wait for hours (although with large data sets and complex
processing it inevitable gets into hours). And this processing ought
better to make use of all the cpu cores and should not run out of
computer memory while doing that.

So it seems preallocating a fixed number of data structures of fixed
size would not really work in my case.

Chris M. Thomasson

unread,

Dec 7, 2022, 3:48:29 PM12/7/22

to

On 12/7/2022 1:05 AM, Juha Nieminen wrote:
> Chris M. Thomasson <chris.m.t...@gmail.com> wrote:
>> On 12/6/2022 3:36 AM, Juha Nieminen wrote:
>>> Paavo Helde <ees...@osa.pri.ee> wrote:
>>>> When tracking down this bug, I monitored all refcounter changes for a
>>>> particular single smartpointer during the program run (ca 10 min). There
>>>> were 591848 increments and decrements, from which 1526 came from the
>>>> problematic (parallelized) part. It looks like a pessimization to slow
>>>> down 99.75% of accesses when only 0.25% would actually benefit from this.
>> [...]
>>> So unless your smart pointer is being copied and assigned around
>>> millions of times per second in tight number-crunching loops, you can
>>> safely ignore any lost clock cycles by making the reference counter
>>> thread-safe.
>> [...]
>> A thread-safe reference counted pointer can heavily damage performance
>> in certain usage scenarios. Blasting the system with memory barriers and
>> atomic RMW ops all over the place.
>
> If you really need to copy/assign smart pointers in tight number-crunching
> inner loops, then perhaps *don't* copy/assign such smart pointers in such
> loops (or use any smart pointers for that matter)?

It really rears its ugly head when iterating large linked lists of
nodes... Read mostly, write rather rarely.

> In scenarios that don't require the last clock cycles squeezed out of
> them, does a "memory barrier" or other optimization hindrances really
> matter all that much?

Big time. Have you ever studied up on RCU? That is one of the reasons it
was created in the first place: to get rid of memory barriers on the
read side of the algorithm.

> If code that takes 0.01% of the total runtime
> gets 1% slower... how much will it slow down the overall program?
> Do the math.

RCU beats them all, it is memory barrier free, well except for systems
that _need_ a membar for data-dependent loads, ala dec alpha. Iirc,
SPARC in RMO mode does not even need membars for such loads. Proxy
collection does pretty damn good, but not as good as RCU...

The membars take a big toll, especially the god damn #StoreLoad barrier
in SMR (aka, hazard pointers).

Chris M. Thomasson

unread,

Dec 7, 2022, 3:50:48 PM12/7/22

to

Fwiw, here is a paper on SMR (Safe Memory Reclamation):

https://www.liblfds.org/downloads/white%20papers/%5BSMR%5D%20-%20%5BMichael%5D%20-%20Hazard%20Pointers;%20Safe%20Memory%20Reclaimation%20for%20Lock-Free%20Objects.pdf

Joe Seigh cleverly combined SMR with RCU to get rid of the NASTY
#StoreLoad membar in SMR.

Juha Nieminen

unread,

Dec 8, 2022, 2:53:13 AM12/8/22

to

Chris M. Thomasson <chris.m.t...@gmail.com> wrote:
>> If you really need to copy/assign smart pointers in tight number-crunching
>> inner loops, then perhaps *don't* copy/assign such smart pointers in such
>> loops (or use any smart pointers for that matter)?
>
> It really rears its ugly head when iterating large linked lists of
> nodes... Read mostly, write rather rarely.

I don't see how iterating a linked list requires copying or assigning
smart pointers. Unless you are using the smart pointers as the next/prev
pointers of each node themselves. In which case you run into a recursive
reference counting situation, which I don't see how that's very feasible.

>> In scenarios that don't require the last clock cycles squeezed out of
>> them, does a "memory barrier" or other optimization hindrances really
>> matter all that much?
>
> Big time. Have you ever studied up on RCU? That is one of the reasons it
> was created in the first place: to get rid of memory barriers on the
> read side of the algorithm.

In scenarios that don't require the last clock cycles squeezed out of

them it's extremely important to squeeze the last clock cycles out?

I don't often like to quote the way-too-often-wrongly-quoted and
way-too-often-completely-misunderstood "early optimization is the
root of all evil", but in this case that it applies.

(In its original context, when Donald Knuth wrote that, he was saying
that your optimization efforts should be concentrated on the 3% of the
code where it actually matters. Something that most people don't know
about the quote. But it does apply here perfectly.)

Paavo Helde

unread,

Dec 8, 2022, 6:38:42 AM12/8/22

to

07.12.2022 22:48 Chris M. Thomasson kirjutas:
> On 12/7/2022 1:05 AM, Juha Nieminen wrote:
>
>
>> If code that takes 0.01% of the total runtime
>> gets 1% slower... how much will it slow down the overall program?
>> Do the math.

My aim is to allow the program to scale safely to many-core machines.
Even if some synchronization overhead is small today when running on 10
cores in parallel, it does not mean it will remain small when run on a
100-core or 1000-core machine, in some not so distant future.

> RCU beats them all, it is memory barrier free, well except for systems
> that _need_ a membar for data-dependent loads, ala dec alpha. Iirc,
> SPARC in RMO mode does not even need membars for such loads. Proxy
> collection does pretty damn good, but not as good as RCU...
>
> The membars take a big toll, especially the god damn #StoreLoad barrier
> in SMR (aka, hazard pointers).

My current approach is to use single-threaded data structures as much as
possible, so that the running threads would not disturb each other at
all. But this creates other challenges like a need for deep copies, and
a need to recalculate same things in different threads, on those copies.

It looks like if I want to use another approach with keeping more data
in shared use I will indeed need to learn more about atomics and RCU.

Tim Rentsch

unread,

Dec 8, 2022, 8:50:43 AM12/8/22

to

Juha Nieminen <nos...@thanks.invalid> writes:

[...]

> I don't often like to quote the way-too-often-wrongly-quoted and
> way-too-often-completely-misunderstood "early optimization is the

> root of all evil", [...]

Amusing that this comment wrongly quotes the original.

Michael S

unread,

Dec 8, 2022, 4:30:23 PM12/8/22

to

premature - ennenaikaista, ennenaikainen
early - aikaisin, aikainen
All words appear to have the same root 'aika'==time.
If I am going to believe google translate then 'ennenaikaista' means
literally 'before early time'. So, may be, for person that thinks
in Finnish it is natural to translate it back to English like 'early'.

It is not just Finnish.
It two languages that I speak most, common translation of
'premature' is 'before time' and 'too early'.
Neither of the languages derives it from 'mature'.
Now, 'immature' is completely different story. This word is
translated rather close to original. But Knuth said 'premature'
rather than 'immature' and that provokes difficulties in translation.

Chris M. Thomasson

unread,

Dec 8, 2022, 7:45:45 PM12/8/22

to

Exactly. Keep things separated out as much as possible, indeed. If you
absolutely _must_ use shared data, think again... After thinking hard,
if you still need to use shared data, then so be it. Damn. Now, there
are many different way to use shared data, and some are a heck of a lot
better than others. They all have their trade offs. RCU is geared toward
"read-mostly" usage patterns. So, a database that experiences a shi%load
of reads, and not so many writes per say, second, well, RCU just might
be of use. Also, a proxy collector might work out for you. Basically,
separate out reads and writes in your logic. If its read heavy, well,
RCU might be able to help you out.

Malcolm McLean

unread,

Dec 9, 2022, 7:35:35 AM12/9/22

to

On Tuesday, 6 December 2022 at 11:36:42 UTC, Juha Nieminen wrote:
> Paavo Helde <ees...@osa.pri.ee> wrote:
> > When tracking down this bug, I monitored all refcounter changes for a
> > particular single smartpointer during the program run (ca 10 min). There
> > were 591848 increments and decrements, from which 1526 came from the
> > problematic (parallelized) part. It looks like a pessimization to slow
> > down 99.75% of accesses when only 0.25% would actually benefit from this.
> There's place for micro-optimization and there's place to do the
> Right Thing (TM) instead.
>
> In the vast, vast majority of situations micro-optimization will have
> little to no effect on the program. It's only when you have number
> crunching code that does something billions of times per second that
> micro-optimization may start having some discernible effect. Those
> situations tend to be very rare and far-in-between. And when you do
> have such situations you can make faster versions of things for that
> alone.
>

However often you are writing low-level, general purpose routines.

For instance I have a routine which calcuates the arc length of a Bezier.
There's no closed form for doing this, so it is simple but intensive. It takes
a large number of points onthe curve and approximates by straight lines.

Now a lot of the time, the curves are entered by the user. It will take him three
or four seconds to draw a curve. So any optimisation is pointless.
However the routine could be called in the inner loop of some very intensive
higher level function, on a large number of candidate curves. So the function does
in fact have to be micro-optimised.

Tim Rentsch

unread,

Dec 9, 2022, 3:51:13 PM12/9/22

to

In English there is a significant difference between "early" and
"premature". (It hadn't occurred to me that premature might be
derived from mature.)

The original source was written in English. It was written
by Don Knuth, whose native language is English. The posting
here was written in English. Furthermore the comment here
is written as a quotation, and mentions that the saying is
wrongly quoted. Once we are in the realm of quoting rather
than paraphrasing there is no room for latitude.

I am sympathetic to those whose native language is other than
English and post in English. More than sympathetic, I admire
them for their language skills, an area where I am woefully much
less than fully competent. Even so, it behooves any non-native
speaker to make an effort to use English correctly, and to want
to use English correctly, when writing in English. And that
especially applies when citing or quoting from a source written
in English.

David Brown

unread,

Dec 11, 2022, 6:18:47 AM12/11/22

to

The etymology of "mature" is from a Latin word for "ripe" (such as "ripe
fruit"). So it has implications of stages of ageing, unlike "early".
"Premature" is therefore "before the fruit is ready to eat", and thus
very different from merely being "early". It's often a good idea to do
things "early" - it is rarely a good idea to do them "prematurely".

If a language does not have a word corresponding directly to "premature"
(or if the word carries too strong connotations of "premature baby"),
then the equivalent of "too early" is going to be a far better
alternative than merely "early" in Knuth's quotation.