std::atomic<std::shared

Chris M. Thomasson

unread,

Jan 28, 2021, 2:40:48 AM1/28/21

to

I am wondering if anybody can run the following C++20 program and report
its output. It is checking to see if an atomic shared_ptr is actually
lockfree. On my compiler MSVC 2019, I am getting a 0, or false. So, this
means its not lock-free! Its a damn shame. Afaict, there is no reason
why std::atomic<std::shared_ptr<T>> should not be 100% lock-free.

https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2

____________________________
#include <iostream>
#include <memory>
#include <atomic>

int main()
{
std::atomic<std::shared_ptr<long>> instance;

std::cout << "instance.is_lock_free = " << instance.is_lock_free()
<< "\n";

return 0;
}
____________________________

The reason I ask is because if your compiler returns a true, well, this
means that we can implement lock-free proxy collectors directly in the
C++20 standard using atomic shared_ptrs, without having to manually
create the reference counting logic.

Think of a proxy collector as a sort of "poor mans" read copy update.

https://en.wikipedia.org/wiki/Read-copy-update

Ian Collins

unread,

Jan 28, 2021, 3:41:21 AM1/28/21

to

On 28/01/2021 20:40, Chris M. Thomasson wrote:
> I am wondering if anybody can run the following C++20 program and report
> its output. It is checking to see if an atomic shared_ptr is actually
> lockfree. On my compiler MSVC 2019, I am getting a 0, or false. So, this
> means its not lock-free! Its a damn shame. Afaict, there is no reason
> why std::atomic<std::shared_ptr<T>> should not be 100% lock-free.
>
> https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2
>
> ____________________________
> #include <iostream>
> #include <memory>
> #include <atomic>
>
>
> int main()
> {
> std::atomic<std::shared_ptr<long>> instance;
>
> std::cout << "instance.is_lock_free = " << instance.is_lock_free()
> << "\n";
>
> return 0;
> }
> ____________________________
>

No luck with g++/clang++ on Linux (or probably anywhere):

$ g++ -std=c++20 /tmp/lf.cc
In file included from /tmp/lf.cc:3:
/usr/include/c++/10/atomic: In instantiation of ‘struct
std::atomic<std::shared_ptr<long int> >’:
/tmp/lf.cc:8:41: required from here
/usr/include/c++/10/atomic:195:21: error: static assertion failed:
std::atomic requires a trivially copyable type
195 | static_assert(__is_trivially_copyable(_Tp),
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
--
Ian.

Ian Collins

unread,

Jan 28, 2021, 3:57:24 AM1/28/21

to

On 28/01/2021 21:41, Ian Collins wrote:
> On 28/01/2021 20:40, Chris M. Thomasson wrote:
>> I am wondering if anybody can run the following C++20 program and report
>> its output. It is checking to see if an atomic shared_ptr is actually
>> lockfree. On my compiler MSVC 2019, I am getting a 0, or false. So, this
>> means its not lock-free! Its a damn shame. Afaict, there is no reason
>> why std::atomic<std::shared_ptr<T>> should not be 100% lock-free.
>>
>> https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2
>>
>> ____________________________
>> #include <iostream>
>> #include <memory>
>> #include <atomic>
>>
>>
>> int main()
>> {
>> std::atomic<std::shared_ptr<long>> instance;
>>
>> std::cout << "instance.is_lock_free = " << instance.is_lock_free()
>> << "\n";
>>
>> return 0;
>> }
>> ____________________________
>>
> No luck with g++/clang++ on Linux (or probably anywhere):

$ g++ --version
g++ (Ubuntu 10.1.0-2ubuntu1~18.04) 10.1.0

$ clang++ --version
Ubuntu clang version
12.0.0-++20201226063812+badf0f20f3b3-1~exp1~20201226174505.2125

--
Ian.

Chris M. Thomasson

unread,

Jan 28, 2021, 2:44:37 PM1/28/21

to

On 1/28/2021 12:41 AM, Ian Collins wrote:
> On 28/01/2021 20:40, Chris M. Thomasson wrote:
>> I am wondering if anybody can run the following C++20 program and report
>> its output. It is checking to see if an atomic shared_ptr is actually
>> lockfree. On my compiler MSVC 2019, I am getting a 0, or false. So, this
>> means its not lock-free! Its a damn shame. Afaict, there is no reason
>> why std::atomic<std::shared_ptr<T>> should not be 100% lock-free.
>>
>> https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2
>>
>> ____________________________
>> #include <iostream>
>> #include <memory>
>> #include <atomic>
>>
>>
>> int main()
>> {
>>       std::atomic<std::shared_ptr<long>> instance;
>>
>>       std::cout << "instance.is_lock_free = " << instance.is_lock_free()
>> << "\n";
>>
>>       return 0;
>> }
>> ____________________________
>>
> No luck with g++/clang++ on Linux

Yeah. Its a new feature of C++20. So, its going to take some time for
the compiler vendors to finally get around to properly implementing, and
testing/verifying it. Imvvvvho, they really need to "strive" to make it
at least lock-free. Now, there might be a problem with that... Humm...
Patents! Several methods to get lock-free, or even wait-free
implementations do indeed have patents. So, if they cannot come up with
their own algorithms, they might have to license existing techniques...

> (or probably anywhere):

Well, I can actually get it to compile and run on MSVC 2019 using the
/std:c++latest option. This makes the compiler work within the "Preview
- Features from the Latest C++ Working Draft" mode. However, its
reporting that the impl is not lock-free. That is a rather big problem
wrt using it to implement a proxy collector, or poor mans RCU if you
will... It is nice that they are actually working on it. To be quite
honest, I was rather surprised that it works on MSVC at all.

> $ g++ -std=c++20 /tmp/lf.cc
> In file included from /tmp/lf.cc:3:
> /usr/include/c++/10/atomic: In instantiation of ‘struct
> std::atomic<std::shared_ptr<long int> >’:
> /tmp/lf.cc:8:41:   required from here
> /usr/include/c++/10/atomic:195:21: error: static assertion failed:
> std::atomic requires a trivially copyable type
> 195 |       static_assert(__is_trivially_copyable(_Tp),
>       |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~

They must not have got around to creating it yet. I bet they are working
on it. The feature is nice because it gives std::shared_ptr strong
thread safety guarantees. One way to do it was created by Joe Seigh via
his really neat atomic_ptr:

http://atomic-ptr-plus.sourceforge.net

which requires DWCAS. Having a lock-free, or ideally wait-free,
std::atomic<std::shared_ptr<T>> would make it really easy to create a
proxy collector in pure C++. It would allow for poor mans RCU using the
standard and eliminate the need for a user to manually implement the
special differential reference counting.

Fwiw, I am almost finished porting one of my proxy collectors to C++17.

Chris M. Thomasson

unread,

Jan 29, 2021, 1:18:37 AM1/29/21

to

On 1/28/2021 12:41 AM, Ian Collins wrote:

I need to give the standalone functions a try as well.

https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic

Marcel Mueller

unread,

Feb 1, 2021, 1:20:19 PM2/1/21

to

Am 28.01.21 um 08:40 schrieb Chris M. Thomasson:

> The reason I ask is because if your compiler returns a true, well, this
> means that we can implement lock-free proxy collectors directly in the
> C++20 standard using atomic shared_ptrs, without having to manually
> create the reference counting logic.

I think there can be no /generic/ solution. You basically need strong
thread safety here.

Usually this requirement needs an inner and an outer reference counter.
There are slight variations of this pattern, but they are similar. At
some point the outer reference count is transferred to the inner counter
which requires to know /both/ data structures in one piece of code.
std::atomic does not have this knowledge.

However, one can write a partial specialization of
std:atomic<shared_ptr<T>>) that implements the appropriate behavior.
This /should/ be part of C++20. But there is no guarantee that this is
implemented lock-free with an outer ref count. And obviously not all
compilers already support this feature at all. :-/

Even worse, the signature of the deprecated atomic_... API functions
prevents any efficient lock free implementation because the require
shared instances of shared_ptr<T> to have exactly the same type than
normal instances. Reasonable implementations need a different type for
atomic instances. In fact shared instances cannot implement any
operations provided by shared_ptr<T>.

I implemented a lock free atomic version of a smart pointer several
years ago. It was an intrusive pointer in my case, but this makes no
significant difference.

Unfortunately C++20 does not include std::atomic<std::intrusive_ptr<T>>.
Most likely this is due to the restriction of std::intrusive_ptr that
does only allows to modify the ref counter by one which prevents any
efficient implementation.

So at the end no custom atomic lock free implementation of std shared
pointer can exist because it does not allow access to the reference counter.

With some hacks a solution for intrusive_ptr might exist, it needs to
transfer the outer ref count one by one. This could preferably done
immediately after each modification of the outer count and after the
object is really owned. But I think another hack is needed, because
intrusive_ptr cannot deal with negative ref counts which require the
object to be deleted at the increment operation.

Marcel

Öö Tiib

unread,

Feb 1, 2021, 3:57:51 PM2/1/21

to

On Monday, 1 February 2021 at 20:20:19 UTC+2, Marcel Mueller wrote:
>
> Unfortunately C++20 does not include std::atomic<std::intrusive_ptr<T>>.
> Most likely this is due to the restriction of std::intrusive_ptr that
> does only allows to modify the ref counter by one which prevents any
> efficient implementation.

What is std::intrusive_ptr? I have read only one proposal of intrusive
pointer to standard, it was named retain_ptr and AFAIK it has not
been added to std.

Chris M. Thomasson

unread,

Feb 1, 2021, 5:00:22 PM2/1/21

to

On 2/1/2021 10:19 AM, Marcel Mueller wrote:
> Am 28.01.21 um 08:40 schrieb Chris M. Thomasson:
>> The reason I ask is because if your compiler returns a true, well,
>> this means that we can implement lock-free proxy collectors directly
>> in the C++20 standard using atomic shared_ptrs, without having to
>> manually create the reference counting logic.
>
> I think there can be no /generic/ solution. You basically need strong
> thread safety here.

Yeah. I think your right. Well, it might be nice for C++ to perhaps
create a new smart pointer based on Joe's atomic_ptr. But, then they
would get into the realm of possibly violating some patents... Humm...

> Usually this requirement needs an inner and an outer reference counter.
> There are slight variations of this pattern, but they are similar. At
> some point the outer reference count is transferred to the inner counter
> which requires to know /both/ data structures in one piece of code.
> std::atomic does not have this knowledge.
>
> However, one can write a partial specialization of
> std:atomic<shared_ptr<T>>) that implements the appropriate behavior.
> This /should/ be part of C++20. But there is no guarantee that this is
> implemented lock-free with an outer ref count. And obviously not all
> compilers already support this feature at all. :-/

iirc, Peter Dimov tried really hard to get shared_ptr to support strong
thread safety, but failed to do so.

> Even worse, the signature of the deprecated atomic_... API functions
> prevents any efficient lock free implementation because the require
> shared instances of shared_ptr<T> to have exactly the same type than
> normal instances. Reasonable implementations need a different type for
> atomic instances. In fact shared instances cannot implement any
> operations provided by shared_ptr<T>.
>
> I implemented a lock free atomic version of a smart pointer several
> years ago. It was an intrusive pointer in my case, but this makes no
> significant difference.

I think I remember you way back on comp.programming.threads. Do you have
any code remaining for your intrusive algorihtm? It would be fun to code
it up in Relacy to check for any possible issues, if any. It seems that
C++ does not have std::intrusive_ptr. Iirc, it was in boost. I might be
missing something.

> Unfortunately C++20 does not include std::atomic<std::intrusive_ptr<T>>.
> Most likely this is due to the restriction of std::intrusive_ptr that
> does only allows to modify the ref counter by one which prevents any
> efficient implementation.
>
> So at the end no custom atomic lock free implementation of std shared
> pointer can exist because it does not allow access to the reference
> counter.

Damn. Iirc, this is akin to what Peter was mentioning way back. Existing
shared_ptr needs to be modified at the implementation level without
breaking any existing use case. Perhaps, it might be better for C++20 to
create a new smart pointer that has strong thread safety... If you do
not need strong thread safety, use shared_ptr. If you do, use
atomic_ptr. Sound Kosher?

> With some hacks a solution for intrusive_ptr might exist, it needs to
> transfer the outer ref count one by one. This could preferably done
> immediately after each modification of the outer count and after the
> object is really owned. But I think another hack is needed, because
> intrusive_ptr cannot deal with negative ref counts which require the
> object to be deleted at the increment operation.

Have you taken a look at my experimental proxy collector code? Still, I
don't think it can be used to make shared_ptr strongly thread safe.
Well, perhaps it can, but it would have to be an internal mutation of an
existing implementation f shared_ptr. Oh well... At least my proxy gc
can allow for a poor mans RCU. Almost finished with an example use case
in pure C++. It involves a lock-free stack that does not need an ABA
counter and allows for node deletion. Also, threads can iterate the
whole stack where they do not need to worry about any nodes being
deleted under their feet. A thread can iterate the lock-free stack while
writer threads are mutating it, even deleting nodes from it.

Chris M. Thomasson

unread,

Feb 1, 2021, 5:01:34 PM2/1/21

to

Imvho, intrusive reference counts are very useful. They can remove the
need to create an external object to hold the count and a pointer to the
object it protects.

Öö Tiib

unread,

Feb 1, 2021, 6:29:04 PM2/1/21

to

Indeed those are very useful. Also most popular usage of shared_ptr involves
make_shared that technically binds the control block and managed
object into single object anyway. That does not provide all the benefits
of intrusive ref-counting but some. Still to my knowledge standard
library does not have it and the boost::intrusive_ptr has couple
shortcomings (that Marcel Mueller perhaps was trying to fix?).

Chris M. Thomasson

unread,

Feb 1, 2021, 11:39:49 PM2/1/21

to

On 2/1/2021 3:28 PM, Öö Tiib wrote:
> On Tuesday, 2 February 2021 at 00:01:34 UTC+2, Chris M. Thomasson wrote:
>> On 2/1/2021 12:57 PM, Öö Tiib wrote:
>>> On Monday, 1 February 2021 at 20:20:19 UTC+2, Marcel Mueller wrote:
>>>>
>>>> Unfortunately C++20 does not include std::atomic<std::intrusive_ptr<T>>.
>>>> Most likely this is due to the restriction of std::intrusive_ptr that
>>>> does only allows to modify the ref counter by one which prevents any
>>>> efficient implementation.
>>>
>>> What is std::intrusive_ptr? I have read only one proposal of intrusive
>>> pointer to standard, it was named retain_ptr and AFAIK it has not
>>> been added to std.
>>>
>> Imvho, intrusive reference counts are very useful. They can remove the
>> need to create an external object to hold the count and a pointer to the
>> object it protects.
>
> Indeed those are very useful. Also most popular usage of shared_ptr involves
> make_shared that technically binds the control block and managed
> object into single object anyway. That does not provide all the benefits
> of intrusive ref-counting but some.

I need to study up on shared_ptr. Back when I was looking into it, was a
long time ago talking with:

https://www.boost.org/users/people/peter_dimov.html

To be quite honest, I never really used them for real work. When I had
to reference count something with basic thread safety, I would just
stick a reference counter directly into the object itself. For a lot of
cases, it would work fine. Say:

struct object
{
// contents
};

would become:

struct object
{
std::atomic<unsigned long> m_ref; // aligned and padded
// contents
};

> Still to my knowledge standard
> library does not have it and the boost::intrusive_ptr has couple
> shortcomings (that Marcel Mueller perhaps was trying to fix?).

Would love to see Marcel's intrusive counter that has strong thread
safety. Actually, think I already have, but cannot remember it at all.
If he can find his old code, well, I would be happy to port it into a
Relacy test unit. If there are any issues, I am sure they can be rather
easily cleaned right up. :^)

Pavel

unread,

Feb 2, 2021, 12:52:02 AM2/2/21

to

ditto. Even better, I now prefer explicit life time management to shared ptrs
and single-threaded processes to threads. A feeling grew in me for decades that
threads and shared ptrs are dirty hacks, indicate lack design and slow down my
servers. Don't need'em stinky threads; not on Linux anyway. :-)

Anybody believes I am serious and do not intend to start any flames?

-Pavel

Marcel Mueller

unread,

Feb 2, 2021, 2:02:46 PM2/2/21

to

Am 01.02.21 um 23:00 schrieb Chris M. Thomasson:

> Yeah. I think your right. Well, it might be nice for C++ to perhaps
> create a new smart pointer based on Joe's atomic_ptr. But, then they
> would get into the realm of possibly violating some patents... Humm...

Hmm, indeed, patents might be a show stopper here.

>> However, one can write a partial specialization of
>> std:atomic<shared_ptr<T>>) that implements the appropriate behavior.
>> This /should/ be part of C++20. But there is no guarantee that this is
>> implemented lock-free with an outer ref count. And obviously not all
>> compilers already support this feature at all. :-/
>
> iirc, Peter Dimov tried really hard to get shared_ptr to support strong
> thread safety, but failed to do so.

With modifications to the shared_ptr class this should be straight forward.
- The atomic shared pointer implementation needs to be a friend to
access the internal structures.
- AFAIR some care needs to be taken when a reference is freed because
the ref counter could become negative for a short time due to threading
issues.

>> I implemented a lock free atomic version of a smart pointer several
>> years ago. It was an intrusive pointer in my case, but this makes no
>> significant difference.
>
> I think I remember you way back on comp.programming.threads. Do you have
> any code remaining for your intrusive algorihtm?

The code is still online and under a BSD like license. But it is also
still designed for IBM VisualAge C++ 3 from the 90s, not even supporting
C++03.

https://github.com/maazl/pm123/blob/master/src/utils/cpp/smartptr.h

I have same newer implementations too, but not public.

> It would be fun to code
> it up in Relacy to check for any possible issues, if any.

The code needs to be rewritten to fit modern C++ standards firstly.
Secondly it uses the stolen bits hack, so it does not depend on double
word CAS. This was perfectly valid on the target platform and passed any
tests hacking on the same shared instance with several dozens of threads.

With generic C++ the stolen bits hack is critical although it performs
even better on 64 bit because you can safely use 3 bits. I am unsure if
it can be implemented without introducing UB - although it is likely to
work on any real hardware.

> It seems that
> C++ does not have std::intrusive_ptr. Iirc, it was in boost. I might be
> missing something.

You are absolutely right. In fact I always used my own smart pointers in
projects. I even did not use the boost implementation of intrusive_ptr
either. So I accidentally assumed that intrusive pointers made it into
the standard.

I mostly needed strong thread safety and, well, intrusive reference
counting is by far more efficient. And the helper functions of
boost::intrusive_ptr prevented strong thread safety. A reference counter
need to be exposed as atomic<some_integer> somehow for strong thread safety.

>> So at the end no custom atomic lock free implementation of std shared
>> pointer can exist because it does not allow access to the reference
>> counter.
>
> Damn. Iirc, this is akin to what Peter was mentioning way back. Existing
> shared_ptr needs to be modified at the implementation level without
> breaking any existing use case.

Exactly.

> Perhaps, it might be better for C++20 to
> create a new smart pointer that has strong thread safety... If you do
> not need strong thread safety, use shared_ptr. If you do, use
> atomic_ptr. Sound Kosher?

No, I don't think so.

Only a new type for strongly thread safe instances is needed. This type
should smoothly integrate with shared_ptr. atomic<shared_ptr> might
become this type.

It is very important, that the overhead for strong thread safety does
not apply to all local instances and that object storage can be shared
between normal and shared instances.

And some operation cannot be implemented in a lock free manner at all.
E.g. swapping two shared instances atomically is not possible. It would
require transactional memory.

> Have you taken a look at my experimental proxy collector code?

No, I do not even know what code you refer here.

> Still, I
> don't think it can be used to make shared_ptr strongly thread safe.
> Well, perhaps it can, but it would have to be an internal mutation of an
> existing implementation f shared_ptr. Oh well...

Indeed.

Marcel

Öö Tiib

unread,

Feb 2, 2021, 2:13:57 PM2/2/21

to

Yes. Benefit of shared_ptr is not shared ownership but weak
horizontal references. With active objects it works kind of like real
life: X requested something from Y giving its own weak_ptr but died
soon after that. Y produced what was requested saw that X is gone
and threw the product away. No problems.

The concept is quite handy as same with intrusive pointers
takes to manually implement some kind of "shut down" state
and then to check and handle it explicitly.

Chris M. Thomasson

unread,

Feb 2, 2021, 3:58:43 PM2/2/21

to

On 2/2/2021 11:02 AM, Marcel Mueller wrote:
> Am 01.02.21 um 23:00 schrieb Chris M. Thomasson:
>> Yeah. I think your right. Well, it might be nice for C++ to perhaps
>> create a new smart pointer based on Joe's atomic_ptr. But, then they
>> would get into the realm of possibly violating some patents... Humm...
>
> Hmm, indeed, patents might be a show stopper here.

Shit.

>>> However, one can write a partial specialization of
>>> std:atomic<shared_ptr<T>>) that implements the appropriate behavior.
>>> This /should/ be part of C++20. But there is no guarantee that this
>>> is implemented lock-free with an outer ref count. And obviously not
>>> all compilers already support this feature at all. :-/
>>
>> iirc, Peter Dimov tried really hard to get shared_ptr to support
>> strong thread safety, but failed to do so.
>
> With modifications to the shared_ptr class this should be straight forward.
> - The atomic shared pointer implementation needs to be a friend to
> access the internal structures.
> - AFAIR some care needs to be taken when a reference is freed because
> the ref counter could become negative for a short time due to threading
> issues.
>
>
>>> I implemented a lock free atomic version of a smart pointer several
>>> years ago. It was an intrusive pointer in my case, but this makes no
>>> significant difference.
>>
>> I think I remember you way back on comp.programming.threads. Do you
>> have any code remaining for your intrusive algorihtm?
>
> The code is still online and under a BSD like license. But it is also
> still designed for IBM VisualAge C++ 3 from the 90s, not even supporting
> C++03.
>
> https://github.com/maazl/pm123/blob/master/src/utils/cpp/smartptr.h

Oh that's great. Will study up on it for sure. Thanks for the link
Marcel. Its been a while since I used the stolen bits hack: They can do
some magical things, indeed... ;^)

> I have same newer implementations too, but not public.
>
>> It would be fun to code it up in Relacy to check for any possible
>> issues, if any.
>
> The code needs to be rewritten to fit modern C++ standards firstly.
> Secondly it uses the stolen bits hack, so it does not depend on double
> word CAS. This was perfectly valid on the target platform and passed any
> tests hacking on the same shared instance with several dozens of threads.

Okay, nice. Humm... Iirc, I think Relacy might have some issues
simulating the stolen bits. Need to dig back into it. For me personally,
running the code through a race detector is always fun. I probably just
have to talk to Dmitry Vyukov. Its been a while since we discussed
Relacy. I have found some issues with it over the years, and he promptly
corrected them. He is working on the Go language, and ThreadSanitizer.
Iirc, ThreadSanitizer uses a lot of Relacy concepts. Also, I think he is
working on TensorFlow as well.

> With generic C++ the stolen bits hack is critical although it performs
> even better on 64 bit because you can safely use 3 bits. I am unsure if
> it can be implemented without introducing UB - although it is likely to
> work on any real hardware.

Yeah... It probably does have a UB issue. However, like you said, its
going to work on a vast majority of hardware.

>> It seems that C++ does not have std::intrusive_ptr. Iirc, it was in
>> boost. I might be missing something.
>
> You are absolutely right. In fact I always used my own smart pointers in
> projects. I even did not use the boost implementation of intrusive_ptr
> either. So I accidentally assumed that intrusive pointers made it into
> the standard.
>
> I mostly needed strong thread safety and, well, intrusive reference
> counting is by far more efficient. And the helper functions of
> boost::intrusive_ptr prevented strong thread safety. A reference counter
> need to be exposed as atomic<some_integer> somehow for strong thread
> safety.

Thanks for that info wrt boost::intrusive_ptr. I vaguely remember way
back on comp.programming.threads where somebody was trying to convince
me that strong thread safety is hardly ever needed. Iirc, my response
was that its not needed, until it is needed. ;^)

>>> So at the end no custom atomic lock free implementation of std shared
>>> pointer can exist because it does not allow access to the reference
>>> counter.
>>
>> Damn. Iirc, this is akin to what Peter was mentioning way back.
>> Existing shared_ptr needs to be modified at the implementation level
>> without breaking any existing use case.
>
> Exactly.
>
>> Perhaps, it might be better for C++20 to create a new smart pointer
>> that has strong thread safety... If you do not need strong thread
>> safety, use shared_ptr. If you do, use atomic_ptr. Sound Kosher?
>
> No, I don't think so.
>
> Only a new type for strongly thread safe instances is needed. This type
> should smoothly integrate with shared_ptr. atomic<shared_ptr> might
> become this type.

Ahhhh. Agreed. It sure seems to be the prime candidate.
atomic<shared_ptr> would be, or sure seems to be, the perfect place for it.

> It is very important, that the overhead for strong thread safety does
> not apply to all local instances and that object storage can be shared
> between normal and shared instances.

Big time! The separation of the use cases is essential. Joe did that
with atomic_ptr and local_ptr.

> And some operation cannot be implemented in a lock free manner at all.
> E.g. swapping two shared instances atomically is not possible. It would
> require transactional memory.
>
>
>> Have you taken a look at my experimental proxy collector code?
>
> No, I do not even know what code you refer here.

Ahhh, okay. Well here is a link to a crude C++ port. When you get some
free time to kill, can you give it a go?

C++11 version:

https://pastebin.com/raw/p1E9WN5i

Read this thread for context:

https://groups.google.com/g/comp.lang.c++/c/KwreljoeJ0k/m/KdQVJocuCQAJ

(oh shit! a google group link... The google might ban comp.lang.c++
next... shit)

It has some issues I need to clean up with negating unsigned. Just need
to get on that and get rid of some nasty casts. Fwiw, here is the
original thread where I posted it into the wild in the form of a Relacy
test unit:

https://groups.google.com/g/lock-free/c/X3fuuXknQF0/m/0VF_u-7shrsJ?pli=1

The Relacy Test Unit code:

https://pastebin.com/raw/f71480694

Passes all tests.

>> Still, I don't think it can be used to make shared_ptr strongly thread
>> safe. Well, perhaps it can, but it would have to be an internal
>> mutation of an existing implementation f shared_ptr. Oh well...
>
> Indeed.

Yeah, agreed. In closing, I will deeply read your code! Thanks again.

Öö Tiib

unread,

Feb 3, 2021, 2:58:53 AM2/3/21

to

May be you want to, but multi-processing instead of multi-threading is rather
viable when you manage to keep inter-process communications low. The
interfaces between modules tend to be better thought thru, an undefined
behavior in some rarely working module has less ways to screw up work
of other modules, it is easier to scale to multiple systems, different modules
are easier to write in different programming languages, little modules
are easier to reason about separately, to black-box test and/or to profile.

But usual pressure is speed to market and whatever hacks it takes.

Marcel Mueller

unread,

Feb 3, 2021, 3:01:08 AM2/3/21

to

Am 02.02.21 um 21:58 schrieb Chris M. Thomasson:

>> With generic C++ the stolen bits hack is critical although it performs
>> even better on 64 bit because you can safely use 3 bits. I am unsure
>> if it can be implemented without introducing UB - although it is
>> likely to work on any real hardware.
>
> Yeah... It probably does have a UB issue. However, like you said, its
> going to work on a vast majority of hardware.

I am not 100% sure about the UB. std::atomic<uintptr_t> might solve the
UB problem with stolen bits. I did not test this, because with a defined
set of target platforms there was no UB.

>> I mostly needed strong thread safety and, well, intrusive reference
>> counting is by far more efficient. And the helper functions of
>> boost::intrusive_ptr prevented strong thread safety. A reference
>> counter need to be exposed as atomic<some_integer> somehow for strong
>> thread safety.
>
> Thanks for that info wrt boost::intrusive_ptr. I vaguely remember way
> back on comp.programming.threads where somebody was trying to convince
> me that strong thread safety is hardly ever needed. Iirc, my response
> was that its not needed, until it is needed. ;^)

There are really /many/ use cases of strong thread safety. It simplifies
programming significantly. Less race conditions, no deadlocks, no
priority inversion. Basically it is required for anything pointer like
object to fit into a std:atomic<T>.

In conjunction with immutable objects it enables large scale
applications without locks, e.g. in memory databases with deduplication.
They have amazing performance. The most interesting part is the scaling
which is less than linear, because the more data you load into memory
the more effectively deduplication compresses it.
Without strong safety any read access needs to be protected by a lock.
This is almost impossible without introducing a bottleneck.

BTDT. I have written such an application in the past. It runs for about
ten years now. A 20GB database typically fits into less than 2GB memory.
This results in amazingly high RAM cache efficiencies. Deduplication not
only saves memory. It effectively also increases the memory throughput.
The typical CPU load of the server with this application is 5% with a
dual core VM and several dozen concurrent users. (I/O is neglectable
with an in memory DB anyway.)
The efficiency was so high that we decided to keep the full change
history of ten years. It just takes only little additional resources.

>>> Have you taken a look at my experimental proxy collector code?
>>
>> No, I do not even know what code you refer here.
>
> Ahhh, okay. Well here is a link to a crude C++ port. When you get some
> free time to kill, can you give it a go?
>
> C++11 version:
>
> https://pastebin.com/raw/p1E9WN5i
>
> Read this thread for context:
>
> https://groups.google.com/g/comp.lang.c++/c/KwreljoeJ0k/m/KdQVJocuCQAJ
>
> (oh shit! a google group link... The google might ban comp.lang.c++
> next... shit)

:-)

More likely they simply discard google groups in the future. There is no
profit in this service.

Marcel

Chris M. Thomasson

unread,

Feb 4, 2021, 11:22:34 PM2/4/21

to

On 2/3/2021 12:00 AM, Marcel Mueller wrote:
> Am 02.02.21 um 21:58 schrieb Chris M. Thomasson:
>>> With generic C++ the stolen bits hack is critical although it
>>> performs even better on 64 bit because you can safely use 3 bits. I
>>> am unsure if it can be implemented without introducing UB - although
>>> it is likely to work on any real hardware.
>>
>> Yeah... It probably does have a UB issue. However, like you said, its
>> going to work on a vast majority of hardware.
>
> I am not 100% sure about the UB. std::atomic<uintptr_t> might solve the
> UB problem with stolen bits. I did not test this, because with a defined
> set of target platforms there was no UB.

std::atomic<uintptr_t> just might do it, and also allow me to give it a
go in Relacy. I should have some free time in a day or two. I need to
_really_ dig into your:

/// Strongly thread safe read
T* acquire() volatile const;

function instead of skimming. That's where a lot of the "magic" happens.
If I am reading this correctly, there is a place where you need strong
CAS. Wrt C++ and its weak and strong CAS variants:

InterlockedCxc in line 281:

if (InterlockedCxc((volatile unsigned*)&Data, old_outer, new_outer) !=
old_outer && new_outer)
// Someone else does the job already => undo.
InterlockedAdd(&((T*)new_outer)->access_counter(), outer_count);
// The global count cannot return to zero here, because we have an
active reference.

Afaict, this will not work with weak CAS. Need to dig into it because
its interesting. Fwiw, in my experimental proxy collector, I was very
happy that I eliminated the need for any CAS.

https://pastebin.com/raw/p1E9WN5i

>>> I mostly needed strong thread safety and, well, intrusive reference
>>> counting is by far more efficient. And the helper functions of
>>> boost::intrusive_ptr prevented strong thread safety. A reference
>>> counter need to be exposed as atomic<some_integer> somehow for strong
>>> thread safety.

Basically, the inner count needs to be somehow transferred to the outer
count. Using a loopless wait-free method tends to perform better than a CAS.

>> Thanks for that info wrt boost::intrusive_ptr. I vaguely remember way
>> back on comp.programming.threads where somebody was trying to convince
>> me that strong thread safety is hardly ever needed. Iirc, my response
>> was that its not needed, until it is needed. ;^)
>
> There are really /many/ use cases of strong thread safety. It simplifies
> programming significantly. Less race conditions, no deadlocks, no
> priority inversion. Basically it is required for anything pointer like
> object to fit into a std:atomic<T>.

Well, its basically required for a thread/process to be able to take a
reference to something that it did not previously have a reference to.

> In conjunction with immutable objects it enables large scale
> applications without locks, e.g. in memory databases with deduplication.
> They have amazing performance. The most interesting part is the scaling
> which is less than linear, because the more data you load into memory
> the more effectively deduplication compresses it.
> Without strong safety any read access needs to be protected by a lock.
> This is almost impossible without introducing a bottleneck.

Have you ever messed around with RCU? Pretty damn nice. A proxy
collector can provide a sort of "poor mans" rcu. A proxy collector can
be created from any strong-thread safe method. Ideally lockfree, even
better when its loopless, and waitfree.

> BTDT. I have written such an application in the past. It runs for about
> ten years now. A 20GB database typically fits into less than 2GB memory.
> This results in amazingly high RAM cache efficiencies. Deduplication not
> only saves memory. It effectively also increases the memory throughput.
> The typical CPU load of the server with this application is 5% with a
> dual core VM and several dozen concurrent users. (I/O is neglectable
> with an in memory DB anyway.)
> The efficiency was so high that we decided to keep the full change
> history of ten years. It just takes only little additional resources.

Excellent!

>>>> Have you taken a look at my experimental proxy collector code?
>>>
>>> No, I do not even know what code you refer here.
>>
>> Ahhh, okay. Well here is a link to a crude C++ port. When you get some
>> free time to kill, can you give it a go?
>>
>> C++11 version:
>>
>> https://pastebin.com/raw/p1E9WN5i
>>
>> Read this thread for context:
>>
>> https://groups.google.com/g/comp.lang.c++/c/KwreljoeJ0k/m/KdQVJocuCQAJ
>>
>> (oh shit! a google group link... The google might ban comp.lang.c++
>> next... shit)
>
> :-)
>
> More likely they simply discard google groups in the future. There is no
> profit in this service.

Iirc, that the reason why they totally murdered Google+. Damn.

Chris M. Thomasson

unread,

Feb 5, 2021, 12:09:20 AM2/5/21

to

On 2/3/2021 12:00 AM, Marcel Mueller wrote:

> Am 02.02.21 um 21:58 schrieb Chris M. Thomasson:
>>> With generic C++ the stolen bits hack is critical although it
>>> performs even better on 64 bit because you can safely use 3 bits. I
>>> am unsure if it can be implemented without introducing UB - although
>>> it is likely to work on any real hardware.

[...]

Humm... I did a another proxy collector, old and nowhere on the net now
since Comcast killed their webhosting, that used stolen bits for the
collectors. Was worried about overflowing them. Wrt, Line 273 in your code:

const unsigned old_outer = InterlockedXad((volatile unsigned*)&Data, 1) + 1;
const unsigned outer_count = old_outer & INT_PTR_COUNTER_MASK;
ASSERT((old_outer & INT_PTR_COUNTER_MASK) != 0); // overflow condition

The assert is here for that right? What am I missing?

The fun part about a proxy collector is that it is not a general purpose
smart pointer. Instead its more of a poor mans rcu where we can steal a
lot of bits by aligning a collector on a large boundary.

If the number of threads that take concurrent access to a strong
reference exceeds the stolen bits at any one time, then overflow can and
will occur. This condition will intrude on the actual pointer part,
corrupting things. Well, that's the way it was with my old code using
stolen bits as a reference count.

Please correct me if I am missing something from your acquire function.
Thanks.

Marcel Mueller

unread,

Feb 5, 2021, 1:43:39 AM2/5/21

to

Am 05.02.21 um 05:22 schrieb Chris M. Thomasson:

> std::atomic<uintptr_t> just might do it, and also allow me to give it a
> go in Relacy. I should have some free time in a day or two. I need to
> _really_ dig into your:
>
> /// Strongly thread safe read
> T* acquire() volatile const;
>
> function instead of skimming. That's where a lot of the "magic" happens.
> If I am reading this correctly, there is a place where you need strong
> CAS. Wrt C++ and its weak and strong CAS variants:

I never heard about a weak CAS primitive before. But yes the code relies
on strong CAS. This is essential to be loop free.

And you should know that the code was exclusively intended for the x86
platform. x86 has very little (or no?) ability of memory reordering. So
there are no further memory barriers required when doing simple atomic
reads and writes. std::atomic should be more sophisticated.

> InterlockedCxc in line 281:
>
> if (InterlockedCxc((volatile unsigned*)&Data, old_outer, new_outer) !=
> old_outer && new_outer)
>     // Someone else does the job already => undo.
>     InterlockedAdd(&((T*)new_outer)->access_counter(), outer_count);
>     // The global count cannot return to zero here, because we have an
> active reference.
>
> Afaict, this will not work with weak CAS. Need to dig into it because
> its interesting. Fwiw, in my experimental proxy collector, I was very
> happy that I eliminated the need for any CAS.

CAS is not bad unless it is a CAS /loop/.
Of course, if read-modify-write memory cycles of the hardware can be
used this is even better.

> Basically, the inner count needs to be somehow transferred to the outer
> count. Using a loopless wait-free method tends to perform better than a
> CAS.

It is a long time ago since I wrote this code, but AFAIR it is loopless.
The number of atomic operations per access was limited to exactly 3.
Other approaches may need only 2, but this will not work with only 2
stolen bits.

[strong thread safety]

> Well, its basically required for a thread/process to be able to take a
> reference to something that it did not previously have a reference to.

And this is likely to happen almost every time when you deal with an
atomic instance of some referencing object.

> Have you ever messed around with RCU? Pretty damn nice.

I shortly heard the first time of RCU and only read a few articles.
AFAICS it just delays deletions a bit to keep memory access valid. This
sound a bit like Thread.Sleep or setTimeout solutions. From my
experience this kind of solutions mainly move the bug from the developer
to the customer.

Grace periods are always too short to prevent corruption under some
special cases with extreme load peaks, e.g. with swapping and I/O delays
involved or maybe some DDOS attack (or just children trying to connect
to their school :-)
Well, at some point we need to note that computers also just happen to
work from the quantum mechanics point of view. So probability /is/ a
valid approach. But probabilities of concurrency issues mainly scale
with the number of coincidences required to trigger them not with the
time taken for the step. A single thread may always block for an unknown
reason at any point. But it is rather unlikely that this happens to many
of them at the same instruction at the same time.

If you reduce RCU by just versioning then I have probably used it for
decades. Atomic pointer updates and immutable data structures were the
core of the in memory database application I mentioned. But I did not
use any grace period.
The PM123 application where the code link came from also uses this
pattern for all internal strings. They are immutable and deduplicated.

> A proxy
> collector can provide a sort of "poor mans" rcu. A proxy collector can
> be created from any strong-thread safe method. Ideally lockfree, even
> better when its loopless, and waitfree.

I think I did not really catch the intended use case of your proxy
collector for now.

Marcel

Marcel Mueller

unread,

Feb 5, 2021, 2:25:30 AM2/5/21

to

Am 05.02.21 um 06:09 schrieb Chris M. Thomasson:

> On 2/3/2021 12:00 AM, Marcel Mueller wrote:
>> Am 02.02.21 um 21:58 schrieb Chris M. Thomasson:
>>>> With generic C++ the stolen bits hack is critical although it
>>>> performs even better on 64 bit because you can safely use 3 bits. I
>>>> am unsure if it can be implemented without introducing UB - although
>>>> it is likely to work on any real hardware.
> [...]
>
> Humm... I did a another proxy collector, old and nowhere on the net now
> since Comcast killed their webhosting, that used stolen bits for the
> collectors. Was worried about overflowing them. Wrt, Line 273 in your code:
>
> const unsigned old_outer = InterlockedXad((volatile unsigned*)&Data, 1)
> + 1;
> const unsigned outer_count = old_outer & INT_PTR_COUNTER_MASK;
> ASSERT((old_outer & INT_PTR_COUNTER_MASK) != 0); // overflow condition
>
> The assert is here for that right? What am I missing?

:-)

Since OS/2 is a 32 bit platform and I did not want to introduce further
alignment I have only two bits for the outer count. So only 3 threads
can safely acquire the pointer at the same time. The 4th one will cause
the counter to return to 0 - an overflow.

But it is not that bad. All I have to do is to clear the counter as fast
as possible. So the transfer of the counter value is done immediately
after the pointer is acquired. The main reference counter is incremented
by the snapshot value of the small counter first. And if resetting the
counter to zero fails another thread already did the job an the
increment of the main ref count is undone.

The assertion never triggered. I tested really hard with ~100 threads
each hammering the same instance in an infinite loop. In fact the
counter never reached the value 3. 2 was sufficient in real life.

This is the price for the stolen bits hack. The platform did not support
64 bit CAS because some (supported) CPUs did not implement the hardware
instruction cmpxchg8b.
The performance impact of the immediate transfer is small. Most likely
any additional atomic access to the same counters are L1 cache hits.

By the way, I could bet 3 stolen bits on 64 bit architectures will also
be sufficient for modern hardware with many cores.
Well, I never tested on SPARC T3 with hundreds of hardware threads or
something like that. But the cache line hopping effectively prevents too
many threads to be likely suspended after the first increments. Clearing
the counter is amortized faster that incrementing it. That's enough to
get no cumulative effect. The number of threads has no significant
impact on the maximum counter value.

> The fun part about a proxy collector is that it is not a general purpose
> smart pointer. Instead its more of a poor mans rcu where we can steal a
> lot of bits by aligning a collector on a large boundary.

> If the number of threads that take concurrent access to a strong
> reference exceeds the stolen bits at any one time, then overflow can and
> will occur. This condition will intrude on the actual pointer part,
> corrupting things. Well, that's the way it was with my old code using
> stolen bits as a reference count.

The latter point I avoided by clearing the counter fast enough.

> Please correct me if I am missing something from your acquire function.
> Thanks.

Absolutely not. It just did not happen ;-)

Marcel

Chris M. Thomasson

unread,

Feb 5, 2021, 2:53:16 PM2/5/21

to

On 2/4/2021 10:43 PM, Marcel Mueller wrote:
> Am 05.02.21 um 05:22 schrieb Chris M. Thomasson:
>> std::atomic<uintptr_t> just might do it, and also allow me to give it
>> a go in Relacy. I should have some free time in a day or two. I need
>> to _really_ dig into your:
>>
>> /// Strongly thread safe read
>> T* acquire() volatile const;
>>
>> function instead of skimming. That's where a lot of the "magic"
>> happens. If I am reading this correctly, there is a place where you
>> need strong CAS. Wrt C++ and its weak and strong CAS variants:
>
> I never heard about a weak CAS primitive before. But yes the code relies
> on strong CAS. This is essential to be loop free.

Indeed. The "problem" we both have with loopfree is an architecture that
uses LL/SC. x86 has pessimistic atomic RMW's where a CAS failure,
actually means that it failed because the destination was not equal to
the comparand. So one can use a CAS for loopfree algorithms. However...
On a system with optimistic RMW's means that a compare_exchange_weak can
spuriously fail. This also means that compare_exchange_strong is going
to use a loop, internally. It's simply not our fault. :^)

> And you should know that the code was exclusively intended for the x86
> platform. x86 has very little (or no?) ability of memory reordering. So
> there are no further memory barriers required when doing simple atomic
> reads and writes. std::atomic should be more sophisticated.

Completely understood. Well, actually... x86 does require an explicit
membar in the form of MFENCE or a LOCK'ed RMW to implement certain
algorithms. One example is the hazard pointer algorihtm. It depends on a
store followed by a load to another location. This can be reordered on
an x86. Are you familiar with hazard pointers?

>> InterlockedCxc in line 281:
>>
>>   if (InterlockedCxc((volatile unsigned*)&Data, old_outer, new_outer)
>> != old_outer && new_outer)
>>      // Someone else does the job already => undo.
>>      InterlockedAdd(&((T*)new_outer)->access_counter(), outer_count);
>>      // The global count cannot return to zero here, because we have
>> an active reference.
>>
>> Afaict, this will not work with weak CAS. Need to dig into it because
>> its interesting. Fwiw, in my experimental proxy collector, I was very
>> happy that I eliminated the need for any CAS.
>
> CAS is not bad unless it is a CAS /loop/.
> Of course, if read-modify-write memory cycles of the hardware can be
> used this is even better.

Yup. I have wrote about this in the past about how loopless algorithms
can tend to lose their "luster" when implemented on an arch that uses
optimistic RMW's, like when CAS is implemented in terms of LL/SC.

>> Basically, the inner count needs to be somehow transferred to the
>> outer count. Using a loopless wait-free method tends to perform better
>> than a CAS.
>
> It is a long time ago since I wrote this code, but AFAIR it is loopless.
> The number of atomic operations per access was limited to exactly 3.
> Other approaches may need only 2, but this will not work with only 2
> stolen bits.

Your algorihtm is loopless on x86, and SPARC, and on some others.
However, on a system with LL/SC, not so much. This is just the way it
is. Shit happens.

> [strong thread safety]
>> Well, its basically required for a thread/process to be able to take a
>> reference to something that it did not previously have a reference to.
>
> And this is likely to happen almost every time when you deal with an
> atomic instance of some referencing object.

It basically depends on the program, and how it accesses things. Some
programs simply do not require strong thread safety.

>> Have you ever messed around with RCU? Pretty damn nice.
>
> I shortly heard the first time of RCU and only read a few articles.
> AFAICS it just delays deletions a bit to keep memory access valid. This
> sound a bit like Thread.Sleep or setTimeout solutions. From my
> experience this kind of solutions mainly move the bug from the developer
> to the customer.

What bug? RCU uses nothing like a sleep or timeout.

> Grace periods are always too short to prevent corruption under some
> special cases with extreme load peaks, e.g. with swapping and I/O delays
> involved or maybe some DDOS attack (or just children trying to connect
> to their school :-)

RCU's not really "ideal" for data-structures with high amounts of
updates. However, its not going to break because of them.

> Well, at some point we need to note that computers also just happen to
> work from the quantum mechanics point of view. So probability /is/ a
> valid approach. But probabilities of concurrency issues mainly scale
> with the number of coincidences required to trigger them not with the
> time taken for the step. A single thread may always block for an unknown
> reason at any point. But it is rather unlikely that this happens to many
> of them at the same instruction at the same time.

You just might like this:

https://youtu.be/DZJPqTTt7MA

> If you reduce RCU by just versioning then I have probably used it for
> decades. Atomic pointer updates and immutable data structures were the
> core of the in memory database application I mentioned. But I did not
> use any grace period.
> The PM123 application where the code link came from also uses this
> pattern for all internal strings. They are immutable and deduplicated.
>
>> A proxy collector can provide a sort of "poor mans" rcu. A proxy
>> collector can be created from any strong-thread safe method. Ideally
>> lockfree, even better when its loopless, and waitfree.

> I think I did not really catch the intended use case of your proxy
> collector for now.

My proxy collector use case is basically the same use case as RCU.
Readers can read, while writers are mutating the data structure, even
deleting things. Actually, it might be better than RCU in high writer
use cases. Humm...

Chris M. Thomasson

unread,

Feb 5, 2021, 3:07:17 PM2/5/21

to

On 2/4/2021 11:25 PM, Marcel Mueller wrote:
> Am 05.02.21 um 06:09 schrieb Chris M. Thomasson:
>> On 2/3/2021 12:00 AM, Marcel Mueller wrote:
>>> Am 02.02.21 um 21:58 schrieb Chris M. Thomasson:
>>>>> With generic C++ the stolen bits hack is critical although it
>>>>> performs even better on 64 bit because you can safely use 3 bits. I
>>>>> am unsure if it can be implemented without introducing UB -
>>>>> although it is likely to work on any real hardware.
>> [...]
>>
>> Humm... I did a another proxy collector, old and nowhere on the net
>> now since Comcast killed their webhosting, that used stolen bits for
>> the collectors. Was worried about overflowing them. Wrt, Line 273 in
>> your code:
>>
>> const unsigned old_outer = InterlockedXad((volatile unsigned*)&Data,
>> 1) + 1;
>> const unsigned outer_count = old_outer & INT_PTR_COUNTER_MASK;
>> ASSERT((old_outer & INT_PTR_COUNTER_MASK) != 0); // overflow condition
>>
>> The assert is here for that right? What am I missing?
>
> :-)
>
> Since OS/2 is a 32 bit platform and I did not want to introduce further
> alignment I have only two bits for the outer count. So only 3 threads
> can safely acquire the pointer at the same time. The 4th one will cause
> the counter to return to 0 - an overflow.

Indeed.

> But it is not that bad. All I have to do is to clear the counter as fast
> as possible.

I will read it carefully. Noticed something like this but I am not sure
yet. Actually, porting your code over to C++17 would help me out here.

> So the transfer of the counter value is done immediately
> after the pointer is acquired. The main reference counter is incremented
> by the snapshot value of the small counter first. And if resetting the
> counter to zero fails another thread already did the job an the
> increment of the main ref count is undone.

I need to dig into this a lot more.

> The assertion never triggered. I tested really hard with ~100 threads
> each hammering the same instance in an infinite loop. In fact the
> counter never reached the value 3. 2 was sufficient in real life.

This sounds a bit odd to me, but then again, I need to understand your
code better. And, I need to understand your use cases. In my proxy
collector here:

https://pastebin.com/raw/p1E9WN5i

It does not use stolen bits, but another technique. Think of wrt 32-bit:

0xRRRRRRRC

Where R is for the reference count, and C is for the collector index.

Millions of threads can increment the outer counter at the same time. No
problem.

> This is the price for the stolen bits hack.

Indeed.

> The platform did not support
> 64 bit CAS because some (supported) CPUs did not implement the hardware
> instruction cmpxchg8b.
> The performance impact of the immediate transfer is small. Most likely
> any additional atomic access to the same counters are L1 cache hits.
>
> By the way, I could bet 3 stolen bits on 64 bit architectures will also
> be sufficient for modern hardware with many cores.
> Well, I never tested on SPARC T3 with hundreds of hardware threads or
> something like that. But the cache line hopping effectively prevents too
> many threads to be likely suspended after the first increments. Clearing
> the counter is amortized faster that incrementing it.

Is that _totally_ safe? I need to examine it further.

> That's enough to
> get no cumulative effect. The number of threads has no significant
> impact on the maximum counter value.
>
>
>> The fun part about a proxy collector is that it is not a general
>> purpose smart pointer. Instead its more of a poor mans rcu where we
>> can steal a lot of bits by aligning a collector on a large boundary.
>
>> If the number of threads that take concurrent access to a strong
>> reference exceeds the stolen bits at any one time, then overflow can
>> and will occur. This condition will intrude on the actual pointer
>> part, corrupting things. Well, that's the way it was with my old code
>> using stolen bits as a reference count.
>
> The latter point I avoided by clearing the counter fast enough.

fast enough makes we wonder.

>> Please correct me if I am missing something from your acquire
>> function. Thanks.
>
> Absolutely not. It just did not happen ;-)

Cool!

Chris M. Thomasson

unread,

Feb 5, 2021, 3:10:13 PM2/5/21

to

To be quite honest, I still have never actually used weak pointers wrt
shared_ptr in real code. Need to study up. They seem a bit odd to me, oh
well.

Chris M. Thomasson

unread,

Feb 5, 2021, 4:00:25 PM2/5/21

to

[...]

For some reason, I remember writing about how a spin on an adaptive
mutex, or spinlock can be used to do other useful work. Instead of
spinning on a contended lock, use this moment in time to try to do some
real work. Iirc, the overall high-level pseudo-code was something like this:
_______________________________
void ct_special_try_lock_high_level(...)
{
while (! ct_try_lock(...))
{
if (! ct_try_to_do_some_other_work(...))
{
ct_wait_lock(...);
break;
}
}

return;
}
_______________________________

Without a "limit" on the number of times try_lock can be invoked, it
kind of turned into a little "message pump" type of scenario. Keep in
mind that each "spin" means other meaningful work was actually accomplished!

;^)

Chris M. Thomasson

unread,

Feb 5, 2021, 5:33:50 PM2/5/21

to

On 2/5/2021 11:52 AM, Chris M. Thomasson wrote:

[...]

Basically, calling a blocking operation while holding a mutex is bad!
Blocking while inside a RCU region is bad!

Öö Tiib

unread,

Feb 6, 2021, 8:39:25 AM2/6/21

to

It feels like having a pointer that we can check if it is dangling pointer
or not. That solves lot of issues ... about bit odd syntax I don't care as
human languages have always seemed way more odd to me.

Marcel Mueller

unread,

Feb 6, 2021, 11:21:14 AM2/6/21

to

Am 05.02.21 um 22:00 schrieb Chris M. Thomasson:

> For some reason, I remember writing about how a spin on an adaptive
> mutex, or spinlock can be used to do other useful work. Instead of
> spinning on a contended lock, use this moment in time to try to do some
> real work. Iirc, the overall high-level pseudo-code was something like
> this:
> _______________________________
> void ct_special_try_lock_high_level(...)
> {
>     while (! ct_try_lock(...))
>     {
>         if (! ct_try_to_do_some_other_work(...))
>         {
>             ct_wait_lock(...);
>             break;
>         }
>     }
>
>     return;
> }
> _______________________________

People called this cooperative multi-tasking in the past. ;-)

Marcel

Marcel Mueller

unread,

Feb 6, 2021, 11:32:10 AM2/6/21

to

Am 05.02.21 um 23:33 schrieb Chris M. Thomasson:

> Basically, calling a blocking operation while holding a mutex is bad!
> Blocking while inside a RCU region is bad!

Indeed. But you have to go to very low level in modern OS to ensure that
a piece of code will never block. With a few exceptions only the kernel
can enforce this. At user level it is almost impossible to avoid
blocking completely. Preemptive multi tasking and page faults are the
most common reasons.

Marcel

Chris M. Thomasson

unread,

Feb 6, 2021, 3:23:16 PM2/6/21

to

On 2/6/2021 8:31 AM, Marcel Mueller wrote:
> Am 05.02.21 um 23:33 schrieb Chris M. Thomasson:
>> Basically, calling a blocking operation while holding a mutex is bad!
>> Blocking while inside a RCU region is bad!
>
> Indeed. But you have to go to very low level in modern OS to ensure that
> a piece of code will never block.

Then how do you use a mutex then? Think about it. Calling user code in a
mutex critical section that can block is very, very bad. It can actually
deadlock and prevent the program from running. It sounds like you are
saying that using a mutex is basically impossible in user code. That is
not true.

> With a few exceptions only the kernel
> can enforce this. At user level it is almost impossible to avoid
> blocking completely. Preemptive multi tasking and page faults are the
> most common reasons.

Well, you can block in a RCU region using my proxy collector. However,
the effect will be delayed collections. Its not going to deadlock
anything. So, if we just follow the same rules we have for a mutex, no
problem at all. :^)

Chris M. Thomasson

unread,

Feb 6, 2021, 4:14:27 PM2/6/21

to

On 2/6/2021 8:31 AM, Marcel Mueller wrote:

Fwiw, I am almost finished wrt showing a use case for my proxy gc. The
new thread will be called "The Poor Mans RCU...".

Basically, this one use case out of many will allow a lock-free list to
be mutated while readers are going full speed ahead wrt iterating the
whole list, while writers are pushing, popping, and deleting nodes.

It can be useful.

Chris M. Thomasson

unread,

Feb 6, 2021, 4:16:22 PM2/6/21

to

Yeah. I just do not understand how they work internally to shared_ptr.
Do weak_ptrs adjust the reference count at all? Please try to excuse my
ignorance here. ;^o

Öö Tiib

unread,

Feb 8, 2021, 2:36:13 AM2/8/21

to

The weak references are simply counted too (as weak references).

Marcel Mueller

unread,

Feb 14, 2021, 8:59:54 AM2/14/21

to

Am 05.02.21 um 21:07 schrieb Chris M. Thomasson:

>>> const unsigned old_outer = InterlockedXad((volatile unsigned*)&Data,
>>> 1) + 1;
>>> const unsigned outer_count = old_outer & INT_PTR_COUNTER_MASK;
>>> ASSERT((old_outer & INT_PTR_COUNTER_MASK) != 0); // overflow
>>> condition
>>>
>>> The assert is here for that right? What am I missing?
>>
>> :-)
>>
>> Since OS/2 is a 32 bit platform and I did not want to introduce
>> further alignment I have only two bits for the outer count. So only 3
>> threads can safely acquire the pointer at the same time. The 4th one
>> will cause the counter to return to 0 - an overflow.
>
> Indeed.
>
>> But it is not that bad. All I have to do is to clear the counter as
>> fast as possible.
>
> I will read it carefully. Noticed something like this but I am not sure
> yet. Actually, porting your code over to C++17 would help me out here.

I did a rough port to C++17 using atomic<uintptr_t>. Unfortunately time
have changed. The code is no longer reliable in general. :-(

It works under OS/2 (the original, x86). It works in a Linux VM (x64).
It does /not/ work on the host of the VM (same hardware, AM4). The
stolen bits counter overflows soon. It works on my local PC (AM3+). It
does not work on a Ryzen 16 core. It does not work on a Xeon neither on
ARMv7 quad core.

The maximum stolen bits count scales nonlinear with the number of CPU
cores. It is less on Xeon and ARM than on AMD.
It is very interesting that the maximum count is at least 30% less if
the code is executed in a VM on the same hardware. The scheduler seem to
have some influence. In fact the code runs twice as fast inside the VM!

I tested with 300 threads hammering on the same atomic instance in an
infinite loop. The duration of the test has almost no effect. The number
of threads also has no significant effect as long there are enough to
reach the maximum.

>> The assertion never triggered. I tested really hard with ~100 threads
>> each hammering the same instance in an infinite loop. In fact the
>> counter never reached the value 3. 2 was sufficient in real life.
>
> This sounds a bit odd to me, but then again, I need to understand your
> code better.

It /is/ odd. My tests were quite long ago. And my VM usually used for
development was one of them that worked. (max count 4 of 7 allowed on x64)
There seems no alternative to DWCAS for the atomic version. :-/

An intrusive reference counted smart pointer is still useful. But it is
no longer wait free if the platform does not support DWCAS.

> https://pastebin.com/raw/p1E9WN5i
>
> It does not use stolen bits, but another technique. Think of wrt 32-bit:
>
> 0xRRRRRRRC
>
> Where R is for the reference count, and C is for the collector index.
>
> Millions of threads can increment the outer counter at the same time. No
> problem.

The collector index?

Marcel

Chris M. Thomasson

unread,

Feb 14, 2021, 3:46:28 PM2/14/21

to

Yeah. Looking at your code, I was just worrying about a shi% load of
threads all taking a reference to the strong pointer at the same time.
That would overflow it rather quickly. Now, I have some old proxy
collector code that steals enough bits to hold an 8-bit counter, so
that's 256 threads. However, if more than 256 threads take the strong
count at the same time, the it will overflow. I need to find it on
archive.org. Luckily, it just might have it.

https://web.archive.org/web/2017*/http://webpages.charter.net/appcore

>> https://pastebin.com/raw/p1E9WN5i
>>
>> It does not use stolen bits, but another technique. Think of wrt 32-bit:
>>
>> 0xRRRRRRRC
>>
>> Where R is for the reference count, and C is for the collector index.
>>
>> Millions of threads can increment the outer counter at the same time.
>> No problem.
>
> The collector index?

The collector index is embedded within the counter so I can increment
the reference count and grab the collector index in a single atomic RMW,
fetch_add in this case. Then I decode the it from the return value and
use it as an index into collector objects, there are two collectors in
this case. Take a careful look at the following code in my proxy
collector: https://pastebin.com/raw/CYZ78gVj
____________________________
collector& acquire()
{
// increment the master count _and_ obtain current collector.
std::uint32_t current =
m_current.fetch_add(ct_ref_inc, std::memory_order_acquire);

// decode the collector index.
return m_collectors[current & ct_proxy_mask];
}
____________________________

It returns a reference to the indexed collector.

Chris M. Thomasson

unread,

Feb 14, 2021, 5:05:41 PM2/14/21

to

This quote is interesting to me:

https://en.cppreference.com/w/cpp/memory/weak_ptr

___________________
std::weak_ptr models temporary ownership: when an object needs to be
accessed only if it exists, and it may be deleted at any time by someone
else, std::weak_ptr is used to track the object, and it is converted to
std::shared_ptr to assume temporary ownership. If the original
std::shared_ptr is destroyed at this time, the object's lifetime is
extended until the temporary std::shared_ptr is destroyed as well.
___________________

Does shared_ptr have a "separate" reference count to weak_ptr's?

Öö Tiib

unread,

Feb 14, 2021, 6:36:36 PM2/14/21

to

Yes.

std::atomic<std::shared_ptr<T>>...

Chris M. Thomasson

Ian Collins

Ian Collins

Chris M. Thomasson

Chris M. Thomasson

Marcel Mueller

Öö Tiib

Chris M. Thomasson

Chris M. Thomasson

Öö Tiib

Chris M. Thomasson

Pavel

Marcel Mueller

Öö Tiib

Chris M. Thomasson

Öö Tiib

Marcel Mueller

Chris M. Thomasson

Chris M. Thomasson

Marcel Mueller

Marcel Mueller

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Öö Tiib

Marcel Mueller

Marcel Mueller

Chris M. Thomasson

Chris M. Thomasson

Chris M. Thomasson

Öö Tiib

Marcel Mueller

Chris M. Thomasson

Chris M. Thomasson

Öö Tiib