Avoiding atomic decrement in final ref counting release

Rani Sharoni

unread,

Jan 3, 2010, 3:59:40 AM1/3/10

to

Hi,

Is this correct implementation of COM style Release:
void Release()
{
// no atomic if single owner release the object
if (m_ref == 1 || atomic_decrement(&m_ref) == 0) {
delete this;
}
}

Memory visibility is guaranteed by the synchronization used to acquire
reference to the object.
It's quite common to have single owner of ref counted object for which
the above might be effective optimization.

Rani

Dmitriy Vyukov

unread,

Jan 3, 2010, 1:35:54 PM1/3/10

to

Yes, it's correct in general. If a thread sees that he is the only
owner of an object then he can delete the object w/o any further
"examination". However you still have to preserve correct ordering,
something along the lines of:

if (m_ref.load(std::memory_order_relaxed) == 1 || m_ref.fetch_add
(-1, std::memory_order_release) == 0)
{
std::atomic_thread_fence(std::memory_order_acquire);
delete this;
}

--
Dmitry V'jukov

Rani Sharoni

unread,

Jan 3, 2010, 3:57:44 PM1/3/10

to

This is similar to the share_ptr correspondence.
I don't think that there is case in which memory visibility is an
issue since coherency is guaranteed when the thread got the reference
(access protected using some synchronization that guarantees
visibility).

BTW - same reasoning seem to work for the first AddRef (for objects
that constructed from ref-count==0 for the favor of "sp = new T").

Rani

Chris M. Thomasson

unread,

Jan 3, 2010, 4:14:28 PM1/3/10

to

"Rani Sharoni" <ranish...@gmail.com> wrote in message
news:6778ed8d-e137-4f0f...@21g2000yqj.googlegroups.com...

The thread which drops to zero needs to synchronize with the other threads
that previously decremented the counter:
________________________________________________________
struct object
{
word32 m_state1; // = 0
word32 m_state2; // = 0
uword32 m_count; // =2

void thread_a()
{
m_state1 = 666;

// release membar

if (m_count == 1 || atomic_faa(&m_count, -1) == 1)
{
// we need acquire membar here to sync with
// thread b's store to `m_state2'.

// acquire membar

assert(m_state2 == -666);
}
}

void thread_b()
{
m_state2 = -666;

// release membar

if (m_count == 1 || atomic_faa(&m_count, -1) == 1)
{
// we need acquire membar here to sync with
// thread a's store to `m_state1'.

// acquire membar

assert(m_state1 == 666);
}
}
};
________________________________________________________

This example can trip the assertions if you omit the acquire membar.

Rani Sharoni

unread,

Jan 3, 2010, 4:25:02 PM1/3/10

to

On Jan 3, 11:14 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Rani Sharoni" <ranisharon...@gmail.com> wrote in message

So there is write to m_state1 without any additional synchronization?
Seems like broken code for COM style ref-counting:
// use some synchronization to get reference
// use the object
pObject->Release();
// no m_state1 assumptions in destructor (final delete)

Rani

Chris M. Thomasson

unread,

Jan 3, 2010, 11:45:56 PM1/3/10

to

"Rani Sharoni" <ranish...@gmail.com> wrote in message

news:d03381ed-8633-4302...@a21g2000yqc.googlegroups.com...

On Jan 3, 11:14 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:

[...]

> > The thread which drops to zero needs to synchronize with the other
> > threads
> > that previously decremented the counter:
> > ________________________________________________________

> > [...]

> > ________________________________________________________
> >
> > This example can trip the assertions if you omit the acquire membar.

> So there is write to m_state1 without any additional synchronization?

Yeah. Why should it require synchronization right at the point of mutation
when only one thread writes to it? It's much better to defer the acquire
until after the drop-to-zero condition wrt a general purpose reference
counting implementation.

> Seems like broken code for COM style ref-counting:
> // use some synchronization to get reference
> // use the object
> pObject->Release();
> // no m_state1 assumptions in destructor (final delete)

It's not broken. The drop-to-zero condition should make mutations performed
by threads fully visible to the destructor in a general purpose reference
counting algorithm. This requires acquire semantics...

Rani Sharoni

unread,

Jan 4, 2010, 5:21:49 AM1/4/10

to

> It's not broken. The drop-to-zero condition should make mutations performed
> by threads fully visible to the destructor in a general purpose reference
> counting algorithm. This requires acquire semantics...

OK. I get your point. Many thanks for the clarification (and being
patient).

Here a use case that I commonly encounter:
class CMyData : public CRefObject // COM style AddRef/Release
{
public:
// implicit cleanup, call close if needed (last ref)
~CRefObject() { if (m_hTimer) Close(); }

// explicit cleanup (not dependent on ref-counting)
// only one caller allowed
void Close()
{
// wait and delete m_hTimer
m_hTimer = NULL;
}
};

Rani

Alexander Terekhov

unread,

Jan 4, 2010, 7:51:48 AM1/4/10

to

Rani Sharoni wrote:
>
> Hi,
>
> Is this correct implementation of COM style Release:
> void Release()
> {
> // no atomic if single owner release the object
> if (m_ref == 1 || atomic_decrement(&m_ref) == 0) {
> delete this;
> }
> }

Only if *this is immutable and atomic_decrement() provides 'sink load'
barrier (part of release barrier).

For mutable *this you'd need something along the lines of

void Release() {
size_t ref = m_ref;
if (ref > 1) {
atomic_thread_fence(memory_order_release);
do {
if (atomic_compare_exchange_weak_explicit(&m_ref, &ref, ref - 1,
memory_order_relaxed, memory_order_relaxed)) return;
} while (ref > 1);
}
assert(ref == 1);
atomic_thread_fence(memory_order_acquire);
delete this;
}

or (the above is best for Power/Sparc-RMO, not quite so for Itanium)

void Release() {
size_t ref = m_ref;
while (ref > 1)
if (atomic_compare_exchange_weak_explicit(&m_ref, &ref, ref - 1,
memory_order_relaxed, memory_order_release)) return;
assert(ref == 1);
atomic_thread_fence(memory_order_acquire);
delete this;
}

regards,
alexander.

Rani Sharoni

unread,

Jan 4, 2010, 7:56:59 AM1/4/10

to

> Only if *this is immutable and atomic_decrement() provides 'sink load'
> barrier (part of release barrier).
>
> For mutable *this you'd need something along the lines of
>
> void Release() {
> size_t ref = m_ref;
> if (ref > 1) {
> atomic_thread_fence(memory_order_release);
> do {
> if (atomic_compare_exchange_weak_explicit(&m_ref, &ref, ref - 1,
> memory_order_relaxed, memory_order_relaxed)) return;
> } while (ref > 1);
> }
> assert(ref == 1);
> atomic_thread_fence(memory_order_acquire);
> delete this;
>
> }
>
> or (the above is best for Power/Sparc-RMO, not quite so for Itanium)
>
> void Release() {
> size_t ref = m_ref;
> while (ref > 1)
> if (atomic_compare_exchange_weak_explicit(&m_ref, &ref, ref - 1,
> memory_order_relaxed, memory_order_release)) return;
> assert(ref == 1);
> atomic_thread_fence(memory_order_acquire);
> delete this;
>
> }
>

Thanks for the info.
I think that the conservative InterlockedDecrement handles all this
being a full barrier.

Can you also show implementation of AddRef?
I don't think that it requires any barrier and atomic increment is
sufficient.

Thanks,
Rani

Alexander Terekhov

unread,

Jan 4, 2010, 8:22:27 AM1/4/10

to

Rani Sharoni wrote:
[...]

> I think that the conservative InterlockedDecrement handles all this
> being a full barrier.

Yep but it is unnecessary heavy. InterlockedDecrementIfGreaterThanOne()
with release on success and acquire on failure would do it much better.

>
> Can you also show implementation of AddRef?
> I don't think that it requires any barrier and atomic increment is
> sufficient.

Yup.

void AddRef() {
atomic_fetch_add_explicit(&m_ref, 1, memory_order_relaxed);
}

regards,
alexander.

Rani Sharoni

unread,

Jan 4, 2010, 9:57:33 AM1/4/10

to

> > I think that the conservative InterlockedDecrement handles all this
> > being a full barrier.
>
> Yep but it is unnecessary heavy. InterlockedDecrementIfGreaterThanOne()
> with release on success and acquire on failure would do it much better.

Much better? what is the perf diff compared with full barrier?

>
> > Can you also show implementation of AddRef?
> > I don't think that it requires any barrier and atomic increment is
> > sufficient.
>
> Yup.

Thanks.

> void AddRef() {
> atomic_fetch_add_explicit(&m_ref, 1, memory_order_relaxed);

Is this the same as InterlockedIncreament on x86/x64 (running windows)
as the memory model is conservative and assure visibility regardless
of interlocked-ops? If not, is there a way cheaper than interlocked-op
to express your AddRef on x86/x64?

For some reason I suspect that my original works, by accident, on
windows (or otherwise tons of such code will break).

Thanks,
Rani

Alexander Terekhov

unread,

Jan 4, 2010, 10:21:40 AM1/4/10

to

Rani Sharoni wrote:
>
> > > I think that the conservative InterlockedDecrement handles all this
> > > being a full barrier.
> >
> > Yep but it is unnecessary heavy. InterlockedDecrementIfGreaterThanOne()
> > with release on success and acquire on failure would do it much better.
>
> Much better? what is the perf diff compared with full barrier?

It depends on performance test case of course.

>
> >
> > > Can you also show implementation of AddRef?
> > > I don't think that it requires any barrier and atomic increment is
> > > sufficient.
> >
> > Yup.
>
> Thanks.
>
> > void AddRef() {
> > atomic_fetch_add_explicit(&m_ref, 1, memory_order_relaxed);
>
> Is this the same as InterlockedIncreament on x86/x64 (running windows)
> as the memory model is conservative and assure visibility regardless
> of interlocked-ops? If not, is there a way cheaper than interlocked-op
> to express your AddRef on x86/x64?

No way unless x86/x64 already introduced CHEAPLOCK prefix instruction or
something like that. ;-)

regards,
alexander.