std::msync

Alexander Terekhov

unread,

Sep 20, 2004, 5:09:20 AM9/20/04

to

A) < Forward Inline >

-------- Original Message --------
Message-ID: <41276DBB...@web.de>
Newsgroups: comp.std.c++
Subject: Re: Multithreaded programming: is the C++ standardization committee listening?
References: ... <abefd130.04082...@posting.google.com>

Peter Dimov wrote:
[...]
> One problem is that the most efficient combination of memory barriers
> may differ depending on the platform. The unsupported standardized
> barriers will need to turn into full barriers (the most inefficient
> kind.)

The standard shall simply define the portable barriers model (and of
course including "relaxed"/"tuned" ones). Something like

msync::none // nothing (e.g. for refcount<T, basic>::increment)
msync::fence // classic fence (acq+rel -- see below)
msync::acq // classic acquire (hlb+hsb -- see below)
msync::ddacq // acquire with data dependency
msync::ccacq // acquire with control dependency
msync::hlb // hoist-load barrier -- acquire not affecting stores
msync::ddhlb // ...
msync::cchlb // ...
msync::hsb // hoist-store barrier -- acquire not affecting loads
msync::ddhsb // ...
msync::cchsb // ...
msync::rel // classic release (slb+ssb -- see below)
msync::slb // sink-load barrier -- release not affecting stores
msync::ssb // sink-store barrier -- release not affecting loads
msync::slfence // store-load fence (ssb+hlb -- see above)

Here's a few illustrations.

DCSI-MBR:

class stuff : private lazy_mutex { // "create/open named mutex"
// trick on windows
atomic<lazy const *> m_ptr;

public:

/* ... */

lazy const & lazy_instance() {
lazy const * ptr;
if (!(ptr = m_ptr.load(msync::ddhlb))) {
lazy_mutex::guard guard(this);
if (!(ptr = m_ptr.load(msync::none)))
m_ptr.store(ptr = new lazy(), msync::ssb);
}
return *ptr;
}
}

http://groups.yahoo.com/group/boost/message/15442
(that's "lazy mutex")

DCCI: (double-checked concurrent init)

class stuff {

atomic<lazy const *> m_ptr;

public:

/* ... */

lazy const & lazy_instance() {
lazy const * ptr;
if (!(ptr = m_ptr.load(msync::ddhlb)) &&
!m_ptr.attempt_update(0, ptr = new lazy(), msync::ssb)) {
delete ptr;
ptr = m_ptr.load(msync::ddhlb);
}
return *ptr;
}
}

Sorta better "critical-section" for windows.

// doesn't provide "POSIX-safety" with respect to destruction
class swap_based_mutex { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention
auto_reset_event m_retry_event; // bin.sema/gate

public:

// ctor/dtor [w/o lazy event init]

void lock() throw() {
if (m_lock_status.swap(1, msync::ccacq))
while (m_lock_status.swap(-1, msync::ccacq))
m_retry_event.wait();
}

bool trylock() throw() {
return !m_lock_status.swap(1, msync::ccacq) ?
true : !m_lock_status.swap(-1, msync::ccacq);
}

bool timedlock(absolute_timeout const & timeout) throw() {
if (m_lock_status.swap(1, msync::ccacq)) {
while (m_lock_status.swap(-1, msync::ccacq))
if (!m_retry_event.timedwait(timeout))
return false;
}
return true;
}

void unlock() throw() {
if (m_lock_status.swap(0, msync::rel) < 0)
m_retry_event.set();
}
};

regards,
alexander.

B) < Forward Inline >

-------- Original Message --------
Message-ID: <414E9206...@web.de>
Newsgroups: comp.lang.c++.moderated
Subject: Re: Possible solution to the DCL problem (Scott Meyers, Andrei Alexandrescu)
References: ... <MPG.1bb51ea3e...@news.hevanet.com>

Scott Meyers wrote:
>
> On 16 Sep 2004 22:41:25 -0400, Alexander Terekhov wrote:
>
> > Keyboard* temp = pInstance;
> > Perform acquire;
> > ...
> >
> > is not really the same (with
> > respect to reordering) as
> >
> > Keyboard* temp = pInstance;
> > Lock L1(args); // acquire
> > ...
> >
> > because the later can be transformed to
> >
> > Lock L1(args); // acquire
> > Keyboard* temp = pInstance;
> > ...
>
> Upon further reflection, I don't see your point here. As I understand it,
> the acquire operation, whether explicit or as part of the Lock constructor,
> prevents memory accesses after (in program order) the initialization of
> temp from migrating up above temp's initialization.

No.

Keyboard* temp = pInstance;
Lock L1(args); // acquire
temp2 = ...

Doesn't prevent

Lock L1(args); // acquire
temp2 = ...
Keyboard* temp = pInstance;

transformation.

With atomic<Keyboard*> for pInstance

Keyboard* temp = pInstance.load(msync::acq);
temp2 = ...

it does. IOW, "acquire" is a marker. It's one thing to mark
"Keyboard* temp = pInstance", and it's completely different thing
to mark some other access like "Lock L1(args)" in your case.

> You seem to be
> suggesting that that is true in the explicit case but not in the case of
> the acquire being part of the Lock constructor.

See above.

> If that is what you are
> arguing, can you please explain why that is the case? If that is not what
> you are arguing, can you please clarify your argument?

Yes, that's what I'm arguing. See above.

[...]
> I've reread the Plan9 posting, and I still don't see your point. The
> transition from page 34 to page 35 of my notes involves replacement of an
> explicit acquire with a Lock constructor (i.e., an implicit acquire). You
> seem to think that there is a change in semantics,

Yes.

> one that offers fewer
> memory visibility guarantees. The only comment about locks in the Plan9
> discussion is this one:
>
> One bit of reassurance: any data structure protected by a spin
> lock is safe. Here's why:
>
> P1 P2
> [already holding lock] wait for lock->busy == 0
> store data->x grab lock
> store data->y use data->x and ->y
> lock->busy = 0
>
> Because of processor ordering, when P2 observes lock->busy == 0,
> it also has observed all prior stores by P1. Hence P2 never gets
> an inconsistent view of P1's updates.
>
> Furthermore, regarding locks and acquire/release, David Butenhof makes this
> remark (in the comp.programming.threads posting to which you responded
> with the link to the Plan9 story):

Locks aside for a moment, Plan9 story illustrates the use of
store-load fence on IA32.

: What we need is that if the following sequence is executed
:
: P1: P2:
: x = 0 y = 0
: x = 1 y = 1
: read y read x
:
: has the values read will be one of
:
: 1 0
: 0 1
: 1 1
:
: 0,0 blows us away.

P1: P2:
x = 0 y = 0
x = 1 y = 1
slfence slfence
read y read x

Is safe.

>
> Because there's no way for software or hardware to reliably associate any
> particular data with a particular mutex, in practice any thread that locks
> any mutex will have a view of (all) memory consistent with the view of the
> last thread to unlock any mutex (at the time of that unlock). This is
> essentially as if you replaced the mutex lock and unlock operations by
> general (full) memory barriers.

Butenhof is wrong.

>
> This suggests that replacement of an explicit acquire operation by mutex
> acquisition replaces an acquire by a full fence, hence *increasing* the
> memory visibility guarantees.

He doesn't seem to "get it", unfortunately.

http://groups.google.com/groups?threadm=40ae0044%40usenet01.boi.hp.com

>
> So, as I said, I don't see your point.

You're in good company. ;-)

[ ... atomic<> ... ]

> Also, again, what is the meaning of atomic<>? If I had to guess, I'd guess
> that all operations on the type atomic<T> are guaranteed to be seen as
> atomic, but I'd prefer not to guess.

Yes, atomic operations with a whole bunch of reordering constraints
to ensue proper memory visibility. Like in

template<typename numeric>
class refcount<numeric, basic> {
public:

enum may_not_store_min_t { may_not_store_min };

private:

atomic<numeric> m_value;

template<typename min_msync, typename update_msync>
bool decrement(min_msync mms, update_msync ums) throw() {
numeric val;
do {
val = m_value.load(msync::none);
assert(min() < val);
if (min() + 1 == val) {
m_value.store(min(), mms);
return false;
}
} while (!m_value.attempt_update(val, val - 1, ums));
return true;
}

class sp_counted_base {
/* ... */
refcount<std::size_t, basic> use_count_;
refcount<std::size_t, basic> self_count_;
/* ... */
public:
/* ... */
sp_counted_base() : use_count_(1), self_count_(1) { }

std::size_t use_count() const throw() {
return use_count_.get();
}

void add_ref() throw() {
use_count_.increment();
}

bool lock() throw() {
return use_count_.increment_if_not_min();
}

void weak_add_ref() throw() {
self_count_.increment();
}

void weak_release() throw() {
if (!self_count_.decrement(msync::acq))
destruct();
}

void release() throw() {
if (!use_count_.decrement()) {
dispose();
if (!self_count_.decrement(msync::rel))
destruct();
}
} /* ... */
};

http://www.terekhov.de/pthread_refcount_t/experimental/refcount.cpp

regards,
alexander.

SenderX

unread,

Sep 20, 2004, 2:39:58 PM9/20/04

to

>> One problem is that the most efficient combination of memory barriers
>> may differ depending on the platform. The unsupported standardized
>> barriers will need to turn into full barriers (the most inefficient
>> kind.)

Not necessarily. The cpu specific assembler that will actually implement
atomic<>'s methods could be extremely efficient.

<...>

> msync::none // nothing (e.g. for refcount<T, basic>::increment)
> msync::fence // classic fence (acq+rel -- see below)
> msync::acq // classic acquire (hlb+hsb -- see below)
> msync::ddacq // acquire with data dependency
> msync::ccacq // acquire with control dependency
> msync::hlb // hoist-load barrier -- acquire not affecting stores
> msync::ddhlb // ...
> msync::cchlb // ...
> msync::hsb // hoist-store barrier -- acquire not affecting loads
> msync::ddhsb // ...
> msync::cchsb // ...
> msync::rel // classic release (slb+ssb -- see below)
> msync::slb // sink-load barrier -- release not affecting stores
> msync::ssb // sink-store barrier -- release not affecting loads
> msync::slfence // store-load fence (ssb+hlb -- see above)

If a cpu doesn't have fine grain memory barriers, then of course performance
will be reduced:

On Itanium a msync::acq/rel would be converted to msync::fence, ie the lock
prefix...

It would be up the to application programmer to choose the most efficient
barrier, and up to the template implementation to choose the most efficient
"compatible" variation available for the target platform...

SenderX

unread,

Sep 20, 2004, 2:42:53 PM9/20/04

to

> On Itanium a msync::acq/rel would be converted to msync::fence, ie the
> lock prefix...

I mean:

On Itanium a msync::acq/rel would be converted to msync::fence on x86, ie
the lock
prefix...

Alexander Terekhov

unread,

Nov 4, 2004, 5:38:31 PM11/4/04

to

Alexander Terekhov wrote:
[...]

> Here's a few illustrations.

Here's one more.

// Introspection (for bool argument below) aside for a moment
template<typename T, bool copy_ctor_or_dtor_can_mutate_object>
class mutex_and_condvar_free_single_producer_single_consumer {

typedef isolated< aligned_storage< T > > ELEM;

size_t m_size; // > 1
ELEM * m_elem;
atomic< ELEM * > m_head;
atomic< ELEM * > m_tail;

ELEM * advance(ELEM * elem) const {
return (++elem < m_elem + m_size) ? elem : m_elem;
}

...

public:

...

void producer(const T & value) {
ELEM * tail = m_tail.load(msync::none);
ELEM * next = advance(tail);
while (next == m_head.load(msync::cchsb)) usleep(1000);
new(tail) T(value);
m_tail.store(next, msync::ssb);
}

T consumer() {
ELEM * head = m_head.load(msync::none);
while (head == m_tail.load(type_list< msync::cchlb_t, msync::ccacq_t >::
element<copy_ctor_or_dtor_can_mutate_object>::type())) usleep(1000);
T value(*head);
head->~T();
m_head.store(advance(head), type_list< msync::slb_t, msync::rel_t >::
element<copy_ctor_or_dtor_can_mutate_object>::type());
return value;
}

};

regards,
alexander.

Alexander Terekhov

unread,

Feb 18, 2005, 7:36:32 AM2/18/05

to

< Forward Inline >

-------- Original Message --------
Message-ID: <42150F0C...@web.de>
Date: Thu, 17 Feb 2005 22:39:24 +0100
From: Alexander Terekhov <tere...@web.de>
Newsgroups: comp.std.c++
Subject: Re: C++ multithreading model
References: ... <slrnd19818.2ik.b...@decadentplace.org.uk>

Ben Hutchings wrote:
[...]
> The PowerPC architecture at least doesn't have CAS; it has two
> separate instructions for load-and-watch and store-if-not-modified (I
> forget what they are actually called).

Load-reserved and store-conditional.

// doesn't provide "POSIX-safety" with respect to destruction

class mutex_for_XBOX_NEXT { // noncopyable

atomic<int> m_lock_status; // 0: free, 1/-1: locked/contention

auto_reset_event m_retry_event; // prohibitively slow bin.sema/gate

template<typename T>
int attempt_update(int old, int new, T msync) {
while (!m_lock_status.store_conditional(new, msync)) {
*****
On assembly level (in MP-safe mode), acquire is implemented using
trailing isync and release is implemented using leading sync/lwsync
instructions.
*****
int fresh = m_lock_status.load_reserved(msync::none);
if (fresh != old)
return fresh;
} return old;
}

public:

// ctor/dtor [w/o lazy event init]

bool trylock() throw() {
return !(m_lock_status.load_reserved(msync::none) ||
attempt_update(0, 1, msync::acq));
}

// bool timedlock() omitted for brevity

void lock() throw() {
int old = m_lock_status.load_reserved(msync::none);
if (old || old = attempt_update(0, 1, msync::acq)) {
do {
while (old < 0 ||
old = attempt_update(1, -1, msync::acq)) {
*****
Acq above is needed in order to ensure proper ordering with respect
to "semaphore lock" operation on m_retry_event below. Lock status
transition 1 -> -1 must complete before "semaphore lock" operation
will take place (otherwise a "deadlock" can arise).
*****
m_retry_event.wait();
old = m_lock_status.load_reserved(msync::none);
*****
Here proper ordering is ensured by semaphore lock operation
(m_retry_event.wait()) which is meant to provide acquire semantics
(m_lock_status.load_reserved(msync::none) must complete after
semaphore lock operation).
*****
if (!old) break;
}
} while (old = attempt_update(0, -1, msync::acq));
}
}

void unlock() throw() {
if (m_lock_status.load_reserved(msync::none) < 0 ||
attempt_update(1, 0, msync::rel) < 0) { // or just !SC
m_lock_status.store(0, msync::rel);
m_retry_event.set();
*****
Ordering here is also important. m_retry_event.set() (semaphore
unlock operation) is meant to provide release semantics
(m_lock_status.store(0, msync::rel) must complete before semaphore
unlock operation).
*****
}
}
};

regards,
alexander.

Alexander Terekhov

unread,

Mar 30, 2005, 12:24:51 PM3/30/05

to

< Forward Inline http://gcc.gnu.org/ml/libstdc++/2005-03/msg00371.html>

-------- Original Message --------
Message-ID: <e52efbe105033...@mail.gmail.com>
Subject: Re: [patch] Make std::tr1::shared_ptr thread-safe.

"Peter Dimov" <pdi...@mmltd.net> on 03/30/2005 04:36:54 PM:
>
> Alexander Terekhov wrote:
> > [... __release/acquire_memory_barrier() ...]
> >
> >> The only reliable implementation of these barriers that I see
> >> is an empty pthread_mutex_lock/pthread_mutex_unlock pair.
> >
> > Nope. That won't work. [...]
>
> Update:
>
> I think that we've reached the conclusion that the following
> implementation:
>
> void release() // nothrow
> {
> if (__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1)
> {
> pthread_mutex_lock( &_M_mutex );
> dispose();
> pthread_mutex_unlock( &_M_mutex );
> weak_release();
> }
> }

void release() // nothrow
{
if (__gnu_cxx::__exchange_and_add(&_M_use_count, -1) == 1)
{
dispose();
pthread_mutex_lock( &_M_mutex );
pthread_mutex_unlock( &_M_mutex );
weak_release();
}
}

would also work. The key here is...

>
> void weak_release() // nothrow
> {
> if (__gnu_cxx::__exchange_and_add(&_M_weak_count, -1) == 1)
> {
> pthread_mutex_lock( &_M_mutex );
> pthread_mutex_unlock( &_M_mutex );
> destroy();
> }
> }

to have the same lock on both sides. Some "thread-specific" locks to
"emulate" bidirectional fences with minimum contention won't work.

>
> will work *provided that __exchange_and_add imposes at least the ordering
> that is required for reference-counted immutable objects to work*.

That's msync::slb (relaxed msync::rel not affecting sinking of preceding
stores) when result > 1 and msync::cchsb (relaxed msync::acq [cc stands
control condition hint] not affecting hoisting of subsequent loads) when
result == 0.

No _M_mutex above is needed with more constrained version of
__exchange_and_add() providing msync::rel when result > 1 and msync::ccacq
when result == 0. Such semantics will also support basic thread-safety of
shared_ptr<> for mutable managed objects without unpleasant requirement
for client provided synchronizing deleters.

>
> The last point is important, because it affects the other uses of
> __exchange_and_add in libstdc++.
>
> Alexander has identified the following problematic case:
>
> // thread A
>
> read *p1
> p1.drop_reference()
>
> // thread B
>
> p2.drop_reference(); // destroys *p1
>
> If the __exchange_and_adds hidden in drop_reference do not establish
> ordering, it is possible for *p1 to be destroyed (and the storage
> invalidated, zeroed or reused) by thread B before the read in thread A.

Yep.

regards,
alexander.

Alexander Terekhov

unread,

Mar 31, 2005, 12:25:40 PM3/31/05

to

< Forward Inline http://gcc.gnu.org/ml/libstdc++/2005-03/msg00405.html>

-------- Original Message --------
Message-ID: <e52efbe105033...@mail.gmail.com>

Subject: Re: Fw: [patch] Make std::tr1::shared_ptr thread-safe.

"Peter Dimov" <pdi...@mmltd.net> on 03/30/2005 11:33:56 PM:
[...]
> A separate issue is that there needs to exist proper memory ordering
> between some of the operations, for example, destroy() needs to happen
> after dispose() (but there are other visibility issues, some of them more
> subtle.)
>
> This can be fixed in one of the following ways:
>
> 1. Empty lock/unlock regions on the same mutex as outlined in my and
> Alexander's posts.
>
> 2. Full bidirectional memory barriers at the same two places.
>
> 3. Making __exchange_and_add offer the following memory synchronization
> guarantees:
>
> - acquire memory semantics when the result is not zero;
> - release memory semantics when the result is zero

release memory semantics when the result is not zero;
acquire memory semantics when the result is zero

And "result" means new value of count stored by __exchange_and_add(),
not return value (its old value). BTW, I had a typo in my previous
message when I wrote "result > 1".

>
> or introducing a separate __atomic_decrement primitive with these
> properties.

refcount<> is better. And it shall be based on full bloat atomic<> with
things like validate() for hazard pointers stuff,

http://www.google.de/groups?selm=4242A922.EF87094A%40web.de

and etc.

class sp_counted_base {
/* ... */

typedef refcount<std::size_t, basic> count;
count use_count_, self_count_;

/* ... */
public:
/* ... */
sp_counted_base() : use_count_(1), self_count_(1) { }

void add_ref() throw() {
use_count_.increment(); // naked
}

bool lock() throw() {
return use_count_.increment_if_not_min(); // naked
}

void weak_add_ref() throw() {
self_count_.increment(); // naked
}

void weak_release() throw() {
if (!self_count_.decrement(msync::acq, count::may_not_store_min))
destruct();
}

void release() throw() {
if (!use_count_.decrement()) { // "full" release-acquire protocol
dispose();
if (!self_count_.decrement(msync::rel, count::may_not_store_min))
destruct();
}
}

/* ... */

};

http://www.terekhov.de/pthread_refcount_t/experimental/refcount.cpp

[...]
> The next issue on our list is that reference-counted mutable synchronized
> objects would need an empty lock/unlock region at the start of the
> destructor if (3) is not chosen. I consider this unfortunate but
> acceptable.

I disagree.

void g(shared_ptr<std::something> p); // can modify *p

void f() {
shared_ptr<std::something> p(...);
new_thread(g, p);
}

The problem is that synchronizing deleter must lock *the same* lock
that was used to synchronize access to mutable object. But in the
case above there isn't any client locking at all!

>
> The final issue is that the reference counted objects in libstdc++
> (std::string, std::locale::facet) may also be broken *unless*
> __exchange_and_add has some memory visibility guarantees that I don't
> understand,

It's really simple. See release/acquire requirement above.

For immutable stuff we need

msync::slb (instead of full conventional release) memory semantics
when the result is not zero;

msync::hsb (instead of full conventional acquire) memory semantics
when the result is zero.

slb stands for "sink load barrier".
hsb stands for "hoist store barrier".

Full conventional release is compound msync::slb + msync::ssb
("sink store barrier).

Full conventional acquire is compound msync::hlb ("hoist load
barrier) + msync::hsb.

> but Alexander does. (It needs to establish ordering of prior
> loads WRT following stores IIUC). This will be fixed by (3) and may or may
> not be a problem in practice; I'm not an expert in this area and can't
> review the current __exchange_and_add implementations. In any event,
> shared_ptr with the above fixes applied would be no more broken than the
> rest of the library.

The rest of the library is currently totally broken in this respect
I'm afraid.

It works on IA32 (compiler reordering aside for a moment) because on
IA32 loads have acquire semantics, stores have release semantics, and
interlocked stuff is fully-fenced (compound acquire+release semantics).

regards,
alexander.

Alexander Terekhov

unread,

Mar 31, 2005, 5:55:52 PM3/31/05

to

< Forward Inline >

-------- Original Message --------
Message-ID: <e52efbe105033...@mail.gmail.com>

Subject: Re: slb/hsb et al

On Thu, 31 Mar 2005 16:36:22 +0300, Peter Dimov <pdi...@mmltd.net> wrote:
> Alexander Terekhov wrote:
> >> The final issue is that the reference counted objects in libstdc++
> >> (std::string, std::locale::facet) may also be broken *unless*
> >> __exchange_and_add has some memory visibility guarantees that I don't
> >> understand,
> >
> > It's really simple. See release/acquire requirement above.
> >
> > For immutable stuff we need
> >
> > msync::slb (instead of full conventional release) memory semantics
> > when the result is not zero;
> >
> > msync::hsb (instead of full conventional acquire) memory semantics
> > when the result is zero.
> >
> > slb stands for "sink load barrier".
> > hsb stands for "hoist store barrier".
>

> Yeah, I lied. I think I do understand. ;-)
>
> But to clarify, sink X means that preceding X cannot cross the barrier.

Right. But unidirectional stuff works in conjunction with some atomic
memory operation (read/write/read-modify-write), not as fence instructions.
So it actually means that preceding X cannot cross memory operation labeled
with sink X barrier. It imposes ordering between preceding X stuff and
operation labeled with sink X barrier.

> Hoist Y means following Y cannot cross the barrier. Right?

Exactly.

>
> Am I correct in thinking that Sparc-style #XY means sink-Y + hoist-X? That
> is, #LoadStore means Sink-Load + Hoist-Store, i.e. what we want above.

Yep. Sparc doesn't use unidirectional labels. They have bidirectional
fences of various types.

> A PPC isync is slb in your notation?

isync works in conjunction with OP + conditional branching and has the
effect of "full" acquire for that OP.

> lwsync fence?

http://www-128.ibm.com/developerworks/eserver/articles/powerpc.html
(see "Storage barriers and ordering of storage accesses" table)

It doesn't provide store-load fencing (expensive thing even on IA32).

So it can't be used as full fence. That's more expansive "sync"
instruction.

>
> And just to beat the subject to death, am I correct that the empty
> lock/unlock regions are equivalent to .mf <location> with the same location
> used in both? (.mf is the label, the operation doesn't matter.)

Yes, on the same lock. Sort of. Fences don't use "locations". They are
bidirectional and impose ordering between preceding and subsequent
operations.

>
> And that they can be replaced with .rel <location> in release() and .acq
> <location> in weak_release(), again with the same location used in both
> places?

No. ".rel" and ".acq" are unidirectional labels. They must be used to
"mark" specific operations. They don't work as bidirectional fences.
Bidirectional fences can be used to achieve the effect of ".rel" and
".acq", but it doesn't really work the other way around.

For op.rel you do "fence, op" and for op.acq you do "op, fence"
(placing fence before op for .rel and after op for .acq). But it is
less efficient than unidirectional labels because fences impose
redundant constraints.

regards,
alexander.

Alexander Terekhov

unread,

Mar 31, 2005, 6:11:10 PM3/31/05

to

Alexander Terekhov wrote:
[...]

> > Am I correct in thinking that Sparc-style #XY means sink-Y + hoist-X? That
> > is, #LoadStore means Sink-Load + Hoist-Store, i.e. what we want above.
>
> Yep.

Well, it's not that simple. For example compound LoadLoad + StoreStore
fence doesn't contitute "full" fence. OTOH memop.acq+rel does have the
effect of full fence.

> Sparc doesn't use unidirectional labels. They have bidirectional
> fences of various types.

Way too many types. They went over the board with fences, AFAICS.

regards,
alexander.

Alexander Terekhov

unread,

Apr 1, 2005, 6:35:46 AM4/1/05

to

Message-ID: <OF53F1BBC2.F4EC4596-ONC1256F...@de.ibm.com>
Subject: Re: refcounting mutable objects, hopefully the last time

> Hm. The problem is that I want to rely on libstdc++'s "portable"
> primitives and not define my own for every platform. So I can only
> use their __atomic_add, __exchange_and_add, _GLIBCXX_READ_MEM_BARRIER
> and _GLIBCXX_WRITE_MEM_BARRIER.

Only-read and only-write stuff are not sufficient in the general
case (apart from the fact that bidirectional fencing is really bad
idea) for acquire/release protocol. In more recent GLIBC they do
have "full" acquire/release stuff. Internally (IIRC ppc port or
something), I mean.

> I've tried to google these things but memory synchronization
> seems totally unpenetrable. I wonder if someone besides you
> understands your ddhlb and naked_competing notation.

Yeah. Well, msync::ddhlb label is known to the Linux folk as
read_barrier_depends() macro. Their notations sucks, because it
IS a "label", not a bidirectional "disjoint" fence. As for
naked_competing... in the past, I had "msync::none" for both
"naked_competing" stuff and "naked_noncompeting" accesses. It
occurred to me that it's probably better to distinguish them.

Just an illustration, not real code:

// Introspection (for bool argument below) aside for a moment
template<typename T, bool copy_ctor_or_dtor_can_mutate_object>
class mutex_and_condvar_free_single_producer_single_consumer {

typedef isolated< aligned_storage< T > > ELEM;

size_t m_size; // > 1
ELEM * m_elem;
atomic< ELEM * > m_head;
atomic< ELEM * > m_tail;

ELEM * advance(ELEM * elem) const {
return (++elem < m_elem + m_size) ? elem : m_elem;
}

...

public:

...

void producer(const T & value) {

ELEM * tail = m_tail.load(msync::naked_noncompeting);

ELEM * next = advance(tail);
while (next == m_head.load(msync::cchsb)) usleep(1000);
new(tail) T(value);
m_tail.store(next, msync::ssb);
}

T consumer() {
ELEM * head = m_head.load(msync::naked_noncompeting);

while (head == m_tail.load(type_list< msync::cchlb_t, msync::ccacq_t>::
element<copy_ctor_or_dtor_can_mutate_object>::type())) usleep(1000);
T value(*head);
head->~T();
m_head.store(advance(head), type_list< msync::slb_t, msync::rel_t >::
element<copy_ctor_or_dtor_can_mutate_object>::type());
return value;
}
};

The idea is to make it easier for compiler to reuse cached value
and omit load instruction (in the example above) for
msync::naked_noncompeting accesses if implementation does caching
(more precise annotations helping to understand the stuff aside
for a moment).

</Forward Inline, truncated>

regards,
alexander.

Peter Dimov

unread,

Apr 2, 2005, 6:23:01 AM4/2/05

to

Alexander Terekhov wrote:
> class sp_counted_base {
> /* ... */
> typedef refcount<std::size_t, basic> count;
> count use_count_, self_count_;
> /* ... */
> public:
> /* ... */
> sp_counted_base() : use_count_(1), self_count_(1) { }
>
> void add_ref() throw() {
> use_count_.increment(); // naked
> }
>
> bool lock() throw() {
> return use_count_.increment_if_not_min(); // naked
> }
>
> void weak_add_ref() throw() {
> self_count_.increment(); // naked
> }
>
> void weak_release() throw() {
> if (!self_count_.decrement(msync::acq, count::may_not_store_min))
> destruct();
> }
>
> void release() throw() {
> if (!use_count_.decrement()) { // "full" release-acquire protocol
> dispose();
> if (!self_count_.decrement(msync::rel, count::may_not_store_min))
> destruct();
> }
> }

Question: your labeled approach only maps well to IA64 (and to a lesser
extent x86), right?

On platforms with bidirectional fences, could it be possible for the labeled
approach to produce suboptimal code? IOW can the algorithm above be improved
if it's known that the target platform has bidirectional fences and does not
support labels?

Peter Dimov

unread,

Apr 2, 2005, 9:43:36 AM4/2/05

to

Alexander Terekhov wrote:
> bool lock() throw() {
> return use_count_.increment_if_not_min(); // naked
> }

msync::naked_competing, IIUC? Is cmpxchg on x86 a proper implementation of
naked_competing, or does it need a lock prefix? And if lockless cmpxchg is
fine,

> void add_ref() throw() {
> use_count_.increment(); // naked
> }

should this use cmpxchg too instead of lock inc/lock xadd?

Alexander Terekhov

unread,

Apr 2, 2005, 10:07:04 AM4/2/05

to

Peter Dimov wrote:
>
> Alexander Terekhov wrote:
> > bool lock() throw() {
> > return use_count_.increment_if_not_min(); // naked
> > }
>
> msync::naked_competing, IIUC?

Yep.

> Is cmpxchg on x86 a proper implementation of
> naked_competing, or does it need a lock prefix?

Without a lock it is naked_MP_noncompeting so to speak. Without a
lock it isn't atomic with respect to other processors (and probably
HT threads, cores, etc.) at all AFAIK.

> And if lockless cmpxchg is
> fine,

It might work on a uniprocessor or then all potentially contending
thread are bound to the same processor.

>
> > void add_ref() throw() {
> > use_count_.increment(); // naked
> > }
>
> should this use cmpxchg too instead of lock inc/lock xadd?

Given that cmpxchg without lock is MP unsafe, lock inc/lock xadd
is probably beter.

regards,
alexander.

Alexander Terekhov

unread,

Apr 2, 2005, 10:29:47 AM4/2/05

to

Peter Dimov wrote:
>
> Alexander Terekhov wrote:
> > class sp_counted_base {
> > /* ... */
> > typedef refcount<std::size_t, basic> count;
> > count use_count_, self_count_;
> > /* ... */
> > public:
> > /* ... */
> > sp_counted_base() : use_count_(1), self_count_(1) { }
> >
> > void add_ref() throw() {
> > use_count_.increment(); // naked
> > }
> >
> > bool lock() throw() {
> > return use_count_.increment_if_not_min(); // naked

For the record, naked was wrong here. It shall be msync::hsb (in min
case) to impose ordering with respect to relaxed self_count_.decrement()
in weak_release() below which doesn't do msync::rel. Or it can be naked,
but then self_count_.decrement() in weak_relase() must have at least
msync::slb (when result is not min).

> > }
> >
> > void weak_add_ref() throw() {
> > self_count_.increment(); // naked
> > }
> >
> > void weak_release() throw() {
> > if (!self_count_.decrement(msync::acq, count::may_not_store_min))
> > destruct();
> > }
> >
> > void release() throw() {
> > if (!use_count_.decrement()) { // "full" release-acquire protocol
> > dispose();
> > if (!self_count_.decrement(msync::rel, count::may_not_store_min))
> > destruct();
> > }
> > }
>
> Question: your labeled approach only maps well to IA64 (and to a lesser
> extent x86), right?

It's an "extension" of conventional release consistency model. Revised
Java, for example, also uses variation of release consistency model.
But they use way too constrained variation AFAICS.

>
> On platforms with bidirectional fences, could it be possible for the labeled
> approach to produce suboptimal code?

I don't think so. Unidirectional labels don't preclude use of fences.
Itanic does have "mf" after all. Unidirectional labels are simply
better when you don't need bidirectional fences. And you rarely need
bidirectional fences.

> IOW can the algorithm above be improved
> if it's known that the target platform has bidirectional fences and does not
> support labels?

I have yet to see such an algorithm...

regards,
alexander.

Alexander Terekhov

unread,

Apr 6, 2005, 2:58:38 PM4/6/05

to

< Forward Inline >

-------- Original Message --------
Message-ID: <42527847...@web.de>
Subject: Re: Lock-free shared_ptr on g++/x86, test/review needed
References: ... <021601c539cc$c8a5ae20$6501a8c0@pdimov>

Peter Dimov wrote:
[...]
> Yes, but in this case threads are not a problem (we're in release,
> destroying the last remaining shared_ptr; if we read weak_count_ == 1, no
> other thread can be creating/destroying weak pointers because of basic
> guarantee). Msync is not a problem because we've just acquired in the
> decrement.

Yeah. Hello Peter, remember original

void release() throw() {
if (!use_count_.decrement()) { // "full" release-acquire protocol
dispose();
if (!self_count_.decrement(msync::rel, count::may_not_store_min))
destruct();
}
}

?

(msync::rel on self_count_.decrement() was meant for "result is not
min" case only; with naked op for "old value == min + 1" case.)

Sure, it can be safely transformed to

void release() throw() {
if (!use_count_.decrement()) { // "full" release-acquire protocol
dispose();

if (self_count_.load(msync::naked_competing) == 1 || // min == 0
!self_count_.decrement(msync::rel))
destruct();
}
}

but ...

>
> Now that I think of it, compiler value propagation isn't a problem either
> because of the "memory" clobber on the decrement. So the cast to volatile
> can be removed.
>
> void release() // nothrow
> {
> if( atomic_exchange_and_add( &use_count_, -1 ) == 1 )
> {
> dispose();
>
> if( weak_count_ == 1 ) // no weak ptrs

(this doesn't convey the notion of (potentially) competing accesses; who
says that implementation can't read it bit-after-bit-and-sum-it, bytes
aside for a moment... just for the sake of breaking your logic?)

> {
> destroy();
> }
> else
> {
> weak_release();
> }
> }
> }

... you also have and manage client-provided "deleter" objects and
I see no reason why weak_ptr clients may NOT have access to them
(resulting in similar problem with respect to access and
destruction as with Whitehead's mutex).

So to be safe, you need

void release() throw() {
if (!use_count_.decrement()) { // "full" release-acquire protocol
dispose();

if (self_count_.load(msync::ccacq) == 1 || // min == 0
!self_count_.decrement()) // "full" release-acquire protocol
destruct();
}
// "full" means either "value dependent" or fully-fenced operation
}

So again,

http://gcc.gnu.org/ml/libstdc++/2005-04/msg00000.html

I suggest that you hide that access in asm (and consider using the
same operation in weak_release() too... for the sake of KISS-on-IA32
and "easy" porting to RISC archs, where "count::may_not_store_min"
is surely better in both release() and weak_release()).

regards,
alexander.

Alexander Terekhov

unread,

Apr 6, 2005, 2:59:48 PM4/6/05

to

< Forward Inline >

-------- Original Message --------
Message-ID: <425423FE...@web.de>
Subject: Re: shared_ptr on ppc?
References: ... <42541BB5...@web.de>

Alexander Terekhov wrote:
[...]
> Got it now. Thanks.

asm long atomic_decrement_weak( register long * pw ) {

<load-UNreserved>
<add -1>
<branch if zero to acquire>

{lw}sync

loop:

<load-reserved>
<add -1>
<branch if zero to acquire>
<store-conditional>
<branch if failed to loop else to done>

acquire:

isync

done:

<...>
}

asm long atomic_decrement_strong( register long * pw ) {

// Peter's state machine

loop0:

<load-reserved>
<add -1>
<branch if zero to loop0_acquire>

{lw}sync

loop1:

<store-conditional>
<branch if !failed to done>

loop2:

<load-reserved>
<add -1>
<branch if !zero to loop1>
<store-conditional>
<branch if failed to loop2 else to acquire>

loop0_acquire:

<store-conditional>
<branch if failed to loop0>

acquire:

isync

done:

<...>
}

All right now?

regards,
alexander.

< Forward Quoted > [Message-ID: f4bc7d7d1041b71c...@twcny.rr.com]

Howard Hinnant wrote:
>
> On Apr 6, 2005, at 12:36 PM, Alexander Terekhov wrote:
>
> >
> > Alexander Terekhov wrote:
> >
> > [... asm long atomic_decrement_[weak|strong] ...]
> >
> > I suppose that CodeWarrior is smart enough to recognize the presence of
> > msync instructions and will behave accordingly with respect to compiler
> > caching/reordering across these routines. If not, things can go wild,
> > beware.
>
> Your assumption is correct. CodeWarrior won't reorder across these
> instructions.
>
> -Howard
>
> _______________________________________________
> Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

Alexander Terekhov

unread,

Apr 11, 2005, 7:07:47 AM4/11/05

to

< Forward Inline >

-------- Original Message --------
Newsgroups: gmane.comp.lib.boost.devel
Subject: shared_ptr on ia64?
Message-ID: <4255091E...@web.de>
References: ... <011901c5394e$3ec2e700$6501a8c0@pdimov2>

Peter Dimov wrote:
[...]
> > P.S. When are you going to kick start an incarnation for Itanic
> > with value dependent cmpxchg.rel-vs-cmpxchg.acq? ;-)
>
> IA64 assembly by hand? No thanks. I'll probably use _Interlocked* on Intel
> and __sync_* on g++.

That would mean expensive "mf" everywhere.

http://gcc.gnu.org/ml/libstdc++/2005-04/msg00059.html
http://gcc.gnu.org/ml/libstdc++/2005-04/msg00060.html

Not good for Itanic which reportedly speculates and reorders like
crazy.

asm long atomic_decrement_weak( register long * pw ) {

loop:

<load-UNordered>

<cmpxchg.rel>

<branch if failed to loop else to done>

acquire:

<ld.acq>
<add -1>
<branch if !zero to acquire>

done:

<...>
}

asm long atomic_decrement_strong( register long * pw ) {

loop:

<load-UNordered>

<cmpxchg.rel>

<branch if failed to loop else to done>

acquire:

<cmpxchg.acq>

done:

<...>

}

Oder?

regards,
alexander.

P.S. For naked stuff, it is probably better to use ".rel", not
more-speculation-hindering ".acq". Too bad Itanic doesn't have
naked "semaphores"...

Peter Dimov

unread,

Apr 11, 2005, 7:48:51 AM4/11/05

to

Alexander Terekhov wrote:
> < Forward Inline >
>
> -------- Original Message --------
> Newsgroups: gmane.comp.lib.boost.devel
> Subject: shared_ptr on ia64?
> Message-ID: <4255091E...@web.de>
> References: ... <011901c5394e$3ec2e700$6501a8c0@pdimov2>
>
> Peter Dimov wrote:
> [...]
>>> P.S. When are you going to kick start an incarnation for Itanic
>>> with value dependent cmpxchg.rel-vs-cmpxchg.acq? ;-)
>>
>> IA64 assembly by hand? No thanks. I'll probably use _Interlocked* on
>> Intel and __sync_* on g++.
>
> That would mean expensive "mf" everywhere.

The _Interlocked* family does have _acq/_rel variants (but I'm not sure
whether they are available on all versions of the Intel compiler), but g++
only implements __sync_* (and only a subset of that IIUC.)

> http://gcc.gnu.org/ml/libstdc++/2005-04/msg00059.html
> http://gcc.gnu.org/ml/libstdc++/2005-04/msg00060.html
>
> Not good for Itanic which reportedly speculates and reorders like
> crazy.

And is reportedly dead. ;-)

Alexander Terekhov

unread,

Apr 11, 2005, 9:01:58 AM4/11/05

to

Peter Dimov wrote:
[...]

> The _Interlocked* family does have _acq/_rel variants (but I'm not sure
> whether they are available on all versions of the Intel compiler),

I'm not sure how does that help given that they don't provide value
dependent msync. Or do you mean "manually" coded "strong" decrement
using InterlockedCompareExchange/Acquire|Release/? If so, sorta yes,
but you know that I just hate volatiles... and it isn't clear at all
whether (some new) MS compiler*** will inject unnecessary barriers
(due to volatile accesses) or not.

regards,
alexander.

***) http://google.de/groups?selm=42079cea%241%40news.microsoft.com