RFC: Parallel shift and threads and locks as a programming model

Johan Torp

unread,

Sep 6, 2008, 5:52:51 AM9/6/08

to

Hi,

I've nearly finished my master thesis which covers the shift to
parallelism and amongst other things, why threads and locks is not a
good parallel programming model. I would really to get some comments
on what I've written so if you have the time and find it interesting,
please help me out.

The document can be found at http://74.53.180.2/~thorpe/

ABSTRACT

The first part is an overview of the paradigmatic shift to parallelism
that is currently taking place. It explains why processors needs to
become parallel, how they might function and what types of parallelism
that are. Given that information, it explains why threads and locks is
not a suitable programming model, how threading is being improved and
used to extract parallel performance and what problems awaits new
parallel programming models and how they might work. The final chapter
surveys the landscape of existing parallel software and hardware
projects and relates them to the overview. The overview is intended
for desktop and embedded programmers and architects.

The second part explains how to use C++'s upcoming memory model and
atomic API. It also relates the memory model to classical definitions
of distributed computing in an attempt to bridge the gap in
terminology between the research literature and C++. An implementation
of hazard pointers and a lock-free stack and queue are given as
example C++0x code. This part is aimed at expert C++ developers and
the research community.

Best Regards, Johan Torp

Szabolcs Ferenczi

unread,

Sep 6, 2008, 12:01:14 PM9/6/08

to

On Sep 6, 11:52 am, Johan Torp <johan.t...@gmail.com> wrote:

> I've nearly finished my master thesis which covers the shift to
> parallelism and amongst other things, why threads and locks is not a
> good parallel programming model. I would really to get some comments
> on what I've written so if you have the time and find it interesting,
> please help me out.

You have spent quite some effort. Before commenting on the details, I
would like to put three questions:

Do you have a title for your thesis already?

Did you ever consider structured parallelism instead of the popular
forking-like threading model?

Did you check out some declarative languages before writing the
section "5.3.2 Explicit vs declarative"?

Best Regards,
Szabolcs

Szabolcs Ferenczi

unread,

Sep 6, 2008, 2:34:43 PM9/6/08

to

On Sep 6, 11:52 am, Johan Torp <johan.t...@gmail.com> wrote:

> ... I would really to get some comments

> on what I've written so if you have the time and find it interesting,
> please help me out.

You write:

"A lost wakeup occurs when a thread goes to sleep and, by mistake, is
never notified. The typical scenario is a thread, which decides it
needs to wait. Before it manages to start waiting, the condition
occurs and another thread notifies the condition variable. The waiting
thread then waits forever. The lost wakeup problem is due to condition
variables, not threads or locks."

Do you really think it is due to condition variables and not due to an
oversimplification in the pthreads design? Do you think "lost wakeup"
may occur with condition variables in Java as well?

Best Regards,
Szabolcs

Johan Torp

unread,

Sep 7, 2008, 7:03:46 AM9/7/08

to

On Sep 6, 6:01 pm, Szabolcs Ferenczi <szabolcs.feren...@gmail.com>
wrote:

> Do you have a title for your thesis already?

I was thinking of the long title at the top. Suggestions are very
welcome.

> Did you ever consider structured parallelism instead of the popular
> forking-like threading model?

I am not familiar with the term "structured parallelism". I spent two
whole chapters on threading since that is the most widespread and
understood parallel computing model. Therefore I thought it might be a
good starting point to discuss parallelism. Also, I'm quite familiar
with it and its problems. Comparing and evaluating some promising
parallel programming models would have been really interesting too but
is not the topic of the thesis. That would probably have been a big
enough project to consititute a thesis on its own.

> Did you check out some declarative languages before writing the
> section "5.3.2 Explicit vs declarative"?

I'm not sure what you mean by declarative languages. The way I see it,
all languages, libraries and APIs make a trade-off between
explicitness and declarativeness. I'm fairly used to Haskell, SQL and
matlab and have been playing around with Erlang but the language I
know best is C++. Did this answer your question?

Cheers, Johan Torp

Johan Torp

unread,

Sep 7, 2008, 7:26:38 AM9/7/08

to

On Sep 6, 8:34 pm, Szabolcs Ferenczi <szabolcs.feren...@gmail.com>
wrote:

> On Sep 6, 11:52 am, Johan Torp <johan.t...@gmail.com> wrote:
>
> > ... I would really to get some comments
> > on what I've written so if you have the time and find it interesting,
> > please help me out.
>
> You write:
>
> "A lost wakeup occurs when a thread goes to sleep and, by mistake, is
> never notified. The typical scenario is a thread, which decides it
> needs to wait. Before it manages to start waiting, the condition
> occurs and another thread notifies the condition variable. The waiting
> thread then waits forever. The lost wakeup problem is due to condition
> variables, not threads or locks."
>
> Do you really think it is due to condition variables and not due to an
> oversimplification in the pthreads design?

I'm not familiar with the pthreads design of condition variables, in
what way are those condition variables special?

> Do you think "lost wakeup"
> may occur with condition variables in Java as well?

Yes. For instance, a waiting thread might forget to acquire a lock
before checking a condition. Forgive me for using C++ pseudo code, I
believe it translates quite directly into Java's Conditions.
// waiting thread
if (some_bool) {
scoped_lock lock(mutex);
some_conditon.wait(lock);
}

// notifying thread
scoped_lock lock(mutex);
some_bool = false;
some_condition.notify_all();

// Execution order
if (some_bool) { // returns true
// notifying thread acquires lock, set some_bool to false and notifies
scoped_lock lock(mutex);
some_conditon.wait(lock); // Will wait indefinetely

If you had a programming model where the actual condition or event was
lumped together with notification, waiting and if applicable, locking,
mistakes as the above can be avoided. In transactional memory, the
retry construct internally decides when threads need to awake and
check its conditions. This way, lost wakeups can never occur. I also
believe you can construct composable and arbitrarily deeply nested
future expressions, which have this safe form of waiting until a
condition is ready. I haven't seen any future implementations which
support this yet.

Best Regards, Johan Torp

Szabolcs Ferenczi

unread,

Sep 7, 2008, 9:27:35 AM9/7/08

to

On Sep 7, 1:26 pm, Johan Torp <johan.t...@gmail.com> wrote:
> On Sep 6, 8:34 pm, Szabolcs Ferenczi <szabolcs.feren...@gmail.com>
> wrote:
>
> > On Sep 6, 11:52 am, Johan Torp <johan.t...@gmail.com> wrote:
>
> > > ... I would really to get some comments
> > > on what I've written so if you have the time and find it interesting,
> > > please help me out.
>
> > You write:
>
> > "A lost wakeup occurs when a thread goes to sleep and, by mistake, is
> > never notified. The typical scenario is a thread, which decides it
> > needs to wait. Before it manages to start waiting, the condition
> > occurs and another thread notifies the condition variable. The waiting
> > thread then waits forever. The lost wakeup problem is due to condition
> > variables, not threads or locks."
>
> > Do you really think it is due to condition variables and not due to an
> > oversimplification in the pthreads design?

.

> I'm not familiar with the pthreads design of condition variables, in
> what way are those condition variables special?

In Pthreads, the signal operation on the condition variable can be
legally made from outside of the protection of the mutex. Originally,
there is a constraint that condition variable must be used inside
Critical Region only and it is relaxed in Pthreads. That way is it
special.

It is not so in Java, for instance, but it was not the case when the
condition variable has been originally introduced in concurrent
programming either. That is one of the reasons why the condition
variable is stateless.

In Pthreads, they relaxed the requirement that signalling must be made
under the protection of the mutex but, on the other hand, the
stateless property of the condition variable remained. This makes way
to the lost wakeup to happen.

> > Do you think "lost wakeup"
> > may occur with condition variables in Java as well?
>
> Yes. For instance, a waiting thread might forget to acquire a lock
> before checking a condition.

That is a concurrency bug in the algorithm and it has nothing to do
with the condition variable being the cause of the lost signal.

> Forgive me for using C++ pseudo code, I
> believe it translates quite directly into Java's Conditions.

The C code would not be a problem but the usage of the condition
variable is.

> // waiting thread
> if (some_bool) {
> scoped_lock lock(mutex);
> some_conditon.wait(lock);
>
> }

This way of using the wait() operation is not desired with any
condition variables although it is allowed, unfortunately. If you keep
to the canonical form of the wait() operation, you cannot make this
mistake.

You can also omit this mistake just by thinking about the condition
variable test:
http://groups.google.com/group/comp.programming.threads/browse_frm/thread/0ff7ac10adadee22

Did you rather mean something like this:

// waiting thread
{
scoped_lock lock(mutex);
while (some_bool) {
some_conditon.wait(lock);
}
}

> // notifying thread
> scoped_lock lock(mutex);
> some_bool = false;
> some_condition.notify_all();
>
> // Execution order
> if (some_bool) { // returns true
> // notifying thread acquires lock, set some_bool to false and notifies
> scoped_lock lock(mutex);
> some_conditon.wait(lock); // Will wait indefinetely

You lost the signal in your example just because of a concurrency bug
in the application code. Namely you do not always access the shared
variable `some_bool' in Critical Region. The current popular languages
all make it possible for you to make errors like that since the shared
variable is not marked at the language level. However, checker tools
are emerging which can tell you by static analysis that you do
something wrong there.

> If you had a programming model where the actual condition or event was
> lumped together with notification, waiting and if applicable, locking,
> mistakes as the above can be avoided.

That is the case in Java and that is the case with the Monitor
construct originally proposed.

> In transactional memory, the
> retry construct internally decides when threads need to awake and
> check its conditions. This way, lost wakeups can never occur.

That is quite true, if you do not send any wakeup, it cannot be lost
indeed.

Did you know that in the original high level language construction
proposal of Conditional Critical Regions there was no condition
variable either, so that lost wakeup could not apply? Conditional
variable has been later introduced to cope with efficiency issues. It
turned out that manipulating the conditional variable became explicit
to make the illusion of more efficiency, however, it could have been
made implicate and then it can be internally decided when processes
need to awake.

It is still possible today that the `parallel shift' is made towards
safer construction in future concurrent languages and higher level
language constructions will be used again.

Best Regards,
Szabolcs

Johan Torp

unread,

Sep 7, 2008, 12:14:36 PM9/7/08

to

On 7 Sep, 15:27, Szabolcs Ferenczi <szabolcs.feren...@gmail.com>
wrote:

> That is a concurrency bug in the algorithm and it has nothing to do
> with the condition variable being the cause of the lost signal.

Ok, sorry. We've misunderstood eachother. My code contained a
deliberate bug. My point is that the programming model of condition
variables allows users to write such bugs. Just like ordinary locks
allow programmers to introduce data race bugs. I'm not suggesting that
condition variable implementations cause lost wakeups, just that it is
easy to accidentally write such a bug and that it can be hard to
detect since threading is indeterministic.

> You can also omit this mistake just by thinking about the condition

> variable test:http://groups.google.com/group/comp.programming.threads/browse_frm/th...

Yes, you can build libraries on top of condition variables which
provide a better programming model where users can not make such
mistakes. A conditional critical region is an example of this.

> The current popular languages
> all make it possible for you to make errors like that since the shared
> variable is not marked at the language level.

That was precisely my point. I'll look at my wording and see if I can
rephrase so that I do not confuse more people.

> However, checker tools
> are emerging which can tell you by static analysis that you do
> something wrong there.

Interesting. I see the need of many static checking tools as flaws in
the programming models. Needless to say, it would be better if the
programming model caught the static bugs at compile or even keyboard
typing time.

> Did you know that in the original high level language construction
> proposal of Conditional Critical Regions there was no condition
> variable either, so that lost wakeup could not apply? Conditional
> variable has been later introduced to cope with efficiency issues. It
> turned out that manipulating the conditional variable became explicit
> to make the illusion of more efficiency, however, it could have been
> made implicate and then it can be internally decided when processes
> need to awake.

Very interesting, I did not know this.

> It is still possible today that the `parallel shift' is made towards
> safer construction in future concurrent languages and higher level
> language constructions will be used again.

I hope so too. There is another side to it too. Programming models of
today are optimized for shared memory. An implicit programming model
might be more efficent on NUMA architectures and architectures which
do not share memory between cores.

Johan

Szabolcs Ferenczi

unread,

Sep 7, 2008, 4:59:29 PM9/7/08

to

On Sep 7, 1:03 pm, Johan Torp <johan.t...@gmail.com> wrote:
> On Sep 6, 6:01 pm, Szabolcs Ferenczi <szabolcs.feren...@gmail.com>
> wrote:
>
> > Do you have a title for your thesis already?
>
> I was thinking of the long title at the top. Suggestions are very
> welcome.

So you are thinking of this:

"Parallel shift and threads and locks as a programming model"

Are you going to suggest in the title that `as a programming model'
you consider `parallel shift and threads and locks'? What would that
mean?

What would the title `Parallel shift and threads and locks as a
programming model' mean anyway? What does it suggest to the reader?
What can the reader expect looking at this title? Do you promise the
reader a programming model that consists of `parallel shift and
threads and locks'?

I was curious about the title since in your thesis you deal with such
a broad spectrum of subjects (from some hardware details via some
multi-processor architectures to some high level languages and some
lower level library blocks) in such a shallow way. I was wondering how
one can give a title to such a diversity. Furthermore, I could not
figure out for the first skim through the text where you are heading
at. Where you are going to end up? I hoped that a title would give me
a clue.

Best Regards,
Szabolcs

Szabolcs Ferenczi

unread,

Sep 7, 2008, 5:50:20 PM9/7/08

to

On Sep 7, 6:14 pm, Johan Torp <johan.t...@gmail.com> wrote:
> On 7 Sep, 15:27, Szabolcs Ferenczi <szabolcs.feren...@gmail.com>
> wrote:
>
> > That is a concurrency bug in the algorithm and it has nothing to do
> > with the condition variable being the cause of the lost signal.
>
> Ok, sorry. We've misunderstood eachother. My code contained a
> deliberate bug.

Well, the basic bug is that you access the shared variable without the
protection of the mutex (see `if (some_bool)'). The concurrency bug is
there even if you do not use any condition variable so it is not an
issue with the lost wakeup.

On the other hand, the lost wakeup applies even if you check the
shared variable and start waiting on the condition variable under the
protection of the mutex provided the condition variable signalling
does not happen under the protection of the mutex.

> My point is that the programming model of condition
> variables allows users to write such bugs.

Actually, the original programming model where the condition variables
have been introduced, do not allow users to write such bugs. You can
never make it with a Monitor construction.

If you can make the concurrency error in Pthreads, Java or the likes,
it just means that these did not take over the programming model
properly.

> Just like ordinary locks
> allow programmers to introduce data race bugs. I'm not suggesting that
> condition variable implementations cause lost wakeups, just that it is
> easy to accidentally write such a bug and that it can be hard to
> detect since threading is indeterministic.

There are tools for detecting this kind of bug since these kind of
bugs can be detected by static analysis. It is not a non-deterministic
bug.

However, the root of the problem is that only those programming
languages will become popular which give you the freedom to commit
such errors.

Once uppon a time, ca. three and a half decades ago there was a
language construction proposal already which ensured that you had to
mark the shared variable and you were forced to access the shared
variable inside critical region only. If I translate it down to a C-
like notation and adapt it to your `some_bool' problem, it is
something like this:

shared bool some_bool;
...
with (some_bool) { ... }

The compiler can insert the mutexes for you in the implementation. In
addition the compiler can check whether you access the shared variable
inside critical region or not. It did not become popular but perhaps
the parallel shift will bring it back.

> > You can also omit this mistake just by thinking about the condition
> > variable test:http://groups.google.com/group/comp.programming.threads/browse_frm/th...
>
> Yes, you can build libraries on top of condition variables which
> provide a better programming model where users can not make such
> mistakes. A conditional critical region is an example of this.

The conditional critical region is not any library but rather it is a
language construction proposal. Again, in a C-like notation it would
look something like this:

shared bool some_bool;
...
with (some_bool) when (some_bool) { ... }

The condition variable test is not a library either. It is just a
mental note that the condition variable should not play any role in
the business logic of the algorithm since it is a programming tool
only for optimisation of the busy waiting loop on a shared variable.
However, it applies to the modified condition variables only (see the
Mesa-type ones) since you did not have to apply the condition
variable wait in a loop originally (see the Hoare-type ones). Now you
have to do that but you get no compiler support for it either.

> > The current popular languages
> > all make it possible for you to make errors like that since the shared
> > variable is not marked at the language level.
>
> That was precisely my point. I'll look at my wording and see if I can
> rephrase so that I do not confuse more people.

I think you never mention, or I have missed it if you do, that the
parallel shift in the programming models should happen towards
declaring the shared variables as being such. You also never mention
that programming languages should incorporate higher level
constructions instead of the low level library calls for mutexes.
Otherwise, how can you avoid the misuse? The considerations about the
so-called "memory models" encourages access of shared variables
without any protection.

> > However, checker tools
> > are emerging which can tell you by static analysis that you do
> > something wrong there.
>
> Interesting. I see the need of many static checking tools as flaws in
> the programming models. Needless to say, it would be better if the
> programming model caught the static bugs at compile or even keyboard
> typing time.

The only way to make it possible is to make the parallel shift towards
high level languages with language constructions for expressing shared
resources and dedicated operations on them.

Best Regards,
Szabolcs

Chris M. Thomasson

unread,

Sep 7, 2008, 11:58:25 PM9/7/08

to

"Johan Torp" <johan...@gmail.com> wrote in message
news:03f0461d-815a-43ea...@b1g2000hsg.googlegroups.com...

Here is my take on Szabolcs:

http://groups.google.com/group/comp.programming.threads/msg/ea019f34ea32e84a

Take his comments wrt POSIX with a grain of salt. After all, he thinks that
condvars cannot be used to signal about events. Anyway, I don't have time to
read your hard work. I am most interested in your comments on hazard
pointers, stack and queue. I have extensive experience in that area. BTW,
here is some information:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/1e166dc04c3df118
(READ ALL!)

Please note that if you make classic optimization of using DWCAS for
consumers and CAS for producers that version counter read order is very
important!!! Also, check this out:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/3ded0e3c5496eaa0

http://groups.google.com/group/comp.programming.threads/search?group=comp.programming.threads&q=pdr

http://groups.google.com/group/comp.programming.threads/search?hl=en&group=comp.programming.threads&q=proxy+gc

http://groups.google.com/group/comp.programming.threads/search?hl=en&group=comp.programming.threads&q=proxy+garbage+collector

http://groups.google.com/group/comp.programming.threads/search?group=comp.programming.threads&q=rcu%2Bsmr

Sorry for throwing all of this at you... I am swamped with work and don't
have any free time lately... Anyway... Enjoy!

;^)

Ask any questions you may have as there is more than one person who
understands this stuff on this group...

Johan Torp

unread,

Sep 8, 2008, 4:04:31 AM9/8/08

to

On Sep 7, 11:50 pm, Szabolcs Ferenczi <szabolcs.feren...@gmail.com>
wrote:

> I think you never mention, or I have missed it if you do, that the
> parallel shift in the programming models should happen towards
> declaring the shared variables as being such. You also never mention
> that programming languages should incorporate higher level
> constructions instead of the low level library calls for mutexes.

I rewrote the lost wakeup section so that more people would not
interpret the way you did and put the new version on www.johantorp.com.

Best Regards, Johan

Johan Torp

unread,

Sep 8, 2008, 4:09:53 AM9/8/08

to

On Sep 8, 5:58 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> http://groups.google.com/group/comp.programming.threads/browse_frm/th...
> (READ ALL!)

It seems to be related to DCAS (DWCAS) solutions. My understanding is
that these instructions are not widespread and should be avoided.
Also, I'm working with C++0x which has no support for DCAS.

> Ask any questions you may have as there is more than one person who
> understands this stuff on this group...

My tutor is a memory reclamation and lock-free researcher and I've
also read a book and lots of papers on the subject so I think
(hope :)) I got it covered. Also, the focus of the thesis is not lock-
free programming, I only show examples of lock-free programming as
that is one area where the memory model (which is the topic of part 2)
is put to extensive use.

Johan Torp

Anthony Williams

unread,

Sep 8, 2008, 5:04:11 AM9/8/08

to

Johan Torp <johan...@gmail.com> writes:

> Also, I'm working with C++0x which has no support for DCAS.

That depends what you mean by DCAS. C++0x supports

struct my_pod
{
void* p;
unsigned count;
};

std::atomic<my_pod> atomic_pod;

void foo()
{
my_pod new_val={some_pointer,some_count};
my_pod old_val;
while(!atomic_pod.compare_exchange(old_val,new_val));
}

Which looks quite like a conventional DCAS to me. However, it may or
may not be lock-free.

On Intel CPUs, with a 32-bit pointer, this should be lock-free, but
there's no guarantee. You can check with atomic_pod.is_lock_free(),
but there's not a compile-time query.

Anthony
--
Anthony Williams | Just Software Solutions Ltd
Custom Software Development | http://www.justsoftwaresolutions.co.uk
Registered in England, Company Number 5478976.
Registered Office: 15 Carrallack Mews, St Just, Cornwall, TR19 7UL

Chris M. Thomasson

unread,

Sep 8, 2008, 8:55:32 AM9/8/08

to

"Anthony Williams" <antho...@gmail.com> wrote in message
news:uy723f...@gmail.com...

> Johan Torp <johan...@gmail.com> writes:
>
>> Also, I'm working with C++0x which has no support for DCAS.
>
> That depends what you mean by DCAS. C++0x supports

Yeah. Well, I try to differentiate between DCAS and DWCAS. IMHO, they mean
two very different things... My line of thinking tells me that the former
can operate on two non-contiguous words which can be aligned on word
boundaries. On the other hand, the latter means that the words need to be
directly adjacent in memory and aligned on a double word boundary.

> struct my_pod
> {
> void* p;
> unsigned count;
> };
>
> std::atomic<my_pod> atomic_pod;
>
> void foo()
> {
> my_pod new_val={some_pointer,some_count};
> my_pod old_val;
> while(!atomic_pod.compare_exchange(old_val,new_val));
> }
>
> Which looks quite like a conventional DCAS to me. However, it may or
> may not be lock-free.
>
> On Intel CPUs, with a 32-bit pointer, this should be lock-free, but
> there's no guarantee.

There's the CMPXCHG16B instruction supported by some Intel 64-bit
processors. Also, this should always be lock-free on SPARC-32. I believe it
will always be lock-free on PPC; IIRC they have a LL/SC which operates on
double-words on 32 or 64 bit processors.

> You can check with atomic_pod.is_lock_free(),
> but there's not a compile-time query.

Cool!

Chris M. Thomasson

unread,

Sep 8, 2008, 8:56:13 AM9/8/08

to

"Johan Torp" <johan...@gmail.com> wrote in message

news:10290247-14d5-4942...@i76g2000hsf.googlegroups.com...

> On Sep 8, 5:58 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > http://groups.google.com/group/comp.programming.threads/browse_frm/th...
> > (READ ALL!)

> It seems to be related to DCAS (DWCAS) solutions. My understanding is
> that these instructions are not widespread and should be avoided.

You can sometimes get around DWCAS by using the alignment trick, or the
offset trick.

> Also, I'm working with C++0x which has no support for DCAS.

I believe it does.

> > Ask any questions you may have as there is more than one person who
> > understands this stuff on this group...

> My tutor is a memory reclamation and lock-free researcher and I've
> also read a book and lots of papers on the subject so I think
> (hope :)) I got it covered.

:^D

> Also, the focus of the thesis is not lock-
> free programming, I only show examples of lock-free programming as
> that is one area where the memory model (which is the topic of part 2)
> is put to extensive use.

Indeed.

Chris M. Thomasson

unread,

Sep 8, 2008, 9:34:48 AM9/8/08

to

"Johan Torp" <johan...@gmail.com> wrote in message

news:10290247-14d5-4942...@i76g2000hsf.googlegroups.com...

On Sep 8, 5:58 am, "Chris M. Thomasson" <n...@spam.invalid> wrote:

[...]

> > Ask any questions you may have as there is more than one person who
> > understands this stuff on this group...

> My tutor is a memory reclamation and lock-free researcher and I've
> also read a book and lots of papers on the subject so I think
> (hope :)) I got it covered.

Some very quick observations:

You make some false assertions in section 4.4.2. You certainly do not need
to stop-the-world to track quiescent periods. You also don't need to
interrupt or block any threads. In section 4.4.5, why do say that RCU is a
typically blocking reclamation scheme? In section 4.5, you don't mention
that TM is very prone to live-locking when their put under load. One more
thing, in section 4.3.1 you mention that obstruction-free can never
deadlock, but fail to mention that it can and will live-lock fairly easily.

Humm, well, you should also mention in section 4.5.5 that AMD is working on
something very interesting indeed:

http://groups.google.com/group/comp.programming.threads/browse_frm/thread/8d9708596c832a18

In section 4.4.4 you seem to imply that atomic instructions are needed to
implement reference counting. I might be nice to mention distributed
reference counting techniques in which all ref-count mutations are
thread-local.

Also, there is another form of memory reclamation that is based on a novel
highly distributed reference counting technique combined with the tracking
of synchronization epochs which has advantages over both RCU and SMR:

Patent publication# 20070067770

> Also, the focus of the thesis is not lock-
> free programming, I only show examples of lock-free programming as
> that is one area where the memory model (which is the topic of part 2)
> is put to extensive use.

You cover a lot of ground. I only skimmed through the parts I am interested
in, and well, I think you did a good job so far!

:^D

Johan Torp

unread,

Sep 8, 2008, 12:17:38 PM9/8/08

to

On Sep 8, 2:55 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > That depends what you mean by DCAS. C++0x supports

I was referring to guaranteed lock-free instructions.

> Yeah. Well, I try to differentiate between DCAS and DWCAS. IMHO, they mean
> two very different things... My line of thinking tells me that the former
> can operate on two non-contiguous words which can be aligned on word
> boundaries. On the other hand, the latter means that the words need to be
> directly adjacent in memory and aligned on a double word boundary.

Do you believe your distinction between DCAS and DWCAS is a wide-
spread use of the acronyms?

Johan

Johan Torp

unread,

Sep 8, 2008, 12:31:03 PM9/8/08

to

On Sep 8, 3:34 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> Some very quick observations:
>
> You make some false assertions in section 4.4.2. You certainly do not need
> to stop-the-world to track quiescent periods. You also don't need to
> interrupt or block any threads.

I know. I don't claim that you do either (in 4.4.2), but maybe I need
to revise the wording so that more people do not misinterpret me. I
knowlingly avoided the subject a bit since I do not know which
techniques are the most common ones for guaranteeing quiescence
periods. May you can help me?

> In section 4.4.5, why do say that RCU is a
> typically blocking reclamation scheme?

I was probably thinking of linux which is the most well known example
of RCU. Actually, the only example I've seen. Is it not typically
blocking? Can you point me to some non-blocking implementations?

> In section 4.5, you don't mention
> that TM is very prone to live-locking when their put under load.

Doesn't this depend on backoff strategies in the implementation? Also,
I think the lock-free and TM sections are already a bit too long so
I'd rather cut out information than add more. Unless I've missed
something vital that is.

> One more
> thing, in section 4.3.1 you mention that obstruction-free can never
> deadlock, but fail to mention that it can and will live-lock fairly easily.

I thought that was obvious but reading it again, I realise it is not.
Thanks.

> Humm, well, you should also mention in section 4.5.5 that AMD is working on
> something very interesting indeed:
>

> http://groups.google.com/group/comp.programming.threads/browse_frm/th...

I will look in to that.

> In section 4.4.4 you seem to imply that atomic instructions are needed to
> implement reference counting. I might be nice to mention distributed
> reference counting techniques in which all ref-count mutations are
> thread-local.

Where? I can't see any mentioning of atomic instructions or locks in
4.4.4.

Thanks /Johan

Chris M. Thomasson

unread,

Sep 12, 2008, 2:15:31 AM9/12/08

to

"Johan Torp" <johan...@gmail.com> wrote in message

news:8b82f058-41a5-4a13...@w7g2000hsa.googlegroups.com...

On Sep 8, 3:34 pm, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > Some very quick observations:
> >
> > You make some false assertions in section 4.4.2. You certainly do not
> > need
> > to stop-the-world to track quiescent periods. You also don't need to
> > interrupt or block any threads.

> I know. I don't claim that you do either (in 4.4.2), but maybe I need
> to revise the wording so that more people do not misinterpret me. I
> knowlingly avoided the subject a bit since I do not know which
> techniques are the most common ones for guaranteeing quiescence
> periods. May you can help me?

> > In section 4.4.5, why do say that RCU is a
> > typically blocking reclamation scheme?

> I was probably thinking of linux which is the most well known example
> of RCU. Actually, the only example I've seen. Is it not typically
> blocking? Can you point me to some non-blocking implementations?

Non-blocking in the sense that the execution of reader/writer-threads is not
manipulated by the polling logic. A writer simply defers to the polling
logic which can be a non-blocking operation in and of itself. Readers can do
several things. The fast implementations are zero-overhead for readers. Or
some involve a simple load and/or store per-reader-side critical-section
acquire/release operations. Some make the readers use atomic RMW operations,
ect...

You can do a very simple RCU using per-thread monotonic counters and a
polling thread. Reader threads mutate per-thread monotonic counter before
and after they access a shared data-structure. Polling thread monitors these
counters in order to determine per-thread quiescent states, and further
more, detect grace periods. Writers simply defer objects to the polling
thread using shared FIFO queue. Polling allows a grace period, or two, to
expire before invoking the deferment callback function on a group of
objects. The really fast implementations sometimes require another grace
period before invoking the callback in order to get proper release semantics
on the objects.

Polling logic can be heavily optimized... I will let you respond to this
before I get into any more of the details...

> > In section 4.5, you don't mention
> > that TM is very prone to live-locking when their put under load.

> Doesn't this depend on backoff strategies in the implementation?

Sometimes. However, I have seen various contention managers which generate
very poor performance characteristics in different STM's when the algorithms
are put under load. Under moderate load, some contention managers do help
indeed. AMD's research on scaleable multi-word atomic operations returns a
contention ticket from a failed operation that can be used to seed an custom
exponential backoff algorihtm, or be used to access another portion of the
data-structure.

> Also,
> I think the lock-free and TM sections are already a bit too long so
> I'd rather cut out information than add more. Unless I've missed
> something vital that is.

Perhaps you can add that distributed reference counting, RCU and quiescent
period detection can be implemented with very virtually zero overhead.

> > One more
> > thing, in section 4.3.1 you mention that obstruction-free can never
> > deadlock, but fail to mention that it can and will live-lock fairly
> > easily.

> I thought that was obvious but reading it again, I realise it is not.
> Thanks.

Sure.

> > Humm, well, you should also mention in section 4.5.5 that AMD is working
> > on
> > something very interesting indeed:
> >
> > http://groups.google.com/group/comp.programming.threads/browse_frm/th...

> I will look in to that.

Yeah, that might be something worth referencing/mentioning...

> > In section 4.4.4 you seem to imply that atomic instructions are needed
> > to
> > implement reference counting. I might be nice to mention distributed
> > reference counting techniques in which all ref-count mutations are
> > thread-local.

> Where? I can't see any mentioning of atomic instructions or locks in
> 4.4.4.

You mention that several implementation require DACES. Then you mention that
DWCAS is not available on certain processors, basically implying that its
not good to use the instruction. IMVHO, you should mention that ref-counting
can be implemented with single-word atomic operations. Or perhaps mention
that the counting can be distributed and accessed on a per-thread basis such
that a thread can increment an objects reference count using plain
non-atomic operations...

Johan Torp

unread,

Sep 17, 2008, 5:07:21 AM9/17/08

to

First of all, sorry for the late reply. I've been relying on google
alerts to notify me when I get replies but for some reason it didn't
work. By now, it is probably too late to make any changes to my
thesis.

On 12 Sep, 08:15, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > I was probably thinking of linux which is the most well known example
> > of RCU. Actually, the only example I've seen. Is it not typically
> > blocking? Can you point me to some non-blocking implementations?
>
> Non-blocking in the sense that the execution of reader/writer-threads is not
> manipulated by the polling logic. A writer simply defers to the polling
> logic which can be a non-blocking operation in and of itself.

Aha, so modifications to not take effect immediately and a writer
which performs a later read might not see its own write?

> Readers can do
> several things. The fast implementations are zero-overhead for readers. Or
> some involve a simple load and/or store per-reader-side critical-section
> acquire/release operations. Some make the readers use atomic RMW operations,
> ect...
>
> You can do a very simple RCU using per-thread monotonic counters and a
> polling thread. Reader threads mutate per-thread monotonic counter before
> and after they access a shared data-structure. Polling thread monitors these
> counters in order to determine per-thread quiescent states, and further
> more, detect grace periods. Writers simply defer objects to the polling
> thread using shared FIFO queue. Polling allows a grace period, or two, to
> expire before invoking the deferment callback function on a group of
> objects. The really fast implementations sometimes require another grace
> period before invoking the callback in order to get proper release semantics
> on the objects.

I do not understand this approach. If the polling thread has detected
quiescence and started performing modifications, does readers block
until the polling thread is finished?
I'm not familiar with "grace periods" in a lock-free context, could
you elaborate?
There can't be any guarantee that there will ever be a quiescent
period if somebody is constantly reading, right?

> > Also,
> > I think the lock-free and TM sections are already a bit too long so
> > I'd rather cut out information than add more. Unless I've missed
> > something vital that is.
>
> Perhaps you can add that distributed reference counting, RCU and quiescent
> period detection can be implemented with very virtually zero overhead.

I would need to read up more on those approaches before I dare make
such claims. Unfortunately, it is probably too late to make any
modifications now. But for my own sake, I'd be interested in seeing
some real world examples of these approaches, consider drawbacks and
trade-offs, etc. Do you have some pointers?

> > > Humm, well, you should also mention in section 4.5.5 that AMD is working
> > > on
> > > something very interesting indeed:
>
> > >http://groups.google.com/group/comp.programming.threads/browse_frm/th...
> > I will look in to that.
>
> Yeah, that might be something worth referencing/mentioning...

I added some information on ASF.

> > > In section 4.4.4 you seem to imply that atomic instructions are needed
> > > to
> > > implement reference counting. I might be nice to mention distributed
> > > reference counting techniques in which all ref-count mutations are
> > > thread-local.
> > Where? I can't see any mentioning of atomic instructions or locks in
> > 4.4.4.
>
> You mention that several implementation require DACES. Then you mention that
> DWCAS is not available on certain processors, basically implying that its
> not good to use the instruction. IMVHO, you should mention that ref-counting
> can be implemented with single-word atomic operations. Or perhaps mention
> that the counting can be distributed and accessed on a per-thread basis such
> that a thread can increment an objects reference count using plain
> non-atomic operations...

That is not the case in C++0x. Any modification to a shared variable
must be properly happens-before related otherwise we have a data race
and undefined behaviour. Or am I missing something?

Thanks a lot for your replies, they were really helpful. I apologize
that I did not detect the post until now.

Johan

Dmitriy V'jukov

unread,

Sep 17, 2008, 7:27:55 AM9/17/08

to

On 17 сент, 13:07, Johan Torp <johan.t...@gmail.com> wrote:

> > You mention that several implementation require DACES. Then you mention that
> > DWCAS is not available on certain processors, basically implying that its
> > not good to use the instruction. IMVHO, you should mention that ref-counting
> > can be implemented with single-word atomic operations. Or perhaps mention
> > that the counting can be distributed and accessed on a per-thread basis such
> > that a thread can increment an objects reference count using plain
> > non-atomic operations...
>
> That is not the case in C++0x. Any modification to a shared variable
> must be properly happens-before related otherwise we have a data race
> and undefined behaviour. Or am I missing something?

In C++0x you can implement it this way:

// modified by only one thread, read by many threads
std::atomic<int> rc;

// executed by only one thread
void acquire()
{
int x = rc.load(std::memory_order_relaxed);
rc.store(x + 1, std::memory_order_relaxed);
}

// executed by many threads concurrently with acquire()
int get_rc()
{
return rc.load(std::memory_order_relaxed);
}

Formally here are no races. All operations are atomic. But
modification if counter is made with plan stores and loads, no heavy
atomic RMW/fences are involved.

You are mixing up 'low-level races' (which are concerned in C++0x
memory model) and 'high-level races' (which are concerned in
application logic).

It's funny easy to eliminate ALL races from C++0x program - just
declare all variables as std::atomic<>. Formally no more races. But
correctness of high-level algorithm is not affected in any way.

Dmitriy V'jukov

Johan Torp

unread,

Sep 17, 2008, 11:43:15 AM9/17/08

to

On Sep 17, 1:27 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > That is not the case in C++0x. Any modification to a shared variable
> > must be properly happens-before related otherwise we have a data race
> > and undefined behaviour. Or am I missing something?
>
> In C++0x you can implement it this way:

...

> Formally here are no races. All operations are atomic. But
> modification if counter is made with plan stores and loads, no heavy
> atomic RMW/fences are involved.

Yes of course, but that is atomic operations, not ordinary accesses to
non-atomic variables.

Johan

Dmitriy V'jukov

unread,

Sep 17, 2008, 12:24:32 PM9/17/08

to

I think Chris meant "a thread can increment an objects reference count
using unfenced atomic stores and unfenced atomic loads (and not fenced
atomic RMW operations)". I.e. thread can use atomic stores and atomic
loads, which physically the same as plain non-atomic stores and plain
non-atomic loads, just because stores and loads to aligned word-sized
variables always atomic.

Dmitriy V'jukov

Johan Torp

unread,

Sep 17, 2008, 1:42:40 PM9/17/08

to

On Sep 17, 6:24 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> I think Chris meant "a thread can increment an objects reference count
> using unfenced atomic stores and unfenced atomic loads (and not fenced
> atomic RMW operations)". I.e. thread can use atomic stores and atomic
> loads, which physically the same as plain non-atomic stores and plain
> non-atomic loads, just because stores and loads to aligned word-sized
> variables always atomic.

Ok, I see. Does relaxed atomic on word sized memory locations always
map to regular instructions (without fences or other extra
synchronizing instructions) on all common hardware architectures?

C++0x does not put any constraints on when relaxed stores should
become visible, only that it should happen in a "reasonable amount of
time". Intuitively, this tells me that any RCU implementation might
wind up with a lot of unclaimed data. Is this a real problem?

Johan

Dmitriy V'jukov

unread,

Sep 17, 2008, 2:12:02 PM9/17/08

to

On Sep 17, 9:42 pm, Johan Torp <johan.t...@gmail.com> wrote:
> On Sep 17, 6:24 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > I think Chris meant "a thread can increment an objects reference count
> > using unfenced atomic stores and unfenced atomic loads (and not fenced
> > atomic RMW operations)". I.e. thread can use atomic stores and atomic
> > loads, which physically the same as plain non-atomic stores and plain
> > non-atomic loads, just because stores and loads to aligned word-sized
> > variables always atomic.
>
> Ok, I see. Does relaxed atomic on word sized memory locations always
> map to regular instructions (without fences or other extra
> synchronizing instructions) on all common hardware architectures?

Yes. There can be some subtle moments, but mainly Yes.

> C++0x does not put any constraints on when relaxed stores should
> become visible, only that it should happen in a "reasonable amount of

> time"...

...as any other atomic operation, no difference at all.

Dmitriy V'jukov

Johan Torp

unread,

Sep 18, 2008, 5:55:54 AM9/18/08

to

On Sep 17, 8:12 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > Ok, I see. Does relaxed atomic on word sized memory locations always
> > map to regular instructions (without fences or other extra
> > synchronizing instructions) on all common hardware architectures?
>
> Yes. There can be some subtle moments, but mainly Yes.

Can you elaborate a little bit on the subtle moments?

> > C++0x does not put any constraints on when relaxed stores should
> > become visible, only that it should happen in a "reasonable amount of
> > time"...
>
> ...as any other atomic operation, no difference at all.

Non-relaxed atomic operations in C++0x has a lot of other visibility
requirements. They involve required visibility of atomic
modifications to other atomic objects as well as modifications of non-
atomic memory locations. Side effects must also follow the ordering
introduced by non-relaxed atomic operations.

Johan

Dmitriy V'jukov

unread,

Sep 18, 2008, 7:19:33 AM9/18/08

to

On Sep 18, 1:55 pm, Johan Torp <johan.t...@gmail.com> wrote:
> On Sep 17, 8:12 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
>
> > > Ok, I see. Does relaxed atomic on word sized memory locations always
> > > map to regular instructions (without fences or other extra
> > > synchronizing instructions) on all common hardware architectures?
>
> > Yes. There can be some subtle moments, but mainly Yes.
>
> Can you elaborate a little bit on the subtle moments?

I'm not aware of any :)
This moment is totally architecture dependent, so it is senseless to
talk about it in general. For example, on x86 1, 2, 4, 8 and 16 bytes
aligned loads and stores are guaranteedly atomic. To the best of my
knowledge, on all wide-spread architectures word-sized aligned loads
and stores are atomic too. Maybe on some wide-spread architectures
double-word or half-word operations are not atomic, I don't know about
all. Maybe on some exotic 'microcontrollers' plain loads and stores
are not atomic.

> > > C++0x does not put any constraints on when relaxed stores should
> > > become visible, only that it should happen in a "reasonable amount of
> > > time"...
>
> > ...as any other atomic operation, no difference at all.
>
> Non-relaxed atomic operations in C++0x has a lot of other visibility
> requirements. They involve required visibility of atomic
> modifications to other atomic objects as well as modifications of non-
> atomic memory locations. Side effects must also follow the ordering
> introduced by non-relaxed atomic operations.

Non-relaxed atomics have additional guarantees wrt *order* of
visibility, but NOT wrt *speed* of visibility, so to say. Wrt *speed*
of visibility they all the same. It's only guaranteed that *all*
atomic operations must became visible in reasonable amount of time.

Dmitriy V'jukov

Chris M. Thomasson

unread,

Sep 18, 2008, 10:38:05 AM9/18/08

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:2951294b-8316-4ae9...@34g2000hsh.googlegroups.com...

On Sep 18, 1:55 pm, Johan Torp <johan.t...@gmail.com> wrote:
> > On Sep 17, 8:12 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> >
> > > > Ok, I see. Does relaxed atomic on word sized memory locations always
> > > > map to regular instructions (without fences or other extra
> > > > synchronizing instructions) on all common hardware architectures?
> >
> > > Yes. There can be some subtle moments, but mainly Yes.
> >
> > Can you elaborate a little bit on the subtle moments?

> I'm not aware of any :)
> This moment is totally architecture dependent, so it is senseless to
> talk about it in general. For example, on x86 1, 2, 4, 8 and 16 bytes
> aligned loads and stores are guaranteedly atomic. To the best of my
> knowledge, on all wide-spread architectures word-sized aligned loads
> and stores are atomic too. Maybe on some wide-spread architectures
> double-word or half-word operations are not atomic, I don't know about
> all.

> Maybe on some exotic 'microcontrollers' plain loads and stores
> are not atomic.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Imagine if the following could trip the assert; assume 32-bit system:
_________________________________________________________
static word_t volatile g_val = 0x12345678;

void thread_1() {
for (;;) {
word_t const val = g_val;
if (val == 0x12345678) {
g_val = 0x87654321;
} else {
g_val = 0x12345678;
}
}
}

void thread_2() {
for (;;) {
word_t const val = g_val;
assert(val == 0x12345678 || val == 0x87654321);
}
}
_________________________________________________________

IMVHO, that would be one weird platform!

:^o

If it could trip the assertion, then surely the platform would provide
special instructions that could perform atomic loads and stores.

[...]

Chris M. Thomasson

unread,

Sep 18, 2008, 11:28:04 AM9/18/08

to

"Johan Torp" <johan...@gmail.com> wrote in message

news:c793acca-8bd1-4739...@k37g2000hsf.googlegroups.com...

> First of all, sorry for the late reply. I've been relying on google
> alerts to notify me when I get replies but for some reason it didn't
> work. By now, it is probably too late to make any changes to my
> thesis.
>
> On 12 Sep, 08:15, "Chris M. Thomasson" <n...@spam.invalid> wrote:
>> > I was probably thinking of linux which is the most well known example
>> > of RCU. Actually, the only example I've seen. Is it not typically
>> > blocking? Can you point me to some non-blocking implementations?
>>
>> Non-blocking in the sense that the execution of reader/writer-threads is
>> not
>> manipulated by the polling logic. A writer simply defers to the polling
>> logic which can be a non-blocking operation in and of itself.
>
> Aha, so modifications to not take effect immediately and a writer
> which performs a later read might not see its own write?

I did not explain things clearly. I am writing about how a writer thread
defers an object to the polling thread. In RCU, when you want to destroy an
object, you make it unreachable and defer it for destruction by associating
a callback function which gets invoked when the object is in a persistent
quiescent state. The implementation of the defer function can be wait-free
or lock-free. A wait-free impl might simply enqueue the object and callback
onto a per-thread single producer/consumer queue which gets
periodically/episodically samples from the polling thread. A lock-free impl
could enqueue the object and callback onto a multiple producer/single
consumer queue which gets flushed by the polling thread every now and
then...

As for the readers threads, well the are always non-blocking by design in
RCU.

>> Readers can do
>> several things. The fast implementations are zero-overhead for readers.
>> Or
>> some involve a simple load and/or store per-reader-side critical-section
>> acquire/release operations. Some make the readers use atomic RMW
>> operations,
>> ect...
>>
>> You can do a very simple RCU using per-thread monotonic counters and a
>> polling thread. Reader threads mutate per-thread monotonic counter before
>> and after they access a shared data-structure. Polling thread monitors
>> these
>> counters in order to determine per-thread quiescent states, and further
>> more, detect grace periods. Writers simply defer objects to the polling
>> thread using shared FIFO queue. Polling allows a grace period, or two, to
>> expire before invoking the deferment callback function on a group of
>> objects. The really fast implementations sometimes require another grace
>> period before invoking the callback in order to get proper release
>> semantics
>> on the objects.
>
> I do not understand this approach. If the polling thread has detected
> quiescence and started performing modifications, does readers block
> until the polling thread is finished?

No. Reader threads never block. The polling thread could only sample data
from the readers, not mutate anything. Also, if your have access to per-cpu
data-structures, the polling thread could just sample that and not even
bother looking at reader thread state.

> I'm not familiar with "grace periods" in a lock-free context, could
> you elaborate?

A grace-period in RCU terms is when all reader threads, or CPU's, have
recently went into a quiescent state.

> There can't be any guarantee that there will ever be a quiescent
> period if somebody is constantly reading, right?

This could happen in RCU if the programmer did something like:

void reader_thread() {
rcu_read_lock();
for (;;) {
// peform read(s)
}
rcu_read_unlock()
}

and simply never broke out of the for loop. The simple fix would be:

void reader_thread() {
for (;;) {
rcu_read_lock();
// peform read(s)
rcu_read_unlock()
}
}

Keep in mind that rcu_read_lock/unlock are always non-blocking. There are
other tricks and amortizations that an impl can do but I am not going to get
into that unless your interested. It's fairly advanced stuff indeed.

You can also read here:

http://atomic-ptr-plus.sourceforge.net/design.html

http://www.rdrop.com/users/paulmck/RCU/

>> > Also,
>> > I think the lock-free and TM sections are already a bit too long so
>> > I'd rather cut out information than add more. Unless I've missed
>> > something vital that is.
>>
>> Perhaps you can add that distributed reference counting, RCU and
>> quiescent
>> period detection can be implemented with very virtually zero overhead.
>
> I would need to read up more on those approaches before I dare make
> such claims. Unfortunately, it is probably too late to make any
> modifications now. But for my own sake, I'd be interested in seeing
> some real world examples of these approaches, consider drawbacks and
> trade-offs, etc. Do you have some pointers?

http://groups.google.com/group/comp.programming.threads/search?hl=en&group=comp.programming.threads&q=vzoom&qt_g=Search+this+group

My algorithm uses an array of per-thread counters and hashing to perform
reference count updates. The counter updates themselves are non-atomic.

>> > > Humm, well, you should also mention in section 4.5.5 that AMD is
>> > > working
>> > > on
>> > > something very interesting indeed:
>>
>> > >http://groups.google.com/group/comp.programming.threads/browse_frm/th...
>> > I will look in to that.
>>
>> Yeah, that might be something worth referencing/mentioning...
>
> I added some information on ASF.

Good.

>> > > In section 4.4.4 you seem to imply that atomic instructions are
>> > > needed
>> > > to
>> > > implement reference counting. I might be nice to mention distributed
>> > > reference counting techniques in which all ref-count mutations are
>> > > thread-local.
>> > Where? I can't see any mentioning of atomic instructions or locks in
>> > 4.4.4.
>>
>> You mention that several implementation require DACES. Then you mention
>> that
>> DWCAS is not available on certain processors, basically implying that its
>> not good to use the instruction. IMVHO, you should mention that
>> ref-counting
>> can be implemented with single-word atomic operations. Or perhaps mention
>> that the counting can be distributed and accessed on a per-thread basis
>> such
>> that a thread can increment an objects reference count using plain
>> non-atomic operations...
>
> That is not the case in C++0x. Any modification to a shared variable
> must be properly happens-before related otherwise we have a data race
> and undefined behaviour. Or am I missing something?

Refer to Dmitriy's responses...

> Thanks a lot for your replies, they were really helpful.

Sure.

> I apologize that I did not detect the post until now.

No problem.

Johan Torp

unread,

Sep 18, 2008, 1:15:20 PM9/18/08

to

On Sep 18, 1:19 pm, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> Non-relaxed atomics have additional guarantees wrt *order* of
> visibility, but NOT wrt *speed* of visibility, so to say. Wrt *speed*
> of visibility they all the same. It's only guaranteed that *all*
> atomic operations must became visible in reasonable amount of time.

It is true that C++0x has no notion of time and thus no notion of time-
based speed. But non-relaxed operations introduce introduces
visibility requirements that is tied to the execution speed of
individual threads.

Johan

Dmitriy V'jukov

unread,

Sep 18, 2008, 1:51:54 PM9/18/08

to

They introduce visibility requirements only wrt *OTHER* operations
(i.e. ordering), but NOT wrt operation itself! So if you are setting
hazard pointer, no matter whether you use relaxed operation or non-
relaxed, polling thread will see hazard pointer itself only after

"reasonable amount of time".

Dmitriy V'jukov

Johan Torp

unread,

Sep 18, 2008, 4:29:32 PM9/18/08

to

On 18 Sep, 19:51, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > It is true that C++0x has no notion of time and thus no notion of time-
> > based speed. But non-relaxed operations introduce introduces
> > visibility requirements that is tied to the execution speed of
> > individual threads.
>
> They introduce visibility requirements only wrt *OTHER* operations
> (i.e. ordering), but NOT wrt operation itself!

I'm not sure if we are saying the same things, but in different words,
or not.

I'm well aware that the speed that one modification becomes visible is
only affected by atomic modifications to *OTHER* objects.

> So if you are setting
> hazard pointer, no matter whether you use relaxed operation or non-
> relaxed, polling thread will see hazard pointer itself only after
> "reasonable amount of time".

Well, no. When the hazard pointer becomes visible is also affected by
whether or not the dereferencing thread observed the removal of the
shared node that the polling thread does prior to loading the hazard
pointer.

atomic<node*> sn;
atomic<void*> hp;

// dereferencing thread
node* sn_copy = sn.load();
hp.store(sn_copy);
if (sn_copy == sn.load()) {...} // second load

// polling/reclaiming thread
sn.store(0);
void* hp_copy = hp.load();

If the dereferencing thread does not observe the 0-store, at the
second load, the hazard pointer store can not just become visible in a
"reasonable amount of time". It must become visible at the speed the
polling thread is executed. In hardware, this can mean stalling the
polling thread or expediting the hazard pointer store by flushing
cached writes.

Johan

Dmitriy V'jukov

unread,

Sep 18, 2008, 6:46:03 PM9/18/08

to

On 19 сент, 00:29, Johan Torp <johan.t...@gmail.com> wrote:

> > So if you are setting
> > hazard pointer, no matter whether you use relaxed operation or non-
> > relaxed, polling thread will see hazard pointer itself only after
> > "reasonable amount of time".
>
> Well, no. When the hazard pointer becomes visible is also affected by
> whether or not the dereferencing thread observed the removal of the
> shared node that the polling thread does prior to loading the hazard
> pointer.
>
> atomic<node*> sn;
> atomic<void*> hp;
>
> // dereferencing thread
> node* sn_copy = sn.load();
> hp.store(sn_copy);
> if (sn_copy == sn.load()) {...} // second load
>
> // polling/reclaiming thread
> sn.store(0);
> void* hp_copy = hp.load();
>
> If the dereferencing thread does not observe the 0-store, at the
> second load, the hazard pointer store can not just become visible in a
> "reasonable amount of time". It must become visible at the speed the
> polling thread is executed. In hardware, this can mean stalling the
> polling thread or expediting the hazard pointer store by flushing
> cached writes.

Usage of seq_cst memory ordering won't speed-up visibility. Really.
The only thing it can do is to slow-down thread execution, so that
there will be some guarantees wrt visibility, for example, that either
dereferencing thread will see removal, or polling thread will see
hazard pointer.
But if you will use relaxed operations, the absolute speed of
visibility will be the same. There are just no magical means to
accelerate inter-core/processor communication. Well, if there would be
any, I think it would be turned on by default for all operations
anyway.

Dmitriy V'jukov

Johan Torp

unread,

Sep 19, 2008, 4:23:22 AM9/19/08

to

On Sep 19, 12:46 am, "Dmitriy V'jukov" <dvyu...@gmail.com> wrote:
> > If the dereferencing thread does not observe the 0-store, at the
> > second load, the hazard pointer store can not just become visible in a
> > "reasonable amount of time". It must become visible at the speed the
> > polling thread is executed. In hardware, this can mean stalling the
> > polling thread or expediting the hazard pointer store by flushing
> > cached writes.
>
> Usage of seq_cst memory ordering won't speed-up visibility. Really.
> The only thing it can do is to slow-down thread execution, so that
> there will be some guarantees wrt visibility, for example, that either
> dereferencing thread will see removal, or polling thread will see
> hazard pointer.
> But if you will use relaxed operations, the absolute speed of
> visibility will be the same. There are just no magical means to
> accelerate inter-core/processor communication. Well, if there would be
> any, I think it would be turned on by default for all operations
> anyway.

Doesn't any processor flush cached writes, that is speed up visibility
wrt real time? You do not want to "turn this on by default" since it
would slow ordinary execution down. Anyway, C++0x and Java's memory
model does not have any notion of time so talking about real-time
visibility speed doesn't make much sense.

Johan

Dmitriy V'jukov

unread,

Sep 19, 2008, 7:33:02 AM9/19/08

to

On Sep 19, 12:23 pm, Johan Torp <johan.t...@gmail.com> wrote:

> > Usage of seq_cst memory ordering won't speed-up visibility. Really.
> > The only thing it can do is to slow-down thread execution, so that
> > there will be some guarantees wrt visibility, for example, that either
> > dereferencing thread will see removal, or polling thread will see
> > hazard pointer.
> > But if you will use relaxed operations, the absolute speed of
> > visibility will be the same. There are just no magical means to
> > accelerate inter-core/processor communication. Well, if there would be
> > any, I think it would be turned on by default for all operations
> > anyway.
>
> Doesn't any processor flush cached writes, that is speed up visibility
> wrt real time?

Processor will flush cached writes anyway. Atomic RMW operations and
memory fences have nothing to do with that. You can just waste some
time waiting for that, or not.

> You do not want to "turn this on by default" since it
> would slow ordinary execution down.

Indeed!
This won't speed-up anything, it can only slow-down. This is why it's
better to use relaxed atomic operations, if you don't need some
special *ordering* guarantees. If you use relaxed atomic operations,
you just not wasting time waiting for processor to flush write buffer.
That is all, no difference wrt *speed* of visibility.
Algorithms like RCU, VZOOM, SMR+RCU don't need special ordering
guarantees when they are acquiring reference (they was especially
designed that way), so they use relaxed operations, so they are very
fast.

> Anyway, C++0x and Java's memory
> model does not have any notion of time so talking about real-time
> visibility speed doesn't make much sense.

We still can talk about visibility with terms like 'faster', 'slower',
'equal'.

Dmitriy V'jukov