Volatile semantic for failed/noop atomic operations

283 views
Skip to first unread message

Simone Bordet

unread,
Sep 13, 2019, 3:08:49 AM9/13/19
to mechanica...@googlegroups.com
Hi,

I have an atomic counter that gets incremented and decremented over
time (non monotonically).
At a certain point, I would like to enter a termination protocol where
increments are not possible anymore and I set an action to run if/when
the counter reaches zero.
Trivial when using synchronized/lock, but I'd like to give it a try
without them.

class A {
private final AtomicLong counter;
// Non-volatile
private Runnable action;

void terminate(Runnable action) {
this.action = action;
// Volatile write needed here for visibility.
if (counter.addAndGet(0) == 0) {
action.run();
}
}

void decrement() {
// Volatile read required to see this.action.
if (counter.decrementAndGet() == 0) {
Runnable a = this.action;
if (a != null) {
a.run()
}
}
}
}

Is addAndGet(0) a volatile write? Can the write be optimized away?
Similarly (although not relevant for this particular example), a
_failed_ compareAndSet() has the semantic of a volatile write even if
the set part was not done because the compare part failed?

Thanks!

--
Simone Bordet
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless. Victoria Livschitz

Vitaly Davidovich

unread,
Sep 14, 2019, 2:29:00 PM9/14/19
to mechanica...@googlegroups.com
Unlike C++, where you can specify mem ordering for failure and success separately, Java doesn’t allow that.  But, the mem ordering is the same for failure/success there.  Unfortunately it doesn’t look like the javadocs mention that, but I recall Doug Lea saying that’s the case on the concurrency-interest mailing list (btw that’s probably the more appropriate channel for this Java-centric question).

For your case, it seems like an AtomicReference<Runnable> is more appropriate.  terminate() sets it, then checks the count via a volatile load (or maybe it can decrement() itself?); if zero, CAS null into the action field to take/claim the action.  decrement() likewise tries to claim the action via a CAS.  The snippet you have now would allow for concurrent action execution, which is likely unsafe/wrong.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/CAFWmRJ3qGJ_qqrXmAHNDZ6ro01BQwe8czHZP7b-SoZ%2BrULhJAw%40mail.gmail.com.
--
Sent from my phone

Francesco Nigro

unread,
Sep 14, 2019, 3:20:01 PM9/14/19
to mechanical-sympathy
Although not mentioned (neither on Doug's cookbook) I have always supposed was more likely the c++ default for both weak/strong CAS ie seq_cst memory ordering.
To make this question more mechanical sympathy focused: on hardware level what happen? I would be curious to know both x86/arm versions for this, what kind of memory reordering are guaranteed...

Most people says that a failed CAS cost much less then one that succeed, but just like a load or more?
Probably the success will cause invalidation of all the caches that reference the cache line changed, but the failed case should lock (on x86) it and AFAIK locks (hw or SW) are not well known to be scalable and performant :)

Vitaly Davidovich

unread,
Sep 14, 2019, 3:41:56 PM9/14/19
to mechanica...@googlegroups.com
On x86, I’ve never heard of failed CAS being cheaper.  In theory, cache snooping can inform the core whether it’s xchg would succeed without going through the RFO dance.  But, to perform the actual xchg it would need ownership regardless (if not already owned/exclusive).

Sharing ordinary mutable memory isn’t scalable, forget locks :) Just doing plain loads/stores of shared memory will kill scalability (and perf) due to cache coherence traffic.

ARM uses LL/SC for CAS, and is much more susceptible to failures (eg even if the value is expected, if the line has been refreshed after the LL part, it fails).  But, I’m not overly familiar with ARM internals/mechanics.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Simone Bordet

unread,
Sep 14, 2019, 6:01:06 PM9/14/19
to mechanica...@googlegroups.com
Hi,

On Sat, Sep 14, 2019 at 8:28 PM Vitaly Davidovich <vit...@gmail.com> wrote:
>
> Unlike C++, where you can specify mem ordering for failure and success separately, Java doesn’t allow that. But, the mem ordering is the same for failure/success there. Unfortunately it doesn’t look like the javadocs mention that,

Unfortunately so.

> but I recall Doug Lea saying that’s the case on the concurrency-interest mailing list (btw that’s probably the more appropriate channel for this Java-centric question).
>
> For your case, it seems like an AtomicReference<Runnable> is more appropriate.

But then I would have 2 volatile accesses instead of one.

> terminate() sets it, then checks the count via a volatile load (or maybe it can decrement() itself?); if zero, CAS null into the action field to take/claim the action. decrement() likewise tries to claim the action via a CAS. The snippet you have now would allow for concurrent action execution, which is likely unsafe/wrong.

I don't see the difference between 2 atomics and my snippet WRT
concurrent execution.
They can both be executed concurrently with arbitrary interleaving, no?

I think my snippet is safe as long as in terminate() there is a
volatile write after setting the action (it is admittedly weird to add
zero just to obtain a memory effect, I could have used
VarHandle.releaseFence() if I were in JDK 11), since the action would
be visible from decrement() because it does a volatile read on the
same variable, no?

Vitaly Davidovich

unread,
Sep 14, 2019, 8:24:19 PM9/14/19
to mechanica...@googlegroups.com
On Sat, Sep 14, 2019 at 6:01 PM Simone Bordet <simone...@gmail.com> wrote:
Hi,

On Sat, Sep 14, 2019 at 8:28 PM Vitaly Davidovich <vit...@gmail.com> wrote:
>
> Unlike C++, where you can specify mem ordering for failure and success separately, Java doesn’t allow that.  But, the mem ordering is the same for failure/success there.  Unfortunately it doesn’t look like the javadocs mention that,

Unfortunately so.

> but I recall Doug Lea saying that’s the case on the concurrency-interest mailing list (btw that’s probably the more appropriate channel for this Java-centric question).
>
> For your case, it seems like an AtomicReference<Runnable> is more appropriate.

But then I would have 2 volatile accesses instead of one
But only in the (presumably) slowpath? I’m assuming terminate() is infrequent.  decrement() won’t look at the AR unless it decrements to 0.
.


> terminate() sets it, then checks the count via a volatile load (or maybe it can decrement() itself?); if zero, CAS null into the action field to take/claim the action.  decrement() likewise tries to claim the action via a CAS.  The snippet you have now would allow for concurrent action execution, which is likely unsafe/wrong.

I don't see the difference between 2 atomics and my snippet WRT
concurrent execution.
They can both be executed concurrently with arbitrary interleaving, no?
In your snippet, terminate() sets the field - it can be visible immediately (or terminate() thread is descheduled).  decrement runs, decrements to 0, sees action != null.  terminate() also now sees 0, and likewise proceeds to run the action.

The AR I was suggesting would allow ownership protocol over the action.  Read it, if not null, try to CAS a null in there.  If CAS succeeds, you’ve claimed ownership.  If it fails, someone else claimed it.  Whoever claims it runs the action.

There may be a better approach, depends on a bit more details of the use cases.


I think my snippet is safe as long as in terminate() there is a
volatile write after setting the action (it is admittedly weird to add
zero just to obtain a memory effect, I could have used
VarHandle.releaseFence() if I were in JDK 11), since the action would
be visible from decrement() because it does a volatile read on the
same variable, no?

--
Simone Bordet
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Simone Bordet

unread,
Sep 15, 2019, 7:11:58 AM9/15/19
to mechanica...@googlegroups.com
Hi,

On Sun, Sep 15, 2019 at 2:24 AM Vitaly Davidovich <vit...@gmail.com> wrote:
> In your snippet, terminate() sets the field - it can be visible immediately (or terminate() thread is descheduled). decrement runs, decrements to 0, sees action != null. terminate() also now sees 0, and likewise proceeds to run the action.
>
> The AR I was suggesting would allow ownership protocol over the action. Read it, if not null, try to CAS a null in there. If CAS succeeds, you’ve claimed ownership. If it fails, someone else claimed it. Whoever claims it runs the action.

I see, thanks!

Francesco Nigro

unread,
Sep 25, 2019, 1:12:58 PM9/25/19
to mechanical-sympathy
> I've never heard of failed CAS being cheaper
I'm referring to this old (but good) article: https://blogs.oracle.com/dave/biased-locking-in-hotspot


Il giorno sabato 14 settembre 2019 21:41:56 UTC+2, Vitaly Davidovich ha scritto:
On x86, I’ve never heard of failed CAS being cheaper.  In theory, cache snooping can inform the core whether it’s xchg would succeed without going through the RFO dance.  But, to perform the actual xchg it would need ownership regardless (if not already owned/exclusive).

Sharing ordinary mutable memory isn’t scalable, forget locks :) Just doing plain loads/stores of shared memory will kill scalability (and perf) due to cache coherence traffic.

ARM uses LL/SC for CAS, and is much more susceptible to failures (eg even if the value is expected, if the line has been refreshed after the LL part, it fails).  But, I’m not overly familiar with ARM internals/mechanics.
On Sat, Sep 14, 2019 at 3:20 PM Francesco Nigro <nigr...@gmail.com> wrote:
Although not mentioned (neither on Doug's cookbook) I have always supposed was more likely the c++ default for both weak/strong CAS ie seq_cst memory ordering.
To make this question more mechanical sympathy focused: on hardware level what happen? I would be curious to know both x86/arm versions for this, what kind of memory reordering are guaranteed...

Most people says that a failed CAS cost much less then one that succeed, but just like a load or more?
Probably the success will cause invalidation of all the caches that reference the cache line changed, but the failed case should lock (on x86) it and AFAIK locks (hw or SW) are not well known to be scalable and performant :)

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Steven Stewart-Gallus

unread,
Oct 5, 2019, 11:41:26 AM10/5/19
to mechanical-sympathy
Couldn't you do a compare and compare and swap? With VarHandles something like

if (ACTION.getOpaque(this) != expected) return false;
return compareAndExchange(this, expected, newValue) == expected;

Not sure I got this correct
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Oct 5, 2019, 3:44:29 PM10/5/19
to mechanica...@googlegroups.com
That's actually the common idiom for performant concurrent code that occasionally needs to do something atomic, but usually (the hot case) doesn't. Without the first read check, there is a good likelihood (depending on how the CPU and prefetchers work for CAS, but very probably) that you will run into serious cache line contention (sort of like false sharing, but not false) on the line under high load on multiple cores, as the CAS will bring the line in exclusive and invalidate it in other caches, while the first read check will bring it in shared, and the invalidation will only happen if you actually need to change the value.

Vitaly Davidovich

unread,
Oct 8, 2019, 9:55:51 PM10/8/19
to mechanica...@googlegroups.com
On Sat, Oct 5, 2019 at 11:41 AM Steven Stewart-Gallus <stevensele...@gmail.com> wrote:
Couldn't you do a compare and compare and swap? With VarHandles something like

if (ACTION.getOpaque(this) != expected) return false;
return compareAndExchange(this, expected, newValue) == expected;

Not sure I got this correct
In general you could.  This is sometimes referred to as TTAS (test and test-and-set).

In this thread’s example, the “checks the count via a volatile load” I mentioned is a similar effect; under some conditions, and opaque load might be slightly cheaper than a volatile one (on x86 at least).  The difference would only really be a result of the optimizer doing other scheduling around the opaque load, but not around a volatile one.

The right/best approach for Simone’s case depends on specifics (don’t they all?! :)).  I didn’t actually pick up on how often the termination protocol triggers - I assumed it’s an uncommon/slow path.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/mechanical-sympathy/43729dfa-1e73-4318-b446-e80c81422b6e%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages