Single writer counter: how expensive is a volatile read?

Peter Veentjer

unread,

Oct 29, 2016, 4:13:38 AM10/29/16

to mechanical-sympathy

How expensive is a volatile read for implementing a single writer performance counter?

One can easily create some kind of performance counter using an AtomicLong or using a LongAdder if there is contention.

some example code:

public class Counter {

    private final AtomicLong c = new AtomicLong();

    public void inc(){
        c.addAndGet(1);
    }

    public long get(){
        return c.get();
    }
}

But in some cases one can create a counter where there is only a single writer and multiple readers. In these case one can make use of AtomicLong.lazySet. This prevent the cpu to wait for the store buffer to be drained. Martin Thompson explained this in his excellent training.

So you get something like this:

public class Counter {

    private final AtomicLong c = new AtomicLong();

    public void inc(){
        long newValue = c.get()+1;
        c.lazySet(newValue);
    }

    public long get(){
        return c.get();
    }
}

But in this case you still need to do a volatile read to read the actual value. In theory one could optimize this to:

public class Counter {

    private final AtomicLong c = new AtomicLong();
    private long local;

    public void inc(){
        long newValue = local+1;
this.local = newValue;
        c.lazySet(newValue);
    }

    public long get(){
        return c.get();
    }
}

In this case there is no volatile read involved, however more data needs to be written to memory.

As soon as I have time, I'll create a JMH benchmark. But I'm interested in the thoughts behind volatile reads in this particular case.

Aleksey Shipilev

unread,

Oct 29, 2016, 7:15:35 AM10/29/16

to mechanica...@googlegroups.com

On 10/29/2016 10:13 AM, Peter Veentjer wrote:
> So you get something like this:
>
> public class Counter {
>
> private final AtomicLong c = new AtomicLong();
>
> public void inc(){
> long newValue = c.get()+1;
> c.lazySet(newValue);
> }
>
> public long get(){
> return c.get();
> }
> }

Does not work. With non-atomic updates, you set yourself for losing a
virtually unbounded number of updates. See e.g:
https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/#pitfall-no-sync

> But in this case you still need to do a volatile read to read the actual
> value. In theory one could optimize this to:
>
> public class Counter {
>
> private final AtomicLong c = new AtomicLong();
> private long local;
>
> public void inc(){
> long newValue = local+1;
> this.local = newValue;
> c.lazySet(newValue);
> }
>
> public long get(){
> return c.get();
> }
> }

I guess get() was meant to have "return local"? If so, this does not
work. With no ordering implied for the reads, you set yourself up for
this trouble:

Counter c = ...;
while (c.get() < 100); // wait for it... forever!

Because optimizers are free from hoisting your read before the loop,
reading the counter once, and reducing predicated loop into the infinite
one. I would *guess* the fastest (not sanest, mind you) way out of this
is doing VarHandles:

public class Counter {
static final VarHandle VH = <get_VH_to_counter>;

long counter;

public void inc(){
VH.getAndAdd(this, 1);
}

public long get(){
return VH.getOpaque(this);
}
}

Still, removing volatiles from get breaks whatever happens-before
guarantees for the counter. Which means, for example, you cannot use it
to "version" the objects:

Counter cnt = new Counter();

Thread 1:
// assume cnt is zero
x = 1;
cnt.inc();

Thread 2:
if (cnt.get() == 1) {
assert (x == 1); // oops, fails.
}

Thanks,
-Aleksey

signature.asc

Olivier Bourgain

unread,

Oct 29, 2016, 11:08:59 AM10/29/16

to mechanical-sympathy

I think this deserves a benchmark.

Le samedi 29 octobre 2016 13:15:35 UTC+2, Aleksey Shipilev a écrit :

On 10/29/2016 10:13 AM, Peter Veentjer wrote:
> So you get something like this:
>
> public class Counter {
>
> private final AtomicLong c = new AtomicLong();
>
> public void inc(){
> long newValue = c.get()+1;
> c.lazySet(newValue);
> }
>
> public long get(){
> return c.get();
> }
> }

Does not work. With non-atomic updates, you set yourself for losing a
virtually unbounded number of updates. See e.g:
https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/#pitfall-no-sync

I agree that in the general case, this will lose updates, but Peter was asking for a "single writer performance counter", so I believe his code is correct.

> But in this case you still need to do a volatile read to read the actual
> value. In theory one could optimize this to:
>
> public class Counter {
>
> private final AtomicLong c = new AtomicLong();
> private long local;
>
> public void inc(){
> long newValue = local+1;
> this.local = newValue;
> c.lazySet(newValue);
> }
>
> public long get(){
> return c.get();
> }
> }

I guess get() was meant to have "return local"? If so, this does not
work. With no ordering implied for the reads, you set yourself up for
this trouble:

Local is for the (single) writer, to avoid a volatile read to get the current counter value. Other threads (readers) need to perform a volatile read to get the correct value.

Vitaly Davidovich

unread,

Oct 29, 2016, 11:33:21 AM10/29/16

to mechanica...@googlegroups.com

Peter's wording after that snippet implied he was trying to remove a volatile read on the reader, so I think Aleksey is right that get() was intended to return the local.

However, as Aleksey noted, VH.getOpaque would be the right way to do that (memory_order_relaxed pseudo equivalent in C++) to avoid compiler surprises. Although on x86 I don't think there's a difference with volatile reads - both perform the load, and both serve as compiler fences (getOpaque is doc'd as being in "program order"). On archs where a volatile read involves a CPU fence, then getOpaque should be cheaper.

The main "issue" here is readers will take cache misses and coherence traffic as their cacheline will be invalidated when the writer stores the new value.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

Tom Lee

unread,

Oct 29, 2016, 3:38:22 PM10/29/16

to mechanica...@googlegroups.com

Peter's wording after that snippet implied he was trying to remove a volatile read on the reader, so I think Aleksey is right that get() was intended to return the local.

I'm not so sure -- you'll notice in the last example he provided he does avoid a volatile read in inc() ... and since inc() will be run on a single thread and all reads are on the atomic/volatile var, I think it all works out. He's basically using `local` as a "thread-local" cache of the atomic value on the writer so he doesn't have to read the atomic.

Peter, perhaps it's best if you can clarify but assuming that was your intent: whether there's a performance change on x86 is another question entirely. You're much better off benchmarking, but here's some wild speculation for you to consider [and for others to rubbish :)]:

I would expect no obvious benefit on x86 between the `local` code and the `lazySet` version assuming this code is running in isolation, no context switches and/or there is nothing else invalidating cache lines. Assuming a single writer thread, the cache line should never leave the L1 cache on the writer thread after the first write. As a result, on all but the first call to inc() you should read the current value from the L1 cache. IIRC reads from L1 are roughly as fast as a read from a register, so I think the writer thread's "volatile read" should actually be quite fast.

Vitaly, whatever his intent the other details in your email were interesting all the same -- I'm not especially familiar with the newer C++ memory ordering stuff. One question:

The main "issue" here is readers will take cache misses and coherence traffic as their cacheline will be invalidated when the writer stores the new value.

If I'm reading what you're saying RE: getOpaque correctly, seems like the cache line invalidation performance hit would be unavoidable except on more relaxed archs [than x86], right?

Cheers,

Tom

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Tom Lee / http://tomlee.co / @tglee

Vitaly Davidovich

unread,

Oct 29, 2016, 4:31:37 PM10/29/16

to mechanica...@googlegroups.com

On Saturday, October 29, 2016, Tom Lee <m...@tomlee.co> wrote:

Peter's wording after that snippet implied he was trying to remove a volatile read on the reader, so I think Aleksey is right that get() was intended to return the local.

I'm not so sure -- you'll notice in the last example he provided he does avoid a volatile read in inc() ... and since inc() will be run on a single thread and all reads are on the atomic/volatile var, I think it all works out. He's basically using `local` as a "thread-local" cache of the atomic value on the writer so he doesn't have to read the atomic.

Right, but I thought he was trying to avoid a volatile read on the reader(s) side. But yes, he should clarify since clearly there are a few different interpretations in this thread.

On the inc() side, he could use getOpaque as well to avoid any costly CPU fences on archs that would have them and also avoid two separate writes (potentially not an issue if they hit same cacheline or adjacent cachelines, depending on microarchitectural details like write combining, adjacent cacheline prefetch, etc). Two separate writes, particularly one to the atomic instance, can also exacerbate false sharing more easily.

Peter, perhaps it's best if you can clarify but assuming that was your intent: whether there's a performance change on x86 is another question entirely. You're much better off benchmarking, but here's some wild speculation for you to consider [and for others to rubbish :)]:

I would expect no obvious benefit on x86 between the `local` code and the `lazySet` version assuming this code is running in isolation, no context switches and/or there is nothing else invalidating cache lines. Assuming a single writer thread, the cache line should never leave the L1 cache on the writer thread after the first write. As a result, on all but the first call to inc() you should read the current value from the L1 cache. IIRC reads from L1 are roughly as fast as a read from a register, so I think the writer thread's "volatile read" should actually be quite fast.

Yes, I believe so as well - that was roughly my point for the volatile read in get() as well, except get()'ers will take cache misses. The writer will maintain the cacheline in modified or owned state, and so the readers are the ones taking cache misses as the line is modified. This is really no different than plain stores.

Vitaly, whatever his intent the other details in your email were interesting all the same -- I'm not especially familiar with the newer C++ memory ordering stuff. One question:

The main "issue" here is readers will take cache misses and coherence traffic as their cacheline will be invalidated when the writer stores the new value.

If I'm reading what you're saying RE: getOpaque correctly, seems like the cache line invalidation performance hit would be unavoidable except on more relaxed archs [than x86], right?

Yes, I believe you cannot avoid the readers missing on the cache because it's constantly invalidated by the writer - depending on cache coherence protocol/details, the reader may be invalidated or the writer may broadcast changes to the reader cores directly (skipping RAM). The performance implications will also depend on whether the reader(s) and writer are on the same socket or different sockets.

Picture the interaction here with plain stores and loads (i.e. no atomics, no volatile) and assume compiler doesn't hoist any reads. There will be normal coherence transactions no matter what.

As for getOpaque, my understanding is it's basically a compiler fence ("program order" semantics) - compiler must read the memory, and it cannot reorder loads and stores across it. This is kind of like when a compiler sees a method call it cannot inline and doesn't know anything about - it cannot reorder memory operations across it because it can't prove that method doesn't modify memory. Hence the term "opaque". However, it doesn't emit any CPU fences, and that's the difference between volatile load and getOpaque on weaker memory model archs.

There's one thing I still can't get someone at Oracle to clarify, which is whether getOpaque ensures atomicity of the read. I believe it would, but I don't think it's been explicitly stated thus far.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Peter Veentjer

unread,

Oct 30, 2016, 12:55:14 AM10/30/16

to mechanical-sympathy

Let me clarify.

The discussion is around removing the volatile read in the inc method.

A counter could be incremented >100k times a second and we don't want a counter to cause too much of a performance degradation. A counter will always be incremented by the same thread.

The get method will only be called every X seconds. So a volatile read is fine.

Aleksey Shipilev

unread,

Oct 30, 2016, 4:22:25 AM10/30/16

to mechanica...@googlegroups.com

On 10/30/2016 05:55 AM, Peter Veentjer wrote:
> Let me clarify.
> The discussion is around removing the volatile read in the inc method.

Ah, sorry I misinterpreted the question. It usually goes the other way:
the reads vastly outnumber the writes.

Well, since the writer is single-threaded, there is no need for a
volatile/atomic reads/updates in the inc(). But I think it's useless to
optimize the volatile reads there, because you have the impending
volatile/release store, which would dominate the cost. So, having that
in mind, the better strategy might be dispensing with Atomic on every
inc()? The excess branch might be much less hassle than a
volatile/release store.

public class Counter {
private long local;
private volatile long publish;

public void inc(){
long n = local++;
if ((n & 0xFF) == 0) {
publish = n;
}
}

public long get() {
return publish;
}
}

...with VarHandles, you can even collude the access modes:

public class Counter {
static final VH = <get_VH_to_local>;

private long local;

public void inc(){

long n = VH.get(this);
n++;
if ((n & 0xFF) == 0) {
VH.set(this, n);
} else {
VH.set{Volatile,Release,Opaque}(this, n);
}
}

public long get(){
return VH.get{Volatile,Acquire,Opaque}(this);
}
}

...with some choice of the modes.

> A counter could be incremented >100k times a second and we don't want a
> counter to cause too much of a performance degradation. A counter will
> always be incremented by the same thread.

There is a subtle difference between "a single writer to Counter", and
"each thread gets its own Counter". I would venture to guess that
(false) sharing in your case should be the concern, not the cost of
volatile ops.

Having a referenced AtomicLong instead of padded out "local" field would
set you up for false sharing very nicely :)

Thanks,
-Aleksey

signature.asc

Aleksey Shipilev

unread,

Oct 30, 2016, 4:23:16 AM10/30/16

to mechanica...@googlegroups.com

On 10/29/2016 10:31 PM, Vitaly Davidovich wrote:
> There's one thing I still can't get someone at Oracle to clarify, which
> is whether getOpaque ensures atomicity of the read. I believe it would,
> but I don't think it's been explicitly stated thus far.

Like std::atomic(..., mem_ord_relaxed), VarHandles.{get,set}Opaque are
access (single-copy) atomic.

Thanks,
-Aleksey

signature.asc

Vitaly Davidovich

unread,

Oct 30, 2016, 9:57:09 AM10/30/16

to mechanica...@googlegroups.com

I think that's what I would go for, except I'd just setOpaque each time in inc(). Besides the compiler barrier, this is pretty much like a plain store.

> A counter could be incremented >100k times a second and we don't want a
> counter to cause too much of a performance degradation. A counter will
> always be incremented by the same thread.

There is a subtle difference between "a single writer to Counter", and
"each thread gets its own Counter". I would venture to guess that
(false) sharing in your case should be the concern, not the cost of
volatile ops.

Having a referenced AtomicLong instead of padded out "local" field would
set you up for false sharing very nicely :)

Thanks,
-Aleksey

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Oct 30, 2016, 9:57:52 AM10/30/16

to mechanica...@googlegroups.com

On Sunday, October 30, 2016, Aleksey Shipilev <aleksey....@gmail.com> wrote:

Ok good. Any chance this will be added to the javadoc to avoid any ambiguity?

Thanks,
-Aleksey

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Nitsan Wakart

unread,

Nov 8, 2016, 3:37:32 AM11/8/16

to mechanica...@googlegroups.com

To summarize from other comments:

- No volatile read -> no HB

- Plain writes/reads -> no atomicity (concern for 32bit)

- Java9 offers us Opaque read/writes -> - HB, + atomicity, might be a good fit

- Aleksey points out false sharing concerns

- Vladimir highlights cache miss costs

The one addition I had was that the "cost" of a volatile read is context dependant, so simplistic benchmarks will fail to demonstrate the impact the fence has. If the benchmark is:

@Benchmark

long get(){ return counter.get(); }

The volatile and opaque load versions should give the same result (assuming sane compiler and same layout). Furthermore, benchmarking the usecase of many incs vs few gets will potentially have the large cache miss cost(since the counter definitely changed since last read) hide the ordering induced costs (perhaps extra reads from L1) in a more "elaborate" case e.g:

int a,b,c;

@Benchmark

long get(){

b+=a;

a+=c;

long l = counter.get();

c+=a;

a+=b;

return l;

}

Reply all

Reply to author

Forward