Benchmarking various lock strategies on Java 8

451 views
Skip to first unread message

Pierre Laporte

unread,
May 1, 2014, 5:56:01 PM5/1/14
to mechanica...@googlegroups.com
Hello performance folks

I recently found this benchmark and decided to create a similar one using JMH. Now, I have results that I cannot explain and am seeking help to understand what is going on.

The code is here : https://github.com/pingtimeout/locks-benchmark

The idea is to increment a counter using the following strategies :
  • no protection on the field (invalid results, of course)
  • field protection with 'volatile' keyword (invalid result, also)
  • field protection with 'synchronized' keyword
  • field protection with ReentrantReadWriteLock
  • usage of AtomicLong
  • usage of LongAdder
  • field protection with StampedLock

I have run the benchmark with 1 up to 100 threads to see how the picture would evolve with an increasing number of readers/writers.

Platform specs : https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/README.md
Results as CSV file : https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/results/benchmark-results.csv
And finally, some graphs : https://github.com/pingtimeout/locks-benchmark/tree/master/results/2014-05-01/results/images

Now, I have a lot of unanswered questions :
  • Is the @Group annotation the correct way of testing concurrent code with JMH ?
  • Is there another annotation that would allow me to have 3x more writers than readers, for instance ?
  • Is my benchmark correct ?
  • According to the StampedLock results, it looks like there is a serious starvation on the writers. Am I doing something wrong here ?
  • According to the overall « Reads » graph, the number of reads/ms of StampedLock goes way above the number of direct (unprotected) field accesses. This cannot be true, since the direct field access should be a CPU register access... Am I missing something ?

Otherwise, it looks like, as stated in the javadoc, LongAdder is performing better than AtomicLong where there are a lot of writers (at the expense of the readers, though).

Any thoughts/ideas ?

--

Pierre

Georges Gomes

unread,
May 2, 2014, 1:55:20 AM5/2/14
to mechanical-sympathy

Hi

You can have more reads and writers by adding @Thread(x) (or @GroupThread(x), don't remember) where x is the number of concurrent threads in evolved. You can have different values for reads and writers.

I found my results very difficult to use. Reentrantlock for example was much better than AtomicLong when multiple writers are involved.

It make sense in the benchmark because the thing is super-highly contended. CAS (AtomicLong) will loop and contend very hard without backing off where RentrantLock will have backoff strategy and behave better.

Be careful on how you use these results. The fact that this benchmark is highly contended gives different results than in a real life program where contension is much more manageable by the CAS operation.

My 2 cents
Georges

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Vest

unread,
May 2, 2014, 3:04:35 AM5/2/14
to mechanica...@googlegroups.com

Remi Forax

unread,
May 2, 2014, 4:07:31 AM5/2/14
to mechanica...@googlegroups.com
On 05/02/2014 09:04 AM, Chris Vest wrote:
> Why is there a ReentrantReadWriteLock per Thread, though?
>
> https://github.com/pingtimeout/locks-benchmark/blob/master/src/main/java/fr/pingtimeout/locksbenchmark/LocksBenchmark.java#L65
>
> Cheers,
> Chris

to avoid contention :)

Rémi

>
> On 02 May 2014, at 07:55, Georges Gomes <george...@gmail.com
> <mailto:george...@gmail.com>> wrote:
>
>> Hi
>>
>> You can have more reads and writers by adding @Thread(x) (or
>> @GroupThread(x), don't remember) where x is the number of concurrent
>> threads in evolved. You can have different values for reads and writers.
>>
>> I found my results very difficult to use. Reentrantlock for example
>> was much better than AtomicLong when multiple writers are involved.
>>
>> It make sense in the benchmark because the thing is super-highly
>> contended. CAS (AtomicLong) will loop and contend very hard without
>> backing off where RentrantLock will have backoff strategy and behave
>> better.
>>
>> Be careful on how you use these results. The fact that this benchmark
>> is highly contended gives different results than in a real life
>> program where contension is much more manageable by the CAS operation.
>>
>> My 2 cents
>> Georges
>>
>> On May 1, 2014 10:56 PM, "Pierre Laporte" <pie...@pingtimeout.fr
>> <mailto:pie...@pingtimeout.fr>> wrote:
>>
>> Hello performance folks
>>
>> I recently found this benchmark
>> <http://www.takipiblog.com/2014/04/16/java-8-longadders-the-fastest-way-to-add-numbers-concurrently/>
>> and decided to create a similar one using JMH. Now, I have
>> results that I cannot explain and am seeking help to understand
>> what is going on.
>>
>> The code is here : https://github.com/pingtimeout/locks-benchmark
>>
>> The idea is to increment a counter using the following strategies :
>>
>> * no protection on the field (invalid results, of course)
>> * field protection with 'volatile' keyword (invalid result, also)
>> * field protection with 'synchronized' keyword
>> * field protection with ReentrantReadWriteLock
>> * usage of AtomicLong
>> * usage of LongAdder
>> * field protection with StampedLock
>>
>>
>> I have run the benchmark with 1 up to 100 threads to see how the
>> picture would evolve with an increasing number of readers/writers.
>>
>> Platform specs :
>> https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/README.md
>> Results as CSV file :
>> https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/results/benchmark-results.csv
>> And finally, some graphs :
>> https://github.com/pingtimeout/locks-benchmark/tree/master/results/2014-05-01/results/images
>>
>> Now, I have a lot of unanswered questions :
>>
>> * Is the @Group annotation the correct way of testing
>> concurrent code with JMH ?
>> * Is there another annotation that would allow me to have 3x
>> more writers than readers, for instance ?
>> * Is my benchmark correct ?
>> * According to the StampedLock results
>> <https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/results/images/stamped-lock.png>,
>> it looks like there is a serious starvation on the writers.
>> Am I doing something wrong here ?
>> * According to the overall « Reads » graph
>> <https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/results/images/reads.png>,
>> the number of reads/ms of StampedLock goes way above the
>> number of direct (unprotected) field accesses. This cannot be
>> true, since the direct field access should be a CPU register
>> access... Am I missing something ?
>>
>> Otherwise, it looks like, as stated in the javadoc, LongAdder is
>> performing better
>> <https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/results/images/adder.png>
>> than AtomicLong where there are a lot of writers (at the expense
>> of the readers, though).
>>
>> Any thoughts/ideas ?
>>
>> --
>>
>> Pierre
>>
>> --
>> You received this message because you are subscribed to the
>> Google Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to mechanical-symp...@googlegroups.com
>> <mailto:mechanical-symp...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to mechanical-symp...@googlegroups.com
>> <mailto:mechanical-symp...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Pierre Laporte

unread,
May 2, 2014, 4:34:27 AM5/2/14
to mechanica...@googlegroups.com
Le vendredi 2 mai 2014 10:07:31 UTC+2, Rémi Forax a écrit :
On 05/02/2014 09:04 AM, Chris Vest wrote:
> Why is there a ReentrantReadWriteLock per Thread, though?
>
> https://github.com/pingtimeout/locks-benchmark/blob/master/src/main/java/fr/pingtimeout/locksbenchmark/LocksBenchmark.java#L65
>
> Cheers,
> Chris

to avoid contention :)



@Georges > I am going to have a look at those annotations, especially @GroupThread, it seems to be what I am looking for :-)

@Chris > Oh… nice catch ! I fixed that. I am re-running the test for RRWL only to see how it goes.

@Remi > In that case, should I conclude that to avoid contention, one should not use a RRWL ? ;-)

Aleksey Shipilev

unread,
May 2, 2014, 4:51:45 AM5/2/14
to mechanica...@googlegroups.com
Hi Pierre,

On 05/02/2014 01:56 AM, Pierre Laporte wrote:
> The code is here : https://github.com/pingtimeout/locks-benchmark

You should do these improvements:

* Move initializers to @Setup, including the initializer for RWLock

* Avoid "final" in field declarations (moving to @Setup will implicitly
solve that)

* Include error bounds in your graphs. Contrary to what some people on
this list are saying, even if your means are drastically different, the
errors might indicate the difference is not significant.

* Do a variable backoff in both readers and writers: this will help to
model the "real life" scenario where user code under protected region is
running without contention for some time, thus amortizing the cost of
synchronization/contention. BlackHole.consumeCPU with @Param int tokens
will do, e.g.
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_21_ConsumeCPU.java

(This will explode running times, but the data would tell much more
interesting things).

> I have run the benchmark with 1 up to 100 threads to see how the picture
> would evolve with an increasing number of readers/writers.

Note that running past the number of CPUs may introduce weird results
since you will effectively run only so much threads your CPUs can
handle, and there would be no 100-thread contention ever on 4-core CPUs.

> * Is the @Group annotation the correct way of testing concurrent code
> with JMH ?

If your benchmarks are asymmetric, then yes. They are.

> * Is there another annotation that would allow me to have 3x more
> writers than readers, for instance ?

@GroupThreads define the thread distribution within the group:
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_15_Asymmetric.java

There is also the command line switch "-tg" which can override that
globally.

> * Is my benchmark correct ?

Well, it measures something :) It will be correct once you explain every
phenomenon you see.
> it looks like there is a serious starvation on the writers. Am I
> doing something wrong here ?

I would believe this graph if you actually have 32+ threads machine, and
you only have 24. Otherwise you are starving your own threads, before
StampedLock even has a chance to starve :)

> * According to the overall « Reads » graph
> <https://github.com/pingtimeout/locks-benchmark/blob/master/results/2014-05-01/results/images/reads.png>,
> the number of reads/ms of StampedLock goes way above the number of
> direct (unprotected) field accesses. This cannot be true, since the
> direct field access should be a CPU register access... Am I missing
> something ?

I *speculate* this might be a side effect of writers have no chance to
run (e.g. stay parked on writeLock), and the readers are always reading
their own unmodified copies never spoiled by writers. This is not the
luxury unprotected field accesses have.

Thanks,
-Aleksey.

Pierre Laporte

unread,
May 2, 2014, 4:25:28 PM5/2/14
to mechanica...@googlegroups.com


Le vendredi 2 mai 2014 10:51:45 UTC+2, Aleksey Shipilev a écrit :
Hi Pierre,

You should do these improvements:

 * Move initializers to @Setup, including the initializer for RWLock

 * Avoid "final" in field declarations (moving to @Setup will implicitly
solve that)

Hi Aleksey

Done, the new version is here
 
 * Include error bounds in your graphs. Contrary to what some people on
this list are saying, even if your means are drastically different, the
errors might indicate the difference is not significant.

Do you mean this kind of graphs ?
Note : I updated all of the graph with the "errorbars" graph types but the values are still from the "No consumeCPU()" run.

 * Do a variable backoff in both readers and writers: this will help to
model the "real life" scenario where user code under protected region is
running without contention for some time, thus amortizing the cost of
synchronization/contention. BlackHole.consumeCPU with @Param int tokens
will do, e.g.
http://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_21_ConsumeCPU.java

 (This will explode running times, but the data would tell much more
interesting things).

I included a Blackhole.consumeCPU(1024). It looks like it represent a big synchronized section, so I am expecting that I compensate the overhead of ReentrantReadWriteLock. I will run another batch with 128 to see the differences.
 
Note that running past the number of CPUs may introduce weird results
since you will effectively run only so much threads your CPUs can
handle, and there would be no 100-thread contention ever on 4-core CPUs.

Got it, I modified the launch script so that it "only" runs the benchmarks with 1 to 24 threads. It is running right now, I should have the results tomorrow.
 
[...]


I would believe this graph if you actually have 32+ threads machine, and
you only have 24. Otherwise you are starving your own threads, before
StampedLock even has a chance to starve :)

I do agree on the values above 24 threads, however, the values from 1 to 24 threads on this graph are ... weird. I think I need to try different combinations of readers/writers to see where this comes from...

I *speculate* this might be a side effect of writers have no chance to
run (e.g. stay parked on writeLock), and the readers are always reading
their own unmodified copies never spoiled by writers. This is not the
luxury unprotected field accesses have.

So it was faster because less updates were made on the values ? In that case the CPU register thing I was mentioning was actually more used in the StampedLock run, is it correct ?

Thanks for the explanations ! Results for the next run in ~12 hours !
Reply all
Reply to author
Forward
0 new messages