Avoiding Selector.wakeup() in NIO event loops, a question about CAS

762 views
Skip to first unread message

awei...@voltdb.com

unread,
Mar 28, 2014, 4:24:04 PM3/28/14
to
Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See
and

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,
Ariel

Ariel Weisberg

unread,
Mar 29, 2014, 11:36:40 AM3/29/14
to
Hi,

Attempting to answer my own question, failed CAS is indeed slower. JMH code http://pastebin.com/PtbTbi0G and results http://pastebin.com/CRnaY3hZ

Result : 2762031.415 ±(99.9%) 316068.515 ops/ms
  Statistics: (min, avg, max) = (2618424.889, 2762031.415, 2818755.467), stdev = 82081.990
  Confidence interval (99.9%): [2445962.900, 3078099.930]
 
 
Benchmark               Mode   Samples         Mean   Mean error    Units
o.s.MyBenchmark.CAS    thrpt         5    31822.578     2484.089   ops/ms
o.s.MyBenchmark.GET    thrpt         5  2762031.415   316068.515   ops/ms

I am skeptical the CAS can deliver more value in this case than the occasional extra Selector.wakeup() invocation. Well, let's benchmark!

I tried to write a selector loop and a loop that wakes up the selector. The selector loop consumes some CPU after being woken up. I also tested using set and lazySet. Set seems to perform better. 1:3 threads since my CPU is a quad-core, but utilization was 250%.


Result : 642967.171 ±(99.9%) 102030.083 ops/ms
  Statistics: (min, avg, max) = (509935.187, 642967.171, 744384.949), stdev = 67486.583
  Confidence interval (99.9%): [540937.088, 744997.254]
Result "testGETProtection": 642632.576 ±(99.9%) 102039.668 ops/ms
  Statistics: (min, avg, max) = (509598.987, 642632.576, 744064.608), stdev = 67492.923
  Confidence interval (99.9%): [540592.908, 744672.245]
Result "testGETSelector": 334.595 ±(99.9%) 16.189 ops/ms
  Statistics: (min, avg, max) = (318.929, 334.595, 349.398), stdev = 10.708
  Confidence interval (99.9%): [318.405, 350.784]

Benchmark                               (useLazySet)   Mode   Samples         Mean   Mean error    Units
o.s.MyBenchmark.CAS                             true  thrpt        10    31981.160      392.509   ops/ms
o.s.MyBenchmark.CAS:testCASProtection           true  thrpt        10    31687.056      394.408   ops/ms
o.s.MyBenchmark.CAS:testCASSelector             true  thrpt        10      294.104        8.928   ops/ms
o.s.MyBenchmark.CAS                            false  thrpt        10    32666.963      666.222   ops/ms
o.s.MyBenchmark.CAS:testCASProtection          false  thrpt        10    32361.241      674.443   ops/ms
o.s.MyBenchmark.CAS:testCASSelector            false  thrpt        10      305.722       23.933   ops/ms
o.s.MyBenchmark.GET                             true  thrpt        10   591208.139    75301.095   ops/ms
o.s.MyBenchmark.GET:testGETProtection           true  thrpt        10   590867.063    75316.221   ops/ms
o.s.MyBenchmark.GET:testGETSelector             true  thrpt        10      341.076       18.519   ops/ms
o.s.MyBenchmark.GET                            false  thrpt        10   642967.171   102030.083   ops/ms
o.s.MyBenchmark.GET:testGETProtection          false  thrpt        10   642632.576   102039.668   ops/ms
o.s.MyBenchmark.GET:testGETSelector            false  thrpt        10      334.595       16.189   ops/ms

Is there a way I can get the copy paste results to format better when posting to the group?

JMH is really great.

Regards,
Ariel 

tm jee

unread,
Apr 1, 2014, 1:47:46 AM4/1/14
to mechanica...@googlegroups.com
Hi Ariel, 

That's great. IMHO I think that CAS and GET versions do not necessarily means the same thing unless single-writer-principle holds true. 

Just me 2 cents.

Norman Maurer

unread,
Apr 1, 2014, 2:04:51 AM4/1/14
to awei...@voltdb.com, mechanica...@googlegroups.com
Hi there,

I can only talk for Netty here and why we do it so take this with a grain of salt :)

I think if you really want to prevent multiple wakeups you need an atomic operation. Remember that in the case of Netty we have multiple threads that may trigger the CAS operation here.

-- 
Norman Maurer

Am 28. März 2014 bei 20:34:18, awei...@voltdb.com (awei...@voltdb.com) schrieb:

Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See
and

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,
Ariel

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

이희승 (Trustin Lee)

unread,
Apr 1, 2014, 2:07:57 AM4/1/14
to mechanica...@googlegroups.com

Why did I use CAS rather than GET? I don't remember to be honest. I'm up for using GET instead if the extra wakeups are not too many. IIRC wakeup in Linux is writing a dummy byte to a pipe to wake up an epoll_wait call and thus it's pretty expensive - think CAS vs system call that writes to a kernel buffer from user space, and then clearing it up.

Usually, a fully asynchronous Netty application will not even see a CAS, because everything is run from an I/O thread. However, an application that performs a potentially long running task will be affected by this change.

Would you be interested in investigating further? I'd be happy to help you.

2014. 3. 30. 오전 12:36에 "Ariel Weisberg" <arielw...@gmail.com>님이 작성:
Hi,

Attempting to answer my own question, failed CAS is indeed slower. JMH code http://pastebin.com/PtbTbi0G and results http://pastebin.com/CRnaY3hZ

Result : 2762031.415 ±(99.9%) 316068.515 ops/ms
  Statistics: (min, avg, max) = (2618424.889, 2762031.415, 2818755.467), stdev = 82081.990
  Confidence interval (99.9%): [2445962.900, 3078099.930]
 
 
Benchmark               Mode   Samples         Mean   Mean error    Units
o.s.MyBenchmark.CAS    thrpt         5    31822.578     2484.089   ops/ms
o.s.MyBenchmark.GET    thrpt         5  2762031.415   316068.515   ops/ms

I am skeptical the CAS can deliver more value in this case than the occasional extra Selector.wakeup() invocation. Well, let's benchmark!

I tried to write a Selector loop and a loop that wakes up the selector. The selector loop consumes some CPU after being woken up. I also tested using set and lazySet. Set seems to perform better. 1:3 threads since my CPU is a quad-core, but utilization was 250%.
Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See
and

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,
Ariel

Norman Maurer

unread,
Apr 1, 2014, 2:14:57 AM4/1/14
to 이희승 (Trustin Lee), mechanica...@googlegroups.com
Hey Trustin,

I think the CAS is quite cheaper then the extra Selector.wakeup() and as you already said most netty apps not even need to call the wakeup at all. 


-- 
Norman Maurer

Ariel Weisberg

unread,
Apr 1, 2014, 8:53:04 AM4/1/14
to
Hi,

I tested to see if you really benefit from CAS, according to my benchmarks you can queue more tasks (and not by a little) without hitting the cache line of the boolean as hard if the selector thread is awake for some period of time. If it really is a beneficial there should be a way to change the benchmark so that CAS comes out faster.

I don't see why Selector.wakeup() is only called from the network thread? If another thread in the system needs to queue a write to a socket owned by the selector would it not put a task in the queue and then invoke wakeup? Does Netty allow writers to sockets to lock and do the writes themselves or are you saying event processing never escapes the Netty thread? 

My application partitions to the core level so there will always be a handoff to a different non-network event processing thread or possibly a forward to a different socket if the request arrived at the wrong node. Replication will also trigger messages to other network threads. Event processing depends on shared mutable state and rather than lock the shared state I am partitioning it so that events can be routed to the correct partition and then processed without locking.

Ariel
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Norman Maurer

unread,
Apr 1, 2014, 8:52:47 AM4/1/14
to mechanica...@googlegroups.com, Ariel Weisberg, awei...@voltdb.com

Am 1. April 2014 bei 14:48:59, Ariel Weisberg (arielw...@gmail.com) schrieb:

Hi,

I tested to see if you really benefit from CAS, according to my benchmarks you can queue more tasks (and not by a little) without hitting the cache line the boolean as hard if the selector thread is awake for some period of time. If it really is a beneficial there should be a way to change the benchmark so that CAS comes out faster.

I don't see why Selector.wakeup() is only called from the network thread? If another thread in the system needs to queue a write to a socket owned by the selector would it not put a task in the queue and then invoke wakeup? Does Netty allow writers to sockets to lock and do the writes themselves or are you saying event processing never escapes the Netty thread? 

I guess you miss-understood me… I said „we have nave multiple threads that may trigger the CAS operation here“, which basically means the Selector.wakeup() will only be called from a „Non-IO-Thread“. So basically what we do is if someone triggers a write from out site of the „IO-Thread (EventLoop)“ we put a task in a queue and wakeup the Selector  so the task is picked up.

 



My application partitions to the core level so there will always be a handoff to a different non-network event processing thread or possibly a forward to a different socket if the request arrived at the wrong node. Replication will also trigger messages to other network threads. Event processing depends on shared state and rather then lock the shared mutable state I am partitioning it so that events can be routed to the correct partition and then processed without locking on the shared mutable state.

Ariel

On Tuesday, April 1, 2014 2:04:51 AM UTC-4, Norman Maurer wrote:
Hi there,

I can only talk for Netty here and why we do it so take this with a grain of salt :)

I think if you really want to prevent multiple wakeups you need an atomic operation. Remember that in the case of Netty we have multiple threads that may trigger the CAS operation here.

-- 
Norman Maurer

Am 28. März 2014 bei 20:34:18, awei...@voltdb.com (awei...@voltdb.com) schrieb:

Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See
and

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,
Ariel




-- 
Norman Maurer



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

awei...@voltdb.com

unread,
Apr 1, 2014, 10:43:18 AM4/1/14
to mechanica...@googlegroups.com, Ariel Weisberg, awei...@voltdb.com, norman...@googlemail.com
Hi,

That is what I expected, the statement that confused me was "  as you already said most netty apps not even need to call the wakeup at all. "

If my benchmark actually measures what it attempts to measure then CAS is not better at protecting Selector.wakeup() from extra wakeups. This might be because the overhead of CAS is greater than the savings from the extra accuracy that CAS provides.

My intuition is that the race for the volatile field will only result in extra Selector.wakeups() a fraction of the time. I would need to run an end to end benchmark with each approach and my guess is that it will barely be measurable.

Ariel

Norman Maurer

unread,
Apr 1, 2014, 11:30:32 AM4/1/14
to mechanica...@googlegroups.com, Ariel Weisberg


Am 1. April 2014 bei 16:43:20, awei...@voltdb.com (awei...@voltdb.com) schrieb:

Hi,

That is what I expected, the statement that confused me was "  as you already said most netty apps not even need to call the wakeup at all. "

This was more related to the fact that many Netty apps are doing all the writes from within the IO-Thread (EventLoop) anyway and so not need to wake up the selector at all. Sorry for the confusion :)




If my benchmark actually measures what it attempts to measure then CAS is not better at protecting Selector.wakeup() from extra wakeups. This might be because the overhead of CAS is greater than the savings from the extra accuracy that CAS provides.

My intuition is that the race for the volatile field will only result in extra Selector.wakeups() a fraction of the time. I would need to run an end to end benchmark with each approach and my guess is that it will barely be measurable.

Yeah… I’m just not sure use not CAS will buy you anything either. So I think all you can do is benchmark and check. And be sure Trustin and me would be really interesting to hear the results ;)

Aleksey Shipilev

unread,
Apr 2, 2014, 7:31:15 AM4/2/14
to mechanica...@googlegroups.com
Hi Ariel,

On 03/29/2014 07:35 PM, Ariel Weisberg wrote:
> Attempting to answer my own question, failed CAS is indeed slower. JMH
> code http://pastebin.com/PtbTbi0G and results http://pastebin.com/CRnaY3hZ

I am a bit sad you abuse @Group for single methods. Why are you doing
this? I guess next versions of JMH will forbid @Groups with a single
@GMB method :) The code looks OK otherwise, and results are predictable:
doing the CAS reads-for-write in the local cache, even though it will
fail pretty much all the time.
You know that "lazySet" flag has no effect at all in your code, right?
Does lazySet on both branches:

public void set(AtomicBoolean val, boolean flag) {
if (useLazySet) {
val.lazySet(flag);
} else {
val.lazySet(flag);
}
}

For that matter, the difference between lazySet=true/false can be
explained by run-to-run variance, and you should really do more than a
single fork pretty much always :)

-Aleksey

Ariel Weisberg

unread,
Apr 2, 2014, 9:39:22 AM4/2/14
to mechanica...@googlegroups.com
Hi,

Thanks Aleksey. I had it in my head that if you wanted to use @GroupThreads I needed to also use @Group. I see that there is an @Threads annotation I should have used, There weren't any examples using it so I didn't know about it. I also had a really hard time benchmarking things like blocking queues because I couldn't check the control flag while blocked. I also found it hard to set up queue topologies connecting multiple threads as a matter of benchmark state coordination between threads. If there was a recipe for that in the examples it would be helpful.

I updated the code http://pastebin.com/Xg1PA0gy and ran again http://pastebin.com/d19KgFAN fixing lazySet usage and running with 5 forks. Still wishing I could format the results for Google Groups better.

Benchmark                               (useLazySet)   Mode   Samples         Mean   Mean error    Units
o.s.MyBenchmark.CAS                             true  thrpt        50    35409.902     2116.964   ops/ms
o.s.MyBenchmark.CAS:testCASProtection           true  thrpt        50    35103.819     2119.638   ops/ms
o.s.MyBenchmark.CAS:testCASSelector             true  thrpt        50      306.083        8.385   ops/ms
o.s.MyBenchmark.CAS                            false  thrpt        50    33969.948     1688.278   ops/ms
o.s.MyBenchmark.CAS:testCASProtection          false  thrpt        50    33644.907     1696.492   ops/ms
o.s.MyBenchmark.CAS:testCASSelector            false  thrpt        50      325.041       11.345   ops/ms
o.s.MyBenchmark.GET                             true  thrpt        50   437667.331    17050.943   ops/ms
o.s.MyBenchmark.GET:testGETProtection           true  thrpt        50   437341.885    17055.107   ops/ms
o.s.MyBenchmark.GET:testGETSelector             true  thrpt        50      325.446        7.565   ops/ms
o.s.MyBenchmark.GET                            false  thrpt        50   430735.046    19177.673   ops/ms
o.s.MyBenchmark.GET:testGETProtection          false  thrpt        50   430399.618    19183.313   ops/ms
o.s.MyBenchmark.GET:testGETSelector            false  thrpt        50      335.428        7.710   ops/ms

Ariel

tm jee

unread,
Apr 3, 2014, 7:43:18 PM4/3/14
to mechanica...@googlegroups.com
Hi Aleksey, 

Can you provide if possible a quick example of how should this be done if we are not to abuse @Group, cause my naive understanding is to use @GroupThreads we need @Group.

Tia.

Aleksey Shipilev

unread,
Apr 3, 2014, 9:26:57 PM4/3/14
to mechanical-sympathy
In the case of single benchmark method, it is most wise to use @Threads without @Group.

-Aleksey.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages