Avoiding Selector.wakeup() in NIO event loops, a question about CAS

awei...@voltdb.com

unread,

Mar 28, 2014, 4:24:04 PM3/28/14

to

Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See

https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L572

and

https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L299

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,

Ariel

Ariel Weisberg

unread,

Mar 29, 2014, 11:36:40 AM3/29/14

to

Hi,

Attempting to answer my own question, failed CAS is indeed slower. JMH code http://pastebin.com/PtbTbi0G and results http://pastebin.com/CRnaY3hZ

Result : 2762031.415 ±(99.9%) 316068.515 ops/ms
Statistics: (min, avg, max) = (2618424.889, 2762031.415, 2818755.467), stdev = 82081.990
Confidence interval (99.9%): [2445962.900, 3078099.930]

Benchmark Mode Samples Mean Mean error Units
o.s.MyBenchmark.CAS thrpt 5 31822.578 2484.089 ops/ms
o.s.MyBenchmark.GET thrpt 5 2762031.415 316068.515 ops/ms

I am skeptical the CAS can deliver more value in this case than the occasional extra Selector.wakeup() invocation. Well, let's benchmark!

I tried to write a selector loop and a loop that wakes up the selector. The selector loop consumes some CPU after being woken up. I also tested using set and lazySet. Set seems to perform better. 1:3 threads since my CPU is a quad-core, but utilization was 250%.

Code http://pastebin.com/X6ZsbwyT and results http://pastebin.com/Gy2FurVN

Result : 642967.171 ±(99.9%) 102030.083 ops/ms
Statistics: (min, avg, max) = (509935.187, 642967.171, 744384.949), stdev = 67486.583
Confidence interval (99.9%): [540937.088, 744997.254]
Result "testGETProtection": 642632.576 ±(99.9%) 102039.668 ops/ms
Statistics: (min, avg, max) = (509598.987, 642632.576, 744064.608), stdev = 67492.923
Confidence interval (99.9%): [540592.908, 744672.245]
Result "testGETSelector": 334.595 ±(99.9%) 16.189 ops/ms
Statistics: (min, avg, max) = (318.929, 334.595, 349.398), stdev = 10.708
Confidence interval (99.9%): [318.405, 350.784]

Benchmark (useLazySet) Mode Samples Mean Mean error Units
o.s.MyBenchmark.CAS true thrpt 10 31981.160 392.509 ops/ms
o.s.MyBenchmark.CAS:testCASProtection true thrpt 10 31687.056 394.408 ops/ms
o.s.MyBenchmark.CAS:testCASSelector true thrpt 10 294.104 8.928 ops/ms
o.s.MyBenchmark.CAS false thrpt 10 32666.963 666.222 ops/ms
o.s.MyBenchmark.CAS:testCASProtection false thrpt 10 32361.241 674.443 ops/ms
o.s.MyBenchmark.CAS:testCASSelector false thrpt 10 305.722 23.933 ops/ms
o.s.MyBenchmark.GET true thrpt 10 591208.139 75301.095 ops/ms
o.s.MyBenchmark.GET:testGETProtection true thrpt 10 590867.063 75316.221 ops/ms
o.s.MyBenchmark.GET:testGETSelector true thrpt 10 341.076 18.519 ops/ms
o.s.MyBenchmark.GET false thrpt 10 642967.171 102030.083 ops/ms
o.s.MyBenchmark.GET:testGETProtection false thrpt 10 642632.576 102039.668 ops/ms
o.s.MyBenchmark.GET:testGETSelector false thrpt 10 334.595 16.189 ops/ms

Is there a way I can get the copy paste results to format better when posting to the group?

JMH is really great.

Regards,

Ariel

tm jee

unread,

Apr 1, 2014, 1:47:46 AM4/1/14

to mechanica...@googlegroups.com

Hi Ariel,

That's great. IMHO I think that CAS and GET versions do not necessarily means the same thing unless single-writer-principle holds true.

Just me 2 cents.

Norman Maurer

unread,

Apr 1, 2014, 2:04:51 AM4/1/14

to awei...@voltdb.com, mechanica...@googlegroups.com

Hi there,

I can only talk for Netty here and why we do it so take this with a grain of salt :)

I think if you really want to prevent multiple wakeups you need an atomic operation. Remember that in the case of Netty we have multiple threads that may trigger the CAS operation here.

--
Norman Maurer

Am 28. März 2014 bei 20:34:18, awei...@voltdb.com (awei...@voltdb.com) schrieb:

Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See

https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L572

and

https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L299

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,

Ariel

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

이희승 (Trustin Lee)

unread,

Apr 1, 2014, 2:07:57 AM4/1/14

to mechanica...@googlegroups.com

Why did I use CAS rather than GET? I don't remember to be honest. I'm up for using GET instead if the extra wakeups are not too many. IIRC wakeup in Linux is writing a dummy byte to a pipe to wake up an epoll_wait call and thus it's pretty expensive - think CAS vs system call that writes to a kernel buffer from user space, and then clearing it up.

Usually, a fully asynchronous Netty application will not even see a CAS, because everything is run from an I/O thread. However, an application that performs a potentially long running task will be affected by this change.

Would you be interested in investigating further? I'd be happy to help you.

2014. 3. 30. 오전 12:36에 "Ariel Weisberg" <arielw...@gmail.com>님이 작성:

Hi,

Attempting to answer my own question, failed CAS is indeed slower. JMH code http://pastebin.com/PtbTbi0G and results http://pastebin.com/CRnaY3hZ

Result : 2762031.415 ±(99.9%) 316068.515 ops/ms
Statistics: (min, avg, max) = (2618424.889, 2762031.415, 2818755.467), stdev = 82081.990
Confidence interval (99.9%): [2445962.900, 3078099.930]

Benchmark Mode Samples Mean Mean error Units
o.s.MyBenchmark.CAS thrpt 5 31822.578 2484.089 ops/ms
o.s.MyBenchmark.GET thrpt 5 2762031.415 316068.515 ops/ms

I am skeptical the CAS can deliver more value in this case than the occasional extra Selector.wakeup() invocation. Well, let's benchmark!

I tried to write a Selector loop and a loop that wakes up the selector. The selector loop consumes some CPU after being woken up. I also tested using set and lazySet. Set seems to perform better. 1:3 threads since my CPU is a quad-core, but utilization was 250%.

Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See
https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L572

and
https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L299

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,
Ariel

Norman Maurer

unread,

Apr 1, 2014, 2:14:57 AM4/1/14

to 이희승 (Trustin Lee), mechanica...@googlegroups.com

Hey Trustin,

I think the CAS is quite cheaper then the extra Selector.wakeup() and as you already said most netty apps not even need to call the wakeup at all.

--
Norman Maurer

Ariel Weisberg

unread,

Apr 1, 2014, 8:53:04 AM4/1/14

to

Hi,

I tested to see if you really benefit from CAS, according to my benchmarks you can queue more tasks (and not by a little) without hitting the cache line of the boolean as hard if the selector thread is awake for some period of time. If it really is a beneficial there should be a way to change the benchmark so that CAS comes out faster.

I don't see why Selector.wakeup() is only called from the network thread? If another thread in the system needs to queue a write to a socket owned by the selector would it not put a task in the queue and then invoke wakeup? Does Netty allow writers to sockets to lock and do the writes themselves or are you saying event processing never escapes the Netty thread?

My application partitions to the core level so there will always be a handoff to a different non-network event processing thread or possibly a forward to a different socket if the request arrived at the wrong node. Replication will also trigger messages to other network threads. Event processing depends on shared mutable state and rather than lock the shared state I am partitioning it so that events can be routed to the correct partition and then processed without locking.

Ariel

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Norman Maurer

unread,

Apr 1, 2014, 8:52:47 AM4/1/14

to mechanica...@googlegroups.com, Ariel Weisberg, awei...@voltdb.com

Am 1. April 2014 bei 14:48:59, Ariel Weisberg (arielw...@gmail.com) schrieb:

Hi,

I tested to see if you really benefit from CAS, according to my benchmarks you can queue more tasks (and not by a little) without hitting the cache line the boolean as hard if the selector thread is awake for some period of time. If it really is a beneficial there should be a way to change the benchmark so that CAS comes out faster.

I don't see why Selector.wakeup() is only called from the network thread? If another thread in the system needs to queue a write to a socket owned by the selector would it not put a task in the queue and then invoke wakeup? Does Netty allow writers to sockets to lock and do the writes themselves or are you saying event processing never escapes the Netty thread?

I guess you miss-understood me… I said „we have nave multiple threads that may trigger the CAS operation here“, which basically means the Selector.wakeup() will only be called from a „Non-IO-Thread“. So basically what we do is if someone triggers a write from out site of the „IO-Thread (EventLoop)“ we put a task in a queue and wakeup the Selector so the task is picked up.

My application partitions to the core level so there will always be a handoff to a different non-network event processing thread or possibly a forward to a different socket if the request arrived at the wrong node. Replication will also trigger messages to other network threads. Event processing depends on shared state and rather then lock the shared mutable state I am partitioning it so that events can be routed to the correct partition and then processed without locking on the shared mutable state.

Ariel

On Tuesday, April 1, 2014 2:04:51 AM UTC-4, Norman Maurer wrote:

Hi there,

I can only talk for Netty here and why we do it so take this with a grain of salt :)

I think if you really want to prevent multiple wakeups you need an atomic operation. Remember that in the case of Netty we have multiple threads that may trigger the CAS operation here.

--
Norman Maurer

Am 28. März 2014 bei 20:34:18, awei...@voltdb.com (awei...@voltdb.com) schrieb:
Hi all,

On my current project the last high traffic lock I have to deal with is Selector.wakeup() which is invoked to hand off writes to the network thread responsible for servicing the socket. The lock is split across several selector threads, but a socket is only ever serviced by one selector to allow lock free access to the associated application state for that connection and this results in contention when there is a hot connection.

Netty tries to reduce invocations of Selector.wakeup() by tracking whether the selector thread might already be awake using an AtomicBoolean. Netty uses CAS to turn the boolean on and off. See
https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L572
and
https://github.com/netty/netty/blob/6efac6179e1e13e6caba2cec6109ce27862efc9a/transport/src/main/java/io/netty/channel/nio/NioEventLoop.java#L299

I don't quite see how CAS is necessary although maybe it is more accurate at preventing extra Selector.wakeup() and just as fast? If you do a CAS that is going to fail on a cache line that is already in the shared state is it any slower than doing a volatile read? Will a failed CAS move the cache line to the exclusive state and incur extra overhead even though the value at the cache line is not going to change?

Thanks,
Ariel

--
Norman Maurer

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

awei...@voltdb.com

unread,

Apr 1, 2014, 10:43:18 AM4/1/14

to mechanica...@googlegroups.com, Ariel Weisberg, awei...@voltdb.com, norman...@googlemail.com

Hi,

That is what I expected, the statement that confused me was " as you already said most netty apps not even need to call the wakeup at all. "

If my benchmark actually measures what it attempts to measure then CAS is not better at protecting Selector.wakeup() from extra wakeups. This might be because the overhead of CAS is greater than the savings from the extra accuracy that CAS provides.

My intuition is that the race for the volatile field will only result in extra Selector.wakeups() a fraction of the time. I would need to run an end to end benchmark with each approach and my guess is that it will barely be measurable.

Ariel

Norman Maurer

unread,

Apr 1, 2014, 11:30:32 AM4/1/14

to mechanica...@googlegroups.com, Ariel Weisberg

Am 1. April 2014 bei 16:43:20, awei...@voltdb.com (awei...@voltdb.com) schrieb:

Hi,

That is what I expected, the statement that confused me was " as you already said most netty apps not even need to call the wakeup at all. "

This was more related to the fact that many Netty apps are doing all the writes from within the IO-Thread (EventLoop) anyway and so not need to wake up the selector at all. Sorry for the confusion :)

If my benchmark actually measures what it attempts to measure then CAS is not better at protecting Selector.wakeup() from extra wakeups. This might be because the overhead of CAS is greater than the savings from the extra accuracy that CAS provides.

My intuition is that the race for the volatile field will only result in extra Selector.wakeups() a fraction of the time. I would need to run an end to end benchmark with each approach and my guess is that it will barely be measurable.

Yeah… I’m just not sure use not CAS will buy you anything either. So I think all you can do is benchmark and check. And be sure Trustin and me would be really interesting to hear the results ;)

Aleksey Shipilev

unread,

Apr 2, 2014, 7:31:15 AM4/2/14

to mechanica...@googlegroups.com

Hi Ariel,

On 03/29/2014 07:35 PM, Ariel Weisberg wrote:
> Attempting to answer my own question, failed CAS is indeed slower. JMH
> code http://pastebin.com/PtbTbi0G and results http://pastebin.com/CRnaY3hZ

I am a bit sad you abuse @Group for single methods. Why are you doing
this? I guess next versions of JMH will forbid @Groups with a single
@GMB method :) The code looks OK otherwise, and results are predictable:
doing the CAS reads-for-write in the local cache, even though it will
fail pretty much all the time.

> Code http://pastebin.com/X6ZsbwyT and results http://pastebin.com/Gy2FurVN

You know that "lazySet" flag has no effect at all in your code, right?
Does lazySet on both branches:

public void set(AtomicBoolean val, boolean flag) {
if (useLazySet) {
val.lazySet(flag);
} else {
val.lazySet(flag);
}
}

For that matter, the difference between lazySet=true/false can be
explained by run-to-run variance, and you should really do more than a
single fork pretty much always :)

-Aleksey

Ariel Weisberg

unread,

Apr 2, 2014, 9:39:22 AM4/2/14

to mechanica...@googlegroups.com

Hi,

Thanks Aleksey. I had it in my head that if you wanted to use @GroupThreads I needed to also use @Group. I see that there is an @Threads annotation I should have used, There weren't any examples using it so I didn't know about it. I also had a really hard time benchmarking things like blocking queues because I couldn't check the control flag while blocked. I also found it hard to set up queue topologies connecting multiple threads as a matter of benchmark state coordination between threads. If there was a recipe for that in the examples it would be helpful.

I updated the code http://pastebin.com/Xg1PA0gy and ran again http://pastebin.com/d19KgFAN fixing lazySet usage and running with 5 forks. Still wishing I could format the results for Google Groups better.

Benchmark (useLazySet) Mode Samples Mean Mean error Units

o.s.MyBenchmark.CAS true thrpt 50 35409.902 2116.964 ops/ms

o.s.MyBenchmark.CAS:testCASProtection true thrpt 50 35103.819 2119.638 ops/ms

o.s.MyBenchmark.CAS:testCASSelector true thrpt 50 306.083 8.385 ops/ms

o.s.MyBenchmark.CAS false thrpt 50 33969.948 1688.278 ops/ms

o.s.MyBenchmark.CAS:testCASProtection false thrpt 50 33644.907 1696.492 ops/ms

o.s.MyBenchmark.CAS:testCASSelector false thrpt 50 325.041 11.345 ops/ms

o.s.MyBenchmark.GET true thrpt 50 437667.331 17050.943 ops/ms

o.s.MyBenchmark.GET:testGETProtection true thrpt 50 437341.885 17055.107 ops/ms

o.s.MyBenchmark.GET:testGETSelector true thrpt 50 325.446 7.565 ops/ms

o.s.MyBenchmark.GET false thrpt 50 430735.046 19177.673 ops/ms

o.s.MyBenchmark.GET:testGETProtection false thrpt 50 430399.618 19183.313 ops/ms

o.s.MyBenchmark.GET:testGETSelector false thrpt 50 335.428 7.710 ops/ms

Ariel

tm jee

unread,

Apr 3, 2014, 7:43:18 PM4/3/14

to mechanica...@googlegroups.com

Hi Aleksey,

Can you provide if possible a quick example of how should this be done if we are not to abuse @Group, cause my naive understanding is to use @GroupThreads we need @Group.

Tia.

Aleksey Shipilev

unread,

Apr 3, 2014, 9:26:57 PM4/3/14

to mechanical-sympathy

In the case of single benchmark method, it is most wise to use @Threads without @Group.

-Aleksey.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Reply all

Reply to author

Forward