thread context switch performance

135 views
Skip to first unread message

Christian Kamm

<mail@ckamm.de>
unread,
Oct 19, 2015, 8:28:37 AM10/19/15
to seastar-dev@googlegroups.com
I was recently looking at the context-switch time of seastar's
cooperative threads. There's some small performance to be gained by
using custom assembly over setjmp/longjmp. I'm not convinced this is a
trade-off you'd want to make and am interested in feedback.

Initially I tried this because when I last implemented coroutines
manually switching stacks turned out to be significantly faster than
using ucontext and I wondered whether that could significantly improve
the results of the thread_context_switch performance test.

But you already use ucontext only for the stack setup and setjmp/longjmp
for further switches, thus the gains are small:

current: 130 ns
custom switching: 110 ns (15% better)

For this test I reused some LGPL code I wrote while employed, so this is
not immediately accompanied by a patch. It's simple though: push all
callee-saved registers, change the stack pointer and pop the registers
again.

Now, since it's only a small improvement that comes at the cost of
ABI-dependent assembly that'll be less well tested than setjmp/longjmp:
Are you interested in me polishing this for inclusion?

Nadav Har'El

<nyh@scylladb.com>
unread,
Oct 19, 2015, 8:52:46 AM10/19/15
to Christian Kamm, seastar-dev
On Mon, Oct 19, 2015 at 3:28 PM, Christian Kamm <ma...@ckamm.de> wrote:
I was recently looking at the context-switch time of seastar's
cooperative threads. There's some small performance to be gained by
using custom assembly over setjmp/longjmp. I'm not convinced this is a
trade-off you'd want to make and am interested in feedback.

Initially I tried this because when I last implemented coroutines
manually switching stacks turned out to be significantly faster than
using ucontext and I wondered whether that could significantly improve
the results of the thread_context_switch performance test.

But you already use ucontext only for the stack setup and setjmp/longjmp
for further switches, thus the gains are small:

current:          130 ns
custom switching: 110 ns (15% better)

For this test I reused some LGPL code I wrote while employed, so this is
not immediately accompanied by a patch. It's simple though: push all
callee-saved registers, change the stack pointer and pop the registers
again.

How is what your assembly code does different from what longjmp()'s implementation does?
20 ns is not a lot, but is more than can be attributed to just the extra function call. So I wonder where the savings are coming from.

Avi Kivity

<avi@scylladb.com>
unread,
Oct 19, 2015, 8:58:12 AM10/19/15
to Christian Kamm, seastar-dev@googlegroups.com


On 10/19/2015 03:28 PM, Christian Kamm wrote:
> I was recently looking at the context-switch time of seastar's
> cooperative threads. There's some small performance to be gained by
> using custom assembly over setjmp/longjmp. I'm not convinced this is a
> trade-off you'd want to make and am interested in feedback.
>
> Initially I tried this because when I last implemented coroutines
> manually switching stacks turned out to be significantly faster than
> using ucontext and I wondered whether that could significantly improve
> the results of the thread_context_switch performance test.
>
> But you already use ucontext only for the stack setup and setjmp/longjmp
> for further switches, thus the gains are small:
>
> current: 130 ns
> custom switching: 110 ns (15% better)

It's surprising that it is this much, I traced through longjmp and it
does a surprising amount of work.

> For this test I reused some LGPL code I wrote while employed, so this is
> not immediately accompanied by a patch. It's simple though: push all
> callee-saved registers, change the stack pointer and pop the registers
> again.
>
> Now, since it's only a small improvement that comes at the cost of
> ABI-dependent assembly that'll be less well tested than setjmp/longjmp:
> Are you interested in me polishing this for inclusion?
>

I thought of using boost.context [1] for this; we don't shy away from
out own assembly, but we prefer to avoid it if possible. The only
problem with boost.context is that it was undercooked in 1.55 (or maybe
1.56) and not really usable.

There is also an effort to standardize this as part of C++17.

But if you produce something that can be compiled out with a #define for
other archs, we can include it.


[1]
http://www.boost.org/doc/libs/1_59_0/libs/context/doc/html/context/overview.html

Nadav Har'El

<nyh@scylladb.com>
unread,
Oct 19, 2015, 9:18:08 AM10/19/15
to Avi Kivity, Christian Kamm, seastar-dev
On Mon, Oct 19, 2015 at 3:58 PM, Avi Kivity <a...@scylladb.com> wrote:


On 10/19/2015 03:28 PM, Christian Kamm wrote:

But you already use ucontext only for the stack setup and setjmp/longjmp
for further switches, thus the gains are small:

current:          130 ns
custom switching: 110 ns (15% better)

It's surprising that it is this much, I traced through longjmp and it does a surprising amount of work.

Looking at glibc, sysdeps/x86_64/__longjmp.S, I don't see any "surprising amount of work" - it looks like fairly minimal assembly code... What is surprising there? Christian, how did your version end up faster?

 
But if you produce something that can be compiled out with a #define for other archs, we can include it.

I agree.

Avi Kivity

<avi@scylladb.com>
unread,
Oct 19, 2015, 9:20:09 AM10/19/15
to Nadav Har'El, Christian Kamm, seastar-dev


On 10/19/2015 04:18 PM, Nadav Har'El wrote:
On Mon, Oct 19, 2015 at 3:58 PM, Avi Kivity <a...@scylladb.com> wrote:


On 10/19/2015 03:28 PM, Christian Kamm wrote:

But you already use ucontext only for the stack setup and setjmp/longjmp
for further switches, thus the gains are small:

current:          130 ns
custom switching: 110 ns (15% better)

It's surprising that it is this much, I traced through longjmp and it does a surprising amount of work.

Looking at glibc, sysdeps/x86_64/__longjmp.S,

libpthread.so intercepted it I think.  On my system it even had bndXX instructions for switchibng the Intel automatic bounds checker thingie.  Try looking at a runtime version.

Nadav Har'El

<nyh@scylladb.com>
unread,
Oct 19, 2015, 9:43:37 AM10/19/15
to Avi Kivity, Christian Kamm, seastar-dev
I have no idea where this comes from - I can't find it in the source code I'm looking at.

Anyway, I wonder if you call __longjmp() directly, if it would be any faster, and be that __longjmp.S code I was looking at...

Avi Kivity

<avi@scylladb.com>
unread,
Oct 19, 2015, 10:10:22 AM10/19/15
to Nadav Har'El, Christian Kamm, seastar-dev
If you do that, you may as well do your own assembly code.

Nadav Har'El

<nyh@scylladb.com>
unread,
Oct 19, 2015, 10:14:08 AM10/19/15
to Avi Kivity, Christian Kamm, seastar-dev
On Mon, Oct 19, 2015 at 5:10 PM, Avi Kivity <a...@scylladb.com> wrote:
If you do that, you may as well do your own assembly code.


Not really....   If __longjmp() is some internal glibc thing, at least it will work on any machine that is supported by glibc - it's not the same as writing x86-only code which won't work on different machine architectures. But of course, it is less good than calling the standard longjmp().

And, to repeat, I have no idea if __longjmp() is actually any faster than longjmp(). I still don't understand where the other stuff you saw is coming from. I just suggested __longjmp() as something to try.

Christian Kamm

<mail@ckamm.de>
unread,
Oct 19, 2015, 1:57:05 PM10/19/15
to seastar-dev@googlegroups.com
> How is what your assembly code does different from what longjmp()'s
> implementation does?
> 20 ns is not a lot, but is more than can be attributed to just the extra
> function call. So I wonder where the savings are coming from.

I don't know. I thought it might be the conditional branch I saw in
longjmp here
http://ftp.netbsd.org/pub/NetBSD/NetBSD-current/src/lib/libc/arch/x86_64/gen/_setjmp.S

But the glibc implementation doesn't have that. Maybe the conditional
around setjmp()?

What I'm using is this, with saving/restoring the floating point control
bytes removed:
https://github.com/ckamm/qt-coroutine/blob/master/src/backend/switchstack_gcc_64_linux_mac.s

I'll look at the exact difference early next week.

Christian Kamm

<mail@ckamm.de>
unread,
Oct 27, 2015, 5:56:31 AM10/27/15
to seastar-dev@googlegroups.com
The standard sigjmp/longjmp spend some extra cycles on

* Conditionally saving and restoring signal masks:

sigjmp calls sigsetjmp(..., 0) and thus needs to do a runtime check to
figure out whether to save signal masks or not. longjmp needs another
check whether to restore them too.

* A call to __pthread_cleanup_upto:

To execute intervening cancellation handlers registered by
pthread_cleanup_push().

* Conditional branch around setjmp


I'll not continue working along these lines. The newer versions of
boost::context do look good and using that eventually seems better than
adding a custom x86/x86_64-only version now.

Avi Kivity

<avi@scylladb.com>
unread,
Oct 27, 2015, 6:01:17 AM10/27/15
to Christian Kamm, seastar-dev@googlegroups.com
I don't want to force people to update boost (yet). But we can use
conditional compilation and select either setjmp/longjmp or
boost.context, depending on availability.

niekbouman@gmail.com

<niekbouman@gmail.com>
unread,
Jul 22, 2020, 5:19:20 AM7/22/20
to seastar-dev
Anno 2020 boost.context’s performance Webpage claims order of 9ns context switching time on a Intel Xeon 2.2GHz.

https://www.boost.org/doc/libs/1_73_0/libs/context/doc/html/context/performance.html

Would there still be reason to migrate to boost.context for seastar::threads now that c++20 coroutines are coming?

(According to my understanding, I should still use seastar::threads for implementing (non-tail) recursive algorithms, as c++20 coroutines do not support nested yielding/awaiting and non-tail recursion. Is my understanding indeed correct?)

And what would be approx. the impact of cache pollution of stackful coroutines on the running time overhead? Won’t those in the typical case dominate the context switching time itself? (In that case, a migration to boost.context would not pay off from an overall perspective)

Kind regards,
Niek

niekbouman@gmail.com

<niekbouman@gmail.com>
unread,
Jul 22, 2020, 5:55:49 AM7/22/20
to seastar-dev
Btw, a disadvantage of Boost.context is that its asm-based variant (f_context) Is currently not compatible with (clang’s) asan.

Avi Kivity

<avi@scylladb.com>
unread,
Jul 22, 2020, 6:28:28 AM7/22/20
to niekbouman@gmail.com, seastar-dev
On 22/07/2020 12.19, niekb...@gmail.com wrote:
> Anno 2020 boost.context’s performance Webpage claims order of 9ns context switching time on a Intel Xeon 2.2GHz.
>
> https://www.boost.org/doc/libs/1_73_0/libs/context/doc/html/context/performance.html


The context switch is a small part of the overhead. In Seastar the
switch is mediated by the scheduler and there is extra overhead due to
future/promise integration. I doubt that switching to boost.context will
bring a significant benefit.


Here's a perf report from thead_context_switch_test:


   6.62%  thread_context_  thread_context_switch_test  [.]
seastar::reactor::run_tasks ▒
   6.48%  reactor-1        thread_context_switch_test  [.]
seastar::reactor::run_tasks ▒
   5.27%  reactor-1        thread_context_switch_test  [.]
seastar::basic_semaphore<seastar::semaphore_default_exception_factory,
std::chrono::_V2::steady_clock>::wait ▒
   4.49%  thread_context_  thread_context_switch_test  [.]
seastar::basic_semaphore<seastar::semaphore_default_exception_factory,
std::chrono::_V2::steady_clock>::wait ▒
   4.43%  thread_context_  thread_context_switch_test  [.]
seastar::reactor::add_task ▒
   4.41%  reactor-1        thread_context_switch_test  [.]
seastar::reactor::add_task ▒
   4.13%  reactor-1        thread_context_switch_test  [.]
seastar::noncopyable_function<void
()>::direct_vtable_for<context_switch_tester::_t1::{lambda()#1}>::call ▒
   4.07%  reactor-1        thread_context_switch_test  [.]
seastar::noncopyable_function<void
()>::direct_vtable_for<context_switch_tester::_t2::{lambda()#1}>::call ▒
   3.98%  thread_context_  thread_context_switch_test  [.]
seastar::noncopyable_function<void
()>::direct_vtable_for<context_switch_tester::_t1::{lambda()#1}>::call ▒
   3.90%  thread_context_  thread_context_switch_test  [.]
seastar::internal::future_base::do_wait ▒
   3.86%  reactor-1        thread_context_switch_test  [.]
seastar::internal::future_base::do_wait ▒
   3.84%  thread_context_  thread_context_switch_test  [.]
seastar::noncopyable_function<void
()>::direct_vtable_for<context_switch_tester::_t2::{lambda()#1}>::call ▒
   3.22%  thread_context_  thread_context_switch_test  [.]
seastar::(anonymous namespace)::thread_wake_task::run_and_dispose ▒
   2.61%  reactor-1        thread_context_switch_test  [.]
seastar::(anonymous namespace)::thread_wake_task::run_and_dispose ▒
   1.93%  thread_context_  thread_context_switch_test  [.]
seastar::memory::cpu_pages::allocate_small ▒
   1.85%  reactor-1        libc-2.31.so                [.] __sigsetjmp ▒
   1.74%  reactor-1        thread_context_switch_test  [.]
seastar::memory::cpu_pages::allocate_small ▒
   1.70%  thread_context_  libc-2.31.so                [.] __sigsetjmp ◆
   1.68%  thread_context_  libpthread-2.31.so          [.]
__GI___pthread_cleanup_upto ▒
   1.63%  reactor-1        libpthread-2.31.so          [.]
__GI___pthread_cleanup_upto ▒
   1.45%  thread_context_  libc-2.31.so                [.]
__libc_siglongjmp ▒
   1.42%  reactor-1        libc-2.31.so                [.]
__libc_siglongjmp ▒
   1.38%  reactor-1        libc-2.31.so                [.] __longjmp ▒
   1.35%  thread_context_  libc-2.31.so                [.] __longjmp ▒
   1.28%  thread_context_  thread_context_switch_test  [.]
seastar::internal::promise_base::clear ▒
   1.26%  thread_context_  thread_context_switch_test  [.]
seastar::internal::promise_base::promise_base ▒
   1.26%  reactor-1        thread_context_switch_test  [.]
seastar::internal::promise_base::clear ▒
   1.24%  reactor-1        thread_context_switch_test  [.]
seastar::internal::promise_base::promise_base ▒
   1.04%  reactor-1        thread_context_switch_test  [.]
seastar::memory::cpu_pages::free ▒
   1.04%  reactor-1        thread_context_switch_test  [.]
seastar::jmp_buf_link::switch_in ▒
   1.03%  thread_context_  thread_context_switch_test  [.]
seastar::memory::cpu_pages::free ▒
   0.95%  thread_context_  thread_context_switch_test  [.]
seastar::jmp_buf_link::switch_out ▒
   0.92%  thread_context_  thread_context_switch_test  [.]
seastar::jmp_buf_link::switch_in ▒
   0.87%  reactor-1        libc-2.31.so                [.]
_longjmp_unwind ▒
   0.80%  thread_context_  libc-2.31.so                [.]
_longjmp_unwind ▒
   0.72%  reactor-1        thread_context_switch_test  [.]
seastar::jmp_buf_link::switch_out ▒
   0.63%  thread_context_  thread_context_switch_test  [.] operator new ▒
   0.59%  thread_context_  libc-2.31.so                [.] __sigjmp_save ▒
   0.57%  thread_context_  thread_context_switch_test  [.]
seastar::memory::free ▒
   0.55%  reactor-1        thread_context_switch_test  [.] operator new ▒
   0.55%  reactor-1        thread_context_switch_test  [.]
seastar::memory::free ▒

   0.54%  reactor-1        libc-2.31.so                [.] __sigjmp_save



Although that's doing needless work, there is memory allocation that
shouldn't be there. Still, it's clear that setjmp/longjmp do not dominate.



> Would there still be reason to migrate to boost.context for seastar::threads now that c++20 coroutines are coming?


Threads might still be faster when there are frequent
potentially-blocking operations that rarely block. For example, writing
small buffers to an output_stream backed by a file. Due to write-behind,
the output_stream rarely needs to block.


Threads may be able to keep more state in registers in this case. Or the
compiler may be able to optimize coroutines in an equivalent way. We
still need to check.


>
> (According to my understanding, I should still use seastar::threads for implementing (non-tail) recursive algorithms, as c++20 coroutines do not support nested yielding/awaiting and non-tail recursion. Is my understanding indeed correct?)


Well, you shouldn't be recursing on the default 128k stack, but yes.
Threads and coroutines are more-or-less equivalent with coroutines
requiring less memory.


> And what would be approx. the impact of cache pollution of stackful coroutines on the running time overhead? Won’t those in the typical case dominate the context switching time itself? (In that case, a migration to boost.context would not pay off from an overall perspective)


It's very hard to quantify these things. I can hand-wave all day about
it, but in the end it's hard to provide hard measurements.


For our application, we only use threads when concurrency is limited, so
most of the change will be from continuations to coroutines, and the
goal is to make the code simpler, not speed it up (though we expect some
speedup too).


> Kind regards,
> Niek
>
Reply all
Reply to author
Forward
0 new messages