Hello all,
I have attached a patch proposing an optional (:win32 + :sb-thread)-only feature called :foreign-callback-fiber which leverages the Windows fibers API to reduce the overhead of foreign callbacks. Please note that this is not an attempt at adding green threads/fibers to SBCL.
This feature was inspired by this blog post: https://www.corsix.org/content/callbacks-luajit-ffi.
The patch itself consists of about 100+ lines of internals changes along with some minor changes to the foreign callback tests. Also included is the code required to run the benchmarks that I cite below. I have verified that the changes pass the test suite on my Windows 10 machine, EXCEPT for sb-mpfr.impure.lisp, which also fails for me when the feature is disabled and on 850d71b7d2b5.
Recall that to permit concurrent foreign callback invocations, the runtime currently creates a new Lisp thread for every single foreign callback invocation. My benchmarking indicates that this costs on the order of 10,000x the overhead of a regular C call (this number is calculated by dividing ([foreign callback elapsed] minus [inline elapsed]) by ([regular C call elapsed] minus [inline elapsed]):
control (inline, no call)
elapsed time: 0.119883 seconds
regular C call (C -> C)
elapsed time: 0.124107 seconds
alien callback (C on Lisp thread -> Lisp)
elapsed time: 0.317698 seconds
foreign callback (C on foreign thread -> Lisp)
elapsed time: 393.213626 seconds
When :foreign-callback-fiber is enabled, the overhead drops to on the order of 100x:
control (inline, no call)
elapsed time: 0.117288 seconds
regular C call (C -> C)
elapsed time: 0.125753 seconds
alien callback (C on Lisp thread -> Lisp)
elapsed time: 0.319400 seconds
foreign callback (C on foreign thread -> Lisp)
elapsed time: 1.043674 seconds
Note that this is a synthetic benchmark. The user-facing performance impact of enabling this feature in a real system will depend on how frequently foreign callbacks are actually invoked.
The :foreign-callback-fiber feature works by automatically partitioning each foreign thread into two fibers upon the first call into Lisp. The first fiber is the foreign fiber, which executes the calling C code as usual. The second fiber is the callback fiber, which executes all foreign callbacks via a trampoline loop running in a Lisp thread that gets created once when the fiber is created. An immediate consequence of this configuration is that C and Lisp run on two separate stacks.
After the fibers are created and the trampoline loop is started on the Lisp thread, a call from C into Lisp proceeds as follows:
Amortizing the cost of creating and deleting the Lisp thread across many calls reduces the steady-state overhead of a foreign callback to that of a fiber switch, which costs less than 100 instructions and happens entirely in user space.
The speedup gained by enabling this feature comes with some tradeoffs. First, the calling C code has to run inside of a fiber. Second, Lisp and C have to run on separate stacks. This makes debugging more difficult, but it does have the benefit of allowing Lisp to return to C at any point without unwinding the Lisp stack.
While this patch is Windows-only, in principle this feature could be implemented on any platform with a stackful coroutines implementation like that provided by the Windows fibers API.
I would greatly appreciate any reviews especially considering the nicheness of this feature. This is only my second real patch after 96f0f2612ee6 (“Remove non-pauseless-threadstart code”) so I would also sincerely appreciate any feedback or suggestions on how to improve the quality of this contribution.
Sincerely,
Kartik
_______________________________________________
Sbcl-devel mailing list
Sbcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel
Thank you for the detailed reply—I will be working on a patch to implement this.
Kartik
From: Douglas Katzman <do...@google.com>
Date: Friday, May 26, 2023 at 7:40 AM
To: "Singh, Kartik S" <kss...@hrl.com>
Cc: "sbcl-...@lists.sourceforge.net" <sbcl-...@lists.sourceforge.net>
Subject: Re: [Sbcl-devel] [PATCH] An optional feature to reduce foreign callback overhead on Windows
|
Any ideas/preferences on how to emulate pthread_key_create() and the destructor function on Windows? It looks like SBCL used to have an emulation layer for this but it was removed: https://github.com/sbcl/sbcl/blob/daa6f0ce672d8dc60176ff885da18e44ee0355c6/src/runtime/pthreads_win32.c#L474
Ironically, the fibers API on Windows does have a way of registering a callback on fiber/thread deletion via the FlsAlloc function.
Kartik