[Sbcl-devel] [PATCH] An optional feature to reduce foreign callback overhead on Windows

13 views
Skip to first unread message

Singh, Kartik S

unread,
May 26, 2023, 12:26:44 AM5/26/23
to sbcl-...@lists.sourceforge.net

Hello all,

 

I have attached a patch proposing an optional (:win32 + :sb-thread)-only feature called :foreign-callback-fiber which leverages the Windows fibers API to reduce the overhead of foreign callbacks. Please note that this is not an attempt at adding green threads/fibers to SBCL.

 

This feature was inspired by this blog post: https://www.corsix.org/content/callbacks-luajit-ffi.

 

The patch itself consists of about 100+ lines of internals changes along with some minor changes to the foreign callback tests. Also included is the code required to run the benchmarks that I cite below. I have verified that the changes pass the test suite on my Windows 10 machine, EXCEPT for sb-mpfr.impure.lisp, which also fails for me when the feature is disabled and on 850d71b7d2b5.

 

Recall that to permit concurrent foreign callback invocations, the runtime currently creates a new Lisp thread for every single foreign callback invocation. My benchmarking indicates that this costs on the order of 10,000x the overhead of a regular C call (this number is calculated by dividing ([foreign callback elapsed] minus [inline elapsed]) by ([regular C call elapsed] minus [inline elapsed]):

 

control (inline, no call)

elapsed time: 0.119883 seconds

 

regular C call (C -> C)

elapsed time: 0.124107 seconds

 

alien callback (C on Lisp thread -> Lisp)

elapsed time: 0.317698 seconds

 

foreign callback (C on foreign thread -> Lisp)

elapsed time: 393.213626 seconds

 

When :foreign-callback-fiber is enabled, the overhead drops to on the order of 100x:

 

control (inline, no call)
elapsed time: 0.117288 seconds
 
regular C call (C -> C)
elapsed time: 0.125753 seconds
 
alien callback (C on Lisp thread -> Lisp)
elapsed time: 0.319400 seconds
 
foreign callback (C on foreign thread -> Lisp)
elapsed time: 1.043674 seconds

 

Note that this is a synthetic benchmark. The user-facing performance impact of enabling this feature in a real system will depend on how frequently foreign callbacks are actually invoked.

 

The :foreign-callback-fiber feature works by automatically partitioning each foreign thread into two fibers upon the first call into Lisp. The first fiber is the foreign fiber, which executes the calling C code as usual. The second fiber is the callback fiber, which executes all foreign callbacks via a trampoline loop running in a Lisp thread that gets created once when the fiber is created. An immediate consequence of this configuration is that C and Lisp run on two separate stacks.

 

After the fibers are created and the trampoline loop is started on the Lisp thread, a call from C into Lisp proceeds as follows:

 

  1. callback_wrapper_trampoline stores the foreign callback’s index, arguments pointer, and return pointer in slots in the Lisp thread struct associated with the running OS thread.
  2. The foreign fiber simultaneously suspends itself and switches to the callback fiber via SwitchToFiber.
  3. The trampoline loop resumes and calls enter-alien-callback with the index, arguments pointer, and return pointer that were passed through the Lisp thread struct.
  4. After completing the foreign callback, the trampoline loop simultaneously suspends the callback fiber and switches back to the foreign fiber.
  5. The foreign fiber returns from the original C call with the provided return value.

 

Amortizing the cost of creating and deleting the Lisp thread across many calls reduces the steady-state overhead of a foreign callback to that of a fiber switch, which costs less than 100 instructions and happens entirely in user space.

 

The speedup gained by enabling this feature comes with some tradeoffs. First, the calling C code has to run inside of a fiber. Second, Lisp and C have to run on separate stacks. This makes debugging more difficult, but it does have the benefit of allowing Lisp to return to C at any point without unwinding the Lisp stack.

 

While this patch is Windows-only, in principle this feature could be implemented on any platform with a stackful coroutines implementation like that provided by the Windows fibers API.

 

I would greatly appreciate any reviews especially considering the nicheness of this feature. This is only my second real patch after 96f0f2612ee6 (“Remove non-pauseless-threadstart code”) so I would also sincerely appreciate any feedback or suggestions on how to improve the quality of this contribution.

 

Sincerely,

Kartik

CONFIDENTIALITY NOTICE: The information transmitted in this email, including attachments, is intended only for the person(s) or entity to which it is addressed and may contain confidential, proprietary and/or privileged material exempt from disclosure under applicable law. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender immediately and destroy any copies of this information in their entirety.
0001-Add-the-foreign-callback-fiber-feature-on-Windows.patch

Douglas Katzman via Sbcl-devel

unread,
May 26, 2023, 10:40:44 AM5/26/23
to Singh, Kartik S, sbcl-...@lists.sourceforge.net
#+win32 used to have a different way of avoiding the overhead of alloc_thread_struct per foreign call-in. It permanently associated the native thread to a 'struct thread' (and the 8MiB memory range) upon the first occurrence of a call into lisp from the native thread. That mechanism created yet another thread to wait for disposal of the windows handle to the native thread, upon which it would deallocate SBCL's data. I removed it, I can't remember when.

But there is a completely conventional way to do exactly this sort of thing not relying on fibers or waiting on handles: it's just pthread_key_create() with a destructor function. 
Using pthread_getspecific, call-into-lisp can check whether the C thread ever had a 'struct thread'. When the native thread goes away, it will clean up that structure.  We can stil redundantly store the 'struct thread' in native TLS for even more efficiency, while relying on the pthread destructor.  Since this should be portable across windows and posix, I would prefer it to something win32-specific. That said, we might want to allow choosing between the current way ("eager struct thread cleanup") and proposed way ("lazy struct thread cleanup").


_______________________________________________
Sbcl-devel mailing list
Sbcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel

Singh, Kartik S

unread,
May 26, 2023, 12:50:04 PM5/26/23
to Douglas Katzman, sbcl-...@lists.sourceforge.net

Thank you for the detailed replyI will be working on a patch to implement this.

 

Kartik

 

From: Douglas Katzman <do...@google.com>
Date: Friday, May 26, 2023 at 7:40 AM
To: "Singh, Kartik S" <kss...@hrl.com>
Cc: "sbcl-...@lists.sourceforge.net" <sbcl-...@lists.sourceforge.net>
Subject: Re: [Sbcl-devel] [PATCH] An optional feature to reduce foreign callback overhead on Windows

 

This message was sent from outside of HRL. Please do not click links or open attachments unless you recognize the sender and know that the content is safe.


 

Singh, Kartik S

unread,
Jun 1, 2023, 1:51:33 PM6/1/23
to Douglas Katzman, sbcl-...@lists.sourceforge.net

Any ideas/preferences on how to emulate pthread_key_create() and the destructor function on Windows? It looks like SBCL used to have an emulation layer for this but it was removed: https://github.com/sbcl/sbcl/blob/daa6f0ce672d8dc60176ff885da18e44ee0355c6/src/runtime/pthreads_win32.c#L474

 

Ironically, the fibers API on Windows does have a way of registering a callback on fiber/thread deletion via the FlsAlloc function.

 

Kartik

Douglas Katzman via Sbcl-devel

unread,
Jun 1, 2023, 4:52:43 PM6/1/23
to Singh, Kartik S, sbcl-...@lists.sourceforge.net
I just checked whether the pthread library in msys2 includes pthread_key_create and it does. And it calls the destructor on thread termination. So just include <pthread.h> for posix and windows and I think you should be OK to go.

Reply all
Reply to author
Forward
0 new messages