racket port to hybrid runtime model

38 views
Skip to first unread message

Kyle Hale

unread,
Dec 16, 2015, 3:15:50 PM12/16/15
to Racket Developers
Hi,

I'm currently working on porting the racket runtime to what we call the Hybrid Runtime (HRT) model, and I'm running into some issues I thought you guys might be able to help out with. 

The key feature of a hybrid runtime is that it runs entirely in kernel mode on top of a specialized OS kernel layer. Briefly, this gives you a minimal OS environment with exactly the abstractions to the machine that the runtime requires and full access to hardware features that are typically only given to the kernel.  If you're interested you can find out more at http://nautilus.halek.co

We're currently involved in automating the creation of these HRTs such that you give us your runtime source and we run it through our toolchain, adding the runtime code necessary to boot up the HRT on another set of cores.  Think of the HRT as a software accelerator.

In the case that there is some reliance on a legacy OS (e.g. POSIX system calls), we can forward those over to another core running that OS. These cores actually share a virtual address space using mechanisms I won't go into here, but this allows us to leverage existing ELF loading, memory mapping, and other facilities in, e.g. Linux. 

Our current setup includes racket in Linux, set up as a library, using the Senora GC, statically compiled, and configured to run within a pthread. Once this thread starts, we migrate racket to an environment where it's running in kernel-mode, in another OS environment, on another core.

I can give more context if necessary, but briefly, the issue I'm having is a strange one with memory corruption in the GC's memory management code, and I suspect the cause might be coming from an interplay with signals. When racket is running in the specialized OS kernel, we forward any page fault that occurs over to a user-space thread on a Linux core, which recreates the memory reference, forcing Linux to handle it for us. 

The issue with this is that if racket is *expecting* references to invalid pages, e.g. for some kind of lazy allocation, the signal handler (for SIGSEGV) is going to run in a different thread, on a different core, and with a different stack. It seems like that could cause some issues, especially if you're using something like setjmp() and longjmp() in the handlers. I couldn't quite tell from digging through the GC code if this is actually going on though.  But if the runtime is doing something clever with its thread stacks, I will certainly run into issues later on.

So my question is this: is racket expecting to catch SIGSEGV to fix up memory regions or something similar, or is it going to be doing any clever magic with something like setjmp() and longjmp()?


​Thanks!
Kyle

Matthew Flatt

unread,
Dec 16, 2015, 3:37:33 PM12/16/15
to Kyle Hale, Racket Developers
At Wed, 16 Dec 2015 12:15:50 -0800 (PST), Kyle Hale wrote:
> I can give more context if necessary, but briefly, the issue I'm having is
> a strange one with memory corruption in the GC's memory management code,
> and I suspect the cause might be coming from an interplay with signals.
> When racket is running in the specialized OS kernel, we forward any page
> fault that occurs over to a user-space thread on a Linux core, which
> recreates the memory reference, forcing Linux to handle it for us.

My guess is that you're seeing an issue with thread-local variables, as
opposed to the stack. Does disabling places and futures (with
`configure --disable-places --disabled-futures`) change anything?


> The issue with this is that if racket is *expecting* references to invalid
> pages, e.g. for some kind of lazy allocation, the signal handler (for
> SIGSEGV) is going to run in a different thread, on a different core, and
> with a different stack. [...]

That should be fine.

> So my question is this: is racket expecting to catch SIGSEGV to fix up
> memory regions or something similar, or is it going to be doing any clever
> magic with something like setjmp() and longjmp()?

No. In fact, we use sigaltstack() on Linux, and the write barrier works
at the level of Mach messages on Mac OS X --- so it's a common mode for
signals to be handled on a different stack than the one for the
faulting thread.

In the case of Mac OS X, the garbage collector must specifically
arrange for the message-handling thread to see the thread-local
variables of the main thread. My guess is that you'll need to do
something similar, but thread-local variables are used only when places
or futures are enabled, so that's why I asked about disabling them.

Reply all
Reply to author
Forward
0 new messages