[Sbcl-devel] Signal handling trickery

Nikodemus Siivola

unread,

Mar 22, 2007, 8:30:16 PM3/22/07

to sbcl-devel

I have a proof of concept tree where GC happens outside the
SIG_MEMORY_FAULT handler (working on x86-64 Linux), which is at least
good enough to pass all tests...

Here's what happens:

1. Signal handler calls
arrange_return_to_signal_tramp(sigsegv_return_handler,
signo, info, context);

2. This is like arrange_return_to_lisp_function, but instead of going to
lisp we end up returning from the signal handler to call the function
signal_tramp with (1) the real handler function, and (2 & 3) malloc'ed
copies of siginfo and context.

3. signal_tramp blocks the signal we are handling in current thread,
saves copies of siginfo and context on stack and frees the malloced
memory. It then calls the real handler with the signo, stack allocated
siginfo and context, and the original signal mask. IF the return handler
returns, the original signal mask is restored by signal_tramp: IF the
real handler unwinds, then it is its responsibility to reset the sigmask.

So, we have basically functional signal handlers outside the kernel, and
are no longer restricted by silly POSIX rules about which functions are
signal safe -- which we broke all the time. The only thing we cannot do
is directly frob the context and return to it.

Todo: The initially malloced copies are just a quick hack: I am planning
to actually copy the siginfo and context directly to stack, so (1) we
don't need to worry if malloc is signal safe, and (2) we don't need to
worry about leaking memory due to asynch unwinds.

Todo: This is also still missing a bit of interrupt protection: we still
have to worry about asynch unwinds that catch us after we have
established the new signal mask.

How does this sound? Am I forgetting something obvious? Does this
approach make someone more uneady then running whatnot inside signal
handlers?

Cheers,

-- Nikodemus

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Sbcl-devel mailing list
Sbcl-...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/sbcl-devel

Cyrus Harmon

unread,

Mar 22, 2007, 8:57:52 PM3/22/07

to Nikodemus Siivola, sbcl-devel

Nikodemus,

This sounds good. You may have looked at it, but x86-darwin-os.c
contains a similar approach for mach exception handling. One thing
that we did there is that we re-trap when we return which gives us
the opportunity to frob the signal context (equivalent) at that point
if we want.

Cyrus

Gábor Melis

unread,

Mar 23, 2007, 4:50:15 AM3/23/07

to sbcl-...@lists.sourceforge.net, Nikodemus Siivola

On Friday 23 March 2007 01:30, Nikodemus Siivola wrote:
> I have a proof of concept tree where GC happens outside the
> SIG_MEMORY_FAULT handler (working on x86-64 Linux), which is at least
> good enough to pass all tests...
>
> Here's what happens:
>
> 1. Signal handler calls
> arrange_return_to_signal_tramp(sigsegv_return_handler,
> signo, info, context);
>
> 2. This is like arrange_return_to_lisp_function, but instead of going
> to lisp we end up returning from the signal handler to call the
> function signal_tramp with (1) the real handler function, and (2 & 3)
> malloc'ed copies of siginfo and context.
>
> 3. signal_tramp blocks the signal we are handling in current thread,

Is there a window of time right before this when the signal is not
blocked?

> saves copies of siginfo and context on stack and frees the malloced
> memory. It then calls the real handler with the signo, stack
> allocated siginfo and context, and the original signal mask. IF the
> return handler returns, the original signal mask is restored by
> signal_tramp: IF the real handler unwinds, then it is its
> responsibility to reset the sigmask.
>
> So, we have basically functional signal handlers outside the kernel,
> and are no longer restricted by silly POSIX rules about which
> functions are signal safe -- which we broke all the time. The only
> thing we cannot do is directly frob the context and return to it.

'Async signal safe' may actually be a slightly misleading misnomer. I
touched on this in "Some thread & signal safety issues". Functions that
are not async signal safe don't care if they are being reentered
directly from the signal handler or from a trampoline that the signal
handler arranged to be called at the very same point where the
interrupt happened.

I think this approach in general (i.e. considering all signals that not
just GC) has the same issues as the orignal that doesn't really fail
either with any frequency that approaches reproducability.

Now, for GC the situation is somewhat different. We know that GC is
triggerred synchronously by consing. Hence, we can be sure that no
async signal unsafe C code is running at that time in the thread where
the gc is triggerred and where we are going to run the GC code. In my
world that means we are safe.

In short, our synchronous signals are nothing to worry about because the
restrictions do no apply to them so as far as I know GC is fine as it
is and true async signals (think sigalarm, sigint) are not helped out
the trampoline.

I've found this page about MPS that seems to share this view:

http://www.ravenstream.com/project/mps/master/design/protli/

"
.threads.async: POSIX (and hence Linux) imposes some restrictions on
signal handler functions (see
design.mps.pthreadext.anal.signal.safety). Basically the rules say the
behaviour of almost all POSIX functions inside a signal handler
is undefined, except for a handful of functions which are known to be
"async-signal safe". However, if it's known that the signal didn't
happen inside a POSIX function, then it is safe to call arbitrary POSIX
functions inside a handler.

.threads.async.protection: If the signal handler is invoked because of
an MPS access, then we know the access must have been caused by client
code (because the client is not allowed to permit access to protectable
memory to arbitrary foreign code [need a reference for this]). In these
circumstances, it's OK to call arbitrary POSIX functions inside the
handler.

.threads.async.other: If the signal handler is invoked for some other
reason (i.e. one we are not prepared to handle) then there is less we
can say about what might have caused the SEGV. In general it is not
safe to call arbitrary POSIX functions inside the handler in this case.

Nikodemus Siivola

unread,

Mar 26, 2007, 5:52:55 AM3/26/07

to Gábor Melis, sbcl-...@lists.sourceforge.net

There was some talk about this a few days back on #lisp. I'll recap
what I recall where the major points here (which are probably mixed
up with my own conclusions):

* Gabor probably hits the nail on the head when he explains what the
POSIX signal safety requirement really means: our handlers for
semi-synchronous signals should be safe even if they don't follow
the letter os POSIX.

* Similarly, our handlers for asynch signals are not going to be safe
even with the kinds of tricks we play with
arrange_return_to_lisp_function.

* To make asynch signal handlers safe in multithreaded builds they need
to request handling from another thread, probably using realtime
semaphores (which are signal safe).

* To make asynch signal handlers safe in unithreaded builds we
apparently need to mask the asynch signals pretty much everywhere,
and listen for them only in safe points.

* Neither of these approaches will make async unwind issues go away.
Asynch unwinds will never be really safe.

* We can, however, gain a synchronous timeout ability by making various
blocking functions have not just a :TIMEOUT parameter, but by making
them also respect a global *DEADLINE*. I hesitate to say anything
about properties of such synchronous timeouts, though.

Corrections and comments hoped for,

Brian Mastenbrook

unread,

Mar 26, 2007, 8:05:24 AM3/26/07

to Nikodemus Siivola, sbcl-...@lists.sourceforge.net, Gábor Melis

Nikodemus Siivola wrote:
> * We can, however, gain a synchronous timeout ability by making various
> blocking functions have not just a :TIMEOUT parameter, but by making
> them also respect a global *DEADLINE*. I hesitate to say anything
> about properties of such synchronous timeouts, though.

Hi Nikodemus,

What is your concern here about the properties of such timeouts?

Regardless of whether async unwinds can be made safe, I think what you
describe is good global policy. For applications which call a number of
blocking APIs but are unconcerned with entering an infinite loop in Lisp
code, this is all the timeout machinery which is necessary. It would
probably be a good idea to expose the API used here so that FFI users
can respect these timeouts as well. I've a few thoughts here if others
are interested.

Thanks,

--
Brian Mastenbrook
br...@mastenbrook.net
http://brian.mastenbrook.net/

Nikodemus Siivola

unread,

Mar 26, 2007, 9:13:17 AM3/26/07

to Brian Mastenbrook, sbcl-...@lists.sourceforge.net, Gábor Melis

Brian Mastenbrook wrote:

> Nikodemus Siivola wrote:
>> * We can, however, gain a synchronous timeout ability by making various
>> blocking functions have not just a :TIMEOUT parameter, but by making
>> them also respect a global *DEADLINE*. I hesitate to say anything
>> about properties of such synchronous timeouts, though.

> What is your concern here about the properties of such timeouts?

Just my ability to get details of stuff like this wrong the first time.
I would like to say that they are well-behaved and can be unwound from
safely, but I have no proof either way right now.

> Regardless of whether async unwinds can be made safe, I think what you
> describe is good global policy. For applications which call a number of
> blocking APIs but are unconcerned with entering an infinite loop in Lisp
> code, this is all the timeout machinery which is necessary. It would
> probably be a good idea to expose the API used here so that FFI users
> can respect these timeouts as well. I've a few thoughts here if others
> are interested.

I am!

Cheers,

-- Nikodemus

Brian Mastenbrook

unread,

Mar 27, 2007, 8:42:10 AM3/27/07

to Nikodemus Siivola, sbcl-...@lists.sourceforge.net, Gábor Melis

Nikodemus Siivola wrote:
> Brian Mastenbrook wrote:
>
>> Nikodemus Siivola wrote:
>>> * We can, however, gain a synchronous timeout ability by making various
>>> blocking functions have not just a :TIMEOUT parameter, but by making
>>> them also respect a global *DEADLINE*. I hesitate to say anything
>>> about properties of such synchronous timeouts, though.
>
>> What is your concern here about the properties of such timeouts?
>
> Just my ability to get details of stuff like this wrong the first time.
> I would like to say that they are well-behaved and can be unwound from
> safely, but I have no proof either way right now.

These synchronous timeouts should all be triggered after return into
Lisp, when the code calling the foreign function checks the return value
and determines that it returned due to an expired timeout. Thus unwind
will be safe as it will be triggered from Lisp code, and not from a
signal handler running in the middle of an allocation (or any other
"bad" case).

However...

>> Regardless of whether async unwinds can be made safe, I think what you
>> describe is good global policy. For applications which call a number
>> of blocking APIs but are unconcerned with entering an infinite loop in
>> Lisp code, this is all the timeout machinery which is necessary. It
>> would probably be a good idea to expose the API used here so that FFI
>> users can respect these timeouts as well. I've a few thoughts here if
>> others are interested.
>
> I am!

... they don't necessarily have to unwind. TIMEOUT should be signaled as
a condition, and a restart made available to continue execution.
Consider an application working with SIP messaging over UDP: it must
explicitly retry certain transactions if no response is received, but in
the meantime it may be off trying to contact another host. In this case,
the response to the timeout should be to retry the message send and
return to whatever else may be processing.

*DEADLINE* is probably too simple as well. Applications like the one I
mentioned above will need to have a set of timeouts active and trigger
different responses depending on which timeout is expiring. Also,
timeout response will usually not need to be taken (as usually the
remote host will be there) and so the application should be given some
way of canceling a timeout. Put this way, a deadline is more of a
computed function from which timeouts are currently active than a single
global value. This function can be computed when timeouts are added to
and removed from the set of active timeouts, so the code surrounding
each blocking foreign call will still be relatively cheap.

I think the interface which is needed here is:

* A function to return the current deadline based on the current
timer queue,
* A function to trigger appropriate timeout actions when the deadline
has passed,
* A function to register a timeout, which is given an instance of a
condition class to be signaled when the timeout expires
* A function to cancel a timeout, which is given the condition class
instance to find in the queue and cancel

There are still a few rough edges here that need to be ironed out: for
instance, when a blocking call returns successfully but the deadline has
passed, do we still invoke timeout handlers? If a timeout handler
chooses to unwind, but other timeouts active in the queue are ready to
expire, when do they fire?

It would probably be interesting to have SB-EVAL check the timeout
periodically as well.

--
Brian Mastenbrook
br...@mastenbrook.net
http://brian.mastenbrook.net/

-------------------------------------------------------------------------

James Y Knight

unread,

Mar 27, 2007, 2:19:26 PM3/27/07

to Brian Mastenbrook, sbcl-...@lists.sourceforge.net, Gábor Melis

On Mar 27, 2007, at 8:42 AM, Brian Mastenbrook wrote:
> I think the interface which is needed here is:
>
> * A function to return the current deadline based on the current
> timer queue,
> * A function to trigger appropriate timeout actions when the
> deadline
> has passed,
> * A function to register a timeout, which is given an instance of a
> condition class to be signaled when the timeout expires
> * A function to cancel a timeout, which is given the condition
> class
> instance to find in the queue and cancel
>

Uggg. This sounds really overdesigned to me. What I'd like is simple:
timer support in the serve-event loop. I think that should cleanly
cover your UDP SIP server case as well.

Other than that, a timeout argument to various blocking functions is
useful, but I don't really see that they need such a sophisticated
support system to go along with them. One can always keep a global
variable *deadline* in user code and pass the appropriate timeout to
each function as you call it, no? An implicit failure-inducing global
like that just seems dangerous.

James

Nikodemus Siivola

unread,

Mar 28, 2007, 8:36:38 AM3/28/07

to James Y Knight, sbcl-...@lists.sourceforge.net, Gábor Melis, Brian Mastenbrook

James Y Knight wrote:

> On Mar 27, 2007, at 8:42 AM, Brian Mastenbrook wrote:
>> I think the interface which is needed here is:
>>
>> * A function to return the current deadline based on the current
>> timer queue,
>> * A function to trigger appropriate timeout actions when the deadline
>> has passed,
>> * A function to register a timeout, which is given an instance of a
>> condition class to be signaled when the timeout expires
>> * A function to cancel a timeout, which is given the condition class
>> instance to find in the queue and cancel

> Uggg. This sounds really overdesigned to me. What I'd like is simple:
> timer support in the serve-event loop. I think that should cleanly cover
> your UDP SIP server case as well.

I confess I haven't thought about how this ties in with SERVE-EVENT
yet. My first thought is that streams have timeouts of their own,
which would then cause the even-loop to signal timeout. ...but
considering how it is a recursive event-loop that makes me feel quite
ill at ease. Need to think and see what I can implement.

> Other than that, a timeout argument to various blocking functions is
> useful, but I don't really see that they need such a sophisticated
> support system to go along with them. One can always keep a global
> variable *deadline* in user code and pass the appropriate timeout to
> each function as you call it, no? An implicit failure-inducing global
> like that just seems dangerous.

I think I am pretty square in the middle between you two. The
interface work I've done so far seems to speak strongly in favor of
per-object default timeouts, per-call-site explicit timeouts, and a
single global *DEADLINE*.

I tried an interface similar to proposed by Brian, but it turned out
that then writing a function exposing a timeout parameter to caller
and respecting the global deadlines got quite hairy very quickly, and
there would have been a whole lot of possible bignum computations
going on.

Here's a sketch of what I have in mind:

;;; Primary timeout user interface
(defmacro with-timeout (seconds &body forms)
`(let* ((*timeout* ,seconds)
(*deadline* (min (+ (now) *timeout*) *deadline*)))
,@forms))

;;; Primary timeout function writer interface
(defun ensure-timeout (&optional default)
(let ((default (if default (coerce default 'double-float) 0.0d0)))
(if *deadline*
(let ((timeout (- *deadline* (now))))
(unless (plusp timeout)
(with-simple-restart
(continue "Extend the deadline by ~A seconds." *timeout*)
(error 'deadline :seconds *timeout*))
(setf timeout *timeout*
*deadline* (+ (now) timeout)))
(if (= default 0.0d0)
timeout
(min timeout default)))
default)))

;;; A random timeout respecting function.
(defun foo (thing &optional (timeout (ensure-timeout (thing-timeout
thing))))
(loop
(when (= +foreign-timeout+ (foreign-foo timeout))
(with-simple-restart (continue "Continue for ~A seconds more."
timeout)
(error 'simple-timeout
:format-control "Timeout while doing FOO."
:seconds timeout)))))

This doesn't do nearly as much as Brian's proposal, but is simple to
use for the common case. If you do have cases where you need to tell
apart between different levels of timeout (time to die, time to abort
connection, etc), then I suggest that you may be better off with
timers. (Which would then fire off in a thread of their own in
multithreaded builds, and during safe-points in unithreaded builds.)

Cheers,

-- Nikodemus

James Y Knight

unread,

Mar 28, 2007, 11:10:32 AM3/28/07

to Nikodemus Siivola, sbcl-devel@lists.sourceforge.net List

On Mar 28, 2007, at 8:36 AM, Nikodemus Siivola wrote:
> I confess I haven't thought about how this ties in with SERVE-EVENT
> yet. My first thought is that streams have timeouts of their own,
> which would then cause the even-loop to signal timeout. ...but
> considering how it is a recursive event-loop that makes me feel quite
> ill at ease. Need to think and see what I can implement.

I just meant timers, handled by the event loop. Something like (add-
timer-event function secs) => handle; (remove-timer-event handle).
If one triggers, this counts as an event occurring. Timer support is
pretty much standard for every other event loop implementation I know
of. (I think various patches to implement this for serve-event have
been proposed in the past but I've not examined them in detail).

I don't know what you mean by streams having timeouts that would
cause serve-event to timeout; that doesn't make any sense to me.

James

Reply all

Reply to author

Forward