cleanly shutting down a gevent process (and signal.signal() vs gevent.signal_handler())

281 views
Skip to first unread message

Andrew Athan

unread,
Jun 15, 2022, 2:57:44 PM6/15/22
to gevent: coroutine-based Python network library
Some months ago I did a lot of empirical tests to understand the differences between signal.signal() and gevent.signal_handler(). It was painful and time consuming, and frankly, I've forgotten some of the nuances, which I am re-learning. However, far below, are my notes from back then (see starred section).

The goal is to come up with a shutdown procedure and signal handling for gevent apps that gives maximum opportunity for data integrity *including* cases where a greenlet is misbehaving (such as in a tight loop that's not returning to the hub).

The general scheme is to register an atexit() handler that invokes groups of shutdown lambdas that are registered with that shutdown handler a-priori, where each such lambda is spawned in a greenlet in  the priority order declared during registration. We then wait for bounded time periods on completion events.

So far, so good.

Now. How do we invoke the shutdown? Cooperative invocation is easy. We have a shutdown() function described above, and any greenlet can call it, which then spawns the shutdown process. Presumably, this does things like cancel() some greenlets, set flags to stop loops in other greenlets (perhaps including the invoker), etc.

All standard stuff, no sweat.

The hard part is dealing with signals properly, including giving our app the maximum opportunity for a clean shutdown in the case of a misbehaving greenlet.

First, a couple of questions:

(1) The documentation at https://www.gevent.org/api/gevent.signal.html described `gevent.signal()` but IMHO needs clarifications:

(a) When monkeypatching the world, does `signal.signal()` become `gevent.signal()` ? If not, what is `signal.signal()` post monkeypatching? Where should this be documented if it's not (I'll submit a doc PR).

(b) Those docs state gevent.signal() is "exactly like signal.signal() ... except where SIGCHLD is concerned".  Is that literally and exactly true, or is it poetic license? I.e., whatever python `signal.signal()` does remains *exactly* the same for calls other than to set the SIGCHILD handler?

(2) How do signal handlers installed via `signal.signal()` (depending on 1a & 1b aka `gevent.signal()` when monkeypatched?) interact with the hub, if at all? Probably not at all. I'm assuming your running greenlet is unceremoniously pre-empted with no hub involvement.

(3) Given 2, for signals delivered via gevent.signal() (other than SIGSEGV and other catastrophes), execution will occur in the execution context of whatever greenlet happened to be running, right? And likewise, when I fall out the bottom of the handler, execution will resume in that same greenlet, right?

(4) In #3, the contrast being that a gevent.signal_handler() registered handler/signal will instead have been caught by the hub, a new greenlet will have been spawned, and the handler will then be running in this new greenlet. Correct?

(5) Are there things you cannot do in the signal.signal() handler? My old notes state "If, within the signal.signal() handler, you attempt to do something like a wait(), exceptions happen from within hub because apparently you can't do certain things like wait() from within the handler"!  This is why I ask #4, because that would imply the execution context is in fact different than the context of the greenlet that was running when the handler got invoked.

(6) In #3, if execution returns to the greenlet that was executing when the handler was invoked, then you *might* be returning into a tight loop that never cooperates with the hub, and nothing else will then run.

So now, back to building a rock solid shutdown process:
(A) It seems you initially would want a gevent.signal_handler() "non-hoisting" handler in order to optimistically allow a nice cooperating app to shut down with grace.
(B) It seems you eventually would want a signal.signal() "hoisting" handler in order to be able to interrupt misbehaving greenlets.
(C) You "want" to call sys.exit() from A and/or B in order to ensure all atexit handlers are called, even ones registered by packages that don't use "our" shutdown system (described above). Our shutdown is an atexit registered function, so that's great.

Externally to the app, you could do something like what kube does:

Call SIGNAL_A, which is presumed to use soft (A) tactics, wait a reasonable amount of time, then call SIGNAL_B, which is presumed to use hard (B) tactics, wait a reasonable amount of time, then destroy the universe.

If (A) is executing in a greenlet, we can just call sys.exit() there and all's well.

If (B) is executing in a random greenlet's context, calling sys.exit() never returns to the calling greenlet (which may be misbehaving, so we may not want that but lets not get distracted).  The implication is that some random greenlet will not get the opportunity for a clean shutdown. This might have been a `while self.keep_running:work()` worker greenlet doing a long computation.

On the other hand if (B) is called, you're likely in a stuck situation, or at least one that exceeded your timeouts, but that doesn't mean you don't want to give *all the other greenlets* a chance to shut down. So you definitely want to do something that will let the gevent hub keep running the other greenlets. What is that thing? Calling sys.exit() from within the handler invoked all the atexit functions, but do we presume some monkeypatched thing in there gets the hub going again? Do we need to add a sleep(0) somewhere just to be sure? Am I confusing myself?

So maybe, in the (B) handler you start a system thread that waits a while then resends SIGNAL_B. You register a stage2 forcible SIGNAL_B handler to catch that. Meanwhile *assuming you are allowed to do this (see #5 above)* you spawn a greenlet that calls sys.exit() and you return.

If you're returning into a tight loop, nothing but the tight loop runs, but the system thread will eventually call your stage_2. In stage_2, you do a while True:sleep(0) for some wallclock time and then sys.exit() again. Presumably, stage 1 already teed up the fancy shutdown, and you just don't want to return to that same badly behaving greenlet in whose context you must once again be running *while giving the hub the opportunity to execute*. 

.... do I need a lobotomy?

A.



* If you register with gevent.signal_handler, then SIGINT will not stop a tight loop in the main greenlet unless that loop has a sleep().
* If you register with signal.signal(), then SIGINT will stop that loop, but it will also be hungry about handling the signal, so you end up in its handler even from within a non-main greenlet
* If you register with gevent.signal_handler after signal.signal() it takes over the signal
* If, within the signal.signal() handler, you attempt to do something like a wait(), exceptions happen from within hub because apparently you can't do certain things like wait() from within the handler
* If you handle a signal from gevent.signal_handler(), the main greenlet keeps running, so you have to gevent.kill the main_greenet
* If you gevent.kill the main greenlet, this may prematurely end the shutdown process
* To prevent prematurely ending the shutdown process you can override sys.excepthook
But, if you wait on all exceptions, you end up preventing the atexit() handler from running, which means you see the ^C, but then you get stuck waiting
* SO .... it turns out the BEST THING to do is to NOT register any signal_handler()s, and let atexit() proceed This will eventually call Shutdown.shutdown() from the main thread EVEN IF the main thread is already wait_and_exit() Importantly, it will stop the main thread's execution as intended by pressing ^C!
Reply all
Reply to author
Forward
0 new messages