Graceful shutdown of the server

705 views
Skip to first unread message

Yutong Zhao

unread,
May 21, 2014, 9:12:06 PM5/21/14
to python-...@googlegroups.com
What is the proper way to shutdown tornado? I'm using the pre-fork feature. Which pid should I send the SIGINT signals to? Upon catching this signal, I'd like the server to stop accepting all new requests and gracefully wait for all the current request handlers in the event loop to finish.

Ben Darnell

unread,
May 21, 2014, 9:55:00 PM5/21/14
to Tornado Mailing List
Graceful shutdown is one of the biggest weaknesses of tornado's built-in multiprocess mode.  You need to send the signals to all of the processes, but then you can't restart anything until all of them have exited.  If you care about graceful restarts I highly recommend using an external process manager so you can address processes individually.  You can rig something up that will work with Tornado's multi-process mode, but it's easier to just use supervisord.

-Ben


On Wed, May 21, 2014 at 9:12 PM, Yutong Zhao <prot...@gmail.com> wrote:
What is the proper way to shutdown tornado? I'm using the pre-fork feature. Which pid should I send the SIGINT signals to? Upon catching this signal, I'd like the server to stop accepting all new requests and gracefully wait for all the current request handlers in the event loop to finish.

--
You received this message because you are subscribed to the Google Groups "Tornado Web Server" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python-tornad...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yutong Zhao

unread,
May 21, 2014, 11:09:14 PM5/21/14
to python-...@googlegroups.com, b...@bendarnell.com
I don't particularly care about the speed of restart. I can wait something like 5 minutes for each process to shutdown. If I simply send SIGINT to every process, then each of them will exit gracefully (ie. wait for all requests to finish) without requiring me needing to do anything special on my end? Righ now I just call tornado.ioloop.IOLoop.instance().stop() for each process.

PS, we haven't added nginx yet to deal with routing to the different ports that each process listens to. Once we do that we'll consider supervisord + nginx to manage the individual processes. 

Ben Darnell

unread,
May 21, 2014, 11:20:29 PM5/21/14
to Yutong Zhao, Tornado Mailing List
On Wed, May 21, 2014 at 11:09 PM, Yutong Zhao <prot...@gmail.com> wrote:
I don't particularly care about the speed of restart. I can wait something like 5 minutes for each process to shutdown. If I simply send SIGINT to every process, then each of them will exit gracefully (ie. wait for all requests to finish) without requiring me needing to do anything special on my end? Righ now I just call tornado.ioloop.IOLoop.instance().stop() for each process.

IOLoop.stop is not a graceful shutdown; it just stops the process immediately.  For a graceful shutdown, you need to decide what you want to wait for (all requests? certain requests? some amount of time?) and then call IOLoop.stop after everything you're waiting for has finished.  My recommendation is to call HTTPServer.stop immediately (which stops new requests from coming in), and then IOLoop.stop after a time period has elapsed, and don't worry about tracking the number of outstanding requests.

-Ben

Yutong Zhao

unread,
May 21, 2014, 11:44:37 PM5/21/14
to python-...@googlegroups.com, Yutong Zhao, b...@bendarnell.com
Ideally I'd like all outstanding requests in the midst of completing to complete. A given request can be in one of three states:

1) Yet to have start at all (I don't care about these ones)
2) In between different yields (eg. motor, or a series of asynchronous file writes).
3) Finished (which isn't even in the IOLoop)

#2) Is the dangerous one, especially if it's a handler that's supposed to be "atomic" barring system failure (eg. writing out a set of files of files asynchronously). Though there are also some handlers I don't care about (eg. ones that serve large static files). 

I have no idea if it's possible to distinguish between 1) and 2) easily. I don't know how you can tell which event in the IOLoop corresponds to which handler.

It'd be nice there was a setting for each request handler that denotes if it's ok to prematurely terminate or not upon shutdown.

Ben Darnell

unread,
May 22, 2014, 10:11:26 AM5/22/14
to Yutong Zhao, Tornado Mailing List
Yeah, this is why graceful shutdown procedures always end up being application-specific (unless you use a strict time-based approach, which again is what I recommend).  Keep a count of how many operations are outstanding that are worth keeping the server alive for, and when you decrement that count to zero stop the server if the shutdown flag has been set.  (you'll probably need a timer too, so a slow client can't keep the server alive indefinitely)

-Ben

Yutong Zhao

unread,
May 25, 2014, 10:10:26 PM5/25/14
to python-...@googlegroups.com, Yutong Zhao, b...@bendarnell.com
I see. I will try that out.

Also, should TCPServer.stop() be invoked if and only if tornado.process.task_id() == None? It looks like having the parent process stop() is all thats required to stop requests being forwarded to the children after some experimentation.

Ben Darnell

unread,
May 25, 2014, 10:53:21 PM5/25/14
to Yutong Zhao, Tornado Mailing List
On Sun, May 25, 2014 at 10:10 PM, Yutong Zhao <prot...@gmail.com> wrote:
I see. I will try that out.

Also, should TCPServer.stop() be invoked if and only if tornado.process.task_id() == None? It looks like having the parent process stop() is all thats required to stop requests being forwarded to the children after some experimentation.

Each process (parent and child) must stop its own TCPServer.  There is no forwarding in this mode; all the child processes are independently listening on the same port and the kernel is distributing traffic among them.  The parent process's TCPServer is not actually handling any traffic because fork_processes never returns (but the parent process still has a copy of the listening socket, which is enough to stop any *other* process from binding to that port if you stop all the children but leave the parent alive)

-Ben

Yutong Zhao

unread,
May 27, 2014, 3:57:45 AM5/27/14
to python-...@googlegroups.com, Yutong Zhao, b...@bendarnell.com
I ran into another hiccup. My parent process manages another process (redis) via subprocess.Popen. My child processes need to poll redis to make sure it's safe to exit by detecting the existence of certain keys inside redis.

The exit logic upon a SIGINT or SIGTERM is as follows:

1. Child processes terminate gracefully (by polling redis).
2. The parent terminates redis gracefully by calling shutdown() on redis. Ideally this is done by calling os.waitpid() on all the pids for the children, but the pids are hidden.
3. Parent process terminates.

I ended up duplicating tornado.process.fork_processes and turning children into a global, so that way I can retrieve the pids. Here's a gis demonstrating the use caset:


It would be nice to expose a way in process.py to look up the pid, given a task id, or perhaps expose the children dict directly. I would be happy to submit a pull request if this makes sense.

Ben Darnell

unread,
May 28, 2014, 10:23:37 AM5/28/14
to Yutong Zhao, Tornado Mailing List
On Tue, May 27, 2014 at 3:57 AM, Yutong Zhao <prot...@gmail.com> wrote:
It would be nice to expose a way in process.py to look up the pid, given a task id, or perhaps expose the children dict directly. I would be happy to submit a pull request if this makes sense.

This is inherently racy - since fork_processes never returns while child processes are running, you could only access this from another thread or signal handler, but then you'd have no way of synchronizing on child processes exiting and being restarted.  It's better to communicate with the fork_processes loop somehow to make it drain, although that may not be necessary in this case - just let the processes exit normally and then catch the SystemExit raised by fork_processes to shut down the redis process.

Also, you shouldn't sleep() in the child process signal handler - it will block the IOLoop.  In general, do as little as possible in signal handlers - in most cases, only call io_loop.add_callback_from_signal.   Python is not as restrictive as C in this regard since the python signal handler doesn't run in the context of the real signal handler, but it's still a strange and limited environment with tricky concurrency issues.

-Ben

Yutong Zhao

unread,
May 28, 2014, 5:05:27 PM5/28/14
to python-...@googlegroups.com, Yutong Zhao, b...@bendarnell.com
Thanks Ben, I've taken your suggestions into account and modified the gist:


Two changes I made:

1. Made fork_processes stop naturally, and catching a SystemExit to shutdown gracefully.
2. Got rid of blocking sleep and replaced it with add_timeout (I assume it is safe to invoke from signal handlers, even though it will be using the same stack_context as the handlers) since docs mention using it explicitly. I could adapt it to use io_loop.add_callback_from_signal but I'm not sure which version is preferred. Eg:

def stop_children(sig, frame):
print('-> stopping children', process2.task_id())
# stop accepting new requests
server.stop()
 
# wait for all the locks to expire
deadline = time.time() + 10
io_loop = tornado.ioloop.IOLoop.instance()
 
def stop_loop():
now = time.time()
if now < deadline and app.db.zrange('locks', 0, -1):
io_loop.add_timeout(now + 1, stop_loop)
else:
app.shutdown()
 
stop_loop()
vs (using add_callback_from_signal):

def stop_children(sig, frame):
    print('-> stopping children', tornado.process.task_id())
    # stop accepting new requests
    server.stop()

    # wait for all the locks to expire
    deadline = time.time() + 10

    def stop_loop():
        if time.time() < deadline and app.db.zrange('locks', 0, -1):
            time.sleep(1)
            tornado.ioloop.IOLoop.instance().add_callback_from_signal(stop_loop)
        else:
            app.shutdown()
 
    stop_loop()

Ben Darnell

unread,
May 28, 2014, 8:09:42 PM5/28/14
to Yutong Zhao, Tornado Mailing List
On Wed, May 28, 2014 at 2:05 PM, Yutong Zhao <prot...@gmail.com> wrote:
Thanks Ben, I've taken your suggestions into account and modified the gist:


Two changes I made:

1. Made fork_processes stop naturally, and catching a SystemExit to shutdown gracefully.
2. Got rid of blocking sleep and replaced it with add_timeout (I assume it is safe to invoke from signal handlers, even though it will be using the same stack_context as the handlers) since docs mention using it explicitly. I could adapt it to use io_loop.add_callback_from_signal but I'm not sure which version is preferred. Eg:

Where do the docs mention that?  add_timeout is definitely not safe for use from a signal handler (and in general you should assume that anything that interacts with any non-local object is not safe from a signal handler unless documented otherwise). 
 


def stop_children(sig, frame):
print('-> stopping children', process2.task_id())
# stop accepting new requests
server.stop()
 
# wait for all the locks to expire
deadline = time.time() + 10
io_loop = tornado.ioloop.IOLoop.instance()
 
def stop_loop():
now = time.time()
if now < deadline and app.db.zrange('locks', 0, -1):
io_loop.add_timeout(now + 1, stop_loop)
else:
app.shutdown()
 
stop_loop()
vs (using add_callback_from_signal):

def stop_children(sig, frame):
    print('-> stopping children', tornado.process.task_id())
    # stop accepting new requests
    server.stop()

    # wait for all the locks to expire
    deadline = time.time() + 10

    def stop_loop():
        if time.time() < deadline and app.db.zrange('locks', 0, -1):
            time.sleep(1)

The sleep is still a problem in this version.  You must use add_callback_from_signal to schedule something like your first stop_loop function on the IOLoop, which will in turn use add_timeout instead of sleep.

-Ben

Yutong Zhao

unread,
May 28, 2014, 8:35:03 PM5/28/14
to python-...@googlegroups.com, Yutong Zhao, b...@bendarnell.com
> Where do the docs mention that?  add_timeout is definitely not safe for use from a signal handler (and in general you should assume that anything that interacts with any non-local object is not safe from a signal handler unless documented otherwise).  

Sorry I meant that add_timeout is generally used for sleeping.

> The sleep is still a problem in this version.  You must use add_callback_from_signal to schedule something like your first stop_loop function on the IOLoop, which will in turn use add_timeout instead of sleep.

Why is it safe to invoke add_timeout after calling it from a callback inside add_callback_from_signal as opposed to just invoking directly?

Also, do you mean something like this instead?

def stop_children(sig, frame):
    print('-> stopping children', tornado.process.task_id())
    # stop accepting new requests
    server.stop()

    # wait for all the locks to expire
    deadline = time.time() + 10

    def stop_loop():
        if time.time() < deadline and app.db.zrange('locks', 0, -1):
            tornado.ioloop.IOLoop.instance().add_timeout(now+1, stop_loop)
        else:
            app.shutdown()
 
    tornado.ioloop.IOLoop.instance().add_callback_from_signal(stop_loop)

Ben Darnell

unread,
May 29, 2014, 8:44:40 PM5/29/14
to Yutong Zhao, Tornado Mailing List
On Wed, May 28, 2014 at 5:35 PM, Yutong Zhao <prot...@gmail.com> wrote:
> Where do the docs mention that?  add_timeout is definitely not safe for use from a signal handler (and in general you should assume that anything that interacts with any non-local object is not safe from a signal handler unless documented otherwise).  

Sorry I meant that add_timeout is generally used for sleeping.

> The sleep is still a problem in this version.  You must use add_callback_from_signal to schedule something like your first stop_loop function on the IOLoop, which will in turn use add_timeout instead of sleep.

Why is it safe to invoke add_timeout after calling it from a callback inside add_callback_from_signal as opposed to just invoking directly?

Signal handlers are run in a very limited environment because they run "on top of" whatever is already running on the main thread.  You don't know what locks may be held (except the GIL) or anything else about the state of the world.  You can't acquire locks to deal with race conditions because you may deadlock with the (blocked) main thread (this makes it very difficult for a function to be both thread-safe and signal-safe).  You must use add_callback_from_signal to get back to a normal IOLoop callback context.
 

Also, do you mean something like this instead?

Yes, that's the right approach, although the call to server.stop() also needs to be moved into the callback (so you'll probably need two callbacks, or make it a coroutine).

-Ben
Reply all
Reply to author
Forward
0 new messages