Errors when gracefully restarting more than 1 daemon process at a time

9 views
Skip to first unread message

Jay

unread,
Aug 27, 2019, 1:51:17 AM8/27/19
to modwsgi
Hi Graham,

First off, thanks for all your work on mod_wsgi and the docker images! It's been a tremendous help.

I'm running an API server using Django, and for memory reasons, I'd like to gracefully restart the daemon processes periodically. What I've found is that using `restart-interval` or sending `SIGUSR1` to multiple daemon processes at the same time causes my app (which is behind a load balancer) to return 502s (bad gateway) to the consumer. This doesn't seem to happen when I send `SIGUSR1` to the daemon processes 1 at a time, though.

This is what I'm using in my `server_args`:
--server-mpm event
--processes 4
--threads 16
--application-type module
--url-alias /static static
--compress-responses
--log-level info
--startup-log
--keep-alive-timeout 5
--server-status
--request-timeout 120
--shutdown-timeout 120
app.wsgi


Is this behavior expected? One guess is that there is a race condition where if multiple daemon processes undergo the shutdown sequence at nearly the same time, some requests can get routed to hit a daemon process that has already stopped accepting new requests.

Is there an out-of-the-box solution to handle this? Or is the workaround to run a job that sends the graceful restart signal to the daemon processes 1 at a time?

Thanks!

Best,
Jay

Graham Dumpleton

unread,
Aug 27, 2019, 2:06:21 AM8/27/19
to mod...@googlegroups.com

On 27 Aug 2019, at 2:30 pm, Jay <j...@alchemyapi.io> wrote:

Hi Graham,

First off, thanks for all your work on mod_wsgi and the docker images! It's been a tremendous help.

All the docker images I created are quite out of date and not actively maintained as there wasn't enough interest in them to justify the effort on them. Which are you using and how old is the mod_wsgi version?

I'm running an API server using Django, and for memory reasons, I'd like to gracefully restart the daemon processes periodically. What I've found is that using `restart-interval` or sending `SIGUSR1` to multiple daemon processes at the same time causes my app (which is behind a load balancer) to return 502s (bad gateway) to the consumer. This doesn't seem to happen when I send `SIGUSR1` to the daemon processes 1 at a time, though.

This is what I'm using in my `server_args`:
--server-mpm event
--processes 4
--threads 16
--application-type module
--url-alias /static static
--compress-responses
--log-level info
--startup-log
--keep-alive-timeout 5
--server-status
--request-timeout 120
--shutdown-timeout 120
app.wsgi


Is this behavior expected? One guess is that there is a race condition where if multiple daemon processes undergo the shutdown sequence at nearly the same time, some requests can get routed to hit a daemon process that has already stopped accepting new requests.

That shouldn't occur. The code uses what is called a cross process mutex. A daemon process will only acquire that mutex lock when it is in a running state, and has capacity to handle requests. If multiple daemon process were restarted at the same time, all that should happen is that requests will be queued up in the socket listener queue between Apache child processes and daemon processes, until a daemon process is ready to start accepting requests again. The queue depth is usually 100, which is more than the whole Apache capacity anyway, so shouldn't even be able to fill that and start having errors.

Further, there are some timeouts in play which means that it tries to only restart a daemon process when there are no active requests being handled by that process.

    optparse.make_option('--graceful-timeout', type='int', default=15,
            metavar='SECONDS', help='Grace period for requests to complete '
            'normally, while still accepting new requests, when worker '
            'processes are being shutdown and restarted due to maximum '
            'requests being reached or restart interval having expired. '
            'Defaults to 15 seconds.'),
    optparse.make_option('--eviction-timeout', type='int', default=0,
            metavar='SECONDS', help='Grace period for requests to complete '
            'normally, while still accepting new requests, when the WSGI '
            'application is being evicted from the worker processes, and '
            'the process restarted, due to forced graceful restart signal. '
            'Defaults to timeout specified by \'--graceful-timeout\' '
            'option.'),

The eviction timeout should come into play here, and because it is 15 seconds, would be surprised if the process would keep in lock step and really end up restarting at the exact same time. They should naturally drift apart, unless you have insignificant traffic, in which case also can't see how would get an error, as requests should queue up still.

Only things I can think of are that you are sending the SIGUSR1 to the Apache parent process as well, and not just the mod_wsgi daemon processes. Or that you are actually managing to restart the whole container somehow.

What do the logs show around the time when you send the signals. You are using INFO level logging for Apache, so it should show lots of mod_wsgi log messages about what is happening with the restarting daemon mode processes.

Is there an out-of-the-box solution to handle this? Or is the workaround to run a job that sends the graceful restart signal to the daemon processes 1 at a time?

Thanks!

Best,
Jay

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/59455b12-a0ea-414a-9ad3-2e1f6740bac2%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages