Unexplained lag during startup

Jamie Biggar

unread,

Oct 22, 2020, 6:21:59 PM10/22/20

to modwsgi

Hi all,

I've been a mod_wsgi user for many years (Graham, thank you for your fantastic community support!), but this week ran into a mystery I haven't been able to solve on my own.

We've been running a fairly hefty Django app in production with mod_wsgi for years without much issue. In August, with no obviously correlated change in code or server architecture, we started having issues where a restart (usually triggered by `touch`ing the WSGI script via `WSGIScriptReloading On`, though sometimes also by `systemctl restart httpd.service`) would occasionally lead to an unending stream of 504 timeouts (and sometimes some 503s as well) lasting indefinitely. Another restart would sometimes fix it, but not always. The issue seems to be load related -- the busier the server is, the more likely it is to get stuck in the 504 loop. Most restarts would work fine and yield a normally-running site after a brief pause as the app was loaded into memory.

While troubleshooting today (not under production load), I noticed something that I think is likely exacerbating load-related restart timeout issues: it seems that after a flurry of activity on initial server (re)start which clearly includes loading our WSGI script (as I see entries in the Apache error log related to Python packages it imports), there's a period of roughly 45 seconds when the CPU is idle and no requests are served via mod_wsgi before it wakes up and finally emits `Started thread 0 in daemon process ...` log messages, then a few seconds later it's able to reply to HTTP requests.

Any idea what could cause that ~45 second idle period during startup? I've tried tuning the *-timeout options for WSGIDaemonProcess, with no apparent effect on the idle time. I also tried disabling our NewRelic APM code to rule out a network API bottleneck.

Software versions:

* Amazon Linux 2

* Python 3.6 (via IUS: https://ius.io/ )

* mod_wsgi/4.6.2 (also via IUS, compiled against Python 3.6)

* Apache/2.4.46

* Django 2.2

Apache config:

WSGIDaemonProcess eslive display-name='(wsgi:es-site)' \

processes=6 threads=1 \

user=apache group=apache \

python-home=/path/to/virtualenv \

python-path=/path/to/code/root \

python-eggs=/var/www/.python-eggs \

lang='en_US.UTF-8' locale='en_US.UTF-8' \

queue-timeout=45 \

socket-timeout=60 \

connect-timeout=15 \

request-timeout=120 \

startup-timeout=30 \

deadlock-timeout=60 \

eviction-timeout=0 \

shutdown-timeout=5 \

graceful-timeout=15 \

restart-interval=0 \

inactivity-timeout=0 \

maximum-requests=0

WSGIImportScript /path/to/django-wsgi.py \

process-group=eslive application-group=%{GLOBAL}

WSGISocketPrefix run/httpd-wsgi

WSGIScriptAlias / /path/to/django-wsgi.py \

process-group=eslive application-group=%{GLOBAL}

WSGIPassAuthorization On

</VirtualHost>

Thanks in advance for any recommendations!

-Jamie

Graham Dumpleton

unread,

Oct 22, 2020, 6:40:22 PM10/22/20

to mod...@googlegroups.com

Remove:

WSGIImportScript /path/to/django-wsgi.py \
process-group=eslive application-group=%{GLOBAL}

Setting both process-group and application-group on WSGIScriptAlias has the same effect of preloading the WSGI script file using WSGIScriptAlias. I am not sure what will happen if both ways of forcing preloading are set.

Also a memory corruption bug was also recently reported to me along with a fix. This has been an outstanding issue for many years but which so rarely occurred on full Linux and macOS platforms (Alpine Linux would crash all the time though), that never been able to track it down. This bug relates to the preloading of the WSGI script file, so there is an outside chance it is related.

Disabling the preloading may not be desirable though because lazy loading has greater risk of delaying first requests longer as can queue up on process which is still loading the application. That said, it may not be noticeable since only one thread per process. Thus worth trying:

WSGIProcessGroup eslive

WSGIScriptAlias / /path/to/django-wsgi.py application-group=%{GLOBAL}

which because no WSGIImportScript, but both process-group and application-group aren't said, means no pre-loading.

BTW, if you don't already have it said, ensure you are setting:

WSGIRestrictEmbedded On

if only using daemon mode. Not related, but good practice and cuts down on memory usage and startup load on Apache child worker processes.

So first up try that. The bug fix I mention hasn't actually been released yet as had some other unfinished stuff in code which wasn't sure if I wanted to change. If you wanted to be brave though, you could try the 'develop' branch of mod_wsgi on GitHub. If can replicate in testing system, could perhaps try it there.

The only other thing can think of is if there is a cross process conflict with initialisation done by your app in relation to a database or backend service, when multiple processes are starting up at the same time.

Finally, not sure whether might be adapted, but as very first thing in WSGI script file you could start a background thread which watches for an event set at end of WSGI script file import, and if takes more than certain time to see that event, indicating slow WSGI script file load, dump out Python stack traces. Code related to this is found at:

https://modwsgi.readthedocs.io/en/master/user-guides/debugging-techniques.html#extracting-python-stack-traces

It will need to be updated to Python 3 as probably still Python 2, and then adapt it as mentioned.

Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/bcb386ac-7c83-459d-bced-792d535a09d0n%40googlegroups.com.

Reply all

Reply to author

Forward

Message has been deleted