21720 [Mon Aug 29 21:31:26.979825 2016] [wsgi:crit] [pid 4317:tid 140495687612160] (35)Resource deadlock avoided: │EOE
mod_wsgi (pid=4317): Couldn't acquire accept mutex '/path-to-apache/var/state.14843.0.1.sock'. Shutting down daemon process
[mpm_worker:debug] [pid 6870:tid 140091454019328] worker.c(1829): AH00294: Accept mutex: fcntl (default: sysvsem)
This I assume implies: mpm worker uses fcntl while mod_wsgi uses sysvsem. With this default /path-to-apache/var contained only .sock file.
I added two lines to apache conf
Mutex file:/path-to-apache/var/state mpm-accept
WSGIAcceptMutex flock
Now apache debug logs show:
[mpm_worker:debug] [pid 15647:tid 140091454117350] worker.c(1829): AH00294: Accept mutex: flock (default: sysvsem)
And /path-to-apache/var contains both .sock and .lock file.
I still don't know if this will eliminate deadlocks happening in one box while the other box chugs away happily without issues. On the box which has issues, deadlock reason seems to vary. Saw these in the logs at different times. Don't know why the reason for deadlock also changes. (Note this is using fcntl default sysvsem)
[wsgi:error] [pid 11561:tid 139804096313088] [client 10.4.71.118:28639] Timeout when reading response headers from daemon process 'app': /path-to-apache/bin/app.wsgi
[wsgi:info] [pid 11559:tid 139804272707328] mod_wsgi (pid=11559): Daemon process deadlock timer expired, stopping process 'app'.
[wsgi:info] [pid 11559:tid 139804333795072] mod_wsgi (pid=11559): Shutdown requested 'app'.
[wsgi:crit] [pid 4317:tid 140495687612160] (35)Resource deadlock avoided: mod_wsgi (pid=4317): Couldn't acquire accept mutex '/path-to-apache/var/state.14843.0.1.sock'. Shutting down daemon process
I am not sure what's at fault. If it was due to fcntl and sysvsem, then that should have happened in both the boxes and not just one. The only two difference between the two boxes are
- One is in AWS US region and the other is in AWS EU region. Though uname and httpd -V shows both being the same.
- The US region loads different set of scikit-learn pickles than the EU region one. But both pickles are exports from the same scikit-learn classes (SGDClassifier)
Mysterious at best. Any clues would help. Definitely don't want to try using gunicorn/nginx just for experimenting if something in apache is messing things up.
Thanks
Santhosh
On 30 Aug 2016, at 4:17 PM, kusimari share <kusimar...@gmail.com> wrote:Some additional dataApache debug logs show:[mpm_worker:debug] [pid 6870:tid 140091454019328] worker.c(1829): AH00294: Accept mutex: fcntl (default: sysvsem)This I assume implies: mpm worker uses fcntl while mod_wsgi uses sysvsem. With this default /path-to-apache/var contained only .sock file.
I added two lines to apache conf
Mutex file:/path-to-apache/var/state mpm-accept
WSGIAcceptMutex flock
Now apache debug logs show:
[mpm_worker:debug] [pid 15647:tid 140091454117350] worker.c(1829): AH00294: Accept mutex: flock (default: sysvsem)
And /path-to-apache/var contains both .sock and .lock file.
I still don't know if this will eliminate deadlocks happening in one box while the other box chugs away happily without issues. On the box which has issues, deadlock reason seems to vary. Saw these in the logs at different times. Don't know why the reason for deadlock also changes. (Note this is using fcntl default sysvsem)
- Logs for the most common of the deadlock situations. Note that pid for the timeout and deadlock timer are different.
[wsgi:error] [pid 11561:tid 139804096313088] [client 10.4.71.118:28639] Timeout when reading response headers from daemon process 'app': /path-to-apache/bin/app.wsgi
[wsgi:info] [pid 11559:tid 139804272707328] mod_wsgi (pid=11559): Daemon process deadlock timer expired, stopping process 'app'.
--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To post to this group, send email to mod...@googlegroups.com.
Visit this group at https://groups.google.com/group/modwsgi.
For more options, visit https://groups.google.com/d/optout.
# apache workers config to start 2 server with 75 max requests.# changes to workload implies that the below needs to be tuned<IfModule mpm_worker_module>StartServers 2MinSpareThreads 10MaxSpareThreads 35ThreadsPerChild 25MaxClients 75MaxRequestsPerChild 0</IfModule>
Mutex flock:/patch-to-apache/var/state mpm-accept
WSGIScriptReloading OffWSGISocketPrefix /path-to-apache/var/stateWSGIPassAuthorization On# see mod_wsgi documentation on embedded vs daemonWSGIRestrictEmbedded On# force wsgi to use flock instead of default. added to check fcntl vs sysvsem vs flockWSGIAcceptMutex flock# Port configurationListen 8080<VirtualHost *:8080>ServerName server-name:8080UseCanonicalName Off#app configuration for current load at 10k+ classification reqs / secWSGIDaemonProcess app processes=3 threads=30 display-name=atlasWSGIScriptAlias / /patch-to-apache/bin/app.wsgiWSGIProcessGroup appWSGIApplicationGroup %{GLOBAL}<Directory /path-to-apache/bin/>Order deny,allowAllow from all</Directory></VirtualHost>
On 30 Aug 2016, at 4:56 PM, kusimari share <kusimar...@gmail.com> wrote:WSGI setup uses global and mpm_worker.# apache workers config to start 2 server with 75 max requests.# changes to workload implies that the below needs to be tuned<IfModule mpm_worker_module>StartServers 2MinSpareThreads 10MaxSpareThreads 35
ThreadsPerChild 25MaxClients 75MaxRequestsPerChild 0</IfModule>Mutex flock:/patch-to-apache/var/state mpm-acceptWSGIScriptReloading OffWSGISocketPrefix /path-to-apache/var/state
WSGIPassAuthorization On# see mod_wsgi documentation on embedded vs daemonWSGIRestrictEmbedded On# force wsgi to use flock instead of default. added to check fcntl vs sysvsem vs flockWSGIAcceptMutex flock# Port configurationListen 8080<VirtualHost *:8080>ServerName server-name:8080UseCanonicalName Off#app configuration for current load at 10k+ classification reqs / secWSGIDaemonProcess app processes=3 threads=30 display-name=atlas
WSGIScriptAlias / /patch-to-apache/bin/app.wsgiWSGIProcessGroup appWSGIApplicationGroup %{GLOBAL}<Directory /path-to-apache/bin/>Order deny,allowAllow from all</Directory></VirtualHost>My initial suspicion was something in scikit/numpy/scipy/nltk was the reason for deadlock, even though I was using GLOBAL following http://modwsgi.readthedocs.io/en/develop/user-guides/application-issues.html#python-simplified-gil-state-api. However, if that is correct, then the deadlock should happen in both boxes as both use the same python flask app code.From apache conf, my later suspicion was that prefork was an issue, so loaded only mpm_worker. That delays the deadlock, but it happens never the less.Don't know if both Mutex and WSGISocketPrefix pointing to the same path has issues. Again if it was an issue, then both boxes should deadlock.Finally, deadlocks happen all of a sudden without a pattern. Its NOT triggered because of increased load. And once deadlock is triggered, then I see that the server goes for a spin. A spin is repeated deadlock by each of the process for about 3 or 4 times post which they all settle down and work fine. This is what made me suspect that locks was a problem. If not why would deadlocks happen across all process/threads at the same time and continue for 3 or 4 cycles and then settle down.