Inconsistent crashes with "Truncated or oversized response headers"

91 views
Skip to first unread message

David White

unread,
Aug 31, 2020, 4:16:58 PM8/31/20
to modwsgi

Hello.  I am running Apache 2.4.43 with mod_wsgi 4.71 compiled against Python 3.8.3 (all manually compiled, not part of the RHEL 8.1 distro).  Apache MPM is "event".

The application running in one of my virtual hosts will occasionally crash, but randomly.  Repeating the same request immediately will usually succeed.  The application may continue working for hours before randomly crashing again.

This started happening recently with an upgrade of the Flask and SQLAlchemy modules.  (again, all manually installed via pip)

A crash is reported this way in the vhost's error_log:

[Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid 140546546816768] [client 10.83.210.200:47266] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid 140546571994880] [client 10.31.82.142:35810] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Mon Aug 31 11:11:08.512262 2020] [wsgi:info] [pid 40601:tid 140547110185856] mod_wsgi (pid=40601): Attach interpreter ''.
[Mon Aug 31 11:11:08.514415 2020] [wsgi:info] [pid 40601:tid 140547110185856] mod_wsgi (pid=40601): Adding '/apps/www/sgm/wsgi/SGM' to path.
[Mon Aug 31 11:11:08.514526 2020] [wsgi:info] [pid 40601:tid 140547110185856] mod_wsgi (pid=40601): Adding '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path.
[Mon Aug 31 11:11:08.515520 2020] [wsgi:debug] [pid 40601:tid 140546715096832] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 0 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515575 2020] [wsgi:debug] [pid 40601:tid 140546706704128] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 1 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515629 2020] [wsgi:debug] [pid 40601:tid 140546698311424] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 2 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515702 2020] [wsgi:debug] [pid 40601:tid 140546689918720] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 3 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515737 2020] [wsgi:debug] [pid 40601:tid 140546681526016] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 4 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515766 2020] [wsgi:debug] [pid 40601:tid 140546673133312] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 5 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515795 2020] [wsgi:debug] [pid 40601:tid 140546664740608] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 6 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515821 2020] [wsgi:debug] [pid 40601:tid 140546656347904] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 7 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515853 2020] [wsgi:debug] [pid 40601:tid 140546647955200] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 8 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515921 2020] [wsgi:debug] [pid 40601:tid 140546639562496] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 9 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515948 2020] [wsgi:debug] [pid 40601:tid 140546631169792] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 10 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.515972 2020] [wsgi:debug] [pid 40601:tid 140546622777088] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 11 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.516002 2020] [wsgi:debug] [pid 40601:tid 140546614384384] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 12 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.516036 2020] [wsgi:debug] [pid 40601:tid 140546605991680] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 13 in daemon process 'sgm-prod'.
[Mon Aug 31 11:11:08.516061 2020] [wsgi:debug] [pid 40601:tid 140546597598976] src/server/mod_wsgi.c(9118): mod_wsgi (pid=40601): Started thread 14 in daemon process 'sgm-prod'.

The WSGI portion of the configuration for the vhost in Apache looks like this:

WSGIPassAuthorization On
LogLevel info wsgi:trace6
SetEnv SGM_PRODUCTION  1
SetEnv SGM_USE_ORACLE  1
WSGIDaemonProcess sgm-prod user=sgmuser group=sgmgroup threads=15 python-home=/apps/sgmuser/.virtualenvs/sgm python-path=/apps/www/sgm/wsgi/SGM:/apps/sgmuser/.virtualenvs/sgm/lib/python3.8/site-packages socket-timeout=6000
WSGIScriptAlias / /apps/www/sgm/wsgi/SGM/run.py
<Directory /apps/www/sgm/wsgi/SGM>
    Require all granted
    AllowOverride AuthConfig
    WSGIProcessGroup sgm-prod
    WSGIApplicationGroup %{GLOBAL}
</Directory>

The main Apache error log shows nothing relevant, and the application-specific logs (with Python logging messages) show only that the request was made.  There is no exception traceback data being logged (I'm guessing the app is crashing before that can happen.)

Given how transitory this error is, I'm not sure how else to configure Apache or the SGM app itself to get more details about why the "sgm-prod" process seems to be crashing.

Any help would be appreciated!  Thanks.

Graham Dumpleton

unread,
Aug 31, 2020, 4:56:46 PM8/31/20
to mod...@googlegroups.com
Look in the main Apache error log, not the virtual host, and you will probably find a message about a segmentation fault or other error.

Where it is quite random like this, unless you can enable capture of the core dump file and can run a debugger (gdb) on that to get a stack trace, is going to be hard to track it down.

Only other thing can suggest is watching the process size to see if getting so large that you are running out of memory and a failure to allocate memory causes something to crash.

Graham

--
You received this message because you are subscribed to the Google Groups "modwsgi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modwsgi+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modwsgi/8df00cf3-abeb-43a8-b14a-6666af322a9an%40googlegroups.com.

David White

unread,
Aug 31, 2020, 5:06:39 PM8/31/20
to modwsgi
Unfortunately the main Apache log shows nothing except normal startup/shutdown messages.  If the "sgm-prod" threads are being terminated and restarted, they are leaving no indication of that in the main server logs. 

The only clue appears to be that the crashing occurs during more heavy loads.  The application does not often strain the CPU/RAM of the underlying host, but the Apache vhost logs show the crashes seem to occur when multiple requests are being handled nearly simultaneously. 

[Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid 140546546816768] [client 10.83.210.200:47266] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid 140546571994880] [client 10.31.82.142:35810] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py


Am I likely to see any benefit to either:
  1. Reducing the process or thread count passed to WSGIDaemonProcess, or
  2. Putting Apache into "prefork" MPM? 

Thanks, Graham.

Graham Dumpleton

unread,
Aug 31, 2020, 5:16:48 PM8/31/20
to mod...@googlegroups.com

On 1 Sep 2020, at 7:06 am, David White <dswh...@gmail.com> wrote:

Unfortunately the main Apache log shows nothing except normal startup/shutdown messages.  If the "sgm-prod" threads are being terminated and restarted, they are leaving no indication of that in the main server logs.  

The only clue appears to be that the crashing occurs during more heavy loads.  The application does not often strain the CPU/RAM of the underlying host, but the Apache vhost logs show the crashes seem to occur when multiple requests are being handled nearly simultaneously.  

[Mon Aug 31 11:11:07.504787 2020] [wsgi:error] [pid 35980:tid 140546546816768] [client 10.83.210.200:47266] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Mon Aug 31 11:11:07.504806 2020] [wsgi:error] [pid 35980:tid 140546571994880] [client 10.31.82.142:35810] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
 


Am I likely to see any benefit to either:
  1. Reducing the process or thread count passed to WSGIDaemonProcess, or
  2. Putting Apache into "prefork" MPM?  
Never use prefork MPM if you can avoid it.

Also ensure that outside of VirtualHost you have:

   WSGIRestrictEmbedded On

to ensure you are never accidentally running in the main Apache child worker processes.

If there is no evidence of a crash, then another cause can be that you are using an external system to trigger log file rotation, rather than recommended method of using Apache's own log file rotation method.

An external log file rotation system usually only fires once per day, but maybe yours is set up differently based on size of logs.

The problem with an external log file rotation system is that it signals Apache to restart. Although the main Apache child worker process are given a grace period to finish requests, the way Apache libraries implement management of third party processes such as the mod_wsgi daemon processes is that it will kill them after 5 seconds if they don't shutdown quick enough. This results in requests being proxied by Apache child worker processes being chopped off and you see that message.

If it is this though, you should see clear messages at info level in logs from mod_wsgi that the daemon processes were being shutdown and restarted. This will appear some time before you see that message.

Only other thing can think of is that you have requests that are getting stuck, and eventually you hit that socket timeout, although you have that set very large, so would take a request to be stuck for 6000 seconds before you saw it.

Only way to tell if may be that is to use:


so can signal mod_wsgi to dump Python stack traces periodically and see if a specific request is stuck some where.

Graham

David White

unread,
Aug 31, 2020, 5:58:26 PM8/31/20
to modwsgi
>  another cause can be that you are using an external system to trigger log file rotation  

That is an inspired guess!  I am in fact using "cronolog" to mange logfile rotation.  I don't think it's the culprit, since it rotates only monthly and crashes can happen many times a day (sometimes minutes apart).  But removing it can at least narrow the variables down.  

> requests that are getting stuck, and eventually you hit that socket timeout

It appears the crashes are pretty much immediate, within a second of a request (or multiple requests) coming in.  The server may run 10 hours before the next crash, or it may crash again within 5 minutes.  There does seem to be some correlation between request load and crashing likelihood - crashes almost always happen overnight when multiple cron jobs are hitting the program at once.

Meanwhile I'll stuff "WSGIRestrictEmbedded On" into the main httpd.conf and see what happens.

I appreciate the debugging advice.

Graham Dumpleton

unread,
Aug 31, 2020, 7:18:15 PM8/31/20
to mod...@googlegroups.com
There must be a core dump or segmentation fault message in main Apache error logs if the daemon process is crashing.

One other remote possibility have seen with some third party libraries (C libraries, not so much Python wrappers), is that when they encounter an error, they call C library exit(), stopping the process. This results in no crash report.

What are the third party Python packages you are using?

Graham

Graham Dumpleton

unread,
Aug 31, 2020, 11:48:31 PM8/31/20
to mod...@googlegroups.com
And another possibility is that your Linux OOM killer is killing the processes because they are using up too much memory.


Apache is quite susceptible to being killed by it because of how it uses resources.

Check the logs it describes to see if is being triggered.

David White

unread,
Sep 1, 2020, 2:04:30 PM9/1/20
to modwsgi
This is such a cryptic thing to debug.  I can't find any sign of a core dump or segmentation fault in any of the Apache main logs, the Apache vhost logs, or the system logs in /var/log .   The main Apache server log (now with "LogLevel debug") shows only the crash event and nothing else particularly helpful.  The sgm-prod process can survive for hours, or it may die every 3-4 minutes.

[Tue Sep 01 09:00:49.625195 2020] [wsgi:info] [pid 130799:tid 140547110185856] mod_wsgi (pid=130799): Initializing Python.
[Tue Sep 01 11:02:27.218264 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has died, deregister and restart it.
[Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by signal 9
[Tue Sep 01 11:02:27.218317 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' has been deregistered and will no longer be monitored.
[Tue Sep 01 11:02:27.219893 2020] [wsgi:info] [pid 47466:tid 140547110185856] mod_wsgi (pid=47466): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:02:27.230831 2020] [wsgi:info] [pid 47466:tid 140547110185856] mod_wsgi (pid=47466): Python home /apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:02:27.230893 2020] [wsgi:info] [pid 47466:tid 140547110185856] mod_wsgi (pid=47466): Initializing Python.

[Tue Sep 01 11:04:44.365899 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has died, deregister and restart it.
[Tue Sep 01 11:04:44.366026 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' terminated by signal 9
[Tue Sep 01 11:04:44.366035 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47466): Process 'sgm-prod' has been deregistered and will no longer be monitored.
[Tue Sep 01 11:04:44.367023 2020] [wsgi:info] [pid 47872:tid 140547110185856] mod_wsgi (pid=47872): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:04:44.374176 2020] [wsgi:info] [pid 47872:tid 140547110185856] mod_wsgi (pid=47872): Python home /apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:04:44.374232 2020] [wsgi:info] [pid 47872:tid 140547110185856] mod_wsgi (pid=47872): Initializing Python.

[Tue Sep 01 11:07:29.569996 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has died, deregister and restart it.
[Tue Sep 01 11:07:29.570049 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' terminated by signal 9
[Tue Sep 01 11:07:29.570054 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=47872): Process 'sgm-prod' has been deregistered and will no longer be monitored.
[Tue Sep 01 11:07:29.571142 2020] [wsgi:info] [pid 48657:tid 140547110185856] mod_wsgi (pid=48657): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:07:29.577718 2020] [wsgi:info] [pid 48657:tid 140547110185856] mod_wsgi (pid=48657): Python home /apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:07:29.577776 2020] [wsgi:info] [pid 48657:tid 140547110185856] mod_wsgi (pid=48657): Initializing Python.

[Tue Sep 01 11:10:18.754372 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has died, deregister and restart it.
[Tue Sep 01 11:10:18.754433 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' terminated by signal 9
[Tue Sep 01 11:10:18.754438 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=48657): Process 'sgm-prod' has been deregistered and will no longer be monitored.
[Tue Sep 01 11:10:18.755506 2020] [wsgi:info] [pid 49225:tid 140547110185856] mod_wsgi (pid=49225): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15.
[Tue Sep 01 11:10:18.762162 2020] [wsgi:info] [pid 49225:tid 140547110185856] mod_wsgi (pid=49225): Python home /apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 11:10:18.762215 2020] [wsgi:info] [pid 49225:tid 140547110185856] mod_wsgi (pid=49225): Initializing Python.

Memory and OOM don't seem to be a factor - this is a beefy virtual machine with 58GB RAM, rarely ever exceeding 5-6 GB even under load.  No "process killed" messages in dmesg or /var/log/messages .

It may be that Apache wasn't properly configured for core dumps.  I've made some config changes and did a "kill -SIGSEGV" on both the main httpd process and (after restarting) a thread process.  Both generated crash dumps in /tmp/crash , so I'll wait to see if I get anything useful in there the next time the app crashes for real.

You asked what third-party packages are part of the virtual environment, and it's pretty extensive.  Flask+SQLAlchemy make up the core of the application, and there are a lot of libraries that either came along for the ride, or were manually installed.

aniso8601==8.0.0
bcrypt==3.2.0
blinker==1.4
Cerberus==1.3.2
certifi==2020.6.20
cffi==1.14.2
chardet==3.0.4
click==7.1.2
colorama==0.4.3
coverage==5.2.1
cx-Oracle==5.3
dateparser==0.7.6
Eve==1.1.2
Eve-SQLAlchemy==0.7.1
Events==0.3
Flask==1.1.2
Flask-Admin==1.5.6
Flask-Bcrypt==0.7.1
Flask-DebugToolbar==0.11.0
flask-ldap3-login==0.9.16
Flask-LDAPConn==0.10.1
Flask-Login==0.4.1
Flask-PyMongo==2.3.0
Flask-RESTful==0.3.8
Flask-Restless==0.17.0
Flask-SQLAlchemy==2.4.4
Flask-WTF==0.14.3
httpie==2.2.0
idna==2.10
inflect==4.1.0
itsdangerous==1.1.0
jdatetime==3.6.2
Jinja2==2.11.2
jinja2-pluralize==0.3.0
jsmin==2.2.2
ldap3==2.8
lxml==4.5.2
MarkupSafe==1.1.1
marshmallow==2.10.3
mimerender==0.6.0
netaddr==0.8.0
nose2==0.9.2
ordered-set==4.0.2
pyasn1==0.4.8
pycparser==2.20
Pygments==2.6.1
pymongo==3.11.0
python-dateutil==2.8.1
python-mimeparse==1.6.0
pytz==2020.1
PyYAML==5.3.1
redis==3.5.3
regex==2020.7.14
requests==2.24.0
ruamel.yaml==0.16.10
ruamel.yaml.clib==0.2.0
simplejson==3.17.2
six==1.15.0
SQLAlchemy==1.3.19
sqlalchemy-datatables==2.0.1
sqlalchemy-views==0.2.4
strict-rfc3339==0.7
tzlocal==2.1
umalqurra==0.2
urllib3==1.25.10
webargs==1.4.0
Werkzeug==1.0.1
WTForms==2.3.3
xmljson==0.2.1
xmltodict==0.12.0

I'll keep fighting with it.  Thanks again for the suggestions, Graham.

David White

unread,
Sep 1, 2020, 2:48:02 PM9/1/20
to modwsgi
Ugh - another crash, but no crashdump generated.

/apps/www/sgm/logs/error.log (Apache vhost error log):
[Tue Sep 01 13:36:51.204652 2020] [wsgi:error] [pid 98990:tid 139766439843584] [client 2001:420:1080:3004::ad25:5fd7:52564] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Tue Sep 01 13:36:51.204662 2020] [wsgi:error] [pid 97668:tid 139767068935936] [client 2001:420:1080:3004::ad25:5fd7:52588] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Tue Sep 01 13:36:51.672700 2020] [wsgi:info] [pid 111735:tid 139767496833920] mod_wsgi (pid=111735): Attach interpreter ''.
[Tue Sep 01 13:36:51.674843 2020] [wsgi:info] [pid 111735:tid 139767496833920] mod_wsgi (pid=111735): Adding '/apps/www/sgm/wsgi/SGM' to path.
[Tue Sep 01 13:36:51.674962 2020] [wsgi:info] [pid 111735:tid 139767496833920] mod_wsgi (pid=111735): Adding '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path.
[Tue Sep 01 13:36:51.675940 2020] [wsgi:debug] [pid 111735:tid 139767101744896] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 0 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676191 2020] [wsgi:debug] [pid 111735:tid 139767084959488] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 2 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676297 2020] [wsgi:debug] [pid 111735:tid 139767076566784] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 3 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676304 2020] [wsgi:debug] [pid 111735:tid 139767093352192] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 1 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676326 2020] [wsgi:debug] [pid 111735:tid 139767068174080] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 4 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676517 2020] [wsgi:debug] [pid 111735:tid 139767059781376] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 5 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676593 2020] [wsgi:debug] [pid 111735:tid 139767051388672] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 6 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676642 2020] [wsgi:debug] [pid 111735:tid 139767034603264] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 8 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676694 2020] [wsgi:debug] [pid 111735:tid 139767017817856] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 10 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676752 2020] [wsgi:debug] [pid 111735:tid 139767009425152] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 11 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676806 2020] [wsgi:debug] [pid 111735:tid 139767001032448] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 12 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676844 2020] [wsgi:debug] [pid 111735:tid 139767026210560] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 9 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676856 2020] [wsgi:debug] [pid 111735:tid 139766992639744] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 13 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676930 2020] [wsgi:debug] [pid 111735:tid 139766984247040] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 14 in daemon process 'sgm-prod'.
[Tue Sep 01 13:36:51.676952 2020] [wsgi:debug] [pid 111735:tid 139767042995968] src/server/mod_wsgi.c(9118): mod_wsgi (pid=111735): Started thread 7 in daemon process 'sgm-prod'.

/apps/apache2/logs/error.log (Apache system log):
[Tue Sep 01 13:36:51.640137 2020] [wsgi:info] [pid 94912:tid 139767496833920] mod_wsgi (pid=97659): Process 'sgm-prod' has died, deregister and restart it.
[Tue Sep 01 13:36:51.640203 2020] [wsgi:info] [pid 94912:tid 139767496833920] mod_wsgi (pid=97659): Process 'sgm-prod' terminated by signal 9
[Tue Sep 01 13:36:51.640210 2020] [wsgi:info] [pid 94912:tid 139767496833920] mod_wsgi (pid=97659): Process 'sgm-prod' has been deregistered and will no longer be monitored.
[Tue Sep 01 13:36:51.641412 2020] [wsgi:info] [pid 111735:tid 139767496833920] mod_wsgi (pid=111735): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15.
[Tue Sep 01 13:36:51.654168 2020] [wsgi:info] [pid 111735:tid 139767496833920] mod_wsgi (pid=111735): Python home /apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 13:36:51.654251 2020] [wsgi:info] [pid 111735:tid 139767496833920] mod_wsgi (pid=111735): Initializing Python.

ls -la /var/crash:
total 4
drwxrwxrwx.  2 root root    6 Sep  1 13:47 .
drwxr-xr-x. 26 root root 4096 Jun  3 20:00 ..

Graham Dumpleton

unread,
Sep 1, 2020, 4:41:11 PM9/1/20
to mod...@googlegroups.com
This is the first time you have included the log message:

[Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by signal 9

This is an important one.

It shows that the process was explicitly killed by something else using equivalent of 'kill -9'. In other words, it isn't crashing by itself but killed off.

This is very indicative of the OOM killer. So double check system logs for evidence of it killing stuff.

Also see if you can somehow set up monitoring of memory usage by processes and see if the size of the Apache processes spikes before these events.

David White

unread,
Sep 2, 2020, 2:05:17 AM9/2/20
to modwsgi
> This is the first time you have included the log message:
> [Tue Sep 01 11:02:27.218311 2020] [wsgi:info] [pid 15161:tid 140547110185856] mod_wsgi (pid=130799): Process 'sgm-prod' terminated by signal 9
> This is an important one.

Yep, user error.  "LogLevel info" was set in the vhost configuration but not the main Apache config, and this particular error shows up in the main server error_log rather than the vhost error_log

> It shows that the process was explicitly killed by something else using equivalent of 'kill -9'. In other words, it isn't crashing by itself but killed off.
> This is very indicative of the OOM killer. So double check system logs for evidence of it killing stuff.
> Also see if you can somehow set up monitoring of memory usage by processes and see if the size of the Apache processes spikes before these events.

Ugh, I just hate for a good lead to not pan out.  A kill -9 sent to the process certainly sounds like OOM, but I just can't find any evidence that this is the problem.  The "good" news is that I have a mostly-repeatable way now to generate the error - call a somewhat lengthy process (an operation that takes 30-45 seconds to complete) twice simultaneously.  Within 60 seconds (usually much less) the sgm-prod process will crash and generate error_log messages (and both clients will receive 500 Internal Server Errors).  This significantly cuts down on how long I have to monitor stuff from beginning to end.  

So here's an example - from server starting up to crashing in 19 seconds:

[Tue Sep 01 23:38:15.177559 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15.
[Tue Sep 01 23:38:15.183802 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Python home /apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 23:38:15.183862 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Initializing Python.
[Tue Sep 01 23:38:15.202600 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Attach interpreter ''.
[Tue Sep 01 23:38:15.204691 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Adding '/apps/www/sgm/wsgi/SGM' to path.
[Tue Sep 01 23:38:15.204792 2020] [wsgi:info] [pid 54042:tid 140644297517952] mod_wsgi (pid=54042): Adding '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path.
[Tue Sep 01 23:38:15.206152 2020] [wsgi:debug] [pid 54042:tid 140643793323776] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 13 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206211 2020] [wsgi:debug] [pid 54042:tid 140643894036224] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 1 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206233 2020] [wsgi:debug] [pid 54042:tid 140643868858112] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 4 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206252 2020] [wsgi:debug] [pid 54042:tid 140643843680000] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 7 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206270 2020] [wsgi:debug] [pid 54042:tid 140643818501888] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 10 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206311 2020] [wsgi:debug] [pid 54042:tid 140643885643520] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 2 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206350 2020] [wsgi:debug] [pid 54042:tid 140643877250816] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 3 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206388 2020] [wsgi:debug] [pid 54042:tid 140643902428928] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 0 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206602 2020] [wsgi:debug] [pid 54042:tid 140643860465408] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 5 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206652 2020] [wsgi:debug] [pid 54042:tid 140643852072704] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 6 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206702 2020] [wsgi:debug] [pid 54042:tid 140643835287296] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 8 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206752 2020] [wsgi:debug] [pid 54042:tid 140643826894592] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 9 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206799 2020] [wsgi:debug] [pid 54042:tid 140643810109184] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 11 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.206965 2020] [wsgi:debug] [pid 54042:tid 140643801716480] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 12 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:15.207067 2020] [wsgi:debug] [pid 54042:tid 140643784931072] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54042): Started thread 14 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:31.260173 2020] [wsgi:info] [pid 54042:tid 140643793323776] [remote 2001:420:1080:3004::ad25:5fd7:37700] mod_wsgi (pid=54042, process='sgm-prod', application=''): Loading Python script file '/apps/www/sgm/wsgi/SGM/run.py'.
[Tue Sep 01 23:38:34.044173 2020] [wsgi:error] [pid 53241:tid 140644082374400] [client 2001:420:1080:3004::ad25:5fd7:37702] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Tue Sep 01 23:38:34.044202 2020] [wsgi:error] [pid 53241:tid 140644124337920] [client 2001:420:1080:3004::ad25:5fd7:37700] Truncated or oversized response headers received from daemon process 'sgm-prod': /apps/www/sgm/wsgi/SGM/run.py
[Tue Sep 01 23:38:34.198481 2020] [wsgi:info] [pid 52360:tid 140644297517952] mod_wsgi (pid=54042): Process 'sgm-prod' has died, deregister and restart it.
[Tue Sep 01 23:38:34.198534 2020] [wsgi:info] [pid 52360:tid 140644297517952] mod_wsgi (pid=54042): Process 'sgm-prod' terminated by signal 9
[Tue Sep 01 23:38:34.198566 2020] [wsgi:info] [pid 52360:tid 140644297517952] mod_wsgi (pid=54042): Process 'sgm-prod' has been deregistered and will no longer be monitored.
[Tue Sep 01 23:38:34.199931 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Starting process 'sgm-prod' with uid=50004207, gid=25 and threads=15.
[Tue Sep 01 23:38:34.204659 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Python home /apps/vmscan/.virtualenvs/sgm.
[Tue Sep 01 23:38:34.204707 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Initializing Python.
[Tue Sep 01 23:38:34.222980 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Attach interpreter ''.
[Tue Sep 01 23:38:34.225049 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Adding '/apps/www/sgm/wsgi/SGM' to path.
[Tue Sep 01 23:38:34.225148 2020] [wsgi:info] [pid 54088:tid 140644297517952] mod_wsgi (pid=54088): Adding '/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages' to path.
[Tue Sep 01 23:38:34.226427 2020] [wsgi:debug] [pid 54088:tid 140643877250816] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 3 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226486 2020] [wsgi:debug] [pid 54088:tid 140643852072704] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 6 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226511 2020] [wsgi:debug] [pid 54088:tid 140643826894592] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 9 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226528 2020] [wsgi:debug] [pid 54088:tid 140643801716480] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 12 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226546 2020] [wsgi:debug] [pid 54088:tid 140643902428928] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 0 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226669 2020] [wsgi:debug] [pid 54088:tid 140643885643520] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 2 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226715 2020] [wsgi:debug] [pid 54088:tid 140643894036224] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 1 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226743 2020] [wsgi:debug] [pid 54088:tid 140643860465408] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 5 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226767 2020] [wsgi:debug] [pid 54088:tid 140643868858112] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 4 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226791 2020] [wsgi:debug] [pid 54088:tid 140643835287296] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 8 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.226814 2020] [wsgi:debug] [pid 54088:tid 140643843680000] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 7 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.227063 2020] [wsgi:debug] [pid 54088:tid 140643810109184] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 11 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.227117 2020] [wsgi:debug] [pid 54088:tid 140643818501888] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 10 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.227936 2020] [wsgi:debug] [pid 54088:tid 140643784931072] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 14 in daemon process 'sgm-prod'.
[Tue Sep 01 23:38:34.228030 2020] [wsgi:debug] [pid 54088:tid 140643793323776] src/server/mod_wsgi.c(9118): mod_wsgi (pid=54088): Started thread 13 in daemon process 'sgm-prod'.

And here's "vmstat 1" during that time.   Free RAM is fine, and available cache RAM never drops below 53GB.

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0 990976 1296072   8044 55966592    0    0    15    53    1    1 12  3 85  0  0
 0  0 990976 1295848   8044 55966592    0    0     0     4  311  398  0  0 100  0  0
 0  0 990976 1295168   8044 55966592    0    0     0     0 4199 2938  2  2 97  0  0
 1  0 990976 1268080   8044 55966592    0    0     0     0 4752 3559  9  4 87  0  0
 1  0 990976 1225368   8044 55966712    0    0     0     0 2311 1228 33  2 65  0  0
 1  0 990976 1139704   8044 55966712    0    0     0     0 2335  860 55  3 42  0  0
 1  0 990976 1082504   8044 55966712    0    0     0    20 2862  626 35  3 62  0  0
 2  0 990976 1013720   8044 55990828    0    0     0     0 2349  370 64  2 34  0  0
 2  0 990976  969116   8044 55990828    0    0     0   186 2288  571 66  2 32  0  0
 2  0 990976  925216   8044 55990828    0    0     0     0 2342  429 66  1 33  0  0
 2  0 990976  917288   8044 55966712    0    0     0     0 3180 1632 43  4 53  0  0
 2  0 990976  882840   8044 55966716    0    0     0    20 3742 2074 32  3 65  0  0
 0  0 990976  875392   8044 55966724    0    0     0    44 8244 5675 18  5 77  0  0
 1  0 990976 1062116   8044 55966744    0    0    32     0 3613 2515 31  3 66  0  0
 0  0 990976 1290828   8044 55966864    0    0     0   320 2072 1016 31  1 68  0  0
 0  0 990976 1290604   8044 55966864    0    0     0   740  731  822  0  1 99  0  0
 0  0 990976 1290352   8044 55966864    0    0     0     0  563  635  0  0 100  0  0
 0  0 990976 1289736   8044 55966864    0    0     0     0  796  894  0  0 100  0  0

"dmesg" produces no output.  "/var/log/messages" has no logs during that time period (other than the sudo log of me running the two client programs simultaneously).  On a RHEL 8 system there doesn't seem to be any other place that OOM logs would be found.

I'm going to start trying to use "perf" and "strace" to dig into the SIGKILL's themselves and see if I can coax Linux to tell me anything about what's killing off those threads.  

David White

unread,
Sep 2, 2020, 2:33:52 AM9/2/20
to modwsgi
Oh, thank goodness, I think I may have tracked it down.  Running "strace" on the sgm-prod WSGI process gave me this:

[pid 93684] openat(AT_FDCWD, "/apps/vmscan/.virtualenvs/sgm/lib/python3.8/site-packages/cx_Oracle.cpython-38-x86_64-linux-gnu.so", O_RDONLY) = 52
[pid 93684] mmap(NULL, 8192, PROT_READ, MAP_SHARED, 52, 0xeb000) = 0x7fea4d65f000
[pid 93684] rt_sigprocmask(SIG_SETMASK, ~[BUS SEGV RTMIN RT_1], [], 8) = 0
[pid 93684] rt_sigaction(SIGBUS, {sa_handler=0x7fea2127a3b0, sa_mask=[ALRM], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fea4c4f1dc0}, NULL, 8) = 0
[pid 93684] rt_sigaction(SIGSEGV, NULL, {sa_handler=0x7fea21290370, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fea4c4f1dc0}, 8) = 0
[pid 93684] rt_sigaction(SIGSEGV, {sa_handler=0x7fea2127a3b0, sa_mask=[ALRM], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7fea4c4f1dc0}, NULL, 8) = 0
[pid 93684] rt_sigprocmask(SIG_BLOCK, NULL, ~[BUS KILL SEGV STOP RTMIN RT_1], 8) = 0
[pid 93684] rt_sigaction(SIGBUS, {sa_handler=0x7fea21290370, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fea4c4f1dc0}, NULL, 8) = 0
[pid 93684] rt_sigaction(SIGSEGV, {sa_handler=0x7fea21290370, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7fea4c4f1dc0}, NULL, 8) = 0
[pid 93684] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid 93684] write(49, "<-Cursor_InternalExecute()+77", 29) = 29
[pid 93684] write(49, "\n", 1)          = 1
[pid 93684] getpid()                    = 93671
[pid 93684] kill(93671, SIGKILL)        = ?
[pid 93682] <... select resumed>)       = ? ERESTARTNOHAND (To be restarted if no handler)
[pid 93681] <... select resumed> <unfinished ...>) = ?
[pid 93671] <... restart_syscall resumed>) = ?
[pid 93686] +++ killed by SIGKILL +++
[pid 93682] +++ killed by SIGKILL +++
[pid 93681] +++ killed by SIGKILL +++
[pid 93684] +++ killed by SIGKILL +++
+++ killed by SIGKILL +++

The last thing before the untimely death of the process was an Oracle call (Cursor_InternalExcecute) by the cx_Oracle module.  Looking into the virtualenv for this program, it appears that everything was upgraded recently except cx-Oracle, which was still v5.3, an ancient 2017 version.  Upgrading it to 8.0.1 seems to have fixed the problem - my reliable crash-generating scenario of two simultaneous client calls is now running without immediately crashing in the first 60 seconds.  The real proof will come overnight when the heavy processing takes place, but this is a very encouraging sign.

Thanks again for all the ideas, Graham.  I'm exhausted, but hopeful this will do it.  (plus I gained a little more sysadmin knowledge, so there's that.)
Reply all
Reply to author
Forward
0 new messages