I am using a script that dumps the output of this command into a log
file every 10 seconds, to get a snapshot of what apache was doing when
it got stuck.
links -dump http://localhost/server-status | grep ^[0-9]|awk 'BEGIN
{print "Seconds, PID, State, IP, Domain, TYPE, URL\n--"} $4 !~
/[GCRK_.]/ {print $6, $2, $4, $11, $12, $13 " " $14|"sort -n"}'
the attached file is a snapshot from the latest incident.
last time I had the issue, the problem was with sqlite connection
pooling. this is *no longer the issue as I am now using mysql database
backend*.
the crash.log file show that some processes have been running for nearly
1000 seconds, this conflicts with the apache configuration:
---------------------
WSGIDaemonProcess firestats.cc user=omry group=omry threads=50
maximum-requests=500 inactivity-timeout=60 display-name=%{GROUP}
deadlock-timeout=60
<VirtualHost *:80>
WSGIProcessGroup firestats.cc
WSGIApplicationGroup %{GLOBAL}
ServerName firestats.cc
ServerAlias www.firestats.cc
ServerAdmin omry
ErrorLog /home/omry/logs/firestats.cc/error.log
WSGIScriptAlias / /home/omry/www/trac/trac.wsgi
SetEnv trac.env_path /home/omry/www/trac/firestats.cc
RewriteEngine on
RewriteRule ^/robots.txt /home/omry/www/trac/robots.txt [L]
</VirtualHost>
---------------------
WSGIDaemonProcess is set to have inactivity and deadlock timeouts of
60 seconds, so how can a process possibly run for 1000 seconds?
using siege on the timeline URL gets the server to a bad spot (with 45
concurrent clients, I can bring my server down in seconds).
I tried to inspect the apache backtrace using gdb , I get this in the
first thread:
(gdb) bt
#0 0x00007f3fcd588b66 in poll () from /lib/libc.so.6
#1 0x00007f3fcda54e95 in apr_poll () from /usr/lib/libapr-1.so.0
#2 0x00007f3fc430c53c in wsgi_daemon_main () from
/usr/lib/apache2/modules/mod_wsgi.so
#3 0x00007f3fc430cd39 in wsgi_start_process () from
/usr/lib/apache2/modules/mod_wsgi.so
#4 0x00007f3fc430d339 in wsgi_start_daemons () from
/usr/lib/apache2/modules/mod_wsgi.so
#5 0x00007f3fc430f640 in wsgi_hook_init () from
/usr/lib/apache2/modules/mod_wsgi.so
#6 0x0000000000438cf4 in ap_run_post_config ()
#7 0x0000000000425bbc in main ()
but after switching to other threads I get this:
#0 0x00007f3fcd821bd1 in sem_wait () from /lib/libpthread.so.0
#1 0x00007f3fc406bbc8 in PyThread_acquire_lock (lock=0xcbaf30,
waitflag=128) at ../Python/thread_pthread.h:349
#2 0x00007f3fc4070ec2 in lock_PyThread_acquire_lock (self=0xc3f2a0,
args=<value optimized out>) at ../Modules/threadmodule.c:46
#3 0x00007f3fc4042fcd in PyEval_EvalFrameEx (f=0x2a7cec0,
throwflag=<value optimized out>) at ../Python/ceval.c:3575
#4 0x00007f3fc4044bed in PyEval_EvalCodeEx (co=0x1176cd8,
globals=<value optimized out>, locals=<value optimized out>, args=0x2,
argcount=1, kws=0x2abc5d8, kwcount=0, defs=0x11854a8, defcount=1,
closure=0x0) at ../Python/ceval.c:2838
....
#141 0x00007f3fc4044375 in PyEval_EvalFrameEx (f=0x3e303c0,
throwflag=<value optimized out>) at ../Python/ceval.c:3661
#142 0x00007f3fc4044375 in PyEval_EvalFrameEx (f=0x3ece7d0,
throwflag=<value optimized out>) at ../Python/ceval.c:3661
#143 0x00007f3fc4044bed in PyEval_EvalCodeEx (co=0xe1e210,
globals=<value optimized out>, locals=<value optimized out>,
args=0x35290b0, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0,
closure=0x0) at ../Python/ceval.c:2838
#144 0x00007f3fc3fdb55e in function_call (func=0x1418938, arg=0x3529098,
kw=0x0) at ../Objects/funcobject.c:517
#145 0x00007f3fc3fb82a3 in PyObject_Call (func=0xcbaf30, arg=0x80,
kw=0x0) at ../Objects/abstract.c:1861
#146 0x00007f3fc403d001 in PyEval_CallObjectWithKeywords
(func=0x1418938, arg=0x3529098, kw=0x0) at ../Python/ceval.c:3444
#147 0x00007f3fc430183c in Adapter_run () from
/usr/lib/apache2/modules/mod_wsgi.so
#148 0x00007f3fc430632d in wsgi_execute_script () from
/usr/lib/apache2/modules/mod_wsgi.so
#149 0x00007f3fc430f3cd in wsgi_hook_daemon_handler () from
/usr/lib/apache2/modules/mod_wsgi.so
#150 0x00007f3fc430b6ac in wsgi_process_socket () from
/usr/lib/apache2/modules/mod_wsgi.so
#151 0x00007f3fc430bafa in wsgi_daemon_worker () from
/usr/lib/apache2/modules/mod_wsgi.so
#152 0x00007f3fc430bbf9 in wsgi_daemon_thread () from
/usr/lib/apache2/modules/mod_wsgi.so
which does not look too helpful to my untrained eye.
I am using the python 2.5 gdbinit extentions (cp
/usr/share/doc/python2.5/gdbinit ~.gdbinit), and if I am not mistaken -
I was supposed to get python stack trace and not python interpreter
stack trace.
any help will be appreciated.
Hello Omry,
I thing we missed round one...
In any case, it would help to tell us more about your system information
(i.e. what you see in /about when logged in as a user with TRAC_ADMIN
permissions). In particular the Trac version and the mysql bindings
details (built with multithread support or not).
> I was supposed to get python stack trace and not python interpreter
> stack trace.
>
Use pystack command for that. Also, dump the stack for all the threads
in the process.
> any help will be appreciated.
>
>
Looks like an interesting bug ;-) When apache gets "stuck", is the CPU
maxed out?
We somehow have something similar on t.e.o, but with fcgi and postgresql.
-- Christian
> In any case, it would help to tell us more about your system information
> (i.e. what you see in /about when logged in as a user with TRAC_ADMIN
> permissions). In particular the Trac version and the mysql bindings
> details (built with multithread support or not).
>
Trac : 0.11stable-r7763
MySQL: server: "5.0.51a-24+lenny2-log", client: "5.0.51a",
thread-safe: 1
MySQLdb: 1.2.2
>> I was supposed to get python stack trace and not python interpreter
>> stack trace.
>>
>>
>
> Use pystack command for that. Also, dump the stack for all the threads
> in the process.
>
how do I dump stack for all threads? (and btw: how do I even list the
threads?)
when I typed pystack, I get:
(gdb) pystack
Current language: auto; currently asm
and gdb appears to be stuck in a loop. at least, it takes a very long
time to normally return from the call. I ended up ctrl+breaking it.
>> any help will be appreciated.
>>
>>
>>
>
> Looks like an interesting bug ;-) When apache gets "stuck", is the CPU
> maxed out?
> We somehow have something similar on t.e.o, but with fcgi and postgresql.
>
t.e.o = trac.edgewell.org ?
when I reproduce the problem by sieging the timeline - yes, there is an
apache process that uses 100% of the cpu.
but I`m not sure I am reproducing the same problem that regularly gets
my apache stuck.
>> I am starting a second thread with the same name, but it's not exactly
>> the same case:
>> I have trac accessed via mod_wsgi, and pretty much every day - my apache
>> gets stuck and being automatically restarted by monit after a few minutes.
>>
>
> How are you determining that Apache is stuck?
>
It does not respond to a simple http request for an empty (static) file:
http://flux.firefang.net/empty
monit detects that, and restarts apache after a short while.
> What Apache MPM are you using and what configuration are you using for
> that MPM? Ie., what do you have for:
>
I am using apache2-mpm-prefork
relevant configuration:
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 200
MaxRequestsPerChild 0
</IfModule>
# worker MPM
# StartServers: initial number of server processes to start
# MaxClients: maximum number of simultaneous client connections
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestsPerChild: maximum number of requests a server process serves
<IfModule mpm_worker_module>
StartServers 2
MaxClients 200
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
</IfModule>
Timeout 300
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15
> Have you fiddled with any of these defaults?
>
I probably increased the MaxClients settings to 200.
I don't think I changed any of the other values.
>> using siege on the timeline URL gets the server to a bad spot (with 45
>> concurrent clients, I can bring my server down in seconds).
>> I tried to inspect the apache backtrace using gdb , I get this in the
>> first thread:
>>
>> (gdb) bt
>>
>
> As per:
>
> http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Debugging_Crashes_With_GDB
>
> Use:
>
> thread apply all bt
>
> This will dump all C thread stacks.
>
> The dump of all C thread stack traces is still useful to me even if of
> no use to others.
>
> Graham
>
I will configure monit to use this command before it restarts apache in
the next time it hangs:
pgrep apache2 | xargs -i bash -c "gdb -ex \"set height 0\" -ex \"thread
apply all bt\" --batch -p {}" > apache.backtrace.txt
it's supposed to dump the backtrace of all apache processes to a file.
hopefully it will contain something useful.