wsgi + trac + rouge bots = apache down , round two

80 views
Skip to first unread message

Omry Yadan

unread,
Sep 28, 2009, 8:20:23 AM9/28/09
to trac...@googlegroups.com
Hi all,
I am starting a second thread with the same name, but it's not exactly
the same case:
I have trac accessed via mod_wsgi, and pretty much every day - my apache
gets stuck and being automatically restarted by monit after a few minutes.

I am using a script that dumps the output of this command into a log
file every 10 seconds, to get a snapshot of what apache was doing when
it got stuck.
links -dump http://localhost/server-status | grep ^[0-9]|awk 'BEGIN
{print "Seconds, PID, State, IP, Domain, TYPE, URL\n--"} $4 !~
/[GCRK_.]/ {print $6, $2, $4, $11, $12, $13 " " $14|"sort -n"}'

the attached file is a snapshot from the latest incident.

last time I had the issue, the problem was with sqlite connection
pooling. this is *no longer the issue as I am now using mysql database
backend*.

the crash.log file show that some processes have been running for nearly
1000 seconds, this conflicts with the apache configuration:
---------------------
WSGIDaemonProcess firestats.cc user=omry group=omry threads=50
maximum-requests=500 inactivity-timeout=60 display-name=%{GROUP}
deadlock-timeout=60

<VirtualHost *:80>
WSGIProcessGroup firestats.cc
WSGIApplicationGroup %{GLOBAL}

ServerName firestats.cc
ServerAlias www.firestats.cc
ServerAdmin omry
ErrorLog /home/omry/logs/firestats.cc/error.log

WSGIScriptAlias / /home/omry/www/trac/trac.wsgi
SetEnv trac.env_path /home/omry/www/trac/firestats.cc

RewriteEngine on
RewriteRule ^/robots.txt /home/omry/www/trac/robots.txt [L]
</VirtualHost>
---------------------

WSGIDaemonProcess is set to have inactivity and deadlock timeouts of
60 seconds, so how can a process possibly run for 1000 seconds?

using siege on the timeline URL gets the server to a bad spot (with 45
concurrent clients, I can bring my server down in seconds).
I tried to inspect the apache backtrace using gdb , I get this in the
first thread:

(gdb) bt
#0 0x00007f3fcd588b66 in poll () from /lib/libc.so.6
#1 0x00007f3fcda54e95 in apr_poll () from /usr/lib/libapr-1.so.0
#2 0x00007f3fc430c53c in wsgi_daemon_main () from
/usr/lib/apache2/modules/mod_wsgi.so
#3 0x00007f3fc430cd39 in wsgi_start_process () from
/usr/lib/apache2/modules/mod_wsgi.so
#4 0x00007f3fc430d339 in wsgi_start_daemons () from
/usr/lib/apache2/modules/mod_wsgi.so
#5 0x00007f3fc430f640 in wsgi_hook_init () from
/usr/lib/apache2/modules/mod_wsgi.so
#6 0x0000000000438cf4 in ap_run_post_config ()
#7 0x0000000000425bbc in main ()

but after switching to other threads I get this:
#0 0x00007f3fcd821bd1 in sem_wait () from /lib/libpthread.so.0
#1 0x00007f3fc406bbc8 in PyThread_acquire_lock (lock=0xcbaf30,
waitflag=128) at ../Python/thread_pthread.h:349
#2 0x00007f3fc4070ec2 in lock_PyThread_acquire_lock (self=0xc3f2a0,
args=<value optimized out>) at ../Modules/threadmodule.c:46
#3 0x00007f3fc4042fcd in PyEval_EvalFrameEx (f=0x2a7cec0,
throwflag=<value optimized out>) at ../Python/ceval.c:3575
#4 0x00007f3fc4044bed in PyEval_EvalCodeEx (co=0x1176cd8,
globals=<value optimized out>, locals=<value optimized out>, args=0x2,
argcount=1, kws=0x2abc5d8, kwcount=0, defs=0x11854a8, defcount=1,
closure=0x0) at ../Python/ceval.c:2838
....
#141 0x00007f3fc4044375 in PyEval_EvalFrameEx (f=0x3e303c0,
throwflag=<value optimized out>) at ../Python/ceval.c:3661
#142 0x00007f3fc4044375 in PyEval_EvalFrameEx (f=0x3ece7d0,
throwflag=<value optimized out>) at ../Python/ceval.c:3661
#143 0x00007f3fc4044bed in PyEval_EvalCodeEx (co=0xe1e210,
globals=<value optimized out>, locals=<value optimized out>,
args=0x35290b0, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0,
closure=0x0) at ../Python/ceval.c:2838
#144 0x00007f3fc3fdb55e in function_call (func=0x1418938, arg=0x3529098,
kw=0x0) at ../Objects/funcobject.c:517
#145 0x00007f3fc3fb82a3 in PyObject_Call (func=0xcbaf30, arg=0x80,
kw=0x0) at ../Objects/abstract.c:1861
#146 0x00007f3fc403d001 in PyEval_CallObjectWithKeywords
(func=0x1418938, arg=0x3529098, kw=0x0) at ../Python/ceval.c:3444
#147 0x00007f3fc430183c in Adapter_run () from
/usr/lib/apache2/modules/mod_wsgi.so
#148 0x00007f3fc430632d in wsgi_execute_script () from
/usr/lib/apache2/modules/mod_wsgi.so
#149 0x00007f3fc430f3cd in wsgi_hook_daemon_handler () from
/usr/lib/apache2/modules/mod_wsgi.so
#150 0x00007f3fc430b6ac in wsgi_process_socket () from
/usr/lib/apache2/modules/mod_wsgi.so
#151 0x00007f3fc430bafa in wsgi_daemon_worker () from
/usr/lib/apache2/modules/mod_wsgi.so
#152 0x00007f3fc430bbf9 in wsgi_daemon_thread () from
/usr/lib/apache2/modules/mod_wsgi.so

which does not look too helpful to my untrained eye.
I am using the python 2.5 gdbinit extentions (cp
/usr/share/doc/python2.5/gdbinit ~.gdbinit), and if I am not mistaken -
I was supposed to get python stack trace and not python interpreter
stack trace.

any help will be appreciated.

crash.log

Christian Boos

unread,
Sep 28, 2009, 4:43:58 AM9/28/09
to trac...@googlegroups.com
Omry Yadan wrote:
> Hi all,
> I am starting a second thread with the same name,

Hello Omry,

I thing we missed round one...
In any case, it would help to tell us more about your system information
(i.e. what you see in /about when logged in as a user with TRAC_ADMIN
permissions). In particular the Trac version and the mysql bindings
details (built with multithread support or not).


> I was supposed to get python stack trace and not python interpreter
> stack trace.
>

Use pystack command for that. Also, dump the stack for all the threads
in the process.

> any help will be appreciated.
>
>

Looks like an interesting bug ;-) When apache gets "stuck", is the CPU
maxed out?
We somehow have something similar on t.e.o, but with fcgi and postgresql.

-- Christian

Omry Yadan

unread,
Sep 28, 2009, 9:53:16 AM9/28/09
to trac...@googlegroups.com
Omry Yadan wrote:
>> Hi all,
>> I am starting a second thread with the same name,
>>
>
> Hello Omry,
>
> I thing we missed round one...
>
Round one was a about a year ago, see this:
http://www.mail-archive.com/search?q=apache+down&l=trac-dev%40googlegroups.com


> In any case, it would help to tell us more about your system information
> (i.e. what you see in /about when logged in as a user with TRAC_ADMIN
> permissions). In particular the Trac version and the mysql bindings
> details (built with multithread support or not).
>

Trac : 0.11stable-r7763
MySQL: server: "5.0.51a-24+lenny2-log", client: "5.0.51a",
thread-safe: 1
MySQLdb: 1.2.2

>> I was supposed to get python stack trace and not python interpreter
>> stack trace.
>>
>>
>
> Use pystack command for that. Also, dump the stack for all the threads
> in the process.
>

how do I dump stack for all threads? (and btw: how do I even list the
threads?)

when I typed pystack, I get:

(gdb) pystack
Current language: auto; currently asm

and gdb appears to be stuck in a loop. at least, it takes a very long
time to normally return from the call. I ended up ctrl+breaking it.


>> any help will be appreciated.
>>
>>
>>
>
> Looks like an interesting bug ;-) When apache gets "stuck", is the CPU
> maxed out?
> We somehow have something similar on t.e.o, but with fcgi and postgresql.
>

t.e.o = trac.edgewell.org ?

when I reproduce the problem by sieging the timeline - yes, there is an
apache process that uses 100% of the cpu.
but I`m not sure I am reproducing the same problem that regularly gets
my apache stuck.

Graham Dumpleton

unread,
Sep 28, 2009, 7:23:52 PM9/28/09
to Trac Development


On Sep 28, 10:20 pm, Omry Yadan <o...@yadan.net> wrote:
> Hi all,
> I am starting a second thread with the same name, but it's not exactly
> the same case:
> I have trac accessed via mod_wsgi, and pretty much every day - my apache
> gets stuck and being automatically restarted by monit after a few minutes.

How are you determining that Apache is stuck?

What Apache MPM are you using and what configuration are you using for
that MPM? Ie., what do you have for:

# prefork MPM
# StartServers: number of server processes to start
# MinSpareServers: minimum number of server processes which are kept
spare
# MaxSpareServers: maximum number of server processes which are kept
spare
# MaxClients: maximum number of server processes allowed to start
# MaxRequestsPerChild: maximum number of requests a server process
serves
<IfModule mpm_prefork_module>
StartServers 1
MinSpareServers 1
MaxSpareServers 10
MaxClients 150
MaxRequestsPerChild 0
</IfModule>

# worker MPM
# StartServers: initial number of server processes to start
# MaxClients: maximum number of simultaneous client connections
# MinSpareThreads: minimum number of worker threads which are kept
spare
# MaxSpareThreads: maximum number of worker threads which are kept
spare
# ThreadsPerChild: constant number of worker threads in each server
process
# MaxRequestsPerChild: maximum number of requests a server process
serves
<IfModule mpm_worker_module>
StartServers 2
MaxClients 150
MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
</IfModule>

Also, what do you have for:

#
# Timeout: The number of seconds before receives and sends time out.
#
Timeout 300

#
# KeepAlive: Whether or not to allow persistent connections (more than
# one request per connection). Set to "Off" to deactivate.
#
KeepAlive On

#
# MaxKeepAliveRequests: The maximum number of requests to allow
# during a persistent connection. Set to 0 to allow an unlimited
amount.
# We recommend you leave this number high, for maximum performance.
#
MaxKeepAliveRequests 100

#
# KeepAliveTimeout: Number of seconds to wait for the next request
from the
# same client on the same connection.
#
KeepAliveTimeout 5

Have you fiddled with any of these defaults?

> I am using a script that dumps the output of this command into a log
> file every 10 seconds, to get a snapshot of what apache was doing when
> it got stuck.
> links -dumphttp://localhost/server-status| grep ^[0-9]|awk 'BEGIN
> {print "Seconds, PID, State, IP, Domain, TYPE, URL\n--"} $4 !~
> /[GCRK_.]/ {print $6, $2, $4, $11, $12, $13 " " $14|"sort -n"}'
>
> the attached file is a snapshot from the latest incident.
>
> last time I had the issue, the problem was with sqlite connection
> pooling. this is *no longer the issue as I am now using mysql database
> backend*.
>
> the crash.log file show that some processes have been running for nearly
> 1000 seconds, this conflicts with the apache configuration:

No, it doesn't conflict.

Inactivity timeout means process will be restarted when there has been
no requests for a period of 60 seconds, or if all active requests had
stalled and had not read or written any data. Thus, if you had at
least one request every 60 seconds, the process would just keep
running and so could rack up 1000 seconds or more.

> ---------------------
> WSGIDaemonProcess firestats.cc user=omry group=omry threads=50
> maximum-requests=500 inactivity-timeout=60 display-name=%{GROUP}
> deadlock-timeout=60
>
> <VirtualHost *:80>
>     WSGIProcessGroup firestats.cc
>     WSGIApplicationGroup %{GLOBAL}
>
>     ServerName firestats.cc
>     ServerAliaswww.firestats.cc
>     ServerAdmin omry
>     ErrorLog /home/omry/logs/firestats.cc/error.log
>
>     WSGIScriptAlias / /home/omry/www/trac/trac.wsgi
>     SetEnv trac.env_path /home/omry/www/trac/firestats.cc
>
>     RewriteEngine on
>     RewriteRule ^/robots.txt /home/omry/www/trac/robots.txt [L]
> </VirtualHost>
> ---------------------
>
> WSGIDaemonProcess  is set to have inactivity and deadlock timeouts of  
> 60 seconds, so how can a process possibly run for 1000 seconds?

See comment above about inactivity timeout.

The deadlock timeout only applies were a GIL dead lock is able to be
detected so if that never occurs then process will not be restarted
because of it. For example, if C code goes into a loop and never
returns, but has released the GIL, there is no dead lock issue.

> using siege on the timeline URL gets the server to a bad spot (with 45
> concurrent clients, I can bring my server down in seconds).
> I tried to inspect the apache backtrace using gdb , I get this in the
> first thread:
>
> (gdb) bt

As per:

http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Debugging_Crashes_With_GDB

Use:

thread apply all bt

This will dump all C thread stacks.

The dump of all C thread stack traces is still useful to me even if of
no use to others.

Graham
> [crash.log11K ]0 15408 W 127.0.0.1 flux.firefang.net GET /server-status
> 9 15407 W 72.30.161.232 firestats.cc GET /ticket/112?format=rss
> 17 15380 W 72.30.161.232 firestats.cc GET /ticket/481
> 25 15381 W 213.236.203.237 firestats.cc GET /ticket/849?format=rss
> 26 15377 W 65.55.106.112 firestats.cc GET
> 30 15100 W 216.158.1.200 firestats.cc GET /ticket/432
> 30 15353 W 216.158.1.200 firestats.cc GET /ticket/324
> 38 15351 W 72.30.161.232 firestats.cc GET /ticket/481
> 55 15348 W 216.158.1.200 firestats.cc GET /ticket/314
> 56 15350 W 66.249.67.115 firestats.cc GET /ticket/997
> 58 15349 W 72.30.161.232 firestats.cc GET /ticket/481
> 59 15329 W 72.30.161.232 firestats.cc GET /wiki/WikiStart?version=152
> 78 15076 W 121.101.214.54 firestats.cc GET
> 78 15320 W 121.101.214.54 firestats.cc GET
> 78 15339 W 121.101.214.54 firestats.cc GET
> 79 15311 W 72.30.161.232 firestats.cc GET /ticket/112?format=rss
> 80 15328 W 72.30.161.232 firestats.cc GET /wiki/WikiStart?version=152
> 100 14987 W 72.30.161.232 firestats.cc GET /ticket/112?format=rss
> 101 15300 W 72.30.161.232 firestats.cc GET /wiki/WikiStart?version=152
> 119 15101 W 65.55.106.112 firestats.cc GET
> 121 15099 W 72.30.161.232 firestats.cc GET /ticket/112?format=rss
> 122 14970 W 72.30.161.232 firestats.cc GET /wiki/WikiStart?version=152
> 122 15096 W 216.158.1.200 firestats.cc GET /ticket/1112
> 142 15095 W 72.30.161.232 firestats.cc GET /ticket/112?format=rss
> 143 15094 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 148 15084 W 216.158.1.200 firestats.cc GET /ticket/886
> 149 15068 W 216.158.1.200 firestats.cc GET /ticket/1012
> 158 15021 W 94.142.63.22 firestats.cc GET /
> 163 14966 W 72.30.161.232 firestats.cc GET /ticket/481
> 164 14964 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 175 15060 W 216.158.1.200 firestats.cc GET /ticket/768
> 183 14965 W 216.158.1.200 firestats.cc GET /ticket/974
> 184 14929 W 216.158.1.200 firestats.cc GET /wiki/ip2c
> 184 14952 W 72.30.161.232 firestats.cc GET /ticket/481
> 185 15022 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 189 15020 W 67.195.114.48 firestats.cc GET /wiki/Standalone?version=13
> 196 14910 W 65.55.105.205 firestats.cc GET /ticket/106
> 198 14989 W 220.194.55.47 firestats.cc GET /wiki/Joomla?action=diff&amp;version=12
> 198 14990 W 79.176.44.160 firestats.cc GET /
> 205 14985 W 72.30.161.232 firestats.cc GET /ticket/481
> 206 14972 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 226 14939 W 72.30.161.232 firestats.cc GET /ticket/481
> 226 14954 W 65.55.105.205 firestats.cc GET /ticket/106
> 227 14953 W 72.30.161.232 firestats.cc GET /ticket/539
> 236 14938 W 216.158.1.200 firestats.cc GET /ticket/738
> 238 14937 W 209.85.238.119 firestats.cc GET /report/7?format=rss&USER=Raptor
> 239 14863 W 220.194.55.47 firestats.cc GET /wiki/Joomla?action=diff&amp;version=12
> 248 14892 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 248 14926 W 72.30.161.232 firestats.cc GET /ticket/539
> 252 14918 W 216.158.1.200 firestats.cc GET /ticket/944
> 257 14915 W 216.158.1.200 firestats.cc GET /ticket/836
> 258 14912 W 65.55.105.207 firestats.cc GET /ticket/397?action=diff&version=5
> 259 14911 W 67.195.114.48 firestats.cc GET /wiki/Standalone?version=13
> 268 14901 W 216.158.1.200 firestats.cc GET /about
> 269 14864 W 72.30.161.232 firestats.cc GET /ticket/539
> 269 14893 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 277 14878 W 91.121.106.59 firestats.cc HEAD /
> 280 14817 W 67.195.114.48 firestats.cc GET /wiki/Standalone?version=13
> 280 14891 W 220.194.55.47 firestats.cc GET /wiki/Joomla?action=diff&amp;version=12
> 288 14820 W 65.55.105.207 firestats.cc GET /ticket/397?action=diff&version=5
> 289 14879 W 72.30.161.232 firestats.cc GET /ticket/539
> 290 14877 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 301 14819 W 67.195.114.48 firestats.cc GET /wiki/Standalone?version=13
> 308 14758 W 91.121.106.59 firestats.cc HEAD /
> 311 14799 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 311 14848 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 318 14808 W 216.158.1.200 firestats.cc GET /ticket/806
> 321 14739 W 67.195.114.48 firestats.cc GET /wiki/Standalone?version=13
> 324 14798 W 220.194.55.47 firestats.cc GET /wiki/Joomla?action=diff&amp;version=12
> 332 14712 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 332 14787 W 72.30.161.232 firestats.cc GET /ticket/481
> 338 14760 W 216.158.1.200 firestats.cc GET /ticket/904
> 340 14759 W 91.121.106.59 firestats.cc HEAD /
> 343 14815 W 65.55.105.205 firestats.cc GET /ticket/40
> 353 14806 W 72.30.161.232 firestats.cc GET /ticket/481
> 353 14807 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 371 14738 W 91.121.106.59 firestats.cc HEAD /
> 374 14509 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 374 14740 W 72.30.161.232 firestats.cc GET /ticket/481
> 374 14750 W 65.55.105.205 firestats.cc GET /ticket/40
> 388 14510 W 67.195.114.48 firestats.cc GET /wiki/Standalone?version=13
> 394 14716 W 72.30.161.232 firestats.cc GET /ticket/481
> 395 14715 W 72.30.161.232 firestats.cc GET /ticket/539
> 398 14714 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 416 14492 W 72.30.161.232 firestats.cc GET /ticket/539
> 416 14512 W 65.55.207.133 firestats.cc GET /ticket/515
> 419 14511 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 437 14471 W 220.194.55.47 firestats.cc GET /wiki/Drupal?action=diff&amp;version=9
> 437 14472 W 72.30.161.232 firestats.cc GET /ticket/539
> 439 14434 W 79.176.44.160 firestats.cc GET /
> 440 14463 W 72.30.161.232 firestats.cc GET /ticket/795?version=0
> 449 14491 W 65.55.207.93 firestats.cc GET /wiki/WikiStart?version=167
> 458 14424 W 72.30.161.232 firestats.cc GET /ticket/539
> 462 14448 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 475 14445 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=3
> 478 14408 W 220.194.55.47 firestats.cc GET /wiki/Drupal?action=diff&amp;version=9
> 483 14436 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 496 14376 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=3
> 504 14435 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 507 14375 W 65.55.207.133 firestats.cc GET /ticket/515
> 508 14423 W 65.55.207.118 firestats.cc GET /ticket/509
> 517 14407 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=3
> 518 14406 W 220.194.55.47 firestats.cc GET /wiki/Drupal?action=diff&amp;version=9
> 524 14366 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 534 14374 W 209.85.238.119 firestats.cc GET /ticket/835?format=rss
> 534 14381 W 174.129.236.193 firestats.cc GET /
> 538 14364 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=3
> 540 14361 W 65.55.207.93 firestats.cc GET /wiki/WikiStart?version=167
> 541 14382 W 220.194.55.47 firestats.cc GET /wiki/TracModPython
> 545 14380 W 65.55.105.197 firestats.cc GET /wiki/Features?action=diff&version=13
> 546 14379 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 559 14351 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=2
> 559 14365 W 220.194.55.47 firestats.cc GET /wiki/Drupal?action=diff&amp;version=9
> 565 14330 W 216.158.1.200 firestats.cc GET /export/1984/trunk/firestats/php/page-database.php
> 567 14353 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 570 14352 W 88.131.106.13 firestats.cc GET /query?status=reopened&milestone=Later
> 571 14350 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 574 14342 W 216.158.1.200 firestats.cc GET /wiki/InterTrac
> 575 14306 W 65.55.105.197 firestats.cc GET /wiki/Features?action=diff&version=13
> 577 14315 W 216.158.1.200 firestats.cc GET /ticket/275
> 580 14331 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=2
> 582 14329 W 220.194.55.47 firestats.cc GET /wiki/TracModPython
> 588 14328 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 592 14292 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 599 14302 W 65.55.207.118 firestats.cc GET /ticket/509
> 600 14307 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=2
> 605 14266 W 88.131.106.13 firestats.cc GET /query?status=reopened&milestone=Later
> 609 14303 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 613 14291 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 621 14232 W 67.195.114.48 firestats.cc GET /wiki/WhoisProviders?version=2
> 622 14275 W 65.55.105.205 firestats.cc GET /ticket/191
> 623 14258 W 220.194.55.47 firestats.cc GET /wiki/TracModPython
> 626 14283 W 216.158.1.200 firestats.cc GET /ticket/255
> 629 13997 W 216.158.1.200 firestats.cc GET /ticket/147
> 638 14274 W 88.131.106.13 firestats.cc GET /query?status=reopened&milestone=Later
> 643 14250 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 653 14252 W 65.55.105.205 firestats.cc GET /ticket/191
> 655 14251 W 216.158.1.200 firestats.cc GET /ticket/1110
> 664 14242 W 220.194.55.47 firestats.cc GET /wiki/TracModPython
> 666 14241 W 65.55.105.197 firestats.cc GET /wiki/ChangeLog?version=67
> 667 14240 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 670 14207 W 88.131.106.13 firestats.cc GET /query?status=reopened&milestone=Later
> 679 14216 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 680 14208 W 79.176.44.160 firestats.cc GET /
> 682 14224 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 688 14223 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 696 13952 W 65.55.105.197 firestats.cc GET /wiki/ChangeLog?version=67
> 700 13975 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 703 13928 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 708 14193 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 721 14005 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 724 13958 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 741 13927 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 745 13917 W 72.30.161.232 firestats.cc GET /wiki/Mutex?action=history
> 771 13880 W 216.158.1.200 firestats.cc GET /wiki/trac
> 778 13949 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 785 13940 W 216.158.1.200 firestats.cc GET /ticket/797
> 799 9831 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 808 13916 W 67.195.114.48 firestats.cc GET /wiki/CommercialUse?version=10
> 811 13915 W 65.55.105.197 firestats.cc GET /wiki/Donate?version=4
> 818 13837 W 220.194.55.45 firestats.cc GET /report/-1?asc=1&amp;sort=report&amp;USER=anonymous
> 820 13848 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 822 13896 W 65.55.106.155 firestats.cc GET /wiki/MultipleSites
> 822 13897 W 216.158.1.200 firestats.cc GET /ticket/885
> 830 13893 W 216.158.1.200 firestats.cc GET /ticket/669
> 840 13845 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 842 13870 W 65.55.105.197 firestats.cc GET /wiki/Donate?version=4
> 845 13804 W 216.158.1.200 firestats.cc GET /ticket/875
> 859 13793 W 220.194.55.45 firestats.cc GET /report/-1?asc=1&amp;sort=report&amp;USER=anonymous
> 881 13827 W 216.158.1.200 firestats.cc GET /ticket/963
> 884 13826 W 216.158.1.200 firestats.cc GET /ticket/855
> 900 13802 W 220.194.55.45 firestats.cc GET /report/-1?asc=1&amp;sort=report&amp;USER=anonymous
> 900 9834 W 216.158.1.200 firestats.cc GET /ticket/639
> 907 13792 W 72.30.161.232 firestats.cc GET /ticket/881?format=tab
> 909 13803 W 216.158.1.200 firestats.cc GET /ticket/845
> 916 13801 W 216.158.1.200 firestats.cc GET /ticket/629
> 921 13788 W 79.176.44.160 firestats.cc GET /
> 921 13791 W 65.55.51.110 firestats.cc GET
> 924 13780 W 65.55.106.155 firestats.cc GET /wiki/MultipleSites
> 927 9452 W 216.158.1.200 firestats.cc GET /ticket/943
> 932 11826 W 216.158.1.200 firestats.cc GET /ticket/835
> 941 11827 W 220.194.55.45 firestats.cc GET /report/-1?asc=1&amp;sort=report&amp;USER=anonymous
> 943 9838 W 216.158.1.200 firestats.cc GET /ticket/933
> 966 9489 W 65.55.105.197 firestats.cc GET /wiki/WikiStart?version=165
> 997 11825 W 65.55.105.197 firestats.cc GET /wiki/WikiStart?version=165
> 999 11842 W 216.158.1.200 firestats.cc GET /ticket/805
> 1073 11844 W 65.55.51.110 firestats.cc GET
> Mon Sep 28 05:42:06 UTC 2009

Omry Yadan

unread,
Oct 2, 2009, 5:10:24 PM10/2/09
to trac...@googlegroups.com
Hi Graham,


>> I am starting a second thread with the same name, but it's not exactly
>> the same case:
>> I have trac accessed via mod_wsgi, and pretty much every day - my apache
>> gets stuck and being automatically restarted by monit after a few minutes.
>>
>
> How are you determining that Apache is stuck?
>

It does not respond to a simple http request for an empty (static) file:
http://flux.firefang.net/empty
monit detects that, and restarts apache after a short while.

> What Apache MPM are you using and what configuration are you using for
> that MPM? Ie., what do you have for:
>

I am using apache2-mpm-prefork
relevant configuration:
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 200
MaxRequestsPerChild 0
</IfModule>

# worker MPM
# StartServers: initial number of server processes to start
# MaxClients: maximum number of simultaneous client connections
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestsPerChild: maximum number of requests a server process serves
<IfModule mpm_worker_module>
StartServers 2

MaxClients 200


MinSpareThreads 25
MaxSpareThreads 75
ThreadsPerChild 25
MaxRequestsPerChild 0
</IfModule>


Timeout 300
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 15

> Have you fiddled with any of these defaults?
>

I probably increased the MaxClients settings to 200.
I don't think I changed any of the other values.

>> using siege on the timeline URL gets the server to a bad spot (with 45
>> concurrent clients, I can bring my server down in seconds).
>> I tried to inspect the apache backtrace using gdb , I get this in the
>> first thread:
>>
>> (gdb) bt
>>
>
> As per:
>
> http://code.google.com/p/modwsgi/wiki/DebuggingTechniques#Debugging_Crashes_With_GDB
>
> Use:
>
> thread apply all bt
>
> This will dump all C thread stacks.
>
> The dump of all C thread stack traces is still useful to me even if of
> no use to others.
>
> Graham
>

I will configure monit to use this command before it restarts apache in
the next time it hangs:

pgrep apache2 | xargs -i bash -c "gdb -ex \"set height 0\" -ex \"thread
apply all bt\" --batch -p {}" > apache.backtrace.txt

it's supposed to dump the backtrace of all apache processes to a file.
hopefully it will contain something useful.

Omry Yadan

unread,
Oct 4, 2009, 9:48:44 AM10/4/09
to trac...@googlegroups.com
apache backtraces attached.
apache.backtrace.zip

Graham Dumpleton

unread,
Oct 5, 2009, 12:11:10 AM10/5/09
to Trac Development
Most odd.

All the mod_wsgi daemon processes are waiting for a request to arrive.
They aren't actively handling any.

All the Apache server child processes are waiting for initial
handshaking response from mod_wsgi daemon processes after having just
sent a request to them.

To me this looks a bit like the epoll() implementation on your system
is broken which is causing a failure of mod_wsgi daemon processes to
detect a new request has been sent.

I would be suggesting that you rebuild Apache/APR such that use of
epoll() is disabled and it instead fallbacks to either poll() or select
(). Unfortunately, not exactly sure how you can override 'configure'
for Apache/APR to stop it using epoll() if it exists.

That is about the only pointer I can give.

What versions of Apache/APR are you using?

Graham
>  apache.backtrace.zip
> 9KViewDownload

Omry Yadan

unread,
Oct 6, 2009, 11:11:54 AM10/6/09
to trac...@googlegroups.com
Either that, or the gdb did not show the stack for the frozen processes.


my kenrel version is : 2.6.26-bpo.1-amd64.

my apache version is : 2.2.9-10

libapr version is : 1.2.12-5+lenny1

Graham Dumpleton

unread,
Oct 6, 2009, 10:54:41 PM10/6/09
to Trac Development, mod...@googlegroups.com


On Oct 7, 2:11 am, Omry Yadan <o...@yadan.net> wrote:
> Either that, or the gdb did not show the stack for the frozen processes.

I doubt gdb would be having an issue. I am also pretty confident the
mod_wsgi daemon process weren't frozen either. The Apache server child
processes were the ones hanging waiting.

Now, you don't actually say what version of mod_wsgi you are using and
also don't think you are saying whether you are compiling from source
code or using a prebuilt binary.

If using source code, can you apply the following patch, recompile and
reinstall. The patch is against mod_wsgi 2.6 source code, but should
apply cleanly against 2.5 as well with no issues.

The patch moves the setting of an internal server timeout on socket to
daemon process to an earlier phase of dealing with that connection.
The original point of the timeout was to avoid mutual deadlock when
more than socket buffer size data was sent by client and WSGI
application didn't read it all in and consume it but replied with a
response also larger than the socket buffer size. The timeout would
instead cause the request to fail.

Your problem isn't this, but as a I said an apparent failure of epoll
(), or maybe UNIX socket connect. Moving the timeout will help detect
such a failure as you are seeing and at least cause Apache server
child processes to timeout on the connection when mod_wsgi daemon
process doesn't do anything. Overall this may not help, as it may be
the case that subsequent requests will just suffer the same fate, but
if it is a temporary glitch, it may well recover. The value of this
timeout is control by Apache 'Timeout' directive. It defaults to 300
seconds. As this timeout controls many things in Apache, you do need
to be a bit careful in changing it, but dropping to 60 seconds
wouldn't be over the top.

It may be the case that you will want to try building Apache yourself
from source code and making it use the APR which is bundled with
Apache rather than allowing it to use the system one, in case the
issues is a mismatch in versions of precompiled Apache and APR/APU
libraries being used.

Given the strangeness of this issue, I would suggest the discussion be
moved over to the mod_wsgi list on Google Groups. I have cc'd this
email there.

Index: mod_wsgi.c
===================================================================
--- mod_wsgi.c (revision 1446)
+++ mod_wsgi.c (working copy)
@@ -9568,6 +9568,9 @@
apr_os_pipe_put_ex(&tmpsock, &daemon->fd, 1, r->pool);
apr_pool_cleanup_kill(r->pool, daemon, wsgi_close_socket);

+ apr_file_pipe_timeout_get(tmpsock, &timeout);
+ apr_file_pipe_timeout_set(tmpsock, r->server->timeout);
+
/* Setup bucket brigade for reading response from daemon. */

bbin = apr_brigade_create(r->pool, r->connection->bucket_alloc);
@@ -9662,6 +9665,9 @@
apr_os_pipe_put_ex(&tmpsock, &daemon->fd, 1, r->pool);
apr_pool_cleanup_kill(r->pool, daemon,
wsgi_close_socket);

+ apr_file_pipe_timeout_get(tmpsock, &timeout);
+ apr_file_pipe_timeout_set(tmpsock, r->server->timeout);
+
apr_brigade_destroy(bbin);

bbin = apr_brigade_create(r->pool, r->connection-
>bucket_alloc);
@@ -9687,9 +9693,6 @@

bbout = apr_brigade_create(r->pool, r->connection->bucket_alloc);

- apr_file_pipe_timeout_get(tmpsock, &timeout);
- apr_file_pipe_timeout_set(tmpsock, r->server->timeout);
-
do {
apr_bucket *bucket;
Reply all
Reply to author
Forward
0 new messages