HTTP server threads dying until TG app stops processing requests

22 views
Skip to first unread message

Chris Miles

unread,
Nov 14, 2007, 6:32:11 AM11/14/07
to TurboGears
I am seeing some odd behaviour with a TG app I have deployed on some
Linux boxes (CentOS 4.5). (TurboGears version info is pasted below)

The app starts up with the usual complement of 10 (or whatever I set
server.thread_pool to) HTTP server threads (CherryPy WorkerThreads),
along with a few other threads as usual.

After a period of time, where the app has had minimal activity (a few
hits per minute at most) the CherryPy WorkerThreads die off and are
not replaced. This continues until there are no CherryPy
WorkerThreads left, at which point any HTTP requests simply hang
(clients can connect, but receive no response until they timeout). I
have to restart the application to get it accepting requests again.

Anyone seen this sort of behaviour before with TG or CP? I'm going to
continue to debug it, but any hints would be helpful.


TurboGears Complete Version Information

TurboGears requires:

* TurboGears 1.0.4b1
* cElementTree 1.0.5-20051216
* elementtree 1.2.6
* SQLAlchemy 0.3.11
* TurboKid 1.0.3
* TurboJson 1.0
* TurboCheetah 0.9.5
* simplejson 1.3
* setuptools 0.6c7
* RuleDispatch 0.5a0.dev-r2306
* PasteScript 0.9.7
* FormEncode 0.7.1
* DecoratorTools 1.5
* configobj 4.3.2
* CherryPy 2.2.1
* Cheetah 2.0rc7
* kid 0.9.6
* RuleDispatch 0.5a0.dev-r2306
* Cheetah 2.0rc7
* PyProtocols 1.0a0
* Cheetah 2.0rc7
* PasteDeploy 0.9.6
* Paste 0.9.7

Toolbox Gadgets

* info (TurboGears 1.0.4b1)
* catwalk (TurboGears 1.0.4b1)
* shell (TurboGears 1.0.4b1)
* designer (TurboGears 1.0.4b1)
* widgets (TurboGears 1.0.4b1)
* admi18n (TurboGears 1.0.4b1)

Identity Providers

* sqlobject (TurboGears 1.0.4b1)
* sqlalchemy (TurboGears 1.0.4b1)

tg-admin Commands

* info (TurboGears 1.0.4b1)
* shell (TurboGears 1.0.4b1)
* quickstart (TurboGears 1.0.4b1)
* update (TurboGears 1.0.4b1)
* sql (TurboGears 1.0.4b1)
* i18n (TurboGears 1.0.4b1)
* toolbox (TurboGears 1.0.4b1)

Visit Managers

* sqlobject (TurboGears 1.0.4b1)
* sqlalchemy (TurboGears 1.0.4b1)

Template Engines

* cheetah (TurboCheetah 0.9.5)
* json (TurboJson 1.0)
* kid (TurboKid 1.0.3)
* genshi-markup (Genshi 0.4.4)
* genshi-text (Genshi 0.4.4)
* genshi (Genshi 0.4.4)

Widget Packages

* file_fields (FileFields 0.1a6.dev-r612)
* tgcaptcha (TGCaptcha 0.11)

TurboGears Extensions

* visit (TurboGears 1.0.4b1)
* identity (TurboGears 1.0.4b1)
* file_server (FileFields 0.1a6.dev-r612)
* tg_media_farm (TGMediaFarm 0.6.2)

Cheers,
Chris Miles

venkatbo

unread,
Nov 15, 2007, 2:31:22 PM11/15/07
to TurboGears
Hi Chris,

On Nov 14, 3:32 am, Chris Miles <miles.ch...@gmail.com> wrote:

>....
> After a period of time, where the app has had minimal activity (a few
> hits per minute at most) the CherryPy WorkerThreads die off and are
> not replaced...

Recently, I've did some minimal stress testing on the exact same
version of TG 1.0.4b1:
Ran the TG-app with 2 simultaneous users over a 24 hr period,
each making page requests every 5 secs. Ran ok and did not
see any debug error msgs in my logs then.

When you mention "a period of time", what is that approx period
you talk about. If it is >24hrs, may be I should run my tests for
the period you observe the error case. Pl. let me know.

Thanks,
/venkat

Chris Miles

unread,
Nov 16, 2007, 3:42:39 AM11/16/07
to Chris Miles, TurboGears
I worked out the solution to this issue. In summary it involves
making sure that the application redirects stderr _somewhere_
(terminal or /dev/null). In our case, the Redhat daemon function
(part of the standard rc.d functions) was starting the TG app in the
background but not redirecting stderr anywhere. I'm not sure where
stderr ended up going (probably to a non-existent terminal) but when
the application attempted to output a traceback about a socket error
to stderr it would cause a new exception to be raised which was
subsequently not caught, resulting in WorkerThread.run() to end and
hence the thread to die. CherryPy does not check how many running
WorkerThread threads exist after starting them up so never replaced
any that ended prematurely. Eventually (<24 hours) our TG app would
run out of WorkerThread threads and new connections couldn't be
processed (they would be accepted by the main thread but would just
hang indefinitely until the client timed out).

Even with TG configured to output everything to log files, these
tracebacks are still written to stderr. This happens within
WorkerThread.run() which simply calls traceback.print_exc() for
unhandled exceptions (in this case we were seeing socket timeout
errors). There's no option to write these to the log files, which I
think could be considered a CherryPy bug.

Our init.d script for this app looked like (stripped down to only the
important bits):
{{{
. /etc/rc.d/init.d/functions
PROG_RUN=/appdir/start-tgapp.py
PROG_CONF=/appdir/turbogears-prod.cfg
PROG_USER=apache
daemon --user $PROG_USER --check $PROG_RUN "( $PROG_RUN $PROG_CONF & )"
}}}

Our fix was to change the daemon line to:
{{{
daemon --user $PROG_USER --check $PROG_RUN "( $PROG_RUN $PROG_CONF 2>/
dev/null & )"
}}}

Cheers,
Chris

Florent Aide

unread,
Nov 17, 2007, 6:54:09 AM11/17/07
to turbo...@googlegroups.com
On Nov 15, 2007 8:31 PM, venkatbo <acha...@gmail.com> wrote:
>
> Hi Chris,
>
> On Nov 14, 3:32 am, Chris Miles <miles.ch...@gmail.com> wrote:
>
> >....
> > After a period of time, where the app has had minimal activity (a few
> > hits per minute at most) the CherryPy WorkerThreads die off and are
> > not replaced...
>
> Recently, I've did some minimal stress testing on the exact same
> version of TG 1.0.4b1:
> Ran the TG-app with 2 simultaneous users over a 24 hr period,
> each making page requests every 5 secs. Ran ok and did not
> see any debug error msgs in my logs then.

Socket timeouts occur when the client kills the connection in an
unclean fashion. This is the kind of thing your robot does not test
right now but that happen often in the real world.
This issue is a real one and we should try to lobby some maintenance
release on CP 2.2 (waiting for a better alternative which would be
upgrading to ... hush ... still in the secret lab)

Florent.

Reply all
Reply to author
Forward
0 new messages