Serious lock issue in Turbogears app

30 views
Skip to first unread message

Richard Clark

unread,
Dec 11, 2007, 5:47:46 AM12/11/07
to TurboGears
heyas. Got a serious problem with my latest TG app. I've delivered
half a dozen commercial tg apps with no problems so I don't think it's
a newbie issue. I'm basically looking for idea on how to diagnose it,
and how to mitigate it.

After a random interval running, anything between 5 minutes and a day,
the app concerned will stop responding to requests - ie, all in-
progress or new requests will simply hang. An strace indicates the
process is waiting on a futex, which suggests some kind of lock
acquire jamup.

This issue appeared mysteriously. The first occurance inspired the
reversion of the latest couple of commits. This did not resolve the
issue. The site has been doubling its load pretty much every week so
it is possible it only became visible once the load (and thus the
chance of the correct sequence of events) became sufficiently high.

Platform:

Turbogears 1.0.3
SQLObject 9.2
Turbomail 1.2

on Ubuntu.

Diagnosis so far:

strace indicates futex wait with no other action
I am unable to perform a thread dump at the point of locking, I cannot
even shut down turbogears, hitting control-c gives the signal handler
message, then hangs. Hard kill is necessary.
Thread-dumps during normal operation show no issues, threads are not
dying or slowly locking up. The failure appears to be spontaneous.
There are no common requests that appear to be associated with the
failure.
I am not doing anything tricky wrt threading, I have no thread
handlers of my own nor do I access internal functions from anywhere.

Theory so far:

That a lock is being acquired and not given up appears to be a given.
it is not a database deadlock, the database logs are not reporting any
issues. My current favorite theory is that the User table, which is
used almost everywhere on the site, is being locked within SQLObject
somehow and never unlocked again, however it's just a theory.

What I need:

Ideas mostly. how would you go about debugging this? keep in mind I
have not been able to replicate the error at all on my test platform,
it appears to be associated with volume and the complexity of the site
means creating a realistic user simulation is simply too hard.

Suggestions as to how to mitigate the issue in the immediate term
would be good as well. Does anyone have suggestions for something that
can automatically kill and restart the service in the event of failure
without me having to code it?

Diez B. Roggisch

unread,
Dec 11, 2007, 7:15:40 AM12/11/07
to turbo...@googlegroups.com

I'm sorry that I can't offer any advice - I just wanted to confirm that I
observed a similar thing with my TG 1.0.x-app. It runs for the better part of
a year now, with low to medium load (it's mainly an in-house admin app) - but
every now and then (e.g. just yesterday) it will simply grind to a halt.

I haven't been able to diagnose it so far, but am of course extremely
interested in a solution both as a commiter & user of TG.

My app also uses SO & Identity + TurboMail.

About the ideas... well, you already found out that locking seems to be an
issue. The obvious thing to do would be to identify all locks that are
acquired & add debug-logging around that, hoping that the statements reveal
the responsible parts of the code - not very much of a suggestion though..

Diez

Christopher Arndt

unread,
Dec 11, 2007, 9:08:26 AM12/11/07
to turbo...@googlegroups.com
Richard Clark wrote:
> Suggestions as to how to mitigate the issue in the immediate term
> would be good as well. Does anyone have suggestions for something that
> can automatically kill and restart the service in the event of failure
> without me having to code it?

A rather hackish solution I sometimes use is running a script through
cron every 5 minutes that does the following:

- get the front page of your webapp with urllib and
- catch timeout exceptions
- check the HTTP status code
- check the response time
- search the response text for anything indicating an error
- if there are any problems, restart the webapp*

All in all, maybe 20 lines of code:

class ServiceException(Exception): pass

try:
try:
start_time = time.time()
ret = urllib.urlopen(FOO_URL)
except IOError, exc:
end_time = time.time()
raise ServiceException("Foo server not responding")
end_time = time.time()
response_time = end_time - start_time
log("Response time: %.2f sec." % response_time)
response = ret.read()
if response_time > MAX_RESPONSE_TIME:
raise ServiceException("Response time to high.")
if "Some error message" in response:
raise ServiceException("'Some error' error occurred.")
except TracServiceException, exc:
log(exc)
restart_service()

* all my webapps are runnning under supervisor, so I just do
"supervisorctl restart myapp"

Chris

n-org

unread,
Dec 11, 2007, 6:57:41 PM12/11/07
to TurboGears
While I've never run in to the problem, if you have access to an
OpenSolaris box (or OS X Leopard box), I'd highly recommend DTrace.

I've used it with other systems to find this sort of problem quickly
and easily (on production systems!) in the past. Other than Python,
DTrace is the most revolutionary tool I've learned in the past year.

Graham Dumpleton

unread,
Dec 11, 2007, 7:28:31 PM12/11/07
to TurboGears
FWIW, in mod_wsgi 2.0 when running a Python web application in daemon
mode, there is both deadlock timeout detection and inactivity timeout
detection. The deadlock detection acts as a fail safe for when some
third party extension module blocks indefinitely without releasing the
Python GIL. The inactivity timeout can be used to detect lack of
request processing, or input/output. This applies in the case of no
requests arriving and can be used to automatically recycle the process
to reclaim memory, or can be used to detect when normal thread
programming deadlock conditions arise that result in all request
handler threads being blocked. In all cases the process will be
automatically shutdown and restarted, therefore unblocking the
application without requiring manual intervention. To enable only
requires an extra option in the Apache/mod_wsgi configuration and no
separate cron jobs or scripts. :-)

Graham

Chris Miles

unread,
Dec 12, 2007, 5:02:03 AM12/12/07
to TurboGears
Review this thread and my conclusion.
http://groups.google.com/group/turbogears/browse_thread/thread/c0cfb73cab349a73

Basically I was having a similar type of problem with a TG 1.0 app and
found that it was actually running out of HTTP server threads. The
symptoms were the same as yours, and I originally suspected some
deadlock issue and examined strace, but despite the futex action that
was a red herring. Just check how many threads the app starts with and
watch the thread count over time. If it drops then there's most
probably a problem. You can also add some debugging to display all
the threads from within the app, which I did, to see how many HTTP
server threads are actually left.

Cheers,
Chris Miles

kevin...@gmail.com

unread,
Dec 12, 2007, 8:09:42 AM12/12/07
to TurboGears
Hi,

Just want to chime in and say "me too". My problems were first
documented here:

http://groups.google.com/group/turbogears/browse_thread/thread/dd29a7ba53f5f798/919b25056247b5b4?lnk=st&q=stuart+turbogears+hang#919b25056247b5b4

and they've never really gone away.

I use SO + Identity + TurboMail

Will have more to post on this issue later.

But I am dead keen to resolve this. We should certainly pool our
efforts.

Stuart

Florent Aide

unread,
Dec 12, 2007, 8:28:15 AM12/12/07
to turbo...@googlegroups.com
On Dec 12, 2007 2:09 PM, kevin...@gmail.com <kevin...@gmail.com> wrote:
>
> Hi,
>
> Just want to chime in and say "me too". My problems were first
> documented here:
>
> http://groups.google.com/group/turbogears/browse_thread/thread/dd29a7ba53f5f798/919b25056247b5b4?lnk=st&q=stuart+turbogears+hang#919b25056247b5b4
>
> and they've never really gone away.
>
> I use SO + Identity + TurboMail
>
> Will have more to post on this issue later.
>
> But I am dead keen to resolve this. We should certainly pool our
> efforts.
>

See my last post in this thread:

http://groups.google.com/group/turbogears/browse_thread/thread/c0cfb73cab349a73

The issue is now either to fix CP2.2 or upgrade to CP 3 ;-) in the
mean time we need to provide deployment instructions to use the right
way as Chris Miles pointed out...

Fixing CP 2.2 would mean lobbying Robert and friends to issue a
maintenance release on a really old software they do no "support"
anymore, this could prove to be more difficult to obtain than it
seems. (I know how much work a release costs and I can assure you this
is not something you do lightly)
Upgrade to CP3 will not happen in 1.0.x lifetime. It may happen in 1.1
though but this is still vaporware at this stage (I need to really
break the config system in order to be able to move that part and this
means time...).

Florent.

Richard Clark

unread,
Dec 12, 2007, 4:38:12 PM12/12/07
to TurboGears
@florent I saw your thread, my issue is not related. I track socket
timeouts, we don't get many and they don't correlate with the
application failure. In addition, we are not losing threads over time,
we have the full complement at failure, they're just all locked up.

On Dec 13, 2:28 am, "Florent Aide" <florent.a...@gmail.com> wrote:
> On Dec 12, 2007 2:09 PM, kevinvi...@gmail.com <kevinvi...@gmail.com> wrote:
>
>
>
>
>
> > Hi,
>
> > Just want to chime in and say "me too". My problems were first
> > documented here:
>
> >http://groups.google.com/group/turbogears/browse_thread/thread/dd29a7...
>
> > and they've never really gone away.
>
> > I use SO + Identity + TurboMail
>
> > Will have more to post on this issue later.
>
> > But I am dead keen to resolve this. We should certainly pool our
> > efforts.
>
> See my last post in this thread:
>
> http://groups.google.com/group/turbogears/browse_thread/thread/c0cfb7...

Florent Aide

unread,
Dec 13, 2007, 4:09:28 AM12/13/07
to turbo...@googlegroups.com
On Dec 12, 2007 10:38 PM, Richard Clark <richar...@gmail.com> wrote:
>
> @florent I saw your thread, my issue is not related. I track socket
> timeouts, we don't get many and they don't correlate with the
> application failure. In addition, we are not losing threads over time,
> we have the full complement at failure, they're just all locked up.>

Ok. Let's pool our efforts. Tell us want you find in your next observations.

Florent.

kevin...@gmail.com

unread,
Dec 13, 2007, 7:08:38 AM12/13/07
to TurboGears
My issue seems to be the same as yours Richard. I've looked at the
locks in PostgreSQL at the time of failure, and there are still a
bunch of locks being held by a bunch of open connections, presumably
being held by a bunch of active threads.

What we need is a means of replicating this "in the lab", as it were.
Have you written any test robots to hammer your site with traffic and
all kinds of bad behaviour? I shall have to do this with mine.

Also, it's suspicious that your package configuration is the same as
mine. My problems have occurred over many versions of TG and SO, so
we can rule out any version-specific problems. Similarly, I've seen
the problem in Linux and BSD. So, it's either a long-standing flaw in
TG or SO, or it's a problem in TurboMail. Since lots of people use
TG, and only a few people are reporting these problems, I suspect
TurboMail is the culprit. I suspect it's somehow hanging threads at
connection attempts, and the threads are getting stuck in the middle
of a page serve (leaving the Postgres transaction locks in place).

So, any automated stress test should incorporate the use of TurboMail.

What DB are you using?

Stuart

kevin...@gmail.com

unread,
Dec 13, 2007, 7:41:58 AM12/13/07
to TurboGears
Hi Richard,

How many hits is your site getting per day?

Thanks,

Stuart


On Dec 13, 7:38 am, Richard Clark <richard.cl...@gmail.com> wrote:

kevin...@gmail.com

unread,
Dec 16, 2007, 2:17:37 AM12/16/07
to TurboGears
Come on people. Don't let this thread die. It's a very important
issue.

Stuart


On Dec 13, 10:41 pm, "kevinvi...@gmail.com" <kevinvi...@gmail.com>
wrote:

Richard Clark

unread,
Dec 16, 2007, 2:50:38 AM12/16/07
to TurboGears
I'm not entirely sure how many hits we're doing at the moment, in any
useful sense. Only dynamic page views count I guess, roughly 30,000
dynamic page views a day not counting bots, scripts etc. I'd say the
app server fails roughly 10 times a day at that load, but not evenly
spread across the day.

Richard Clark

unread,
Dec 16, 2007, 2:51:50 AM12/16/07
to TurboGears
I'm using postgres, and have been unable to identify a turbomail-
related lock failure at this point.

Alice McGregor

unread,
Dec 16, 2007, 5:06:09 AM12/16/07
to TurboGears
On Dec 15, 11:51 pm, Richard Clark <richard.cl...@gmail.com> wrote:
> I'm using postgres, and have been unable to identify aturbomail-related lock failure at this point.

Anything I can fix easily and quickly? ;)

-- Alice.

Alberto Valverde

unread,
Dec 16, 2007, 5:59:47 AM12/16/07
to turbo...@googlegroups.com
kevin...@gmail.com wrote:
> My issue seems to be the same as yours Richard. I've looked at the
> locks in PostgreSQL at the time of failure, and there are still a
> bunch of locks being held by a bunch of open connections, presumably
> being held by a bunch of active threads.

Hmm. This reminds me of an issue I had in a web.py app in which I used
SA.. The problem was due to unlclosed connections when issuing non-ORM
queries from the transaction's connection. This soon depleted all
connections in the pool which caused further tries to check out a
connection to hang forever waiting for a connection to be available.

Maybe the issue is caused by a determinate sequence of events that,
probably after an exception, leaves unclosed connections behind?

>
> What we need is a means of replicating this "in the lab", as it were.
> Have you written any test robots to hammer your site with traffic and
> all kinds of bad behaviour? I shall have to do this with mine.
>
> Also, it's suspicious that your package configuration is the same as
> mine. My problems have occurred over many versions of TG and SO, so
> we can rule out any version-specific problems. Similarly, I've seen
> the problem in Linux and BSD. So, it's either a long-standing flaw in
> TG or SO, or it's a problem in TurboMail. Since lots of people use
> TG, and only a few people are reporting these problems, I suspect
> TurboMail is the culprit. I suspect it's somehow hanging threads at
> connection attempts, and the threads are getting stuck in the middle
> of a page serve (leaving the Postgres transaction locks in place).
>
> So, any automated stress test should incorporate the use of TurboMail.

IMHO Turbomail doesn't look like the culprit since it uses its own
threadpool to deliver mails... It doesn't seem likely to me that it
could cause request-serving threads to hang since queueing mails doesn't
ever block (unless using a bound queue which and no mail threads are
consuming the work in it, which IIRC, is not the default behavior)

Alberto

kevin...@gmail.com

unread,
Dec 17, 2007, 4:09:24 AM12/17/07
to TurboGears
> Hmm. This reminds me of an issue I had in a web.py app in which I used
> SA.. The problem was due to unlclosed connections when issuing non-ORM
> queries from the transaction's connection. This soon depleted all
> connections in the pool which caused further tries to check out a
> connection to hang forever waiting for a connection to be available.

** Richard (and others who are having this problem), are you doing raw
SQL on your ORM connection? I know I am.

Stuart

Alberto Valverde

unread,
Dec 17, 2007, 5:53:48 AM12/17/07
to turbo...@googlegroups.com

Does explicitly closing the cursor when you're done with the results
solve it? The safest would be in a try, finally block to ensure it is
closed regardless of any error. This was what I did to solve my issue.

Alberto

Mark Ramm

unread,
Dec 18, 2007, 10:00:14 PM12/18/07
to turbo...@googlegroups.com
> > ** Richard (and others who are having this problem), are you doing raw
> > SQL on your ORM connection? I know I am.
>
> Does explicitly closing the cursor when you're done with the results
> solve it? The safest would be in a try, finally block to ensure it is
> closed regardless of any error. This was what I did to solve my issue.

Anybody still experiencing this issue?

kevin...@gmail.com

unread,
Dec 19, 2007, 7:26:06 AM12/19/07
to TurboGears
> Does explicitly closing the cursor when you're done with the results
> solve it? The safest would be in a try, finally block to ensure it is
> closed regardless of any error. This was what I did to solve my issue.

** I'm not quite sure how to do that. I execute my query as follows:

results = hub.getConnection().queryAll(query_string)

Not sure where the cursor associated with that is hiding, or how to
close it. Can I trouble you for some advice?

Thanks,

Stuart

kevin...@gmail.com

unread,
Dec 19, 2007, 7:28:05 AM12/19/07
to TurboGears
> Anybody still experiencing this issue?

** I would say yes. It's hard for me to tell, since it only happens
under the load of a real site, but I want my real site to stay up.
And so I'm constantly trying to relieve the load that's causing it to
crash :) Makes debugging very difficult.

I will keep the group advised.

Stuart

Alberto Valverde

unread,
Dec 19, 2007, 10:33:56 AM12/19/07
to turbo...@googlegroups.com

Hmmm, my problem was with SQLAlchemy so perhaps the issue is not related
:/ My SO is so rusty I can't even remember if SQLObject used a
connection pool at all.

Anyway, trying won't hurt I guess... does the "results" object have a
close() method? If so try calling it after iterating the results. This
is how my code looks more or less using SA in case it gives a hint:

try:
cursor = select(....., engine=engine).execute()
for r in cursor: ...
finally:
cursor.close()

Alberto

Richard Clark

unread,
Feb 12, 2008, 4:26:38 AM2/12/08
to TurboGears
I can confirm we are not only still experiencing it, but our
production site is dying from it roughly 3 times an hour. I have been
utterly unable to track down the cause of the error, I have ruled out
everything I can think of to the maximum extent possible under a
production system. Our only solution at this time is a watchdog that
is hard killing the turbogears process and restarting it.

I cannot replicate it under test conditions.

I can confirm we are not running out of threads, nor in fact does it
appear that we are hanging within a regular action call. The reason I
can make the latter statement is that I now have a memcached list
which maintains all the actions called in the last 10 minutes, and
removes each as they exit. We do not end up with any hanging calls, so
every one completes. We are currently serving roughly 10,000 dynamic
page loads per hour on average, peaking at 15,000 for rush hour.

On Dec 19 2007, 4:00 pm, "Mark Ramm" <mark.mchristen...@gmail.com>
wrote:

Jason Chu

unread,
Mar 19, 2008, 3:53:25 PM3/19/08
to turbo...@googlegroups.com
I know this thread is old and I barely read this list anymore, but I wanted
to share my fix for this issue. Hopefully it's not too late for other
people who have run into this.

It's a bug in SQLObject. We're using an old version, but the same code is
in the newer versions too.

The file is declarative.py and the problem exists with the
threadSafeMethod.

If you want the quick fix, comment out this line in declarative.py:

cls.__init__ = threadSafeMethod(lock)(cls.__init__)

Below is an explanation of the issue.

Postgres has all sorts of locks that are created when you do many different
things. The threadSafeMethod also creates a lock, but this one is local to
a single python process.

Let's go over an example. You have two threads (A and B) that are in the
middle of different db transactions. Thread A has locks on tables I and
II. Thread B creates a row on table III that has a foreign key to table I,
it will wait for thread A's lock to be given up before actually creating
that row. Then if thread A gets context again and tries to create or
access a row in table III it will lock because of the threadSafeMethod's
lock.

Postgres will see one thread trying to create a row and the other thread
(the one that the former thread is waiting for a lock from) idle in
transaction. Neither one will continue doing anything.

Usually other requests come in so new threads pick them up and try to query
tg-visit, but the db has certain types of locks on tg-visit rows so those
threads end up waiting for thread A and thread B to give up, even though
they never will.

It turns out that the code was originally put into SQLObject on Alberto's
request:

http://sourceforge.net/tracker/index.php?func=detail&aid=1407684&group_id=74338&atid=540674

I've tried recreating his problem, but was never able to.

We've been running a modified version of sqlobject for months in production
with no problems.

If you want this change in SQLObject, someone's going to have to push this
because I don't have time to do the followups.

Jason

Lukasz Szybalski

unread,
Mar 19, 2008, 4:20:50 PM3/19/08
to turbo...@googlegroups.com

I agree with Graham here. You should try switching over to mod_wsgi
and see how that goes.
I would definately look at the number of requests you have and number
of requests you app can handle. If you move over to apache and
mod_wsgi you can test your app with program "ab" and simulate 10000
requests.

Since you are using linux I would check the logs next. Syslog, db
logs, apache logs. I would look at the number of connections to db and
see if it peaks at any time.

Lucas

Richard Clark

unread,
Mar 19, 2008, 8:07:32 PM3/19/08
to TurboGears
Thanks Jason, I mostly got around the issue I was having by moving
most queries to a batch process that did sql directly rather than via
sqlobject, but I still have a significant proportion of the main site
in sqlobject, I'll give it a try and see if it helps.
> http://sourceforge.net/tracker/index.php?func=detail&aid=1407684&grou...
>
> I've tried recreating his problem, but was never able to.
>
> We've been running a modified version of sqlobject for months in production
> with no problems.
>
> If you want this change in SQLObject, someone's going to have to push this
> because I don't have time to do the followups.
>
> Jason
>
> application_pgp-signature_part
> 1KDownload

Florent Aide

unread,
Mar 20, 2008, 3:59:20 AM3/20/08
to turbo...@googlegroups.com
On Wed, Mar 19, 2008 at 8:53 PM, Jason Chu <jc...@xentac.net> wrote:
> I know this thread is old and I barely read this list anymore, but I wanted
> to share my fix for this issue. Hopefully it's not too late for other
> people who have run into this.
>
> It's a bug in SQLObject. We're using an old version, but the same code is
> in the newer versions too.

Thanks Jason!

This is an excellent post... We should make sure to have this
information on our Wiki and then bug Ian Bickin to fix SO!

Florent.

Diez B. Roggisch

unread,
Mar 20, 2008, 4:03:29 AM3/20/08
to turbo...@googlegroups.com
Florent Aide schrieb:

Ian isn't really involved anymore - but I will give Oleg a hint!

Diez

Florent Aide

unread,
Mar 20, 2008, 4:24:46 AM3/20/08
to turbo...@googlegroups.com
On Thu, Mar 20, 2008 at 9:03 AM, Diez B. Roggisch <de...@web.de> wrote:
>
> Florent Aide schrieb:

>
> Ian isn't really involved anymore - but I will give Oleg a hint!
>
> Diez

I created ticket #1765 to track the progression of the SO fix and our
documentation.

Cheers,
Florent.

Reply all
Reply to author
Forward
0 new messages