I'm sorry that I can't offer any advice - I just wanted to confirm that I
observed a similar thing with my TG 1.0.x-app. It runs for the better part of
a year now, with low to medium load (it's mainly an in-house admin app) - but
every now and then (e.g. just yesterday) it will simply grind to a halt.
I haven't been able to diagnose it so far, but am of course extremely
interested in a solution both as a commiter & user of TG.
My app also uses SO & Identity + TurboMail.
About the ideas... well, you already found out that locking seems to be an
issue. The obvious thing to do would be to identify all locks that are
acquired & add debug-logging around that, hoping that the statements reveal
the responsible parts of the code - not very much of a suggestion though..
Diez
A rather hackish solution I sometimes use is running a script through
cron every 5 minutes that does the following:
- get the front page of your webapp with urllib and
- catch timeout exceptions
- check the HTTP status code
- check the response time
- search the response text for anything indicating an error
- if there are any problems, restart the webapp*
All in all, maybe 20 lines of code:
class ServiceException(Exception): pass
try:
try:
start_time = time.time()
ret = urllib.urlopen(FOO_URL)
except IOError, exc:
end_time = time.time()
raise ServiceException("Foo server not responding")
end_time = time.time()
response_time = end_time - start_time
log("Response time: %.2f sec." % response_time)
response = ret.read()
if response_time > MAX_RESPONSE_TIME:
raise ServiceException("Response time to high.")
if "Some error message" in response:
raise ServiceException("'Some error' error occurred.")
except TracServiceException, exc:
log(exc)
restart_service()
* all my webapps are runnning under supervisor, so I just do
"supervisorctl restart myapp"
Chris
See my last post in this thread:
http://groups.google.com/group/turbogears/browse_thread/thread/c0cfb73cab349a73
The issue is now either to fix CP2.2 or upgrade to CP 3 ;-) in the
mean time we need to provide deployment instructions to use the right
way as Chris Miles pointed out...
Fixing CP 2.2 would mean lobbying Robert and friends to issue a
maintenance release on a really old software they do no "support"
anymore, this could prove to be more difficult to obtain than it
seems. (I know how much work a release costs and I can assure you this
is not something you do lightly)
Upgrade to CP3 will not happen in 1.0.x lifetime. It may happen in 1.1
though but this is still vaporware at this stage (I need to really
break the config system in order to be able to move that part and this
means time...).
Florent.
Ok. Let's pool our efforts. Tell us want you find in your next observations.
Florent.
Hmm. This reminds me of an issue I had in a web.py app in which I used
SA.. The problem was due to unlclosed connections when issuing non-ORM
queries from the transaction's connection. This soon depleted all
connections in the pool which caused further tries to check out a
connection to hang forever waiting for a connection to be available.
Maybe the issue is caused by a determinate sequence of events that,
probably after an exception, leaves unclosed connections behind?
>
> What we need is a means of replicating this "in the lab", as it were.
> Have you written any test robots to hammer your site with traffic and
> all kinds of bad behaviour? I shall have to do this with mine.
>
> Also, it's suspicious that your package configuration is the same as
> mine. My problems have occurred over many versions of TG and SO, so
> we can rule out any version-specific problems. Similarly, I've seen
> the problem in Linux and BSD. So, it's either a long-standing flaw in
> TG or SO, or it's a problem in TurboMail. Since lots of people use
> TG, and only a few people are reporting these problems, I suspect
> TurboMail is the culprit. I suspect it's somehow hanging threads at
> connection attempts, and the threads are getting stuck in the middle
> of a page serve (leaving the Postgres transaction locks in place).
>
> So, any automated stress test should incorporate the use of TurboMail.
IMHO Turbomail doesn't look like the culprit since it uses its own
threadpool to deliver mails... It doesn't seem likely to me that it
could cause request-serving threads to hang since queueing mails doesn't
ever block (unless using a bound queue which and no mail threads are
consuming the work in it, which IIRC, is not the default behavior)
Alberto
Does explicitly closing the cursor when you're done with the results
solve it? The safest would be in a try, finally block to ensure it is
closed regardless of any error. This was what I did to solve my issue.
Alberto
Anybody still experiencing this issue?
Hmmm, my problem was with SQLAlchemy so perhaps the issue is not related
:/ My SO is so rusty I can't even remember if SQLObject used a
connection pool at all.
Anyway, trying won't hurt I guess... does the "results" object have a
close() method? If so try calling it after iterating the results. This
is how my code looks more or less using SA in case it gives a hint:
try:
cursor = select(....., engine=engine).execute()
for r in cursor: ...
finally:
cursor.close()
Alberto
It's a bug in SQLObject. We're using an old version, but the same code is
in the newer versions too.
The file is declarative.py and the problem exists with the
threadSafeMethod.
If you want the quick fix, comment out this line in declarative.py:
cls.__init__ = threadSafeMethod(lock)(cls.__init__)
Below is an explanation of the issue.
Postgres has all sorts of locks that are created when you do many different
things. The threadSafeMethod also creates a lock, but this one is local to
a single python process.
Let's go over an example. You have two threads (A and B) that are in the
middle of different db transactions. Thread A has locks on tables I and
II. Thread B creates a row on table III that has a foreign key to table I,
it will wait for thread A's lock to be given up before actually creating
that row. Then if thread A gets context again and tries to create or
access a row in table III it will lock because of the threadSafeMethod's
lock.
Postgres will see one thread trying to create a row and the other thread
(the one that the former thread is waiting for a lock from) idle in
transaction. Neither one will continue doing anything.
Usually other requests come in so new threads pick them up and try to query
tg-visit, but the db has certain types of locks on tg-visit rows so those
threads end up waiting for thread A and thread B to give up, even though
they never will.
It turns out that the code was originally put into SQLObject on Alberto's
request:
http://sourceforge.net/tracker/index.php?func=detail&aid=1407684&group_id=74338&atid=540674
I've tried recreating his problem, but was never able to.
We've been running a modified version of sqlobject for months in production
with no problems.
If you want this change in SQLObject, someone's going to have to push this
because I don't have time to do the followups.
Jason
I agree with Graham here. You should try switching over to mod_wsgi
and see how that goes.
I would definately look at the number of requests you have and number
of requests you app can handle. If you move over to apache and
mod_wsgi you can test your app with program "ab" and simulate 10000
requests.
Since you are using linux I would check the logs next. Syslog, db
logs, apache logs. I would look at the number of connections to db and
see if it peaks at any time.
Lucas
Thanks Jason!
This is an excellent post... We should make sure to have this
information on our Wiki and then bug Ian Bickin to fix SO!
Florent.
Ian isn't really involved anymore - but I will give Oleg a hint!
Diez
I created ticket #1765 to track the progression of the SO fix and our
documentation.
Cheers,
Florent.