Production server stops responding.

4 views
Skip to first unread message

mulicheng

unread,
Jun 15, 2006, 11:12:16 AM6/15/06
to TurboGears
I'm running a couple instances of my application in production and
balancing them with lighttpd's mod_proxy. (Thanks all for getting me
on the right track with that.)

On a pretty regular frequency, one of the instances will stop
responding to requests. (The site stays up because there is more than
one instance running.)

Does anybody have any suggestions on how to debug the process that is
still running? I'm not sure if it is something in my business logic
causing the problem or if there is something wrong with TG/Cherrypy.
The process is still listening on the port I started it on, a tcp
connection can be made, but no response is delivered and there is no
debugging information printed.

Thanks
Dennis
mydrawings.com

Kevin Dangoor

unread,
Jun 15, 2006, 12:08:57 PM6/15/06
to turbo...@googlegroups.com
that's a bummer. You might try making a CherryPy filter that logs at
the different points in the request cycle so that you can get an idea
of where it's getting locked up.

"""Base class for CherryPy filters."""

class BaseFilter(object):
"""
Base class for filters. Derive new filter classes from this, then
override some of the methods to add some side-effects.
"""

def on_start_resource(self):
"""Called before any request processing has been done"""
pass

def before_request_body(self):
"""Called after the request header has been read/parsed"""
pass

def before_main(self):
""" Called after the request body has been read/parsed"""
pass

def before_finalize(self):
"""Called before final output processing"""
pass

def before_error_response(self):
"""Called before _cp_on_error and/or finalizing output"""
pass

def after_error_response(self):
"""Called after _cp_on_error and finalize"""
pass

def on_end_resource(self):
"""Called after finalizing the output (status, header, and
body)"""
pass

def on_end_request(self):
"""Called when the server closes the request."""
pass

Simon Belak

unread,
Jun 15, 2006, 2:45:17 PM6/15/06
to turbo...@googlegroups.com
A long shot, but is it possible that this particular process runs out of
available threads (see prod.cfg: server.thread_pool)?


Cheers,
Simon

mulicheng

unread,
Jun 15, 2006, 3:43:09 PM6/15/06
to TurboGears

Simon Belak wrote:
> A long shot, but is it possible that this particular process runs out of
> available threads (see prod.cfg: server.thread_pool)?
>

I tried settings for the thread_pool between 50 and 150 and had the
same results. I also tweaked the socket_queue_size setting but havent
noticed any difference. From what I read, cherrypy is pretty good at
handling load and being stable so I gather something I've done is
probably the culprit. I'm using sqlAlchemy and a few 3rd party
libraries not shipped standard with TG or python so perhaps one of
those libraries is consuming resources or something. Hard to say but
I'll keep working on it.

-Dennis

fumanchu

unread,
Jun 15, 2006, 8:31:20 PM6/15/06
to TurboGears
Stick an instance of HTTPREPL into your app. That'll help you do live
debugging that you would otherwise do with logging.

http://projects.amor.org/misc/wiki/HTTPREPL


Robert Brewer
System Architect
Amor Ministries
fuma...@amor.org

mike bayer

unread,
Jun 17, 2006, 11:36:52 AM6/17/06
to TurboGears
what version of SQLAlchemy are you using ? one thing that has been
improved not too long ago (and only within the 0.2 series) is that the
connection pool has a timeout when grabbing a connection, if all
allowed connections are hung open. the 0.1 series im pretty sure will
just hang indefinitely. which means in 0.2 youll get an error page
instead of an indefinite hang, but at least thats more of a clue.

another thing you may want to check for, depending on what DB youre
using, is a deadlock condition. the 0.2 series also does an
unconditional "rollback" on connections returned to the connection
pool, which prevents a DB like postgres from keeping read locks open on
tables that were accessed within a connection.

also in moments of sheer multithreaded desperation, the threadframe
module can be invaluable in illustrating where a hang is occurring,
though it requires some instrumentation of your application:

http://www.majid.info/mylos/stories/2004/06/10/threadframe.html

mulicheng

unread,
Jun 18, 2006, 1:16:18 AM6/18/06
to TurboGears
I'm on 0.1.7 I'd like to upgrade.. really but I'm working too hard
right now trying to keep the servers up :)

digg effect.

Anyhow.. This does seem like it could be an issue and I think migrating
to 0.2 is pretty soon on slate.

mulicheng

unread,
Jun 19, 2006, 12:10:06 PM6/19/06
to TurboGears
Just to add a side note to this thread, I was having lots of problems
keeping two instances running. It had been previously suggested that
to take advantage to two CPUs, running a couple instances would do the
trick. I had to restart each instance quite often under that scenario
though. After going back too one instance, the application has been
running fine. I added further caching (memcache) for db queries and
haven't had any lockups since that time. I think the SQLAlchemy
problem mentioned earlier could be the source of the lockup problem.

My server is hyperthreaded so it isn't really two CPUs. In addition,
the physical server is hosting 3 virtual domains (Xen). Mydrawings.com
is the only domain that gets substantial traffic (We got Dugg Saturday
by the way) but the resources are indeed divided between the 3 domains.
I can't test for sure right now, but this setup could be part of the
problem with running mroe than once instance on the machine.

Anyhow, I plan to upgrade SQLAlchemy and to get a 2nd physical server
in operation but for the time being, our site is still surviving.

mike bayer

unread,
Jun 19, 2006, 7:17:58 PM6/19/06
to TurboGears
well, if you get 0.2 running on your multiserver setup, let us know on
the SA list or write to me....as a Digg'ed site running SA would be a
good step towards SA moving out of "alpha" :)

mulicheng

unread,
Jun 20, 2006, 2:21:54 PM6/20/06
to TurboGears
It's done, I've deployed SQ 0.2.3 to mydrawings.com. I'm NOT using the
threadlocal mod as denoted in the upgrade guide. Instead, I've gone
ahead and converted my code to use sessions and queries.

Here's crossing my fingers that servers don't lock up any more. (Which
they still have been doing even after the digg load has calmed a
little.)

mulicheng

unread,
Jun 23, 2006, 2:02:14 AM6/23/06
to TurboGears
Using httprepl, I confirmed that it is indeed SQLAlchemy (0.2.3 now)
locking up. The server is fine for hours, then when the problem
occurs, each thread that encounters a database access call locks at the
point of connecting to the database. The threads stay locked
indefinately and the server stops responding when all threads are
locked.

Hmn,, suggestions?

ben.ha...@gmail.com

unread,
Jun 23, 2006, 4:23:19 AM6/23/06
to TurboGears
Only solution I can think of is to use less than the lock up amount of
connections. Ideally you only want to allocate enough connections as
you have running threads in the pool EVER.
5-10 instead of 2xx

ben.ha...@gmail.com

unread,
Jun 23, 2006, 4:43:08 AM6/23/06
to TurboGears
I had kind of a similar problem with speed in my app and this is maybe
working better for me (testing is not my forte)
for x in range(self.MAX_CRAWL_THREADS+1):
self.con.append(
[MySQLdb.connect('localhost','u','p','spiderpydb',3306),0])
def getconnection(self):
ins = [z[1] for z in self.con]
for x in self.con:
if x[1] == 0:
y = max(ins)+1
x[1] = y
return x
def resetconnection(self,threadnumber):
for x in self.con:
if x[1] == threadnumber:
x[1] = 0

connection = self.getconnection()

cu = connection[0].cursor()
sql_us = "SELECT id FROM user WHERE name = '%s'" % self.user
rws = cu.execute(sql_us)
user = cu.fetchone()
cu.executemany("INSERT INTO que (address,time,user_id) VALUES
(%s,%s,%s)", [(ref,tn,user[0]) for ref in href ])
cu.close()

self.resetconnection(connection[1])

mulicheng

unread,
Jun 23, 2006, 10:26:10 AM6/23/06
to TurboGears
I'll look into capping the max connections.
Right now, I have one server locked up that I am looking at. It has 40
threads all locked.
There are only 8 connections from that server to the database though.

ben.ha...@gmail.com

unread,
Jun 23, 2006, 11:50:37 PM6/23/06
to TurboGears
This is the scheme i use for sqlobject type sql operations, is
sqlalchemy better I've been hearing abunch about it lately I'll look at
it. I think that sqlobject is stable though. This code works for a
few hours at least I think.

Usage

conpool.getLock() #lock
conpool.getConnection()
conpool.relLock() #hub uses it own locking or is thread safe
hub.begin()
selectedURI = URI.select(URI.q.address == site)
selectedURIList = list(selectedURI)
if len(selectedURIList) == 0:
time_=time()
d = str(lines)
c = URI(address=site,data=d,time=time_)
hub.commit()
else:
c = selectedURIList[0]
hub.end()

#this goes in the model.py file
from sqlobject import *
from turbogears.database import AutoConnectHub
from sqlobject.mysql.mysqlconnection import MySQLConnection
import time
import threading
import MySQLdb

threadlocker = threading.Lock()
hub = AutoConnectHub()

def connect(threadIndex):
""" Function to create a connection at the start of the thread """
hub.threadConnection =
MySQLConnection('db','u','p','localhost',3306) #AutoConnectHub()#

def synch(func,*args):
def wrapper(*args):
try:
threadlocker.acquire()
return func(*args)
finally:
threadlocker.release()
return wrapper
class ConnectionPool:
def __init__(self, timeout=100, checkintervall=10):
self.timeout=timeout
self.lastaccess={}
self.connections={}
self.checkintervall=checkintervall
self.lastchecked=time.time()
self.lockingthread=None
def getConnection(self):
global hub
tid=threading._get_ident()
try:
con=self.connections[tid]
except KeyError:
print 'key error re-connecting'

self.connections[tid]=MySQLConnection('db','u','p','localhost',3306)
#hub.threadConnection
con = self.connections[tid]
self.lastaccess[tid]=time.time()
if (self.lastchecked + self.checkintervall) < time.time():
self.cleanUp()
return con
def getLock(self):
while not self._checklock():
#print 'in check lock'
time.sleep(0.02)
self.lockingthread=threading._get_ident()
getLock=synch(getLock)
def relLock(self):
self.lockingthread=None
def _checklock(self):
if self.lockingthread == threading._get_ident() or
self.lockingthread is None:
return True
return False
def cleanUp(self):
dellist=[]
for con in self.connections:
if self.lastaccess[con] + self.timeout < time.time():
self.connections[con].close()
#del self.lastaccess[con]
dellist.append(con)
for l in dellist:
print 'deleted connections'
del self.connections[l]
def kill(self):
tid=threading._get_ident()
dellist=[]
self.connections[tid].close()
#del self.lastaccess[tid]
dellist.append(tid)
for l in dellist:
print 'deleted connection'
del self.connections[l]
conpool = ConnectionPool()

ben.ha...@gmail.com

unread,
Jun 24, 2006, 9:46:38 PM6/24/06
to TurboGears
My problem with stability is back, I don't suggest parsing pdfs for
hrefs.

Kevin Dangoor

unread,
Jun 25, 2006, 11:44:30 AM6/25/06
to turbo...@googlegroups.com
On Jun 24, 2006, at 9:46 PM, ben.ha...@gmail.com wrote:

>
> My problem with stability is back, I don't suggest parsing pdfs for
> hrefs.

That sounds expensive. You could do that as a background tasks and
cache the results somewhere...

ben.ha...@gmail.com

unread,
Jun 26, 2006, 12:00:38 AM6/26/06
to TurboGears
Well it has turned into quite a learning experience for me (learning
about web apps). I sure do appreciate Turbogears by the way its a
great framework probably the best! I tried getting sqlalchemy up and
running by easy install and now in my main app I'm getting this. Do I
need to remove sqlalchemy for it to work? Or can I just svn revert the
project I think that is what I did and it still doesn't work. By the
way the app is called spidey agent. Its a search tool and so far I've
seen some decent performance with it after some 'tweaking' it has been
alot of fun. Some of the apps for turbogears are amazing. I went to
mydrawings.com and was really impressed. I wish I had a bigger budget
for this spider app its seriously slow (only about 8 sites a second
crawling max on a dual opteron 240) and requires a bunch of resources
but hopefully I can polish it up and put it somewhere so that someone
else could use it. It needs search description work (results or page
preview) and some kind of page ranking system. It works though if you
are patient. So thanks without TG I wouldn't have been able to make it
maybe.

Traceback (most recent call last):
File
"/usr/lib/python2.4/site-packages/CherryPy-2.2.1-py2.4.egg/cherrypy/_cphttptools.py",
line 105, in _run
self.main()
File
"/usr/lib/python2.4/site-packages/CherryPy-2.2.1-py2.4.egg/cherrypy/_cphttptools.py",
line 254, in main
body = page_handler(*virtual_path, **self.params)
File "<string>", line 3, in index
File
"/usr/lib/python2.4/site-packages/TurboGears-0.9a6-py2.4.egg/turbogears/controllers.py",
line 273, in expose
output = database.run_with_transaction(
File
"/usr/lib/python2.4/site-packages/TurboGears-0.9a6-py2.4.egg/turbogears/database.py",
line 220, in run_with_transaction
sqlalchemy.objectstore.clear()
AttributeError: 'module' object has no attribute 'objectstore'

ben.ha...@gmail.com

unread,
Jun 26, 2006, 2:41:07 AM6/26/06
to TurboGears
I simply installed SA I think .2 adn TG did not run the next day. I
tried removing all .pyc. I also tg-admin sql and deleting the old
database and created a new blank one. The performance was decent
something like 1000 inserts a second for a name table. When I took the
code out of my model.py file TG exploded (thats just a coincidence
perhaps though). Does installing SQLAlchemy break Turbogears?
Seriously this is weird how could TG know about SQLAlchemy unless
something happens during easy install? I registered a metadata but
that was it. Somebody hack my machine?

Jorge Vargas

unread,
Jun 26, 2006, 4:13:33 AM6/26/06
to turbo...@googlegroups.com

I simply installed SA I think .2 adn TG did not run the next day.  I
tried removing all .pyc.  I also tg-admin sql and deleting the old
database and created a new blank one.  The performance was decent
something like 1000 inserts a second for a name table.  When I took the
code out of my model.py file TG exploded (thats just a coincidence
perhaps though).  Does installing SQLAlchemy break Turbogears?

aparently it does check this line http://trac.turbogears.org/turbogears/browser/tags/0.9a6/turbogears/database.py?rev=1524#L215

I remenber seeing this a while ago and said this could break something but couldn't find a logical solution, now I have it if SA changes it's api (as happen with 1.0-2.0) and TG hasn't migrated then if someone install the lastest SA, well what your having :)

fortunate for you Max notice this too and commited a patch http://trac.turbogears.org/turbogears/changeset/1523

not the prettiest but it works :)

so you have two way, #1 get the lastest 1.0 branch (trunk may have unstable features) or uninstall SA.

Seriously this is weird how could TG know about SQLAlchemy unless
something happens during easy install?

because when SA code was added it was intended to become the main ORM, it still is but it isn't clear how to migrate in a proper way.

how about using extras_require in setuptools? and a config file entry in app.cfg maybe?

I registered a metadata but that was it.  Somebody hack my machine?

hehehe well yes, someone trip you into installing a peace of malware call turbogears that coerce itself to run code on top of SQLAlchemy :)


ben.ha...@gmail.com

unread,
Jun 26, 2006, 5:02:38 AM6/26/06
to TurboGears
ah foiled again, thank you for your deep understanding of this problem
truely amazing. I figgered it would not no way no how mess up just to
try SA. Fooey

Reply all
Reply to author
Forward
0 new messages