Response hangs when using SQLAlchemy

1,769 views
Skip to first unread message

Bryan

unread,
Mar 3, 2009, 4:25:09 PM3/3/09
to pylons-discuss, pylons...@googlegroups.com
Inside of an XMLRPCController I have a function that inserts items
into a database, and returns 'OK' to the client if everything works.
Everything runs quickly and correctly, the rows are inserted into the
DB, but pylons hangs for about a minute when generating the response.
The client hangs there waiting for a response while pylons does
*something*.

If I comment out the SQLAlchemy orm code, and don't insert the rows
into the database, the 'ok' response is generated instantly and
returned to the client just like all my other XMLRPC functions.

Why would using SQLAlchemy cause pylons to hang when returning a
response? The rows are inserted into the database just fine.

My Code
==================================================
try:
orm.flush()
orm.commit()
except Exception, e :
orm.rollback() # Rollback transaction
# Release all connections and expunge objects
# Don't allow any other code to use session, full of bad data
orm.remove()
return Fault(30, 'Could not insert rows.')
return 'OK'

Pylons DEBUG output while the client is left hanging there
==============================================================
[pylons.controllers.core]Calling Response object to return WSGI data
[pylons.controllers.core] Merging pylons.response headers into
start_response call, status: 200 OK

I have found the code that emits this pylons log info.
I am using pylons .9.6
=============================================================================
def repl_start_response(status, headers, exc_info=None):
response = pylons.response._current_obj()
start_response_called.append(None)

# Copy the headers from the global response
# XXX: TODO: This should really be done with a more
efficient
# header merging function at some point.
if log_debug:
log.debug("Merging pylons.response headers into "
"start_response call, status: %s", status)
response.headers.update(HeaderDict.fromlist(headers))
headers = response.headers.headeritems()
for c in pylons.response.cookies.values():
headers.append(('Set-Cookie', c.output(header='')))
return start_response(status, headers, exc_info)
self.start_response = repl_start_response

I see that in .9.7 this code has been changed to
=====================================================================
def repl_start_response(status, headers, exc_info=None):
response = self._py_object.response
start_response_called.append(None)

# Copy the headers from the global response
if log_debug:
log.debug("Merging pylons.response headers into "
"start_response call, status: %s", status)
for header in response.headerlist:
if header[0] == 'Set-Cookie' or header[0].startswith('X-'):
headers.append(header)
return start_response(status, headers, exc_info)
self.start_response = repl_start_response


How can the code in .9.6 cause pylons to hang while returning a
response to the client ONLY when I use SQLalchemy to insert some rows?


Bryan

Brennan Todd

unread,
Mar 3, 2009, 5:16:28 PM3/3/09
to pylons-...@googlegroups.com, pylons...@googlegroups.com


My Code
==================================================
try:
       orm.flush()
       orm.commit()
except Exception, e :
       orm.rollback()  # Rollback transaction
       # Release all connections and expunge objects
       # Don't allow any other code to use session, full of bad data
       orm.remove()
       return Fault(30, 'Could not insert rows.')
return 'OK'

Shouldn't that orm.remove() call be placed into a finally: clause?  As written, you aren't removing it unless there is an exception.

Bryan

unread,
Mar 3, 2009, 5:59:01 PM3/3/09
to pylons-discuss
orm.remove() is called in my base controller, so it actually does not
even need to happen here, but I put it there during debugging just to
do a sanity check and eliminate SQLAlchemy session issues from the
possible problems.

Paweł Stradomski

unread,
Mar 3, 2009, 6:26:13 PM3/3/09
to pylons-...@googlegroups.com, Bryan, pylons...@googlegroups.com
W liście Bryan z dnia wtorek 03 marca 2009:

> Inside of an XMLRPCController I have a function that inserts items
> into a database, and returns 'OK' to the client if everything works.
> Everything runs quickly and correctly, the rows are inserted into the
> DB, but pylons hangs for about a minute when generating the response.
> The client hangs there waiting for a response while pylons does
> *something*.
>

Maybe try to profile your application to check where it spends that time? Try
repoze.profile.

--
Paweł Stradomski

Bryan

unread,
Mar 4, 2009, 1:09:10 PM3/4/09
to pylons-discuss
First off, thanks big time for pointing me towards repoze.profile. I
have integrated it into my app and love it.

The total time for a request without DB stuff is about .855 CPU
seconds
The total time for a request with DB suff is about 1.470 CPU seconds

These numbers are saying that the entire pylons request/response cycle
is only taking 1.47sec, but my client sits there waiting for a request
for about a minute, even when the client is being run on the same
machine. Could the response be getting tied up in the network layer?
Why would my use of SQLAlchemy and Mysql interfere with how long it
takes a response to get sent to a client?

I have never looked at network traffic directly before, but it seems
like that is what I need to do next. I want to find out exactly when
the request is being sent over the wire. A

ny suggestions on how to proceed?

Paweł Stradomski

unread,
Mar 4, 2009, 1:52:03 PM3/4/09
to pylons-...@googlegroups.com
W liście Bryan z dnia środa 04 marca 2009:

> First off, thanks big time for pointing me towards repoze.profile. I
> have integrated it into my app and love it.
>
> The total time for a request without DB stuff is about .855 CPU
> seconds
> The total time for a request with DB suff is about 1.470 CPU seconds
>
> These numbers are saying that the entire pylons request/response cycle
> is only taking 1.47sec, but my client sits there waiting for a request
> for about a minute, even when the client is being run on the same
> machine. Could the response be getting tied up in the network layer?
> Why would my use of SQLAlchemy and Mysql interfere with how long it
> takes a response to get sent to a client?

I'd be more keen to suspect that some lock is held and timeouts rather than
blame network, especially if output is identical in both cases, though you
can check that too.

What exactly is the difference between "with db" and "without db" versions,
code-wise? Could you try to remove functionality by small chunks to see what
lines cause the slowdown?

--
Paweł Stradomski

Noah

unread,
Mar 3, 2009, 2:17:26 PM3/3/09
to pylons-...@googlegroups.com, pylons-discuss


On Mar 5, 2009, at 7:09 AM, Bryan <brya...@gmail.com> wrote:

>
> First off, thanks big time for pointing me towards repoze.profile. I
> have integrated it into my app and love it.
>
> The total time for a request without DB stuff is about .855 CPU
> seconds
> The total time for a request with DB suff is about 1.470 CPU seconds
>
> These numbers are saying that the entire pylons request/response cycle
> is only taking 1.47sec, but my client sits there waiting for a request
> for about a minute, even when the client is being run on the same
> machine. Could the response be getting tied up in the network layer?
> Why would my use of SQLAlchemy and Mysql interfere with how long it
> takes a response to get sent to a client?
>
> I have never looked at network traffic directly before, but it seems
> like that is what I need to do next. I want to find out exactly when
> the request is being sent over the wire. A
>
> ny suggestions on how to proceed?

Wireshark and or strace are miracle workers.

Bryan

unread,
Mar 4, 2009, 3:31:07 PM3/4/09
to pylons-discuss
The only difference is:

orm.flush()
orm.commit()

Where orm is my SQLAlchemy Session. When I comment out these 2 lines,
it works fine. When left uncommented, the actual SQL INSERT happens
quickly, but the client sits there waiting for a response. According
to my repoze.profile report, the pylons request completes in about
1.5sec, but the client sits there.

To rule out the possibility that the server was sending the response
properly, and the client choking on it somehow, I ran tcpdump. I
don't fully understand the output, but comparing the dumps from with
DB and without DB, it looks like with the DB INSERT happening, there
is a 5 minute pause of some sort.
Notice the 5 minute difference between lines 2 and 3 in the first
tcpdump output below (10.0.45.67 is the server):

Last few lines of tcpdump when using SQLAlchemy
===============================================================================
12:00:23 length 126) 10.0.45.67 > 10.0.45.111 : P ack win 44100
12:00:23 length 40) 10.0.45.111 > 10.0.45.67: ., cksum (correct) ack
win 65434
12:05:22 length 40) 10.0.45.67 > 10.0.45.111 : F, cksum correct) ack
win 44100
12:05:22 length 40) 10.0.45.111 > 10.0.45.67: ., cksum (correct) ack
win 65434
12:05:22 length 40) 10.0.45.111 > 10.0.45.67: F, cksum (correct) ack
win 65434
12:05:22 length 40) 10.0.45.67 > 10.0.45.111 : ., cksum (correct) ack
win 44100
2366 packets captured
2366 packets received by filter
0 packets dropped by kernel

Last few lines of tcpdump when NOT using SQLAlchemy
===============================================================================
11:59:39 length 126) 10.0.45.67 > 10.0.45.1118: FP ack win 44100
11:59:39 length 40) 10.0.45.1118 > 10.0.45.67: ., cksum (correct) ack
win
11:59:39 length 40) 10.0.45.1118 > 10.0.45.67: ., cksum (correct) ack
win 65434
11:59:39 length 40) 10.0.45.1118 > 10.0.45.67: F, cksum (correct) ack
win 65434
11:59:39 length 40) 10.0.45.67 > 10.0.45.1118: ., cksum (correct) ack
win 44100
215 packets captured
216 packets received by filter
0 packets dropped by kernel

Bryan

unread,
Mar 4, 2009, 4:04:20 PM3/4/09
to pylons-discuss
Using wireshark on the client side, the xmlrpc data is actually
getting back to the client right away. Then there is the large pause
(always right around 5 minutes)
This is how the conversation goes. The 5 minute gap appears only when
using SQLAlchemy w/ MySQL

1. Last packet that has actual xmlrpc data sent to client
2. Client sends packet to server (confirmation ??)
----- 5 Minute gap ---------------------------------------------
3. Server sends 1 packet to client
4. Client sends back 2 packets
5. Server sends 1 packet to client

It is the first packet sent by the server *after* the data is
received by the client and the client responds that is stalling.
After the data is received by the client, I assume there needs to be
some tcp/ip confirmation and such, and that is what the last 4 packets
being sent between client/server are. But what is so special about
this packet that is taking 300 seconds for the server to send?

On Mar 3, 11:17 am, Noah <noah.g...@gmail.com> wrote:

Bryan

unread,
Mar 4, 2009, 4:15:13 PM3/4/09
to pylons-discuss
The plot thickens: While the 5 minute pause is happening, I can kill
the pylons app, and the last 4 packets are still exchanged between
client and server, albeit 5 minutes later. Could paste still be
hanging around after I kill the pylons app?

I have had problems in the past when SQLAlchemy encounters a DB error
and when I try to restart my pylons app, I get a "socket.error: (98,
'Address already in use')" exception that starts in SocketServer.py
and bubbles up to paste/serve.py. And if I do netstat -l at that
point, I see that "something" is listening on my app's port 5004, but
there is no pid or program name there, just a dash. 5 minutes later,
that port no longer has "something" listening on it.

It seems that there is some sort of 5 minute zombie state in paste or
tcp/ip stack. How could Mysql+SQLAlchemy happenings have this sort of
effect??
The server is running Ubunut 8.04.

Wichert Akkerman

unread,
Mar 4, 2009, 4:53:28 PM3/4/09
to Bryan, pylons-discuss
Previously Bryan wrote:
> Using wireshark on the client side, the xmlrpc data is actually
> getting back to the client right away. Then there is the large pause
> (always right around 5 minutes)

Perhaps you are sending an incorrect Content-Length header and the
client is waiting for more data to arrive?

Wichert.

--
Wichert Akkerman <wic...@wiggy.net> It is simple to make things.
http://www.wiggy.net/ It is hard to make things simple.

Bryan

unread,
Mar 4, 2009, 5:10:16 PM3/4/09
to pylons-discuss
The problem seems to be that the server is supposed to send a couple
more confirmation packets, but does not for 5 minutes. So the client
is correct in waiting: it has not received the last few packets of the
transaction yet, even though it has received the actual xmlrpc data.

I have seen a couple of posts on the internet about the "socket.error:
(98, 'Address already in use')" problem, but have never seen an
explanation or solution. Has anyone run into this problem?

On Mar 4, 1:53 pm, Wichert Akkerman <wich...@wiggy.net> wrote:
> Previously Bryan wrote:
> > Using wireshark on the client side, the xmlrpc data is actually
> > getting back to the client right away.  Then there is the large pause
> > (always right around 5 minutes)
>
> Perhaps you are sending an incorrect Content-Length header and the
> client is waiting for more data to arrive?
>
> Wichert.
>
> --
> Wichert Akkerman <wich...@wiggy.net>    It is simple to make things.http://www.wiggy.net/                  It is hard to make things simple.

Bryan

unread,
Mar 5, 2009, 12:33:31 PM3/5/09
to pylons-discuss
I have moved my code out of Pylons and into a typical SimpleXMLRPC
server. It works fine there. This problem is somewhere in the pylons/
WSGI stack.

Bryan

unread,
Mar 5, 2009, 12:50:17 PM3/5/09
to pylons-discuss
Found some more info here:
http://hea-www.harvard.edu/~fine/Tech/addrinuse.html

Apparently, pylons+paste is stalling when it comes to the server
sending its FIN packet to tell the client the transaction is done. It
does not happen when I use SimpleXMLRPCServer and ThreadingMixIn from
the standard library, so it must be something higher in the stack like
paste serve or something.

I still don't understand how accessing MySQL through SQLAlchemy could
be the trigger for these kinds of network issues???

Bryan

unread,
Mar 5, 2009, 3:02:00 PM3/5/09
to pylons-discuss
The problem only occurred when I was inserting rows into a MySQL table
that had a guid (char(36)) as the primary key. In my SQLAlchemy
setup, I assigned uuid.uuid1() to be the default for these columns, so
python was generating the guid, and it is sent to the server.

I first found success when I moved the guid generation to the MySQL
server side. I let MySQL generate the guid using its uuid()
function. Everything worked at that point.

Now I do not understand the next part. I changed my code back to how
it originally was, with python generating the guid, and now everything
works. I don't have any clue what is going on at this point. It
works, but I don't know why. When I call the broken function
referenced above, it works now, but nothing in the function's code has
changed, I have been using a test stub function this whole time so I
know the code has not changed.

Matthew Zwier

unread,
Mar 5, 2009, 3:47:28 PM3/5/09
to pylons-...@googlegroups.com
Could this be something to do with the system entropy pool? If the
uuid module was accessing /dev/random when no entropy was available,
the read from /dev/random would block. If MySQL weren't using
/dev/random, and were using /dev/urandom or something, it would never
block. /dev/random may have replenished itself during your testing
before returning to the original implementation (in Python), hence
you'd have no problem, at least for a while.

MZ

Bryan

unread,
Mar 5, 2009, 4:31:47 PM3/5/09
to pylons-discuss
I think you are correct. I restarted the server and the problems came
back. In my SQLAlchemy setup I use uuid.uuid1(). When I changed this
to use uuid4() the problems went away.

I don't have the sources for python 2.5 (what the server runs), but in
2.6 uuid1 sometimes uses the random module, which accesses random
generators at the os level. The thing is, in 2.6 the random module
uses urandom, so there should not be a problem. Perhaps in 2.5 the
random module uses the /dev/random file?

On Mar 5, 12:47 pm, Matthew Zwier <mczw...@gmail.com> wrote:
> Could this be something to do with the system entropy pool?  If the
> uuid module was accessing /dev/random when no entropy was available,
> the read from /dev/random would block.  If MySQL weren't using
> /dev/random, and were using /dev/urandom or something, it would never
> block.  /dev/random may have replenished itself during your testing
> before returning to the original implementation (in Python), hence
> you'd have no problem, at least for a while.
>
> MZ
>

Matthew Zwier

unread,
Mar 5, 2009, 6:13:54 PM3/5/09
to pylons-...@googlegroups.com
The 2.5 uuid module tries to use underlying system UUID libraries, so
the read from /dev/random may be in there instead of in the Python
code. Don't know what 2.6 does. Hmm...that's not very helpful. I
know just enough about the internals of Python to be dangerous :)

MZ

Bryan

unread,
Mar 5, 2009, 6:32:20 PM3/5/09
to pylons-discuss
I had never even known about /dev/random

It is working now, but I think the final thing I will do is push the
GUID creation onto the database server instead of doing it in python
code. One less line of code I need to maintain.

This was one of the most difficult bugs I've had to deal with in a
long time. Sometimes it is amazing the path that you traverse in
debugging code. This bug had me looking at the pylons source, then
looking at every packet that crossed the wire, and finally looking at
python sources. I've learned a lot in the last few days, I'm going to
eat ice cream now.

Bryan


On Mar 5, 3:13 pm, Matthew Zwier <mczw...@gmail.com> wrote:
> The 2.5 uuid module tries to use underlying system UUID libraries, so
> the read from /dev/random may be in there instead of in the Python
> code.  Don't know what 2.6 does.  Hmm...that's not very helpful.  I
> know just enough about the internals of Python to be dangerous :)
>
> MZ
>

Bryan

unread,
Mar 6, 2009, 10:09:30 AM3/6/09
to pylons-discuss
It hit me this morning @ breakfast that my use of uuid1() w/
SQLAlchemy cannot be the problem, at least not the whole problem.

When I used SimpleXMLRPCServer instead of the pylons WSGI stack,
everything worked, even when using uuid1(). Also, even when using
pylons, the rows were being inserted into the DB just fine, so uuid1()
was being called for each new row, and they were getting into the DB
quickly and correctly, no blocking behavior at all.

The real problem was the final TCP packet not being sent to the
client. So I'm back to not understanding why using uuid1() causes the
WSGI+TCP/IP stack to break down, and uuid4() does not.

The only thing I can think of is that uuid1() access some information
about the machine's network hardware to create the GUID, and maybe
that access is interfering with TCP/IP? Seems very unlikely.

I know how to make the problem go away, but I keep bringing it up
because I want to submit a bug report, I just don't know which project
I should submit it to. I still don't know the exact problem.
Reply all
Reply to author
Forward
0 new messages