Opening a database connection is expensive; it's taking 150ms. Even
10ms would be far too much, for a request that otherwise takes only
30ms. Django threads are reused in FastCGI, and database connections
are reusable; why do threads not reuse their database connections?
Disabling django.db.close_connection fixes this, bringing a trivial
request from 170ms to 35ms after the first.
150ms to localhost also seems grossly expensive, and I'm still
investigating that. However, even if I bring it down to 10ms, that's
still spending 25% of the time for the request on something that
shouldn't be necessary at all.
--
Glenn Maynard
On Thu, Jul 23, 2009 at 5:24 PM, Glenn Maynard<gl...@zewt.org> wrote:
> Why is each thread's database connection closed after each request?
I believe that this is related to Django's shared-nothing-by-default approach.
HTH,
Carlos.
In this case, that's a terrible-performance-by-default approach.
(It's also not a default, but the only behavior, but I'll probably
submit a patch to add a setting for this if I don't hit any major
problems.)
I can think of obscure cases where this would matter--if something
changes a per-connection setting, like changing the schema search
path, and doesn't put it back--but for most people this is just an
unnecessary chunk of overhead. (I wonder if Postgres has a way to
quickly reset the connection state, without having to tear it down
completely.)
The only other change I've had to make so far is making
TransactionMiddleware always commit or rollback, not just on is_dirty,
so it doesn't leave read locks held after a request. (Of course, that
sometimes adds an SQL COMMIT where there wasn't one before, but I'll
take a 500us per request overhead over 100ms in a heartbeat.)
--
Glenn Maynard
Please do a bit more research and reflection on the topic before you
start submitting patches, because this isn't quite what you're making
it out to be.
In the case of a fairly low-traffic site, you're not going to notice
any real performance difference (since you're not doing enough traffic
for connection overhead to add up). In the case of a high-traffic
site, you almost certainly want some sort of connection-management
utility (like pgpool) regardless of what Django does, in which case it
becomes rather moot (since what you're doing is getting connection
handles from pgpool or something similar).
Meanwhile, the codebase stays much simpler and avoids some pitfalls
with potential resource and state leaks.
(and, in general, I don't believe that connection-management utilities
belong in an ORM; keeping them in a different part of the stack
drastically increases flexibility in precisely the cases where you
need it most)
--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
All pgpool2 does to reset the session to avoid all of these pitfalls
is issue "ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT".
There's nothing complex about that. (With that added, the
TransactionMiddleware change I mentioned earlier isn't needed,
either.)
I see no need for a complex connection pooling service. You're making
this sound much more complicated than it is, resulting in people
needing to use configurations much more complicated than necessary.
--
Glenn Maynard
Except this is what it turns into. So suppose a patch is added which
does nothing except keep the database connection open; well, that's
problematic because it means a server process/thread that's not
handling a request at the moment is still tying up a handle to a DB
connection. So somebody will say it'd be much better if Django
maintained a pool of connections independent of request/response
cycles, and just doled them out as needed.
And then you need configuration to manage the size of the pool, when
connections get recycled, how to reset connections portably...
And then somebody will notice that it's got all the infrastructure for
pgpool's "poor man's replication" and helpfully submit a patch for
that...
And then we end up in the unenviable state of having wasted a bunch of
time reimplementing features already available in the standard tool
people should've been using from the start, but without getting any of
that tool's useful flexibility (e.g., changing the stuff behind the
pool without changing the application layer).
Or we could just accept that there are tools right now which will do
all this and more for you, and that if you've got enough traffic
through your app that the overhead of DB connections is problematic,
you should be using one of those tools.
What I need is sensible, simple and faster, but since someone else
might want to turn it into something complex and unnecessary, it
shouldn't be done? Sorry, this argument applies to every change
anyone might possibly make. There's no "slippery slope" here. I
don't want a connection pooler (and all the issues that brings, like
not being able to use "local" authentication in Postgres); just to
eliminate needless database reconnections.
--
Glenn Maynard
While the specific thing you personally are asking for might not be
that much, we can't really commit it to Django and then say "Hey, we
only did this for Glenn and nobody else gets to ask for features
building on it". And since there have already been multiple requests
for Django to grow full-blown connection-pooling utilities, I feel
pretty confident that's where it would end up.
If you don't like it, simply detach the signal handler that closes the
connection (and register your own handler that resets the connection).
But Django itself shouldn't add that because it really throws open a
Pandora's box of things which shouldn't be in Django but which will be
demanded anyway.
There are lots of requests for ways to build specific types of
queries, too. You can't really commit a QuerySet method for one thing
and then...
Anyway, even if I was to put together a patch for this, it wouldn't be
for quite a while--not until after I've tested and used it in
production for quite a while, done more extensive benchmarking, and in
any case I wouldn't submit anything but bug reports right now with the
tracker so (understandably) backlogged from the release freeze. So
relax, I'm not jumping head-first into this.
--
Glenn Maynard
By the way, I switched my connections from TCP to a Unix socket with
local authentication and it dropped to 5-10ms. (I suspect it was
setting up SSL to localhost--seems like Postgres should know better,
but I havn't investigated.) I still consider 5-10ms significant (all
delays are cumulative and the framework should have a small baseline
"latency footprint"), but I havn't done enough benchmarking and timing
yet, and obviously that's an order of magnitude lower.
Also, you may be connecting to a database on another server, or local
authentication may not be available; not reconnecting for each request
seems just like sensible behavior unless you really are using a pooler
and explicitly don't want it to. It's easy to be stuck with a set
configuration on a shared server, and if you're stuck with a 150ms
configuration, that's a perceptible delay that's easily avoided.
I've attached the change as I'm running it now. I could isolate it
outside of a patch (disconnect the signal, as James suggested, and
connect my own), but it does need to know a little about the backend
internals (on Postgres, to set the isolation level; and resetting the
connection will be different with other databases).
(Obviously, this isn't anything like a submittable patch; I'm just
making it available since he asked.)
> I'd just like to take a moment to point out that that simply *cannot*
> be, else the only logical conclusion would be that .5s of latency
> results in 0 sales, which plainly makes no sense.
I suspect the actual formula is something like pow(0.9,
(seconds_latency / 0.050)); in other words, 500ms latency would imply
34.8% as many sales. That's pretty believable.
--
Glenn Maynard
So, let's walk through this logically.
Suppose we start with a patch which does the sort of thing you
apparently want: hold on to the connection, and reset it after each
request/response cycle. That patch would never get committed, because
it would have even worse side effects (see: "spinning up a server
process ties up a DB handle for as long as the process lives?") and
could quite realistically cause resource starvation.
How do we avoid that? Well, we'd need some way to say "I'm done with
this connection for now but I'll be needing a connection again soon,
so hang on to it to avoid the overhead of re-establishing it". Which
is pretty much the definition of connection pooling.
So we see that this isn't a pointless slippery-slope argument:
connection pooling is an inevitable outcome of your approach, just one
you've yet to accept.
But it gets worse: you'd want this mechanism to be independent of the
web server processes/threads in which the connections are used,
because you don't want to have to have one connection pool per server
process (since that just reintroduces the problem of latency every
time you spin one up and increases overhead and code complexity) and
you don't want the connection pool to be killed if the process which
hosted it gets recycled by the server. Which points to... an external
connection-pooling utility.
Such utilities already exist and are usable right now. They don't
belong in Django (for many reasons, "don't reinvent the wheel" being
only one of them).
8-16 connections from the 8-16 backend threads I'm starting are going
to swamp Postgresql? Sorry, no.
--
Glenn Maynard
If you're using a connection pooler, you don't want to disconnect and
reconnect from it all the time--you want to stay connected and let it
worry about not keeping idle sessions open. That's its job! Django
should keep the connection open, and if people want to run so many
Django backends that this creates too many idle connections, then
*that* is the exact scenario for using a connection pooler.
100 connections, without connection pooling, with full
disconnect/reconnect: 3.385s
100 connections, without connection pooling, with connection reset: 2.681s
100 connections, with pybouncer, with full disconnect/reconnect: 2.908s
100 connections, with pybouncer, with connection reset: 2.754s
You're saying that Django should reconnect constantly, and if that's
too slow, people should use a pooler to fix it. This is precisely
backwards. Django should stay connected, and if people don't want
idle connections to the database, *that* is when you use a pooler.
The current behavior--avoiding idle connections to the database--is
exactly what you're against: the behavior that should be handled by a
pooler. That's what they're for.
--
Glenn Maynard
On some other project's mailing list?
Connection pooling doesn't belong in Django. I've outlined one reason
for that above. Google a bit for things like "django connection pool"
and you'll likely find others (including the fact that it hurts your
flexibility and mingles layers of the stack which don't and shouldn't
need to know about each other's bits).
Thread hijacking. Thanks, always appreciated.
--
Glenn Maynard
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.