Database pooling vs. persistent connections

15,626 views
Skip to first unread message

Aymeric Augustin

unread,
Feb 17, 2013, 6:24:52 AM2/17/13
to django-d...@googlegroups.com
**tl;dr** I believe that persistent database connections are a good idea.
Please prove me wrong :)

--------------------

Since I didn't know why the idea of adding a connection pooler to Django was
rejected, I did some research before replying to the cx_Oracle SessionPooling
thread.

The best explanation I've found is from Russell:

> To clarify -- we've historically been opposed to adding connection
> pooling to Django is for the same reason that we don't include a web
> server in Django -- the capability already exists in third party
> tools, and they're in a position to do a much better job at it than us
> because it's their sole focus. Django doesn't have to be the whole
> stack.

All the discussions boil down to this argument, and the only ticket on the
topic is short on details: https://code.djangoproject.com/ticket/11798

--------------------

The connection pools for Django I've looked at replace "open a connection" by
"take a connection from the pool" and "close a connection" by "return the
connection to the pool". This isn't "real" connection pooling: each worker
holds a connection for the entire duration of each request, regardless of
whether it has an open transaction or not.

This requires as many connection as workers, and thus is essentially
equivalent to persistent database connections, except connections can be
rotated among workers.

Persistent connections would eliminate the overhead of creating a connection
(IIRC ~50ms/req), which is the most annoying symptom, without incurring the
complexity of a "real" pooler.

They would be a win for small and medium websites that don't manage their
database transactions manually and where the complexity of maintaining an
external connection pooler isn't justified.

Besides, when Django's transaction middelware is enabled, each request is
wrapped in a single transaction, which reserves a connection. In this case, a
connection pooler won't perform better than persistent connections.

Obviously, large websites should use an external pooler to multiplex their
hundreds of connections from workers into tens of connections to their
database and manage their transactions manually. I don't believe persistent
connections to the pooler would hurt in this scenario, but if it does, it
could be optional.

--------------------

AFAICT there are three things to take care of before reusing a connection:

1) restore a pristine transaction state: transaction.rollback() should do;

2) reset all connection settings: the foundation was laid in #19274;

3) check if the connection is still alive, and re-open it otherwise:
- for psycopg2: "SELECT 1";
- for MySQL and Oracle: connection.ping().

Some have argued that persistent connections tie the lifetime of databases
connections to the lifetime of workers, but it's easy to store the creation
timestamp and re-open the connection if it exceeds a given max-age.

So -- did I miss something?

--
Aymeric.

Anssi Kääriäinen

unread,
Feb 17, 2013, 1:31:48 PM2/17/13
to Django developers
On 17 helmi, 13:24, Aymeric Augustin
I am not yet convinced that poolers implemented inside Django core are
necessary. A major reason for doing #19274 was to allow somewhat easy
creation of 3rd party connection poolers.

I don't see transactional connection pooling as something that forces
including connection pools into Django. Transactional pooler
implementation should be possible outside Django. On every execute
check which connection to use. When inside transaction, then use the
connection tied to the transaction, otherwise take a free connection
from the pool and use that. The big problem is that Django's
transaction handling doesn't actually know when the connection is
inside transaction. Fix this and doing transactional poolers external
to Django will be possible. (Tying the connection to transaction
managed blocks could work, but then what to do for queries outside any
transaction managed block?).

Instead of implementing poolers inside Django would it be better to
aim for DBWrapper subclassing (as done in #19274)? The subclassing
approach has some nice properties, for example one could implement
"rewrite to prepared statements" feature (basically, some queries will
get automatically converted to use prepared statements on execution
time). This setup should result in nice speedups for some use cases.

Another implementation idea is to have the DB settings contain a
'POOLER' entry. By default this entry is empty, but when defined it
points to a class that has a (very limited) pooler API:
lend_connection(), release_connection() and close_all_connections()
(the last one is needed for test teardown). And then the connections
itself could have .reset(), .ping() and so on methods. This is simple
and should also be extensible.

It seems SQLAlchemy has a mature pooling implementation. So, yet
another approach is to see if SQLAlchemy's pooling implementation
could be reused. (Maybe in conjunction with the above 'POOLER' idea).

I also do believe that persistent database connections are a good
idea. I don't yet believe the implementation must be in Django core...

- Anssi

Carl Meyer

unread,
Feb 17, 2013, 5:17:09 PM2/17/13
to django-d...@googlegroups.com
On 02/17/2013 11:31 AM, Anssi Kääriäinen wrote:
> It seems SQLAlchemy has a mature pooling implementation. So, yet
> another approach is to see if SQLAlchemy's pooling implementation
> could be reused. (Maybe in conjunction with the above 'POOLER' idea).

There is in fact an implementation of reusing SQLAlchemy's connection
pool with Django (via DatabaseWrapper subclassing):
https://github.com/kennethreitz/django-postgrespool/

I have this in production for several months with no issues, so the
concept certainly works.

Carl

signature.asc

Marc Tamlyn

unread,
Feb 18, 2013, 4:39:18 AM2/18/13
to django-d...@googlegroups.com
+1 to django-postgrespool it works well for me.

Aymeric Augustin

unread,
Feb 18, 2013, 7:41:24 AM2/18/13
to django-d...@googlegroups.com
Le 17 févr. 2013 à 19:31, Anssi Kääriäinen <anssi.ka...@thl.fi> a écrit :

> I am not yet convinced that poolers implemented inside Django core are
> necessary.

In case I wasn't clear enough — I agree that Django core shouldn't provide
connection pooling, because it'll be simplistic and hamper innovation outside
of Django.

> I also do believe that persistent database connections are a good
> idea. I don't yet believe the implementation must be in Django core...


The goal of my proposal is to save the cost (in response time) of establishing
the database connection, in vanilla Django, without compromising reliability.
I'm trying to help people who aren't even aware of the problem :)

The implementation is quite straightforward:
https://github.com/django/django/pull/733

Everything is controlled by a per-connection max-age setting. It can be set
to a low value on rarely used connections, so they get closed quickly, and
are re-opened only when necessary.

Why would this not belong in Django core -- besides historical reasons?

--
Aymeric.



Jacob Kaplan-Moss

unread,
Feb 18, 2013, 8:47:05 AM2/18/13
to django-developers
On Mon, Feb 18, 2013 at 7:41 AM, Aymeric Augustin
<aymeric....@polytechnique.org> wrote:
> The goal of my proposal is to save the cost (in response time) of establishing
> the database connection, in vanilla Django, without compromising reliability.
> I'm trying to help people who aren't even aware of the problem :)
>
> The implementation is quite straightforward:
> https://github.com/django/django/pull/733
>
> Everything is controlled by a per-connection max-age setting. It can be set
> to a low value on rarely used connections, so they get closed quickly, and
> are re-opened only when necessary.

+1 from me.

The overhead of establishing a connection isn't a big deal when you
control your own hardware and can route traffic to the database over a
gigE backplane, but many of our users don't have that luxury. Routing
on AWS seems to be particularly crappy: most of my AWS-hosted sites
show a 75-100ms overhead just to establish a connection. That can be
as much as 90% of my response time! This fix represents a cheap and
easy performance improvement. Let's get it in.

Jacob

Andrey Antukh

unread,
Feb 18, 2013, 9:38:00 AM2/18/13
to django-d...@googlegroups.com
Also, https://github.com/niwibe/djorm-ext-pool is based initially on postgresspool but works with sqlite and mysql.

The only difference is that uses monky patching instead a separate backend.
We also use it in production on a project a while ago and has not given us any problems.

Andrey


2013/2/17 Carl Meyer <ca...@oddbird.net>



--
Andrey Antukh - Андрей Антух - <ni...@niwi.be>
http://www.niwi.be/about.html
http://www.kaleidos.net/A5694F/

"Linux is for people who hate Windows, BSD is for people who love UNIX"
"Social Engineer -> Because there is no patch for human stupidity"

Carl Meyer

unread,
Feb 18, 2013, 1:02:57 PM2/18/13
to django-d...@googlegroups.com
+1 from me too. Similar experience on Heroku (which is of course also
AWS), persistent DB connections via django-postgrespool were a quick and
easy win to save a significant chunk of response time. I don't see any
reason Django shouldn't do it by default, since many users won't even
realize how much their response time is being hurt by the lack of it.

Carl

Michael

unread,
Feb 18, 2013, 1:51:16 PM2/18/13
to django-d...@googlegroups.com
I was just alerted to this by the reopening of ticket #18135.

I agree the routing on AWS is slow and represents a large amount of connections. This is something that AWS needs to address.

I will argue, however, that the AWS use case is a reason we need this to be an OPTION, not automatic.

Generally, in my experience, databases are the first place that require investigations as you to start seeing slow downs as your application grows with users. Relational databases are expensive to scale. They generally require DBMs, massive servers, tons of memory and that is when everything goes right. So we do not immediately jump to upgrading the database, that would be just wrong. 

Instead what we do is to look to places to cache aggressively. I think that in the future we are going to see more and more places flattening databases and moving to NoSQL, generally in views, but occasionally directly on the pages with javascript. This forces less and less reliance on the relational database.

So the future of web frameworks will have less and less reliance on a constrained database. This is a really really good thing because AWS has given us the ability to only purchase the hardware that we need when we need it. That means that if one of my sites get a ton of users right now, my application will scale the number of EC2 instances. I have confidence that the number of queries directly to my database will not increase directly with that load because each query is aggressively cached. 

My EC2 scaling is directly in proportion to the number of users my application has. RDS, on the other hand, scales exponentially with the number of users that I have [1]. This is consistant with any large application. The database just costs more. 

So here is my point, requiring that an application has a consistant connection to the database is really good for small applications, but as the number of front facing web servers increase, the number of database connections also increase, consuming the most expensive resource of the web application, the database. Leaving open, "sleeping", I would argue "stale", connections doesn't scale cost linearly with usership, instead it increases exponentially.

This should be an option. For the small percentage of people who are noticing the SQL connection time issues (which I would recommend they reach out to a AWS support person about), a persistant connection makes sense. The general Django community should not be exhausting their most expensive resource first. 









--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.



Jacob Kaplan-Moss

unread,
Feb 18, 2013, 2:03:56 PM2/18/13
to django-developers
On Mon, Feb 18, 2013 at 1:51 PM, Michael <newma...@gmail.com> wrote:
> I will argue, however, that the AWS use case is a reason we need this to be
> an OPTION, not automatic.

Check out Aymeric's patch; it is indeed an option triggered by
DATABASES[CONN_MAX_AGE].

Jacob

Michael

unread,
Feb 18, 2013, 2:17:37 PM2/18/13
to django-d...@googlegroups.com
Unless I am reading it wrong, only on the next database query. So if I set the connection to 0 and don't have a connection to the database after the validation on start, the connections still are sleeping. I really never want to hang onto a DB connection.

Am I reading that wrong? I will get an instance fired up sometime this week to test this out. 

Carl Meyer

unread,
Feb 18, 2013, 2:21:11 PM2/18/13
to django-d...@googlegroups.com
On 02/18/2013 12:17 PM, Michael wrote:
> On Mon, Feb 18, 2013 at 2:03 PM, Jacob Kaplan-Moss <ja...@jacobian.org
> <mailto:ja...@jacobian.org>> wrote:
>
> On Mon, Feb 18, 2013 at 1:51 PM, Michael <newma...@gmail.com
> <mailto:newma...@gmail.com>> wrote:
> > I will argue, however, that the AWS use case is a reason we need
> this to be
> > an OPTION, not automatic.
>
> Check out Aymeric's patch; it is indeed an option triggered by
> DATABASES[CONN_MAX_AGE].
>
>
> Unless I am reading it wrong, only on the next database query. So if I
> set the connection to 0 and don't have a connection to the database
> after the validation on start, the connections still are sleeping. I
> really never want to hang onto a DB connection.

No, it's not on the next DB query, it's at the request_finished signal
(which is the same point when Django will currently close your
connection for you). So the effect with CONN_MAX_AGE=0 is really not
distinguishable from what you get currently.

Carl

Michael

unread,
Feb 18, 2013, 2:27:33 PM2/18/13
to django-d...@googlegroups.com
Thanks, Carl, I had missed that.

Then I have no objection. I will still take a deeper dive this week, let you know how it goes.  

Aymeric Augustin

unread,
Feb 18, 2013, 2:39:48 PM2/18/13
to django-d...@googlegroups.com
On 18 févr. 2013, at 19:51, Michael <newma...@gmail.com> wrote:

> I was just alerted to this by the reopening of ticket #18135.

My initial analysis is incomplete. I'll comment on the ticket in a few minutes.

> I will argue, however, that the AWS use case is a reason we need this to be an OPTION, not automatic.

Like Jacob said, persistent connections can be disabled entirely by settings CONN_MAX_AGE = 0.

> So here is my point, requiring that an application has a consistant connection to the database is really good for small applications, but as the number of front facing web servers increase, the number of database connections also increase, consuming the most expensive resource of the web application, the database. Leaving open, "sleeping", I would argue "stale", connections doesn't scale cost linearly with usership, instead it increases exponentially.

If the majority of your views do not hit a given database, I expect that you'll set a low CONN_MAX_AGE for this database.

> For the small percentage of people who are noticing the SQL connection time issues (which I would recommend they reach out to a AWS support person about), a persistant connection makes sense.

Most Django sites suffer from the database connection establishment delay — actually, all except those that use a connection pooler. AWS is the most egregious example, but it isn't the primary motivation for this patch.

On a random server running both Django and PostgreSQL locally, I measured an overhead of 7ms. If PostgreSQL was running on another machine, the overhead would be more important. If one is aiming for 200ms response times, connection establishment can easily eat 5 to 10% of the processing time.

--
Aymeric.



Aymeric Augustin

unread,
Feb 18, 2013, 4:27:38 PM2/18/13
to django-d...@googlegroups.com
On 18 févr. 2013, at 14:47, Jacob Kaplan-Moss <ja...@jacobian.org> wrote:
> This fix represents a cheap and easy performance improvement. Let's get it in.


After working on a few iterations of the patch, I have identified two problems
worth discussing.


Problem #1: Is it worth re-executing the connection setup at the beginning of
every request?

The connection setup varies widely between backends:
- SQLite: none
- PostgreSQL: https://github.com/django/django/blob/master/django/db/backends/postgresql_psycopg2/base.py#L185-L205
- MySQL: https://github.com/django/django/blob/master/django/db/backends/mysql/base.py#L445-L452
- Oracle: https://github.com/django/django/blob/master/django/db/backends/oracle/base.py#L549-L599

The current version of the patch repeats it every time. In theory, this isn't
necessary. Doing it only once would be more simple.

It could be backwards incompatible, for instance, if a developer changes the
connection's time zone. But we can document to set CONN_MAX_AGE = 0 to restore
the previous behavior in such cases.


Problem #2: How can Django handle situations where the connection to the
database is lost?

Currently, with MySQL, Django pings the database server every time it creates
a cursor, and reconnects if that fails. This isn't a good practice and this
behavior should be removed: https://code.djangoproject.com/ticket/15119

Other backends don't have an equivalent behavior. If a connection was opened,
Django assume it works. Worse, if the connection breaks, Django fails to close
it, and keeps the broken connection instead of opening a new one:
https://code.djangoproject.com/ticket/15802

Thus, persistent connections can't make things worse :) but it'd be nice to
improve Django in this area, consistently across all backends.

I have considered four possibilities:

(1) Do nothing. At worst, the connection will be closed after max_age and then
reopened. The worker will return 500s in the meantime. This is the current
implementation.

(2) "Ping" the connection at the beginning of each request, and if it doesn't
work, re-open it. As explained above, this isn't a good practice. Note
that if Django repeats the connection setup before each request, it can
take this opportunity to check that the connection works and reconnect
otherwise. But I'm not convinced I should keep this behavior.

(3) Catch database exceptions in all appropriate places, and if the exception
says that the connection is broken, reconnect. In theory this is the best
solution, but it's complicated to implement. I haven't found a conclusive
way to identify error conditions that warrant a reconnection.

(4) Close all database connections when a request returns a 500. It's a bad
idea because it ties the HTTP and database layers. It could also hide
problems.


Ideas?

--
Aymeric.



Karen Tracey

unread,
Feb 18, 2013, 5:30:12 PM2/18/13
to django-d...@googlegroups.com
Just a couple of random quick thoughts:

Will persistent connections be able to ensure that "bad connections" (e.g. those in state "current transaction is aborted, commands ignored until end of transaction block") don't leak from one request to another?

MySQL, at least, will close a connection from the server side if it is idle for too long (no idea if that is configurable, I've never had reason to investigate...have just noticed the behavior). In the pull request it looks like we only check for the need to close at the end of a request cycle? For a very low activity site it seems those two behaviors could interact badly, with the request cycle finishing code deciding to keep the connection around since it's not too old....long delay until next request while MySQL server closes what it sees as an idle connection...next request tries to use a connection that the server has closed?

Karen

Alex Gaynor

unread,
Feb 18, 2013, 5:33:28 PM2/18/13
to django-d...@googlegroups.com
As far as I know, the MySQL timeout time is configurable, and the default is 8 hours.

Alex


--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero

Karen Tracey

unread,
Feb 18, 2013, 5:41:38 PM2/18/13
to django-d...@googlegroups.com
On Mon, Feb 18, 2013 at 5:33 PM, Alex Gaynor <alex....@gmail.com> wrote:
As far as I know, the MySQL timeout time is configurable, and the default is 8 hours.

Assuming this is what does: http://dev.mysql.com/doc/refman/5.0/en/server-system-variables.html#sysvar_wait_timeout that sounds right. I guess I have just managed to encounter a fair number of servers who lowered that number. (I've mainly seen it from mysql shell sessions, where I get a message about the server having gone away and the connection being re-established if I've waited "too long" between commands...pretty sure "too long" has never been anywhere near 8 hours).

Karen

Aymeric Augustin

unread,
Feb 18, 2013, 5:51:47 PM2/18/13
to django-d...@googlegroups.com
On 18 févr. 2013, at 23:30, Karen Tracey <kmtr...@gmail.com> wrote:

> Will persistent connections be able to ensure that "bad connections" (e.g. those in state "current transaction is aborted, commands ignored until end of transaction block") don't leak from one request to another?

I'm calling connection.abort() at the end of each request to avoid this. If there are bugs in connection.abort(), I'll try to fix them.

> MySQL, at least, will close a connection from the server side if it is idle for too long (no idea if that is configurable, I've never had reason to investigate...have just noticed the behavior). In the pull request it looks like we only check for the need to close at the end of a request cycle? For a very low activity site it seems those two behaviors could interact badly, with the request cycle finishing code deciding to keep the connection around since it's not too old....long delay until next request while MySQL server closes what it sees as an idle connection...next request tries to use a connection that the server has closed?


Indeed, that could be a problem.

To mitigate it, I could:
- check the connection's max age before each request, and reopen it if it's outdated (in addition to the current check after each request, required to emulate the current behavior);
- implement a "ping" before each request — but I'd really prefer to avoid this overhead;
- disable persistent connections by default :(

--
Aymeric.



Carl Meyer

unread,
Feb 18, 2013, 7:31:01 PM2/18/13
to django-d...@googlegroups.com
On 02/18/2013 02:27 PM, Aymeric Augustin wrote:
> Problem #1: Is it worth re-executing the connection setup at the beginning of
> every request?
>
> The connection setup varies widely between backends:
> - SQLite: none
> - PostgreSQL: https://github.com/django/django/blob/master/django/db/backends/postgresql_psycopg2/base.py#L185-L205
> - MySQL: https://github.com/django/django/blob/master/django/db/backends/mysql/base.py#L445-L452
> - Oracle: https://github.com/django/django/blob/master/django/db/backends/oracle/base.py#L549-L599
>
> The current version of the patch repeats it every time. In theory, this isn't
> necessary. Doing it only once would be more simple.
>
> It could be backwards incompatible, for instance, if a developer changes the
> connection's time zone. But we can document to set CONN_MAX_AGE = 0 to restore
> the previous behavior in such cases.

It seems silly to re-run queries per-request that were only ever
intended to be per-connection setup. So I favor the latter. I think this
change will require prominent discussion in the
potentially-backwards-incompatible section of the release notes regardless.

(This could form an argument for keeping the built-in default 0, and
just setting it to a higher number in the project template. But I don't
like polluting the default project template, and I think a majority of
existing upgrading projects will benefit from this without problems, so
I don't actually want to do that.)
I'd be inclined to go for (1), with the intent of moving gradually
towards (3) as specific detectable error conditions that happen in real
life and do warrant closing the connection and opening a new one are
brought to our attention. Unfortunately handling those cases is likely
to require parsing error messages, as pretty much anything related to
DBAPI drivers and error conditions does :/

I tried to dig for the origins of the current MySQL behavior to see if
that would illuminate such a case, but that code goes way back into the
mists of ancient history (specifically, merger of the magic-removal
branch), beyond which the gaze of "git annotate" cannot penetrate.

Option (4) is very bad IMO, and (2) is not much better.

Carl

Mario Briggs

unread,
Feb 19, 2013, 12:46:13 AM2/19/13
to Django developers
Just FYI, in the DB2 backend for Django, we have exposed a custom
database property PCONNECT and have defaulted it to true for a long
time now. What this does is that it uses persistent connections that
is supported by the DB2 Python driver (http://code.google.com/p/ibm-db/
wiki/APIs#ibm_db.pconnect), rather than opening a new fresh connection
each time. One point to note with this approach is that it cannot be
used with multi-threaded web servers and we did have a situation with
a customer using this with Apache on Windows (http://groups.google.com/
group/ibm_db/browse_thread/thread/eab320359f0d16d8?hl=en_US)... , so i
am curious how the above approach deals with the multi-threaded
scenario

regards
Mario Briggs

Anssi Kääriäinen

unread,
Feb 19, 2013, 5:04:28 AM2/19/13
to Django developers
On 19 helmi, 02:31, Carl Meyer <c...@oddbird.net> wrote:
> On 02/18/2013 02:27 PM, Aymeric Augustin wrote:
>
> > Problem #1: Is it worth re-executing the connection setup at the beginning of
> > every request?
>
> > The connection setup varies widely between backends:
> > - SQLite: none
> > - PostgreSQL:https://github.com/django/django/blob/master/django/db/backends/postg...
> > - MySQL:https://github.com/django/django/blob/master/django/db/backends/mysql...
> > - Oracle:https://github.com/django/django/blob/master/django/db/backends/oracl...
>
> > The current version of the patch repeats it every time. In theory, this isn't
> > necessary. Doing it only once would be more simple.
>
> > It could be backwards incompatible, for instance, if a developer changes the
> > connection's time zone. But we can document to set CONN_MAX_AGE = 0 to restore
> > the previous behavior in such cases.
>
> It seems silly to re-run queries per-request that were only ever
> intended to be per-connection setup. So I favor the latter. I think this
> change will require prominent discussion in the
> potentially-backwards-incompatible section of the release notes regardless.
>
> (This could form an argument for keeping the built-in default 0, and
> just setting it to a higher number in the project template. But I don't
> like polluting the default project template, and I think a majority of
> existing upgrading projects will benefit from this without problems, so
> I don't actually want to do that.)

Maybe we need another setting for what to do in request.start. It does
seem somewhat likely that users could do SET SEARCH_PATH in middleware
to support multitenant setups for example, and expect that set to go
away when connection is closed after the request. Any other SET is a
likely candidate for problems in PostgreSQL, and I am sure other DBs
have their equivalent of session state, too. (In this case doing RESET
ALL; run connection setup again is the right thing to do in
PostgreSQL).

It would be nice to support this use case, but just documenting this
change clearly in the release notes, and point out that if you have
such requirements, then set max_age to 0. More features can always be
added later on.
I hope this discussion is about what to do at request finish/start
time.

I am very strongly opposed to anything where Django suddenly changes
connections underneath you. At request finish/start this is OK (you
should expect new connections then anyways), but otherwise if you get
broken connection, it isn't Django's business to swap the connection
underneath you. There is a reasonable expectation that while you are
using single connections[alias] in a script for example, you can
expect the underlying connection to be the same for the whole time.
Otherwise SET somevar in postgresql could break for example.

Now, one could argument that SET somevar should not be used with
Django's connections. But this is very, very limiting for some real
world use cases (multitenant and SET SEARCH_PATH for one). There is no
way to actually force such a limitation, so the limitation would be
documentation only. In addition the result is that very rarely you get
a weird (potentially data corrupting) problem because your connection
was swapped at the wrong moment. Nearly impossible to debug
(especially if this is not logged either).

If the connection swapping is still wanted, then there must at least
be a way to tell Django that do NOT swap connections unless told to do
so.

I think a good approach would be to mark the connection potentially
broken on errors in queries, and then in request_finished check for
this potentially broken flag. If flag set, then and only then run
ping() / select 1. So, this is a slight modification of no. 3 where
one can mark the connection potentially broken liberally, but the
connection is swapped only when the ping fails, and only in
request_finished. For most requests there should be no overhead as
errors in queries are rare.

BTW the remark above in Aymeric's post that persistent connections
can't make things worse: I don't believe this. Persistent connections
will keep the broken connection from request to request, and at least
on PostgreSQL a broken connection is correctly closed in request
finish.

- Anssi

Aymeric Augustin

unread,
Feb 19, 2013, 5:20:53 AM2/19/13
to django-d...@googlegroups.com
On 19 févr. 2013, at 11:04, Anssi Kääriäinen <anssi.ka...@thl.fi> wrote:

> Maybe we need another setting for what to do in request.start. It does
> seem somewhat likely that users could do SET SEARCH_PATH in middleware
> to support multitenant setups for example, and expect that set to go
> away when connection is closed after the request. Any other SET is a
> likely candidate for problems in PostgreSQL, and I am sure other DBs
> have their equivalent of session state, too. (In this case doing RESET
> ALL; run connection setup again is the right thing to do in
> PostgreSQL).
>
> It would be nice to support this use case, but just documenting this
> change clearly in the release notes, and point out that if you have
> such requirements, then set max_age to 0. More features can always be
> added later on.

Yes, the short term solution in this case is to set max_age to 0.

I think the long term solution is to modify such middleware to revert its
effects in process_response() — or reset everything in process_request().

I agree with Carl that I should document this thoroughly.


> (…)


> I hope this discussion is about what to do at request finish/start
> time.

You're right, it's very important not to fiddle with connections during
request processing.

> I am very strongly opposed to anything where Django suddenly changes
> connections underneath you. At request finish/start this is OK (you
> should expect new connections then anyways), but otherwise if you get
> broken connection, it isn't Django's business to swap the connection
> underneath you. There is a reasonable expectation that while you are
> using single connections[alias] in a script for example, you can
> expect the underlying connection to be the same for the whole time.
> Otherwise SET somevar in postgresql could break for example.

Even worse, you could break transactional integrity! See this comment:
https://code.djangoproject.com/ticket/15119#comment:9

> I think a good approach would be to mark the connection potentially
> broken on errors in queries, and then in request_finished check for
> this potentially broken flag. If flag set, then and only then run
> ping() / select 1. So, this is a slight modification of no. 3 where
> one can mark the connection potentially broken liberally, but the
> connection is swapped only when the ping fails, and only in
> request_finished. For most requests there should be no overhead as
> errors in queries are rare.

This is an interesting idea. Django already catches database exceptions in
execute() to re-raise them. We could simply set the flag there.

> BTW the remark above in Aymeric's post that persistent connections
> can't make things worse: I don't believe this. Persistent connections
> will keep the broken connection from request to request, and at least
> on PostgreSQL a broken connection is correctly closed in request
> finish.

https://code.djangoproject.com/ticket/15802 says it isn't — although I haven't
confirmed that this bug still exists.

--
Aymeric.



Anssi Kääriäinen

unread,
Feb 19, 2013, 5:37:52 AM2/19/13
to Django developers
On 19 helmi, 12:20, Aymeric Augustin
<aymeric.augus...@polytechnique.org> wrote:
> On 19 févr. 2013, at 11:04, Anssi Kääriäinen <anssi.kaariai...@thl.fi> wrote:
>
> > Maybe we need another setting for what to do in request.start. It does
> > seem somewhat likely that users could do SET SEARCH_PATH in middleware
> > to support multitenant setups for example, and expect that set to go
> > away when connection is closed after the request. Any other SET is a
> > likely candidate for problems in PostgreSQL, and I am sure other DBs
> > have their equivalent of session state, too. (In this case doing RESET
> > ALL; run connection setup again is the right thing to do in
> > PostgreSQL).
>
> > It would be nice to support this use case, but just documenting this
> > change clearly in the release notes, and point out that if you have
> > such requirements, then set max_age to 0. More features can always be
> > added later on.
>
> Yes, the short term solution in this case is to set max_age to 0.
>
> I think the long term solution is to modify such middleware to revert its
> effects in process_response() — or reset everything in process_request().
>
> I agree with Carl that I should document this thoroughly.

Yeah, just doing the reset manually in middleware should be good
enough.

> > I think a good approach would be to mark the connection potentially
> > broken on errors in queries, and then in request_finished check for
> > this potentially broken flag. If flag set, then and only then run
> > ping() / select 1. So, this is a slight modification of no. 3 where
> > one can mark the connection potentially broken liberally, but the
> > connection is swapped only when the ping fails, and only in
> > request_finished. For most requests there should be no overhead as
> > errors in queries are rare.
>
> This is an interesting idea. Django already catches database exceptions in
> execute() to re-raise them. We could simply set the flag there.
>
> > BTW the remark above in Aymeric's post that persistent connections
> > can't make things worse: I don't believe this. Persistent connections
> > will keep the broken connection from request to request, and at least
> > on PostgreSQL a broken connection is correctly closed in request
> > finish.
>
> https://code.djangoproject.com/ticket/15802says it isn't — although I haven't
> confirmed that this bug still exists.

I just tried this and broken connections are closed correctly at least
in master. First paragraph of comment https://code.djangoproject.com/ticket/15802#comment:20
says this is already fixed in 1.4. The open part of the ticket seems
to be about 1.3 (which is not going to be fixed anyways) or automatic
connection swapping (which is dangerous). So maybe it is time to
reclose the ticket?

- Anssi

Carl Meyer

unread,
Feb 19, 2013, 1:31:35 PM2/19/13
to django-d...@googlegroups.com
On 02/19/2013 03:04 AM, Anssi K��ri�inen wrote:
> I hope this discussion is about what to do at request finish/start
> time.
>
> I am very strongly opposed to anything where Django suddenly changes
> connections underneath you. At request finish/start this is OK (you
> should expect new connections then anyways), but otherwise if you get
> broken connection, it isn't Django's business to swap the connection
> underneath you.

I agree completely. I assumed we were talking about closing the
connection at request-finished time, which would cause a new one to be
opened on the next request.

> I think a good approach would be to mark the connection potentially
> broken on errors in queries, and then in request_finished check for
> this potentially broken flag. If flag set, then and only then run
> ping() / select 1. So, this is a slight modification of no. 3 where
> one can mark the connection potentially broken liberally, but the
> connection is swapped only when the ping fails, and only in
> request_finished. For most requests there should be no overhead as
> errors in queries are rare.

I like this idea. The ping is a minimal cost if you only pay it when
there is a query error during the request, and it's more reliable than
trying to parse error messages to figure out which ones require closing
the connection.

I think we should exclude some classes of errors that we already
distinguish and know they don't mean the connection is broken (e.g.
IntegrityError). If some codebase makes heavy use of algorithms like
get_or_create that rely on a correctly-handled IntegrityError, they
shouldn't need to pay the cost of the ping more frequently.

Carl

Eric Florenzano

unread,
Feb 20, 2013, 12:52:09 AM2/20/13
to django-d...@googlegroups.com
One question: does this proposal include some way of specifying a maxiumum number of outstanding connections to the database from a single process?  Looked but didn't see it in the pull request.  In my experience, most PostgreSQL instances don't have a whole lot of breathing room in terms of the number of total outstanding connections they can handle due to low shared memory defaults etc. This is something that django-postgrespool does by way of SQLAlchemy's pool_size and max_overflow settings.

Alex Gaynor

unread,
Feb 20, 2013, 1:11:35 AM2/20/13
to django-d...@googlegroups.com
I don't really see what point such an option would serve. This is *not a connection pool*. Currently Django uses one open connection per thread, this simple keeps them open between requests. That means before: you needed O(concurrent threads) connections, now you need O(total active threads) connections. If your database can't handle that many connections, why do you have that many threads running, peak load would already be making it fall over.

Alex


On Tue, Feb 19, 2013 at 9:52 PM, Eric Florenzano <flo...@gmail.com> wrote:
One question: does this proposal include some way of specifying a maxiumum number of outstanding connections to the database from a single process?  Looked but didn't see it in the pull request.  In my experience, most PostgreSQL instances don't have a whole lot of breathing room in terms of the number of total outstanding connections they can handle due to low shared memory defaults etc. This is something that django-postgrespool does by way of SQLAlchemy's pool_size and max_overflow settings.

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Shai Berger

unread,
Feb 21, 2013, 6:42:19 PM2/21/13
to django-d...@googlegroups.com
Hi,

On Sunday 17 February 2013, Aymeric Augustin wrote:
> **tl;dr** I believe that persistent database connections are a good idea.
> Please prove me wrong :)
>
> [...]
>
> So -- did I miss something?

I think you haven't -- but several points came to mind, which weren't
discussed in this thread. As far as I can tell, none of them are problems, but
*I* may be missing something, so I thought I should raise them here.

First -- the very mention of this topic reminded me of
https://code.djangoproject.com/ticket/9964, which was my pet bug for about two
years. The point of that bug, though, was to make sure that transactions are
properly closed before (and regardless if) the connection is closed, so while
related, it should be unaffected.

Second -- Tasks out of requests. Core developers participating in this
discussion already have this in mind, Aymeric has commented about it in
https://code.djangoproject.com/ticket/17887, but it was left out of the
discussion on the thread. The suggested changes, AFAICT, modify the behavior
around the end of requests only -- for tasks, nobody closed the connection
before, and nobody is going to do anything different now; so that's not
"something missed" either.

Third -- different use patterns of multi-db. AFAICT the change treats all
connections as equals, so no new problems should arise, but I can't remove a
gnawing suspicion that some interaction may pop up. In particular, I'm worried
because not all such patterns are automatically tested; I know my pet pattern
(two aliases for one db, for different transaction behavior -- for logging into
db, which shouldn't be reverted when the main transaction is rolled back) has
had problems unnoticed by others (see e.g.
https://code.djangoproject.com/ticket/14415).

And last (and probably least) -- coordinated distributed transactions and two-
phase-commit are not really supported by Django, and if you want them you need
to write backend-specific code (probably a custom backend -- I haven't had the
[mis]fortune to need to do this yet). I suspect such code would be affected,
I'm not sure if it becomes easier or harder to write. I know the issue is
mostly theoretical, and even if there is a problem with it, that's not a
blocker; still, I thought it was worth a mention.

Thanks,
Shai.

Aymeric Augustin

unread,
Feb 22, 2013, 4:21:14 AM2/22/13
to django-d...@googlegroups.com
Hi Shai,

Thanks for compiling this list ! My answers below.

> First -- the very mention of this topic reminded me of
> https://code.djangoproject.com/ticket/9964, which was my pet bug for about two
> years. The point of that bug, though, was to make sure that transactions are
> properly closed before (and regardless if) the connection is closed, so while
> related, it should be unaffected.

If anything, persistent connections will *require* proper transaction termination
at the end of each request cycle.

> Second -- Tasks out of requests. Core developers participating in this
> discussion already have this in mind, Aymeric has commented about it in
> https://code.djangoproject.com/ticket/17887, but it was left out of the
> discussion on the thread. The suggested changes, AFAICT, modify the behavior
> around the end of requests only -- for tasks, nobody closed the connection
> before, and nobody is going to do anything different now; so that's not
> "something missed" either.

Yes, this patch must not affect behavior outside of the request handler.

> Third -- different use patterns of multi-db. AFAICT the change treats all
> connections as equals, so no new problems should arise, but I can't remove a
> gnawing suspicion that some interaction may pop up. In particular, I'm worried
> because not all such patterns are automatically tested; I know my pet pattern
> (two aliases for one db, for different transaction behavior -- for logging into
> db, which shouldn't be reverted when the main transaction is rolled back) has
> had problems unnoticed by others (see e.g.
> https://code.djangoproject.com/ticket/14415).

The problem described in that ticket occurred because of the invalid
assumption that each connection was to a different database.

As far as I can tell I'm not making this assumption anywhere. Transaction state
is managed per-connection, and the representation of a connection in Django
is very straightforward.

> And last (and probably least) -- coordinated distributed transactions and two-
> phase-commit are not really supported by Django, and if you want them you need
> to write backend-specific code (probably a custom backend -- I haven't had the
> [mis]fortune to need to do this yet). I suspect such code would be affected,
> I'm not sure if it becomes easier or harder to write. I know the issue is
> mostly theoretical, and even if there is a problem with it, that's not a
> blocker; still, I thought it was worth a mention.


I've never thought about this and I'm not sure what the interactions are. At worst,
you can restore the previous behavior by setting the connection's max age to zero.

--
Aymeric.



Anssi Kääriäinen

unread,
Feb 22, 2013, 6:09:26 AM2/22/13
to Django developers
On 22 helmi, 11:21, Aymeric Augustin
<aymeric.augus...@polytechnique.org> wrote:
> > Second -- Tasks out of requests. Core developers participating in this
> > discussion already have this in mind, Aymeric has commented about it in
> >https://code.djangoproject.com/ticket/17887, but it was left out of the
> > discussion on the thread. The suggested changes, AFAICT, modify the behavior
> > around the end of requests only -- for tasks, nobody closed the connection
> > before, and nobody is going to do anything different now; so that's not
> > "something missed" either.
>
> Yes, this patch must not affect behavior outside of the request handler.

The way connections are handled in requests could be useful for task
queues, too. The tasks should be separated from each other, so the
wanted properties are:
- Make sure you have fresh connection state (transaction state for
example) per task
- Make sure that if a connection is closed inside some task, the
connection will be reopened for the next task.
- For performance reasons always closing connections after each task
isn't wanted.

This is exactly what is wanted in request processing, too.

For example django-celery does already have their version of
persistent connections, see: https://github.com/celery/django-celery/blob/master/djcelery/loaders.py#L84

So, persistent connections will be usable outside request processing
as long as they are opt-in. Supporting persistent connections outside
request processing isn't needed for the initial implementation, of
course.

- Anssi

David Cramer

unread,
Feb 28, 2013, 2:09:40 PM2/28/13
to django-d...@googlegroups.com
Can we please change this so it defaults to off, and just document how to turn it on and in what situations you should turn it on?

In my opinion this default-on feature caters to a very specific audience, and will cause a lot of unexpected behavior with other users.

Here is the tl;dr of an argument for turning it on:

Connections are expensive, we should persist them.

Here is the tl;dr for turning it off:

Connections are expensive, so we dont have a lot of them.

Immediately for anyone who has configured more workers than they have Postgres connections (which I can only imagine is common among people who havent setup infrastructure like pgbouncer) things will start blowing up.

Why should this be the default behavior?

Unlike MySQL, which already has cheap connections and doesn't suffer this problem, connections in Postgres add quite a large cost, both on creation, on allowing them. **Everyone** who cares about their performance on Postgres should already be using pgbouncer (or some alternative) to offset **both** of these, not just a single problem which is what Django is addressing here.

I will happily defend my opinion in person at PyCon in two weeks if it still remains a default, and anyone is willing to listen, and if you reach out to the other large (and I dont mean scale) Django users I imagine you will find a lot of mixed feelings about this default.

On Sunday, February 17, 2013 3:24:52 AM UTC-8, Aymeric Augustin wrote:
**tl;dr** I believe that persistent database connections are a good idea.
Please prove me wrong :)

--------------------

Since I didn't know why the idea of adding a connection pooler to Django was
rejected, I did some research before replying to the cx_Oracle SessionPooling
thread.

The best explanation I've found is from Russell:

> To clarify -- we've historically been opposed to adding connection
> pooling to Django is for the same reason that we don't include a web
> server in Django -- the capability already exists in third party
> tools, and they're in a position to do a much better job at it than us
> because it's their sole focus. Django doesn't have to be the whole
> stack.

All the discussions boil down to this argument, and the only ticket on the
topic is short on details: https://code.djangoproject.com/ticket/11798

--------------------

The connection pools for Django I've looked at replace "open a connection" by
"take a connection from the pool" and "close a connection" by "return the
connection to the pool". This isn't "real" connection pooling: each worker
holds a connection for the entire duration of each request, regardless of
whether it has an open transaction or not.

This requires as many connection as workers, and thus is essentially
equivalent to persistent database connections, except connections can be
rotated among workers.

Persistent connections would eliminate the overhead of creating a connection
(IIRC ~50ms/req), which is the most annoying symptom, without incurring the
complexity of a "real" pooler.

They would be a win for small and medium websites that don't manage their
database transactions manually and where the complexity of maintaining an
external connection pooler isn't justified.

Besides, when Django's transaction middelware is enabled, each request is
wrapped in a single transaction, which reserves a connection. In this case, a
connection pooler won't perform better than persistent connections.

Obviously, large websites should use an external pooler to multiplex their
hundreds of connections from workers into tens of connections to their
database and manage their transactions manually. I don't believe persistent
connections to the pooler would hurt in this scenario, but if it does, it
could be optional.

--------------------

AFAICT there are three things to take care of before reusing a connection:

1) restore a pristine transaction state: transaction.rollback() should do;

2) reset all connection settings: the foundation was laid in #19274;

3) check if the connection is still alive, and re-open it otherwise:
    - for psycopg2: "SELECT 1";
    - for MySQL and Oracle: connection.ping().

Some have argued that persistent connections tie the lifetime of databases
connections to the lifetime of workers, but it's easy to store the creation
timestamp and re-open the connection if it exceeds a given max-age.

So -- did I miss something?

--
Aymeric.

Aymeric Augustin

unread,
Feb 28, 2013, 3:11:02 PM2/28/13
to django-d...@googlegroups.com
On 28 févr. 2013, at 20:09, David Cramer <dcr...@gmail.com> wrote:

> Can we please change this so it defaults to off, and just document how to turn it on and in what situations you should turn it on?
>

> In my opinion this default-on feature caters to a very specific audience, and will cause a lot of unexpected behavior with other users.

For the record, the reasoning behind the current default is as follows.

By default, Django uses the database session backend. As a consequence, each worker opens a database connection early in the request cycle (in process_request), and keeps it open until the end of the request. I don't have any figures, but I believe that such sites cannot be dismissed as "a very specific audience".

When a website grows larger (in terms of traffic), its developers optimize it: put sessions into a cache, disable transaction middleware, turn autocommit on, handle transactions manually, use a database connection pooler, etc.

The further along this curve you are, the less persistent connections are useful; they may even become harmful. However, the further along this curve you are, the more you know about database performance too, and the better your ability to determine if persistent connections are useful for your application.

The default is targeted at people who wouldn't know what a good value is and who will get the most benefit from persistent connections.

You obviously have a strong opinion on what a good value is for your applications (namely, 0) and you won't suffer from an unsuitable default.

> Immediately for anyone who has configured more workers than they have Postgres connections (which I can only imagine is common among people who havent setup infrastructure like pgbouncer) things will start blowing up.

This argument was addressed by Alex earlier in the thread and I agree with his answer. I also find it more helpful to blow up at application start rather than when the traffic exceeds some threshold.

> Why should this be the default behavior?
>
> Unlike MySQL, which already has cheap connections and doesn't suffer this problem, connections in Postgres add quite a large cost, both on creation, on allowing them. **Everyone** who cares about their performance on Postgres should already be using pgbouncer (or some alternative) to offset **both** of these, not just a single problem which is what Django is addressing here.

"**Everyone** who cares about their performance on Postgres" can take 5 seconds to set CONN_MAX_AGE to 0.

Persistent connections will improve the situation a little bit for everyone else — people running on a hosted platform, people who don't have the means to configure, run and monitor a pooler efficiently, people who don't even know what a pooler is, etc.

> I will happily defend my opinion in person at PyCon in two weeks if it still remains a default, and anyone is willing to listen, and if you reach out to the other large (and I dont mean scale) Django users I imagine you will find a lot of mixed feelings about this default.

I won't be at PyCon, but several Django committers will. If a core dev wants to change the default to 0, feel free. (Being the author of this feature, I'm obviously biased towards turning it on.)

I don't mind being proved wrong, and I chose to commit this early in the 1.6 release cycle precisely to give plenty of time for such feedback!

--
Aymeric.



Christophe Pettus

unread,
Feb 28, 2013, 4:05:14 PM2/28/13
to django-d...@googlegroups.com
One comment on the patch (which I generally approve of entirely):

It would be helpful to have a backend method that performers the "restore connection between uses" function, rather than just use connection.abort() (of course, the default implementation can use that). For example, on PostgreSQL, ABORT; DISCARD ALL is the recommended way of resetting a connection, so being able to implement that would be great.

--
-- Christophe Pettus
x...@thebuild.com

Christophe Pettus

unread,
Feb 28, 2013, 4:10:50 PM2/28/13
to django-d...@googlegroups.com

On Feb 28, 2013, at 11:09 AM, David Cramer wrote:

> Immediately for anyone who has configured more workers than they have Postgres connections (which I can only imagine is common among people who havent setup infrastructure like pgbouncer) things will start blowing up.

If they have this configuration, it's an error. The fact that the error is now surfacing doesn't make it a correct configuration.

Aymeric Augustin

unread,
Feb 28, 2013, 4:29:16 PM2/28/13
to django-d...@googlegroups.com
On 28 févr. 2013, at 22:05, Christophe Pettus <x...@thebuild.com> wrote:

> It would be helpful to have a backend method that performers the "restore connection between uses" function, rather than just use connection.abort() (of course, the default implementation can use that). For example, on PostgreSQL, ABORT; DISCARD ALL is the recommended way of resetting a connection, so being able to implement that would be great.

Good point, filed here: https://code.djangoproject.com/ticket/19948

--
Aymeric.



Michael

unread,
Feb 28, 2013, 4:40:06 PM2/28/13