Proposal: user-friendly API for multi-database support

Simon Willison

unread,

Sep 10, 2008, 1:53:15 PM9/10/08

to Django developers

For those who weren't at DjangoCon, here's the state of play with
regards to multi-db support: Django actually supports multiple
database connections right now: the Query class (in django/db/models/
sql/query.py) accepts a connection argument to its constructor, and
the QuerySet class (django/db/models/query.py) can be passed an
optional Query instance - if you don't pass it one, it will create a
Query that uses the default django.db.connection.

As a result, if you want to talk to a different connection you can do
it right now using a custom manager:

class MyManager(Manager):

def get_query_set(self):
query = sql.Query(self.model, my_custom_db_connection)
return QuerySet(self.model, query)

As Malcolm described it, he's provided the plumbing, now we need to
provide the porcelain in the form of a powerful, user-friendly API to
this stuff. Here's my first attempt at an API design.

Requirements
============

There are a number of important use-cases for multi-db support:

* Simple master-slave replication: SELECT queries are distributed
between slaves, while UPDATE and DELETE statements are sent to
the master.
* Different Django applications live on different databases, e.g.
a forum on one database while blog lives on another.
* Moving data between different databases - it would be useful if
you could do this using the ORM to help paper over the
differences in SQL syntax.
* Sharding: data in a single Django model is distributed across
multiple databases depending on some kind of heuristic (e.g. by
user ID, or with archived content moved to a different server)
* Replication tricks, for example if you use MySQL's InnoDB for
your data but replicate to a MyISAM server somewhere so you can
use MySQL full-text indexing to search (Flickr used to do this).

I've probably missed some; please feel free to fill in the gaps.

We don't need to solve every problem, but we do need to provide
obvious hooks for how those problems should be solved. Sharding, for
example, is extremely application specific. I don't think Django
should automatically shard your data for you if you specify 'sharding
= True' on a model class, but it should provide documented hooks for
making a custom decision on which database connection should be used
for a query that allow sharding to be implemented without too much
pain.

Different applications on different databases on the other hand is
something Django should support out of the box. Likewise, master-slave
replication is common enough that it would be good to solve it with as
few lines of user-written code as possible (it's the first step most
people take to scale their database after adding caching - and it's a
sweet spot for the kind of read-heavy content sites that Django is
particularly suited for).

Proposed API
============

Here's my first attempt at describing a user-facing API.

First, we need a way of specifying multiple database connections.
Adrian has already expressed an interest in moving to DSNs rather than
individual settings, so I suggest something like this:

DATABASES = {
'default': 'mysql://foo:bar@localhost/baz',
}

With multiple databases configured this could be:

DATABASES = {
'master': 'mysql://foo:bar@master/mydb',
'slave1': 'mysql://foo:bar@slave1/mydb',
'slave2': 'mysql://foo:bar@slave2/mydb',
'archive': 'mysql://foo:bar@archive/mydb',
'default': 'master',
}

There are two types of connection string - DSNs and aliases. A DSN
contains '://' while an alias does not. Aliases can be used even
within the DATABASES setting itself, as with 'default' in the above
example.

It should be possible to use a DSN that has not been defined in the
DATABASES setting. As a result, I propose that anywhere in Django that
accepts a connection alias should also accept a DSN or even a raw DB-
API compliant connection object.

The QuerySet.using() method
---------------------------

Next, we need a way of telling Django which connection to use. I
propose a new queryset method as the lowest level way of doing this,
called 'using':

qs = Article.objects.filter(published__lt = ...).using('archive')

"using(alias_or_connection_or_dsn)" simply tells the QuerySet to
execute against a different connection, by updating its
internal .connection attribute.

Other options for this method name include:

with_db()
with_connection()

I preferred "using()" as it reads nicely and doesn't contain an
underscore.

using() represents the lowest level user-facing API. We can cover a
common case (different applications on different databases) with the
following:

class Article(models.Model):
...
class Meta:
using = 'articles'

This means "use the articles connection for all queries originating
with this model". I'm repurposing the term 'using' here for API
consistency.

Advanced connection selection
-----------------------------

All of the other above use-cases boil down to one key decision: given
a particular database query, which database connection should I
execute the query against?

I propose adding a manager method which is called every time that
decision is made, and which is designed to be over-ridden by advanced
users. Here's the default implementation:

class Manager:
...
def get_connection(self, query):
from django.db import connection
return connection # Use the default connection for everything

Here's an implementation which implements very simple master-slave
replication:

class Manager:
...
def get_connection(self, query):
if isinstance(query, (InsertQuery, DeleteQuery, UpdateQuery)):
return 'master'
else:
return 'slave'
# Or if we have more than one slave:
return random.choice(['slave1', 'slave2']) # Footnote [1]

The above would be even easier if InsertQuery, DeleteQuery and
UpdateQuery were all subclasses of a ModificationQuery class (they are
currently all direct subclasses of Query) - then the check could
simply be:

if isinstance(query, ModificationQuery)

We could even ship a MasterSlaveManager that implements a variant of
the above logic in django.contrib.masterslave (more for educational
and marketing purposes than because it's something that's hard to
implement).

Note that in my above example get_connection() methods one returns an
actual connection object while the other returns a connection alias.
This makes for a more convenient API, and is consistent with my above
suggestion that DSNs, aliases and connection objects should be
interchangeable.

Since the get_connection method has access to the full query object,
even complex sharding schemes based on criteria such as the individual
fields being looked up in the query could be supported reasonably
well.

Dealing with single queries that span multiple databases
--------------------------------------------------------

Once you have different tables living in different databases there's
always the chance that someone will try to write a query that attempts
to join tables that live on two different database servers. I don't
think we should address this problem at all (aside from maybe
attempting to throw a descriptive error message should it happen) - if
you're scaling across different servers you need to be aware of the
limitations of that approach.

That said, databases like MySQL actually do allow cross-database joins
provided both databases live on the same physical server. Is this
something we should support? I'd like to say "no" and assume that
people who need to do that will be happy rolling their own SQL using a
raw cursor, but maybe I'm wrong and it's actually a common use case.

Connection pooling
------------------

This is where I get completely out of my depth, but it seems like we
might need to implement connection pooling at some point since we are
now maintaining multiple connections to multiple databases. We could
roll our own solution here, but to my knowledge SQLAlchemy has a solid
connection pool implementation which is entirely separate from the
rest of the SQLAlchemy ORM. We could just ensure that if someone needs
connection pooling there's a documented way of integrating the
SQLAlchemy connection pool with Django - that way we don't have an
external dependency on SQL Alchemy for the common case but people who
need connection pools can still have them.

Backwards compatibility
-----------------------

I think we can do all of the above while maintaining almost 100%
backwards with Django 1.0. In the absence of a DATABASES setting we
can construct one using Django's current DATABASE_ENGINE /
DATABASE_NAME / etc settings to figure out the 'default' connection.
Everything else should Just Work as it does already - the only people
who will need to worry are those who have hacked together their own
multi-db support based on Django internals.

Justification
=============

Why is get_connection() on Manager, not Query or QuerySet?
----------------------------------------------------------

The logic that picks which database connection is used could live in
three potential places: on the manager, on the QuerySet class or on
the Query itself. The manager seems to me like the most natural place
for this to live - users are already used to modifying the manager,
it's trivial to swap in a different manager (e.g. a
MasterSlaveManager) for a given model and the manager class gets to
see all of the queries that go through it, including things like
Article.objects.create(). If there are good reasons it should go on
the Query or QuerySet instead I'd love to hear them.

Why hand get_connection a Query rather than a QuerySet?
-------------------------------------------------------

Because when you call a model's .save() method, a Query object is all
you get. That said, a QuerySet is a higher level API and it might
hence be easier to make a decision based on introspecting that than
introspecting Query. We could get the best of both worlds and define
the function signature as "get_connection(query, queryset=None)" -
passing in both arguments if a queryset is available and only the
first argument otherwise. Seems a bit ugly though.

I expect that some get_connection() methods will just call
query.as_sql() and make a decision based on the string that comes
back.

As an aside, since the get_connection() method is called for every
query it could actually double up as a hook for adding SQL logging or
profiling.

Another aside: maybe pick_connection(query) is a better method name?

Feedback?
=========

I haven't attempted to implement any of the above yet, but from
reading through the code it seems like it would be possible without a
great deal of alterations to Django. So... does this look feasible? Is
the API powerful enough to do what people need? Is this easy to
understand from an end-user point of view?

Cheers,

Simon

[1] - random.choice(['slave1', 'slave2']) is actually a poor way of
doing this, as we probably want to pick a single slave at the
beginning of each request and use that for all following queries. This
probably means using a threadlocal that is reset at the end of each
request via a signal, which is crufty but unavoidable.

Justin Fagnani

unread,

Sep 10, 2008, 2:13:28 PM9/10/08

to django-d...@googlegroups.com

For application-wide db connections, I think it'd be much easier and
more portable to choose the connection in settings.py rather than in a
Model.

Manager.get_connection() is a great idea, but would it also make sense
to allow selecting the db via signals? That way you could make the
decision without modifying an app.

-Justin

Malcolm Tredinnick

unread,

Sep 10, 2008, 2:17:39 PM9/10/08

to django-d...@googlegroups.com

Okay, there's lots to digest here, but a couple of things that need
clarification / addition here that I spotted on the first reading.

Also, as a general thing here, did you go back and read the various
discussions we had about API when the multi-db Summer of Code project
was running? If not, that would be worth doing and incorporating, since
we debated a few alternatives for things back then which will still be
somewhat relevant. Particularly some of the reasons why certain options
didn't feel like good API.

On Wed, 2008-09-10 at 10:53 -0700, Simon Willison wrote:
[...]

> Dealing with single queries that span multiple databases
> --------------------------------------------------------
>
> Once you have different tables living in different databases there's
> always the chance that someone will try to write a query that attempts
> to join tables that live on two different database servers. I don't
> think we should address this problem at all (aside from maybe
> attempting to throw a descriptive error message should it happen) - if
> you're scaling across different servers you need to be aware of the
> limitations of that approach.
>
> That said, databases like MySQL actually do allow cross-database joins
> provided both databases live on the same physical server. Is this
> something we should support? I'd like to say "no" and assume that
> people who need to do that will be happy rolling their own SQL using a
> raw cursor, but maybe I'm wrong and it's actually a common use case.

My gut feeling is that this isn't something to include initially as a
necessary goal, but it's also probably not too hard once the other 95%
is done. My reason for saying 'no' initially is we try to be as portable
as possible and that particular case is very specific. Also, the single
physical server constraint makes it even more specialised. If you need
separate databases for performance reasons, particularly, they're not
going to be on the same physical server.

More significantly, however, is that there is a need for custom manager
support when a field is being used in a "related field" context, because
this is precisely when cross-database access happens and/or needs to be
prevented. Right now, when you write blog.entry for a particular "blog"
instance, the queryset used to access "entry" is taken from the default
manager on the Entry model (or possibly even models.Manager if a certain
attribute isn't set -- there's some serious hackery going on internally
that we'll sort out). However, neither of those particular things are
the right answer in multi-db or even "advanced selection" cases. We
really need to be able to say "when traversing this relation, use this
manager (or this initial queryset)". This allows us to, for example,
raise an error if trying to cross a join that isn't permitted. Say, in a
sharded situation or in a Hadoop or Big Table setup. It also provides
the way to determine that the query is about to cross a database barrier
and so should actually become a second query against the other database
whose result is then included in the first query.

That needs a public API, too, and I haven't thought about that problem
at all. The plumbing side of that is pretty easy. I have a
proto-implementation locally that I'm not going to further with yet
until fields/related.py has a bit of an internal rewrite to make things
a little less complicated in there.

>
> Connection pooling
> ------------------
>
> This is where I get completely out of my depth, but it seems like we
> might need to implement connection pooling at some point since we are
> now maintaining multiple connections to multiple databases.

Why does this need to be in the sphere of Django at all. Why wouldn't a
backend that talks correctly to pgpool and whatever the equivalent is
for MySQL be the right solution?

> We could
> roll our own solution here, but to my knowledge SQLAlchemy has a solid
> connection pool implementation which is entirely separate from the
> rest of the SQLAlchemy ORM. We could just ensure that if someone needs
> connection pooling there's a documented way of integrating the
> SQLAlchemy connection pool with Django - that way we don't have an
> external dependency on SQL Alchemy for the common case but people who
> need connection pools can still have them.
>
> Backwards compatibility
> -----------------------
>
> I think we can do all of the above while maintaining almost 100%
> backwards with Django 1.0. In the absence of a DATABASES setting we
> can construct one using Django's current DATABASE_ENGINE /
> DATABASE_NAME / etc settings to figure out the 'default' connection.
> Everything else should Just Work as it does already - the only people
> who will need to worry are those who have hacked together their own
> multi-db support based on Django internals.

Yes. I think this can be done in a fully backwards compatible fashion.

>
> Justification
> =============
>
> Why is get_connection() on Manager, not Query or QuerySet?
> ----------------------------------------------------------
>
> The logic that picks which database connection is used could live in
> three potential places: on the manager, on the QuerySet class or on
> the Query itself. The manager seems to me like the most natural place
> for this to live - users are already used to modifying the manager,
> it's trivial to swap in a different manager (e.g. a
> MasterSlaveManager) for a given model and the manager class gets to
> see all of the queries that go through it, including things like
> Article.objects.create(). If there are good reasons it should go on
> the Query or QuerySet instead I'd love to hear them.

As far as the public API goes, it's either the manager or the QuerySet
(that means the concept transfers nicely to non-relational situations
like LDAP where Query might not even exist in something resembling the
current form).

The deciding factor about whether or not it is on the QuerySet (which
would normally be the "natural" choice) is whether it will ever make
sense to want to manually get/set the connection mid-stream. At which
point the method is no longer get_connection(), since it's
multi-purpose. But I suspect this is better controlled by managers (the
decision as to whether to hit master, slave or cache is a property of
fields and models, not of query construation and filtering).

More thoughts once I've had a chance to digest all this.

Regards,
Malcolm

Rock

unread,

Sep 10, 2008, 2:25:02 PM9/10/08

to Django developers

The default setting defines the application-wide db connection.
The Manager mechanism is for overriding the default connection.

Selecting the db via signals makes no sense to me, however a mapping
between apps and databases in settings is worth a moment of thought as
a possible supplement to the Manager approach.

Rock

P.S. for Simon:
I haven't spotted any obvious problems with the proposal so far.
My initial reaction is that I like it. Good work!

Simon Willison

unread,

Sep 10, 2008, 3:07:48 PM9/10/08

to Django developers

On Sep 10, 7:17 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:

> Also, as a general thing here, did you go back and read the various
> discussions we had about API when the multi-db Summer of Code project
> was running? If not, that would be worth doing and incorporating, since
> we debated a few alternatives for things back then which will still be
> somewhat relevant

I'm pretty sure I did at the time :) The above is a brain dump based
on months of quiet chewing followed by a burst of inspiration from
your Django talk. Are there any threads in particular that you think
are worth revisiting?

Simon Willison

unread,

Sep 10, 2008, 3:30:26 PM9/10/08

to Django developers

On Sep 10, 7:13 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
> For application-wide db connections, I think it'd be much easier and
> more portable to choose the connection in settings.py rather than in a
> Model.

That's a very interesting point, and one I hadn't considered. It makes
sense to allow people to over-ride the connection used by an
application they didn't write - for example, people may want to tell
Django that django.contrib.auth.User should live in a particular
database. Further-more, just allowing people to over-ride the
connection used for an existing application isn't enough - you need to
be able to over-ride the default get_connection method, since you
might want to shard Django's built in users (for example).

This in definitely one of the use-cases we need to cover.

I'm not sure of the best way to handle it though. The way I see it the
options are as follows:

1. Monkey-patch the existing User manager.
2. Have a setting which lets you say "for model auth.User, use the
get_connection method defined over here". This is made inelegant by
the fact that settings shouldn't really contain references to actual
function definitions, which means we would probably need to us a
'dotted.path.to.a.function', which is crufty.
3. Use a signal. There isn't much precedence in Django for signals
which alter the way in which something is done - normally signals are
used to inform another part of the code that something has happened.

I'm not overjoyed by any of these options.

Cheers,

Simon

Dan Fairs

unread,

Sep 10, 2008, 3:40:12 PM9/10/08

to django-d...@googlegroups.com

> 2. Have a setting which lets you say "for model auth.User, use the
> get_connection method defined over here". This is made inelegant by
> the fact that settings shouldn't really contain references to actual
> function definitions, which means we would probably need to us a
> 'dotted.path.to.a.function', which is crufty.

The admin takes a registry-based approach to associate ModelAdmin
classes with Models. Could a similar approach work here?

myapp/connections.py:

from django.contrib.multidb import connection
from myapp.models import MyModel

class MyModelConnection(connection.ModelConnection):

def __call__(self):
... return a database connection ...

connection.register(MyModel, MyModelConnection)

I guess there's no reason even for MyModelConnection to be a class; a
callable would do.

Just a thought.

Cheers,
Dan
--
Dan Fairs <dan....@gmail.com> | http://www.stereoplex.com/

Waylan Limberg

unread,

Sep 10, 2008, 4:11:08 PM9/10/08

to django-d...@googlegroups.com

On Wed, Sep 10, 2008 at 3:30 PM, Simon Willison <si...@simonwillison.net> wrote:
>
> On Sep 10, 7:13 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
>> For application-wide db connections, I think it'd be much easier and
>> more portable to choose the connection in settings.py rather than in a
>> Model.
>
> That's a very interesting point, and one I hadn't considered. It makes
> sense to allow people to over-ride the connection used by an
> application they didn't write - for example, people may want to tell
> Django that django.contrib.auth.User should live in a particular
> database. Further-more, just allowing people to over-ride the
> connection used for an existing application isn't enough - you need to
> be able to over-ride the default get_connection method, since you
> might want to shard Django's built in users (for example).
>
> This in definitely one of the use-cases we need to cover.
>
> I'm not sure of the best way to handle it though. The way I see it the
> options are as follows:
>

Would this perhaps be easier to do after the Apps-Refactor (#3591)
lands? I'm not real familiar with that ticket, but if we're trying to
set a connection on a app level - that seems like the easiest way to
do it via settings. Something like:

INSTALLED_APPS = (
app('django.contrib.auth', connection='my_user_db'),
...
)

Not sure how that would work for over-riding the default
get_connection method though. We'd probably still be referring to a
callable by 'dotted.path.to.a.function' syntax. And it would apply to
all models in an app, not just some.

Just a thought.

--
----
Waylan Limberg
way...@gmail.com

koenb

unread,

Sep 10, 2008, 4:16:47 PM9/10/08

to Django developers

Just to add a little note: back in May I did some work on multidb,
some thoughts and some work can be found on http://trac.woe-beti.de/ ,
which Ben Ford set up for this.
I stopped because django was becoming too much of a moving target to
keep it in sync (and i did not have the time).

I would like to point out that my starting point was mainly to be able
to use data from different database ENGINES (like some tables from
postgres and some from mysql).

As far as I can tell, this is not supported currently by the plumbing
Malcolm provided (since the operations settings eg in WhereNode are
taken from the default connection and not from the passed in
connection; this is no problem if your databases are using the same
engine, but it returns wrong SQL if you have different engines).
I allready reported this once in #7258 (which was considered invalid,
probably because I forgot to mention my use case).

Anyway, I am very glad to see some renewed interest in this field!

Koen

Malcolm Tredinnick

unread,

Sep 10, 2008, 4:18:38 PM9/10/08

to django-d...@googlegroups.com

On Wed, 2008-09-10 at 20:40 +0100, Dan Fairs wrote:
> > 2. Have a setting which lets you say "for model auth.User, use the
> > get_connection method defined over here". This is made inelegant by
> > the fact that settings shouldn't really contain references to actual
> > function definitions, which means we would probably need to us a
> > 'dotted.path.to.a.function', which is crufty.
>
> The admin takes a registry-based approach to associate ModelAdmin
> classes with Models. Could a similar approach work here?

Oh, please, no! Registration is a very fragile process. It simply
doesn't work very well. It's a bit disappointing that it's the way we
have to do things that way in places in Django and if we can avoid it
elsewhere that'd be nice.

Malcolm

Justin Fagnani

unread,

Sep 10, 2008, 4:24:53 PM9/10/08

to django-d...@googlegroups.com

On Wed, Sep 10, 2008 at 12:30 PM, Simon Willison
<si...@simonwillison.net> wrote:
> On Sep 10, 7:13 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
>> For application-wide db connections, I think it'd be much easier and
>> more portable to choose the connection in settings.py rather than in a
>> Model.
>
> That's a very interesting point, and one I hadn't considered. It makes
> sense to allow people to over-ride the connection used by an
> application they didn't write - for example, people may want to tell
> Django that django.contrib.auth.User should live in a particular
> database. Further-more, just allowing people to over-ride the
> connection used for an existing application isn't enough - you need to
> be able to over-ride the default get_connection method, since you
> might want to shard Django's built in users (for example).

I think this example highlights the problem with per-Model db
connections: it'll only work if either that model is not related to
the others in the app, or if the other models in the app also use the
same db. This will probably make per-application db connections a much
more common use case than per-Model.

> 2. Have a setting which lets you say "for model auth.User, use the
> get_connection method defined over here". This is made inelegant by
> the fact that settings shouldn't really contain references to actual
> function definitions, which means we would probably need to us a
> 'dotted.path.to.a.function', which is crufty.

Considering that this is how every module, function and class are
referred to in setting, I don't think it'll be that big of a deal. I
especially like Waylan's suggestion.

> 3. Use a signal. There isn't much precedence in Django for signals
> which alter the way in which something is done - normally signals are
> used to inform another part of the code that something has happened.

The nice thing about signals is that it allows any arbitrary scheme
for selecting connections without modifying the application. For the
User case above, you could register a function that chooses a replica
for User queries only on selects which don't join with a model outside
the auth app.

I see your point about not changing how things are done with signals.
I was thinking this would be done most simply by sending the QuerySet
with the signal, but that opens things up to a lot more changes than
just db connections. That could end being a way to introduce very hard
to find bugs. I still like how easy it makes it to customize db access
without altering the app itself.

-Justin

Simon Willison

unread,

Sep 10, 2008, 4:56:28 PM9/10/08

to Django developers

On Sep 10, 9:18 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:

> Oh, please, no! Registration is a very fragile process. It simply
> doesn't work very well. It's a bit disappointing that it's the way we
> have to do things that way in places in Django and if we can avoid it
> elsewhere that'd be nice.

I was hoping we could get a discussion going about this at DjangoCon.
Registration is a pattern that comes up /all the time/ in Django:

* Registering models with the admin
* Registering models with databrowse
* Registering template tags
* Registering custom management commands

It's also present in popular party apps:

* django-tagging and django-mptt both register models (though both
feel like they should really be some kind of mixin)

We'll also need it in the near future for a couple of in-development
features:

* Registering custom panels with the Django debugging toolbar
* Registering new benchmarks with the metronome profiling tool
* Registering get_connection overrides in the above multi-db proposal

Finally, we've been needing to solve it for projects at work: we have
a CMS that exposes a concept of "custom rows" which can be provided by
applications and dropped in to various places around the site. Guess
what: the rows need to be registered!

There MUST be a good way of doing this. zope.interface? setuptools
entry points? We really, really need to solve this for Django. If we
had a single, supported and documented way of registering things it
would open up a huge amount of potential for plugin-style extension
points and give us a proper solution to a problem we are solving in a
bunch of different ways at the moment.

Any ideas?

Ivan Sagalaev

unread,

Sep 10, 2008, 5:15:20 PM9/10/08

to django-d...@googlegroups.com

Simon Willison wrote:
> * Simple master-slave replication: SELECT queries are distributed
> between slaves, while UPDATE and DELETE statements are sent to
> the master.

It won't work on a statement-level. If you have a transaction and do an
UPDATE and then a SELECT then the latter won't see results of the former
because it will look into another connection (and another database).

I strongly believe that choosing between a master and a slave is the
decision that should be made on business logic level, not at model level.

Malcolm Tredinnick

unread,

Sep 10, 2008, 5:16:16 PM9/10/08

to django-d...@googlegroups.com

On Wed, 2008-09-10 at 13:56 -0700, Simon Willison wrote:
> On Sep 10, 9:18 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
> wrote:
> > Oh, please, no! Registration is a very fragile process. It simply
> > doesn't work very well. It's a bit disappointing that it's the way we
> > have to do things that way in places in Django and if we can avoid it
> > elsewhere that'd be nice.
>
> I was hoping we could get a discussion going about this at DjangoCon.
> Registration is a pattern that comes up /all the time/ in Django:

*sigh* Whilst I realise you are very enthusiastic about getting stuff
done at the moment, Simon. It's very hard to juggle 10 different serious
design issues all at once. And, yes, I understand that some of them
overlap, but solving the registration issue isn't really going to be the
main part (or even necessarily any part) of working out a multiple
database API.

Incremental steps, rather than a plan which requires changing 6 things
at once is really preferable here. It helps code stability and means we
can devote our full attention to just one or two things at a time. The
things you are bringing up are serious issues, but they're not new
issues -- hardly anything that people are suddenly rediscovering from
djangocon haven't been on our radar for months or years. We don't have
to solve them all this week. So, please. Let's slow down a bit and have
the time to consider how we can do things in small steps and require
large sweeping changes as an "if all else fails" fallback. We might
still end up using some kind of new "registration" alternative in, say,
database connection registration, but that can be phase two or phase
three. Phase one being the manual configuration option.

>
> * Registering models with the admin
> * Registering models with databrowse
> * Registering template tags
> * Registering custom management commands

You don't register custom management commands. They are "discovered",
similar to import discovering modules to import: by putting them in a
well-defined location. I'm not sure why template tags aren't done the
same way (a distinguished variable in a file that is imported saying
"these are the methods that are the tags I'm supply", similar to
__all__).

As I note below, the current uses of registration aren't all necessary.

> It's also present in popular party apps:
>
> * django-tagging and django-mptt both register models (though both
> feel like they should really be some kind of mixin)
>
> We'll also need it in the near future for a couple of in-development
> features:
>
> * Registering custom panels with the Django debugging toolbar
> * Registering new benchmarks with the metronome profiling tool
> * Registering get_connection overrides in the above multi-db proposal
>
> Finally, we've been needing to solve it for projects at work: we have
> a CMS that exposes a concept of "custom rows" which can be provided by
> applications and dropped in to various places around the site. Guess
> what: the rows need to be registered!
>
> There MUST be a good way of doing this. zope.interface?

> setuptools
> entry points?

Whilst I realise that some people want to use setuptools, anything that
*requires* it is going to be a big problem for me, at least. It has some
problems that mean it's really unusable in large sysadmin installations
and it acts as odds to the existing packaging system on proper
distributions (you can't be a proper distribution without a proper
packaging system, after all). The maintainer of setuptools has pretty
clearly indicated he isn't interested in fixing the latter problem (look
back at some of the interactions with the Debian guys in the past) and
the former is barely acknowledged as even a problem.

> We really, really need to solve this for Django.

It's not entirely clear that we do, since before you solve something
there has to be a problem. We could make things a bit easier, but it's
quite possibly a case of using the wrong shovel to hammer in your screws
in some cases and in other cases it requires almost no infrastructure.

For those already firing up their replies, note that I carefully wrote
"possibly". I'm asking that people step back and view the issue as
whether it's the right approach before we make a better version of thing
X.

Regards,
Malcolm

Mike Malone

unread,

Sep 10, 2008, 5:24:12 PM9/10/08

to django-d...@googlegroups.com

Wow... like Malcom said, lots to digest here.

So to start, the "simple" master-slave replication scenario turns out not to be so simple once you get into the implementation details. Replication lag being what it is, you almost never way to query the slave for every SELECT.

At Pownce, for example, we stick users to the master database for some period of time (a couple of seconds, usually) after they post a new note. The problem here (as Malcolm pointed out) is that related managers use the default manager for the related field. So if I ask for a User's Notes, the default Note manager is used. That manager is, presumably, where the decision is going to be made as to whether the slave or the master should be queried. But the Note manager has no way of knowing whether the User is stuck to the master -- it doesn't even know that there's a User associated with the query...

We've solved this by poking at a lot of the related fields internals. Malcolm helped a lot, and he's probably one of the only people who could have made it happen. It's not that much code, but it relies heavily on internal API and is certainly not something that should be recommended.

Simon, from your first email it seems you're suggesting that the Manager call Query.as_sql() and then parse the resulting SQL string? That seems like it's going to encourage a lot of hacky/fragile solutions. IMO, the right place for a decision like "should this User's notes come from the master, or the slave?" is on the User model (or maybe User manager), not in the Note manager.

The same problem comes up with sharding. Suppose, for example, Pownce started sharding by User and putting each User's Notes on the same server the User is on. We should be able to call User.notes.all() and get that User's notes, but the Note manager can't easily tell what server it should be querying, since it doesn't know about the User. Again, you could start poking at the internals of Query and try to figure out what's going on, but that doesn't seem like a particularly elegant solution...

Mike

Simon Willison

unread,

Sep 10, 2008, 5:27:14 PM9/10/08

to Django developers

Good point. That also highlights an omission in my original brain-dump
- having a "uses" method on a QuerySet isn't enough, you also need a
way of over-riding the database connection used by a call to
model.save(). Again, I'd propose the same terminology again as a
keyword argument.

If you wanted to control which master/slave connection was used from
your business logic, you'd do something like this:

obj = Article.objects.using('master').get(pk = 4)
obj.name = 'Hello'
obj.save(using = 'master')

Ivan: how does that look?

Ivan Sagalaev

unread,

Sep 10, 2008, 5:33:51 PM9/10/08

to django-d...@googlegroups.com

Mike Malone wrote:
> At Pownce, for example, we stick users to the master database for some
> period of time (a couple of seconds, usually) after they post a new
> note.

Another approach that I took in mysql_replicated[1] is to serve a page
that user GETs from a redirect after successful POST always from the
master. It certainly doesn't solve the problem in general but it's good
enough (for us at least). But I'll second that this damn lagging thing
is pretty hard to solve in a general way.

[1]: http://softwaremaniacs.org/soft/mysql_replicated/en/

Simon Willison

unread,

Sep 10, 2008, 5:40:14 PM9/10/08

to Django developers

On Sep 10, 10:24 pm, "Mike Malone" <mjmal...@gmail.com> wrote:
> At Pownce, for example, we stick users to the master database for some
> period of time (a couple of seconds, usually) after they post a new note.
> The problem here (as Malcolm pointed out) is that related managers use the
> default manager for the related field. So if I ask for a User's Notes, the
> default Note manager is used. That manager is, presumably, where the
> decision is going to be made as to whether the slave or the master should be
> queried. But the Note manager has no way of knowing whether the User is
> stuck to the master -- it doesn't even know that there's a User associated
> with the query...

That's really interesting. I wonder if that invalidates the whole
approach I proposed, or merely means it needs some refining?

> Simon, from your first email it seems you're suggesting that the Manager
> call Query.as_sql() and then parse the resulting SQL string?

Not at all - I'm suggesting the manager pokes around at the query
object itself (what type of query is it, which tables does it touch
etc). I mentioned as_sql as a throw-away remark; I certainly wouldn't
want to suggest implementing connection selection logic in that way.

> IMO, the right place for a decision like "should this User's notes come from
> the master, or the slave?" is on the User model (or maybe User manager),
> not in the Note manager.

It's possible that the Note manager simply won't have enough
information to make that decision - in which case I'd suggest that
solving it is up to the developer. They might chose to use the
'using()' method to force a query through notes to go to the slave,
for example.

> The same problem comes up with sharding. Suppose, for example, Pownce
> started sharding by User and putting each User's Notes on the same server
> the User is on. We should be able to call User.notes.all() and get that
> User's notes, but the Note manager can't easily tell what server it should
> be querying, since it doesn't know about the User. Again, you could start
> poking at the internals of Query and try to figure out what's going on, but
> that doesn't seem like a particularly elegant solution...

Again, my assumption is that in that case it's up to the developer to
ensure queries go to the right place - maybe by adding their own
"get_notes" method to their user object that automatically queries the
correct shard (with the using() method).

I'm not convinced that this stuff can be made invisible to the
developer - if you're sharding things you're in pretty deep, and you
probably want to maintain full control over where your queries are
going.

Ivan Sagalaev

unread,

Sep 10, 2008, 5:44:52 PM9/10/08

to django-d...@googlegroups.com

Simon Willison wrote:
> Good point. That also highlights an omission in my original brain-dump
> - having a "uses" method on a QuerySet isn't enough, you also need a
> way of over-riding the database connection used by a call to
> model.save(). Again, I'd propose the same terminology again as a
> keyword argument.
>
> If you wanted to control which master/slave connection was used from
> your business logic, you'd do something like this:
>
> obj = Article.objects.using('master').get(pk = 4)
> obj.name = 'Hello'
> obj.save(using = 'master')
>
> Ivan: how does that look?

Well... To be sure save() should always go to master because on slaves
you just don't have permissions to save anything. So a parameter to
save() is redundant.

More to the point, Mike Malone just described a situation when you want
the same code to query either master or slave depending on whether
you're sure that data on slave had a chance to be synced with master.

Another thing is that explicit specification of a connection may become
very tedious and non-DRY. For example one should always use 'master'
when you're POSTing forms. I don't think requiring users to do it
manually is not a good idea.

To be honest, I like the approach that I've implemented in my db backend
for this: using HTTP semantics with the ability to force mater or slave
when you need it. However it works with an implicit state which is not a
clean approach compared to a functional one with explicit passing of
connections.

Ivan Sagalaev

unread,

Sep 10, 2008, 5:49:17 PM9/10/08

to django-d...@googlegroups.com

Simon Willison wrote:
> That's really interesting. I wonder if that invalidates the whole
> approach I proposed, or merely means it needs some refining?

As Malcolm has pointed you're proposing many things at once :-). I tend
to think that replication, sharding, migration to another db are very
different things and may be we shouldn't try to solve them with a single
API.

Mike Malone

unread,

Sep 10, 2008, 5:59:43 PM9/10/08

to django-d...@googlegroups.com

Well... To be sure save() should always go to master because on slaves
you just don't have permissions to save anything. So a parameter to
save() is redundant.

Not so. There are certainly use-cases for more sophisticated database architectures where, for example, the majority of the database tables are written to the master and replicated to all slaves, while a couple of write-heavy tables are sharded and written directly to individual slaves. More common is a master-master replication strategy, where a particular User (for example) is stuck to one of a pair of database servers that replicate one another. In this case you'd want to be able to specify somehow which server to save() to.

Mike

ab

unread,

Sep 10, 2008, 5:55:12 PM9/10/08

to Django developers

For the api to accept a DSN, alias, or connection anywhere would add
similar code in multiple places. I propose that the aliases are mapped
into django.db.connections. For your example, you could use
django.db.connections.archive. I also propose that you can either
define a single database (as now) or multiple DATABASES (as you
propose) and to define both or neither is an error. But anyways, how
to name/specify a database is semi-bikesheddy and orthogonal to the
issue of how to actually choose the database to be used, which is the
more important one. Here's my take on that:

My biggest problem with Simon's original proposal is that the
connection-choosing logic is too spread out. Sticking stuff on your
Model classes makes sense when the code is "local to that model" --
like methods, metadata, or choosing a connection per-table -- but that
doesn't make sense for a lot of multi-db setups. For complicated stuff
like sharding, I think you'd want all the logic in the same place.

Counter-proposal:
A *project-global* get_connection function, maybe in a location
specified by settings.
Input: the queryset, at least, and probably whatever else you'll
likely want to use: the model class, tables joined with,
fields_accessed?, etc.
Output: a connection object

That would make it easier to write and maintain your multi-db setup
and share logic across models. If you want control at the queryset-
granularity, this could maybe result in a proposed_connection
parameter to get_connection (and get_connection can obviously raise
exceptions as it sees fit). This proposal also solves the problem of
choosing a connection for contrib or 3rd-party applications.

Andrew

Mike Malone

unread,

Sep 10, 2008, 6:03:57 PM9/10/08

to django-d...@googlegroups.com

On Sep 10, 10:24 pm, "Mike Malone" <mjmal...@gmail.com> wrote:
> At Pownce, for example, we stick users to the master database for some
> period of time (a couple of seconds, usually) after they post a new note.
> The problem here (as Malcolm pointed out) is that related managers use the
> default manager for the related field. So if I ask for a User's Notes, the
> default Note manager is used. That manager is, presumably, where the
> decision is going to be made as to whether the slave or the master should be
> queried. But the Note manager has no way of knowing whether the User is
> stuck to the master -- it doesn't even know that there's a User associated
> with the query...

That's really interesting. I wonder if that invalidates the whole
approach I proposed, or merely means it needs some refining?

I think it just needs refining. My understanding is that related fields was due for a refactor anyways, so this would probably be a good time to do / think about it. I guess my point is that there needs to be some non-internal API for getting at related field information, too. In any case, more thought is required.

Mike

Simon Willison

unread,

Sep 10, 2008, 6:06:54 PM9/10/08

to Django developers

On Sep 10, 10:44 pm, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:
> Well... To be sure save() should always go to master because on slaves
> you just don't have permissions to save anything. So a parameter to
> save() is redundant.

It's redundant in the case of a single master, but there are other
situations when you might want full control over where a save() ends
up going (when you have more than one master for example, or in a
situation where you are loading data from one database and saving it
to another as part of an import/export routine).

> To be honest, I like the approach that I've implemented in my db backend
> for this: using HTTP semantics with the ability to force mater or slave
> when you need it. However it works with an implicit state which is not a
> clean approach compared to a functional one with explicit passing of
> connections.

I had your POST v.s. GET method in mind when I was thinking about the
get_connection method - one of the potential things you could do in
that method is look at a thread local that was set to the request
method when the last request came in. This is a bit too much of a hack
to support in Django core but there's nothing to stop end-users using
thread locals in that way if they want HTTP-based master/slave
selection.

Malcolm Tredinnick

unread,

Sep 10, 2008, 6:15:42 PM9/10/08

to django-d...@googlegroups.com

On Wed, 2008-09-10 at 15:03 -0700, Mike Malone wrote:
[...]

> I think it just needs refining. My understanding is that related
> fields was due for a refactor anyways, so this would probably be a
> good time to do / think about it. I guess my point is that there needs
> to be some non-internal API for getting at related field information,
> too. In any case, more thought is required.

Agreed, mostly. I'm using this thread as a way of looking at the various
use-cases people are proposing and this will guide a bunch of that
particular refactoring, I suspect. What needs to be exposed is kind of a
consequence of the API we decide to settle on. It's a case where, right
this minute, we (or, rather, I, personally) don't necessarily have a
high confidence level that we understand all the use-cases and can
confidently split them into "things we want to do out of the box",
"things we want to support via extensions" and "moderately crazy stuff".

Regards,
Malcolm

Simon Willison

unread,

Sep 10, 2008, 6:22:34 PM9/10/08

to Django developers

On Sep 10, 10:55 pm, ab <andrewb...@gmail.com> wrote:
> For the api to accept a DSN, alias, or connection anywhere would add
> similar code in multiple places. I propose that the aliases are mapped
> into django.db.connections. For your example, you could use
> django.db.connections.archive.

I'm keen on the DSN/alias/connection trichotomy for the following
reasons:

1. It's more pleasant to hand around short strings than long object
references.
2. Allowing a DSN to be provided means you don't have to pre-define
every single one of your databases in your settings.py. WordPress.com
is an example of why you might not want to do this: they have
literally thousands of databases each hosting a bundle of blogs -
their system simply wouldn't work if every database connection had to
be pre-defined.
3. Allowing an actual connection object to be handed around means you
can take full advantage of Python's duck-typing to do pretty much
anything you like.

> Counter-proposal:
> A *project-global* get_connection function, maybe in a location
> specified by settings.

That seems like a smart compromise. It allows for over-riding existing
applications and, like you said, keeps all of the tricky connection
logic in a single place.

I'm still interested in the simple case of allowing individual models
to live on different databases, but one way of tackling that could be
the following:

# settings.py
DATABASE_PICKER = 'django.contrib.multidb.per_model_database'

# models.py

class Article(models.Model):
...
class Meta:

using = 'archive'

The per_model_database connection picker would use model._meta.using
to decide which connection to return.

> Input: the queryset, at least, and probably whatever else you'll
> likely want to use: the model class, tables joined with,
> fields_accessed?, etc.
> Output: a connection object

I still think the input should be a Query rather than a QuerySet
(since a QuerySet isn't available in the case of insert() or
update()). The query object should provide enough information to make
a decision in most (all?) cases, and you can always use query.model to
get back to the original model definition. It might be useful for the
ORM to annotate query with the corresponding queryset, if one is in
use - such at you can access it as query.queryset from inside the
connection picker.

This is a really enlightening thread (for me at least) - I'm going to
collect together the various requirements that have been highlighted
on a wiki page, probably tomorrow morning.

Andrew Godwin

unread,

Sep 10, 2008, 6:19:40 PM9/10/08

to django-d...@googlegroups.com

ab wrote:
> For the api to accept a DSN, alias, or connection anywhere would add
> similar code in multiple places. I propose that the aliases are mapped
> into django.db.connections. For your example, you could use
> django.db.connections.archive. I also propose that you can either
> define a single database (as now) or multiple DATABASES (as you
> propose) and to define both or neither is an error. But anyways, how
> to name/specify a database is semi-bikesheddy and orthogonal to the
> issue of how to actually choose the database to be used, which is the
> more important one. Here's my take on that:
>
> My biggest problem with Simon's original proposal is that the
> connection-choosing logic is too spread out. Sticking stuff on your
> Model classes makes sense when the code is "local to that model" --
> like methods, metadata, or choosing a connection per-table -- but that
> doesn't make sense for a lot of multi-db setups. For complicated stuff
> like sharding, I think you'd want all the logic in the same place.
>
> Counter-proposal:
> A *project-global* get_connection function, maybe in a location
> specified by settings.
> Input: the queryset, at least, and probably whatever else you'll
> likely want to use: the model class, tables joined with,
> fields_accessed?, etc.
> Output: a connection object

I personally think we need an approach where we provide the framework
that can support sharding and master/slave database layouts, but where
that's not directly built in, since as these discussions seem to show,
everyone has a different need...

A function whacked in settings might work best, especially if there's
ones shipped with Django to do the simpler stuff (like databases set
per-model, something one of the projects I was on was considering, even
though it's very simple); as long as it has access to some kind of
state, it can also then decide that a user was saved in the last second,
and you need to use the master, or something of the sort.

Having features where you put attributes or functions on models or apps
can then be provided by something Django ships with that implements said
scanning of models or apps (I'm pretty sure this is possible, if we
managed to deliver all the appropriate arguments to the function...).

This way, most people will be satisfied by the shipped functions (which
are somewhat analogous to generic views, in a very very roundabout way)
and people who Love Control™ can roll their own.

Just my two cents. I'll scuttle back to migrations libraries now.

Andrew

Robert Lofthouse

unread,

Sep 10, 2008, 8:27:17 PM9/10/08

to Django developers

Wow, i'm with Malcolm on the "steady on, let's take this a bit at a
time" :) Someone has been working hard, but that's a huge amount to
think about.

Reading through some of the comments: Most of the functionality
described here would be an explicit mapping and choice made by the
developer, it would be pretty hard to use any form of magic. I'm all
for the setting up of bindings in a settings.py, which you later
access and pass through whenever you do a select/update/delete. I'm
also for per-model connections/connection overrides, with an option to
override on a more general level at the (overseeing) application
level.

I'm against the idea of having functions in settings.py, but I don't
see why they can't be discovered based on definitions in the settings
file.

Will have a proper read through tomorrow, as this thread is
interesting.

> class Article(models.Model):
> ...
> class Meta:

> Because when you call a model's .save() ...
>
> read more »

Sean Stoops

unread,

Sep 11, 2008, 1:06:30 AM9/11/08

to Django developers

I just wanted to toss out one more use case that I know I would find
especially useful. Legacy databases.

I'd like to see the ability to setup a model (using rough
introspection and then user tweaking) based on a legacy database,
maybe with the options of read-write or read-only access.

Like Robert, I'll have to give this another read through tomorrow
after a little sleep.

-Sean

> ...
>
> read more »

Jan Oberst

unread,

Sep 11, 2008, 5:13:42 AM9/11/08

to Django developers

> I think we can do all of the above while maintaining almost 100%
> backwards with Django 1.0. In the absence of a DATABASES setting we
> can construct one using Django's current DATABASE_ENGINE /
> DATABASE_NAME / etc settings to figure out the 'default' connection.
> Everything else should Just Work as it does already - the only people
> who will need to worry are those who have hacked together their own
> multi-db support based on Django internals.
>

I think sharding really is something every developer would do
different because it's just so dependent on the actual business logic
and how your models work. That said maybe there's a few things lots of
people agree on and could build a base for more detailed
implementations.

At our (large-scale) project we chose Django because of the ORM and I
implemented a complete sharding setup on top of the ORM. The important
thing for me was to have quite a bit of magic when it comes to
handling the DB. Because you specify relations between data and even
specify how you use them (by building querysets) there's a lot of
information to draw conclusions from. This can lead to a simple and
plain interface. My goal was to have the same application code run on
a single-DB machine and a sharded DB environment without changing
anything.

The problem with sharding is, that lots of operations won't work. No
JOINs, no DB-side foreign keys, no transactions, no auto_increment
IDs... Another few won't work the way they are supposed to. ORDER BY,
LIMIT, COUNT and so on. But like some of you said that's just the way
it is. If you want to use sharding you should take care of those
exceptions yourself.

Our project is based on profiles. And our profile-based sharding
should put all data that's related to a profile on one shard. My
sharding config now tells the ORM that there's models that are always
on the same DB machine (photo sets and photos and comments for the
photos for example). Now once you have a photo object and trigger that
foreign key (photo.comments_set.all() example [1]) this query will go
directly to the shard the photo itself is on. Since we already know
that shard there's not even a need to look that up.

Sharding exceptions occour when the ORM tries something it can't
handle if the model is sharded. The great thing about this is, that we
can reduce the amount of use cases where we actually have to spawn a
query across multiple shards.

This won't work in every environment. I implemented this by hacking
quite a bit of Django source (mainly the descriptors of the ForeignKey
field). It sure would be neat to have a proper API for this. The real
problem with these techniques is not to implement them for youself.
But to implement them for everyone else. Since I'm the one responsible
for our ORM and models I can hack away just like this. Doing it
properly seems like a whole other story.

-Jan

[1] - A little bit of my model code:
This will put all Profile objects on shards 11-13 and establish
dependencies so that dependent objects and querysets know a priori
which shard to query.

class ProfileShardManager(ShardManager):
# Tell the model which shards it belongs to.
use_shards = [11,12,13]

def initial_id_to_shard_mapping(self,id):
# This is only the initial mapping. If a new profile is
created
# we're doing round robin for all shards. Later we could
prefer
# newly added servers for new profiles.
return (11 + id % 3)

class Profile(Model):
sharding = ProfileShardManager()

email = EmailField()
name = CharField()

class Photoset(Model):
sharding = ShardDependency(Profile)

profile = ForeignKey(Profile)
description = TextField()

class Photo(Model):
sharding = ShardDependency(Profile,pooled_fields=['photoset'])

profile = ForeignKey(Profile)
photoset = ForeignKey(Photoset)
title = CharField()

class PhotoComment(Model):
sharding = ShardDependency(Profile,pooled_fields=['photoset'])

profile = ForeignKey(Profile)
photo = ForeignKey(Photo)
text = TextField()

Django shell:
>>> p = Photo.objects.get(pk=5)
# Doing a lookup in the shard index table behind the scenes
# to find the correct database for the Photo object with pk=5
>>> p
<Photo: #5>
>>> p._shard
# Every object knows where it lives
<Shard 11>
>>> p.photoset
# Because of pooled_fields p passes along the shard 11
# we know every related model has to be on shard 11
# we only query shard 11 with that ID
<Photoset: #1>
>>> p.comments_set.all()
# Same applies to reverse foreign keys
[<PhotoComment: #1>,<PhotoComment: #2>,<PhotoComment:
#3>,<PhotoComment: #4>]

jeffself

unread,

Sep 11, 2008, 9:03:55 AM9/11/08

to Django developers

On Sep 10, 2:13 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
> For application-wide db connections, I think it'd be much easier and
> more portable to choose the connection in settings.py rather than in a
> Model.
>

I agree with you Justin. The applications should remain db
agnostic.

We really need an enhanced settings.py or something new called
orm.py. There needs to be a way to map a class from your models.py
file to a table in a database manually. I know Django's ORM handles
this for you now automatically, but I think its going to have to
loosen the reigns a bit in order for Django to grow and be more
feasible in the enterprise. Give us the ability to override the
default db and table for a given class. I like how Hibernate handles
this.

Rock

unread,

Sep 11, 2008, 5:14:59 PM9/11/08

to Django developers

FWIW as a possible data point, our team has implemented partitioned
models in Oracle. To control this I added a settings parameter with a
list or models to be partitioned. The actual work is done in the
sqlreset command logic (but really needs to be in the table/index
generation code for SQL creates.) Fortunately nothing else in the ORM
needs to know about the partitioning. (The one exception: a partition
aware truncation function would be a nice addition for supporting
block deletions that line up with the partition boundaries.)

From my standpoint, a settings based per-model specification mechanism
for database info may have value over and above the multi-db scenarios
currently under discussion.

Malcolm Tredinnick

unread,

Sep 11, 2008, 6:29:31 PM9/11/08

to django-d...@googlegroups.com

On Wed, 2008-09-10 at 13:16 -0700, koenb wrote:
[...]

> As far as I can tell, this is not supported currently by the plumbing
> Malcolm provided (since the operations settings eg in WhereNode are
> taken from the default connection and not from the passed in
> connection; this is no problem if your databases are using the same
> engine, but it returns wrong SQL if you have different engines).
> I allready reported this once in #7258 (which was considered invalid,
> probably because I forgot to mention my use case).

This is indeed a bug (the fact that the appropriate connection isnt
passed into the "where" addition), but it's a bigger one than you
indicate here. The real fix will be that WhereNode.add() should be
passed the "query" instance that is calling it (which contains a
reference to the database connection amongst many other things). If
processing a new where-constraint required, for example, adding a new
table join, you're out of luck at the moment.

I'm holding off committing this until we branch off the 1.0-bugfixes
branch, since it's only going to go onto trunk (it's essentially a
feature-add).

Regards,
Malcolm

mengel

unread,

Sep 12, 2008, 12:11:59 PM9/12/08

to Django developers

> obj = Article.objects.using('master').get(pk = 4)
> obj.name = 'Hello'
> obj.save(using = 'master')

Shouldn't the object remember the database used to fetch it, and
save()
should save it back there? Then you would only have to specify it
like that if you want to fetch it from one database and save it in
another
explicitly...

Simon Willison

unread,

Sep 12, 2008, 3:08:43 PM9/12/08

to Django developers

That sounds smart to me. There are cases where you want to select from
one database and then save to another (moving data between different
database servers, re-sharding data for whatever reason) but the common
case will be as you describe. This means an ORM object should have a
special property on it recording which database connection it was
loaded from. It also presumes we won't ever want to support the case
where a model consists of properties loaded from more than one
database (title and date from 'metadata', abstract and content from
'content') - I think this is a perfectly reasonable limitation; if you
want to do something like that you can write your own class that
composes two ORM models behind the scenes.

Cheers,

Simon

Brent Hagany

unread,

Sep 12, 2008, 2:45:40 PM9/12/08

to Django developers

Now I'm nobody special, and I probably don't even belong in this
group, but I do want to voice my support for this part of the original
proposal:

> Dealing with single queries that span multiple databases
> --------------------------------------------------------

> Once you have different tables living in different databases there's
> always the chance that someone will try to write a query that attempts
> to join tables that live on two different database servers. I don't
> think we should address this problem at all (aside from maybe
> attempting to throw a descriptive error message should it happen) - if
> you're scaling across different servers you need to be aware of the
> limitations of that approach.

> That said, databases like MySQL actually do allow cross-database joins
> provided both databases live on the same physical server. Is this
> something we should support? I'd like to say "no" and assume that
> people who need to do that will be happy rolling their own SQL using a
> raw cursor, but maybe I'm wrong and it's actually a common use case.

As a relatively small-time developer, it seems silly to me not to
include this from the get-go. From my point of view, things like
sharding and replication are things that I probably will never need,
unless one of my little projects suddenly makes it big. And maybe I'm
naive, but I don't think that most Django users will ever need those
things either. So, to see this part of the proposal downgraded in
importance because it's perceived not to be a common use case strikes
me as odd. This is, after all, something that a vanilla MySQL
installation supports out of the box (my experience with other dbs is
admittedly slim), while sharding and replication take siginificantly
more work to set up, and are useful in a much smaller subset of
situations.

For my part, I need/want support for cross-database joins on the same
server for two reasons that seem like they ought to be common. The
first and most important is that I want to use the Django ORM in
situations where the data is already happily living in separate
databases on the same server. Since it often makes sense to organize
data into multiple databases on the same server, many people working
with data that was not specifically meant for use with Django are
going to run into a problem switching to Django for any projects that
use that data. This, at least for me, presents a very large barrier
to introducing Django into many business situations.

Secondly, and I know that this is anal, shoving all your data for
every app into a single database is simply ugly. Databases provide
precious little support for namespacing as it is, and I'd prefer to
use what they have. If I have 15 Django apps installed, I don't want
to see 100 tables when I fire up mysql and do a SHOW TABLES;. I want
those apps to be nicely compartmentalized in the db and in the code,
and not differentiated only by bastardized_underscore_namespacing.

As to Malcolm's response:

> My gut feeling is that this isn't something to include initially as a
> necessary goal, but it's also probably not too hard once the other 95%
> is done. My reason for saying 'no' initially is we try to be as portable
> as possible and that particular case is very specific. Also, the single
> physical server constraint makes it even more specialised. If you need
> separate databases for performance reasons, particularly, they're not
> going to be on the same physical server.

As far as I can tell, the argument about being portable is a good one,
and if that ends up being the reason this doesn't get in right away,
then I can't complain. However, the part about the single server
making this more specialized, and the part about performance are non
sequiturs. The vast majority of web applications run with a single
database server; I don't see how that can be called specialized. And,
the reasons for wanting to do cross-database joins (from my POV) have
nothing to do with performance, so I don't really understand why it's
relevant to this part of the proposal. I understand that performance
is the main driving force behind most of this multi-db work, but there
are other reasons that us small fish are looking forward to a multi-db
api.

Again, I'm not really a Django developer, so I apologize for my
intrusion.

TP

unread,

Sep 12, 2008, 5:46:05 PM9/12/08

to Django developers

Simon Willison wrote:

> That sounds smart to me. There are cases where you want to select from
> one database and then save to another (moving data between different
> database servers, re-sharding data for whatever reason) but the common
> case will be as you describe. This means an ORM object should have a
> special property on it recording which database connection it was
> loaded from. It also presumes we won't ever want to support the case

I apologize for my ignorance as to how django DB connections work, but
will this lead to thread safety issues if one thread loads an object
and passes it into another thread? Or is the underlying Django
connection thread safe?

Also, what if the connection has closed for whatever reason since the
object was loaded (due to DB connection timeout, DB restart, etc)?

Malcolm Tredinnick

unread,

Sep 12, 2008, 5:52:36 PM9/12/08

to django-d...@googlegroups.com

On Fri, 2008-09-12 at 14:46 -0700, TP wrote:
[...]

> Also, what if the connection has closed for whatever reason since the
> object was loaded (due to DB connection timeout, DB restart, etc)?

The django.db.connection object isn't a pure network connection. It's a
DatabaseWrapper class and knows how to handle opening a connection if
necessary. They're talking about reusing the DatabaseWrapper (which ties
it to a particular database) here.

Malcolm

Simon Willison

unread,

Sep 12, 2008, 5:57:19 PM9/12/08

to Django developers

On Sep 12, 10:46 pm, TP <thomaspinckn...@gmail.com> wrote:
> > This means an ORM object should have a
> > special property on it recording which database connection it was
> > loaded from. It also presumes we won't ever want to support the case
>
> I apologize for my ignorance as to how django DB connections work, but
> will this lead to thread safety issues if one thread loads an object
> and passes it into another thread? Or is the underlying Django
> connection thread safe?
>
> Also, what if the connection has closed for whatever reason since the
> object was loaded (due to DB connection timeout, DB restart, etc)?

Sorry, that was a clumsy explanation on my part. I didn't mean store
the actual connection object; I meant store information about which
database was used. How this is stored (DatabaseWrapper instance, DSN
string, connection alias) is up to the implementation, but it should
certainly take the factors you describe in to account.

Simon Willison

unread,

Sep 12, 2008, 6:18:37 PM9/12/08

to Django developers

On Sep 12, 7:45 pm, Brent Hagany <brent.hag...@gmail.com> wrote:
> As far as I can tell, the argument about being portable is a good one,
> and if that ends up being the reason this doesn't get in right away,
> then I can't complain. However, the part about the single server
> making this more specialized, and the part about performance are non
> sequiturs. The vast majority of web applications run with a single
> database server; I don't see how that can be called specialized. And,
> the reasons for wanting to do cross-database joins (from my POV) have
> nothing to do with performance, so I don't really understand why it's
> relevant to this part of the proposal.

To be honest, the main reason I proposed avoiding this problem is it
sounded like it might be difficult so I was trying to constrain the
scope a bit, plus I don't know if the ability to do cross-database
joins exists outside of MySQL. From a quick search, it looks like
PostgreSQL doesn't support it: http://www.postgresql.org/docs/faqs.FAQ.html
but Oracle does: http://www.remote-dba.cc/oracle_tips_sql_joins.htm

Anyway, it turns out Django supports cross-database joins in MySQL
already! It's a bit of a hack, but it works right now, with no
modifications needed to Django at all.

The only difference between a cross-database join and a regular join
in MySQL is that in cross-database joins the database name is included
in each table reference:

SELECT database1.table1.column1, database2.table2.column2
FROM database1.table1 ...

So, my first attempt was to add the database name to the db_table
property in class Meta:

class Topic(models.Model):
name = models.CharField(max_length = 100)
slug = models.SlugField()

class Meta:
db_table = "other_database.review_topic"

def __unicode__(self):
return self.name

That didn't work, because Django adds escaping quotes around each
table reference. Here's the generated SQL:

SELECT
`other_database. review_topic `.`id`,
`other_database.review_topic`.`name`,
`other_database.review_topic`.`slug`
FROM `other_database.review_topic`

You can probably guess where this is going: if you include the
backticks around the . in your db_table setting everything works just
fine!

class Meta:
db_table = "other_database`.`review_topic"

Regular selects, joins and everything else appear to Just Work out of
the box. It turns out we don't have to do anything special to enable
cross-database joins / models in different databases accessible
through the same logical connection on MySQL, which is pretty cool.
It's a shame the syntax is hacky though. One slight improvement would
be to allow the db_table property to optionally include its own table
quoting and only add backticks if that string didn't have them
already. That way you could do the following, which is slightly less
hacky:

class Meta:
db_table = "`other_database`.`review_topic`"

Cheers,

Simon

Brent Hagany

unread,

Sep 12, 2008, 9:28:37 PM9/12/08

to Django developers

Simon, that's great! I think I can overlook the hackiness in my
excitement. I am in your debt.

mengel

unread,

Sep 16, 2008, 2:52:18 PM9/16/08

to Django developers

On Sep 10, 12:53 pm, Simon Willison <si...@simonwillison.net> wrote:
> For those who weren't at DjangoCon, here's the state of play with
> regards to multi-db support: Django actually supports multiple
> database connections right now: the Query class (in django/db/models/
> sql/query.py) accepts a connection argument to its constructor, and
> the QuerySet class (django/db/models/query.py) can be passed an
> optional Query instance - if you don't pass it one, it will create a
> Query that uses the default django.db.connection.

Hmm.. so to the first layer of the onion, that statement is true...
But now, how
does one get a custom connection? Altnernate connection info needs
to be
able to be passed in to the DatabaseWrapper constructors, so that you
can
call them with different database information; and then when
BaseDatabaseWrapper.cursor() calls _cursor() it should pass in a
*copy* of
the global settings overridden with anything passed in as kwargs to
BaseDatabaseWrapper.__init__()
(which is already saved in self.options) . That is, instead of
def cursor(self):
from django.conf import settings
cursor = self._cursor(settings)
if settings.DEBUG:
return self.make_debug_cursor(cursor)
return cursor

We should do:

def cursor(self):
from django.conf import settings
mysettings = settings.copy().update(self.options)
cursor = self._cursor(mysettings)
if settings.DEBUG:
return self.make_debug_cursor(cursor)
return cursor

And then we can do something like
from django.db.backends.mysql.base import DatabaseWrapper as
OracleDatabaseWrapper
myconnection =
OracleDatabaseWrapper(DATABASE_HOST='somwhere.example.com',

DATABASE_USER='tiger',
...
)
and then you could use myconnection in your Manager subclass, etc.

Anyhow, that seems to me like it is a stab at addressing the simple
"some
model classes talk to a different database" approach. One could
make a
ManagerWithAlternateConnection class that is a subclass of Manager and
takes a
database connection as a parameter and does that.

mihasya

unread,

Sep 30, 2008, 1:25:00 AM9/30/08

to Django developers

So, I'm still learning my way around the deeper parts of the codebase,
but at first glance the "function pointed to from settings.py"
approach seems best. This way, the connection management is completely
opaque to the model. If that function isn't defined, it can just fall
back to the default connection. This requires minimal changes to the
base model (to start with, anyway), and gives the developer
finegrained control over what queries go where - as fine grained as
they want. We can then build on that, adding "shortcuts" for basic use
cases, like defining a cluster of slaves to balance reads on etc. for
more "run-of-the-mill" problems.

A distinct advantage of this approach is that it can be built up very
gradually. We can start with literally nothing - just a hook in the
model class to call that function to determine what connection to use
- and gradually add pre-cooked solutions. The advanced users can take
advantage of it right away, defining their custom shard logic. The
usual single-db Django apps won't even notice. We can then very
quietly and gradually build on it.

Once we get into things like cross-database joins, we would need to
modify the ORM more to be aware that, but we could take this all one
step at a time, just as Malcolm recommended.

Mike.

Obviously this function could get very complicated for a heavily
sharded setup

> ...
>
> read more »

mihasya

unread,

Sep 30, 2008, 1:39:12 AM9/30/08

to Django developers

Oops looks like I got distracted and hit send mid-thought. That last
part is supposed to be:

Obviously this function could get very complicated for a heavily

sharded setup, but sharding is complicated by definition, so where
that complexity goes isn't necessarily relevant. Letting this logic
live in a function outside the apps allows for immense flexibility
over time: if you upgrade your database server and this allows you to
host two databases that were previously on distinct machines, all you
have to do is modify that file. In the case of sharding, if you start
out with a dataset that works fine on a single server, you don't do
anything in teh function. Once you start sharding, you modify the
function to reflect that.

Cheers.

Mike.

> ...
>
> read more »

Joey Wilhelm

unread,

Oct 15, 2008, 5:43:08 PM10/15/08

to Django developers

Hello all,

I just wanted to toss in my own $0.02 for my particular use case and
my experiences thus far. For all around ease, this is a copy of a
comment I made on ticket #1142 yesterday:
----
What is the current status on this issue? There doesn't appear to have
been any visible movement in several months now, and this is something
which would be very useful in a project I'm working on. I have a
slightly different use case than the previously mentioned ones,
however... and I'm sure it's not an entirely common one, but perhaps
something to take into consideration.

The company I work for is sitting on a fairly large data warehouse for
a number of clients. Each client has their own database, with a common
schema, on one of two MS SQL 2k servers. In the near future, we will
hopefully be moving some of these clients onto a PostgreSQL server
(and in the distant future, it would be nice to move towards
Greenplum; although that should be transparent with psycopg2, afaik).
In addition, there is a separate database already running on the
PostgreSQL server, which stores configuration parameters and which I
am using for the Django system tables.

On every page load, I need Django to connect to database 'A' on the
PostgreSQL server for any Django specific operations, then connect to
an arbitrary database on either of the MS SQL servers, depending on
which of our clients' data is currently being viewed. From what I've
seen so far in the discussion in this ticket and on
MultipleDatabaseSupport, this doesn't sound like it would likely be a
supported scenario, as the connection to be used is determined
dynamically, rather than a static connection chosen per model. In my
case, all models apply across all database connections.

Now, I have made this work thus far with my own connection manager
(http://dpaste.com/hold/84458/) but using it is somewhat kludgy and
I'm sure it's far from a full implementation. I also had to apply a
hack (http://code.google.com/p/django-pyodbc/issues/detail?id=18) to
django-pyodbc and apply a pending patch (#6710) to Django in order to
get it to work. Here is an example of why it in use: http://dpaste.com/hold/84465/
----

Having read through the posts on this thread, it appears that this was
actually (somewhat) one of the mentioned approaches. As mentioned in
one posting, however, this violates DRY in many ways. In my case
however, this may be a necessity, unless I find some way to determine
the connection in a piece of middleware, then make that connection
globally available (without overriding the 'system'
django.db.connection object).

Now, the above is all good and well, and this has been working fairly
well for me thus far... but I just hit one big stumbling block today.
In fact, it's pretty much a 50 ft. brick wall in my path. The problem
is in the get_db_prep_value() function for some field types. Line 609
of django/db/models/fields/__init__.py is the case I am running into
at the moment. It is specifically referring to django.db.connection,
regardless of what connection may have been tossed around between
Query objects. Now even in most multi-db cases, this wouldn't cause a
problem, because it's just asking the db driver how to convert the
data. In my case, however, django.db.connection is a psycopg2
connection, while I'm actually querying from a pyodbc connection. This
tends to make Django fairly unhappy. In fact, this prompted one of the
most confusing Python error messages I have seen to date: "TypeError:
* wants int"

Looking through this same fields file, it appears that there are a
great many direct references to the global connection object, and I'm
not entirely sure how to work around this at the moment, aside from
perhaps passing the connection to the Field object.

Any input, suggestions, criticism, etc are welcome.

--Joey Wilhelm

crippledcanary

unread,

Oct 23, 2008, 9:56:28 AM10/23/08

to Django developers

What you are talking about here is somewhat covered in
http://code.djangoproject.com/ticket/6148
But here it is called db_schema instead.

In mysql "schema" is aliased to database. In postgres, (and probably
oracle) it actually is a different schema within the same database.
I made some work on that ticket up to about rel 1.0 and I am currently
using it for solving the
one database per app use case. I just create a base model in my app
that defines db_schema and let all other app models inherit from that
one.
It's not a perfect solution but works for me.

//Peter

p.s. I would really like someone to continue the development of that
ticket since I have been moved to other projects and don't have the
time for it any more.

Reply all

Reply to author

Forward