Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Proposal: user-friendly API for multi-database support
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 46 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Simon Willison  
View profile  
 More options Sep 10 2008, 1:53 pm
From: Simon Willison <si...@simonwillison.net>
Date: Wed, 10 Sep 2008 10:53:15 -0700 (PDT)
Local: Wed, Sep 10 2008 1:53 pm
Subject: Proposal: user-friendly API for multi-database support
For those who weren't at DjangoCon, here's the state of play with
regards to multi-db support: Django actually supports multiple
database connections right now: the Query class (in django/db/models/
sql/query.py) accepts a connection argument to its constructor, and
the QuerySet class (django/db/models/query.py) can be passed an
optional Query instance - if you don't pass it one, it will create a
Query that uses the default django.db.connection.

As a result, if you want to talk to a different connection you can do
it right now using a custom manager:

class MyManager(Manager):

    def get_query_set(self):
        query = sql.Query(self.model, my_custom_db_connection)
        return QuerySet(self.model, query)

As Malcolm described it, he's provided the plumbing, now we need to
provide the porcelain in the form of a powerful, user-friendly API to
this stuff. Here's my first attempt at an API design.

Requirements
============

There are a number of important use-cases for multi-db support:

* Simple master-slave replication: SELECT queries are distributed
  between slaves, while UPDATE and DELETE statements are sent to
  the master.
* Different Django applications live on different databases, e.g.
  a forum on one database while blog lives on another.
* Moving data between different databases - it would be useful if
  you could do this using the ORM to help paper over the
  differences in SQL syntax.
* Sharding: data in a single Django model is distributed across
  multiple databases depending on some kind of heuristic (e.g. by
  user ID, or with archived content moved to a different server)
* Replication tricks, for example if you use MySQL's InnoDB for
  your data but replicate to a MyISAM server somewhere so you can
  use MySQL full-text indexing to search (Flickr used to do this).

I've probably missed some; please feel free to fill in the gaps.

We don't need to solve every problem, but we do need to provide
obvious hooks for how those problems should be solved. Sharding, for
example, is extremely application specific. I don't think Django
should automatically shard your data for you if you specify 'sharding
= True' on a model class, but it should provide documented hooks for
making a custom decision on which database connection should be used
for a query that allow sharding to be implemented without too much
pain.

Different applications on different databases on the other hand is
something Django should support out of the box. Likewise, master-slave
replication is common enough that it would be good to solve it with as
few lines of user-written code as possible (it's the first step most
people take to scale their database after adding caching - and it's a
sweet spot for the kind of read-heavy content sites that Django is
particularly suited for).

Proposed API
============

Here's my first attempt at describing a user-facing API.

First, we need a way of specifying multiple database connections.
Adrian has already expressed an interest in moving to DSNs rather than
individual settings, so I suggest something like this:

DATABASES = {
    'default': 'mysql://foo:bar@localhost/baz',

}

With multiple databases configured this could be:

DATABASES = {
    'master': 'mysql://foo:bar@master/mydb',
    'slave1': 'mysql://foo:bar@slave1/mydb',
    'slave2': 'mysql://foo:bar@slave2/mydb',
    'archive': 'mysql://foo:bar@archive/mydb',
    'default': 'master',

}

There are two types of connection string - DSNs and aliases. A DSN
contains '://' while an alias does not. Aliases can be used even
within the DATABASES setting itself, as with 'default' in the above
example.

It should be possible to use a DSN that has not been defined in the
DATABASES setting. As a result, I propose that anywhere in Django that
accepts a connection alias should also accept a DSN or even a raw DB-
API compliant connection object.

The QuerySet.using() method
---------------------------

Next, we need a way of telling Django which connection to use. I
propose a new queryset method as the lowest level way of doing this,
called 'using':

qs = Article.objects.filter(published__lt = ...).using('archive')

"using(alias_or_connection_or_dsn)" simply tells the QuerySet to
execute against a different connection, by updating its
internal .connection attribute.

Other options for this method name include:

with_db()
with_connection()

I preferred "using()" as it reads nicely and doesn't contain an
underscore.

using() represents the lowest level user-facing API. We can cover a
common case (different applications on different databases) with the
following:

class Article(models.Model):
    ...
    class Meta:
        using = 'articles'

This means "use the articles connection for all queries originating
with this model". I'm repurposing the term 'using' here for API
consistency.

Advanced connection selection
-----------------------------

All of the other above use-cases boil down to one key decision: given
a particular database query, which database connection should I
execute the query against?

I propose adding a manager method which is called every time that
decision is made, and which is designed to be over-ridden by advanced
users. Here's the default implementation:

class Manager:
    ...
    def get_connection(self, query):
        from django.db import connection
        return connection # Use the default connection for everything

Here's an implementation which implements very simple master-slave
replication:

class Manager:
    ...
    def get_connection(self, query):
        if isinstance(query, (InsertQuery, DeleteQuery, UpdateQuery)):
            return 'master'
        else:
            return 'slave'
            # Or if we have more than one slave:
            return random.choice(['slave1', 'slave2']) # Footnote [1]

The above would be even easier if InsertQuery, DeleteQuery and
UpdateQuery were all subclasses of a ModificationQuery class (they are
currently all direct subclasses of Query) - then the check could
simply be:

    if isinstance(query, ModificationQuery)

We could even ship a MasterSlaveManager that implements a variant of
the above logic in django.contrib.masterslave (more for educational
and marketing purposes than because it's something that's hard to
implement).

Note that in my above example get_connection() methods one returns an
actual connection object while the other returns a connection alias.
This makes for a more convenient API, and is consistent with my above
suggestion that DSNs, aliases and connection objects should be
interchangeable.

Since the get_connection method has access to the full query object,
even complex sharding schemes based on criteria such as the individual
fields being looked up in the query could be supported reasonably
well.

Dealing with single queries that span multiple databases
--------------------------------------------------------

Once you have different tables living in different databases there's
always the chance that someone will try to write a query that attempts
to join tables that live on two different database servers. I don't
think we should address this problem at all (aside from maybe
attempting to throw a descriptive error message should it happen) - if
you're scaling across different servers you need to be aware of the
limitations of that approach.

That said, databases like MySQL actually do allow cross-database joins
provided both databases live on the same physical server. Is this
something we should support? I'd like to say "no" and assume that
people who need to do that will be happy rolling their own SQL using a
raw cursor, but maybe I'm wrong and it's actually a common use case.

Connection pooling
------------------

This is where I get completely out of my depth, but it seems like we
might need to implement connection pooling at some point since we are
now maintaining multiple connections to multiple databases. We could
roll our own solution here, but to my knowledge SQLAlchemy has a solid
connection pool implementation which is entirely separate from the
rest of the SQLAlchemy ORM. We could just ensure that if someone needs
connection pooling there's a documented way of integrating the
SQLAlchemy connection pool with Django - that way we don't have an
external dependency on SQL Alchemy for the common case but people who
need connection pools can still have them.

Backwards compatibility
-----------------------

I think we can do all of the above while maintaining almost 100%
backwards with Django 1.0. In the absence of a DATABASES setting we
can construct one using Django's current DATABASE_ENGINE /
DATABASE_NAME / etc settings to figure out the 'default' connection.
Everything else should Just Work as it does already - the only people
who will need to worry are those who have hacked together their own
multi-db support based on Django internals.

Justification
=============

Why is get_connection() on Manager, not Query or QuerySet?
----------------------------------------------------------

The logic that picks which database connection is used could live in
three potential places: on the manager, on the QuerySet class or on
the Query itself. The manager seems to me like the most natural place
for this to live - users are already used to modifying the manager,
it's trivial to swap in a different manager (e.g. a
MasterSlaveManager) for a given model and the manager class gets to
see all of the queries that go through it, including things like
Article.objects.create(). If there are good reasons it should go on
the Query or QuerySet instead I'd love to hear them.

Why hand get_connection a Query rather than a QuerySet?
-------------------------------------------------------

Because when you call a model's .save() ...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Justin Fagnani  
View profile  
 More options Sep 10 2008, 2:13 pm
From: "Justin Fagnani" <justin.fagn...@gmail.com>
Date: Wed, 10 Sep 2008 11:13:28 -0700
Local: Wed, Sep 10 2008 2:13 pm
Subject: Re: Proposal: user-friendly API for multi-database support
For application-wide db connections, I think it'd be much easier and
more portable to choose the connection in settings.py rather than in a
Model.

Manager.get_connection() is a great idea, but would it also make sense
to allow selecting the db via signals? That way you could make the
decision without modifying an app.

-Justin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Malcolm Tredinnick  
View profile  
 More options Sep 10 2008, 2:17 pm
From: Malcolm Tredinnick <malc...@pointy-stick.com>
Date: Wed, 10 Sep 2008 11:17:39 -0700
Local: Wed, Sep 10 2008 2:17 pm
Subject: Re: Proposal: user-friendly API for multi-database support
Okay, there's lots to digest here, but a couple of things that need
clarification / addition here that I spotted on the first reading.

Also, as a general thing here, did you go back and read the various
discussions we had about API when the multi-db Summer of Code project
was running? If not, that would be worth doing and incorporating, since
we debated a few alternatives for things back then which will still be
somewhat relevant. Particularly some of the reasons why certain options
didn't feel like good API.

On Wed, 2008-09-10 at 10:53 -0700, Simon Willison wrote:

[...]

My gut feeling is that this isn't something to include initially as a
necessary goal, but it's also probably not too hard once the other 95%
is done. My reason for saying 'no' initially is we try to be as portable
as possible and that particular case is very specific. Also, the single
physical server constraint makes it even more specialised. If you need
separate databases for performance reasons, particularly, they're not
going to be on the same physical server.

More significantly, however, is that there is a need for custom manager
support when a field is being used in a "related field" context, because
this is precisely when cross-database access happens and/or needs to be
prevented. Right now, when you write blog.entry for a particular "blog"
instance, the queryset used to access "entry" is taken from the default
manager on the Entry model (or possibly even models.Manager if a certain
attribute isn't set -- there's some serious hackery going on internally
that we'll sort out). However, neither of those particular things are
the right answer in multi-db or even "advanced selection" cases. We
really need to be able to say "when traversing this relation, use this
manager (or this initial queryset)". This allows us to, for example,
raise an error if trying to cross a join that isn't permitted. Say, in a
sharded situation or in a Hadoop or Big Table setup. It also provides
the way to determine that the query is about to cross a database barrier
and so should actually become a second query against the other database
whose result is then included in the first query.

That needs a public API, too, and I haven't thought about that problem
at all. The plumbing side of that is pretty easy. I have a
proto-implementation locally that I'm not going to further with yet
until fields/related.py has a bit of an internal rewrite to make things
a little less complicated in there.

> Connection pooling
> ------------------

> This is where I get completely out of my depth, but it seems like we
> might need to implement connection pooling at some point since we are
> now maintaining multiple connections to multiple databases.

Why does this need to be in the sphere of Django at all. Why wouldn't a
backend that talks correctly to pgpool and whatever the equivalent is
for MySQL be the right solution?

Yes. I think this can be done in a fully backwards compatible fashion.

As far as the public API goes, it's either the manager or the QuerySet
(that means the concept transfers nicely to non-relational situations
like LDAP where Query might not even exist in something resembling the
current form).

The deciding factor about whether or not it is on the QuerySet (which
would normally be the "natural" choice) is whether it will ever make
sense to want to manually get/set the connection mid-stream. At which
point the method is no longer get_connection(), since it's
multi-purpose. But I suspect this is better controlled by managers (the
decision as to whether to hit master, slave or cache is a property of
fields and models, not of query construation and filtering).

More thoughts once I've had a chance to digest all this.

Regards,
Malcolm


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rock  
View profile  
 More options Sep 10 2008, 2:25 pm
From: Rock <r...@rockhoward.com>
Date: Wed, 10 Sep 2008 11:25:02 -0700 (PDT)
Local: Wed, Sep 10 2008 2:25 pm
Subject: Re: Proposal: user-friendly API for multi-database support
The default setting defines the application-wide db connection.
The Manager mechanism is for overriding the default connection.

Selecting the db via signals makes no sense to me, however a mapping
between apps and databases in settings is worth a moment of thought as
a possible supplement to the Manager approach.

Rock

P.S. for Simon:
I haven't spotted any obvious problems with the proposal so far.
My initial reaction is that I like it. Good work!

On Sep 10, 1:13 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Simon Willison  
View profile  
 More options Sep 10 2008, 3:07 pm
From: Simon Willison <si...@simonwillison.net>
Date: Wed, 10 Sep 2008 12:07:48 -0700 (PDT)
Local: Wed, Sep 10 2008 3:07 pm
Subject: Re: Proposal: user-friendly API for multi-database support
On Sep 10, 7:17 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:

> Also, as a general thing here, did you go back and read the various
> discussions we had about API when the multi-db Summer of Code project
> was running? If not, that would be worth doing and incorporating, since
> we debated a few alternatives for things back then which will still be
> somewhat relevant

I'm pretty sure I did at the time :) The above is a brain dump based
on months of quiet chewing followed by a burst of inspiration from
your Django talk. Are there any threads in particular that you think
are worth revisiting?

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Simon Willison  
View profile  
 More options Sep 10 2008, 3:30 pm
From: Simon Willison <si...@simonwillison.net>
Date: Wed, 10 Sep 2008 12:30:26 -0700 (PDT)
Local: Wed, Sep 10 2008 3:30 pm
Subject: Re: Proposal: user-friendly API for multi-database support
On Sep 10, 7:13 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:

> For application-wide db connections, I think it'd be much easier and
> more portable to choose the connection in settings.py rather than in a
> Model.

That's a very interesting point, and one I hadn't considered. It makes
sense to allow people to over-ride the connection used by an
application they didn't write - for example, people may want to tell
Django that django.contrib.auth.User should live in a particular
database. Further-more, just allowing people to over-ride the
connection used for an existing application isn't enough - you need to
be able to over-ride the default get_connection method, since you
might want to shard Django's built in users (for example).

This in definitely one of the use-cases we need to cover.

I'm not sure of the best way to handle it though. The way I see it the
options are as follows:

1. Monkey-patch the existing User manager.
2. Have a setting which lets you say "for model auth.User, use the
get_connection method defined over here". This is made inelegant by
the fact that settings shouldn't really contain references to actual
function definitions, which means we would probably need to us a
'dotted.path.to.a.function', which is crufty.
3. Use a signal. There isn't much precedence in Django for signals
which alter the way in which something is done - normally signals are
used to inform another part of the code that something has happened.

I'm not overjoyed by any of these options.

Cheers,

Simon


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Fairs  
View profile  
 More options Sep 10 2008, 3:40 pm
From: Dan Fairs <dan.fa...@gmail.com>
Date: Wed, 10 Sep 2008 20:40:12 +0100
Local: Wed, Sep 10 2008 3:40 pm
Subject: Re: Proposal: user-friendly API for multi-database support

> 2. Have a setting which lets you say "for model auth.User, use the
> get_connection method defined over here". This is made inelegant by
> the fact that settings shouldn't really contain references to actual
> function definitions, which means we would probably need to us a
> 'dotted.path.to.a.function', which is crufty.

The admin takes a registry-based approach to associate ModelAdmin  
classes with Models. Could a similar approach work here?

myapp/connections.py:

from django.contrib.multidb import connection
from myapp.models import MyModel

class MyModelConnection(connection.ModelConnection):

   def __call__(self):
     ... return a database connection ...

connection.register(MyModel, MyModelConnection)

I guess there's no reason even for MyModelConnection to be a class; a  
callable would do.

Just a thought.

Cheers,
Dan
--
Dan Fairs <dan.fa...@gmail.com> | http://www.stereoplex.com/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Waylan Limberg  
View profile  
 More options Sep 10 2008, 4:11 pm
From: "Waylan Limberg" <way...@gmail.com>
Date: Wed, 10 Sep 2008 16:11:08 -0400
Local: Wed, Sep 10 2008 4:11 pm
Subject: Re: Proposal: user-friendly API for multi-database support

Would this perhaps be easier to do after the Apps-Refactor (#3591)
lands? I'm not real familiar with that ticket, but if we're trying to
set a connection on a app level - that seems like the easiest way to
do it via settings.  Something like:

    INSTALLED_APPS = (
        app('django.contrib.auth', connection='my_user_db'),
        ...
    )

Not sure how that would work for over-riding the default
get_connection method though. We'd probably still be referring to a
callable by 'dotted.path.to.a.function' syntax. And it would apply to
all models in an app, not just some.

Just a thought.

--
----
Waylan Limberg
way...@gmail.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
koenb  
View profile  
 More options Sep 10 2008, 4:16 pm
From: koenb <koen.bierm...@werk.belgie.be>
Date: Wed, 10 Sep 2008 13:16:47 -0700 (PDT)
Local: Wed, Sep 10 2008 4:16 pm
Subject: Re: Proposal: user-friendly API for multi-database support
Just to add a little note: back in May I did some work on multidb,
some thoughts and some work can be found on http://trac.woe-beti.de/ ,
which Ben Ford set up for this.
I stopped because django was becoming too much of a moving target to
keep it in sync (and i did not have the time).

I would like to point out that my starting point was mainly to be able
to use data from different database ENGINES (like some tables from
postgres and some from mysql).

As far as I can tell, this is not supported currently by the plumbing
Malcolm provided (since the operations settings eg in WhereNode are
taken from the default connection and not from the passed in
connection; this is no problem if your databases are using the same
engine, but it returns wrong SQL if you have different engines).
I allready reported this once in #7258 (which was considered invalid,
probably because I forgot to mention my use case).

Anyway, I am very glad to see some renewed interest in this field!

Koen


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Malcolm Tredinnick  
View profile  
 More options Sep 10 2008, 4:18 pm
From: Malcolm Tredinnick <malc...@pointy-stick.com>
Date: Wed, 10 Sep 2008 13:18:38 -0700
Local: Wed, Sep 10 2008 4:18 pm
Subject: Re: Proposal: user-friendly API for multi-database support

On Wed, 2008-09-10 at 20:40 +0100, Dan Fairs wrote:
> > 2. Have a setting which lets you say "for model auth.User, use the
> > get_connection method defined over here". This is made inelegant by
> > the fact that settings shouldn't really contain references to actual
> > function definitions, which means we would probably need to us a
> > 'dotted.path.to.a.function', which is crufty.

> The admin takes a registry-based approach to associate ModelAdmin  
> classes with Models. Could a similar approach work here?

Oh, please, no! Registration is a very fragile process. It simply
doesn't work very well. It's a bit disappointing that it's the way we
have to do things that way in places in Django and if we can avoid it
elsewhere that'd be nice.

Malcolm


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Justin Fagnani  
View profile  
 More options Sep 10 2008, 4:24 pm
From: "Justin Fagnani" <justin.fagn...@gmail.com>
Date: Wed, 10 Sep 2008 13:24:53 -0700
Local: Wed, Sep 10 2008 4:24 pm
Subject: Re: Proposal: user-friendly API for multi-database support
On Wed, Sep 10, 2008 at 12:30 PM, Simon Willison

<si...@simonwillison.net> wrote:
> On Sep 10, 7:13 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
>> For application-wide db connections, I think it'd be much easier and
>> more portable to choose the connection in settings.py rather than in a
>> Model.

> That's a very interesting point, and one I hadn't considered. It makes
> sense to allow people to over-ride the connection used by an
> application they didn't write - for example, people may want to tell
> Django that django.contrib.auth.User should live in a particular
> database. Further-more, just allowing people to over-ride the
> connection used for an existing application isn't enough - you need to
> be able to over-ride the default get_connection method, since you
> might want to shard Django's built in users (for example).

I think this example highlights the problem with per-Model db
connections: it'll only work if either that model is not related to
the others in the app, or if the other models in the app also use the
same db. This will probably make per-application db connections a much
more common use case than per-Model.

> 2. Have a setting which lets you say "for model auth.User, use the
> get_connection method defined over here". This is made inelegant by
> the fact that settings shouldn't really contain references to actual
> function definitions, which means we would probably need to us a
> 'dotted.path.to.a.function', which is crufty.

Considering that this is how every module, function and class are
referred to in setting, I don't think it'll be that big of a deal. I
especially like Waylan's suggestion.

> 3. Use a signal. There isn't much precedence in Django for signals
> which alter the way in which something is done - normally signals are
> used to inform another part of the code that something has happened.

The nice thing about signals is that it allows any arbitrary scheme
for selecting connections without modifying the application. For the
User case above, you could register a function that chooses a replica
for User queries only on selects which don't join with a model outside
the auth app.

I see your point about not changing how things are done with signals.
I was thinking this would be done most simply by sending the QuerySet
with the signal, but that opens things up to a lot more changes than
just db connections. That could end being a way to introduce very hard
to find bugs. I still like how easy it makes it to customize db access
without altering the app itself.

-Justin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Solving registration once and for all?" by Simon Willison
Simon Willison  
View profile  
 More options Sep 10 2008, 4:56 pm
From: Simon Willison <si...@simonwillison.net>
Date: Wed, 10 Sep 2008 13:56:28 -0700 (PDT)
Local: Wed, Sep 10 2008 4:56 pm
Subject: Solving registration once and for all?
On Sep 10, 9:18 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:

> Oh, please, no! Registration is a very fragile process. It simply
> doesn't work very well. It's a bit disappointing that it's the way we
> have to do things that way in places in Django and if we can avoid it
> elsewhere that'd be nice.

I was hoping we could get a discussion going about this at DjangoCon.
Registration is a pattern that comes up /all the time/ in Django:

* Registering models with the admin
* Registering models with databrowse
* Registering template tags
* Registering custom management commands

It's also present in popular party apps:

* django-tagging and django-mptt both register models (though both
feel like they should really be some kind of mixin)

We'll also need it in the near future for a couple of in-development
features:

* Registering custom panels with the Django debugging toolbar
* Registering new benchmarks with the metronome profiling tool
* Registering get_connection overrides in the above multi-db proposal

Finally, we've been needing to solve it for projects at work: we have
a CMS that exposes a concept of "custom rows" which can be provided by
applications and dropped in to various places around the site. Guess
what: the rows need to be registered!

There MUST be a good way of doing this. zope.interface? setuptools
entry points? We really, really need to solve this for Django. If we
had a single, supported and documented way of registering things it
would open up a huge amount of potential for plugin-style extension
points and give us a proper solution to a problem we are solving in a
bunch of different ways at the moment.

Any ideas?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Proposal: user-friendly API for multi-database support" by Ivan Sagalaev
Ivan Sagalaev  
View profile  
 More options Sep 10 2008, 5:15 pm
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Thu, 11 Sep 2008 01:15:20 +0400
Local: Wed, Sep 10 2008 5:15 pm
Subject: Re: Proposal: user-friendly API for multi-database support

Simon Willison wrote:
> * Simple master-slave replication: SELECT queries are distributed
>   between slaves, while UPDATE and DELETE statements are sent to
>   the master.

It won't work on a statement-level. If you have a transaction and do an
UPDATE and then a SELECT then the latter won't see results of the former
because it will look into another connection (and another database).

I strongly believe that choosing between a master and a slave is the
decision that should be made on business logic level, not at model level.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Solving registration once and for all?" by Malcolm Tredinnick
Malcolm Tredinnick  
View profile  
 More options Sep 10 2008, 5:16 pm
From: Malcolm Tredinnick <malc...@pointy-stick.com>
Date: Wed, 10 Sep 2008 14:16:16 -0700
Local: Wed, Sep 10 2008 5:16 pm
Subject: Re: Solving registration once and for all?

On Wed, 2008-09-10 at 13:56 -0700, Simon Willison wrote:
> On Sep 10, 9:18 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
> wrote:
> > Oh, please, no! Registration is a very fragile process. It simply
> > doesn't work very well. It's a bit disappointing that it's the way we
> > have to do things that way in places in Django and if we can avoid it
> > elsewhere that'd be nice.

> I was hoping we could get a discussion going about this at DjangoCon.
> Registration is a pattern that comes up /all the time/ in Django:

*sigh* Whilst I realise you are very enthusiastic about getting stuff
done at the moment, Simon. It's very hard to juggle 10 different serious
design issues all at once. And, yes, I understand that some of them
overlap, but solving the registration issue isn't really going to be the
main part (or even necessarily any part) of working out a multiple
database API.

Incremental steps, rather than a plan which requires changing 6 things
at once is really preferable here. It helps code stability and means we
can devote our full attention to just one or two things at a time. The
things you are bringing up are serious issues, but they're not new
issues -- hardly anything that people are suddenly rediscovering from
djangocon haven't been on our radar for months or years. We don't have
to solve them all this week. So, please. Let's slow down a bit and have
the time to consider how we can do things in small steps and require
large sweeping changes as an "if all else fails" fallback. We might
still end up using some kind of new "registration" alternative in, say,
database connection registration, but that can be phase two or phase
three. Phase one being the manual configuration option.

> * Registering models with the admin
> * Registering models with databrowse
> * Registering template tags
> * Registering custom management commands

You don't register custom management commands. They are "discovered",
similar to import discovering modules to import: by putting them in a
well-defined location. I'm not sure why template tags aren't done the
same way (a distinguished variable in a file that is imported saying
"these are the methods that are the tags I'm supply", similar to
__all__).

As I note below, the current uses of registration aren't all necessary.

Whilst I realise that some people want to use setuptools, anything that
*requires* it is going to be a big problem for me, at least. It has some
problems that mean it's really unusable in large sysadmin installations
and it acts as odds to the existing packaging system on proper
distributions (you can't be a proper distribution without a proper
packaging system, after all). The maintainer of setuptools has pretty
clearly indicated he isn't interested in fixing the latter problem (look
back at some of the interactions with the Debian guys in the past) and
the former is barely acknowledged as even a problem.

>  We really, really need to solve this for Django.

It's not entirely clear that we do, since before you solve something
there has to be a problem. We could make things a bit easier, but it's
quite possibly a case of using the wrong shovel to hammer in your screws
in some cases and in other cases it requires almost no infrastructure.

For those already firing up their replies, note that I carefully wrote
"possibly". I'm asking that people step back and view the issue as
whether it's the right approach before we make a better version of thing
X.

Regards,
Malcolm


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Proposal: user-friendly API for multi-database support" by Mike Malone
Mike Malone  
View profile  
 More options Sep 10 2008, 5:24 pm
From: "Mike Malone" <mjmal...@gmail.com>
Date: Wed, 10 Sep 2008 14:24:12 -0700
Local: Wed, Sep 10 2008 5:24 pm
Subject: Re: Proposal: user-friendly API for multi-database support

Wow... like Malcom said, lots to digest here.

So to start, the "simple" master-slave replication scenario turns out not to
be so simple once you get into the implementation details. Replication lag
being what it is, you almost never way to query the slave for every SELECT.

At Pownce, for example, we stick users to the master database for some
period of time (a couple of seconds, usually) after they post a new note.
The problem here (as Malcolm pointed out) is that related managers use the
default manager for the related field. So if I ask for a User's Notes, the
default Note manager is used. That manager is, presumably, where the
decision is going to be made as to whether the slave or the master should be
queried. But the Note manager has no way of knowing whether the User is
stuck to the master -- it doesn't even know that there's a User associated
with the query...

We've solved this by poking at a lot of the related fields internals.
Malcolm helped a lot, and he's probably one of the only people who could
have made it happen. It's not that much code, but it relies heavily on
internal API and is certainly not something that should be recommended.

Simon, from your first email it seems you're suggesting that the Manager
call Query.as_sql() and then parse the resulting SQL string? That seems like
it's going to encourage a lot of hacky/fragile solutions. IMO, the right
place for a decision like "should this User's notes come from the master, or
the slave?" is on the User model (or maybe User manager), not in the Note
manager.

The same problem comes up with sharding. Suppose, for example, Pownce
started sharding by User and putting each User's Notes on the same server
the User is on. We should be able to call User.notes.all() and get that
User's notes, but the Note manager can't easily tell what server it should
be querying, since it doesn't know about the User. Again, you could start
poking at the internals of Query and try to figure out what's going on, but
that doesn't seem like a particularly elegant solution...

Mike


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Simon Willison  
View profile  
 More options Sep 10 2008, 5:27 pm
From: Simon Willison <si...@simonwillison.net>
Date: Wed, 10 Sep 2008 14:27:14 -0700 (PDT)
Local: Wed, Sep 10 2008 5:27 pm
Subject: Re: Proposal: user-friendly API for multi-database support
On Sep 10, 10:15 pm, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:

> Simon Willison wrote:
> > * Simple master-slave replication: SELECT queries are distributed
> >   between slaves, while UPDATE and DELETE statements are sent to
> >   the master.

> It won't work on a statement-level. If you have a transaction and do an
> UPDATE and then a SELECT then the latter won't see results of the former
> because it will look into another connection (and another database).

> I strongly believe that choosing between a master and a slave is the
> decision that should be made on business logic level, not at model level.

Good point. That also highlights an omission in my original brain-dump
- having a "uses" method on a QuerySet isn't enough, you also need a
way of over-riding the database connection used by a call to
model.save(). Again, I'd propose the same terminology again as a
keyword argument.

If you wanted to control which master/slave connection was used from
your business logic, you'd do something like this:

obj = Article.objects.using('master').get(pk = 4)
obj.name = 'Hello'
obj.save(using = 'master')

Ivan: how does that look?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Sep 10 2008, 5:33 pm
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Thu, 11 Sep 2008 01:33:51 +0400
Local: Wed, Sep 10 2008 5:33 pm
Subject: Re: Proposal: user-friendly API for multi-database support

Mike Malone wrote:
> At Pownce, for example, we stick users to the master database for some
> period of time (a couple of seconds, usually) after they post a new
> note.

Another approach that I took in mysql_replicated[1] is to serve a page
that user GETs from a redirect after successful POST always from the
master. It certainly doesn't solve the problem in general but it's good
enough (for us at least). But I'll second that this damn lagging thing
is pretty hard to solve in a general way.

[1]: http://softwaremaniacs.org/soft/mysql_replicated/en/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Simon Willison  
View profile  
 More options Sep 10 2008, 5:40 pm
From: Simon Willison <si...@simonwillison.net>
Date: Wed, 10 Sep 2008 14:40:14 -0700 (PDT)
Local: Wed, Sep 10 2008 5:40 pm
Subject: Re: Proposal: user-friendly API for multi-database support
On Sep 10, 10:24 pm, "Mike Malone" <mjmal...@gmail.com> wrote:

> At Pownce, for example, we stick users to the master database for some
> period of time (a couple of seconds, usually) after they post a new note.
> The problem here (as Malcolm pointed out) is that related managers use the
> default manager for the related field. So if I ask for a User's Notes, the
> default Note manager is used. That manager is, presumably, where the
> decision is going to be made as to whether the slave or the master should be
> queried. But the Note manager has no way of knowing whether the User is
> stuck to the master -- it doesn't even know that there's a User associated
> with the query...

That's really interesting. I wonder if that invalidates the whole
approach I proposed, or merely means it needs some refining?

> Simon, from your first email it seems you're suggesting that the Manager
> call Query.as_sql() and then parse the resulting SQL string?

Not at all - I'm suggesting the manager pokes around at the query
object itself (what type of query is it, which tables does it touch
etc). I mentioned as_sql as a throw-away remark; I certainly wouldn't
want to suggest implementing connection selection logic in that way.

> IMO, the right place for a decision like "should this User's notes come from
> the master, or the slave?" is on the User model (or maybe User manager),
> not in the Note manager.

It's possible that the Note manager simply won't have enough
information to make that decision - in which case I'd suggest that
solving it is up to the developer. They might chose to use the
'using()' method to force a query through notes to go to the slave,
for example.

> The same problem comes up with sharding. Suppose, for example, Pownce
> started sharding by User and putting each User's Notes on the same server
> the User is on. We should be able to call User.notes.all() and get that
> User's notes, but the Note manager can't easily tell what server it should
> be querying, since it doesn't know about the User. Again, you could start
> poking at the internals of Query and try to figure out what's going on, but
> that doesn't seem like a particularly elegant solution...

Again, my assumption is that in that case it's up to the developer to
ensure queries go to the right place - maybe by adding their own
"get_notes" method to their user object that automatically queries the
correct shard (with the using() method).

I'm not convinced that this stuff can be made invisible to the
developer - if you're sharding things you're in pretty deep, and you
probably want to maintain full control over where your queries are
going.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Sep 10 2008, 5:44 pm
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Thu, 11 Sep 2008 01:44:52 +0400
Local: Wed, Sep 10 2008 5:44 pm
Subject: Re: Proposal: user-friendly API for multi-database support

Simon Willison wrote:
> Good point. That also highlights an omission in my original brain-dump
> - having a "uses" method on a QuerySet isn't enough, you also need a
> way of over-riding the database connection used by a call to
> model.save(). Again, I'd propose the same terminology again as a
> keyword argument.

> If you wanted to control which master/slave connection was used from
> your business logic, you'd do something like this:

> obj = Article.objects.using('master').get(pk = 4)
> obj.name = 'Hello'
> obj.save(using = 'master')

> Ivan: how does that look?

Well... To be sure save() should always go to master because on slaves
you just don't have permissions to save anything. So a parameter to
save() is redundant.

More to the point, Mike Malone just described a situation when you want
the same code to query either master or slave depending on whether
you're sure that data on slave had a chance to be synced with master.

Another thing is that explicit specification of a connection may become
very tedious and non-DRY. For example one  should always use 'master'
when you're POSTing forms. I don't think requiring users to do it
manually is not a good idea.

To be honest, I like the approach that I've implemented in my db backend
for this: using HTTP semantics with the ability to force mater or slave
when you need it. However it works with an implicit state which is not a
clean approach compared to a functional one with explicit passing of
connections.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ivan Sagalaev  
View profile  
 More options Sep 10 2008, 5:49 pm
From: Ivan Sagalaev <man...@softwaremaniacs.org>
Date: Thu, 11 Sep 2008 01:49:17 +0400
Local: Wed, Sep 10 2008 5:49 pm
Subject: Re: Proposal: user-friendly API for multi-database support

Simon Willison wrote:
> That's really interesting. I wonder if that invalidates the whole
> approach I proposed, or merely means it needs some refining?

As Malcolm has pointed you're proposing many things at once :-). I tend
to think that replication, sharding, migration to another db are very
different things and may be we shouldn't try to solve them with a single
  API.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Malone  
View profile  
 More options Sep 10 2008, 5:59 pm
From: "Mike Malone" <mjmal...@gmail.com>
Date: Wed, 10 Sep 2008 14:59:43 -0700
Local: Wed, Sep 10 2008 5:59 pm
Subject: Re: Proposal: user-friendly API for multi-database support

> Well... To be sure save() should always go to master because on slaves
> you just don't have permissions to save anything. So a parameter to
> save() is redundant.

Not so. There are certainly use-cases for more sophisticated database
architectures where, for example, the majority of the database tables are
written to the master and replicated to all slaves, while a couple of
write-heavy tables are sharded and written directly to individual slaves.
More common is a master-master replication strategy, where a particular User
(for example) is stuck to one of a pair of database servers that replicate
one another. In this case you'd want to be able to specify somehow which
server to save() to.

Mike


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
ab  
View profile  
 More options Sep 10 2008, 5:55 pm
From: ab <andrewb...@gmail.com>
Date: Wed, 10 Sep 2008 14:55:12 -0700 (PDT)
Local: Wed, Sep 10 2008 5:55 pm
Subject: Re: Proposal: user-friendly API for multi-database support
For the api to accept a DSN, alias, or connection anywhere would add
similar code in multiple places. I propose that the aliases are mapped
into django.db.connections. For your example, you could use
django.db.connections.archive. I also propose that you can either
define a single database (as now) or multiple DATABASES (as you
propose) and to define both or neither is an error. But anyways, how
to name/specify a database is semi-bikesheddy and orthogonal to the
issue of how to actually choose the database to be used, which is the
more important one. Here's my take on that:

My biggest problem with Simon's original proposal is that the
connection-choosing logic is too spread out. Sticking stuff on your
Model classes makes sense when the code is "local to that model" --
like methods, metadata, or choosing a connection per-table -- but that
doesn't make sense for a lot of multi-db setups. For complicated stuff
like sharding, I think you'd want all the logic in the same place.

Counter-proposal:
A *project-global* get_connection function, maybe in a location
specified by settings.
Input: the queryset, at least, and probably whatever else you'll
likely want to use: the model class, tables joined with,
fields_accessed?, etc.
Output: a connection object

That would make it easier to write and maintain your multi-db setup
and share logic across models. If you want control at the queryset-
granularity, this could maybe result in a proposed_connection
parameter to get_connection (and get_connection can obviously raise
exceptions as it sees fit). This proposal also solves the problem of
choosing a connection for contrib or 3rd-party applications.

Andrew

On Sep 10, 10:53 am, Simon Willison <si...@simonwillison.net> wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Malone  
View profile  
 More options Sep 10 2008, 6:03 pm
From: "Mike Malone" <mjmal...@gmail.com>
Date: Wed, 10 Sep 2008 15:03:57 -0700
Local: Wed, Sep 10 2008 6:03 pm
Subject: Re: Proposal: user-friendly API for multi-database support

I think it just needs refining. My understanding is that related fields was
due for a refactor anyways, so this would probably be a good time to do /
think about it. I guess my point is that there needs to be some non-internal
API for getting at related field information, too. In any case, more thought
is required.

Mike


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Simon Willison  
View profile  
 More options Sep 10 2008, 6:06 pm
From: Simon Willison <si...@simonwillison.net>
Date: Wed, 10 Sep 2008 15:06:54 -0700 (PDT)
Local: Wed, Sep 10 2008 6:06 pm
Subject: Re: Proposal: user-friendly API for multi-database support
On Sep 10, 10:44 pm, Ivan Sagalaev <man...@softwaremaniacs.org> wrote:

> Well... To be sure save() should always go to master because on slaves
> you just don't have permissions to save anything. So a parameter to
> save() is redundant.

It's redundant in the case of a single master, but there are other
situations when you might want full control over where a save() ends
up going (when you have more than one master for example, or in a
situation where you are loading data from one database and saving it
to another as part of an import/export routine).

> To be honest, I like the approach that I've implemented in my db backend
> for this: using HTTP semantics with the ability to force mater or slave
> when you need it. However it works with an implicit state which is not a
> clean approach compared to a functional one with explicit passing of
> connections.

I had your POST v.s. GET method in mind when I was thinking about the
get_connection method - one of the potential things you could do in
that method is look at a thread local that was set to the request
method when the last request came in. This is a bit too much of a hack
to support in Django core but there's nothing to stop end-users using
thread locals in that way if they want HTTP-based master/slave
selection.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Malcolm Tredinnick  
View profile  
 More options Sep 10 2008, 6:15 pm
From: Malcolm Tredinnick <malc...@pointy-stick.com>
Date: Wed, 10 Sep 2008 15:15:42 -0700
Local: Wed, Sep 10 2008 6:15 pm
Subject: Re: Proposal: user-friendly API for multi-database support

On Wed, 2008-09-10 at 15:03 -0700, Mike Malone wrote:

[...]

> I think it just needs refining. My understanding is that related
> fields was due for a refactor anyways, so this would probably be a
> good time to do / think about it. I guess my point is that there needs
> to be some non-internal API for getting at related field information,
> too. In any case, more thought is required.

Agreed, mostly. I'm using this thread as a way of looking at the various
use-cases people are proposing and this will guide a bunch of that
particular refactoring, I suspect. What needs to be exposed is kind of a
consequence of the API we decide to settle on. It's a case where, right
this minute, we (or, rather, I, personally) don't necessarily have a
high confidence level that we understand all the use-cases and can
confidently split them into "things we want to do out of the box",
"things we want to support via extensions" and "moderately crazy stuff".

Regards,
Malcolm


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 46   Newer >
« Back to Discussions « Newer topic     Older topic »