Manager.get_connection() is a great idea, but would it also make sense
to allow selecting the db via signals? That way you could make the
decision without modifying an app.
-Justin
Also, as a general thing here, did you go back and read the various
discussions we had about API when the multi-db Summer of Code project
was running? If not, that would be worth doing and incorporating, since
we debated a few alternatives for things back then which will still be
somewhat relevant. Particularly some of the reasons why certain options
didn't feel like good API.
On Wed, 2008-09-10 at 10:53 -0700, Simon Willison wrote:
[...]
> Dealing with single queries that span multiple databases
> --------------------------------------------------------
>
> Once you have different tables living in different databases there's
> always the chance that someone will try to write a query that attempts
> to join tables that live on two different database servers. I don't
> think we should address this problem at all (aside from maybe
> attempting to throw a descriptive error message should it happen) - if
> you're scaling across different servers you need to be aware of the
> limitations of that approach.
>
> That said, databases like MySQL actually do allow cross-database joins
> provided both databases live on the same physical server. Is this
> something we should support? I'd like to say "no" and assume that
> people who need to do that will be happy rolling their own SQL using a
> raw cursor, but maybe I'm wrong and it's actually a common use case.
My gut feeling is that this isn't something to include initially as a
necessary goal, but it's also probably not too hard once the other 95%
is done. My reason for saying 'no' initially is we try to be as portable
as possible and that particular case is very specific. Also, the single
physical server constraint makes it even more specialised. If you need
separate databases for performance reasons, particularly, they're not
going to be on the same physical server.
More significantly, however, is that there is a need for custom manager
support when a field is being used in a "related field" context, because
this is precisely when cross-database access happens and/or needs to be
prevented. Right now, when you write blog.entry for a particular "blog"
instance, the queryset used to access "entry" is taken from the default
manager on the Entry model (or possibly even models.Manager if a certain
attribute isn't set -- there's some serious hackery going on internally
that we'll sort out). However, neither of those particular things are
the right answer in multi-db or even "advanced selection" cases. We
really need to be able to say "when traversing this relation, use this
manager (or this initial queryset)". This allows us to, for example,
raise an error if trying to cross a join that isn't permitted. Say, in a
sharded situation or in a Hadoop or Big Table setup. It also provides
the way to determine that the query is about to cross a database barrier
and so should actually become a second query against the other database
whose result is then included in the first query.
That needs a public API, too, and I haven't thought about that problem
at all. The plumbing side of that is pretty easy. I have a
proto-implementation locally that I'm not going to further with yet
until fields/related.py has a bit of an internal rewrite to make things
a little less complicated in there.
>
> Connection pooling
> ------------------
>
> This is where I get completely out of my depth, but it seems like we
> might need to implement connection pooling at some point since we are
> now maintaining multiple connections to multiple databases.
Why does this need to be in the sphere of Django at all. Why wouldn't a
backend that talks correctly to pgpool and whatever the equivalent is
for MySQL be the right solution?
> We could
> roll our own solution here, but to my knowledge SQLAlchemy has a solid
> connection pool implementation which is entirely separate from the
> rest of the SQLAlchemy ORM. We could just ensure that if someone needs
> connection pooling there's a documented way of integrating the
> SQLAlchemy connection pool with Django - that way we don't have an
> external dependency on SQL Alchemy for the common case but people who
> need connection pools can still have them.
>
> Backwards compatibility
> -----------------------
>
> I think we can do all of the above while maintaining almost 100%
> backwards with Django 1.0. In the absence of a DATABASES setting we
> can construct one using Django's current DATABASE_ENGINE /
> DATABASE_NAME / etc settings to figure out the 'default' connection.
> Everything else should Just Work as it does already - the only people
> who will need to worry are those who have hacked together their own
> multi-db support based on Django internals.
Yes. I think this can be done in a fully backwards compatible fashion.
>
> Justification
> =============
>
> Why is get_connection() on Manager, not Query or QuerySet?
> ----------------------------------------------------------
>
> The logic that picks which database connection is used could live in
> three potential places: on the manager, on the QuerySet class or on
> the Query itself. The manager seems to me like the most natural place
> for this to live - users are already used to modifying the manager,
> it's trivial to swap in a different manager (e.g. a
> MasterSlaveManager) for a given model and the manager class gets to
> see all of the queries that go through it, including things like
> Article.objects.create(). If there are good reasons it should go on
> the Query or QuerySet instead I'd love to hear them.
As far as the public API goes, it's either the manager or the QuerySet
(that means the concept transfers nicely to non-relational situations
like LDAP where Query might not even exist in something resembling the
current form).
The deciding factor about whether or not it is on the QuerySet (which
would normally be the "natural" choice) is whether it will ever make
sense to want to manually get/set the connection mid-stream. At which
point the method is no longer get_connection(), since it's
multi-purpose. But I suspect this is better controlled by managers (the
decision as to whether to hit master, slave or cache is a property of
fields and models, not of query construation and filtering).
More thoughts once I've had a chance to digest all this.
Regards,
Malcolm
The admin takes a registry-based approach to associate ModelAdmin
classes with Models. Could a similar approach work here?
myapp/connections.py:
from django.contrib.multidb import connection
from myapp.models import MyModel
class MyModelConnection(connection.ModelConnection):
def __call__(self):
... return a database connection ...
connection.register(MyModel, MyModelConnection)
I guess there's no reason even for MyModelConnection to be a class; a
callable would do.
Just a thought.
Cheers,
Dan
--
Dan Fairs <dan....@gmail.com> | http://www.stereoplex.com/
Would this perhaps be easier to do after the Apps-Refactor (#3591)
lands? I'm not real familiar with that ticket, but if we're trying to
set a connection on a app level - that seems like the easiest way to
do it via settings. Something like:
INSTALLED_APPS = (
app('django.contrib.auth', connection='my_user_db'),
...
)
Not sure how that would work for over-riding the default
get_connection method though. We'd probably still be referring to a
callable by 'dotted.path.to.a.function' syntax. And it would apply to
all models in an app, not just some.
Just a thought.
--
----
Waylan Limberg
way...@gmail.com
Oh, please, no! Registration is a very fragile process. It simply
doesn't work very well. It's a bit disappointing that it's the way we
have to do things that way in places in Django and if we can avoid it
elsewhere that'd be nice.
Malcolm
I think this example highlights the problem with per-Model db
connections: it'll only work if either that model is not related to
the others in the app, or if the other models in the app also use the
same db. This will probably make per-application db connections a much
more common use case than per-Model.
> 2. Have a setting which lets you say "for model auth.User, use the
> get_connection method defined over here". This is made inelegant by
> the fact that settings shouldn't really contain references to actual
> function definitions, which means we would probably need to us a
> 'dotted.path.to.a.function', which is crufty.
Considering that this is how every module, function and class are
referred to in setting, I don't think it'll be that big of a deal. I
especially like Waylan's suggestion.
> 3. Use a signal. There isn't much precedence in Django for signals
> which alter the way in which something is done - normally signals are
> used to inform another part of the code that something has happened.
The nice thing about signals is that it allows any arbitrary scheme
for selecting connections without modifying the application. For the
User case above, you could register a function that chooses a replica
for User queries only on selects which don't join with a model outside
the auth app.
I see your point about not changing how things are done with signals.
I was thinking this would be done most simply by sending the QuerySet
with the signal, but that opens things up to a lot more changes than
just db connections. That could end being a way to introduce very hard
to find bugs. I still like how easy it makes it to customize db access
without altering the app itself.
-Justin
It won't work on a statement-level. If you have a transaction and do an
UPDATE and then a SELECT then the latter won't see results of the former
because it will look into another connection (and another database).
I strongly believe that choosing between a master and a slave is the
decision that should be made on business logic level, not at model level.
*sigh* Whilst I realise you are very enthusiastic about getting stuff
done at the moment, Simon. It's very hard to juggle 10 different serious
design issues all at once. And, yes, I understand that some of them
overlap, but solving the registration issue isn't really going to be the
main part (or even necessarily any part) of working out a multiple
database API.
Incremental steps, rather than a plan which requires changing 6 things
at once is really preferable here. It helps code stability and means we
can devote our full attention to just one or two things at a time. The
things you are bringing up are serious issues, but they're not new
issues -- hardly anything that people are suddenly rediscovering from
djangocon haven't been on our radar for months or years. We don't have
to solve them all this week. So, please. Let's slow down a bit and have
the time to consider how we can do things in small steps and require
large sweeping changes as an "if all else fails" fallback. We might
still end up using some kind of new "registration" alternative in, say,
database connection registration, but that can be phase two or phase
three. Phase one being the manual configuration option.
>
> * Registering models with the admin
> * Registering models with databrowse
> * Registering template tags
> * Registering custom management commands
You don't register custom management commands. They are "discovered",
similar to import discovering modules to import: by putting them in a
well-defined location. I'm not sure why template tags aren't done the
same way (a distinguished variable in a file that is imported saying
"these are the methods that are the tags I'm supply", similar to
__all__).
As I note below, the current uses of registration aren't all necessary.
> It's also present in popular party apps:
>
> * django-tagging and django-mptt both register models (though both
> feel like they should really be some kind of mixin)
>
> We'll also need it in the near future for a couple of in-development
> features:
>
> * Registering custom panels with the Django debugging toolbar
> * Registering new benchmarks with the metronome profiling tool
> * Registering get_connection overrides in the above multi-db proposal
>
> Finally, we've been needing to solve it for projects at work: we have
> a CMS that exposes a concept of "custom rows" which can be provided by
> applications and dropped in to various places around the site. Guess
> what: the rows need to be registered!
>
> There MUST be a good way of doing this. zope.interface?
> setuptools
> entry points?
Whilst I realise that some people want to use setuptools, anything that
*requires* it is going to be a big problem for me, at least. It has some
problems that mean it's really unusable in large sysadmin installations
and it acts as odds to the existing packaging system on proper
distributions (you can't be a proper distribution without a proper
packaging system, after all). The maintainer of setuptools has pretty
clearly indicated he isn't interested in fixing the latter problem (look
back at some of the interactions with the Debian guys in the past) and
the former is barely acknowledged as even a problem.
> We really, really need to solve this for Django.
It's not entirely clear that we do, since before you solve something
there has to be a problem. We could make things a bit easier, but it's
quite possibly a case of using the wrong shovel to hammer in your screws
in some cases and in other cases it requires almost no infrastructure.
For those already firing up their replies, note that I carefully wrote
"possibly". I'm asking that people step back and view the issue as
whether it's the right approach before we make a better version of thing
X.
Regards,
Malcolm
Another approach that I took in mysql_replicated[1] is to serve a page
that user GETs from a redirect after successful POST always from the
master. It certainly doesn't solve the problem in general but it's good
enough (for us at least). But I'll second that this damn lagging thing
is pretty hard to solve in a general way.
Well... To be sure save() should always go to master because on slaves
you just don't have permissions to save anything. So a parameter to
save() is redundant.
More to the point, Mike Malone just described a situation when you want
the same code to query either master or slave depending on whether
you're sure that data on slave had a chance to be synced with master.
Another thing is that explicit specification of a connection may become
very tedious and non-DRY. For example one should always use 'master'
when you're POSTing forms. I don't think requiring users to do it
manually is not a good idea.
To be honest, I like the approach that I've implemented in my db backend
for this: using HTTP semantics with the ability to force mater or slave
when you need it. However it works with an implicit state which is not a
clean approach compared to a functional one with explicit passing of
connections.
As Malcolm has pointed you're proposing many things at once :-). I tend
to think that replication, sharding, migration to another db are very
different things and may be we shouldn't try to solve them with a single
API.
Well... To be sure save() should always go to master because on slaves
you just don't have permissions to save anything. So a parameter to
save() is redundant.
On Sep 10, 10:24 pm, "Mike Malone" <mjmal...@gmail.com> wrote:That's really interesting. I wonder if that invalidates the whole
> At Pownce, for example, we stick users to the master database for some
> period of time (a couple of seconds, usually) after they post a new note.
> The problem here (as Malcolm pointed out) is that related managers use the
> default manager for the related field. So if I ask for a User's Notes, the
> default Note manager is used. That manager is, presumably, where the
> decision is going to be made as to whether the slave or the master should be
> queried. But the Note manager has no way of knowing whether the User is
> stuck to the master -- it doesn't even know that there's a User associated
> with the query...
approach I proposed, or merely means it needs some refining?
> I think it just needs refining. My understanding is that related
> fields was due for a refactor anyways, so this would probably be a
> good time to do / think about it. I guess my point is that there needs
> to be some non-internal API for getting at related field information,
> too. In any case, more thought is required.
Agreed, mostly. I'm using this thread as a way of looking at the various
use-cases people are proposing and this will guide a bunch of that
particular refactoring, I suspect. What needs to be exposed is kind of a
consequence of the API we decide to settle on. It's a case where, right
this minute, we (or, rather, I, personally) don't necessarily have a
high confidence level that we understand all the use-cases and can
confidently split them into "things we want to do out of the box",
"things we want to support via extensions" and "moderately crazy stuff".
Regards,
Malcolm
This is indeed a bug (the fact that the appropriate connection isnt
passed into the "where" addition), but it's a bigger one than you
indicate here. The real fix will be that WhereNode.add() should be
passed the "query" instance that is calling it (which contains a
reference to the database connection amongst many other things). If
processing a new where-constraint required, for example, adding a new
table join, you're out of luck at the moment.
I'm holding off committing this until we branch off the 1.0-bugfixes
branch, since it's only going to go onto trunk (it's essentially a
feature-add).
Regards,
Malcolm