[GSOC] Multiple Database API proposal

11 views

Skip to first unread message

Alex Gaynor

unread,

Mar 20, 2009, 9:45:48 AM3/20/09

to django-d...@googlegroups.com

Hello all,

To those who don't me I'm a freshman computer science student at Rensselaer
Polytechnic Institute in Troy, New York. I'm on the mailing lists quite a bit
so you may have seen me around.

A Multiple Database API For Django
==================================

Django current has the low level hooks necessary for multiple database support,
but it doesn't have the high level API for using, nor any support
infrastructure, documentation, or tests. The purpose of this project would be
to implement the high level API necessary for the use of multiple databases in
Django, along with requisit documentation and tests.

There have been several previous proposals and implementation of
multiple-database support in Django, non of which has been complete, or gained
sufficient traction in the community in order to be included in Django itself.
As such this proposal will specifically address some of the reasons for past
failures, and their remedies.

The API
-------

First there is the API for defining multiple connections. A new setting will
be created ``DATABASES`` (or something similar), which is a dictionary mapping
database alias(internal name) to a dictionary containing the current
``DATABASE_*`` settings:

.. sourcecode:: python

    DATABASES = {
        'default': {
            'DATABASE_ENGINE': 'postgresql_psycopg2',
            'DATABASE_NAME': 'my_data_base',
            'DATABASE_USER': 'django',
            'DATABASE_PASSWORD': 'super_secret',
        }
        'user': {
            'DATABASE_ENGINE': 'sqlite3',
            'DATABASE_NAME': '/home/django_projects/universal/users.db',
        }
    }

A database with the alias ``default`` will be the default connection(it will be
used if no other one is specified for a query) and will be the direct
replacement for the ``DATABASE_*`` settings. In compliance with Django's
deprecation policy the ``DATABASE_*`` will automatically be handled as if they
were defined in the ``DATABASES`` dict for at least 2 releases.

Next a ``connections`` object will be implemented in ``django.db``, analgous
to the ``django.db.connection`` object, the ``connections`` one will be a
dictionary like object, that is subscripted by database alias, and lazily
returns a connection to the database. ``django.db.connection`` will remain(at
least for the present, it's ultimate state will be by community consensus) and
merely proxy to ``django.db.connections['default']``. Using the previously
defined database setting this might be used as:

.. sourcecode:: python

    from django.db import connections

    conn = connections['user']
    c = conn.cursor()
    results = c.execute("""SELECT 1""")
    results.fetchall()

Now that there is the necessary infastructure to accompany the very low level
plumbing we need our actual API. The high level API will have 2 components.
First here will be a ``using()`` method on ``QuerySet`` and ``Manager``
objects. This method simply takes an alias to a connection(and possibly a
connection object itself to allow for dynamic database usage) and makes that
the connection that will be used for that query. Secondly, a new options will
be created in the inner Meta class of models. This option will be named
``using`` and specify the default connection to use for all queries against
this model, overiding the default specified in the settings:

.. sourcecode:: python

    class MyUser(models.Model):
        ...
        class Meta:
            using = 'user'

    # this queries the 'user' database
    MyUser.objects.all()
    # this queries the 'default' database
    MyUser.objects.using('default')

Lastly, various plumbing will need to be updated to reflect the new multidb
API, such as transactions, breakpoints, management commands, etc.

More Advanced Usage
-------------------

While the above two methods are strictly speaking sufficient they require the
user to write lots of boilerplate code in order to implement advanced multi
database strategies such as replication and sharding. Therefore we also
introduce the concept of ``DatabaseManagers``, not to be confused with Django's
current managers. DatabaseManagers are classes that define how what connection
should be used for a given query. There are 2 levels at which to specify what
``DatabaseManager`` to use, as a setting, and at the class level. For example
in one's settings.py one might have:

.. sourcecode:: python

    DEFAULT_DB_MANAGER = 'django.db.multidb.round_robin.Random'

This tells Django that for each query it should use the ``DatabaseManager``
specified at that location, unless it is overidden by the ``using`` Meta option,
or the ``using()`` method.

The more granular way to use ``DatabaseManagers`` is to provide them, in place
of a string, as the ``using`` Meta option. Here we pass an instance of the
class we want to use:

.. sourcecode:: python

    class MyModel(models.Model):
        class Meta:
            using = Random(['my_db1', 'my_db2', 'my_db2'])

At this level it can still be overidden by the explicit usage of the
``using()`` method.

But how exactly do ``DatabaseManagers`` work? Let's start with an example:

.. sourcecode:: python

    class Random(DatabaseManager):
        def __init__(self, dbs=None):
            self.dbs = dbs if dbs is not None else settings.DATABASES.keys()

        def select(self, cls, **params):
            return random.choose(self.dbs)

        def create(self, cls, **params):
            raise TypeError("Random database manager is intended only for reads")

        def update(self, cls, **params):
            raise TypeError("Random database manager is intended only for reads")

Basically we have 3 methods on a ``DatabaseManager``, plus the ``__init__``
method. ``__init__`` should be able to be called with no parameters if you
want to make the class the default for your project. ``select()``,
``create()``, and ``update()`` each take the class of the model that the query
is for, plus ``**params``, it has yet to be determined what params should be
passed, ideas include:

* The ``Query`` object for the ``QuerySet`` in question.
* The ``WhereNode`` for the ``Query`` object.
* others...

Plan of Action
--------------
1) Implement the ``connections`` object. -- 1 day
2) Alter the relevant management commands and anything else to use all
   connections or ``django.db.connections['default']`` depending on which is
   approporiate. -- 1 week
3) Implement the method tracking(command pattern). -- 1 week
4) Implement the ``using()`` method and the ``using`` inner ``Meta`` options.
   -- 1 week
5) Write initial tests and docs, the rest will be written as features are
   implemented, however a large initial set needs to be written. -- 3 weeks
6) Fix up transaction support, the close database signal, anything else in
   transactions.py. -- 2 weeks
7) Add support for the ``DatabaseManager`` for more complex support. -- 2 weeks
8) Time permitting implement a few common replication patterns.

All of these times are fairly aggressive, and there are about 2 weeks to spare,
so those can be used as necessary, or for part #8.

Hurdles
-------

The following are a list of possible technical issues:

* In ``django.db.models.sql.query.Query`` are any tests done on what the
   connection is before the actual SQL construction phase. If so these need
   to be changed not to do this, since the connect might change at some point
   after that test. If this can't be done than ``using()`` needs to be the
   first method called on a ``QuerySet``, or at a minimum called before any
   methods that do such testing. Further, if these tests can't be put off then
   the only option is a callback that's called right when the first ``Query``
   object is constructed, this means Django won't know what type of query it
   would be, rendering the ``DatabaseManager`` impossible.
* Will models need to know which database they came from so that they can be
   saved back correctly?
* Does ``Model.save()`` need to take a ``using`` parameter so new objecs can
   be created on a specific database or saved to a new database.
* Backends that use custom query classes, will we need a ``from_query``
   classmethod to transform them. This would require all backends to store
   and use information that is basically less than or equal to what the
   ``Query`` object stores. Also, there needs to be the reverse, a way to go
   from a custom ``Query`` object back to either the Django default or some
   other custom ``Query`` object.
* Foreign keys will basically be handled en passant because of how they are
   implemented, but many to many fields will require more thought, especially
   since that SQL isn't in the ``Query`` class.
*

Solutions
---------

The greatest hurdle is changing the connection after we already have our
``Query`` partly created. The issues here are that: we might have done tests
against ``connection.features`` already, we might need to switch either to or
from a custom ``Query`` object, amongst other issues. One possible solution
that is very powerful(though quite inellegant) is to have the ``QuerySet`` keep
track of all public API method calls against it and what parameters they took,
then when the ``connection`` is changed it will recreate the ``Query`` object
by creating a "blank" one with the new connection and reapplying all the
methods it has stored. This is basically a simple implementation of the
command pattern.

I'm here soliciting feedback on both the API, and any potential hurdles I may
have missed.

Thanks,
Alex

--
"I disapprove of what you say, but I will defend to the death your right to say it." --Voltaire
"The people's good is the highest law."--Cicero

Tim Chase

unread,

Mar 20, 2009, 10:29:43 AM3/20/09

to django-d...@googlegroups.com

> I'm here soliciting feedback on both the API, and any potential hurdles I
> may have missed.

While my vote may mean little, Alex has certainly been active and
had quality code on the mailing list. MultiDB has also been a
frequent issue on the mailing-list, so Alex gets my +1

I'd hope to see "multiple databases" defined a little more
clearly as discussed in this thread[1]. Whether the SoC project
address *all* of the facets (wow, lots of work!) or just selects
certain issues, I'd like to see them addressed in the proposal
("addressing federation and load-balancing, but not sharding") to
show that they're being considered during the implementation.
From what I gather in the description, Alex is only proposing
load-balancing.

Depending on which definitions of multidb you plan to address, it
also impacts areas such as aggregation (performing
count/summation over shards requires extra consideration) and
cross-database joining. In the above thread, Malcolm also raises
the issue of read/write consistency when doing load-balancing.

-tim

[1]
http://groups.google.com/group/django-users/browse_thread/thread/663046559fd0f9c1/

Malcolm Tredinnick

unread,

Mar 20, 2009, 11:21:07 PM3/20/09

to django-d...@googlegroups.com

That's one word for it. This stuff will take a lot longer than you think
because testing it along the way will take longer than you think, for a
start. There are lots of traps that we don't even know exist yet.

Fortunately, step 3 won't be needed (I think things would really be
broken if it is), but aiming to get to about where you are at the end of
step 6 by the end of the period feels, in my gut, about right.

This isn't an area where the first implementation will be the right.
There will be places where you'll want to go back a few days later and
try again. If not, your mentor, whoever that is, will probably suggest
you do. There will be, no doubt, bits of refactoring of the existing
internals required to incorporate some of this and those changes will
need careful testing and consideration (and, often, a couple of passes).
This isn't even intuition on my part any more. I've done enough coding
in that area to know it's a huge, complex beast and first attempts can
always be improved upon.

Even the documentation and tests are things you'll want to come back to
a few times, because writing them out usually (or should) triggers
doubts about the implementation or API, which leads to a bunch of
changes, which leads to the documentation now needing more updating,
etc. Good that you've allowed 3 weeks there, but I suspect it's 3 weeks
spread over a much longer period on the calendar, rather than all
together.

By all means have a few things that you can get to if there's time. And
estimation is one of those impossible things to get right. Aim to get up
to some point (probably end of step 6 or partway through step 7) and
then have stuff to do if it all goes well.

Something that hasn't really been done in past years, though, is
students using the extra time to really improve their code. If you have
a week or two at the end, use it to fix up all the things that could be
better in the first ten weeks' work.

All that being said, your schedule looks roughly reasonable. Estimation
is very difficult, even for people who've been in the industry for
years, so we're not going to expect miracles in the SoC front. Part of
the experience is learning the truth of that.

> and there are about 2 weeks to spare,
> so those can be used as necessary, or for part #8.
>
> Hurdles
> -------
>
> The following are a list of possible technical issues:
>
> * In ``django.db.models.sql.query.Query`` are any tests done on what
> the
> connection is before the actual SQL construction phase.

I've discussed this one below, under "solutions".

> If so these need
> to be changed not to do this, since the connect might change at
> some point
> after that test. If this can't be done than ``using()`` needs to
> be the
> first method called on a ``QuerySet``, or at a minimum called
> before any
> methods that do such testing. Further, if these tests can't be put
> off then
> the only option is a callback that's called right when the first
> ``Query``
> object is constructed, this means Django won't know what type of
> query it
> would be, rendering the ``DatabaseManager`` impossible.
> * Will models need to know which database they came from so that they
> can be
> saved back correctly?

Be worth doing. Storing the connection name used to retrieve the model
wouldn't be hard.

> * Does ``Model.save()`` need to take a ``using`` parameter so new
> objecs can
> be created on a specific database or saved to a new database.

Almost certainly. There has to be some way to specify it. A parameter to
save() seems natural enough.

> * Backends that use custom query classes, will we need a
> ``from_query``
> classmethod to transform them. This would require all backends to
> store
> and use information that is basically less than or equal to what
> the
> ``Query`` object stores. Also, there needs to be the reverse, a
> way to go
> from a custom ``Query`` object back to either the Django default or
> some
> other custom ``Query`` object.

Can you explain this some more? It sounds like a horrible artifact. I'm
not sure calling the base SQL Query method the super-wrapper of
everything is a good design goal.

> * Foreign keys will basically be handled en passant because of how
> they are
> implemented, but many to many fields will require more thought,
> especially
> since that SQL isn't in the ``Query`` class.

I'd be happy saying that if you tried to cross a database boundary, it's
an error. It's up to you (the user) to know enough about your data
modeling setup to avoid this. But see what's possible there. That's
certainly an area that can be an error initially and added on later on,
since nothing will in the error path will exclude making it not an
error.

> *
>
> Solutions
> ---------
>
> The greatest hurdle is changing the connection after we already have
> our
> ``Query`` partly created. The issues here are that: we might have
> done tests
> against ``connection.features`` already, we might need to switch
> either to or
> from a custom ``Query`` object, amongst other issues.

There's possibly some intrusion here, but not a lot. Creating some
delayed evaluation structures to hold various pieces is probably a
reasonable approach for most cases. For example, all uses of quote_name
(and friends) are restricted to the SQL construction phase of things
(as_sql() and things it calls).

Also, I wouldn't be unhappy with a restriction that you can't change the
type of database you are directing to once the queryset construction has
started. The common use-cases here are having a pool of databases all
running on the same type of server. Or read-only and read-write
databases running on the same server, etc. So being able to change the
target of the query late in the process is useful enough to be basically
compulsory -- so that you can pass around a queryset and, much later,
decide whether to read from master or slave, for example -- but being
able to switch to some entirely arbtrary other storage system late in
the game could be something we prohibit. That's probably not too onerous
a restriction.

> One possible solution
> that is very powerful(though quite inellegant) is to have the
> ``QuerySet`` keep
> track of all public API method calls against it and what parameters
> they took,
> then when the ``connection`` is changed it will recreate the ``Query``
> object
> by creating a "blank" one with the new connection and reapplying all
> the
> methods it has stored. This is basically a simple implementation of
> the
> command pattern.

It's pretty yukky. There's a lot of Python level junk that we
intentionally avoid storing in querysets so that they behave properly as
persistent data structures (clones are independent copies) and can be
pickled without trouble, etc. It would be really bad for performance to
reintroduce those (I did a lot of profiling when developing that stuff
and tried to throw out as much as possible). I think this fortunately
isn't going to be a real issue. I was pretty careful originally to keep
the leakage from django.db.connection into the Query class to as few
places as possible and mostly when we're creating the SQL.

Some cases that might eb unavoidable could be replaced with delayed
evaluation objects (essentially encapsulating the command pattern just
for that fragment), which is a bit cleaner.

> I'm here soliciting feedback on both the API, and any potential
> hurdles I may
> have missed.

Did you go back through the thread from just after DjangoCon last year
and make sure you've covered or intentionally omitted all the use-cases
mentioned there? Would be good to see some discussion of things that
aren't going to be possible with this. I know I rounded up a few people
I'd done consulting for or chatted to about this problem who posted
there (and don't normally post), so it's quite full of industry-backed
experiences. I think you've hit most of the points, but it would be
worth confirming that.

Regards,
Malcolm

Alex Gaynor

unread,

Mar 20, 2009, 11:41:48 PM3/20/09

to django-d...@googlegroups.com

I agree completely(shouldn't even have been phrased as a question).

> * Does ``Model.save()`` need to take a ``using`` parameter so new
> objecs can
> be created on a specific database or saved to a new database.

Almost certainly. There has to be some way to specify it. A parameter to
save() seems natural enough.

In terms of the API, I would also think queryset.using('db').create('obj') should pass that using param along.

> * Backends that use custom query classes, will we need a
> ``from_query``
> classmethod to transform them. This would require all backends to
> store
> and use information that is basically less than or equal to what
> the
> ``Query`` object stores. Also, there needs to be the reverse, a
> way to go
> from a custom ``Query`` object back to either the Django default or
> some
> other custom ``Query`` object.

Can you explain this some more? It sounds like a horrible artifact. I'm
not sure calling the base SQL Query method the super-wrapper of
everything is a good design goal.

This was more a "one possible way to handle" the issue of switching Query classes. Some of this proposal is unfortunately not as organized as I'd like, the solution I preseneted under the Solutions header solves the same problem and I would prefer.

> * Foreign keys will basically be handled en passant because of how
> they are
> implemented, but many to many fields will require more thought,
> especially
> since that SQL isn't in the ``Query`` class.

I'd be happy saying that if you tried to cross a database boundary, it's
an error. It's up to you (the user) to know enough about your data
modeling setup to avoid this. But see what's possible there. That's
certainly an area that can be an error initially and added on later on,
since nothing will in the error path will exclude making it not an
error.

Marty Alchin has suggested that if you want to do this you should use an intermediary model, else Django has no way to guess at which DB the intermediary table is one. I like this suggestion, but I think the location of the SQL may still be an issue, but it does seem to me as the right longer term suggestion.

One suggestion Eric Florenzano had was that we go above and beyond just storing the methods and parameters, we don't even excecute them at all until absolutely necessary. This suggestion appeals to me, although thinking about what the code might look like doesn't: basically all public queryset API methods wouldn't do what they currently do, they would just do self.status.append((method_name, args, kwargs)), then when the QuerySet actually needed to be evaluated for results(that is at the last moment after the DB is decided) it runs through and calls the real implementation of each of these methods to build up the Query object then actually does the query.

> I'm here soliciting feedback on both the API, and any potential
> hurdles I may
> have missed.

Did you go back through the thread from just after DjangoCon last year
and make sure you've covered or intentionally omitted all the use-cases
mentioned there? Would be good to see some discussion of things that
aren't going to be possible with this. I know I rounded up a few people
I'd done consulting for or chatted to about this problem who posted
there (and don't normally post), so it's quite full of industry-backed
experiences. I think you've hit most of the points, but it would be
worth confirming that.

Yep, I've reread that thread, as well as any others I could find. I've also spent a good bit of time discussing what the API needs with guys who have implemented multiple database APIs for themselves using the existing low level hooks.

Regards,
Malcolm

Thanks for all the review Malcolm.

One question that I didn't really ask in the initial post is what parameters should a "DatabaseManager" receieve on it's methods, one suggestion is the Query object, since that gives the use the maximal amount of information,, however my concerns there are that it's not a public API, and having a private API as a part of the public API feels klunky. OTOH there isn't really another data structure that carries around the information someone writing their sharding logic(or whatever other scheme they want to implement) who inevitably want to have.

Malcolm Tredinnick

unread,

Mar 21, 2009, 12:06:35 AM3/21/09

to django-d...@googlegroups.com

Trimming unused portions of the response to make it readable (which I
should have done the first time around, too)...

On Fri, 2009-03-20 at 23:41 -0400, Alex Gaynor wrote:
>
>
> On Fri, Mar 20, 2009 at 11:21 PM, Malcolm Tredinnick
> <mal...@pointy-stick.com> wrote:
>
>
> On Fri, 2009-03-20 at 09:45 -0400, Alex Gaynor wrote:
> > Hello all,

[...]

> > The greatest hurdle is changing the connection after we
> already have
> > our
> > ``Query`` partly created. The issues here are that: we
> might have
> > done tests
> > against ``connection.features`` already, we might need to
> switch
> > either to or
> > from a custom ``Query`` object, amongst other issues.

[...]

Excuse me for a moment whilst I add Eric to a special list I've been
keeping. He's trying to make trouble.

Ok, back now... There are at least two problems with this.

(a) Backwards incompatible in that some querysets would return
noticeably different results before and after that change. It would be
subtle, quiet and very difficult to detect without auditing every line
of code that contributes to a queryset. The worst kind of change for us
to make from the perspective of the users.

(b) Intentionally not done right now and not because I'm whimsical and
arbitrary (although I am). The problem is it requires storing all sorts
of arbitrarily complex Python objects. Which breaks pickling, which
breaks caching. People tend to complain, a lot, about that last bit.

That's why the Where.add() converts things to more basic types when they
are added (via a filter() command). If somebody really needs lazily
evaluated parameters, it's easy enough via a custom Q-like object, but
so far nobody has asked for that if they've gotten stuck doing it. It's
even something we could consider adding to Django, although it's not a
no-brainer given the potential to break caching.

[...]

>
> Thanks for all the review Malcolm.

No problems.

> One question that I didn't really ask in the initial post is what
> parameters should a "DatabaseManager" receieve on it's methods, one
> suggestion is the Query object, since that gives the use the maximal
> amount of information,, however my concerns there are that it's not a
> public API, and having a private API as a part of the public API feels
> klunky.

At first glance, I believe the word you're looking for is "wrong". :-)

Definitely a valid concern.

> OTOH there isn't really another data structure that carries around
> the information someone writing their sharding logic(or whatever other
> scheme they want to implement) who inevitably want to have.

Two solutions spring to mind, although I haven't thought this through a
lot: it's not particularly germane to the proposal since it's something
we can work out a bit later on. I've got limited time today(something
about a beta release coming up), so I wanted to just get out responses
to the two people who posted items for discussion. I suspect there's a
lot of thinking needed here about the concept as a whole and I want to
do that. Anyway...

One option is to use the piece of public API that is available which
will always be carrying around a Query object: the QuerySet. Query
objects don't exist in isolation. However, this sounds problematic
because the implementation is going to be working at a very low-level --
database managers are only really interesting to Query.as_sql() and it's
dependencies. But that leads to the next idea, ...

The other is to work out a better place for this database manager in the
hierarchy. It might be something that lives as an attribute on a
QuerySet. Something like the user provides a function that picks the
database based "some information" that is available to it and the base
method selects the right database to use. Since it lives in the QuerySet
namespace, it can happily access the "query" attribute there without any
encapsulation violations. The database manager then becomes two pieces,
an algorithm on QuerySet (that might just dispatch to the real algorithm
on Query), plus some user-supplied code to make the right selections.
That latter thing could be a callable object if you need the full class
structure. But the stuff QuerySet/Query needs to know about is probably
a much smaller interface than *requiring* a full class. (Did any of that
make sense?)

I think this -- the database manager concept -- is the part of your
proposal that is most up in the air with respect to what the API looks
like. Which is fine. The fact that it's something to consider is good
enough to know. Certainly put some thought into the problem, but don't
sweat the details too much just yet (in the application period). This is
one of those hard areas where you probably do need to think about it so
much it costs you sleep, you forget to eat and so on.

Regards,
Malcolm

Alex Gaynor

unread,

Mar 21, 2009, 12:41:26 AM3/21/09

to django-d...@googlegroups.com

> One suggestion Eric Florenzano had was that we go above and beyond
> just storing the methods and parameters, we don't even excecute them
> at all until absolutely necessary.

Excuse me for a moment whilst I add Eric to a special list I've been
keeping. He's trying to make trouble.

Ok, back now... There are at least two problems with this.

(a) Backwards incompatible in that some querysets would return
noticeably different results before and after that change. It would be
subtle, quiet and very difficult to detect without auditing every line
of code that contributes to a queryset. The worst kind of change for us
to make from the perspective of the users.

What scenario does it return different results, the one place I can think of is:

query = queryset.order_by('I AM NOT A REAL FIELD, HAHA')
render_to_response('template.html', {'q': query})

which would raise an exception in the template instead of in the view.

(b) Intentionally not done right now and not because I'm whimsical and
arbitrary (although I am). The problem is it requires storing all sorts
of arbitrarily complex Python objects. Which breaks pickling, which
breaks caching. People tend to complain, a lot, about that last bit.

That's why the Where.add() converts things to more basic types when they
are added (via a filter() command). If somebody really needs lazily
evaluated parameters, it's easy enough via a custom Q-like object, but
so far nobody has asked for that if they've gotten stuck doing it. It's
even something we could consider adding to Django, although it's not a
no-brainer given the potential to break caching.

I vaguely recall there being a ticket about this that you wontfixed, although that may have been about defering calling callables :). In any event the caching issue was one I hadn't considered, although one solution would be not to pickle it with the ability to switch to a different query type, it's a bit of a strange restriction, but I don't think it's one that would practically affect people, and it's less restricitive.

[...]

>
> Thanks for all the review Malcolm.

No problems.

> One question that I didn't really ask in the initial post is what
> parameters should a "DatabaseManager" receieve on it's methods, one
> suggestion is the Query object, since that gives the use the maximal
> amount of information,, however my concerns there are that it's not a
> public API, and having a private API as a part of the public API feels
> klunky.

At first glance, I believe the word you're looking for is "wrong". :-)

Yes, that's the one.

The concept of a database manager is somewhat important as it makes automating your mI ultidb strategy far easier. My concern with just passing a QuerySet is it doesn't really hold any information, if I want to say shard on the id then I need to poke at the Query(the same for any information about the query other than the type which we already know from the method), and if we always need to actually touch the Query than passing the QuerySet is a bit of an end run around.

The nice thing about the DatabaseManager concept(as I've conceived it) is that it can be implemented entirely seperately and after the rest of the API.

Regards,
Malcolm

Thanks,

Malcolm Tredinnick

unread,

Mar 21, 2009, 1:25:39 AM3/21/09

to django-d...@googlegroups.com

On Sat, 2009-03-21 at 00:41 -0400, Alex Gaynor wrote:
>
>
> > One suggestion Eric Florenzano had was that we go above and
> beyond
> > just storing the methods and parameters, we don't even
> excecute them
> > at all until absolutely necessary.
>
>
> Excuse me for a moment whilst I add Eric to a special list
> I've been
> keeping. He's trying to make trouble.
>
> Ok, back now... There are at least two problems with this.
>
> (a) Backwards incompatible in that some querysets would return
> noticeably different results before and after that change. It
> would be
> subtle, quiet and very difficult to detect without auditing
> every line
> of code that contributes to a queryset. The worst kind of
> change for us
> to make from the perspective of the users.
>
> What scenario does it return different results, the one place I can
> think of is:
>
> query = queryset.order_by('I AM NOT A REAL FIELD, HAHA')
> render_to_response('template.html', {'q': query})
>
> which would raise an exception in the template instead of in the view.

It's related to eager/deferred argument evaluation (which is done for
the same reasons): any "smart" object like Q objects would require
changing to handle deferring things correctly. They can currently be
designed to evaluate only once and will work correctly.

You wrote a really long sentence there that didn't make a lot of sense
(too many prepositions and commas, not enough nouns and full stops).
Unclear which restriction you're arguing against, but the picklability
of querysets is pretty much a requirement. It's something people really
use.

However, before we go too far down this path: this is a very minor
thing. It's unlikely to be required. Adding it "because we can" is an
argument Eric can propose at some much later date if it's not absolutely
*required* for multi-db stuff. I think we won't need to worry about this
at all.

That's never been argued against.

> My concern with just passing a QuerySet is it doesn't really hold any
> information, if I want to say shard on the id then I need to poke at
> the Query(the same for any information about the query other than the
> type which we already know from the method),

Hmm ... maybe. I think you might have the dependency directions reversed
here. Think a bit more about what I wrote with regard to providing some
methods to make the choice. QuerySet/Query could provide the worker
routine which passes necessary information to a callback that is
provided by the user, for example. That's why the design requires
thinking here: there are at least two directions the control could flow
and I suspect you're getting into difficulties from the direction you're
currently approach (with DatabaseManager controlling the show and doing
all the work).

If DatabaseManager has to poke at Query, we've probably lost, because
then it's tied to that Query class, not to the concept of storage
management selection (which should work with any type of Query object
and even general QuerySets).

Don't try to solve this now. The concept of this type of utility is a
good one. But it's a problem that requires thinking. So think up a dozen
alternatives and filtering them down to two or three.

> and if we always need to actually touch the Query than passing the
> QuerySet is a bit of an end run around.

No. It's encapsulation. You're passing in the public object and only use
methods on the public object.

>
> The nice thing about the DatabaseManager concept(as I've conceived it)
> is that it can be implemented entirely seperately and after the rest
> of the API.

The concept isn't dependent on the implementation. It can be added later
whether it's a separate class or a method on Querysets. It's a utility
feature, pretty much by definition, so however it's implemented, it can
be added later (just like multi-db support could always be added later
to the ORM, however it was implemented).

Regards,
Malcolm

Alex Gaynor

unread,

Mar 21, 2009, 1:38:56 AM3/21/09

to django-d...@googlegroups.com

I don't see this as an issue, simply because whatever happens in the instantiation of these objects would be the same for whatever connection was in use.

Just to clear that up what I was say was:

When you pickly a QuerySet we build up the entire Query as we would right before SQL excecution and then just pickle that. Then the restriction is that you can't change the database type to be used on an unpickled query.

Except right now the public API on a queryset doesn't give you any information about what your "query" is asking for. Therefore we are back to asking what pieces of information do we reasonably want to have to decide what database we are querying, and what's a reasonable format for providing them.

>
> The nice thing about the DatabaseManager concept(as I've conceived it)
> is that it can be implemented entirely seperately and after the rest
> of the API.

The concept isn't dependent on the implementation. It can be added later
whether it's a separate class or a method on Querysets. It's a utility
feature, pretty much by definition, so however it's implemented, it can
be added later (just like multi-db support could always be added later
to the ORM, however it was implemented).

Regards,
Malcolm

As always, thanks for the time and thoughts,

Ivan Sagalaev

unread,

Mar 21, 2009, 6:28:09 AM3/21/09

to django-d...@googlegroups.com

Alex Gaynor wrote:
> 8) Time permitting implement a few common replication patterns.

I'm kind of not very excited with this point.

To me replication is a major use-case. I suspect most people who move
beyond single server setup and beyond 10'000 - 20'000 visitors realize
that replication should just be in place ensuring performance and
redundancy. In my experience other multi-DB patterns (those that covered
with `using()` and Meta-attributes on models) are just *less* common in
practice. So I consider leaving replication to "time permitting" a mistake.

On the other hand may be all this work won't break mysql_replicated and
I'll just have to update it to the new db backend interface. There may
be non-trivial things to work out though such as having separate
master-slave pairs for each data shard.

Eric Florenzano

unread,

Mar 21, 2009, 1:45:22 PM3/21/09

to Django developers

> To me replication is a major use-case. I suspect most people who move
> beyond single server setup and beyond 10'000 - 20'000 visitors realize
> that replication should just be in place ensuring performance and
> redundancy. In my experience other multi-DB patterns (those that covered
> with `using()` and Meta-attributes on models) are just *less* common in
> practice. So I consider leaving replication to "time permitting" a mistake.

There's a finite amount of time in GSoC. If he says he will
definitely do it, then something else will probably have to be cut to
make time. Everything else, however, is prerequisite to implementing
the actual replication strategies.

There are almost two GSoC projects here, wrapped into one. First
there's the plumbing in Django's core that just needs to happen.
Second there's the actual APIs built on top of that plumbing. The
former needs to happen before the latter, but in implementing the
latter, some changes will almost certainly need to be made to the
former as assumptions are challenged and implementation details get in
the way.

In any case, I think Alex is the one to do this. He's got a +1 from
me (not that it means much now that I'm on Malcolm's special list).
Speaking of Malcolm's special list...

> One suggestion Eric Florenzano had was that we go above and beyond
> just storing the methods and parameters, we don't even excecute them
> at all until absolutely necessary.

I wasn't actually making that suggestion, per se. I was just thinking
out loud that _if_ this type of system were implemented, then it would
open the door for some fun computer-sciency things like performing
transitive reductions on the operations performed on the QuerySet.
That being said, I'm not convinced it's the right way to go because of
its significant added complexity, and because it would make poking
around at the Query object more difficult and generally make the Query
object more opaque.

Thanks,
Eric Florenzano

Alex Gaynor

unread,

Mar 21, 2009, 2:15:30 PM3/21/09

to django-d...@googlegroups.com

On Sat, Mar 21, 2009 at 1:45 PM, Eric Florenzano <flo...@gmail.com> wrote:

> To me replication is a major use-case. I suspect most people who move
> beyond single server setup and beyond 10'000 - 20'000 visitors realize
> that replication should just be in place ensuring performance and
> redundancy. In my experience other multi-DB patterns (those that covered
> with `using()` and Meta-attributes on models) are just *less* common in
> practice. So I consider leaving replication to "time permitting" a mistake.

There's a finite amount of time in GSoC. If he says he will
definitely do it, then something else will probably have to be cut to
make time. Everything else, however, is prerequisite to implementing
the actual replication strategies.

Whether I implement any form of multiple database scheme is immaterial to whether it is implementable. The point of the DatbaseManager concept is that you can implement your own replication scheme easily.

There are almost two GSoC projects here, wrapped into one. First
there's the plumbing in Django's core that just needs to happen.
Second there's the actual APIs built on top of that plumbing. The
former needs to happen before the latter, but in implementing the
latter, some changes will almost certainly need to be made to the
former as assumptions are challenged and implementation details get in
the way.

In any case, I think Alex is the one to do this. He's got a +1 from
me (not that it means much now that I'm on Malcolm's special list).
Speaking of Malcolm's special list...

> One suggestion Eric Florenzano had was that we go above and beyond
> just storing the methods and parameters, we don't even excecute them
> at all until absolutely necessary.

I wasn't actually making that suggestion, per se. I was just thinking
out loud that _if_ this type of system were implemented, then it would
open the door for some fun computer-sciency things like performing
transitive reductions on the operations performed on the QuerySet.
That being said, I'm not convinced it's the right way to go because of
its significant added complexity, and because it would make poking
around at the Query object more difficult and generally make the Query
object more opaque.

Sorry if I was putting words in you mouth, I merely meant the idea came from you.

Thanks,
Eric Florenzano

Thanks to all,

Reply all

Reply to author

Forward

0 new messages