While my vote may mean little, Alex has certainly been active and
had quality code on the mailing list. MultiDB has also been a
frequent issue on the mailing-list, so Alex gets my +1
I'd hope to see "multiple databases" defined a little more
clearly as discussed in this thread[1]. Whether the SoC project
address *all* of the facets (wow, lots of work!) or just selects
certain issues, I'd like to see them addressed in the proposal
("addressing federation and load-balancing, but not sharding") to
show that they're being considered during the implementation.
From what I gather in the description, Alex is only proposing
load-balancing.
Depending on which definitions of multidb you plan to address, it
also impacts areas such as aggregation (performing
count/summation over shards requires extra consideration) and
cross-database joining. In the above thread, Malcolm also raises
the issue of read/write consistency when doing load-balancing.
-tim
[1]
http://groups.google.com/group/django-users/browse_thread/thread/663046559fd0f9c1/
That's one word for it. This stuff will take a lot longer than you think
because testing it along the way will take longer than you think, for a
start. There are lots of traps that we don't even know exist yet.
Fortunately, step 3 won't be needed (I think things would really be
broken if it is), but aiming to get to about where you are at the end of
step 6 by the end of the period feels, in my gut, about right.
This isn't an area where the first implementation will be the right.
There will be places where you'll want to go back a few days later and
try again. If not, your mentor, whoever that is, will probably suggest
you do. There will be, no doubt, bits of refactoring of the existing
internals required to incorporate some of this and those changes will
need careful testing and consideration (and, often, a couple of passes).
This isn't even intuition on my part any more. I've done enough coding
in that area to know it's a huge, complex beast and first attempts can
always be improved upon.
Even the documentation and tests are things you'll want to come back to
a few times, because writing them out usually (or should) triggers
doubts about the implementation or API, which leads to a bunch of
changes, which leads to the documentation now needing more updating,
etc. Good that you've allowed 3 weeks there, but I suspect it's 3 weeks
spread over a much longer period on the calendar, rather than all
together.
By all means have a few things that you can get to if there's time. And
estimation is one of those impossible things to get right. Aim to get up
to some point (probably end of step 6 or partway through step 7) and
then have stuff to do if it all goes well.
Something that hasn't really been done in past years, though, is
students using the extra time to really improve their code. If you have
a week or two at the end, use it to fix up all the things that could be
better in the first ten weeks' work.
All that being said, your schedule looks roughly reasonable. Estimation
is very difficult, even for people who've been in the industry for
years, so we're not going to expect miracles in the SoC front. Part of
the experience is learning the truth of that.
> and there are about 2 weeks to spare,
> so those can be used as necessary, or for part #8.
>
> Hurdles
> -------
>
> The following are a list of possible technical issues:
>
> * In ``django.db.models.sql.query.Query`` are any tests done on what
> the
> connection is before the actual SQL construction phase.
I've discussed this one below, under "solutions".
> If so these need
> to be changed not to do this, since the connect might change at
> some point
> after that test. If this can't be done than ``using()`` needs to
> be the
> first method called on a ``QuerySet``, or at a minimum called
> before any
> methods that do such testing. Further, if these tests can't be put
> off then
> the only option is a callback that's called right when the first
> ``Query``
> object is constructed, this means Django won't know what type of
> query it
> would be, rendering the ``DatabaseManager`` impossible.
> * Will models need to know which database they came from so that they
> can be
> saved back correctly?
Be worth doing. Storing the connection name used to retrieve the model
wouldn't be hard.
> * Does ``Model.save()`` need to take a ``using`` parameter so new
> objecs can
> be created on a specific database or saved to a new database.
Almost certainly. There has to be some way to specify it. A parameter to
save() seems natural enough.
> * Backends that use custom query classes, will we need a
> ``from_query``
> classmethod to transform them. This would require all backends to
> store
> and use information that is basically less than or equal to what
> the
> ``Query`` object stores. Also, there needs to be the reverse, a
> way to go
> from a custom ``Query`` object back to either the Django default or
> some
> other custom ``Query`` object.
Can you explain this some more? It sounds like a horrible artifact. I'm
not sure calling the base SQL Query method the super-wrapper of
everything is a good design goal.
> * Foreign keys will basically be handled en passant because of how
> they are
> implemented, but many to many fields will require more thought,
> especially
> since that SQL isn't in the ``Query`` class.
I'd be happy saying that if you tried to cross a database boundary, it's
an error. It's up to you (the user) to know enough about your data
modeling setup to avoid this. But see what's possible there. That's
certainly an area that can be an error initially and added on later on,
since nothing will in the error path will exclude making it not an
error.
> *
>
> Solutions
> ---------
>
> The greatest hurdle is changing the connection after we already have
> our
> ``Query`` partly created. The issues here are that: we might have
> done tests
> against ``connection.features`` already, we might need to switch
> either to or
> from a custom ``Query`` object, amongst other issues.
There's possibly some intrusion here, but not a lot. Creating some
delayed evaluation structures to hold various pieces is probably a
reasonable approach for most cases. For example, all uses of quote_name
(and friends) are restricted to the SQL construction phase of things
(as_sql() and things it calls).
Also, I wouldn't be unhappy with a restriction that you can't change the
type of database you are directing to once the queryset construction has
started. The common use-cases here are having a pool of databases all
running on the same type of server. Or read-only and read-write
databases running on the same server, etc. So being able to change the
target of the query late in the process is useful enough to be basically
compulsory -- so that you can pass around a queryset and, much later,
decide whether to read from master or slave, for example -- but being
able to switch to some entirely arbtrary other storage system late in
the game could be something we prohibit. That's probably not too onerous
a restriction.
> One possible solution
> that is very powerful(though quite inellegant) is to have the
> ``QuerySet`` keep
> track of all public API method calls against it and what parameters
> they took,
> then when the ``connection`` is changed it will recreate the ``Query``
> object
> by creating a "blank" one with the new connection and reapplying all
> the
> methods it has stored. This is basically a simple implementation of
> the
> command pattern.
It's pretty yukky. There's a lot of Python level junk that we
intentionally avoid storing in querysets so that they behave properly as
persistent data structures (clones are independent copies) and can be
pickled without trouble, etc. It would be really bad for performance to
reintroduce those (I did a lot of profiling when developing that stuff
and tried to throw out as much as possible). I think this fortunately
isn't going to be a real issue. I was pretty careful originally to keep
the leakage from django.db.connection into the Query class to as few
places as possible and mostly when we're creating the SQL.
Some cases that might eb unavoidable could be replaced with delayed
evaluation objects (essentially encapsulating the command pattern just
for that fragment), which is a bit cleaner.
> I'm here soliciting feedback on both the API, and any potential
> hurdles I may
> have missed.
Did you go back through the thread from just after DjangoCon last year
and make sure you've covered or intentionally omitted all the use-cases
mentioned there? Would be good to see some discussion of things that
aren't going to be possible with this. I know I rounded up a few people
I'd done consulting for or chatted to about this problem who posted
there (and don't normally post), so it's quite full of industry-backed
experiences. I think you've hit most of the points, but it would be
worth confirming that.
Regards,
Malcolm
Almost certainly. There has to be some way to specify it. A parameter to
> * Does ``Model.save()`` need to take a ``using`` parameter so new
> objecs can
> be created on a specific database or saved to a new database.
save() seems natural enough.
Can you explain this some more? It sounds like a horrible artifact. I'm
> * Backends that use custom query classes, will we need a
> ``from_query``
> classmethod to transform them. This would require all backends to
> store
> and use information that is basically less than or equal to what
> the
> ``Query`` object stores. Also, there needs to be the reverse, a
> way to go
> from a custom ``Query`` object back to either the Django default or
> some
> other custom ``Query`` object.
not sure calling the base SQL Query method the super-wrapper of
everything is a good design goal.
I'd be happy saying that if you tried to cross a database boundary, it's
> * Foreign keys will basically be handled en passant because of how
> they are
> implemented, but many to many fields will require more thought,
> especially
> since that SQL isn't in the ``Query`` class.
an error. It's up to you (the user) to know enough about your data
modeling setup to avoid this. But see what's possible there. That's
certainly an area that can be an error initially and added on later on,
since nothing will in the error path will exclude making it not an
error.
Did you go back through the thread from just after DjangoCon last year
> I'm here soliciting feedback on both the API, and any potential
> hurdles I may
> have missed.
and make sure you've covered or intentionally omitted all the use-cases
mentioned there? Would be good to see some discussion of things that
aren't going to be possible with this. I know I rounded up a few people
I'd done consulting for or chatted to about this problem who posted
there (and don't normally post), so it's quite full of industry-backed
experiences. I think you've hit most of the points, but it would be
worth confirming that.
Regards,
Malcolm
On Fri, 2009-03-20 at 23:41 -0400, Alex Gaynor wrote:
>
>
> On Fri, Mar 20, 2009 at 11:21 PM, Malcolm Tredinnick
> <mal...@pointy-stick.com> wrote:
>
>
> On Fri, 2009-03-20 at 09:45 -0400, Alex Gaynor wrote:
> > Hello all,
[...]
> > The greatest hurdle is changing the connection after we
> already have
> > our
> > ``Query`` partly created. The issues here are that: we
> might have
> > done tests
> > against ``connection.features`` already, we might need to
> switch
> > either to or
> > from a custom ``Query`` object, amongst other issues.
[...]
Excuse me for a moment whilst I add Eric to a special list I've been
keeping. He's trying to make trouble.
Ok, back now... There are at least two problems with this.
(a) Backwards incompatible in that some querysets would return
noticeably different results before and after that change. It would be
subtle, quiet and very difficult to detect without auditing every line
of code that contributes to a queryset. The worst kind of change for us
to make from the perspective of the users.
(b) Intentionally not done right now and not because I'm whimsical and
arbitrary (although I am). The problem is it requires storing all sorts
of arbitrarily complex Python objects. Which breaks pickling, which
breaks caching. People tend to complain, a lot, about that last bit.
That's why the Where.add() converts things to more basic types when they
are added (via a filter() command). If somebody really needs lazily
evaluated parameters, it's easy enough via a custom Q-like object, but
so far nobody has asked for that if they've gotten stuck doing it. It's
even something we could consider adding to Django, although it's not a
no-brainer given the potential to break caching.
[...]
>
> Thanks for all the review Malcolm.
No problems.
> One question that I didn't really ask in the initial post is what
> parameters should a "DatabaseManager" receieve on it's methods, one
> suggestion is the Query object, since that gives the use the maximal
> amount of information,, however my concerns there are that it's not a
> public API, and having a private API as a part of the public API feels
> klunky.
At first glance, I believe the word you're looking for is "wrong". :-)
Definitely a valid concern.
> OTOH there isn't really another data structure that carries around
> the information someone writing their sharding logic(or whatever other
> scheme they want to implement) who inevitably want to have.
Two solutions spring to mind, although I haven't thought this through a
lot: it's not particularly germane to the proposal since it's something
we can work out a bit later on. I've got limited time today(something
about a beta release coming up), so I wanted to just get out responses
to the two people who posted items for discussion. I suspect there's a
lot of thinking needed here about the concept as a whole and I want to
do that. Anyway...
One option is to use the piece of public API that is available which
will always be carrying around a Query object: the QuerySet. Query
objects don't exist in isolation. However, this sounds problematic
because the implementation is going to be working at a very low-level --
database managers are only really interesting to Query.as_sql() and it's
dependencies. But that leads to the next idea, ...
The other is to work out a better place for this database manager in the
hierarchy. It might be something that lives as an attribute on a
QuerySet. Something like the user provides a function that picks the
database based "some information" that is available to it and the base
method selects the right database to use. Since it lives in the QuerySet
namespace, it can happily access the "query" attribute there without any
encapsulation violations. The database manager then becomes two pieces,
an algorithm on QuerySet (that might just dispatch to the real algorithm
on Query), plus some user-supplied code to make the right selections.
That latter thing could be a callable object if you need the full class
structure. But the stuff QuerySet/Query needs to know about is probably
a much smaller interface than *requiring* a full class. (Did any of that
make sense?)
I think this -- the database manager concept -- is the part of your
proposal that is most up in the air with respect to what the API looks
like. Which is fine. The fact that it's something to consider is good
enough to know. Certainly put some thought into the problem, but don't
sweat the details too much just yet (in the application period). This is
one of those hard areas where you probably do need to think about it so
much it costs you sleep, you forget to eat and so on.
Regards,
Malcolm
Excuse me for a moment whilst I add Eric to a special list I've been
> One suggestion Eric Florenzano had was that we go above and beyond
> just storing the methods and parameters, we don't even excecute them
> at all until absolutely necessary.
keeping. He's trying to make trouble.
Ok, back now... There are at least two problems with this.
(a) Backwards incompatible in that some querysets would return
noticeably different results before and after that change. It would be
subtle, quiet and very difficult to detect without auditing every line
of code that contributes to a queryset. The worst kind of change for us
to make from the perspective of the users.
(b) Intentionally not done right now and not because I'm whimsical and
arbitrary (although I am). The problem is it requires storing all sorts
of arbitrarily complex Python objects. Which breaks pickling, which
breaks caching. People tend to complain, a lot, about that last bit.
That's why the Where.add() converts things to more basic types when they
are added (via a filter() command). If somebody really needs lazily
evaluated parameters, it's easy enough via a custom Q-like object, but
so far nobody has asked for that if they've gotten stuck doing it. It's
even something we could consider adding to Django, although it's not a
no-brainer given the potential to break caching.
[...]
>No problems.
> Thanks for all the review Malcolm.
At first glance, I believe the word you're looking for is "wrong". :-)
> One question that I didn't really ask in the initial post is what
> parameters should a "DatabaseManager" receieve on it's methods, one
> suggestion is the Query object, since that gives the use the maximal
> amount of information,, however my concerns there are that it's not a
> public API, and having a private API as a part of the public API feels
> klunky.
Regards,
Malcolm
It's related to eager/deferred argument evaluation (which is done for
the same reasons): any "smart" object like Q objects would require
changing to handle deferring things correctly. They can currently be
designed to evaluate only once and will work correctly.
You wrote a really long sentence there that didn't make a lot of sense
(too many prepositions and commas, not enough nouns and full stops).
Unclear which restriction you're arguing against, but the picklability
of querysets is pretty much a requirement. It's something people really
use.
However, before we go too far down this path: this is a very minor
thing. It's unlikely to be required. Adding it "because we can" is an
argument Eric can propose at some much later date if it's not absolutely
*required* for multi-db stuff. I think we won't need to worry about this
at all.
That's never been argued against.
> My concern with just passing a QuerySet is it doesn't really hold any
> information, if I want to say shard on the id then I need to poke at
> the Query(the same for any information about the query other than the
> type which we already know from the method),
Hmm ... maybe. I think you might have the dependency directions reversed
here. Think a bit more about what I wrote with regard to providing some
methods to make the choice. QuerySet/Query could provide the worker
routine which passes necessary information to a callback that is
provided by the user, for example. That's why the design requires
thinking here: there are at least two directions the control could flow
and I suspect you're getting into difficulties from the direction you're
currently approach (with DatabaseManager controlling the show and doing
all the work).
If DatabaseManager has to poke at Query, we've probably lost, because
then it's tied to that Query class, not to the concept of storage
management selection (which should work with any type of Query object
and even general QuerySets).
Don't try to solve this now. The concept of this type of utility is a
good one. But it's a problem that requires thinking. So think up a dozen
alternatives and filtering them down to two or three.
> and if we always need to actually touch the Query than passing the
> QuerySet is a bit of an end run around.
No. It's encapsulation. You're passing in the public object and only use
methods on the public object.
>
> The nice thing about the DatabaseManager concept(as I've conceived it)
> is that it can be implemented entirely seperately and after the rest
> of the API.
The concept isn't dependent on the implementation. It can be added later
whether it's a separate class or a method on Querysets. It's a utility
feature, pretty much by definition, so however it's implemented, it can
be added later (just like multi-db support could always be added later
to the ORM, however it was implemented).
Regards,
Malcolm
>The concept isn't dependent on the implementation. It can be added later
> The nice thing about the DatabaseManager concept(as I've conceived it)
> is that it can be implemented entirely seperately and after the rest
> of the API.
whether it's a separate class or a method on Querysets. It's a utility
feature, pretty much by definition, so however it's implemented, it can
be added later (just like multi-db support could always be added later
to the ORM, however it was implemented).
Regards,
Malcolm
I'm kind of not very excited with this point.
To me replication is a major use-case. I suspect most people who move
beyond single server setup and beyond 10'000 - 20'000 visitors realize
that replication should just be in place ensuring performance and
redundancy. In my experience other multi-DB patterns (those that covered
with `using()` and Meta-attributes on models) are just *less* common in
practice. So I consider leaving replication to "time permitting" a mistake.
On the other hand may be all this work won't break mysql_replicated and
I'll just have to update it to the new db backend interface. There may
be non-trivial things to work out though such as having separate
master-slave pairs for each data shard.
There's a finite amount of time in GSoC. If he says he will
> To me replication is a major use-case. I suspect most people who move
> beyond single server setup and beyond 10'000 - 20'000 visitors realize
> that replication should just be in place ensuring performance and
> redundancy. In my experience other multi-DB patterns (those that covered
> with `using()` and Meta-attributes on models) are just *less* common in
> practice. So I consider leaving replication to "time permitting" a mistake.
definitely do it, then something else will probably have to be cut to
make time. Everything else, however, is prerequisite to implementing
the actual replication strategies.
There are almost two GSoC projects here, wrapped into one. First
there's the plumbing in Django's core that just needs to happen.
Second there's the actual APIs built on top of that plumbing. The
former needs to happen before the latter, but in implementing the
latter, some changes will almost certainly need to be made to the
former as assumptions are challenged and implementation details get in
the way.
In any case, I think Alex is the one to do this. He's got a +1 from
me (not that it means much now that I'm on Malcolm's special list).
Speaking of Malcolm's special list...
I wasn't actually making that suggestion, per se. I was just thinking
> One suggestion Eric Florenzano had was that we go above and beyond
> just storing the methods and parameters, we don't even excecute them
> at all until absolutely necessary.
out loud that _if_ this type of system were implemented, then it would
open the door for some fun computer-sciency things like performing
transitive reductions on the operations performed on the QuerySet.
That being said, I'm not convinced it's the right way to go because of
its significant added complexity, and because it would make poking
around at the Query object more difficult and generally make the Query
object more opaque.
Thanks,
Eric Florenzano