[GSOC] Final Multi-DB status Update

12 views
Skip to first unread message

Alex Gaynor

unread,
Aug 18, 2009, 4:24:29 AM8/18/09
to django-d...@googlegroups.com
Hey all,

It seems GSOC has finally come to a close and so I'm giving my final
status update as a part of GSOC (but I'm not going anywhere!). When
we last left off I had just gotten Oracle support working, however
after reviewing with Russ we agreed that the solution was a good bit
too hacky, and the real root of the problem was that the Query class
has 2 functions, one to record information and build a Pythonic
representation of a Query, which is the same for all SQL backends, as
well as to actually generate SQL from this representation, which is
different in the case of Oracle and others. Therefore the solution is
to actually split these up into separate classes, so we can swap out
SQL generators without needing to care about the data collector. In
short that's what I've been working on. Unfortunately this isn't done
at the time of writing (and the end of GSOC), however as I said the
code basically works now, it's just not in a form that would end up
back in Django. But, as I said, I'm not going anywhere. I'm going to
continue to work on this problem, and I'll continue to checkin with
django-developers as design decisions and complications come up.

For now, thanks for all the useful ideas, constructive criticism, and
words of encouragement django-developers has provided as I've worked
on this.

Alex

--
"I disapprove of what you say, but I will defend to the death your
right to say it." -- Voltaire
"The people's good is the highest law." -- Cicero
"Code can always be simpler than you think, but never as simple as you
want" -- Me

JL

unread,
Aug 19, 2009, 6:54:49 PM8/19/09
to Django developers
Hi Alex,

Thanks so much for the effort you've put into this. We've begun using
your code relatively extensively at my work. For what it's worth,
we're an enterprise Java shop that offers a software as a service
product to over 400 customers that are some of the biggest retailers
and manufacturers in the world. Because of multi-db we're now able to
use Django in a useful way to do prototyping, proof of concept work
and internal tools. Hopefully in the future, we can move more toward
Django as well.

Last week over email, you asked me to share my impressions of working
with the APIs with the larger developer audience. The rest of this
email outlines my impressions of working with the API but first a
brief introduction to how we structure our data to provide some
context to the others: Basically we've split our data across multiple
large DB clusters each with the same schema. Our data is then
"sharded" (I use the term loosely) by customer. Essentially, you
select the database cluster to work on by Customer then perform almost
all of your operations against the selected cluster.

Likes of the Multi-db API:

- Definition of the databases in the settings file. Love the way it's
handled in a Python dictionary. Clean elegant and then I can label
each db wtih the actual name of the cluster I'm hitting. I can define
"default" as my local application db which contains my customer-
>cluster mapping and then the actual clusters themselves

- 'using' meta setting for models to set the default database. While
this makes the models less portable and apps a little less re-usable,
it works for my use case. I can leave this setting off for models
that hit my local application db. I've created a second application
that is essentially just a container for models that attach to our
clusters. In this application I set the default 'using' to be our
first cluster. Ultimately though, I question the usefulness of this
setting as it will be hard to use for external applications you may
have imported into your projects without changing their code base.
This will make upgrading any open source projects your using a little
more difficult

- 'using' queryset function. It's elegant and reads well. I do wish
there was something more stateful that I could use (see later)

Dislikes of multi-db API:

- Following foreign key relationships can be tricky and require care
both in the forward and reverse cases. For example, in the forward
case, imagine I have a Customer and I know that customer exists on DB
Cluster 5 (c5 from here on out). I also have a Product and a Category
model where the Product model has a foreign key field pointed at the
Category model and another foreign key field pointed at Customer. I
can query for a product for a given customer doing something like
this:

prd = Product.objects.using('c5').get(customer_ref = mycustomer,
slug=my_product_slug)

Now I might want to go and retrieve the Category that the product
belongs to... maybe to do some output or to find the top n-products or
something to output in a template. My initial gut would tell me to do
something like this:

top_products_in_category = prd.category_ref.get_top_products()

But that would be wrong. Why? Because when I try to look up
prd.category_ref, Django checks against the default cluster instead of
c5 where this customer and all it's data lives and the ORM will
through a DoesNotExist exception. Of course, I can look up my
category and find my top products like this:

top_products_in_category = Category.objects.using('c5').get
(pk=prd.category_ref_id).get_top_products()

but this is clunky and counterintuitive to the way relationship fields
are supposed to work. Where I could see this getting really hairy is
in templates. For example, lets say I'm rendering a product page that
I'd like to include other top products from the category on. I'm
kinda screwed here since I can't dereference the pointer back to the
Category table properly. (Let's not even get into how'd I'd write the
'get_top_products' function at all since that would also have to know
about what cluster the current data set is living on).

This is also a problem (though much less so) when following the
reverse relations. At least here, since I'm working with a queryset,
I can apply a 'using' clause. So imagining I had a category object
and wanted to follow it to products, I could something like this:

prds = my_category.product_ref_set.using('c5').all()

Of course, trying to use this in a template still falls short since I
can't pass a parameter to a function in a template.

I want a Pony:

A great way to fix the above would be to have objects and querysets
'remember' where they originated from and apply it forward to any
requests to related objects or reverse relation lookups. So if I did
this query:

prds = Products.objects.using('c5').all()

Anytime I reference a product in the prds queryset, it knows it came
from c5 so all related look ups aim at that DB. Now when I do things
like:

cat = prds[0].category_ref

Django should know that 'prds' was generated from looking at db c5 and
it should attempt to look up relationships there first. The reverse
API should also be available somehow so I can override where a foriegn
key lookup is done (much like I can already do on reverse look ups
with an extra 'where' clause).

One more pony request: I should be able to spec a 'using' in any of
the short cut functions too ... like get_object_or_404 (you may have
already done this).

Regardless of your current status with the branch, I fully intend to
keep on using this code. Luckily, we're all mysql based here so I
don't have to worry about the custom back end stuff :) Thanks again
for your effort this summer Alex!

Jon.

Alex Gaynor

unread,
Aug 21, 2009, 10:35:04 PM8/21/09
to django-d...@googlegroups.com
Jon thanks for taking the time to give us some of your thoughts on the
API. Hearing how people feel about APIs as they are developed is
always a huge boon.
This idea sort of appeals to me, but I fear there are some edge cases
who's behavior I'd want to figure out first. For example, let's say I
have a model with Meta.using = 'db1' and a 2nd model with Meta.using =
'db2'. When I try to access a foreignkey relation from something on
Model1 to Model2, which DB should it use, the one it came from, or the
default for Model2?

> One more pony request: I should be able to spec a 'using' in any of
> the short cut functions too ... like get_object_or_404 (you may have
> already done this).
>

I'd prefer not to alter get_object_or_404 (or similar helpers),
however don't forget, in addition to Models get_object_or_404 can take
a QuerySet, so you can do

get_object_or_404(MyModel.objects.using('db2'), pk=pk)

And it'll do what you want :)

> Regardless of your current status with the branch, I fully intend to
> keep on using this code.  Luckily, we're all mysql based here so I
> don't have to worry about the custom back end stuff :)  Thanks again
> for your effort this summer Alex!
>

Great to hear!
If anyone has any thoughts on the above "corner case" of that possible
API, or on any part of the idea of remembering where on object comes
from (this would also be true for saving and deleting I think), I'd
love to hear them.

JL

unread,
Aug 23, 2009, 4:21:26 PM8/23/09
to Django developers
So over the weekend I've been thinking about the problem of the two
potential use cases for following relationship fields in Django
models: 1) All (or most) models exist on every db instance and the
data is sharded somehow or 2) Different models exist on different db
instances but all data for a particular model is on that instance.
For these two instances you want different behavior for (1) you want
objects to 'remember' where they came from and to follow relationships
on the originating db (this is my version of the problem) but for (2)
you want relationships followed on the db specified by the
Meta.using. An interesting aside about (2) though is how would you,
as a user, ensure referential integrity across db's at all? Would you
be even able to create the tables using syncdb?

The above aside notwithstanding, I think the way you might want to
handle this would be to follow the pattern of everything being
declarative in Django. Do a quick survey to determine what the
correct default behavior would be based on the majority of users out
there, but also allow users to override the behavior with a field
setting on ForeignKey and ManyToMany fields. It would look something
like this:

class MyModel(models.Model):
fk = models.ForeignKey('someothermodel', follow_origin=True)

Where setting 'follow_origin' to True would cause the behavior of look
ups to remain on the objects originating DB and setting it to False
would cause the lookup to follow the ForiegnKey models Meta.using.
The default value of this parameter would be determined by the results
of your survey.

Of course, doing this solution requires that the objects looked up
need to have a field or method that can retrieve what database they
were originally looked up from. Something like object.origin_db.
Having that look up would potentially be handy regardless if you
started mixing objects looked up across multiple databases in
collections or passing them around as function parameters.

Jon.

Craig Kimerer

unread,
Sep 3, 2009, 7:14:36 PM9/3/09
to django-d...@googlegroups.com
I've spent a little time using this branch and looking at the possibility of using it with my project.  Below is a short list of problems and ponies that I have encountered (or want).

1. It'd be awesome if we could mark certain databases as slaves.  Inserts / deletes / creates / drops would only run on the masters (table creation and deletion specifically).  I can skip the slaves by passing in the databases I want to sync, but I still have the next issue.

2. Only creating tables on the databases specified in 'using'.  It's confusing (to me) connecting to a database, trying to select all my users to find out I am on the wrong db (because the table is empty).  Perhaps tables should only be created on the database they are using.  I don't have a good suggestion for this as if you were sharding data, you would want to create the databases on all tables that this model could potentially live on.  Perhaps using could be be a string or list of connections,

3. I have multiple databases defined (some multiple times).  It would be cool if we could 'ignore' certain databases.  An example, I have 3 MySQL instances running.  MASTER_MAIN_DB, MASTER_OTHER_DB, SLAVE_MAIN_DB.  I want to be able to refer to them all, but also all the contrib apps I am using I want to live on MASTER_MAIN_DB.  So in my settings I have:
DATABASES = {'default': MAINDB_MASTER,
'MASTER_MAIN_DB': MAINDB_MASTER,
'SLAVE_MAIN_DB': MAINDB_SLAVE,
'MASTER_OTHER_DB': OTHERDB_SLAVE
}
Which means that when I run tests, it tries to drop tables on MAINDB_MASTER twice.  Perhaps someone (Alex?) knows of a better way to do this? 

4. I am using ContentTypes, and while running tests, if the default database is not created first, then the tests fail with an exception that the django_content_type table does not exist.  For now I have just hacked it so the default table is created before any of the others.  Perhaps there is a better way to fix this problem than that?


For things like #4, where is the proper place to file a bug about that (if there isn't a bug already)?  Do bugs from Django branches go in the normal tickets filed on djangoproject.com?

Thanks,

Craig

Russell Keith-Magee

unread,
Sep 3, 2009, 8:00:47 PM9/3/09
to django-d...@googlegroups.com
On Fri, Sep 4, 2009 at 7:14 AM, Craig Kimerer<craig....@gmail.com> wrote:
> I've spent a little time using this branch and looking at the possibility of
> using it with my project.  Below is a short list of problems and ponies that
> I have encountered (or want).
>
> 1. It'd be awesome if we could mark certain databases as slaves.  Inserts /
> deletes / creates / drops would only run on the masters (table creation and
> deletion specifically).  I can skip the slaves by passing in the databases I
> want to sync, but I still have the next issue.

So far, the implemented API is pretty low-level - it lets you direct
queries to a particular database, but the practical end-user API for
use cases such as slave/master hasn't been worked on that much. If you
go back to the original specifications, the suggested solution is to
put this in the manager, and so far I can't see any reason why this
won't work.

> 2. Only creating tables on the databases specified in 'using'.  It's
> confusing (to me) connecting to a database, trying to select all my users to
> find out I am on the wrong db (because the table is empty).  Perhaps tables
> should only be created on the database they are using.  I don't have a good
> suggestion for this as if you were sharding data, you would want to create
> the databases on all tables that this model could potentially live on.
> Perhaps using could be be a string or list of connections,

I think this makes the third time I get to tell Alex "I told you so" :-)

I agree that this is a problem. We're still working on a solution. I'm
not sure that the Meta: using approach will be enough - consider the
case where you want contrib.auth (or some app you don't control and
doesn't specify using) to be synchronized to the non-default database.
I'm hoping to sort this out with Alex and some of the core devs during
DjangoCon.

> 3. I have multiple databases defined (some multiple times).  It would be
> cool if we could 'ignore' certain databases.  An example, I have 3 MySQL
> instances running.  MASTER_MAIN_DB, MASTER_OTHER_DB, SLAVE_MAIN_DB.  I want
> to be able to refer to them all, but also all the contrib apps I am using I
> want to live on MASTER_MAIN_DB.  So in my settings I have:
> DATABASES = {'default': MAINDB_MASTER,
> 'MASTER_MAIN_DB': MAINDB_MASTER,
> 'SLAVE_MAIN_DB': MAINDB_SLAVE,
> 'MASTER_OTHER_DB': OTHERDB_SLAVE
> }
> Which means that when I run tests, it tries to drop tables on MAINDB_MASTER
> twice.  Perhaps someone (Alex?) knows of a better way to do this?

Is there any reason (other than clarity) that you want to be able to
explicitly refer to 'default' and 'MASTER_MAIN_DB'? Is there some
reason that it isn't practical to just call 'default' the main-master
database and refer to it as such?

The reason I'm asking is that the duplication you are doing here will
result in you opening two different connections to MAINDB_MASTER. I
can't think of an obvious reason that this would e required. I suspect
we could work around the problem if we had some sort of aliasing in
the DATABASES definition (i.e., set up 'default' as an alias of
'MASTER_MAIN_DB'), but before we add this, I'd like to understand the
use case to see if it is worth the effort (and potential confusion).

> 4. I am using ContentTypes, and while running tests, if the default database
> is not created first, then the tests fail with an exception that the
> django_content_type table does not exist.  For now I have just hacked it so
> the default table is created before any of the others.  Perhaps there is a
> better way to fix this problem than that?

We'll have to look into this. Thanks for the report.

> For things like #4, where is the proper place to file a bug about that (if
> there isn't a bug already)?  Do bugs from Django branches go in the normal
> tickets filed on djangoproject.com?

Yes. There is a soc2009/multidb version identifier in Trac; open your
tickets there and assign them to that version.

Yours,
Russ Magee %-)

Craig Kimerer

unread,
Sep 3, 2009, 8:56:37 PM9/3/09
to django-d...@googlegroups.com
On Thu, Sep 3, 2009 at 5:00 PM, Russell Keith-Magee <freakb...@gmail.com> wrote:

On Fri, Sep 4, 2009 at 7:14 AM, Craig Kimerer<craig....@gmail.com> wrote:
> I've spent a little time using this branch and looking at the possibility of
> using it with my project.  Below is a short list of problems and ponies that
> I have encountered (or want).
>
> 1. It'd be awesome if we could mark certain databases as slaves.  Inserts /
> deletes / creates / drops would only run on the masters (table creation and
> deletion specifically).  I can skip the slaves by passing in the databases I
> want to sync, but I still have the next issue.

So far, the implemented API is pretty low-level - it lets you direct
queries to a particular database, but the practical end-user API for
use cases such as slave/master hasn't been worked on that much. If you
go back to the original specifications, the suggested solution is to
put this in the manager, and so far I can't see any reason why this
won't work.

But if you list them in settings.DATABASES you have the problem below where tables try to get created on the slaves, which leads to errors because tables cannot be created on the slave.  That was my main reason for asking if there was a way to skip them (I should have been more clear). 

Specifying that queries should come from the master or slave will definitely work in the QuerySet though (and I have been able to do that and have had good progress on it).
 

I could do that, it was mainly for clarity.  Everything inside of my app would specify a using of that database.  No other good reason, I can move it over to 'default' and 'default_slave' :). 

It could be potentially useful for a case where I am developing application I_Want_A_Pony and my models were defined as

class SomeModel(models.Model):
...
class Meta:
using = 'Pony_DB'

So users could make them not live on defaults.  Although (I haven't tried) it may be possible to do something like...

I_Want_A_Pony.models.SomeModel.Meta.using = 'Pony_DB'
 

> 4. I am using ContentTypes, and while running tests, if the default database
> is not created first, then the tests fail with an exception that the
> django_content_type table does not exist.  For now I have just hacked it so
> the default table is created before any of the others.  Perhaps there is a
> better way to fix this problem than that?

We'll have to look into this. Thanks for the report.

> For things like #4, where is the proper place to file a bug about that (if
> there isn't a bug already)?  Do bugs from Django branches go in the normal
> tickets filed on djangoproject.com?

Yes. There is a soc2009/multidb version identifier in Trac; open your
tickets there and assign them to that version.

Thanks, I should have inspected the creation of tickets more clearly, I looked at all the open ones and saw nothing open for multi-db.  I will file that.

Yours,
Russ Magee %-)

Thanks for the response,

Craig

Russell Keith-Magee

unread,
Sep 3, 2009, 9:08:32 PM9/3/09
to django-d...@googlegroups.com

At the moment it is true that all tables are created on all databases,
but that won't be true in the final version. This ties in with my
comment on your original point 2 - we need much better ways to
describe what data is on what database. Create/Write/Read access to
that data is part of that specification.

I'm yet to be convinced that `Meta: using` is actually a good thing.
IMHO, it's the very model of a setting that makes it impossible to
re-use your application. The setting will probably survive into the
final version, but I suspect we need a much better mechanism than
`Meta: using` for most common use cases. Again, this comes back to my
comment on your original point 2.

> Thanks for the response,

No problems. Thanks for taking some beta code for a spin and giving us feedback.

Yours,
Russ Magee %-)

JL

unread,
Sep 14, 2009, 12:49:26 PM9/14/09
to Django developers
Russell Said:

"I'm yet to be convinced that `Meta: using` is actually a good thing.
IMHO, it's the very model of a setting that makes it impossible to
re-use your application. The setting will probably survive into the
final version, but I suspect we need a much better mechanism than
`Meta: using` for most common use cases. Again, this comes back to my
comment on your original point 2"

Piling back onto this a little late but I'm definitely a +1 here.
Since writing up my initial feedback I've run into this and similar
problems to what Craig's had and when using the manage.py db
functions. I don't have write access to some of my legacy db's and
when it tries to create tables on syncdb or you're doing things like
resetdb it can fail. Ultimately, I've just accepted it as fact and
just generate the sql and run the sql by hand in the appropriate db's
but I don't think that's going to make for a very long term solution
(and at the very least it'll be something that you get slammed for in
the users group ;)

I definitely agree on the re-use front as well. It'll make
distributing apps harder.

JL.

On Sep 3, 8:08 pm, Russell Keith-Magee <freakboy3...@gmail.com> wrote:
> On Fri, Sep 4, 2009 at 8:56 AM, Craig Kimerer<craig.kime...@gmail.com> wrote:
>
> > On Thu, Sep 3, 2009 at 5:00 PM, Russell Keith-Magee <freakboy3...@gmail.com>
> > wrote:
>
> >> On Fri, Sep 4, 2009 at 7:14 AM, Craig Kimerer<craig.kime...@gmail.com>

Alex Gaynor

unread,
Sep 14, 2009, 1:16:16 PM9/14/09
to django-d...@googlegroups.com
FWIW, Russ, Joseph Kocherhans, and I discussed this at the DjangoCon
sprints and our conclusion was to have syncdb only sync a single table
at a time, and to take a --exclude flag (or was it --include?) to
specify what models should be syncd.

Joseph Kocherhans

unread,
Sep 14, 2009, 1:54:16 PM9/14/09
to django-d...@googlegroups.com
On Mon, Sep 14, 2009 at 12:16 PM, Alex Gaynor <alex....@gmail.com> wrote:

FWIW, Russ, Joseph Kocherhans, and I discussed this at the DjangoCon
sprints and our conclusion was to have syncdb only sync a single table
at a time, and to take a --exclude flag (or was it --include?) to
specify what models should be syncd.

Did you mean to say "sync a single db" instead of "sync a single table"? Russ was talking about an --exclude flag at the sprints, but it doesn't really matter either way to me.

Joseph

Alex Gaynor

unread,
Sep 14, 2009, 3:57:13 PM9/14/09
to django-d...@googlegroups.com
Yes :)

Rob Madole

unread,
Sep 29, 2009, 11:48:05 AM9/29/09
to Django developers
Hmm. I just spent some time looking at #11828, and I don't think the
"syncing one db at a time" will work. The first problem this causes
is with anything that subscribes to the post sync signal. Content
type does this, so it can create permissions. If we sync one db at a
time, I don't see how content type can do it's job.

My response in the ticket was to just make sure that default get's
sync'd first, guaranteed. But based on what JL said about resetdb and
running the SQL manually it sounds like the roots go down a little
deeper. I disagree with myself now.

What if syncdb did this in two phases. It goes through every db
connection and creates the appropriate tables. Then do everything
else.

Oh and I certainly think that tables should only exist for the db's
they apply to. I had a yuck moment the first time I opened up the db
and saw those empty tables. This one is going to be a hard sale on
our dev team. I'm anticipating the argument I'm going to have with
the DBA'ish fella on the team.

Rob

On Sep 14, 2:57 pm, Alex Gaynor <alex.gay...@gmail.com> wrote:
> On Mon, Sep 14, 2009 at 1:54 PM, Joseph Kocherhans
>
>
>
>
>
> <jkocherh...@gmail.com> wrote:

Alex Gaynor

unread,
Sep 29, 2009, 11:57:36 AM9/29/09
to django-d...@googlegroups.com
It's worth noting that even if our end solution is to sync all DBs to
all tables you would be able to DROP the tables and not worry about
them. Not saying we should go with that solution (in fact I really
think we shouldn't), but the extraneous table creation isn't a huge
deal in my view.
Reply all
Reply to author
Forward
0 new messages