Switching from soc2009/multidb to trunk; cross db foreign keys

CB

unread,

Dec 29, 2009, 5:13:35 PM12/29/09

to Django users

Hello.

I've been developing on the multdb branch for awhile now (most recent
version just before the dropping of the 'using' meta attribute), and
everything has been going well. Recently though, with multidb landing
in the trunk, I wanted to switch over to the trunk.

I have a 'partitioning' scheme somewhat like this:

----db_extra
class Building(models.Model):
area = models.ForeignKey('area')
...etc

----default
class Area(models.Model):
...etc

And in db_extra, i've defined a custom manager that overrides
get_query_set(), returning .using('db_extra') as per the svn docs.

Now, my admin changelist includes the 'area' foreign key field on
Building. In the multidb branch, all was fine - the admin 'correctly'
retrived the foreign key id number from the db_extra db, and looked it
up locally in default. In the trunk now, I get a TemplateSyntaxError
indicating that the specified Area does not exist.

Clearly, it is now looking to the db_extra db for the related area
instead of looking to default.

Am I doing this correctly? How can I inform django to lookup the
foreign key in this db?

Thanks for any help.

-CB

Russell Keith-Magee

unread,

Dec 30, 2009, 7:21:52 AM12/30/09

to django...@googlegroups.com

On Wed, Dec 30, 2009 at 6:13 AM, CB <cbw...@gmail.com> wrote:
> Hello.
>
> I've been developing on the multdb branch for awhile now (most recent
> version just before the dropping of the 'using' meta attribute), and
> everything has been going well. Recently though, with multidb landing
> in the trunk, I wanted to switch over to the trunk.
>
> I have a 'partitioning' scheme somewhat like this:
>
> ----db_extra
> class Building(models.Model):
> area = models.ForeignKey('area')
> ...etc
>
> ----default
> class Area(models.Model):
> ...etc
>
>
> And in db_extra, i've defined a custom manager that overrides
> get_query_set(), returning .using('db_extra') as per the svn docs.

...

> Am I doing this correctly? How can I inform django to lookup the
> foreign key in this db?

In short, you can't (or shouldn't).

What you're asking the database to do is keep a foreign key value that
is not valid on the database on which it is stored. On databases like
Postgres that have referential integrity, this approach will cause
referential integrity errors.

The multi-db branch had checks for this sort of referential integrity
until just before the merge; however, these checks were relaxed just
prior to commit in order to allow for multi-db to support master-slave
partitioning - in these situations, a model will appear to be on a
different database alias, but won't be a referential integrity failure
since it is actually the same database.

The intention is to restore these referential integrity checks at some
point in the future. We may also need to look at ways of storing
cross-database foreign keys to accommodate the sort of use case that
you describe.

Yours,
Russ Magee %-)

CB

unread,

Dec 30, 2009, 11:25:30 PM12/30/09

to Django users

> In short, you can't (or shouldn't).
>
> What you're asking the database to do is keep a foreign key value that
> is not valid on the database on which it is stored.

Yes, this is true.

> The multi-db branch had checks for this sort of referential integrity
> until just before the merge; however, these checks were relaxed just
> prior to commit in order to allow for multi-db to support master-slave
> partitioning - in these situations, a model will appear to be on a
> different database alias, but won't be a referential integrity failure
> since it is actually the same database.

Yeah, i watched these checks go up and then drop out.

> We may also need to look at ways of storing
> cross-database foreign keys to accommodate the sort of use case that
> you describe.

What about creating a new set of ForeignKey fields that don't actually
do any real foreign key checking? Maybe CrossDBForeignKey, or
RelaxedKey?

I understand if this is off the timetable for 1.2 (as I can see the
dev group is getting ready for freeze, etc) but if I wanted to develop
this route, do you have any pointers? Might I be able to simply
subclass ForeignKey (or one of it's bases, etc), and more or less lie
about the constraint being fulfilled, tell the DB that this is just an
int (like IntegerField), and then convince the Query internals to ask
for my data seperately (making it not eligible for select_related, for
example)?

Either way, i'm going to look into this use case myself and try to
make it work (in a hopefully not /too/ hacky way) . If you or anyone
else has any pointers, please let me know.

Thanks,
-CB

Russell Keith-Magee

unread,

Dec 31, 2009, 12:21:06 AM12/31/09

to django...@googlegroups.com

On Thu, Dec 31, 2009 at 12:25 PM, CB <cbw...@gmail.com> wrote:
>> In short, you can't (or shouldn't).
>>
>> What you're asking the database to do is keep a foreign key value that
>> is not valid on the database on which it is stored.
>
> Yes, this is true.
>
>
>> The multi-db branch had checks for this sort of referential integrity
>> until just before the merge; however, these checks were relaxed just
>> prior to commit in order to allow for multi-db to support master-slave
>> partitioning - in these situations, a model will appear to be on a
>> different database alias, but won't be a referential integrity failure
>> since it is actually the same database.
>
> Yeah, i watched these checks go up and then drop out.
>
>> We may also need to look at ways of storing
>> cross-database foreign keys to accommodate the sort of use case that
>> you describe.
>
> What about creating a new set of ForeignKey fields that don't actually
> do any real foreign key checking? Maybe CrossDBForeignKey, or
> RelaxedKey?

I suspect something like this will ultimately form part of the
solution. There are potentially two types of 'relaxed' FK required -

* A FK when you know that the source data always comes from another
specific database (e.g. User objects are only on the 'auth' database)

* A FK that could point to one of many databases (e.g., Users are
spread across multiple databases based on some sharding strategy).

Analogous fields will need to exist for M2M fields.

> I understand if this is off the timetable for 1.2 (as I can see the
> dev group is getting ready for freeze, etc) but if I wanted to develop
> this route, do you have any pointers? Might I be able to simply
> subclass ForeignKey (or one of it's bases, etc), and more or less lie
> about the constraint being fulfilled, tell the DB that this is just an
> int (like IntegerField), and then convince the Query internals to ask
> for my data seperately (making it not eligible for select_related, for
> example)?

I haven't tried this myself, but subclassing ForeignKey certainly
seems like a reasonable starting point. One complication will be the
SQL that is generated; essentially, you need the field to be a FK, but
*not* generate field constraints. It *might* be easier to start by
subclassing IntegerField, and then introduce an FK-compatible API that
accommodates the remote data source.

> Either way, i'm going to look into this use case myself and try to
> make it work (in a hopefully not /too/ hacky way) . If you or anyone
> else has any pointers, please let me know.

At this point, you have pretty much all the pointers that I personally
have to offer. I'm certainly interested to hear from anyone else on
this topic.

If you manage to get this working, it would be helpful to hear about
any problems you experience. If we're going to put these user-facing
features into trunk in the 1.3 timeframe, it would be helpful to have
pointers/hints/experiences from people who have actually used multi-db
in real-world situations.

To that end - if you find that there is a need to modify the core in
some subtle way (for example, adding a flag/option to disable
select_related), feel free to suggest it. Depending on the proposal,
it might not even be out-of-bounds for 1.2. If you can make the case
for a generally transparent change to API that will make multi-db
practical for a particular use case, it might be covered as a bug fix,
and therefore be an acceptable checkin right up until the RC freeze.

Yours,
Russ Magee %-)

CB

unread,

Dec 31, 2009, 4:26:51 PM12/31/09

to Django users

So I did some basic poking around with managers, etc.

I noticed what 'may be' a 3 part solution to keep a model on one and
only one db:

First, we should modify the initial queryset to use our preferred db.
We can do this by adding a custom manager, with an overridden
get_query_set():

class PartionedManager(models.Manager):
use_for_related_fields = True #Forces this manager to be used
during related lookups
def __init__(self, *args, **kwargs):
self.forced_using = kwargs.pop('using', 'default')
super(PartionedManager, self).__init__(*args, **kwargs)
def get_query_set(self):
return super(PartionedManager, self).get_query_set().using
(self.forced_using)

However, this doesn't solve the ability to .save() into another db, so
lets create an abstract model our real stuff can inherit from:

class PartitionedModel(models.Model):
objects = PartitionedManager(using='aux')
def save(self, *args, **kwargs):
kwargs['using'] = 'adsales' #Makes sure instances are saved in
the proper place
super(AdSalesModel, self).save(*args, **kwargs)
class Meta:
abstract = True

class ConcreteModel(PartitionedModel):
name = models.CharField(max_length=255)

class SomeModelOnTheMainDB(models.Model):
name = models.CharField(max_length=255)
concrete = models.ForeignKey(ConcreteModel)

So far so good. However, relations still throw an error that models
can't be found. The reason for this is:

http://code.djangoproject.com/browser/django/trunk/django/db/models/fields/related.py#L262

The related code instantiates the manager, and then immediately
changes it's db with a .using() clause to wherever the original model
came from (e.g., another db). Since we want to force this, we have two
options:

1) get a django patch that can find another class attribute on the
manager and optionally not apply that using(), or apply a forced using
(), or
2) hack it!:

class PartionedManager(models.Manager):
use_for_related_fields = True #Forces this manager to be used
during related lookups
def __init__(self, *args, **kwargs):
self.forced_using = kwargs.pop('using', 'default')
super(PartionedManager, self).__init__(*args, **kwargs)
def get_query_set(self):
qs = super(PartionedManager, self).get_query_set().using
(self.forced_using)
qs._real_using = qs.using
qs.using = lambda new_using: qs._real_using(qs._db) #Disable
changing db with .using() - required because related manager tries to
set .using after we return the qs
return qs

class PartitionedModel(models.Model):
objects = PartitionedManager(using='aux')
def save(self, *args, **kwargs):
kwargs['using'] = 'adsales' #Makes sure instances are saved in
the proper place
super(AdSalesModel, self).save(*args, **kwargs)
class Meta:
abstract = True

class ConcreteModel(PartitionedModel):
name = models.CharField(max_length=255)

ConcreteModel here is now tied to a db. I'm testing this code in
production, and was wondering what everyone thought?

CB

unread,

Dec 31, 2009, 4:30:14 PM12/31/09

to Django users

*Oops, sorry about the refrences to 'AdSalesModel' and setting
using='adsales' - you can see where I'm using these ideas :)

Change them to PartitionedModel and self.objects.forced_using
respectively.

Alex_Gaynor

unread,

Dec 31, 2009, 4:37:55 PM12/31/09

to Django users

Hrm, it's clear that the current implementation is probably overly
aggressive in changing the DB, there are plenty of cases where a
related model would be on a different database. That being said in
your case it might make sense for using() to be a no-op, since by
definition your manager is supposed to work with a specific DB.

Alex

CB

unread,

Dec 31, 2009, 4:46:29 PM12/31/09

to Django users

> Hrm, it's clear that the current implementation is probably overly
> aggressive in changing the DB

I definitely agree here. It just seems like a useful working solution
for 'now', clearly somethign a bit more elegant would be needed to
make it into Django's core.

> That being said in
> your case it might make sense for using() to be a no-op, since by
> definition your manager is supposed to work with a specific DB.

Yes, and I think this is a somewhat common use case. If we could
convince the related managers not to call .using() on the returned
queryset (perhaps by means of a class attribute on the manager just
like use_for_related_fields), we can drop the override of the .using()
method on the queryset.

Do you think this flag addition is small enough for a bugfix / feature
in the 1.2 timeframe?

CB

unread,

Jan 2, 2010, 4:22:41 PM1/2/10

to Django users

Putting the following code somewhere after settings.py is imported (a
new app called 'startup' in __init__.py, with 'startup' added first in
INSTALLED_APPS is a good place), will allow you to lock models to
databases via a manager's forced_using attribute. This definitely not
'the best way', as monkey patching is a little nasty. However, I'm not
sure this would be accepted as a patch (when formulated as such), and
this works for me, so i'm just sharing it for whoever needs it.

I haven't seen any bugs, so if you use it and find any let me know.

Thanks again to Alex for making the multidb branch.

-CB

------------------------------------

from django.db import models

def hijack_for_partitioning():
"""This class enables you to add a forced_using = 'db_name' to a
manager,
which will lock queries and saves for that model to that
particular db.

In addition, if this is /not/ specified on a custom manager, it
will lock every model to the 'default' db.
"""
_real_save = models.Model.save
def save(self, *args, **kwargs):
kwargs['using'] = type(self).objects.forced_using #Makes sure

instances are saved in the proper place

_real_save(self, *args, **kwargs)
models.Model.save = save

_real_get_query_set = models.Manager.get_query_set
def get_query_set(self):
qs = _real_get_query_set(self)
if hasattr(self, 'forced_using'):
qs = qs.using(self.forced_using)

qs._real_using = qs.using
qs.using = lambda new_using: qs._real_using(qs._db)
#Disable changing db with .using() - required because related manager
tries to set .using after we return the qs
return qs

models.Manager.get_query_set = get_query_set

models.Manager.forced_using = 'default'
models.Manager.use_for_related_fields = True

hijack_for_partitioning()

Russell Keith-Magee

unread,

Jan 2, 2010, 11:03:33 PM1/2/10

to django...@googlegroups.com

On Fri, Jan 1, 2010 at 5:46 AM, CB <cbw...@gmail.com> wrote:

Apologies for taking so long to get back to you on this - the silly
season has been consuming a lot of my time.

> Yes, and I think this is a somewhat common use case. If we could
> convince the related managers not to call .using() on the returned
> queryset (perhaps by means of a class attribute on the manager just
> like use_for_related_fields), we can drop the override of the .using()
> method on the queryset.

I'd go one step further than an attribute - I'd provide a method. The
default implementation would essentially be:

def db_from_related_object(self, instance):
return instance._state.db

The related manager would then use the following:

using = rel_mgr.db_for_source(instance)
if getattr(rel_mgr, 'use_for_related_fields', False):
rel_obj = rel_mgr.using(using).get(**params)
else:
rel_obj = QuerySet(self.field.rel.to).using(using).get(**params)

In your use case, you would override this method on the AdSalesManager:

def db_from_related_object(self, instance):
return 'adsales'

so no matter where the request for an AdSales object comes from, the
adsales database will be used. If you wanted to implement a sharding
strategy, you could put in more complex logic based on instance.

> Do you think this flag addition is small enough for a bugfix / feature
> in the 1.2 timeframe?

Some sort of configuration in this direction will obviously be
required in the long term, and it's arguably a bug in the current
implementation - we need to expose a way to control the using()
arrangements for related fields, and we don't currently do so. Given
that this is a fairly glaring hole in the API, I'm certainly open to
providing a fix in the 1.2 timeframe.

My only question is whether we should try to make this public API for
this release. I'm slightly hesitant to introduce API without a
complete understanding of exactly how multi-db will be used in the
field. For example, you've highlighted a problem on retrieval, but
there will be an analogous problem on assignment - in the current
implementation, book.author = user will force user onto the same
database as book, (or force book onto the same database as user if
book hasn't been db-assigned). This is obviously wrong for the use
case your provide.

So, my suggestion would be to introduce an API to solve this problem,
but prefix the methods we add with underscores to highlight that this
is internal API for now. Come 1.3, we can then change these API
(either dropping the underscores, or replacing the underscore methods
with an entirely new technique) without breaking backwards
compatibility. Anyone using the internal API should be aware (because
of the underscore) that there may be some migration work required to
move from 1.2->1.3.

Yours,
Russ Magee %-)

CB

unread,

Jan 3, 2010, 3:06:13 AM1/3/10

to Django users

> Apologies for taking so long to get back to you on this - the silly
> season has been consuming a lot of my time.

Silly season indeed :)

> I'd go one step further than an attribute - I'd provide a method.

I was considering this, but I was thinking about a hacky/internal API
for 1.2 instead of doing the full implementation that I read was being
thrown around :) I suppose it is an easy logical extention.

> The
> default implementation would essentially be:
>
> def db_from_related_object(self, instance):
> return instance._state.db

I'm not familiar with ._state and not sure if you're referencing
existing attributes on the manager, but I can see what you're getting
at.

> The related manager would then use the following:
>
> using = rel_mgr.db_for_source(instance)
> if getattr(rel_mgr, 'use_for_related_fields', False):
> rel_obj = rel_mgr.using(using).get(**params)
> else:
> rel_obj = QuerySet(self.field.rel.to).using(using).get(**params)

Yes, exactly. I didn't want to jump in much further than I had to, so
I haven't actually patched this (monkey patching is fun, heh) as a
quick guess is it needs to be done in another place for M2M. I have
merely passing familiarity with this stuff, until the other day I
wasn't even aware of these 'magic' descriptors in django.

> In your use case, you would override this method on the AdSalesManager:
>
> def db_from_related_object(self, instance):
> return 'adsales'
>
> so no matter where the request for an AdSales object comes from, the
> adsales database will be used. If you wanted to implement a sharding
> strategy, you could put in more complex logic based on instance.

Agreed.

> Some sort of configuration in this direction will obviously be
> required in the long term, and it's arguably a bug in the current
> implementation - we need to expose a way to control the using()
> arrangements for related fields, and we don't currently do so.

This does seem to be the biggest issue at the moment.

> Given
> that this is a fairly glaring hole in the API, I'm certainly open to
> providing a fix in the 1.2 timeframe.

I'm not too familiar with when exactly this timeframe is - if it's
still a bit away and this doesn't count as a 'feature' (i.e., I have
maybe a week or so) I might try writing up a patch.

> My only question is whether we should try to make this public API for
> this release.

Without an understanding of exactly how soon you want 1.2 out, and as
I'm not exactly the best unit tester... ;) I think keeping this an
internal API might be best for now until someone a bit more
knowledgeable about the internal plumbing than me weighs in on these
ideas.

> I'm slightly hesitant to introduce API without a
> complete understanding of exactly how multi-db will be used in the
> field. For example, you've highlighted a problem on retrieval, but
> there will be an analogous problem on assignment - in the current
> implementation, book.author = user will force user onto the same
> database as book, (or force book onto the same database as user if
> book hasn't been db-assigned). This is obviously wrong for the use
> case your provide.

I totally forgot about assignment! In the monkeypatched code I posted
before, assignment (seemingly) works by segregating the models across
DB (that is, keeping all instances of a model on a single db) by
basically making the default implementation return 'default' instead
of keeping the 'from whence it came' strategy for related models.

We probably should name / detail these partitioning strategies so we
can discuss them easier:

1. Follow Original DB: Always resolve relations by staying on the same
DB as the source relation
2. Force Model DB: Always resolve relations by using

The code I posted above sticks with option 2, and treats an un-
dbassigned model as assigned to 'default'.

> So, my suggestion would be to introduce an API to solve this problem,
> but prefix the methods we add with underscores to highlight that this
> is internal API for now. Come 1.3, we can then change these API
> (either dropping the underscores, or replacing the underscore methods
> with an entirely new technique) without breaking backwards
> compatibility. Anyone using the internal API should be aware (because
> of the underscore) that there may be some migration work required to
> move from 1.2->1.3.

Agreed.

Russell Keith-Magee

unread,

Jan 3, 2010, 9:52:02 AM1/3/10

to django...@googlegroups.com

On Sun, Jan 3, 2010 at 4:06 PM, CB <cbw...@gmail.com> wrote:

>> The
>> default implementation would essentially be:
>>
>> def db_from_related_object(self, instance):
>> return instance._state.db
>
> I'm not familiar with ._state and not sure if you're referencing
> existing attributes on the manager, but I can see what you're getting
> at.

FYI: _state is the per-instance analog of _meta. It's an instance
specific store of instance specific "state" - like the database that
was used to source the object.

>> Given
>> that this is a fairly glaring hole in the API, I'm certainly open to
>> providing a fix in the 1.2 timeframe.
>
> I'm not too familiar with when exactly this timeframe is - if it's
> still a bit away and this doesn't count as a 'feature' (i.e., I have
> maybe a week or so) I might try writing up a patch.

The current timeframe targets a feature freeze on January 26th, with a
full code freeze on March 2. If we can get something in place for the
feature freeze, that would be ideal; depending on the exact nature and
scope of the changes required, I might be able to convince people to
let this sort of change in after the feature freeze.

> We probably should name / detail these partitioning strategies so we
> can discuss them easier:
>
> 1. Follow Original DB: Always resolve relations by staying on the same
> DB as the source relation
> 2. Force Model DB: Always resolve relations by using
>
> The code I posted above sticks with option 2, and treats an un-
> dbassigned model as assigned to 'default'.

There is an additional special case of (2) that is required by
master/slave. Although objects may be retrieved from slave, they are
really 'from' the master database, and two objects retrieved from
different slaves are actually compatible. It was this use case that
originally led to the cross-database checks being relaxed. At the
time, there was discussion about adding a database setting to provide
a point of commonality between databases.

I appreciate that this particular use case doesn't affect your
particular needs, but it should be kept in mind while we are working
on the general data source problem.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,

Jan 19, 2010, 10:31:16 AM1/19/10

to django...@googlegroups.com

Hi CB,

FYI - I've just uploaded a patch to #12540 that implements a strategy
similar to what I described. I'd be interested in hearing any feedback
you may have.

Yours,
Russ Magee %-)

Reply all

Reply to author

Forward