#7052 - Fixing serialization for contrib.contenttypes and contrib.auth

7 views
Skip to first unread message

Russell Keith-Magee

unread,
Dec 3, 2009, 10:33:44 AM12/3/09
to Django Developers
Hi all,

I've been looking at ticket #7052 again. I've got a draft patch up on
Trac, and I'd like feedback on the approach.

Previously, I've been advocating the approach of embedding queries
into the serialization syntax - essentially, interpreting dictionaries
in JSON (and equivalent in other formats) as keyword arguments to a
Model.objects.get() call.

This approach works, but very rapidly gets messy. It's painfully easy
to write a fixture that has circular, nested, or otherwise horribly
knotted dependencies. There is also the issue of serialization - a
query-based syntax is easy to deserialize, but how do you determine
which fields should be included in a serialization?

So, I've taken a different approach with this new patch. The new
approach is much simpler and more explicit than the last. Rather than
trying to embed a query into the serialization language, I've taken a
step back and looked at the problem a different way.

If there is some mechanism that can be used to look up instances of a
model, then by definition, it must be a surrogate key of some kind.
Completely independent of serialization, it would be nifty to be able
to perform lookups based on this surrogate key.

So, lets add a convention for these methods. As an example, consider
contrib.auth Permissions (models have been slightly simplified for
demo purposes).

class PermissionManager(models.Manager):
def get_by_surrogate(self, key):
codename, ct_key = key.split('|',1)
return self.get(
codename=codename,
content_type=ContentType.objects.get_by_surrogate(ct_key)
)

class Permission(models.Model):
name = models.CharField(max_length=50)
content_type = models.ForeignKey(ContentType)
codename = models.CharField(max_length=100)
objects = PermissionManager()

def surrogate_key(self):
return '%s|%s' % (self.codename, self.content_type.surrogate_key())

There are two additions here - a get_by_surrogate() method on the
default manager, and a surrogate_key() method on the model itself. If
I have an instance of permission, I can call p.surrogate_key(), which
will return 'add_user|auth:user'. If I want to get a particular
permission, I can call
Permission.objects.get_by_surrogate('add_user|auth:user'), and that
will resolve to the appropriate permission (or raise an exception if
no answer exists.

So far, this is independent of serialization. These methods could be
useful to end users for looking up permissions or content types.

However, they're also really useful for serialization. The serializers
can use the existence of these methods as cues for changes in
serialization behavior. If a get_by_surrogate() method is found on the
manager, the deserializer will use that method to look up objects; if
surrogate_key() exists on the model, that will be used to serialize
references to the object instead of using primary key values.

So, a JSON serialized reference to a permission that previously read:

{
"pk": 1,
"model": "auth.User",
"fields": {
...
permissions = [ 1,3 ]
}
}

will now read:

{
"pk": 1,
"model": "auth.User",
"fields": {
...
permissions = [ "add_user|auth:user" ,"delete_user|auth:user" ]
}
}

The patch attached to #7052 implements this scheme, and includes
surrogate key definitions for Permission and ContentType, plus tests
to validate that this approach works.

Two possible objections:

* It's a string-based serialization format. This means the developer
will need to write parsing code and a microsyntax to handle composite
surrogate keys (e.g., separating permission name from content type
with a pipe, and separating app_label from model with a colon).

* You can only define one surrogate key. Some models might have more
than one natural serialization.

Personally, I'm comfortable with both of these limitations, but I'm
interested in hearing other opinions.

On a practical note - there is one failing test, which highlights the
one flaw in this scheme that I am aware of. There is still a circular
dependency problem - an object must be deserialized before it can be
referenced using a surrogate key. I have a couple of ideas of how to
address this (essentially fixing it at the dumpdata level with model
ordering), but I wanted to get community approval for the general idea
before I did too much work on fixing the edge cases.

So - opinions?

Yours,
Russ Magee %-)

Luke Plant

unread,
Dec 3, 2009, 6:03:38 PM12/3/09
to django-d...@googlegroups.com
On Thursday 03 December 2009 15:33:44 Russell Keith-Magee wrote:

> So, I've taken a different approach with this new patch. The new
> approach is much simpler and more explicit than the last. Rather
> than trying to embed a query into the serialization language, I've
> taken a step back and looked at the problem a different way.
>
> If there is some mechanism that can be used to look up instances of
> a model, then by definition, it must be a surrogate key of some
> kind. Completely independent of serialization, it would be nifty
> to be able to perform lookups based on this surrogate key.

This approach seems good to me. I had a brief look at the patches on
the ticket, and it was *much* easier to understand your draft patch
than the alternative, which is a good sign, other things being equal.

> Two possible objections:
>
> * It's a string-based serialization format. This means the
> developer will need to write parsing code and a microsyntax to
> handle composite surrogate keys (e.g., separating permission name
> from content type with a pipe, and separating app_label from model
> with a colon).

How easy would it be to fix this? If you used a list of string values,
instead of a single string, wouldn't all the serialization formats
handle this with very little work? Without looking at the code, I
imagine that JSON and YAML at least would. You are still forcing
things to be to be converted to strings and back, but there are
usually standard ways to do that.

> * You can only define one surrogate key. Some models might have
> more than one natural serialization.

I agree that this limitation isn't serious. If you added support for
multiple surrogate keys, you'd need to add methods to specify which
one should be used, and thread that option through all the
serialization code...

I think the limitation could become more significant if other parts of
Django started relying on these surrogate key methods, we should think
about that if it comes up.

> On a practical note - there is one failing test, which highlights
> the one flaw in this scheme that I am aware of. There is still a
> circular dependency problem - an object must be deserialized
> before it can be referenced using a surrogate key.

I presume this problem would exist with the alternative method on that
ticket as well, right?

Regards,

Luke

--
"I am going to let you move around more, just to break up the
mahogany." (True Quotes From Induhviduals, Scott Adams)

Luke Plant || http://lukeplant.me.uk/

Russell Keith-Magee

unread,
Dec 3, 2009, 6:38:31 PM12/3/09
to django-d...@googlegroups.com
Interesting thought. JSON and YAML serialization of list/tuples would
be able to handle this almost transparently - in fact, it probably
works right now if you modify the surrogate implementations to
return/accept lists/tuples.

XML will require a bit more work, but it should still be possible.
I'll take a look and see what I can do.

>>  * You can only define one surrogate key. Some models might have
>>  more than one natural serialization.
>
> I agree that this limitation isn't serious.  If you added support for
> multiple surrogate keys, you'd need to add methods to specify which
> one should be used, and thread that option through all the
> serialization code...
>
> I think the limitation could become more significant if other parts of
> Django started relying on these surrogate key methods, we should think
> about that if it comes up.
>
>> On a practical note - there is one failing test, which highlights
>>  the one flaw in this scheme that I am aware of. There is still a
>>  circular dependency problem - an object must be deserialized
>>  before it can be referenced using a surrogate key.
>
> I presume this problem would exist with the alternative method on that
> ticket as well, right?

Correct. We never arrived at a way to implement serialization under
the old approach, so this never really became an issue.

Yours,
Russ Magee %-)

mattim...@gmail.com

unread,
Dec 3, 2009, 10:46:07 PM12/3/09
to Django developers
Hi Russ,

I do not like this approach. Using the output from this serializer for
anything other than loaddata/dumpdata would be annoying. You would
have no idea which model's fields the separators (pipe and colon) are
delimiting. E.g. in my REST api the related models are serialized to
JSON using my Django Full Serializer and I can just access them as
Javascript objects in my frontend code. Your approach would require
repeating the parsing code in javascript as well as having to know
beforehand which model each and every separator corresponded to (!@#$
%^&*:;.,)!!!. Might be ok for 3 models in django.contrib but not for
hundreds of reusable app's models that might start using this feature.
What about separator clashes?

cheers

Matthew

Russell Keith-Magee

unread,
Dec 3, 2009, 10:56:36 PM12/3/09
to django-d...@googlegroups.com
There are two separate problems here:
1) "Full" serialization
2) Fixing #7052.

As I've said before, full serialization is something I'd like to
tackle, but it's almost orthogonal to the #7052 problem. I can't think
of anything in this solution that would prohibit a later attempt to
provide full serialization - 'utilize surrogate keys' is just one
serialization strategy that could be followed (optionally, if need be)
for foreign keys.

I'm willing to accept the concern about separators, though. Luke's
suggestion to use tuples would fix this - instead of a permission
being rendered (in JSON, for example) as:

permission = "add_user|auth:user"

you would get

permission = ('add_user', ('auth', 'user'))

This avoids separator clashes, and it parsed by the same rules as the
serialization language.It also means that the implementation for
get_by_surrogate() can be a little more meaningful - e.g, for content
type:

def get_by_surrogate(app_label, model):
...

Is this any better in your opinion?

Yours,
Russ Magee %-)

mattim...@gmail.com

unread,
Dec 3, 2009, 11:19:58 PM12/3/09
to Django developers


On Dec 4, 2:56 pm, Russell Keith-Magee <freakboy3...@gmail.com> wrote:
> On Fri, Dec 4, 2009 at 11:46 AM, mattimust...@gmail.com
Using tuple is a little better but it isn't much of a leap to go to
full serialization as in the example output here [1] (ignoring the
relations argument syntax for now).
Your patch also makes this the default behaviour for a serializer. So
how then does a consumer (other than the Django deserializer) of the
serialized output know that "('auth', 'user')" belongs to the
contenttypes.contenttype model?

I'm also uncomfortable putting this in the model & manager, but I'm
not clear on the other potential use cases. Making it as an option in
the serializer sits better with me.

Would you see full serialization replacing the surrogate keys strategy
in future or would Django be stuck with supporting both and their
intermingled code?

[1] http://code.google.com/p/wadofstuff/wiki/DjangoFullSerializers#Serializing_a_relation_of_a_relation

cheers
Matthew Flanagan

Russell Keith-Magee

unread,
Dec 3, 2009, 11:54:39 PM12/3/09
to django-d...@googlegroups.com
I'm sorry, but the leap to "full" serializers *really* is a big leap.

With all due respect, and acknowledging that you've built a project
that scratches your personal itches - I don't consider
django-full-serializers to be anything close to the sort of "full"
serializer that would like to see in Django. To be sure -
django-full-serializers contains a whole lot of features that Django's
serializers doesn't have, but IMHO, it's also missing a whole lot of
really important features.

I don't consider the task of "full" serialization to be just a matter
of allowing extra fields and attributes, plus allowing arbitrary depth
of serialization. I consider "full" serialization to be allowing
complete configuration of the serialization process. For example, you
should be able to define a serializer that *doesn't* require the top
level structure of serialized objects to include "pk", "model" and
"fields". I would expect that the following to be a completely
legitimate output of a configured serializer for content types:

[ ('auth', 'user'), ('auth','permissions'), ('admin','logentry') ]

This obviously isn't enough detail to deserialize anything, but this
sort of sanitized output could be extremely useful to an AJAX
consumer.

Getting this degree of customization will be a big job, but that's
what I'm thinking about when I think about "full" serialization.

> Your patch also makes this the default behaviour for a serializer. So
> how then does a consumer (other than the Django deserializer) of the
> serialized output know that "('auth', 'user')" belongs to the
> contenttypes.contenttype model?

The same way that Django's deserializer knows - because it has access
to some metadata. I don't think it's unreasonable to expect that
consumers of serialized data should have some sort of understanding of
what they are consuming - either hardcoded knowledge about the
datatypes they are receiving, or via some metadata description.

Of course, you could embed that metadata into the serialization -
which is what django-full-serializers does. And this sort of embedding
should be possible. But it's not the *only* way to serialize data.

> I'm also uncomfortable putting this in the model & manager, but I'm
> not clear on the other potential use cases. Making it as an option in
> the serializer sits better with me.

The non-serialization use case is that a surrogate key is essentially
a useful way of referring to an object instance. ContentType provides
manager level helpers to retrieve content types based on a Model
instance - this is just extending those capabilities. As a bonus, it
happens to be useful for serialization.

> Would you see full serialization replacing the surrogate keys strategy
> in future or would Django be stuck with supporting both and their
> intermingled code?

Replacing it, at an option. "Full" serialization to me means that you
can choose exactly how a foreign key is serialized and under what
circumstances that would occur. This could mean:
* Use foreign keys always
* Use foreign keys unless a surrogate is available
* Use a custom scheme described by some user-defined function
* Use a combination of these approaches depending on context.

Yours
Russ Magee %-)

Russell Keith-Magee

unread,
Dec 4, 2009, 12:39:47 PM12/4/09
to django-d...@googlegroups.com
On Fri, Dec 4, 2009 at 7:38 AM, Russell Keith-Magee
<freakb...@gmail.com> wrote:
> On Fri, Dec 4, 2009 at 7:03 AM, Luke Plant <L.Pla...@cantab.net> wrote:
>> On Thursday 03 December 2009 15:33:44 Russell Keith-Magee wrote:
>>
>> How easy would it be to fix this? If you used a list of string values,
>> instead of a single string, wouldn't all the serialization formats
>> handle this with very little work?  Without looking at the code, I
>> imagine that JSON and YAML at least would. You are still forcing
>> things to be to be converted to strings and back, but there are
>> usually standard ways to do that.
>
> Interesting thought. JSON and YAML serialization of list/tuples would
> be able to handle this almost transparently - in fact, it probably
> works right now if you modify the surrogate implementations to
> return/accept lists/tuples.

I've just uploaded an updated patch that implements this strategy. A
surrogate key can now be a string or a list/tuple. If it's a string,
the same thing happens as before. If the surrogate key is a
list/tuple, it will be unrolled as arguments to get_by_surrogate. This
means that content types are now serialized in JSON as:

contenttype: ('auth','user')

which is rolled out in code to perform the following query:

ContentType.objects.get_by_surrogate('auth','User')

>>> On a practical note - there is one failing test, which highlights
>>>  the one flaw in this scheme that I am aware of. There is still a
>>>  circular dependency problem - an object must be deserialized
>>>  before it can be referenced using a surrogate key.
>>
>> I presume this problem would exist with the alternative method on that
>> ticket as well, right?
>
> Correct. We never arrived at a way to implement serialization under
> the old approach, so this never really became an issue.

The new patch addresses this issue in two ways.

Firstly, models that define a surrogate key are given priority in
serialization; surrogate key models are serialized first, followed by
normal models.

Secondly, to allow for dependencies between models with surrogate
keys, surrogate keys can define dependencies. For example, the
auth.Permissions model defines it's surrogate key as follows:

def surrogate_key(self):
return (self.codename, self.content_type.app_label, self.content_type.model)
surrogate_key.dependencies = ['contenttypes.contenttype']

This indicates that Permission has a dependency on ContentType, so
Permissions will always be serialized after ContentTypes if they are
in the same fixture.

Documentation is still to come.

Russ %-)

Jacob Kaplan-Moss

unread,
Dec 8, 2009, 11:55:16 AM12/8/09
to django-d...@googlegroups.com
Hi Russ --

This is looking pretty great; I'm a big fan of this approach. I've
looked through the places I've needed to work around this problem with
serializers, and your fix would work in every case, so I think you
nailed it.

I do, however, have one wrench to throw into the works:

As it stands, this gets incredibly close to what we'd need to support
composite primary keys! The primary thing standing in the way of
getting composite PKs is an API to refer to objects by something other
than a single value; your `surrogate_key`/`get_by_surrogate` is that
API.

Now, I'm not suggesting that we shave the CPK yak just to get
serialization of content types. However, it might be worth a bit of
bikeshedding in the API to future-proof it so that 1.3 could grow CPK
support out of this work.

All that to say: perhaps it'd be worth naming it something other than
"surrogate"?

On Fri, Dec 4, 2009 at 11:39 AM, Russell Keith-Magee
<freakb...@gmail.com> wrote:
> Documentation is still to come.

For shame! :) :)

Jacob

Russell Keith-Magee

unread,
Dec 8, 2009, 6:15:51 PM12/8/09
to django-d...@googlegroups.com
On Wed, Dec 9, 2009 at 12:55 AM, Jacob Kaplan-Moss <ja...@jacobian.org> wrote:
> Hi Russ --
>
> This is looking pretty great; I'm a big fan of this approach. I've
> looked through the places I've needed to work around this problem with
> serializers, and your fix would work in every case, so I think you
> nailed it.
>
> I do, however, have one wrench to throw into the works:
>
> As it stands, this gets incredibly close to what we'd need to support
> composite primary keys! The primary thing standing in the way of
> getting composite PKs is an API to refer to objects by something other
> than a single value; your `surrogate_key`/`get_by_surrogate` is that
> API.

There's overlap, but it isn't necessarily a perfect match.

Firstly, there's no reason that a surrogate needs to be multiple
columns - it could just be a reference to a different column. For
example, you could use the name of a Author as a surrogate. A single
string works really nicely as a surrogate.

Secondly, a surrogate doesn't need to be . auth.Permission is a
triple, but the second and third values are themselves the surrogate
of the content type.

Lastly, there's no guarantee that a CPK will automatically be an
appropriate surrogate for serialization.

> Now, I'm not suggesting that we shave the CPK yak just to get
> serialization of content types. However, it might be worth a bit of
> bikeshedding in the API to future-proof it so that 1.3 could grow CPK
> support out of this work.
>
> All that to say: perhaps it'd be worth naming it something other than
> "surrogate"?

The overlap between CPK and surrogate wasn't lost on me, but at the
time I couldn't think of an obvious alternate name that wasn't bound
to serialization. In the end, I came to the position that 'surrogate'
was a reasonable name for this particular feature, and we should
mentally reserve 'composite' (i.e., get_composite_key,
get_by_composite, etc) as the magic word for CPK.

I just kicked the idea around the office here, and we came up with two
other options:

* Natural key: on the grounds that we're picking a natural way to
refer to the object.

* Stable key: on the grounds that this whole problem exists because
the primary key *isn't* stable over syncdb calls, and we're picking a
set of columns that will be stable.

I'm happy to entertain any other bikeshed colors. Suggestions?

> On Fri, Dec 4, 2009 at 11:39 AM, Russell Keith-Magee
> <freakb...@gmail.com> wrote:
>> Documentation is still to come.
>
> For shame! :) :)

If it makes you feel any better, I haven't been sleeping well since I
uploaded the patch :-)

Russ %-)

Jacob Kaplan-Moss

unread,
Dec 9, 2009, 10:36:37 AM12/9/09
to django-d...@googlegroups.com
On Tue, Dec 8, 2009 at 5:15 PM, Russell Keith-Magee
<freakb...@gmail.com> wrote:
> There's overlap, but it isn't necessarily a perfect match.

Good points -- you've clearly thought this through farther than me.
Consider the wrench withdrawn :)

If you need another set of eyes on the code, I'm happy to give it a
review later this week and/or during the sprint this weekend.

Jacob

Russell Keith-Magee

unread,
Dec 10, 2009, 7:14:17 PM12/10/09
to django-d...@googlegroups.com
On Wed, Dec 9, 2009 at 11:36 PM, Jacob Kaplan-Moss <ja...@jacobian.org> wrote:
> On Tue, Dec 8, 2009 at 5:15 PM, Russell Keith-Magee
> <freakb...@gmail.com> wrote:
>> There's overlap, but it isn't necessarily a perfect match.
>
> Good points -- you've clearly thought this through farther than me.
> Consider the wrench withdrawn :)

You might have withdrawn your wrench, but it has given me pause for
thought. I've decided to go with 'natural key' rather than surrogate.
It seems like a slightly better terminology match (there is only one
by definition, and it's the 'natural' way of referring to an object.

> If you need another set of eyes on the code, I'm happy to give it a
> review later this week and/or during the sprint this weekend.

If you'd be so kind. I've just uploaded RC1 of the patch to the ticket
- this time, with docs!

I've also uploaded RC1 for #7052 (cached templates). If you've got a
spare moment, a review there wouldn't go astray.

Russ %-)

Russell Keith-Magee

unread,
Dec 10, 2009, 7:22:00 PM12/10/09
to django-d...@googlegroups.com
Bah - of course, that should be #6262 (cached templates)

Russ %-)
Reply all
Reply to author
Forward
0 new messages