Next on my pony list for v1.2: #7052 - fixing the serializers to work around the problem of serializing dynamically created objects, such as those produced by contrib.auth and contrib.contenttypes. I need some feedback on how much of this solution we need, want, and are comfortable seeing in trunk. Apologies in advance for the long post.
For those not familiar with the problem - these two apps dynamically create data as part of the syncdb process. As a result, the primary keys for these objects aren't necessarily consistent after a syncdb, so fixtures can't reliably refer to auth permissions or content types. The problem is more general than these two apps specifically, but these two are the ones that most people get bitten by early on in the testing process.
The solution that I've been intending to implement for a while is an extension to Django's serialization syntax: wherever a primary key is legal, we will also allow a dictionary-like structure (whatever the serialization format allows) that equates to the kwargs that will be passed to a Model.objects.get() call.
The serializer will then do ContentType.objects.get(app_name='otherapp', model='othermodel') to resolve the actual primary key at runtime. Analogous syntax would exist for XML, PyYAML, etc.
Now, there are two parts to the solution. The deserializer is easy - write a handler for the dictionary syntax for primary keys, and you're done. Easy to implement, easy to test.
The serializer isn't so easy, however. Determining when to output a lookup dictionary for a primary key isn't trivial. Here are some options:
Option 1: Ignore the problem -----------------------------------------
Implement the deserializer, but don't try and solve the serialization problem. Treat the lookup syntax for primary keys as a nifty extra you can exploit by hand if need be. Serialization generates integer primary keys, and you can hand modify fixtures to use lookup syntax if you want to.
Option 2: Add a Meta argument for serialization --------------------------------------------------------------------
This is essentially what the patch on #7052 currently implements.
Under this approach, a model that is known to engage in dynamic data creation can mark itself for dynamic dumping, indicating the fields that should be used for that dump. For example, ContentTypes would contain something like:
class Meta: ... dump_related = ('app_label','model')
which indicates the two fields that should be used to construct the lookup dictionary whenever a ContentType object is serialized.
The problem with this approach is that hard-codes a single aspect of serialization into the model. If someone has a different set of requirements for serializing content types under particular circumstances, they will be out of luck.
Option 3: Add flags/arguments to the serializer to control dynamic dumping --------------------------------------------------------------------------- ---------------------------------
It might be possible to simplify this a little by saying that when --lookup=contenttypes.contenttype is specfied, the first unique_together tuple will be used to construct the lookup.
This puts complete control in the hand of the user at serialization time. However, the syntax isn't especially elegant, especially given that every single serialization of contenttypes and permissions will, in practice, need to use the --lookup argument.
Option 4: An all-singing, all-dancing serialization framework rework --------------------------------------------------------------------------- ---------------------
Django's serialization format is fairly limited, and there have been many proposals to add features to the output format (serializing non-model properties, reverse relations, deep relations, etc). I've been holding off on these in favour of a larger rework of the serialization framework.
In my minds eye, I have a vision of a serialization framework that would allow for registration of different serialization formats - not just JSON/XML, but the fields and internal structure of a JSON fixture, etc. Describing which fields should be rendered as lookups, how the lookup would be determined, and under what conditions a lookup should be used would all just be a configuration items on a serialization definition.
This is obviously a much larger body of work, and certainly wouldn't get done for v1.2 - if only because I haven't done any planning, prototype implementation, or community review.
The good news in all this is that Option 1 isn't mutually exclusive to the other options - we can land Option 1 right now and get the advantages of dynamic lookups, and then worry about how to close the loop as a second problem.
So - feedback welcome. Which option should we pursue?
On Thu, Nov 5, 2009 at 9:29 AM, Russell Keith-Magee <freakboy3...@gmail.com>wrote:
> Hi all,
> Next on my pony list for v1.2: #7052 - fixing the serializers to work > around the problem of serializing dynamically created objects, such as > those produced by contrib.auth and contrib.contenttypes. I need some > feedback on how much of this solution we need, want, and are > comfortable seeing in trunk. Apologies in advance for the long post.
No worries. This is one of my ponies as well.
> Option 1: Ignore the problem > -----------------------------------------
> Implement the deserializer, but don't try and solve the serialization > problem. Treat the lookup syntax for primary keys as a nifty extra you > can exploit by hand if need be. Serialization generates integer > primary keys, and you can hand modify fixtures to use lookup syntax if > you want to.
I think this is a viable option, as you say just to get the code checked in. A surprising number of people write fixtures by hand, and this would allow them to get nice control over loading. Editing generated ones automatically or by hand afterwards also isn't that hard. That said, I think there is a better middle ground.
> Option 2: Add a Meta argument for serialization > --------------------------------------------------------------------
> The problem with this approach is that hard-codes a single aspect of > serialization into the model. If someone has a different set of > requirements for serializing content types under particular > circumstances, they will be out of luck.
I agree that this approach feels a bit heavy. I don't know if we really need something on the model that tells us how to dump something. Don't we already have that data? unique_together and unique=True should provide us enough information for making at least a naive implementation.
> It might be possible to simplify this a little by saying that when > --lookup=contenttypes.contenttype is specfied, the first > unique_together tuple will be used to construct the lookup.
> This puts complete control in the hand of the user at serialization > time. However, the syntax isn't especially elegant, especially given > that every single serialization of contenttypes and permissions will, > in practice, need to use the --lookup argument.
This is somewhere along the lines of the approach that I was considering. I'll explain below.
> Django's serialization format is fairly limited, and there have been > many proposals to add features to the output format (serializing > non-model properties, reverse relations, deep relations, etc). I've > been holding off on these in favour of a larger rework of the > serialization framework.
> In my minds eye, I have a vision of a serialization framework that > would allow for registration of different serialization formats - not > just JSON/XML, but the fields and internal structure of a JSON > fixture, etc. Describing which fields should be rendered as lookups, > how the lookup would be determined, and under what conditions a lookup > should be used would all just be a configuration items on a > serialization definition.
> This is obviously a much larger body of work, and certainly wouldn't > get done for v1.2 - if only because I haven't done any planning, > prototype implementation, or community review.
Yes please, but lets get something usable into 1.2.
> The good news in all this is that Option 1 isn't mutually exclusive to > the other options - we can land Option 1 right now and get the > advantages of dynamic lookups, and then worry about how to close the > loop as a second problem.
> So - feedback welcome. Which option should we pursue?
Well, since you asked for it ;)
I have solved this in a proof of concept way in my playground code[0]. Pinax also took this implementation and adapted it to work with json[1].
The basic approach, is that on serialization, you introspect the model, and see if it has any unique or unique_together fields. If there are these fields present, then those should be queried and used as the unique constraints when outputting the fixture.
This gives us a neat fallback, where if the only unique constraint is the Autofield generated, then it automatically just outputs the query as the pk query that is used now.
I think this gives us the most amount of automatic ability, outputting things where we obviously know they are unique. I think combining this with some version of Option 3, allowing users to specify on the command line which fields they wish to output, gets us most of the rest of the way.
My use case for the command line options is normally having a Slugfield on a model that isn't declared as unique=True, but for the data set that I have, it is.
On Thu, Nov 5, 2009 at 9:29 AM, Russell Keith-Magee
<freakboy3...@gmail.com> wrote: > Option 1: Ignore the problem
One step at a time; my preference would be to get this done first, then start looking for better approaches on the serialization side. I'd rather get *something* done than spend time looking for a perfect approach.
That said, I do think Eric's suggestion of automatically introspecting for unique/unique_together is the best idea I've seen yet. Well, the best idea that doesn't require a ground-up rewrite of serialization, that is.
On Nov 5, 10:21 am, Jacob Kaplan-Moss <ja...@jacobian.org> wrote:
> That said, I do think Eric's suggestion of automatically introspecting
> for unique/unique_together is the best idea I've seen yet. Well, the
> best idea that doesn't require a ground-up rewrite of serialization,
> that is.
> Next on my pony list for v1.2: #7052 - fixing the serializers to work
> around the problem of serializing dynamically created objects, such as
> those produced by contrib.auth and contrib.contenttypes. I need some
> feedback on how much of this solution we need, want, and are
> comfortable seeing in trunk. Apologies in advance for the long post.
> For those not familiar with the problem - these two apps dynamically
> create data as part of the syncdb process. As a result, the primary
> keys for these objects aren't necessarily consistent after a syncdb,
> so fixtures can't reliably refer to auth permissions or content types.
> The problem is more general than these two apps specifically, but
> these two are the ones that most people get bitten by early on in the
> testing process.
> The solution that I've been intending to implement for a while is an
> extension to Django's serialization syntax: wherever a primary key is
> legal, we will also allow a dictionary-like structure (whatever the
> serialization format allows) that equates to the kwargs that will be
> passed to a Model.objects.get() call.
> The serializer will then do
> ContentType.objects.get(app_name='otherapp', model='othermodel') to
> resolve the actual primary key at runtime. Analogous syntax would
> exist for XML, PyYAML, etc.
> Now, there are two parts to the solution. The deserializer is easy -
> write a handler for the dictionary syntax for primary keys, and you're
> done. Easy to implement, easy to test.
> The serializer isn't so easy, however. Determining when to output a
> lookup dictionary for a primary key isn't trivial. Here are some
> options:
> Option 1: Ignore the problem
> -----------------------------------------
> Implement the deserializer, but don't try and solve the serialization
> problem. Treat the lookup syntax for primary keys as a nifty extra you
> can exploit by hand if need be. Serialization generates integer
> primary keys, and you can hand modify fixtures to use lookup syntax if
> you want to.
+1. I need to ship a default set of groups with permissions assigned
to them in my app. This would make it much easier to do.
> Option 2: Add a Meta argument for serialization
> --------------------------------------------------------------------
> This is essentially what the patch on #7052 currently implements.
> Under this approach, a model that is known to engage in dynamic data
> creation can mark itself for dynamic dumping, indicating the fields
> that should be used for that dump. For example, ContentTypes would
> contain something like:
> class Meta:
> ...
> dump_related = ('app_label','model')
> which indicates the two fields that should be used to construct the
> lookup dictionary whenever a ContentType object is serialized.
> The problem with this approach is that hard-codes a single aspect of
> serialization into the model. If someone has a different set of
> requirements for serializing content types under particular
> circumstances, they will be out of luck.
> Option 3: Add flags/arguments to the serializer to control dynamic dumping
> --------------------------------------------------------------------------- ---------------------------------
> It might be possible to simplify this a little by saying that when
> --lookup=contenttypes.contenttype is specfied, the first
> unique_together tuple will be used to construct the lookup.
> This puts complete control in the hand of the user at serialization
> time. However, the syntax isn't especially elegant, especially given
> that every single serialization of contenttypes and permissions will,
> in practice, need to use the --lookup argument.
> Django's serialization format is fairly limited, and there have been
> many proposals to add features to the output format (serializing
> non-model properties, reverse relations, deep relations, etc). I've
> been holding off on these in favour of a larger rework of the
> serialization framework.
> In my minds eye, I have a vision of a serialization framework that
> would allow for registration of different serialization formats - not
> just JSON/XML, but the fields and internal structure of a JSON
> fixture, etc. Describing which fields should be rendered as lookups,
> how the lookup would be determined, and under what conditions a lookup
> should be used would all just be a configuration items on a
> serialization definition.
> This is obviously a much larger body of work, and certainly wouldn't
> get done for v1.2 - if only because I haven't done any planning,
> prototype implementation, or community review.
I have an obvious bias [1] but this would be my preferred option with
Option 3 being implemented on top of it. I believe my Django Full
Serializers implement 90% of what you are after (there is also a patch
for reverse relations in the issue tracker). It is missing
deserializing "full" serialized models and the ability to customize
the internal structure of the output. I have a test suite for it but
it needs to be extracted/rewritten as it depends on models in an
internal project.
Would it be worth my while reworking my code as a patch against Django
trunk?
> The good news in all this is that Option 1 isn't mutually exclusive to
> the other options - we can land Option 1 right now and get the
> advantages of dynamic lookups, and then worry about how to close the
> loop as a second problem.
> So - feedback welcome. Which option should we pursue?
On Thu, Nov 5, 2009 at 11:46 PM, Eric Holscher <eric.holsc...@gmail.com> wrote: > On Thu, Nov 5, 2009 at 9:29 AM, Russell Keith-Magee <freakboy3...@gmail.com> > wrote:
>> Hi all,
>> Next on my pony list for v1.2: #7052 - fixing the serializers to work >> around the problem of serializing dynamically created objects, such as >> those produced by contrib.auth and contrib.contenttypes. I need some >> feedback on how much of this solution we need, want, and are >> comfortable seeing in trunk. Apologies in advance for the long post.
> No worries. This is one of my ponies as well.
>> Option 1: Ignore the problem >> -----------------------------------------
>> Implement the deserializer, but don't try and solve the serialization >> problem. Treat the lookup syntax for primary keys as a nifty extra you >> can exploit by hand if need be. Serialization generates integer >> primary keys, and you can hand modify fixtures to use lookup syntax if >> you want to.
> I think this is a viable option, as you say just to get the code checked in. > A surprising number of people write fixtures by hand, and this would allow > them to get nice control over loading. Editing generated ones automatically > or by hand afterwards also isn't that hard. That said, I think there is a > better middle ground.
>> Option 2: Add a Meta argument for serialization >> --------------------------------------------------------------------
>> The problem with this approach is that hard-codes a single aspect of >> serialization into the model. If someone has a different set of >> requirements for serializing content types under particular >> circumstances, they will be out of luck.
> I agree that this approach feels a bit heavy. I don't know if we really need > something on the model that tells us how to dump something. Don't we already > have that data? unique_together and unique=True should provide us enough > information for making at least a naive implementation.
>> Option 3: Add flags/arguments to the serializer to control dynamic dumping
>> It might be possible to simplify this a little by saying that when >> --lookup=contenttypes.contenttype is specfied, the first >> unique_together tuple will be used to construct the lookup.
>> This puts complete control in the hand of the user at serialization >> time. However, the syntax isn't especially elegant, especially given >> that every single serialization of contenttypes and permissions will, >> in practice, need to use the --lookup argument.
> This is somewhere along the lines of the approach that I was considering. > I'll explain below.
>> Option 4: An all-singing, all-dancing serialization framework rework
>> Django's serialization format is fairly limited, and there have been >> many proposals to add features to the output format (serializing >> non-model properties, reverse relations, deep relations, etc). I've >> been holding off on these in favour of a larger rework of the >> serialization framework.
>> In my minds eye, I have a vision of a serialization framework that >> would allow for registration of different serialization formats - not >> just JSON/XML, but the fields and internal structure of a JSON >> fixture, etc. Describing which fields should be rendered as lookups, >> how the lookup would be determined, and under what conditions a lookup >> should be used would all just be a configuration items on a >> serialization definition.
>> This is obviously a much larger body of work, and certainly wouldn't >> get done for v1.2 - if only because I haven't done any planning, >> prototype implementation, or community review.
> Yes please, but lets get something usable into 1.2.
>> The good news in all this is that Option 1 isn't mutually exclusive to >> the other options - we can land Option 1 right now and get the >> advantages of dynamic lookups, and then worry about how to close the >> loop as a second problem.
>> So - feedback welcome. Which option should we pursue?
> Well, since you asked for it ;)
> I have solved this in a proof of concept way in my playground code[0]. Pinax > also took this implementation and adapted it to work with json[1].
> The basic approach, is that on serialization, you introspect the model, and > see if it has any unique or unique_together fields. If there are these > fields present, then those should be queried and used as the unique > constraints when outputting the fixture.
> This gives us a neat fallback, where if the only unique constraint is the > Autofield generated, then it automatically just outputs the query as the pk > query that is used now.
> I think this gives us the most amount of automatic ability, outputting > things where we obviously know they are unique. I think combining this with > some version of Option 3, allowing users to specify on the command line > which fields they wish to output, gets us most of the rest of the way.
I can see what you're driving at here, but I have some reservations.
Firstly, not all models with unique fields need to be serialized as lookups. This introduces a potential efficiency issue. Assigning a literal PK is much faster than a database lookup. Fixture loading is already one of the two weak points in testing speed. Introducing lookups that aren't strictly necessary into fixtures will just make the matter worse.
There is also potentially a complexity issue. At the moment, circular dependencies are easy to resolve - assign a numerical PK, and defer integrity checks to the end of the transaction. Since FK values can themselves be part of unique_together constraints, you could end up with a situation where object 1 needs to lookup object 2, but object 2 needs to lookup object 1.
Secondly, not all models that need to be serialized as lookups have unique fields. It's easy to create a management.py trigger to create dynamic content, but there's no requirement that the models involved have unique non-primary keys.
In summary - I think the 'first unique/unique together field' technique is an excellent way to determine which fields should be used for serialization once you have determined that a model should be serialized as a lookup. However, I don't think it's a very good criterion for determining that a lookup is required in the first place.
Manually specifying models at the command line (--lookup) is a longwinded approach, and it's easy to accidentally forget which models need to be serialized as lookups. However, it doesn't suffer from the flaws that the automatic unique-based criterion has. It's not pretty, but it might be enough to tide us over until a full serialization rewrite can happen.
>> In my minds eye, I have a vision of a serialization framework that >> would allow for registration of different serialization formats - not >> just JSON/XML, but the fields and internal structure of a JSON >> fixture, etc. Describing which fields should be rendered as lookups, >> how the lookup would be determined, and under what conditions a lookup >> should be used would all just be a configuration items on a >> serialization definition.
>> This is obviously a much larger body of work, and certainly wouldn't >> get done for v1.2 - if only because I haven't done any planning, >> prototype implementation, or community review.
> I have an obvious bias [1] but this would be my preferred option with > Option 3 being implemented on top of it. I believe my Django Full > Serializers implement 90% of what you are after (there is also a patch > for reverse relations in the issue tracker). It is missing > deserializing "full" serialized models and the ability to customize > the internal structure of the output.
I've seen your Full Serializers before, and it looks like good stuff - but I disagree about your '90% of the problem' analysis. To my mind, including extra properties into the serialized data is the easy part of the problem. The hard bit is getting real control over the serialization format. Once you have good control over the serialization format, serializing extra fields or deep relations is almost a trivial afterthought.
I'm also hesitant to revisit the sins of modelform_factory and generic views. Django has repeatedly discovered that trying to push configuration through the argument list of a function is a recipe for problems in the long term. To whit, from your own docs:
The real solution is to use a full class-based representation.
> Would it be worth my while reworking my code as a patch against Django > trunk?
At this point, probably not. A full serialization rewrite is on my list of medium-term things to do (or list of things to try and encourage other people to do :-), which is why the tickets around 'full serialization' have languished for so long. This is one of those occasions where I'm deliberately not fixing part of the problem to encourage attempts to fix the whole problem.
On Fri, Nov 6, 2009 at 1:55 AM, Travis Cline <travis.cl...@gmail.com> wrote:
> On Nov 5, 10:21 am, Jacob Kaplan-Moss <ja...@jacobian.org> wrote: >> That said, I do think Eric's suggestion of automatically introspecting >> for unique/unique_together is the best idea I've seen yet. Well, the >> best idea that doesn't require a ground-up rewrite of serialization, >> that is.
> Needs some work and more tests but some of the bits mentioned above > are there.
> My initial aim is to be independent of core but last item on my list > of goals is to provide the work as a patch.
Hi Travis,
The patch on #7052 already does the dictionary-based lookup part of the problem, as well as implementing option 2 from my original mail. We probably won't end up using the serialization part (the Meta flag) but the deserializers already work pretty well. I expect that the deserialization/lookup part of the patch will be in trunk quite soon.
> >> In my minds eye, I have a vision of a serialization framework that
> >> would allow for registration of different serialization formats - not
> >> just JSON/XML, but the fields and internal structure of a JSON
> >> fixture, etc. Describing which fields should be rendered as lookups,
> >> how the lookup would be determined, and under what conditions a lookup
> >> should be used would all just be a configuration items on a
> >> serialization definition.
> >> This is obviously a much larger body of work, and certainly wouldn't
> >> get done for v1.2 - if only because I haven't done any planning,
> >> prototype implementation, or community review.
> > I have an obvious bias [1] but this would be my preferred option with
> > Option 3 being implemented on top of it. I believe my Django Full
> > Serializers implement 90% of what you are after (there is also a patch
> > for reverse relations in the issue tracker). It is missing
> > deserializing "full" serialized models and the ability to customize
> > the internal structure of the output.
> I've seen your Full Serializers before, and it looks like good stuff -
> but I disagree about your '90% of the problem' analysis. To my mind,
> including extra properties into the serialized data is the easy part
> of the problem. The hard bit is getting real control over the
> serialization format. Once you have good control over the
> serialization format, serializing extra fields or deep relations is
> almost a trivial afterthought.
> I'm also hesitant to revisit the sins of modelform_factory and generic
> views. Django has repeatedly discovered that trying to push
> configuration through the argument list of a function is a recipe for
> problems in the long term. To whit, from your own docs:
> The real solution is to use a full class-based representation.
I've just been looking at #6735 - Class based Generic Views, and the
patches attached there don't seem to eliminate the convention of using
dicts to pass arguments/configuration to the views. They just make it
easier to override the behaviour.
How do you imagine this being different in (more) class-based
serialization?
The way I think of my current implementation is that the serializer
options are simple configuration (except for very deep relations), and
if indented well is not that daunting to maintain.
The current serializer classes have start_serialization,
end_serialization, start_object, end_object methods that can be
overridden, aren't these enough to handle custom serialization
formats?
> > Would it be worth my while reworking my code as a patch against Django
> > trunk?
> At this point, probably not. A full serialization rewrite is on my
> list of medium-term things to do (or list of things to try and
> encourage other people to do :-), which is why the tickets around
> 'full serialization' have languished for so long. This is one of those
> occasions where I'm deliberately not fixing part of the problem to
> encourage attempts to fix the whole problem.
Given my interest in the area I would be willing to work on this if we
can come to a consensus on what will fix the whole problem.
>> >> In my minds eye, I have a vision of a serialization framework that >> >> would allow for registration of different serialization formats - not >> >> just JSON/XML, but the fields and internal structure of a JSON >> >> fixture, etc. Describing which fields should be rendered as lookups, >> >> how the lookup would be determined, and under what conditions a lookup >> >> should be used would all just be a configuration items on a >> >> serialization definition.
>> >> This is obviously a much larger body of work, and certainly wouldn't >> >> get done for v1.2 - if only because I haven't done any planning, >> >> prototype implementation, or community review.
>> > I have an obvious bias [1] but this would be my preferred option with >> > Option 3 being implemented on top of it. I believe my Django Full >> > Serializers implement 90% of what you are after (there is also a patch >> > for reverse relations in the issue tracker). It is missing >> > deserializing "full" serialized models and the ability to customize >> > the internal structure of the output.
>> I've seen your Full Serializers before, and it looks like good stuff - >> but I disagree about your '90% of the problem' analysis. To my mind, >> including extra properties into the serialized data is the easy part >> of the problem. The hard bit is getting real control over the >> serialization format. Once you have good control over the >> serialization format, serializing extra fields or deep relations is >> almost a trivial afterthought.
>> I'm also hesitant to revisit the sins of modelform_factory and generic >> views. Django has repeatedly discovered that trying to push >> configuration through the argument list of a function is a recipe for >> problems in the long term. To whit, from your own docs:
>> The real solution is to use a full class-based representation.
> I've just been looking at #6735 - Class based Generic Views, and the > patches attached there don't seem to eliminate the convention of using > dicts to pass arguments/configuration to the views. They just make it > easier to override the behaviour.
> How do you imagine this being different in (more) class-based > serialization?
I just had another look at Jacob's github branch for #6735 - I actually wasn't aware that Jacob had made the constructor to the class based generic views so open ended.
However, the important feature of class-based generic views isn't the constructor - it's that you can override the functions that return those values. Sure, by default get_foo just returns the foo argument, but the interesting feature of class-based views is that you can replace get_foo with an implementation that does any sort of conditional processing based on state, arguments, etc.
> The way I think of my current implementation is that the serializer > options are simple configuration (except for very deep relations), and > if indented well is not that daunting to maintain.
> The current serializer classes have start_serialization, > end_serialization, start_object, end_object methods that can be > overridden, aren't these enough to handle custom serialization > formats?
Python provides the tools to write HTML headers. Isn't that enough to build a website? :-)
I'm looking at the serialization framework as a way of turning a model, or list of models, or list of models plus some metadata, into an JSON/XML/YAML data structure that can be consumed by an arbitrary endpoint. For example, I don't consider it a given that Django should serialize "pk", "model" and "fields" as the top level structure of an object. I also don't consider it a given that a serialization should necessarily be deserializable.
>> > Would it be worth my while reworking my code as a patch against Django >> > trunk?
>> At this point, probably not. A full serialization rewrite is on my >> list of medium-term things to do (or list of things to try and >> encourage other people to do :-), which is why the tickets around >> 'full serialization' have languished for so long. This is one of those >> occasions where I'm deliberately not fixing part of the problem to >> encourage attempts to fix the whole problem.
> Given my interest in the area I would be willing to work on this if we > can come to a consensus on what will fix the whole problem.
Excellent. I've got a busy weekend ahead of me, so I don't have time right now to formalize the ideas I have. However, I tried (unsuccessfully) to motivate a couple of students to follow this topic for the GSoC this year; there are a couple of threads where I tried to express the ideas that I had in mind. It's not an ideal starting point, but hopefully it will give you an insight into what I have in mind.
Once my weekend settles down, I'll try to get some ideas down on paper. We probably won't be able to get anything this big into v1.2, but if we can nail down a spec (and maybe an implementation) over the next couple of months, we'll be in a really good position for v1.3 - or at least have a really good external project for the community to use.
(accidentally sent this directly to Russell, he has offered to help me
re-construct this where it belongs)
I am under-the-covers-down-and-naughty familiar with this one.
This one has caused pain for us. We had a situation very similar to
what #7052 outlines and I ended up forking Django just to solve it and
implementing the attached patch. So, from my selfish point of view I
don't care where this one ends up I just want:
# To not fork Django
# To have my tests work
On Nov 5, 9:29 am, Russell Keith-Magee <freakboy3...@gmail.com> wrote:
> Next on my pony list for v1.2: #7052 - fixing the serializers to work
> around the problem of serializing dynamically created objects, such as
> those produced by contrib.auth and contrib.contenttypes. I need some
> feedback on how much of this solution we need, want, and are
> comfortable seeing in trunk. Apologies in advance for the long post.
> For those not familiar with the problem - these two apps dynamically
> create data as part of the syncdb process. As a result, the primary
> keys for these objects aren't necessarily consistent after a syncdb,
> so fixtures can't reliably refer to auth permissions or content types.
> The problem is more general than these two apps specifically, but
> these two are the ones that most people get bitten by early on in the
> testing process.
> The solution that I've been intending to implement for a while is an
> extension to Django's serialization syntax: wherever a primary key is
> legal, we will also allow a dictionary-like structure (whatever the
> serialization format allows) that equates to the kwargs that will be
> passed to a Model.objects.get() call.
> The serializer will then do
> ContentType.objects.get(app_name='otherapp', model='othermodel') to
> resolve the actual primary key at runtime. Analogous syntax would
> exist for XML, PyYAML, etc.
> Now, there are two parts to the solution. The deserializer is easy -
> write a handler for the dictionary syntax for primary keys, and you're
> done. Easy to implement, easy to test.
> The serializer isn't so easy, however. Determining when to output a
> lookup dictionary for a primary key isn't trivial. Here are some
> options:
> Option 1: Ignore the problem
> -----------------------------------------
> Implement the deserializer, but don't try and solve the serialization
> problem. Treat the lookup syntax for primary keys as a nifty extra you
> can exploit by hand if need be. Serialization generates integer
> primary keys, and you can hand modify fixtures to use lookup syntax if
> you want to.
> Option 2: Add a Meta argument for serialization
> --------------------------------------------------------------------
> This is essentially what the patch on #7052 currently implements.
> Under this approach, a model that is known to engage in dynamic data
> creation can mark itself for dynamic dumping, indicating the fields
> that should be used for that dump. For example, ContentTypes would
> contain something like:
> class Meta:
> ...
> dump_related = ('app_label','model')
> which indicates the two fields that should be used to construct the
> lookup dictionary whenever a ContentType object is serialized.
> The problem with this approach is that hard-codes a single aspect of
> serialization into the model. If someone has a different set of
> requirements for serializing content types under particular
> circumstances, they will be out of luck.
-1. But it DOES work. And it works really well. If I remember
correctly, part of the work the patch does is alter the ContentType
model to add the dump_related exactly as you mentioned.
I wonder how nasty it would be if ContentType and Permission used this
in a very undocumented way until the major serialization re-write
could be accomplished.
> Option 3: Add flags/arguments to the serializer to control dynamic dumping
> --------------------------------------------------------------------------- ---------------------------------
> It might be possible to simplify this a little by saying that when
> --lookup=contenttypes.contenttype is specfied, the first
> unique_together tuple will be used to construct the lookup.
> This puts complete control in the hand of the user at serialization
> time. However, the syntax isn't especially elegant, especially given
> that every single serialization of contenttypes and permissions will,
> in practice, need to use the --lookup argument.
My concern for this one is forgetting to add the option every time I
dump the fixtures. (Which is quite often in our environment). I
would end up writing my own management command to wrap this so I
wouldn't forget.
Let's say dump_related is out in the Model class, I agree this is not
where this belongs. What if we had a slightly altered way of
describing it. In it you could have something like this:
from django.contrib.contenttypes.models import ContentType
from django.core.serializers import register_serializer_style
I could put this one-liner anywhere. I would probably put it in my
models.py right after the class definition, or I might put it in the
project itself and leave it out of the app altogether. The important
thing is it gets it out of the Model class definition Meta section.
It could also be last-in wins, so if my way of serializing ContentType
differs from the default (which might be built-in to contrib) I could
change it somewhere else. I can't think of a good use case for why
you would need to serialize it differently but Russell did mention it
as a possibility.
One feature that we've been playing with is using a Python module to
seed our serialization. It looks something like this:
from django.contrib.auth.models import *
from myproject.models import *
from myproject.utils.serializer import S #'S' for "seed"
Our system is complex enough that we need to dump subsets of data,
filter by date, and some other junk before we ultimately pass it to
the serializer. I would like to see something like this added to the
next version as well.
> Django's serialization format is fairly limited, and there have been
> many proposals to add features to the output format (serializing
> non-model properties, reverse relations, deep relations, etc). I've
> been holding off on these in favour of a larger rework of the
> serialization framework.
> In my minds eye, I have a vision of a serialization framework that
> would allow for registration of different serialization formats - not
> just JSON/XML, but the fields and internal structure of a JSON
> fixture, etc. Describing which fields should be rendered as lookups,
> how the lookup would be determined, and under what conditions a lookup
> should be used would all just be a configuration items on a
> serialization definition.
> This is obviously a much larger body of work, and certainly wouldn't
> get done for v1.2 - if only because I haven't done any planning,
> prototype implementation, or community review.
> The good news in all this is that Option 1 isn't mutually exclusive to
> the other options - we can land Option 1 right now and get the
> advantages of dynamic lookups, and then worry about how to close the
> loop as a second problem.
> So - feedback welcome. Which option should we pursue?
I've implemented a nearly identical solution for this problem before,
and while it worked, it felt dirty.
It strikes me that the problem is to do with our "surrogate" primary
key ids, which don't relate to the data at all. For most models, this
is fine. The problem here is that the ids do not meaningfully identify
particular rows, and so we can't refer to them without psychic id-
guessing powers.
Wouldn't a simpler solution be to change ContentType and such to have
meaningful primary keys?
For ContentType, at least, this would require either creating a new
surrogate primary key that contained both the app and model names, or
else compound primary key support.
> Next on my pony list for v1.2: #7052 - fixing the serializers to work
> around the problem of serializing dynamically created objects, such as
> those produced by contrib.auth and contrib.contenttypes. I need some
> feedback on how much of this solution we need, want, and are
> comfortable seeing in trunk. Apologies in advance for the long post.
> For those not familiar with the problem - these two apps dynamically
> create data as part of the syncdb process. As a result, the primary
> keys for these objects aren't necessarily consistent after a syncdb,
> so fixtures can't reliably refer to auth permissions or content types.
> The problem is more general than these two apps specifically, but
> these two are the ones that most people get bitten by early on in the
> testing process.
> The solution that I've been intending to implement for a while is an
> extension to Django's serialization syntax: wherever a primary key is
> legal, we will also allow a dictionary-like structure (whatever the
> serialization format allows) that equates to the kwargs that will be
> passed to a Model.objects.get() call.
> The serializer will then do
> ContentType.objects.get(app_name='otherapp', model='othermodel') to
> resolve the actual primary key at runtime. Analogous syntax would
> exist for XML, PyYAML, etc.
> Now, there are two parts to the solution. The deserializer is easy -
> write a handler for the dictionary syntax for primary keys, and you're
> done. Easy to implement, easy to test.
> The serializer isn't so easy, however. Determining when to output a
> lookup dictionary for a primary key isn't trivial. Here are some
> options:
> Option 1: Ignore the problem
> -----------------------------------------
> Implement the deserializer, but don't try and solve the serialization
> problem. Treat the lookup syntax for primary keys as a nifty extra you
> can exploit by hand if need be. Serialization generates integer
> primary keys, and you can hand modify fixtures to use lookup syntax if
> you want to.
> Option 2: Add a Meta argument for serialization
> --------------------------------------------------------------------
> This is essentially what the patch on #7052 currently implements.
> Under this approach, a model that is known to engage in dynamic data
> creation can mark itself for dynamic dumping, indicating the fields
> that should be used for that dump. For example, ContentTypes would
> contain something like:
> class Meta:
> ...
> dump_related = ('app_label','model')
> which indicates the two fields that should be used to construct the
> lookup dictionary whenever a ContentType object is serialized.
> The problem with this approach is that hard-codes a single aspect of
> serialization into the model. If someone has a different set of
> requirements for serializing content types under particular
> circumstances, they will be out of luck.
> Option 3: Add flags/arguments to the serializer to control dynamic dumping
> --------------------------------------------------------------------------- ---------------------------------
> It might be possible to simplify this a little by saying that when
> --lookup=contenttypes.contenttype is specfied, the first
> unique_together tuple will be used to construct the lookup.
> This puts complete control in the hand of the user at serialization
> time. However, the syntax isn't especially elegant, especially given
> that every single serialization of contenttypes and permissions will,
> in practice, need to use the --lookup argument.
> Django's serialization format is fairly limited, and there have been
> many proposals to add features to the output format (serializing
> non-model properties, reverse relations, deep relations, etc). I've
> been holding off on these in favour of a larger rework of the
> serialization framework.
> In my minds eye, I have a vision of a serialization framework that
> would allow for registration of different serialization formats - not
> just JSON/XML, but the fields and internal structure of a JSON
> fixture, etc. Describing which fields should be rendered as lookups,
> how the lookup would be determined, and under what conditions a lookup
> should be used would all just be a configuration items on a
> serialization definition.
> This is obviously a much larger body of work, and certainly wouldn't
> get done for v1.2 - if only because I haven't done any planning,
> prototype implementation, or community review.
> The good news in all this is that Option 1 isn't mutually exclusive to
> the other options - we can land Option 1 right now and get the
> advantages of dynamic lookups, and then worry about how to close the
> loop as a second problem.
> So - feedback welcome. Which option should we pursue?
On Sat, Nov 7, 2009 at 8:37 AM, Rob Madole <robmad...@gmail.com> wrote:
> (accidentally sent this directly to Russell, he has offered to help me > re-construct this where it belongs)
(and, repeating my response that got sent privately to Rob)
> I am under-the-covers-down-and-naughty familiar with this one.
> This one has caused pain for us. We had a situation very similar to > what #7052 outlines and I ended up forking Django just to solve it and > implementing the attached patch. So, from my selfish point of view I > don't care where this one ends up I just want:
> # To not fork Django > # To have my tests work
Fear not - my intention is that *something* will get into v1.2. I've started this discussion to try and work out if there is anything simple we can do for the serialization case, but if we can't come to a consensus, I'm more than happy to commit just the Option 1, deserialization only solution.
>> class Meta: >> ... >> dump_related = ('app_label','model')
>> which indicates the two fields that should be used to construct the >> lookup dictionary whenever a ContentType object is serialized.
>> The problem with this approach is that hard-codes a single aspect of >> serialization into the model. If someone has a different set of >> requirements for serializing content types under particular >> circumstances, they will be out of luck.
> -1. But it DOES work. And it works really well. If I remember > correctly, part of the work the patch does is alter the ContentType > model to add the dump_related exactly as you mentioned.
Correct. If a Meta option was the best option, the patch on the ticket is pretty much pret-a-porter. However, "it works" isn't really the quality bar we're aiming for :-)
> I wonder how nasty it would be if ContentType and Permission used this > in a very undocumented way until the major serialization re-write > could be accomplished.
This is an interesting approach. My only hesitation is that, I'm not sure how I feel about introducing a temporary option, not documenting it, and then sorta-kinda encouraging people to use it. However, if we approach the documentation process carefully, it might be manageable (maybe introduce it into the docs, but marking it as deprecated from the start)
>> Option 3: Add flags/arguments to the serializer to control dynamic dumping >> --------------------------------------------------------------------------- ---------------------------------
>> It might be possible to simplify this a little by saying that when >> --lookup=contenttypes.contenttype is specfied, the first >> unique_together tuple will be used to construct the lookup.
>> This puts complete control in the hand of the user at serialization >> time. However, the syntax isn't especially elegant, especially given >> that every single serialization of contenttypes and permissions will, >> in practice, need to use the --lookup argument.
> My concern for this one is forgetting to add the option every time I > dump the fixtures. (Which is quite often in our environment). I > would end up writing my own management command to wrap this so I > wouldn't forget.
This is also my primary concern with this technique.
> Let's say dump_related is out in the Model class, I agree this is not > where this belongs. What if we had a slightly altered way of > describing it. In it you could have something like this:
> from django.contrib.contenttypes.models import ContentType > from django.core.serializers import register_serializer_style
> I could put this one-liner anywhere. I would probably put it in my > models.py right after the class definition, or I might put it in the > project itself and leave it out of the app altogether. The important > thing is it gets it out of the Model class definition Meta section. > It could also be last-in wins, so if my way of serializing ContentType > differs from the default (which might be built-in to contrib) I could > change it somewhere else. I can't think of a good use case for why > you would need to serialize it differently but Russell did mention it > as a possibility.
And that's exactly what I've got in my mind's eye for Option 4 - the complete rewrite. However, I don't want to commit to part of an API without knowing what the rest of the API will look like.
> Our system is complex enough that we need to dump subsets of data, > filter by date, and some other junk before we ultimately pass it to > the serializer. I would like to see something like this added to the > next version as well.
This sounds like it might be out of scope for v1.2 (simply because this is a completely new idea that hasn't been discussed before).
On Sat, Nov 7, 2009 at 1:48 PM, J Meier <jim-goo...@dsdd.org> wrote:
> I've implemented a nearly identical solution for this problem before, > and while it worked, it felt dirty.
> It strikes me that the problem is to do with our "surrogate" primary > key ids, which don't relate to the data at all. For most models, this > is fine. The problem here is that the ids do not meaningfully identify > particular rows, and so we can't refer to them without psychic id- > guessing powers.
> Wouldn't a simpler solution be to change ContentType and such to have > meaningful primary keys?
There are three issues here.
Firstly, changing the definition of ContentType to avoid the problem has been proposed in the past. However, this would be a *massive* backwards incompatibility. Making this change isn't something we would do lightly.
Secondly, the "right way" in a relational sense would be to have a composite primary key over app_label and model. However, Django doesn't currently support composite primary keys. So - in order to follow this approach, we would need a finish a major rework of primary key handling first.
Thirdly, that would solve the problem for ContentType, but not the general problem. The general problem is that it is possible to dynamically create data as part of a syncdb trigger, so any autogenerated primary key isn't useful. Sure - we can change ContentType to avoid the problem, but that doesn't make the problem go away for the general case. It's easy to define a model for which an autogenerated integer primary key is appropriate, which has dynamically created syncdb content. These are the models that aren't currently serializable.
> For ContentType, at least, this would require either creating a new > surrogate primary key that contained both the app and model names, or > else compound primary key support.
> Or with a compound key, > { > "pk": 1, > "model": "myapp.mymodel", > "fields": { > "name": "foobar", > "content_type": ("otherapp","othermodel") > } > }
> This would solve both the serialization and deserialization sides of > the problem.
> Am I off the mark?
Not entirely. The idea of having a surrogate primary key is interesting - it essence it isn't that far from Option 2 in my original list.
Option 2 introduced the idea of defining a 'dump_relation' key to Meta to define the columns that should be used to define serialization. In essence, this is defining the surrogate primary key for the model. In most practical uses, this will be either a unique column, or a unique_together tuple - which means that the grouping will effectively be a surrogate primary key.
The syntactical difference in your proposal (i.e., 'myapp.mymodel', rather than {'app_label': 'myapp', 'model': 'mymodel'} requires that the model provide specific serialization/deserialization tools for your model.
So if we said that if a model provides a get_surrogate_key() and the model manager provides a get_instance_by_surrogate_key() method (not happy with the names, but it's a bikeshed for the moment), the serializer/deserializer will use those methods instead of to_python() on the PK for (de)serializing the model.
The downside of this approach is exactly the same as it is for the original Option 2 - it means you can only define one way to serialize the model. However, this may not be such a problem as we're expressing this in terms of surrogate keys, rather than a specific serialization property that might change with the serialization technique.
And, again - this is actually compatible with Option 1. As syntactic sugar, we can still include dictionary based lookup syntax if we want to - we just won't need to provide query based serialization support.