Most likely because it will lead to duplicate data when you dump the
models for a particular app. Often the parent and child are in the same
application and you'll see the data from the parent in two places.
*Very* fiddly to untangle. It might be possible to add an option so that
parent data is optionally dumped when you dump a specific model (as
opposed to the whole app).
> Will it be a workable and necessary solution to add that to my
> proposal?
> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?
Quite probably because MySQL (or possibly just the MySQLdb wrapper)
sucks and can't support microsecond data when it's in a datetime value.
So reinserting that data requires yet another place where we have to set
microseconds to 0. It's one of those cases where we've adopted the
lowest common denominator.
Regards,
Malcolm
Why manually wrap it at all? Email clients have been able to handle
wrapping lines sensibly on behalf of the sender for about 20 years now.
Just type normally and only hit Return between paragraphs.
Regards,
Malcolm
Quite probably because MySQL (or possibly just the MySQLdb wrapper)
> Will it be a workable and necessary solution to add that to my
> proposal?
> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?
sucks and can't support microsecond data when it's in a datetime value.
So reinserting that data requires yet another place where we have to set
microseconds to 0. It's one of those cases where we've adopted the
lowest common denominator.
Malcolm has already addressed this, and his analysis is pretty much
spot on. I would only add that the current behaviour can also be
explained by looking at the heritage of the fixture system.
Historically, Django's fixtures have been used as a way of serializing
output for transfer between two Django installations (for example, as
test fixtures). To this end, the serializers have concentrated on
replicating a very database-like structure - that is, the structures
that are serialized closely match the underlying database structures.
In an inheritance situation, child tables don't contain all the data
from the parent table; hence, neither do the serialized structures.
Obviously, this focus on representing the database misses an obvious
alternate use case - occasions where serialization is required to
communicate to some other data consumer, such as an AJAX framework. In
my 'big picture' of the ideal serialization SoC project, this is the
problem that needs to be fixed. More on in later comments.
> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?
Again, Malcolm is on the money. If you can come up with a fix that
enables non-millisecond deprived databases to maintain microseconds,
I'm sure it would be a welcome inclusion. Thinking about it, this
shouldn't actually be that hard to achieve.
> Also I am leaving adding extras option to serializers since a patch for it
> has already been submitted(Ticket #5711) and looks like a working
> solution. If you all want something extra to be done there to
> commit it to django trunk, please tell me, I will work on that a bit
> and add it to the proposal.
If you are intending to take on "updating the serializers" as a SoC
project, I would encourage you to include #5711 as part of your
proposal. There may be a patch on #5711, and it may be the right
solution, but the patch isn't even close to being ready for trunk -
for one thing, there are no tests or documentation. Finishing the work
on this ticket would be a very worthwhile contribution.
> Here is my long long long proposal:
...
> ~~~~~~~
> Why?
> ~~~~~~~
>
> - The existing format of the serialized output firstly doesn't specify the
> name of the
> Primary Key(PK henceforth), which is a problem for fields which are
> implicitly set
> as PKs (Ticket #10295).
This ticket is a very small part of a bigger problem. The fact that
the primary key isn't named in the serialization format is of no
consequence to the 'database replication' role for serializers,
evidenced by the extensive test suite that demonstrates round trip
fixture loading. It is only significant when you need to support some
alternate data consumer that needs to know the name of the primary
key.
The bigger issue is that we need to be able to easily reconfigure the
output format of serializers to suit the specific requirements of
other data consumers.
> - The existing format only specifies the PK of the related field, but
> doesn't traverse it
> in depth to specify its fields (Ticket #4656).
> - There are no APIs for the above said requirement.
> - The inherited models fields are not serialized.
Again, these are just variations on the same theme. The real problem
is being able to easily reconfigure the output format.
> Situations/problems arising from attempting to fix the above problems
> - When we allow Serialization to follow relations, it becomes unnatural if
> the related Model is included in every relating model data. The data
> becomes extremely redundant. Consider the following example.
It may be redundant, but it may also be required, depending on
circumstance. This is something that needs to be left in the hand of
the end-user.
> - The way loaddata and dumpdata are handled is changed. The new version of
> this
> loaddata and dumpdata may not be compatible with the fixtures generated
> from
> older versions.
The current serialization format is well known, well understood, and
well suited to the task it was designed to perform. As a result, I
would expect that this format would remain as the 'default' format for
Django fixtures, and be entirely backwards compatible without extra
options/flags.
However, there is an obvious need for alternate formats. These formats
may be dramatically different from the current serialization formats,
and certain output formats may not contain enough data to be used for
later loading - for example, consider the case where your want an AJAX
response that contains a list of (author_name, book_title) tuples.
This structure may be useful to your AJAX application, but won't be
useful for recreating a list of Author and Book records in your
database.
My point is that a serialization format doesn't necessarily have to be
'round-trip'. The existing default format is, and needs to remain that
way. However, the corollary of this is that 'new format' serializers
don't necessarily need to be made available to loaddata/dumpdata.
> The project begins with implementing a version-id field in the serialized
> output. This
> field is provided for backwards compatibility. Then it proceeds by
> converting the existing
> PK field which appears as
> {
> "pk": 1,
> "model": "testapp.choice2",
> #...
> to serialize the name of the PK field. I propose it to be presented as:
> {
> "pk": {
> "id": 1
> },
> "model": "testapp.choice2",
> #...
> This change is being proposed keeping in mind that David Crammer's patches
> for
> Ticket #373 gets into Django trunk sometime or the other, since it should
> happen as
> it is a long standing requirement. This representation allows for multiple
> PK fields to
> exist in the model and be serialized correctly.
This isn't a problem you need to worry about. If/when #373 lands, the
default serialization format will also have to change. We don't need
to pre-emptively change it, and the existing serialization format
would be entirely compatible with a world where multiple primary key
models exist. Remember - in the 'database serialization' case, we can
introspect model definitions to see when multiple primary keys exist,
so on deserialization, we will know when "pk" is a single value and
when it is a list/dict/whatever format eventuates to support multiple
primary keys.
I can see how your proposal addresses #10295, but as I said earlier,
that's a very small scope version of a larger problem. I would suggest
addressing your efforts at fixing the bigger problem, rather than a
cosmetic approach to #10295 that is backwards incompatible with all
existing fixtures.
> The second phase, the biggest phase, starts by implementing serializing of
> relations in
> depth. The APIs will be implemented for these things hand-in-hand as the
> features are
> being implemented. An API to specify, what relations to serialize, will be
> provided with
> "relations=(rel1, rel2, ...)" parameter to serialize. Also a parameter to
> specify
> "relation_depth=(N1, N2, ...)" will be provided to serialize the related
> models recursively
There are two ways to interpret ticket #4656, and you have picked my
least favourite of the two. :-)
Option 1 (my preferred interpretation) is to look at this as a
'gathering dependencies' interface to the existing serializers.
For example, you pass an Book object to the serializer. That book
contains a reference to an Author. That author contains a reference to
a City.
In the current serializers, your output fixture only contains the Book
object. This may be a useful fixture, but it has referential integrity
problems - the Book contains a FK reference to a non-existent author.
It would also be nice to be able to say "and also serialize all the
other objects that are required in order to reproduce this full
object" - that is, by serializing the Book, you automatically get all
the related Author and City records. At the moment, the only way to do
this is to dump the entire Book, Author and City tables, and prune out
any data you don't want.
Adding a 'select related' option to the existing serializers would
make it much easier to generate fixtures, or dump parts of a database,
and it requires no changes to the output format at all.
Option 2 (your interpretation), is to allow for inline serialization
of related models. All my previous arguments about output format
apply, along with all your arguments about redundant encoding of data.
If you solve the bigger problem of allowing flexible output formats,
then the need to hard-code embedded child model data goes away.
> The project is planned to be completed in 9 phases.
...
> 2. Finalizing Design and Coding Phase I (May 22th – May 31st )
> 3. Testing Phase I (June 1st – June 5th )
As a prior warning - I'm very skeptical of anyone that proposes a
"test" phase that isn't integrated with the "build" phase. If you're
not testing at the same time you are building, then you don't know you
have the right result? If you test after you build, what happens when
your test reveals a problem with your implementation?
I know line items like this make accountant types happy, but it just
doesn't wash with me. If your implementation, including tests, will
take 3 weeks, then say three weeks. Don't say 2 weeks implementation
followed by a 1 week test.
> Lastly I want to express my deep commitment for this project and Django.
> I'm fully
> available this summer without any other commitments, will tune my day/night
> rhythm
> as per my mentor's requirement and assure a dedicated work of 35-40
> hours/week.
> Also I will assure that I will continue my commitments with Django well
> after GSoC.
> If you find any part of this proposal is not clear please contact me.
Thanks for taking the time to put together such a comprehensive
proposal. I hope my comments haven't left you too despondent. :-)
If I may offer 10c worth of advice: I see your proposal as containing
two real sections. Part 1 is a set of relatively small, but very
useful modifications to Django's serializers:
1) The dependency discovery interpretation of #4656
2) The ability to include non-model fields (#5711 is one small part
of this), computed fields, class properties, reverse relationships,
etc in serialization output.
3) Millisecond support in times (#10201)
You are absolutely correct that these can and should be fixed, and
they are well suited to a SoC project (well understood, well scoped
extensions to existing functionality). These changes would require
only minor (and entirely backwards compatible) modifications to the
existing serializers. By itself, Part 1 isn't enough to fill a SoC -
it's maybe a couple of weeks of effort, if you include the "start of
project bedding in" complications, plus all documentation and testing.
Part 2 is a much bigger problem - the ability to easily specify
alternate serialization formats. This encompases #10295 and the
embedded-rendering interpretation of #4656, but has much larger
ambitions. The approaches you have proposed would enable us to close
those specific tickets, but don't really address the bigger problem.
To that end, I'd rather leave those tickets open as unsolved problems
until we can come up with a comprehensive solution that addresses the
real problem.
Part 2 is a much bigger body of work, but it needs a concrete proposal
first. This problem is something that has been bouncing around in the
back of my head for a while, but it hasn't really got to the point
where I have any concrete proposal to present to you as a complete
API.
However, all is not lost. While it would be advantageous to have a
complete API proposal before starting work, it isn't completely
necessary. What would be necessary at a minimum is a set of use cases
to provide some sort of scope for what you would like to achieve
(i.e., develop a serialization API that would allow for the following
serialization use cases). Once we have a set of use cases, we can
establish the options that we have for an API, and develop that API
during the 'getting to know you' phase, and even during the initial
development phase of the GSoC project.
Of course, if you already have any ideas on how to specify
user-customizable serialization formats, feel free to knock our socks
off :-)
Yours,
Russ Magee %-)
The bigger issue is that we need to be able to easily
reconfigure the output format of serializers to suit the
specific requirements of other data consumers.
Again, Malcolm is on the money. If you can come up with a fix that
> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?
enables non-millisecond deprived databases to maintain microseconds,
I'm sure it would be a welcome inclusion. Thinking about it, this
shouldn't actually be that hard to achieve.
> The project is planned to be completed in 9 phases.
...
> 2. Finalizing Design and Coding Phase I (May 22th – May 31st )
> 3. Testing Phase I (June 1st – June 5th )As a prior warning - I'm very skeptical of anyone that proposes a
"test" phase that isn't integrated with the "build" phase. If you're
not testing at the same time you are building, then you don't know you
have the right result? If you test after you build, what happens when
your test reveals a problem with your implementation?
I know line items like this make accountant types happy, but it just
doesn't wash with me. If your implementation, including tests, will
take 3 weeks, then say three weeks. Don't say 2 weeks implementation
followed by a 1 week test.
Thanks for taking the time to put together such a comprehensiveproposal. I hope my comments haven't left you too despondent. :-)
However, all is not lost.
While it would be advantageous to have a
complete API proposal before starting work, it isn't completely
necessary. What would be necessary at a minimum is a set of use cases
to provide some sort of scope for what you would like to achieve
(i.e., develop a serialization API that would allow for the following
serialization use cases). Once we have a set of use cases, we can
establish the options that we have for an API, and develop that API
during the 'getting to know you' phase, and even during the initial
development phase of the GSoC project.
Of course, if you already have any ideas on how to specify
user-customizable serialization formats, feel free to knock our socks
off :-)