[GSoC] Proposal for discussion about Serialization requirements and requesting for Review

10 views

Skip to first unread message

Madhusudan C.S

unread,

Mar 26, 2009, 12:48:34 PM3/26/09

to Django developers, Django developers

Hi all,
    After some discussions with Malcolm on this list and doing some
research based on the pointers he gave me I have come up with a
rough plan of what I want to do this summer for Django. Since we
are running out of time, I have come up with a *rough draft* of the
proposal without full discussion with the Django community about the
features that can be implemented. So this is in no way a *Complete
Proposal* and I don't want to submit until some discussion on this
happens really. Also the required proposal format asks to put the
links of the devel list discussions that led to the proposal, which I don't
have except Malcolm's mails. So I kindly request you all to review my
proposal thoroughly and suggest me what I can add or subtract from
the proposal. If my propositions and assumptions are true and how I
can correct myself, so that I can submit my proposal to Google.

*Note: *
Django doesn't serialize inherited Model fields in the Child Model. I asked
on IRC why this decision was taken but got no response. I searched the
devel list too, but did not get anything on it. I want to add it to my
proposal, but before doing it I wanted to know why this decision was
taken. Will it be a workable and necessary solution to add that to my
proposal?
Same is the case for Ticket #10201. Can someone please tell me why
microsecond data was dropped?

Also I am leaving adding extras option to serializers since a patch for it
has already been submitted(Ticket #5711) and looks like a working
solution. If you all want something extra to be done there to
commit it to django trunk, please tell me, I will work on that a bit
and add it to the proposal.

Here is my long long long proposal:

Title: Restructuring of existing Serialization format and improvisation of APIs

~~~~~~~~~
Abstract
~~~~~~~~~

Greetings!

   I wish to provide Django, a better support for Serialization by building upon the
existing Serialization framework. This project includes extending the format of the
Serialized output that existing Serializer produces by allowing in-depth traversal of
Relation Fields in a given Model. The project also includes extending the existing API
to specify the depth of the relations to be serialized, the name of the related model
to be serialized. The API also provides for backwards compatibility to allow older
versions of serialized output to work with the to-be introduced changes. All the
changes will be made keeping in mind 2 important things.
   1. All the changes should be backwards compatible (can only break when a very
     important requirement that improves the serialization by many folds cannot be
     implemented without making backwards incompatible changes and django
     community gives a GO Green signal for doing so).
   2. The serialized data should be useful not just for use withing Django apps but
     also for exporting the data for external use and processing.

~~~~~~~
Why?
~~~~~~~

- The existing format of the serialized output firstly doesn't specify the name of the
Primary Key(PK henceforth), which is a problem for fields which are implicitly set
as PKs (Ticket #10295).
- The existing format only specifies the PK of the related field, but doesn't traverse it
in depth to specify its fields (Ticket #4656).
- There are no APIs for the above said requirement.
- The inherited models fields are not serialized.

Situations/problems arising from attempting to fix the above problems
- When we allow Serialization to follow relations, it becomes unnatural if
the related Model is included in every relating model data. The data
becomes extremely redundant. Consider the following example.

class Poll2(models.Model):
      question = models.CharField(max_length=200)
      pub_date = models.DateTimeField('date published')

      def __unicode__(self):
          return self.question

class Choice2(models.Model):
      poll = models.ForeignKey(Poll)
      choice = models.CharField(max_length=200)
      votes = models.IntegerField()

      def __unicode__(self):
          return self.choice

The serializing Choice2 Model might look something like below if we allow
following-of-Relations:
[
    {
        "pk": 1,
        "model": "testapp.choice2",
        "fields": {
            "votes": 1,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Django"
        }
    },
    {
        "pk": 2,
        "model": "testapp.choice2",
        "fields": {
            "votes": 2,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Python"
        }
    },
    {
        "pk": 3,
        "model": "testapp.choice2",
        "fields": {
            "votes": 4,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Others are useless"
        }
    }
]
which clearly shows the redundant Poll data. Here we are serializing Choice2, of
course, but that doesn't mean Serializing Polls will give the natural serialized
output. In fact serializing Poll doesn't give anything pertaining to Choice Model
instance. A more natural serialization should result from Serializing a Poll Model
instance which includes within itself all the Choice Model instances that are
related to it. This is an obvious consequence of how Database schemas are
designed by applying Normalization rules.

- The way loaddata and dumpdata are handled is changed. The new version of this
loaddata and dumpdata may not be compatible with the fixtures generated from
older versions.

Most of the above said problems have been addressed in the tickets specified, but
the patches need to be dealt more thoroughly after discussing with the Django
community in general. So design decisions need to be taken for fixing most of the
tickets(which I will do in community bonding phase).

~~~~~~~
How?
~~~~~~~

The project begins with implementing a version-id field in the serialized output. This
field is provided for backwards compatibility. Then it proceeds by converting the existing
PK field which appears as
{
    "pk": 1,
    "model": "testapp.choice2",
    #...
to serialize the name of the PK field. I propose it to be presented as:
{
    "pk": {
        "id": 1
    },
    "model": "testapp.choice2",
    #...
This change is being proposed keeping in mind that David Crammer's patches for
Ticket #373 gets into Django trunk sometime or the other, since it should happen as
it is a long standing requirement. This representation allows for multiple PK fields to
exist in the model and be serialized correctly.

The corresponding changes in the deserializers to process this data will also be made
at this stage. The implementation touches the following parts of Django:
django.core.serializers.python.Serializer.end_object()
django.core.serializers.xml_serializer.start_serialization() [It already implements version.]
and related methods and files.

The project proceeds by splitting the serializer into 2 versions to handle the older
version and this current version of the serialized output. The decision as to which
version of the serializer to use will be taken by adding an API option "old_version=True"
parameter to serialize method. The deserialize method can however decide this by
looking at the new version-id. Also options for django-admin.py loaddata and dumpdata
commands will be provided with --old_version.

The second phase, the biggest phase, starts by implementing serializing of relations in
depth. The APIs will be implemented for these things hand-in-hand as the features are
being implemented. An API to specify, what relations to serialize, will be provided with
"relations=(rel1, rel2, ...)" parameter to serialize. Also a parameter to specify
"relation_depth=(N1, N2, ...)" will be provided to serialize the related models recursively
till the specified depth N. Skipping "relations=" implies to serialize all the related models
in a given model and skipping "relation_depth=" implies serializing to full depth. Skipping
both serializes just the PK of the related models(old style). Further selection of fields in
the individual related models to be serialized is provided with a DjangoFullSerializers like
syntax, using dictionaries. An exclude fields option will be given similar to
DjangoFullSerializers.
Link to DjangoFullSerializers: http://code.google.com/p/wadofstuff/wiki/DjangoFullSerializers

This phase proceeds by providing the API optional parameter, "reverse_relation=[rel1,
rel2]" within a Related Model(Poll2 in the example), rather than the Model that relates
to this model(Choice2). This does a reverse relation look up and for each Related Model
instance it serializes all the reverse relations that relate to this model instance which
solves the above said problem of data redundancy. The output looks something like below
if serialized as: serializers.serialize('json', Poll2.objects.all(), reverse_relation=('choice2'))
[
    {
        "pk": 1,
        "model": "testapp.poll2",
        "fields": {
            "question": "What's Up?",
            "pub_date": "2009-03-01 06:00:00"
        }
        "testapp.choice2": [
            {
                "pk": 2,
                "model": "testapp.choice2",
                "fields": {
                    "votes": 2,
                    "choice": "Python"
                 }
            },
            {
                "pk": 2,
                "model": "testapp.choice2",
                "fields": {
                    "votes": 4,
                    "choice": "Django"
                }
            }
        ]
    }
]

This becomes extremely useful when we are exporting data for external processing.
As far as deserializers are concerned, in this case, they process the data to see if
they have any other app.modelname in the serialized data outside the fields dictionary,
and if they exist are considered as reverse_relation data and constructs both the Poll2
Model objects and Choice Model objects. Calling save() should recursively save all the
instances. This implementation may not be as easy as it looks. It requires a lot of design
decisions to be taken before implementing these changes.

The above said implementation requires making changes to
serializers.base.Serializer.serialize method to handle new added parameters. Reverse
lookups will be added here. Relation in-depth serialization will also be taken care by
possibly adding new methods in the Base class, to return the required data. These
methods recursively return data of multi-level relations by possibly "yield"ing. The
DeserializedObject.reversed_objects is added to contain a list of reverse relation instances.
The <format>.Deserializers will also construct the Model objects by taking into
account only the current model fields but not the related model fields. It just uses
the PK field from such related model data.

The loaddata and dumpdata fixtures will be optionally allowed to use reverse_relations
by giving the option --natural. This helps to dump the data with least redundancy for
exporting.

~~~~~~~~~~~~~~~~~
Benefits to Django
~~~~~~~~~~~~~~~~~
By the end of this project, Django will have a better support for Serialization. It
supports much requested feature of in-depth Serializations thereby fixing ticket #4656.
It also fixes #10295. Fixtures and Serialized data become more convenient for use in
Django and externally by reducing Data Redundancy. And finally better API support
for all the newly introduced features. The serialized data is made more generic
keeping in mind the possible future additions like multiple PK support and backwards
compatibility.

~~~~~~~~~~~~
Deliverables
~~~~~~~~~~~~
1. Internal implementation and code for in-depth serialization, reverse relation
    serialization and additional fields.
2. Additional APIs to support in-depth serialization, to specify relation depth for
    serialization, support for PK field name in the Serialized output and version id.
3. Also APIs for reverse relations serialization.
4. Additional options to loaddata and dumpdata commands.
5. Test Cases for all the newly introduced features.

Non-Code deliverables include testing performed at 3 different phases to verify the
correctness and backwards compatibility. Also detailed user and development
documentation for using the new Serializer implementations.

~~~~~~~~
When?
~~~~~~~~

The project is planned to be completed in 9 phases. Every phase includes documenting
the progress during that phase. The timeline for each of these phases is given below:
1. Design Decisions and Initial preparation(Community Bonding Period : Already started -
      May 22nd )
        Closely working with Django community to learn more about Django in depth,
        learning code structure of Django, reading documentations related to Django
        internals, reading and understanding the code base of ORM and Serializers
        in depth, reading about other system's Serializers. Communicating and discussing
        with the community about the outstanding issues to resolve the accepted
        tickets. Design decisions I propose are discussed and finalized.

2. Finalizing Design and Coding Phase I (May 22th – May 31st )
        Discussions with Django community in general and my mentor to finalize the
        design desicions for the major portion of the project. Documenting the design
        decision. Implementing Version-id, PK changes in Serializers and implementing
        deserializers to parse the same. Serializers and deserializers will be split
        to handle both the versions(old and new).

3. Testing Phase I (June 1st – June 5th )
        Writing new test cases and adjusting the existing test cases to make sure
        the phase I changes don't break Django in anyway.

4. Coding Stage II (June 6th – June 21st )
        Serializing relations in-Depth will be implemented in this phase, also the
        corresponding APIs will be added as mentioned in the Details section. Changes
        and additions will be made to both serializers and deserializers for this. Also
        corresponding changes are made for fixtures.

5. Testing Phase II (June 22nd – June 29th )
        New test cases will be added to ensure Django is still fully backwards
        compatible and the new features pass the test too.

6. Coding Phase III (June 30th – July 18th )
        Reverse relations serialization will be added. Relevant APIs will be
        implemented. Additions to DeserializedObject and save will be made to contain
        and save reversed_objects. These will be implemented for fixtures too.
        Mid Term evaluations happen during this phase.

7. Testing Phase III (July 19th – July 26th )
        New test cases will be added for testing reverse relations serialization and
        backwards compatibility.

8. Requesting for community wide Reviews, testing and evaluation
    (July 27th – August 2nd )
        Final phase of testing of the overall project, obtaining and consolidating the
        results and evaluation of the results. Requesting community to help me in
        final testing.

9. Scrubbing Code, Wrap-Up, Documentation (August 3rd – August 10th )
        Fixing major and minor bugs if any and merging the project with the Django
        SVN Trunk. Writing User and Developer documentations and finalization.

~~~~~~~~~~~~~~
Where?
~~~~~~~~~~~~~~

   I am already comfortable with the django-devel mailing-list and IRC channel
#djang...@freenode.net. I will be able to contact my mentor in both of the above
two ways and will also be available through google-talk(jabber). I am also comfortable
with svn, git and mercurial since I was the SVN administrator for 2 academic projects
and git administrator for 1 project.

~~~~~~~~~~
Why Me?
~~~~~~~~~~

I am a 4th Year undergraduate student pursuing Information Science and Engineering
as a major at BMSCE, Bangalore, India(IST). Have been using and advocating Free and
Open Source Softwares from past 5 years. Have been one of the main coordinators of
BMSLUG. Have given various talks and conducted workshops on FOSS tools:
- Most importantly, recently I conducted a Python and *Django* workshop for beginners at
NIT, Calicut, a premium Insititution around.
- How to contribute to FOSS? - A Hands-On hackathon using GNUSim8085 as example.
http://groups.google.com/group/bms-lug/browse_thread/thread/0c9ca2367966727a
- Have been actively participating in various FOSS Communities by reporting bugs to
communities like Ubuntu, GNOME, RTEMS, KDE.
- I was a major contributor and writer of the KDE's first-ever Handbook.
http://img518.imageshack.us/img518/9796/hb1o.png
http://img518.imageshack.us/img518/4296/hb2.png

I have been contributing patches and code to various FOSS communities, major ones being:
- GNUSim8085 (http://is.gd/p5wZ , http://is.gd/p5xK)
- KDE Step (http://is.gd/oci7)
- RTEMS
- Melange (The GSoC Web App. http://code.google.com/p/soc/source/browse/trunk/AUTHORS)

My Django Work:
I was interested in contributing to Django even before GSoC flashed to me. Discussed
with David Crammer about Ticket #373 on #django-dev. I read the Django ORM code
required for that, but could not write any code myself. Thanks to University coursework.
I have had some discussions about fixing ticket #8161 on django-devel list
(http://is.gd/obr2) but unfortunately it was fixed. So I am applying for GSoC as I feel
it lowers the barrier to get started.
http://groups.google.com/group/django-developers/browse_thread/thread/5461dae3cf8d5d6a

   I have a fair understanding of concepts of Python and have One and half years of
Python experience. I have a fair understanding on Django ORM code because of my
previous work. I am getting used to Serialization Code as I am writing this proposal and
have no problems with it. Also I am using Django from 1 year for some of my Webapps.

   Since I have been working with FOSS communities I have a good understanding of
FOSS Development methodologies of communicating with people, using Ticket tracker of
Django, coding and testing.

   Lastly I want to express my deep commitment for this project and Django. I'm fully
available this summer without any other commitments, will tune my day/night rhythm
as per my mentor's requirement and assure a dedicated work of 35-40 hours/week.
Also I will assure that I will continue my commitments with Django well after GSoC.
If you find any part of this proposal is not clear please contact me.

~~~~~~~~~~~~~~~~~~~~~~~~
Important Links and URLs
~~~~~~~~~~~~~~~~~~~~~~~~
My Blog: http://madhusudancs.info
My CV : http://www.madhusudancs.info/sites/default/files/madhusudancsCV.pdf

--
Thanks and regards,
Madhusudan.C.S

Blogs at: www.madhusudancs.info
Official Email ID: madhu...@madhusudancs.info

Madhusudan C.S

unread,

Mar 26, 2009, 2:53:58 PM3/26/09

to Django developers, Django developers

Hi all,

What a blunder :( I submitted my proposal the way I will
have to submit to socghop.appspot.com with lines manually wrapped
at 80 chars per line and the groups wrapp it at 75 chars making
my proposal look as ugly as possible. Did not realize that it was
75 chars here. Please excuse me, tell me if my proposal is
unreadable I will resubmit it with lines wrapped at 70 chars
or so.

Malcolm Tredinnick

unread,

Mar 26, 2009, 11:33:11 PM3/26/09

to djang...@googlegroups.com

On Thu, 2009-03-26 at 22:18 +0530, Madhusudan C.S wrote:
> Hi all,
> After some discussions with Malcolm on this list and doing some
> research based on the pointers he gave me I have come up with a
> rough plan of what I want to do this summer for Django. Since we
> are running out of time, I have come up with a *rough draft* of the
> proposal without full discussion with the Django community about the
> features that can be implemented. So this is in no way a *Complete
> Proposal* and I don't want to submit until some discussion on this
> happens really. Also the required proposal format asks to put the
> links of the devel list discussions that led to the proposal, which I
> don't
> have except Malcolm's mails. So I kindly request you all to review my
> proposal thoroughly and suggest me what I can add or subtract from
> the proposal. If my propositions and assumptions are true and how I
> can correct myself, so that I can submit my proposal to Google.
>
> *Note: *
> Django doesn't serialize inherited Model fields in the Child Model.
> I asked
> on IRC why this decision was taken but got no response. I searched the
> devel list too, but did not get anything on it. I want to add it to
> my
> proposal, but before doing it I wanted to know why this decision was
> taken.

Most likely because it will lead to duplicate data when you dump the
models for a particular app. Often the parent and child are in the same
application and you'll see the data from the parent in two places.
*Very* fiddly to untangle. It might be possible to add an option so that
parent data is optionally dumped when you dump a specific model (as
opposed to the whole app).

> Will it be a workable and necessary solution to add that to my
> proposal?
> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?

Quite probably because MySQL (or possibly just the MySQLdb wrapper)
sucks and can't support microsecond data when it's in a datetime value.
So reinserting that data requires yet another place where we have to set
microseconds to 0. It's one of those cases where we've adopted the
lowest common denominator.

Regards,
Malcolm

Malcolm Tredinnick

unread,

Mar 26, 2009, 11:35:11 PM3/26/09

to djang...@googlegroups.com, Django developers

On Fri, 2009-03-27 at 00:23 +0530, Madhusudan C.S wrote:
> Hi all,
>
> What a blunder :( I submitted my proposal the way I will
> have to submit to socghop.appspot.com with lines manually wrapped
> at 80 chars per line and the groups wrapp it at 75 chars making
> my proposal look as ugly as possible. Did not realize that it was
> 75 chars here. Please excuse me, tell me if my proposal is
> unreadable I will resubmit it with lines wrapped at 70 chars
> or so.

Why manually wrap it at all? Email clients have been able to handle
wrapping lines sensibly on behalf of the sender for about 20 years now.
Just type normally and only hit Return between paragraphs.

Regards,
Malcolm

Madhusudan C.S

unread,

Mar 27, 2009, 2:42:49 PM3/27/09

to django-d...@googlegroups.com, djang...@googlegroups.com, Malcolm Tredinnick

Hi Malcolm,

Right. I get it now. Won't do that blunder again :( Some of my friends who participated in previous years of GSoC had told me to manually wrap the text since they felt the text would look ugly after submission to Google's app if it is not wrapped with some small paragraphs appearing as a single huge line and also since wrapping gives a neatly presented look too :(

Madhusudan C.S

unread,

Mar 27, 2009, 3:12:18 PM3/27/09

to djang...@googlegroups.com, Malcolm Tredinnick

Hi Malcolm,
Thanks for the response. I am looking forward to hear the review from other prospective mentors too :(

I personally this may be required in some situations, but I agree that those situations may be rare and it will be better to have it forcibly by passing parameter to serializer. I am still wondering why no one has opened a ticket on this? Do you think it will be a good idea to add in the proposal? (If so I will also open a ticket on the issue and do some preliminary work on how it may be implemented? )

> Will it be a workable and necessary solution to add that to my
> proposal?
> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?

Quite probably because MySQL (or possibly just the MySQLdb wrapper)
sucks and can't support microsecond data when it's in a datetime value.
So reinserting that data requires yet another place where we have to set
microseconds to 0. It's one of those cases where we've adopted the
lowest common denominator.

But do you think you may add it sooner or later since it is one of those tickets lying there? And may haven't done it now because you have better things to concentrate ATM? I feel it may be unfair to other backends that support microseconds info.

Russell Keith-Magee

unread,

Mar 28, 2009, 2:47:09 AM3/28/09

to djang...@googlegroups.com

On Fri, Mar 27, 2009 at 1:48 AM, Madhusudan C.S <madhus...@gmail.com> wrote:
> Hi all,

> *Note: *
> Django doesn't serialize inherited Model fields in the Child Model. I
> asked
> on IRC why this decision was taken but got no response. I searched the
> devel list too, but did not get anything on it. I want to add it to my
> proposal, but before doing it I wanted to know why this decision was
> taken. Will it be a workable and necessary solution to add that to my
> proposal?

Malcolm has already addressed this, and his analysis is pretty much
spot on. I would only add that the current behaviour can also be
explained by looking at the heritage of the fixture system.
Historically, Django's fixtures have been used as a way of serializing
output for transfer between two Django installations (for example, as
test fixtures). To this end, the serializers have concentrated on
replicating a very database-like structure - that is, the structures
that are serialized closely match the underlying database structures.
In an inheritance situation, child tables don't contain all the data
from the parent table; hence, neither do the serialized structures.

Obviously, this focus on representing the database misses an obvious
alternate use case - occasions where serialization is required to
communicate to some other data consumer, such as an AJAX framework. In
my 'big picture' of the ideal serialization SoC project, this is the
problem that needs to be fixed. More on in later comments.

> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?

Again, Malcolm is on the money. If you can come up with a fix that
enables non-millisecond deprived databases to maintain microseconds,
I'm sure it would be a welcome inclusion. Thinking about it, this
shouldn't actually be that hard to achieve.

> Also I am leaving adding extras option to serializers since a patch for it
> has already been submitted(Ticket #5711) and looks like a working
> solution. If you all want something extra to be done there to
> commit it to django trunk, please tell me, I will work on that a bit
> and add it to the proposal.

If you are intending to take on "updating the serializers" as a SoC
project, I would encourage you to include #5711 as part of your
proposal. There may be a patch on #5711, and it may be the right
solution, but the patch isn't even close to being ready for trunk -
for one thing, there are no tests or documentation. Finishing the work
on this ticket would be a very worthwhile contribution.

> Here is my long long long proposal:

...

> ~~~~~~~
> Why?
> ~~~~~~~
>
> - The existing format of the serialized output firstly doesn't specify the
> name of the
> Primary Key(PK henceforth), which is a problem for fields which are
> implicitly set
> as PKs (Ticket #10295).

This ticket is a very small part of a bigger problem. The fact that
the primary key isn't named in the serialization format is of no
consequence to the 'database replication' role for serializers,
evidenced by the extensive test suite that demonstrates round trip
fixture loading. It is only significant when you need to support some
alternate data consumer that needs to know the name of the primary
key.

The bigger issue is that we need to be able to easily reconfigure the
output format of serializers to suit the specific requirements of
other data consumers.

> - The existing format only specifies the PK of the related field, but
> doesn't traverse it
> in depth to specify its fields (Ticket #4656).
> - There are no APIs for the above said requirement.
> - The inherited models fields are not serialized.

Again, these are just variations on the same theme. The real problem
is being able to easily reconfigure the output format.

> Situations/problems arising from attempting to fix the above problems
> - When we allow Serialization to follow relations, it becomes unnatural if
> the related Model is included in every relating model data. The data
> becomes extremely redundant. Consider the following example.

It may be redundant, but it may also be required, depending on
circumstance. This is something that needs to be left in the hand of
the end-user.

> - The way loaddata and dumpdata are handled is changed. The new version of
> this
> loaddata and dumpdata may not be compatible with the fixtures generated
> from
> older versions.

The current serialization format is well known, well understood, and
well suited to the task it was designed to perform. As a result, I
would expect that this format would remain as the 'default' format for
Django fixtures, and be entirely backwards compatible without extra
options/flags.

However, there is an obvious need for alternate formats. These formats
may be dramatically different from the current serialization formats,
and certain output formats may not contain enough data to be used for
later loading - for example, consider the case where your want an AJAX
response that contains a list of (author_name, book_title) tuples.
This structure may be useful to your AJAX application, but won't be
useful for recreating a list of Author and Book records in your
database.

My point is that a serialization format doesn't necessarily have to be
'round-trip'. The existing default format is, and needs to remain that
way. However, the corollary of this is that 'new format' serializers
don't necessarily need to be made available to loaddata/dumpdata.

> The project begins with implementing a version-id field in the serialized
> output. This
> field is provided for backwards compatibility. Then it proceeds by
> converting the existing
> PK field which appears as
> {
>     "pk": 1,
>     "model": "testapp.choice2",
>     #...
> to serialize the name of the PK field. I propose it to be presented as:
> {
>     "pk": {
>         "id": 1
>     },
>     "model": "testapp.choice2",
>     #...
> This change is being proposed keeping in mind that David Crammer's patches
> for
> Ticket #373 gets into Django trunk sometime or the other, since it should
> happen as
> it is a long standing requirement. This representation allows for multiple
> PK fields to
> exist in the model and be serialized correctly.

This isn't a problem you need to worry about. If/when #373 lands, the
default serialization format will also have to change. We don't need
to pre-emptively change it, and the existing serialization format
would be entirely compatible with a world where multiple primary key
models exist. Remember - in the 'database serialization' case, we can
introspect model definitions to see when multiple primary keys exist,
so on deserialization, we will know when "pk" is a single value and
when it is a list/dict/whatever format eventuates to support multiple
primary keys.

I can see how your proposal addresses #10295, but as I said earlier,
that's a very small scope version of a larger problem. I would suggest
addressing your efforts at fixing the bigger problem, rather than a
cosmetic approach to #10295 that is backwards incompatible with all
existing fixtures.

> The second phase, the biggest phase, starts by implementing serializing of
> relations in
> depth. The APIs will be implemented for these things hand-in-hand as the
> features are
> being implemented. An API to specify, what relations to serialize, will be
> provided with
> "relations=(rel1, rel2, ...)" parameter to serialize. Also a parameter to
> specify
> "relation_depth=(N1, N2, ...)" will be provided to serialize the related
> models recursively

There are two ways to interpret ticket #4656, and you have picked my
least favourite of the two. :-)

Option 1 (my preferred interpretation) is to look at this as a
'gathering dependencies' interface to the existing serializers.

For example, you pass an Book object to the serializer. That book
contains a reference to an Author. That author contains a reference to
a City.

In the current serializers, your output fixture only contains the Book
object. This may be a useful fixture, but it has referential integrity
problems - the Book contains a FK reference to a non-existent author.
It would also be nice to be able to say "and also serialize all the
other objects that are required in order to reproduce this full
object" - that is, by serializing the Book, you automatically get all
the related Author and City records. At the moment, the only way to do
this is to dump the entire Book, Author and City tables, and prune out
any data you don't want.

Adding a 'select related' option to the existing serializers would
make it much easier to generate fixtures, or dump parts of a database,
and it requires no changes to the output format at all.

Option 2 (your interpretation), is to allow for inline serialization
of related models. All my previous arguments about output format
apply, along with all your arguments about redundant encoding of data.
If you solve the bigger problem of allowing flexible output formats,
then the need to hard-code embedded child model data goes away.

> The project is planned to be completed in 9 phases.

...

> 2. Finalizing Design and Coding Phase I (May 22th – May 31st )

> 3. Testing Phase I (June 1st – June 5th )

As a prior warning - I'm very skeptical of anyone that proposes a
"test" phase that isn't integrated with the "build" phase. If you're
not testing at the same time you are building, then you don't know you
have the right result? If you test after you build, what happens when
your test reveals a problem with your implementation?

I know line items like this make accountant types happy, but it just
doesn't wash with me. If your implementation, including tests, will
take 3 weeks, then say three weeks. Don't say 2 weeks implementation
followed by a 1 week test.

> Lastly I want to express my deep commitment for this project and Django.
> I'm fully
> available this summer without any other commitments, will tune my day/night
> rhythm
> as per my mentor's requirement and assure a dedicated work of 35-40
> hours/week.
> Also I will assure that I will continue my commitments with Django well
> after GSoC.
> If you find any part of this proposal is not clear please contact me.

Thanks for taking the time to put together such a comprehensive
proposal. I hope my comments haven't left you too despondent. :-)

If I may offer 10c worth of advice: I see your proposal as containing
two real sections. Part 1 is a set of relatively small, but very
useful modifications to Django's serializers:

1) The dependency discovery interpretation of #4656
2) The ability to include non-model fields (#5711 is one small part
of this), computed fields, class properties, reverse relationships,
etc in serialization output.
3) Millisecond support in times (#10201)

You are absolutely correct that these can and should be fixed, and
they are well suited to a SoC project (well understood, well scoped
extensions to existing functionality). These changes would require
only minor (and entirely backwards compatible) modifications to the
existing serializers. By itself, Part 1 isn't enough to fill a SoC -
it's maybe a couple of weeks of effort, if you include the "start of
project bedding in" complications, plus all documentation and testing.

Part 2 is a much bigger problem - the ability to easily specify
alternate serialization formats. This encompases #10295 and the
embedded-rendering interpretation of #4656, but has much larger
ambitions. The approaches you have proposed would enable us to close
those specific tickets, but don't really address the bigger problem.
To that end, I'd rather leave those tickets open as unsolved problems
until we can come up with a comprehensive solution that addresses the
real problem.

Part 2 is a much bigger body of work, but it needs a concrete proposal
first. This problem is something that has been bouncing around in the
back of my head for a while, but it hasn't really got to the point
where I have any concrete proposal to present to you as a complete
API.

However, all is not lost. While it would be advantageous to have a
complete API proposal before starting work, it isn't completely
necessary. What would be necessary at a minimum is a set of use cases
to provide some sort of scope for what you would like to achieve
(i.e., develop a serialization API that would allow for the following
serialization use cases). Once we have a set of use cases, we can
establish the options that we have for an API, and develop that API
during the 'getting to know you' phase, and even during the initial
development phase of the GSoC project.

Of course, if you already have any ideas on how to specify
user-customizable serialization formats, feel free to knock our socks
off :-)

Yours,
Russ Magee %-)

Madhusudan C.S

unread,

Mar 29, 2009, 12:51:40 PM3/29/09

to djang...@googlegroups.com, Django developers, Russell Keith-Magee

Hi Russell,
I am extremely thankful to you for spending your invaluable time for doing a review (err... should I say post-mortem? ;-) ) of my complete proposal. I had kept my fingers crossed for someone who knew about the technical aspects of it to do it since most of my friends did only a language review (some of them even gave up seeing the length :( ). I am also equally thankful to Malcolm for it.

After a lot of thinking, reviewing and studying how other serializers, apart from Django serializers, in different languages and frameworks such as PHP, Python(pickle), Java, Turbogears(TurboJSON) and Boost work, the whole of yesterday, I have come up with some ideas which mostly departs from what I have proposed earlier. From the top view I still propose to solve the same problems I suggested in my initial proposal along with considering the bigger problems you suggested. Again this is a very rough draft of my ideas and requires a lot of refining by discussing with you and rest of the community.

Thanks to ideas on the Wiki. Reference to ModelAdmin there gave me some ideas to think further. Though this is not a copy, I have borrowed some ideas from other serializers I studied yesterday. Also I have ensured as far as possible that this doesn't break the existing Serializer and fixtures in any way, but only adds on to it. Please point out if I have gone against this somewhere.

The bigger issue is that we need to be able to easily
reconfigure the output format of serializers to suit the
specific requirements of other data consumers.

The idea that I propose below is mostly to tackle this bigger issue which you pointed out throughout.

Let us consider same 2 models as before:

class Poll2(models.Model):
question = models.CharField(max_length=200)

pub_date = models.DateTimeField('date published')

class Choice2(models.Model):
    poll = models.ForeignKey(Poll)
    choice = models.CharField(max_length=200)
    votes = models.IntegerField()

The user now will be able to construct a class on the lines of ModelAdmin for specifying custom serialization formats. I propose the API based on the following ideas.
The user will be given an option to define a Serializer class that inherits from the framework's serializers classes, Base, XML, Python, YAML and JSON. For the moment, to avoid confusion, let me call the new Serializer newserialzer (But this is only tentative, decision as to whether we must rename the framework or just the classes can be finalized later). From what I have understand, Python mainly consists of basic datatypes of single value or the data structures like List, Tuple and Dictionary. Most other complex data types/structures are derived from these types and thus represented with those notations.

So our base class defines a set of class attributes that define the notation for these fields which are same as the Python notations, for example ListSeparators will be a 3-tuple containing enclosing notations and the List item separator ('[', ']', ','). Similarly Dictionary Separtors is a 4-tuple ('{', '}', ',', ':'). The last item is for key:value separation. Similarly more specialized cases will be defined for YAML and JSON classes. We can use this approach to XML too. For this case we can pass a tuple of strings with this format.
list_separator = ('<list-name>', '</list-name>', '<>list-value</>')
dict_separtorr = ('<dict-name>', </dict-name>', '<dict-key=dict-value></>')
It is important to note here that list-name, dict-name, list-value, dict-value, dict-key are all indicative and are a part of the API(A better naming convention will be developed) and they are not the place holders for some other value there. As in, those are the names that must be always used consistently, which will be evident from the below examples.

The user can now inherit from one of these classes in his app depending upon the his requirements and over-ride these class attributes as per the format he wants. The API rougly looks like this for Serializing the Poll class, in a format similar to JSON notation.

class PollSerializer(newserializer.JSONSerializer):
    list_separator = ('{%', '%}', ':')
    dict_separator = ('{{', '}}', ':', '|')

In addition to this the user can specify the fields to be selected, by over-riding a class attribute, fields. This attribute is a tuple of strings where each item is the name of the field to be serialized. The above class can now be written as follows:

class PollSerializer(newserializer.JSONSerializer):
    list_separator = ('{%', '%}', ':')
    dict_separator = ('{{', '}}', ':', '|')
    fields = ('question', 'pub_date')

Additionally a class attribute named exclude_fields, a tuple of strings, is added which is just complimentary of fields attribute(Thanks to DjangoFullSerializers for giving this idea).

To solve the ticket #5711, I propose a method extra_fields() which returns a dictionary. It must return dictionary instead of a tuple because most of the times the extra fields are computed/derived fields. Example below:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def extra_fields(self):
       pub_date_recent = pub_date > '2009-03-15'
       return {'is_recent': pub_date_recent}

One can also specify how a Primary Key can be serialized with the method def pk_serialize() which returns a dictionary. This should address the ticket #102. Example below:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def pk_serialize(self):
       return {'pk': pk_value, 'pk name': 'id'}

The dictionary can contain any number of items, but the stress is for the use of *pk_value* at least once to serialize the PK value somewhere. I am still unsure, if I should make this a method or an attribute. Can some one kindly give suggestions?

The serialized output after over-riding the pk_serialize() method looks something like below.
{
        "pk": 1,
        "pk name": 'id'
        "model": "testapp.poll2",
        "fields": {
            "pub_date": "2009-03-01 06:00:00",
            "question": "What's Up?"
        }
}

An additional model_extras() method can be overridden, which by default returns nothing in the Parent classes. But in the over-ridden method of the derived class this can return a dictionary of values which are added to the Model's serialized data. An example of this can be version number of the serialized format. API example:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def model_extras(self):
       return {'version': '2.1'}

Finally coming to the big thing, Ticket #4656, I propose 3 Class attributes for this. First one being select_related (as per your suggestion) which is a dictionary. The key of the dictionary being the name of the Relation Attribute and the value is a dictionary. This dictionary can have keys - 'fields' or 'excludefields', whose values are tuples of strings, which indicate the name of the fields in that model to be selected or excluded. If this dictionary is empty, it serializes the entire model, by using its Serialization class similar to this one, if at all defined or using the existing serializers.

Example:
class ChoiceSerializer(newserializer.JSONSerializer):
    #...
    select_related = {'poll': {'fields': ('question')}}

NOTE: I am not very sure if I can implement this in the SoC timeline, but I will include it in the API proposal, if I run out of time I will continue with this after GSoC. If time permits, well and good, I will implement this too. The value of 'fields' key in the above dictionary is a tuple of strings which clearly means I cannot follow a relation on that model. So I wish to also allow dictionaries in this tuple along with the strings. This dictionary is again a select_related kind of nested dictionary which can follow the relation in that realtion and so on.
For the Book, Author, City example you gave, it can looks like this:
class BookSerializer(newserializer.JSONSerializer):
    #...
    select_related = {
        'author': {
            'fields': ('name', 'age', {
                'city':{
                    'fields': ('cityname', ...)
                }
            })
        }
    }
*END NOTE*

Rest of the following are in the SoC timeline.
The second of the 3 attributes, is the inline_related attribute which can be set to True. In the parent class this is false. If it is set to true, Serializer will serialize the select_related relations inline.

The third attribute is the reverse_related. It is again a dictionary, similar in structure to the select_related dictionary, with keys being the name of the Model that relates to this model. For example:

class PollSerializer(newserializer.JSONSerializer):
   #...
   reverse_related = {'choice': {
       'fields': ('choice', 'votes')
   }}

Last but not the least always exists ;-)

The user registers this PollSerializer class with our serializer framwork, similar to ModelAdmin as:
serializer.register.model(Poll, PollSerializer)

Now a question arises, what if the user wants to change only the serialization format i.e notation, nothing else in the entire app? Should he do the donkey's coding job of copy pasting list_separtor and dict_separator? I feel he need not. For that I propose the following. The solution is to define a Serializer class, say AppnameSerializer with what ever app specific customization he wants(provided by the API) and the call
serializer.register.app(AppName, AppnameSerializer).

This can be extended to multiple apps and too. If he wants to customize a set of apps, he can say:
serializer.register.app(multiple_apps=(App1Name, App2Name, ...), AppSetSerializer).

Ok got it. This can be taken care by *fields* class attribute in the above API.

> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?

Again, Malcolm is on the money. If you can come up with a fix that
enables non-millisecond deprived databases to maintain microseconds,
I'm sure it would be a welcome inclusion. Thinking about it, this
shouldn't actually be that hard to achieve.

I am still not very sure of how to implement this. The only approach I can think ATM is the hard-coded approach.
if database_type == mysql: #during deserialization
get rid of microseconds info.
But I don't feel it is an elegant solution. There may be a better one which I am not able to think as of now. So I will exclude it for now. If I can get a solution or some one suggests a solution, it anyways doesn't hurt implementing it?

> The project is planned to be completed in 9 phases.
...

> 2. Finalizing Design and Coding Phase I (May 22th – May 31st )

> 3. Testing Phase I (June 1st – June 5th )

As a prior warning - I'm very skeptical of anyone that proposes a
"test" phase that isn't integrated with the "build" phase. If you're
not testing at the same time you are building, then you don't know you
have the right result? If you test after you build, what happens when
your test reveals a problem with your implementation?

I know line items like this make accountant types happy, but it just
doesn't wash with me. If your implementation, including tests, will
take 3 weeks, then say three weeks. Don't say 2 weeks implementation
followed by a 1 week test.

I have not provided the full schedule of my revised proposal, but just the APIs, since I feel this is an entirely new approach to Serialization and requires some refining still after which I can prepare good schedule plan. He He I understood what you meant (then I think I am of the accountant types ;-) since I love that kind of split up). I am correcting it anyways, understood the problem you indicated.

This is a very rough schedule, no way close to complete.
From May 22
1. Create newserialization framework classes. Add list_separator and dict_separator fields. Make sure everything is sane and works correctly as before without breaking existing serializers with all defaults - 4 weeks.
2. Add on additional APIs support. Namely methods and attributes such as fields, exclude_fields, extra_fields(), pk_serialize(), model_extras() and test them - 3 weeks.
3. Add support for follwing relations, select_related, inline_related, reverse_related class attributes - 4 weeks
4. Write user and developer documentation, minor issues and bug fixing, communicating and dicussing with the community and code scrubbing - 2 weeks.

Thanks for taking the time to put together such a comprehensive
proposal. I hope my comments haven't left you too despondent. :-)

No way. I am too happy that you pointed out where I lack seeing the big picture. I in fact took it positively and I always do so when some one points out my mistakes. I understand that some one points out mistakes only for my good. Hope my above work reflects it :(

However, all is not lost.

I am of the same opinion too. I want to be a Django contributor and I want to be a Django GSoC student too (period)

While it would be advantageous to have a
complete API proposal before starting work, it isn't completely
necessary. What would be necessary at a minimum is a set of use cases
to provide some sort of scope for what you would like to achieve
(i.e., develop a serialization API that would allow for the following
serialization use cases). Once we have a set of use cases, we can
establish the options that we have for an API, and develop that API
during the 'getting to know you' phase, and even during the initial
development phase of the GSoC project.

Of course, if you already have any ideas on how to specify
user-customizable serialization formats, feel free to knock our socks
off :-)

Hope I have covered most of the things I have learnt and can be done.

P.S. (I think it is not very easy to come up with a revolutionary idea in one single day. So I don't claim it is revolutionary, but I claim it is better than what exists now and what I proposed initially.)

Madhusudan C.S

unread,

Mar 29, 2009, 2:08:27 PM3/29/09

to djang...@googlegroups.com, Django developers, Russell Keith-Magee

Hello all,
Also I would like to add again that, I am madrazr on #django-dev. Whenever I tried to ask something I haven't got any response till now. I am not complaining, I understand it is mainly because of timezone problems. I just want to inform anyone who wants to tell me directly on my face ;-) anything about my proposal that I am available for that :D
I will be around whenever I am logged into the channel.

Madhusudan C.S

unread,

Apr 1, 2009, 4:25:35 AM4/1/09

to Russell Keith-Magee, Django developers, Django developers

Hi Russell,

After some thinking again, I have re-worked on my proposal and come up with the following idea. Here is my draft proposal. I have also submitted it to socghop.appspot.com

Let us consider the following two models for discussion through out:
class Poll(models.Model):

question = models.CharField(max_length=200)
pub_date = models.DateTimeField('date published')

      creator = models.CharField(max_length=200)
      valid_for = models.IntegerField(max_length=200)

      def __unicode__(self):
          return self.question

class Choice(models.Model):

      poll = models.ForeignKey(Poll)
      choice = models.CharField(max_length=200)
      votes = models.IntegerField()

def __unicode__(self):
return self.choice

This projects begins by providing ModelAdmin and Feeds framework
like APIs for Serializers where the user now will be able to construct
a class for specifying custom serialization formats. I propose the API

based on the following ideas.

The user will first define a Class inherited from the Serializer
framework. The parent class is a generic base Serializer class. The
user defined class is then passed as a parameter to the serialize
method we call when we want to serialize the Models. Within this class
the user will be able to specify the customized serialization format
in which he desires the output. Since Python supports majorly three
data structures, Lists, Tuples and Dictionaries, this format can
contain any of these data structures in any possible order. Examples:

Example 1:
class PollSerializer(Serializer):
      custom_format = [("question", "valid_for", "id")]

The output in this case will be a list of tuples containing the values
of question, valid_for and id fields. Here the strings are the names
of the fields in the model.

                        OR
Example 2:
class PollSerializer2(Serializer):
      custom_format = (["question", {
          "valid_for_number_of_days": "valid_for"
          "Poll ID": "id"
      }])

The output in this case will be a tuple of lists containing the values
of question and a dictionary which contains valid_for and id fields
as values and their description as keys of a dictionary.

The implementation although not trivial, will work as follows:
(This is not final. Final implementation will be worked out by
discussing with the community)
- The custom_format will be checked for the type. The top level
structure will be decided from this type. "{}" if dictionary, "()"
if tuple and "[]" if list. In case of XML, the root tag will be
django-objects. Also its children will have tag name as "object"
and include model="Model Name" in the tag. This is same as the
existing XML Serializer till here.

- Further the type of the only item within the top-level structure
is determined. All the django objects serialized will be of this
type. In case of XML, the children of "object" tag will be the tags
having the name "field". The tags will also have name="fieldname"
and type="FieldType" attributes within this tag. Additionally if
these field tags are items of the dictionary, they will have a
description="dictionary_key" attribute in the field tag.

- Further each item within the inner object("question","valid_for"
and "id" in the first example) is checked for the type and the
serialized output will have corresponding type. This is implemented
recursively from this level. In case of XML, however, the name of
the tag for further level groupings will have to be chosen in some
consistent way. My suggestion for now is to name the tags as
"field1" for the third level in the original custom format structure,
"field2" for the fourth level in the original custom format
structure, and so on.

For the second example above, we call the serializer as follows:

serializer.serialize("json", Poll.objects.all(),
      custom_serializer=PollSerializer2)

The output looks as follows:
(
    ["What's Up?", {
        "valid_for_number_of_days": "30"
        "Poll ID": "1"
        }
    ],
    ["Elections 2009", {
        "valid_for_number_of_days": "60"
        "Poll ID": "2"
        }
    ]
)

Also if we use XML,
serializer.serialize("xml", Poll.objects.all(),
      custom_serializer=PollSerializer2)

The output looks as follows:

<django-objects version="1.0">
    <object pk="1" model="testapp.poll2">
        <field type="CharField" name="question">What's Up?</field>
        <field>
            <field1 type="IntegerField" name="valid_for" description="valid_for_number_of_days">
                30
            </field1>
            <field1 type="AutoField" name="id" description="POLL ID">
                1
            </field1>
        </field>
    </object>
    <object pk="2" model="testapp.poll2">
        <field type="CharField" name="question">Elections 2009</field>
        <field>
            <field1 type="IntegerField" name="valid_for" description="valid_for_number_of_days">
                60
            </field1>
            <field1 type="AutoField" name="id" description="POLL ID">
                2
            </field1>
        </field>
    </object>
</django-objects>

Further when a user wants to include extra fields in the serialized
data like additional non-model fields or computed fields, he needs
to specify the name of the method in the class that returns the value
of this field as the value of that item in his format. It should not
be a String. So that we can check if the item value is callable
and if so we can call that method and use the return value for
serialization. For example:

Example 3:
class PollSerializer(Serializer):
      custom_format = [("question", "valid_for", till_date)]

      def till_date(self):
          import datetime
          delta_time = datetime.timedelta(
              days=Poll.objects.get(pk=self.pk).valid_for)
          new_datetime = Poll.objects.get(pk=self.pk).pub_date +
                             delta_time
          return new_datetime

Further an important thing to note here is that, whenever the string
passed as an item value to the custom_format anywhere in the whole
format doesn't evaluate to any field in the model, it is serialized as
the same string in the final output, thereby allowing addition of
non-model static data, such as version number of the format among
other things.

Another point to note here is that, the string specified in the
custom format can also include fields from the Parent Models, thereby
allowing even Parent Model fields to be serialized.

Further the user will be well informed in the docs that he cannot
pass any arbitrary Django object when calling the serialize()
method with custom_format parameter, but only the Objects of type
for which the custom_format is defined using the ModelSerializer class.
If he does so we it will be flagged as error.

Also last but not the least, a select_related parameter will be
added to the serialize method, upon setting to True will automatically
serialize all the related models for this model. Serializing the
related model facilitates the reconstruction of the database tables
for the given model in case there exists any constraints. Further
the related models will be serialized in a default format.

Further if user knows what models might be selected when
select_related is true, he can provide the parameter like below:

related_custom_serializers={
      "Model1" : Model1Serializer
      "Model2" : Model2Serializer
}

While Serializing the related models, the serializer checks to see
if related_custom_serializers have items for the selected model
and serializes in that format if it exists. Example:
serializer.serialize("json", Poll.objects.all(),
      custom_serializer=PollSerializer2, select_related=True,
      related_custom_serializers={
      "Model1" : Model1Serializer
      "Model2" : Model2Serializer
      }
)

(I am very skeptical about the use cases for the above feature, since
select_related is usually needed for round trips and rarely needed for
external applications. Nevertheless I propose it here, "Waiting for
further discussion")


NOTE: I must also admit that I am following the other proposal on the same idea. Felt no point in hiding it. But it was my idea too to provide
custom format. I had started with this in my previous proposal itself I feel. I was having very similar idea in mind when I used list and
dict separators, but got it wrong. After thinking of its weaknesses you said for a day or so, I came up with the same idea, but was unfortunately late in sending it, since you know I had already got it wrong 2 times :( Wanted to tell something sensible 3rd time and was
preparing a more comprehensive solution. I hope it answers almost all the questions you gave as braindump on the other proposal.

- Thanks and regards,
Madhusudan.C.S

Reply all

Reply to author

Forward

0 new messages