[GSoC] Proposal for discussion about Serialization requirements and requesting for Review

19 views
Skip to first unread message

Madhusudan C.S

unread,
Mar 26, 2009, 12:48:34 PM3/26/09
to Django developers, Django developers
Hi all,
    After some discussions with Malcolm on this list and doing some
research based on the pointers he gave me I have come up with a
rough plan of what I want to do this summer for Django. Since we
are running out of time, I have come up with a *rough draft* of the
proposal without full discussion with the Django community about the
features that can be implemented. So this is in no way a *Complete
Proposal* and I don't want to submit until some discussion on this
happens really. Also the required proposal format asks to put the
links of the devel list discussions that led to the proposal, which I don't
have except Malcolm's mails. So I kindly request you all to review my
proposal thoroughly and suggest me what I can add or subtract from
the proposal. If my propositions and assumptions are true and how I
can correct myself, so that I can submit my proposal to Google.

*Note: *
  Django doesn't serialize inherited Model fields in the Child Model. I asked
on IRC why this decision was taken but got no response. I searched the
devel list too, but did not get anything on it. I want to add it to my
proposal, but before doing it I wanted to know why this decision was
taken. Will it be a workable and necessary solution to add that to my
proposal?
Same is the case for Ticket #10201. Can someone please tell me why
microsecond data was dropped?

  Also I am leaving adding extras option to serializers since a patch for it
has already been submitted(Ticket #5711) and looks like a working
solution. If you all want something extra to be done there to
commit it to django trunk, please tell me, I will work on that a bit
and add it to the proposal.

Here is my long long long proposal:

Title: Restructuring of existing Serialization format and improvisation of APIs

~~~~~~~~~
Abstract
~~~~~~~~~

Greetings!

   I wish to provide Django, a better support for Serialization by building upon the
existing Serialization framework. This project includes extending the format of the
Serialized output that existing Serializer produces by allowing in-depth traversal of
Relation Fields in a given Model. The project also includes extending the existing API
to specify the depth of the relations to be serialized, the name of the related model
to be serialized. The API also provides for backwards compatibility to allow older
versions of serialized output to work with the to-be introduced changes. All the
changes will be made keeping in mind 2 important things.
   1. All the changes should be backwards compatible (can only break when a very
     important requirement that improves the serialization by many folds cannot be
     implemented without making backwards incompatible changes and django
     community gives a GO Green signal for doing so).
   2. The serialized data should be useful not just for use withing Django apps but
     also for exporting the data for external use and processing.

~~~~~~~
Why?
~~~~~~~

- The existing format of the serialized output firstly doesn't specify the name of the
  Primary Key(PK henceforth), which is a problem for fields which are implicitly set
  as PKs (Ticket #10295).
- The existing format only specifies the PK of the related field, but doesn't traverse it
  in depth to specify its fields (Ticket #4656).
- There are no APIs for the above said requirement.
- The inherited models fields are not serialized.

Situations/problems arising from attempting to fix the above problems
- When we allow Serialization to follow relations, it becomes unnatural if
  the related Model is included in every relating model data. The data
  becomes extremely redundant. Consider the following example.
 
  class Poll2(models.Model):
      question = models.CharField(max_length=200)
      pub_date = models.DateTimeField('date published')

      def __unicode__(self):
          return self.question


  class Choice2(models.Model):
      poll = models.ForeignKey(Poll)
      choice = models.CharField(max_length=200)
      votes = models.IntegerField()

      def __unicode__(self):
          return self.choice
 
  The serializing Choice2 Model might look something like below if we allow
following-of-Relations:
[
    {
        "pk": 1,
        "model": "testapp.choice2",
        "fields": {
            "votes": 1,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Django"
        }
    },
    {
        "pk": 2,
        "model": "testapp.choice2",
        "fields": {
            "votes": 2,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Python"
        }
    },
    {
        "pk": 3,
        "model": "testapp.choice2",
        "fields": {
            "votes": 4,
            "poll": [
                {
                    "pk": 1,
                    "model": "testapp.poll2",
                    "fields": {
                        "question": "What's Up?",
                        "pub_date": "2009-03-01 06:00:00"
                    }
                }
            ]
            "choice": "Others are useless"
        }
    }
]
  which clearly shows the redundant Poll data. Here we are serializing Choice2, of
  course, but that doesn't mean Serializing Polls will give the natural serialized
  output. In fact serializing Poll doesn't give anything pertaining to Choice Model
  instance. A more natural serialization should result from Serializing a Poll Model
  instance which includes within itself all the Choice Model instances that are
  related to it. This is an obvious consequence of how Database schemas are
  designed by applying Normalization rules.

- The way loaddata and dumpdata are handled is changed. The new version of this
  loaddata and dumpdata may not be compatible with the fixtures generated from
  older versions.

Most of the above said problems have been addressed in the tickets specified, but
the patches need to be dealt more thoroughly after discussing with the Django
community in general. So design decisions need to be taken for fixing most of the
tickets(which I will do in community bonding phase).

~~~~~~~
How?
~~~~~~~

  The project begins with implementing a version-id field in the serialized output. This
field is provided for backwards compatibility. Then it proceeds by converting the existing
PK field which appears as
{
    "pk": 1,
    "model": "testapp.choice2",
    #...
to serialize the name of the PK field. I propose it to be presented as:
{
    "pk": {
        "id": 1
    },
    "model": "testapp.choice2",
    #...
  This change is being proposed keeping in mind that David Crammer's patches for
Ticket #373 gets into Django trunk sometime or the other, since it should happen as
it is a long standing requirement. This representation allows for multiple PK fields to
exist in the model and be serialized correctly.

  The corresponding changes in the deserializers to process this data will also be made
at this stage. The implementation touches the following parts of Django:
django.core.serializers.python.Serializer.end_object()
django.core.serializers.xml_serializer.start_serialization() [It already implements version.]
and related methods and files.

  The project proceeds by splitting the serializer into 2 versions to handle the older
version and this current version of the serialized output. The decision as to which
version of the serializer to use will be taken by adding an API option "old_version=True"
parameter to serialize method. The deserialize method can however decide this by
looking at the new version-id. Also options for django-admin.py loaddata and dumpdata
commands will be provided with --old_version.

  The second phase, the biggest phase, starts by implementing serializing of relations in
depth. The APIs will be implemented for these things hand-in-hand as the features are
being implemented. An API to specify, what relations to serialize, will be provided with
"relations=(rel1, rel2, ...)" parameter to serialize. Also a parameter to specify
"relation_depth=(N1, N2, ...)" will be provided to serialize the related models recursively
till the specified depth N. Skipping "relations=" implies to serialize all the related models
in a given model and skipping "relation_depth=" implies serializing to full depth. Skipping
both serializes just the PK of the related models(old style). Further selection of fields in
the individual related models to be serialized is provided with a DjangoFullSerializers like
syntax, using dictionaries. An exclude fields option will be given similar to
DjangoFullSerializers.
Link to DjangoFullSerializers: http://code.google.com/p/wadofstuff/wiki/DjangoFullSerializers

  This phase proceeds by providing the API optional parameter, "reverse_relation=[rel1,
rel2]" within a Related Model(Poll2 in the example), rather than the Model that relates
to this model(Choice2). This does a reverse relation look up and for each Related Model
instance it serializes all the reverse relations that relate to this model instance which
solves the above said problem of data redundancy. The output looks something like below
if serialized as: serializers.serialize('json', Poll2.objects.all(), reverse_relation=('choice2'))
[
    {
        "pk": 1,
        "model": "testapp.poll2",
        "fields": {
            "question": "What's Up?",
            "pub_date": "2009-03-01 06:00:00"
        }
        "testapp.choice2": [
            {
                "pk": 2,
                "model": "testapp.choice2",
                "fields": {
                    "votes": 2,
                    "choice": "Python"
                 }
            },
            {
                "pk": 2,
                "model": "testapp.choice2",
                "fields": {
                    "votes": 4,
                    "choice": "Django"
                }
            }
        ]
    }
]

  This becomes extremely useful when we are exporting data for external processing.
As far as deserializers are concerned, in this case, they process the data to see if
they have any other app.modelname in the serialized data outside the fields dictionary,
and if they exist are considered as reverse_relation data and constructs both the Poll2
Model objects and Choice Model objects. Calling save() should recursively save all the
instances. This implementation may not be as easy as it looks. It requires a lot of design
decisions to be taken before implementing these changes.

  The above said implementation requires making changes to
serializers.base.Serializer.serialize method to handle new added parameters. Reverse
lookups will be added here. Relation in-depth serialization will also be taken care by
possibly adding new methods in the Base class, to return the required data. These
methods recursively return data of multi-level relations by possibly "yield"ing. The
DeserializedObject.reversed_objects is added to contain a list of reverse relation instances.
The <format>.Deserializers will also construct the Model objects by taking into
account only the current model fields but not the related model fields. It just uses
the PK field from such related model data.

  The loaddata and dumpdata fixtures will be optionally allowed to use reverse_relations
by giving the option --natural. This helps to dump the data with least redundancy for
exporting.

~~~~~~~~~~~~~~~~~
Benefits to Django
~~~~~~~~~~~~~~~~~
  By the end of this project, Django will have a better support for Serialization. It
supports much requested feature of in-depth Serializations thereby fixing ticket #4656.
It also fixes #10295. Fixtures and Serialized data become more convenient for use in
Django and externally by reducing Data Redundancy. And finally better API support
for all the newly introduced features. The serialized data is made more generic
keeping in mind the possible future additions like multiple PK support and backwards
compatibility.

~~~~~~~~~~~~
Deliverables
~~~~~~~~~~~~
  1. Internal implementation and code for in-depth serialization, reverse relation
    serialization and additional fields.
  2. Additional APIs to support in-depth serialization, to specify relation depth for
    serialization, support for PK field name in the Serialized output and version id.
  3. Also APIs for reverse relations serialization.
  4. Additional options to loaddata and dumpdata commands.
  5. Test Cases for all the newly introduced features.

  Non-Code deliverables include testing performed at 3 different phases to verify the
correctness and backwards compatibility. Also detailed user and development
documentation for using the new Serializer implementations.

~~~~~~~~
When?
~~~~~~~~

  The project is planned to be completed in 9 phases. Every phase includes documenting
the progress during that phase. The timeline for each of these phases is given below:
  1. Design Decisions and Initial preparation(Community Bonding Period : Already started -
      May 22nd )
        Closely working with Django community to learn more about Django in depth,
        learning code structure of Django, reading documentations related to Django
        internals, reading and understanding the code base of ORM and Serializers
        in depth, reading about other system's Serializers. Communicating and discussing
        with the community about the outstanding issues to resolve the accepted
        tickets. Design decisions I propose are discussed and finalized.

  2. Finalizing Design and Coding Phase I (May 22th – May 31st )
        Discussions with Django community in general and my mentor to finalize the
        design desicions for the major portion of the project. Documenting the design
        decision. Implementing Version-id, PK changes in Serializers and implementing
        deserializers to parse the same. Serializers and deserializers will be split
        to handle both the versions(old and new).
 
  3. Testing Phase I (June 1st – June 5th )
        Writing new test cases and adjusting the existing test cases to make sure
        the phase I changes don't break Django in anyway.

  4. Coding Stage II (June 6th – June 21st )
        Serializing relations in-Depth will be implemented in this phase, also the
        corresponding APIs will be added as mentioned in the Details section. Changes
        and additions will be made to both serializers and deserializers for this. Also
        corresponding changes are made for fixtures.

  5. Testing Phase II (June 22nd – June 29th )
        New test cases will be added to ensure Django is still fully backwards
        compatible and the new features pass the test too.

  6. Coding Phase III (June 30th – July 18th )
        Reverse relations serialization will be added. Relevant APIs will be
        implemented. Additions to DeserializedObject and save will be made to contain
        and save reversed_objects. These will be implemented for fixtures too.
        Mid Term evaluations happen during this phase.

  7. Testing Phase III (July 19th – July 26th )
        New test cases will be added for testing reverse relations serialization and
        backwards compatibility.

  8. Requesting for community wide Reviews, testing and evaluation
    (July 27th – August 2nd )
        Final phase of testing of the overall project, obtaining and consolidating the
        results and evaluation of the results. Requesting community to help me in
        final testing.

  9.  Scrubbing Code, Wrap-Up, Documentation (August 3rd – August 10th )
        Fixing major and minor bugs if any and merging the project with the Django
        SVN Trunk. Writing User and Developer documentations and finalization.

~~~~~~~~~~~~~~
Where?
~~~~~~~~~~~~~~

   I am already comfortable with the django-devel mailing-list and IRC channel
#djang...@freenode.net. I will be able to contact my mentor in both of the above
two ways and will also be available through google-talk(jabber). I am also comfortable
with svn, git and mercurial since I was the SVN administrator for 2 academic projects
and git administrator for 1 project.

~~~~~~~~~~
Why Me?
~~~~~~~~~~

  I am a 4th Year undergraduate student pursuing Information Science and Engineering
as a major at BMSCE, Bangalore, India(IST). Have been using and advocating Free and
Open Source Softwares from past 5 years. Have been one of the main coordinators of
BMSLUG. Have given various talks and conducted workshops on FOSS tools:
- Most importantly, recently I conducted a Python and *Django* workshop for beginners at
  NIT, Calicut, a premium Insititution around.
- How to contribute to FOSS? - A Hands-On hackathon using GNUSim8085 as example.
  http://groups.google.com/group/bms-lug/browse_thread/thread/0c9ca2367966727a
- Have been actively participating in various FOSS Communities by reporting bugs to
  communities like Ubuntu, GNOME, RTEMS, KDE.
- I was a major contributor and writer of the KDE's first-ever Handbook.
http://img518.imageshack.us/img518/9796/hb1o.png
http://img518.imageshack.us/img518/4296/hb2.png

I have been contributing patches and code to various FOSS communities, major ones being:
- GNUSim8085 (http://is.gd/p5wZ , http://is.gd/p5xK)
- KDE Step (http://is.gd/oci7)
- RTEMS
- Melange (The GSoC Web App. http://code.google.com/p/soc/source/browse/trunk/AUTHORS)

My Django Work:
I was interested in contributing to Django even before GSoC flashed to me. Discussed
with David Crammer about Ticket #373 on #django-dev. I read the Django ORM code
required for that, but could not write any code myself. Thanks to University coursework.
I have had some discussions about fixing ticket #8161 on django-devel list
(http://is.gd/obr2) but unfortunately it was fixed. So I am applying for GSoC as I feel
it lowers the barrier to get started.
http://groups.google.com/group/django-developers/browse_thread/thread/5461dae3cf8d5d6a

   I have a fair understanding of concepts of Python and have One and half years of
Python experience. I have a fair understanding on Django ORM code because of my
previous work. I am getting used to Serialization Code as I am writing this proposal and
have no problems with it. Also I am using Django from 1 year for some of my Webapps.

   Since I have been working with FOSS communities I have a good understanding of
FOSS Development methodologies of communicating with people, using Ticket tracker of
Django, coding and testing.

   Lastly I want to express my deep commitment for this project and Django. I'm fully
available this summer without any other commitments, will tune my day/night rhythm
as per my mentor's requirement and assure a dedicated work of 35-40 hours/week.
Also I will assure that I will continue my commitments with Django well after GSoC.
If you find any part of this proposal is not clear please contact me.

~~~~~~~~~~~~~~~~~~~~~~~~
Important Links and URLs
~~~~~~~~~~~~~~~~~~~~~~~~
  My Blog: http://madhusudancs.info
  My CV : http://www.madhusudancs.info/sites/default/files/madhusudancsCV.pdf


--
Thanks and regards,
 Madhusudan.C.S

Blogs at: www.madhusudancs.info
Official Email ID: madhu...@madhusudancs.info

koenb

unread,
Mar 26, 2009, 2:05:42 PM3/26/09
to Django developers
On 26 mrt, 17:48, "Madhusudan C.S" <madhusuda...@gmail.com> wrote:
> Hi all,
> [snipped]

I can't seem to find anything on the problem concerning contenttypes
in your proposal (see [1] for some recent discussion).
It would be nice to see that solved too, since it is one of the
inconveniences that has bitten me several times.

Koen


[1]: http://groups.google.be/group/django-developers/browse_thread/thread/943a36552527a018/f540ef9050aa20ce

Madhusudan C.S

unread,
Mar 26, 2009, 2:53:58 PM3/26/09
to Django developers, Django developers
Hi all,

What a blunder :( I submitted my proposal the way I will
have to submit to socghop.appspot.com with lines manually wrapped
at 80 chars per line and the groups wrapp it at 75 chars making
my proposal look as ugly as possible. Did not realize that it was
75 chars here. Please excuse me, tell me if my proposal is
unreadable I will resubmit it with lines wrapped at 70 chars
or so.

Madhusudan C.S

unread,
Mar 26, 2009, 2:55:48 PM3/26/09
to django-d...@googlegroups.com
Hi Koen,


On Thu, Mar 26, 2009 at 11:35 PM, koenb <koen.b...@werk.belgie.be> wrote:

On 26 mrt, 17:48, "Madhusudan C.S" <madhusuda...@gmail.com> wrote:
> Hi all,
> [snipped]

I can't seem to find anything on the problem concerning contenttypes
in your proposal (see [1] for some recent discussion).
It would be nice to see that solved too, since it is one of the
inconveniences that has bitten me several times.

 
Thanks a lot for that link. I will work on that in detail and
add it to my proposal.

Malcolm Tredinnick

unread,
Mar 26, 2009, 11:35:11 PM3/26/09
to djang...@googlegroups.com, Django developers
On Fri, 2009-03-27 at 00:23 +0530, Madhusudan C.S wrote:
> Hi all,
>
> What a blunder :( I submitted my proposal the way I will
> have to submit to socghop.appspot.com with lines manually wrapped
> at 80 chars per line and the groups wrapp it at 75 chars making
> my proposal look as ugly as possible. Did not realize that it was
> 75 chars here. Please excuse me, tell me if my proposal is
> unreadable I will resubmit it with lines wrapped at 70 chars
> or so.


Why manually wrap it at all? Email clients have been able to handle
wrapping lines sensibly on behalf of the sender for about 20 years now.
Just type normally and only hit Return between paragraphs.

Regards,
Malcolm


Madhusudan C.S

unread,
Mar 27, 2009, 2:42:49 PM3/27/09
to django-d...@googlegroups.com, djang...@googlegroups.com, Malcolm Tredinnick
Hi Malcolm,


  Right. I get it now. Won't do that blunder again :( Some of my friends who participated in previous years of GSoC had told me to manually wrap the text since they felt the text would look ugly after submission to Google's app if it is not wrapped with some small paragraphs appearing as a single huge line and also since wrapping gives a neatly presented look too :(

Madhusudan C.S

unread,
Mar 29, 2009, 12:51:40 PM3/29/09
to djang...@googlegroups.com, Django developers, Russell Keith-Magee
Hi Russell,
   I am extremely thankful to you for spending your invaluable time for doing a review (err... should I say post-mortem? ;-) ) of my complete proposal. I had kept my fingers crossed for someone who knew about the technical aspects of it to do it since most of my friends did only a language review (some of them even gave up seeing the length :( ). I am also equally thankful to Malcolm for it.

After a lot of thinking, reviewing and studying how other serializers, apart from Django serializers, in different languages and frameworks such as PHP, Python(pickle), Java, Turbogears(TurboJSON) and Boost work, the whole of yesterday, I have come up with some ideas which mostly departs from what I have proposed earlier. From the top view I still propose to solve the same problems I suggested in my initial proposal along with considering the bigger problems you suggested. Again this is a very rough draft of my ideas and requires a lot of refining by discussing with you and rest of the community.

Thanks to ideas on the Wiki. Reference to ModelAdmin there gave me some ideas to think further. Though this is not a copy, I have borrowed some ideas from other serializers I studied yesterday. Also I have ensured as far as possible that this doesn't break the existing Serializer and fixtures in any way, but only adds on to it. Please point out if I have gone against this somewhere.

The bigger issue is that we need to be able to easily
reconfigure the output format of serializers to suit the
specific requirements of other data consumers.

The idea that I propose below is mostly to tackle this bigger issue which you pointed out throughout.

Let us consider same 2 models as before:


class Poll2(models.Model):
    question = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published')

class Choice2(models.Model):
    poll = models.ForeignKey(Poll)
    choice = models.CharField(max_length=200)
    votes = models.IntegerField()

The user now will be able to construct a class on the lines of ModelAdmin for specifying custom serialization formats. I propose the API based on the following ideas.
The user will be given an option to define a Serializer class that inherits from the framework's serializers classes, Base, XML, Python, YAML and JSON. For the moment, to avoid confusion, let me call the new Serializer newserialzer (But this is only tentative, decision as to whether we must rename the framework or just the classes can be finalized later). From what I have understand, Python mainly consists of basic datatypes of single value or the data structures like List, Tuple and Dictionary. Most other complex data types/structures are derived from these types and thus represented with those notations.

So our base class defines a set of class attributes that define the notation for these fields which are same as the Python notations, for example ListSeparators will be a 3-tuple containing enclosing notations and the List item separator ('[', ']', ','). Similarly Dictionary Separtors is a 4-tuple ('{', '}', ',', ':'). The last item is for key:value separation. Similarly more specialized cases will be defined for YAML and JSON classes. We can use this approach to XML too. For this case we can pass a tuple of strings with this format.
list_separator = ('<list-name>', '</list-name>', '<>list-value</>')
dict_separtorr = ('<dict-name>', </dict-name>', '<dict-key=dict-value></>')
It is important to note here that list-name, dict-name, list-value, dict-value, dict-key are all indicative and are a part of the API(A better naming convention will be developed) and they are not the place holders for some other value there. As in, those are the names that must be always used consistently, which will be evident from the below examples.

The user can now inherit from one of these classes in his app depending upon the his requirements and over-ride these class attributes as per the format he wants. The API rougly looks like this for Serializing the Poll class, in a format similar to JSON notation.

class PollSerializer(newserializer.JSONSerializer):
    list_separator = ('{%', '%}', ':')
    dict_separator = ('{{', '}}', ':', '|')

In addition to this the user can specify the fields to be selected, by over-riding a class attribute, fields. This attribute is a tuple of strings where each item is the name of the field to be serialized. The above class can now be written as follows:

class PollSerializer(newserializer.JSONSerializer):
    list_separator = ('{%', '%}', ':')
    dict_separator = ('{{', '}}', ':', '|')
    fields = ('question', 'pub_date')

Additionally a class attribute named exclude_fields, a tuple of strings, is added which is just complimentary of fields attribute(Thanks to DjangoFullSerializers for giving this idea).

To solve the ticket #5711, I propose a method extra_fields() which returns a dictionary. It must return dictionary instead of a tuple because most of the times the extra fields are computed/derived fields. Example below:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def extra_fields(self):
       pub_date_recent = pub_date > '2009-03-15'
       return {'is_recent': pub_date_recent}

One can also specify how a Primary Key can be serialized with the method def pk_serialize() which returns a dictionary. This should address the ticket #102. Example below:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def pk_serialize(self):
       return {'pk': pk_value, 'pk name': 'id'}

The dictionary can contain any number of items, but the stress is for the use of *pk_value* at least once to serialize the PK value somewhere. I am still unsure, if I should make this a method or an attribute. Can some one kindly give suggestions?

The serialized output after over-riding the pk_serialize() method looks something like below.
 {
        "pk": 1,
        "pk name": 'id'
        "model": "testapp.poll2",
        "fields": {
            "pub_date": "2009-03-01 06:00:00",
            "question": "What's Up?"
        }
 }


An additional model_extras() method can be overridden, which by default returns nothing in the Parent classes. But in the over-ridden method of the derived class this can return a dictionary of values which are added to the Model's serialized data. An example of this can be version number of the serialized format. API example:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def model_extras(self):
       return {'version': '2.1'}

Finally coming to the big thing, Ticket #4656, I propose 3 Class attributes for this. First one being select_related (as per your suggestion) which is a dictionary. The key of the dictionary being the name of the Relation Attribute and the value is a dictionary. This dictionary can have keys - 'fields' or 'excludefields', whose values are tuples of strings, which indicate the name of the fields in that model to be selected or excluded. If this dictionary is empty, it serializes the entire model, by using its Serialization class similar to this one, if at all defined or using the existing serializers.

Example:
class ChoiceSerializer(newserializer.JSONSerializer):
    #...
    select_related = {'poll': {'fields': ('question')}}

NOTE: I am not very sure if I can implement this in the SoC timeline, but I will include it in the API proposal, if I run out of time I will continue with this after GSoC. If time permits, well and good, I will implement this too. The value of 'fields' key in the above dictionary is a tuple of strings which clearly means I cannot follow a relation on that model. So I wish to also allow dictionaries in this tuple along with the strings. This dictionary is again a select_related kind of nested dictionary which can follow the relation in that realtion and so on.
For the Book, Author, City example you gave, it can looks like this:
class BookSerializer(newserializer.JSONSerializer):
    #...
    select_related = {
        'author': {
            'fields': ('name', 'age', {
                'city':{
                    'fields': ('cityname', ...)
                }
            })
        }
    }
 *END NOTE*

Rest of the following are in the SoC timeline.
The second of the 3 attributes, is the inline_related attribute which can be set to True. In the parent class this is false. If it is set to true, Serializer will serialize the select_related relations inline.

The third attribute is the reverse_related. It is again a dictionary, similar in structure to the select_related dictionary, with keys being the name of the Model that relates to this model. For example:

class PollSerializer(newserializer.JSONSerializer):
   #...
   reverse_related = {'choice': {
       'fields': ('choice', 'votes')
   }}

Last but not the least always exists ;-)

The user registers this PollSerializer class with our serializer framwork, similar to ModelAdmin as:
serializer.register.model(Poll, PollSerializer)

Now a question arises, what if the user wants to change only the serialization format i.e notation, nothing else in the entire app? Should he do the donkey's coding job of copy pasting list_separtor and dict_separator? I feel he need not. For that I propose the following. The solution is to define a Serializer class, say AppnameSerializer with what ever app specific customization he wants(provided by the API) and the call
serializer.register.app(AppName, AppnameSerializer).

This can be extended to multiple apps and too. If he wants to customize a set of apps, he can say:
serializer.register.app(multiple_apps=(App1Name, App2Name, ...), AppSetSerializer).



On Sat, Mar 28, 2009 at 12:17 PM, Russell Keith-Magee <freakb...@gmail.com> wrote:

On Fri, Mar 27, 2009 at 1:48 AM, Madhusudan C.S <madhus...@gmail.com> wrote:
> Hi all,
> *Note: *
>   Django doesn't serialize inherited Model fields in the Child Model. I
> asked
> on IRC why this decision was taken but got no response. I searched the
> devel list too, but did not get anything on it. I want to add it to my
> proposal, but before doing it I wanted to know why this decision was
> taken. Will it be a workable and necessary solution to add that to my
> proposal?

Malcolm has already addressed this, and his analysis is pretty much
spot on. I would only add that the current behaviour can also be
explained by looking at the heritage of the fixture system.
Historically, Django's fixtures have been used as a way of serializing
output for transfer between two Django installations (for example, as
test fixtures). To this end, the serializers have concentrated on
replicating a very database-like structure - that is, the structures
that are serialized closely match the underlying database structures.
In an inheritance situation, child tables don't contain all the data
from the parent table; hence, neither do the serialized structures.

Obviously, this focus on representing the database misses an obvious
alternate use case - occasions where serialization is required to
communicate to some other data consumer, such as an AJAX framework. In
my 'big picture' of the ideal serialization SoC project, this is the
problem that needs to be fixed. More on in later comments.

Ok got it. This can be taken care by *fields* class attribute in the above API.


> Same is the case for Ticket #10201. Can someone please tell me why
> microsecond data was dropped?

Again, Malcolm is on the money. If you can come up with a fix that
enables non-millisecond deprived databases to maintain microseconds,
I'm sure it would be a welcome inclusion. Thinking about it, this
shouldn't actually be that hard to achieve.

I am still not very sure of how to implement this. The only approach I can think ATM is the hard-coded approach.
if database_type == mysql:  #during deserialization
    get rid of microseconds info.
But I don't feel it is an elegant solution. There may be a better one which I am not able to think as of now. So I will exclude it for now. If I can get a solution or some one suggests a solution, it anyways doesn't hurt implementing it?

>   The project is planned to be completed in 9 phases.
...

>   2. Finalizing Design and Coding Phase I (May 22th – May 31st )
>   3. Testing Phase I (June 1st – June 5th )

As a prior warning - I'm very skeptical of anyone that proposes a
"test" phase that isn't integrated with the "build" phase. If you're
not testing at the same time you are building, then you don't know you
have the right result? If you test after you build, what happens when
your test reveals a problem with your implementation?

I know line items like this make accountant types happy, but it just
doesn't wash with me. If your implementation, including tests, will
take 3 weeks, then say three weeks. Don't say 2 weeks implementation
followed by a 1 week test.

I have not provided the full schedule of my revised proposal, but just the APIs, since I feel this is an entirely new approach to Serialization and requires some refining still after which I can prepare good schedule plan. He He I understood what you meant (then I think I am of the accountant types ;-) since I love that kind of split up). I am correcting it anyways, understood the problem you indicated.

This is a very rough schedule, no way close to complete.
From May 22
1. Create newserialization framework classes. Add list_separator and dict_separator fields. Make sure everything is sane and works correctly as before without breaking existing serializers with all defaults - 4 weeks.
2. Add on additional APIs support. Namely methods and attributes such as fields, exclude_fields, extra_fields(), pk_serialize(), model_extras() and test them - 3 weeks.
3. Add support for follwing relations, select_related, inline_related, reverse_related class attributes - 4 weeks
4. Write user and developer documentation, minor issues and bug fixing, communicating and dicussing with the community and code scrubbing - 2 weeks.

Thanks for taking the time to put together such a comprehensive
proposal. I hope my comments haven't left you too despondent. :-)

No way. I am too happy that you pointed out where I lack seeing the big picture. I in fact took it positively and I always do so when some one points out my mistakes. I understand that some one points out mistakes only for my good. Hope my above work reflects it :(
 
However, all is not lost.

I am of the same opinion too. I want to be a Django contributor and I want to be a Django GSoC student too (period)

While it would be advantageous to have a
complete API proposal before starting work, it isn't completely
necessary. What would be necessary at a minimum is a set of use cases
to provide some sort of scope for what you would like to achieve
(i.e., develop a serialization API that would allow for the following
serialization use cases). Once we have a set of use cases, we can
establish the options that we have for an API, and develop that API
during the 'getting to know you' phase, and even during the initial
development phase of the GSoC project.

Of course, if you already have any ideas on how to specify
user-customizable serialization formats, feel free to knock our socks
off :-)

Hope I have covered most of the things I have learnt and can be done.



P.S. (I think it is not very easy to come up with a revolutionary idea in one single day. So I don't claim it is revolutionary, but I claim it is better than what exists now and what I proposed initially.)

Madhusudan C.S

unread,
Mar 29, 2009, 2:08:27 PM3/29/09
to djang...@googlegroups.com, Django developers, Russell Keith-Magee
Hello all,
    Also I would like to add again that, I am madrazr on #django-dev. Whenever I tried to ask something I haven't got any response till now. I am not complaining, I understand it is mainly because of timezone problems. I just want to inform anyone who wants to tell me directly on my face ;-) anything about my proposal that I am available for that :D
I will be around whenever I am logged into the channel.

Madhusudan C.S

unread,
Apr 1, 2009, 4:25:35 AM4/1/09
to Russell Keith-Magee, Django developers, Django developers
Hi Russell,
 
  After some thinking again, I have re-worked on my proposal and come up with the following idea. Here is my draft proposal. I have also submitted it to socghop.appspot.com

Let us consider the following two models for discussion through out:
  class Poll(models.Model):

      question = models.CharField(max_length=200)
      pub_date = models.DateTimeField('date published')
      creator = models.CharField(max_length=200)
      valid_for = models.IntegerField(max_length=200)

      def __unicode__(self):
          return self.question


  class Choice(models.Model):

      poll = models.ForeignKey(Poll)
      choice = models.CharField(max_length=200)
      votes = models.IntegerField()

      def __unicode__(self):
          return self.choice

  This projects begins by providing ModelAdmin and Feeds framework
like APIs for Serializers where the user now will be able to construct
a class for specifying custom serialization formats. I propose the API

based on the following ideas.

  The user will first define a Class inherited from the Serializer
framework. The parent class is a generic base Serializer class. The
user defined class is then passed as a parameter to the serialize
method we call when we want to serialize the Models. Within this class
the user will be able to specify the customized serialization format
in which he desires the output. Since Python supports majorly three
data structures, Lists, Tuples and Dictionaries, this format can
contain any of these data structures in any possible order. Examples:

Example 1:
  class PollSerializer(Serializer):
      custom_format = [("question", "valid_for", "id")]
 
The output in this case will be a list of tuples containing the values
of question, valid_for and id fields. Here the strings are the names
of the fields in the model.

                        OR
Example 2: 
  class PollSerializer2(Serializer):
      custom_format = (["question", {
          "valid_for_number_of_days": "valid_for"
          "Poll ID": "id"
      }])

The output in this case will be a tuple of lists containing the values
of question and a dictionary which contains valid_for and id fields
as values and their description as keys of a dictionary.
 
The implementation although not trivial, will work as follows:
(This is not final. Final implementation will be worked out by
discussing with the community)
- The custom_format will be checked for the type. The top level
  structure will be decided from this type. "{}" if dictionary, "()"
  if tuple and "[]" if list. In case of XML, the root tag will be
  django-objects. Also its children will have tag name  as "object"
  and include model="Model Name" in the tag. This is same as the
  existing XML Serializer till here.

- Further the type of the only item within the top-level structure
  is determined. All the django objects serialized will be of this
  type. In case of XML, the children of "object" tag will be the tags
  having the name "field". The tags will also have name="fieldname"
  and type="FieldType" attributes within this tag. Additionally if
  these field tags are items of the dictionary, they will have a
  description="dictionary_key" attribute in the field tag.
 
- Further each item within the inner object("question","valid_for"
  and "id" in the first example) is checked for the type and the
  serialized output will have corresponding type. This is implemented
  recursively from this level. In case of XML, however, the name of
  the tag for further level groupings will have to be chosen in some
  consistent way. My suggestion for now is to name the tags as
  "field1" for the third level in the original custom format structure,
  "field2" for the fourth level in the original custom format
  structure, and so on.

For the second example above, we call the serializer as follows:

  serializer.serialize("json", Poll.objects.all(),
      custom_serializer=PollSerializer2)

The output looks as follows:
(
    ["What's Up?", {
        "valid_for_number_of_days": "30"
        "Poll ID": "1"
        }
    ],
    ["Elections 2009", {
        "valid_for_number_of_days": "60"
        "Poll ID": "2"
        }
    ]
)

Also if we use XML,
  serializer.serialize("xml", Poll.objects.all(),
      custom_serializer=PollSerializer2)

The output looks as follows:

<django-objects version="1.0">
    <object pk="1" model="testapp.poll2">
        <field type="CharField" name="question">What's Up?</field>
        <field>
            <field1 type="IntegerField" name="valid_for" description="valid_for_number_of_days">
                30
            </field1>
            <field1 type="AutoField" name="id" description="POLL ID">
                1
            </field1>
        </field>
    </object>
    <object pk="2" model="testapp.poll2">
        <field type="CharField" name="question">Elections 2009</field>
        <field>
            <field1 type="IntegerField" name="valid_for" description="valid_for_number_of_days">
                60
            </field1>
            <field1 type="AutoField" name="id" description="POLL ID">
                2
            </field1>           
        </field>
    </object>
</django-objects>

  Further when a user wants to include extra fields in the serialized
data like additional non-model fields or computed fields, he needs
to specify the name of the method in the class that returns the value
of this field as the value of that item in his format. It should not
be a String. So that we can check if the item value is callable
and if so we can call that method and use the return value for
serialization. For example:

Example 3:
  class PollSerializer(Serializer):
      custom_format = [("question", "valid_for", till_date)]

      def till_date(self):
          import datetime
          delta_time = datetime.timedelta(
              days=Poll.objects.get(pk=self.pk).valid_for)
          new_datetime = Poll.objects.get(pk=self.pk).pub_date +
                             delta_time
          return new_datetime

  Further an important thing to note here is that, whenever the string
passed as an item value to the custom_format anywhere in the whole
format doesn't evaluate to any field in the model, it is serialized as
the same string in the final output, thereby allowing addition of
non-model static data, such as version number of the format among
other things.

  Another point to note here is that, the string specified in the
custom format can also include fields from the Parent Models, thereby
allowing even Parent Model fields to be serialized.

  Further the user will be well informed in the docs that he cannot
pass any arbitrary Django object when calling the serialize()
method with custom_format parameter, but only the Objects of type
for which the custom_format is defined using the ModelSerializer class.
If he does so we it will be flagged as error.

  Also last but not the least, a select_related parameter will be
added to the serialize method, upon setting to True will automatically
serialize all the related models for this model. Serializing the
related model facilitates the reconstruction of the database tables
for the given model in case there exists any constraints. Further
the related models will be serialized in a default format.

  Further if user knows what models might be selected when
select_related is true, he can provide the parameter like below:

  related_custom_serializers={
      "Model1" : Model1Serializer
      "Model2" : Model2Serializer
  }

  While Serializing the related models, the serializer checks to see
if related_custom_serializers have items for the selected model
and serializes in that format if it exists. Example:
  serializer.serialize("json", Poll.objects.all(),
      custom_serializer=PollSerializer2, select_related=True,
      related_custom_serializers={
      "Model1" : Model1Serializer
      "Model2" : Model2Serializer
      }
  )

(I am very skeptical about the use cases for the above feature, since
select_related is usually needed for round trips and rarely needed for
external applications. Nevertheless I propose it here, "Waiting for
further discussion")
         

NOTE: I must also admit that I am following the other proposal on the same idea. Felt no point in hiding it. But it was my idea too to provide
custom format. I had started with this in my previous proposal itself I feel. I was having very similar idea in mind when I used list and
dict separators, but got it wrong. After thinking of its weaknesses you said for a day or so, I came up with the same idea, but was unfortunately late in sending it, since you know I had already got it wrong 2 times :( Wanted to tell something sensible 3rd time and was
preparing a more comprehensive solution. I hope it answers almost all the questions you gave as braindump on the other proposal.


- Thanks and regards,
  Madhusudan.C.S



Reply all
Reply to author
Forward
0 new messages