[GSoC] Serialization Refactor

7 views
Skip to first unread message

Madhusudan C.S

unread,
Mar 20, 2009, 9:36:27 AM3/20/09
to Django developers, Django developers
Hi all,
   I am a prospective GSoC student who is interested in
working on Django this summer. I have fully read 2 the
mails about GSoC(Malcolm's mail to a prospective student on
django-gsoc list and Jacob's mail for all the prospective
students on -devel list.) Thanks to both of you for such an
informative and detailed mail. It really helped me. Coming
back, I have been using Django from last 8 months and
started using Google App Engine some time back in January.
Ever since I started using it I felt the need for Django's
models to be supported since it requires a duplication of
efforts to learn App Engine Models when we already know
Django and wanted to see that support in Django badly.
And now when I decided to participate in GSoC as a
Django student and saw the ideas list I was very happy to see
that being listed as one of the ideas. By the time I made
some preparations to discuss about that on the list here,
I read Malcolm's mail about the same to one of the Student
prospect and his note on the Wiki page. It really did
disappointed me, but now I understand how huge the project
is and how it might turn to be difficult to some one who
doesn't know Django internals *extremely* well.

   But some how I don't want to give up working on Django.
I found that the serialization refactor idea interests me.
Can some one please tell me who the person-of-contact or
most-likely-to-be-mentor for that idea is or should I discuss about
that on the list with everyone here in general?

Also can I get some more information about it. I went through
wadofstuff linked on the ideas page and all the examples given
in their website. It looks seriously interesting to me. Can some one
please tell me what exactly Django is expecting for it from
a student?

I just want to add few other things I have in mind for this
idea. But this seems to be becoming a very long mail. I will
write my additional points in the next mail and will give a
small intro about myself in a separate mail as well.

--
Thanks and regards,
 Madhusudan.C.S

P.S. Really sorry for such a long mail.

Blogs at: www.madhusudancs.info
Official Email ID: madhu...@madhusudancs.info

Madhusudan C.S

unread,
Mar 20, 2009, 4:08:13 PM3/20/09
to Django developers, Django developers
Hi all,
   I just wrote 2 mails about myself and my wish to
participate in GSoC as a Django student. Sorry if I am
spamming your inboxes. I just want to keep my mails short
so people who don't want to read everything in there can
skip the mails that are irrelevant to them. Please correct
me where ever I am wrong and if I am not doing it the
way it must be done here in Django.

I hope I understand what Malcolm and Jacob meant when
they said this.

We make changes because there are use-cases
for them, not because we can. So any proposal should
be driven by trying to fix some existing problem, not
creating a "wouldn't it be nice if...?" situation.

I want to work on Serialization Refactor for GSoC. Since what
the Django community requires exactly from that idea is still
not clear to me, I am requesting any of you to explain a bit
on what is expected of that project? In the mean time,
before I get the response let me add few more ideas to it.

I am proposing this idea as a Django user initially, opening
it up for discussion for the rest of the community. I
personally feel this is a missing feature in Django and want
to see it happen as a "Django user" for sure. (Also please
tell me if it is worth opening a ticket on this and sending
this idea to Django users list as well for additional
feedback? )

Let me begin my idea with an interesting Use Case I have as
a Django User(Hope many other users would have felt the same).
I am not sure if this already exists in Django. I assume it
doesn't from what I have learnt. Please correct me if I am
wrong.

I have a Web app written in Django which gets its data in
a Serialized format. The data source is actually a third-
party script which fetches an HTML page from a website parses
the data and supplies it in json format to us. (The page
parsed is actually an University Result sheet, for which
the script has no access to its results Database). I now want
to store this data into my Database. But along with the data
provided by JSON I need to add some additional administration
stuff into the database table for each serialized data I get.
One can easily ask me, why can't I use deserialization. But
the problem here as I have understood(may be I am wrong,
please correct me if so) is, whenever I deserialize the
stream data I get, I can only obtain a DeserializedObject
that contains a Django object which should contain the full
Model data including any PK fields that exist in the model,
but not the subset of fields. This is not the case here.
I just want to make the Serialized data, I get, a part of the
Database Table, say a subset of fields in  the table along
with other fields too, for example like the time at which
this data was recorded in the Database, some indexing stuff
among other things. One can also ask me to write a Custom Field
which stores the serialized data in a String (i.e as varchar)
format or something like that. But from what I understand
(from the docs) I can use custom fields for single fields
but not for data that must be split over several fields.
Thats exactly what is required here, since I get the marks
in JSON, I must able to obtain class average over a
particular subject and stuff which becomes difficult if I
store JSON data as string. Since I need to deserialize the
entire string each time I need access to a single field in it.

So the solution that appears to me now is to add a
Serialization Field support to Django Models. Say something
like JSONField and provide Meta Data for the JSON Field
Structure in some way, say defining a class for its structure
(as we do for ModelAdmin) or providing this in Class Meta
inside the Model Definition. This can be done since we will
already, at least, know what will be the format of serialized
data we recieve (quite obvious, we need to know this, since
we cannot process any random serialized data). Hope this is
somewhat similar in idea to what is pointed out as ModelAdmin
in the ideas list on the wiki page.

I would like to add support for JSON and Python serialization
through this project during Summer Of Code period and take
take up XML and YAML post GSoC since I feel if we include
those also it would be too much for 12 weeks project. Just
my estimate :(

Python Serialization support has another interesting use case
I feel. If we allow Python buitin types, at least types like
lists, tuples and dictionary fields in Django Models, we will
be providing the highest level of Object Oriented Abstraction
for Relational Databases. We will make the lives of Django
users easier by allowing them to use those Python types
easily without having to worry too much about Normalization.
But how we implement them will also be interesting and
tricky. It obviously requires many design decisions from Django
Community. One idea that I get now is to apply the same kind
of Normalization we apply to the list of values we have to put
it into a Relational Database, like creating a new table for
list items and creating a foreign key relationship from that
table to the original table. Just a thought.

Seems like this too became one big huge post :( But I wanted
to explain the idea I had in mind and get feedback from the
community. Please tolerate me. I am sincerely expecting your
invaluable feedback. Please ping me if you want any more
details on this or if else you want to FLAME me ;-)

Hope I have convinced you people with my idea. I am still
waiting for more details about "Serializaztion refactor".
What I have said above is just an addition to Serialization
support, hope both of them will be somewhat related.

Thanks for all your time. I keep my fingers crossed hoping
similar use cases exist to good number of users ;-)


--
Thanks and regards,
 Madhusudan.C.S

Madhusudan C.S

unread,
Mar 21, 2009, 1:20:58 AM3/21/09
to djang...@googlegroups.com, Django developers, Malcolm Tredinnick
Hi Malcolm and all,

On Sat, Mar 21, 2009 at 8:16 AM, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:

> I want to work on Serialization Refactor for GSoC. Since what
> the Django community requires exactly from that idea is still
> not clear to me, I am requesting any of you to explain a bit
> on what is expected of that project?

Yes, that seems to be the problem here (in fact, it was what I was
thinking to myself when reading your second mail).

I thought this problem was going to arise. The one-line suggestions on
the SoC wiki page aren't particularly specific, unfortunately. They also
aren't QA'd in any real way for practicality or difficulty, so it's a
bit of a combination of wishlist and brainstorming. A starting point for
further research, really. The confusion there is our fault, but if you
view it as a starting point for thinking, that will be a good point.

Ah OK. I did not see it that way, sorry. I thought some one who
wrote there on the wiki had specific set of things on mind. Will
definitely work on it. Thanks.

That particular item appears to be very poorly named in the wiki page.
It's not about refactoring at all (which is changing code around to make
new functionality easier, or remove redundancy). It's about adding new
features to the serializers. Enhancing, extending and changing in
various places, not refactoring.

Getting it now :)

Now, there are a bunch of things that could be worked on in the
serialization space. Have a look at the currently open tickets in that
area (I mean, read them *all*):
http://code.djangoproject.com/query?status=new&status=assigned&status=reopened&component=Serialization&order=priority

I had already seen most of the tickets in Serialization and ORM before
making this post. Since it is a huge list combined, I had only glanced
through the tickets there. Will get into each of them and will study
them in detail.

You'll see a few consistent patterns for feature requests and
awkwardness there (beyond the things that are just basic bugs we have to
fix at some point). It's also worth having a look at mailing list posts
(and tickets) that refer to "fixtures", since that's where those things
are used in Django. You'll start to see problems there with items like
content type values changing or references to pk value in other models
that change upon loading.

Oh OK. I think most of the problems of this kind are already filed as
tickets in the tracker. I remember seeing them, like #6233, #7052,
#9422, IIRC. Those are the tickets I have bookmarked here :)
I will also look into Mailing list posts for them.
 
We'd like to change the serialisation format
to be a lot more robust when it's referring to other models in any way.
One possibility is to use a label instead of a value for those fields.
It can even be designed to be backwards compatible (by adding a version
field to any new format).

Oh OK. I think I need to study a bit more about this and jump
into discussion. I will read the related code.

Adding support for non-model fields to be serialised is another option.

I am confused here a bit. Can you please tell me what you meant
by saying non-model field? Does it mean Foreign Keys or something like
that?

Most of the things I see on the DjangoFullSerializers project appear to
be covered in the tickets in Trac. So the question then becomes whether
the goal would be to merge in DjangoFullSerializers, but keeping things
backwards compatible for existing users. Or to take the good ideas and
merge them in in a more piecemeal fashion. Or work through the general
problems raised in the serializer and fixture tickets and posts on the
mailing list.

Hopefully that gives you a bunch of ideas for a bit more research.

Yeah thanks. It definitely gave me a lot of ideas. Working on them right
away.
 
[...]

>
> So the solution that appears to me now is to add a
> Serialization Field support to Django Models. Say something
> like JSONField and provide Meta Data for the JSON Field
> Structure in some way, say defining a class for its structure
> (as we do for ModelAdmin) or providing this in Class Meta
> inside the Model Definition. This can be done since we will
> already, at least, know what will be the format of serialized
> data we recieve (quite obvious, we need to know this, since
> we cannot process any random serialized data). Hope this is
> somewhat similar in idea to what is pointed out as ModelAdmin
> in the ideas list on the wiki page.

Hmm .. fields that provide serialized data aren't really anything to do
with the serializer. You can already write them now. In fact, people
have. I'm not sure where Meta factors into this, either. Writing a new
custom field type isn't really Summer of Code project (it's Weekend of
Code difficulty, really).

Oh is it possible to deserialize the stream data obtained from external
source into various number of fields in a Database table along with
other fields of the table which are obtained in the current app?
I did not know that, sorry :(

I want to know how they do it? Can some one please point me to
any such examples if they exist? I really never thought it was just
a matter of adding a single new custom field :(

> I would like to add support for JSON and Python serialization
> through this project during Summer Of Code period and take
> take up XML and YAML post GSoC since I feel if we include
> those also it would be too much for 12 weeks project. Just
> my estimate :(

Any serializers changes would really address all four formats at once,
since the differences are very minor. Work in that area is really 90%
getting it working for one type and then 10% getting it working for
everything else, since converting to string format and converting from
string format to internal object format are the very last and very first
things, respectively, done by serialization and deserialization. Have a
look at the existing code -- if you're going to work on serialization,
it's probably not too much to hope you've actually looked at the code we
have now -- and you'll see that the specific format classes don't do
much extra work at all. The heavy lifting is in methods that work with
the internal data structures.

Ok will look at the code. 
Reply all
Reply to author
Forward
0 new messages