[GSoC 2013] A second stab at an implementation of composite fields

245 views
Skip to first unread message

Michal Petrucha

unread,
Apr 12, 2013, 9:44:48 AM4/12/13
to django-d...@googlegroups.com
Hello everyone,

Hopefully some of you still remember me as the guy who's been trying
to implement support for composite primary keys by means of composite
model fields.

I haven't provided a whole lot of information on the progress of this
project for... quite a long time, so I'll try to rectify this now and
explain the current standing.

A quick summary of GSoC 2011: I started by implementing
CompositeField, which only acts as a dumb container. The obvious
problem was that it wouldn't work with relationship fields, but other
than that, it works just fine. Most importantly, using it as a primary
key should work just fine.

As far as relationship fields go, we tried to ignore them at first and
get back to them during the second half of GSoC. Two approaches were
considered, one was to special-case CompositeField targets in
ForeignKey and in this case, make a single ForeignKey field manage
multiple database columns directly. This turned out to be really
painful and messy, so we tried another path, which was to turn
ForeignKey into a virtual field and create an auxiliary copy of the
target field in the local model which is supposed to manage the
database column and its raw value. This way, ForeignKey only takes
care of the higher-level relationship stuff.

A large part of the ForeignKey refactor has been done. However, I was
doing it on top of the CompositeField patchset, which makes it a real
PITA to keep it in sync with master. I had managed to keep it in sync
for about a year but this has become increasingly tedious, especially
with all the Py3k changes and recent ORM improvements and cleanups.
Currently my code is rebased on top of a commit from 2012-11-04, which
is already five months in the past.

It might be the case that it's simply because of me abusing git in a
wrong way, but I've been keeping my work as a patchset on top of
master using git rebase -- it helps me keep track of the progress I
did and it is easier for me to find the commit where I did a certain
change and why I did it in case I'm not sure what to do when a merge
conflict appears. (If anyone is willing to teach me a better workflow,
I'd be more than happy to take advice from more experienced users.)

The more I think about it, the less I am convinced that trying to keep
my patchsets in sync with master is worth the effort. When I
implemented CompositeField itself, not having a clear idea of how
ForeignKey would be brought into the equation and what kinds of
abstraction would be required later, I implemented multiple things
(like database creation or SQL WHERE statements) in a way that worked
at that time, but require a more general handling with the
abstractions introduced in the ForeignKey refactor. This basically
means that I first implemented CompositeField and later broke most of
its internal handling within the ORM (which is one of the reasons why
keeping this in sync with master is difficult).

The course of action I'd suggest at this moment is to first port the
ForeignKey refactor on top of current master and make sure everything
is perfectly backwards compatible. This would also create all the
internal abstractions required to implement CompositeField, which
means porting CompositeField on top of the ForeignKey refactor will be
easier (at the very least, it will require less internal changes).


As the past few years have shown, this project is far too
time-consuming for an individual to make any significant progress just
working on it in one's free time. Therefore I'm proposing to do this
as part of the upcoming GSoC. That would mean I'd be working on it
full-time for three months which should be enough to finally push this
over the edge and at least finish this code-wise (and
documentation-wise, of course), hopefully even get the revision
history into a state where it'd be ready for review and eventually
inclusion into master.

If there are no serious objections to the proposed plan, in a couple
of days I'll start working on a timeline for the GSoC project and I'll
also write down the few more serious implementation issues I still
don't really know how to approach.


Have a great weekend everyone.

Michal
signature.asc

Anssi Kääriäinen

unread,
Apr 12, 2013, 10:35:45 AM4/12/13
to Django developers
On 12 huhti, 16:44, Michal Petrucha <michal.petru...@ksp.sk> wrote:
> As far as relationship fields go, we tried to ignore them at first
and
> get back to them during the second half of GSoC. Two approaches were
> considered, one was to special-case CompositeField targets in
> ForeignKey and in this case, make a single ForeignKey field manage
> multiple database columns directly. This turned out to be really
> painful and messy, so we tried another path, which was to turn
> ForeignKey into a virtual field and create an auxiliary copy of the
> target field in the local model which is supposed to manage the
> database column and its raw value. This way, ForeignKey only takes
> care of the higher-level relationship stuff.
>
> A large part of the ForeignKey refactor has been done. However, I was
> doing it on top of the CompositeField patchset, which makes it a real
> PITA to keep it in sync with master. I had managed to keep it in sync
> for about a year but this has become increasingly tedious, especially
> with all the Py3k changes and recent ORM improvements and cleanups.
> Currently my code is rebased on top of a commit from 2012-11-04, which
> is already five months in the past.

ForeignKeys have been changed a lot since 2012-11-04. The introduction
of ForeignObject (which is base for ForeignKey) means that there is
support for multicolumn joins in the ORM now, and that ForeignObject
itself is a non-concrete field (not in ._meta.virtual_fields though).
ForeignObject currently works only on ORM level, so there is no API
for creating multicolumn foreign keys.

In any case, the GSoC proposal should take the introduction of
ForeignObject in account. tests/foreign_object contain some examples
related to ForeignObject functionality, exploring how those examples
work is a good idea.

Another big issue to solve is how to store the fields in model._meta.
IMO a good idea is to add all fields into some attribute ('all_fields'
as name might be good...), then separate the fields for different use
cases based on features the fields has. For example concrete_fields is
[f for f in self.all_fields if f.column], form fields are [f for f in
self.all_fields if f.form_field] (or maybe this should be called
logical fields - a multicolumn foreign key is logical, but not
concrete field, and the fields used by the foreign key are concrete
but not logical fields). The different field lists can be
cached_properties so there will be no problems from performance
perspective.

- Anssi

Michal Petrucha

unread,
Apr 12, 2013, 11:34:13 AM4/12/13
to django-d...@googlegroups.com
On Fri, Apr 12, 2013 at 07:35:45AM -0700, Anssi Kääriäinen wrote:
> On 12 huhti, 16:44, Michal Petrucha <michal.petru...@ksp.sk> wrote:
> ForeignKeys have been changed a lot since 2012-11-04. The introduction
> of ForeignObject (which is base for ForeignKey) means that there is
> support for multicolumn joins in the ORM now, and that ForeignObject
> itself is a non-concrete field (not in ._meta.virtual_fields though).
> ForeignObject currently works only on ORM level, so there is no API
> for creating multicolumn foreign keys.
>
> In any case, the GSoC proposal should take the introduction of
> ForeignObject in account. tests/foreign_object contain some examples
> related to ForeignObject functionality, exploring how those examples
> work is a good idea.

This has been on my TODO list for a few days actually, I was going
through the work done regarding #19385 [1] and noticed this. However,
I still need to give it a more thorough examination to fully get a
grasp on this.

> Another big issue to solve is how to store the fields in model._meta.
> IMO a good idea is to add all fields into some attribute ('all_fields'
> as name might be good...), then separate the fields for different use
> cases based on features the fields has. For example concrete_fields is
> [f for f in self.all_fields if f.column], form fields are [f for f in
> self.all_fields if f.form_field] (or maybe this should be called
> logical fields - a multicolumn foreign key is logical, but not
> concrete field, and the fields used by the foreign key are concrete
> but not logical fields). The different field lists can be
> cached_properties so there will be no problems from performance
> perspective.

As a matter of fact, this is one of the first things I did back in
2011. (-: Almost exactly like that -- _meta.fields contains all fields
(except for M2M), _meta.concrete_fields, _meta.virtual_fields and I
think I also needed a _meta.local_concrete (for non-inherited fields;
I couldn't find a better name for it that wouldn't be too verbose).


[1] https://code.djangoproject.com/ticket/19385
signature.asc

Anssi Kääriäinen

unread,
Apr 17, 2013, 6:49:11 AM4/17/13
to Django developers
On 12 huhti, 18:34, Michal Petrucha <michal.petru...@ksp.sk> wrote:
> On Fri, Apr 12, 2013 at 07:35:45AM -0700, Anssi Kääriäinen wrote:
> > On 12 huhti, 16:44, Michal Petrucha <michal.petru...@ksp.sk> wrote:
> > ForeignKeys have been changed a lot since 2012-11-04. The introduction
> > of ForeignObject (which is base for ForeignKey) means that there is
> > support for multicolumn joins in the ORM now, and that ForeignObject
> > itself is a non-concrete field (not in ._meta.virtual_fields though).
> > ForeignObject currently works only on ORM level, so there is no API
> > for creating multicolumn foreign keys.
>
> > In any case, the GSoC proposal should take the introduction of
> > ForeignObject in account. tests/foreign_object contain some examples
> > related to ForeignObject functionality, exploring how those examples
> > work is a good idea.
>
> This has been on my TODO list for a few days actually, I was going
> through the work done regarding #19385 [1] and noticed this. However,
> I still need to give it a more thorough examination to fully get a
> grasp on this.

The basic idea is that there is a new ForeignObject class. A
ForeignObject basically just takes a related model, and from_fields
and to_fields which are the names of the fields used for the relation.
Then, the ORM knows how to create JOINs, subqueries etc based on a
ForeignObject. The ForeignObject itself is a virtual field, it doesn't
have any concrete fields in the DB.

There is currently zero public API support for using the
ForeignObject. The addition of ForeignObject is there to make the work
on composite fields and multicolumn foreign keys in particular easier.

> > Another big issue to solve is how to store the fields in model._meta.
> > IMO a good idea is to add all fields into some attribute ('all_fields'
> > as name might be good...), then separate the fields for different use
> > cases based on features the fields has. For example concrete_fields is
> > [f for f in self.all_fields if f.column], form fields are [f for f in
> > self.all_fields if f.form_field] (or maybe this should be called
> > logical fields - a multicolumn foreign key is logical, but not
> > concrete field, and the fields used by the foreign key are concrete
> > but not logical fields). The different field lists can be
> > cached_properties so there will be no problems from performance
> > perspective.
>
> As a matter of fact, this is one of the first things I did back in
> 2011. (-: Almost exactly like that -- _meta.fields contains all fields
> (except for M2M), _meta.concrete_fields, _meta.virtual_fields and I
> think I also needed a _meta.local_concrete (for non-inherited fields;
> I couldn't find a better name for it that wouldn't be too verbose).

I took a look at the code and I think the .Meta fields implementation
is good. (Minor point is that there is no need for the field.virtual
flag - it should be enough to check if it has a db column or not).

Do you have any plans for updatable primary keys? Updatable primary
keys would be useful: the current api where obj.lastname = newval;
obj.save() will result in save of new object instead of update of the
lastname is plain ugly. OTOH This problem might be too complex to
include in GSoC - cascading the updates is going to be hard. One
possible solution is to add some way to create the foreign keys as ON
UPDATE CASCADE and let the DB solve the problem.

- Anssi

Michal Petrucha

unread,
Apr 17, 2013, 7:10:46 PM4/17/13
to django-d...@googlegroups.com, Simone Federici
On Wed, Apr 17, 2013 at 03:49:11AM -0700, Anssi Kääriäinen wrote:
> The basic idea is that there is a new ForeignObject class. A
> ForeignObject basically just takes a related model, and from_fields
> and to_fields which are the names of the fields used for the relation.
> Then, the ORM knows how to create JOINs, subqueries etc based on a
> ForeignObject. The ForeignObject itself is a virtual field, it doesn't
> have any concrete fields in the DB.
>
> There is currently zero public API support for using the
> ForeignObject. The addition of ForeignObject is there to make the work
> on composite fields and multicolumn foreign keys in particular easier.

Thanks for the overview. For the sake of completeness, we had a
discussion on IRC earlier today, the outcome of which is that the
ForeignObject refactor will simplify the task immensely. We also came
upon a few concerns which I'll describe in a moment.

> I took a look at the code and I think the .Meta fields implementation
> is good. (Minor point is that there is no need for the field.virtual
> flag - it should be enough to check if it has a db column or not).

Agreed. Actually, the way I tried to go about the implementation, was
to replace field.column with field.columns which would be a tuple and
in case of virtual fields this would be a property which automatically
gathers the columns of all of its enclosed concrete fields, which
means a field.virtual would be necessary, but it looks like this
brings an inappropriate amount of complexity with little benefit.

> Do you have any plans for updatable primary keys? Updatable primary
> keys would be useful: the current api where obj.lastname = newval;
> obj.save() will result in save of new object instead of update of the
> lastname is plain ugly. OTOH This problem might be too complex to
> include in GSoC - cascading the updates is going to be hard. One
> possible solution is to add some way to create the foreign keys as ON
> UPDATE CASCADE and let the DB solve the problem.

Now I get to the issues that I still haven't figured out or would like
the opinion of others on. I'll separate them into sections for better
readability.

Updatable primary keys
----------------------

The one you mention here is the way the ORM finds out whether to
perform an INSERT or UPDATE on model.save(). The current way this
works is just fine for models with an AutoField which is not visible
to the user. The problem is that as soon as you use an explicit
primary key and aren't extremely careful, you'll get unexpected
results.

This isn't directly related to composite primary keys, however, those
make it more apparent. I actually raised this issue two years ago
(IIRC) and I think I discussed it with Carl and possibly others; the
outcome was that we'd just put a note into the docs that would warn
people to be careful not to make any part of primary key user-editable
in the admin and generally be cautious when handling them directly or
otherwise bad things would happen.

I still think it would be good to change this behavior. The downsides
are that
1) it would subtly change behavior for models with explicit primary
keys in a backwards-incompatible way
2) it would require models to hold something like "the last known
value of the primary key in DB"
3) it would cause weird behavior in situations like if you have two
separate instances corresponding to the same DB row, modify the PK
in one and save. I don't really know how much of an issue this
would be.
4) as Anssi said, we would need a cascading mechanism for updates.
With certain backends, this could probably be done by the database
itself, with others, I believe we'd have to do it manually
(SQLite?)...

Again, this is not entirely related to this project, but it would
still be nice to get this done before composite fields are released,
should we agree we want this implemented.

GenericForeignKey
-----------------

Two years ago we gave this some thought and decided to leave this for
a later stage. I think the later stage is here.

It's fairly easy to represent composite values as strings, something
like the following works quite well::

",".join(escape(value) for value in composite_value)

Of course, we can just do this and stick it into the database as the
object_id. The problem is with GenericRelation. This is something that
works on the database level and we can't just ask the database server
to compare a string built this way with a tuple of other values.

Ideally, we'd need to come up with an encoding that would be possible
to reproduce as a SQL expression, preferably with the same result on
all DB backends. (The expression itself can be backend-dependent, the
result should probably not.)

As far as I know, Simone Federici had something that was close to what
we need, but IIRC, it had some issues. Simone, could you please chime
in and provide some more detail on this? Have you succeeded? Was it a
dead end?

I'll admit right away that I'm not enough of a SQL wizard and I most
certainly don't know the quirks of all supported database backends
well enough to come up with a working solution to this. That's why
I'll need a bit of help from more experienced people.

Why I think we need to concern ourselves with this? The admin keeps a
log of changes. This log uses parts of the contenttypes framework.
It's true it does not use GenericForeignKey directly, but it does
practically the same thing.

Now, I think it wouldn't make much sense to merge support for
CompositeFields without them being supported by the admin. For that,
we need a string representation for composite values that we won't
change later -- doing that would invalidate all admin log entries for
models with a composite primary key from previous Django versions.
That, however, means we need to get the right representation *before*
this is merged into master. Otherwise we won't be able to change it
later, thus making it impossible to add support for
GenericForeignKey/GenericRelation.

It might be possible to still go ahead without contenttypes support
and then later use a different representation for contenttypes, but
that would mean we'd have to use one encoding for the admin and
another one for contenttypes. Ew. Not nice IMO.

Or we can just say that GenericRelation won't ever be supported, which
would make it all easier right now, but I don't really like this
either. Especially since there are people who have asked me about
contenttypes support explicitly.

If there's another possibility I haven't thought of, by all means, let
me know, because right now this looks quite bad to me.

Model._meta.fields and friends
------------------------------

In order to add proper support for virtual fields, I had to make
certain modifications to things in _meta, mainly the way the list of
fields is stored. Currently there's _meta.virtual_fields, _meta.fields
and a few others. (Right now, after the ForeignObject/ForeignKey work,
this is somewhat inconsistent as certain virtual field types are in
_meta.virtual_fields and others are in _meta.fields.)

Instead, I put everything into _meta.fields and created aliases for
certain meaningful subsets, some of which are there already (like
concrete_fields, local_concrete_fields, virtual_fields etc.).

If I turn ForeignKey into a virtual field which adds an extra local
field to store the raw id, this changes the contents of _meta.fields
for every model there is containing a ForeignKey. I know the _meta
object is considered private (though I seem to recall some discussions
about making it public which, I hope, has not happened yet), but we
all know there are lots of applications out there that use things in
_meta anyway. This has the potential of breaking some of them.

I don't think there's much we can do about it. We might try to keep
things in _meta the way they were but personally, I don't think it's
worth it -- we'd be cluttering internal APIs just to support people
doing things they should know might break between versions. I just
mention this here because this might have a nontrivial negative
impact.

Database backends
-----------------

This is quite obvious, some things like sql_create_model will require
updates, though these things are generally used from the base database
adaptor classes and not overridden by backends. However, certain
features, like backends.introspection.get_relations will require a
complete rewrite.

Take get_relations for example. This method is responsible for
detecting ForeignKeys and other relationships between database tables.
Currently the data structure it returns is designed to only support
single-column relationships which obviously doesn't have enough
expressive power to encode a composite relationship.

The question I'd like to raise here is, what's the status of the
database backend APIs? Are they private? Are they public? Are they
somewhere in between? I know there are quite a few third-party backend
connectors. Can I just go ahead and change the API of get_relations,
or do we need some kind of backwards compatibility?

Speaking of get_relations, I really look forward to parsing composite
relationships from SQLite CREATE TABLE statements... d-:

Those blasted __in lookups
--------------------------

__in lookups are used quite often, for example, during each
model.delete(), if the delete cascades. For now I do this as a long
CNF expression, which, as I am told, is horribly inefficient on
certain backends.

Say, you wanted to do the following (which is only valid in certain
SQL servers, I don't recall which ones anymore, though):

SELECT a, b, c FROM tbl1, tbl2
WHERE (a, b, c) IN [(1, 2, 3), (4, 5, 6)];

The current expression would be

SELECT a, b, c FROM tbl1, tbl2
WHERE (a = 1 AND b = 2 AND c = 3) OR (a = 4 AND b = 5 AND c = 6);

Might something like the following be a viable alternative? Looks
backend-agnostic to me as well, though it adds two levels of nested
SELECTs...

SELECT a, b, c FROM tbl1, tbl2
WHERE EXISTS (SELECT a1, b1, c1, FROM (SELECT 1 as a, 2 as b, 3 as c
UNION SELECT 4, 5, 6)
WHERE a1=1 AND b1=b AND c1=c);

Of course, the specific expression to be used would be generated by
the database backend, so we could still use
(a, b, c) IN [(1, 2, 3), (4, 5, 6)]
where available...


This would be all for tonight. I hope there isn't anything else, but a
quick scan through my notes doesn't yield anything.

Project timeline is in progress.

Michal
signature.asc

Michal Petrucha

unread,
Apr 19, 2013, 5:17:09 PM4/19/13
to django-d...@googlegroups.com
Hello,

as I promised a week ago, I'm posting a draft of my proposal. It is
far from complete, specifically, it does not contain technical details
on how things are going to be implemented. I'll try to fill these in
as soon as possible, although I wrote more on most of them in my
previous emails to this list. The most important thing is that it
contains a timeline and a description of possible fallbacks.

The proposal is accessible as a public gist [1] where it will be
updated in the future, I'm posting the full text here as well for
convenience.

Have a nice weekend.

Michal

[1] https://gist.github.com/koniiiik/5408673


Composite Fields, vol. 2
""""""""""""""""""""""""

Aim of this project
===================

Since I started working on this two years ago, I managed (with the
help of a few people) to get most of the hard work done. The biggest
part that I didn't get around to implementing was support in the ORM
for multi-column joins. This has, however, been implemented recently
by Jeremy Tiller and Anssi, which means there are only a few things
left to be done.

First of all, the code sitting in my github repo is badly out of date,
which means it needs to be updated to the current state of master.
While I'm at it, I also want to clean up the revision history to get
it as close to a state where it could be just reviewed and merged
directly as possible.

Beside this, there are only the juicier features left that we
initially wanted to leave unsupported and get back to them later.
Those are the following (in no particular order):

- ``GenericForeignKey`` support for ``CompositeField``
- more intelligent handling of ``__in`` lookups
- database instrospection and ``inspectdb``
- detection of a change of the primary key in model instances

Timeline
========

Our exam period starts on May 20. and ends on June 28., which means
that I can't guarantee I will be able to fully dedicate myself to the
project for the first two weeks, however, if nothing goes wrong, I
should be able to pass all exams before June 17.

Also, I intend to go to EuroPython, which means the first week of July
won't be as fruitful as the others, but otherwise I'm ready to work
full time on the project.

Week 1 (Jun 17. - Jun 23.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- porting the required virtual field changes, like a ``VirtualField``
class and ``django.db.models.options`` changes
- revisiting the documentation I wrote two years ago to reflect the
evolution this project has gone through

Week 2 (Jun 24. - Jun 30.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- porting the virtual ``ForeignKey`` patchset on top of master to get
most of the functionality right

Week 3 (Jul 1. - Jul 7.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- EuroPython 2013, Florence
- I intend to spend the full two days of sprints working on this but I
can't promise much more during this week.
- running through the tests and fixing those that need updating,
ironing out the remaining wrinkles

Week 4 (Jul 8. - Jul 14.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- basic ``CompositeField`` functionality, ``CompositeValue`` and
descriptor protocol

Week 5 (Jul 15. - Jul 21.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``db_index``, ``unique`` and ``primary_key`` for ``CompositeField``
- backend-dependent SQL for ``__in`` lookups on ``CompositeField``

Week 6 (Jul 22. - Jul 28.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- the few patches required to make admin work with composite primary
keys

Week 7 (Jul 29. - Aug 4.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- composite ``ForeignKey``: basic functionality, descriptors, database
JOINs

Week 8 (Aug 5. - Aug 11.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- composite ``ForeignKey``: customization of auxiliary fields,
adapting ``ModelForms`` and the admin in case it is necessary

Week 9 (Aug 12. - Aug 18.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``GenericForeignKey`` and ``GenericRelation`` support

Week 10 (Aug 19. - Aug 25.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- database introspection: create composite fields where necessary

Week 11 (Aug 26. - Sep 1.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- database introspection: composite ``ForeignKey``

Week 12 (Sep 2. - Sep 8.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- better handling of modifications to primary keys: detecting the fact
and issuing an appropriate DB query

Week 13 (Sep 9. - Sep 15.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- revision log cleanup
- better handling of modifications to primary keys: cascading primary
key updates

Week 14 (Sep 16. - Sep 22.)
~~~~~~~~~~~~~~~~~~~~~~~~~~
- this is after the suggested “soft pencils down” deadline, therefore
I'll keep this week as a reserve

Deliverables and fallbacks
==========================

As the timeline shows, the first few weeks will be dedicated to
internal refactors of the relationship handling parts of the ORM,
which means the outcome won't be entirely evident or useful by itself.
By the end of the sixth week, I expect the refactor of ``ForeignKey``
to be stable for concrete target fields. Also, the non-relationship
part of ``CompositeField`` functionality should be ported on top of
this.

Next, there is support for composite foreign keys. This is the core of
the work and the absolute minimum I want to squeeze out of this
project to consider it at least partially successful.

All the features that appear later in the timeline are optional,
ordered by the priority with which I would like to implement them. In
the unlikely case of extreme difficulties with the previous parts of
this project some (or possibly all) of them may be left out.
signature.asc

Andrew Godwin

unread,
Apr 24, 2013, 3:06:34 PM4/24/13
to django-d...@googlegroups.com
Hi Michal,

This looks like a good starting point for a proposal (not to mention that we don't doubt that you know about the problem area here!) - a few comments:

 - I'd like a bit more detail on some of your timeline points, in particular what the introspection parts and primary key updates are (personal bias, I suspect)
 - I'd like more details on what's been holding it back since GSOC 2011, and how this GSOC is going to help serve those

Still, I like the general idea. You've mentioned tests and docs too. And I really, really want to land this feature!

Andrew

Michal Petrucha

unread,
Apr 24, 2013, 8:28:38 PM4/24/13
to django-d...@googlegroups.com
Hello,

Last week I posted a draft of a timeline for my proposed project and
only afterward I realized that the timeline by itself does not convey
any meaningful information without explaining each item a little bit
deeper. Here comes the full version which I managed to finally
complete a while ago.

Again, for those who prefer reading the formatted version, the
proposal is also available as a gist:
https://gist.github.com/koniiiik/5408673

Apologies for the huge amounts of text, I seem to have lost track
while writing.

Have an awesome day
Michal



Composite Fields, vol. 2
""""""""""""""""""""""""

Abstract
========

Two years ago, as part of GSoC 2011, I started working on an
implementation of composite fields as a means to support multi-column
primary keys in models. While the project was was successful, the
timeframe wasn't sufficient to finish it and get it into a state where it
could be merged into Django. This year I propose to take the last
remaining steps for this project to be ready and hopefully even extend it
with some extra features which were left out of the initial project.

Aim of this project
===================

Since I started working on this two years ago, I managed (with the help of
a few people) to get most of the hard work done. The biggest part that I
didn't get around to implementing was support in the ORM for multi-column
joins. This has, however, been implemented recently by Jeremy Tillman and
Anssi, which means there are only a few things left to be done.

First of all, the code sitting in my github repo is badly out of date,
which means it needs to be updated to the current state of master. While
I'm at it, I also want to clean up the revision history to get it as close
to a state where it could be just reviewed and merged directly as
possible.

Beside this, there are only the juicier features left that we initially
wanted to leave unsupported and get back to them later. Those are the
following (in no particular order):

- ``GenericForeignKey`` support for ``CompositeField``
- more intelligent handling of ``__in`` lookups
- database instrospection and ``inspectdb``
- detection of a change of the primary key in model instances

Summary of ``CompositeField``
=============================

This section summarizes the basic API as established in the proposal for
GSoC 2011 [1]_.

A ``CompositeField`` requires a list of enclosed regular model fields as
positional arguments, as shown in this example::

class SomeModel(models.Model):
first_field = models.IntegerField()
second_field = models.CharField(max_length=100)
composite = models.CompositeField(first_field, second_field)

The model class then contains a descriptor for the composite field, which
returns a ``CompositeValue`` which is a customized namedtuple, the
descriptor accepts any iterable of the appropriate length. An example
interactive session::

>>> instance = new SomeModel(first_field=47, second_field="some string")
>>> instance.composite
CompositeObject(first_field=47, second_field='some string')
>>> instance.composite.first_field
47
>>> instance.composite[1]
'some string'
>>> instance.composite = (74, "other string")
>>> instance.first_field, instance.second_field
(74, 'other string')

``CompositeField`` supports the following standard field options:
``unique``, ``db_index``, ``primary_key``. The first two will simply add a
corresponding tuple to ``model._meta.unique_together`` or
``model._meta.index_together``. Other field options don't make much sense
in the context of composite fields.

Supported ``QuerySet`` filters will be ``exact`` and ``in``. The former
should be clear enough, the latter is elaborated in a separate section.

It will be possible to use a ``CompositeField`` as a target field of
``ForeignKey``, ``OneToOneField`` and ``ManyToManyField``. This is
described in more detail in the following section.

Changes in ``ForeignKey``
=========================

Currently ``ForeignKey`` is a regular concrete field which manages both
the raw value stored in the database and the higher-level relationship
semantics. Managing the raw value is simple enough for simple
(single-column) targets. However, in the case of a composite target field,
this task becomes more complex. The biggest problem is that many parts of
the ORM work under the assumption that for each database column there is a
model field it can assign the value from the column to. While it might be
possible to lift this restriction, it would be a really complex project by
itself.

On the other hand, there is the abstraction of virtual fields working on
top of other fields which is required for this project anyway. The way
forward would be to use this abstraction for relationship fields.
Currently, ``ForeignKey`` (and by extension ``OneToOneField``) is the only
field whose ``name`` and ``attname`` differ, where ``name`` stores the
value dictated by the semantics of the field and ``attname`` stores the
raw value from the database.

We can use this to our advantage and put an auxiliary field into the
``attname`` of each ``ForeignKey``, which would be of the same database
type as the target field, and turn ``ForeignKey`` into a virtual field on
top of the auxiliary field. This solution has the advantage that it
offloads the need to manage the raw database value off ``ForeignKey`` and
uses a field specifically intended for the task.

In order to keep this backwards compatible and avoid the need to
explicitly create two fields for each ``ForeignKey``, the auxiliary field
needs to be created automatically during the phase where a model class is
created by its metaclass. Initially I implemented this as a method on
``ForeignKey`` which takes the target field and creates its copy, touches
it up and adds it to the model class. However, this requires performing
special tasks with certain types of fields, such as ``AutoField`` which
needs to be turned into an ``IntegerField`` or ``CompositeField`` which
requires copying its enclosed fields as well.

A better approach is to add a method such as ``create_auxiliary_copy`` on
``Field`` which would create all new field instances and add them to the
appropriate model class.

One possible problem with these changes is that they change the contents
of ``_meta.fields`` in each model out there that contains a relationship
field. For example, if a model contains the following fields::

['id',
'name',
'address',
'place_ptr',
'rating',
'serves_hot_dogs',
'serves_pizza',
'chef']

where ``place_ptr`` is a ``OneToOneField`` and ``chef`` is a
``ForeignKey``, after the change it will contain the following list::

['id',
'name',
'address',
'place_ptr',
'place_ptr_id',
'rating',
'serves_hot_dogs',
'serves_pizza',
'chef',
'chef_id']

This causes a lot of failures in the Django test suite, because there are
a lot of tests relying on the contents of ``_meta.fields`` or other
related attributes/properties. (Actually, this example is taken from one
of these tests,
``model_inheritance.tests.ModelInheritanceTests.test_multiple_table``.)
Fixing these is fairly simple, all they need is to add the appropriate
``__id`` fields. However, this raises a concern of how ``_meta`` is
regarded. It has always been a private API officially, but everyone uses
it in their projects anyway. I still think the change is worth it, but it
might be a good idea to include a note about the change in the release
notes.

Porting previous work on top of master
======================================

The first major task of this project is to take the code I wrote as part
of GSoC 2011 and sync it with the current state of master. The order in
which I implemented things two years ago was to implement
``CompositeField`` first and then I did a refactor of ``ForeignKey`` which
is required to make it support ``CompositeField``. This turned out to be
inefficient with respect to the development process, because some parts of
the refactor broke the introduced ``CompositeField`` functionality,
meaning I had to effectively reimplement parts of it again. Also, some
abstractions introduced by the refactor made it possible to rewrite
certain parts in a cleaner way than what was necessary for
``CompositeField`` alone (e.g. database creation or certain features of
``model._meta``).

In light of these findings I am convinced that a better approach would be
to first do the required refactor of ``ForeignKey`` and implement
CompositeField as the next step. This will result in a better maintainable
development branch and a cleaner revision history, making it easier to
review the work before its eventual inclusion into Django.

``__in`` lookups for ``CompositeField``
=======================================

The existing implementation of ``CompositeField`` handles ``__in`` lookups
in the generic, backend-independent ``WhereNode`` class and uses a
disjunctive normal form expression as in the following example::

SELECT a, b, c FROM tbl1, tbl2
WHERE (a = 1 AND b = 2 AND c = 3) OR (a = 4 AND b = 5 AND c = 6);

The problem with this solution is that in cases where the list of values
contains tens or hundreds of tuples, this DNF expression will be extremely
long and the database will have to evaluate it for each and every row,
without a possibility of optimizing the query.

Certain database backends support the following alternative::

SELECT a, b, c FROM tbl1, tbl2
WHERE (a, b, c) IN [(1, 2, 3), (4, 5, 6)];

This would probably be the best option, but it can't be used by SQLite,
for instance. This is also the reason why the DNF expression was
implemented in the first place.

In order to support this more natural syntax, the ``DatabaseOperations``
needs to be extended with a method such as ``composite_in_sql``.

However, this leaves the issue of the inefficient DNF unresolved for
backends without support for tuple literals. For such backends, the
following expression is proposed::

SELECT a, b, c FROM tbl1, tbl2
WHERE EXISTS (SELECT a1, b1, c1, FROM (SELECT 1 as a, 2 as b, 3 as c
UNION SELECT 4, 5, 6)
WHERE a1=1 AND b1=b AND c1=c);

Since both syntaxes are rather generic and at least one of them should fit
any database backend directly, a new flag will be introduced,
``DatabaseFeatures.supports_tuple_literals`` which the default
implementation of ``composite_in_sql`` will consult in order to choose
between the two options.

``contenttypes`` and ``GenericForeignKey``
==========================================

Two years ago we gave this some thought and decided to leave this for
a later stage. I think the later stage is here.

It's fairly easy to represent composite values as strings. Given an
``escape`` function which uniquely escapes commas, something like the
following works quite well::

",".join(escape(value) for value in composite_value)

However, in order to support JOINs generated by ``GenericRelation``, we
need to be able to reproduce exactly the same encoding using an SQL
expression which would be used in the JOIN condition.

Luckily, while thus encoded strings need to be possible to decode in
Python (for example, when retrieving the related object using
``GenericForeignKey`` or when the admin decodes the primary key from URL),
this isn't necessary at the database level. Using SQL we only ever need to
perform this in one direction, that is from a tuple of values into a
string.

That means we can use a generalized version of the function
``django.contrib.admin.utils.quote`` which replaces each unsafe
character with its ASCII value in hexadecimal base, preceded by an escape
character. In this case, only two characters are unsafe -- comma (which is
used to separate the values) and an escape character (which I arbitrarily
chose as '~').

To reproduce this encoding, all values need to be cast to strings and then
for each such string two calls to the ``replace`` functions are made::

replace(replace(CAST (`column` AS text), '~', '~7E'), ',', '~2C')

According to available documentation, all four supported database backends
provide the ``replace`` function. [2]_ [3]_ [4]_ [5]_

Even though the ``replace`` function seems to be available in all major
database servers (even ones not officially supported by Django, including
MSSQL, DB2, Informix and others), this is still probably best left to the
database backend and will be implemented as
``DatabaseOperations.composite_value_to_text_sql``.

Database introspection, ``inspectdb``
=====================================

There are three main goals concerning database introspection in this
project. The first is to ensure the output of ``inspectdb`` remains the
same as it is now for models with simple primary keys and simple foreign
key references, or at least equivalent. While this shouldn't be too
difficult to achieve, it will still be regarded with high importance.

The second goal is to extend ``inspectdb`` to also create a
``CompositeField`` in models where the table contains a composite primary
key. This part shouldn't be too difficult,
``DatabaseIntrospection.get_primary_key_column`` will be renamed to
``get_primary_key`` which will return a tuple of columns and in case the
tuple contains more than one element, an appropriate ``CompositeField``
will be added. This will also require updating
``DatabaseWrapper.check_constraints`` for certain backends since it uses
``get_primary_key_column``.

The third goal is to also make ``inspectdb`` aware of composite foreign
keys. This will need a rewrite of ``get_relations`` which will have to
return a mapping between tuples of columns instead of single columns. It
should also ensure each tuple of columns pointed to by a foreign key gets
a ``CompositeField``. This part will also probably require some changes in
other backend methods as well, especially since each backend has a unique
tangle of introspection methods.

This part requires a tremendous amount of work, because practically every
single change needs to be done four times and needs separate research of
the specific backend. Therefore I can't promise to deliver full support
for all features mentioned in this section for all backends. I'd say
backwards compatibility is a requirement, recognition of composite primary
keys is a highly wanted feature that I'll try to implement for as many
backends as possible and recognition of composite foreign keys would be a
nice extra to have for at least one or two backends.

I'll be implementing the features for the individual backends in the
following order: PostgreSQL, MySQL, SQLite and Oracle. I put PostgreSQL
first because, well, this is the backend with the best support in Django
(and also because it is the one where I'd actually use the features I'm
proposing). Oracle comes last because I don't have any way to test it and
I'm afraid I'd be stabbing in the dark anyway. Of the two remaining
backends I put MySQL first for two reasons. First, I don't think people
need to run ``inspectdb`` on SQLite databases too often (if ever). Second,
on MySQL the task seems marginally easier as the database has
introspection features other than just “give me the SQL statement used to
create this table”, whose parsing is most likely going to be a complete
mess.

All in all, extending ``inspectdb`` features is a tedious and difficult
task with shady outcome, which I'm well aware of. Still, I would like to
try to at least implement the easier parts for the most used backends. It
might quite possibly turn out that I won't manage to implement more than
composite primary key detection for PostgreSQL. This is the reason I keep
this as one of the last features I intend to work on, as shown in the
timeline. It isn't a necessity, we can always just add a note to the docs
that ``inspectdb`` just can't detect certain scenarios and ask people to
edit their models manually.

Updatable primary keys in models
================================

The algorithm that determines what kind of database query to issue on
``model.save()`` is a fairly simple and well-documented one [6]_. If a row
exists in the database with the value of its primary key equal to the
saved object, it is updated, otherwise a new row is inserted. This
behavior is intuitive and works well for models where the primary key is
automatically created by the framework (be it an ``AutoField`` or a parent
link in the case of model inheritance).

However, as soon as the primary key is explicitly created, the behavior
becomes intuitive and might be confusing, for example, to users of the
admin. For instance, say we have the following model::

class Person(models.Model):
first_name = models.CharField(max_length=47)
last_name = models.CharField(max_length=47)
shoe_size = models.PositiveSmallIntegerField()

full_name = models.CompositeField(first_name, last_name,
primary_key=True)

Then we register the model in the admin using the standard one-liner::

admin.site.register(Person)

Since we haven't excluded any fields, all three fields will be editable in
the admin. Now, suppose there's an instance whose ``full_name`` is
``CompositeValue(first_name='Darth', last_name='Vadur')``. A user decides
to fix the last name using the admin, hits the “Save” button and instead
of fixing an existing record, a new one will appear with the new value,
while the old one remains untouched. This behavior is clearly broken from
the point of view of the user.

It can be argued that it is the developer's fault that the database schema
is poorly chosen and that they expose the primary key to their users.
While this may be true in some cases, it is still to some extent a
subjective matter.

Therefore I propose a new behavior for ``model.save()`` where it would
detect a change in the instance's primary key and in that case issue an
``UPDATE`` for the right row, i.e. ``WHERE primary_key = previous_value``.

Of course, just going ahead and changing the behavior in this way for all
models would be backwards incompatible. To do this properly, we would need
to make this an opt-in feature. This can be achieved in multiple ways.

1) add a keyword argument such as ``update_pk`` to ``Model.save``
2) add a new option to ``Model.Meta``, ``updatable_pk``
3) make this a project-wide setting

Option 3 doesn't look pleasant and I think I can safely eliminate that.
Option 2 is somewhat better, although it adds a new ``Meta`` option.
Option 1 is the most flexible solution, however, it does not change the
behavior of the admin, at least not by default. This can be worked around
by overriding the ``save`` method to use a different default::

class MyModel(models.Model):
def save(self, update_pk=True, **kwargs):
kwargs['update_pk'] = update_pk
return super(MyModel, self).save(**kwargs)

To avoid the need to repeat this for each model, a class decorator might
be provided to perform this automatically.

In order to implement this new behavior a little bit of extra complexity
would have to be added to models. Model instances would need to store the
last known value of the primary key as retrieved from the database. On
save it would just find out whether the last known value is present and in
that case issue an ``UPDATE`` using the old value in the ``WHERE``
condition.

So far so good, this could be implemented fairly easily. However, the
problem becomes considerably more difficult as soon as we take into
account the fact that updating a primary key value may break foreign key
references. In order to avoid breaking references the ``on_delete``
mechanism of ``ForeignKey`` would have to be extended to support updates
as well. This means that the collector used by deletion will need to be
extended as well.

The problem becomes particularly nasty if we realize that a ``ForeignKey``
might be part of a primary key, which means the collector needs to keep
track of which field depends on which in a graph of potentially unlimited
size. Compared to this, deletion is simpler as it only needs to find a
list of all affected model instances as opposed to having to keep track of
which field to update using which value.

Given the complexity of this problem and the fact that it is not directly
related to composite fields, this is left as the last feature which will
be implemented only if I manage to finish everything else on time.

Timeline
========

Our exam period starts on May 20. and ends on June 28., which means that I
can't guarantee I will be able to fully dedicate myself to the project for
the first two weeks, however, if nothing goes wrong, I should be able to
pass all exams before June 17.

Also, I intend to go to EuroPython, which means the first week of July
won't be as fruitful as the others, but otherwise I'm ready to work full
time on the project.

Week 1 (Jun 17. - Jun 23.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- porting the required virtual field changes, like a ``VirtualField``
class and ``django.db.models.options`` changes
- revisiting the documentation I wrote two years ago to reflect the
evolution this project has gone through

Week 2 (Jun 24. - Jun 30.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- porting the virtual ``ForeignKey`` patchset on top of master to get most
of the functionality right

Week 3 (Jul 1. - Jul 7.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- EuroPython 2013, Florence
- I intend to spend the full two days of sprints working on this but I
can't promise much more during this week.
- running through the tests and fixing those that need updating, ironing
out the remaining wrinkles

Week 4 (Jul 8. - Jul 14.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- basic ``CompositeField`` functionality, ``CompositeValue`` and
descriptor protocol

Week 5 (Jul 15. - Jul 21.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``db_index``, ``unique`` and ``primary_key`` for ``CompositeField``
- backend-dependent SQL for ``__in`` lookups on ``CompositeField``

Week 6 (Jul 22. - Jul 28.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- the few patches required to make admin work with composite primary keys

Week 7 (Jul 29. - Aug 4.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- composite ``ForeignKey``: basic functionality, descriptors, database
JOINs

Week 8 (Aug 5. - Aug 11.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- composite ``ForeignKey``: customization of auxiliary fields, adapting
``ModelForms`` and the admin in case it is necessary

Week 9 (Aug 12. - Aug 18.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``GenericForeignKey`` and ``GenericRelation`` support

Week 10 (Aug 19. - Aug 25.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- database introspection: create composite fields where necessary

Week 11 (Aug 26. - Sep 1.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- database introspection: composite ``ForeignKey``

Week 12 (Sep 2. - Sep 8.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- better handling of modifications to primary keys: detecting the fact and
issuing an appropriate DB query

Week 13 (Sep 9. - Sep 15.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- revision log cleanup
- better handling of modifications to primary keys: cascading primary key
updates

Week 14 (Sep 16. - Sep 22.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- this is after the suggested “soft pencils down” deadline, therefore I'll
keep this week as a reserve

Deliverables and fallbacks
==========================

As the timeline shows, the first few weeks will be dedicated to internal
refactors of the relationship handling parts of the ORM, which means the
outcome won't be entirely evident or useful by itself. By the end of the
sixth week, I expect the refactor of ``ForeignKey`` to be stable for
concrete target fields. Also, the non-relationship part of
``CompositeField`` functionality should be ported on top of this.

Next, there is support for composite foreign keys. This is the core of the
work and the absolute minimum I want to squeeze out of this project to
consider it at least partially successful.

All the features that appear later in the timeline are optional, ordered
by the priority with which I would like to implement them. In the unlikely
case of extreme difficulties with the previous parts of this project some
(or possibly all) of them may be left out.

A note about the history of this project
========================================

During GSoC 2011 the project progressed steadily, especially during the
first half. The second half was a little bit more blurry because I had no
idea how difficult the task of implementing composite foreign keys would
be before I started to actually omplement them. Still, even though the
second half didn't yield a working implementation (just a half-baked
refactor), it was crucial for me to understand how it can be done.

Unfortunately, soon thereafter school kicked in and along with it
homework, assignments, thesis and other projects related to programming
contests we organize throughout the academic year. This left me with
barely enough time to sync the code with master from time to time, let
alone make any significant progress.

I didn't apply for GSoC 2012 because I knew I wouldn't have enough time to
work on the project because of my thesis and state exams. This would make
it a poor candidate for a GSoC project and even if it did get chosen, it
wouldn't feel fair to other GSoC participants.

Without any deadlines ahead of me and without the GSoC participation, I
was lacking the motivation to keep working on this. I admit, the lack of
progress throughout last summer was largely due to my laziness. I did sync
the code from time to time but I stopped around the time when Aymeric
started pushing all the autogenerated Py3k commits. Sorting out the merge
conflicts and updating a patchset of this scope to the new Py3k
compatibility standards was a really painful process.

I managed to at least get the code through the Py3k changes sometime in
October before the RuPy.eu conference, putting aside everything else. At
that time I was really confident I could keep working on it, but
obviously, that didn't happen.

All in all, the reasons why there hasn't been any significant progress can
be summarized as these two:

1) lack of free time (don't we all know this one?)
2) lack of short-term incentive (a.k.a. laziness)

Why I think the upcoming GSoC will help?

Regarding the first one, having planned half a year in advance for a
Summer of Code project means that I don't have any other plans for the
summer beside this. (Well, except for EuroPython, which I hope will help
this project advance anyway.)

As far as motivation is concerned, there's the obvious one -- a nice
paycheck from Google if I succeed to deliver. However, that's not the only
one, I think the fact that I would have a mentor supervising my work, whom
I respond to throughout the summer, has a really great effect in this
regard.

References
==========

.. [1] GSoC 2011 proposal: Composite Fields
(https://gist.github.com/koniiiik/5450625)
.. [2] PostgreSQL documentation: Other String Functions
(http://www.postgresql.org/docs/9.1/static/functions-string.html#FUNCTIONS-STRING-OTHER)
.. [3] SQLite documentation: Core Functions
(http://www.sqlite.org/lang_corefunc.html#replace)
.. [4] Oracle® Database SQL Reference: REPLACE
(http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions134.htm)
.. [5] MySQL Reference Manual: String Functions
(http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_replace)
.. [6] Django documentation: Model instance reference
(https://docs.djangoproject.com/en/dev/ref/models/instances/#how-django-knows-to-update-vs-insert)
signature.asc
Reply all
Reply to author
Forward
0 new messages