Hello,
Last week I posted a draft of a timeline for my proposed project and
only afterward I realized that the timeline by itself does not convey
any meaningful information without explaining each item a little bit
deeper. Here comes the full version which I managed to finally
complete a while ago.
Again, for those who prefer reading the formatted version, the
proposal is also available as a gist:
https://gist.github.com/koniiiik/5408673
Apologies for the huge amounts of text, I seem to have lost track
while writing.
Have an awesome day
Michal
Composite Fields, vol. 2
""""""""""""""""""""""""
Abstract
========
Two years ago, as part of GSoC 2011, I started working on an
implementation of composite fields as a means to support multi-column
primary keys in models. While the project was was successful, the
timeframe wasn't sufficient to finish it and get it into a state where it
could be merged into Django. This year I propose to take the last
remaining steps for this project to be ready and hopefully even extend it
with some extra features which were left out of the initial project.
Aim of this project
===================
Since I started working on this two years ago, I managed (with the help of
a few people) to get most of the hard work done. The biggest part that I
didn't get around to implementing was support in the ORM for multi-column
joins. This has, however, been implemented recently by Jeremy Tillman and
Anssi, which means there are only a few things left to be done.
First of all, the code sitting in my github repo is badly out of date,
which means it needs to be updated to the current state of master. While
I'm at it, I also want to clean up the revision history to get it as close
to a state where it could be just reviewed and merged directly as
possible.
Beside this, there are only the juicier features left that we initially
wanted to leave unsupported and get back to them later. Those are the
following (in no particular order):
- ``GenericForeignKey`` support for ``CompositeField``
- more intelligent handling of ``__in`` lookups
- database instrospection and ``inspectdb``
- detection of a change of the primary key in model instances
Summary of ``CompositeField``
=============================
This section summarizes the basic API as established in the proposal for
GSoC 2011 [1]_.
A ``CompositeField`` requires a list of enclosed regular model fields as
positional arguments, as shown in this example::
class SomeModel(models.Model):
first_field = models.IntegerField()
second_field = models.CharField(max_length=100)
composite = models.CompositeField(first_field, second_field)
The model class then contains a descriptor for the composite field, which
returns a ``CompositeValue`` which is a customized namedtuple, the
descriptor accepts any iterable of the appropriate length. An example
interactive session::
>>> instance = new SomeModel(first_field=47, second_field="some string")
>>> instance.composite
CompositeObject(first_field=47, second_field='some string')
>>> instance.composite.first_field
47
>>> instance.composite[1]
'some string'
>>> instance.composite = (74, "other string")
>>> instance.first_field, instance.second_field
(74, 'other string')
``CompositeField`` supports the following standard field options:
``unique``, ``db_index``, ``primary_key``. The first two will simply add a
corresponding tuple to ``model._meta.unique_together`` or
``model._meta.index_together``. Other field options don't make much sense
in the context of composite fields.
Supported ``QuerySet`` filters will be ``exact`` and ``in``. The former
should be clear enough, the latter is elaborated in a separate section.
It will be possible to use a ``CompositeField`` as a target field of
``ForeignKey``, ``OneToOneField`` and ``ManyToManyField``. This is
described in more detail in the following section.
Changes in ``ForeignKey``
=========================
Currently ``ForeignKey`` is a regular concrete field which manages both
the raw value stored in the database and the higher-level relationship
semantics. Managing the raw value is simple enough for simple
(single-column) targets. However, in the case of a composite target field,
this task becomes more complex. The biggest problem is that many parts of
the ORM work under the assumption that for each database column there is a
model field it can assign the value from the column to. While it might be
possible to lift this restriction, it would be a really complex project by
itself.
On the other hand, there is the abstraction of virtual fields working on
top of other fields which is required for this project anyway. The way
forward would be to use this abstraction for relationship fields.
Currently, ``ForeignKey`` (and by extension ``OneToOneField``) is the only
field whose ``name`` and ``attname`` differ, where ``name`` stores the
value dictated by the semantics of the field and ``attname`` stores the
raw value from the database.
We can use this to our advantage and put an auxiliary field into the
``attname`` of each ``ForeignKey``, which would be of the same database
type as the target field, and turn ``ForeignKey`` into a virtual field on
top of the auxiliary field. This solution has the advantage that it
offloads the need to manage the raw database value off ``ForeignKey`` and
uses a field specifically intended for the task.
In order to keep this backwards compatible and avoid the need to
explicitly create two fields for each ``ForeignKey``, the auxiliary field
needs to be created automatically during the phase where a model class is
created by its metaclass. Initially I implemented this as a method on
``ForeignKey`` which takes the target field and creates its copy, touches
it up and adds it to the model class. However, this requires performing
special tasks with certain types of fields, such as ``AutoField`` which
needs to be turned into an ``IntegerField`` or ``CompositeField`` which
requires copying its enclosed fields as well.
A better approach is to add a method such as ``create_auxiliary_copy`` on
``Field`` which would create all new field instances and add them to the
appropriate model class.
One possible problem with these changes is that they change the contents
of ``_meta.fields`` in each model out there that contains a relationship
field. For example, if a model contains the following fields::
['id',
'name',
'address',
'place_ptr',
'rating',
'serves_hot_dogs',
'serves_pizza',
'chef']
where ``place_ptr`` is a ``OneToOneField`` and ``chef`` is a
``ForeignKey``, after the change it will contain the following list::
['id',
'name',
'address',
'place_ptr',
'place_ptr_id',
'rating',
'serves_hot_dogs',
'serves_pizza',
'chef',
'chef_id']
This causes a lot of failures in the Django test suite, because there are
a lot of tests relying on the contents of ``_meta.fields`` or other
related attributes/properties. (Actually, this example is taken from one
of these tests,
``model_inheritance.tests.ModelInheritanceTests.test_multiple_table``.)
Fixing these is fairly simple, all they need is to add the appropriate
``__id`` fields. However, this raises a concern of how ``_meta`` is
regarded. It has always been a private API officially, but everyone uses
it in their projects anyway. I still think the change is worth it, but it
might be a good idea to include a note about the change in the release
notes.
Porting previous work on top of master
======================================
The first major task of this project is to take the code I wrote as part
of GSoC 2011 and sync it with the current state of master. The order in
which I implemented things two years ago was to implement
``CompositeField`` first and then I did a refactor of ``ForeignKey`` which
is required to make it support ``CompositeField``. This turned out to be
inefficient with respect to the development process, because some parts of
the refactor broke the introduced ``CompositeField`` functionality,
meaning I had to effectively reimplement parts of it again. Also, some
abstractions introduced by the refactor made it possible to rewrite
certain parts in a cleaner way than what was necessary for
``CompositeField`` alone (e.g. database creation or certain features of
``model._meta``).
In light of these findings I am convinced that a better approach would be
to first do the required refactor of ``ForeignKey`` and implement
CompositeField as the next step. This will result in a better maintainable
development branch and a cleaner revision history, making it easier to
review the work before its eventual inclusion into Django.
``__in`` lookups for ``CompositeField``
=======================================
The existing implementation of ``CompositeField`` handles ``__in`` lookups
in the generic, backend-independent ``WhereNode`` class and uses a
disjunctive normal form expression as in the following example::
SELECT a, b, c FROM tbl1, tbl2
WHERE (a = 1 AND b = 2 AND c = 3) OR (a = 4 AND b = 5 AND c = 6);
The problem with this solution is that in cases where the list of values
contains tens or hundreds of tuples, this DNF expression will be extremely
long and the database will have to evaluate it for each and every row,
without a possibility of optimizing the query.
Certain database backends support the following alternative::
SELECT a, b, c FROM tbl1, tbl2
WHERE (a, b, c) IN [(1, 2, 3), (4, 5, 6)];
This would probably be the best option, but it can't be used by SQLite,
for instance. This is also the reason why the DNF expression was
implemented in the first place.
In order to support this more natural syntax, the ``DatabaseOperations``
needs to be extended with a method such as ``composite_in_sql``.
However, this leaves the issue of the inefficient DNF unresolved for
backends without support for tuple literals. For such backends, the
following expression is proposed::
SELECT a, b, c FROM tbl1, tbl2
WHERE EXISTS (SELECT a1, b1, c1, FROM (SELECT 1 as a, 2 as b, 3 as c
UNION SELECT 4, 5, 6)
WHERE a1=1 AND b1=b AND c1=c);
Since both syntaxes are rather generic and at least one of them should fit
any database backend directly, a new flag will be introduced,
``DatabaseFeatures.supports_tuple_literals`` which the default
implementation of ``composite_in_sql`` will consult in order to choose
between the two options.
``contenttypes`` and ``GenericForeignKey``
==========================================
Two years ago we gave this some thought and decided to leave this for
a later stage. I think the later stage is here.
It's fairly easy to represent composite values as strings. Given an
``escape`` function which uniquely escapes commas, something like the
following works quite well::
",".join(escape(value) for value in composite_value)
However, in order to support JOINs generated by ``GenericRelation``, we
need to be able to reproduce exactly the same encoding using an SQL
expression which would be used in the JOIN condition.
Luckily, while thus encoded strings need to be possible to decode in
Python (for example, when retrieving the related object using
``GenericForeignKey`` or when the admin decodes the primary key from URL),
this isn't necessary at the database level. Using SQL we only ever need to
perform this in one direction, that is from a tuple of values into a
string.
That means we can use a generalized version of the function
``django.contrib.admin.utils.quote`` which replaces each unsafe
character with its ASCII value in hexadecimal base, preceded by an escape
character. In this case, only two characters are unsafe -- comma (which is
used to separate the values) and an escape character (which I arbitrarily
chose as '~').
To reproduce this encoding, all values need to be cast to strings and then
for each such string two calls to the ``replace`` functions are made::
replace(replace(CAST (`column` AS text), '~', '~7E'), ',', '~2C')
According to available documentation, all four supported database backends
provide the ``replace`` function. [2]_ [3]_ [4]_ [5]_
Even though the ``replace`` function seems to be available in all major
database servers (even ones not officially supported by Django, including
MSSQL, DB2, Informix and others), this is still probably best left to the
database backend and will be implemented as
``DatabaseOperations.composite_value_to_text_sql``.
Database introspection, ``inspectdb``
=====================================
There are three main goals concerning database introspection in this
project. The first is to ensure the output of ``inspectdb`` remains the
same as it is now for models with simple primary keys and simple foreign
key references, or at least equivalent. While this shouldn't be too
difficult to achieve, it will still be regarded with high importance.
The second goal is to extend ``inspectdb`` to also create a
``CompositeField`` in models where the table contains a composite primary
key. This part shouldn't be too difficult,
``DatabaseIntrospection.get_primary_key_column`` will be renamed to
``get_primary_key`` which will return a tuple of columns and in case the
tuple contains more than one element, an appropriate ``CompositeField``
will be added. This will also require updating
``DatabaseWrapper.check_constraints`` for certain backends since it uses
``get_primary_key_column``.
The third goal is to also make ``inspectdb`` aware of composite foreign
keys. This will need a rewrite of ``get_relations`` which will have to
return a mapping between tuples of columns instead of single columns. It
should also ensure each tuple of columns pointed to by a foreign key gets
a ``CompositeField``. This part will also probably require some changes in
other backend methods as well, especially since each backend has a unique
tangle of introspection methods.
This part requires a tremendous amount of work, because practically every
single change needs to be done four times and needs separate research of
the specific backend. Therefore I can't promise to deliver full support
for all features mentioned in this section for all backends. I'd say
backwards compatibility is a requirement, recognition of composite primary
keys is a highly wanted feature that I'll try to implement for as many
backends as possible and recognition of composite foreign keys would be a
nice extra to have for at least one or two backends.
I'll be implementing the features for the individual backends in the
following order: PostgreSQL, MySQL, SQLite and Oracle. I put PostgreSQL
first because, well, this is the backend with the best support in Django
(and also because it is the one where I'd actually use the features I'm
proposing). Oracle comes last because I don't have any way to test it and
I'm afraid I'd be stabbing in the dark anyway. Of the two remaining
backends I put MySQL first for two reasons. First, I don't think people
need to run ``inspectdb`` on SQLite databases too often (if ever). Second,
on MySQL the task seems marginally easier as the database has
introspection features other than just “give me the SQL statement used to
create this table”, whose parsing is most likely going to be a complete
mess.
All in all, extending ``inspectdb`` features is a tedious and difficult
task with shady outcome, which I'm well aware of. Still, I would like to
try to at least implement the easier parts for the most used backends. It
might quite possibly turn out that I won't manage to implement more than
composite primary key detection for PostgreSQL. This is the reason I keep
this as one of the last features I intend to work on, as shown in the
timeline. It isn't a necessity, we can always just add a note to the docs
that ``inspectdb`` just can't detect certain scenarios and ask people to
edit their models manually.
Updatable primary keys in models
================================
The algorithm that determines what kind of database query to issue on
``model.save()`` is a fairly simple and well-documented one [6]_. If a row
exists in the database with the value of its primary key equal to the
saved object, it is updated, otherwise a new row is inserted. This
behavior is intuitive and works well for models where the primary key is
automatically created by the framework (be it an ``AutoField`` or a parent
link in the case of model inheritance).
However, as soon as the primary key is explicitly created, the behavior
becomes intuitive and might be confusing, for example, to users of the
admin. For instance, say we have the following model::
class Person(models.Model):
first_name = models.CharField(max_length=47)
last_name = models.CharField(max_length=47)
shoe_size = models.PositiveSmallIntegerField()
full_name = models.CompositeField(first_name, last_name,
primary_key=True)
Then we register the model in the admin using the standard one-liner::
admin.site.register(Person)
Since we haven't excluded any fields, all three fields will be editable in
the admin. Now, suppose there's an instance whose ``full_name`` is
``CompositeValue(first_name='Darth', last_name='Vadur')``. A user decides
to fix the last name using the admin, hits the “Save” button and instead
of fixing an existing record, a new one will appear with the new value,
while the old one remains untouched. This behavior is clearly broken from
the point of view of the user.
It can be argued that it is the developer's fault that the database schema
is poorly chosen and that they expose the primary key to their users.
While this may be true in some cases, it is still to some extent a
subjective matter.
Therefore I propose a new behavior for ``model.save()`` where it would
detect a change in the instance's primary key and in that case issue an
``UPDATE`` for the right row, i.e. ``WHERE primary_key = previous_value``.
Of course, just going ahead and changing the behavior in this way for all
models would be backwards incompatible. To do this properly, we would need
to make this an opt-in feature. This can be achieved in multiple ways.
1) add a keyword argument such as ``update_pk`` to ``Model.save``
2) add a new option to ``Model.Meta``, ``updatable_pk``
3) make this a project-wide setting
Option 3 doesn't look pleasant and I think I can safely eliminate that.
Option 2 is somewhat better, although it adds a new ``Meta`` option.
Option 1 is the most flexible solution, however, it does not change the
behavior of the admin, at least not by default. This can be worked around
by overriding the ``save`` method to use a different default::
class MyModel(models.Model):
def save(self, update_pk=True, **kwargs):
kwargs['update_pk'] = update_pk
return super(MyModel, self).save(**kwargs)
To avoid the need to repeat this for each model, a class decorator might
be provided to perform this automatically.
In order to implement this new behavior a little bit of extra complexity
would have to be added to models. Model instances would need to store the
last known value of the primary key as retrieved from the database. On
save it would just find out whether the last known value is present and in
that case issue an ``UPDATE`` using the old value in the ``WHERE``
condition.
So far so good, this could be implemented fairly easily. However, the
problem becomes considerably more difficult as soon as we take into
account the fact that updating a primary key value may break foreign key
references. In order to avoid breaking references the ``on_delete``
mechanism of ``ForeignKey`` would have to be extended to support updates
as well. This means that the collector used by deletion will need to be
extended as well.
The problem becomes particularly nasty if we realize that a ``ForeignKey``
might be part of a primary key, which means the collector needs to keep
track of which field depends on which in a graph of potentially unlimited
size. Compared to this, deletion is simpler as it only needs to find a
list of all affected model instances as opposed to having to keep track of
which field to update using which value.
Given the complexity of this problem and the fact that it is not directly
related to composite fields, this is left as the last feature which will
be implemented only if I manage to finish everything else on time.
Timeline
========
Our exam period starts on May 20. and ends on June 28., which means that I
can't guarantee I will be able to fully dedicate myself to the project for
the first two weeks, however, if nothing goes wrong, I should be able to
pass all exams before June 17.
Also, I intend to go to EuroPython, which means the first week of July
won't be as fruitful as the others, but otherwise I'm ready to work full
time on the project.
Week 1 (Jun 17. - Jun 23.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- porting the required virtual field changes, like a ``VirtualField``
class and ``django.db.models.options`` changes
- revisiting the documentation I wrote two years ago to reflect the
evolution this project has gone through
Week 2 (Jun 24. - Jun 30.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- porting the virtual ``ForeignKey`` patchset on top of master to get most
of the functionality right
Week 3 (Jul 1. - Jul 7.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- EuroPython 2013, Florence
- I intend to spend the full two days of sprints working on this but I
can't promise much more during this week.
- running through the tests and fixing those that need updating, ironing
out the remaining wrinkles
Week 4 (Jul 8. - Jul 14.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- basic ``CompositeField`` functionality, ``CompositeValue`` and
descriptor protocol
Week 5 (Jul 15. - Jul 21.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``db_index``, ``unique`` and ``primary_key`` for ``CompositeField``
- backend-dependent SQL for ``__in`` lookups on ``CompositeField``
Week 6 (Jul 22. - Jul 28.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- the few patches required to make admin work with composite primary keys
Week 7 (Jul 29. - Aug 4.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- composite ``ForeignKey``: basic functionality, descriptors, database
JOINs
Week 8 (Aug 5. - Aug 11.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- composite ``ForeignKey``: customization of auxiliary fields, adapting
``ModelForms`` and the admin in case it is necessary
Week 9 (Aug 12. - Aug 18.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- ``GenericForeignKey`` and ``GenericRelation`` support
Week 10 (Aug 19. - Aug 25.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- database introspection: create composite fields where necessary
Week 11 (Aug 26. - Sep 1.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- database introspection: composite ``ForeignKey``
Week 12 (Sep 2. - Sep 8.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- better handling of modifications to primary keys: detecting the fact and
issuing an appropriate DB query
Week 13 (Sep 9. - Sep 15.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- revision log cleanup
- better handling of modifications to primary keys: cascading primary key
updates
Week 14 (Sep 16. - Sep 22.)
~~~~~~~~~~~~~~~~~~~~~~~~~~~
- this is after the suggested “soft pencils down” deadline, therefore I'll
keep this week as a reserve
Deliverables and fallbacks
==========================
As the timeline shows, the first few weeks will be dedicated to internal
refactors of the relationship handling parts of the ORM, which means the
outcome won't be entirely evident or useful by itself. By the end of the
sixth week, I expect the refactor of ``ForeignKey`` to be stable for
concrete target fields. Also, the non-relationship part of
``CompositeField`` functionality should be ported on top of this.
Next, there is support for composite foreign keys. This is the core of the
work and the absolute minimum I want to squeeze out of this project to
consider it at least partially successful.
All the features that appear later in the timeline are optional, ordered
by the priority with which I would like to implement them. In the unlikely
case of extreme difficulties with the previous parts of this project some
(or possibly all) of them may be left out.
A note about the history of this project
========================================
During GSoC 2011 the project progressed steadily, especially during the
first half. The second half was a little bit more blurry because I had no
idea how difficult the task of implementing composite foreign keys would
be before I started to actually omplement them. Still, even though the
second half didn't yield a working implementation (just a half-baked
refactor), it was crucial for me to understand how it can be done.
Unfortunately, soon thereafter school kicked in and along with it
homework, assignments, thesis and other projects related to programming
contests we organize throughout the academic year. This left me with
barely enough time to sync the code with master from time to time, let
alone make any significant progress.
I didn't apply for GSoC 2012 because I knew I wouldn't have enough time to
work on the project because of my thesis and state exams. This would make
it a poor candidate for a GSoC project and even if it did get chosen, it
wouldn't feel fair to other GSoC participants.
Without any deadlines ahead of me and without the GSoC participation, I
was lacking the motivation to keep working on this. I admit, the lack of
progress throughout last summer was largely due to my laziness. I did sync
the code from time to time but I stopped around the time when Aymeric
started pushing all the autogenerated Py3k commits. Sorting out the merge
conflicts and updating a patchset of this scope to the new Py3k
compatibility standards was a really painful process.
I managed to at least get the code through the Py3k changes sometime in
October before the RuPy.eu conference, putting aside everything else. At
that time I was really confident I could keep working on it, but
obviously, that didn't happen.
All in all, the reasons why there hasn't been any significant progress can
be summarized as these two:
1) lack of free time (don't we all know this one?)
2) lack of short-term incentive (a.k.a. laziness)
Why I think the upcoming GSoC will help?
Regarding the first one, having planned half a year in advance for a
Summer of Code project means that I don't have any other plans for the
summer beside this. (Well, except for EuroPython, which I hope will help
this project advance anyway.)
As far as motivation is concerned, there's the obvious one -- a nice
paycheck from Google if I succeed to deliver. However, that's not the only
one, I think the fact that I would have a mentor supervising my work, whom
I respond to throughout the summer, has a really great effect in this
regard.
References
==========
.. [1] GSoC 2011 proposal: Composite Fields
(
https://gist.github.com/koniiiik/5450625)
.. [2] PostgreSQL documentation: Other String Functions
(
http://www.postgresql.org/docs/9.1/static/functions-string.html#FUNCTIONS-STRING-OTHER)
.. [3] SQLite documentation: Core Functions
(
http://www.sqlite.org/lang_corefunc.html#replace)
.. [4] Oracle® Database SQL Reference: REPLACE
(
http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions134.htm)
.. [5] MySQL Reference Manual: String Functions
(
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_replace)
.. [6] Django documentation: Model instance reference
(
https://docs.djangoproject.com/en/dev/ref/models/instances/#how-django-knows-to-update-vs-insert)