#7210 - F() syntax, design feedback required.

756 views
Skip to first unread message

Russell Keith-Magee

unread,
Jan 5, 2009, 8:21:02 AM1/5/09
to Django Developers
Hi all,

I have a design issue on ticket #7210 that requires some feedback from
the community.

For those that haven't been following along, Ticket #7210 is about
adding the ability to reference fields during a query. Two quick
examples of common use cases:

Find all the hotel rooms that have the same number of chairs and beds:
>>> HotelRoom.objects.filter(n_chairs=F('n_beds'))
=> SELECT * from HotelRoom WHERE n_chairs=n_beds;

Update the number of occupants in room 319 the room by 1:
>>> HotelRoom.objects.filter(room=319).update(n_occupants=F('n_occupants') + 1)
=> UPDATE HotelRoom SET n_occupants=n_occupants+1 WHERE id=319;

Nicolas Lara did some work on this ticket in the scope of his
Aggregates work, leading off an earlier patch from Sebastian Noack.
The ticket was accepted for v1.1 as a must-have. My Github repository
contains the work-in-progress [1]; from my testing, it appears
reasonably solid, although there is still one feature addition
pending.

However, the test cases don't currently pass for all backends, which
leads me to a problem of design and scope. The test suite for F()
notation has revealed some interesting inconsistencies with the
handling of various operators on the various database backends.

Consider integer division. 42/15 returns 2 under SQLite and Postgres.
MySQL, however, returns 2.8, which is rounded into 3 if you try to
assign the division result to an integer column. (e.g., UPDATE person
SET age = 42/15 ...).

Postgres doesn't allow modulo arithmetic on floats; MySQL and SQLite do.

Bitwise operators (& and |) on floats are also inconsistently handled.
SQLite and MySQL are in agreement, but Postgres won't allow bitwise
operations on floating point values without explicit casts to integer
types.

This doesn't even get into the issue of handling dates and strings
using overloaded arithmetic operations.

I can see several possible ways forward:

1) Expose the underlying operations, warts and all. Provide a minimal
test suite that checks that it is possible to do each of the
arithmetic operations, but studiously avoid edge cases known to cause
problems. This is the ostrich solution, but it also acknowledges that
there are differences between backends. Documentation would be updated
to note that some operations are not necessarily portable, and users
should check local guides for details.

2) As for (1), but provide a comprehensive test suite that checks the
different expected results for each backends. This doesn't change
anything in the way the feature would work, but it would explicitly
check that the problematic cases return the expected results. This
would improve test coverage, but would make the test suite much more
complicated, and I'm not convinced it would actually improve the
quality of the code. The bits of code that need testing are in the way
queries are rolled into SQL, not in the way that the backends
interpret that SQL.

3) Try to find the common ground and only provide the subset of
operations that actually behave the same under all backends. I suspect
this would mean we would ultimately end up with no operators
available.

4) Put the infrastructure in place to ensure that integer division is
always handled the same way, regardless of backend handling, and do
strong type checking of operators and operations to make sure the
resultant expression is legal and meaningful. I'm not even sure this
option is feasible for all cases (or even possible), but if we want to
maintain complete database independence, we may need to look down this
road.

Personally, I'm leaning towards option 1. We already acknowledge some
inconsistencies between backend (e.g., fixture ordering issues with
InnoDB, case sensitivity options with some collations). I suspect that
by trying to bend the rules to make things consistent, we will end up
making things a lot more brittle.

Opinions? Other options?

[1] http://github.com/freakboy3742/django/tree/query-expressions

Yours,
Russ Magee %-)

alex....@gmail.com

unread,
Jan 5, 2009, 9:54:33 AM1/5/09
to Django developers
I'd vote in favor of number 1. While I'd love to have number 4, it's
simply not possible, as we discovered when looking at aggregates, not
even all versions of SQLite play nice together(yay whatever the hell
version I have!). If someone wants to do something like develop on
SQLite and then switch to MySQL or PgSQL while using inconsistent
operations that's there prerogative.

Alex

On Jan 5, 7:21 am, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

Ivan Sagalaev

unread,
Jan 5, 2009, 11:17:18 AM1/5/09
to django-d...@googlegroups.com
Russell Keith-Magee wrote:

> 1) Expose the underlying operations, warts and all. Provide a minimal
> test suite

I'm +1 for this option. Here are some comments why I don't like others:

> 2) As for (1), but provide a comprehensive test suite that checks the
> different expected results for each backends.

This will create a maintenance burden. New versions of existing DBs
might change the behavior in question, new backends will be invented
etc. And testing all these warts won't give us anything because it's not

Django's scope anyway. As you yourself said:

> The bits of code that need testing are in the way
> queries are rolled into SQL, not in the way that the backends
> interpret that SQL.


> 3) Try to find the common ground and only provide the subset of
> operations that actually behave the same under all backends. I suspect
> this would mean we would ultimately end up with no operators
> available.

It will also be inconsistent with what we already have. We have
transactions that aren't supported in MyISAM, we have '__search' that
isn't supported in Oracle and SQLite etc. In other words, Django assumes
that users know which backend they use and doesn't try to be a complete
abstraction layer.

> 4) Put the infrastructure in place to ensure that integer division is
> always handled the same way, regardless of backend handling, and do
> strong type checking of operators and operations to make sure the
> resultant expression is legal and meaningful.

Frankly, this just scares me out my pants :-)

Malcolm Tredinnick

unread,
Jan 5, 2009, 6:09:01 PM1/5/09
to django-d...@googlegroups.com

My gut feeling is that this is probably the most pragmatic, in terms of
functionality.

To my mind, whilst Django supports multiple database backends, it's
unrealistic (and not even *currently* true) to say it performs
identically on all backends. Any individual or company is going to be
running their software primarily on one database backend, not chopping
and changing, so they'll know (or should, if they're serious) what the
behaviour is. I strongly suspect Django-using applications are similar
to software development in general: the bulk of code is written for
internal purposes than distribution purposes.

Let's not try to disguise differences that were quite possibly one of
the differentiating features when somebody was making their backend
decision. "Django wraps the database layer; it doesn't replace it."

> 2) As for (1), but provide a comprehensive test suite that checks the
> different expected results for each backends.

We might need to introduce some backend-specific tests over time, but we
can backfill as experience dictates there. I wouldn't let this hold up
the merge.

Regards,
Malcolm


Alex Gaynor

unread,
Jan 7, 2009, 11:36:17 AM1/7/09
to django-d...@googlegroups.com
I've spent some time looking at this trying to implement expansion of related fields as Russ and I discussed last night.  I have a few questions about the design choice that was implemented, why didn't you choose to go with this evaluator class in place of a mechanism more similar to the way aggregates are implemented.  Specifically the issue that has come up is that whatever goes into the Queries where attribute needs to know the table alias that it's going to be on, and while this can quite easily be accomplished by passing it to it at the time evaluate() is called when bump_aliases is called(on an update query for example) this necessitates the relabeling of all the aliases, right now there's a hook for this(that aggregates make use of), but that can't be used since QueryWrapper is being used and opaque strings of SQL are being passed around.  Essentially the question boils down to why don't we defer creating the SQL until it's actually needed?

Alex
--
"I disapprove of what you say, but I will defend to the death your right to say it." --Voltaire
"The people's good is the highest law."--Cicero

Russell Keith-Magee

unread,
Jan 7, 2009, 6:22:40 PM1/7/09
to django-d...@googlegroups.com
On Thu, Jan 8, 2009 at 1:36 AM, Alex Gaynor <alex....@gmail.com> wrote:
> I've spent some time looking at this trying to implement expansion of
> related fields as Russ and I discussed last night. I have a few questions
> about the design choice that was implemented, why didn't you choose to go
> with this evaluator class in place of a mechanism more similar to the way
> aggregates are implemented.

An aggregate is a single class, so there isn't much overhead involved
in creating an SQL-specific duplicate of the aggregate. An expression,
however, is a full tree. Duplicating a tree is a more expensive
operation, so I opted for a mechanism that interprets the tree on
demand, rather than duplicating the tree.

> Specifically the issue that has come up is that
> whatever goes into the Queries where attribute needs to know the table alias
> that it's going to be on, and while this can quite easily be accomplished by
> passing it to it at the time evaluate() is called when bump_aliases is
> called(on an update query for example) this necessitates the relabeling of
> all the aliases, right now there's a hook for this(that aggregates make use
> of), but that can't be used since QueryWrapper is being used and opaque
> strings of SQL are being passed around. Essentially the question boils down
> to why don't we defer creating the SQL until it's actually needed?

A combination of historical and lack of need. Nicolas' code invoked
evaluate when the filter/update clause was added, and since F()
clauses on local attributes aren't subject to re-aliasing, it never
arose as an issue.

Deferring SQL creation strikes me as the right thing to do in this
case. My initial reaction is that the approach required is Rather than
having a single evaluator class, I suspect we may need to insert
instances of SQLEvaluator that remember the root of expression trees
into the SQL-side of the query where evaluate() is currently being
called.

This would make expressions more like Aggregates. F() objects would
have an add_to_query() function, and you would call add_to_query() on
the root of the query expression to create an Evaluator instance.
as_sql() on the SQLEvaluator would then expand the query expression at
time of use.

Russ %-)

Alex Gaynor

unread,
Jan 7, 2009, 6:26:20 PM1/7/09
to django-d...@googlegroups.com
Another problem that occured to me is precisely that, F() objects are trees, so what happens when I do:

Model.objects.filter(field=F('related_model1__field')+F('related_model2__field'))

clearly the process for adding in calls to setup_joins() gets a little more complex.  Hopefully I'll have some more time to look at it either tonight or tomorrow, but I'm juggling a few patches at present, and as you said this feature isn't required for F() to make it into 1.1.

Alex

Alex Gaynor

unread,
Jan 8, 2009, 12:24:34 PM1/8/09
to django-d...@googlegroups.com
Ok, I've spent some time working with this, and I've come to a bit of an impasse.  But firs the good news: use of related fields in filter() works precisely as expected, and correctly now, the code itself probably needs a bit of a patdown since I've changed a lot of things(specifically before any value with an as_sql() was getting turned into a QueryWrapper which wasn't acceptable for F() since the aliases needed to be able to change using the relabel_aliases hook, so that has a bit of a special case).  I also brought over the trim_joins() method from aggregates since it saved on code duplication(yay).

The bad news: UPDATE with related fields explodes, since there is no column by the name in the query, my understanding is one can do UPDATE FROM to handle this, however my SQL is very weak and I know nothing of the portability of this, how it works, or what it ultimately looks like.  So someone with better SQL than I probably has to look at it, however strictly speaking the filter on related could go in without the update on related since they really are quite orthagonal.

Otherwise any feedback is appreciated, all the code is available at:
http://github.com/alex/django/commits/related-expressions

Alex

Malcolm Tredinnick

unread,
Jan 8, 2009, 8:12:28 PM1/8/09
to django-d...@googlegroups.com
Hopefully this will be confusing enough to prove my point:

On Thu, 2009-01-08 at 11:24 -0600, Alex Gaynor wrote:
> Ok, I've spent some time working with this,

5. Clear now?

4. Because it makes it very hard to follow what is going on and is
really sodding annoying. Is your email client broken or something? If
so, get a better tool.

3. Why?

2. Please stop top-posting all the time.

1. This mail demonstrates how hard it is to have to read from bottom to
top and try to work out the context. This is the first line of the
reply. The first line of what I'm replying to refers to stuff way later
in the message you are replying to and completely lacks any context.
Like many others, I have an interest in following what's going on here,
but it's only one thread amongst hundreds of emails I read every day, so
it would be nice if it didn't take a minute to decode what you're
talking about each time.

Thankyou for your consideration.

Malcolm


Alex Gaynor

unread,
Jan 8, 2009, 8:31:44 PM1/8/09
to django-d...@googlegroups.com
Top posting is the default in my mail client(Gmail), it also makes it easier it easier to write, and ultimately it wasn't something I thought about, but whatever.  I'll try to restate my previous message in a manner that makes more sense.

As I said, the usage of F() across related fields now works correctly.  There were several things that had to be overcome to allow this to work.  First any value with an as_sql() method on it was getting converted into a QueryWrapper which didn't work since the F() needed to be able to relabel it's aliases.  To that effect I also added a hook to allow values with relabel_alises methods to be called from inside a WhereNode.  Finally I ported over the trim_join method Russ abstracted as a part of the aggregation work, there wasn't any changes to it, it's join a note.

The actual implementation issue arrise with the usage of F() for related fields in update queries.  Specifically one of my concerns was the portability of the SQL.  I spoke with Michael Trier this afternoon and it appears to me that the SQL is almost totally non portable.  Postgres uses a UPDATE ...SET ...FROM style query.  MySQL uses UPDATE [table list ] SET... style query.  I'm not sure how SQLite or Oracle do it.  I gather that it is possible to do a subquery such a UPDATE table SET field = (SELECT value FROM related_table WHERE "related_table"."id" = "table"."id"); but this is apparently not very performant(has the potential to result in a large number of subqueries in the case of multiple updated fields, in addition to where clauses, etc.).  Therefore I am not sure how to proceed with the update portion.

Hopefully this is more clear than my previous message.

Alex

Malcolm Tredinnick

unread,
Jan 8, 2009, 8:48:15 PM1/8/09
to django-d...@googlegroups.com
On Thu, 2009-01-08 at 19:31 -0600, Alex Gaynor wrote:
[...]

> Top posting is the default in my mail client(Gmail),

It's also the default in Outlook. Didn't make it a good idea there,
either, and it's obstructed clear business communications for over a
decade, sadly (although possibly that's been a mission goal of the
company that produces outlook).

It's not just this thread. It's something I notice you doing regularly
and since you make a lot of replies to long threads, I think you'll help
yourself (and us) by putting the replies in context.

> it also makes it easier it easier to write, and ultimately it wasn't
> something I thought about, but whatever. I'll try to restate my
> previous message in a manner that makes more sense.
>
> As I said, the usage of F() across related fields now works correctly.
> There were several things that had to be overcome to allow this to
> work. First any value with an as_sql() method on it was getting
> converted into a QueryWrapper which didn't work since the F() needed
> to be able to relabel it's aliases. To that effect I also added a
> hook to allow values with relabel_alises methods to be called from
> inside a WhereNode. Finally I ported over the trim_join method Russ
> abstracted as a part of the aggregation work, there wasn't any changes
> to it, it's join a note.
>
> The actual implementation issue arrise with the usage of F() for
> related fields in update queries. Specifically one of my concerns was
> the portability of the SQL. I spoke with Michael Trier this afternoon
> and it appears to me that the SQL is almost totally non portable.

Yes, Michael is wise. Also, welcome to my world. :-)

The queryset update() method is written in the particular way it is as
an attempt to minimise those obstructions, but it is a very hairy area.
I agree. It's important not to get too hung up on this stuff, though.
Many updates are just simple direct updates to a single table. Selects
are more common than updates. So the complex multi-table or other
all-singing, all-dancing constructs (related fields in F()) can afford
to be a little slower and/or use multiple queries. Django is not
intended to be a total replacement for SQL and if somebody needs to use
that type of slow case frequently and wants it fast, there's this really
cool feature we provide called cursor.execute() that they can use
instead.

I've pulled your branch of the code and will have a read over it. I've
been meaning to review all those changes now that Russell's getting
close to done and this is probably the tipping point to make me do so. I
doubt I'll get to it before this evening, but I'll definitely make time
to have a read with full attention.

Regards,
Malcolm


Ian Kelly

unread,
Jan 9, 2009, 4:09:32 PM1/9/09
to django-d...@googlegroups.com
On Thu, Jan 8, 2009 at 6:31 PM, Alex Gaynor <alex....@gmail.com> wrote:
> The actual implementation issue arrise with the usage of F() for related
> fields in update queries. Specifically one of my concerns was the
> portability of the SQL. I spoke with Michael Trier this afternoon and it
> appears to me that the SQL is almost totally non portable. Postgres uses a
> UPDATE ...SET ...FROM style query. MySQL uses UPDATE [table list ] SET...
> style query. I'm not sure how SQLite or Oracle do it. I gather that it is
> possible to do a subquery such a UPDATE table SET field = (SELECT value FROM
> related_table WHERE "related_table"."id" = "table"."id"); but this is
> apparently not very performant(has the potential to result in a large number
> of subqueries in the case of multiple updated fields, in addition to where
> clauses, etc.). Therefore I am not sure how to proceed with the update
> portion.

Don't worry about the speed in Oracle; I understand that the optimizer
handles this well. The query above may look ugly, but under the hood
Oracle is just doing the same join that Postgres does with an UPDATE
... FROM query.

Ian

Ian Kelly

unread,
Jan 9, 2009, 4:16:19 PM1/9/09
to django-d...@googlegroups.com

Also, where multiple fields are pulling values from the same related
table, you can avoid replicating subqueries with the syntax UPDATE
table SET (a, b, c) = (SELECT x, y, z FROM related_table WHERE
related_table.id = table.id);

Ian

Alex Gaynor

unread,
Jan 9, 2009, 4:19:14 PM1/9/09
to django-d...@googlegroups.com
Well the idea is to do that for all DB engines(since it's the only portable option), so while Oracle may optimize it correctly I very much doubt MySQL does :) .  Is the multiple-set portable to other dbs(in any event I think it would come later as an optimization)?

In any event I started looking at this today and it's definetly doable, but I'm not sure how do generate the subquery itself, optimally we could generate this query using the QuerySet itself, but that would require us to regenerate the attribute traversal string("fkey__another_fkey__id") which is probably impossible, so I could use some more input on what the right way to do this is(there's probably something really obvious that I'm missing).

Ian Kelly

unread,
Jan 9, 2009, 4:41:52 PM1/9/09
to django-d...@googlegroups.com
On Fri, Jan 9, 2009 at 2:19 PM, Alex Gaynor <alex....@gmail.com> wrote:
> Well the idea is to do that for all DB engines(since it's the only portable
> option), so while Oracle may optimize it correctly I very much doubt MySQL
> does :) . Is the multiple-set portable to other dbs(in any event I think it
> would come later as an optimization)?

I have no idea -- from a quick glance at the PostgreSQL docs, it
appears to support the multiple-assignment syntax but not the use of
subqueries as rvalues. Probably this bit of SQL generation should be
delegated to the backend rather than trying to do it in a single
cross-database way.

Ian

Reply all
Reply to author
Forward
0 new messages