I have a few suggestions that I think could increase the the
capabilities and expressiveness of Django DB query syntax. The
modifications are aimed at increasing usage of Django DB syntax (in
preference to embedded raw SQL), and increasing the range of SQL
queries that can be expressed in Django DB syntax.
I have started working on a patch that implements these ideas, but I
thought should solicit comments and approval before I went too much
further. I apologise in advance for the long post, but the
modifications are somewhat related.
I am proposing 4 changes:
1) A curried query function that returns SQL queries
2) A 'columns' keyword to restrict column returns
3) Explicit use of INNER JOIN... ON
4) Introducing a Join class for use in the tables clause
All the proposed changes would be backwards compatible.
In more detail:
1) In addition to the 'instance returning' functions like get_list(),
add a curried function that returns the SQL query itself. Where
model.get_list(kwargs) returns the list of model instances,
model.query(kwargs) would return the SQL query that would return the
same list.
This would allow the wider use of Django DB syntax instead of having to
drop into raw SQL query strings. Anywhere that a string describing an
SQL query is currently valid (e.g., in a select or tables clause),
model.query() would be equally valid.
This requires no modification to the way existing query functions
(get_list() etc) are currently handled, as they are already expecting
strings, and model.query() returns a string. This modification only
serves to normalize representation into Django syntax wherever a string
is currently valid, minimizing the need to put raw SQL in your Django
queries.
2) Add a 'columns' keyword that allow users to nominate a specific list
of columns that they wish returned by a query.
This kwarg only make sense in the context of the previous suggestion.
At present, get_list() generates a select that returns all columns of
the queried table. However, if query() is being used to describe a
table/select it is possible that only a small number of columns in the
joined column are required.
If unspecified, all columns would be returned. Calls to get_list() etc,
which require the existence of all columns, would override any provided
columns kwarg (as the get_count() call currently does for order_by,
offset and limit).
select can still be used to augment the column list, if required.
An alternate approach with the same endpoint would be to have an
'exclude' kwarg, which requires users to explicity name columns they
_don't_ want. Personally, I prefer the inclusive approach, but I'm
willing to be convinced otherwise.
3) Make all existing joins explicitly INNER JOIN ... ON, rather than
using the equivalent ', ... WHERE' syntax.
The infrastructure for this is already partially in place; internally,
'join_where' is separated from 'where' in order to clarify OR queries.
Making the join explicitly INNER removes all ambiguity with regards to
the WHERE clause, and opens the door for option 4...
4) Create a Join() class (with InnerJoin and OuterJoin subclasses) that
can be used in the tables clause.
Existing tables clauses would remain unaffected; if a string is
provided, the table named by the string is added to the FROM clause
using ',' notation.
However, a Join object can also be provided. For example:
Polls.get_list(
tables=[
InnerJoin('polls_choices',
on=['polls_polls.id = polls_choices.poll_id'],
as="choices")
])
would be an inner join, whereas:
Polls.get_list(
tables=[
OuterJoin('polls_choices',
direction=LEFT,
on=['polls_polls.id = polls_choices.poll_id'],
as="choices")
])
would be a left outer join. The first argument (table) is required;
It's a string, so model.query() calls would also be valid to allow
joins over subselects. 'direction', 'on', and 'as' are kwargs on the
constructor that modify how the Join is evaluated as SQL. Direction
takes a constant of 'left','right', or 'full'; 'on' is expanded in the
same way as 'where'. If 'as' is not provided, the table name would be
used as a default.
An alternative approach to the direction parameter would be to have a
LeftOuterJoin, RightOuterJoin, and FullOuterJoin subclasses.
So there we have it. 4 changes that would extend the range of Django DB
syntax and reduce the amount of embedded SQL, all while remaining
backward compatible.
Comments? Criticisms?
Russ Magee
There are quite a lot of changes in this area being considered in the
magic-removal branch. So your patch will almost certainly get broken by
those changes.
As I haven't written up yet my full proposal on the query stuff yet (but
it is closely related to the descriptor fields stuff), I can't really
say exactly which bits of your proposals would be workable/necessary.
Here is a hopefully informative example of the kind of thing we (me and
hugo) have been talking about:
eg with
class Person(Model):
name = CharField(maxlength=100)
age = IntegerField()
father = ForeignKey("Person", null=True)
# Strings work as foreign key targets in
# magic removal. Classes still do as well.
mother = ForeignKey("Person", null=True)
Here are a few example queries. The overriding idea is that the query
objects act like builtin sets as possible, but lazily fetched from the
database.
people = Person.objects
people.filter(name__startswith="P") |people.filter(name__startswith="S")
# a union.
( people.filter(name__startswith="Pete") |
people.filter(name__startswith="Simon") ) & people.filter(age__exact=4)
A union and an intersection. Things are clearer with temporaries:
petes = people.filter(name__startswith="Pete")
simons = people.filter(name__startswith="Simon")
age_4 = people.filter(age__exact=4)
res = (petes | simons) & age_4
Dunno what people think about this, it probably depends on how
comfortable they are with sets. However, I think it is a lot less ugly
than the complex= syntax currently.
subqueries would also be supported:
petes.filter(father__in=sims)
would get all the peters whose father is a simon.
I will comment on a few things however...
> 2) Add a 'columns' keyword that allow users to nominate a specific list
> of columns that they wish returned by a query.
>
> This kwarg only make sense in the context of the previous suggestion.
> At present, get_list() generates a select that returns all columns of
> the queried table. However, if query() is being used to describe a
> table/select it is possible that only a small number of columns in the
> joined column are required.
>
> If unspecified, all columns would be returned. Calls to get_list() etc,
> which require the existence of all columns, would override any provided
> columns kwarg (as the get_count() call currently does for order_by,
> offset and limit).
>
> select can still be used to augment the column list, if required.
>
> An alternate approach with the same endpoint would be to have an
> 'exclude' kwarg, which requires users to explicity name columns they
> _don't_ want. Personally, I prefer the inclusive approach, but I'm
> willing to be convinced otherwise.
With descriptor fields, it makes sense to allow the fields themselves to
be used to define this, eg maybe
Person.objects.returning( Person.name, Person.age )
would return an object that lazily acts like an ordered set of tuples of
name and age values.
+1
> 4) Create a Join() class (with InnerJoin and OuterJoin subclasses) that
> can be used in the tables clause.
>
> Existing tables clauses would remain unaffected; if a string is
> provided, the table named by the string is added to the FROM clause
> using ',' notation.
>
> However, a Join object can also be provided. For example:
>
> Polls.get_list(
> tables=[
> InnerJoin('polls_choices',
> on=['polls_polls.id = polls_choices.poll_id'],
> as="choices")
> ])
Exposes raw table names. Embedded string parsing. Bad. Also, what does
it actually return as a result?
I think the main problem with this proposal is readability. You appear
to be thinking in terms of the SQL generated. This is fine if you have
read and understood the generation functions, but is not really
intuitive to someone coming along fresh.
I think the query syntax needs to be designed with use cases in mind,
rather than "if we do this, we can express lots of sql queries". Thats
nice, but only if its going to be comprehensible, and more usable than
raw sql.
> So there we have it. 4 changes that would extend the range of Django DB
> syntax and reduce the amount of embedded SQL, all while remaining
> backward compatible.
The issue is, I don't really get how this join stuff is very different
than embedded sql. It seems to me to simply be sql syntax preparsed into
separate arguments. Also it relies on knowing how the sql queries are
produced, eg alias names, table names, etc. This seems to be asking for
more backwards compatibilty trouble in the future, eg how will it
interact with the real subtyping stuff?
-1.
I'm still reading through your proposal, but one quick note:
On Dec 19, 2005, at 12:30 AM, Russell Keith-Magee wrote:
> 2) A 'columns' keyword to restrict column returns
This already exists; look at the get_values() function: http://
www.djangoproject.com/documentation/db_api/#get-values-kwargs
More later...
Jacob
> This already exists; look at the get_values() function:
Right idea, not quite the complete implementation. 'fields' only
applies to get_values, and the 'fields' kwarg isn't utilized by the sql
query itself. The SQL query still retrieves all columns, and then a
dict(zip()) the results to get a dictionary with the requested subset
of values.
The idea here would be to construct a SELECT that only pulls out the
columns in 'fields'. This isn't of any use for normal use-case
get_list() calls, but it could be useful for populating 'select' or
'tables' kwargs within a get_list() call.
Russ %-)
I figured as much. Is there any sort of estimated timeframe for the
merge of the magic-removal branch into the trunk?
> Here are a few example queries. The overriding idea is that the query
> objects act like builtin sets as possible, but lazily fetched from the
> database.
+1. Love it. This hits one of my biggest bugbears about the existing
syntax - that you can't decompose a big query into little logical
bits, and compose/reuse the bits when you need the query evaluated.
> > 4) Create a Join() class (with InnerJoin and OuterJoin subclasses) that
> > can be used in the tables clause.
>
> Exposes raw table names. Embedded string parsing. Bad.
Granted. However, I can't see any easy way with the current system to
get access to the table/column names. However - the syntax you
suggested for #2:
> Person.objects.returning( Person.name, Person.age )
provides a solution, as long a django db objects have properties that
evaluate to their underlying table/column names. Person.name ->
"polls_person"."name" (or can be interpreted/looked up in some way).
> I think the main problem with this proposal is readability. You appear
> to be thinking in terms of the SQL generated.
Strange that... it's because I was :-) IMHO, once you get outside
relatively simple examples, the Django syntax encourages you to do so.
The 'select', 'tables', and 'where' kwargs are all there to get around
the limitations in Django expressiveness. And beyond encouraging
SQL-like thinking, they even encourage the embedding of raw SQL.
The suggestions I was putting forward were attempts to regularize
syntax with the goal of minimizing the amount of raw SQL required. I'm
happy to abandon all of them if we can develop an alternate syntax
that minimizes the need for any SQL embedding.
+1 to the keeping access to raw SQL queries available (at a cursor
level) for the edge cases, but -1 to putting the edge so close to the
core, requiring this approach for every non-trivial problem.
> I think the query syntax needs to be designed with use cases in mind,
> rather than "if we do this, we can express lots of sql queries". Thats
> nice, but only if its going to be comprehensible, and more usable than
> raw sql.
Absolutely agree - but I would also make one addition - it has to be
comprehensible, more usable, AND have expressive range that covers the
most common SQL use cases. I accept that the ability to pose some
queries will be lost in any elegant syntax, but if you can't cover a
wide range of base use cases, why bother having the syntax at all?
IMHO, as it currently stands, Django is missing 2 major areas of SQL
functionality:
1) queries over 1-N relationships (get me a list of polls whose
choices meet a certain criteria)
2) Aggregate clauses
and nothing I have seen in the magic-removal discussions is addressing
these omissions. I don't see these as edge cases - every SQL tutorial
I have seen that covers more than a single page talks about JOIN and
GROUP BY. The only reason I am dancing around the edges of 'select'
and 'where' is because I need to do both these things, and cannot with
the existing Django syntax.
If you want a use case to work with - here's one:
A Task has a list of Estimates and a list of Activities. Each Estimate
has an integer describing the amount of effort involved, as does each
Activity.
Keeping the individual estimates/activities is important, so you can't
just decompose Task to contain integer Estimate and Activity fields
(unless you can propose an alternative audit trail)
A task is planned if the number of estimates != 0.
A task is complete if it is planned, and sum(estimates) = sum(activity)
Now, retrieve a list of all complete tasks.
For bonus points: A Milestone has a list of Tasks. A Milestone is
complete if it has at least one task, all of which are complete.
Retrieve a list of completed milestones.
If the outcome of the magic-removal is the ability to express this
problem in Django syntax, I'll be a very happy man :-)
Russ Magee