Refactoring QuerySet

Malcolm Tredinnick

unread,

Feb 25, 2007, 7:11:49 AM2/25/07

to django-d...@googlegroups.com

I noticed today that a new GIS branch has been created, which includes a
goal to develop a custom QuerySet-derivative. There has also been a
discussion on trying to work out an API for aggregates, which seems to
be moving along, although I suspect the design has a bit more work
required yet. Michael Radziej has inquired about trying to integrate
custom SQL as a natural QuerySet result. That's possibly a bit
pie-in-the-sky for all cases, but we might be able to get close.

All of these (plus no doubt some other things I don't remember or know
about) are going to impact QuerySets. So I thought I just lob out my
plans/ideas for refactoring that part of the code prior to 1.0.

What I would like here is for anybody planning to play in this space to
have a think about whether their particular use-case is going to get
clearer harder to implement under this scheme, so that changes can be
made now.

No really urgent action required on this. I keep bashing away at the
code for this (and tweaking the design) whenever I have time. And I
realise everybody at PyCon already has their head full of other stuff. I
just want to inject the thoughts into peoples' brains since there are a
bunch of you sitting in Texas talking about "the future".

Apologies for the length. This makes Jacob's admin refactoring post look
short. Life sucks like that sometimes.

How QuerySets work at the moment
---------------------------------
By magic!

Seriously, the QuerySet class performs a few jobs: it constructs the
database query (SQL), retrieves the data by calling the appropriate
backend, and then provides an API for accessing the data in a
Python-friendly format.

There are really two portions that I want to separate here: the
mechanics of getting the data from the database (constructing the query,
particularly) and providing the Python interface via model instances,
etc.

When we construct an SQL query at the moment, it is mostly done by
pushing string fragments around inside a couple of functions in the
QuerySet class. This makes it very difficult to fiddle with query
construction because you need to get at the right string fragment at
just the right moment. You cannot go back later and tweak any portion of
the query. The obvious example of where this is painful is if we
implement slicing functionality for MS SQL Server: that database does
not have the non-SQL-standard LIMIT/OFFSET keywords. Instead, you
construct a sub-select in the FROM clause to get the equivalent
behaviour. I've no doubt screwed up the details slightly there, but the
important point here is that it's basically impossible without really
special-case handling ("if using_ms_sql: ...") to hijack the FROM
construction and not put in LIMIT/OFFSET bits. Oracle has similar, but
not quite as difficult to accommodate requirements for extracting an
ordered portion of the results, too.

What would the new (QuerySet) world order look like
----------------------------------------------------
What I propose to do is to split out the SQL construction from the body
of QuerySet and make it into a separate class. This class (let's call it
Query, for simplicity) would end up being a "query" attribute on
QuerySets and it's __str__ function would return the appropriate SQL
statement.

The Query class would not push string fragments around. Instead, it
would keep the parts of the SQL statement separated until asked to
combine them (in __str__, for example). So there would be attributes
called, say, select, tables, where, having, grouping, limit and offset.
Initially, "having" and "grouping" wouldn't be used by our current
functionality, but they're there for extending the Query class (and for
use by things like aggregates and custom tweaks).

The exact contents of these attributes isn't really important for the
purposes of this discussion, but they will be Python objects that you
can poke at and shuffle as much as you like. For example, the "select"
attribute would be a list of (column name, alias) pairs (where alias
could be None) and the "tables" attribute would be a (possibly nested)
list of (table name, alias, join type) triples. Since the table names
just need to be convertible to strings, it would be possible to use a
Query class for the table name, giving us a way to inject
SQL-Server-required views into the FROM clause.

The Query class is primarily responsible for generating the SQL string.
However, when I started writing code for this, it rapidly becomes
apparent that it will need to know the database engine it is coding for.
Tasks like constructing the right SQL for slicing (LIMIT/OFFSET vs
SQL-Server requirements), for example, need that knowledge to construct
the right string.

So, at some level, it might make sense to pass off *all* of the database
interaction to this class. The QuerySet class would create a Query
instance that contains the right information, tell it the database
engine, pass it the values for the parameters (maybe not all in the same
place) and then call "run_the_query_please()" and get back an iterator
over the results.

This helps on another, slightly crazier level, too: if you want to
replace the database-backed ORM with another type of storage, but were
happy to keep using the Django API for querying, all the SQL-specific
stuff is now in one place and you would only need to write your own
Query replacement or subclass. Not a driving requirement, but very much
nice to have, since not everything runs out of databases.

How does this help for custom SQL and extra QuerySet functionality?
-------------------------------------------------------------------
Writing custom SQL fragments now becomes possible, because you have
direct access to the individual pieces of the query before they are
combined into a string. I'm not saying it's trivial, but that's not the
goal. We try to make the common things easy and the hard things
possible. This is in the "hard things are possible" category.

Adding functionality to QuerySets as far as interfacing with the
databases via extra SQL is another motivation here. You would subclass
QuerySet and tell your subclass to use your subclass of Query for its
default query construction (probably overriding QuerySet.__init__ --
details to be determined, but they aren't important at this point
anyway, since there are about 6 ways to do it). Then you can make
whatever changes you want to a subclass (or even straight replacement)
of the Query class. For example, you could add a "sum()" method to the
derived QuerySet -- if we don't have it in the default version -- and
this sum() method would know what method to call in the Query subclass
to add the right things to the query.

To clarify that a little bit, suppose you create MyQuerySet and MyQuery
as subclasses of QuerySet and Query, respectively:

MyQuerySet.sum() knows what method in MyQuery to call and with
what information. It also knows how to present the result. Okay,
sum() is a trivial example, since it's going to return a number,
but, still, it's a number and not a model instance or a sequence
and you could imagine cases where the return type is something
more complex.

MyQuery has a method (let's call it sum()) that knows to add a
sum(*) to the select attribute and possibly something to the
grouping attribute as well. That's all it does: modified the
query based on the controls passed in from the caller.

Addressing Michael Radjiez's (amongst others) request for somehow
returning the results of custom queries and making it act like a real
QuerySet: I think this can be done, to some extent, by subclassing
QuerySet. Your subclass handles your custom Query modifications and also
supports the normal QuerySet interface, so a call to, say, filter() on
your custom QuerySet will work.

The hard part about integrating custom queries with QuerySets is because
we haven't evaluated your custom portion of the query at the time you
start calling other QuerySet methods on it. So the new methods need to
be able to extent the existing query that is being built up. At the
moment, we cannot do that, because the query construction is fairly dumb
about building up the query fragments and kind of stomps all over any
custom fragments you might want to inject. By keeping all the fragments
in the Query class and not combining them until the final __str__
conversion, it makes it easier to augment an existing query.

Is this really necessary?
-------------------------
I (and others, especially Russell) have spent quite a while living in
the current QuerySet code fixing bugs with query construction and
scratching our heads wondering how on earth we are going to fix some
other cases. For example, we often don't use the most efficient (or the
correct -- in some cases) type of table join when constructing queries.
Joining QuerySets with particular combinations of disjunctions and
conjunctions when common fields are involved leads to incorrect results.
Thinking about how to join in support for aggregates just gives me a
nose bleed, because constructing the GROUP BY and HAVING clauses is
fiddly. All of these problems are much easier to fix by restructuring
how queries are created -- they are, in fact, the motivation for this
design.

Based on periodic IRC visits and reading django-users, people often
try/want to insert custom output columns in the select portion of a
query, which are then linked to something custom in the where clause,
but we don't make that possible. You cannot currently do things like
insert a custom "tableA.column1 = tableB.column2" fragment in a where
clause. For anything custom like this, the (not totally unacceptable)
solution at the moment is to write out the whole thing as a custom SQL
statement. I think the above should make some of this a bit easier and
the rest no harder than it is now. This is another case where I'm not
sure it's a driving requirement for the changes, since we aren't trying
to be a completely isomorphic Python interface over SQL, so asking
people to drop down to SQL if they want to use SQL functionality isn't
evil. But it's a nice side-effect.

Things I think become easier this way
-------------------------------------
- aggregates and other computed output columns
- custom fragments in portions of a query (as opposed to writing
a whole query for the database)
- adding new query types (GIS?)
- different database syntax (LIMIT/OFFSET in both MS SQL Server
and Oracle)

I don't know of any announced plans/wishes that affect QuerySets that
would become more difficult under this change, so please sing out if you
know of any.

Regards,
Malcolm

Honza Král

unread,

Feb 25, 2007, 12:21:08 PM2/25/07

to django-d...@googlegroups.com

On 2/25/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
>
> I noticed today that a new GIS branch has been created, which includes a
> goal to develop a custom QuerySet-derivative. There has also been a
> discussion on trying to work out an API for aggregates, which seems to
> be moving along, although I suspect the design has a bit more work
> required yet.

I have a working code, that can handle simple situations (no __
lookups), it works nice, but to make it work 100% I also found the
need to refactor the queryset. I started on that as well, when in the
first step I just broke up the query generation into more detail, so
that _get_sql_clause doesn't return the FROM WHERE ORDER LIMIT as a
block, but separately, so things can be added there... I just started
on the half an hour ago, but it went smoothly, I then ran into
difficulties trying to get the related__lookups__to__work outside Q
objects ( I thought I would use this notation to specify on which
field to group by etc.)

this was solved adequately for me by the breakup of _get_sql_clause(),
not perfect (by far), but gives you the ability to append to/replace
individual clauses

The obvious example of where this is painful is if we
> implement slicing functionality for MS SQL Server: that database does
> not have the non-SQL-standard LIMIT/OFFSET keywords. Instead, you
> construct a sub-select in the FROM clause to get the equivalent
> behaviour. I've no doubt screwed up the details slightly there, but the
> important point here is that it's basically impossible without really
> special-case handling ("if using_ms_sql: ...") to hijack the FROM
> construction and not put in LIMIT/OFFSET bits. Oracle has similar, but
> not quite as difficult to accommodate requirements for extracting an
> ordered portion of the results, too.

well, backend function that would return a base QuerysetClass would
solve this with minimal effort, not the best solution, but would work

I agree it would be great to have this, and am willing to help. I am
not sure about the extend of this overhaul though

> I (and others, especially Russell) have spent quite a while living in
> the current QuerySet code fixing bugs with query construction and
> scratching our heads wondering how on earth we are going to fix some
> other cases. For example, we often don't use the most efficient (or the
> correct -- in some cases) type of table join when constructing queries.
> Joining QuerySets with particular combinations of disjunctions and
> conjunctions when common fields are involved leads to incorrect results.
> Thinking about how to join in support for aggregates just gives me a
> nose bleed, because constructing the GROUP BY and HAVING clauses is
> fiddly.

its not that bad - I managed to get GROUP BY working, with things as
sum, min, count etc.
getting HAVING to work is quite simple by overriding the filter() (
well, _filter_or_exclude to be precise )

The most painful when getting the GROUP BY working is trying to figure
out how to resolve the related__fields__syntax, since it is only
present for conditions. If we manage to untangle these helper
functions, it would be a great start - it would enable us to specify
ordering in this syntax (current one can be a nightmare, since when
you don't pay attention, wou will end up with a cross-join) and
generally allow for other queryset methods to leverage this
information...

this relatively small change, along with some cleanup should greatly
improve queryset's extensibility, so that every backend could specify
its own (defaulting to ANSIQuerySet) taht would handle the corner
cases (LIMIT TO) and maybe allow some extra functionality (like
Oracle's ROLLUP)

>
> Regards,
> Malcolm
Thanks, great work
>
>
> >
>

--
Honza Král
E-Mail: Honza...@gmail.com
ICQ#: 107471613
Phone: +420 606 678585

Michael Radziej

unread,

Feb 25, 2007, 1:06:18 PM2/25/07

to django-d...@googlegroups.com

Hi Malcolm,

I like this since a lot. Your approach to put the SQL construction into
a different query reminds me much of SQLAlchemy. A quote from their
tutorial [1]:

"""A central concept of SQLAlchemy is that it actually contains two
distinct areas of functionality, one of which builds upon the other. One
is a SQL Construction Language and the other is an Object Relational
Mapper ("ORM" for short). """

SQLAlchemy supports these databases: SQLite, Postgres, MySQL, and
Oracle, as well as developmental support for MS-SQL and Firebird. So our
set is covered.

Why not ditch the current SQL construction code in QuerySet and replace
it with SQLAlchemy? While it adds an external dependency, I feel that
SQLAlchemy has solved these problems much better than Django has. Let's
continue to use our own Django-API for it, but use the machinery of
SQLAlchemy! Really, if you'd think about creating a web application
framework from start today, you wouldn't bother to handle SQL
construction on your own.

There is already the SQLAlchemy branch with exactly this goal in Django,
but it seems to be only merges of trunk without any own code.

The problem of SQL construction isn't trivial. We currently have a few
bugs around multiple joins to the same tables where we create wrong SQL
[2]. These also demonstrate that an API that is able to express all
kinds of SQL statements needs a lot of care. Using SqlAlchemy for the
query construction probably opens a path to also use their mapper when
you really need it.

References:

[1] http://www.sqlalchemy.org/docs/tutorial.myt#tutorial_twoinone
[2] Tickets #1801, #2253, #2496,

Cheers,

Michael

Malcolm Tredinnick

unread,

Feb 25, 2007, 7:19:50 PM2/25/07

to django-d...@googlegroups.com

On Sun, 2007-02-25 at 18:21 +0100, Honza Král wrote:
> On 2/25/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
> >
> > I noticed today that a new GIS branch has been created, which includes a
> > goal to develop a custom QuerySet-derivative. There has also been a
> > discussion on trying to work out an API for aggregates, which seems to
> > be moving along, although I suspect the design has a bit more work
> > required yet.
>
> I have a working code, that can handle simple situations (no __
> lookups), it works nice, but to make it work 100% I also found the
> need to refactor the queryset. I started on that as well, when in the
> first step I just broke up the query generation into more detail, so
> that _get_sql_clause doesn't return the FROM WHERE ORDER LIMIT as a
> block, but separately, so things can be added there... I just started
> on the half an hour ago, but it went smoothly, I then ran into
> difficulties trying to get the related__lookups__to__work outside Q
> objects ( I thought I would use this notation to specify on which
> field to group by etc.)

Sounds like you are doing the same sort of split, but without moving it
to a separate class (which is just for maintainability and separation of
responsibility, mostly).

[...]

> > I (and others, especially Russell) have spent quite a while living in
> > the current QuerySet code fixing bugs with query construction and
> > scratching our heads wondering how on earth we are going to fix some
> > other cases. For example, we often don't use the most efficient (or the
> > correct -- in some cases) type of table join when constructing queries.
> > Joining QuerySets with particular combinations of disjunctions and
> > conjunctions when common fields are involved leads to incorrect results.
> > Thinking about how to join in support for aggregates just gives me a
> > nose bleed, because constructing the GROUP BY and HAVING clauses is
> > fiddly.
>
> its not that bad - I managed to get GROUP BY working, with things as
> sum, min, count etc.
> getting HAVING to work is quite simple by overriding the filter() (
> well, _filter_or_exclude to be precise )

You'll start to hit a lot of problems when you try to get this working
correctly for various intersected and joined queries (combinations of
"and" and "or") and possibly in other more advanced usages. Basically,
it gets fun when table aliases start to appear and when you want to use
multiple join types (we pretty much always use "left outer" which is
inefficient in some cases and just plain wrong in some other cases -- it
works most of the time, though). You have to work with things in the
right order. I got GROUP BY and HAVING working in the more
straightforward cases, too. Then the wheels fell right off when I got to
more advanced cases.

Regards,
Malcolm

Malcolm Tredinnick

unread,

Feb 25, 2007, 7:26:35 PM2/25/07

to django-d...@googlegroups.com

On Sun, 2007-02-25 at 19:06 +0100, Michael Radziej wrote:
> Hi Malcolm,
>
> I like this since a lot. Your approach to put the SQL construction into
> a different query reminds me much of SQLAlchemy. A quote from their
> tutorial [1]:
>
> """A central concept of SQLAlchemy is that it actually contains two
> distinct areas of functionality, one of which builds upon the other. One
> is a SQL Construction Language and the other is an Object Relational
> Mapper ("ORM" for short). """
>
> SQLAlchemy supports these databases: SQLite, Postgres, MySQL, and
> Oracle, as well as developmental support for MS-SQL and Firebird. So our
> set is covered.
>
> Why not ditch the current SQL construction code in QuerySet and replace
> it with SQLAlchemy? While it adds an external dependency,

That's one of the reasons. I can't say I'm completely unhappy with
Django's "batteries included" approach, up to a point.

> I feel that
> SQLAlchemy has solved these problems much better than Django has. Let's
> continue to use our own Django-API for it, but use the machinery of
> SQLAlchemy! Really, if you'd think about creating a web application
> framework from start today, you wouldn't bother to handle SQL
> construction on your own.
>
> There is already the SQLAlchemy branch with exactly this goal in Django,
> but it seems to be only merges of trunk without any own code.
>
> The problem of SQL construction isn't trivial. We currently have a few
> bugs around multiple joins to the same tables where we create wrong SQL
> [2]. These also demonstrate that an API that is able to express all
> kinds of SQL statements needs a lot of care. Using SqlAlchemy for the
> query construction probably opens a path to also use their mapper when
> you really need it.

I like the fact that Django doesn't try to replicate all the
functionality of SQL. It isn't a Python layer over a database, rather it
is a Python framework that happens to have some persistency available.
Most of the time when people are asking about missing functionality,
they are approaching it from a "I can do this in a database" and trying
to map that to Python. If you turn the view around and just ignore the
database except as a way to have magic persistent storage and be able to
query things that are in persistent storage, there are often alternative
solutions to many of the problems (not always, but often -- and for the
other cases, we encourage direct database interaction).

So I decided a while ago that I'm happy with Django's approach and we
can add things as needed in order to make the Python work easier.

That said, it will be interesting to see what happens in the SQLAlchemy
branch. However, for right now, I would rather just fix our immediate
problems: moving query construction into a separate position and redoing
the main loop there will fix a bunch of bugs that we have with query
construction and provide a cleaner place to separate database from the
Python interface to the ORM.

Regards,
Malcolm

Jeremy Dunck

unread,

Feb 25, 2007, 7:50:02 PM2/25/07

to django-d...@googlegroups.com

On 2/25/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:

> I noticed today that a new GIS branch has been created, which includes a
> goal to develop a custom QuerySet-derivative. There has also been a
> discussion on trying to work out an API for aggregates, which seems to
> be moving along, although I suspect the design has a bit more work
> required yet. Michael Radziej has inquired about trying to integrate
> custom SQL as a natural QuerySet result. That's possibly a bit
> pie-in-the-sky for all cases, but we might be able to get close.

It appears that I'll have more work to do to get GeoTypes in working
order for Django. It's infrequently maintained and is currently
specific to psycopg one.

I'm working with the initd.org people to see if they'll accept such changes.

I agree that queryset shipping bits of text is not terribly flexible,
and I'm guessing SqlAlchemy is the quickest way to get there. I
personally think SQLAlchemy goes *way* beyond the 80/20 point the DJ
ORM hits, but delegating the hard problems to it is probably useful.

In any case, I have to get to trunk and get GeoTypes updated for me to
care about QuerySet terribly, so my vote shouldn't count for much.

Cheers,
Jeremy

Malcolm Tredinnick

unread,

Feb 25, 2007, 8:07:00 PM2/25/07

to django-d...@googlegroups.com

On Mon, 2007-02-26 at 11:26 +1100, Malcolm Tredinnick wrote:
[...]

>
> That said, it will be interesting to see what happens in the SQLAlchemy
> branch. However, for right now, I would rather just fix our immediate
> problems: moving query construction into a separate position and redoing
> the main loop there will fix a bunch of bugs that we have with query
> construction and provide a cleaner place to separate database from the
> Python interface to the ORM.

In passing, moving the query construction out of QuerySet, particularly
if we move the database access portions out as well, will make
integrating with SQLAlchemy easier -- since it's "just another data
storage backend". I'm really not against SQLAlchemy in any way -- quite
a happy user of it, in fact -- I'm just not convinced it's needed by
default in Django, but I'd be happy to see it as an option.

Regards,
Malcolm

Michael Radziej

unread,

Feb 26, 2007, 12:08:20 AM2/26/07

to django-d...@googlegroups.com

Hi,

Malcolm Tredinnick wrote:
> In passing, moving the query construction out of QuerySet, particularly
> if we move the database access portions out as well, will make
> integrating with SQLAlchemy easier -- since it's "just another data
> storage backend".

If you make it a design goal, yes. But I'd think that a generic
pluggable "data storage backend" together with your own backend is much
harder to get right than directly plugging in SQLAlchemy.

> I'm really not against SQLAlchemy in any way -- quite
> a happy user of it, in fact -- I'm just not convinced it's needed by
> default in Django, but I'd be happy to see it as an option.

QuerySet does currently not really have a concept of joins. I mean, it
can handle joins to some extent, but it does not have an understanding
that goes above knowing the SQL syntax. It does not understand table
aliases, multiple joins to the same table, different join types, join
conditions.

I fear that the problems in some of the tickets I mentioned are not
really solvable without such a concept. But this would need a lot of
efforts and be more or less a rewrite of QuerySet. I'm usually quite
capable of fixing the bugs in Django that itch me, but I haven't found a
solution for these.

Another point is that you'll get more and more change requests, for
things like aggregates, sorting by functions, outer joins, composite
primary keys. At some time you'll need to define how far Django will go
and put a stop to this. But how can you justify it within the python
community when there's something like SQLAlchemy around?

Batteries included is nice, but, hey, Django does not implement the
database backend, either. It does not implement the fast-cgi interface.
In my opinion, the strength of Django vs. TurboGears/Pylons is that it
has documented everything in one place, plus the coherent API. The
dependencies are a minor point. You don't need to give this up when you
integrate SQLAlchemy just for SQL construction. This is similar to the
SQL backend and flup: They usually don't turn up in your code, only when
you actually need them.

Michael

Malcolm Tredinnick

unread,

Feb 26, 2007, 12:17:40 AM2/26/07

to django-d...@googlegroups.com

On Mon, 2007-02-26 at 06:08 +0100, Michael Radziej wrote:
> Hi,
>
> Malcolm Tredinnick wrote:
> > In passing, moving the query construction out of QuerySet, particularly
> > if we move the database access portions out as well, will make
> > integrating with SQLAlchemy easier -- since it's "just another data
> > storage backend".
>
> If you make it a design goal, yes. But I'd think that a generic
> pluggable "data storage backend" together with your own backend is much
> harder to get right than directly plugging in SQLAlchemy.

Let's stop going down this path. This isn't a thread about whether to
switch to SQLAlchemy. If you want to work on that, there is a branch
already created and a maintainer appointed for that branch. I've got
nothing against it, but it's not an area I'm working on at the moment.

Thanks,
Malcolm

James Bennett

unread,

Feb 26, 2007, 1:09:00 AM2/26/07

to django-d...@googlegroups.com

On 2/25/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:

> Let's stop going down this path. This isn't a thread about whether to
> switch to SQLAlchemy. If you want to work on that, there is a branch
> already created and a maintainer appointed for that branch. I've got
> nothing against it, but it's not an area I'm working on at the moment.

Agreed. Having SQLAlchemy as an option is something we might have for
1.0, but I highly doubt it'll be done in time given the work that
needs to be done in that branch. Switching Django to SQLAlchemy is
something we cannot do right now because we have stated that the
Django database-access API will not break backwards-compatibility
before 1.0, and switching to SQLAlchemy would require us to break that
compatibility in major ways.

And if you want Django/SQLAlchemy integration, the thing to do is get
in touch with us about contributing code to the SQLAlchemy branch. I
know we have a lot of people out there who say they want this feature,
and the only way it's going to happen is if some motivated folks write
the code :)

--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

Michael Radziej

unread,

Feb 26, 2007, 1:52:43 AM2/26/07

to django-d...@googlegroups.com

Hi Malcolm,

Malcolm Tredinnick schrieb:

> Let's stop going down this path. This isn't a thread about whether to
> switch to SQLAlchemy. If you want to work on that, there is a branch
> already created and a maintainer appointed for that branch. I've got
> nothing against it, but it's not an area I'm working on at the moment.

Well, sorry, I didn't intend to hijack this thread, I only meant to
point you towards something that, IMO, might save you some efforts. But
I'm looking forward to the new structure, no matter how it's done!

Michael

Russell Keith-Magee

unread,

Feb 26, 2007, 5:46:31 AM2/26/07

to django-d...@googlegroups.com

On 2/25/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
>

> What would the new (QuerySet) world order look like
> ----------------------------------------------------
> What I propose to do is to split out the SQL construction from the body
> of QuerySet and make it into a separate class. This class (let's call it
> Query, for simplicity) would end up being a "query" attribute on
> QuerySets and it's __str__ function would return the appropriate SQL
> statement.

Broadly speaking, +1. When I last gave serious thought to aggregates,
I was starting to have similar ideas on the need for a QuerySet
refactor. Nice to know I'm not completely out on my own.

Of course, the devil is in the detail, and there are some interesting
edge cases where WHERE, HAVING and GROUP BY intersect, but if you're
willing/able to come up with a first draft, I'm happy to throw my
weight into poking holes in it. :-)

> Things I think become easier this way
> -------------------------------------
> - aggregates and other computed output columns
> - custom fragments in portions of a query (as opposed to writing
> a whole query for the database)
> - adding new query types (GIS?)
> - different database syntax (LIMIT/OFFSET in both MS SQL Server
> and Oracle)
>
> I don't know of any announced plans/wishes that affect QuerySets that
> would become more difficult under this change, so please sing out if you
> know of any.

That's a pretty good list. There might be a few others, but these are
the big ones I am aware of.

Yours,
Russ Magee %-)

Honza Král

unread,

Mar 10, 2007, 1:31:52 PM3/10/07

to django-d...@googlegroups.com

On 2/26/07, Russell Keith-Magee <freakb...@gmail.com> wrote:
>
> On 2/25/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
> >
> > What would the new (QuerySet) world order look like
> > ----------------------------------------------------
> > What I propose to do is to split out the SQL construction from the body
> > of QuerySet and make it into a separate class. This class (let's call it
> > Query, for simplicity) would end up being a "query" attribute on
> > QuerySets and it's __str__ function would return the appropriate SQL
> > statement.
>
> Broadly speaking, +1. When I last gave serious thought to aggregates,
> I was starting to have similar ideas on the need for a QuerySet
> refactor. Nice to know I'm not completely out on my own.
>
> Of course, the devil is in the detail, and there are some interesting
> edge cases where WHERE, HAVING and GROUP BY intersect, but if you're
> willing/able to come up with a first draft, I'm happy to throw my
> weight into poking holes in it. :-)

I went for a skiing trip last week, since there was no snow, I gave
the QuerySet a try:

I implemented a very simple approach, using Query as a class
representing the SQL statement ( I didn't actually deal with
querysets, just the SQL)

I use objects

class Model - represents a model/table, can be joined with other model
via a path ( 'some__related__lookup'.split( '__' ) ), has a reference
to query

class Field - has a reference to Model instance so that it can get its
alias (one django model can be used twice in one query), has a
reference to query

operator - a simple dictionary containing functions producing SQL
snippets for expressions, for example:
{
'lt' : lambda x,y: '%s < %s' % (x, y),
}, more operators can be registered at runtime (custom SQL operators)

class Q - represents generic condition, can be ORed and ANDed, can be
used as WHILE or as HAVING clauses, have the filter( **kwargs )
method, has a reference to query

class Query - contains list of Fields to select, Models to select
from, and mainly encapsulates all lookup operations, so you can call
self.query.lookup_path( some_path )
and it will return a correct Field instance representing the field and
add the necessary joins and where conditions to the query in the
background -- this is the main method where most work will be done, to
demonstrate, this is how a working order_by() method looks:

def order_by( self, *args ):
"""
'-field', 'related__field__to__order__by'
"""
self.order_by = []
for field in args:
if field.startswith( '-' ):
order = 'DESC'
field = field[1:]
else:
order = 'ASC'
field = self.lookup_path( field.split( '__' ) )
self.order_by.append( ( field, order, ) )

rendering the order by:
if self.order_by:
output.append( 'ORDER BY' )
output.append( ', '.join( '%s %s' % ( f.alias, order ) for
f, order in self.order_by ) )

the other thing this approach will bring is that by overriding Query's
__str__() method, you can modify everything - nothing is stored in
text form, everything is normalized and only when putting the SQL code
together is it evaluated. so we could even move this (or parts of it)
to the backend and thus work with the special features of the
individual engines. Or just allow users to supply their own Query
class. only the Q object representing the condition is rendered
separately via the operators.

I will try and clean up the code today and post it along with some
examples. Its just a proof of concept, so it can only deal with simple
forward foreign keys, but I think this approach might just work.

start the poking ;)

>
> > Things I think become easier this way
> > -------------------------------------
> > - aggregates and other computed output columns
> > - custom fragments in portions of a query (as opposed to writing
> > a whole query for the database)
> > - adding new query types (GIS?)
> > - different database syntax (LIMIT/OFFSET in both MS SQL Server
> > and Oracle)
> >
> > I don't know of any announced plans/wishes that affect QuerySets that
> > would become more difficult under this change, so please sing out if you
> > know of any.
>
> That's a pretty good list. There might be a few others, but these are
> the big ones I am aware of.
>
> Yours,
> Russ Magee %-)
>
> >
>

Honza Král

unread,

Mar 10, 2007, 4:30:00 PM3/10/07

to django-d...@googlegroups.com

> I will try and clean up the code today and post it along with some
> examples. Its just a proof of concept, so it can only deal with simple
> forward foreign keys, but I think this approach might just work.

here it is:

very simple example:

>>> from django.contrib.admin import models
>>> import query
>>> q = query.Query( models.LogEntry )
>>> q.filter( user__username='aa')
>>> q.order_by( '-user__id' )
>>> print q
SELECT
"t0"."id", "t0"."action_time", "t0"."user_id", "t0"."content_type_id",
"t0"."object_id", "t0"."object_repr", "t0"."action_flag",
"t0"."change_message"
FROM
"django_admin_log" as "t0" INNER JOIN "auth_user" as "user" ON
"t0"."user_id" = "user"."id"
WHERE
"user"."username" = aa
ORDER BY
"user"."id" DESC

>>>

In the code you can find FIXME and TODO marks, FIXME marks something
broken, TODO stands for something missing.

Looking forward to any comments...

query.py

Jeremy Dunck

unread,

Mar 15, 2007, 5:42:43 PM3/15/07

to django-d...@googlegroups.com

On 2/25/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
>

> I noticed today that a new GIS branch has been created, which includes a
> goal to develop a custom QuerySet-derivative.

...

> I don't know of any announced plans/wishes that affect QuerySets that
> would become more difficult under this change, so please sing out if you
> know of any.

We are a bit further along on the GIS work. If possible, I'd like to
avoid becoming dependent on the QuerySet refactoring, but I also don't
want to make it any harder.

So far, the only pain is in the fact that all lookup types share the
same namespace, but may have different semantics based on the field
type. For example, text "contains" is very different than GIS
"contains".

There are two places so far that this causes trouble.

django.db.query.parse_lookup uses a static list of mappings (...
lookup_type not in QUERY_TERMS...) in order to default the lookup type
to "exact".

django.db.query.get_where_clause does the wrong thing for GIS contains. ;-)

Both could be addressed by allowing lookup_type semantics to differ
depending on what Field type is being used.

I think it'd be nice to move .get_where_clause to
django.db.models.fields.Field and subclasses. When a class ois first
constructed (or the first time .contribute_to_class is called) it
could register a mapping for a given lookup_type.

from django.db.models.query import register_lookup_types

class Field(...):
...

def __init__(self,...):
if Field.creation_counter == 0:
register_lookup_types(self.__class__, ['contains', 'ilike',...])

def get_where_clause(lookup_type, table_prefix, field_name, value):
...
get_where_clause = classmethod(get_where_clause)

Then .query.get_where_clause would change to just locate and call the
related field.get_where_clause.

I think this change could be made (on the gis branch) without too much trouble.

I don't think this harms the QS refactoring discussion, but wanted to
make sure it's OK before proceeding in this direction.

Malcolm?

Cheers,
-Jeremy

(P.S. When I say "we", I mean I'm checking in and consolidating in a
way that could land on Django trunk, but I have yet to write original
code-- Justin Bronn and and Travis ? have been doing the bulk of the
work. Thanks guys!)

Malcolm Tredinnick

unread,

Mar 15, 2007, 6:02:29 PM3/15/07

to django-d...@googlegroups.com

I have a dim memory that there's a trick required if you do this and I'm
trying to remember what the problem was. It's not coming forwards at the
moment... :-(

Looking at the code I've written, I think the only potential problem
tnhat jumps out is that constructing some of these things is
database-engine specific (some of the text matching options,
potentially), so having access to the db backend is necessary. But I'll
have a look at whether I can just transplant them easily to the Field
class.

As a design issue, it's a little leaky in that Field classes are
currently unaware of any database interactions (see next paragraph for
why I like this). However, you are creating something that is fairly
tightly tied to the backend, so it's hard not to have leaky
encapsulation in your case. Tough to see how you could do it otherwise
without also dragging the rest along (well, one way would be to
conditionally call the field if it had a particular method, but that
feels icky in some ways, too).

I would probably call out to another function somewhere to do the real
registration, since I have (pipe?) dreams of wanting to be able to
replace the database work with another backend (e.g. RDF-based,
SQLAlchemy, bdb) easily. But that's a minor issue of where to put the
actual code and whether to have a function call and it's way down the
track.

As an aside on the sample you jotted down: any big reason why you want
it to be a class method here? It doesn't seem to gain a lot in the
usage, as far as I can see, since you're going to be running through
Field *instances* when you are constructing the query in any case and I
don't think we need to access class-level variables in these methods, do
we? [I know why you might want to do it from a "OO purity" stand point,
since it is tied to the class more than the specific instantiation, but
if it doesn't hurt to access it through the instance it's just visual
over-kill. That's what I'm wondering about.]

Having thought about it for a whole 10 minutes, I'd say try that path
out and if I think of anything, I'll sing out and help you port it.

Cheers,
Malcolm

Jeremy Dunck

unread,

Mar 15, 2007, 6:24:21 PM3/15/07

to django-d...@googlegroups.com

On 3/15/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
...

> Looking at the code I've written, I think the only potential problem
> tnhat jumps out is that constructing some of these things is
> database-engine specific (some of the text matching options,
> potentially), so having access to the db backend is necessary. But I'll
> have a look at whether I can just transplant them easily to the Field
> class.

Yeah, that occurred to me. I guess I could make the get_where_clause
take backend as a parameter.

> Tough to see how you could do it otherwise
> without also dragging the rest along (well, one way would be to
> conditionally call the field if it had a particular method, but that
> feels icky in some ways, too).

Yeah, the current as-yet-uncommitted branch code is copying
get_where_clause, parse_lookup, and lookup_inner with minor changes,
and I don't think that's any better. :)

> I would probably call out to another function somewhere to do the real
> registration,

Sorry, can you pseudo-code this? I'm not sure what you mean by
another function for real registration.

> As an aside on the sample you jotted down: any big reason why you want
> it to be a class method here?

No, in fact I considered making it an instance method which would not
take table_prefix or field_name, but realized this was a somewhat
larger change and didn't handle aliasing, which you'll likely be
addressing in the QS refactor, so I was trying to stay out of the way
for that.

I don't mind doing an instance method at all, if that's what you prefer.

...

> Having thought about it for a whole 10 minutes, I'd say try that path
> out and if I think of anything, I'll sing out and help you port it.

Just waiting for enlightenment on the real registration, and I'll be on my way.

Thanks for the fast response. :)
-Jeremy

Malcolm Tredinnick

unread,

Mar 15, 2007, 6:42:04 PM3/15/07

to django-d...@googlegroups.com

On Thu, 2007-03-15 at 17:24 -0500, Jeremy Dunck wrote:
> On 3/15/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:

[...]

> > I would probably call out to another function somewhere to do the real
> > registration,
>
> Sorry, can you pseudo-code this? I'm not sure what you mean by
> another function for real registration.

I might be making a mountain out of a molehill here. I was thinking out
loud about where the various pieces of code reside.

At the moment we have Field classes knowing nothing about the database
-- both when you use them and in the source code -- and all/most of the
database access bits in django.db.models.query. You seem to be proposing
to move a lot more database-awareness into django.db.models.fields for
good reason.

I was wondering how well this would work if I was going to write
something that looked like a QuerySet, but didn't talk to the database.
That was sort of a hidden side-effect of the refactoring proposal I
wrote up: all the database-interaction portions were moved into a class
(Query) that is an attribute of QuerySet. So QuerySet provides the ORM
API and Query does the database work. I could replace Query with
something that talked to rdflib or SQLAlchemy fairly simply, I was
thinking (or hoping). Moving database intelligence into Field classes
makes that harder.

For now, don't worry about this. There'll be some tweaking to do anyway
when the two streams interact (adding GIS and refactoring QuerySets).
Somebody has to go first, so I'll keep an eye on what you're doing. I'm
not going to be (publicly) writing ORM backend replacements on day one
in any case. This is just an eye towards the future.

>
> > As an aside on the sample you jotted down: any big reason why you want
> > it to be a class method here?
>
> No, in fact I considered making it an instance method which would not
> take table_prefix or field_name, but realized this was a somewhat
> larger change and didn't handle aliasing, which you'll likely be
> addressing in the QS refactor, so I was trying to stay out of the way
> for that.
>
> I don't mind doing an instance method at all, if that's what you prefer.

Table prefix you have to pass in in any case, right? Since it's not part
of the field (it's part of the Options class). The column name is
instance-specific, so you need "self" access, don't you? I'm
confused. :(

I don't have anything against class methods in general. When I see one
used in code, though, I (like others, I suspect) wonder why it's being
used in that way, rather than just a "normal" method. Typically it's
because you access class-level attributes only, so the internals of the
method are easier as a classmethod() (no need for self.__class__ all
over the shop). In this case, I was wondering if you needed that. Don't
worry too much about my foibles here (although I'm not sure how you're
getting the right column name in a class method).

Cheers,
Malcolm

Reply all

Reply to author

Forward