---------------------------------------
Status report
We've got pretty far with our App Engine port. For example, the
sessions db and cached_db backends both work unmodified on App Engine.
You can also order results and use basic filter()s as supported by the
low-level App Engine API (gt, gte, lt, lte, exact, pk__in). You can
also use QuerySet.order_by(), .delete(), .count(), Model.save(),
.delete().
This is our second porting attempt (it's not in the old repository).
Our first attempt had too many conflicts with the multi-db branch
(esp. the one on github). This time we just hacked everything
together. We didn't concentrate on cleaning up the current backend
API. We've also disabled SQL support.
The next step is to move all the hacks into a nice backend API (at the
same time making sure that it won't conflict with multi-db) and
re-enable SQL support. That's where we need help. Also, if you want to
work on SimpleDB support this is the right time to join. The App
Engine backend itself can be handled by Thomas Wanschik and me -
contributions in this area are not absolutely necessary, so please
concentrate on the cleanup if you want to help.
Now to the details (for those who want to contribute).
---------------------------------------
Introducing QueryGlue
The old Django code was distributed across three layers:
* django.db.models.queryset.QuerySet
* django.db.models.sql.query.Query (from now on just sql.Query)
* backend
When a new QuerySet is instantiated (e.g. by calling
Model.objects.all()) it asks the backend for its Query class and then
creates an instance of that class. By default, this class is
sql.Query. Only the Oracle backend has its own Query which subclasses
sql.Query.
Normally, sql.Query builds the query on-the-fly. Whenever you call
QuerySet.filter(<filters>) the filters get put into a
Q(<filters>) and passed to
sql.Query.add_q( Q(...) ).
This function iterates over all filter rules in the Q object and calls
sql.Query.add_filter() for each individual filter.
This in turn directly modifies sql.Query.where which is a tree
structure that represents the WHERE clause. It already contains
information about the JOIN type for each filter (INNER, OUTER), the
fields that get referenced by the filter, the column and table
aliases, and so on. It already does a lot of what we need for
non-relational backends, but it's too SQL-specific.
The current behavior is also a problem for multi-db because it makes
too many assumptions about the storage format of the filter rules. The
user could call QuerySet.using(other_connection) anytime, so QuerySet
shouldn't really work with the low-level sql.Query class before it
actually executes the query.
We've solved this problem by introducing a backend-independent query
representation between QuerySet and the low-level Query (sql.Query,
appengine.Query, etc.). This representation is called QueryGlue. You
can find it in django.db.models.queryglue. It
provides almost exactly the same "public" API as sql.Query (so it can
easily be integrated with QuerySet). Each filter() call gets
translated into a tree structure that is inspired by sql.Query.where,
but it doesn't contain any information about the kind of JOIN.
Instead, it stores high-level important information like whether we're
filtering on a primary key, which columns and tables are involved in a
JOIN, etc.
---------------------------------------
The low-level Query class
Once the query needs to be executed (e.g., by calling .count() or by
iterating over the query) the QueryGlue instance creates a new
low-level Query instance which gets the QueryGlue as its only
parameter. Currently, the low-level Query class is hard-coded to
GAEQuery/BaseQuery in django.db.models.nonrelational.query.
Then, QueryGlue calls the Query's respective execution function
(results_iter(), count(), etc.). The
constructor only gets the QueryGlue instance. Then, we call the
respective execution function (results_iter(), count(), etc.) on the
instantiated low-level Query. Our GAEQuery can now iterate over all
filters in QueryGlue.filters and convert them to an App Engine Query
object.
---------------------------------------
subqueries
Instead of working with subquery classes we've added delete_bulk(),
insert(), etc. directly to QueryGlue and the low-level Query class. If
sql.Query really needs the current design those functions can still be
routed to the respective subquery instance, but on App Engine it's
easier to handle those operations in a separate function.
---------------------------------------
The cleanup
We made a few not-so-clean changes to Django itself. I've attached a
diff, so contributors can easily find all the changes we did to Django
(they're also commented with TODO and GAE):
............................
* disabled multi-table inheritance;
this could be emulated as described on the Django wiki
http://code.djangoproject.com/wiki/NonSqlBackends
See
django/db/models/base.py: line 147
............................
* disabled deletion of related objects in Model.delete() and QuerySet.delete()
See
django/db/models/query.py: lines 1036, 1065
............................
* replaced sql.subqueries.*Query usage with simple functions on a
single Query class (insert_or_update() instead of InsertQuery and
UpdateQuery)
See
django/db/models/query.py: lines 1058, 1088
............................
* commented out distinction between insert and update in
Model.save_base() because there's no such concept in App Engine (and
SimpleDB, AFAIK)
See
django/db/models/base.py: lines 470, 475
............................
The long-term goal is of course to clean this up and move most of
these changes into the backend API.
---------------------------------------
Common non-relational features
The plan is to add support for simple joins and select_related to all
non-relational backends by
either subclassing the backend's Query class on-the-fly with a
JoinQuery or by supporting something like query pre-processors which
can be added above the low-level Query class. We haven't thought about
the details, yet, but I hope you get the idea.
---------------------------------------
SQL layer details:
The ugly detail is that sql.subqueries contains specialized query
classes like InsertQuery, DeleteQuery, etc. which subclass the
backend's Query class. This means that currently, the module loading
process jumps around:
* sql/__init__.py imports sql.query and then sql.subqueries
* sql.query creates the base Query class
* after that, sql.query allows the backend to override the Query class
* sql.subqueries creates subclasses which derive from Query
In multi-db in SVN this is uglier because the subquery classes don't
have just one single sql.Query base class from which to derive,
anymore. There can be multiple backends, each with their own sql.Query
class, so the subqueries have to be maintained by the backend (with
some multi-inheritance magic and manual caching of the custom
subclasses).
In multi-db on github this is much cleaner: The backends can't
override sql.Query, anymore. Instead, there's an SQLCompiler class
which can be overridden by the backend to take care of
backend-specific details. sql.Query stores a slightly more abstract
representation of the query. This multi-db branch moves a lot of code
around. That's why we should try to keep as much code as possible
where it is (at least until the branch gets merged into trunk).
---------------------------------------
The source
The test project and our unit tests are here:
http://bitbucket.org/wkornewald/django-testapp/
The modified Django source and the backend is here:
http://bitbucket.org/wkornewald/django-nonrel-hacked/
We've patched the trunk branch. Unforunately, the branches are
unnamed (I converted the git mirror because the hg mirror's branches
on bitbucket are broken). You should be able to find the right branch
with "hg heads"
and "hg up -C" to it. Normally our branch should be at tip, anyway, so
you don't need to do anything.
When merging you need to find the trunk branch with "hg heads" and "hg
merge <revnum>" with the trunk head. If this becomes a huge problem
we'll switch to the django-trunk mirror, but I wanted to keep the
option to switch to Alex' multidb branch if that's better, so I chose
this sub-optimal Django mirroring solution.
---------------------------------------
Task management
Our tasks are managed in a Google Spreadsheet:
https://spreadsheets.google.com/ccc?key=0AnLqunL-SCJJdE1fM0NzY1JQTXJuZGdEa0huODVfRHc&hl=en
The task list isn't complete, yet. We're working on that.
Bye,
Waldemar Kornewald
I'm unsure what problem you're having here. The backend needs to
return a type that the TimeField can turn into a Python Time object.
TimeField is fairly liberal in what it will accept - DateTime objects,
Time objects, and strings that express a time will all be handled.
As long as your backend returns one of these acceptable types, you're done.
> What's the status of the email backends ticket? There hasn't been any
> reply to Andi Albrecht's latest patch and comment.
> http://code.djangoproject.com/ticket/10355
> This is essential for supporting all kinds of cloud platforms.
We're in the process of doing feature voting for v1.2. Personally, I'm
happy with the state of the patch, but there have been a couple of -1
votes for the patch, which means that some people still need to be
convinced that it's the right thing to do. Once voting is finished, we
may need to revisit this issue on django-dev.
Yours,
Russ Magee %-)
Great. I just wasn't sure if this was just an internal implementation
detail which we better shouldn't rely on in our backends.
Bye,
Waldemar Kornewald
I should point out that this is one of the specific problems Alex and
I are trying to address in the multi-db refactor. When we've finished,
returning the right query class should be as simple as implementing an
API on the backend.
Yours,
Russ Magee %-)
The current query_class will need to change slightly to support
multi-db, so anything you implement against that interface will
require some rework later on. That said, the fundamental approach
(i.e., the backend tells you what class to use for queries) will still
be there - it will just be used in a slightly different way.
If you want to write (and test) code now, my suggestion would be to
try making your code as clean as possible against the current
interface, with the expectation that there will be some rework once
multi-db lands. The corollary to this is that if you find yourself
needing to make weird and widespread engineering decisions in order to
support the query_class approach, you should stop and wait for
multi-db to land.
Yours
Russ Magee %-)
In the SVN multi-db branch there is a modified query_class() API.
OTOH, on github it got replaced with SQLCompiler. Are the
query_class() changes already committed somewhere?
Why do you still need query_class() if you already have SQLCompiler?
If this is just about making non-SQL backends work then you'll need
some kind of backend-independent query representation, so
QuerySet.using() can be supported. That's exactly what we've already
done with QueryGlue, so maybe you should better reuse what we've
started and finish that together with us, so we all don't waste time
on refactoring everything twice?
Bye,
Waldemar Kornewald
No, they haven't been developed yet. Alex and I did the initial design
work at the DjangoCon sprints, but we haven't actually implemented
anything yet.
> Why do you still need query_class() if you already have SQLCompiler?
> If this is just about making non-SQL backends work then you'll need
> some kind of backend-independent query representation, so
> QuerySet.using() can be supported. That's exactly what we've already
> done with QueryGlue, so maybe you should better reuse what we've
> started and finish that together with us, so we all don't waste time
> on refactoring everything twice?
There are two different agents at work here.
We need to split sql.Query from QueryCompiler to support the fact that
the same SQL-like query needs to be rendered in different ways by
different backends. This can be as simple as the character used for
quoting, or as complex as wrapper clauses needed to handle LIMIT and
OFFSET.
There is a separate issue of determining if sql.Query is the right
internal structure to use for representing a query.
To date, sql.Query is the right structure for all Django's supported
backends. It might even be the right structure for a non-SQL backend
that provides a SQL-like query layer (AppEngine possibly falls into
this category, as might a SimpleDB backend). However, a CouchDB,
Cassandra or MongoDB backend probably won't get much traction using an
internal query structure that talks about Joins and Where clauses.
So - the intention is to repurpose query_class() slightly. Once
refactored, query_class() will be required to return a class that
implements the Query interface. sql.Query is the only example at
present, but other backends can provide other internal
representations. The call to query_class() will be made in QuerySet -
not as part of the sql.Query construction. In this way, query_class()
becomes the "get me the actual implementation" method on the backend.
We're *not* trying to build a completely generic internal query
representation. I'm not convinced that such an animal is even possible
in the general case - again, JOIN means something to relational
databases, but doesn't mean much to non-SQL databases. If AppEngine is
able to leverage some of the sql.Query internals, thats great - but I
don't expect that this will be the default situation.
Yours,
Russ Magee %-)
App Engine's datastore API is more similar to MongoDB than SQL. Even
on SimpleDB I don't think that the Where tree is a good idea because
it's way too SQL-specific.
> So - the intention is to repurpose query_class() slightly. Once
> refactored, query_class() will be required to return a class that
> implements the Query interface. sql.Query is the only example at
> present, but other backends can provide other internal
> representations. The call to query_class() will be made in QuerySet -
> not as part of the sql.Query construction. In this way, query_class()
> becomes the "get me the actual implementation" method on the backend.
Why do you want to implement this in multi-db if it's only useful for
non-SQL support? Shouldn't you better keep multi-db as-is and add the
query_class() feature to our branch? That would save us lots of
conflicts because won't have to implement our code twice (once for the
old query_class and once for your version) and we'll probably have to
change your query_class, anyway.
> We're *not* trying to build a completely generic internal query
> representation. I'm not convinced that such an animal is even possible
> in the general case - again, JOIN means something to relational
> databases, but doesn't mean much to non-SQL databases. If AppEngine is
> able to leverage some of the sql.Query internals, thats great - but I
> don't expect that this will be the default situation.
Does this mean you'll remove QuerySet.using()? Otherwise you'd have to
transform an sql.Query to an appengine.Query.
If the generic query representation is not much more detailed than Q
objects then I don't see a big problem, anyway (our QueryGlue can be
easily transformed into sql.Query or any other query type exactly for
that reason). The point why we need QueryGlue is that the queries will
have to be manipulated and interpreted in order to emulate certain
features (e.g., joins) and its much easier to do this on the final
query tree than on its intermediate states.
Bye,
Waldemar Kornewald
Exactly my point. There is no such thing as a "generic" internal
query. The closest we can hope for is a common interface for objects
that can have Qs, filters, et all added to them. sql.Query interprets
those Q's and filters as joins. Other backends will require other
interpretations.
>> So - the intention is to repurpose query_class() slightly. Once
>> refactored, query_class() will be required to return a class that
>> implements the Query interface. sql.Query is the only example at
>> present, but other backends can provide other internal
>> representations. The call to query_class() will be made in QuerySet -
>> not as part of the sql.Query construction. In this way, query_class()
>> becomes the "get me the actual implementation" method on the backend.
>
> Why do you want to implement this in multi-db if it's only useful for
> non-SQL support? Shouldn't you better keep multi-db as-is and add the
> query_class() feature to our branch? That would save us lots of
> conflicts because won't have to implement our code twice (once for the
> old query_class and once for your version) and we'll probably have to
> change your query_class, anyway.
Because the way query_class() is currently used causes other problems.
Providing an entry point for multi-db is a bonus.
>> We're *not* trying to build a completely generic internal query
>> representation. I'm not convinced that such an animal is even possible
>> in the general case - again, JOIN means something to relational
>> databases, but doesn't mean much to non-SQL databases. If AppEngine is
>> able to leverage some of the sql.Query internals, thats great - but I
>> don't expect that this will be the default situation.
>
> Does this mean you'll remove QuerySet.using()? Otherwise you'd have to
> transform an sql.Query to an appengine.Query.
QuerySet.using() will continue to exist. However, I expect there will
be some restrictions on when you can call it. Retasking across backend
types will be one of those restrictions.
> If the generic query representation is not much more detailed than Q
> objects then I don't see a big problem, anyway (our QueryGlue can be
> easily transformed into sql.Query or any other query type exactly for
> that reason). The point why we need QueryGlue is that the queries will
> have to be manipulated and interpreted in order to emulate certain
> features (e.g., joins) and its much easier to do this on the final
> query tree than on its intermediate states.
I need to take a closer look at QueryGlue to be able to offer any
deeper critique of this. I'll put this on my todo list.
Yours,
Russ Magee %-)
Yes, that'll help in our discussions and I hope it'll make clearer why
query_class() should rather be implemented in our branch instead of
multi-db (which already works the way it is - withour query_class()).
Here's the link:
http://bitbucket.org/wkornewald/django-nonrel-hacked/src/tip/django/db/models/queryglue.py
What QueryGlue does is something like this (though, it's simplified):
queryset.filter(bla__attr=3)
=> gets translated to =>
queryglue.filters_tree.add(( ['bla', 'attr'], 'exact', 3 ))
As you can see, there isn't anything backend-specific in the
filters_tree. It's actually not even that much different from what
sql.Query.add_filter() already does - just without adding information
about joins and other SQL-specific stuff.
Now, an SQL backend can just iterate over filters_tree and call
sql.Query.add_filter() for each child in the tree - this would be the
easiest way to make sql.Query work again in our code. OTOH, the
non-relational backends could inspect the tree and possibly execute
multiple queries - one for each table involved in the query - and then
join the result set in memory (depending on the query and your data
this can be inefficient - or efficient).
Bye,
Waldemar Kornewald
Bye,
Waldemar Kornewald
I've renamed it to QueryData. With that huge roadblock out of our way,
I hope you're much more likely to help. ;)
Bye,
Waldemar Kornewald