QuerySet refactoring

228 views
Skip to first unread message

Anssi Kääriäinen

unread,
Jun 13, 2012, 2:50:42 AM6/13/12
to Django developers
I am asking for directions about what to do about
django.db.models.sql.query (actually sql.*). I would like to refactor
the code in small incremental steps. However, this will bring internal
API breakages, and will likely add some more bugs temporarily.

While the ORM mostly works, it IMHO needs some polish. The reasons why
I see ORM refactoring as needed:
- The code is too complex and underdocumented currently. Some
examples are query.add_q() and compiler.fill_related_selections().
- Because of the above, adding new features is hard. There are some
long standing bugs which are hard to fix. There are 407 open ORM or
models.Model tickets, of which 247 are bugs).
- I believe the ORM could be made faster. The ORM currently uses 4x
the time in Python compared to the time the DB needs to parse, plan
and execute a query like this: Model.objects.get(pk=1).

Why incremental rewrite instead of total rewrite? Total rewrite will
likely take so much time as to never actually get done. The underlying
structure of the ORM is good enough. There are things which would
likely be done in different way in total rewrite, but there isn't
anything blocker quality.

Why not use SQLAlchemy or some other ORM? I am no expert of
SQLAlchemy, but I believe it doesn't actually do the same thing as
Django's ORM. The complexity of Django's ORM comes from the need to
handle things like subqueries for negated multijoin lookups and
checking when to use LEFT JOIN, when INNER JOIN. SQLAlchemy doesn't do
that as far as I know.

I have no need to make the ORM generic enough for no-SQL databases. I
don't believe generating a generic query.py class for no-SQL databases
is the correct approach. 80% of the code in query.py deal with joins,
null handling, subqueries and things like that. None of those would be
common to no-SQL DBs. Instead, no-SQL databases need to deal with
structured records, which isn't a problem for the SQL side of the ORM.
The common things are lookup handling and the API. The API is already
separate from sql.*. The lookup handling should be, too.

So, what I am trying to do? Things like:

https://code.djangoproject.com/ticket/16759 - patch:
https://github.com/akaariai/django/compare/ticket_16759
- This is a performance improvement for query.clone() which should
alone make the ORM around 20% faster. This is definite DDN stuff, so
don't worry, I won't be committing this one currently.

https://code.djangoproject.com/ticket/17000 - patch:https://github.com/
django/django/pull/92, https://github.com/akaariai/django/compare/refactor_utils_tree
- Make the utils.tree saner to use. Make query.add_q and
where.as_sql() cleaner. Fixes a couple of bugs.

https://code.djangoproject.com/ticket/16715 - somewhat old patch in
ticket.
- Unify and fix join promotion logic.

There are more tickets somewhat ready in Trac.

The above are small steps to making the ORM easier to handle. The long
term goals at the moment are just cleanup. This cleanup should allow
new features (conditional aggregates, custom lookups etc), but is not
the immediate goal.

Anssi Kääriäinen

unread,
Jun 13, 2012, 3:09:59 AM6/13/12
to Django developers
> https://code.djangoproject.com/ticket/16759- patch:https://github.com/akaariai/django/compare/ticket_16759
>   - This is a performance improvement for query.clone() which should
> alone make the ORM around 20% faster. This is definite DDN stuff, so
> don't worry, I won't be committing this one currently.
>
> https://code.djangoproject.com/ticket/17000- patch:https://github.com/
> django/django/pull/92,https://github.com/akaariai/django/compare/refactor_utils_tree
>   - Make the utils.tree saner to use. Make query.add_q and
> where.as_sql() cleaner. Fixes a couple of bugs.
>
> https://code.djangoproject.com/ticket/16715- somewhat old patch in
> ticket.
>   - Unify and fix join promotion logic.
>
> There are more tickets somewhat ready in Trac.
>
> The above are small steps to making the ORM easier to handle. The long
> term goals at the moment are just cleanup. This cleanup should allow
> new features (conditional aggregates, custom lookups etc), but is not
> the immediate goal.

Accidentally clicked send... So, what I am asking is: Is there support
for ORM refactoring, and the "small step at time" way of doing it? If
the ORM refactorings are to be done, it will be hard to get reviews.
In practice I would need to commit patches without full reviews.

For more technical questions: Is changing the API of utils.tree
acceptable? What about removing old undocumented features ("anything
containing add_to_query() will shortcut qs.add_q() - not tested, not
documented"). qs.order_by("tablename.columname") - tested, but not
documented. Syntax predates 1.0). What about the more radical changes,
like using .clone() instead of .deepcopy() in query.clone() (ticket
#16759).

I know I have taken too much stuff under work already... However, the
ORM is what I really like to hack, and would like to concentrate on
that for some time. To do so I will need to feel confident that there
is support for getting the patches committed.

- Anssi

Luke Plant

unread,
Jun 13, 2012, 12:15:06 PM6/13/12
to django-d...@googlegroups.com
On 13/06/12 08:09, Anssi Kääriäinen wrote:

> Accidentally clicked send... So, what I am asking is: Is there support
> for ORM refactoring, and the "small step at time" way of doing it? If
> the ORM refactorings are to be done, it will be hard to get reviews.
> In practice I would need to commit patches without full reviews.
>
> For more technical questions: Is changing the API of utils.tree
> acceptable? What about removing old undocumented features ("anything
> containing add_to_query() will shortcut qs.add_q() - not tested, not
> documented"). qs.order_by("tablename.columname") - tested, but not
> documented. Syntax predates 1.0). What about the more radical changes,
> like using .clone() instead of .deepcopy() in query.clone() (ticket
> #16759).
>
> I know I have taken too much stuff under work already... However, the
> ORM is what I really like to hack, and would like to concentrate on
> that for some time. To do so I will need to feel confident that there
> is support for getting the patches committed.

I think this is a very necessary piece of work. The problem with that
layer of code is that it is very difficult to really grok and therefore
to review patches. It would take almost as much effort to do a review of
a substantial patch as the patch itself. You have to really understand
what Query is doing, and it is a huge class, that takes hours to even
vaguely get the hang of and get in your head.

I would like to be able to help, and if I dropped everything else I
might otherwise do on Django, that should be possible. I think with two
pairs of eyes on these changes, and with a good test suite, we should be
reasonably safe, but getting more review than that would be hard.

I would say that any undocumented APIs are fair game for changing, which
covers all of the Query class and SQLCompiler classes, AFAIK. For
user-facing features, we should be much more slow to break them,
especially if they are tested. qs.order_by("tablename.columnname") is
pretty strange though, I would be happy with getting rid of that,
assuming there is an alternative.

I do actually have another proposal, that has been a growing draft email
on my computer for a few weeks: that we rewrite our backend using
SQLAlchemy Core (not the ORM). I've used SQLAlchemy a bit, and I'm very
confident we could use it profitably. While the ORM works significantly
differently to ours, the lower layer (Core - the SQL Expression
Language) can indeed build queries that can do all the joins etc that we
would need - it's a full wrapper around SQL. Some of the complexity in
our code comes from not having this wrapper.

SQLAlchemy is a really excellent library, and I fear that long term if
we don't switch to it, we will just produce an ad-hoc, informally
specified, buggy, slow implementation of half of SQLAlchemy.

However, that is a longer term aim (Django 2.0 really, for various
reasons), and I'm confident this refactoring would help either way.

Nonetheless, I'll try to post that proposal soon. There is one strategy
for implementing it that might allow it to be started soon, and
developed alongside existing functionality without breaking anything.
(That depends on when we think 'Django 2.0' will happen). If you've got
a particular interest in the ORM, you might be interested in that proposal.

I completely agree regarding how to handle NoSQL as well.

Regards,

Luke


--
"Outside of a dog, a book is a man's best friend... inside of a
dog, it's too dark to read."

Luke Plant || http://lukeplant.me.uk/

Andrew Godwin

unread,
Jun 13, 2012, 3:50:40 PM6/13/12
to django-d...@googlegroups.com
On 13/06/12 17:15, Luke Plant wrote:

> I think this is a very necessary piece of work. The problem with that
> layer of code is that it is very difficult to really grok and therefore
> to review patches. It would take almost as much effort to do a review of
> a substantial patch as the patch itself. You have to really understand
> what Query is doing, and it is a huge class, that takes hours to even
> vaguely get the hang of and get in your head.
>
> I would like to be able to help, and if I dropped everything else I
> might otherwise do on Django, that should be possible. I think with two
> pairs of eyes on these changes, and with a good test suite, we should be
> reasonably safe, but getting more review than that would be hard.

I'm happy to look over any patches as well - while I generally know more
about the actual backends rather than Query, the more eyes the better on
this kind of stuff, I think.

> I do actually have another proposal, that has been a growing draft email
> on my computer for a few weeks: that we rewrite our backend using
> SQLAlchemy Core (not the ORM). I've used SQLAlchemy a bit, and I'm very
> confident we could use it profitably. While the ORM works significantly
> differently to ours, the lower layer (Core - the SQL Expression
> Language) can indeed build queries that can do all the joins etc that we
> would need - it's a full wrapper around SQL. Some of the complexity in
> our code comes from not having this wrapper.
>
> SQLAlchemy is a really excellent library, and I fear that long term if
> we don't switch to it, we will just produce an ad-hoc, informally
> specified, buggy, slow implementation of half of SQLAlchemy.

I'd love to see the full proposal for this - it's been a while since
I've heard rumblings about SQLAlchemy, and I've forgotten what the major
objections/drawbacks were.

It would certainly save me some work if I could rely on it for some DDL
code generation, though because of the way models are defined in terms
of Fields that's unlikely to be possible soon without some serious
mapping and wrapping.

Andrew

Aymeric Augustin

unread,
Jun 14, 2012, 3:59:20 AM6/14/12
to django-d...@googlegroups.com
Hello Anssi,

I'm familiar with the topic since I tried to review some of your
refactoring patches (before you gained the ability to commit them
yourself).

I'm convinced that this refactoring is useful, because it is likely to
fix some bugs, especially in features that were added to the ORM long
after its original design, and it will make ongoing maintenance
easier.

It's also very difficult to review and I'm afraid I may not be able to
help a lot, besides sanity-checking patches. Anyway, ask loudly for
reviews when you have patches ready, and I'll try to join the effort.

Best regards,

--
Aymeric.

Anssi Kääriäinen

unread,
Jun 14, 2012, 4:58:21 AM6/14/12
to Django developers
On 14 kesä, 10:59, Aymeric Augustin
Thanks for all for the offers for reviews.

I think I will go forward with the following plan:
- For small issues I will create a ticket + pull request and
announce that I am going to commit the request. I will wait for couple
of days and commit if no objections or requests for more time are
raised.
- For medium issues I will ask for a review and wait for one.
- For bigger issues (db schemas, deepcopy() vs. clone()) there need
to be some form of consensus that the approach is in fact correct.

Lets see how the above works. If it doesn't, then lets figure out some
other way. A branch for "orm-next" could work also...

I will post a short request here on django-developers, and then detail
the request in the ticket.

For now the big issue needing review is the django.utils.tree/add_q
refactor. https://code.djangoproject.com/ticket/17000#comment:7 and
https://github.com/akaariai/django/commits/refactor_utils_tree

I would like to get the qs deepcopy/clone() thingy moved forward or
closed as wontfix. Luckily this is a well separated issue from rest of
queryset refactor, so there is no hurry for this one:
https://code.djangoproject.com/ticket/16759#comment:33

- Anssi

Steven Cummings

unread,
Jun 26, 2012, 1:41:17 AM6/26/12
to django-d...@googlegroups.com
Sorry for chiming in so late. I very much like the goals of this effort, particularly bringing clarity to some of the internal APIs. Some related points for your consideration:

* Would the rows matched vs. updated issue be resolved or clarified in this effort [1]?
* It seems like the work I had started to expose accurate update/delete counts [2], and further provide the capability for conditional updates and deletes [3] would be more clearly done on the basis of such a refactor. Does that seem accurate to you or will it not make much of a difference there?

Thanks.


--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.


Anssi Kääriäinen

unread,
Jun 26, 2012, 3:31:23 AM6/26/12
to Django developers
On 26 kesä, 08:41, Steven Cummings <estebis...@gmail.com> wrote:
> Sorry for chiming in so late. I very much like the goals of this effort,
> particularly bringing clarity to some of the internal APIs. Some related
> points for your consideration:

No problem. I have been so busy lately that I haven't gotten a single
commit done... Things look a bit calmer, so I hope I will be able to
start working on Django again.

> * Would the rows matched vs. updated issue be resolved or clarified in this
> effort [1]?

I don't believe this one can be changed. Django currently returns rows
matched, and to remain backwards compatible a flag to .update() would
be needed. The flag would change the return value from matched to
changed rows. This doesn't feel right.

> * It seems like the work I had started to expose accurate update/delete
> counts [2], and further provide the capability for conditional updates and
> deletes [3] would be more clearly done on the basis of such a refactor.
> Does that seem accurate to you or will it not make much of a difference
> there?

For the optimistic concurrency control for model .save(), it seems
that splitting the save_base() to smaller parts could allow subclasses
to do whatever needed for good optimistic concurrency control. Another
option is some sort of hook which allows the instance to add
additional conditions to the save.

I must note that the idea of the refactor is not to do any drastic
rewrites. The structure of the ORM will stay mostly the same.
Currently my aim is at the add_q/add_filter stage of the ORM, and I
don't see much cross section with what you are doing at this stage. I
don't have any long term plans of where this all is going, it is just
cleanup/bugfixing of existing code. Most likely the save_base() will
need some refactoring too.

- Anssi

Steven Cummings

unread,
Jun 26, 2012, 11:13:46 AM6/26/12
to django-d...@googlegroups.com
That answers my question. Thanks! Perhaps I'll try to make matched-vs-updated a point of discussion for Django 2.0. As for the other stuff, I'll give consideration to work on save_base and maybe work up a proposal there.
--
Steven



 - Anssi

Reply all
Reply to author
Forward
0 new messages