ORM refactoring

101 views
Skip to first unread message

Anssi Kääriäinen

unread,
Oct 9, 2011, 3:28:25 PM10/9/11
to Django developers
First, if you have time for it, please take a look of #17025 (sql/
WhereNode refactoring). In summary: test suite is 10% faster, creating
a simple objects.filter(pk=1) query is 30% faster, chained .filter()
calls can be multiple times faster and so on. The code is IMHO
cleaner. And it is much better documented. The patch is WIP, but
please try it on real world queries (especially if you have problems
with queryset cloning) and see if it makes any speed difference. And
of course if you can break it.

Now, for the topic of the post. I have worked quite intensively (some
might say too intensively) around different issues in the ORM for the
last couple of weeks, and my take is that while the code in general
makes mostly sense and it is clear that it had a good design to start
with, it has slowly bitrotted into a situation where it is nearly
impossible to understand it. Case in point, take a look of add_q pre
and post #17025 patch. Can you tell from the pre-version what is done
and why?

I would like to work on refactoring the ORM. This would be slow work,
one piece at a time, not a total rewrite. I think there is much
potential in the ORM. It needs to be cleaned first, and then we can
get hacking new and awesome features. I know there are other people
interested in this, too.

For example I have been thinking for some time about something like
F() objects, but which could contain arbitrary SQL and references to
fields in the query. relabel_aliases and cloning would be handled for
you. These could be used in .order_by(), values_list, annotate() and
so on. SQLSnippet could be its name. This would allow us to
deprecate .extra, and would allow for some really fancy extensions.

Using these you could implement an external SQL snippets library, the
PostgreSQL window functions in one library for example. Or custom
aggregate ModelAggregate (fetch related models using array_agg,
something like prefetch_related, but runs in single SQL query). I
really do believe this is possible.

But we need to get the work started on refactoring the ORM first. I do
currently have some extra time, and I am willing to work on the ORM.
But we need some people reviewing the patches and most of all we need
core developers who have enough time to review and commit the ready
patches. Unfortunately the ORM-knowing core developers seem to be
really busy.

I really do not know how to move forward. I guess I am asking all you
what do you think should be done with the ORM. We could take a
completely opposite direction and make it a wrapper around SQLAlchemy
(I saw this mentioned somewhere some time ago). Or decide that the ORM
is just fine now, or that there are higher importance items. Before
continuing my hacking, it would be nice to know if there is support
for ORM refactoring.


Thank you for your time,
- Anssi Kääriäinen

Alexander Schepanovski

unread,
Oct 10, 2011, 2:18:11 AM10/10/11
to Django developers
I use a bunch of hacks and monkey patches to make ORM master. So I
consider a proper cleanup very valuable.
This one for example for more efficient pickling:
https://gist.github.com/974735

Luke Plant

unread,
Oct 10, 2011, 9:13:25 AM10/10/11
to django-d...@googlegroups.com
Hi Anssi,

> I would like to work on refactoring the ORM. This would be slow work,
> one piece at a time, not a total rewrite. I think there is much
> potential in the ORM. It needs to be cleaned first, and then we can
> get hacking new and awesome features. I know there are other people
> interested in this, too.
>
> For example I have been thinking for some time about something like
> F() objects, but which could contain arbitrary SQL and references to
> fields in the query. relabel_aliases and cloning would be handled for
> you. These could be used in .order_by(), values_list, annotate() and
> so on. SQLSnippet could be its name. This would allow us to
> deprecate .extra, and would allow for some really fancy extensions.
>
> Using these you could implement an external SQL snippets library, the
> PostgreSQL window functions in one library for example. Or custom
> aggregate ModelAggregate (fetch related models using array_agg,
> something like prefetch_related, but runs in single SQL query). I
> really do believe this is possible.
>
> But we need to get the work started on refactoring the ORM first. I do
> currently have some extra time, and I am willing to work on the ORM.
> But we need some people reviewing the patches and most of all we need
> core developers who have enough time to review and commit the ready
> patches. Unfortunately the ORM-knowing core developers seem to be
> really busy.

One of the problems is that it can be very hard to review refactorings.
For example, I recently checked in rev 16929 [1] from ivan_virabyan's
patch, and reviewing it was hard, despite the fact that it was a very
good quality patch. The difficulty is that there appear to be a large
number of changes, when in reality the code has just moved around. But
confirming that it has indeed stayed the same is very time consuming if
you do it properly - in fact it seems almost faster to do the same
changes yourself, and verify you get the same result.

This is only thinking out loud, but I'm wondering if that approach might
not be too bad. If instead of a large patch, we have a series of smaller
steps (in a github branch or whatever), perhaps that would be easier to
review? I'm not saying rework any already created patches to match this,
BTW.

We do have pretty good test coverage, which should be a good help, but
one of the problems here is the mutability thing - it is very difficult
to write a test that shows that shows that a new QuerySet is not sharing
data structures with the old in a way that is incorrect and could lead
to the previous QuerySet being modified by the new one (or subsequent ones).

> I really do not know how to move forward. I guess I am asking all you
> what do you think should be done with the ORM. We could take a
> completely opposite direction and make it a wrapper around SQLAlchemy
> (I saw this mentioned somewhere some time ago). Or decide that the ORM
> is just fine now, or that there are higher importance items. Before
> continuing my hacking, it would be nice to know if there is support
> for ORM refactoring.

It was me who suggested the idea of basing the ORM on SQLAlchemy in
future, on my blog [2]. It would be a very large change, especially
things related to the unit-of-work stuff, and I have no idea whether it
is even feasible really.

My concern is that the ORM currently leaves you high and dry when you
come to the end of the queries it can generate. Some people say the
answer is raw SQL at that point, but if you have built up your query
dynamically in any way, or perhaps in a completely Model agnostic way
(as I was doing in django-easyfilters), you are completely stuck,
because you are left with a dynamically created query that you really
cannot usefully manipulate in any way, but which you must manipulate in
order to do the queries you need. I think the best you could do is
generate the SQL from the query as a string (which is actually hard to
do currently due to the way that SQL generation interacts with the
different backends), parse it again with python-sqlparse, manipulate and
pass to QuerySet.raw(). And then somehow cope with the backend specific
stuff. Ugh.

SQLA, I believe, would be very different, because 'core' provides a
complete wrapper around SQL, and the ORM allows you to drop down to core
to do advanced things. And it has a much nicer way of dealing with the
different SQL dialects.

But as I said, I don't know if any kind of rewrite to use SQLA is
feasible, and I don't know whether anyone else in the community or in
the core team has similar leanings - I don't know if any of the core
devs happened to read that blog post.

I do think that these refactorings you are doing are valuable. However,
if you work on this, I think you need to be prepared for the possibility
that the work may be a dead end. It could be an experiment that shows
that a different approach will be needed. If it does that, it will still
have been useful work, but you will need a certain kind of attitude to
still consider it useful work if we end up switching to SQLA :-)

I'm also unconvinced of the SQLSnippet approach at the moment. I think
it could end up becoming like '.extra()' - an extension point that
becomes troublesome because it drops you through the layers of
abstraction too rapidly. I suspect we need a more convincing and
comprehensive lower level before we can add low level extensions. And it
is that thinking that makes me want to more to SQLA with its core/ORM
distinction.

Regards,

Luke


[1] https://code.djangoproject.com/changeset/16929
[2]
http://lukeplant.me.uk/blog/posts/announcing-django-easyfilters-with-some-heresy-on-the-side/

--
"My capacity for happiness you could fit into a matchbox without
taking out the matches first." (Marvin the paranoid android)

Luke Plant || http://lukeplant.me.uk/

Anssi Kääriäinen

unread,
Oct 10, 2011, 10:59:33 AM10/10/11
to Django developers


On Oct 10, 4:13 pm, Luke Plant <L.Plant...@cantab.net> wrote:

> One of the problems is that it can be very hard to review refactorings.
> For example, I recently checked in rev 16929 [1] from ivan_virabyan's
> patch, and reviewing it was hard, despite the fact that it was a very
> good quality patch. The difficulty is that there appear to be a large
> number of changes, when in reality the code has just moved around. But
> confirming that it has indeed stayed the same is very time consuming if
> you do it properly - in fact it seems almost faster to do the same
> changes yourself, and verify you get the same result.

Well, the patch doesn't move a lot of code - it does actually change
the code. Not that it makes it any easier to review the patch :)

> This is only thinking out loud, but I'm wondering if that approach might
> not be too bad. If instead of a large patch, we have a series of smaller
> steps (in a github branch or whatever), perhaps that would be easier to
> review? I'm not saying rework any already created patches to match this,
> BTW.

I too support the idea of smaller patches. But the problem with the
ORM is that the bigger changes tend to cascade. Or at least that was
what happened to the where node refactoring. It might be that the
patch is too big to swallow. But the benefits are real: django
benchmark suite shows 4-5x speed improvement for chained queryset
construction. And 1.5x improvements in many cases. Those are
differences important to many users. Details in the ticket.

> We do have pretty good test coverage, which should be a good help, but
> one of the problems here is the mutability thing - it is very difficult
> to write a test that shows that shows that a new QuerySet is not sharing
> data structures with the old in a way that is incorrect and could lead
> to the previous QuerySet being modified by the new one (or subsequent ones).

The patched version actually does share structures - the leaf nodes of
the where tree. They are only cloned when relabel_aliases is called,
and that together with avoiding deepcopy is where the speed difference
comes. Ok, the where trees are also kept as small as possible, so that
too affects performance.

> It was me who suggested the idea of basing the ORM on SQLAlchemy in
> future, on my blog [2]. It would be a very large change, especially
> things related to the unit-of-work stuff, and I have no idea whether it
> is even feasible really.
>
> My concern is that the ORM currently leaves you high and dry when you
> come to the end of the queries it can generate. Some people say the
> answer is raw SQL at that point, but if you have built up your query
> dynamically in any way, or perhaps in a completely Model agnostic way
> (as I was doing in django-easyfilters), you are completely stuck,
> because you are left with a dynamically created query that you really
> cannot usefully manipulate in any way, but which you must manipulate in
> order to do the queries you need. I think the best you could do is
> generate the SQL from the query as a string (which is actually hard to
> do currently due to the way that SQL generation interacts with the
> different backends), parse it again with python-sqlparse, manipulate and
> pass to QuerySet.raw(). And then somehow cope with the backend specific
> stuff. Ugh.
>
> SQLA, I believe, would be very different, because 'core' provides a
> complete wrapper around SQL, and the ORM allows you to drop down to core
> to do advanced things. And it has a much nicer way of dealing with the
> different SQL dialects.

In the perfect world you could have a SQLA backend, or qs.as_sqla()
method. That would be neat, yes. QuerySet chaining could be hard to
implement, though (or does SQLA have support for something like
that?). But in my opinion this is another argument for refactoring the
ORM. When the ORM has better separation of concerns and better
abstractions between compiler and sql/query.py it will be easier to
write different compilers - be it a compiler for Cassandra or SQLA.

> I do think that these refactorings you are doing are valuable. However,
> if you work on this, I think you need to be prepared for the possibility
> that the work may be a dead end. It could be an experiment that shows
> that a different approach will be needed. If it does that, it will still
> have been useful work, but you will need a certain kind of attitude to
> still consider it useful work if we end up switching to SQLA :-)

I am fine with the code being discarded, no problems with that. I just
want it to be at least somewhat useful... My biggest worry is not
getting any response at all. That means I do work which is not useful
at all.

And trust me, I have already met some dead ends in my refactorings.
You may not have seen them, but they do exist... :)

- Anssi

Carl Meyer

unread,
Oct 10, 2011, 12:18:57 PM10/10/11
to django-d...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Anssi,

Thanks for your work on #17025 - sounds like an impressive improvement.

On 10/09/2011 01:28 PM, Anssi K��ri�inen wrote:
> I really do not know how to move forward. I guess I am asking all you
> what do you think should be done with the ORM. We could take a
> completely opposite direction and make it a wrapper around SQLAlchemy
> (I saw this mentioned somewhere some time ago). Or decide that the ORM
> is just fine now, or that there are higher importance items. Before
> continuing my hacking, it would be nice to know if there is support
> for ORM refactoring.

I am not sure the "wrapper around SQLAlchemy" idea is at all feasible,
for any number of reasons; it's certainly not an idea that should hold
up other improvements.

I do think there are parts of the ORM that could really benefit from the
kind of cleanup and refactoring you're talking about, so consider me in
favor. I can't make specific promises about review, but I have #17025
high on my list to look at, and would make an effort to review any
patches you work on in this area.

Carl
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6TGvEACgkQ8W4rlRKtE2dphgCfa+u2GsqbNuQbZCHjgNkuwKre
+JsAoICOuSAQ82lrE6oPNHmkDIrqKhSA
=K6a2
-----END PGP SIGNATURE-----

Jonas H.

unread,
Oct 10, 2011, 6:27:53 PM10/10/11
to django-d...@googlegroups.com
On 10/10/2011 04:59 PM, Anssi Kääriäinen wrote:
> In the perfect world you could have a SQLA backend, or qs.as_sqla()
> method. That would be neat, yes. QuerySet chaining could be hard to
> implement, though (or does SQLA have support for something like
> that?). But in my opinion this is another argument for refactoring the
> ORM. When the ORM has better separation of concerns and better
> abstractions between compiler and sql/query.py it will be easier to
> write different compilers - be it a compiler for Cassandra or SQLA.

What's required to make different database backends easy are correct
abstractions -- currently the ORM backend abstraction is pretty messed
up, at least it's not written with non-SQL backends in mind.

/Everything/ SQL related should be moved into a base SQL backend. The
code in django.db.models should be completely query language/database
agnostic. Currently, SQL generation is scattered over multiple places
and abstraction layers, django.db.models.sql.* and django.db.backends etc.

Reply all
Reply to author
Forward
0 new messages