ORM Optimization for filtering by related existence

Croepha

unread,

Aug 7, 2014, 7:43:07 PM8/7/14

to django-d...@googlegroups.com

It would be nice if there was some database level optimization for getting objects that have related objects that exist.

This is a slow filter:

qs = [obj for obj in qs if qs.somefield_set.exists()]

Could be faster with something like this:

qs = qs.filter(somefield__exists=True)

Here is some (rough, probably grossly over simplified but working) code:

Code is also available here if it gets garbled: http://dpaste.com/0825PNP

query_template = (

'{not_part} EXISTS (SELECT 1 FROM "{table1}" WHERE '

'"{table1}"."{table1_column}" = "{table2}"."{table2_column}")'

)

def filter_by_reverse_feriegn_key_existance(qs, **kw):

for arg, value in kw.items():

assert arg.endswith('__exists')

if value is True:

not_part = ''

elif value is False:

not_part = 'NOT'

Model = qs.model

for field in arg.rstrip('__exists').split('__'):

field = Model._meta.get_field_by_name(field)[0]

qs = qs.extra(where=[query_template.format(

table1=field.model._meta.db_table,

table1_column=field.field.db_column,

table2=Model._meta.db_table,

table2_column=Model._meta.pk.db_column,

not_part = not_part

)])

Model = field.model

return qs

This works on postgres, and is ~100x faster

I would be interested in any comments

Tim Graham

unread,

Aug 7, 2014, 7:48:22 PM8/7/14

to django-d...@googlegroups.com

Does .filter(somefield__isnull=False) not work for what you're trying to do?

Josh Smeaton

unread,

Aug 7, 2014, 7:49:01 PM8/7/14

to django-d...@googlegroups.com

I think the idea is good, but the implementation you have here isn't desirable. Maybe you could experiment with writing your own custom lookup (https://docs.djangoproject.com/en/dev/howto/custom-lookups/) and see if it fits into the framework that way. AFAIK, anything requiring .extra should be avoided internally, because there are usually better methods for generating SQL and parameters.

Cheers

David Butler

unread,

Aug 7, 2014, 9:16:45 PM8/7/14

to django-d...@googlegroups.com

On Thursday, August 7, 2014 6:48:22 PM UTC-5, Tim Graham wrote:

Does .filter(somefield__isnull=False) not work for what you're trying to do?

If I do .filter(somefield__isnull=False) it tries to do a LEFT OUTER JOIN on the table for somefield instead of a WHERE EXISTS (...) and if that table is very large then it is pretty slow

Perhaps the answer is just to modify what __isnull does instead of making a new lookup

Collin Anderson

unread,

Aug 7, 2014, 9:36:38 PM8/7/14

to django-d...@googlegroups.com

I personally use qs = qs.filter(somefield__gte=1) (works if the id field is an integer)

Curtis Maloney

unread,

Aug 8, 2014, 2:48:44 AM8/8/14

to django-d...@googlegroups.com

Can you create a PR, with docs and tests?

If not, we'll need to find someone brave enough to delve into the ORM and add this [quite valid, IMHO] optimisation.

--

C

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/d4f2d1bc-326d-43d0-a53c-d2b1c25465cd%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Anssi Kääriäinen

unread,

Aug 8, 2014, 3:17:56 AM8/8/14

to django-d...@googlegroups.com

Unfortunately the problem is a bit deeper in the ORM. In general, when
filtering against a multivalued relation (reverse foreign key or m2m
relation) Django uses joins instead of subqueries. This can result in
duplicating rows if multiple related rows match the filter. The fix
recommended in the docs is to apply .distinct() to the query. Using
distinct can result in slow query execution speed, and many DBAs see
this kind of usage of distinct as a surefire sign that the query is
written poorly (and I agree here).

We really should be using IN/EXISTS queries for filters spanning
multivalued relations, but for historical reasons we don't. We do that
for exclude and negated filters already.

There is also a technical problem which needs to be solved before
changing how __isnull=False works. If we want to make __isnull=False to
do EXISTS/IN query, then the additional icontains filter in the query
qs.filter(somefield__isnull=False, somefield__othercol__icontains='a')
must be pushed to the same IN/EXISTS clause. This isn't easy to do
correctly (and we don't do it correctly for exclude queries currently).

In addition, there are cases where subqueries aren't wanted, older
versions of MySQL for example choke on subqueries.

So, there are a couple of complications for this idea:
- Support for pushing multiple filter conditions to the same subquery
is needed even if we consider support for only isnull=False filter.
- We should consider using IN/EXISTS + subquery for multivalued
relations in filters. This is challenging both technically and from
backwards compatibility perspective.

- Anssi

> it, send an email to django-developers
> +unsub...@googlegroups.com.

Josh Smeaton

unread,

Aug 8, 2014, 3:18:55 AM8/8/14

to django-d...@googlegroups.com

Just to be clear, are you (Curtis) suggesting that the isnull lookup should do an EXISTS when looking at a join? I don't think that's a very good idea - having that construct can be extremely useful. I think adding an __exists lookup to relationships would be a great addition - but that's where the PR/docs/tests would come into it. Adding a custom lookup (or building a 3rd party library that introduces that lookup) means you don't need to get too involved with the ORM - you just need to build out the custom Lookup.

Anssi Kääriäinen

unread,

Aug 8, 2014, 3:33:45 AM8/8/14

to django-d...@googlegroups.com

IIRC relation lookups don't yet use the custom lookups facility, so just
writing a custom lookup might not work. We should turn the related
lookups to use the new lookup system, too, but that wasn't done as the
way relation lookups work is a bit special inside the ORM.

As for addition of .exists lookup - how
should .filter(somefield__exists=False, somefield__othercol=2) work? If
both filters must target the same join, then just writing a custom
lookup doesn't work.

- Anssi

Shai Berger

unread,

Aug 12, 2014, 9:39:06 PM8/12/14

to django-d...@googlegroups.com

On Friday 08 August 2014 10:41:20 Anssi Kääriäinen wrote:
>
> As for addition of .exists lookup - how
> should .filter(somefield__exists=False, somefield__othercol=2) work?

It should be noted that __exists is a little special. For the False case
above,

.filter(somefield__exists=False, somefield__othercol=2)

should always get an empty result set, and for the True case,
somefield__othercol=2 already implies somefield__exists=True. So while it would
be really nice to have sane behavior there, it is not much of a real problem
in practice.

That being said, I suspect adding another level does have the problem Anssi
mentions: .filter(a__b__exists=False, a__c=6) should still use only one join
for a.

Shai.

Anssi Kääriäinen

unread,

Aug 13, 2014, 2:47:47 AM8/13/14

to django-d...@googlegroups.com

On Wed, 2014-08-13 at 04:38 +0300, Shai Berger wrote:
> On Friday 08 August 2014 10:41:20 Anssi Kääriäinen wrote:
> >
> > As for addition of .exists lookup - how
> > should .filter(somefield__exists=False, somefield__othercol=2) work?
>
> It should be noted that __exists is a little special. For the False case
> above,
>
> .filter(somefield__exists=False, somefield__othercol=2)
>
> should always get an empty result set, and for the True case,
> somefield__othercol=2 already implies somefield__exists=True. So while it would
> be really nice to have sane behavior there, it is not much of a real problem
> in practice.

Hmmh, my interpretation was that the above filter would check if there
exists any somefield relations with othercol value=2. But that
interpretation is wrong.

But, how about .filter(Q(somefield__exists=False) |
Q(somefield__othercol=2))? This one should return results only if there
are no somefield related instances at all, or if there are related
instances, then at least one of those instances must have othercol
value=2. So, both filters must target the same join (if my
interpretation is correct...).

- Anssi

Shai Berger

unread,

Aug 13, 2014, 7:40:22 PM8/13/14

to django-d...@googlegroups.com

On Wednesday 13 August 2014 09:55:33 Anssi Kääriäinen wrote:
>
> But, how about .filter(Q(somefield__exists=False) |
> Q(somefield__othercol=2))? This one should return results only if there
> are no somefield related instances at all, or if there are related
> instances, then at least one of those instances must have othercol
> value=2.

Right.

> So, both filters must target the same join

I don't see how this follows -- separate joins will each have the same set of
related records.

Shai.

Anssi Kääriäinen

unread,

Aug 14, 2014, 3:02:50 AM8/14/14

to django-d...@googlegroups.com

On Thu, 2014-08-14 at 02:39 +0300, Shai Berger wrote:
> > So, both filters must target the same join
>
> I don't see how this follows -- separate joins will each have the same set of
> related records.

True, the query results will be correct for exists lookup (but not for
most other lookups if using two joins). The performance will likely be
suboptimal.

It seems __exists filter would be an optimization only in relatively
rare cases. At least PostgreSQL and Oracle can do antijoin
optimizations, and it seems the proposed subquery plan is actually
slower than LEFT JOIN plan on MySQL[1], too.

For example on PostgreSQL both variants of the query generate the exact
same query plan:

akaariai=# explain select * from fooasdf left join barasdf on foo_id =
id where foo_id is null;
QUERY PLAN
----------------------------------------------------------------------
Hash Anti Join (cost=26.48..44.26 rows=2 width=8)
Hash Cond: (fooasdf.id = barasdf.foo_id)
-> Seq Scan on fooasdf (cost=0.00..14.01 rows=1001 width=4)
-> Hash (cost=13.99..13.99 rows=999 width=4)
-> Seq Scan on barasdf (cost=0.00..13.99 rows=999 width=4)

akaariai=# explain select * from fooasdf where not exists (select 1 from
barasdf where foo_id = id);
QUERY PLAN
----------------------------------------------------------------------
Hash Anti Join (cost=26.48..44.26 rows=2 width=4)
Hash Cond: (fooasdf.id = barasdf.foo_id)
-> Seq Scan on fooasdf (cost=0.00..14.01 rows=1001 width=4)
-> Hash (cost=13.99..13.99 rows=999 width=4)
-> Seq Scan on barasdf (cost=0.00..13.99 rows=999 width=4)

We should make related fields use the custom lookups system in any case.
If that is done, it should be relatively easy to add the two joins
exists variant in projects that need it.

- Anssi
[1]http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/

Reply all

Reply to author

Forward