WHERE EXISTS / WHERE NOT EXISTS clause without using QuerySet.extra

john....@plushrugs.com

unread,

Feb 3, 2016, 11:03:31 AM2/3/16

to Django users

Greetings!

I'm trying to refactor a query to avoid using QuerySet.extra (as per the recommendation here: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#extra)

Here's a simplified version of my code:

# testapp.models

class Parent(models.Model):
    pass

class Child(models.Model):
    parent = models.ForeignKey('testapp.Parent')

This is the query I am attempting to replace:

Parent.objects.extra(where=["""
    NOT EXISTS (SELECT 1 FROM testapp_child WHERE testapp_child.parent_id = testapp_parent.id)
"""])

This is extremely similar to, but not the same as

Parent.objects.exclude(id__in=Child.objects.all().values('parent_id'))

Both queries should return the same rows, but the biggest difference is that PostgreSQL's query planner produces completely different plans. The performance difference can be huge, even for a relatively modest data set. (I believe my data set has something like 270k "parent" instances and 80k "child" instances.)

I tried searching this group as well as the ticket system, and couldn't find a solution to this exact problem (other than using the alternative query).

Thanks for taking the time to read through all of this! I am more than happy to open a ticket, but I thought I should post here first.

More Technical Stuff

Here's the output of EXPLAIN ANALYZE on my actual models/data. I had to apply a LIMIT 100 because if I try to return the entire result, the second one completely hangs my computer. I bolded some relevant bits

Limit (cost=0.71..9.70 rows=100 width=111) (actual time=0.074..0.483 rows=100 loops=1)
-> Merge Anti Join (cost=0.71..22435.69 rows=249613 width=111) (actual time=0.072..0.465 rows=100 loops=1)
        Merge Cond: (catalogue_product.id = catalogue_productcategory.product_id)"
        -> Index Scan using catalogue_product_pkey on catalogue_product (cost=0.42..16986.70 rows=292339 width=111) (actual time=0.020..0.285 rows=150 loops=1)
        -> Index Only Scan using catalogue_productcategory_9bea82de on catalogue_productcategory (cost=0.29..3861.27 rows=81671 width=4) (actual time=0.037..0.070 rows=101 loops=1)
              Heap Fetches: 0
Total runtime: 0.568 ms

Here's the plan for the second query:
Limit (cost=0.00..234229.95 rows=100 width=111) (actual time=31.087..1165.022 rows=100 loops=1)
-> Seq Scan on catalogue_product (cost=0.00..342373919.34 rows=146170 width=111) (actual time=31.087..1164.992 rows=100 loops=1)
        Filter: (NOT (SubPlan 1))
        Rows Removed by Filter: 5
        SubPlan 1
          -> Materialize (cost=0.00..2138.07 rows=81671 width=4) (actual time=0.001..5.539 rows=80818 loops=105)
                -> Seq Scan on catalogue_productcategory u0 (cost=0.00..1409.71 rows=81671 width=4) (actual time=0.005..10.574 rows=81671 loops=1)
Total runtime: 1165.255 ms

As you can see the former is about ~2000 times faster than the latter.

John P.

unread,

Feb 3, 2016, 11:09:24 AM2/3/16

to Django users

Sorry for the double-post, but I just searched the Django Developers group, and found this: "Using EXISTS instead of IN for subqueries" https://groups.google.com/forum/#!searchin/django-developers/WHERE$20EXISTS/django-developers/OG1unUV-MOU/VzkepyuaUDUJ
and this: "Support server-side cursors for queryset iteration in database backends" https://code.djangoproject.com/ticket/16614

I don't think this answers my specific question, because mine is more about trying to explicitly invoke the WHERE EXISTS clause instead of automatically replacing IN with WHERE EXISTS, but I thought I should put them here for anyone interested.

Thanks.

John

John P.

unread,

Feb 10, 2016, 11:02:18 AM2/10/16

to Django users

I posted a ticket here: https://code.djangoproject.com/ticket/26194 which was marked as a duplicate of https://code.djangoproject.com/ticket/25789#comment:2

It was suggested that I try Parent.objects.filter(child=None).distinct() (but that it might perform worse than the WHERE NOT EXISTS query) Here's the plan generated by that query: HashAggregate (cost=216.11..216.54 rows=43 width=103) (actual time=1.553..1.562 rows=25 loops=1) -> Nested Loop Anti Join (cost=0.42..215.15 rows=43 width=103) (actual time=0.015..1.511 rows=25 loops=1) -> Seq Scan on catalogue_category (cost=0.00..104.45 rows=145 width=103) (actual time=0.006..0.154 rows=146 loops=1) -> Index Only Scan using catalogue_productcategory_b583a629 on catalogue_productcategory (cost=0.42..37.27 rows=801 width=4) (actual time=0.009..0.009 rows=1 loops=146) Index Cond: (category_id = catalogue_category.id) Heap Fetches: 87 Total runtime: 1.619 ms
This is only 3 times as slow as the carefully optimized query, which might be an acceptable performance trade-off in some cases.

On Wednesday, February 3, 2016 at 10:03:31 AM UTC-6, John P. wrote:

Reply all

Reply to author

Forward