WHERE EXISTS / WHERE NOT EXISTS clause without using QuerySet.extra

97 views
Skip to first unread message

john....@plushrugs.com

unread,
Feb 3, 2016, 11:03:31 AM2/3/16
to Django users
Greetings!

I'm trying to refactor a query to avoid using QuerySet.extra (as per the recommendation here: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#extra)

Here's a simplified version of my code:

# testapp.models

class Parent(models.Model):
    pass

class Child(models.Model):
    parent = models.ForeignKey('testapp.Parent')

This is the query I am attempting to replace:

Parent.objects.extra(where=["""
    NOT EXISTS (SELECT 1 FROM testapp_child WHERE testapp_child.parent_id = testapp_parent.id)
"""])

This is extremely similar to, but not the same as

Parent.objects.exclude(id__in=Child.objects.all().values('parent_id'))

Both queries should return the same rows, but the biggest difference is that PostgreSQL's query planner produces completely different plans. The performance difference can be huge, even for a relatively modest data set. (I believe my data set has something like 270k "parent" instances and 80k "child" instances.)

I tried searching this group as well as the ticket system, and couldn't find a solution to this exact problem (other than using the alternative query).

Thanks for taking the time to read through all of this! I am more than happy to open a ticket, but I thought I should post here first.


More Technical Stuff

Here's the output of EXPLAIN ANALYZE on my actual models/data. I had to apply a LIMIT 100 because if I try to return the entire result, the second one completely hangs my computer. I bolded some relevant bits


Limit  (cost=0.71..9.70 rows=100 width=111) (actual time=0.074..0.483 rows=100 loops=1)
  ->  Merge Anti Join  (cost=0.71..22435.69 rows=249613 width=111) (actual time=0.072..0.465 rows=100 loops=1)
        Merge Cond: (catalogue_product.id = catalogue_productcategory.product_id)"
        ->  Index Scan using catalogue_product_pkey on catalogue_product  (cost=0.42..16986.70 rows=292339 width=111) (actual time=0.020..0.285 rows=150 loops=1)
        ->  Index Only Scan using catalogue_productcategory_9bea82de on catalogue_productcategory  (cost=0.29..3861.27 rows=81671 width=4) (actual time=0.037..0.070 rows=101 loops=1)
              Heap Fetches: 0
Total runtime: 0.568 ms

Here's the plan for the second query:
Limit  (cost=0.00..234229.95 rows=100 width=111) (actual time=31.087..1165.022 rows=100 loops=1)
  ->  Seq Scan on catalogue_product  (cost=0.00..342373919.34 rows=146170 width=111) (actual time=31.087..1164.992 rows=100 loops=1)
        Filter: (NOT (SubPlan 1))
        Rows Removed by Filter: 5
        SubPlan 1
          ->  Materialize  (cost=0.00..2138.07 rows=81671 width=4) (actual time=0.001..5.539 rows=80818 loops=105)
                ->  Seq Scan on catalogue_productcategory u0  (cost=0.00..1409.71 rows=81671 width=4) (actual time=0.005..10.574 rows=81671 loops=1)
Total runtime: 1165.255 ms

As you can see the former is about ~2000 times faster than the latter.

John P.

unread,
Feb 3, 2016, 11:09:24 AM2/3/16
to Django users
Sorry for the double-post, but I just searched the Django Developers group, and found this: "Using EXISTS instead of IN for subqueries" https://groups.google.com/forum/#!searchin/django-developers/WHERE$20EXISTS/django-developers/OG1unUV-MOU/VzkepyuaUDUJ
and this: "Support server-side cursors for queryset iteration in database backends" https://code.djangoproject.com/ticket/16614

I don't think this answers my specific question, because mine is more about trying to explicitly invoke the WHERE EXISTS clause instead of automatically replacing IN with WHERE EXISTS, but I thought I should put them here for anyone interested.

Thanks.

John

John P.

unread,
Feb 10, 2016, 11:02:18 AM2/10/16
to Django users
I posted a ticket here: https://code.djangoproject.com/ticket/26194 which was marked as a duplicate of https://code.djangoproject.com/ticket/25789#comment:2

It was suggested that I try Parent.objects.filter(child=None).distinct() (but that it might perform worse than the WHERE NOT EXISTS query)

Here's the plan generated by that query:

HashAggregate  (cost=216.11..216.54 rows=43 width=103) (actual time=1.553..1.562 rows=25 loops=1)
  ->  Nested Loop Anti Join  (cost=0.42..215.15 rows=43 width=103) (actual time=0.015..1.511 rows=25 loops=1)
        ->  Seq Scan on catalogue_category  (cost=0.00..104.45 rows=145 width=103) (actual time=0.006..0.154 rows=146 loops=1)
        ->  Index Only Scan using catalogue_productcategory_b583a629 on catalogue_productcategory  (cost=0.42..37.27 rows=801 width=4) (actual time=0.009..0.009 rows=1 loops=146)
              Index Cond: (category_id = catalogue_category.id)
              Heap Fetches: 87
Total runtime: 1.619 ms

This is only 3 times as slow as the carefully optimized query, which might be an acceptable performance trade-off in some cases.


On Wednesday, February 3, 2016 at 10:03:31 AM UTC-6, John P. wrote:
Reply all
Reply to author
Forward
0 new messages