Prefetch and avoiding huge IN clauses

140 views
Skip to first unread message

Erik Cederstrand

unread,
Sep 25, 2015, 6:35:34 AM9/25/15
to Django developers (Contributions to Django itself)
Hi devs,

When prefetching related items for a queryset returning a large amount of items, the generated SQL can be quite inefficient. Here's an example:

class Category(models.Model):
    type
= models.PositiveIntegerField(db_index=True)


class Item(models.Model):
   category
= models.ForeignKey(Category, related_name='%(class)ss')


If I have 50.000 categories of type=5, then "Category.objects.filter(type=5).prefetch_related('items')" will generate an SQL on the Item table with an IN clause containing the 50.000 Category ID's. This is because get_prefetch_queryset() on the Manager classes in django.db.models.related all do this:

def get_prefetch_queryset(self, instances, queryset=None):
    
[...]
    query = {'%s__in' % self.query_field_name: instances}
    queryset
= queryset._next_is_sticky().filter(**query)
   
[...]


While this is great when instances is a short list, we can hope to do better than that when instances is large. A much more efficient query in the above case would be queryset._next_is_sticky().filter(category__type=5). We just need to make sure this alternative query will fetch (at least) the same items as the query with the IN clause would.

I thought I could use Prefetch('items', queryset=Item.objects.filter(category__type=5))to accomplish this, but while the custom queryset *is* used, get_prefetch_queryset() still adds the IN clause unconditionally. This is A) redundant B) sends a huge SQL string over the wire for the database to process, and C) seriously messes up the database query planner, often generating an inefficient execution plan.

I would like to fix this problem. The easiest way seems to be to let Prefetch() tell get_prefetch_queryset() to skip the IN clause. django.db.models.prefetch_one_level() is in charge of passing the Prefetch queryset to get_prefetch_queryset(), so one option would be to add a skip_in_clause attribute to the Prefetch model which prefetch_one_level() would pass to get_prefetch_queryset(). The latter could then do:

def get_prefetch_queryset(self, instances, queryset=None, skip_in_clause=False):
   
[...]
   
if not skip_in_clause:
        query
= {'%s__in' % self.query_field_name: instances}
        queryset
= queryset._next_is_sticky().filter(**query)
   
[...]


In my preliminary testing on real data, this:

Category.objects.filter(type=5).prefetch_related(Prefetch('items', queryset=Item.objects.filter(category__type=5), skip_in_clause=True))

is about 20x faster than the current implementation when the query returns 50.000 categories. The improvement would be greater on larger datasets.

Any comments? If needed, I can provide a pull request with a suggested implementation. Apart from the Prefetch class and prefetch_one_level(), there are 4 instances of get_prefetch_queryset() in django.db.models.related that would need to be changed.


Thanks,
Erik
Reply all
Reply to author
Forward
0 new messages