Hi devs,
When prefetching related items for a queryset returning a large amount of items, the generated SQL can be quite inefficient. Here's an example:
class Category(models.Model):
type = models.PositiveIntegerField(db_index=True)
class Item(models.Model):
category = models.ForeignKey(Category, related_name='%(class)ss')
If I have 50.000 categories of type=5, then "Category.objects.filter(type=5).prefetch_related('items')" will generate an SQL on the Item table with an IN clause containing the 50.000 Category ID's. This is because get_prefetch_queryset() on the Manager classes in django.db.models.related all do this:
def get_prefetch_queryset(self, instances, queryset=None):
[...]
query = {'%s__in' % self.query_field_name: instances}
queryset = queryset._next_is_sticky().filter(**query)
[...]
While this is great when instances is a short list, we can hope to do better than that when instances is large. A much more efficient query in the above case would be queryset._next_is_sticky().filter(category__type=5). We just need to make sure this alternative query will fetch (at least) the same items as the query with the IN clause would.
I thought I could use Prefetch('items', queryset=Item.objects.filter(category__type=5))to accomplish this, but while the custom queryset *is* used, get_prefetch_queryset() still adds the IN clause unconditionally. This is A) redundant B) sends a huge SQL string over the wire for the database to process, and C) seriously messes up the database query planner, often generating an inefficient execution plan.
I would like to fix this problem. The easiest way seems to be to let Prefetch() tell get_prefetch_queryset() to skip the IN clause. django.db.models.prefetch_one_level() is in charge of passing the Prefetch queryset to get_prefetch_queryset(), so one option would be to add a skip_in_clause attribute to the Prefetch model which prefetch_one_level() would pass to get_prefetch_queryset(). The latter could then do:
def get_prefetch_queryset(self, instances, queryset=None, skip_in_clause=False):
[...]
if not skip_in_clause:
query = {'%s__in' % self.query_field_name: instances}
queryset = queryset._next_is_sticky().filter(**query)
[...]
In my preliminary testing on real data, this:
Category.objects.filter(type=5).prefetch_related(Prefetch('items', queryset=Item.objects.filter(category__type=5), skip_in_clause=True))
is about 20x faster than the current implementation when the query returns 50.000 categories. The improvement would be greater on larger datasets.
Any comments? If needed, I can provide a pull request with a suggested implementation. Apart from the Prefetch class and prefetch_one_level(), there are 4 instances of get_prefetch_queryset() in django.db.models.related that would need to be changed.
Thanks,
Erik