[Django] #25464: Allow skipping IN clause on prefetch queries

Django

unread,

Sep 25, 2015, 5:03:57 AM9/25/15

to django-...@googlegroups.com

#25464: Allow skipping IN clause on prefetch queries
----------------------------------------------+--------------------
Reporter: ecederstrand | Owner: nobody
Type: New feature | Status: new
Component: Database layer (models, ORM) | Version: master
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 1
Easy pickings: 0 | UI/UX: 0
----------------------------------------------+--------------------
When using prefetch_related() on a large queryset, the prefetch query SQL
can be inefficient. Consider this:

Category.objects.filter(type=5).prefetch_related('items')

If 100.000 categories have type=5, then an IN clause with 100.000 Category
IDs is generated to get the Item objects. Even with a custom queryset
using a Prefetch() object, the IN clause is generated, even though it is
A) redundant, B) sends a potentially multi-megabyte SQL statement over the
wire for the database to process, C) may confuse the query planner to
generate an inefficient execution plan, and D) doesn't scale:

Category.objects.filter(type=5).prefetch_related(Prefetch('items',
queryset=Item.objects.filter(category__item=5)))

Pull request https://github.com/django/django/pull/5356 adds the
possibility to skip the IN clause in cases where we are sure that a better
queryset will get (at least) the same items as the IN clause would:

Category.objects.filter(type=5).prefetch_related(Prefetch('items',
queryset=Item.objects.filter(category__item=5),
filter_on_instances=False))

In my tests, this speeds up prefetch_related() by 20x-50x on large
querysets.

--
Ticket URL: <https://code.djangoproject.com/ticket/25464>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Sep 25, 2015, 6:54:35 AM9/25/15

to django-...@googlegroups.com

#25464: Allow skipping IN clause on prefetch queries

-------------------------------------+-------------------------------------

Reporter: ecederstrand | Owner: nobody
Type: New feature | Status: new

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by mjtamlyn):

* needs_better_patch: => 0
* stage: Unreviewed => Accepted
* needs_tests: => 1
* needs_docs: => 1

Comment:

Accepting the basic idea here. I'd be interested to see some performance
profiles across a few databases here though, it seems your inner query
there is potentially doing an enormous join to do it's filter, I'm
slightly surprised this is that much more efficient.

Patch needs tests and documentation, and also needs to reassure me what
happens if the prefetched queryset contains extraneous rows. Are they
ignored? Does it throw an error? What should the result be?

--
Ticket URL: <https://code.djangoproject.com/ticket/25464#comment:1>

Django

unread,

Sep 25, 2015, 7:56:56 AM9/25/15

to django-...@googlegroups.com

#25464: Allow skipping IN clause on prefetch queries

-------------------------------------+-------------------------------------

Reporter: ecederstrand | Owner: nobody
Type: New feature | Status: new

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Description changed by timgraham:

Old description:

> When using prefetch_related() on a large queryset, the prefetch query SQL
> can be inefficient. Consider this:
>
> Category.objects.filter(type=5).prefetch_related('items')
>
> If 100.000 categories have type=5, then an IN clause with 100.000
> Category IDs is generated to get the Item objects. Even with a custom
> queryset using a Prefetch() object, the IN clause is generated, even
> though it is A) redundant, B) sends a potentially multi-megabyte SQL
> statement over the wire for the database to process, C) may confuse the
> query planner to generate an inefficient execution plan, and D) doesn't
> scale:
>
> Category.objects.filter(type=5).prefetch_related(Prefetch('items',
> queryset=Item.objects.filter(category__item=5)))
>
> Pull request https://github.com/django/django/pull/5356 adds the
> possibility to skip the IN clause in cases where we are sure that a
> better queryset will get (at least) the same items as the IN clause
> would:
>
> Category.objects.filter(type=5).prefetch_related(Prefetch('items',
> queryset=Item.objects.filter(category__item=5),
> filter_on_instances=False))
>
> In my tests, this speeds up prefetch_related() by 20x-50x on large
> querysets.

New description:

When using prefetch_related() on a large queryset, the prefetch query SQL
can be inefficient. Consider this:
{{{
Category.objects.filter(type=5).prefetch_related('items')
}}}
If 100.000 categories have type=5, then an IN clause with 100.000 Category
IDs is generated to get the Item objects. Even with a custom queryset
using a Prefetch() object, the IN clause is generated, even though it is
A) redundant, B) sends a potentially multi-megabyte SQL statement over the
wire for the database to process, C) may confuse the query planner to
generate an inefficient execution plan, and D) doesn't scale:
{{{
Category.objects.filter(type=5).prefetch_related(Prefetch('items',
queryset=Item.objects.filter(category__item=5)))
}}}
Pull request https://github.com/django/django/pull/5356 adds the
possibility to skip the IN clause in cases where we are sure that a better
queryset will get (at least) the same items as the IN clause would:
{{{
Category.objects.filter(type=5).prefetch_related(Prefetch('items',
queryset=Item.objects.filter(category__item=5),
filter_on_instances=False))
}}}
In my tests, this speeds up prefetch_related() by 20x-50x on large
querysets.

--

--
Ticket URL: <https://code.djangoproject.com/ticket/25464#comment:2>

Django

unread,

Sep 25, 2015, 9:54:25 AM9/25/15

to django-...@googlegroups.com

#25464: Allow skipping IN clause on prefetch queries

-------------------------------------+-------------------------------------

Reporter: ecederstrand | Owner: nobody
Type: New feature | Status: new

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by ecederstrand):

Thanks. Which profiling would you like? I'm working on PostgreSQL and
could provide EXPLAIN on both types of queries.

I'll try to add documentation some tests to
/django/tests/prefetch_related/

When prefetching is done, the prefetched objects are matched with the
instances by pk. My basic gut feeling is that extraneous rows should
simply be ignored, for cases where it's simply more efficient to fetch a
little more than we need.

--
Ticket URL: <https://code.djangoproject.com/ticket/25464#comment:3>

Django

unread,

Sep 25, 2015, 10:10:07 AM9/25/15

to django-...@googlegroups.com

#25464: Allow skipping IN clause on prefetch queries

-------------------------------------+-------------------------------------

Reporter: ecederstrand | Owner: nobody
Type: New feature | Status: new

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by charettes):

* cc: charettes (added)

--
Ticket URL: <https://code.djangoproject.com/ticket/25464#comment:4>

Django

unread,

Sep 27, 2015, 11:40:10 AM9/27/15

to django-...@googlegroups.com

#25464: Allow skipping IN clause on prefetch queries

-------------------------------------+-------------------------------------

Reporter: ecederstrand | Owner: nobody
Type: New feature | Status: new

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by shaib):

* cc: shaib (added)

Old description:

> When using prefetch_related() on a large queryset, the prefetch query SQL
> can be inefficient. Consider this:
> {{{
> Category.objects.filter(type=5).prefetch_related('items')
> }}}
> If 100.000 categories have type=5, then an IN clause with 100.000
> Category IDs is generated to get the Item objects. Even with a custom
> queryset using a Prefetch() object, the IN clause is generated, even
> though it is A) redundant, B) sends a potentially multi-megabyte SQL
> statement over the wire for the database to process, C) may confuse the
> query planner to generate an inefficient execution plan, and D) doesn't
> scale:
> {{{
> Category.objects.filter(type=5).prefetch_related(Prefetch('items',
> queryset=Item.objects.filter(category__item=5)))
> }}}
> Pull request https://github.com/django/django/pull/5356 adds the
> possibility to skip the IN clause in cases where we are sure that a
> better queryset will get (at least) the same items as the IN clause
> would:
> {{{
> Category.objects.filter(type=5).prefetch_related(Prefetch('items',
> queryset=Item.objects.filter(category__item=5),
> filter_on_instances=False))
> }}}
> In my tests, this speeds up prefetch_related() by 20x-50x on large
> querysets.

New description:

When using prefetch_related() on a large queryset, the prefetch query SQL
can be inefficient. Consider this:
{{{
Category.objects.filter(type=5).prefetch_related('items')
}}}
If 100.000 categories have type=5, then an IN clause with 100.000 Category
IDs is generated to get the Item objects. Even with a custom queryset
using a Prefetch() object, the IN clause is generated, even though it is
A) redundant, B) sends a potentially multi-megabyte SQL statement over the
wire for the database to process, C) may confuse the query planner to
generate an inefficient execution plan, and D) doesn't scale:
{{{
Category.objects.filter(type=5).prefetch_related(Prefetch('items',

queryset=Item.objects.filter(category__type=5)))

}}}
Pull request https://github.com/django/django/pull/5356 adds the
possibility to skip the IN clause in cases where we are sure that a better
queryset will get (at least) the same items as the IN clause would:
{{{
Category.objects.filter(type=5).prefetch_related(Prefetch('items',

queryset=Item.objects.filter(category__type=5),

filter_on_instances=False))
}}}
In my tests, this speeds up prefetch_related() by 20x-50x on large
querysets.

--

Comment:

Two notes come to mind:

1) While this may be less "natural" to think of, it seems the query would
be more natural and efficient as
{{{
Item.objects.filter(category__type=5).select_related('category')
}}}
Of course, this would require restructuring the code that handles the
items and categories.

2) Not sure if this is as easy, but I think a better and more general
alternative would be to expose the "joining-in-python" mechanism for
general use. I'm thinking along the lines of
{{{
cats = Category.objects.filter(type=5)
items = Item.Item.objects.filter(category__type=5)
cats.use_prefetched('items', items)
}}}
where `'items'` is the name of the reverse relation, of course, and
`items` could be replaced with any iterable returning `Item` instances.

--
Ticket URL: <https://code.djangoproject.com/ticket/25464#comment:5>

Django

unread,

Nov 7, 2015, 5:35:16 AM11/7/15

to django-...@googlegroups.com

#25464: Allow skipping IN clause on prefetch queries

-------------------------------------+-------------------------------------

Reporter: ecederstrand | Owner: nobody
Type: New feature | Status: new

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by adamchainz):

1) Presumably the reason {{{prefetch_related}}} is being used is to avoid
fetching the category data repeated many times, is {{{select_related}}}
does by including it in the join. You can keep the prefetching with:

{{{
Item.objects.filter(category__type=5).prefetch_related('category')
}}}

2) See #25279 for my simple suggestion of making
{{{prefetch_related_objects}}} a public API, which simply exposes the pre-
existing prefetch code for "joining in python" mechanism. I don't know if
it would actually solve this problem though.

--
Ticket URL: <https://code.djangoproject.com/ticket/25464#comment:6>

Reply all

Reply to author

Forward