Prefetch() with through models

129 views
Skip to first unread message

Erik Cederstrand

unread,
Sep 16, 2015, 8:54:13 AM9/16/15
to Django Users
Hi folks,

I'm working on a school timetable app. I want to fetch hundreds of thousands of Lesson instances with prefetched m2m relations (e.g. subjects). My m2m relations use through models (I'm not sure this actually makes a difference here), and I'm running into performance issues because the prefetch query does something along the lines of "SELECT ... FROM lesson_subjects WHERE lesson_id IN [insane_list_of_lesson_ids]".

The initial query on Lesson uses a properly indexed filter on e.g. dates, so I thought I'd try to use the same filter to get the related Subjects via the relation to Lessons:

qs = Subject.objects.filter(lessons__school_id=8, lessons__start__lt=datetime(2015, 8, 1))
Lesson.objects.filter(school_id=8, start__lt=datetime(2015, 8, 1))\
.prefetch_related(Prefetch('subjects', queryset=qs))

I can see that the extra filters are added to the prefetch query, but the huge IN clause is still there.

Am I using Prefetch() wrong? Are there any other techniques to avoid the huge IN clause in prefetch queries?

Thanks,
Erik

Simon Charette

unread,
Sep 16, 2015, 5:08:31 PM9/16/15
to Django users
Hi Erik,

I think the prefetch api uses IN [id1, ..., idn] instead of IN (SELECT * ...)
because the default isolation level Django runs with is READ COMITTED
and using the latter could return different results for the second query.

e.g.

SELECT id FROM foo WHERE bar = true;
-- An other transaction changes a matching row from bar = true to false.
SELECT id, foo_id FROM baz WHERE foo_id IN (
    SELECT id FROM foo WHERE bar = true
);

If you want to use this approach I'm afraid you'll have to do
what prefetch_related does under the hood by yourself.

e.g.

from collections import defaultdict

prefetched = defaultdict(list)
subjects = Subject.objects.filter(...)
lessons = Lesson.objects.filter(...)
for lesson_subject in Lesson.subjects.through.objects.filter(lesson__in=lessons, subject__in=subject).select_related('subject').iterator():
    prefetched[lesson_subject.lesson_id].append(lesson_subject.subject)

Simon

Mike Dewhirst

unread,
Sep 16, 2015, 11:23:45 PM9/16/15
to django...@googlegroups.com
On 16/09/2015 9:53 AM, Erik Cederstrand wrote:
> Hi folks,
>
> I'm working on a school timetable app. I want to fetch hundreds of
> thousands of Lesson instances with prefetched m2m relations (e.g.
> subjects). My m2m relations use through models (I'm not sure this
> actually makes a difference here), and I'm running into performance
> issues because the prefetch query does something along the lines of
> "SELECT ... FROM lesson_subjects WHERE lesson_id IN
> [insane_list_of_lesson_ids]".

I'm no expert so I'm wondering if len([insane_list_of_lesson_ids]) ==
"hundreds of thousands"?

And if there are that many involved, why wouldn't the infrastructure
groan and creak a little?

I'm not much help but I do remember many years ago a school timetable
programming guru who told me they are definitely not trivial. I think he
was referring to timetable creation given enrolments, student
preferences, curriculum/course requirements, availability of teachers
and location of campuses. One of the main issues (ISTR) was memory - or
lack of it.

Someone other than me is going to have to help on this but I am
interested in the solution.

Good luck

Mike

Erik Cederstrand

unread,
Sep 17, 2015, 7:08:18 AM9/17/15
to Django Users

> Den 16/09/2015 kl. 16.45 skrev Mike Dewhirst <mi...@dewhirst.com.au>:
>
> On 16/09/2015 9:53 AM, Erik Cederstrand wrote:
>> Hi folks,
>>
>> I'm working on a school timetable app. I want to fetch hundreds of
>> thousands of Lesson instances with prefetched m2m relations (e.g.
>> subjects). My m2m relations use through models (I'm not sure this
>> actually makes a difference here), and I'm running into performance
>> issues because the prefetch query does something along the lines of
>> "SELECT ... FROM lesson_subjects WHERE lesson_id IN
>> [insane_list_of_lesson_ids]".
>
> I'm no expert so I'm wondering if len([insane_list_of_lesson_ids]) == "hundreds of thousands"?

In my, case, yes. I think the backend might process the query in chunks if the length of the SQL exceeds the max SQL query size.

I've had a look at the prefetch code in Django 1.8. The prefetcher is designed to kick in *after* the QuerySet has constructed the list of objects. It only has access to the resulting list of items, not the original SQL filters, so the only sane way to fetch related objects is to use an IN clause. I can't see any way to replace this with the filters without rewriting the whole prefetch code.

I'll try the route Simon suggested and run the prefetch queries myself. I can probably populate the _prefetched_objects_cache on each object so the optimization is reasonably transparent to consumers. Prefetching 'foo__bar' this way gets more complicated, so I think I'll go with a simple solution first.

> I'm not much help but I do remember many years ago a school timetable programming guru who told me they are definitely not trivial. I think he was referring to timetable creation given enrolments, student preferences, curriculum/course requirements, availability of teachers and location of campuses. One of the main issues (ISTR) was memory - or lack of it.

Constraints solvers to calculate the optimal timetable given a set of (soft or hard) requirements and costs is mathematically well understood but insanely difficult to get right in practice. Lack of memory and the limited life span of human beings are just some of the issues. Luckily, I'm just working on the output of that calculation :-)

Erik

Javier Guerra Giraldez

unread,
Sep 17, 2015, 7:23:03 AM9/17/15
to django...@googlegroups.com
On Thu, Sep 17, 2015 at 2:07 AM, Erik Cederstrand
<erik+...@cederstrand.dk> wrote:
>> Den 16/09/2015 kl. 16.45 skrev Mike Dewhirst <mi...@dewhirst.com.au>:
>>
>> On 16/09/2015 9:53 AM, Erik Cederstrand wrote:
>>> issues because the prefetch query does something along the lines of
>>> "SELECT ... FROM lesson_subjects WHERE lesson_id IN
>>> [insane_list_of_lesson_ids]".
>>
>> I'm no expert so I'm wondering if len([insane_list_of_lesson_ids]) == "hundreds of thousands"?
>
> In my, case, yes. I think the backend might process the query in chunks if the length of the SQL exceeds the max SQL query size.


Just curious, which RDBMS are you using? I remember that on MySQL
there used to be an advice to populate a temporary table and do a JOIN
instead of very big `xxx IN (yyyy)` statements. I'm not sure if
there's a hard limit, but anything over a few hundreds would be better
with JOIN than with IN(...)

With Oracle, there _is_ a hard limit, and a not very high one. I've
hit it frequently when doing exploratory SQL on SQLDeveloper. For
anything over a thousand items, it's safer to go the temporary table
route.

The only one where I haven't been able to hit any limit is PostgreSQL.
There i routinely do many thousands of items and it works beautifully.
I haven't personally tried "hundreds of thousands", but I guess that
if it fits within some maximum textual representation, then it will
handle it.


--
Javier

Erik Cederstrand

unread,
Sep 17, 2015, 7:41:55 AM9/17/15
to Django Users

> Den 17/09/2015 kl. 09.22 skrev Javier Guerra Giraldez <jav...@guerrag.com>:
>
> On Thu, Sep 17, 2015 at 2:07 AM, Erik Cederstrand
> <erik+...@cederstrand.dk> wrote:
>>> Den 16/09/2015 kl. 16.45 skrev Mike Dewhirst <mi...@dewhirst.com.au>:
>>>
>>> On 16/09/2015 9:53 AM, Erik Cederstrand wrote:
>>>> issues because the prefetch query does something along the lines of
>>>> "SELECT ... FROM lesson_subjects WHERE lesson_id IN
>>>> [insane_list_of_lesson_ids]".
>>>
>>> I'm no expert so I'm wondering if len([insane_list_of_lesson_ids]) == "hundreds of thousands"?
>>
>> In my, case, yes. I think the backend might process the query in chunks if the length of the SQL exceeds the max SQL query size.
>
>
> Just curious, which RDBMS are you using? I remember that on MySQL
> there used to be an advice to populate a temporary table and do a JOIN
> instead of very big `xxx IN (yyyy)` statements. I'm not sure if
> there's a hard limit, but anything over a few hundreds would be better
> with JOIN than with IN(...)

I'm using PostgreSQL and yes, I'm sure there are techniques to work around huge IN clauses. It just seems plain wrong to send 1-2MB of textual SQL to an SQL server when my original filters should work just fine, and hopefully much faster.

Erik

Erik Cederstrand

unread,
Sep 17, 2015, 9:00:26 PM9/17/15
to Django Users

> Den 16/09/2015 kl. 19.08 skrev Simon Charette <chare...@gmail.com>:
>
> If you want to use this approach I'm afraid you'll have to do
> what prefetch_related does under the hood by yourself.
>
> e.g.
>
> from collections import defaultdict
>
> prefetched = defaultdict(list)
> subjects = Subject.objects.filter(...)
> lessons = Lesson.objects.filter(...)
> for lesson_subject in Lesson.subjects.through.objects.filter(lesson__in=lessons, subject__in=subject).select_related('subject').iterator():
> prefetched[lesson_subject.lesson_id].append(lesson_subject.subject)

I hacked up a quick solution that works for my limited requirements. No chained querysets, limited prefetch recursion, no order_by, extras etc. But maybe someone finds it useful: https://gist.github.com/ecederstrand/6748a3496acdc95e40ef

At least it speeds up my queries exponentially; 10-50x improvement in my tests so far.

Erik
Reply all
Reply to author
Forward
0 new messages