[Django] #18702: Remove chunked reads from iter(qs)

Django

unread,

Aug 2, 2012, 8:35:12 AM8/2/12

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4
Component: Database | Keywords:
layer (models, ORM) | Has patch: 1
Severity: Normal | Needs tests: 0
Triage Stage: Design | Easy pickings: 0
decision needed |
Needs documentation: 0 |
Patch needs improvement: 0 |
UI/UX: 0 |
-------------------------------------+-------------------------------------
The queryset iterator protocol does convert rows lazily to objects when
iterated. This has two advantages:
1. If one iterates just part of the queryset, there is no need to do
model conversion for all objects.
2. Again, if iterating just part of the qs, some backends allow you to
fetch just part of the rows from the DB (oracle, for example).

However, there are some costs, too:
1. Complexity in the `__iter__` -> _result_iter -> (_results_cache,
_iter) -> _iterator implementation.
2. The lazy fetching costs around 5-10% performance in the case of "for
obj in qs.all()" (1000 objs, 2 fields). For values_list('id') the cost is
around 30%.
3. The current implementation silently discards some exceptions when
doing list(qs). This can be annoying especially when debugging django-core
code.

My take is we are optimizing the wrong case currently. That is, the case
where one wants to consume a queryset only partially, but can't use the
.iterator() method. The case would be something like:
{{{
for obj in qs:
if somecond:
break
# Now, another loop for the same queryset!
for obj in qs:
if someothercond:
break
}}}
If there is no another loop, it is possible to use .iterator(). If one of
the above loops consumes major portion of the qs, then there is no benefit
in doing partial object conversion.

The question is if there are common patterns where the current
implementation is worth the code complexity & performance loss for the
common cases.

I will leave this as DDN, as this change is obviously something that needs
to be considered carefully.

There is a branch implementing the removal of chunked reads at:
https://github.com/akaariai/django/compare/non_chunked_reads

--
Ticket URL: <https://code.djangoproject.com/ticket/18702>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Aug 2, 2012, 1:01:01 PM8/2/12

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

* cc: charette.s@… (added)

Comment:

I have to say it's quite easier to read the implementation with this
cleanup.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:1>

Django

unread,

Feb 25, 2013, 4:55:37 PM2/25/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by akaariai):

A rebased + squashed patch available from:
https://github.com/akaariai/django/commit/ddc85150d0bb82231e3820807f7652d2a9578ab2

There were some changes to the queryset iteration lately (the introduction
of _safe_iterator()). So, currently there are six nested iterators when
iterating over a qs. This is getting a bit complex.

Performance difference between the patched version and master for fetching
ten objects from the DB using "for obj in qs" is 5%-7%. For
values_list('pk') it is around 15%. So, there is a performance difference
but it isn't too big.

The silent discard of some exceptions should be fixed with the
_safe_iterator() patch.

Still, the code clarity difference is quite big. When considering this
patch the question is mainly if there are some use cases where the
"convert only some results to Python" is a big performance win. I don't
believe such cases are common enough to care about, but then again I know
mostly my own usage patterns... To me it seems that in most cases where
the change would matter the user is doing something that will cause bad
performance anyways.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:2>

Django

unread,

Feb 25, 2013, 5:10:19 PM2/25/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

* stage: Design decision needed => Accepted

Comment:

Wow, the code clarity difference is indeed big; and that performance
difference isn't insignificant either. I'm having trouble coming up with
realistic cases where saving memory on partial iteration matters and you
can't just use `.iterator()`, which is our documented solution for when
you can't afford to materialize the whole queryset. So +1 from me on this
patch.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:3>

Django

unread,

Mar 16, 2013, 5:26:57 AM3/16/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by akaariai):

I am going to commit the patch this weekend if nobody objects.

A very short summary: the patch will make "half-iterating" a queryset cost
more (as all the rows are converted upfront), but on the other hand full
iteration will be faster. Cases hit are "obj in largeqs" without needing
the rest of the qs, or "for obj in qs: if somecond: break".

One of the original reasons for the patch, discard of some exceptions, is
fixed by #19895 - but there is a comment in the ticket that says the fix
is causing a serious memory leak. This is hard to verify as there aren't
any details. Still, the comment is from core developer so it is very
likely that the commit does cause memory leak under some conditions.

Rebased patch available from
https://github.com/akaariai/django/tree/non_chunked_iter_squashed.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:4>

Django

unread,

Mar 16, 2013, 11:43:01 AM3/16/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by lukeplant):

I think there are some cases you might have missed:

For example, if someone gets a large result set in a qs, and does `if qs`.
This will currently create just one instance. There could be cases where
this is a reasonable thing to do e.g. if in some cases, but not all, you
will go on to iterate over the whole queryset, this pattern will mean you
only do one DB query.

I am also hoping that at some point we will get better support for doing
chunked reads at the DB level in Postgres. However, I guess that when you
need that, you also need `QuerySet.iterator()`, so this change wouldn't
affect that case much.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:5>

Django

unread,

Mar 16, 2013, 5:03:21 PM3/16/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by akaariai):

Yes, `__bool__` seems to be another case hit. Minor correction is that you
do still convert 100 objects currently, not just one.

It is true that this patch is dealing with a tradeoff. On one hand we have
full resultset iteration performance and code clarity, on the other doing
bool(qs), contains or iteration with break. If the resultset is small it
doesn't matter what will be done (less than 100 objects and the results
are converted in full anyways). If the resultset is large, the user who
cares about performance should be doing something else anyways (unless
using Oracle).

It turns out that it is somewhat easy to support bool() and contains in
"one object at time" manner, but still do full conversion when accessing
full resultset. This way the only regression case is iteration with break.
But then, just use .iterator(). Also, those wanting to use server side
cursors will likely want to use .iterator().

The code is somewhat more complex than in the "always fetch all objects"
but not much.

1000 objects with pk + one field from db with .all() is 10% faster using
the patched code. .iterator() speed isn't changed.

Patch available from
https://github.com/akaariai/django/tree/non_chunked_iter_squashed.

Sidenote about chunked reads: Without better support from Postgres it will
be impossible to automatically use chunked reads. If you don't use WITH
HOLD, the cursor will be unusable outside transactions. Add in autocommit
mode, and it means this is a total no-go (not that it was a good idea even
before). If you use WITH HOLD, then the cursor must be closed explicitly.
So, if you have "if obj in qs" but don't iterate the whole qs it means you
just leaked a cursor (which is even worse than having idle-in-tx
connections). So, a dedicated API is needed in any case.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:6>

Django

unread,

Mar 17, 2013, 4:54:59 AM3/17/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by aaugustin):

My naive opinion is that anything that doesn't use `.iterator()` should
load the entire queryset on the first access. That's the least surprising
behavior.

As explained by Anssi, loading results incrementally from the database
cannot be done transparently.

That's an external, uninformed comment; feel free to ignore me ;)

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:7>

Django

unread,

Mar 17, 2013, 9:56:41 AM3/17/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by akaariai):

I am not sure if I should commit the simple no-optimizations for
`__bool__` and `__contains__` version, or the optimized and bit more
complex version. The added complexity isn't much, but on the other hand I
believe users benefiting from the optimizations are rare.

I am leaning towards committing the more complex optimized version, simply
because there could be users who would get a performance regression
otherwise. I am wondering if this is too cautious. This approach optimizes
use cases that are rare, and in the use cases optimized, the user would
often benefit a lot more by using different code patterns.

Opinions welcome...

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:8>

Django

unread,

Mar 27, 2013, 11:50:36 AM3/27/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by lukeplant):

Replying to [comment:8 akaariai]:

> Opinions welcome...

You have persuaded me at least.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:9>

Django

unread,

May 19, 2013, 2:59:31 PM5/19/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by akaariai):

I think we want to go with the "always convert all objects" approach.
Special casing some uses makes the code complex. And, in most cases we
will win only in the amount of converted objects, not in the amount of
fetched rows.

So, I updated the original
[https://github.com/akaariai/django/compare/non_chunked_reads
non_chunked_reads] branch to apply to master. If somebody could review at
least the release notes changes that would be good...

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:10>

Django

unread,

May 19, 2013, 3:30:23 PM5/19/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Version: 1.4

Comment (by aaugustin):

Yes, that passes my cursory code inspection.

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:11>

Django

unread,

May 21, 2013, 12:37:28 PM5/21/13

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Type: | Status: closed
Cleanup/optimization | Version: 1.4
Component: Database layer | Resolution: fixed

Changes (by Anssi Kääriäinen <akaariai@…>):

* status: new => closed
* resolution: => fixed

Comment:

In [changeset:"70679243d1786e03557c28929f9762a119e3ac14"]:
{{{
#!CommitTicketReference repository=""
revision="70679243d1786e03557c28929f9762a119e3ac14"
Fixed #18702 -- Removed chunked reads from QuerySet iteration
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:12>

Django

unread,

Nov 20, 2014, 9:34:00 AM11/20/14

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Type: | Status: closed
Cleanup/optimization | Version: 1.4

Comment (by Tim Graham <timograham@…>):

In [changeset:"327df551e89a505c5756becee97c40198f38aff2"]:
{{{
#!CommitTicketReference repository=""
revision="327df551e89a505c5756becee97c40198f38aff2"
Fixed #23817 -- Updated docs on QuerySet evaluation

Removed inaccurate info about partial evaluation after refs #18702.
Added information on modifying sliced QuerySets; refs #22503.
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:13>

Django

unread,

Nov 20, 2014, 9:35:55 AM11/20/14

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Type: | Status: closed
Cleanup/optimization | Version: 1.4

Comment (by Tim Graham <timograham@…>):

In [changeset:"593353d8af5b4b8d1a1e627712fe68ed593961d0"]:
{{{
#!CommitTicketReference repository=""
revision="593353d8af5b4b8d1a1e627712fe68ed593961d0"
[1.7.x] Fixed #23817 -- Updated docs on QuerySet evaluation

Removed inaccurate info about partial evaluation after refs #18702.
Added information on modifying sliced QuerySets; refs #22503.

Backport of 327df551e89a505c5756becee97c40198f38aff2 from master
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:15>

Django

unread,

Nov 20, 2014, 9:35:56 AM11/20/14

to django-...@googlegroups.com

#18702: Remove chunked reads from iter(qs)
-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Type: | Status: closed
Cleanup/optimization | Version: 1.4

Comment (by Tim Graham <timograham@…>):

In [changeset:"614dd44d0d2e5f8f12256835c3453f420f54c3b4"]:
{{{
#!CommitTicketReference repository=""
revision="614dd44d0d2e5f8f12256835c3453f420f54c3b4"
[1.6.x] Fixed #23817 -- Updated docs on QuerySet evaluation

Removed inaccurate info about partial evaluation after refs #18702.
Added information on modifying sliced QuerySets; refs #22503.