--
Ticket URL: <https://code.djangoproject.com/ticket/18702>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* cc: charette.s@… (added)
Comment:
I have to say it's quite easier to read the implementation with this
cleanup.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:1>
Comment (by akaariai):
A rebased + squashed patch available from:
https://github.com/akaariai/django/commit/ddc85150d0bb82231e3820807f7652d2a9578ab2
There were some changes to the queryset iteration lately (the introduction
of _safe_iterator()). So, currently there are six nested iterators when
iterating over a qs. This is getting a bit complex.
Performance difference between the patched version and master for fetching
ten objects from the DB using "for obj in qs" is 5%-7%. For
values_list('pk') it is around 15%. So, there is a performance difference
but it isn't too big.
The silent discard of some exceptions should be fixed with the
_safe_iterator() patch.
Still, the code clarity difference is quite big. When considering this
patch the question is mainly if there are some use cases where the
"convert only some results to Python" is a big performance win. I don't
believe such cases are common enough to care about, but then again I know
mostly my own usage patterns... To me it seems that in most cases where
the change would matter the user is doing something that will cause bad
performance anyways.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:2>
* stage: Design decision needed => Accepted
Comment:
Wow, the code clarity difference is indeed big; and that performance
difference isn't insignificant either. I'm having trouble coming up with
realistic cases where saving memory on partial iteration matters and you
can't just use `.iterator()`, which is our documented solution for when
you can't afford to materialize the whole queryset. So +1 from me on this
patch.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:3>
Comment (by akaariai):
I am going to commit the patch this weekend if nobody objects.
A very short summary: the patch will make "half-iterating" a queryset cost
more (as all the rows are converted upfront), but on the other hand full
iteration will be faster. Cases hit are "obj in largeqs" without needing
the rest of the qs, or "for obj in qs: if somecond: break".
One of the original reasons for the patch, discard of some exceptions, is
fixed by #19895 - but there is a comment in the ticket that says the fix
is causing a serious memory leak. This is hard to verify as there aren't
any details. Still, the comment is from core developer so it is very
likely that the commit does cause memory leak under some conditions.
Rebased patch available from
https://github.com/akaariai/django/tree/non_chunked_iter_squashed.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:4>
Comment (by lukeplant):
I think there are some cases you might have missed:
For example, if someone gets a large result set in a qs, and does `if qs`.
This will currently create just one instance. There could be cases where
this is a reasonable thing to do e.g. if in some cases, but not all, you
will go on to iterate over the whole queryset, this pattern will mean you
only do one DB query.
I am also hoping that at some point we will get better support for doing
chunked reads at the DB level in Postgres. However, I guess that when you
need that, you also need `QuerySet.iterator()`, so this change wouldn't
affect that case much.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:5>
Comment (by akaariai):
Yes, `__bool__` seems to be another case hit. Minor correction is that you
do still convert 100 objects currently, not just one.
It is true that this patch is dealing with a tradeoff. On one hand we have
full resultset iteration performance and code clarity, on the other doing
bool(qs), contains or iteration with break. If the resultset is small it
doesn't matter what will be done (less than 100 objects and the results
are converted in full anyways). If the resultset is large, the user who
cares about performance should be doing something else anyways (unless
using Oracle).
It turns out that it is somewhat easy to support bool() and contains in
"one object at time" manner, but still do full conversion when accessing
full resultset. This way the only regression case is iteration with break.
But then, just use .iterator(). Also, those wanting to use server side
cursors will likely want to use .iterator().
The code is somewhat more complex than in the "always fetch all objects"
but not much.
1000 objects with pk + one field from db with .all() is 10% faster using
the patched code. .iterator() speed isn't changed.
Patch available from
https://github.com/akaariai/django/tree/non_chunked_iter_squashed.
Sidenote about chunked reads: Without better support from Postgres it will
be impossible to automatically use chunked reads. If you don't use WITH
HOLD, the cursor will be unusable outside transactions. Add in autocommit
mode, and it means this is a total no-go (not that it was a good idea even
before). If you use WITH HOLD, then the cursor must be closed explicitly.
So, if you have "if obj in qs" but don't iterate the whole qs it means you
just leaked a cursor (which is even worse than having idle-in-tx
connections). So, a dedicated API is needed in any case.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:6>
Comment (by aaugustin):
My naive opinion is that anything that doesn't use `.iterator()` should
load the entire queryset on the first access. That's the least surprising
behavior.
As explained by Anssi, loading results incrementally from the database
cannot be done transparently.
That's an external, uninformed comment; feel free to ignore me ;)
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:7>
Comment (by akaariai):
I am not sure if I should commit the simple no-optimizations for
`__bool__` and `__contains__` version, or the optimized and bit more
complex version. The added complexity isn't much, but on the other hand I
believe users benefiting from the optimizations are rare.
I am leaning towards committing the more complex optimized version, simply
because there could be users who would get a performance regression
otherwise. I am wondering if this is too cautious. This approach optimizes
use cases that are rare, and in the use cases optimized, the user would
often benefit a lot more by using different code patterns.
Opinions welcome...
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:8>
Comment (by lukeplant):
Replying to [comment:8 akaariai]:
> Opinions welcome...
You have persuaded me at least.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:9>
Comment (by akaariai):
I think we want to go with the "always convert all objects" approach.
Special casing some uses makes the code complex. And, in most cases we
will win only in the amount of converted objects, not in the amount of
fetched rows.
So, I updated the original
[https://github.com/akaariai/django/compare/non_chunked_reads
non_chunked_reads] branch to apply to master. If somebody could review at
least the release notes changes that would be good...
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:10>
Comment (by aaugustin):
Yes, that passes my cursory code inspection.
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:11>
* status: new => closed
* resolution: => fixed
Comment:
In [changeset:"70679243d1786e03557c28929f9762a119e3ac14"]:
{{{
#!CommitTicketReference repository=""
revision="70679243d1786e03557c28929f9762a119e3ac14"
Fixed #18702 -- Removed chunked reads from QuerySet iteration
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:12>
Comment (by Tim Graham <timograham@…>):
In [changeset:"327df551e89a505c5756becee97c40198f38aff2"]:
{{{
#!CommitTicketReference repository=""
revision="327df551e89a505c5756becee97c40198f38aff2"
Fixed #23817 -- Updated docs on QuerySet evaluation
Removed inaccurate info about partial evaluation after refs #18702.
Added information on modifying sliced QuerySets; refs #22503.
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:13>
Comment (by Tim Graham <timograham@…>):
In [changeset:"593353d8af5b4b8d1a1e627712fe68ed593961d0"]:
{{{
#!CommitTicketReference repository=""
revision="593353d8af5b4b8d1a1e627712fe68ed593961d0"
[1.7.x] Fixed #23817 -- Updated docs on QuerySet evaluation
Removed inaccurate info about partial evaluation after refs #18702.
Added information on modifying sliced QuerySets; refs #22503.
Backport of 327df551e89a505c5756becee97c40198f38aff2 from master
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:15>
Comment (by Tim Graham <timograham@…>):
In [changeset:"614dd44d0d2e5f8f12256835c3453f420f54c3b4"]:
{{{
#!CommitTicketReference repository=""
revision="614dd44d0d2e5f8f12256835c3453f420f54c3b4"
[1.6.x] Fixed #23817 -- Updated docs on QuerySet evaluation
Removed inaccurate info about partial evaluation after refs #18702.
Added information on modifying sliced QuerySets; refs #22503.
Backport of 327df551e89a505c5756becee97c40198f38aff2 from master
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/18702#comment:14>