I did a reply on the post. Might be some time before it's approved.
The gist is that yes it's that bad if you're using it naively. As long
as you know what's gonna be loaded from the DB you can avoid those
cases pretty easily.
----- Original Message ----- From: "NitinHayaran" <nitinhaya...@gmail.com>
To: "Django developers" <django-developers@googlegroups.com>
Sent: Tuesday, February 17, 2009 6:40 PM
Subject: Is this true. that django really takes a lot of memory?
> Hi All,
> Today i read this article and was wondering whether django orm is
> really that bad.
The documentation includes the statement, "A QuerySet typically reads all of its results and instantiates all of the corresponding objects the first time you access it; iterator() will instead read results and instantiate objects in discrete chunks, yielding them one at a time."
Am I mistaken, or is this not exactly correct? As I understand it, the difference between QuerySet.__iter__ and QuerySet.iterator isn't that the former reads and instantiates everything all at once, but that the former will make use of the QuerySet's result cache, reading from it when available and filling it as a side effect of iteration.
> The documentation includes the statement, "A QuerySet typically reads
> all of its results and instantiates all of the corresponding objects
> the first time you access it; iterator() will instead read results and
> instantiate objects in discrete chunks, yielding them one at a time."
> Am I mistaken, or is this not exactly correct? As I understand it,
> the difference between QuerySet.__iter__ and QuerySet.iterator isn't
> that the former reads and instantiates everything all at once, but
> that the former will make use of the QuerySet's result cache, reading
> from it when available and filling it as a side effect of iteration.
> Ian
Neither is completely correct ;). Both do chunked reads from the
DB(__iter__ using iterator for getting the data), however __iter__ also
caches them, so if you reiterate you don't do a second db query, whereas
iterator doesn't cache them.
Alex
-- "I disapprove of what you say, but I will defend to the death your right to
say it." --Voltaire
"The people's good is the highest law."--Cicero
On Tue, Feb 17, 2009 at 2:11 PM, Alex Gaynor <alex.gay...@gmail.com> wrote:
...
> Neither is completely correct ;). Both do chunked reads from the > DB(__iter__ using iterator for getting the data), however __iter__ also > caches them, so if you reiterate you don't do a second db query, whereas > iterator doesn't cache them.
If I'm reading it right, it looks like ForNode doesn't use .iterator. I can see why it might be useful to assume QS cache should be used-- maybe the same QS will be repeatedly iterated.
Even so, it seems like it'd be useful to have a built-in filter which uses iter(object)?
{% for question in poll.questions.all()|iterate %} ?
On Tue, Feb 17, 2009 at 3:12 PM, Jeremy Dunck <jdu...@gmail.com> wrote: > Even so, it seems like it'd be useful to have a built-in filter which > uses iter(object)?
> {% for question in poll.questions.all()|iterate %}
Ugh.
Sorry, I'm an idiot.
{% for question in poll.questions.all.iterator %} works just fine.
On Tue, Feb 17, 2009 at 3:15 PM, Jeremy Dunck <jdu...@gmail.com> wrote:
...
> {% for question in poll.questions.all.iterator %} > works just fine.
OK, last one from me.
As a 2.0 wish, I'd like to make .iterator the default behavior, and the cached-version a special case. I realize this point is quite debatable.
However-- is there already a place for 2.0-wishlist sort of things? I know there's no sense discussing in for the 1.x line. So how do we remember these sorts of issues when it comes 2.x time?
On Tue, Feb 17, 2009 at 7:40 AM, NitinHayaran <nitinhaya...@gmail.com> wrote: > Today i read this article and was wondering whether django orm is > really that bad.
Well, it's obligatory for me first to say "wow, Blogger sucks", since I can't actually read that post -- I just get a Blogger template with a big white empty space where the article ought to be (looking even at the HTML source, the content just ain't there).
-- "Bureaucrat Conrad, you are technically correct -- the best kind of correct."
On Tue, Feb 17, 2009 at 3:52 PM, James Bennett <ubernost...@gmail.com> wrote:
> On Tue, Feb 17, 2009 at 7:40 AM, NitinHayaran <nitinhaya...@gmail.com> wrote: >> Today i read this article and was wondering whether django orm is >> really that bad.
> Well, it's obligatory for me first to say "wow, Blogger sucks", since > I can't actually read that post -- I just get a Blogger template with > a big white empty space where the article ought to be (looking even at > the HTML source, the content just ain't there).
It used to be there. I think the OP deleted the post.
On Tue, Feb 17, 2009 at 3:00 PM, Jeremy Dunck <jdu...@gmail.com> wrote:
> On Tue, Feb 17, 2009 at 3:52 PM, James Bennett <ubernost...@gmail.com> wrote:
>> On Tue, Feb 17, 2009 at 7:40 AM, NitinHayaran <nitinhaya...@gmail.com> wrote: >>> Today i read this article and was wondering whether django orm is >>> really that bad.
>> Well, it's obligatory for me first to say "wow, Blogger sucks", since >> I can't actually read that post -- I just get a Blogger template with >> a big white empty space where the article ought to be (looking even at >> the HTML source, the content just ain't there).
> It used to be there. I think the OP deleted the post.
I'm not sure. If you click the blog archive link for February, you can still read the full post. It's only the direct link that's not working, which means that the comments are (AFAIK) inaccessible.
On Tue, 2009-02-17 at 15:20 -0600, Jeremy Dunck wrote: > On Tue, Feb 17, 2009 at 3:15 PM, Jeremy Dunck <jdu...@gmail.com> wrote: > ... > > {% for question in poll.questions.all.iterator %} > > works just fine.
> OK, last one from me.
> As a 2.0 wish, I'd like to make .iterator the default behavior, and > the cached-version a special case. I realize this point is quite > debatable.
I'd be somewhat against this, I think. It's *very* easy to reuse querysets and inadvertently cause extra database queries. Unless you're using really huge querysets, the memory usage is not going to kill you. Pulling back the huge number of results already uses a bunch of memory and that's a property of the db wrapper. There's a multiplier involved for creating Python objects. Since we have a way to not use the caching if somebody wants to optimise on that level and since doing that and then doing a second database access is quite slow, we're trading memory usage for speed and ease of use (and providing a way to improve the former in "expert mode").
I really don't look forward to the five questions a day on django-users about all the databae queries that are happening. I know you're only talking about the mythical 2.0, but that doesn't change how people will behave. I'm strongly in favour of keeping Django's primary audience as experienced developers wanting to work faster, but we do have a large non-experienced and even absolute beginner userbase, so simple things that can save them a lot of time aren't to be dismissed out of hand.
> I'd be somewhat against this, I think. It's *very* easy to reuse > querysets and inadvertently cause extra database queries. ... > we're trading memory > usage for speed and ease of use (and providing a way to improve the > former in "expert mode").
Point taken.
I wish there were some way to issue a warning if _result_cache is filled but __iter__ isn't used more than once. :-/
I could imagine a warning being issued if the functionality offered by .iterator is used more than once. That might be a happy medium-- then I could use .iterator as my default coding practice, and be slapped when I iterate more than once after all.
if settings.DEBUG and self.prior_iteration: warnings.warn("dope!") ?
On Tue, 2009-02-17 at 18:57 -0600, Jeremy Dunck wrote: > On Tue, Feb 17, 2009 at 6:49 PM, Malcolm Tredinnick > <malc...@pointy-stick.com> wrote: > ... > > I'd be somewhat against this, I think. It's *very* easy to reuse > > querysets and inadvertently cause extra database queries. > ... > > we're trading memory > > usage for speed and ease of use (and providing a way to improve the > > former in "expert mode").
> Point taken.
> I wish there were some way to issue a warning if _result_cache is > filled but __iter__ isn't used more than once. :-/
Possible. Requires relying on __del__ being called so that we know when it's not being used any longer. I prefer your other option, however.
> I could imagine a warning being issued if the functionality offered by > .iterator is used more than once. That might be a happy medium-- then > I could use .iterator as my default coding practice, and be slapped > when I iterate more than once after all.
> if settings.DEBUG and self.prior_iteration: > warnings.warn("dope!")
This certainly sounds reasonable and doable today without any real overhead. Go ahead and make a patch/ticket.
Discussion subject changed to "Warning on multiple calls to QuerySet.iterator() (was Re: Is this true. that django really takes a lot of memory?)" by Malcolm Tredinnick
On Tue, 2009-02-17 at 19:25 -0600, Jeremy Dunck wrote: > On Tue, Feb 17, 2009 at 7:13 PM, Malcolm Tredinnick > <malc...@pointy-stick.com> wrote: > ... > >> if settings.DEBUG and self.prior_iteration: > >> warnings.warn("dope!")
> > This certainly sounds reasonable and doable today without any real > > overhead. Go ahead and make a patch/ticket.
> OK.
> Do you think there should be a PerformanceWarning class, or just use > the default UserWarning?
It should be blue! :-)
Slight preference for using a standard warning type at the moment. Either UserWarning or RuntimeWarning. Only a slight preference, though. Please yourself here.
Please don't ask me what to do about issuing multiple times, because I was thinking about that over lunch just now and it may be fiddly. Issuing a warning always sounds right, since it's commonly going to be a property of a template that will cause this to happen. But if you use that template and hit a problem, you're going to be swamped with warnings. We really need a "once per template" option that obviously doesn't exist in the warnings module. I might be over-thinking it, though. "Always" is probably the right answer.
On Tue, Feb 17, 2009 at 7:49 PM, Malcolm Tredinnick
<malc...@pointy-stick.com> wrote: > I'd be somewhat against this, I think. It's *very* easy to reuse > querysets and inadvertently cause extra database queries. Unless you're > using really huge querysets, the memory usage is not going to kill you. > Pulling back the huge number of results already uses a bunch of memory > and that's a property of the db wrapper. There's a multiplier involved > for creating Python objects.
Speaking as someone who has (accidentally) brought down a beefy server by accidentally evaluating a reasonably large QuerySet, I'd say there's not a whole lot we can do without impacting usability in other, more vital-to-support scenarios.
When we had our nasty server-crashing query (which thankfully never made it out of staging; that's what staging servers are for and why you should have one to test things before you ever think about deploying), just fetching the data from the DB -- no object instantiation at all -- was a significant drain. Actually trying to instantiate the model objects kicked the usage up even higher, of course, but it was mostly an interesting exercise in watching the memory spike move from place to place as the data worked its way from the DB to the Python process in which Django was running.
(incidentally, the above sort of situation is one reason why a QuerySet limits itself to a certain number of objects displayed in __repr__; the real killer was that an error was being thrown, and as part of the Django debug page it was trying to print the __repr__ of a QuerySet of, IIRC, about half a million objects. A QuerySet doesn't try to do that anymore)
-- "Bureaucrat Conrad, you are technically correct -- the best kind of correct."
> On Tue, 2009-02-17 at 18:57 -0600, Jeremy Dunck wrote: >> On Tue, Feb 17, 2009 at 6:49 PM, Malcolm Tredinnick >> <malc...@pointy-stick.com> wrote: >> ... >>> I'd be somewhat against this, I think. It's *very* easy to reuse >>> querysets and inadvertently cause extra database queries. >> ... >>> we're trading memory >>> usage for speed and ease of use (and providing a way to improve the >>> former in "expert mode").
>> Point taken.
>> I wish there were some way to issue a warning if _result_cache is >> filled but __iter__ isn't used more than once. :-/
> Possible. Requires relying on __del__ being called so that we know > when > it's not being used any longer. I prefer your other option, however.
Well, since __del__ messes with cyclic GC, one could also make a tool that tracks instances of weakref(QuerySets made, created_at), and at the end of each request, print out a list of querysets which were never reused.
But you'd probably end up finding a lot of cases where, in the future, a cache would help you. And you could start getting the opposite issue, which is what the result cache alleviates: many queries.
Which begs the question, would it even be interesting to know if an QS makes an iterator out of itself more than once? Easy to implement, hm hm.
I think I'll play around with these two ideas next Someday.