QuerySet & memory

49 views
Skip to first unread message

Tomas Kopecek

unread,
Aug 27, 2007, 6:09:08 PM8/27/07
to django...@googlegroups.com
Hello,
I've thought, that QuerySets whose are iterated does not allocate memory
for all objects. But simple test shows otherwise. When I iterate through
SomeModel.objects.all() I get same memory consumption like with
list(SomeModel.objects.all()). It is very frustrating, because with test
database (which is definitely not as large as production one) with
approximately 30000 records I get about two GB of memory per process.
Iterating through all records in my app is a special case, but needed one.

So my question is: Is (or should be) there a difference between
iterating and enumerating objects? Is there any way to load object on
demand only, so to use memory only roughly equal to sizeof(SomeModel)?

--

Tomas Kopecek
e-mail: permonik at mesias.brnonet.cz
ICQ: 114483784

Doug B

unread,
Aug 27, 2007, 8:48:56 PM8/27/07
to Django users
There isn't much difference. Once you ennumerate,slice, or iterate a
queryset that's when the actual database query occurs. It will pull
in all matching object in order to save you additional queries.

Why not iterate in batches, by slicing the query? That way if you set
you step to say 100, you'll have at most 100 records in memory at a
time. If the records add or delete during your process it might not
be so good though.

count = SomeModel.objects.all().count()

steps=100
offset=0
for i in range(0,count,steps):
offset=offset+steps
for o in SomeModel.objects.all()[i:offset]
# do your stuff


You could also do a query to select all ids using .values(), and
iterate that using .get() to fetch each individually, or filter with
or'd Q objects to get batches.

value_fetch = SomeModel.objects.all().values("id")
for row in value_fetch:
o=SomeModel.objects.get(pk=row['id'])
# do your stuff

Jeremy Dunck

unread,
Aug 27, 2007, 8:55:48 PM8/27/07
to django...@googlegroups.com
On 8/27/07, Tomas Kopecek <perm...@mesias.brnonet.cz> wrote:
> I've thought, that QuerySets whose are iterated does not allocate memory
> for all objects.

QuerySet.iterator does what you want.

QuerySet.__iter__ (the python method that is called from the for loop)
returns an iterator over the results of QuerySet._get_data, which does
store a results cache.

The design fits the expectation that you'll more frequently be
iterating a smallert result set for the same queryset, and that
shouldn't hit the database twice, so the results are stored. But
that obviously isn't what you want in this case.

One other point-- the DB-API provides cursor.fetchmany, and Django's
iterator uses this correctly. However, some database libraries
default to a client-side cursor, meaning that even though the API
provides chunking semantics, the library still brings back the entire
resultset in one go.

I can't remember what psycopg1 does, but psycopg2 defaults to
client-side cursor. It is possible to do server-side cursors using
names, but Django doesn't do this. Fixing this has been (low) on my
to-do list for a long time.

In the common case of small result sets, client-side cursors are
generally a win since they do only one hop to the DB and all of the
results fit in memory easily.

Anyway, either do as Doug B suggests, iterating over slices, or do as
I suggest, directly calling QuerySet.iterator.

Doug B is wrong that there isn't much difference, though. There
certainly is when you have an expensive query whose results don't fit
into memory.

Five Worlds of Software, recommended reading:
http://www.joelonsoftware.com/articles/FiveWorlds.html

James Bennett

unread,
Aug 27, 2007, 9:10:13 PM8/27/07
to django...@googlegroups.com
On 8/27/07, Jeremy Dunck <jdu...@gmail.com> wrote:
> QuerySet.iterator does what you want.

I was going to follow up with a documentation link, but it appears we
lost the documentation for QuerySet.iterator at some point. Opened a
ticket

In any case, Jeremy's right: the "iterator" method returns a generator
which fetches the data in chunks and only instantiates objects when
they're actually needed, yielding them one at a time as you iterate
over it. So you can replace a call to 'all()' with a call to
'iterator()', or chain 'iterator()' on after a call to 'filter()', and
you should see greatly improved memory usage for situations where
you're dealing with huge numbers of objects.


--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

Tomas Kopecek

unread,
Aug 28, 2007, 1:30:39 PM8/28/07
to django...@googlegroups.com
James Bennett napsal(a):

> On 8/27/07, Jeremy Dunck <jdu...@gmail.com> wrote:
>> QuerySet.iterator does what you want.
>
> I was going to follow up with a documentation link, but it appears we
> lost the documentation for QuerySet.iterator at some point. Opened a
> ticket
>
> In any case, Jeremy's right: the "iterator" method returns a generator
> which fetches the data in chunks and only instantiates objects when
> they're actually needed, yielding them one at a time as you iterate
> over it. So you can replace a call to 'all()' with a call to
> 'iterator()', or chain 'iterator()' on after a call to 'filter()', and
> you should see greatly improved memory usage for situations where
> you're dealing with huge numbers of objects.
>
>
Thanks for responses. I look to iterator() code and it does this thing
ok. With some experimenting I saw that mysql backend is probably the
worst. Even with iterator mysql sends whole result set and Python DB
imitates cursor() semantics. As I said I get 2GB of memory footprint.
With SQLite I got only 20MB with same code. So it looks that I have to
more think about switching to some other DBMS or to use explicit slicing.

For me it could be more appropriate to change iterator() to do some
slicing for me (by explicit LIMIT clause), maybe a small patch for our
application. I understand, that changing it in general would be a bad
design decision.

So again, thank for help.

Jeremy Dunck

unread,
Aug 28, 2007, 4:40:37 PM8/28/07
to django...@googlegroups.com
On 8/28/07, Tomas Kopecek <perm...@mesias.brnonet.cz> wrote:
...

> For me it could be more appropriate to change iterator() to do some
> slicing for me (by explicit LIMIT clause), maybe a small patch for our
> application. I understand, that changing it in general would be a bad
> design decision.

Ick. :) Consider subclassing queryset to override __iter__ and do
the chunking yourself.

This is bad in the sense that it does n+1 queries to chunk it, but you
said that it wasn't needed that often.

Also, note that if you need a consistent read (despite other processes
committing between chunk selects), then you'll need to change your
transaction isolation level:
http://en.wikipedia.org/wiki/Transaction_isolation_level

class MyQuerySet(QuerySet):
def __iter__(self):
import itertools
import math
count = self.count()
chunk_size = 1000 #or whatever makes sense to you
chunks = []
for i in range(0, math.ceil(float(count)/chunk_size)):
print i
chunks.append(self[chunk_size*i:chunk_size*(i+1)].iterator())
return itertools.chain(*chunks)

Reply all
Reply to author
Forward
0 new messages