So my question is: Is (or should be) there a difference between
iterating and enumerating objects? Is there any way to load object on
demand only, so to use memory only roughly equal to sizeof(SomeModel)?
--
Tomas Kopecek
e-mail: permonik at mesias.brnonet.cz
ICQ: 114483784
Why not iterate in batches, by slicing the query? That way if you set
you step to say 100, you'll have at most 100 records in memory at a
time. If the records add or delete during your process it might not
be so good though.
count = SomeModel.objects.all().count()
steps=100
offset=0
for i in range(0,count,steps):
offset=offset+steps
for o in SomeModel.objects.all()[i:offset]
# do your stuff
You could also do a query to select all ids using .values(), and
iterate that using .get() to fetch each individually, or filter with
or'd Q objects to get batches.
value_fetch = SomeModel.objects.all().values("id")
for row in value_fetch:
o=SomeModel.objects.get(pk=row['id'])
# do your stuff
QuerySet.iterator does what you want.
QuerySet.__iter__ (the python method that is called from the for loop)
returns an iterator over the results of QuerySet._get_data, which does
store a results cache.
The design fits the expectation that you'll more frequently be
iterating a smallert result set for the same queryset, and that
shouldn't hit the database twice, so the results are stored. But
that obviously isn't what you want in this case.
One other point-- the DB-API provides cursor.fetchmany, and Django's
iterator uses this correctly. However, some database libraries
default to a client-side cursor, meaning that even though the API
provides chunking semantics, the library still brings back the entire
resultset in one go.
I can't remember what psycopg1 does, but psycopg2 defaults to
client-side cursor. It is possible to do server-side cursors using
names, but Django doesn't do this. Fixing this has been (low) on my
to-do list for a long time.
In the common case of small result sets, client-side cursors are
generally a win since they do only one hop to the DB and all of the
results fit in memory easily.
Anyway, either do as Doug B suggests, iterating over slices, or do as
I suggest, directly calling QuerySet.iterator.
Doug B is wrong that there isn't much difference, though. There
certainly is when you have an expensive query whose results don't fit
into memory.
Five Worlds of Software, recommended reading:
http://www.joelonsoftware.com/articles/FiveWorlds.html
I was going to follow up with a documentation link, but it appears we
lost the documentation for QuerySet.iterator at some point. Opened a
ticket
In any case, Jeremy's right: the "iterator" method returns a generator
which fetches the data in chunks and only instantiates objects when
they're actually needed, yielding them one at a time as you iterate
over it. So you can replace a call to 'all()' with a call to
'iterator()', or chain 'iterator()' on after a call to 'filter()', and
you should see greatly improved memory usage for situations where
you're dealing with huge numbers of objects.
--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
For me it could be more appropriate to change iterator() to do some
slicing for me (by explicit LIMIT clause), maybe a small patch for our
application. I understand, that changing it in general would be a bad
design decision.
So again, thank for help.
Ick. :) Consider subclassing queryset to override __iter__ and do
the chunking yourself.
This is bad in the sense that it does n+1 queries to chunk it, but you
said that it wasn't needed that often.
Also, note that if you need a consistent read (despite other processes
committing between chunk selects), then you'll need to change your
transaction isolation level:
http://en.wikipedia.org/wiki/Transaction_isolation_level
class MyQuerySet(QuerySet):
def __iter__(self):
import itertools
import math
count = self.count()
chunk_size = 1000 #or whatever makes sense to you
chunks = []
for i in range(0, math.ceil(float(count)/chunk_size)):
print i
chunks.append(self[chunk_size*i:chunk_size*(i+1)].iterator())
return itertools.chain(*chunks)