It would be very handy to have a batching operation, something like
`Queryset.batch(lambda obj: obj.do_something(), batch_size=100)`. This
would page through the queryset in memory, performing the lambda on every
object. Naming is obviously up for discussion. I must have code to do
something like this dozens of times.
This is quite similar to how a Paginator works, but the API for pagination
feels a bit funny for doing this and it's not much less code than writing
it yourself.
--
Ticket URL: <https://code.djangoproject.com/ticket/26530>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
Comment (by timgraham):
Could you give an example of the boilerplate code you intend to replace?
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:1>
Comment (by mjtamlyn):
There are a few ways of approaching it:
Copying roughly a paginator:
{{{
count = qs.count()
pointer = 0
while pointer < count:
for obj in qs[pointer:pointer + batch_size]:
do_something(obj)
pointer += batch_size
}}}
Basing off e.g. a sequential id, can also apply to time series
{{{
pointer = 0
while True:
# work from oldest first so incoming objects during the run will get
processed
batch = qs.filter(id__gt=pointer).order_by('id')[:batch_size]
if not batch:
break
for obj in batch:
pointer = obj.id
do_something(obj)
}}}
The operation should also ideally apply to a values or values_list
queryset, this is a similar piece of code which doesn't have to worry
about memory as much:
{{{
ids = qs.values_list('id', flat=True)
while user_ids:
batch, user_ids = user_ids[:100], user_ids[100:]
queue_task(batch)
}}}
-----
My motivation for this patch is twofold - partly I'm bored of writing
similar code when dealing with large querysets, but also I have seen many
developers debugging issues with their code because they haven't realised
10k+ querysets in memory are problematic. Having an easy API to use which
is documented, with warnings about why you need this, should help people
to be aware of the issues, and make it easy for them to fix them.
A better API suggestion could be `for batch in qs.batch(size=100)`. This
means quite possibly fixing your broken code is just changing one line.
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:2>
Comment (by akaariai):
If the idea is to do something for each object, then
{{{
for obj in qs.iterator():
obj.do_something()
}}}
should give you a lot better memory efficiency. Of course, if using
PostgreSQL, the driver will still fetch all the rows into memory.
A very good approach would be to finally tackle the named cursors issue.
Then you could just do:
{{{
for obj in qs.iterator(cursor_size=100):
obj.do_something()
}}}
and be done with it. The problem with the named cursor approach is that
some databases have more or less hard to overcome limitations of what can
be done with the cursor, how transactions work and so on.
If you really want batches of object, then we probably need to use the
pointer approach. Otherwise iterating through a large queryset will end up
doing queries like `select * from the_query offset 100000 limit 100` which
is very inefficient, and concurrent modifications could end up introducing
the same object in multiple batches.
I'm mildly in favor of adding this, as the addition to API surface isn't
large, and there are a lot of ways to implement the batching in mildly
wrong ways.
If we are going for this, then I think the API should be the `for batch in
qs.batch(size=100)` one. The queryset should be ordered in such a way that
primary key is the only sorting criteria. We can change that later so that
primary or some other unique key is a postfix of the order by, but that is
a bit harder to do.
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:3>
Comment (by mjtamlyn):
Heh, I actually didn't know about `iterator()`... Good work reading docs
Marc.
Batching for background tasks is definitely common though.
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:4>
Comment (by akaariai):
I'm ok with the batch(size=100) approach. I haven't had a need for this,
and I'm unfamiliar how common such cases are. But if this is something
commonly needed, then I think having an ORM method that does this in a way
that is both efficient and correct when ran on table modified concurrently
is the way to go.
Thinking of this a bit more, batching only on primary key ordering is the
only safe approach if we want to definitely avoid seeing the same object
in multiple batches. Otherwise for example:
{{{
for batch in qs.order_by('mod_date', 'pk').batch(size=1000):
for obj in batch:
obj.foo = 'bar'
obj.mod_date = datetime.now()
obj.save()
}}}
could end up in indefinite loop.
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:5>
Comment (by timgraham):
The server-side cursors ticket is #16614. Should we mark this as a
duplicate of that one?
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:6>
Comment (by Nagyman):
It would be nice to see a native solution to this issue. We have had to do
this on occasion (G Adventures) and it is a technique used by django-
haystack [https://github.com/django-haystack/django-
haystack/blob/cd7380e35c0950861d5fd07142d4c3c2aa442005/haystack/management/commands/update_index.py#L60
here for example]. That project also calls {{{reset_queries()}}}, but I
have not tested the effects of that call (just copied the pattern).
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:7>
* stage: Unreviewed => Accepted
--
Ticket URL: <https://code.djangoproject.com/ticket/26530#comment:8>