Large Queryset Calculation In Background?

Nan

unread,

Nov 22, 2011, 5:37:59 PM11/22/11

to Django users

Hi folks --

I need to run a fairly CPU-intensive calculation nightly over a
dataset that's already large and growing quickly. I'm planning to run
this via a cron job, but would like to make sure that it neither eats
up the entire CPU nor locks the database, so that my site can continue
functioning in the meantime. The rough outline of what it needs to do
is as follows:

class OtherThing(models.Model):
anotherthing = models.ManyToManyField(Whatever)
...

class Thing(models.Model):
other_things = models.ManyToManyField(OtherThing,
through='SomethingElse')
...

for thing in Thing.objects.select_related('other_things',
'other_things__anotherthing__etc'):
calculated = calculation_on_thing_and_its_otherthings(thing) #
this mainly involves serialization to a great depth
thing.calculated_data = calculated
thing.save()

Will the above approach lock the database for a while or eat tons of
CPU? Any suggestions? I'm using Django 1.2, btw.

Thanks,
-Nan

Nikolas Stevenson-Molnar

unread,

Nov 22, 2011, 6:04:44 PM11/22/11

to django...@googlegroups.com

I wouldn't expect it to lock the database (though someone with more database expertise should address that). I would expect it to consume significant CPU. If you're on UNIX, you could address this issue by making your process 'nice': http://docs.python.org/library/os.html#os.nice The nicer a process (higher the value), the less CPU it will hog. IIRC, nice values default to 0 for processes and range from -20 (biggest CPU usage) to +20 (smallest CPU usage).

_Nik

Andre Terra

unread,

Nov 22, 2011, 7:51:18 PM11/22/11

to django...@googlegroups.com

You will definitely need to look into caching those results, perhaps "permanently" in a database.

My recommendation is redis[1] and possibly tools like sebleier's django-redis-cache[2]. Cache invalidation is a pain, I know, but it's pretty much the only way to go.

Long term, you will need to profile the bottlenecks and dive into the django generated SQL to find if you can tune it by refactoring, and possibly switching to either .raw() or .sql() in some cases.

There are plenty of presentations from python/django conferences out there that touch on the subject of ORM optimization, so don't be afraid to google.

Good luck!

Cheers,

AT

[1] http://redis.io

[2] https://github.com/sebleier/django-redis-cache

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

Nan

unread,

Nov 22, 2011, 8:39:42 PM11/22/11

to Django users

Thanks, Nik: that looks very handy. I didn't realize you could renice
a process from within! So now we just have to figure out whether it
blocks database queries at all, since if it does, then the longer it
runs the bigger a problem we have.

Andre: yes, that's why I'm doing the calculation overnight and caching
it in the DB. But the data it's based on changes almost daily and we
want the calculated version we're working with to be no more than 48
hours old -- hence the nightly recalculation. I'd love tips on how to
get that nightly recalculation not to bring the entire site grinding
to a halt. ;-)

On Nov 22, 7:51 pm, Andre Terra <andrete...@gmail.com> wrote:
> You will definitely need to look into caching those results, perhaps
> "permanently" in a database.
>
> My recommendation is redis[1] and possibly tools like sebleier's
> django-redis-cache[2]. Cache invalidation is a pain, I know, but it's
> pretty much the only way to go.
>
> Long term, you will need to profile the bottlenecks and dive into the
> django generated SQL to find if you can tune it by refactoring, and
> possibly switching to either .raw() or .sql() in some cases.
>
> There are plenty of presentations from python/django conferences out there
> that touch on the subject of ORM optimization, so don't be afraid to google.
>
> Good luck!
>
> Cheers,
> AT
>
> [1]http://redis.io
> [2]https://github.com/sebleier/django-redis-cache
>
> On Tue, Nov 22, 2011 at 9:04 PM, Nikolas Stevenson-Molnar <
>
>
>
>
>
>
>
> nik.mol...@consbio.org> wrote:
> > I wouldn't expect it to lock the database (though someone with more

> > database expertise should address that). I *would* expect it to consume

Tim Chase

unread,

Nov 22, 2011, 10:16:35 PM11/22/11

to django...@googlegroups.com, Nikolas Stevenson-Molnar

On 11/22/11 17:04, Nikolas Stevenson-Molnar wrote:
> I wouldn't expect it to lock the database (though someone with more

> database expertise should address that). I /would/ expect it to consume

> significant CPU. If you're on UNIX, you could address this issue by
> making your process 'nice'

There's also an "ionice" that should make disk (and possibly
network) I/O operate in a manner kinder to your machine. You'd
launch your cron job with the nice/ionice settings.

-tkc

Nan

unread,

Nov 23, 2011, 11:23:12 AM11/23/11

to Django users

Thanks, Tim -- that looks handy too.

Can anyone comment on the database locking question?

Nikolas Stevenson-Molnar

unread,

Nov 23, 2011, 3:09:59 PM11/23/11

to django...@googlegroups.com

What database are you using? You should be able to find information in
the documents about the locking behavior for that database. Compare that
with the operations your running and determine whether they would result
in an exclusive lock.

From your pseudocode, it looks like you're performing a possibly-lengthy
select, followed by lengthy calculations (not involving database
operations), and then a (presumably) quick save. AFAIK, databases never
lock when doing selects, so depending on the nature of your
calculations, you should be fine.

Additionally, if in doubt, I recommend running some tests with the same
database used in your production environment.

_Nik

Tom Evans

unread,

Nov 24, 2011, 4:47:14 AM11/24/11

to django...@googlegroups.com

On Wed, Nov 23, 2011 at 8:09 PM, Nikolas Stevenson-Molnar
<nik.m...@consbio.org> wrote:
> What database are you using? You should be able to find information in
> the documents about the locking behavior for that database. Compare that
> with the operations your running and determine whether they would result
> in an exclusive lock.
>
> From your pseudocode, it looks like you're performing a possibly-lengthy
> select, followed by lengthy calculations (not involving database
> operations), and then a (presumably) quick save. AFAIK, databases never
> lock when doing selects, so depending on the nature of your
> calculations, you should be fine.
>

MySQL will always lock tables for writes when reading from them.
Therefore, any long running query on a mysql table will result in
updates to that table being locked out.

The easiest way around this is with hardware. Use a master-slave DB
setup, and perform your long reads on the slave(s), and all your
writes on the master.

Cheers

Tom

Nan

unread,

Nov 28, 2011, 9:54:40 AM11/28/11

to Django users

We're using MySQL 5 (don't know off the top of my head what specific
release). I don't think a master/slave DB configuration is something
we can manage to set up at this point.

Querysets are fetched from the database in chunks, right? I imagine
that the select itself is actually quite quick, but do the tables
remain locked between chunks? I.e. if the query returns a total of N
records and Django fetches N/10 records and then performs a bunch of
calculations and saves before returning to the DB for another N/10
records, will the DB be locked during all those calculations/saves?

On Nov 24, 4:47 am, Tom Evans <tevans...@googlemail.com> wrote:
> On Wed, Nov 23, 2011 at 8:09 PM, Nikolas Stevenson-Molnar
>

Tom Evans

unread,

Nov 28, 2011, 10:10:20 AM11/28/11

to django...@googlegroups.com

On Mon, Nov 28, 2011 at 2:54 PM, Nan <ring...@gmail.com> wrote:
>
> We're using MySQL 5 (don't know off the top of my head what specific
> release). I don't think a master/slave DB configuration is something
> we can manage to set up at this point.
>
> Querysets are fetched from the database in chunks, right? I imagine
> that the select itself is actually quite quick, but do the tables
> remain locked between chunks? I.e. if the query returns a total of N
> records and Django fetches N/10 records and then performs a bunch of
> calculations and saves before returning to the DB for another N/10
> records, will the DB be locked during all those calculations/saves?
>

Well, it depends upon the database, but no, it doesn't work like that.

With MySQL, the query will cause the server to lock the table until
the client has finished reading all of the result. There are then two
modes to read the result from the server, row-by-row or all-at-once.
Django currently always fetches the entire result all at once,
regardless of how you then fetch the data from the queryset.

Cheers

Tom

Javier Guerra Giraldez

unread,

Nov 28, 2011, 10:17:19 AM11/28/11

to django...@googlegroups.com

On Mon, Nov 28, 2011 at 10:10 AM, Tom Evans <teva...@googlemail.com> wrote:
> Django currently always fetches the entire result all at once,
> regardless of how you then fetch the data from the queryset.

but this result isn't the whole queryset result, it's a chunk of it.
the ORM adds 'LIMIT' arguments to the query. I think the answer to
Nan's question is that there's no lock across chunked reads.

--
Javier

Nan

unread,

Nov 28, 2011, 10:19:43 AM11/28/11

to Django users

On Nov 28, 10:17 am, Javier Guerra Giraldez <jav...@guerrag.com>
wrote:

> On Mon, Nov 28, 2011 at 10:10 AM, Tom Evans <tevans...@googlemail.com> wrote:
> > Django currently always fetches the entire result all at once,
> > regardless of how you then fetch the data from the queryset.

Ah, I must have been misunderstanding a discussion[1] on Django-
Developers.

> but this result isn't the whole queryset result, it's a chunk of it.
> the ORM adds 'LIMIT' arguments to the query. I think the answer to
> Nan's question is that there's no lock across chunked reads.
>
> --
> Javier

Thanks, Javier -- that's a huge help!

[1] http://groups.google.com/group/django-developers/browse_thread/thread/f19040e2e3229d7a#

Tom Evans

unread,

Nov 28, 2011, 10:41:39 AM11/28/11

to django...@googlegroups.com

(NB: you snipped the bit where I say that I am specifically talking
about MySQL - I still am)

No-one mentioned slicing or adding LIMITs into the query - but it is
irrelevant. When you issue a query, the DB tables that are read from
are locked until that data is returned to the client. That happens as
soon as mysql_store_result() in the MySQL C API finishes, at which
point all the data has been transferred from the server to the client.
This happens at the moment that you evaluate your query in django.

When you iterate through or otherwise access that query result in
django, it is no longer talking to the DB server, it does not fetch
any additional data, in chunks or otherwise. The queryset object cache
is built in chunks, but this relates to when Django creates model
instances, not communication with the database.