Django aggregation equivalent of SUM(IF(field = x, 1, 0)) - 40 million row table

626 views
Skip to first unread message

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 16, 2011, 7:20:57 AM6/16/11
to django...@googlegroups.com
Hey guys,

I've picked through as much as the Django docs as possible, but can't seem to find any reference to Sum() allowing any in-method conditionals (i.e. a conditional to Sum() with an in-line IF(), without the use of a WHERE/filter). 

The original query in MySQL is:

mysql> select SUM(is_spam) as is_spam, SUM(is_image_blocked) as is_image_blocked, SUM(IF(dl_job_state = 2, 1, 0)) as dl_job_success COUNT(*) as total_rows from fourchan_post;
+---------+------------------+----------------+------------+
| is_spam | is_image_blocked | dl_job_success | total_rows |
+---------+------------------+----------------+------------+
|    9116 |           266516 |        5010939 |   38832166 |
+---------+------------------+----------------+------------+
1 row in set (3 min 13.14 sec)


In Django, I'm using (incomplete):

>>> Post.objects.aggregate(Count('id'), Sum('is_spam'), Sum('is_image_blocked'), Sum('is_checked'))

So far, the only way I can see to do this, would be to do a filter() before the aggregate, but this will affect the other Sum()'s which would mean multiple queries would be necessary. Normally this wouldn't be an issue, but the table has over 40 million rows lol (and it already takes well over 3 minutes to execute)

Ideally, I'd like to try and find a way (within the ORM), to specify an IF conditional for the Sum(), thus only having to perform a single query. If this isn't possible, I'll put in a feature request for it.

Thanks

Cal

Russell Keith-Magee

unread,
Jun 16, 2011, 7:27:44 AM6/16/11
to django...@googlegroups.com

I haven't looked into your specific use case in detail, but it *might*
be possible to do this with a custom aggregate. SUM, COUNT etc are all
just classes in Django, so you can extend and define your own if you
want.

Ticket #11305 [1] seems to describe almost exactly the feature you are
asking for (although they wanted COUNT, not SUM); in the discussion
for the ticket, Alex gives a code sample implementing conditional
aggregates.

There were also discussions on django-users back when aggregates were
introduced, describing how to define a custom aggregate [2].

Both these references are worth a look.

[1] https://code.djangoproject.com/ticket/11305
[2] http://groups.google.com/group/django-users/browse_thread/thread/bd5a6b329b009cfa

Yours,
Russ Magee %-)

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 16, 2011, 7:41:23 AM6/16/11
to django...@googlegroups.com
Hi Russ,

Thanks for getting back to me. Ticket 11305 looks like what I need, but the code is extremely hacky. The second link you pasted seems to take the right approach, and I agree with you that the F() object should be used.

If I create a sane/clean patch (with F() support) as a stand alone class, would you consider having it merged into the core?

Cal


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.


Russell Keith-Magee

unread,
Jun 16, 2011, 8:11:40 AM6/16/11
to django...@googlegroups.com
On Thu, Jun 16, 2011 at 7:41 PM, Cal Leeming [Simplicity Media Ltd]
<cal.l...@simplicitymedialtd.co.uk> wrote:
> Hi Russ,
> Thanks for getting back to me. Ticket 11305 looks like what I need, but the
> code is extremely hacky. The second link you pasted seems to take the right
> approach, and I agree with you that the F() object should be used.
> If I create a sane/clean patch (with F() support) as a stand alone class,
> would you consider having it merged into the core?

It depends a little on the exact use case your patch covers. If your
aggregate only covers an obscure edge case, I'd prefer to see it live
externally; but if you can make a case that it's a common use case (or
it's obvious that it is a common use case), then a suitably robust and
tested patch could be considered for core.

Yours,
Russ Magee %-)

Cal Leeming [Simplicity Media Ltd]

unread,
Jun 16, 2011, 8:14:23 AM6/16/11
to django...@googlegroups.com
Hmm, to be honest, most people don't really have the amount of data that we have, and thus would be able to use multiple queries with a filter() instead.

I've had a look through the Django sql aggregates and models aggregates, however I realised that for this to be included in the core, I'd have to add support for all the other databases too. Sadly, I haven't got enough spare time to create/patch/test to this extent, so it looks like I'll have to use a raw() for now.

Thanks for getting back to me anyway though, I'll update the ticket with this discussion.

Cal


Yours,
Russ Magee %-)

Reply all
Reply to author
Forward
0 new messages