QuerySet.distinct and aggregates

2,299 views
Skip to first unread message

Wedg

unread,
Apr 28, 2010, 6:38:40 PM4/28/10
to Django users
I guess I'm writing this to confirm a behaviour and see if there might
be a work around.

It appears that qs.aggregate(Sum('field')) ignores qs.distinct().

If I have something like this:

qs = Model.objects.filter(reverserelation__field=id).distinct()

... then len(qs) will return the correct number of unique items,
however ...

total = qs.aggregate(total=Sum('field')).get('total')

... appears to ignore the qs.distinct() and will sum all the items,
which may not be distinct if the reverse relation isn't one to one.

Is this something that's a bug? Or does it have to do with the SQL
produced by the qs? Is there a workaround for this?

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.

zinckiwi

unread,
Apr 30, 2010, 9:28:48 AM4/30/10
to Django users
Finding it a wee bit hard to follow without actual models. Could you
post them?

Wedg

unread,
May 3, 2010, 9:20:33 PM5/3/10
to Django users
Here's a real simplified version demonstrating what I was (poorly)
trying to explain above:

# models.py

from django.db import models

class Alpha( models.Model ):
name = models.CharField(max_length=10, default='Alpha')

def __unicode__( self ):
return u'%s%s' % (self.name, self.id)

class Beta( models.Model ):
name = models.CharField(max_length=10, default='Beta')

def __unicode__( self ):
return u'%s%s' % (self.name, self.id)

class Gamma( models.Model ):
name = models.CharField(max_length=10, default='Gamma')
alpha = models.ForeignKey(Alpha)
beta = models.ForeignKey(Beta)

def __unicode__( self ):
return u'%s%s' % (self.name, self.id)


# ./manage.py shell output

In [3]: Alpha.objects.create()
Out[3]: <Alpha: Alpha1>

In [4]: Beta.objects.create()
Out[4]: <Beta: Beta1>

In [5]: Beta.objects.create()
Out[5]: <Beta: Beta2>

In [6]: g = Gamma()

In [7]: g.alpha_id = 1

In [8]: g.beta_id = 1

In [9]: g.save()

In [10]: g = Gamma()

In [11]: g.alpha_id = 1

In [12]: g.beta_id = 2

In [13]: g.save()

In [14]: Alpha.objects.filter(gamma__beta__name='Beta')
Out[14]: [<Alpha: Alpha1>, <Alpha: Alpha1>]

In [15]: Alpha.objects.filter(gamma__beta__name='Beta').distinct()
Out[15]: [<Alpha: Alpha1>]

In [16]:
Alpha.objects.filter(gamma__beta__name='Beta').aggregate(models.Sum('id'))
Out[16]: {'id__sum': 2}

In [17]:
Alpha.objects.filter(gamma__beta__name='Beta').distinct().aggregate(models.Sum('id'))
Out[17]: {'id__sum': 2}

As you can see, the aggregate call in 17 ignores the distinct call.

So - is this a known bug/issue, or documented behavior, or something
new? Is there a workaround for this?

zinckiwi

unread,
May 4, 2010, 9:36:09 AM5/4/10
to Django users
> In [16]:
> Alpha.objects.filter(gamma__beta__name='Beta').aggregate(models.Sum('id'))
> Out[16]: {'id__sum': 2}
>
> In [17]:
> Alpha.objects.filter(gamma__beta__name='Beta').distinct().aggregate(models. Sum('id'))
> Out[17]: {'id__sum': 2}
>
> As you can see, the aggregate call in 17 ignores the distinct call.

I think distinct() in aggregation isn't the right way to go about
this. Using values() provides the same sort of functionality, and
seems the way to go from the docs. Try something like (untested and
line-split for clarity):

total = Alpha.objects.filter(gamma__beta__name='Beta')
.values('name')
.annotate(subtotal=models.Sum('id'))
.aggregate(total=models.Sum('subtotal'))
.get('total')

Regards
Scott

Wedg

unread,
May 4, 2010, 4:29:40 PM5/4/10
to Django users
Hrm... didn't quite work...

In [18]: Alpha.objects.filter(gamma__beta__name='Beta').values('name')
Out[18]: [{'name': u'Alpha'}, {'name': u'Alpha'}]

In [19]:
Alpha.objects.filter(gamma__beta__name='Beta').values('name').annotate(subtotal=models.Sum('id'))
Out[19]: [{'subtotal': 2, 'name': u'Alpha'}]

In [24]:
Alpha.objects.filter(gamma__beta__name='Beta').values('name').annotate(subtotal=models.Sum('id')).aggregate(total=models.Sum('subtotal'))
---------------------------------------------------------------------------
DatabaseError Traceback (most recent call
last)

/#snip#/<ipython console> in <module>()

/usr/local/lib/python2.6/dist-packages/django/db/models/query.pyc in
aggregate(self, *args, **kwargs)
311 is_summary=True)
312
--> 313 return query.get_aggregation(using=self.db)
314
315 def count(self):

/usr/local/lib/python2.6/dist-packages/django/db/models/sql/query.pyc
in get_aggregation(self, using)
364 query.related_select_fields = []
365
--> 366 result = query.get_compiler(using).execute_sql(SINGLE)
367 if result is None:
368 result = [None for q in
query.aggregate_select.items()]

/usr/local/lib/python2.6/dist-packages/django/db/models/sql/
compiler.pyc in execute_sql(self, result_type)
725
726 cursor = self.connection.cursor()
--> 727 cursor.execute(sql, params)
728
729 if not result_type:

/usr/local/lib/python2.6/dist-packages/django/db/backends/util.pyc in
execute(self, sql, params)
17 start = time()
18 try:
---> 19 return self.cursor.execute(sql, params)
20 finally:
21 stop = time()

/usr/local/lib/python2.6/dist-packages/django/db/backends/sqlite3/
base.pyc in execute(self, query, params)
198 query = self.convert_query(query)
199 try:
--> 200 return Database.Cursor.execute(self, query,
params)
201 except Database.IntegrityError, e:
202 raise utils.IntegrityError,
utils.IntegrityError(*tuple(e)), sys.exc_info()[2]

DatabaseError: near "FROM": syntax error

Looks like it really doesn't like that aggregate. I might be confused,
but it seems like it wouldn't work anyway, since the annotation
produces 'subtotal':2, where what I'm going for would be 'subtotal':1,
since there's only one distinct model, but I might be missing
something.

It seems like any call to aggregate after values and annotate produces
this error (same error, same line), regardless of whether it
references the annotation.

In [25]:
Alpha.objects.filter(gamma__beta__name='Beta').values('id').annotate(subtotal=models.Sum('id')).aggregate(total=models.Sum('subtotal'))
---------------------------------------------------------------------------
DatabaseError Traceback (most recent call
last)

In [26]:
Alpha.objects.filter(gamma__beta__name='Beta').values('id').annotate(subtotal=models.Sum('id')).aggregate(total=models.Sum('id')
....: )
---------------------------------------------------------------------------
DatabaseError Traceback (most recent call
last)

And I dunno if this is related, but probably is... aggregate fails to
produce anything when called after values:

In [28]: Alpha.objects.filter(gamma__beta__name='Beta').values('id')
Out[28]: [{'id': 1}, {'id': 1}]

In [29]:
Alpha.objects.filter(gamma__beta__name='Beta').values('id').aggregate(models.Sum('id'))
Out[29]: {}

In [30]:
Alpha.objects.filter(gamma__beta__name='Beta').values('name').aggregate(models.Sum('id'))
Out[30]: {}

In [31]:
Alpha.objects.filter(gamma__beta__name='Beta').aggregate(models.Sum('id'))
Out[31]: {'id__sum': 2}

Any other ideas? It's seeming more and more like if I want this done
in the DB, I'm going to have to write some custom SQL. Either that or
do post processing in python after getting the distinct queryset.

- Jake

P.S. Thanks for your attempted help. :)

zinckiwi

unread,
May 5, 2010, 9:02:11 AM5/5/10
to Django users
> In [29]:
> Alpha.objects.filter(gamma__beta__name='Beta').values('id').aggregate(model s.Sum('id'))
> Out[29]: {}
>
> In [30]:
> Alpha.objects.filter(gamma__beta__name='Beta').values('name').aggregate(mod els.Sum('id'))
> Out[30]: {}

That's odd -- I would have expected #29 to work (though not #30). Must
just be a particular case of mixing values, annotation and aggregation
that doesn't mesh. My suggestion was cobbled together from the parts
about values() and the final section about combining annotation and
aggregation here:

http://docs.djangoproject.com/en/1.1/topics/db/aggregation/#values

I suppose what I'd try next, slightly clunkier but a step short of
having to write SQL, is two queries:

qs = Alpha.objects.filter(gamma__beta__name='Beta').distinct())
id_list = [x.id for x in qs]
total =
Alpha.objects.filter(id__in=id_list).aggregate(total=models.Sum('id')).get('total')

Wedg

unread,
May 5, 2010, 12:31:13 PM5/5/10
to Django users
> qs = Alpha.objects.filter(gamma__beta__name='Beta').distinct())
> id_list = [x.id for x in qs]
> total =
> Alpha.objects.filter(id__in=id_list).aggregate(total=models.Sum('id')).get('total')

I'm sure this approach would work fine, however I feel like it might
be a performance issue when the Alpha model (in my real world case, a
Payment model) reaches a few thousand instances in that id_list. Maybe
I'm wrong. If it were only a few dozen or even a few hundred, I would
just do the processing in python and not worry about it. On the other
hand, maybe python has some fancy optimizations for list comprehension
that I'm not aware of, or DBs for processing large WHERE ... IN ...
clauses.

I guess I'll just start writing up the SQL for now and secretly hope
that someone chimes in with a better solution.

zinckiwi

unread,
May 5, 2010, 4:04:20 PM5/5/10
to Django users
I'd try both and compare the processing time. What is the nature of
the Beta.name equivalent in your actual model? If it is a simple field
(e.g. INT) I doubt the overhead would be too bad. What might be
tripping me up is having your Gamma being linked to your Alpha through
an intermediate model. Since you mentioned Alpha is a payment model,
presumably Gamma is one of the constituent parts -- so why would there
not be a direct FK relationship between the two? (Or at least a one
way FK relationship through the intermediate model e.g. Beta->Gamma-
>Alpha.)

Wedg

unread,
May 5, 2010, 5:12:56 PM5/5/10
to Django users
The Beta name field is actually an FK to another model, like this:
Payment (Alpha) <- AppliedPayment (Gamma) -> Invoice (Beta) ->
Location (Name)

The actual call I had tried was
Payment.objects.filter(appliedpayment__invoice__location=loc).distinct().aggregate(Sum('amount'))
(where amount is a field on Payment). I could do
AppliedPayment...aggregate(...), but unfortunately the AppliedPayment
splits the 'amount' into several values (based on revenue type), which
would mean doing several aggregate calls and summing them, which I've
already done when it seemed appropriate (and I just discovered while
writing this that aggregate can take more than one argument, but I
don't know if that results in a single query). For now, when I know
there's generally going to be less than a dozen instances, I've just
done the processing in python, which I'm sure will work fine.

I was hoping for an elegant solution for when I know there's going to
be several thousand plus instances (or hundreds of thousands after a
year or two of deployment). So in the end, I guess I can work with it,
but mostly I was hoping to find out whether the whole respecting
distinct() thing could be made to happen.

Dunno if it's worthwhile filing a ticket for it, or if there is one.
Or maybe just a doc ticket for mentioning explicitly that aggregate
ignores distinct.

Thanks very much for your help.

zinckiwi

unread,
May 6, 2010, 7:50:27 AM5/6/10
to Django users
> The actual call I had tried was
> Payment.objects.filter(appliedpayment__invoice__location=loc).distinct().ag gregate(Sum('amount'))
> (where amount is a field on Payment). I could do
> AppliedPayment...aggregate(...), but unfortunately the AppliedPayment
> splits the 'amount' into several values (based on revenue type), which
> would mean doing several aggregate calls and summing them, which I've
> already done when it seemed appropriate (and I just discovered while
> writing this that aggregate can take more than one argument, but I
> don't know if that results in a single query).

Multiple aggregates within the same aggregate(), I'm 99% sure, are put
into a single query, at least if they're simple and not spanning
complex relationships themselves.

> For now, when I know
> there's generally going to be less than a dozen instances, I've just
> done the processing in python, which I'm sure will work fine.
>
> I was hoping for an elegant solution for when I know there's going to
> be several thousand plus instances (or hundreds of thousands after a
> year or two of deployment). So in the end, I guess I can work with it,
> but mostly I was hoping to find out whether the whole respecting
> distinct() thing could be made to happen.

Yeah, that's a tricky use case no matter what. I don't think there's a
way out of *something* doing some heavy lifting, whether it be python
or the db. But as I've seen mentioned many times (often by
Malcolm...where's a Malcolm bat-signal when you need one? I'd love to
hear his thoughts on this) the ORM isn't designed and can't be
expected to handle 100% of cases. This may indeed be a situation where
a simple extra() actually *is* the elegant solution.

> Dunno if it's worthwhile filing a ticket for it, or if there is one.
> Or maybe just a doc ticket for mentioning explicitly that aggregate
> ignores distinct.

Certainly agree with this.

Sorry I couldn't be more help!

Regards
Scott

Russell Keith-Magee

unread,
May 9, 2010, 8:43:39 PM5/9/10
to django...@googlegroups.com
On Thu, May 6, 2010 at 7:50 PM, zinckiwi <zinc...@gmail.com> wrote:

>> I was hoping for an elegant solution for when I know there's going to
>> be several thousand plus instances (or hundreds of thousands after a
>> year or two of deployment). So in the end, I guess I can work with it,
>> but mostly I was hoping to find out whether the whole respecting
>> distinct() thing could be made to happen.
>
> Yeah, that's a tricky use case no matter what. I don't think there's a
> way out of *something* doing some heavy lifting, whether it be python
> or the db. But as I've seen mentioned many times (often by
> Malcolm...where's a Malcolm bat-signal when you need one? I'd love to
> hear his thoughts on this) the ORM isn't designed and can't be
> expected to handle 100% of cases. This may indeed be a situation where
> a simple extra() actually *is* the elegant solution.

You're correct that the design intent of the ORM isn't to be a 100%
replacement for SQL. If you ever find yourself in a situation where
you're asking the question "How would I represent this SQL query in
the ORM?" or "Will the ORM interpret this query in a particular way?",
then you've almost certainly reached the point where you should be
using a raw query.

As for using extra() - my advice there is to be careful. It's great
for adding a single extra select column (e.g., for a date-based
aggregate that you can't express using Django's own query language).
However, many people seem to view extra() as a crowbar you can use to
bend the ORM into outputting the SQL you want. If you're in that sort
of situation, you should be using a raw query, not extra(). Django 1.2
makes raw queries much easier specifically to avoid problems like
this.

>> Dunno if it's worthwhile filing a ticket for it, or if there is one.
>> Or maybe just a doc ticket for mentioning explicitly that aggregate
>> ignores distinct.

"ignores" isn't really the right way of looking at it - the behavior
of DISTINCT and GROUP BY are very closely related. Consider:

SELECT DISTINCT table.id, table.name, table.size
FROM table
INNER JOIN other ON other.id = table.join_id

is essentially the same as:

SELECT table.id, table.name, table.size
FROM table
INNER JOIN other ON other.id=table.join_id
GROUP BY table.id, table.name, table.size

That is; by grouping on columns in the base table, you effectively get
distinctness on those rows. The GROUP BY clause has the effect of
collapsing duplicates of the result set that have occurred as a result
of joining into a different table.

Given that Django's aggregate clauses handle all the GROUP BY columns
transparently, this is one of those areas of conceptual leakage where
you need to be aware of how SQL handles certain queries.

Yours,
Russ Magee %-)
Reply all
Reply to author
Forward
0 new messages