Aggregate class: a first-attempt

10 views
Skip to first unread message

Tim Chase

unread,
Feb 22, 2007, 1:13:40 PM2/22/07
to django...@googlegroups.com
Below I've pasted my first attempt at an aggregate function
class. Its __init__ takes a queryset (and an optional list of
aggregate functions to perform if, say, you only want the "sum"s
rather than min/max/avg/sum). Thus you should be able to do
things like

>>>class Foo(Model):
... blah = CharField(maxlength=42)
... xxx = IntegerField()
... yyy = SmallIntegerField()
... zzz = PositiveIntegerField()
...
[add some data to Foo]
>>> stats = Aggregate(Foo.objects.filter(xxx__gt=21))

The "stats" object now has properties
xxx_sum
xxx_max
xxx_min
xxx_average
yyy_sum
yyy_max
yyy_min
yyy_average
zzz_sum
zzz_max
zzz_min
zzz_average

which can be accessed to provide the associated stats for the
given model. If you make your own field types, if they're
numeric, you'll want to add them to the set().

The constructor can also be called with the list of functions you
want:

>>> stats = Aggregate(Foo.objects.all(), ['sum'])

in case you just want the *_sum properties.

Future enhancements might include aggregates that don't involve
sum/avg, so one might get the min/max of a date/time/char field.

I haven't yet figured out a way to suppress the order_by portion,
so what's currently in there is an ugly hack. But it would need
to prevent the standard methods from inserting an ORDER BY clause
against a non-aggregated field.

At the moment, it meets my needs, but I'd gladly consider any
feedback on this first draft version of things. Also, if you
find it useful, feel free to shred it into your own purposes as
desired.

-tkc


#############################################
from django.db import backend, connection

NUMERIC_FIELDS = set((
'FloatField',
'IntegerField',
'PositiveIntegerField',
'PositiveSmallIntegerField',
'SmallIntegerField',
))

FUNCTIONS = [
'min',
'max',
'average',
'sum',
]

def is_numeric_field_type(field):
return field.get_internal_type() in NUMERIC_FIELDS

class Aggregate(object):
def __init__(self, queryset, only_type=None):
self._cache = None
self.queryset = queryset.select_related(False)
self.only_type = only_type or FUNCTIONS

def all(self):
self._get_results()

def _get_aggregate_sql(self):
q = self.queryset._clone()
q._order_by = '?' # can't use None as it gets
# overridden if a default Meta.ordering is specified
# Should do some magic to prevent it from adding
# the Meta.ordering if possible
select, sql, params = q._get_sql_clause()
meta = self.queryset.model._meta
# build the SELECT contents:
# fn(table.field) AS field_fn
selects = [[
"%s(%s.%s) AS %s" % (
fn.lower(),
backend.quote_name(meta.db_table),
backend.quote_name(f.column),
backend.quote_name("%s_%s" % (f.column, fn.lower())),
) for f in meta.fields
if is_numeric_field_type(f)
] for fn in self.only_type]
# the sum(selects, []) flattens
# the list-of-lists into just a list
return sum(selects, []), sql, params

def _get_results(self):
if self._cache is None:
select, sql, params = self._get_aggregate_sql()
cursor = connection.cursor()
sql = "SELECT " + ",".join(select) + sql
cursor.execute(sql, params)
results = cursor.fetchone()
field_names = [d[0] for d in cursor.description]
self._cache = dict(zip(field_names, results))
return self._cache

def __getattribute__(self, k):
try:
return object.__getattribute__(self, k)
except:
results = self._get_results()
return results[k]

Honza Král

unread,
Feb 22, 2007, 6:40:10 PM2/22/07
to django...@googlegroups.com

if you add an empty call (no parameters) to order_by(), it will drop
the ORDER BY clause

>
> At the moment, it meets my needs, but I'd gladly consider any
> feedback on this first draft version of things. Also, if you
> find it useful, feel free to shred it into your own purposes as
> desired.

I have another proposition, how about:

>>> quseryset = Model.objects.all()
>>> queryset.aggregate( ( 'name', 'city' ), sum=( 'pay',
'some_other_field' ), avg=( 'pay', 'age' ), count=True )
>>> [
{
'name' : 'some name',
'city' : 'Prague',
'sum' : { 'pay' : 30000, 'some_other_field' : 10000 },
'avg' : { 'pay' : 10000, 'age' : 30 },
'count' : 5,
},
......
]

or something like:

{
( 'some name', 'Prague') : {
'sum' : { 'pay' : 50000, 'some_other_field' : 10000 },
'avg' : { 'pay' : 10000, 'age' : 30 },
'count' : 5,
},
( 'other name', 'other city') : {
....
}
}

it seems reasonably clean interface to me:
fist argument is a tuple of fields on which to call GROUP BY, then
there are several optional kwargs: avg, sum, min, max and count

I am willing to code up a patch for this, it should be pretty
straightforward. If we would want to take it to the next level, we
could allow filter() on such aggregated queryset, that would manifest
itself in the HAVING clause...

what do you think??


--
Honza Král
E-Mail: Honza...@gmail.com
ICQ#: 107471613
Phone: +420 606 678585

Tim Chase

unread,
Feb 22, 2007, 8:46:16 PM2/22/07
to django...@googlegroups.com
>> I haven't yet figured out a way to suppress the order_by portion,
>> so what's currently in there is an ugly hack. But it would need
>> to prevent the standard methods from inserting an ORDER BY clause
>> against a non-aggregated field.
>
> if you add an empty call (no parameters) to order_by(), it will drop
> the ORDER BY clause

I thought I had tried this. I'll have to give it another whack
on Monday when I get back from PyCon. I had problems with it
trying to order the resultset by the Meta.ordering properties,
when those fields weren't part of the aggregate. I'll try it
again to see if I did something wrong or if I unearthed a bug.

> I have another proposition, how about:
>
>>>> quseryset = Model.objects.all()
>>>> queryset.aggregate( ( 'name', 'city' ), sum=( 'pay',
> 'some_other_field' ), avg=( 'pay', 'age' ), count=True )

I like this calling interface as an alternate method for its
fine-tuned control, but there are times it would be nice to not
have to have such tight coupling between the model and the
aggregation. Otherwise, a change in the model requires not only
a change in the template to make use of the results, but in the
view (to request various aggregate functions) to add the desired
aggregation functions.

>>>> [
> {
> 'name' : 'some name',
> 'city' : 'Prague',
> 'sum' : { 'pay' : 30000, 'some_other_field' : 10000 },
> 'avg' : { 'pay' : 10000, 'age' : 30 },
> 'count' : 5,
> },
> ......
> ]

As long as the resulting data can be easily accessed directly by
a template, any number of resulting formats would work just fine.
I'd see using it something like

return render_to_response(
'test.html', {'items': Model.objects.all()}
)

and then within the template refer to

<table>
{% for item in items %}
<tr><td>{{ item.pay }}</td></tr>
{% endfor %}
<tr><td>{{ items.aggregate.sum.pay }}</td></tr>
</table>

or possibly

items = Model.objects.all()
return render_to_response('test.html', {
'items': items,
'stats': items.aggregate(),
})

and then within the template refer to

<table>
{% for item in items %}
<tr><td>{{ item.pay }}</td></tr>
{% endfor %}
<tr><td>{{ stats.sum.pay }}</td></tr>
</table>

> it seems reasonably clean interface to me:
> fist argument is a tuple of fields on which to call GROUP BY, then
> there are several optional kwargs: avg, sum, min, max and count
>
> I am willing to code up a patch for this, it should be pretty
> straightforward. If we would want to take it to the next level, we
> could allow filter() on such aggregated queryset, that would manifest
> itself in the HAVING clause...
>
> what do you think??

I like the ideas you present, particularly your clean interface
for the aggregated results ("results.sum.fieldname" rather than
my munged "results.fieldname_sum") and the clean interface for
specific aggregation. However, as stated above, I am hesitant to
give up the simplicity of just asking for "aggregate everything
possible" to prevent tight coupling. Both sound good to me ("but
I want both specific *and* generic" :)

Feel free to swipe any of the code I provided for any patches you
make.

I think it will be a valuable addition as evaluation can be made
lazy (thus, it has no great impact on existing code), and I've
seen it requested several times in the Django-Users ML archives
where the answer was "Django doesn't do that, but you can drop to
hand-rolled SQL to do it for you." I think it would help the
folks that don't think in SQL.

Just my musings...I appreciate your feedback.

-tkc

Honza Král

unread,
Feb 22, 2007, 9:26:15 PM2/22/07
to django...@googlegroups.com
On 2/23/07, Tim Chase <django...@tim.thechases.com> wrote:
>
> >> I haven't yet figured out a way to suppress the order_by portion,
> >> so what's currently in there is an ugly hack. But it would need
> >> to prevent the standard methods from inserting an ORDER BY clause
> >> against a non-aggregated field.
> >
> > if you add an empty call (no parameters) to order_by(), it will drop
> > the ORDER BY clause
>
> I thought I had tried this. I'll have to give it another whack
> on Monday when I get back from PyCon. I had problems with it
> trying to order the resultset by the Meta.ordering properties,
> when those fields weren't part of the aggregate. I'll try it
> again to see if I did something wrong or if I unearthed a bug.
>
> > I have another proposition, how about:
> >
> >>>> quseryset = Model.objects.all()
> >>>> queryset.aggregate( ( 'name', 'city' ), sum=( 'pay',
> > 'some_other_field' ), avg=( 'pay', 'age' ), count=True )
>
> I like this calling interface as an alternate method for its
> fine-tuned control, but there are times it would be nice to not
> have to have such tight coupling between the model and the
> aggregation. Otherwise, a change in the model requires not only
> a change in the template to make use of the results, but in the
> view (to request various aggregate functions) to add the desired
> aggregation functions.

yes, and that is how it should be in my opinion. Why have the database
calculate stuff you don't want?

>
> >>>> [
> > {
> > 'name' : 'some name',
> > 'city' : 'Prague',
> > 'sum' : { 'pay' : 30000, 'some_other_field' : 10000 },
> > 'avg' : { 'pay' : 10000, 'age' : 30 },
> > 'count' : 5,
> > },
> > ......
> > ]
>
> As long as the resulting data can be easily accessed directly by
> a template, any number of resulting formats would work just fine.

good point, the dictionary one wouldn't work very well in templates

> I'd see using it something like
>
> return render_to_response(
> 'test.html', {'items': Model.objects.all()}
> )
>
> and then within the template refer to
>
> <table>
> {% for item in items %}
> <tr><td>{{ item.pay }}</td></tr>
> {% endfor %}
> <tr><td>{{ items.aggregate.sum.pay }}</td></tr>
> </table>

this just seems wrong - doing something like this in a template just
doesn't feel right, you should calculate the aggregation in the view -
its an expensive operation you should always think through

>
> or possibly
>
> items = Model.objects.all()
> return render_to_response('test.html', {
> 'items': items,
> 'stats': items.aggregate(),
> })

I think this should be:

items = Model.objects.all()
return render_to_response('test.html', {

'stats_by_owner': items.aggregate( ('owner',), count=True,
sum=('visits',) max=('rating', ), min=('rating', 'visits') ),
})

{% for stat in stats_by_owner %}
Owner {{ stat.owner }} has written {{ stat.count }} articles wit
ratings between {{ stat.min.rating }} and {{ stat.max.rating }}. His
work was read {{ stat.sum.visits }} times in total and his least
popular article was read {{ stat.min.visits }} times.
{% endfor %}

the problem here is that if this should make any sense, owner should
be an object, not just a field value (user id in this case), I believe
though that this can be done...

>
> and then within the template refer to
>
> <table>
> {% for item in items %}
> <tr><td>{{ item.pay }}</td></tr>
> {% endfor %}
> <tr><td>{{ stats.sum.pay }}</td></tr>
> </table>
>
> > it seems reasonably clean interface to me:
> > fist argument is a tuple of fields on which to call GROUP BY, then
> > there are several optional kwargs: avg, sum, min, max and count

the first argument should also be optional - in case you want the total

> >
> > I am willing to code up a patch for this, it should be pretty
> > straightforward. If we would want to take it to the next level, we
> > could allow filter() on such aggregated queryset, that would manifest
> > itself in the HAVING clause...
> >
> > what do you think??
>
> I like the ideas you present, particularly your clean interface
> for the aggregated results ("results.sum.fieldname" rather than
> my munged "results.fieldname_sum") and the clean interface for
> specific aggregation. However, as stated above, I am hesitant to
> give up the simplicity of just asking for "aggregate everything
> possible" to prevent tight coupling. Both sound good to me ("but
> I want both specific *and* generic" :)

and what about performance and optimization? do you really want to
calculate even the things you have no intention on using? Tight
coupling with the model here makes sense - you cannot really ask for
aggregation, if you don't know, what you want.

>
> Feel free to swipe any of the code I provided for any patches you
> make.

thanks, I will wait for some more feedback though, I want to get the
interface right before I start coding

>
> I think it will be a valuable addition as evaluation can be made
> lazy (thus, it has no great impact on existing code), and I've
> seen it requested several times in the Django-Users ML archives
> where the answer was "Django doesn't do that, but you can drop to
> hand-rolled SQL to do it for you." I think it would help the
> folks that don't think in SQL.

or the guys that don't want to rely on one DB engine

>
> Just my musings...I appreciate your feedback.

same here ;)

>
> -tkc

Tim Chase

unread,
Feb 23, 2007, 6:59:19 AM2/23/07
to django...@googlegroups.com
>>>>>> quseryset = Model.objects.all()
>>>>>> queryset.aggregate( ( 'name', 'city' ), sum=( 'pay',
>>> 'some_other_field' ), avg=( 'pay', 'age' ), count=True )
>> I like this calling interface as an alternate method for its
>> fine-tuned control, but there are times it would be nice to not
>> have to have such tight coupling between the model and the
>> aggregation. Otherwise, a change in the model requires not only
>> a change in the template to make use of the results, but in the
>> view (to request various aggregate functions) to add the desired
>> aggregation functions.
>
> yes, and that is how it should be in my opinion. Why have the database
> calculate stuff you don't want?

My line of reasoning for allowing it to default to everything
would be the same reason that you can simply call .all() and get
all the fields rather than calling something crazy like

.all(model.field1, model.field3, model.field7)

to just selectively bring back the three fields that you "know"
you want. It gives the template designer access to any of the
fields without needing to resort to involving the coder to make
changes to the view. Any good database will not require a
table/hash-scan for each aggregate, but will gather the aggregate
stats in a single pass, so the cost of doing N fields vs doing
N+1 fields is barely noticeable. And since N is bounded by the
number of fields in the model, this isn't apt to grow out of control.

>> As long as the resulting data can be easily accessed directly by
>> a template, any number of resulting formats would work just fine.
>
> good point, the dictionary one wouldn't work very well in templates

Unless you have some brainstorm for making dictionaries work
nicely in templates, which is another bugaboo I see come up on
the ML occasionally :)

>> and then within the template refer to
>>
>> <table>
>> {% for item in items %}
>> <tr><td>{{ item.pay }}</td></tr>
>> {% endfor %}
>> <tr><td>{{ items.aggregate.sum.pay }}</td></tr>
>> </table>
>
> this just seems wrong - doing something like this in a template just
> doesn't feel right, you should calculate the aggregation in the view -
> its an expensive operation you should always think through

The results should be cached so that if an aggregate is needed by
the template-designer (as per above), the cost happens once with
one additional DB hit. I would be horrified if

<tr>
<td>{{ items.aggregate.sum.pay }}</td>
<td>{{ items.aggregate.sum.hours }}</td>
</tr>

triggered two database hits. :) But if the aggregate stats are
needed, they should be generated on demand, just like .all()
where no DB call actually occurs until its data is used in things
like a {% for x in data %} call. And if it's called, the DB is
only accessed once to bring back the data (okay, glossing over
fetchmany() calls here)


>> items = Model.objects.all()
>> return render_to_response('test.html', {
>> 'items': items,
>> 'stats': items.aggregate(),
>> })
>
> I think this should be:
>
> items = Model.objects.all()
> return render_to_response('test.html', {
> 'stats_by_owner': items.aggregate( ('owner',), count=True,
> sum=('visits',) max=('rating', ), min=('rating', 'visits') ),
> })

As mentioned, I like both syntaxes. :) They seem to return
somewhat different things though. The .aggregate() call would
only ever return one row for the entire dataset. The
parameterized version would return multiple rows based on the
presence of a column-name tuple as the first parameter.
Otherwise, code starts getting unwieldy when you actually do want
all your fields. I have a model with 20+ fields in it for
various types of money (FloatField) and time-durations
(PositiveIntegerField) involved in transactions (you've gotta
love billing information for cell-phone usage). Without the
ability to say "sum and average everything you can", the code
would start looking like

items.aggregate((,), sum=(
'field1',
'field2',
...
'field20',
'field21',
), average=(
'field1',
'field2',
...
'field20',
'field21',
))

which makes for one very ugly view. Even with a folding editor
like Vim, it's still there.

> {% for stat in stats_by_owner %}
> Owner {{ stat.owner }} has written {{ stat.count }} articles wit
> ratings between {{ stat.min.rating }} and {{ stat.max.rating }}. His
> work was read {{ stat.sum.visits }} times in total and his least
> popular article was read {{ stat.min.visits }} times.
> {% endfor %}
>
> the problem here is that if this should make any sense, owner should
> be an object, not just a field value (user id in this case), I believe
> though that this can be done...

I agree...when you have the leading tuple, this is a lovely
syntax. It does have some rough edges where one's model
foolishly has fields such as "min" or "max" and suddenly you've
got a clash of ontologies.


> the first argument should also be optional - in case you want the total

Makes sense.

>> I think it will be a valuable addition as evaluation can be made
>> lazy (thus, it has no great impact on existing code), and I've
>> seen it requested several times in the Django-Users ML archives
>> where the answer was "Django doesn't do that, but you can drop to
>> hand-rolled SQL to do it for you." I think it would help the
>> folks that don't think in SQL.
>
> or the guys that don't want to rely on one DB engine

Do any of the supported DB engines not support groupings? I
thought at least that much was pretty standard. But yes,
requiring a developer to learn a second language (even if it's as
useful as knowing SQL is...) goes against the grain of Django's
"get out of the developer's way and let them just be productive
in Python yet do crazy-powerful stuff" philosophy.

Thanks again for your feedback and ideas.

-tkc

Honza Král

unread,
Feb 23, 2007, 10:12:40 AM2/23/07
to django...@googlegroups.com
On 2/23/07, Tim Chase <django...@tim.thechases.com> wrote:
>
> >>>>>> quseryset = Model.objects.all()
> >>>>>> queryset.aggregate( ( 'name', 'city' ), sum=( 'pay',
> >>> 'some_other_field' ), avg=( 'pay', 'age' ), count=True )
> >> I like this calling interface as an alternate method for its
> >> fine-tuned control, but there are times it would be nice to not
> >> have to have such tight coupling between the model and the
> >> aggregation. Otherwise, a change in the model requires not only
> >> a change in the template to make use of the results, but in the
> >> view (to request various aggregate functions) to add the desired
> >> aggregation functions.
> >
> > yes, and that is how it should be in my opinion. Why have the database
> > calculate stuff you don't want?
>
> My line of reasoning for allowing it to default to everything
> would be the same reason that you can simply call .all() and get
> all the fields rather than calling something crazy like
>
> .all(model.field1, model.field3, model.field7)

this is something completely different - it is easier for the db to
just take all the fields

>
> to just selectively bring back the three fields that you "know"
> you want. It gives the template designer access to any of the
> fields without needing to resort to involving the coder to make
> changes to the view. Any good database will not require a
> table/hash-scan for each aggregate, but will gather the aggregate
> stats in a single pass, so the cost of doing N fields vs doing
> N+1 fields is barely noticeable. And since N is bounded by the
> number of fields in the model, this isn't apt to grow out of control.

actually this is not true - if you wish to do just one MIN(), it will
only use index, no full table scan, myISAM can even use index for
COUNT() and so on....

even if it wasn't noticeable at all, I would still be against it,
because its plain wrong, it wastes resources and allows developers not
to think of what they are doing.

yes, sure, but you are talking two different things here... I say that
it seems wrong to initiate an aggregation from template. Besides, try
doing similar with current querysets - it won't get cached because
every call to filter(), order_by() etc will produce a NEW queryset
with no connection to the old one, so even if you would actually run
the latter, the former will have no access to the data.

well, in this extreme example, I would suggest you use a list:
fields = [ f.name for f in items.model._meta.fields ]
items.aggregate( sum=fields, average=fields, min=fields, max=fields )

not that bad, is it?

>
> which makes for one very ugly view. Even with a folding editor
> like Vim, it's still there.
>
> > {% for stat in stats_by_owner %}
> > Owner {{ stat.owner }} has written {{ stat.count }} articles wit
> > ratings between {{ stat.min.rating }} and {{ stat.max.rating }}. His
> > work was read {{ stat.sum.visits }} times in total and his least
> > popular article was read {{ stat.min.visits }} times.
> > {% endfor %}
> >
> > the problem here is that if this should make any sense, owner should
> > be an object, not just a field value (user id in this case), I believe
> > though that this can be done...
>
> I agree...when you have the leading tuple, this is a lovely
> syntax. It does have some rough edges where one's model
> foolishly has fields such as "min" or "max" and suddenly you've
> got a clash of ontologies.

true, but only if you would want to aggregate by those fields, we
could work around that by simply moving the grouping fields to a
separate dictionary as well:

[
{
'grouped_by' : { 'owner' : XX },
'min' : { 'pay' : 100 },
'max' : { 'pay' : 101},
},
.....
]

>
>
> > the first argument should also be optional - in case you want the total
>
> Makes sense.
>
> >> I think it will be a valuable addition as evaluation can be made
> >> lazy (thus, it has no great impact on existing code), and I've
> >> seen it requested several times in the Django-Users ML archives
> >> where the answer was "Django doesn't do that, but you can drop to
> >> hand-rolled SQL to do it for you." I think it would help the
> >> folks that don't think in SQL.
> >
> > or the guys that don't want to rely on one DB engine
>
> Do any of the supported DB engines not support groupings? I
> thought at least that much was pretty standard. But yes,
> requiring a developer to learn a second language (even if it's as
> useful as knowing SQL is...) goes against the grain of Django's
> "get out of the developer's way and let them just be productive
> in Python yet do crazy-powerful stuff" philosophy.

every sql implementation I know of support group by, the problem is
with other things like quoting etc.

>
> Thanks again for your feedback and ideas.
>
> -tkc
>
>
>
>
> >
>

Tim Chase

unread,
Feb 23, 2007, 3:47:58 PM2/23/07
to django...@googlegroups.com
>> items.aggregate((,), sum=(
>> 'field1',
>> 'field2',
>> ...
>> 'field20',
>> 'field21',
>> ), average=(
>> 'field1',
>> 'field2',
>> ...
>> 'field20',
>> 'field21',
>> ))
>
> well, in this extreme example, I would suggest you use a list:
> fields = [ f.name for f in items.model._meta.fields ]
> items.aggregate( sum=fields, average=fields, min=fields, max=fields )
>
> not that bad, is it?

A nice way to do this, and a good compromise on a clean syntax that is
also easy to do what I need.

> true, but only if you would want to aggregate by those fields, we
> could work around that by simply moving the grouping fields to a
> separate dictionary as well:
>
> [
> {
> 'grouped_by' : { 'owner' : XX },
> 'min' : { 'pay' : 100 },
> 'max' : { 'pay' : 101},
> },
> .....
> ]

Another excellent idea. Pleasantly simple yet a good way to encapsulate
the info while still removing the ambiguity of field-names that conflict
with aggregate-function names. I'm still mildly concerned about being
able to access the contents from within a template, wherein one could
access it by member notation suggested earlier:

stats.grouped_by.owner
stats.sum.pay
stats.max.pay

-tkc

Honza Král

unread,
Feb 24, 2007, 2:00:07 PM2/24/07
to django...@googlegroups.com
I created a ticket for this:

http://code.djangoproject.com/ticket/3566

any comments are welcome

On 2/23/07, Tim Chase <django...@tim.thechases.com> wrote:
>

Reply all
Reply to author
Forward
0 new messages