Aggregate Functions

179 views
Skip to first unread message

Rock

unread,
Mar 1, 2006, 4:51:50 PM3/1/06
to Django developers
My other changeset developed at this week's Django Sprint is much more
extensive and is best explained by the new section that I am adding to
the DB API docs.

Feedback is welcome.

Rock

Aggregate Functions
===================

Aggregate functions perform calculations on columns. Typically
they return a single value. They are in two groups: high_level
and low_level.

High Level Functions
--------------------

The high_level functions are sum(), min(), max(), avg(), stddev()
and median(). Each takes a fieldname as an argument. The type of
the field is checked for correctness as only certain datatypes are
allowed for each of the high level functions.

sum(fieldname)
---------------

Returns the sum of the named field. The field must be an
IntegerField or a FloatField. The returned value corresponds
with the type of the column.

min(fieldname), max(fieldname)
--------------------------------

Returns the minimum or maximum value of the named field. The field
must be an IntegerField, FloatField or DateField. The returned value
corresponds with the type of the field. (This is a string
representation if the field is a DateField.)

avg(fieldname)
---------------

Returns the average of the named field. The field must be an
IntegerField or a FloatField. The returned value is a Float.

stddev(fieldname)
------------------

Returns the standard deviation of the named field. The field must be an
IntegerField or a FloatField. The returned value is a Float.
(Not supported on sqlite3. You get an OperationError exception.)

median(fieldname)
------------------

Returns the median value of the named field. The field
must be an IntegerField, FloatField or DateField. The returned
value corresponds with the type of the field. (This is a string
representation if the column is a DateField.) Unlike the other
functions in this group, this function does not use the DB
supplied capabilities. It fetches all of the values of the field
ordered by that field and returns the middle value. (If there
are an even number of values, the second of the two middle
values is returned.)

Low Level Functions
-------------------

There are two low level functions: get_aggregate() and
get_aggregates(). They do minimal checking and allow for
powerful queries that potentially return multiple values
and/or combine multiple column arithmetically.

The low_level functions take columnnames instead of fieldnames.
You must do your own conversion from fieldname to columnname
if you are taking advantage of the fieldname mapping. (By
default fieldnames and columnnames match each other and so
most users will not have to worry about this distinction.)

get_aggregate(type,columnname)
------------------------------

This function supplies direct support for all database-supplied
aggregate functions. The type parameter is the name of an aggregate
function such as 'SUM', 'VARIANCE' or so forth limited only by
what set of functions your particular database supports. The return
value uses whatever type your database connonically returns. (Most
databases return the same type as the named column, although this
is not the case for some functions such as "avg" or "stddev" which
always returns a Float. Also note that sqlite3 always returns a Float
for all aggregate function.)

Note that the columnname is not explicitly checked for type and
so it is possible to combine columns arithmetically (with care!)
as follows:

Inventory.objects.get_aggregate('AVG','quantity*price')

This returns the average value of the 'quantity' column multiplied
by the 'price' column.

Meals.objects.get_aggregate('MAX','price+tax+tip')

This returns the highest priced meal which is calculated by the
database by adding the 'price', the 'tax' and the 'tip' columns.

(As a repeat warning: Don't forget to get the columnname from your
fieldname if you are using fieldname mapping.)

get_aggregates(types,columnname)
--------------------------------

This function allows a single SQL operation to perform multiple
aggregate functions. The types field is an iterable list of
aggregate function names. The columnname is handled in the same
manner as with the get_aggregate() function. For example:

Inventory.objects.get_aggregates(['AVG','MIN','MAX'],'quantity')

The results are returned in an array.

Usage
-----

Typical use targets all of the rows in the targeted table.
For example:

Articles.objects.sum('wordcount')

However it is possible to combine the aggregate functions with
judicious filtering. For example:

Poll.objects.filter(question__contains='football').min('pub_date')

Exceptions
----------

The most common exceptions encountered when using aggregate functions
are:

FieldDoesNotExist - the columnname is not found.

TypeError - the named column uses an unsupported type.

OperationError - the functype is not supported by the database.

Russell Keith-Magee

unread,
Mar 2, 2006, 8:38:27 AM3/2/06
to django-d...@googlegroups.com
On 3/2/06, Rock <ro...@rockhoward.com> wrote:
>
> My other changeset developed at this week's Django Sprint is much more
> extensive and is best explained by the new section that I am adding to
> the DB API docs.
>
> Feedback is welcome.

Truth be told, a desire to improve aggregates in Django is the reason
that I got involved a few months back (I have a use case that requires
pretty extensive use of aggregates). I was going to wait until post
magic-removal merge to worry about any aggregate proposals - worry
about getting MR out the door first, then worry about other
improvements.

However, if you're going to start playing, I don't to miss out on any
fun :-) Here are some comments on your proposals (including the count
patch from the other thread):

1) I'm -1 on the get_aggregate[s] notation you are suggesting. To me,
as a name, 'get_aggregates' only makes sense as a same if you know
SQL, and the syntax requires end users to learn two notations for
aggregates - avg() and 'AVG', with the latter being very SQL, and not
very python/ORM.

If you are willing to use raw SQL (like get_aggregate
psuedo-requires), you can already do what you are proposing with
relative ease using a select kwarg. select kwargs also allows the
definition of aggregate and non-aggregate 'price+tax+tip' fields.

Also, your proposal doesn't cross-multiply very well - your proposal
is fine if I want 1-n summary stats of 1 field, but not if I want 1-n
summary stats of 1-n fields. (i.e., sum of field 1, avg of field 2).

2) I'm also -1 on the count modification you proposed.
Article.objects.all().count() reads quite obviously as the count of
all articles. However, Article.objects.all().count('title') reads (to
me) like an inelegant way of saying 'a count of all titles'. This
isn't what your proposal would return. What you are proposing can
already be acheived using
Article.objects.filter(fieldname__isnull=False).count(), which, IMHO,
reads better, and is more flexible.

3) I'm +0 on the min, max, avg, stddev and median (where available)
suggestions as you describe. They are all reasonable analogs of the
count() function, and I don't think it would necessarily hurt if they
were to be included. However, I have three things preventing me from
saying +1:

Firstly (and trivially), I would say average() not avg(). The other
abbreviations don't bother me, but for some reason avg() rubs me the
wrong way.

Secondly, in the absence of get_aggregates, if you want to get the
minimum AND the maximum, you need to make two separate SQL queries.
Obviously, this is why you suggested get_aggregates - I just think we
need a better syntax.

A couple of quick suggestions for an alternate syntax for multiple
aggregates in a single query (these ideas are in foetal form - like I
said earlier, I wasn't expecting to have this discussion for a while):

>>> Article.objects.summary(count=True, min='field', average=['field',
'field'], ...)
{'count' = 5, 'min'={'field':1}, 'average'={'field': 3, 'field': 4} ...}

i.e., a kwarg for each aggregate function; COUNT taking true/false,
and SUM/MIN/MAX/AVG/STDEV/MEDIAN taking a single field name, or a
tuple/list of field names. The return value is a dictionary populated
with the summary statistics.

Alternatively, the return value could also be an object:
>>> s = Article.objects.summary(...)
>>> s.count
5
>>> s.min['field'] # (or maybe s.field.min)
1

You could also rearrange the summary method to make the field names
the kwargs of the method:
>>> Article.objects.summary(field1=sum, field2=(sum,average),...)
if you make sum, average, etc functions that can be referenced as
objects. However, this makes count harder to implement (since it
doesn't need to operate on a field).

These options would all allow 1-n fields with 1-n aggregates, and
avoids 'SQL string' notation in ORM queries.

Thirdly, (and most significantly), while avg(), min(), max() etc are
reasonable representations, IMHO they only deal with a small part of
the aggregate story.

Your suggestions all deal with aggregates in the 'summary' sense -
give me an average of a field for these objects. However, in this
context, the field must belong to the object being averaged, and a
query produces either a summary value or a list of objects - you don't
produce both.

While this is important (especially for things like count), IMHO this
misses one very important use case of aggregates: those queries where
you need both a list of objects AND a summary value (or, to put it
another way, queries where the summary value isn't over the table
whose objects we are retreiving).

For example:

class Machine(Model):
id = CharField()

class Part(Model):
name = CharField()
weight = FloatField()
machine = ForeignKey(Machine)

What is the total weight of each machine? The machine consists of
several parts, and the total_weight is the sum of all weights of
related parts.

What you need to be able to do here is to annotate the results
returned by your model with some summary stats. You can currently do
this using select, tables and where kwargs on a filter; however, these
are somewhat complex to set up. IMHO select/where/tables is a
necessary evil; while it is required to allow spanning coverage of
SQL, we should be striving to minimize the use of select/where/tables,
especially when we can find an object-like subset of use cases.

So, I would be aiming to be able to do something like:

>>> m = Machine.objects.annotate('total_weight', part__weight=sum).all()
>>> m[0].total_weight
123

This would do the join required to get weight from the part related
objects, add an extra column to the results returned by SQL, and add
an attribute to objects with the provided alias. You can add multiple
annotations if so required, and filter (or otherwise process) an
annotated query set.

Unresolved issues that I would like to visit:
- queries over annotated fields
- summary stats that are used for queries, but not returned to the end user
- interaction of annotations and queries with group_by/having at the SQL level
- unifying notation for 'summary' and 'annotate'

Like I said earlier, I wasn't expecting to have this conversation for
a while, so my ideas are still a little underdeveloped, but I hope you
get an idea of where I am heading.

Russ Magee %-)

Rock

unread,
Mar 7, 2006, 10:58:12 PM3/7/06
to Django developers
Russ,

It seems that no one else has any thoughts about this, so its just you
and me.

First I note that I agree with you about count. I am taking that out.

Next, I disagree about "average". Using the abbreviated SQLish name for
all of the functions except one is bad. Either all of the names should
be expanded or none. I prefer none since the non-standard SQL aggregate
functions must match the SQL provided name precisely. Might as well
make the standard function names do the same.

The get_aggregate() function is really just the shared code behind
sum(), min(), max(), stddev() and so forth. Arguably it could replaced
by get_aggregates() (or whatever better name you might suggest), but
then it will have a lot of extra setup and teardown code that is
unecessary for the simple cases. Perhaps it should be _aggregate () to
discourage end user use, but there are cases where I would want to use
it. Hmmm.

I will admit that using kwargs is more in line with the rest of Django,
but I really dislike the thought that the mixmaster approach that you
sketched out would be the only interface. I prefer a design that
handles the simplest 80% of the calls trivially with the knowledge that
the other 20% can be accomplished by dropping down to SQL (which, after
all, is how I do the aggregate functions today.) Once we have a good
80/20 design, then a "kitchen sink" approach would be fun to explore
and your suggestions are a good starting point.

Finally, I agree that the name get_aggregate(s) kinda sux but I still
haven't come up with anything better.

Rock

Gary Wilson

unread,
Mar 8, 2006, 12:38:01 AM3/8/06
to Django developers
Rock wrote:
> Finally, I agree that the name get_aggregate(s) kinda sux but I still
> haven't come up with anything better.

Sending to list this time, sorry Rock...

Throwing these into the mix:

Inventory.objects.aggregate(['AVG','MIN','MAX'],'quantity')
Inventory.objects.calculate(['AVG','MIN','MAX'],'quantity')

Of course, you would then allow singular and iterable for the aggregate
functions.

Russell Keith-Magee

unread,
Mar 8, 2006, 1:55:26 AM3/8/06
to django-d...@googlegroups.com
On 3/8/06, Rock <ro...@rockhoward.com> wrote:

It seems that no one else has any thoughts about this, so its just you
and me.

I read it to mean that everyone was concentrating on getting magic-removal finished. MR is all about getting the backwards incompatible changes out of the way in one big jump; once those are sorted out, we can worry about the backwards compatible changes (like adding new API for aggregates). If we don't focus, magic-removal runs the risk of becoming the branch that will never merge...

However - that said, a few comments to give you something to ruminate upon:
 
Next, I disagree about "average". Using the abbreviated SQLish name for
all of the functions except one is bad. Either all of the names should
be expanded or none. I prefer none since the non-standard SQL aggregate
functions must match the SQL provided name precisely. Might as well
make the standard function names do the same.

You have inadvertantly made my point. The only reason avg is a good name is so that you don't have to remember the difference in names when you use your get_aggregate(s) method, and like I said in my last email, I _really_ don't like that function precisely because it leaks raw SQL into the Django API.

Show me somewhere in core (or near core) python libraries that are not SQL related where avg() is used to describe an arithmetic mean, and I'll back down on this one. There is precedent in the Python core for the abbreviation of min, max:

http://docs.python.org/lib/typesseq.html#l2h-167

If you take SciPy as an example of good maths + stats libraries, it would dictate that std() and mean() be the names for STDDEV and AVG.

http://www.scipy.org/doc/api_docs/scipy.stats.stats.html

There are also at least one ASPN recipe to support the idea of stddev() and mean():

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/409413

But a quick search didn't reveal a single example (that isn't SQL bound) that promotes avg().

Look back at the archives when I first brought up aggregates:

http://groups.google.com/group/django-developers/browse_thread/thread/897711c8b86bc5e8/ef491ce3be3e1989

It was made clear to me then that 'SQL does it like X, so lets add X to Django' wouldn't win me any points. Django's ORM isn't about finding a way of representing SQL as Python - it's about getting a consistent, expressive object model, that just happens to be backed by a SQL database. Keep in mind that it could just as well be backed by an object database, or some other persistent store. What will happen to SQL notation if SQL isn't available?

The get_aggregate() function is really just the shared code behind
sum(), min(), max(), stddev() and so forth. Arguably it could replaced
by get_aggregates() (or whatever better name you might suggest), 

The name isn't the only problem: In no particular order:
- It requires the use of raw SQL as a parameter, specified as a string
- It requires the use of strings to identify column names (which string to use? column name? field name? verbose name? plural name? what about error checking?)
- It doesn't provide a way to get the minimum of one field, and the maximum of another using a single SQL call
- It doesn't exhibit good polymorphism - There is no reason that get_aggregate couldn't accept a list as its first parameter as well as a single item
- The method name is get_ in an API moving towards descriptors

Not to put too fine a point on it, but I find very little in get_aggregates that is appealing.

all, is how I do the aggregate functions today.) Once we have a good
80/20 design, then a "kitchen sink" approach would be fun to explore
and your suggestions are a good starting point.

I'm not opposed to the idea of a simple min/max etc API in addition to some mega-aggregate API. However, I don't want to start working on something as big as aggregates until we have a stable base to work on, and we have the attention of all the big players. Finalizing magic-removal has everyone pretty busy at the moment.

_Please_ can we defer this discussion to post magic-removal. Trust me, I won't forget about it - it's the single biggest item I personally want in Django, and I'd rather it be done right, than right now :-)

Russ Magee %-)

Rock

unread,
Mar 8, 2006, 12:18:13 PM3/8/06
to Django developers
> I'm not opposed to the idea of a simple min/max etc API in addition to some
> mega-aggregate API.

I am completely opposed to this.
What I am trying to devise is a non-mega-aggregate function that
handles 80% of the cases easily.
The sum(), min(), max() etc. convenience functions are _not_ that.

What is that is a function to return multiple aggregates performed on a
single column expression.
If we have that, we have something worthy of putting in MR _right now_
(and a tested patch exists.)

Creating a mega API to handle all the other stuff can certainly wait
for after MR.

Agreed? If not, let's discuss what function would handle 80% of use
cases.


FWIW, Jacob agreed with this approach when I sat beside him as I coded
it up. I consider
him a big player. Meanwhile I am not doing anything else with MR, so I
can take the time to
polish up what I have already coded and then we have something usable
now. It can still be
changed in future releases up to 1.0.

Since I have created one of the largest Django apps (managing over 20
million rows), I am
eager to move to MR as anyone. Some minimal aggregate function support
would be a
huge plus for me and that is why I chose to target that during the
recent Sprint.

Reply all
Reply to author
Forward
0 new messages