Aggregation

9 views
Skip to first unread message

Paul Colomiets

unread,
Aug 3, 2007, 5:57:59 AM8/3/07
to sqlal...@googlegroups.com
Hi,

Is there some aggregation ability built-in in the SQLAlchemy?

I want some simple functions like counting rows and determining the last
one.
I know it's something simple to implement with Mapper Extensions, but may be
there are readily available solutions?

And is there interest if I'll make some code on that?
--
Paul.

Michael Bayer

unread,
Aug 3, 2007, 10:18:12 AM8/3/07
to sqlal...@googlegroups.com

Hmmm, do you mean joining relations against a subrelation that uses
an aggregate like MAX ? i'd like to see what you have in mind for this.

Paul Colomiets

unread,
Aug 3, 2007, 10:57:02 AM8/3/07
to sqlal...@googlegroups.com
Michael Bayer wrote:
>
> Hmmm, do you mean joining relations against a subrelation that uses
> an aggregate like MAX ? i'd like to see what you have in mind for this.
Well, I think I've not explained it correctly.

It looks quite like you're saying, but I want that aggregations
to be stored (cached?) in the table.

Suppose you have, a table of articles and comments on them.
I want to have article.last_comment and article.number_of_comments
to be stored in columns and to be updated when I'm saving Comment
object (which e.g. has a foreign key to article).
And I really want that to be atomic updates (update set cnt=cnt+1)
instead of updating counter on Article instance and saving it.

Michael Bayer

unread,
Aug 3, 2007, 11:34:02 AM8/3/07
to sqlal...@googlegroups.com

we've just added the atomic update thing to 0.4 (note the uppercase
Article which produces a column expression):

article.comment_count = Article.comment_count + 1
session.flush()

i wonder though if theres some way that could get "out of sync" with
the actual number. you'd have to be careful to establish this
update. if you wanted to use a mapper extension to do it, then youd
have to issue the UPDATE directly, the change to the "comment_count"
attribute wont get picked up if its established within the flush itself.


Paul Colomiets

unread,
Aug 3, 2007, 3:22:55 PM8/3/07
to sqlal...@googlegroups.com
Michael Bayer wrote:
> we've just added the atomic update thing to 0.4 (note the uppercase
> Article which produces a column expression):
>
> article.comment_count = Article.comment_count + 1
> session.flush()
>
That's great!
It's quite unconvenient here, but has a lot of good use cases.

> i wonder though if theres some way that could get "out of sync" with
> the actual number. you'd have to be careful to establish this
> update. if you wanted to use a mapper extension to do it, then youd
> have to issue the UPDATE directly, the change to the "comment_count"
> attribute wont get picked up if its established within the flush itself.
If you mean out of sync with article instance - it's not
a problem. If you have a highly concurent database you'll
get out of sync instances in any case, if not - field will be
quite up to date.

Well, while writing this e-mail I've realized that there can
be some problems with deletes and updates that probably
can make value out of sync.

I'll just try and give a feedback after that :)
--
Paul.


Paul Colomiets

unread,
Aug 7, 2007, 1:20:29 PM8/7/07
to sqlal...@googlegroups.com
Michael Bayer wrote:
> i wonder though if theres some way that could get "out of sync" with
> the actual number. you'd have to be careful to establish this
> update. if you wanted to use a mapper extension to do it, then youd
> have to issue the UPDATE directly, the change to the "comment_count"
> attribute wont get picked up if its established within the flush itself.

Hi,

Thank you Michael for help, and for new great docs!

I've finally done first POC implementation of this feature.

Basic usage looks like:

import aggregator as a
mapper(Line, lines,
extension=a.Quick(a.Count(blocks.c.lines),
a.Max(blocks.c.lastline, lines.c.id)))

(You also need foreign keys)

I've implemented two interfaces one
that counts like "cnt = cnt+1" (called Quick)
second does query "SELECT count(*) FROM ...."
at each insert/update/delete (called Accurate)

Well, Quick should be accurate as long as you either don't
update foreign keys involved in process, or have only one
Count aggregation on changable key (*), or use transactions.
So it covers most usecases I think. If you like to recalc you can
use Accurate which can be better in some edge cases.
A bit slower on updates but still fast on reads.

(*) - aggregations on update is not implemented yet, but it's
only a matter of time

For more comprehensive description look here:
http://www.mr-pc.kiev.ua/projects/SQLAlchemyAggregation

I have also two issues.

1. func.if_(...) - tries to sql function "if_(...)",
but func._if(...) - "_i(...)"
I believe is typo?

2. If there a way, to handle functions in cross-database manner?

I've need to write something like this, for max function which
exists on sqlite but absent in mysql (and if() visa-versa):

if aggregator.mapper.local_table.\
metadata.bind.url.drivername == 'mysql':
return if_func(A < B, B, A)
else:
return func.max(A, B)

Is there a better way?
(Well, I will decrease number of dots it no better way exists :) )

--
Paul.

Michael Bayer

unread,
Aug 7, 2007, 1:41:25 PM8/7/07
to sqlalchemy

On Aug 7, 1:20 pm, Paul Colomiets <paul.colomi...@gmail.com> wrote:
>
> 1. func.if_(...) - tries to sql function "if_(...)",
> but func._if(...) - "_i(...)"
> I believe is typo?

er, probably. func is trying to sidestep various "_" underscore
attributes i think. do you need to say "_if(...)" ?

> 2. If there a way, to handle functions in cross-database manner?


this is ticket 615 which seems to be becoming very important, so i
think ill move it to 0.4xx milestone and increase priority.

sdo...@sistechnology.com

unread,
Aug 7, 2007, 2:06:36 PM8/7/07
to sqlal...@googlegroups.com

> I've finally done first POC implementation of this feature.
>
> Basic usage looks like:
>
> import aggregator as a
> mapper(Line, lines,
> extension=a.Quick(a.Count(blocks.c.lines),
> a.Max(blocks.c.lastline, lines.c.id)))
>
> (You also need foreign keys)
>

hi, i have similar idea/need within dbcook, although on a somewhat
higher level:
<pre>
cache_results/: (dbcook/SA) add-on for automaticaly-updated database
denormalisation caches of intermediate results, each one depending on
particular pattern of usage. Wishful syntax:

class SomeCacheKlas( Base):
fieldname = cache_aggregator( klas.field, AggrFilterCriteria)
#e.g.
#class Cache4averagePerson( Base):
#age = cache_agregators.Average( Person.age, Filter1 )
#salary = cache_agregators.Sum( Person.salary, Filter2 )
</pre>
i was thinking on using triggers and/or sql-functions but then this is
just one way to do it - and the idea is to hide the implementation.

Can your implementation be extended to use:
a) more complex aggregator expressions (e.g. average(), that is
sum() / count(), and similar dependencies)
b) more complex filters - e.g. not max() on all rows, but on some
subset/select

think of generic report calculations/aggregations, and putting those
into some cacheing table.
Eventualy getting the report being built on-the-run - distributing the
big wait over the atomary updates.


> http://www.mr-pc.kiev.ua/projects/SQLAlchemyAggregation
this gives me 404

Paul Colomiets

unread,
Aug 7, 2007, 3:38:50 PM8/7/07
to sqlal...@googlegroups.com

Paul Colomiets

unread,
Aug 7, 2007, 3:43:09 PM8/7/07
to sqlal...@googlegroups.com
Michael Bayer wrote:
>
> On Aug 7, 1:20 pm, Paul Colomiets <paul.colomi...@gmail.com> wrote:
>
>> 1. func.if_(...) - tries to sql function "if_(...)",
>> but func._if(...) - "_i(...)"
>> I believe is typo?
>>
>
> er, probably. func is trying to sidestep various "_" underscore
> attributes i think. do you need to say "_if(...)" ?

No I need to say "IF" but I can't say func.if() currently I use
getattr('func','if') but It's ugly.

It strips last char when sees first underscore. But should ether strip
first char, or check end of string for an underscore.

Patch attached :)

sqlalchemy_underscore.diff

Paul Colomiets

unread,
Aug 8, 2007, 5:18:24 AM8/8/07
to sqlal...@googlegroups.com
sdo...@sistechnology.com wrote:
> hi, i have similar idea/need within dbcook, although on a somewhat
> higher level:
> <pre>
> cache_results/: (dbcook/SA) add-on for automaticaly-updated database
> denormalisation caches of intermediate results, each one depending on
> particular pattern of usage. Wishful syntax:
>
> class SomeCacheKlas( Base):
> fieldname = cache_aggregator( klas.field, AggrFilterCriteria)
> #e.g.
> #class Cache4averagePerson( Base):
> #age = cache_agregators.Average( Person.age, Filter1 )
> #salary = cache_agregators.Sum( Person.salary, Filter2 )
> </pre>
> i was thinking on using triggers and/or sql-functions but then this is
> just one way to do it - and the idea is to hide the implementation.
>
>
I've seen that, but found no code.
Is it something only planned for dbcook?

> Can your implementation be extended to use:
> a) more complex aggregator expressions (e.g. average(), that is
> sum() / count(), and similar dependencies)
>
Definetly can. Now you can just use property for that:

avg = property(lambda self: self.sum / self.count)

I currently have no idea on how to make it better (less writing?)
Also documenting extension interface is in to do list.


> b) more complex filters - e.g. not max() on all rows, but on some
> subset/select
>

Yes. I'm looking for syntax for that. Functionality seems
very similar for `relation()` so may be
`a.Max(...,primaryjoin=...)` would do.

sdo...@sistechnology.com

unread,
Aug 9, 2007, 1:48:15 AM8/9/07
to sqlal...@googlegroups.com
On Wednesday 08 August 2007 12:18:24 Paul Colomiets wrote:
> sdo...@sistechnology.com wrote:
> > hi, i have similar idea/need within dbcook, although on a
> > somewhat higher level:
> > <pre>
> > cache_results/: (dbcook/SA) add-on for automaticaly-updated
> > database denormalisation caches of intermediate results, each one
> > depending on particular pattern of usage. Wishful syntax:
> >
> > class SomeCacheKlas( Base):
> > fieldname = cache_aggregator( klas.field,
> > AggrFilterCriteria) #e.g.
> > #class Cache4averagePerson( Base):
> > #age = cache_agregators.Average( Person.age, Filter1 )
> > #salary = cache_agregators.Sum( Person.salary, Filter2 )
> > </pre>
> > i was thinking on using triggers and/or sql-functions but then
> > this is just one way to do it - and the idea is to hide the
> > implementation.
>
> I've seen that, but found no code.
> Is it something only planned for dbcook?
yes, to-be-done, as a way to avoid DB to become denormalized in some
particular unplanned/uncontrolled way, that fits some reports and
screws all else. It takes me a great fight to convince users that
denormalization is out of app's bare model... its something u lay
over it.

> > Can your implementation be extended to use:
> > a) more complex aggregator expressions (e.g. average(), that is
> > sum() / count(), and similar dependencies)
>
> Definetly can. Now you can just use property for that:
> avg = property(lambda self: self.sum / self.count)

naaah, sorry, that was too easy. i mean more complex aggregation
functions... i guess it can, if u can do sum=sum+x, then u could do
sq=sq+x*x/2

> > b) more complex filters - e.g. not max() on all rows, but on
> > some subset/select
>
> Yes. I'm looking for syntax for that. Functionality seems
> very similar for `relation()` so may be
> `a.Max(...,primaryjoin=...)` would do.

hmmm... relation... yess, it is like a plural relation but getting a
singular result out of it, and never ever loading the items.
Are u using something along PropertyLoader?

ciao
svil

sdo...@sistechnology.com

unread,
Aug 9, 2007, 2:45:34 AM8/9/07
to sqlal...@googlegroups.com
some comments...

> 2. If there a way, to handle functions in cross-database manner?
> I've need to write something like this, for max function which
> exists on sqlite but absent in mysql (and if() visa-versa):

- u can at least get that 'if...mysql' out of the actual code, just
define a global if_func( usualargs, aggregator=None) and do swicthing
inside there. Or, better, define the if_func() as method on the
aggregator itself (that is, the mapperExtension?) thus separating
somewhat implementation from usage.

- Max.onupdate will recurse forever in the else:

- the min/max has to be redone not to use maxint (or any other
machine/value limit) - one 'if' more..

- why returning a dict if it never contains more than 1 value?
it clutters your code a lot... maybe 'return something' or None (or
raise something) would be better.. eventualy making some generic
updater like:

def do_updates( self, instance, table, fields, method, condition):
updates = {}
for f in fields:
r = getattr( f, method)( self, instance)
if r is not None: updates[ f.target.name] = r
table.update( condition, values=updates).execute()

ah, nevermind.

Paul Colomiets

unread,
Aug 9, 2007, 12:48:39 PM8/9/07
to sqlal...@googlegroups.com
sdo...@sistechnology.com wrote:
> some comments...

>
> - u can at least get that 'if...mysql' out of the actual code, just
> define a global if_func( usualargs, aggregator=None) and do swicthing
> inside there. Or, better, define the if_func() as method on the
> aggregator itself (that is, the mapperExtension?) thus separating
> somewhat implementation from usage.
>
Moved min and max to the aggregator. I woder if I can cache
database type in MapperExtension, or it can be changed
sometimes (e.g. ".tometadata()", or metadata.bind = eng) ?

> - Max.onupdate will recurse forever in the else:
>
Fixed

> - the min/max has to be redone not to use maxint (or any other
> machine/value limit) - one 'if' more..
>
Fixed with max(ifnull(a, b-1), b). Should be ok. There is no
"IF" in sqlite.

> - why returning a dict if it never contains more than 1 value?
> it clutters your code a lot... maybe 'return something' or None (or
> raise something) would be better.. eventualy making some generic
> updater like:
>
> def do_updates( self, instance, table, fields, method, condition):
> updates = {}
> for f in fields:
> r = getattr( f, method)( self, instance)
> if r is not None: updates[ f.target.name] = r
> table.update( condition, values=updates).execute()
>
Were done with average() in mind, which needs to update 2 fields.
Above solution don't work for that. I thought about passing
dict and allow mutating it, or yielding key,value pairs, but that
whould do even more cluttered code.
> ah, nevermind.
Thanks for great review!

Reply all
Reply to author
Forward
0 new messages