Denormalisation, magic, and is it really that useful?

2,511 views
Skip to first unread message

Andrew Godwin

unread,
Sep 22, 2008, 4:03:10 PM9/22/08
to Django developers
So, hello everyone. I figure this list is the best place to ask this,
but please feel free to deride me if not...

After all the talk of multiple databases, and non-relational databases
(bigtable, couchdb, etc.) that went on at DjangoCon and afterward,I've
been thinking about denormali[s/z]ation and how to make it easier in
Django; previously, I had thought denormalisation was something you did
to ruin your database, but I can now see it can be additive as well.

In this vein, last weekend I decided to see how easy it would be to
write a magical DenormalisationField which handled all the synchronising
of data from the relevant tables given a normal ForeignKey relationship;
the results are at
http://www.aeracode.org/2008/9/14/denormalisation-follies/

Basically, yes, it is possible, and a lot of people (eight! for my blog,
that's a lot) seem to like the idea; it certainly saves having to write
that replicate-on-save logic that otherwise goes with this kind of
schema design.

Given the fact that it seems like a reasonably common thing to do in
large-scale applications, I'd like to know what people think about
adding this kind of thing into core (while I have a standalone
implementation, that needs some hacks I'd like to see gone). My main
issues with this idea are:

1) It relies on having a ForeignKey pointing to the other model, so this
might not work for non-relational databases (well, unless ForeignKeys
are still allowed in that case, but they just perform expensive SELECTs
as well as possibly printing abusive messages to stderr).

2) I'm not sure how common a use case this is; it sounds like it might
be verging into the long tail a bit.

3) People also want other things denormalised - count()s, aggregate
queries, etc. Of course, writing fields for these too is similar, and
I'd be happy to do it, but it might mean that even if the case for
denormalisation is good this is only a piece of the puzzle.

So, I'd love people's opinions about where I should take this - to a
dark corner and hide it, to a separate app like django-extensions (which
involves keeping the horrid hacks in there), into a separate app but
with patches to core to make it less hackish (i.e. more signals), or add
it as my 1.1 pony with the proviso that I'm happy to write all the code.

Andrew

Marty Alchin

unread,
Sep 22, 2008, 4:16:47 PM9/22/08
to django-d...@googlegroups.com
On Mon, Sep 22, 2008 at 4:03 PM, Andrew Godwin <andrew...@gmail.com> wrote:
> So, I'd love people's opinions about where I should take this - to a
> dark corner and hide it, to a separate app like django-extensions (which
> involves keeping the horrid hacks in there), into a separate app but
> with patches to core to make it less hackish (i.e. more signals), or add
> it as my 1.1 pony with the proviso that I'm happy to write all the code.

Personally, I think it's best suited for a django-d13n (that's d13n
for denormali(s|z)ation) app that stands on its own. That way, you can
provide a host of other options as you see fit, have other people get
onboard and provide whatever they use, and so on.

I think there's definite merit to what you're doing, but I just don't
see it ever being in core, unless it requires some freakish internal
changes. If you do need changes to core, or you can at least frame
them in a way that would provide wider benefits, you'll definitely get
a fruitful discussion on it, at least.

-Gul

Simon Willison

unread,
Sep 22, 2008, 4:44:56 PM9/22/08
to Django developers
On Sep 22, 9:03 pm, Andrew Godwin <andrewgod...@gmail.com> wrote:
> 2) I'm not sure how common a use case this is; it sounds like it might
> be verging into the long tail a bit.

I think denormalisation is much more common than many people think. I
can think of a couple of reasons you might want to denormalise on
small sites and in relatively simple applications (especially since
small sites often run on low powered VPS hosting where performance
matters):

1. You're performing an expensive transformation on a bunch of markup
(e.g. running Markdown or similar) stored in a database table, and you
want to reduce the overhead by caching the generated HTML in a
"foo_html" field.

2. You're building /anything/ that includes "number of Xs" in a list
view - e.g. a forum that shows "number of replies" next to each
thread. It's impossible to do this in regular SQL without doing n+1
queries - using database aggregation just pushes the extra 'n' queries
in to the database in the form of subselects.

I believe denormalisation that is managed by the ORM itself represents
best practice for both of the above cases - and encouraging best
practice is what Django's all about. I'd love to see something like
this included in Django core.

Cheers,

Simon

Winsley von Spee

unread,
Sep 22, 2008, 5:30:45 PM9/22/08
to django-d...@googlegroups.com
Hi,

this would be very useful for a lot of use cases. View counts that
originally came from a foo viewed bar table with timestamps for every
viewer, or storing some sort of sum calculation from a list of
related objects. Basically all that stuff you would display in an
"overview table" of some sort for which you would normally use
aggregation wich gets very slow very fast on large data sets.

I have written a generic denormalizer to store whole data sets
in denormalized tables to speed up searching.

It basically consists of a Metaclass that constructs a Model out of a
values queryset and a base class that adds method to generate sql to
keep the table up to date. This way most of the time you only have
to write the values query set and for special cases you can still add
"custom" Fields if you have to.

The tables get updated periodically, but it would be quite possible to
do this with signals instantly.

A previous version even generated triggers to keep the tables updated,
but this was dropped afterwards due to other reasons.

Queryset refactor did help a lot to make this sort of stuff easier.
Only quirk right now is that you have to use extra and "force"
queryset.query to use aggregation wich you'll need very often when
denormalizing.

I think a django.contrib.denormalization app would be a very nice
thingy.

+1 from me ;).

Regards

Justin Fagnani

unread,
Sep 22, 2008, 7:21:04 PM9/22/08
to django-d...@googlegroups.com
I think any discussion of denormalization needs to include aggregation
as well, since most uses will involve more than simply looking up one
value form a related model. I also think this is something definitely
to include in core. In my experience at least, denormalization occurs
a lot and leaves a lot of room for mistakes, so it's something a
framework should handle if possible.

Ideally I'd like to see either a calculation field that can take
aggregation expressions and options about how to store and
synchronize.

I imaging something like this syntax:

class Order(models.Model):
# updates the order total if the line items change
subtotal = models.CalculationField(Sum('items.cost'), store=True, sync=True)

# doesn't update the order's customer name if the customer's name changes
customer = models.ForeignKey(Customer)
customer_name = models.CalculationField(F('customer.name'),
store=True, sync=False)

# don't store a rarely used value. calculate on select
weight = CalculationField(Sum('items.weight'), store=False)


This approach would require some changes to the ORM and aggregation,
but it'd be worth it since it'd make denormalization easy, flexible
and less error-prone. Fields would need to be able to contribute more
to queries and table creation, and we'd some simple need type
inference for expressions/aggregation. If calculation support is added
to all fields, or calculations have to have a declared type, then type
inference wouldn't be necessary, but type inference for aggregation
isn't hard.

-Justin

Simon Willison

unread,
Sep 22, 2008, 8:00:48 PM9/22/08
to Django developers
On Sep 23, 12:21 am, "Justin Fagnani" <justin.fagn...@gmail.com>
wrote:
> In my experience at least, denormalization occurs
> a lot and leaves a lot of room for mistakes, so it's something a
> framework should handle if possible.

Just so it's on the record, I'd like any denormalisation tools in
Django to include a mechanism for re-syncronizing them should
something go awry (like direct updates being applied to the database
without keeping the denormalised fields in sync). This mechanism could
then be exposed as a ./manage.py command which could be called
periodically to verify and fix any incorrect data.

David Cramer

unread,
Sep 23, 2008, 3:52:47 AM9/23/08
to Django developers
If you're not doing denormalization in your database, most likely
you're doing something wrong. I really like the approach that is
offered here.

For me, personally, it would be great if this could accept callables
as well. So you could store the username, like so, or you could store
a choices field like:

field = models.IntegerField(choices=CHOICES)
denorm = models.DenormField('self', 'get_field_display') # which
would rock if it was .field.display ;)

You could also make it somehow accept querysets or something similar
for things like count(). I see a lot of possibilities and am a bit
disappointed I didn't come up with something this easy for my use-
cases.

+1 from me!

Andrew Godwin

unread,
Sep 23, 2008, 5:42:40 AM9/23/08
to django-d...@googlegroups.com
David Cramer wrote:
> If you're not doing denormalization in your database, most likely
> you're doing something wrong. I really like the approach that is
> offered here.
>
> For me, personally, it would be great if this could accept callables
> as well. So you could store the username, like so, or you could store
> a choices field like:
>
> field = models.IntegerField(choices=CHOICES)
> denorm = models.DenormField('self', 'get_field_display') # which
> would rock if it was .field.display ;)
>
> You could also make it somehow accept querysets or something similar
> for things like count(). I see a lot of possibilities and am a bit
> disappointed I didn't come up with something this easy for my use-
> cases.
>

The key is making sure you can listen for changes on whatever's at the
other end of your denormalisation. With my current snippet, it listens
for a save on the model the foreignkey points to, then checks for the
right ID; if we start accepting random querysets, then there has to be a
way to resolve that back to conditions the signal listener can understand.

Still, with an
AggregateField(Sandwiches.filter(filling="cheese").count()) it's still
possible to work out that you want to listen on the Sandwiches model,
and you could then fall back to re-running the count on every Sandwich
save, even if it ends up not having a cheese filling.

So, I think the best approach would be one to replicate fields (like my
current DenormField; perhaps call it CopyField or something) and one to
cache aggregates (an AggregateField, like above).

Simon Willison wrote:
> Just so it's on the record, I'd like any denormalisation tools in
> Django to include a mechanism for re-syncronizing them should
> something go awry (like direct updates being applied to the database
> without keeping the denormalised fields in sync). This mechanism could
> then be exposed as a ./manage.py command which could be called
> periodically to verify and fix any incorrect data.

Yes, this I very much agree with. The reason you always layer this stuff
on top of a pre-normalised database is because you can then rebuild the
data after playing with it externally.

Doing so shouldn't be too much of a problem; have a management command
that loads the models, and then just executes the update method on each
of the denormalisationalish fields.

Justin's idea of lazy updating is interesting, and probably quite doable
(as well as what most people will want by default on aggregate queries).

I'm also hoping any kind of aggregate denormalisation will work with any
future extended aggregate support, but if the field just takes a normal
QuerySet, that might Just Work™.

Andrew

Steve Holden

unread,
Sep 23, 2008, 8:44:59 AM9/23/08
to django-d...@googlegroups.com
This appears to be a proposal to re-implement triggers inside Django.

I can see there are benefits if the underlying DB platform won't support
triggers, but wouldn't triggers be the preferred solution when they're
available? That way there is no chance that changes can be made outside
the scope of the denormalization, and hence no need to recompute the
denormalized values.

regards
Steve

Andrew Godwin

unread,
Sep 23, 2008, 8:56:30 AM9/23/08
to django-d...@googlegroups.com
Yes, Steve, it's true that triggers do have much of the same
functionality as this kind of proposal, but as you say, part of the
proposal is to make these hopefully work with future non-SQL databases
(although, admittedly, that's only a small piece of the puzzle).

To be honest, my main drive is to have a nice, easy denormalisation API
for Django; whether this means implementing triggers in all the database
backends and using them behind this kind of API remains to be seen. The
problems I can see with this at the moment are that:

1) If you change a DenormField, there's less writing
migrations/fiddling with the database than if we managed it all in
Django code (where you could just resync)
2) Triggers are a whole part of SQL Django just hasn't touched yet, and
implementing support for them will take a reasonable amount of work in
both debugging and testing for each backend.

I'd really like to go ahead and implement this stuff 'vanilla' at first,
relying on the tried-and-tested field and signal code, and then if it
gets popular, add in the option to run denormalisations using triggers
rather than inside Django. That way, we eventually end up with the best
of both worlds, and get people using more efficient schemas first, which
they can then easily bake into triggers when they come along, and if
their DB supports them.

Andrew

Marty Alchin

unread,
Sep 23, 2008, 9:00:40 AM9/23/08
to django-d...@googlegroups.com
On Tue, Sep 23, 2008 at 8:44 AM, Steve Holden <hold...@gmail.com> wrote:
> This appears to be a proposal to re-implement triggers inside Django.

I suppose it is. But then, perhaps triggers were a re-implementation
of application-based denormalization. ... The chicken? The egg? The
world may never know. :)

> I can see there are benefits if the underlying DB platform won't support
> triggers, but wouldn't triggers be the preferred solution when they're
> available? That way there is no chance that changes can be made outside
> the scope of the denormalization, and hence no need to recompute the
> denormalized values.

Perhaps the biggest downside is that distributed applications can't
make use of it, since triggers aren't cross-platform. Sure, you could
argue that denormalization is best suited for individual projects,
rather than distributed apps, but I could see uses for it. Imagine a
full-blown forum app that wants to denormalize its post counts.

Without some Python-based approach, all I could see is maybe adding a
cross-platform "create trigger" API (ugh) to Django, which an
application could then use to set up its triggers during syncdb.
Otherwise, something like that forum app would have to implement a
trigger for all available backends or just ship with instructions on
how to set it up yourself.

That said, you're right, I would think it's usually safest to put that
stuff directly in the db. That way, if you have other applications
using other languages or perhaps just other ORMs, everything's all
intact. I'm pretty sure I've seen arguments on both sides of the
issue, though, so I'm not sure there's any objective answer to it.

-Gul

Simon Willison

unread,
Sep 23, 2008, 10:04:24 AM9/23/08
to Django developers
On Sep 23, 2:00 pm, "Marty Alchin" <gulop...@gamemusic.org> wrote:
> Without some Python-based approach, all I could see is maybe adding a
> cross-platform "create trigger" API (ugh) to Django, which an
> application could then use to set up its triggers during syncdb.
> Otherwise, something like that forum app would have to implement a
> trigger for all available backends or just ship with instructions on
> how to set it up yourself.

Triggers also aren't supported in Drizzle ( https://launchpad.net/drizzle
), which Django will probably want to target at some point.

Cheers,

Simon

David Durham, Jr.

unread,
Sep 23, 2008, 11:37:06 AM9/23/08
to django-d...@googlegroups.com
On Tue, Sep 23, 2008 at 7:44 AM, Steve Holden <hold...@gmail.com> wrote:
>
> This appears to be a proposal to re-implement triggers inside Django.
>
> I can see there are benefits if the underlying DB platform won't support
> triggers, but wouldn't triggers be the preferred solution when they're
> available? That way there is no chance that changes can be made outside
> the scope of the denormalization, and hence no need to recompute the
> denormalized values.

It would be nice to have an option to make atomic changes to the
normalized and denormalized tables.

-Dave

Justin Fagnani

unread,
Sep 23, 2008, 5:19:14 PM9/23/08
to django-d...@googlegroups.com
On Tue, Sep 23, 2008 at 12:52 AM, David Cramer <dcr...@gmail.com> wrote:>
> For me, personally, it would be great if this could accept callables
> as well. So you could store the username, like so, or you could store
> a choices field like:
>
> field = models.IntegerField(choices=CHOICES)
> denorm = models.DenormField('self', 'get_field_display') # which
> would rock if it was .field.display ;)

I think denormalizing with callables is a very different thing than
denormalizing with expressions that can be evaluated by the database.
Not that they both aren't worth supporting, but db-level expressions
are going to be far easier and more efficient to validate and update
in bulk - no looping in Python, just executing SQL. In this case, I
think your example would be better suited as an FK, for denorm
purposes.

Callables would be useful for something more complicated like
abstracting auto_now to not be limited to dates, that is allowing a
field value to be set by a callable on save, not just create, for any
field type.


On Tue, Sep 23, 2008 at 2:42 AM, Andrew Godwin <andrew...@gmail.com> wrote:
> Still, with an
> AggregateField(Sandwiches.filter(filling="cheese").count()) it's still
> possible to work out that you want to listen on the Sandwiches model,
> and you could then fall back to re-running the count on every Sandwich
> save, even if it ends up not having a cheese filling.

I'm not sure I like the idea of accepting arbitrary QuerySets. It
could just be my point-of-view, but I see automatic denormalization
and aggregation as intimately tied, where a denormalized field is just
a declarative aggregate expression that's optionally cached in a
column. I think this makes it easy to understand and document, since
aggregation queries and calculation fields would support the same
features, and it also allows the implementation to share a lot with
aggregates. It's also better in terms of storing and updating a
calculation: you can calculate on reads or writes for N objects in one
query without N subqueries, though it may involve a lot of joins.

> So, I think the best approach would be one to replicate fields (like my
> current DenormField; perhaps call it CopyField or something) and one to
> cache aggregates (an AggregateField, like above).

I'd also be hesitant to have two separate fields for these cases,
since copying a related field value is just a simple SQL expression. I
think the same calculation field could be used for both, by assuming a
string is a field name:

name = CalculationField('user.name')

or by using F expressions:

name = CalculationField(F('user.name'))


Another approach to calculations, that doesn't necessarily involve
denormalization, would be to use views. Russell talked to me a bit
about this at djangocon, and I think the idea was that if you solve
basing models on views (Isn't it actually possible now? Maybe we need
read-only fields), and then have view creation support in the ORM,
then calculation fields can be implemented with views. I see that
un-stored calculations are re-implementing views, but I don't know
enough about views to know whether they have some performance
advantages over embedding the calculation in a query.

-Justin

David Cramer

unread,
Sep 24, 2008, 3:45:04 AM9/24/08
to Django developers
I just wanted to touch on this point:

models.IntegerField(default=0) does not translate to INT(11) default
0; in SQL :)

Django doesn't nescesarily try to interact with the database properly
when it should, because it requires too much thinking (and more code
that hasn't been written).

I would say ignore triggers on the DB level, until they've been
written in the framework.

David Cramer

unread,
Sep 24, 2008, 3:46:33 AM9/24/08
to Django developers
Denormalized for me personally happens a LOT when it's not just a
simple foreign key. Maybe I want to store a many to many field as
name,name,name in a column (tags). We're already inside of Django so
it should already be available (it's not going to be much slower than
it is executing the subquery in sql to get the denormalized output).

On Sep 23, 4:19 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
> On Tue, Sep 23, 2008 at 12:52 AM, David Cramer <dcra...@gmail.com> wrote:>
> > For me, personally, it would be great if this could accept callables
> > as well. So you could store the username, like so, or you could store
> > a choices field like:
>
> >    field = models.IntegerField(choices=CHOICES)
> >    denorm = models.DenormField('self', 'get_field_display') # which
> > would rock if it was .field.display ;)
>
> I think denormalizing with callables is a very different thing than
> denormalizing with expressions that can be evaluated by the database.
> Not that they both aren't worth supporting, but db-level expressions
> are going to be far easier and more efficient to validate and update
> in bulk - no looping in Python, just executing SQL. In this case, I
> think your example would be better suited as an FK, for denorm
> purposes.
>
> Callables would be useful for something more complicated like
> abstracting auto_now to not be limited to dates, that is allowing a
> field value to be set by a callable on save, not just create, for any
> field type.
>

Andrew Godwin

unread,
Sep 25, 2008, 9:08:29 AM9/25/08
to django-d...@googlegroups.com
David Cramer wrote:
> I would say ignore triggers on the DB level, until they've been
> written in the framework.
>

Yes, this was essentially my point earlier; triggers would be nice to
have from a consistency point of view, but it will be easier and quicker
to reimplement them in Python, not least because it allows more complex
functionality to be put straight in if we want it in the future. Then,
we can always extend it so that some types of fields are implemented
directly as triggers if you want once the API is settled.

With all this in mind, my API proposal would be something like this:

A CopiedField, called like so: user_username =
models.CopiedField("user", "username") [here, "user" is the name of
the ForeignKey to link through]

An AggregateField: post_count =
models.AggregateField(Post.objects.filter(visible=True), "count")

The CopiedField essentially works how my current DenormField does, but
using a foreignkey to avoid you having to specify both it _and_ the
model if you have more than one to the same model (a reasonable amount
of the DenormField code is just for finding which foreignkey to link
through, anyway)

The AggregateField takes a simple queryset (from which it can extract
the model to hook a signal onto), and whenever something in the queryset
is saved (so, either an extra select on each save to see what items are
in the queryset, or just updating the aggregate every time - possibly
with either selectable, since the cost might be higher on either), and
it also takes the function to run on the set - either a builtin name
like "count" here, or an actual function which takes a queryset and
returns a result (in this situation the type of the underlying column
will also have to be specified).

This should probably cover most cases, and has the advantage of people
being able to specify their own in-python aggregate functions (such as
working out the standard deviation of the users' ages) if they tell us
what kind of field to sling the result in.

I'm happy to change the field names, too, CopiedField especially is a
little clunky.

Andrew

David Cramer

unread,
Sep 25, 2008, 9:11:40 AM9/25/08
to django-d...@googlegroups.com
CopiedField sounds a bit off, but otherwise I like the proposed additions.
--
David Cramer
Director of Technology
iBegin
http://www.ibegin.com/

Alex Koshelev

unread,
Sep 26, 2008, 11:45:00 AM9/26/08
to Django developers
Hi, guys!

For a long time I have thoughts to make composition/denormalisation a
little bit easier and reusable. But I have no time to implement my
ideas.
Inspired by Anrew's blog post and its thread I recently wrote some
code. And I think that it really may be useful and cover most of
denormalisation use cases in Django.

And this is an example of Anrew's code but with my CompositionField:

class Picture(models.model):
user = models.ForeignKey(User)
user_username = CompositionField(
native=models.CharField(max_length=150),
trigger=dict(
sender_model=User,
field_holder_getter=lambda user:
user.picture_set.all(),
do=lambda picture, user, signal:
user.username
)
)
image = models.ImageField(upload_to="foo")
title = models.CharField(max_length=100)

Its more verbose but very generic and can be customized. And free
bonus - auto-generated `update_user_username` method.

The code available here [1]. Its very generic but allows to write more
high-level subclasses with some introspection.

Today I've written the blog post about it and my future suggestions.
It is in Russian[2], here is a google-translated version [3] (but with
broken code blocks) that I think can help to understand some concepts.
And it holds more usage examples.

I have many new ideas of how my solution can be improved and I what to
hear you opinions about it.

Thanks.

[1]: http://svn.turbion.org/turbion/trunk/turbion/utils/composition.py
[2]: http://webnewage.org/post/2008/9/26/krasivaya-kompozitsiya/
[3]:
http://translate.google.com/translate?u=http%3A%2F%2Fwebnewage.org%2Fpost%2F2008%2F9%2F26%2Fkrasivaya-kompozitsiya%2F%23comment_720&hl=en&ie=UTF-8&sl=ru&tl=en

On Sep 23, 12:03 am, Andrew Godwin <andrewgod...@gmail.com> wrote:
> So, hello everyone. I figure this list is the best place to ask this,
> but please feel free to deride me if not...
>
> After all the talk of multiple databases, and non-relational databases
> (bigtable, couchdb, etc.) that went on at DjangoCon and afterward,I've
> been thinking about denormali[s/z]ation and how to make it easier in
> Django; previously, I had thought denormalisation was something you did
> to ruin your database, but I can now see it can be additive as well.
>
> In this vein, last weekend I decided to see how easy it would be to
> write a magical DenormalisationField which handled all the synchronising
> of data from the relevant tables given a normal ForeignKey relationship;
> the results are athttp://www.aeracode.org/2008/9/14/denormalisation-follies/

Erik Allik

unread,
Sep 26, 2008, 12:56:39 PM9/26/08
to django-d...@googlegroups.com
I was just wondering.. Why all this abstraction?

Why do we need a separate field for denormalization? Can't we just use
regular fields and simply set up denormalization in a procedural way
in the constructor? All that needs to be done to create a denormalized
field is connect a few signals maybe or do some expensive compuation
after modifying a field. But maybe I'm missing something crucial.

Erik

Alex Koshelev

unread,
Sep 26, 2008, 4:38:14 PM9/26/08
to Django developers
Hi, Erik.

The main purpose is to have declarative form of composition field
calculation definition. Not to write imperative actions/signal
connection/etc.

I've made only flexible generic solution. In future I plan to write
high-level subclasses that will can with minimal input parameters make
all work very comfortable for developers.
And then there will be no need to write all denormalisation
functionality once again and again, from project to project.
Reply all
Reply to author
Forward
0 new messages