After all the talk of multiple databases, and non-relational databases
(bigtable, couchdb, etc.) that went on at DjangoCon and afterward,I've
been thinking about denormali[s/z]ation and how to make it easier in
Django; previously, I had thought denormalisation was something you did
to ruin your database, but I can now see it can be additive as well.
In this vein, last weekend I decided to see how easy it would be to
write a magical DenormalisationField which handled all the synchronising
of data from the relevant tables given a normal ForeignKey relationship;
the results are at
http://www.aeracode.org/2008/9/14/denormalisation-follies/
Basically, yes, it is possible, and a lot of people (eight! for my blog,
that's a lot) seem to like the idea; it certainly saves having to write
that replicate-on-save logic that otherwise goes with this kind of
schema design.
Given the fact that it seems like a reasonably common thing to do in
large-scale applications, I'd like to know what people think about
adding this kind of thing into core (while I have a standalone
implementation, that needs some hacks I'd like to see gone). My main
issues with this idea are:
1) It relies on having a ForeignKey pointing to the other model, so this
might not work for non-relational databases (well, unless ForeignKeys
are still allowed in that case, but they just perform expensive SELECTs
as well as possibly printing abusive messages to stderr).
2) I'm not sure how common a use case this is; it sounds like it might
be verging into the long tail a bit.
3) People also want other things denormalised - count()s, aggregate
queries, etc. Of course, writing fields for these too is similar, and
I'd be happy to do it, but it might mean that even if the case for
denormalisation is good this is only a piece of the puzzle.
So, I'd love people's opinions about where I should take this - to a
dark corner and hide it, to a separate app like django-extensions (which
involves keeping the horrid hacks in there), into a separate app but
with patches to core to make it less hackish (i.e. more signals), or add
it as my 1.1 pony with the proviso that I'm happy to write all the code.
Andrew
Personally, I think it's best suited for a django-d13n (that's d13n
for denormali(s|z)ation) app that stands on its own. That way, you can
provide a host of other options as you see fit, have other people get
onboard and provide whatever they use, and so on.
I think there's definite merit to what you're doing, but I just don't
see it ever being in core, unless it requires some freakish internal
changes. If you do need changes to core, or you can at least frame
them in a way that would provide wider benefits, you'll definitely get
a fruitful discussion on it, at least.
-Gul
this would be very useful for a lot of use cases. View counts that
originally came from a foo viewed bar table with timestamps for every
viewer, or storing some sort of sum calculation from a list of
related objects. Basically all that stuff you would display in an
"overview table" of some sort for which you would normally use
aggregation wich gets very slow very fast on large data sets.
I have written a generic denormalizer to store whole data sets
in denormalized tables to speed up searching.
It basically consists of a Metaclass that constructs a Model out of a
values queryset and a base class that adds method to generate sql to
keep the table up to date. This way most of the time you only have
to write the values query set and for special cases you can still add
"custom" Fields if you have to.
The tables get updated periodically, but it would be quite possible to
do this with signals instantly.
A previous version even generated triggers to keep the tables updated,
but this was dropped afterwards due to other reasons.
Queryset refactor did help a lot to make this sort of stuff easier.
Only quirk right now is that you have to use extra and "force"
queryset.query to use aggregation wich you'll need very often when
denormalizing.
I think a django.contrib.denormalization app would be a very nice
thingy.
+1 from me ;).
Regards
Ideally I'd like to see either a calculation field that can take
aggregation expressions and options about how to store and
synchronize.
I imaging something like this syntax:
class Order(models.Model):
# updates the order total if the line items change
subtotal = models.CalculationField(Sum('items.cost'), store=True, sync=True)
# doesn't update the order's customer name if the customer's name changes
customer = models.ForeignKey(Customer)
customer_name = models.CalculationField(F('customer.name'),
store=True, sync=False)
# don't store a rarely used value. calculate on select
weight = CalculationField(Sum('items.weight'), store=False)
This approach would require some changes to the ORM and aggregation,
but it'd be worth it since it'd make denormalization easy, flexible
and less error-prone. Fields would need to be able to contribute more
to queries and table creation, and we'd some simple need type
inference for expressions/aggregation. If calculation support is added
to all fields, or calculations have to have a declared type, then type
inference wouldn't be necessary, but type inference for aggregation
isn't hard.
-Justin
The key is making sure you can listen for changes on whatever's at the
other end of your denormalisation. With my current snippet, it listens
for a save on the model the foreignkey points to, then checks for the
right ID; if we start accepting random querysets, then there has to be a
way to resolve that back to conditions the signal listener can understand.
Still, with an
AggregateField(Sandwiches.filter(filling="cheese").count()) it's still
possible to work out that you want to listen on the Sandwiches model,
and you could then fall back to re-running the count on every Sandwich
save, even if it ends up not having a cheese filling.
So, I think the best approach would be one to replicate fields (like my
current DenormField; perhaps call it CopyField or something) and one to
cache aggregates (an AggregateField, like above).
Simon Willison wrote:
> Just so it's on the record, I'd like any denormalisation tools in
> Django to include a mechanism for re-syncronizing them should
> something go awry (like direct updates being applied to the database
> without keeping the denormalised fields in sync). This mechanism could
> then be exposed as a ./manage.py command which could be called
> periodically to verify and fix any incorrect data.
Yes, this I very much agree with. The reason you always layer this stuff
on top of a pre-normalised database is because you can then rebuild the
data after playing with it externally.
Doing so shouldn't be too much of a problem; have a management command
that loads the models, and then just executes the update method on each
of the denormalisationalish fields.
Justin's idea of lazy updating is interesting, and probably quite doable
(as well as what most people will want by default on aggregate queries).
I'm also hoping any kind of aggregate denormalisation will work with any
future extended aggregate support, but if the field just takes a normal
QuerySet, that might Just Work™.
Andrew
To be honest, my main drive is to have a nice, easy denormalisation API
for Django; whether this means implementing triggers in all the database
backends and using them behind this kind of API remains to be seen. The
problems I can see with this at the moment are that:
1) If you change a DenormField, there's less writing
migrations/fiddling with the database than if we managed it all in
Django code (where you could just resync)
2) Triggers are a whole part of SQL Django just hasn't touched yet, and
implementing support for them will take a reasonable amount of work in
both debugging and testing for each backend.
I'd really like to go ahead and implement this stuff 'vanilla' at first,
relying on the tried-and-tested field and signal code, and then if it
gets popular, add in the option to run denormalisations using triggers
rather than inside Django. That way, we eventually end up with the best
of both worlds, and get people using more efficient schemas first, which
they can then easily bake into triggers when they come along, and if
their DB supports them.
Andrew
I suppose it is. But then, perhaps triggers were a re-implementation
of application-based denormalization. ... The chicken? The egg? The
world may never know. :)
> I can see there are benefits if the underlying DB platform won't support
> triggers, but wouldn't triggers be the preferred solution when they're
> available? That way there is no chance that changes can be made outside
> the scope of the denormalization, and hence no need to recompute the
> denormalized values.
Perhaps the biggest downside is that distributed applications can't
make use of it, since triggers aren't cross-platform. Sure, you could
argue that denormalization is best suited for individual projects,
rather than distributed apps, but I could see uses for it. Imagine a
full-blown forum app that wants to denormalize its post counts.
Without some Python-based approach, all I could see is maybe adding a
cross-platform "create trigger" API (ugh) to Django, which an
application could then use to set up its triggers during syncdb.
Otherwise, something like that forum app would have to implement a
trigger for all available backends or just ship with instructions on
how to set it up yourself.
That said, you're right, I would think it's usually safest to put that
stuff directly in the db. That way, if you have other applications
using other languages or perhaps just other ORMs, everything's all
intact. I'm pretty sure I've seen arguments on both sides of the
issue, though, so I'm not sure there's any objective answer to it.
-Gul
It would be nice to have an option to make atomic changes to the
normalized and denormalized tables.
-Dave
I think denormalizing with callables is a very different thing than
denormalizing with expressions that can be evaluated by the database.
Not that they both aren't worth supporting, but db-level expressions
are going to be far easier and more efficient to validate and update
in bulk - no looping in Python, just executing SQL. In this case, I
think your example would be better suited as an FK, for denorm
purposes.
Callables would be useful for something more complicated like
abstracting auto_now to not be limited to dates, that is allowing a
field value to be set by a callable on save, not just create, for any
field type.
On Tue, Sep 23, 2008 at 2:42 AM, Andrew Godwin <andrew...@gmail.com> wrote:
> Still, with an
> AggregateField(Sandwiches.filter(filling="cheese").count()) it's still
> possible to work out that you want to listen on the Sandwiches model,
> and you could then fall back to re-running the count on every Sandwich
> save, even if it ends up not having a cheese filling.
I'm not sure I like the idea of accepting arbitrary QuerySets. It
could just be my point-of-view, but I see automatic denormalization
and aggregation as intimately tied, where a denormalized field is just
a declarative aggregate expression that's optionally cached in a
column. I think this makes it easy to understand and document, since
aggregation queries and calculation fields would support the same
features, and it also allows the implementation to share a lot with
aggregates. It's also better in terms of storing and updating a
calculation: you can calculate on reads or writes for N objects in one
query without N subqueries, though it may involve a lot of joins.
> So, I think the best approach would be one to replicate fields (like my
> current DenormField; perhaps call it CopyField or something) and one to
> cache aggregates (an AggregateField, like above).
I'd also be hesitant to have two separate fields for these cases,
since copying a related field value is just a simple SQL expression. I
think the same calculation field could be used for both, by assuming a
string is a field name:
name = CalculationField('user.name')
or by using F expressions:
name = CalculationField(F('user.name'))
Another approach to calculations, that doesn't necessarily involve
denormalization, would be to use views. Russell talked to me a bit
about this at djangocon, and I think the idea was that if you solve
basing models on views (Isn't it actually possible now? Maybe we need
read-only fields), and then have view creation support in the ORM,
then calculation fields can be implemented with views. I see that
un-stored calculations are re-implementing views, but I don't know
enough about views to know whether they have some performance
advantages over embedding the calculation in a query.
-Justin
Yes, this was essentially my point earlier; triggers would be nice to
have from a consistency point of view, but it will be easier and quicker
to reimplement them in Python, not least because it allows more complex
functionality to be put straight in if we want it in the future. Then,
we can always extend it so that some types of fields are implemented
directly as triggers if you want once the API is settled.
With all this in mind, my API proposal would be something like this:
A CopiedField, called like so: user_username =
models.CopiedField("user", "username") [here, "user" is the name of
the ForeignKey to link through]
An AggregateField: post_count =
models.AggregateField(Post.objects.filter(visible=True), "count")
The CopiedField essentially works how my current DenormField does, but
using a foreignkey to avoid you having to specify both it _and_ the
model if you have more than one to the same model (a reasonable amount
of the DenormField code is just for finding which foreignkey to link
through, anyway)
The AggregateField takes a simple queryset (from which it can extract
the model to hook a signal onto), and whenever something in the queryset
is saved (so, either an extra select on each save to see what items are
in the queryset, or just updating the aggregate every time - possibly
with either selectable, since the cost might be higher on either), and
it also takes the function to run on the set - either a builtin name
like "count" here, or an actual function which takes a queryset and
returns a result (in this situation the type of the underlying column
will also have to be specified).
This should probably cover most cases, and has the advantage of people
being able to specify their own in-python aggregate functions (such as
working out the standard deviation of the users' ages) if they tell us
what kind of field to sling the result in.
I'm happy to change the field names, too, CopiedField especially is a
little clunky.
Andrew
Why do we need a separate field for denormalization? Can't we just use
regular fields and simply set up denormalization in a procedural way
in the constructor? All that needs to be done to create a denormalized
field is connect a few signals maybe or do some expensive compuation
after modifying a field. But maybe I'm missing something crucial.
Erik