Design Question - How to simulate efficient joins

78 views
Skip to first unread message

Mos

unread,
Oct 19, 2011, 2:28:09 PM10/19/11
to google-a...@googlegroups.com
Hello,

I've got a database related design question. Let me explain it by using an example:

I've got Entities of the kind "Article" with the following properties:
- title
- userId
- ....
- sumOfScore

SumOfScore should be the sum of all related "Score" entities, which have
properties like:
- articleId
- userId
- score

In Pseudo-SQL:   sumOfScore = select sum(score) from Score where score.articleId = article.id

I see two possibilities to design this:

1.)  No property sumOfScore for Articles; but query always:
This means: Every time an article is read, I need to do an query for this specific article for calculating the sumOfScore.
Imagine a list of 100 Articles that is shown to a user. This would need additional 100 queries to the database, just
to show the score for each article.
Nevertheless: This would be my preferred way when using a Relational-DB.  No redundancy and good normalization.
And with SQL you can use just one join-select to catch all data.
But it doesn't feel right for Google's BigTable.


2.) Calculate the sumOfScore whenever Score entities are changed:

This means: Whenever a Score-Entity is added, removed or changed, the related Article updates the sumOfScore property.
Advantage:  When reading articles no additional queries are needed. The sumOfScore is redundant on the entity itself.
Disadvantage:   Every time a score is changed, there is one additional query and an additional write (updating an Article entity).   And sumOfScore may mismatch with the actual Score entities (e.g. value is changed via DB-Console)


What are more experienced people think?  Is there a common best practice for such scenario?

Thank a lot
Mos

voscausa

unread,
Oct 19, 2011, 5:19:32 PM10/19/11
to google-a...@googlegroups.com
How many times an article is read, how they are read (by a user / client or a task) and the number of articles are also major design criteria.  But in general : the second solution is the best. It is faster and it is more scaleable. 

Mos

unread,
Oct 20, 2011, 3:27:07 AM10/20/11
to google-a...@googlegroups.com
What are doing the JPA or JDO implementation under the hood?  Variant 1) or 2) ?

On Wed, Oct 19, 2011 at 11:19 PM, voscausa <rober...@gmail.com> wrote:
How many times an article is read, how they are read (by a user / client or a task) and the number of articles are also major design criteria.  But in general : the second solution is the best. It is faster and it is more scaleable. 

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/Nmg_yBkwf5UJ.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Johan Euphrosine

unread,
Oct 20, 2011, 6:00:30 AM10/20/11
to google-a...@googlegroups.com
Ikai Lan made a codelab at last year Google I/O Bootcamp on how to
properly implement 2), you can find it there:
http://code.google.com/p/2011-datastore-bootcamp-codelab/downloads/detail?name=2011-datastore-bootcamp-codelab-v2.zip&can=2&q=

IIRC, it shows how to use sharded counters and tasks queues to
synchronize a denormalized score on user submitted links.

Looks similar to what you are trying to do with "Article" and "scores".

Hope that helps.

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.

> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

--
Johan Euphrosine (proppy)
Developer Programs Engineer
Google Developer Relations

Jose Montes de Oca

unread,
Oct 27, 2011, 3:49:28 PM10/27/11
to google-a...@googlegroups.com
Hi Mos,

You should design your data models as a non relational database, where having redundancy its just fine. 

Plain and simple:

1/ will not scale, Not suitable for App Engine non relational database (Datastore)

2/ Right Approach, just need to be careful on what voscausa said. 

Hope this helps!

Jose Montes de Oca

Jeff Schnitzer

unread,
Oct 27, 2011, 4:06:16 PM10/27/11
to google-a...@googlegroups.com
I'm going to be the dissenting here.  You should consider #1 if you know that you're always going to have on the order of 100 items.  You can easily cache this value in memcache.

When you get into thousands of items you start to see the point at which it's not practical to do on-the-fly sums, even just to store the result in memcache.  But less than thousands, might as well just calculate the sum.

Pre-aggregating the sum is not necessarily trivial.  If the sum changes more than once per second you need to shard the sum.  Once you get to 100 shards, you've created a system that doesn't perform any better and is a hell of a lot more complicated.

One thing to watch out for is eventuality - if your sum needs to be instantaneously correct, a simple query (on the HRD) may provide lagged data.  On the other hand, if you are fetching your articles by key (and have STRONG consistency configured), you're ok.

Jeff

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/sUbapsIqigAJ.

Jose Montes de Oca

unread,
Oct 27, 2011, 5:16:54 PM10/27/11
to google-a...@googlegroups.com

On Thursday, October 27, 2011 1:06:16 PM UTC-7, Jeff Schnitzer wrote:
I'm going to be the dissenting here.  You should consider #1 if you know that you're always going to have on the order of 100 items.  You can easily cache this value in memcache.

IMO this would depend on the size of the items, fetching 100 items and then traversing then to do a simple sum could be a great overhead on read time that could be avoid by doing this on write time. (This is App Engine strategy for queries)

 

When you get into thousands of items you start to see the point at which it's not practical to do on-the-fly sums, even just to store the result in memcache.  But less than thousands, might as well just calculate the sum.

Pre-aggregating the sum is not necessarily trivial.  If the sum changes more than once per second you need to shard the sum.  Once you get to 100 shards, you've created a system that doesn't perform any better and is a hell of a lot more complicated.

One thing to watch out for is eventuality - if your sum needs to be instantaneously correct, a simple query (on the HRD) may provide lagged data.  On the other hand, if you are fetching your articles by key (and have STRONG consistency configured), you're ok.

If you do an ancestor query you will get consistent data. 

Jeff Schnitzer

unread,
Oct 28, 2011, 2:38:27 AM10/28/11
to google-a...@googlegroups.com
On Thu, Oct 27, 2011 at 2:16 PM, Jose Montes de Oca <jfmont...@google.com> wrote:

On Thursday, October 27, 2011 1:06:16 PM UTC-7, Jeff Schnitzer wrote:
I'm going to be the dissenting here.  You should consider #1 if you know that you're always going to have on the order of 100 items.  You can easily cache this value in memcache.

IMO this would depend on the size of the items, fetching 100 items and then traversing then to do a simple sum could be a great overhead on read time that could be avoid by doing this on write time. (This is App Engine strategy for queries)

Sure, it depends on entity size... but the "typical" entity not so large that this is an issue, and the OP's Score entity sounds tiny.  If he can cache for any length of time, computation is a non-issue.

It also depends on change rate.  If the data changes frequently, it will likely be significantly cheaper and easier to calculate it ad-hoc rather than pre-calculating.  Especially if he doesn't need instantaneously accurate results.
 
If you do an ancestor query you will get consistent data. 

...but only if you have everything in one entity group.  Everything is a tradeoff.

Jeff
Reply all
Reply to author
Forward
0 new messages