Best Practice For Trivial Example (No JOINs)

3 views
Skip to first unread message

Martin Sojka

unread,
Apr 10, 2008, 2:13:07 AM4/10/08
to Google App Engine
What's the best way to approach following trivial example in the
appengine?

Assume that we are coding blog application and each blog entry has
many comments, each comment has one author. On the blog entry page we
want to list all of the comments of that entry plus each comment's
author's name. In the SQL we'd use simple JOIN to do the job and fetch
the comments together with the respective authors at once.

There are no joins in GQL so what's the best practice here?

1. Fetch comments and then fetch the authors one-by-one as the
comments are rendered? Remember there could be tens of comments for
each entry. Doesn't sound too efficient.
2. Fetch comments and then fetch all the authors in one query..
matching the list of authors' keys. Is this even possible in GQL -
something like 'select from User where key in (..., ..., ...)' ?
3. Cache author's name for each comment during the comment's creation?
Remember that if the respective User decides to update his name, all
of her comments need to be updated as well.
4. Other ideas?

Brett Morgan

unread,
Apr 10, 2008, 3:19:00 AM4/10/08
to google-a...@googlegroups.com
Option # 3. Remember, the number of read requests for authors
outnumber the number of times an author changes his name by several
orders of magnitude.

deceiver

unread,
Apr 10, 2008, 4:22:10 AM4/10/08
to Google App Engine
A good way structure db's for big table is to pick the ugliest
solution and that's the one that will work the best. It's a very
strange beast indeed.

Brett Morgan

unread,
Apr 10, 2008, 4:44:11 AM4/10/08
to google-a...@googlegroups.com
It's only ugly to the people who are still wedded to the idea we got
taught in uni that a fully normalised database is pretty.

When I hit industry, I was thrown in the deep end trying to keep high
traffic database backed websites up, the first thing i learned was
that joins were evil. In fact, anything that wasn't just a single
record pull by id from an indexed field hurt.

So we learned to have a normalised edit database, and a fully
denormalised published database that had a record for each page we
were serving. Yes this meant our published database was several orders
of magnitude larger than our edit database. But it was fast. And
that's all that counted.

So, by now, a decade later, I look at a fully normalised database, and
remember the fresh faced uni grads who brought our website down by
adding a small chunk to our main page that required six joins to
render. From memory at least one of those joins required a full table
scan.

Think of it as a space time trade off. We burn disk space to optimise
web page rendering time. When disks are cheap and page rendering time
is gold...

deceiver

unread,
Apr 10, 2008, 5:27:52 AM4/10/08
to Google App Engine
I'm self taught, so I can't say what people think when they are
"properly" schooled. I personally have never had a reason to run a
join except when exporting data for reports, so I see where you are
coming from.

worst I've seen was a regular expression that was used to search
through text blogs, once for each word in the search term and then in
every possible combination of order, done with a php loop tossing out
queries as fast as it could go.
> >  > 4. Other ideas?- Hide quoted text -
>
> - Show quoted text -

Brett Morgan

unread,
Apr 10, 2008, 5:34:00 AM4/10/08
to google-a...@googlegroups.com
On Thu, Apr 10, 2008 at 7:27 PM, deceiver <andy.st...@gmail.com> wrote:
>
> I'm self taught, so I can't say what people think when they are
> "properly" schooled. I personally have never had a reason to run a
> join except when exporting data for reports, so I see where you are
> coming from.

Even though i attended uni, I spent more time arguing with lecturers
than actually learning course material. Something about having started
programming under my dad's tutelage at age 4. I have programmed
everything from vz200s in basic through to AS/400s in RPG/400. And
that was before i wound up at uni, beating up on Haskell. Oh what
days.

But yeah. At the end of the day, it's all about understanding why
things are. Fully normalised databases make sense when you have a
reasonably high ratio of writes to reads. This makes sense because
normalised databases are optimised for the write case. Those of us who
really are self taught learn that emotional beliefs really doesn't
have a place when you are fighting with a cpu =)

> worst I've seen was a regular expression that was used to search
> through text blogs, once for each word in the search term and then in
> every possible combination of order, done with a php loop tossing out
> queries as fast as it could go.

I think i need a strong drink now...

Martin Sojka

unread,
Apr 10, 2008, 6:08:13 AM4/10/08
to Google App Engine
Thanks for great points Brett.. makes sense ;)
I'm also learning -in my own current apps - that it's best to offload
the reads by doing more work in the writes.

On Apr 10, 11:34 am, "Brett Morgan" <brett.mor...@gmail.com> wrote:

Brett Morgan

unread,
Apr 10, 2008, 6:20:21 AM4/10/08
to google-a...@googlegroups.com
I'm glad my rambling makes sense. Means I'm not totally crazy.

Yet =)

btoc

unread,
Apr 11, 2008, 11:06:40 PM4/11/08
to Google App Engine
You need to be careful here. Using option 3 might work but what if you
have a "live author" value that you need to show. For example consider
the case where you want to show the total number of comments created
by that author. The *ONLY* way to do this is by going after the author
object. It just could not scale to update this in all the comment
entities. If you have 1000 authors each with 100 comments and they are
all actively adding comments then you would need to update the comment
count in 100 * 1000 entities ..... not good. This is just a small
example.

Why not store the author key in each comment entity. Then when you
query for the comments pull all the author keys and use db.get(keys).

I'm not sure if this scales either but you really need to consider the
problem domain before blindly copying data everywhere.

Much more consideration of the design needs to be considered in the
application rather than the datastore, somewhat bucking the
traditional SQL database model. This might not be a bad thing either.

On Apr 10, 2:13 am, Martin Sojka <sojka.mar...@gmail.com> wrote:
> What's the best way to approach following trivial example in the
> appengine?
>
> Assume that we are coding blog application and each blog entry has
> many comments, each comment has one author. On the blog entry page we
> want to list all of the comments of that entry plus each comment's
> author's name. In theSQLwe'd use simple JOIN to do the job and fetch

Brett Morgan

unread,
Apr 12, 2008, 2:06:22 AM4/12/08
to google-a...@googlegroups.com
Yeah, there is a lot of learning here for all of us to do. I'm
expecting to spend the next six months getting my head around all of
the tradeoffs of speed of queries vs cost of doing updates. And
getting the code right.

Hopefully we will get some good tutorials up soon...

anvaka

unread,
Apr 12, 2008, 4:11:33 AM4/12/08
to Google App Engine
> 4. Other ideas?

Hello Martin!

I'm not sure either this approach is good from performance
perspective, but you can consider it to.

In short words there is a ReferenceProperty() property type. When this
property references another model class, for say ClassA, then an
instance of ClassA can enumerate all referenced items. You can find
all details about ReferenceProperties here:
http://code.google.com/appengine/docs/datastore/entitiesandmodels.html#References

Now let's drive into details with example. You can copy/paste it to
the Development Console to check it out ( http://localhost:8080/_ah/admin/interactive
). Hope this helps :).

# We should ensure that we've imported 'db' module:
from google.appengine.ext import db

# We create a BlogEnity:
class BlogEntity(db.Model):
body = db.TextProperty()

# Here is our Authors:
class Author(db.Model):
firstName = db.StringProperty()
lastName = db.StringProperty()

# This is most tricky class: a comment to the blog entity.
# Note: it contains couple of ReferenceProperty()
class Comment(db.Model):
commentsTo = db.ReferenceProperty(reference_class=BlogEntity,
collection_name='comments_set')
author = db.ReferenceProperty(Author)
body = db.TextProperty()

# Let's create some default data:

# 1. Our first post:
firstPost = BlogEntity()
firstPost.body = db.Text(u"First article")
firstPost.put()

# 2. Authors John and Smith:
authorJohn = Author()
authorJohn.firstName = u'John'
authorJohn.lastName = u'Johnson'
authorJohn.put()

authorSmith = Author()
authorSmith.firstName = u'Smith'
authorSmith.lastName = u'Smithson'
authorSmith.put()

# 3. Authors' comments:
johnComment = Comment()
johnComment.commentsTo = firstPost
johnComment.author = authorJohn
johnComment.body = db.Text(u"Hey! Here is my comment")
johnComment.put()

smithComment = Comment()
smithComment.commentsTo = firstPost
smithComment.author = authorSmith
smithComment.body = db.Text(u"Great article!")
smithComment.put()

# And here is comments rendering code.
# Note: firstPost object contains 'comments_set' property.
# We've defined it in the 'Comment' class via 'collection_name':
for comment in firstPost.comments_set:
print comment.author.firstName, comment.author.lastName, 'wrote:'
print comment.body

btoc

unread,
Apr 12, 2008, 9:19:12 AM4/12/08
to Google App Engine
The problem with this approach is that each object is being fetched
when it is referenced! This would be very inefficient, I would think.
If you get 100 comments each time you reference author.firstName the
object needs to be fetched (unless it is in cache).

Again I don't know but I feel doing one fetch via db.get(keys) would
be more efficient. In fact I would suspect if caching works correctly
you would do this and then when you do comment.author.firstName it
would have it in cache from the db.get command earlier.

It's hard to know here exactly as the documentation is fairly
incomplete when it comes to this.

On Apr 12, 4:11 am, anvaka <anv...@gmail.com> wrote:
> > 4. Other ideas?
>
> Hello Martin!
>
> I'm not sure either this approach is good from performance
> perspective, but you can consider it to.
>
> In short words there is a ReferenceProperty() property type. When this
> property references another model class, for say ClassA, then an
> instance of ClassA can enumerate all referenced items. You can find
> all details about ReferenceProperties here:http://code.google.com/appengine/docs/datastore/entitiesandmodels.htm...
>
> Now let's drive into details with example. You can copy/paste it to
> the Development Console to check it out (http://localhost:8080/_ah/admin/interactive

anvaka

unread,
Apr 12, 2008, 10:15:05 AM4/12/08
to Google App Engine
btoc,

Yep, I've mentioned about performance consideration in the very
beginning of my previous post. Object is being fetched only if the
referenced entity is not in memory (according to the documentation
http://code.google.com/appengine/docs/datastore/entitiesandmodels.html#References).
BTW can we figure out how many requests did we made to the Bigtable
through the Administration Console? If we are, then we can just
measure it ;).

Anyway, since we have not a "standard" sql here, and there are still
many-many things to learn and understand, I would highly recommend to
recall the first rule of program optimization: don't do it.

PS: Donald Knuth told: "... premature optimization is the root of all
evil". And I trust him.

Martin Sojka

unread,
Apr 12, 2008, 6:08:43 PM4/12/08
to Google App Engine
Thanks for keeping the discussion going, folks.

anvaka, your solution was actually #1 from my list
>1. Fetch comments and then fetch the authors one-by-one as the
comments are rendered? Remember there could be tens of comments for
each entry. Doesn't sound too efficient.

btoc, good example about showing more "frequent changing" author's
info together with each comment. We're getting to problem here since I
guess we must somehow prefetch the authors of the comments we need to
display on the page in this scenario. As you said updating each
comment with this live author's info would be real pain.

Fetching comments, collecting comments' author keys and then fetching
all the authors by these keys sounds as the only way to accomplish
this in GQL efficiently. As you mentioned there is one key question:

Does comment.author use the prefetched author automatically or do we
need to associate authors back with the comments after the fetch?

.. and yes, is there a way to see the number of BigTable requests per
url request?



On Apr 12, 4:15 pm, anvaka <anv...@gmail.com> wrote:
> btoc,
>
> Yep, I've mentioned about performance consideration in the very
> beginning of my previous post. Object is being fetched only if the
> referenced entity is not in memory (according to the documentationhttp://code.google.com/appengine/docs/datastore/entitiesandmodels.htm...).

Martin Sojka

unread,
Apr 12, 2008, 6:28:31 PM4/12/08
to Google App Engine
Another thing to consider...

Brett > Remember, the number of read requests for authors
outnumber the number of times an author changes his name by several
orders of magnitude.

I think we can assume this even with the frequent changing data like
total # of comments by author (if want to display this info). The
number of read requests for the respective author comments (across
whole application total) outnumbers the number of times respective
author makes new comment by several orders of magnitude.

So maybe the real question should be... how efficient/fast is updating
of the 10,000 - 100,000 or even more comment entities at once with new
cached author info during each comment posting?

DennisP

unread,
Apr 12, 2008, 6:41:29 PM4/12/08
to Google App Engine
If it's ok that author's number of posts be a little out of date, you
could store the last-updated-time with the comment or thread, and only
occasionally do the update from author.
> > > incomplete when it comes to this.- Hide quoted text -

Brett Morgan

unread,
Apr 12, 2008, 8:23:10 PM4/12/08
to google-a...@googlegroups.com
Just thinking out loud here.

How about one entry with all the author names, and one entry with all
the comments in a thread.

This way rendering a comment thread would pull back two entries, and
as the thread is rendered, you can print out each author name as
appropriate.

Does that work?

DennisP

unread,
Apr 13, 2008, 3:07:36 AM4/13/08
to Google App Engine
That brings up a question I've been wondering about...how does
concurrency work out? If you store a bunch of stuff in one entry,
which lots of users might update, do you have to worry much about
contention?

Martin Sojka

unread,
Apr 13, 2008, 3:13:13 AM4/13/08
to Google App Engine
Something like this, Brett?

class BlogEntry(db.Model):
body = db.TextProperty()

class Member(db.Model):
firstName = db.StringProperty()
lastName = db.StringProperty()

class Comment(db.Model):
entry = db.ReferenceProperty(reference_class=BlogEntry,
collection_name='comments')
author = db.ReferenceProperty(CommentAuthor)
body = db.TextProperty()

class CommentAuthor(db.Model):
entry = db.ReferenceProperty(reference_class=BlogEntry,
collection_name='commentAuthors')
member = db.ReferenceProperty(Member)
commentsCount = db.IntegerProperty()


Now we can fetch comments and comment authors:
comments = entry.comments
comments = entry.commentAuthors

But since CommentAuthor contains the commentsCount property we would
need to update all commentAuthors again.

Is it possible to do something like this?

class BlogEntry(db.Model):
comments = db.ListProperty(key)
commentAuthors = db.ListProperty(key)



On Apr 13, 2:23 am, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> Just thinking out loud here.
>
> How about one entry with all the author names, and one entry with all
> the comments in a thread.
>
> This way rendering a comment thread would pull back two entries, and
> as the thread is rendered, you can print out each author name as
> appropriate.
>
> Does that work?
>

Brett Morgan

unread,
Apr 13, 2008, 4:28:05 AM4/13/08
to google-a...@googlegroups.com
Heh, it's a question of how often the authors names are being updated.
So in short, yes it's something you have to worry about. =)

Brett Morgan

unread,
Apr 13, 2008, 4:30:25 AM4/13/08
to google-a...@googlegroups.com
Erm, not quite what i had in mind. Uhm.

I'm trying to get some sleep in before i do a production change on
some machines in five hours, so yeah, not fully rational at the
moment. Hah. Me rational. I'm a joke a minute.

DennisP

unread,
Apr 13, 2008, 9:38:27 AM4/13/08
to Google App Engine
Depends on how it's implemented...eg., if the whole thing were written
in Erlang, messages to do updates would just queue up and get
processed in turn. If they're using locking, it's more of a problem.
> >  > - Show quoted text -- Hide quoted text -
Reply all
Reply to author
Forward
0 new messages