Many-to-many JOIN with the Datastore

1,356 views
Skip to first unread message

arbin...@gmail.com

unread,
Apr 29, 2008, 11:19:30 AM4/29/08
to Google App Engine
It turns out my assumption was right, that there is a need for better
relational to GAE Datastore examples in the docs.

Following on my one-to-many JOIN post --

http://blog.arbingersys.com/2008/04/google-app-engine-one-to-many-join.html

-- I've created a second tutorial that demonstrates how to, in effect,
do a many-to-many JOIN using the Datastore. It's up on my blog:

http://blog.arbingersys.com/

Permalink:

http://blog.arbingersys.com/2008/04/google-app-engine-many-to-many-join.html

You can download the source and run it in your SDK. Hope you find it
useful.

James

Yagiz Erkan

unread,
Apr 29, 2008, 11:44:09 AM4/29/08
to google-a...@googlegroups.com
James,

Thanks for the blog entries!

Regards,

- Yagiz -

Jeremey Barrett

unread,
Apr 29, 2008, 11:50:03 AM4/29/08
to google-a...@googlegroups.com
Out of curiosity, why not use a join model?

class Library(db.Model):
name = db.StringProperty()
address = db.StringProperty()

def books(self):
return (x.book for x in self.book_set)

class Book(db.Model):
title = db.StringProperty()

def libraries(self):
return (x.library for x in self.librarybook_set)

class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)

Then you can CRUD the LibraryBook and store quantities there, etc.

Regards,
Jeremey.

Edoardo Marcora

unread,
Apr 29, 2008, 12:16:21 PM4/29/08
to Google App Engine
I too think using a join model for many-to-many relationship is more
appropriate (see http://groups.google.com/group/google-appengine/msg/d8b3901937c7765d).
Reference property effectively model many-to-one (or belongs_to)
relationships. From the other side (using the automagically generated
collection) you get a one-to-many (or has_many) relationship. Using
the join model you get many-to-many relationships (or
has_and_belongs_to_many). It would be nice to have a one-to-one
relationship also, but that's easy to code over the one-to-many
relationship.

On Apr 29, 8:50 am, "Jeremey Barrett" <jeremey.barr...@gmail.com>
wrote:
> Out of curiosity, why not use a join model?
>
> class Library(db.Model):
>     name = db.StringProperty()
>     address = db.StringProperty()
>
>     def books(self):
>         return (x.book for x in self.book_set)
>
> class Book(db.Model):
>     title = db.StringProperty()
>
>     def libraries(self):
>         return (x.library for x in self.librarybook_set)
>
> class LibraryBook(db.Model):
>     library = db.ReferenceProperty(Library)
>     book = db.ReferenceProperty(Book)
>
> Then you can CRUD the LibraryBook and store quantities there, etc.
>
> Regards,
> Jeremey.
>
> On Tue, Apr 29, 2008 at 10:19 AM, arbinger...@gmail.com
>
> <arbinger...@gmail.com> wrote:
>
> >  It turns out my assumption was right, that there is a need for better
> >  relational to GAE Datastore examples in the docs.
>
> >  Following on my one-to-many JOIN post --
>
> >  http://blog.arbingersys.com/2008/04/google-app-engine-one-to-many-joi...
>
> >  -- I've created a second tutorial that demonstrates how to, in effect,
> >  do a many-to-many JOIN using the Datastore. It's up on my blog:
>
> >  http://blog.arbingersys.com/
>
> >  Permalink:
>
> >  http://blog.arbingersys.com/2008/04/google-app-engine-many-to-many-jo...

arbin...@gmail.com

unread,
Apr 29, 2008, 12:50:39 PM4/29/08
to Google App Engine
That does look a little better -- at least more analogous to the SQL
counterpart. I'll play around with it maybe and compare it to what I
came up with.

James


On Apr 29, 4:50 pm, "Jeremey Barrett" <jeremey.barr...@gmail.com>
wrote:

Ben the Indefatigable

unread,
Apr 29, 2008, 2:06:45 PM4/29/08
to Google App Engine
Using the reference property has also been called a "virtual join" on
this forum. You can actually do the select on the intermediary entity
without pulling the referenced entities. This is good if you de-
normalize the data that you need for your query out of those
referenced entities into your intermediary entity (such as LibraryBook
above) to avoid pulling the additional entities in your common read
cases. A good discussion of many to many joins in the datastore should
mostly discuss how to avoid them and find scalable alternatives. :)

arbin...@gmail.com

unread,
Apr 29, 2008, 2:19:40 PM4/29/08
to Google App Engine
Ben: Can you post a code snippet that might demonstrate the
intermediate query sans referenced entities? I think I understand what
you're saying, but something visual would definitely help.

James

arbin...@gmail.com

unread,
Apr 29, 2008, 4:26:11 PM4/29/08
to Google App Engine
I went ahead and played with your "join model" example, Jeremey, and
it's definitely an improvement over what I had. Obtaining books from
libraries and libraries from books becomes quite intuitive:

# Books by library
for book in lib1.books():
self.response.out.write(book.title + ' ' + book.author + '<br /
>')

# Libraries by book
for lib in book1.libraries():
self.response.out.write(book1.title +
' ' + book1.author + '-' + lib.name + '<br />')


I still wanted to demonstrate outputting all data in total (notice
from the above that I'm getting all the books at library 'lib1' and
all the libraries that contain 'book1) and so I did the following,
using Libraries and Books Models as single collection points for each
entity type:


class Libraries(db.Model):
notes = db.StringProperty()


class Books(db.Model):
notes = db.StringProperty()


class Library(db.Model):
name = db.StringProperty()
address = db.StringProperty()
city = db.StringProperty()
libscol = db.ReferenceProperty(Libraries,
collection_name='libscol')

def books(self):
return (x.book for x in self.librarybook_set)

class Book(db.Model):
title = db.StringProperty()
author = db.StringProperty()
bookscol = db.ReferenceProperty(Books, collection_name='bookscol')

def libraries(self):
return (x.library for x in self.librarybook_set)


class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)


Then, I was able to output everything from the library perspective and
the book perspective:

# All libraries, all books
for lib in libs.libscol.order('-name'):
for book in lib.books():
self.response.out.write(
lib.name + '-' + book.title + ' ' + book.author +
'<br />')


# All books, all libraries
for book in books.bookscol.order('author').order('-title'):
for lib in book.libraries():
self.response.out.write(
book.title + ' ' + book.author + '-' + lib.name +
'<br />')


I really just wanted to demonstrate being able to do this. I plan to
post an update to my blog using this code. Is there a better way to do
the last part, which is full, ordered output?

Thanks for all the help. James

Ben the Indefatigable

unread,
Apr 29, 2008, 8:01:53 PM4/29/08
to Google App Engine
Thanks for the response James. I was talking in terms of scalability.
Since the datastore is not written for this kind of join access on a
page load, it might be made more feasible by denormalizing at least
the fields you will need as you are displaying all the books in a
library (booktitle) and/or all the libraries a book is in
(libraryname):

class LibraryBook(db.Model):
library = db.ReferenceProperty(Library)
book = db.ReferenceProperty(Book)
booktitle = db.StringProperty()
libraryname = db.StringProperty()

So if your page shows a whole list of librarynames without actually
pulling the library entities, then you could include links for
drilling down into more details about a particular library such as the
address and only at that point would you access the particular library
entity with:

librarybook.library.address

Since until you follow that reference, the entity will not be pulled.
Denormalizing goes against the intuition at first, I know.

On Apr 29, 2:19 pm, "arbinger...@gmail.com" <arbinger...@gmail.com>
wrote:
> > mostly discuss how to avoid them and find scalable alternatives. :)- Hide quoted text -
>
> - Show quoted text -

Ben the Indefatigable

unread,
Apr 29, 2008, 9:33:53 PM4/29/08
to Google App Engine
actually, librarybook.library.address confuses what I was saying. You
would have a key (likely in the URL of the link) in order to go
directly to the library entity on the new request.

arbin...@gmail.com

unread,
Apr 30, 2008, 10:23:08 AM4/30/08
to Google App Engine
On Apr 30, 1:01 am, Ben the Indefatigable <bcbry...@gmail.com> wrote:
> Since the datastore is not written for this kind of join access on a
> page load, it might be made more feasible by denormalizing at least
> the fields you will need as you are displaying all the books in a
> library (booktitle) and/or all the libraries a book is in
> (libraryname):
>
> class LibraryBook(db.Model):
> library = db.ReferenceProperty(Library)
> book = db.ReferenceProperty(Book)
> booktitle = db.StringProperty()
> libraryname = db.StringProperty()
>
> ...
>
> Denormalizing goes against the intuition at first, I know.

Ben, thanks, that's helpful. The whole idea of "disk and CPU are
cheap, stop worrying about them and scale" is starting to gel, I
think. I do find it amazing how resistant a mind is to change, though.
I look at the code above, and even knowing why denormalization is
useful in GAE, I still want to worry about the wastefulness of having
both librarybook.booktitle and book.title.

James

Brett Morgan

unread,
Apr 30, 2008, 8:36:33 PM4/30/08
to google-a...@googlegroups.com

Habits take time to change. We, as an industry, have been focusing on
speed and size just about forever. It's only when you pull back from
the micro world and focus on what we are attempting to achieve that
you realise that you can achieve real world usability doing things
that seem counter-intuitive on first pass.

Intuitive, of course, is another way of saying familiar. =)

> James


--

Brett Morgan http://brett.morgan.googlepages.com/

cyj

unread,
May 3, 2008, 9:27:48 AM5/3/08
to Google App Engine
Hi, I am a newbie to both GAE and python.
I have a question about how to query the m:n model simulated by GAE?
e.g. get a library has book1.

On May 1, 8:36 am, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> On Thu, May 1, 2008 at 12:23 AM, arbinger...@gmail.com

arbin...@gmail.com

unread,
May 8, 2008, 12:17:08 PM5/8/08
to Google App Engine
On May 3, 2:27 pm, cyj <chenyue...@gmail.com> wrote:
> Hi, I am a newbie to both GAE and python.
> I have a question about how to query the m:n model simulated by GAE?
> e.g. get a library has book1.

That's described in this post:

http://blog.arbingersys.com/2008/04/google-app-engine-better-many-to-many.html

Specifically, I pass

'libs_by_book': book1.libraries(),

to the template, which in turn does:

<h4>All libraries with book "{{ forbook }}"</h4>
<table border="1">
{% for lib in libs_by_book %}
<tr>
<td>{{ lib.name }}</td>
<td>{{ lib.address }}</td>
<td>{{ lib.city }}</td>
</tr>
{% endfor %}
</table>

joh...@easypublisher.com

unread,
May 12, 2008, 11:59:12 AM5/12/08
to google-a...@googlegroups.com
On 5/8/08, arbin...@gmail.com <arbin...@gmail.com> wrote:
>
> On May 3, 2:27 pm, cyj <chenyue...@gmail.com> wrote:
> > Hi, I am a newbie to both GAE and python.
> > I have a question about how to query the m:n model simulated by GAE?
> > e.g. get a library has book1.
>
> That's described in this post:
>
> http://blog.arbingersys.com/2008/04/google-app-engine-better-many-to-many.html


Hi,
Here is my modification of James example.py.
I've removed cleaned it up a bit for clarity (removing the Libraries
and Books datamodel).
(this is just the example.py replacement)

http://cluebin.appspot.com/pasted/25

Regards,
/Johan


--
Johan Carlsson
Colliberty Easy Publisher
http://www.easypublisher.com

Andrew Fong

unread,
May 14, 2008, 7:15:27 PM5/14/08
to Google App Engine
I still have issues with denormalization. It's not just a space issue.
The reason normalized databases don't repeat records is to avoid some
confusion down the road. For example, what happens if, in the
LibraryBook example, the Library changes its name? In a normalized
database, you would only have to update one record. Under a
denormalized database, would that entail finding every LibraryBook
that referenced that particular Library and updating it?

It so, it seems that the more denormalized a database is, the more
expensive updates are (even if the reads are fast).

Furthermore, it would require anyone trying to update an entity to
understand the structure of all the entities that referenced this
entity. In the LibraryBook example, updating the name attribute for
Library requires knowing that there is a libraryname attribute in
LibraryBook. Not a big deal for one model, but as the number of models
increases, it's going to get difficult keeping track of which entities
referencing Library have a libraryname attribute, which have a
libraryaddress attribute, and which ones might not have any such
attribute at all -- especially on a multi-person project.

Am I missing something?

-- Andrew

On Apr 30, 7:23 am, "arbinger...@gmail.com" <arbinger...@gmail.com>
wrote:

Brett Morgan

unread,
May 14, 2008, 8:01:09 PM5/14/08
to google-a...@googlegroups.com
On Thu, May 15, 2008 at 9:15 AM, Andrew Fong <FongA...@gmail.com> wrote:
>
> I still have issues with denormalization. It's not just a space issue.
> The reason normalized databases don't repeat records is to avoid some
> confusion down the road. For example, what happens if, in the
> LibraryBook example, the Library changes its name? In a normalized
> database, you would only have to update one record. Under a
> denormalized database, would that entail finding every LibraryBook
> that referenced that particular Library and updating it?
>
> It so, it seems that the more denormalized a database is, the more
> expensive updates are (even if the reads are fast).
>
> Furthermore, it would require anyone trying to update an entity to
> understand the structure of all the entities that referenced this
> entity. In the LibraryBook example, updating the name attribute for
> Library requires knowing that there is a libraryname attribute in
> LibraryBook. Not a big deal for one model, but as the number of models
> increases, it's going to get difficult keeping track of which entities
> referencing Library have a libraryname attribute, which have a
> libraryaddress attribute, and which ones might not have any such
> attribute at all -- especially on a multi-person project.
>
> Am I missing something?
>
> -- Andrew

Yes, all of the above concerns are valid. Yes, denormalisation hurts,
both on disk space, and on correctness.

The reason we are doing this is to achieve scale. At scale you wind up
doing a bunch of things that seem wrong, but that are required by the
numbers we are running. Go watch the EBay talks. Or read the posts
about how many database instances FaceBook is running.

The simple truth is, what we learned about in uni was great for the
business automation apps of small to medium enterprise applications,
where the load was predictable, and there was money enough to buy the
server required to handle the load of 50 people doing data entry into
an accounts or business planning and control app.

On the web, we are in a different world. If you get successful, you'll
get slashdotted. Well, these days it's probably more correct to call
it reddited. Or boing boinged. And suddenly you have to go from 4
servers to fourty, to four hundred, to four thousand. Read up the
story about the iLike guys. They wrote an app that went viral on FB.
And they melted. Needed servers. Yesterday.

What GAE gives you is the ability to handle this, easily. All the
things that GAE makes you do is done with this end game in mind. You
have to write your code such that it can run on 400 app servers spread
across the globe, on google's infrastructure. You have to deal with
the fact that the transaction engine is distributed. You have to deal
with the fact that queries are slow, and you should really be
publishing entities that match one to one with your popular pages. And
that you need to hide your updates using ajax. It's better to give the
user a progress bar than a white screen of death, anyways.

If you aren't interested in serving millions of customers, then this
is likely overkill for you. But if you are, then you have to go
through this world view change. And yes it hurts. I'm not going to say
it's easy. It hurt me when I had to go through it back in 2000. It
actually took me about four attempts (aka, webapps that melted
underload) before i got it. But, once you make the leap, and
understand that we are breaking rules for a reason, then you'll
understand where and when to do it. Every choice has costs and
benefits. Understanding when GAE makes sense is part of this journey
of discovery.

And if any of the above doesn't make sense, feel free to come back
with more questions. =)

Andrew Fong

unread,
May 15, 2008, 4:15:00 AM5/15/08
to Google App Engine
Hmmm, so maybe the proper way to approach the datastore is think of it
as a pseudo-cache. Let's say we start with a more or less normalized
datastore and we do all the joins through a ReferenceProperty -- and
if we notice we're frequently using that reference, we "cache" the the
referenced values in the referencing entity. And we treat updates to
the referenced attribute using the same strategies we treat updates to
any item that's cached -- e.g. wait for the values to propagate via
some background task (speaking of which, how are people doing
background tasks in GAE?), whether that's one that runs periodically
or whenever certain kinds of entities are updated.

It seems to me that a large part of this could be automated though. I
really like how the datastore indices are automatically generated in
the index.yaml file without any action on the developers part. I'm new
to python and GAE but how feasible would it be to write a plugin that
automatically does this sort of "caching"?

-- Andrew

On May 14, 5:01 pm, "Brett Morgan" <brett.mor...@gmail.com> wrote:

Brett Morgan

unread,
May 15, 2008, 7:43:15 AM5/15/08
to google-a...@googlegroups.com
On Thu, May 15, 2008 at 6:15 PM, Andrew Fong <FongA...@gmail.com> wrote:
>
> Hmmm, so maybe the proper way to approach the datastore is think of it
> as a pseudo-cache. Let's say we start with a more or less normalized
> datastore and we do all the joins through a ReferenceProperty -- and
> if we notice we're frequently using that reference, we "cache" the the
> referenced values in the referencing entity. And we treat updates to
> the referenced attribute using the same strategies we treat updates to
> any item that's cached -- e.g. wait for the values to propagate via
> some background task (speaking of which, how are people doing
> background tasks in GAE?), whether that's one that runs periodically
> or whenever certain kinds of entities are updated.

Yes, splitting things so that you have a primary data source, and the
published cache entities makes sense. It's a strategy i've suggested
several times in various situations.

As for background tasks, you could always star my issue:
http://code.google.com/p/googleappengine/issues/detail?id=6

> It seems to me that a large part of this could be automated though. I
> really like how the datastore indices are automatically generated in
> the index.yaml file without any action on the developers part. I'm new
> to python and GAE but how feasible would it be to write a plugin that
> automatically does this sort of "caching"?

Probably. I haven't thought that far. =)

--

Brett Morgan http://brett.morgan.googlepages.com/

Filip

unread,
May 15, 2008, 8:01:51 AM5/15/08
to Google App Engine
Another method is to unwind links into a single Expando model. Often,
you need an record which has a fixed set of references in a list, and
each referenced record has a set of properties. Instead of retrieving
the set of referenced records for the record every time, you could try
to include the referenced records in the parent record (Expando) by
including ref1_prop1, ref1_prop2, etc, where prop1 and prop2 are
replaced by actual property names. That way, you can substantially
reduce the number of searches in the database to build a single
webpage.

Now, I wonder if it would be possible to write a plugin script that
automatically convert those ref1, ref2, etc. back into a list you can
loop through as if there were real references. I'll try that.

Is anybody aware of limitations on the number of fields an Expando
model can have?

Filip.
> > Brett Morganhttp://brett.morgan.googlepages.com/- Tekst uit oorspronkelijk bericht niet weergeven -
>
> - Tekst uit oorspronkelijk bericht weergeven -
Reply all
Reply to author
Forward
0 new messages