Message from discussion Many-to-many JOIN with the Datastore
Date: Thu, 15 May 2008 01:15:00 -0700 (PDT)
Received: by 10.100.142.4 with SMTP id p4mr19494and.18.1210839300688; Thu, 15
May 2008 01:15:00 -0700 (PDT)
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en)
AppleWebKit/523.12 (KHTML, like Gecko) Version/3.0.4 Safari/523.12,gzip(gfe),gzip(gfe)
Subject: Re: Many-to-many JOIN with the Datastore
From: Andrew Fong <FongAnd...@gmail.com>
To: Google App Engine <email@example.com>
Content-Type: text/plain; charset=ISO-8859-1
Hmmm, so maybe the proper way to approach the datastore is think of it
as a pseudo-cache. Let's say we start with a more or less normalized
datastore and we do all the joins through a ReferenceProperty -- and
if we notice we're frequently using that reference, we "cache" the the
referenced values in the referencing entity. And we treat updates to
the referenced attribute using the same strategies we treat updates to
any item that's cached -- e.g. wait for the values to propagate via
some background task (speaking of which, how are people doing
background tasks in GAE?), whether that's one that runs periodically
or whenever certain kinds of entities are updated.
It seems to me that a large part of this could be automated though. I
really like how the datastore indices are automatically generated in
the index.yaml file without any action on the developers part. I'm new
to python and GAE but how feasible would it be to write a plugin that
automatically does this sort of "caching"?
On May 14, 5:01=A0pm, "Brett Morgan" <brett.mor...@gmail.com> wrote:
> On Thu, May 15, 2008 at 9:15 AM, Andrew Fong <FongAnd...@gmail.com> wrote:=
> > I still have issues with denormalization. It's not just a space issue.
> > The reason normalized databases don't repeat records is to avoid some
> > confusion down the road. For example, what happens if, in the
> > LibraryBook example, the Library changes its name? In a normalized
> > database, you would only have to update one record. Under a
> > denormalized database, would that entail finding every LibraryBook
> > that referenced that particular Library and updating it?
> > It so, it seems that the more denormalized a database is, the more
> > expensive updates are (even if the reads are fast).
> > Furthermore, it would require anyone trying to update an entity to
> > understand the structure of all the entities that referenced this
> > entity. In the LibraryBook example, updating the name attribute for
> > Library requires knowing that there is a libraryname attribute in
> > LibraryBook. Not a big deal for one model, but as the number of models
> > increases, it's going to get difficult keeping track of which entities
> > referencing Library have a libraryname attribute, which have a
> > libraryaddress attribute, and which ones might not have any such
> > attribute at all -- especially on a multi-person project.
> > Am I missing something?
> > -- Andrew
> Yes, all of the above concerns are valid. Yes, denormalisation hurts,
> both on disk space, and on correctness.
> The reason we are doing this is to achieve scale. At scale you wind up
> doing a bunch of things that seem wrong, but that are required by the
> numbers we are running. Go watch the EBay talks. Or read the posts
> about how many database instances FaceBook is running.
> The simple truth is, what we learned about in uni was great for the
> business automation apps of small to medium enterprise applications,
> where the load was predictable, and there was money enough to buy the
> server required to handle the load of 50 people doing data entry into
> an accounts or business planning and control app.
> On the web, we are in a different world. If you get successful, you'll
> get slashdotted. Well, these days it's probably more correct to call
> it reddited. Or boing boinged. And suddenly you have to go from 4
> servers to fourty, to four hundred, to four thousand. Read up the
> story about the iLike guys. They wrote an app that went viral on FB.
> And they melted. Needed servers. Yesterday.
> What GAE gives you is the ability to handle this, easily. All the
> things that GAE makes you do is done with this end game in mind. You
> have to write your code such that it can run on 400 app servers spread
> across the globe, on google's infrastructure. You have to deal with
> the fact that the transaction engine is distributed. You have to deal
> with the fact that queries are slow, and you should really be
> publishing entities that match one to one with your popular pages. And
> that you need to hide your updates using ajax. It's better to give the
> user a progress bar than a white screen of death, anyways.
> If you aren't interested in serving millions of customers, then this
> is likely overkill for you. But if you are, then you have to go
> through this world view change. And yes it hurts. I'm not going to say
> it's easy. It hurt me when I had to go through it back in 2000. It
> actually took me about four attempts (aka, webapps that melted
> underload) before i got it. But, once you make the leap, and
> understand that we are breaking rules for a reason, then you'll
> understand where and when to do it. Every choice has costs and
> benefits. Understanding when GAE makes sense is part of this journey
> of discovery.
> And if any of the above doesn't make sense, feel free to come back
> with more questions. =3D)