According to the released videos so far, there's no joins available on the query language.
cuz if it is for the db no point in scaling our app since we cant store
a million user rows
You are looking for read time functionality. Everything about how
google works is trading request time for disk space. Push you effort
into pre-computing things at write time and you will be going with the
grain of BigTable.
Merge the concepts of Person and Contact, for starters.
In the dim dark past when relational databases came to the fore, disk
was expensive. So we made sure to slice and dice things such that
there was no wasted space. Thus instead of optional fields, you
created seperate tables such that the optional fields could be pulled
in using a join.
In this new world of disk space being free, merge these previously
split concepts such that the optional fields are in the main object.
Thus the reason why you keep seeing denormalisation being bandied
about this group as a tactic for dealing with BigTable. Make few,
large entities with optional fields, instead of lots of small
entities.
This is the same lesson we had to learn with RPC. With normal
procedure calls, having lots of small calls with a few parameters made
sense, because stack space needed to be conserved, and the latency of
a local call is almost nothing. With RPC, each individual call is
expensive, both in computational terms at each end for serialisation
and deserialisation and also in raw network latency. Suddenly you had
to change the shape of the functions. Instead of lots of little calls,
you suddenly had a few calls that returned lots of data. Cheaper to
return a copy of the world than make fifty calls to get the small part
of the world you were interested in.
I guess you are driving a magazine style site with content in
articles, and then summary pages with lists of articles, authors, and
a pull quote? Maybe with the articles dressed up into genres, or
related article groups? Or issues?
So looking at this, I can divide the site into layers, the main
landing page with article teasers, genre pages, individual article
pages, and then possibly comment threads and trackback links per
article.
We actually have a bunch of information that would be good to keep in
a normalised fashion for ease of editing. Authors, articles, genres,
commentors, et al. We also want to cache this information in
denormalised fashion for speed, i.e. pre-rendered to html.
So I'd keep both.
Have the CMS editing side interact with the normalised data, with a
nice big fat "publish" button that takes all this nicely normalised
data and generates all the rendered html for the landing page, the
genre pages, and the individual article pages with cross links to
related pages.
The publish button is going to take a while - it has a bunch of work
to do. So do it piecemeal via AJAX so that you can report progress
back to the user via a progress bar or something.
So, in a bunch of ways, I've completely avoided your question. I think
the technique you are using is probably quite good. I just wouldn't do
it outside of the "publish" phase. I'm centralising work in the cms
part such that the end user using the site sees a faster site, because
it's effectively a flat published site by then.
Does that help?
How many connections can you run in parrallel from the firefox plugin?
You have an intelligent client that is parsing the page, and figuring
out a list of unique identifiers. You then want information keyed by
each identifier to augment the page. GreaseMonkey style, i'm guessing.
So, what intelligence do you need server side?
I'd be thinking the most important thing server side is to get the
bits shipped quickly. What I'd do is publish information about each
unique id in it's own file. Formatted as json for ease of use on
client side. And a json formatted manifest file recording that you can
use client side to map from the uuids to server side urls. Then spread
them out across a bunch of virtual servers www[1..50].host.com. That
means you can fire as many parrallel requests (across the different
virtual hosts) so as to pull it all down. It looks like you have a
question of byte shipping.
Amusingly enough I was just reading in jwz's livejournal that the
original netscape code had a magic hostname that it did exactly the
above trick to get load balancing. Ahh, nostalgia. Client side load
balancing. Heh.
But even if you do want to stick with gae for what ever set of
reasons, I kinda doubt that query performance is going to hurt you.
Where I expect you to hurt is the 500meg disk space limitation...
Depends what yo mean by perform. =)
Each new entity you reference is another instance you are pulling from
DataStore, with the overhead of finding it on the cluster and moving
it across the wire to your appserver.
You have stared issue 6, right? =)
http://code.google.com/p/googleappengine/issues/detail?id=6
> > It looks like you have a question of byte shipping.
>
> Caching the data in static files and using load-balancing client-side
> is not an option, since the app is write-intensive... the users
> interact with each article and thus with the datastore record
> frequently (adding ratings, annotations, assigning tags etc).
I'm guessing you are heading in the direction of having popular
articles pages, recently annotated articles pages, and tag pages?
These ajaxy interactions can be slower because the user isn't
nacigating away from the page. So you can update pre-generated html
content on these requests.
> > But even if you do want to stick with gae for what ever set of
> > reasons, I kinda doubt that query performance is going to hurt you.
> > Where I expect you to hurt is the 500meg disk space limitation...
>
> The reason I am playing around with GAE is because of scalability
> issue (I assume GAE scales but have not seen real-world benchmarking
> yet). The govt site contains, as of today, more than 18 millions
> articles and there will be, hopefully, millions of users (biomedical
> scientists and doctors all around the world).
Yeah, I understand why you are using GAE now.
> Being able to scale is
> important. But what I gather from your answers is that the only
> solution (beside the one that I've already found using key names) to
> do this in GAE is to perform a query for each client-supplied uid in a
> for loop... kinda ugly. How difficult would it be to implemente a IN
> operator in GQL? Ensuring uniqueness of the value of a property across
> the datastore is another very much lacking feature that has not been
> transferred over from SQL (UNIQUE indices) and Django (unique = True
> property constructor argument).
I'm still figuring out what we can and can't do with the back end
store. And I so can't answer for the AppEngine team on how hard it
would be to implement functionality. You could raise an request on the
issue tracker tho...
Thoughts?
Bwahaha, I wish. All those toys to play with? I'd be in heaven =)
> Dado
I stared 223 some time back. I still don't understand 178, but I'm
being slow this morning.