Getting around joins/in queries

Taylor Gautier

unread,

Jul 27, 2009, 3:10:08 PM7/27/09

to Google App Engine

Hi,

I have a facebook app that naturally would like to perform set
operations on a logged in user's friends against the data in the
application (stored in app engine). This presents a particular
challenge, suppose for example that in my application a logged in user
can subscribe to a particular event, we then have some classes like
this:

User {
Long key;
Long facebookId;
// etc...
}

Event {
Long key;
String name;
// etc...
}

Subscribed {
Long userKey;
Long eventKey;
Long start;
Long end;
//etc...
}

Now, if I have a particular user logged in, who has a set of friends,
I want to know which events that set of friends have subscribed to.
A normal query would be "SELECT * from SUBSCRIBED where userkey in
( <the set of friends> )"

By now, I know that there are no conditional OR operations, joins, or
in queries allowed in app engine.

My initial attempt to solve this problem has resulted in writing out
the intersection at subscription time using an additional table,
something like:

FriendSubscribed extends Subscribed {
Long friendKey;
// etc...
}

// for every friend of the logged in user, write a record so the query
can simply pick up
// all records for that friend
void persistForFriends(Collection<Long> friends, Subscribed
subscribed) {
for (Long id : friends) {
pm.persist(new FriendSubscribed(id, subscribed));
}
}

Now I have shifted the query burden to the write side and I can do a
simple query on the FriendSubscribed table for a particular user which
is nice and fast: "SELECT * from FRIENDSUBSCRIBED where friendkey =
<myuserid>". This returns all the subscriptions that the logged in
user's friends have made (with some slight gotchas, e.g. if the user
adds a new friend, then any subscriptions that the new friend already
has will not be seen - this is tolerable)

So, what's the problem? On facebook, a user typically has order 100
friends. And I suspect many will have 500 and a few will have 1000+.

Even for order 100 friends, at the current write speed (5/s) I can
expect a write operation to take approx. 20s!! This is rather long
for a web operation, and gets worse as you increase the number of
friends.

I am curious if anyone has any ideas for solutions? I have a few
thoughts, but wanted to see what people thought before moving on to
the next optimization. Here are my thoughts:

1) Queue the friend write operation - it's not critical to the core
write operation that all the "join" data be written out, it can always
be re-computed at any point in time by analyzing the Subscribed
records and current friend list for each friend. Only problem is I am
using Java and the scheduled tasks API is not yet supported.
Furthermore, it still seems like a lot of work to go through if we
imagine thousands or hundreds of thousands of users making
subscriptions as each write operation they do gets multiplied by a
factor of 100-1000.

2) Make an owned relationship from Events to Subscribed. Something
like:

Event {
Long key;
String name;

Set<Subscribed> subscribed;
// etc...
}

Now reading an event records gives precisely the information I want
and simply has to be culled down (the intersection of the logged in
user's friends and the Event subscribed list is the list of friends
that are subscribed to that event). The problem I see with this
implementation is high contention. For every subscription event I
have to lock out updates and make sure everyone is writing a coherent
view of the set (don't we have to update the value into the Event
record?) Or I may be wrong and it may be that the ownership
relationship is maintained automatically by the datastore and this
isn't as much of a concern as I think it is as the owned relationship
doesn't really write anything into the owner record, it just does a
join-like query to populate the set on read...that might be ideal??

3) ?? Something else?

Devel63

unread,

Jul 28, 2009, 11:03:09 AM7/28/09

to Google App Engine

I haven't given your situation much thought, but you could write the
list to memcache and have a cron job that reads memcache and does the
writing. You don't get instant results, but the user doesn't have to
wait.

Caution: you aren't going to be able to do queries that return 500 or
1000 results anyway ... it will probably take too long to return. As
I said, I haven't taken the time to really understand what you are
doing, but perhaps you will need to rethink the architecture.

Jason C

unread,

Jul 28, 2009, 3:51:38 PM7/28/09

to Google App Engine

We have some very early code that allows for multiple parallel
queries: http://code.google.com/p/asynctools/ This allows for a
different class of reading and avoids the need to compute/store
(certain types of) additional information.

In fact, it was Facebook friends graph stuff that inspired it in the
first place. Those escapades are detailed here: http://squeeville.com/2009/07/24/asynctools/

j

bFlood

unread,

Jul 28, 2009, 5:54:52 PM7/28/09

to Google App Engine

asynctools - excellent stuff, thx for releasing it!

how have you found the results when compared to running the queries
serially? was it significantly faster? when I was messing around with
the plumbing to make async qgl queries, I noticed they did run in
parallel but the datastore.Iterator._Get calls were running serially,
which I guess is as expected. this seemed to be where a large portion
of the instance cpu was spent so the perf gain was not huge (they were
key_only queries though so maybe that was a factor)

that being said, its very possible/likely I was botching some
important part so I will definitely try out your code. thanks much

cheers
brian

On Jul 28, 3:51 pm, Jason C <jason.a.coll...@gmail.com> wrote:
> We have some very early code that allows for multiple parallel

> queries:http://code.google.com/p/asynctools/This allows for a

Jason C

unread,

Jul 28, 2009, 10:03:40 PM7/28/09

to Google App Engine

asynctools uses RPC callbacks to kick off the Iterators in parallel
too. This means that there are some limits to the type of result sets,
but for a particular set of queries they are, of course, monstrously
faster.

j

On Jul 28, 3:54 pm, bFlood <bflood...@gmail.com> wrote:
> asynctools - excellent stuff, thx for releasing it!
>
> how have you found the results when compared to running the queries
> serially? was it significantly faster? when I was messing around with
> the plumbing to make async qgl queries, I noticed they did run in
> parallel but the datastore.Iterator._Get calls were running serially,
> which I guess is as expected. this seemed to be where a large portion
> of the instance cpu was spent so the perf gain was not huge (they were
> key_only queries though so maybe that was a factor)
>
> that being said, its very possible/likely I was botching some
> important part so I will definitely try out your code. thanks much
>
> cheers
> brian
>
> On Jul 28, 3:51 pm, Jason C <jason.a.coll...@gmail.com> wrote:
>
>
>
> > We have some very early code that allows for multiple parallel

> > queries:http://code.google.com/p/asynctools/Thisallows for a

Mahmoud

unread,

Jul 29, 2009, 10:20:42 AM7/29/09

to Google App Engine

Hi Taylor,

In our application, we had to solve a similar problem. Basically, we
needed to answer the question: "what are my friends doing today?". Our
initial solution was (pseudo code):
events = []
for friend in friends:
friend_events = friend.what_are_you_doing_today()
events.extend(friend_events)

Obviously, this code is executed serially. So for a list of 200
friends, it was taking about 8 seconds to execute! unacceptable.

So to make it faster, we figured we need to execute this code in
*parallel*. Note that db.get(keys) and db.get_by_key_name(key_names)
are parallelized by the datastore ...

So first, we made a new entity that stores what a friend is doing on a
given day. Something like (I'm using your pseudo code to describe the
data structure):
EventsForDay{
Key userKey;
Key[] events;
DateTime day;
}
This is updates whenever a user attends/unattends an event. I'm not
sure why you're looping through friends and writing stuff ....

The trick is we also gave each of those entities a key_name:
<user_key>_<YYYY:MM:DD>. Now, to get stuff quickly, we do:
1. Calculate all the key-names for the EventsForDay entities for the
current user's friends. Easy.
2. db.get_by_key_name(key_names) <= this happens quite fast.
3. To kick things up a notch, use memcache in #2. For entities not
found in memcache, retrieve them from the datastore.

Now for a list of 200 friends, this executes in less than 400ms!

Cheers,
-Mahmoud

Reply all

Reply to author

Forward