Reference existence data modelling

17 views
Skip to first unread message

Drew Sears

unread,
May 28, 2008, 6:19:57 PM5/28/08
to Google App Engine
I'm trying to drink the BigTable Kool-Aid and move towards more data
duplication and less relational mapping, but I wanted to be sure there
isn't a better way to do this:

Let's say I have a Flashcard model with 1000 entities. When a user
logs in, I want to sort the cards by the number of times the user has
viewed it. This might include 250 that the user has viewed once, 250
that the user has viewed twice, and 500 that the user has never
viewed.

With MySQL, this would be something like:

SELECT * FROM `cards` LEFT JOIN `views` ON (cards.id = views.card_id
AND views.user_id = 123) ORDER BY views.id IS NULL DESC, views.total

With GAE, I don't see any clean way to look for objects for which a
specific back reference is missing. My current thought is to populate
an empty "views" record for the user, for every card. That's 1000
writes with just empty placeholder data, which is kind of painful.

On top of this, I have to keep the user's records up to date when new
cards are added. My thought here is just to have the user cache a
timestamp for the most recent card it has a reference record for,
check the timestamp on the most recent real card, and then I can use
the timestamp index to sync up pretty easily.

The data duplication still feels wrong. Am I missing something?

Thanks,
Drew

Mahmoud

unread,
May 29, 2008, 11:30:31 AM5/29/08
to Google App Engine
Drew,
Here are some thoughts:
1. To be consistent with your above nomenclature, one could create an
entity called View. Such an entity would hold a reference to the User,
and an IntegerProperty to hold the count (seen_count). Each time a
user is presented a Flashcard, you create or update such entity. Lets
call the flashcards in this group, SEEN_SET.

2. Now, to present the user with Flashcards, ordered by how many times
they've been seen, one can do the following:
a. Get all the Flashcards in your model. This is easy to query for,
something like Cards.all(). Lets put those entities in a Python
set[1], and lets call it FULL_SET.
b. Get all the Flashcards the user has seen. We've stored those in
step 1. Make sure you sort them by seen_count in your query.
c. Get the difference between FULL_SET and SEEN_SET. So we're
removing all the cards that have already been seen, from the full
set.
So now you really have an UNSEEN_SET.
c. Now, append the SEEN_SET (which is ordered by seen_count, as in
#b) to the end of the UNSEEN_SET. Lists have an 'extend' function.
Sets have a 'union' function.

Now you have your ordered set of cards. The ORDER_BY query in #b will
probably be optimized by the engine. However, the filtering in #c
might be expensive.

-Mahmoud



[1] http://docs.python.org/lib/types-set.html

Drew Sears

unread,
May 29, 2008, 8:10:15 PM5/29/08
to Google App Engine
Thanks, but this isn't really an improvement, it just shifts the cost
from one-time disk writes to a recurring CPU impact. The objective is
to get the "unseen set" without loading and comparing the entire "full
set" and "seen set".

Tom Offermann

unread,
Jun 1, 2008, 9:35:22 PM6/1/08
to google-a...@googlegroups.com
Drew,

You asked a good question. I've been thinking about it for a couple
days, and really, I was hoping you would get some more responses,
because you've described a common scenario.

As you point out, with the GAE datastore, you can't query for
"missing" data the way you can with the LEFT OUTER JOIN. (Your MySQL
query will show the view count for *all* cards, even when there's no
view count record for a particular card.)

Your approach (creating a view record where view_count = 0 for each
card for each user) is probably best, but here's another idea to
consider/test:

For each user, create an unseen_cards list of db.keys, where the each
db.key points to a Card that the user hasn't viewed.

class User(db.Model):
user = db.UserProperty()
...
unseen_cards = db.ListProperty(db.key)

The upside:

* When you add a user, you do 1 write (a single list of 1000
unseen_cards), rather than 1000 writes (1 write for each card). This
means that adding a user should be faster, especially as the number of
flashcards grows.

The downsides:

* Generating the "sorted by view_count" flashcard list is slightly
more complicated, since you have to query on views, then query on
unseen_cards and append the unseen_cards to the list.

* How many db.keys can you keep in a ListProperty? At some point,
you'll have too many cards to handle efficiently in a ListProperty,
but I have no idea when that is.

* Viewing the card for the first time is more complicated, since you
have to remove the db.key from the unseen_cards list and create the
view record. For data consistency, you should do this operation in a
transaction.

Tom

--
My blog: http://offermann.us/

Mahmoud

unread,
Jun 2, 2008, 12:17:07 AM6/2/08
to Google App Engine
> For each user, create an unseen_cards list of db.keys, where the each
> db.key points to a Card that the user hasn't viewed

So would inserting a new flashcard require us to update each user?
This will not scale if you have millions of users, especially in the
absence of batch processing.

Tom Offermann

unread,
Jun 2, 2008, 9:58:51 AM6/2/08
to google-a...@googlegroups.com

Yes. In both Drew's original idea and in my suggestion, you would have
to set the view_count to 0 for each user for every new card.

Obviously, you wouldn't be able to complete millions of updates in a
single Web request without timing out. So, until Google allows
background processes, you would have to do something like add recently
added cards to a user's account when that user logs in, rather than
adding them to all users when you first create the flashcard.

The argument is that somehow you've got to initialize view_count for
every card for every user, otherwise you can't efficiently query for
all flashcards for a user sorted by view_count (including unseen
cards). Unless someone can show a better way to do this query that
doesn't rely on initializing view_count...

Tom
--
Blog: http://offermann.us

Reply all
Reply to author
Forward
0 new messages