Retrieving the index position of an entity

91 views
Skip to first unread message

de Witte

unread,
Nov 1, 2011, 4:18:28 AM11/1/11
to google-a...@googlegroups.com
Hi,

I'm looking for the following:

Is it possible to retrieve the position of an entity in an index table?

Of course, within going through the entire index by retrieving keys.

Ayy tipes would be welcome



David Whittaker

unread,
Nov 2, 2011, 9:54:18 AM11/2/11
to Google App Engine
I doubt BigTable keeps track of index positions... it seems from its
architecture (distributed sorted list of keys) that it would require
updating every entity in the table on every insert to keep track of
each entity's position. So, the question becomes: do you really need
this information, or can your application be restructured to use the
index key itself for whatever purpose you have in mind for the index
key's position?

I can think of at least one example where you might really need this
information: a ranking system that ranks a huge list of entities based
on where they fall in the sorted list of some key (probably a
numerical "strength" value calculated from the rest of the entity's
properties). Let's say the entities represent Users on some
competitive flash gaming site, and JoeNewUser, who has been a member
of the site for 3 weeks and played 734 games of Tower Defense, wants
to know what his rank is. Joe's rank is 73,248, but we don't want to
have to look at the 73,247 people ahead of him (sorted by TotalScore
DESC) to find that out. If your actual use case has nothing in common
with this, let me know, but I think it at least parallels the general
gist of what you are asking for.

What you are looking for is a B-Tree, with the addition of subtree
size in the subtree pointers:
http://en.wikipedia.org/wiki/B-tree

The data model for that might look something like this:
class BTree(Model):
totalScores = ListProperty(int) #a sorted list totalScores, up to
some limit long
user = ListProperty(Key(User)) #same length as totalScores, points
to the User the score belongs to
pointers = ListProperty(Key(BTree)) #pointers to child Nodes
subTreeCount = ListProperty(int) #number of keys in all child
Nodes below this pointer
subTreeSum = ListProperty(int) #a prefix sum (running total)
of subTreeCount +1 for each totalScore

So it's really just a bunch of list properties. You set how big you
want the list to get (bigger is better - fewer datastore calls for
lookups - but keep in mind the 1 MB limit). Follow the procedures on
the wiki page linked above, but on insert, add one to the subTreeCount
for the each pointer you follow (and 1 to every subTreeSum from there
to the right on each node). To find the index of an item you are
searching for, keep track of the subTreeSum of all the subtrees to the
left of the subtree you are following, plus the nodes you pass along
the way, and add them all up to get the index of the final node.
Don't forget to binary search the totalScores list - there's no need
to look at every item in it, since it's already sorted. You may have
to do some linear searching at the end if several users have the same
totalScore, but you can avoid that by using fixed point numbers and
appending the UserID to the key to ensure uniqueness while still
sorting correctly (lexicographically). Split operations are
relatively simple too, since you have all the data you need to
populate the new subTreeCounts and subTreeSums... it's just
bookkeeping.

Looking at the complexity of this, if you have 100 keys per BTree
node, and a million entities, you can find the one you are looking for
and its index in 3 sequential datastore calls, while comparing to 8
keys per node. 1000 keys per page gets you down to 2 datastore
lookups and 10 comparisons per page, but you start spending more and
more CPU time serializing and deserializing the data in and out of the
datastore, so you'll have to play around with it to find the
Goldilocks number. Good luck, and let me know if you need a more
detailed explanation (though it will probably take a few days).

David

Nick Johnson

unread,
Nov 3, 2011, 1:30:41 AM11/3/11
to google-a...@googlegroups.com
Good post, David! Just one thing to add - there is in fact an existing library that already implements this: http://code.google.com/p/google-app-engine-ranklist/

-Nick Johnson


--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.




--
Nick Johnson, Developer Programs Engineer, App Engine


de Witte

unread,
Nov 4, 2011, 7:09:30 AM11/4/11
to google-a...@googlegroups.com
Hi David, Thank you for your detailed post.

It is for a forum application. A topic may have up to 20.000 posts, set to 10 for each page. You can browse through the posts with the next and prev page buttons but we also need a jump to page n field.

For now, we store all keys in the topic entity. It is a blob with a serialized array with some magic to reduce the size. The forum entity does the same for all topics.

We also have a ranking system to the list the active users. It is the entire ranking table divided in blocks of 50.000 users per entity @ 0.7MB. The table doesn't include users with a score of 0 posts. Didn't thought of the B-tree approach, will have second thought about it.


Reply all
Reply to author
Forward
0 new messages