I doubt BigTable keeps track of index positions... it seems from its
architecture (distributed sorted list of keys) that it would require
updating every entity in the table on every insert to keep track of
each entity's position. So, the question becomes: do you really need
this information, or can your application be restructured to use the
index key itself for whatever purpose you have in mind for the index
key's position?
I can think of at least one example where you might really need this
information: a ranking system that ranks a huge list of entities based
on where they fall in the sorted list of some key (probably a
numerical "strength" value calculated from the rest of the entity's
properties). Let's say the entities represent Users on some
competitive flash gaming site, and JoeNewUser, who has been a member
of the site for 3 weeks and played 734 games of Tower Defense, wants
to know what his rank is. Joe's rank is 73,248, but we don't want to
have to look at the 73,247 people ahead of him (sorted by TotalScore
DESC) to find that out. If your actual use case has nothing in common
with this, let me know, but I think it at least parallels the general
gist of what you are asking for.
What you are looking for is a B-Tree, with the addition of subtree
size in the subtree pointers:
http://en.wikipedia.org/wiki/B-tree
The data model for that might look something like this:
class BTree(Model):
totalScores = ListProperty(int) #a sorted list totalScores, up to
some limit long
user = ListProperty(Key(User)) #same length as totalScores, points
to the User the score belongs to
pointers = ListProperty(Key(BTree)) #pointers to child Nodes
subTreeCount = ListProperty(int) #number of keys in all child
Nodes below this pointer
subTreeSum = ListProperty(int) #a prefix sum (running total)
of subTreeCount +1 for each totalScore
So it's really just a bunch of list properties. You set how big you
want the list to get (bigger is better - fewer datastore calls for
lookups - but keep in mind the 1 MB limit). Follow the procedures on
the wiki page linked above, but on insert, add one to the subTreeCount
for the each pointer you follow (and 1 to every subTreeSum from there
to the right on each node). To find the index of an item you are
searching for, keep track of the subTreeSum of all the subtrees to the
left of the subtree you are following, plus the nodes you pass along
the way, and add them all up to get the index of the final node.
Don't forget to binary search the totalScores list - there's no need
to look at every item in it, since it's already sorted. You may have
to do some linear searching at the end if several users have the same
totalScore, but you can avoid that by using fixed point numbers and
appending the UserID to the key to ensure uniqueness while still
sorting correctly (lexicographically). Split operations are
relatively simple too, since you have all the data you need to
populate the new subTreeCounts and subTreeSums... it's just
bookkeeping.
Looking at the complexity of this, if you have 100 keys per BTree
node, and a million entities, you can find the one you are looking for
and its index in 3 sequential datastore calls, while comparing to 8
keys per node. 1000 keys per page gets you down to 2 datastore
lookups and 10 comparisons per page, but you start spending more and
more CPU time serializing and deserializing the data in and out of the
datastore, so you'll have to play around with it to find the
Goldilocks number. Good luck, and let me know if you need a more
detailed explanation (though it will probably take a few days).
David