counting

157 views
Skip to first unread message

Ben Nevile

unread,
Oct 24, 2008, 2:52:06 AM10/24/08
to Google App Engine
Hi,

I have recently started using GAE for components of a large sports-
related Facebook app. One of the contests has hundreds of thousands
of participants, and I need to be able to tell a user at any given
time what place they're in. eg, you are 34,728th out of 234,829
participants.

After spending some time with the docs and browsing this group, it
appears that using the datastore there's no way to accomplish this
relatively mundane task. So do I have to keep a mysql database active
on some other host just so that I can efficiently do this type of
analysis?

Ben

Sylvain

unread,
Oct 24, 2008, 12:51:33 PM10/24/08
to Google App Engine

Ben Nevile

unread,
Oct 24, 2008, 3:39:14 PM10/24/08
to Google App Engine
Hi Sylvain - thanks for the pointer, and although I now know a cool
sharding hack for a shared resource, I don't see how the technique can
be applied to my problem. To summarize, I could have a model indexed
by a single entity called "points", and I want to be able to tell
someone how they rank in terms of points. For someone with say 100
points, in SQL I would do something like

select count(*) from entries where points > 100

Ben



On Oct 24, 9:51 am, Sylvain <sylvain.viv...@gmail.com> wrote:
> Check this :http://groups.google.com/group/google-appengine/browse_frm/thread/e13...

Sylvain

unread,
Oct 24, 2008, 5:33:10 PM10/24/08
to Google App Engine
Sorry, I don't understand exactly the problem but if you want to have
this SQL

select count(*) from entries where points > 100

You can create many "sharded counter"
and from the code use this :

increment("point_%i" % point)

Then, if you want to count, just : get_count("point_%i" % point)

It will create a lot of of "counter" but it will works fine.

Then if you want to count for points > 100
You will have to do many any times : get_count("point_%i" % point) for
point from x to y
else you can create steps : a counter from 100 to 150 , etc,...

I think the only solution to do count(*) (if count(*) > 1000) is to
use sharded counter

if count(*) < 1000 then you can use the "count" keyword.

Regards.

yejun

unread,
Oct 24, 2008, 5:49:45 PM10/24/08
to Google App Engine
His problem is a count needed for every single user. So a single
change will cause a massive update on all count.
On a traditional database it is the position on the index.

Peter Recore

unread,
Oct 25, 2008, 8:27:38 PM10/25/08
to Google App Engine
I'll try to state a more formal version of the problem, then offer my
suggestion.

Problem:

Assume that there is an entity type called Player, and each Player has
a score. A score is an integer. For any given Player P with a score
of X, we want to know how many other Players have scores that are
greater than or equal to X. This is their Rank.

Let's say we have the following Players and scores:

Peter 30
Ben 27
Sylvain 42
Marzia 50
Ryan 12

Peter's Rank would be 3 because Marzia, Sylvain and Peter all have
scores greater than or equal to Peter's score of 30.
Marzia's Rank would be 1, and Ryan's would be 5. The tricky part is
that Ben needs to do this efficiently for hundreds of thousands of
items ( with a fairly frequent number of updates i assume)

Getting this kind of answer is relatively simple on a SQL database,
though I don't honestly know how scalable the typical count(*) > X
solutions are. With the proper indexes, they could probably be very
fast, but i don't know if you'd get lots of contention on writing to
that table/index.

This does seem like a problem that any game with the concept of 'high
score' built in app engine would need to solve. (actually, this
problem probably generalizes even further) I'm pretty sure I will
need something like this at some point myself. So I might as well
take a stab at it now. Here are my initial thoughts... Encouraging or
disparaging comments welcome.

To solve this problem with the GAE datastore, I propose building a
ShardedRanker.
The lowest level item we need to consider is a RankedItem. A
RankedItem has a key, a value, and a reference to a RankingGroup. The
key would be a player's id, and the value would be a player's score.

A ShardedRanker would have references to a number of RankingGroups.
Each RankingGroup would contain a list of RankedItems, and be
responsible for knowing the relative rank of each item it contains.
Each RankingGroup would also be responsible for knowing the highest
and lowest scores its items contain.

When we want to know the rank of a particular item, we can get its
overall rank by combining its relative rank within its RankingGroup
with the total number of items contained in higher ranked Groups.
This will require 1 datastore read to get the info from the
RankingGroup, plus one read for each of the RankingGroups ranked
higher than the 'target' group.

When we insert a new RankedItem, we will need to update the
RankingGroup it belongs to, and maybe the index which tells us which
Group contains which ranges of scores. At some point, if a
RankingGroup gets to big, it would probably be split.

There should be lots of opportunity to memcache things here, reducing
the need for datastore reads. As the total number of items increases,
it might be necessary to create multiple levels of RankingGroups.

-peter
Message has been deleted

Ben Nevile

unread,
Oct 27, 2008, 2:38:21 PM10/27/08
to Google App Engine
Hi Peter,

Thanks first for re-stating the problem with more precise language!

As for your solution, sharding into many different ranking groups is
an interesting approach. One could imagine maintaining a tree of
ranking groups; searching/inserting/querying would require log N reads/
writes. All of it would have to be wrapped up in transactions, which
makes me a bit queasy. Not sure I want to venture down that path when
a cheap server running MySQL can handle the problem so nicely.

I have also been evaluating CouchDB as a next-gen database solution.
Hopefully Google is tracking their progress as well - as the world
leaders in map-reduce implementations, one would expect datastore to
boast features similar to how CouchDB handles its views. To do a
histogram in CouchDB,

map: function(doc) { emit(score, 1); }
reduce: function(keys, values) { return sum(values) }

CouchDB efficiently diffs this view index with every incoming write,
so finding a rank is a quick query.


Ben

Peter Recore

unread,
Oct 27, 2008, 4:14:30 PM10/27/08
to Google App Engine
I started playing around with a solution this weekend, and trying to
make it safe for concurrent users but still fast is definitely the
tricky part. My goal is to limit the need for transactions to the
lower parts of the tree in the majority of cases.

ryan

unread,
Jan 26, 2009, 5:12:19 PM1/26/09
to Peter Recore, google-a...@googlegroups.com
[bump]

the team behind the google code jam app recently published their real
time ranking library for app engine:

http://code.google.com/p/google-app-engine-ranklist/

it uses a bucket-based data structure similar to the one peter
proposed. we'll probably post something about it to the app engine
blog soon.

happy hacking...

Reply all
Reply to author
Forward
0 new messages