sharding counters

22 views
Skip to first unread message

Jim

unread,
Mar 2, 2009, 12:55:48 AM3/2/09
to web2py Web Framework
I'm working towards implementing on Google App Engine. Google
stresses the importance of sharding counters in their environment.
http://code.google.com/appengine/articles/sharding_counters.html

Why? Well, Craigslist keeps a counter for their ads. Every time
someone posts an ad, that counter gets incremented. As of a few weeks
ago, they released Sphinx as their search engine, which means that you
can use that 9-digit ad number in any city.

On GAE, writes are slow - you might have to wait a second or more to
write out one record. Other records will have to wait for the first
one to finish. You can't run Craigslist on GAE.

What does sharding counters mean? As I interpret it, it means knowing
"about" how many records you have. Not exactly. So you get 10 or 20
sub-counters and write to one at random. If you need to know about
how many records you have, you total the 10 or 20 sub-counters to get
an answer. It's an approximate answer but if you've got a lot of
data, hey, it's going to be close enough. And you can get 10 or 20
or 40 writes/second because each time you're grabbing a different
counter.

At least that's the theory. I'm trying to translate the Google
implementation into web2py. I've got increment working except for
memcache. It's doing test0, test1, etc. with the counts properly.
get_count is not working and I'm having trouble figuring out
why counters = db(db.shards.name==name).select() is not
returning any results. That seems to be the most accurate way to
translate GAE/webapp to web2py but my suspicion is it's failing
because test0 is not equal to test. I'm trying to avoid using two
versions of routines - one for web2py w/o GAE, one with - but it might
be necessary.

def test_it():
count=get_count('test')
session.flash = count is + `count`
increment('test')
return

def get_count(name):
"""Retrieve the value for a given sharded counter.

Parameters:
name - The name of the counter
"""
total = memcache.get(name)
if total is None:
print "none"
total = 0
counters = db(db.shards.name==name).select()
for counter in counters:
total += counter.count
print counter.name
memcache.add(name, str(total), 60)
return total

def increment(name):
"""Increment the value for a given sharded counter.

Parameters:
name - The name of the counter
"""

index = random.randint(0, NUM_SHARDS - 1)
shard_name = name + str(index)
try:
counter = db(db.shards.name==shard_name).select()[0]
temp=counter.count+1
counter.update_record(count=temp)
except:
db.shards.insert(name=shard_name, count=1)

# memcache.incr(name)

Eventually I hope to implement the version that allows for increasing
of the number of shards.

Thanks.

Jim

unread,
Mar 2, 2009, 1:48:12 PM3/2/09
to web2py Web Framework
Boiled down, my problem is this:

I want to use

counter = db(db.shards.name=shard_name).select()

but web2py won't allow it. I get

SyntaxError: keyword can't be an expression

I'm trying to avoid writing a Google Query Language query here but
maybe that's the right answer.

On Mar 1, 9:55 pm, Jim <JDeib...@gmail.com> wrote:
> I'm working towards implementing on Google App Engine.   Google
> stresses the importance of sharding counters in their environment.http://code.google.com/appengine/articles/sharding_counters.html

mdipierro

unread,
Mar 2, 2009, 3:20:57 PM3/2/09
to web2py Web Framework
Because it is

counter = db(db.shards.name==shard_name).select()

;-)

Jim

unread,
Mar 2, 2009, 4:25:05 PM3/2/09
to web2py Web Framework
Yes, but that gets me an exact match, right?

If I've sharded my counters the way Google is recommending, I'll have
something like this:

name count
test0 3
test1 2
test2 3
...
test19 2

So I want to find all the records in my shards table that start with
"test" and then total them.

mdipierro

unread,
Mar 2, 2009, 6:27:56 PM3/2/09
to web2py Web Framework
I am lost here. I need to look up what "sharding" means.

Massimo

Jim

unread,
Mar 2, 2009, 7:11:33 PM3/2/09
to web2py Web Framework
In the first message, I give a reference to google's implementation.
I also found http://highscalability.com/numbers-everyone-should-know
helpful:

"Sharded Counters

We always seem to want to keep count of things. But BigTable doesn't
keep a count of entities because it's a key-value store. It's very
good at getting data by keys, it's not interested in how many you
have. So the job of keeping counts is shifted to you.

The naive counter implementation is to lock-read-increment-write. This
is fine if there a low number of writes. But if there are frequent
updates there's high contention. Given the the number of writes that
can be made per second is so limited, a high write load serializes and
slows down the whole process.

The solution is to shard counters. This means:
# Create N counters in parallel.
# Pick a shard to increment transactionally at random for each item
counted.
# To get the real current count sum up all the sharded counters.
# Contention is reduced by 1/N. Writes have been optimized because
they have been spread over the different shards. A bottleneck around
shared state has been removed.

This approach seems counter-intuitive because we are used to a counter
being a single incrementable variable. Reads are cheap so we replace
having a single easily read counter with having to make multiple reads
to recover the actual count. Frequently updated shared variables are
expensive so we shard and parallelize those writes."

Robin B

unread,
Mar 2, 2009, 7:57:15 PM3/2/09
to web2py Web Framework
You must use transactions to atomically increment any counter shard.
Web2py does not have transactions yet, so just use google models for
the sharded counters, and increment/sum them inside/after form.accepts
().

Robin

On Mar 2, 6:11 pm, Jim <JDeib...@gmail.com> wrote:
> In the first message, I give a reference to google's implementation.
> I also foundhttp://highscalability.com/numbers-everyone-should-know
Reply all
Reply to author
Forward
0 new messages