High-performance, high-precision sharded counters in the HRD

312 views
Skip to first unread message

Jeff Schnitzer

unread,
Aug 20, 2011, 4:50:38 PM8/20/11
to Google App Engine
The short version boiled down to two questions:

1) What is the maximum realistic throughput for memcache increment()
on a single counter?

2) The documentation for the HRD says that queries across entity
groups are eventually consistent. Does this extend to __key__ queries
as well? For example, will this be eventually or strongly consistent?

SELECT * FROM Thing WHERE __key__ >= KEY('Thing', 10) AND __key__ <=
KEY('Thing', 20)

-----

The long version, to hopefully open a discussion:

I'm selling various types of virtual goods, each of which have a fixed
number available. The constraints for each product are:

* I can sell up to exactly N items
* I cannot under any circumstances sell N+1
* For some products, N might be hundreds of thousands
* The peak purchase rate is expected to be hundreds per second when a
highly anticipated product goes on sale. Maybe thousands per second.

I have thought of several strategies for building this system, but the
best one requires a counter that can:

1) Be updated hundreds of times per second (thousands?)
2) Give me a transactionally safe count, or at least temporarily err
on the side of overcounting (not undercounting!)

Between memcache and the datastore, I can build this:

* Memcache increment() gives me a current atomic count of purchases.
* Sharded counters give me a scalable persistent record of purchases
(the purchase record can be parented by the count shard and updated in
the same transaction)
* Memcache putIfUntouched() lets me safely initialize the memcache
value as long as I can get a strongly consistent count from the shards
* By ordering memcache increments *before* shard increments for
purchases, and shard decrements *before* memcache decrements for
refunds, I can guarantee that datastore failures will at worst cause
overcounts, which is fine. A timeout on the memcache value means it
will eventually become accurate again.

First question: What is the practical throughput of
memcache.increment() on a single counter? Thousands per second?

Second: The key is being able to get a strongly consistent sum of the
sharded counters. The example of a sharded counter in the appengine
documentation uses a query to fetch the shards, which will only work
on the M/S datastore.

Since get() is strongly consistent, the obvious solution is to give
the shards predictable names and use batch get() for XYZ-shard1,
XYZ-shard2, ... XYZ-shard100. Ok, but it's not quite as nice as being
able to say query for __key__ >= XYZ-shard1 and __key__ <=
XZY-shard100 because some of the shards might not exist... or I might
want to decrease the number of shards for a running system. Should I
just not care about this case?

I guess I've probably answered question #2... but if anyone else has
any suggestions for how to build a system like this, I'd love to hear
your thoughts.

Thanks,
Jeff

Stephen

unread,
Aug 20, 2011, 5:34:59 PM8/20/11
to google-a...@googlegroups.com
On Sat, Aug 20, 2011 at 9:50 PM, Jeff Schnitzer <je...@infohazard.org> wrote:
>
> Second:  The key is being able to get a strongly consistent sum of the
> sharded counters.

You only really need to know whether widgets are in stock or out of stock.

While widgets are in stock, record the customers intent to purchase by
placing the order in a pull-queue. Bulk process the queue in batches
of 1000 requests, deducting 1000 from a total-widgets counter. Flip
the out of stock bit when you run out.

All orders received get an entity in the datastore marked with
success, or failure for the stragglers which entered the queue before
the out of stock bit could be flipped.

Ajax form submission on the front end, receive a ticket immediately,
poll the ticket until success or failure.

Jeff Schnitzer

unread,
Aug 20, 2011, 7:49:33 PM8/20/11
to google-a...@googlegroups.com

This is a really interesting idea, especially combined with the
channel api to push notification to the client. But I'm worried about
a couple things:

1) Are there periods when pull task queues go down? There are for
push queues. And they back up occasionally. I would hate to have the
experience that the user clicks "buy" and then waits for significant
times (more than a second or two) before they get an ok to proceed.

2) Wouldn't this require continual looping to see if there is anything
on the task queue? If it checks every 2 seconds, that's potentially 2
seconds of user waiting. Or if this is done by kicking off a push
task, we're back to long waits or failures when the push task system
is backed up or disabled.

3) I'm a little concerned about getting enough parallelism with this.
If the task processor leases 1,000 tasks, they'll be a mishmosh of
different kinds. At the very least it will require writing 1,000
purchase records, which will probably take a few seconds even with
async. Maybe the answer is to do smaller chunks and run multiple
processors in parallel, but now we get back to potential collision and
sharding issues. And the total throughput of the system will be 1,000
per (however long it takes) rather than 1,000 per purchasable item per
(however long it takes). This feels solveable but it might be a
balancing act rather than straightforward engineering.

Just to add a few bits of complexity that I didn't mention the first time:

* The UI must display the # of units left

* There's a shopping-cart of sorts whereby the product is deducted
from inventory when it gets added to a cart. It gets returned to the
pool if the transaction hasn't been completed in 15 mins (similar to
the way eventbrite works).

However, I don't think this extra complexity is too hard to add to a
basic purchase system that can already handle restricting sales to a
specific # of units.

Jeff

Robert Kluin

unread,
Aug 21, 2011, 2:57:10 AM8/21/11
to google-a...@googlegroups.com

Hey Jeff,
  If I remember correctly, I've used the memcache incr function without issue at maybe a couple hundred ops / second.  I've got some processing that is quite similar to what you're describing, I periodically see keys get in a sort of "stuck" state for brief time --  maybe 30 seconds to a minute -- where they won't return or take a value. You'd want to be sure your logic and process can handle that because I see it daily.

  Stephen has a neat idea with using pull-queue tasks as a marker.  It would be really great if you could batch the tasks in some way, I think there is an issue for this (maybe subqueues).  The reliability of inserting tasks has been excellent from what I've seen, as good as or better than any of the other services, so using them as a marker might be a really good way to ensure you don't oversell an item.  In your case, you may not need to care if you aren't immediately able to lease a task.  Just using them as markers might be sufficient to know if the purchase is OK or not.

  I started writing a process using pull queues for aggregations like this in April.  I use the pull tasks as markers, then I've got a dispatcher process that leases them, groups them, and inserts aggregator tasks to process each batch.  I try to insert a dispatcher task any time something is added to the pull queue.  I use a simple system of batch numbers and rounded timestamps so I don't insert a new dispatcher for every item, made more efficient by memcache markers.

   As for the shared counter, I would use the predictable names + fetch by key method.  That will give you the best chance of not missing something, just keep the number of shards as low as you can.  Each key requires a separate transaction to fetch the entity, true whether using a query or fetch by key, so each additional entity increases the chance a fetched entity has already been updated. This is a great spot to use multiple async batch fetches to minimize the actual wall-clock time.  You should be able to come up with a nice way of adding / removing shards to keep this as efficient as possible.

Robert




> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>

Stephen

unread,
Aug 21, 2011, 9:14:07 AM8/21/11
to google-a...@googlegroups.com
On Sun, Aug 21, 2011 at 12:49 AM, Jeff Schnitzer <je...@infohazard.org> wrote:
>
> Just to add a few bits of complexity that I didn't mention the first time:
>
>  * The UI must display the # of units left

If you're selling up to 1000/s is this a meaningful number? By the
time the user has read the message it will be wrong.

You could estimate the global sales velocity by recording sales by a
front-end instance in local memory and periodically flushing to
memcache. Given the global total from memcache you could give messages
like: half gone; less than 5000 left; estimate all sold in 10 mins.


Here's another way to sell widgets:

In Phase One, a front-end instance starts up and deducts 1000 widgets
from a datastore counter. It may sell widgets with IDs 1-1000 from
instance memory. It is selling widget 42, not any old widget. When it
sells out, it deducts another 1000 widgets from the global datastore
counter, widgets with IDs 10,001-11,000 say.

Contention on the datastore counter is reduced 1000-fold and
front-ends may sell without contention, and without overselling, as
long as they continue to grab batches of 1000.

A front-end may crash or be recycled without selling all 1000 widgets.
You may be happy selling 80%, 95%, 99.9%, depending.

However, when a front-end instance grabs the last 1000 widgets from
the global datastore counter you could enter Phase Two:

Sweep the sales looking for gaps in your blocks of 1000. You could
have your front-ends record when they lease 1000 IDs, when they sell
all 1000 IDs, and checkpoint every N mins if sales die down before
they sell out. Consolidate all the gaps into a new range from 0-N and
restart the sale. Start speculatively sweeping once you've sold 90% of
stock so that there is no time gap between a phase one and two sale.
Add phases until 99.N% sold.

Jan Zawadzki / Hapara

unread,
Aug 22, 2011, 9:28:38 AM8/22/11
to Google App Engine
Memcache incr: we have upwards of 200 instances doing this with no
apparent performance penalty.

As Robert pointed out, there are occasions when a value "sticks", so
you will need a way to deal with this gracefully.

J

Ikai Lan (Google)

unread,
Aug 22, 2011, 1:37:45 PM8/22/11
to google-a...@googlegroups.com
Jeff,

1. I don't know the exact number, but this method has been used in apps that get thousands of QPS
2. No, because there is also a Key index. There's no entity group root to "transact" on to see if we have the newest version of data. If you did a batch get with the keys themselves, it would be strongly consistent. Here's my example that does that:


My example uses pregenerated keys with a prefix and a batch get by key for strong consistency, though the write behind counter cache (for sorting) is eventually consistent.

--
Ikai Lan 
Developer Programs Engineer, Google App Engine



Ikai Lan (Google)

unread,
Aug 22, 2011, 1:38:33 PM8/22/11
to google-a...@googlegroups.com
(and if I read your entire email, I would have realized that you already figured it out that you can do a batch get by key)

--
Ikai Lan 
Developer Programs Engineer, Google App Engine



Kaan Soral

unread,
Aug 22, 2011, 3:49:10 PM8/22/11
to Google App Engine
Out of curiosity, are you in the development stage?

I did a lot of unnecessary thinking like this when I was in the
development stage and after the production, I saw that half of my work
was unnecessary
Reply all
Reply to author
Forward
0 new messages