Backend performance, compared

Jeff Schnitzer

unread,

Aug 7, 2012, 10:10:43 PM8/7/12

to Google App Engine

If you've been reading various threads on this list you know that
Richard has been having trouble getting his mobile game to run
smoothly on GAE. It's a little unusual because timing is coordinated
precisely:

* At T+0, all clients submit scores
* At T+5s, a reaper process aggregates the scores and builds a result set
* At T+10s, all clients fetch scores

The question is: Where to submit the score data so that the reaper
can fetch and aggregate it?

Here's some answers that didn't work:

* The datastore. Eventual consistency is too eventual to query for
all the scores and get them.
* Pull queues. There's too much of a delay between task insertion
and when it appears for leasing.
* A single backend. One backend cannot handle more than ~80qps.

He eventually got a system working reliably, sharded across ten B1
instances, at a cost (beyond other charges) of ~$600/mo. It can
collect a couple thousand scores within the 5s deadline (barely).

I thought this was insane, so I built a few experiments to see what
other technologies can do, using the exact program logic of Richard's
collector. Here are the results:

The environment: 256MB Rackspacecloud VPS running Ubuntu 10.04.4 LTS
The cost: $11/mo
The command: ab -c 10000 -n 10000 -r http://theurl (that's 10k
requests, all concurrent).

Node.js: ~2500 qps. Rock solid through multiple test runs, all
complete before the deadline.
Java SimpleHTTP: ~2100 qps. Had to bump heap up to 128MB.
Python Twisted: ~1600 qps. Failed a lot of requests on most test runs.
Python Tornado: ~1500 qps, but rock solid through multiple test runs.

So basically, an $11/mo VPS server running Javascript vastly exceeds
the capabilities of 10 backends at $60/mo each.

Jeff

rern...@gmail.com

unread,

Aug 8, 2012, 1:30:07 AM8/8/12

to google-a...@googlegroups.com

Very interesting result.

Sent from my HTC

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Ray

unread,

Aug 8, 2012, 3:08:00 AM8/8/12

to google-a...@googlegroups.com, je...@infohazard.org

$60/m or $600/m?

Kristopher Giesing

unread,

Aug 8, 2012, 3:55:35 AM8/8/12

to google-a...@googlegroups.com, je...@infohazard.org

Terrifying.

So, I assume Richard's medium-term plan is to set up an "outboard" collector endpoint, and have a single GAE backend reap from that?

- Kris

André Pankraz

unread,

Aug 8, 2012, 4:08:19 AM8/8/12

to google-a...@googlegroups.com, je...@infohazard.org

But why are the B4 backends for your example this slow? They are much bigger (RAM) than 256MB rackspaces and have dedicated fast CPU cores. I have some CPU intensive tasks that are really OK on backends. Backends are crazy expensive, but slow?!

80 qps? Maybe its:
* slow network access via Http or Tasks or whatever?
* Or the 10 max. threads?

I already have said in the pros/cons thread - you sometimes need dozens of GAE instances to match 2 good root server in terms of qps. The major reason for me is the network latency and thread restrictions.

Johan Euphrosine

unread,

Aug 8, 2012, 4:39:13 AM8/8/12

to google-a...@googlegroups.com

I believe those results are due to the current limitation of 10 concurrent requests per instance.

There is a feature request to make that configurable:

http://code.google.com/p/googleappengine/issues/detail?id=7927

I made a quick benchmark with urlfetch to a go backend that store k,v in memory.

While I get managed to get 350+qps out of it, I don't get a performance boost by using a B8 instead of B1, see below:

$ ab -c 300 -n 3000 -r http://proppy-loadtest-backend.appspot.com/?backend=gob8

This is ApacheBench, Version 2.3 <$Revision: 655654 $>

Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking proppy-loadtest-backend.appspot.com (be patient)

Completed 300 requests

Completed 600 requests

Completed 900 requests

Completed 1200 requests

Completed 1500 requests

Completed 1800 requests

Completed 2100 requests

Completed 2400 requests

Completed 2700 requests

Completed 3000 requests

Finished 3000 requests

Server Software: Google

Server Hostname: proppy-loadtest-backend.appspot.com

Server Port: 80

Document Path: /?backend=gob8

Document Length: 31 bytes

Concurrency Level: 300

Time taken for tests: 8.480 seconds

Complete requests: 3000

Failed requests: 0

Write errors: 0

Total transferred: 2777169 bytes

HTML transferred: 93000 bytes

Requests per second: 353.77 [#/sec] (mean)

Time per request: 848.006 [ms] (mean)

Time per request: 2.827 [ms] (mean, across all concurrent requests)

Transfer rate: 319.82 [Kbytes/sec] received

Connection Times (ms)

min mean[+/-sd] median max

Connect: 17 18 0.7 18 31

Processing: 133 769 293.1 763 1904

Waiting: 133 769 293.1 763 1904

Total: 151 787 293.1 780 1921

Percentage of the requests served within a certain time (ms)

50% 780

66% 907

75% 996

80% 1040

90% 1162

95% 1272

98% 1394

99% 1475

100% 1921 (longest request)

$ ab -c 300 -n 3000 -r http://proppy-loadtest-backend.appspot.com/?backend=gob1

This is ApacheBench, Version 2.3 <$Revision: 655654 $>

Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking proppy-loadtest-backend.appspot.com (be patient)

Completed 300 requests

Completed 600 requests

Completed 900 requests

Completed 1200 requests

Completed 1500 requests

Completed 1800 requests

Completed 2100 requests

Completed 2400 requests

Completed 2700 requests

Completed 3000 requests

Finished 3000 requests

Server Software: Google

Server Hostname: proppy-loadtest-backend.appspot.com

Server Port: 80

Document Path: /?backend=gob1

Document Length: 31 bytes

Concurrency Level: 300

Time taken for tests: 8.262 seconds

Complete requests: 3000

Failed requests: 0

Write errors: 0

Total transferred: 2773964 bytes

HTML transferred: 93000 bytes

Requests per second: 363.11 [#/sec] (mean)

Time per request: 826.197 [ms] (mean)

Time per request: 2.754 [ms] (mean, across all concurrent requests)

Transfer rate: 327.88 [Kbytes/sec] received

Connection Times (ms)

min mean[+/-sd] median max

Connect: 17 18 0.6 18 28

Processing: 136 759 331.8 722 2176

Waiting: 136 758 331.8 722 2176

Total: 154 776 331.8 740 2193

Percentage of the requests served within a certain time (ms)

50% 740

66% 881

75% 978

80% 1045

90% 1219

95% 1373

98% 1520

99% 1659

100% 2193 (longest request)

--
Johan Euphrosine (proppy)
Developer Programs Engineer
Google Developer Relations

vlad

unread,

Aug 8, 2012, 12:24:38 PM8/8/12

to google-a...@googlegroups.com, je...@infohazard.org

Jeff, what storage subsystem did you use on your dedicated host? If you kept all data in RAM then it is just a sleuth of hand. Overall I am not surprised about lame performance of backends and pull queues. I am a bit surprised on the cost of doing it in Datastore. Good work!

Jeff Schnitzer

unread,

Aug 8, 2012, 1:27:23 PM8/8/12

to vlad, google-a...@googlegroups.com

Why do you call it sleuth of hand? Storing the data in RAM represents
the actual business need. We're doing not only an apples-to-apples
comparison but an "actual implementation of business case to actual
implementation of business case".

Persistent RAM is one of the principal values of a Backend, and it
turns out that they aren't very useful for that purpose. Not only is
QPS low, but if you saw my post in Brandon's thread, latency is
erratic and horrible.

Just to give some latency numbers: Urlfetches to the rackspace VPS
are consistently 30-50ms. Not great, but tolerable. Urlfetches to a
backend average about the same - they skew more towards 15-25ms, but
there are a large number of 250ms+ outliers that push up the
arithmetic mean.

No-op fetches F1 to B1:
https://img.skitch.com/20120808-nn749683wqdg5516fy732k71iw.jpg
No-op fetches F1 to B8:
https://img.skitch.com/20120808-gerejtgfd958i691hsiw897wjw.jpg

The test F1 to Rackspace:
https://img.skitch.com/20120808-npj3kutc2p511bw5gujxkm5kg4.jpg

BTW I picked Rackspace for this test because I happened to already
have a VPS there. I would love to see latency numbers for Linode,
EC2, and Google Compute Engine. Does anyone want to run the test?
I'm still waiting for an invite to GCE.

Jeff

Drake

unread,

Aug 8, 2012, 1:42:04 PM8/8/12

to google-a...@googlegroups.com

Jeff your Low volume is effecting your numbers. The more you do the more
consistent the numbers become. I don't know what black magic Google does,
but the more people I have the more my instances get faster, the faster my
datastore reads (and writes) become, and the better the hit ratio on my
memcache.

Jeff Schnitzer

unread,

Aug 8, 2012, 1:48:19 PM8/8/12

to google-a...@googlegroups.com

I do not believe you without quantitative measurements. I see no
reason to believe that something that sucks at low volume will
suddenly stop sucking at high volume, and I have a lot of reason to
distrust your wild assertions.

Also: The system requiring 10 backends is a real-world app with 2k+
simultaneous users. How much more volume do you think it needs?

Jeff

Drake

unread,

Aug 8, 2012, 2:18:53 PM8/8/12

to google-a...@googlegroups.com

Number of Instances - Details	Average QPS	Average Latency	Average Memory
3 total (1 Resident)	1.017	568.7 ms	61.5 MBytes

Hit count:	237801
Miss count:	1940100
Hit ratio:	10%
Item count:	43785 item(s)
Total cache size:	78798944 byte(s)
Oldest item age:	8 hour(s) 36 min(s) 47 second(s)

Same App serving same content, but in a different Market

Number of Instances - Details	Average QPS	Average Latency	Average Memory
5 total	41.062	98.7 ms	51.6 MBytes

Mecache

Hit count:	2610910
Miss count:	308020
Hit ratio:	89%
Item count:	54803 item(s)
Total cache size:	92765915 byte(s)
Oldest item age:	8 hour(s) 27 min(s) 31 second(s)

Waleed Abdulla

unread,

Aug 8, 2012, 8:44:58 PM8/8/12

to google-a...@googlegroups.com

Thanks, Jeff. Is it possible to repeat the test with qps < 10 to rule out the limit that Johan pointed out? In other words, how big is the performance difference if you had less requests that do more work?

Jeff Schnitzer

unread,

Aug 8, 2012, 11:25:19 PM8/8/12

to google-a...@googlegroups.com

Why are you posting memcache stats in a thread about backends?

Jeff

--

Kristopher Giesing

unread,

Aug 9, 2012, 12:29:11 AM8/9/12

to google-a...@googlegroups.com, vlad, je...@infohazard.org

I have a Linode instance. How do I run the test?

Kristopher Giesing

unread,

Aug 9, 2012, 12:34:36 AM8/9/12

to google-a...@googlegroups.com, je...@infohazard.org

This is his evidence that higher-use apps have better performance. I'm not surprised that memcache hit rates would be higher (since you're more likely to evict your neighbors than vice versa) but the latency difference does surprise me.

I wonder if Google's balancing algorithms tend to starve low-qps apps? I can imagine a class of caching behavior that would cause that sort of thing, and the caches could be operating at many levels of the stack.

- Kris

vlad

unread,

Aug 9, 2012, 1:05:44 AM8/9/12

to google-a...@googlegroups.com, vlad, je...@infohazard.org

Because I thought that all 3 methods that did not work involved storing data in Datastore. If you used Backend RAM as storage, well, ok it is an inventive way to use Backend, I guess. I think Backends are a wrong feature all together. I just does not with GAE concept as a scalable, no-configuration system. But I guess we have ourselves to blame. Some of us screamed so much about 60 sec time limit in front-ends. And this time Google decided to "listen" and gave us Backends :)

Jeff Schnitzer

unread,

Aug 9, 2012, 1:16:04 AM8/9/12

to google-a...@googlegroups.com

On Wed, Aug 8, 2012 at 5:44 PM, Waleed Abdulla <wal...@ninua.com> wrote:
> Thanks, Jeff. Is it possible to repeat the test with qps < 10 to rule out
> the limit that Johan pointed out? In other words, how big is the performance
> difference if you had less requests that do more work?

You must mean concurrency less than 10?

I'm not really certain how concurrency relates to this. All the tests
I ran (Node.js, Twisted, Tornado, Simple) were nonblocking servers
with a concurrency of 1. Maybe - just maybe - it would be possible to
increase throughput by using multiple system threads up to the number
of cores available... but then you would lose performance due to
synchronization. Probably significantly. Optimal hardware
utilization is one isolated, single-threaded, nonblocking server per
core.

I really don't know why backends are slow. Maybe it has something to
do with the request queueing system? Throughput sucks even when
backends are doing noops. Maybe "increased concurrency" would allow
more requests to travel through the queueing system at once... but
it's hard to imagine this helping out the actual server process at
all. More timeslicing and synchronization on a cpu- and memory-bound
problem will reduce performance, not improve it.

Jeff

Kristopher Giesing

unread,

Aug 9, 2012, 2:01:26 AM8/9/12

to google-a...@googlegroups.com, je...@infohazard.org

Do we know for sure that front ends are any faster? Their individual throughput limits might just be masked by having more of them spin up.

- Kris

Jeff Schnitzer

unread,

Aug 9, 2012, 2:21:48 AM8/9/12

to vlad, google-a...@googlegroups.com

On Wed, Aug 8, 2012 at 10:05 PM, vlad <vlad.tr...@gmail.com> wrote:
> Because I thought that all 3 methods that did not work involved storing data
> in Datastore. If you used Backend RAM as storage, well, ok it is an
> inventive way to use Backend, I guess. I think Backends are a wrong feature
> all together. I just does not with GAE concept as a scalable,
> no-configuration system. But I guess we have ourselves to blame. Some of us
> screamed so much about 60 sec time limit in front-ends. And this time Google
> decided to "listen" and gave us Backends :)

Only one of the 3 failed attempts used the datastore; the others were
task queue and single-backend.

I also think that backends are a misfeature, but for different reasons:

* As an in-memory index, they are waaay too expensive for any
reasonable quantity of RAM.
* As a repository of in-game state (ie this case), they don't provide
enough throughput.
* As a way around the 60s timeout for frontends... I'd really just
rather have the ability to run frontend requests longer. And to
define groups of frontends of different sizes.

As Google Compute Engine rolls out, I expect the appeal of backends
will diminish considerably. Which is too bad, because if they were
cheaper and faster they would be really incredibly useful.

Jeff

Jeff Schnitzer

unread,

Aug 9, 2012, 2:31:18 AM8/9/12

to Kristopher Giesing, google-a...@googlegroups.com

On Wed, Aug 8, 2012 at 11:01 PM, Kristopher Giesing
<kris.g...@gmail.com> wrote:
> Do we know for sure that front ends are any faster? Their individual
> throughput limits might just be masked by having more of them spin up.

I expect frontends are about the same. But frontends can't aggregate
scores so they aren't really an issue in this test. If you want to
feel even worse for Richard, he's paying for what seems like an
unreasonable number of F1 instances too... but for some reason that
seems less frustrating than the totally unnecessary need to add a
second equivalent number of backends.

Jeff

Takashi Matsuo

unread,

Aug 9, 2012, 3:18:41 AM8/9/12

to google-a...@googlegroups.com, Kristopher Giesing

First of all, thank you for the detailed cost comparison and the
honest feedback. We really appreciate it.

The data itself is not very far from what I observed in my test.
Certainly, we should try making the backends cheaper. The easiest fix
will be this issue:
http://code.google.com/p/googleappengine/issues/detail?id=7411

It will reduce the cost by 5/8(if the discount rate is the same as the
one for frontend), so it'll be $360/month for 10 B1s.

Besides the cost, we should also improve the performance itself.

You might say 'It's still insane', however, I don't think so, because
App Engine backends provides other goodies like high availability,
redundancy, maintenance-free nature from the start.

Please think this way. Would you like to offer a score-server service
for Richard by yourself at a cost of $360/month? Besides running a
single node.js server, you also need to run a spare server, need to
create a monitoring/fail-over mechanism, need to have a support
channel when in trouble, and sometimes need to diagnose any network
problem between App Engine and your server.

If the answer is yes, probably it's a good business chance for you ;)

I'm not saying that App Engine is perfect, but i just wanted to point
out that your comparison lacks(maybe it's intentional) a consideration
for the important aspects of App Engine.

However, your feedback is still invaluable to us. Thank you as always.

Regards,

-- Takashi

> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
>

--
Takashi Matsuo | Developer Advocate | tma...@google.com

Johan Euphrosine

unread,

Aug 9, 2012, 6:31:54 AM8/9/12

to google-a...@googlegroups.com

On Thu, Aug 9, 2012 at 7:16 AM, Jeff Schnitzer <je...@infohazard.org> wrote:

On Wed, Aug 8, 2012 at 5:44 PM, Waleed Abdulla <wal...@ninua.com> wrote:
> Thanks, Jeff. Is it possible to repeat the test with qps < 10 to rule out
> the limit that Johan pointed out? In other words, how big is the performance
> difference if you had less requests that do more work?

You must mean concurrency less than 10?

I'm not really certain how concurrency relates to this. All the tests
I ran (Node.js, Twisted, Tornado, Simple) were nonblocking servers
with a concurrency of 1.

Actually those server/framework do support processing multiple request concurrently, when one of the request handler/callback is doing I/O it can process another request. (Just like the go backend I tested)

They just don't do parallelism (by default).

</nitpick>

Maybe - just maybe - it would be possible to
increase throughput by using multiple system threads up to the number
of cores available... but then you would lose performance due to
synchronization. Probably significantly. Optimal hardware
utilization is one isolated, single-threaded, nonblocking server per
core.

I really don't know why backends are slow. Maybe it has something to
do with the request queueing system? Throughput sucks even when
backends are doing noops. Maybe "increased concurrency" would allow
more requests to travel through the queueing system at once... but
it's hard to imagine this helping out the actual server process at
all.

Increasing concurrency could help if you are I/O bound, but it will not help if you are truly CPU bound.

For a truly cpu-bound request, you are using as much cpu as the instance class limit allows for all backends < B8, and half of the CPU limit for B8.

So with threadsafe: true, you can only process:

- 1 cpu bound request at a time on B1 B2 B4

- 2 cpu bound request at a time on B8

If you are not completely CPU bound and idling between CPU burst then increasing concurrency would help, just like if you were bound on I/O operations.

Hope that clarify the situation a bit.

More timeslicing and synchronization on a cpu- and memory-bound
problem will reduce performance, not improve it.

Jeff

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Johan Euphrosine

unread,

Aug 9, 2012, 8:38:41 AM8/9/12

to google-a...@googlegroups.com

I just answered a question on the same topic on Stack Overflow:

http://stackoverflow.com/a/11882719/656408

Jeff Schnitzer

unread,

Aug 9, 2012, 9:40:34 AM8/9/12

to google-a...@googlegroups.com

I think I mentioned this before, but there is no I/O in the problem.
It just collects data in RAM (thousands of individual submissions) and
waits for the reaper come get the entire set in one fetch. This is
why I do not expect concurrency to help.

Jeff

Johan Euphrosine

unread,

Aug 9, 2012, 9:47:13 AM8/9/12

to google-a...@googlegroups.com

On Thu, Aug 9, 2012 at 3:40 PM, Jeff Schnitzer <je...@infohazard.org> wrote:

I think I mentioned this before, but there is no I/O in the problem.
It just collects data in RAM (thousands of individual submissions) and
waits for the reaper come get the entire set in one fetch. This is
why I do not expect concurrency to help.

I was just nitpicking because I thought "with a concurrency of 1" applied to "Node.js servers" and not "the tests I ran", sorry for the missunderstanding.

Jeff Schnitzer

unread,

Aug 9, 2012, 10:04:32 AM8/9/12

to google-a...@googlegroups.com

I generally am sympathetic to "GAE costs more because it offers more".
But there's a finite limit to what that "more" is worth. 2X? 3X?

Consider that those 10 B1s are barely keeping up with a load of 2k
users. A single $11/mo Node instance handles more than 10k users in
my tests. As we've been talking about this, Richard's user #s are
growing, and now he needs more than 10 B1s. So to put this in proper
comparison, even _with_ the discount that isn't currently available,
we're comparing cost to support 10k users:

50 B1s = $1,800/mo
2 Node instances (the second for redundancy) = $22/mo

We're talking almost 100X. Two full orders of magnitude more expensive.

But yes, I would happily offer the score-server service for $360/mo,
let alone $1800! However, Richard is smart enough to run these on his
own. My VPS instances have uptimes of 129 days, and that's just
because I did a kernel upgrade.

BTW, my goal in having this conversation is to encourage you guys to
make backends either: 1) a lot cheaper or 2) perform a *lot* better.
Ideally both.

Jeff

Kristopher Giesing

unread,

Aug 9, 2012, 10:36:22 AM8/9/12

to google-a...@googlegroups.com, vlad, je...@infohazard.org

The reason I'm currently using them is to avoid insane instancing behavior on front ends. (Which is sad, because I'm basically paying extra to avoid a feature.)

I just commented on your SO post, btw, asking clarification on whether your rules are only for backends, because I definitely *don't* see that behavior for front ends.

- Kris

Hernan Liendo

unread,

Aug 9, 2012, 10:38:40 AM8/9/12

to google-a...@googlegroups.com, je...@infohazard.org

This is the best thread during this month!

On Racetown we've similar issues. Based on our experience having different services such as GAE and Rackspace in the same project (ie scores vs. the rest of the game) tend to complex things.
After a while you will need to share data between both technologies incurring in data migration needs. For instance sharing user between Rackspace hosted score system and the rest of the game.

We did that on GAE hosted game with metrics logging on Rackspace. Maintenance of that solution was painful because Rackspace Linux/Apache maintenance, monitoring and also integration between GAE and this service.
Finally we opted for paying a little more and having everything on GAE side.

On the other hand, I think you could simplify a little your scoring solution. What if you hold user’s scores on sharded MemCache entries and after a while you get all of them and save the result on the DataStore?

Kristopher Giesing

unread,

Aug 9, 2012, 10:00:07 PM8/9/12

to google-a...@googlegroups.com, je...@infohazard.org

That would work up until your data got evicted at the wrong time...

Though this gets me thinking: it would be really, really nice for many applications to have a *guaranteed* memcached-style in-memory table that you could use to share data between instances (and between FE/BE). Unlike memcached there would be no automatic management; instead you would manually add/remove items to stay within a hard limit (or put()s would fail). The hard limit could depend on configuration and maybe you could pay for more if you needed it.

- Kris

Joakim

unread,

Aug 10, 2012, 5:30:35 AM8/10/12

to google-a...@googlegroups.com

By *guaranteed*, do you mean that this should be kept synchronized across multiple data centers? I ask because in the case of a data center emergency, your app spins up at a different data center. The HR datastore is always in sync, the memcache is always local to the DC.
Comparing the put-latency of the datastore and memcache for the last few days, the datastore generally keeps under 50 ms while memcache tends to stay below 30 ms. I'd say the datastore isn't so bad, and adding sync to memcache would probably bring that number up a bit. At that point, this new feature is starting to look a bit like the current HR datastore. Add properties for insertion date/time etc, and it could be a queue, right?

I don't mean to shoot you down, but to make you specify what it really is you want, and how it relates to the current offering.

Kristopher Giesing

unread,

Aug 10, 2012, 4:45:18 PM8/10/12

to google-a...@googlegroups.com

Good point. I had unconsciously assumed that a replicated memcached solution would be much more efficient than a replicated data store solution, but if the replication and not the disk I/O is what's expensive, then that assumption falls down.

Generally speaking, I'm having difficulty architecting my own app in such a way that it's robust across instances that may be "very distant" from one another. Right now I'm relying on optimistic concurrency in the datastore, but it means my app is much slower than, say, a single instance that held everything in RAM. It's been tempting to just drop the datastore entirely and limit myself to a single instance, but GAE doesn't seem designed to scale down that way; front end instances spin up apparently for no reason, and back end instances, as this thread reveals, seem to have more performance limitations than one would expect, so I have no idea whether a single instance would actually be able to service my user base.

- Kris

Jeff Schnitzer

unread,

Aug 13, 2012, 2:10:17 AM8/13/12

to google-a...@googlegroups.com

Memcache doesn't work for this application. Completely aside from
reliability issues, there's no memcache instruction for "give me all
the data". At best you can fetch some number of individual keys, but
that brings up two problems:

1) Trying to do a batch fetch of 10k keys (or more)
2) How do you know what keys to fetch in the first place?

#2 is intractable because any practical solution would essentially
eliminate the need for memcache in the first place.

The problem is pretty easily stated: Collect 10k score submissions in
5s and be able to provide a sorted leaderboard 5s later. GAE does not
offer any practical facility capable of accomplishing this.

Jeff

> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/z6MbQajkRboJ.

Johan Euphrosine

unread,

Aug 13, 2012, 9:33:53 AM8/13/12

to google-a...@googlegroups.com

On Wed, Aug 8, 2012 at 4:10 AM, Jeff Schnitzer <je...@infohazard.org> wrote:

If you've been reading various threads on this list you know that
Richard has been having trouble getting his mobile game to run
smoothly on GAE. It's a little unusual because timing is coordinated
precisely:

* At T+0, all clients submit scores
* At T+5s, a reaper process aggregates the scores and builds a result set
* At T+10s, all clients fetch scores

The question is: Where to submit the score data so that the reaper
can fetch and aggregate it?

Here's some answers that didn't work:

* The datastore. Eventual consistency is too eventual to query for
all the scores and get them.
* Pull queues. There's too much of a delay between task insertion
and when it appears for leasing.
* A single backend. One backend cannot handle more than ~80qps.

He eventually got a system working reliably, sharded across ten B1
instances, at a cost (beyond other charges) of ~$600/mo. It can
collect a couple thousand scores within the 5s deadline (barely).

I thought this was insane, so I built a few experiments to see what
other technologies can do, using the exact program logic of Richard's
collector. Here are the results:

The environment: 256MB Rackspacecloud VPS running Ubuntu 10.04.4 LTS
The cost: $11/mo
The command: ab -c 10000 -n 10000 -r http://theurl (that's 10k
requests, all concurrent).

Node.js: ~2500 qps. Rock solid through multiple test runs, all
complete before the deadline.
Java SimpleHTTP: ~2100 qps. Had to bump heap up to 128MB.
Python Twisted: ~1600 qps. Failed a lot of requests on most test runs.
Python Tornado: ~1500 qps, but rock solid through multiple test runs.

So basically, an $11/mo VPS server running Javascript vastly exceeds
the capabilities of 10 backends at $60/mo each.

Hi Jeff,

Can you share your test methodology ?

I would like to reproduce this testing on my side before escalating the results.

In particular I'm wondering if are you calling ab on the backends directly, or on App Engine frontend that urlfetch to the backend in an handler?

Thanks in advance.

Jeff

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.

To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Jeff Schnitzer

unread,

Aug 13, 2012, 10:59:31 AM8/13/12

to google-a...@googlegroups.com

On Mon, Aug 13, 2012 at 6:33 AM, Johan Euphrosine <pro...@google.com> wrote:
>
> Hi Jeff,
>
> Can you share your test methodology ?
>
> I would like to reproduce this testing on my side before escalating the
> results.
>
> In particular I'm wondering if are you calling ab on the backends directly,
> or on App Engine frontend that urlfetch to the backend in an handler?

I tried both. Going ab -> frontend -> backend is _much_ slower than
going directly to a backend.

You can see the code for testing backends here:

https://github.com/stickfigure/wh-test

There is a bunch of extra code in there you should ignore. The
important methods are:

/noop - returns static string from frontend
/bnoop - urlfetch from frontend to backend which returns static string
/backend/noop - directly to backend to return static string
/away - just does a urlfetch to a rackspacecloud vps instance

There's pretty much no code here. Just handlers that do nothing.

Jeff

Johan Euphrosine

unread,

Aug 13, 2012, 11:06:10 AM8/13/12

to google-a...@googlegroups.com

On Mon, Aug 13, 2012 at 4:59 PM, Jeff Schnitzer <je...@infohazard.org> wrote:

On Mon, Aug 13, 2012 at 6:33 AM, Johan Euphrosine <pro...@google.com> wrote:
>
> Hi Jeff,
>
> Can you share your test methodology ?
>
> I would like to reproduce this testing on my side before escalating the
> results.
>
> In particular I'm wondering if are you calling ab on the backends directly,
> or on App Engine frontend that urlfetch to the backend in an handler?

I tried both. Going ab -> frontend -> backend is _much_ slower than
going directly to a backend.

And the numbers you posted were for hitting directly the vps instance? or urlfetch from a frontend?

You can see the code for testing backends here:

https://github.com/stickfigure/wh-test

There is a bunch of extra code in there you should ignore. The
important methods are:

/noop - returns static string from frontend
/bnoop - urlfetch from frontend to backend which returns static string
/backend/noop - directly to backend to return static string
/away - just does a urlfetch to a rackspacecloud vps instance

There's pretty much no code here. Just handlers that do nothing.

Jeff

--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To post to this group, send email to google-a...@googlegroups.com.
To unsubscribe from this group, send email to google-appengi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.

Hernan Liendo

unread,

Aug 13, 2012, 5:43:33 PM8/13/12

to google-a...@googlegroups.com, je...@infohazard.org

On Monday, August 13, 2012 3:10:17 AM UTC-3, Jeff Schnitzer wrote:

Memcache doesn't work for this application. Completely aside from
reliability issues, there's no memcache instruction for "give me all
the data". At best you can fetch some number of individual keys, but
that brings up two problems:

Reliability comes with costs, as you might know. In some scenarios we've seen, losing few data from millons is not a problem.

Maybe that's not your case.

1) Trying to do a batch fetch of 10k keys (or more)

MemCache.getAll(keys) method could work if you know the keys. Using sharded keys you could reach them all. (I get this inspiration from AppStats way of holding an app perf stats ;)

10k keys seems too much. But for 1k, it's working.

Why that much?

2) How do you know what keys to fetch in the first place?

No business meaning keys. For instances numbers from 0 to 999

#2 is intractable because any practical solution would essentially
eliminate the need for memcache in the first place.

Well, if you already host your whole app on GAE other 'practical' solution could not exist.

The problem is pretty easily stated: Collect 10k score submissions in
5s and be able to provide a sorted leaderboard 5s later. GAE does not
offer any practical facility capable of accomplishing this.

Yeah, that's right :(

hyperflame

unread,

Aug 13, 2012, 6:53:37 PM8/13/12

to Google App Engine

On Aug 13, 4:43 pm, Hernan Liendo <hernan.lie...@gmail.com> wrote:
> MemCache.getAll(keys) method could work if you know the keys. Using sharded
> keys you could reach them all.

I wrote a simple test application for memcache earlier. I managed to
collect 1k keys reliably for my simple test application, and I would
bet that 10k keys would be no problem. The problem is the amount of
data in the value. If you just wanted to store scores, that would be
fine, but IIRC the original poster needed to store a lot more data
than just scores.

On Aug 13, 4:43 pm, Hernan Liendo <hernan.lie...@gmail.com> wrote:
> On Monday, August 13, 2012 3:10:17 AM UTC-3, Jeff Schnitzer wrote:
> > The problem is pretty easily stated: Collect 10k score submissions in
> > 5s and be able to provide a sorted leaderboard 5s later. GAE does not
> > offer any practical facility capable of accomplishing this.
>
> Yeah, that's right :(

Personally, I'd like to see memcache made more reliable, and more
guarantees added to it (and I would have no problem paying more).
Right now memcache is "Well, it's there, but it could go out at any
time."

Jeff Schnitzer

unread,

Aug 15, 2012, 12:58:22 AM8/15/12

to google-a...@googlegroups.com

On Mon, Aug 13, 2012 at 8:06 AM, Johan Euphrosine <pro...@google.com> wrote:
>
>> I tried both. Going ab -> frontend -> backend is _much_ slower than
>> going directly to a backend.
>
> And the numbers you posted were for hitting directly the vps instance? or
> urlfetch from a frontend?

I'm not sure which exact numbers you are referring to. The VPS tests
run 'ab' on the same VPS running the server. The 'wh-test' noop tests
ran 'ab' across a 30Mbit link with fairly low latency to
ghs.google.com... basically the best connection I had available.

Given that these are high-concurrency tests, I wouldn't expect latency
to be a significant issue - there's a large pending backlog on the
server in each case.

Jeff
(sorry, I've been travelling in internet dark zones for the last several days)

Jeff Schnitzer

unread,

Aug 15, 2012, 2:30:12 AM8/15/12

to google-a...@googlegroups.com

While it is possible to imagine doing aggregation with memcache -
maybe - it's very hard to imagine it being worth the effort. It's
like using a hammer to punch staples - sure, it's possible, but you'll
break a lot fewer fingers if you just use a stapler. How do you pick
memcache keys for thousands of users that are constantly signing in
and signing out? You can't just assign them 0-9999; you'd need some
sort of coordination system to recycle empty spaces. And now you need
that coordinator to scale because it rapidly mutates.

Memcache doesn't have a list structure. Really, if this was being
done with off-the-shelf components, it's a good fit for Redis. But
even that is overkill; a dozen-line Node.js script can accept HTTP
directly from clients and works just as well.

Jeff

> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.

Johan Euphrosine

unread,

Aug 15, 2012, 5:07:19 AM8/15/12

to google-a...@googlegroups.com

On Wed, Aug 15, 2012 at 6:58 AM, Jeff Schnitzer <je...@infohazard.org> wrote:
> On Mon, Aug 13, 2012 at 8:06 AM, Johan Euphrosine <pro...@google.com> wrote:
>>
>>> I tried both. Going ab -> frontend -> backend is _much_ slower than
>>> going directly to a backend.
>>
>> And the numbers you posted were for hitting directly the vps instance? or
>> urlfetch from a frontend?
>
> I'm not sure which exact numbers you are referring to. The VPS tests
> run 'ab' on the same VPS running the server.

I was referring to the 2000+ qps benchmark result you posted at the
beginning of this thread, it's difficult to compare those with backend
QPS if those are coming from running ab against localhost.

It could be interesting to check what QPS you get when running ab
against App Engine frontend urlfetching to your VPS, so we can compare
with App Engine Backend and Compute.

> The 'wh-test' noop tests
> ran 'ab' across a 30Mbit link with fairly low latency to
> ghs.google.com... basically the best connection I had available.
>
> Given that these are high-concurrency tests, I wouldn't expect latency
> to be a significant issue - there's a large pending backlog on the
> server in each case.
>
> Jeff
> (sorry, I've been travelling in internet dark zones for the last several days)
>

Jeff Schnitzer

unread,

Aug 15, 2012, 4:00:22 PM8/15/12

to google-a...@googlegroups.com

On Wed, Aug 15, 2012 at 2:07 AM, Johan Euphrosine <pro...@google.com> wrote:
>
> I was referring to the 2000+ qps benchmark result you posted at the
> beginning of this thread, it's difficult to compare those with backend
> QPS if those are coming from running ab against localhost.
>
> It could be interesting to check what QPS you get when running ab
> against App Engine frontend urlfetching to your VPS, so we can compare
> with App Engine Backend and Compute.

I'll try running some more tests when I get some time, but I don't see
why I would expect any different results. I might expect better
results if 'ab' isn't competing with node for CPU resources. The only
practical difference of moving 'ab' off localhost is whether the
network card driver on the VPS server is dramatically less efficient
than the loopback driver. Maybe it is, but it's likely insignificant
compared to the work done at higher levels of the network stack (like,
say, decoding http in javascript).

This assumes that an unrestricted number of GAE frontends can pass
through any amount of load I am capable of generating, which I take
for granted.

Jeff

Tom Davis

unread,

Nov 12, 2012, 11:15:19 AM11/12/12

to google-a...@googlegroups.com

I can independently verify that the extremely poor (and inconsistent)
performance from Backends hasn't improved. A very similar test server
(collects stats in memory) on a B8 in its own Application can only
handle about ~50 req/sec. I tried hand-delivering the responses to the
computers but the Backend was slightly faster.

I am currently battling Enterprise Support to get a straight answer; so
far I've been repeatedly told to run App Stats against a 15-line,
Go-based server that performs no RPCs. Beyond that, it's generally
recommended to use Pull Queues over relying on direct requests to a
Backend. I'm not sure why a whole layer of indirection on top of HTTP
would be *faster* than raw requests but I've experienced stranger things
on AppEngine.

FWIW, I haven't noticed any task latency regarding Pull Queues and they
seem to have fixed the bug that prevented leasing from the same queue in
parallel. Though you would still need to tolerate frequent false
negatives (lease returns 0 tasks when tasks exist) and TransientErrors.

Cheers,

Tom

Ajax

unread,

Nov 16, 2012, 12:25:29 AM11/16/12

to google-a...@googlegroups.com, je...@infohazard.org

A couple quick notes...

I can verify Brandon's post about high throughput apps having extreme performance, while low throughput, test apps don't scale up very well. There is a learning algorithm at play, which favors production apps over test apps. Go figure ;-}

" I might expect better
results if 'ab' isn't competing with node for CPU resources. The only
practical difference of moving 'ab' off localhost is whether the
network card driver on the VPS server is dramatically less efficient

than the loopback driver. " - Comparing anything against lo is apples to oranges.

In regard to having all clients hit the server at the same time, you could probably ramp QPS way up by having each clients results stored into a concurrent queue in ram, aggregate 25 units per entity and perform either DS or cache puts asynchronously in batches (first batch ten, subsequent batches 25), such that each client request just pushes to RAM and gets out to free up the request handler, and have a single thread asynchronously streaming results into DS.

I've found with all performance tuning, having each request lock individually on appengine services leads to poor and choppy performance, while running using a single thread to mass push entities from shared concurrent queues using asynch batch puts results in extreme performance.

Also, using known string keys is far, far better than having appengine generate long keys one at a time for your app. If you can't possible generate a UUID from your data, at least lease big batches of keys to reduce bottlenecks.

Reply all

Reply to author

Forward