If your bill shoots up due to increased latency, you may not be refunded the charges incurred

960 views
Skip to first unread message

Nischal Shetty

unread,
Jun 13, 2012, 4:20:23 AM6/13/12
to google-a...@googlegroups.com
We have been on GAE/J from more than 2 years now. We have 2 products that run on it. A couple of weeks ago, I noticed an unusually high latency for one of our products and as a result a high number of instances being present (I guess if latency increases, the number of instances would increase as well to serve new requests).

I logged a production issue (link) and the gae team took it up swiftly and started work on fixing it. Though it took time to fix it, I was happy that they were in touch while fixing the issue. 

Since the bump in instances was a result of the problem encountered due to a degradation of GAE infrastructure, I thought it was right on my part to ask for a refund of the extra billing charges that were levied. 

Our charges are usually in the range of $30 per day but during the 3 days the charges were $86, $188 and $47 (attached the screenshot). That's pretty steep and it does hurt our weekly budgets as we're a bootstrapped startup.

When I contacted customer service and asked for a refund I was told that the SLA is violated when there are exceptions thrown with error code 500. Since that wasn't really the case here, we were denied the refund.

In our case it was the latency(caused due to some problem with appengine) that made our app take a big hit which means it isn't covered under SLA! In case the GAE infrastructure degrades again, and instances spin up at a crazy rate once more, it means we have to pay the charges. I dread if this problem ever crops up again and stays for a week.

This can happen to anyone due to any bug in appengine and I thought it was good to give a heads up. If GAE causes a high number of instances to spin up for no fault of yours, you would still end up paying the charges.
Billing History.png

Jeff Schnitzer

unread,
Jun 13, 2012, 12:13:23 PM6/13/12
to google-a...@googlegroups.com
Dear Google: This issue is going to steadily erode the "goodwill" of
even your best customers. It looks really bad.

Long ago it was suggested that one of the advantages of the new
pricing system is that it would be more transparent. A year of
experience later, the new pricing system is dramatically *less*
transparent than the old one. In the old system, I could see what
each request cost to service and predict from that. In the current
system, I have no way of knowing what a request would cost - datastore
ops is easy, but instance time is wildly unpredictable. The only way
to figure out what an app will cost is to run it for a day. And
pricing goes UP when service quality goes DOWN, which is inexcusable.

The silly thing is that for multithreaded apps, the number if
instances required is determined by megacycles used. So now we're
back to (effectively) charging for CPU. The old pricing model, while
screwy for single-threaded apps, makes WAY more sense for
multithreaded apps. A better solution would have been to keep the old
model, increase pricing to sustainable levels, and figure out how to
push everyone onto multithreaded solutions - probably with some sort
of price surcharge.

This is really a mess.

Jeff
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/hDVaF8zzxrQJ.
> To post to this group, send email to google-a...@googlegroups.com.
> To unsubscribe from this group, send email to
> google-appengi...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.

Stefano Ciccarelli

unread,
Jun 13, 2012, 12:48:47 PM6/13/12
to google-a...@googlegroups.com
I have to agree.


2012/6/13 Jeff Schnitzer <je...@infohazard.org>



--
Nel mondo esistono 10 categorie di persone, quelle che capiscono il binario e quelle che non lo capiscono.

Brandon Wirtz

unread,
Jun 13, 2012, 2:33:09 PM6/13/12
to google-a...@googlegroups.com
I know exactly how much every request cost me.

cpm_usd=0.000308

cpm_usd=0.000175

As to the latency changes in pricing are no different than when AWS has a
performance issue. (actually less so)


alex

unread,
Jun 13, 2012, 4:02:27 PM6/13/12
to google-a...@googlegroups.com
Yeah, it is very unfortunate that AE SLA does not cover increased latency and other malfunctioning caused by the internal infrastructure but it is also true that, for instance, AWS and Rackspace SLA does pretty much the same. They all talk about downtime. I'd be curious to see an SLA (for PaaS or IaaS) that covers a service degradation. 

Though a fair comparison would be GAE vs Heroku / Azure / CloudFoundry / OpenShift / <place your prefered>, but then most of them don't even have one. And if they do, they usually refund 25-30% max (vs 50% GAE).


-- alex

Simon Knott

unread,
Jun 13, 2012, 4:09:20 PM6/13/12
to google-a...@googlegroups.com
I still find it bizarre that we can't cap the maximum number of instances.  

If we could say "I never want any more than 5 instances", then this billing problem goes away - sure, your service will probably be hit with performance issues, but at least you are in more control of your outgoing costs.

alex

unread,
Jun 13, 2012, 4:20:16 PM6/13/12
to google-a...@googlegroups.com
Simon, so you're saying that setting Min Pending Latency slider to e.g. 10s does not work for you?

Simon Knott

unread,
Jun 13, 2012, 4:51:25 PM6/13/12
to google-a...@googlegroups.com
That doesn't add a hard limit to the maximum number of instances that GAE can spin up, it's just a hint to the scheduler that it shouldn't spin up a new instance unless the current incoming request will have to wait longer than 10s to process with existing instances.  If GAE is playing up, it's very likely that an incoming request will have to wait this long and boom, you've got runaway instances being spun up.  

At the moment you have no way to limit the instances - if the scheduler went completely awry, they could keep spinning up instances until the cows come home and you have no way to stop it.

Jeff Schnitzer

unread,
Jun 13, 2012, 4:52:55 PM6/13/12
to google-a...@googlegroups.com
On Wed, Jun 13, 2012 at 11:33 AM, Brandon Wirtz <dra...@digerat.com> wrote:
> I know exactly how much every request cost me.
>
> cpm_usd=0.000308
>
> cpm_usd=0.000175

These numbers tell you how much a request cost, but don't tell you how
much another identical request will cost.

Also, what do these numbers mean in the context of a multithreaded
instance? If you have 5 requests running in one instance and the 6th
request hits a fresh instance, how are the costs divided among the 6
requests?

> As to the latency changes in pricing are no different than when AWS has a
> performance issue. (actually less so)

In AWS, a performance degradation in the cluster may reduce your
application effectiveness, but it doesn't change billing - at least
not automatically. Maybe it will cause you to spin up new instances,
but this doesn't seem particularly likely - at least not in a
multithreaded world. A java appserver can easily handle hundreds
(even thousands) of requests blocked on IO, so even under increased
"datastore" (lets say DynamoDB or SimpleDB) latency there's no reason
to expect a need for lots of extra instances. Theoretically, GAE's
multithreaded instances should work the same way... except that they
don't for some reason.

Jeff

Brandon Wirtz

unread,
Jun 13, 2012, 5:26:15 PM6/13/12
to google-a...@googlegroups.com
> These numbers tell you how much a request cost, but don't tell you how
> much another identical request will cost.
>

The variance between my "identical" requests is less than 10%.

In relation to multi-threading, I only see a variance on the downside of a
Spike, when I have Idle instances. That seems to be less than 10% as well.
If you have lots of idle threads raising your price you aren't optimizing
your Defer Queue and steady state tasks effectively.

We have seen AWS days where cluster congestion has raised our bills by 3-4x
using a Specified QoS based on "latency"/time per request/request queue
depth. (same as tuning the scheduler).

As always the people who are complaining don't seem to understand how the
Scheduler works (which is quite likely a fault in Google's
Documentation/lack there of). You either pay for QoS that is X, or you say
your users can have a crappy experience, and Latency won't make a lick of
difference in your App.

My biggest charge days are not days that "latency" spikes but days that
MemCache gets slow, or seems to "dump" more often than expected. This
hasn't been enough of an issue for us to implement our own Memcacher as a
Backend primarily because we are using 10+ gigs of memcache and can't build
an instance to do that.




alex

unread,
Jun 13, 2012, 5:39:22 PM6/13/12
to google-a...@googlegroups.com


On Wednesday, June 13, 2012 10:52:55 PM UTC+2, Jeff Schnitzer wrote:
On Wed, Jun 13, 2012 at 11:33 AM, Brandon Wirtz <dra...@digerat.com> wrote:
> I know exactly how much every request cost me.
>
> cpm_usd=0.000308
>
> cpm_usd=0.000175

These numbers tell you how much a request cost, but don't tell you how
much another identical request will cost.

Also, what do these numbers mean in the context of a multithreaded
instance?  If you have 5 requests running in one instance and the 6th
request hits a fresh instance, how are the costs divided among the 6
requests?

> As to the latency changes in pricing are no different than when AWS has a
> performance issue. (actually less so)

In AWS, a performance degradation in the cluster may reduce your
application effectiveness, but it doesn't change billing - at least
not automatically.  Maybe it will cause you to spin up new instances,
but this doesn't seem particularly likely - at least not in a
multithreaded world.  

I'll have to disagree here. First off, you're talking about different "kind" of instances. On the other hand, if you're talking about some hypothetical app instances, you forgot to mention that someone (or something) will have to manage that. AWS is not running apps, it's running stuff like EC2 instances, which is a totally different thing.

Secondly, for a mid to high load app you will definitely end up running more than one EC2 instance (otherwise, you might as well deploy your app on a PC sitting under your desk). If you don't setup Autoscale thing (if you do - here's your billing going up already), sooner or later you'll have to do it manually. Both ways, there will be additional costs. Not only that, you'll also need to allocate some (human) resources to manage that little cluster on which your app/whatever is running.

GAE and AWS are not comparable, though they do have a couple similar services.

 
A java appserver can easily handle hundreds
(even thousands) of requests blocked on IO, so even under increased
"datastore" (lets say DynamoDB or SimpleDB) latency there's no reason
to expect a need for lots of extra instances.  Theoretically, GAE's
multithreaded instances should work the same way... except that they
don't for some reason.  

Jeff

alex

unread,
Jun 13, 2012, 5:49:47 PM6/13/12
to google-a...@googlegroups.com
On Wednesday, June 13, 2012 10:51:25 PM UTC+2, Simon Knott wrote:
That doesn't add a hard limit to the maximum number of instances that GAE can spin up, it's just a hint to the scheduler that it shouldn't spin up a new instance unless the current incoming request will have to wait longer than 10s to process with existing instances.  If GAE is playing up, it's very likely that an incoming request will have to wait this long and boom, you've got runaway instances being spun up.  

Every App Engine service has an RPC client with a number of settings, including deadline/timeout. Plus, I can't imagine what would run (successfully) longer than 10s, hardly even 5s, unless it's a backend or something and that's a different story.
 

At the moment you have no way to limit the instances - if the scheduler went completely awry, they could keep spinning up instances until the cows come home and you have no way to stop it.

That sounded like end of the world. In that case I might as well go and shoot myself. 

There are a number of ways to control your billing: better coding, deadline/timeout settings, latency sliders, idle instances, reserved instance hours, backends, etc. It's just not expressed in "number of instances" parameter, that's it.

Jeff Schnitzer

unread,
Jun 13, 2012, 6:15:22 PM6/13/12
to google-a...@googlegroups.com
On Wed, Jun 13, 2012 at 2:26 PM, Brandon Wirtz <dra...@digerat.com> wrote:
>> These numbers tell you how much a request cost, but don't tell you how
>> much another identical request will cost.
>
> The variance between my "identical" requests is less than 10%.

Only when GAE is behaving normally. As the OP pointed out, this can
vary by a factor of 2-6+ for days at a time.

> In relation to multi-threading, I only see a variance on the downside of a
> Spike, when I have Idle instances. That seems to be less than 10% as well.
> If you have lots of idle threads raising your price you aren't optimizing
> your Defer Queue and steady state tasks effectively.
>
> We have seen AWS days where cluster congestion has raised our bills by 3-4x
> using a Specified QoS based on "latency"/time per request/request queue
> depth. (same as tuning the scheduler).

Sure. However, GAE seems to be particularly sensitive to this issue
because of low concurrency limits.

> As always the people who are complaining don't seem to understand how the
> Scheduler works (which is quite likely a fault in Google's
> Documentation/lack there of).  You either pay for QoS that is X, or you say
> your users can have a crappy experience, and Latency won't make a lick of
> difference in your App.
>
> My biggest charge days are not days that "latency" spikes but days that
> MemCache gets slow, or seems to "dump" more often than expected.  This
> hasn't been enough of an issue for us to implement our own Memcacher as a
> Backend primarily because we are using 10+ gigs of memcache and can't build
> an instance to do that.

Sure... and you probably wouldn't want to either given how insanely
expensive backend RAM is on GAE. But your app is a bit of an oddity,
and fits into a quirk of GAE's pricing model. You primarily consume
two resources that Google doesn't charge for - memcache and edge
caching. This is is highly atypical.

Jeff

Jeff Schnitzer

unread,
Jun 13, 2012, 6:18:46 PM6/13/12
to google-a...@googlegroups.com
On Wed, Jun 13, 2012 at 2:39 PM, alex <al...@cloudware.it> wrote:
>
> Secondly, for a mid to high load app you will definitely end up running more
> than one EC2 instance (otherwise, you might as well deploy your app on a PC
> sitting under your desk). If you don't setup Autoscale thing (if you do -
> here's your billing going up already), sooner or later you'll have to do it
> manually. Both ways, there will be additional costs. Not only that, you'll
> also need to allocate some (human) resources to manage that little cluster
> on which your app/whatever is running.

Replace "AWS" with "Elastic Beanstalk" in my comments and yes, it is
very comparable with GAE.

The real question is to what extent a performance degradation in the
datastore (say, DynamoDB) will force an uptick in the # of appserver
instances required. Non-GAE systems aren't very sensitive to this;
doubling the latency of a datastore request may double the number of
threads blocked on I/O in the appserver, but there should be plenty of
headroom in any java appserver. Threads blocked on I/O are silly
cheap these days.

It's possible that Google can solve this problem entirely by getting
better concurrency out of instances. Is there still a hard limit of
10 threads?

Jeff

Brandon Wirtz

unread,
Jun 13, 2012, 6:28:49 PM6/13/12
to google-a...@googlegroups.com
> Sure... and you probably wouldn't want to either given how insanely
> expensive backend RAM is on GAE. But your app is a bit of an oddity, and
fits
> into a quirk of GAE's pricing model. You primarily consume two resources
> that Google doesn't charge for - memcache and edge caching. This is is
highly
> atypical.
>
> Jeff

I support far more than one type of app. Just because you don't like my
answers, don't dismiss me just because one app I build does one thing that
is optimized for caching.

Even when Memcache and Datastore operations get slow, costs don't flux much.
The Instance does other stuff when it is waiting for the API's to finish.

Small apps (sub $30 a day) probably see this more than "medium apps" (sub
$500 a day) because they are more likely to not have enough requests to keep
the spare instance cycles busy.

Again if you are use defer, and chron, optimizing your steady state should
allow you to take advantage of these changes and your price should flux much
less.

If you set your QoS too high, you will have more change in price than you
might like on slow days, but that is the "Cost" of having a responsive
application. If you can live with 100ms more wait, then you will find that a
"slow" day has to be a lot slower for you to feel the change in price.

And yes, I optimize the heck out of my apps. If you don't that's your
failing, not AppEngines.

Oh, and just as mention, My F4 Instances doing pure computation stuff, flux
less than 5% in cost on identical operations, except when there are new
updates which so far have dropped my cost significantly with each update.

If you really want to see performance and cost improvements, use NDB and
watch your app get cheaper every update.



alex

unread,
Jun 13, 2012, 6:49:54 PM6/13/12
to google-a...@googlegroups.com
On Thursday, June 14, 2012 12:18:46 AM UTC+2, Jeff Schnitzer wrote:
On Wed, Jun 13, 2012 at 2:39 PM, alex <al...@cloudware.it> wrote:
>
> Secondly, for a mid to high load app you will definitely end up running more
> than one EC2 instance (otherwise, you might as well deploy your app on a PC
> sitting under your desk). If you don't setup Autoscale thing (if you do -
> here's your billing going up already), sooner or later you'll have to do it
> manually. Both ways, there will be additional costs. Not only that, you'll
> also need to allocate some (human) resources to manage that little cluster
> on which your app/whatever is running.

Replace "AWS" with "Elastic Beanstalk" in my comments and yes, it is
very comparable with GAE.

By doing this you effectively eliminating other runtimes that GAE supports. That compares just a part of the system.
 

The real question is to what extent a performance degradation in the
datastore (say, DynamoDB) will force an uptick in the # of appserver
instances required.  Non-GAE systems aren't very sensitive to this;
doubling the latency of a datastore request may double the number of
threads blocked on I/O in the appserver, but there should be plenty of
headroom in any java appserver.  Threads blocked on I/O are silly
cheap these days.

If by Non-GAE systems you mean mostly IaaS and stuff like Beanstalk then of course they are "less sensitive", but again we're talking about different service levels (hence different approaches in solving a specific problem/challenge). Others (e.g. Heroku) simply make you set a fixed # of instances. Well, that's one of the reasons I prefer GAE. 
 

It's possible that Google can solve this problem entirely by getting
better concurrency out of instances.  Is there still a hard limit of
10 threads?
 
Yes. BWT, 99% of such "problems" I've seen could be effectively solved with push or pull queues.
 

Jeff

Claude Zervas

unread,
Jun 13, 2012, 8:50:42 PM6/13/12
to google-a...@googlegroups.com
I'm not sure I understand this. I have a very low traffic app and I've set the min idle instances to automatic and the max idle instances to 1. Once I did this I have never been charged for more than one instance even during high latency periods. Before I set this limit to one I had it set to 5 and during those high latency periods I would get the surprise bill from hell for almost no actual traffic.
Google says "You will not be charged for idle instances over the specified maximum." Which to me indicates that yes you can limit the number of instances (at least for billing purposes.)
I guess the problem with this is that you get crap latency if your actual traffic does spike regardless of GAE latency issues.

Maybe it would be nice if AE latency degrades X% then you get X% more free instances during that period or something.

On Wednesday, June 13, 2012 1:51:25 PM UTC-7, Simon Knott wrote:
... 
At the moment you have no way to limit the instances - if the scheduler went completely awry, they could keep spinning up instances until the cows come home and you have no way to stop it.
...

Jeff Schnitzer

unread,
Jun 13, 2012, 10:17:24 PM6/13/12
to google-a...@googlegroups.com
On Wed, Jun 13, 2012 at 3:49 PM, alex <al...@cloudware.it> wrote:
>
> If by Non-GAE systems you mean mostly IaaS and stuff like Beanstalk then of
> course they are "less sensitive", but again we're talking about different
> service levels (hence different approaches in solving a specific
> problem/challenge). Others (e.g. Heroku) simply make you set a fixed # of
> instances. Well, that's one of the reasons I prefer GAE.

The fact that normal appservers can have hundreds of threads blocking
on reads and GAE apparently can't doesn't really seem related to the
"service levels".

I prefer GAE too, but this means I want to congratulate the team for
the many good things they do and hold their feet to the fire when they
do bad things. "Degraded service == more profitable" is a perverse
incentive, and will eventually produce undesirable development
priorities and turn happy customers into angry customers. From a game
design perspective, this is a bad way to structure a business
relationship.

There were problems with the original pricing model, and now we have
problems with the new one. Let's talk about it.

>> It's possible that Google can solve this problem entirely by getting
>> better concurrency out of instances.  Is there still a hard limit of
>> 10 threads?
>
> Yes. BWT, 99% of such "problems" I've seen could be effectively solved with
> push or pull queues.

This doesn't really address the problem. Queues are serviced by
instances. Datastore latency will still cause extra instance spinups.

Jeff

aramanuj

unread,
Jun 14, 2012, 1:08:54 AM6/14/12
to google-a...@googlegroups.com
We've had the exact same problem. The latency increased and the cost shot up by 3 times during a particular week, with no changes on our side.

Regards,
Arun

Nischal Shetty

unread,
Jun 14, 2012, 2:18:15 AM6/14/12
to google-a...@googlegroups.com
 "Degraded service == more profitable" is a perverse 
incentive, and will eventually produce undesirable development 
priorities and turn happy customers into angry customers. 

Rightly said. The way things are right now, this is the exact thing that comes to a customers mind. Many are suggesting optimizing, max idle instances and stuff, but when the latency goes from 300ms to like 20s (it did in our case), there's hardly anything on your end that you can do (without making your apps users angry).




On Thursday, June 14, 2012 7:47:24 AM UTC+5:30, Jeff Schnitzer wrote:

Michael Hermus

unread,
Jun 14, 2012, 8:19:23 AM6/14/12
to google-a...@googlegroups.com
This, for me, is the crux of this issue. I doubt very much that this was Google's intention, but it now appears that when the infrastructure suffers degradation outside of their control, app owners have to pay more for it.

With all the amazing things that App Engine offers, I don't need it to be perfect, but that certainly doesn't seem to make sense.

Cesium

unread,
Jun 14, 2012, 11:09:38 AM6/14/12
to google-a...@googlegroups.com
Sigh,


Brandon wrote:
The variance between my "identical" requests is less than 10%.  

Yea, me too. Except when it ain't.

I went class 4 ape$hit a couple of weeks ago, when I saw latencies for my app increase by 50x. That's 5000%.

Check out the last 30 days of operation (milliseconds per request):

 Requests/Second (24 hrs)

Note the correlation between the dates given by Nischal Shetty and the signal at -14d.

Takashi was kind enough to explain:

Besides that, please keep in mind that the performance of each loading 
request(sure, actually every single request) may vary, because of 
various reasons like actions of other customers in the same cluster, 
the load on the system, or some kind of maintenance happening. App 
Engine is a multi-tenant cloud platform. 

Huh. Interesting.  "App Engine is a multi-tenant cloud platform."

Get used to hearing that explanation for your observed QoS.
David


stevep

unread,
Jun 14, 2012, 11:48:51 AM6/14/12
to google-a...@googlegroups.com
Forgone Scheduler improvements == more profitable also...
Engineering: "If we make this Scheduler change, we can halve the total number of instances without any hardware investment!!!!!!!!"
Finance: "Remember you work for the shareholders, not the developers."

Jon McAlister

unread,
Jun 14, 2012, 12:54:23 PM6/14/12
to google-a...@googlegroups.com
I'm an engineer on the App Engine team, and work as the TL for
the scheduler, appserver, and other serving infrastructure. I am
also closely involved with production and reliability issues. I
can offer some perspective here.

Jeff Schnitzer:
} "Degraded service == more profitable" is a perverse
} incentive, and will eventually produce undesirable development
} priorities and turn happy customers into angry customers.

This captures the issue well. It may seem at first like we've got
an incentive, but in truth the second-order effects are a much
stronger incentive for us. We care very much about predictable
reliable performance and continually work to improve here.

stevep:
} Forgone Scheduler improvements == more profitable also...
} Engineering: "If we make this Scheduler change, we can halve the
total number of instances without any hardware investment!!!!!!!!"
} Finance: "Remember you work for the shareholders, not the developers."

I've personally been involved in five projects over the last year
in which we shipped scheduler improvements which reduced the
number of instances needed by an app to run a workload. As with
the above point, it may seem at first like we've got a bad
incentive, but in truth the second-order effects override it. We
want App Engine and the scheduler to get more efficient over
time, and prioritize several projects internally to that effect.
It's in our best interest to see predictable and reliable
behaviors, greater efficiency, and higher performance. It's sort
of a never-ending project for us, as we have to keep up with
infrastructure changes, correctly adapt to new features, keep
complexity down, support new runtimes and computational models,
and the whole time try to make things look effortless (i.e.
automatic, with little-to-no feedback needed from the developer).
It's very important to us.

Jeff Schnitzer:
} Is there still a hard limit of 10 threads?

Yes, but probably not for the reason you expect. The primary
issue we run into is memory management. If we raised the default
to 100, many apps would then see out-of-memory deaths (more than
they do now), and these deaths show up differently for
python/java/go. The right path forward is more intelligent
algorithms wrt memory, providing configurability, and so on. This
is an example of the kinds of projects we work on for the
scheduler, but as with any team we have to prioritize our
projects. I'd recommend filing this (or any other desired
scheduler enhancements) on the public issue tracker so they can
get feedback/data/votes.
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/Oixl790VrV0J.

Jeff Schnitzer

unread,
Jun 14, 2012, 8:01:26 PM6/14/12
to google-a...@googlegroups.com
On Thu, Jun 14, 2012 at 9:54 AM, Jon McAlister <jon...@google.com> wrote:
> Jeff Schnitzer:
> } "Degraded service == more profitable" is a perverse
> } incentive, and will eventually produce undesirable development
> } priorities and turn happy customers into angry customers.
>
> This captures the issue well. It may seem at first like we've got
> an incentive, but in truth the second-order effects are a much
> stronger incentive for us. We care very much about predictable
> reliable performance and continually work to improve here.

Thank you for the thoughtful response. FWIW, I have full faith that
the GAE team today is doing their best to create a more reliable, more
efficient solution. However, while I am hopeful, it's hard to predict
what Google of 10 or 20 years from now will be like. In 10 years I
expect to still be in business, but with ten times more code that is
all woven into GAE's services. But maybe in 10 years Search will have
crashed, replaced by The Next Thing (!?), and the bean counters are
going to take complete control.

In the long run, behavior follows incentives. We-the-customers, who
are investing heavily on a highly locked-in platform, would feel a lot
better knowing that both primary and secondary incentives are aligned
in our mutual interest.

> } Is there still a hard limit of 10 threads?
>
> Yes, but probably not for the reason you expect. The primary
> issue we run into is memory management. If we raised the default
> to 100, many apps would then see out-of-memory deaths (more than
> they do now), and these deaths show up differently for
> python/java/go. The right path forward is more intelligent
> algorithms wrt memory, providing configurability, and so on. This
> is an example of the kinds of projects we work on for the
> scheduler, but as with any team we have to prioritize our
> projects. I'd recommend filing this (or any other desired
> scheduler enhancements) on the public issue tracker so they can
> get feedback/data/votes.

Does this limit still apply to larger instances? If an F2 has twice
the memory and CPU of an F1, it should be able to handle the same load
as two F1s. In theory, any app which requires dozens of instances
would be better off switching to F4 frontends. But not if there's
still a 10 thread limit.

Jeff

Jon McAlister

unread,
Jun 15, 2012, 11:34:47 AM6/15/12
to google-a...@googlegroups.com
On Thu, Jun 14, 2012 at 5:01 PM, Jeff Schnitzer <je...@infohazard.org> wrote:
> On Thu, Jun 14, 2012 at 9:54 AM, Jon McAlister <jon...@google.com> wrote:
>> Jeff Schnitzer:
>> } "Degraded service == more profitable" is a perverse
>> } incentive, and will eventually produce undesirable development
>> } priorities and turn happy customers into angry customers.
>>
>> This captures the issue well. It may seem at first like we've got
>> an incentive, but in truth the second-order effects are a much
>> stronger incentive for us. We care very much about predictable
>> reliable performance and continually work to improve here.
>
> Thank you for the thoughtful response.  FWIW, I have full faith that
> the GAE team today is doing their best to create a more reliable, more
> efficient solution.  However, while I am hopeful, it's hard to predict
> what Google of 10 or 20 years from now will be like.  In 10 years I
> expect to still be in business, but with ten times more code that is
> all woven into GAE's services.  But maybe in 10 years Search will have
> crashed, replaced by The Next Thing (!?), and the bean counters are
> going to take complete control.
>
> In the long run, behavior follows incentives.  We-the-customers, who
> are investing heavily on a highly locked-in platform, would feel a lot
> better knowing that both primary and secondary incentives are aligned
> in our mutual interest.

The primary incentive is the long-term incentive, I don't see how that
would be different between today and in N years. I don't see when it
would ever be sensible to do what you are describing.

>
>> } Is there still a hard limit of 10 threads?
>>
>> Yes, but probably not for the reason you expect. The primary
>> issue we run into is memory management. If we raised the default
>> to 100, many apps would then see out-of-memory deaths (more than
>> they do now), and these deaths show up differently for
>> python/java/go. The right path forward is more intelligent
>> algorithms wrt memory, providing configurability, and so on. This
>> is an example of the kinds of projects we work on for the
>> scheduler, but as with any team we have to prioritize our
>> projects. I'd recommend filing this (or any other desired
>> scheduler enhancements) on the public issue tracker so they can
>> get feedback/data/votes.
>
> Does this limit still apply to larger instances?  If an F2 has twice
> the memory and CPU of an F1, it should be able to handle the same load
> as two F1s.  In theory, any app which requires dozens of instances
> would be better off switching to F4 frontends.  But not if there's
> still a 10 thread limit.

It does not presently consider this. It seems like a reasonable approach.

>
> Jeff
>
> --
> You received this message because you are subscribed to the Google Groups "Google App Engine" group.

Michael Hermus

unread,
Jun 15, 2012, 1:40:12 PM6/15/12
to google-a...@googlegroups.com
This really should be documented (if it is already and I missed it, apologies).

Given a hard limit of 10 threads per instance, does it make sense to use anything other than an F1 instance, if you have primarily RPC-blocked requests (which I assume many, if not most, App Engine apps are)?

alex

unread,
Jun 15, 2012, 5:47:46 PM6/15/12
to google-a...@googlegroups.com
I don't personally see it as "degraded == more profitable", quite the opposite.

And to remind you while you're bashing, AWS (since you mentioned it) has its moments too

In this particular one, degraded EBS, would degrade a handful apps even with a 1000 threads. As you might notice, noone sees it as a "degraded == more profitable". In fact, at this point one can probably choose between having users being extremely angry (because an app would probably just seem down) and go away, or sustain an temporal costs increase but at least keep users "less-angry".

stevep

unread,
Jun 15, 2012, 6:04:58 PM6/15/12
to google-a...@googlegroups.com
Jon wrote:
The primary incentive is the long-term incentive, I don't see how that
would be different between today and in N years. I don't see when it
would ever be sensible to do what you are describing.

A good while ago, I worked for HP. We priced a product at $34.95. It was profitable against an expected installed base of let's say 1 million units. That was many, many years ago. Since then a couple of things changed. Our initial installed base estimates were off by a factor of probably 10. Instead of 1 million, the base was 10 million. Revenues and profits were hugely related (in a positive correlation) to size of the installed base, meanwhile costs per unit moved in the opposite direction as installed base grew. Hence, if HP did nothing, it would have reaped gigantic windfall profits (windfall vs. initial investment analysis). These numbers have Google-sizes zeros after them. So let's do the customer a favor HP and lower the price, or a least keep it the same.  Second-order effects such as consumer goodwill and brand equity would be great. Same product now sells for $55 at Costco. When I was working to get the the $34.95 product out, I believed (and I think this was true) that my organization was one of the best, most capable consumer products companies EVER. Now I feel a sense of shame that it evolved to what it is today. (I left quite some time ago, but still regret what it has become.) So Jon, stay focused please, but by all means be realistic.

Jeff Schnitzer

unread,
Jun 15, 2012, 6:11:54 PM6/15/12
to google-a...@googlegroups.com
On Fri, Jun 15, 2012 at 2:47 PM, alex <al...@cloudware.it> wrote:
>
> And to remind you while you're bashing,

I am not bashing GAE. I offer constructive criticism of the
ill-conceived parts of GAE, of which there are several. Short of
dragging Googlers off to Las Vegas for an all-you-can-eat hookers and
blow buffet, this is the best way I (as a customer) know how to enact
change. If enough customers complain loudly enough for long enough,
Google will adjust their development priorities.

The only reason AWS has been brought up is because there is an
argument to be made that they suffer from the same perverse incentive
WRT performance. To some extent this is true, but it's certainly less
direct than with GAE. If it were more direct, more people would
probably complain. And even if AWS works "the same way", there's
still no reason to embrace it, especially when GAE is multiple times
more expensive. There is no reason to turn this into a general
discussion of AWS vs GAE.

The fact that my bill can vary by a factor of 6 because of
fluctuations entirely within Google infrastructure is a problem.
Period.

Jeff

Jeff Schnitzer

unread,
Jun 15, 2012, 6:32:26 PM6/15/12
to google-a...@googlegroups.com
On Fri, Jun 15, 2012 at 3:04 PM, stevep <pros...@gmail.com> wrote:
>
> A good while ago, I worked for HP.

This.

Google has been around for a little over 20 years, and most of the
employees for only a tiny fraction of that. The company is barely
adolescent. In my short years on this planet I've watched Microsoft
and IBM go from hero/villan to villan/hero and back again. I've
watched Oracle go from worshipped to despised (for good reason).

Google has a great trajectory. But let's not claim to know a-priori
what another 20 years will bring. Long-term thinking gets sacrificed
on the altar of short-term survival by struggling companies *all the
time*.

Also: We're already seeing some cracks. Google has totally screwed
up pricing for the Maps API, to the point many developers (including
Foursqure, Wikipedia, and myself) have replaced it with cheaper (often
free) OSM-based solutions.

Jeff
Reply all
Reply to author
Forward
0 new messages