502 Server Errors after upgrade?

116 views
Skip to first unread message

Bill

unread,
Mar 2, 2009, 10:06:09 PM3/2/09
to Google App Engine
Anyone else having trouble with their apps? I'm getting 502 Server
Errors and even looking at logs in the console are glacially slow.

Gee

unread,
Mar 2, 2009, 10:09:08 PM3/2/09
to Google App Engine
me too

app id: rotzy

Tony Arkles - home

unread,
Mar 2, 2009, 10:12:03 PM3/2/09
to Google App Engine
Me too

app id: steprep
app id: steprep-demo

B.J.

unread,
Mar 2, 2009, 10:12:40 PM3/2/09
to Google App Engine
me also

Kuber

unread,
Mar 2, 2009, 10:28:15 PM3/2/09
to Google App Engine
I'm running into this problem too.
Is this related to the GAE scheduled maintenancea hour ago?

http://groups.google.com/group/google-appengine-downtime-notify/browse_thread/thread/90c7fde1e5ccbd4b#

Bill

unread,
Mar 2, 2009, 10:41:47 PM3/2/09
to Google App Engine
Don't know, but the timing is suspect.

My apps are up but were doggedly slow. Here are the ms/request
charts:

http://static.writertopia.com/chart.png
http://static.writertopia.com/chart2.png

These were pages that served reasonably fast before.
So something is going on I think.

On Mar 2, 7:28 pm, Kuber <polo...@gmail.com> wrote:
> I'm running into this problem too.
> Is this related to the GAE scheduled maintenancea hour ago?
>
> http://groups.google.com/group/google-appengine-downtime-notify/brows...

yobin

unread,
Mar 2, 2009, 10:46:04 PM3/2/09
to google-a...@googlegroups.com
me2:(

Bill

unread,
Mar 2, 2009, 10:49:50 PM3/2/09
to Google App Engine
Here's the word:

----- cut from email ------

For a period of 10 minutes after this maintenance completed all apps
saw elevated latencies, and a very small number of apps saw an
increase in error rates. This problem has been fixed, and apps are
now serving normally.

Pete Koomen, App Engine Team

On Mar 2, 7:46 pm, "yobin" <yyo...@gmail.com> wrote:
> me2:(
>
> ----- Original Message -----
> From: "Kuber" <polo...@gmail.com>
> To: "Google App Engine" <google-a...@googlegroups.com>
> Sent: Tuesday, March 03, 2009 11:28 AM
> Subject: [google-appengine] Re: 502 Server Errors after upgrade?
>
> I'm running into this problem too.
> Is this related to the GAE scheduled maintenancea hour ago?
>
> http://groups.google.com/group/google-appengine-downtime-notify/brows...

Brenton

unread,
Mar 3, 2009, 2:16:16 AM3/3/09
to Google App Engine
Still bad for us at insightdining.

cz

unread,
Mar 3, 2009, 4:09:11 AM3/3/09
to Google App Engine
yep, our app still quite a bit slower than usual but slightly better
than earlier today.
zipimport still seems especially slow... I'm wondering if using non-
trivial 3rd party frameworks such as Django 1.x is such a good idea on
GAE. With this kind of latency I'm thinking maybe not. Perhaps for
high traffic sites were multiple instances of the app remain in
memory, but for medium to low traffic it's really terrible. Any chance
this might get better?

Arun Shanker Prasad

unread,
Mar 3, 2009, 4:49:02 AM3/3/09
to Google App Engine
Hi All,

I am still getting this 502 error randomly, did anyone figure out what
caused it and is it fixed??

Thanks,
Arun Shanker Prasad.

Nick Winter

unread,
Mar 3, 2009, 11:55:10 AM3/3/09
to Google App Engine
http://code.google.com/status/appengine/detail/serving/2009/03/03#ae-trust-detail-helloworld-get-latency

Just about every day for the past several weeks, there's been elevated
latency like this, usually at similar times of day. It was unfortunate
and frustrating before, but since last night our development is
stalled because every part of App Engine is too slow to do any testing
or data manipulation right now. 5 seconds per request?

I'm confident that the App Engine team will get a handle on the
performance and everything will be shiny once more, but it'd be nice
to hear some word as to what's going on. Are the servers just
overloaded? Did something go wrong with the maintenance last night? Is
anomaly-yellow serving to be expected?

Brandon Thomson

unread,
Mar 3, 2009, 12:03:05 PM3/3/09
to Google App Engine
The yellow is causing me a problem too; a lot of my (very simple)
queries that previously returned quickly are timing out.

On Mar 3, 11:55 am, Nick Winter <livel...@gmail.com> wrote:
> http://code.google.com/status/appengine/detail/serving/2009/03/03#ae-...

proxypy

unread,
Mar 3, 2009, 3:48:09 AM3/3/09
to Google App Engine
I got 403 Over Quota Error. But I only used less than 1% quota now!!
my site: proxypy.appspot.com

Who can help me ?

Brett Slatkin

unread,
Mar 3, 2009, 1:25:32 PM3/3/09
to google-a...@googlegroups.com
Hi Nick,

We had some unexpected issues during the maintenance last night which
caused elevated latencies and errors for all applications. We resolved
the issue around 8:45pm last night and things have returned to normal
since. Please let me know if you're still seeing any problems.

As for the elevated latency for the dynamic request metric (that you
linked to), this is primarily a product of alert tolerances. We're
still tuning our status site metrics to match real-world expectations
of App Engine performance. You'll notice today that we've raised some
of these tolerances by a little bit, causing many of the lines to go
back to a blue color (i.e., everything OK).

-Brett

Brenton

unread,
Mar 3, 2009, 2:04:51 PM3/3/09
to Google App Engine
It still feels down to me. Our app keeps timing out. Friends' apps
too.

208 - try again in 30 secs. Same as last night.

Brandon Thomson

unread,
Mar 3, 2009, 2:08:38 PM3/3/09
to Google App Engine
Ditto. Datastore performance is very bad too.

Sylvain

unread,
Mar 3, 2009, 2:18:32 PM3/3/09
to Google App Engine
Yes I don't know why but today the main request is 1000 ms-cpu higher
than yesterday.
So the CPU Quota is higher than usual.

Regards

Arun Shanker Prasad

unread,
Mar 3, 2009, 2:22:58 PM3/3/09
to Google App Engine
Hi,

My app still seems to be experiencing this problem.. I seem to get 502
- Bad Gateway error at random, also the the serving of the app is very
slow. Writes are the most expensive, timing out randomly..

Thanks,
Arun Shanker Prasad.

Pete Koomen

unread,
Mar 3, 2009, 2:53:27 PM3/3/09
to Google App Engine
Hi all,

Some apps are still seeing inreased latencies--we're working hard to
isolate this. We'll keep you updated as we do.

Pete

On Mar 3, 11:22 am, Arun Shanker Prasad <ArunShankerPra...@gmail.com>
wrote:

Nick Winter

unread,
Mar 3, 2009, 3:05:08 PM3/3/09
to Google App Engine
Yes, I'm still seeing problems. We've been averaging 4 seconds per
request for the past six hours, roughly correlated to the elevated
latency that you say is within tolerance and is averaging 250 ms per
request. Some of that is due to a recently identified performance
issue with one of our common handlers being identified as a "high CPU"
handler and getting sidelined to make room for more performant apps'
handlers. We haven't been able to fix that yet, but it's not just that
handler: it's every dynamic handler.

This isn't just an issue since last night, though; it's been on and
off for two months. We're still seeing widely variable response times
on almost every request. In the admin console, for example, most
requests take a normal second or so, but a fifth of them or so take
between 7-20 seconds to respond. This is stuff like expanding one log
entry with no log messages in it. I know that's not in my code. Is it
instance startup costs? I think so. From my logs, I'll load something
simple like a FAQ page, and it's fine, whenever I have to start
another instance and import Django and some views, well, that's 7-14
seconds to do it. Have we changed anything in the past two months? A
couple very small things, but mostly our development has been on
another branch. It goes between "working just fine on every request"
and "several seconds to spawn an instance running the same code as
usual" from day to day and time to time. It's just been getting more
frequent and worse as time goes on.

The simplest handler we have, which imports our models, reads one
thing at random from the datastore, and returns it, with no web
framework, normally runs in 50-250 ms and takes 50-80 ms of cpu, but
sometimes (new instance?) takes 1300-2000 ms and 130-300 ms cpu. But I
made an even simpler handler as a test, with no import, that just
prints a random number. Starting a new instance for that only takes a
70-250 ms response time. And if I make that import our models every
time, it doesn't change anything. It's almost like the cost of
starting a new instance is multiplied by the cost of the instantiation
code is multiplied by whether App Engine is feeling cheerful that
hour.

It could be our code, I could see that. If there's a reason why the
admin console would be so slow at the same times and in the same way
as the rest of our site, maybe it would shed some light. If we're
doing something subtly wrong when we import Django (0.96) and our
models, I could see that, but only if there's a reason why we wouldn't
see it sometimes and would at other times, and why it would have been
getting worse without any code changes.

It would also help a lot to know exactly how this works (from the
Quotas page):
"Applications that are heavily cpu-bound, on the other hand, may incur
some additional latency in long-running requests in order to make room
for other apps sharing the same servers."
Marzia has said that this is per-request, so I'm assuming that certain
handlers will get sidelined, but this is happening site-wide --
otherwise, it would make perfect sense. How is a handler identified as
high-cpu? How long does that classification last?

Thanks,
--Nick (app id: skrit)


On Mar 3, 1:25 pm, Brett Slatkin <brett-appeng...@google.com> wrote:
> Hi Nick,
>
> On Tue, Mar 3, 2009 at 8:55 AM, Nick Winter <livel...@gmail.com> wrote:
>
> >http://code.google.com/status/appengine/detail/serving/2009/03/03#ae-...

Artem

unread,
Mar 3, 2009, 1:44:43 PM3/3/09
to Google App Engine
I am still getting 502s in our App (which is an instance of Rietveld
without any changes except securing it with a password).
app id is: wsl-codereview

We have changed nothing. Problems started last night, but now every
request is a 502.

Artem

On Mar 3, 1:25 pm, Brett Slatkin <brett-appeng...@google.com> wrote:
> Hi Nick,
>
> On Tue, Mar 3, 2009 at 8:55 AM, Nick Winter <livel...@gmail.com> wrote:
>
> >http://code.google.com/status/appengine/detail/serving/2009/03/03#ae-...

Brandon Thomson

unread,
Mar 3, 2009, 6:07:23 PM3/3/09
to Google App Engine
I think my errors are gone now. Thank you Google!

Pete Koomen

unread,
Mar 3, 2009, 9:23:19 PM3/3/09
to Google App Engine
Hi all, we'll continue posting updates about this issue to our status
site and downtime-notify group as we continue investigating:

http://code.google.com/status/appengine
http://groups.google.com/group/google-appengine-downtime-notify

Thanks for your patience, we're still working hard on this one.

Pete

Arun Shanker Prasad

unread,
Mar 4, 2009, 12:15:41 AM3/4/09
to Google App Engine
Hi,

The system status site shows that all services are serving normal now,
but my app is still experiencing DeadlineExceededError in pages that
used be served well under 5s, and others are very slow and takes
almost twice the normal time to serve.

I know the Google App Engine team is working on resolving this and I
thank them for that, but I made this post since the current status is
shown as Normal and my app is still experiencing errors.

Thanks,
Arun Shanker Prasad.


On Mar 4, 7:23 am, Pete Koomen <pkoo...@google.com> wrote:
> Hi all, we'll continue posting updates about this issue to our status
> site and downtime-notify group as we continue investigating:
>
> http://code.google.com/status/appenginehttp://groups.google.com/group/google-appengine-downtime-notify

Arun Shanker Prasad

unread,
Mar 4, 2009, 5:15:03 AM3/4/09
to Google App Engine
Hi,

I am still getting random 502 error and the latency of even simple
memcahce hits are very high, almost twice what I had a couple of days
before.

The App Engine status page shows everything as normal now, are the
issues resolved? Any updates?

Thanks,
Arun Shanker Prasad.

On Mar 4, 10:15 am, Arun Shanker Prasad <ArunShankerPra...@gmail.com>
wrote:
> Hi,
>
> The system status site shows that all services are serving normal now,
> but my app is still experiencing DeadlineExceededError in pages that
> used be served well under 5s, and others are very slow and takes
> almost twice the normal time to serve.
>
> I know the Google App Engine team is working on resolving this and I
> thank them for that, but I made this post since the current status is
> shown as Normal and my app is still experiencing errors.
>
> Thanks,
> Arun Shanker Prasad.
>
> On Mar 4, 7:23 am, Pete Koomen <pkoo...@google.com> wrote:
>
> > Hi all, we'll continue posting updates about this issue to our status
> > site and downtime-notify group as we continue investigating:
>
> >http://code.google.com/status/appenginehttp://groups.google.com/group...

cz

unread,
Mar 4, 2009, 5:56:22 AM3/4/09
to Google App Engine
Latency is still pretty much killing our site. Dynamic pages that took
1-3 seconds (which is bad enough) before the slowdown still take 10-20
seconds. The app dashboard is super sluggish as well (as Nick pointed
out). I'm hoping this means that the GAE team is still working on it
(if so, thanks guys).


On Mar 3, 9:15 pm, Arun Shanker Prasad <ArunShankerPra...@gmail.com>
wrote:
> Hi,
>
> The system status site shows that all services are serving normal now,
> but my app is still experiencing DeadlineExceededError in pages that
> used be served well under 5s, and others are very slow and takes
> almost twice the normal time to serve.
>
> I know the Google App Engine team is working on resolving this and I
> thank them for that, but I made this post since the current status is
> shown as Normal and my app is still experiencing errors.
>
> Thanks,
> Arun Shanker Prasad.
>
> On Mar 4, 7:23 am, Pete Koomen <pkoo...@google.com> wrote:
>
> > Hi all, we'll continue posting updates about this issue to our status
> > site and downtime-notify group as we continue investigating:
>
> >http://code.google.com/status/appenginehttp://groups.google.com/group...

Brenton

unread,
Mar 4, 2009, 6:29:12 AM3/4/09
to Google App Engine
The status page leaves much to be desired. It says everything is OK
now (it's not). Yesterday, it said 'anomaly.' What the hell does
that mean, in the context of large scale hosting?

Clear, concise, timely, and accurate are the hallmarks of a good
status page. At a glance, I should be able to know if there is an
issue, how many people are affected, how long the downtime is expected
to last, and if I can do anything about it. It's nice when there's
some behind the scenes 'this broke - we fixed it,' but that's not
essential.


My latest app is mirrored on multiple domains with Google Apps for
your Domain. In addition to the 502 timeouts that we've been seeing
the last couple days, I have noticed that there is a page on my app
that is broken on one domain but not another. It seems one 'install'
of the app can get a record from the datastore, but another can't.


It's scary to see it be this bad for this long. Google has a
reputation for having the best systems on the 'net. It earned this
distinction, being one of the most responsive sites online even before
it was famous. Bad things happen, and we all understand that;
however, it's been nearly two days. If this was any other host,
people would be sketched out and cancel their service. I don't know
how many defectors you'll see in this case, but I suspect App Engine
will have a harder time borrowing Google's reputation for stability in
the future.


I have a lot of respect for you guys, and I wish you luck. Please get
us flying again. =)

Brett Slatkin

unread,
Mar 4, 2009, 3:26:47 PM3/4/09
to google-a...@googlegroups.com
On Wed, Mar 4, 2009 at 2:56 AM, cz <cze...@gmail.com> wrote:
>
> Latency is still pretty much killing our site. Dynamic pages that took
> 1-3 seconds (which is bad enough) before the slowdown still take 10-20
> seconds. The app dashboard is super sluggish as well (as Nick pointed
> out). I'm hoping this means that the GAE team is still working on it
> (if so, thanks guys).

Sorry for the delay. We're still hard at work trying to resolve these
issues. We will post to the downtime-notify group and the Status Site
as soon as we have an update for you. Thanks for your continued
patience and sorry again for all the trouble!

-Brett
Google App Engine Team

Brett Slatkin

unread,
Mar 10, 2009, 8:05:43 PM3/10/09
to google-a...@googlegroups.com, live...@gmail.com
Hi Nick,

Sorry for the delay here. Been working hard these last few days to
resolve these latency issues, which are looking good now. I'd like to
help you figure out the issue here, so please bear with me since I'm
not looking at your app's source code. =)

I believe Marzia stated elsewhere that the once you get into the
100ms+ range for CPU time (runtime only, not including APIs) you will
begin to see this prioritization and additional latency at the request
level. This could account for some of the variability you have seen.
Another thing to keep in mind is the active dynamic requests limit
explained here:

http://code.google.com/appengine/docs/quotas.html#Request_Limits

"An application operating entirely within the free quotas can process
around 30 active dynamic requests simultaneously. This means that an
app whose average server-side request processing time is 75
milliseconds can serve up to (1000 ms/second / 75 ms/request) * 30 =
400 requests/second, independent of the quota system, without
incurring any additional latency. Requests for static files are not
affected by this limit. Applications that are heavily CPU-bound, on
the other hand, may incur some additional latency in long-running
requests in order to make room for other apps sharing the same
servers."


We maximize the throughput of these dynamic requests as much as
possible, taking advantage of App Caching
(http://code.google.com/appengine/docs/python/runtime.html#App_Caching).
This is what enables high load applications to serve at 75ms of
runtime CPU per request. If you're not using App Caching, you could
see a significant impact on your application's latency.


Another thing to think about is how API call latency affects overall
throughput. For example, if you execute a series of Datastore queries
to retrieve many entities, you may be looking at 200-400ms of latency
for a query (or more, depending on the shape of your data). Now
connect this number with the runtime CPU and wall-clock time and
you're looking at 450-650ms of wall-clock time minimum to execute a
single request. Doing simple math you can get your maximum throughput
per instance per second: 1 request/650ms = ~1.5 requests per second
per instance. What if you're doing three of these queries per request?
Then we're looking at a throughput of less than 1 request per second
per instance.

Going back to it: There are around 30 active instances available to
your application. We do our best to maximize your use of these
depending on your load and your application's throughput. But one
thing to remember is the faster your app is (in wall-clock latency),
the better we can spread its load across these 30 instances. With
requests that may take 1 second or more to complete, you may see more
variability in your app's latency and throughput. I believe that's
part of what's going on here.

The reason is that we scale your instances to match your sustained
throughput. If your average request takes 500ms and you're sustaining
10 requests per second, then you can serve this load easily using only
5 instances. However, if ever so often you get a request that takes
2-5 seconds, then you're going to see some extra latency as the faster
500ms requests slow down for the big guy to go through. If bigger
requests keep coming through, things should balance back out in a
short amount of time; but if the bigger workload is relatively
infrequent this could show up as latency spikes.


There are a few solutions to this. One big one is caching. I assume
you're already doing that, but if you can get any more to reduce
latency, that will definitely help. Another is to profile your
application to find where you spend the majority of your wall-clock
latency (see http://code.google.com/appengine/kb/commontasks.html#profiling).
Precomputing as much information always helps (that's the App Engine
way). But another thing that could help a lot is using background
processing (when that feature is ready) to do the precomputing in the
background; this could isolate you from increased latency you'll see
when more expensive requests go through the system.


Hopefully this information has helped a bit. Please let me know if you
have any questions. I still would very much appreciate your help in
tracking down the latency you're experiencing so we can figure out the
root cause. Thanks,

-Brett

Nick Winter

unread,
Mar 11, 2009, 6:54:22 PM3/11/09
to Google App Engine
Wow, thanks for the attention, Brett!

We just did a big site update for our app, too, and have been running
around trying to fix critical bugs and get users back online, too, so
you've my sympathies.

I haven't been doing testing since you guys got the latency levels
down, since I did a big blitz right before overhauling the app to be
immune to increased latency (from the user's point of view). We still
have a lot of handlers running a few hundred ms cpu, but they haven't
been getting slammed like they were before, and I got a better version
of my one library that was taking 1300ms to import.

I think you've explained the unresolved mystery of why the handlers
were getting slammed so hard. We were using app caching, weren't
anywhere near 30 instances, and were getting hit bad even with no
datastore access and no instance startup costs--

--but the bit about scaling instances to match sustained throughput is
enlightening. I had thought that when Marzia said that prioritization
was per-request and not per-handler or per-app, that meant this sort
of thing (fast requests being deprioritized) wouldn't happen, and
that's why I was confused. The smaller requests slowing down because
of the big requests (which were mostly happening on instance startup
and were getting deprioritized hard) sounds like my issue. Requests
were very variable on the app, and so a sustained throughput
calculation would be significantly off from second to second or minute
to minute, which may not have helped things.

One thing we found out we could do at the developer chat was store a
lot more than we'd counted on in the memcache. Since we couldn't find
any limits posted, we were being (extremely, as it turns out)
conservative with memcache space. We'll be able to increase
performance in many places by putting more lookup tables in there. I
know that there are few guarantees when it comes to how much will fit
in there, but if Marzia has said that around 100MB is the right
ballpark, then perhaps some info can be put online so others can have
some estimate of its capacity.

After the dust's settled, I'm quite happy with App Engine performance,
and the ease of use has been a dream. Thanks for all your hard work!
--Nick
Reply all
Reply to author
Forward
0 new messages