Open Letter to Ryan Barrett and the AppEngine Team regarding "high availability"

31 views
Skip to first unread message

gae123

unread,
Sep 15, 2009, 6:01:47 AM9/15/09
to Google App Engine
Dear Ryan:

I did read your Sep. 14 posting on the megastore transition with
interest. I also viewed your presentation at Google I/O. I am very
impressed by the effort that the team is putting to make GAE
scaleable, highly available and minimize downtime. In your talk
multiple times you mentioned how unacceptable a two hour downtime is
and you gave the SVColo incident that brought down Tweeter and
FriendFeed a few times. So you seem to get it!!!

Unfortunately, the reality I see and face in the ground for the past
16 months is different. As I am writing this, I have 2 GAE powered
sites down for TWO DAYS[1] and I am still waiting. The root cause?
Simple "index count quotas" reseting issues that Google can solve on
your end in minutes by resetting counters. In a similar situation in
June[2], it took SEVEN DAYS to resolve the problem. Same issue in
January took FOUR DAYS to get resolved[4]. Same issue is October 2008
took THREE DAYS to resolve[5]. Do you find this response acceptable?
If you read the postings I refer to, you will see that most of the
time was wasted in round trip communication across timezones!!!

Several bugs have been filed on the index quotas issues but they have
not been addressed. You cannot believe how frustrating and
unproductive is to be in my shoes:
1. I dont know when anybody from Google will read my postings...
2. It can take up to 12 hours between replies.
3. For hours, I have to check the discussion group every ten minutes
to see if somebody will read/take action/reply to my messages.
Who can run a successful service with this kind of support?

In addition to investments in multihoming, data centers and all the
other great projects you describe, Google needs to invest in fixing
the relatively simples issues (like [3]) first. This will make us self
sufficient. It also must beef up support. A couple of months ago I
posted a note on what Google can do to improve Google App Engine
Support[6]. I received a lot of positive feedback from the community
but official GAE did not even reply to the posting.

The situation I currently see makes me wonder whether GAE as a whole
has its priorities right. Right now I definetely feel that the
customer is not the priority. Thanks for reading, I hope this message
will help Google focus on the buring issues first and improve Google
App Engine.

Best Regards,
Peter
http://www.gae123.com


REFERENCES
1. http://groups.google.com/group/google-appengine-python/browse_thread/thread/688aa5d0d376bd9f
2. http://groups.google.com/group/google-appengine/browse_thread/thread/d5c17a116e81d266
3. http://groups.google.com/group/google-appengine/browse_thread/thread/e0c5bd1a19c945c1
4. http://groups.google.com/group/google-appengine/browse_thread/thread/4f98d2c04fac0fb9/715e7def1a361ec9?show_docid=715e7def1a361ec9
5. Direct e-mail communication with Mar. Nic., available upon request
6. http://groups.google.com/group/google-appengine/browse_thread/thread/03fd57a38f130dee

Ram Shanker

unread,
Sep 15, 2009, 9:02:41 AM9/15/09
to Google App Engine
+1 for this ..

I have also been thinking to post something. Though since my app is
not mission critical and Failures get corrected by next minute cron
update but I still get frustrated when the same entity which was
accessed successfully for last 100 times (read only) gets sudden read
Timeout Error from datastore. So what if I had something critical or I
plan to do somthing more importand in future. The situation start
getting worse if you MUST provide a hundred *try:except* pairs and
fallbacks... due to these kind of totally random errors.
> 1.http://groups.google.com/group/google-appengine-python/browse_thread/...
> 2.http://groups.google.com/group/google-appengine/browse_thread/thread/...
> 3.http://groups.google.com/group/google-appengine/browse_thread/thread/...
> 4.http://groups.google.com/group/google-appengine/browse_thread/thread/...
> 5. Direct e-mail communication with Mar. Nic., available upon request
> 6.http://groups.google.com/group/google-appengine/browse_thread/thread/...

Nicholas Brown

unread,
Sep 15, 2009, 11:53:33 AM9/15/09
to Google App Engine
Your criticism seems legit, I just wanted to point out the following:

"* This is a preview release of Google App Engine."

It certainly seems that appengine is less mature than some of the
other cloud offerings. But it also looks to be more promising over the
long term, if they get it right. Combine that with free hosting for
low-load apps, and here we are, despite having to deal with the
growing pains :-)

ryan

unread,
Sep 15, 2009, 7:04:24 PM9/15/09
to Google App Engine
hi peter!

On Sep 15, 3:01 am, gae123 <pa...@gae123.com> wrote:

> Unfortunately, the reality I see and face in the ground for the past
> 16 months is different. As I am writing this, I have 2 GAE powered
> sites down for TWO DAYS[1] and I am still waiting. The root cause?
> Simple "index count quotas" reseting issues that Google can solve on

thanks for the feedback, and apologies for the trouble. you're right,
that particular quota has been troublesome for a long time, and when
people have hit it, the only recourse has been to post to the group
and wait for someone here to fix it manually. we've definitely been
aware of this for a while, and we've known the status quo wasn't good
enough.

happily, your timing is good, since we managed to prioritize this just
a couple weeks ago. in 1.2.6, the index count quota will be handled
much better, both in real time and behind the scenes so that if it
does skew, it will automatically be fixed. we follow a fairly agile
process internally, so we don't usually schedule release dates ahead
of time, but based on our track record so far -
http://code.google.com/p/googleappengine/wiki/SdkReleaseNotes - we've
put out releases roughly once a month, and 1.2.6 will probably follow
that pattern.

i also feel your pain in the support department. we wish we had the
resources to provide more high-touch support! we're not a large team,
though, so we have to ruthlessly prioritize. that often means less
individual support, and it can also means prioritizing necessary but
invisible internal changes over developer-visible features and bug
fixes.

personally, i favor the approach joel spolsky describes in
http://www.joelonsoftware.com/articles/customerservice.html .
naturally, when a developer hits a problem, we should try to fix it
for them immediately if we have the resources. more importantly,
though, we should try to change app engine so that problem doesn't
happen at all. there are lots of possible app engine improvements like
this, so some of them - like the index count quota - don't get fixed
right away. we definitely try to keep track of them all, though, and
get to them all eventually. (of course, as you mention, we can always
improve the ways we communicate so that you know we're aware of a
problem and do plan to work on it eventually.)

Stephen

unread,
Sep 15, 2009, 10:09:57 PM9/15/09
to Google App Engine

On Sep 16, 12:04 am, ryan <ryanb+appeng...@google.com> wrote:
>
> (of course, as you mention, we can always
> improve the ways we communicate so that you know we're aware of a
> problem and do plan to work on it eventually.)


How will you do this?

ryan

unread,
Sep 16, 2009, 2:51:32 AM9/16/09
to Google App Engine
good question. we currently use the roadmap for big things and the
issue tracker for small things. both are good, but we can always
improve on how we use them. the issue tracker, in particular, we
haven't kept up with as well as we'd like to. we're going to work on
that.

the indices count quota problem, for example, has bugs in the issue
tracker that we could update with status reports when we work on fixes
like the ones coming up in 1.2.6. we'll try to do that more in the
future.

http://code.google.com/p/googleappengine/issues/detail?id=1161
http://code.google.com/p/googleappengine/issues/detail?id=2124

ted stockwell

unread,
Sep 16, 2009, 12:10:06 PM9/16/09
to Google App Engine


On Sep 15, 6:04 pm, ryan <ryanb+appeng...@google.com> wrote:
>
> i also feel your pain in the support department. we wish we had the
> resources to provide more high-touch support! we're not a large team,
> though, so we have to ruthlessly prioritize. that often means less
> individual support, and it can also means prioritizing necessary but
> invisible internal changes over developer-visible features and bug
> fixes.
>

What about opening the SDK code and accepting contributions from the
community?

There have been statements in the forums from Google people that
eventually the SDK code would be open sourced.
If you do this now I am sure that you will get bug fixes and
extensions from the community (I would personally like to fix 1899,
Asynchonous fetch in Java API).
If you let the community help you fix issues then there will be a lot
less high-touch support required from you.



ryan

unread,
Sep 16, 2009, 8:24:27 PM9/16/09
to Google App Engine
On Sep 16, 9:10 am, ted stockwell <emorn...@gmail.com> wrote:
>
> What about opening the SDK code and accepting contributions from the
> community?

definitely! i assume you mean the java sdk; the python sdk has been
open source for a while. we've had good results from that, including
useful contributions from the community, as you mention, so we
definitely want to do it with java too. it's been the plan from the
beginning; the reason it hasn't happened already is just
prioritization, as usual, since the java sub team is small and has
limited bandwidth.

gae123

unread,
Sep 23, 2009, 6:52:43 PM9/23/09
to Google App Engine
Hi Ryan,

thanks for taking time to reply to my posting and thaks for all the
great work that is going on at GAE. I am glad to hear that the indexes
problem will be resolved in 1.2.6, it has been one of the few but very
acute problems when it happens. I am also content to see that Google
has been taking time to acknowledge and prioritize bugs the past week.

I read http://www.joelonsoftware.com/articles/customerservice.html and
also totally agree with the approach. We all want to not need
customer support, this is the nirvana!!! However, when we need
support, like in the case of the indexes quotas bug, we do need a
timely response. We also need privacy and a system to track and
measure this "timely response". We all know that independently of how
great the team is, support will be occcasionally needed. So I think
Google should reconsider my suggestions about customer support in GAE
(http://groups.google.com/group/google-appengine/browse_thread/thread/
03fd57a38f130dee) and icrease the funding to incorporate them or an
alternative that meets the goals. I will be waiting to hear Google's
roadmap on support; in the meantime this remains in one of my top
items in the "why GAE is not ready for prime time" list.

Best Regards





On Sep 15, 4:04 pm, ryan <ryanb+appeng...@google.com> wrote:
> hi peter!
>
> On Sep 15, 3:01 am, gae123 <pa...@gae123.com> wrote:
>
> > Unfortunately, the reality I see and face in the ground for the past
> > 16 months is different. As I am writing this, I have 2 GAE powered
> > sites down for TWO DAYS[1] and I am still waiting. The root cause?
> > Simple "index count quotas" reseting issues that Google can solve on
>
> thanks for the feedback, and apologies for the trouble. you're right,
> that particular quota has been troublesome for a long time, and when
> people have hit it, the only recourse has been to post to the group
> and wait for someone here to fix it manually. we've definitely been
> aware of this for a while, and we've known the status quo wasn't good
> enough.
>
> happily, your timing is good, since we managed to prioritize this just
> a couple weeks ago. in 1.2.6, the index count quota will be handled
> much better, both in real time and behind the scenes so that if it
> does skew, it will automatically be fixed. we follow a fairly agile
> process internally, so we don't usually schedule release dates ahead
> of time, but based on our track record so far -http://code.google.com/p/googleappengine/wiki/SdkReleaseNotes- we've

Stephen

unread,
Sep 25, 2009, 2:46:20 PM9/25/09
to Google App Engine


On Sep 16, 7:51 am, ryan <ryanb+appeng...@google.com> wrote:
> On Sep 15, 7:09 pm, Stephen <sdea...@gmail.com> wrote:
>
> > On Sep 16, 12:04 am, ryan <ryanb+appeng...@google.com> wrote:
>
> > > (of course, as you mention, we can always
> > > improve the ways we communicate so that you know we're aware of a
> > > problem and do plan to work on it eventually.)
>
> > How will you do this?
>
> good question. we currently use the roadmap for big things and the
> issue tracker for small things. both are good, but we can always
> improve on how we use them. the issue tracker, in particular, we
> haven't kept up with as well as we'd like to. we're going to work on
> that.

When are you going to start working on this?

How long will the folks who have starred issue 1695 and posted a dozen
threads in the groups over the last couple of weeks have to wait for a
Googler to acknowledge them?
Reply all
Reply to author
Forward
0 new messages