App Engine Channel API Sept. 15 Outage Postmortem

254 views
Skip to first unread message

Takashi Matsuo ♟

unread,
Sep 21, 2011, 10:39:26 PM9/21/11
to google-appengine...@googlegroups.com
Postmortem

This document is the post mortem describing the September 15 2011 App
Engine Channel API outage.


Summary

On September 15 2011, all applications using the Channel API were
affected by a service outage that lasted, in the worst case,
approximately 15 hours.

During the outage, clients using the Channel API received a javascript
error "Service for 'ae' is not registered." These clients would not
receive any messages sent by the Channel API. No other App Engine
functionality was affected because the javascript error occurred in an
iframe. Other javascript on the application's page remained
functional.


Root Cause

During September 14th, a new version of the Google Talk front-end
servers was pushed to production. The Channel API uses the Google Talk
servers to implement its client-side message receiving code. The
version that was pushed to production contained a bug that caused this
javascript error.


Timeline

All times are PDT

9/14 1800 New Talk FE push started
9/14 2300 First user mentions Channel API broken on AppEngine Google Group
9/15 0245 Production issue ticket filed
9/15 0400 App Engine engineer begins investigation of problem and
identifies cause.
9/15 0530 Rollback of bad code started, clients start coming back online
9/15 0930 Rollback completed, all clients working again.


Issues and Fixes

The bug causing the problem was missed during coding. Unit tests which
caught this bug were flaky for other reasons and the failure was
ignored. These tests have been improved and are run as part of every
checkin now.

Automated tests are run against new App Engine releases but were not
run against new Talk FE releases. A new suite of tests is now run
against every new Talk FE build that is a candidate for production.

Automated alerts for some specific services running on Talk FE servers
were misconfigured and didn't fire when usage of the Channel API
service dropped noticeably. These alerts are in the process of being
improved so that engineers will be paged much earlier if service usage
starts falling.

The Channel API does not have its own section on the App Engine Status
page so developers didn't have a known place to go to check status. We
will add a section to the status page for the Channel API.

-- Takashi Matsuo, on behalf of the App Engine Team

Reply all
Reply to author
Forward
0 new messages