PSA: All Infrastructure Webapps are down

22 views
Skip to first unread message

Aaron Gable

unread,
Feb 15, 2018, 3:17:16 PM2/15/18
to Chromium-dev, Chromium OS dev
Almost all Chrome Operations services, both those that users interact with and those that do work behind the scenes, run on AppEngine. AppEngine is currently undergoing a global outage in which writes to its native data storage backend (the cloud datastore) are refused. As a result, the following applications are not expected to work correctly:
  • Monorail
  • Buildbucket
  • The CQ
  • chromium-status
  • Sheriff-o-Matic
  • and many more
I'll provide another update at 1pm, or when we have news about the ongoing outage.

Thanks,
Aaron (oncall Trooper)

Aaron Gable

unread,
Feb 15, 2018, 3:44:35 PM2/15/18
to Aaron Gable, Chromium-dev, Chromium OS dev
Underlying AppEngine services began recovering at 12:34 Pacific; we're watching carefully and attempting to ensure that the recovery process doesn't cause further knock-on outages.

Will post another update at 1:30pm, or when we have additional information.

Aaron Gable

unread,
Feb 15, 2018, 4:40:30 PM2/15/18
to Aaron Gable, Chromium-dev, Chromium OS dev
Underlying AppEngine has almost fully recovered, and our services seem to be operating fine. I reopened the tree long enough for the CQ to land one change, and both the tryjobs and main waterfall jobs that were started as a result seem to be operating fine.

I am reopening the tree now, and will keep an eye on things. This will be the last update unless something else goes wrong.

Thanks for your patience,
Aaron

Aaron Gable

unread,
Feb 15, 2018, 5:07:43 PM2/15/18
to Aaron Gable, Chromium-dev, Chromium OS dev
Update: Swarming has not yet recovered, so although compiles are happening, swarmed tests are not being processed.

Will provide another update at 14:30 Pacific.

Aaron Gable

unread,
Feb 15, 2018, 6:40:13 PM2/15/18
to Aaron Gable, Chromium-dev, Chromium OS dev
Update: We've got swarming mostly recovered, and it is executing jobs successfully, but the frontends are still thrashing a bit and running more instances than is normally necessary so we're keeping the tree closed to keep things calm for a little while longer.

Another update before 5pm Pacific.

Aaron Gable

unread,
Feb 15, 2018, 7:36:07 PM2/15/18
to Aaron Gable, Chromium-dev, Chromium OS dev
The number of pending (and soon-to-be-expired) Swarming tasks is now down to 20k, down from 50k a couple hours ago. I currently plan to reopen the tree at 5pm or when the number of pending tasks reaches 2k, whichever comes first.

Aaron

John Budorick

unread,
Feb 15, 2018, 9:31:15 PM2/15/18
to Aaron Gable, Chromium-dev, chromiu...@chromium.org
The number of pending Swarming tasks has now fallen below 8k, with the count slowly but steadily declining since Aaron's last message. We're reopening the tree, throttled, and will continue to monitor pending task count.


--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAH58R2cdA-1PrFOitkYC_XiHFuV_W6THeuw-CmGoro2pUrUYuA%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages