Error handling during downtime

dir Ls

unread,

Apr 26, 2019, 12:21:50 PM4/26/19

to Google App Engine

Cloud datastore has 99.95% monthly uptime SLA for multi-region which translates to slightly above 20 minutes per month. Is this downtime likely to happen all at once or intermittently? What kind of errors are to be expected during the downtime? I am trying to figure out the strategy required to be put in place on how the app should respond to end users during the downtime. Would it be possible that it works for data related to some users but not the others at a given time? I am looking for a best practice guidance for an app that is expected to be usable 24/7 with graceful downgrading based on the underlying services. For example, if the downtime is intermittent, users might just reload the page and won't even know something wrong happened. But if the downtime is prolonged, explicitly displaying that the system is currently inaccessible and asking them to visit after sometime might be better.

Tiago (Google Cloud Platform Support)

unread,

May 15, 2019, 11:10:35 PM5/15/19

to Google App Engine

Hello,

The Cloud Datastore SLA agreement doesn't specify answers to many of the questions posed here on purpose: it's extremely hard to predict if downtime will happen all at once or intermittently, as those events are most often unplanned by their own nature. Indeed, a quick glance at previous incidents reveal the occurrence of them both in the past year. When designing your application, it's probably better to abstract such unknowns and implement general fail-safe mechanisms - for instance, if a write fails, you can catch the Datastore exception and enqueue a task to retry later, etc.

That being said, given the small budget for downtime allocated for Cloud Datastore (and taking into consideration its past generally reliable behavior), it's more common to observe issues with it due to the implementation not following the general best practices or because of sub-optimal design. There's a greater benefit to be reaped in terms of your app's overall reliability by focusing on a general strategy to give those topics the proper attention they deserve in development instead.

dir Ls

unread,

May 16, 2019, 2:39:48 AM5/16/19

to Google App Engine

Thank you TIago for the response.

> if a write fails, you can catch the Datastore exception and enqueue a task to retry later, etc.

What I would like to know is the kind of exceptions that will be thrown that tells me that I need to try it later. My app is based on Go and the datastore client in Go only has few errors and none of them are related to read/write errors that are infrastructure level. They all seem to be app logic related.

https://godoc.org/cloud.google.com/go/datastore#pkg-variables

Harmit Rishi (Cloud Platform Support)

unread,

May 16, 2019, 9:29:34 PM5/16/19

to google-a...@googlegroups.com

Hello,

You may feel free to explore the following documentation that highlights how to conduct "Errors and Error Handling" for low-level Datastore mode API. Please note as the documentation mentions that client libraries may or may not return these same values.

Regardless, you may refer to the chart called "Error Codes". There, you will see recommended actions for the error code encountered. Essentially, the codes which mention using a "Retry using exponential backoff" policy would typically be the ones associated with a HTTP- 5xx status code indicating server incidents. You may feel free to explore this documentation further and determine if it applies to your inquiry.

Hope this helps!

dir Ls

unread,

May 16, 2019, 10:22:12 PM5/16/19

to Google App Engine

Perfect, thank you for pointing to this documentation. Exactly what I was looking for.

I assume I can get to those individual codes in golang using the following code

s, ok := status.FromError(err)
switch s.Code() {
case codes.Aborted:
...
}

where status and codes are grpc packages based on the datastore/client.go code

Reply all

Reply to author

Forward