More granularity for errors generated by gRPC itself

71 views
Skip to first unread message

Ruslan Nigmatullin

unread,
Apr 12, 2018, 5:52:10 PM4/12/18
to grpc.io
Hi,

Is there a chance to add details to errors generated by gRPC layer itself to distinguish different scenarios instead of forcing gRPC users to analyze error message client-side? Parsing error messages is error-prone as they are not standardized across different languages and can change over time without any warning (and strictly speaking are not a part of api).

Few examples we'd like to differentiate:
1.1. ResourceExhausted returned by python server in case of exceeding concurrency limit - it's safe to retry request to different server (server is known to not starting processing the request) and is quite common for python setups
1.2. ResourceExhausted returned by any server in case of too big metadata/message - it's useless to retry it as message size doesn't change
1.3. ResourceExhausted returned by any server in case of too big response - it's dangerous to retry non-idempotent request as server already processed it once
2.1. Unavailable returned by application logic to indicate some dependency being down, it can or can not be safe to retry depending on the specific scenario
2.2. Unavailable returned by client to indicate that all connections are down, it's safe to retry with a hope that new connection becomes established
2.3. Unavailable returned by client to indicate that current active stream was terminated

We're interested in having individual grpc-transport-specific error codes for individual cases for better attribution of failure scenarios (in metrics/tracing) to improve system's visibility and in some cases for reacting on them differently.
While some of this cases can be mitigated by ensuring that we always properly attribute our own application-specific errors, grpc-own errors are still indistinguishable in some scenarios [1].


Thanks

Muxi Yan

unread,
Apr 18, 2018, 2:24:35 PM4/18/18
to grpc.io
The problem with that is there is not a fine-grained set of codes that we can support. Particularly, the server application may return status codes with their own definition. Adding codes is not likely to help with the situation for gRPC since there would always be scenarios that is not covered.

Carl Mastrangelo

unread,
Apr 18, 2018, 4:46:12 PM4/18/18
to grpc.io
Responses inline


On Thursday, April 12, 2018 at 2:52:10 PM UTC-7, Ruslan Nigmatullin wrote:
Hi,

Is there a chance to add details to errors generated by gRPC layer itself to distinguish different scenarios instead of forcing gRPC users to analyze error message client-side? Parsing error messages is error-prone as they are not standardized across different languages and can change over time without any warning (and strictly speaking are not a part of api).

Few examples we'd like to differentiate:
1.1. ResourceExhausted returned by python server in case of exceeding concurrency limit - it's safe to retry request to different server (server is known to not starting processing the request) and is quite common for python setups

Retriable RPCs which are in the process of being implemented will allow you to specify if an error code can be retried.  The call will be retried on the next available transport.   If you are using the round robin load balancer this will pick the next server.   Python is doing something nonstandard here, as other languages don't have a global maximum on the number of RPCs.  
 
1.2. ResourceExhausted returned by any server in case of too big metadata/message - it's useless to retry it as message size doesn't change

These are not generally retriable anyways.  What would you do in response to getting this code?
 
1.3. ResourceExhausted returned by any server in case of too big response - it's dangerous to retry non-idempotent request as server already processed it once

Again, what would you do about this?  The Status codes used by gRPC is intended to be handled *automatically*.  I don't think there is any automatic action you can take here, so why distinguish?

 
2.1. Unavailable returned by application logic to indicate some dependency being down, it can or can not be safe to retry depending on the specific scenario

This is something that applications are expected to indicate.  in gRPC, you can add additional headers to RPCs.  One header you could add is an app-specific sub error code, such as dependency_is_down.   You can include additional detail about which dependency and things like that.  gRPC gives you the tools to do this since it varies per instance.
 
2.2. Unavailable returned by client to indicate that all connections are down, it's safe to retry with a hope that new connection becomes established

This is not safe to retry.  The client would just spin trying to send an RPC and failing repeatedly.  There is a special option in gRPC called "wait for ready" which allows you to say an RPC should wait until there is an available transport.   The opposite means fail immediately if no transport could be used.  (to be clear, this fails when all transports are failing, not when there are no transports.  This allows gRPC to establish a new connection for the very first RPC without failing.)  
 
2.3. Unavailable returned by client to indicate that current active stream was terminated

If this isn't an error condition, why not just use Status OK ?  I assume you mean graceful termination.

Arpit Baldeva

unread,
May 4, 2018, 5:24:19 PM5/4/18
to grpc.io
There are many scenarios when application would want to see custom error codes even if just for logging/better visibility. The pattern I use:

1. Stick a google::rpc::status in every response message.
2. Create a Message with embedded enums next to the Service definition in the proto file. Each enum is a specific error code that your service returns. 
3. If you have an application error, use the above enum to populate google::rpc::Status with the integer code and readable message string. You can even send a custom message as part of Any for more verbose logging. 

On client side, you can then cast the integer code to enum to check the error type so like
assertEquals("error code check", FooServiceError.Type.ERR_OK_VALUE, response.getStatus().getCode());

Grpc also generates helper method to make sure that int to enum cast is valid (at least in C++) so you'll need to use that. 

Little more work than ideal but gets the job done. 
Reply all
Reply to author
Forward
0 new messages