GRPC C++ Question on best practices for Client handling of servers going up and down

49 views
Skip to first unread message

justin.c...@ansys.com

unread,
Nov 20, 2018, 10:22:16 PM11/20/18
to grpc.io
GRPC Version: 1.3.9
Platform: Windows

I'm working on a prototype application that periodically calculates data and then in a multi-step process pushes the data to a server. The design is that the server doesn't need to be up or can go down mid process. The client will not block (or block as little as possible) between updates if there is problem pushing data.

A simple model for the client would be:
Loop Until Done
{
 Calculate Data
 If Server Available and No Error Begin Update
 If Server Available and No Error UpdateX (Optional)
 If Server Available and No Error UpdateY (Optional)
 If Server Available and No Error UpdateZ (Optional)
 If Server Available and No Error End Update
}

The client doesn't care if the server is available but if it is should push data, if any errors skip everything else until next update.

The problem is that if I make an call on the client (and the server isn't available) the first fails very quickly (~1sec) and the rest take a "long" time, ~20sec. It looks like this is due to the reconnect backoff time. I tried setting the GRPC_ARG_MAX_RECONNECT_BACKOFF_MS on the channel args to a lower value (2000) but that didn't have any positive affect.

I tried using GetState(true) on the channel to determine if we need to skip an update. This call fails very quickly but never seems to get out of the transient failure state after the server was started (waited for over 60 seconds). On the documentation it looked like the param for GetState only affects if the channel was in the idle state to attempt a reconnect.

What is the best way to achieve the functionality we'd like?

I noticed there was a new GRPC_ARG_MIN_RECONNECT_BACKOFF_MS option added in a later version of grpc, would that cause the grpc call to "fail fast" if I upgraded and set that to a low value ~1sec?

Is there a better way to handle this situation in general?

robert engels

unread,
Nov 20, 2018, 11:19:09 PM11/20/18
to justin.c...@ansys.com, grpc.io
You should track the err after each update, and if non-nil, just return… why keep trying the further updates in that loop.

It is also trivial too - to not even attempt the next loop if it has been less than N ms since the last error.

According to your pseudo code, you already have the ‘server available’ status.

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/9fb7bf54-88fa-4781-8864-c9b2b06d5f0e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

justin.c...@ansys.com

unread,
Nov 21, 2018, 10:12:15 AM11/21/18
to grpc.io
I do check the error code after each update and skip the rest of the current iterations updates if a failure occurred.

I could skip all updates for 20 seconds after an update but that seems less than ideal.

By server available I was using the GetState on the channel. The problem I was running into was that if I only call GetState on the channel to see if the server is around it "forever" stays in the state of transient failure (at least for 60 seconds). I was expecting to see a state change back to idle/ready after a bit.

Robert Engels

unread,
Nov 21, 2018, 10:16:45 AM11/21/18
to justin.c...@ansys.com, grpc.io
The other thing to keep in mind is that the way you are “forcing failure” is error prone - the connection is valid as packets are making it through. It is just that is will be very slow due to extreme packet loss. I am not sure this is considered a failure by gRPC. I think you would need to detect slow network connections and abort that server yourself. 

justin.c...@ansys.com

unread,
Nov 21, 2018, 10:52:43 AM11/21/18
to grpc.io
I'm not sure I follow you on that one. I am taking the server up and down myself. Everything works fine if I just make rpc calls on the client and check the error codes. The problem was the 20 seconds blocking on secondary rpc calls for the reconnect, which seems to be due to the backoff algorithm. I was hoping to shrink that wait if possible to something smaller. Setting the GRPC_ARG_MAX_RECONNECT_BACKOFF_MS to 5000 seemed to still take the full 20 seconds when making an RPC call.

Using GetState on the channel looked like it was going to get rid of the blocking nature on a broken connection but the state of the channel doesn't seem to change from transient failure once the server comes back up. Tried using KEEPALIVE_TIME, KEEPALIVE_TIMEOUT and KEEPALIVE_PERMIT_WITHOUT_CALLS but those didn't seem to trigger a state change on the channel.

Seems like the only way to trigger a state change on the channel is to make an actual rpc call.

I think the answer might just be update to a newer version of rpc and look at using the MIN_RECONNECT_BACKOFF channel arg setting and probably downloading the source and looking at how those variables are used :). 

Robert Engels

unread,
Nov 21, 2018, 11:17:24 AM11/21/18
to justin.c...@ansys.com, grpc.io
I thought your original message said you were simulating the server going down using iptables and causing packet loss?

justin.c...@ansys.com

unread,
Nov 21, 2018, 11:20:32 AM11/21/18
to grpc.io
That must have been a different person :). I'm actually taking down the server and restarting it, no simulation of it.

Robert Engels

unread,
Nov 21, 2018, 11:24:58 AM11/21/18
to justin.c...@ansys.com, grpc.io
Sorry. Still if you forcibly remove the cable or hard shut the server, the client can’t tell if the server is down with out some sort of ping pong protocol with timeout. The tcp ip timeout is on the order of hours, and minutes if keep alive is set. 

justin.c...@ansys.com

unread,
Nov 21, 2018, 12:09:59 PM11/21/18
to grpc.io
I found a fix for my problem.

Looked at the latest source and there was a test argument being used "grpc.testing.fixed_reconnect_backoff_ms" that sets both the min and max backoff times to that value. Found the source for 1.3.9 on a machine and it was being used then as well. Setting that argument to 2000ms does what I wanted. 1000ms seemed to be to low of a value, the rpc calls continued to fail. 2000 seems to reconnect just fine. Once we update to a newer version of grpc I can change that to set the min and max backoff times.

Robert Engels

unread,
Nov 21, 2018, 12:19:45 PM11/21/18
to justin.c...@ansys.com, grpc.io
I’m glad you got it working. Something doesn’t seem right though that going from 1 sec to 2 sec causes things to work... I would think that production anomalies could easily cause similar degradation, so I think this solution may be fragile... still, I’m still not sure I 100% get the problem you’re having but if you understand exactly why it works that’s good enough for me :)

Robert Engels

unread,
Nov 21, 2018, 12:26:07 PM11/21/18
to justin.c...@ansys.com, grpc.io
To be clear, if a server is down, there are only two reasons, it is off/failed, or some networking condition does not allow traffic. In either case a client cannot reasonably determine this in a fast manner, by specification this can take a long time, especially if a recent connection attempt was good. 

What it can determine in a fast manner is if there is no process available listening on the requested port on the remote machine, or that there is no route available to the requested machine. 

If you have lots of outages like the former, you are going to have issues. 

On Nov 21, 2018, at 11:09 AM, justin.c...@ansys.com wrote:

justin.c...@ansys.com

unread,
Nov 21, 2018, 2:06:32 PM11/21/18
to grpc.io
I agree that it is strange about the different backoff time values. It seems values of 1000ms or less for the min backoff time cause the rpc calls to not reconnect to the server. Just did a quick test with 1100ms and that worked the 10 times I shut the server down and restarted it. Will see how robust it is with some testing. At this point we are over a year out of date on our version of grpc so we probably should update to a new version before doing much delving. A lot of work was done over that time.

Thanks!
Reply all
Reply to author
Forward
0 new messages