Fail over takes too long because of TCP retransmission

John Shahid

unread,

Nov 20, 2018, 10:33:44 AM11/20/18

to grp...@googlegroups.com

Hi all,

We just ran into an interesting issue. We are using grpc-go for both
the client and server implementation. There are two instance of the
server deployed for HA. Clients use dns name lookup and usually are
split evenly between the two servers.

One of the servers had a network issue and wasn't reachable (we were
able to simulate this situation by adding an iptables rule to drop
packets destined to one of the two servers). The DNS server immediately
detect that one of the servers isn't reachable and removes it from the
pool. What we observed is that clients connected to that instance will
keep getting "context deadline exceeded" errors for about 15 minutes.
The tcpdump show multiple retransmission attempts. The client will
eventually (after ~15 minutes) reconnect to the healthy instance.

Is there a way to speed up the fail over without changing the number of
TCP retransmissions in `/proc/sys/net/ipv4/tcp_retries2' ?

Thanks,

JS

John Shahid

unread,

Nov 26, 2018, 2:26:39 PM11/26/18

to grp...@googlegroups.com

We ended up adding the following to `Dial':

grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second,
})

This required bumping grpc to a commit that included the fix in
https://github.com/grpc/grpc-go/pull/2307 which sets the
TCP_USER_TIMEOUT socket option on Linux. On a side note, this issue
doesn't affect windows clients. It looks like by default windows
retransmissions are much lower than on GNU/Linux.

Carl Mastrangelo

unread,

Nov 29, 2018, 4:19:08 PM11/29/18

to grpc.io

The main way we solve this is by enabling client side keep-alives, which IIRC run once every 45 second if there are no active RPCs. These are implemented in gRPC as HTTP/2 Ping frames. I can't say I know where this is for Go, but in Java this is an actively used feature.

Reply all

Reply to author

Forward