What is the keepalive feature for? What is the right way to use it?
I am interested in having connection drops detected fast and reconnects happen fast. So I tried setting low keepalive thresholds.
// We expect this connection to be rock solid.
new ChannelOption("grpc.keepalive_time_ms", 1000),
new ChannelOption("grpc.keepalive_timeout_ms", 1000),
new ChannelOption("grpc.keepalive_permit_without_calls", 1),
That didn't work out well. I see a lot of "Status(StatusCode=Internal, Detail="keepalive watchdog timeout")" errors now (using .NET client). What's worse, I see such errors for calls that actually succeed as far as the server knows. This leads to headache on my non-idempotent APIs when the client retries (keepalive timeout -> seems logical to retry the lost connection, no?)
Clearly I am doing it wrong. So I ask the wider audience - what is the right way to configure gRPC to detect and recover from fast from connections? Both my client and server are on the same server, so I expect them to respond very fast to each other.