Hi all,
New to the group and a little newer to gRPC in general, so hoping to get some pointers on where to dig deeper.
Currently we have two services that interact via GRPC. Their client and server libraries are generated by the protoc compiler. Our server side runs python (grpcio 1.37.0) and our client side runs Node (grpc-js 1.3.7). Both sides use Kubernetes deployed in the cloud. We chose GRPC for its typed contracts, polyglot server/client code generation, speed, and, most importantly, its HTTP2 support for long-running, stateful, lightweight connections.
Our setup uses Contour/Envoy load balancing.
The issue right now is that we have requests that
1) sometimes never even make it to the server side (i.e. they hit their client-specified timeout with DEADLINE_EXCEEDED)
2) take a variable amount of time to get ack'd by the server side (this may not necessarily be an issue but perhaps it is telling as to what is happening)
- This range is quite jarring and varies per GRPC method:
- Method 1 (initialize) - p75 43ms, p95 208ms, p99 610ms, max 750ms
- Method 2 (run) - p75 8ms, p95 13ms, p99 35ms, max 168ms
- Method 3 (abandon) - p75 7ms, p95 34ms, p99 69ms, max 285ms
When we see requests that never make it to the server side, the requests simply hit their client-specified timeout window without being acknowledged. For these unsuccessful requests, we also never see a corresponding Envoy log. Sometimes the connection happens to only have a one-off network error like this that goes away, however, we had a recent incident where one of our client pods (in a Kubernetes cluster) could not reach the server at all and thousands of requests failed consecutively over ~24 hours. Memory space was quickly lost and we had to restart the pod.
It is worth it to note that some of our processes (e.g. calls to `run` method) take up to 5 minutes to process on the server side. Every other method should take less than ~1s to complete.
Any ideas on where this could be happening, or if there are extra systems we can add telemetry to would be helpful!
Cheers,
Aaron