gRCP Kubernetes Timeouts

627 views
Skip to first unread message

Aaron Pelz

unread,
Aug 24, 2021, 6:28:31 PM8/24/21
to grpc.io
Hi all, 

New to the group and a little newer to gRPC in general, so hoping to get some pointers on where to dig deeper.

Currently we have two services that interact via GRPC. Their client and server libraries are generated by the protoc compiler. Our server side runs python (grpcio 1.37.0) and our client side runs Node (grpc-js 1.3.7). Both sides use Kubernetes deployed in the cloud. We chose GRPC for its typed contracts, polyglot server/client code generation, speed, and, most importantly, its HTTP2 support for long-running, stateful, lightweight connections.

Our setup uses Contour/Envoy load balancing.

The issue right now is that we have requests that
  1) sometimes never even make it to the server side (i.e. they hit their client-specified timeout with DEADLINE_EXCEEDED)
  2) take a variable amount of time to get ack'd by the server side (this may not necessarily be an issue but perhaps it is telling as to what is happening)
    - This range is quite jarring and varies per GRPC method:
      - Method 1 (initialize) - p75 43ms, p95 208ms, p99 610ms, max 750ms
      - Method 2 (run)        - p75 8ms,  p95 13ms,  p99 35ms,  max 168ms
      - Method 3 (abandon)    - p75 7ms,  p95 34ms,  p99 69ms,  max 285ms

When we see requests that never make it to the server side, the requests simply hit their client-specified timeout window without being acknowledged. For these unsuccessful requests, we also never see a corresponding Envoy log. Sometimes the connection happens to only have a one-off network error like this that goes away, however, we had a recent incident where one of our client pods (in a Kubernetes cluster) could not reach the server at all and thousands of requests failed consecutively over ~24 hours. Memory space was quickly lost and we had to restart the pod.

It is worth it to note that some of our processes (e.g. calls to `run` method) take up to 5 minutes to process on the server side. Every other method should take less than ~1s to complete.

Any ideas on where this could be happening, or if there are extra systems we can add telemetry to would be helpful!

Cheers,
Aaron

Lidi Zheng

unread,
Aug 25, 2021, 2:15:29 PM8/25/21
to grpc.io
Hi,


I would recommend "GRPC_VERBOSITY=debug GRPC_TRACE=api" as a starting point.

The RPC latency could be a result of many factors. E.g., the pod's resource limit, the network environment. With the tracing flag, we might be able to see why a particular RPC failed, is it name resolution or failed to connect to endpoints.

If nothing wrong is observed in trace log, we could use "netstat" to check packet loss, or "tcpdump" to see where the traffic is actually going.

Aaron Pelz

unread,
Sep 3, 2021, 9:50:08 AM9/3/21
to grpc.io
Thanks for the tip!  We didn't find anything particularly insightful from the logs, but we did find an issue with a load balancer timeout that we've tweaked and the issues have reduced in frequency. 

Fotos Georgiadis

unread,
Sep 6, 2021, 11:39:49 AM9/6/21
to Aaron Pelz, grpc.io
Hi Aaron,

We might be facing a similar situation here. Can you shed some light on your resolution?

What was the issue with the timeout and what value did you change it to?

Best,

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/f9acda7f-4d67-49a4-a42d-4b4dddf06131n%40googlegroups.com.


--
photo
Fotos Georgiadis
Engineering Manager, Service Foundations at Scribd
 Weesperstraat 61 | 1018VN Amsterdam, Netherlands
 +31 (0) 650 73 62 75
  
Reply all
Reply to author
Forward
0 new messages