Hi,
I'm encountering an intermittent issue with a gRPC-based client-server application running in AWS ECS. Here's the setup:
* A simple gRPC server runs as an ECS task behind an Elastic Load Balancer (ELB).
* The gRPC client is also containerized and runs as an ECS task in the same cluster.
Under normal conditions, RPC calls complete quickly—typically within 2–3 ms, with a maximum of around 4 ms. However, I occasionally see DEADLINE_EXCEEDED errors in the client.
After enabling trace logging in the gRPC library, I noticed these errors consistently occur when the ELB's IP address changes. It seems that the gRPC client continues attempting to connect to the outdated IP address, ultimately resulting in the deadline errors.
Currently, the client is using the default pick_first load balancing policy. From the documentation gRPC load balancing strategies (
https://grpc.github.io/grpc/cpp/md_doc_load-balancing.html), it seems that switching to round_robin might better handle scenarios where the server IP changes after the client has already established a connection.
I have a few questions around this issue:
1. Would switching to round_robin mitigate this issue by prompting the client to cycle through updated IPs?
2. Are there performance or stability trade-offs when using round_robin instead of pick_first?
3. If round_robin is generally more resilient in these dynamic environments, why isn't it the default policy?
Thanks for your time.
-mandeep