Hello,
SetUp:
We have gRPC pods running in a k8s cluster. The service mesh we use is linkerd. Our gRPC microservices are written in python (asyncio as the concurrency mechanism), with the exception of the entry-point. That microservice is written in golang (using gin framework). We have an AWS API GW that talks to an NLB in front of the golang service. The golang service communicates to the backend via nodeport services.
Problem:
Requests on our gRPC microservices can take a while to complete. Average is 8s, up to 25s in the 99th %ile. In order to handle the load from clients, we've horizontally scaled, and spawned many pods to handle concurrent requests. When we send multiple requests to the system, even sequentially, we sometimes notice that requests go to the same pod as an ongoing request. What can happen is that this new request ends up getting "queued" in the server-side (not fully "queued", some progress gets made when context switches happen). The issue with queueing like this is that:
1. The earlier requests can start getting starved, and eventually timeout (we have a hard 30s cap).
2. The newer requests may also not get handled on time, and as a result get starved.
The symptom we're noticing is 504s which are expected from our hard 30s cap.
What's strange is that we have other pods available, but for some reason the loadbalancer isn't routing it to those pods smartly. It's possible that linkerd's routing doesn't work well for our situation (we need to look into this further, however that will require a big overhaul to our system).
One thing I wanted to try doing is to stop this queuing up of requests. I want the service to immediately reject the request if one is already in progress, and have the client-retry. The client-retry will hopefully hit a different pod (this is something I'm trying to test as a part of this). In order to do this, I set the "maximum_concurrent_rpcs" to 1. When i sent multiple requests in parallel to the system, I didn't see any RESOURCE_EXHAUSTED exceptions.
I then saw online that it may be possible for requests to get queued up on the client-side as a default behavior in http/2. So I tried:
channel = grpc.insecure_channel(
"<some address>",
options=[("grpc.max_concurrent_streams", 1)]
)
However, I'm not seeing any change here either. (Note, even after I get this client working, eventually i'll need to make this run in golang. Any help there would be appreciated as well).
Questions:
1. How can I get the desired effect here?
2. Is there some way to ensure that at least the earlier requests don't get starved by the new requests?
3. Any other advice on how to fix this issue? I'm grasping at straws here.
Thank you!