When running our Java application in MS Azure, we sometimes observe very strange behavior, which appears as if a long-lived gRPC channel was working only in one direction and was not delivering any RPC calls in the opposite direction.
Out setup is that we have three apps connected via gRPC:
A -> B -> C
B usually has a long-lived server streaming gRPC requests to C open, consuming updates from C. When the issue occurs, updates from C are still streaming to B, but no new unary requests made by B make it to C.
The unary requests made by B are originated by A. B receives the request from A and sends an unary request to C with a deadline copied from the original request. After 20 seconds, B sees an "RPC cancelled" event, which I believe comes from A in response to some kind of timeout.
The problem occurs randomly and when it occurs, the channel never recovers.
Debug logging seems to show that when B receives the request from A, it creates a new stub using an existing (cached) channel and attempts to send a request to C, but that request is actually never sent.
If I make B forget the cached channel and create a new one, the unary request to C works fine.
We have keepAlive enabled on these channels, so I am surprised that potential issues with the underlying connection are not detected by the keepAlive mechanism. Is it possible that since traffic is steadily flowing in the direction from C to B, that B never pings C to see if communication in the opposite direction works as well?
I suppose we could work around this by adding application-level health checking for every channel, but I thought this is already taken care of by gRPC.
Any suggestions would be appreciated.