Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

gRPC channel broken in single direction?

135 views
Skip to first unread message

Jiří Krutil

unread,
Nov 26, 2024, 10:22:16 AM11/26/24
to grpc.io
When running our Java application in MS Azure, we sometimes observe very strange behavior, which appears as if a long-lived gRPC channel was working only in one direction and was not delivering any RPC calls in the opposite direction.

Out setup is that we have three apps connected via gRPC:

A -> B -> C

B usually has a long-lived server streaming gRPC requests to C open, consuming updates from C. When the issue occurs, updates from C are still streaming to B, but no new unary requests made by B make it to C.

The unary requests made by B are originated by A. B receives the request from A and sends an unary request to C with a deadline copied from the original request. After 20 seconds, B sees an "RPC cancelled" event, which I believe comes from A in response to some kind of timeout.

The problem occurs randomly and when it occurs, the channel never recovers.

Debug logging seems to show that when B receives the request from A, it creates a new stub using an existing (cached) channel and attempts to send a request to C, but that request is actually never sent.

If I make B forget the cached channel and create a new one, the unary request to C works fine.

We have keepAlive enabled on these channels, so I am surprised that potential issues with the underlying connection are not detected by the keepAlive mechanism. Is it possible that since traffic is steadily flowing in the direction from C to B, that B never pings C to see if communication in the opposite direction works as well?

I suppose we could work around this by adding application-level health checking for every channel, but I thought this is already taken care of by gRPC.

Any suggestions would be appreciated.

Yuri Golobokov

unread,
Nov 26, 2024, 2:53:01 PM11/26/24
to grpc.io
How many server streaming calls are open to C at the time of the issue?
Does C limits the number of maximum concurrent streams? (you can see it in the debug logs for the SETTINGS frame coming from C)

Jiří Krutil

unread,
Nov 27, 2024, 7:44:05 AM11/27/24
to grpc.io
I cannot confirm the exact number of open streaming calls on C during the last occurrence, but my assumption is just a few.

I have tried processing one of the unary calls now with debug logging enabled on C and this shows:

2024-11-27 13:17:37.088 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND SETTINGS: ack=false settings={MAX_CONCURRENT_STREAMS=2147483647, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192}
2024-11-27 13:17:37.089 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
2024-11-27 13:17:37.090 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND SETTINGS: ack=false settings={ENABLE_PUSH=0, MAX_CONCURRENT_STREAMS=0, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192}
2024-11-27 13:17:37.090 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND SETTINGS: ack=true
2024-11-27 13:17:37.090 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
2024-11-27 13:17:37.091 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND SETTINGS: ack=true
2024-11-27 13:17:37.092 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND HEADERS: streamId=3 headers=GrpcHttp2RequestHeaders[:path: /ra.hub.grpc.api.v1.SessionService/stopSession, :authority: 10.40.80.78, :method: POST, :scheme: http, te: trailers, content-type: application/grpc, user-agent: grpc-java-netty/1.66.0, grpc-accept-encoding: gzip, grpc-timeout: 59969004u] streamDependency=0 weight=16 exclusive=false padding=0 endStream=false
2024-11-27 13:17:37.110 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND DATA: streamId=3 padding=0 endStream=true length=63 bytes=000000003a0a2435373239663665342d396633662d343738382d383132372d32343633636338663339666412120a0a6875622d6575732d3161120455532d31
2024-11-27 13:17:37.110 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND PING: ack=false bytes=1234
2024-11-27 13:17:37.111 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND PING: ack=true bytes=1234
2024-11-27 13:17:37.112 [INFO] (grpc-default-executor-3) org.dett.router.grpc.RouterSessionGrpcService: - Processing stopSession request {uuid: "5729f6e4-9f3f-4788-8127-2463cc8f39fd" hubId { name: "hub-eus-1a" cluster: "US-1" }}
2024-11-27 13:17:37.115 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND HEADERS: streamId=3 headers=GrpcHttp2OutboundHeaders[:status: 200, content-type: application/grpc, grpc-encoding: identity, grpc-accept-encoding: gzip] padding=0 endStream=false
2024-11-27 13:17:37.115 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND DATA: streamId=3 padding=0 endStream=false length=5 bytes=0000000000
2024-11-27 13:17:37.116 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND HEADERS: streamId=3 headers=GrpcHttp2OutboundHeaders[grpc-status: 0] padding=0 endStream=true
2024-11-27 13:17:37.118 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND PING: ack=false bytes=1234
2024-11-27 13:17:37.118 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND PING: ack=true bytes=1234
2024-11-27 13:17:57.118 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] INBOUND PING: ack=false bytes=1111
2024-11-27 13:17:57.118 [DEBUG] (grpc-default-worker-ELG-3-2) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0xab7eb2ae, L:/10.40.80.78:10010 - R:/10.40.80.77:49754] OUTBOUND PING: ack=true bytes=1111

I think this indicates max_concurrent_streams for C as a server is unlimited.

Maybe I have oversimplified in the original post. We actually have 2 instances of A, 2 instances of B and N instances of C.
One observation was that at one time, we had stale channel B1>C2 and stale channel B2>C4, while all other channels were working.
In other words, the problem does not affect all channels from given B instance or all channels to given C instance. It seems to be channel specific.

Just to check one related idea I had -- is it possible that enabling keepAliveWithoutCalls on B could make this issue go away? The reasoning is that if gRPC client on B sees incoming traffic from C, it may never probe the channel to C, thus won't detect the issue without this option.
Dne úterý 26. listopadu 2024 v 20:53:01 UTC+1 uživatel Yuri Golobokov napsal:

Jiří Krutil

unread,
Dec 3, 2024, 4:38:11 AM12/3/24
to grpc.io
A quick update: turns out enabling keepAliveWithoutCalls for component B does not resolve the issue.

I would like to understand what kind of guarantees gRPC actually provides here. Is RPC call delivery on a channel on best effort basis?
If that is the case, sounds like the only way to resolve this is to add application-level health checking?

Cheers
Jiri

Dne středa 27. listopadu 2024 v 13:44:05 UTC+1 uživatel Jiří Krutil napsal:

Yuri Golobokov

unread,
Dec 3, 2024, 2:03:21 PM12/3/24
to Jiří Krutil, grpc.io
Is the /ra.hub.grpc.api.v1.SessionService/stopSession the unary call in question?
From the log it looks completely normal and successfully processed.
It would be nice to have debug logs from C _and_ B at the time when the unary call is received on B until it is canceled.

re: keepAliveWithoutCalls -- the streaming calls you have between B-C are the in-flight calls, so you should always have some calls, that's why changing keepAliveWithoutCalls does not make a difference in your case. 

What is your client LB policy?

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/grpc-io/9a7cf13f-6545-429f-b68b-f880fb967ccbn%40googlegroups.com.

Jiří Krutil

unread,
Dec 16, 2024, 9:04:12 AM12/16/24
to grpc.io
Yes, stopSession is the unary request that is failing.

I can't really post the full logs of B and C here, as it contains some sensitive data.

The occurrence I am looking at has the following in B log at the time the unary request is made - this is the request coming in from A:

2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND SETTINGS: ack=false settings={MAX_CONCURRENT_STREAMS=1000, INITIAL_WINDOW_SIZE=1048576, MAX_HEADER_LIST_SIZE=8192}
2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND WINDOW_UPDATE: streamId=0 windowSizeIncrement=983041
2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND SETTINGS: ack=false settings=[id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842]
2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND SETTINGS: ack=true
2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND HEADERS: streamId=1 headers=GrpcHttp2RequestHeaders[:path: /ra.hub.grpc.api.v1.SessionService/stopSession, :authority: 10.178.148.210:10004, :method: POST, :scheme: http, te: trailers, grpc-accept-encoding: identity,deflate,gzip, accept-encoding: identity, grpc-timeout: 59996m, user-agent: grpc-node-js/1.9.12, content-type: application/grpc] padding=0 endStream=false
2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND DATA: streamId=1 padding=0 endStream=false length=63 bytes=000000003a0a2434623235373935342d626161632d336538332d396132612d65396536323763343362393612120a0a6875622d6575732d3261120455532d32
2024-11-19 07:55:05.090 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] OUTBOUND PING: ack=false bytes=1234
2024-11-19 07:55:05.091 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND DATA: streamId=1 padding=0 endStream=true length=0 bytes=
2024-11-19 07:55:05.091 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND SETTINGS: ack=true
2024-11-19 07:55:05.091 [DEBUG] (grpc-default-executor-284) org.dett.common.grpc.communication.MessageLoggingServerInterceptor: - {"method":"ra.hub.grpc.api.v1.SessionService/stopSession","logId":"ad24cfeb-c99b-43f1-9891-634c08e1a7af","message-direction":"in","message":"{\n  \"uuid\": \"4b257954-baac-3e83-9a2a-e9e627c43b96\",\n  \"hubId\": {\n    \"name\": \"C2a\",\n    \"cluster\": \"US-2\"\n  }\n}","address":"/10.178.148.95:45842"}
2024-11-19 07:55:05.091 [INFO] (grpc-default-executor-284) org.dett.business.server.hub.grpc.BusinessServerSessionGrpcService: - Processing stopSession request #31, uuid: "4b257954-baac-3e83-9a2a-e9e627c43b96" hubId { name: "C2a" cluster: "US-2" }
2024-11-19 07:55:05.091 [INFO] (grpc-default-executor-284) org.dett.business.server.communication.SystemChannelProvider: - Provided channel: C2a/US-2, keep alive time: 20000, keep alive timeout: 2000
2024-11-19 07:55:05.091 [INFO] (grpc-default-executor-284) org.dett.business.server.hub.grpc.BusinessServerSessionGrpcService: - Stream stub created for C2a/US-2; channel state: READY
2024-11-19 07:55:05.091 [DEBUG] (grpc-nio-worker-ELG-1-5) io.grpc.netty.shaded.io.grpc.netty.NettyServerHandler: - [id: 0x755bd2e9, L:/10.178.148.210:10004 - R:/10.178.148.95:45842] INBOUND PING: ack=true bytes=1234

Then a number of INBOUND DATA messages from C's IP address with two different stream IDs, as the streaming call updates come in, but no outgoing messages to C are logged.

Later B says:

2024-11-19 07:55:25.085 [INFO] (grpc-default-executor-284) org.dett.business.server.hub.grpc.BusinessServerSessionGrpcService: - onError() invoked for stopSession request #31: io.grpc.StatusRuntimeException: CANCELLED: RPC cancelled

On the C side, I can see a lot of outgoing messages on the 2 streams mentioned above being sent to B. All of the OUTBOUND DATA messages have the same id and only the 2 streams. Nothing else.
I guess I could post that traffic here, but it's rather trivial.
This traffic from C to B goes on during the whole 20 seconds of the duration of the cancelled RPC unary call.

Is there anything specific we are looking for in the debug logs?

Cheers
Jiri

Dne úterý 3. prosince 2024 v 20:03:21 UTC+1 uživatel Yuri Golobokov napsal:
Reply all
Reply to author
Forward
0 new messages