We are using gRPC (version 1.37.1) for our
inter-process communication between our C# process and C++ process. Both
processes act as a server and client with the other and run on the same machine
over localhost using the HTTP/2 transport. All of the calls are use blocking
synchronous unary calls and not bi-directional streaming. Some average(ish)
stats:
From C++->C#: 0-2 calls per second, 0-40 calls per minute
From C#->C++: 0-5 calls per second, 0-200 calls per minute
Intermittently, we were getting one of 3 issues
The top most one is the most frequent and where we have the most information about. Usually, what we’ll see is the Client receives the RPC call and runs into an unknown frame type. Then the subchannel goes into shutdown and everything usually re-connects fine. We also generally see an embedded error like the following (note that we replaced all __FILE__ instances to __FUNCTION__ in our gRPC source):
win_read","file_line":307,"os_error":"The system detected an invalid pointer address in attempting to use a pointer argument in a call.\r\n","syscall":"WSARecv","wsa_error":10014}]},{"created":"@1622120588.494000000","description":"frame of size 262404 overflows local window of 65535","file":"grpc_core::chttp2::TransportFlowControl::ValidateRecvData","file_line":213}]}
What we’ve seen with the unknown frame type, is that it parses the HEADERS, WINDOW_UPDATE, DATA, WINDOW_UPDATE and then gets a TCP: on_read without a corresponding READ and then tries to parse again. It’s this parse where it looks like the parser is at the wrong offset in the buffer, because it gets the unknown frame type, incoming frame size and incoming stream_id all map to the middle of the RPC call that it just parsed.
The above was what we were encountering prior to a change to create a new channel for each rpc call. While we realize it is not great from a performance standpoint, we have seen increased stability since making the change. However, we still do occasionally get rpc exceptions. Now, the most common is “Unknown”/”Stream Removed” rather than the ones listed above.
Any ideas on what might be going wrong is appreciated.