C++ server stream replies not reaching client

298 views
Skip to first unread message

Bryan Schwerer

unread,
Mar 19, 2021, 3:59:30 PM3/19/21
to grpc.io
Hello,

I'm in the long overdo process of updating gRPC from 1.20 to 1.36.1.  I am running into an issue where the streaming replies from the server are not reaching the client in about 50% of the instances.  This is binary, either the streaming call works perfectly or it doesn't work at all.  After debugging a bit, I turned on the http tracing and from what I can tell, the http messages are received in the client thread, but where in the correct case, perform_stream_op[s=0x7f0e16937290]:  RECV_MESSAGE is logged, but in the broken case it isn't.  No error messages occur.

I've tried various tracers, but haven't hit anything.  The code is pretty much the same pattern as the example and there's no indication any disconnect has occurred which would cause the call to terminate.  Using gdb to look at the thread, it is still in epoll_wait.

The process in which this runs calls 2 different synchronous server streaming calls to the same server in separate threads.  It also is a gRPC server.  Everything is run over the internal 'lo' interface.  Any ideas on where to look to debug this?

Thanks,

Bryan

yas...@google.com

unread,
Mar 24, 2021, 1:35:29 PM3/24/21
to grpc.io
This is pretty strange. It is possible that we are being blocked on flow control. I would check that we are making sure that the application layer is reading. If I am not mistaken, `perform_stream_op[s=0x7f0e16937290]:  RECV_MESSAGE` is a log that is seen at the start of an operation meaning that the HTTP/2 layer hasn't yet been instructed to read a message, (or there is a previous read on the stream already that hasn't finished). Given that you are just updating the gRPC version from 1.20 to 1.36.1, I do not have an answer as to why you would see this without any application changes. 

A few questions - 
Do the two streams use the same underlying channel/transport?
Are the clients and the server in the same process?
Is there anything special about the environment this is being run in?

(One way to make sure that the read op is being propagated to the transport layer, is to check the logs with the "channel" tracer.)

Bryan Schwerer

unread,
Mar 24, 2021, 2:02:30 PM3/24/21
to grpc.io
Thanks for replying.

I was able to get a tcpdump capture and run it through the wireshark disector.  It indicated that there were malformed protobuf fields in the message.  I'm guessing the client threw the messages away.   I didn't see a trace message indicating that.  Is there some sort of stat I can check?  Would it be possible that older versions didn't discard malformed message?  I haven't loaded up an old version of our code, but I suspect it has always been there.  The end of the message has counters and such that if they were a bit off, no one would notice.

I think we are corrupting the messages on the server side,  I turned on -fstack-protector-all and the problem went away.  If there's a possible way to check the message before sending to Writer, that may give us more information.  We don't use arenas.  The message itself is uint32's, bool's and one string.  I assume protobufs makes a copy of the string and not the pointer to the buffer.

yas...@google.com

unread,
Mar 24, 2021, 2:23:04 PM3/24/21
to grpc.io
The deserialization happens at the surface layer instead of the transport layer, unless we suspect that HTTP/2 frames themselves were malformed. If we suspect the serialization/deserialization code, we can check if simply serializing the proto to bytes and back is causing issues. Protobuf has utility functions to do this. Alternatively, gRPC has utility functions here https://github.com/grpc/grpc/blob/master/include/grpcpp/impl/codegen/proto_utils.h

I am worried for memory corruption though so that is certainly something to check.

Bryan Schwerer

unread,
Mar 26, 2021, 9:38:08 AM3/26/21
to grpc.io
A structure occasionally had an uninitialized boolean value that was directly set into the reply message.  Unspecified Behavior Sanitizer (libubsan) found it for us.
Reply all
Reply to author
Forward
0 new messages