gRPC python behaves differently from WINE Python vs Linux Python

158 views
Skip to first unread message

John Forensics

unread,
Jan 4, 2022, 10:42:12 AM1/4/22
to grpc.io
I am running identical Python code from either 

WINE  python.exe   versus Linux native python 

The behaviour is slightly different, leading eventually to losing the communicatie with a WINE python variant (messages/calls not received at the java client side)

The most visible difference I see is the number of HTTP2 Streams that are created. The linux native runs stay at a low number, and most communication is done within one stream, while the windows variant seems to distribute it evenly over more streams. 

I do not understand why the gRPC layer would behave differently. I currently try to understand the problem using wireshark, but thats not the right way to do it.

Any tips on how to debug this would be appreciated? I do not see any exceptions explaining why a message isn't received at the java side.

If people have some hypotheses for me to test, that would be nice as well.

Best regards,

Ruud


John Forensics

unread,
Jan 5, 2022, 5:54:13 AM1/5/22
to grpc.io
I managed to extract some logging, the python side seems to send a too large frame.

[id: 0x0e0792ca, L:/127.0.0.1:43718 - R:localhost/127.0.0.1:60630] Sent GOAWAY: lastStreamId '2147483647', errorCode '6', debugData 'Frame length: 2165320 exceeds maximum: 16384'. Forcing shutdown of the connection.


Op dinsdag 4 januari 2022 om 16:42:12 UTC+1 schreef John Forensics:

John Forensics

unread,
Jan 5, 2022, 8:32:20 AM1/5/22
to grpc.io
After sending some data (about 4 chunks of 16k of a single 1Mb gRPC message)  the HTTP2-stream seems to be put on the "stalled" list. (Whatever that may be). 


 World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello W
orld!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello Worl
d!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.Hello World!!!!.'
I0105 12:30:25.218000000   216 src/core/lib/iomgr/timer_generic.cc:558]   .. shard[13] popped 0
I0105 12:30:25.258000000   208 src/core/lib/surface/completion_queue.cc:1074] RETURN_EVENT[00484AA0]: QUEUE_TIMEOUT
I0105 12:30:25.267000000   212 src/core/lib/iomgr/executor.cc:294] EXECUTOR (default-executor) try to schedule 00499258 (short) to thread 0
I0105 12:30:25.267000000   216 src/core/lib/iomgr/timer_generic.cc:613]   .. result --> 1, shard[13]->min_deadline 10496 --> 12695, now=11765
I0105 12:30:25.267000000   208 src/core/lib/surface/completion_queue.cc:970] grpc_completion_queue_next(cq=00484AA0, deadline=gpr_timespec { tv_sec: 1641385825, tv_nsec: 341971712, clock_type: 1 }, reserved=00000000)
I0105 12:30:25.268000000   196 src/core/lib/iomgr/executor.cc:242] EXECUTOR (default-executor) [0]: execute
I0105 12:30:25.268000000   216 src/core/lib/iomgr/timer_generic.cc:517]   .. shard[14]: heap_empty=true
I0105 12:30:25.268000000   196 src/core/lib/iomgr/executor.cc:123] EXECUTOR (default-executor) run 00499258
I0105 12:30:25.268000000   216 src/core/lib/iomgr/timer_generic.cc:492]   .. shard[14]->queue_deadline_cap --> 12694
I0105 12:30:25.268000000   196 src/core/ext/transport/chttp2/transport/chttp2_transport.cc:800] W:00497270 SERVER [ipv4:10.1.0.1:41200] state WRITING+MORE -> WRITING [continue writing]
I0105 12:30:25.269000000   216 src/core/lib/iomgr/timer_generic.cc:558]   .. shard[14] popped 0
I0105 12:30:25.269000000   196 src/core/ext/transport/chttp2/transport/stream_lists.cc:71] 00497270[5][svr]: pop from writing
I0105 12:30:25.269000000   216 src/core/lib/iomgr/timer_generic.cc:613]   .. result --> 1, shard[14]->min_deadline 10496 --> 12695, now=11765
I0105 12:30:25.269000000   196 src/core/ext/transport/chttp2/transport/stream_lists.cc:71] 00497270[5][svr]: pop from writable
I0105 12:30:25.269000000   216 src/core/lib/iomgr/timer_generic.cc:517]   .. shard[7]: heap_empty=true
I0105 12:30:25.270000000   196 src/core/ext/transport/chttp2/transport/writing.cc:443] W:00497270 SERVER[5] im-(sent,send)=(1,0) announce=0
I0105 12:30:25.270000000   216 src/core/lib/iomgr/timer_generic.cc:492]   .. shard[7]->queue_deadline_cap --> 12764
D0105 12:30:25.270000000   196 src/core/ext/transport/chttp2/transport/flow_control.cc:117] 00498C78[5][svr] | s updt sent | trw:                             0, tlw:                       4183485, taw:                       4187469, srw:                           288, slw:                       4187521, saw:                       4187521
I0105 12:30:25.270000000   216 src/core/lib/iomgr/timer_generic.cc:558]   .. shard[7] popped 0
D0105 12:30:25.270000000   196 src/core/ext/transport/chttp2/transport/writing.cc:163] ipv4:10.1.0.1:41200:00497270 stream 5 moved to stalled list by transport. This is FULLY expected to happen in a healthy program that is not seeing flow control stalls. However, if you know that there are unwanted stalls, here is some helpful data: [fc:pending=656:pending-compressed=0:flowed=1048288:peer_initwin=4187516:t_win=0:s_win=288:s_delta=-1048288]
I0105 12:30:25.271000000   216 src/core/lib/iomgr/timer_generic.cc:613]   .. result --> 1, shard[7]->min_deadline 10626 --> 12765, now=11765
I0105 12:30:25.271000000   196 src/core/ext/transport/chttp2/transport/stream_lists.cc:125] 00497270[5][svr]: add to stalled_by_transport
I0105 12:30:25.271000000   216 src/core/lib/iomgr/timer_generic.cc:517]   .. shard[9]: heap_empty=true
D0105 12:30:25.271000000   196 src/core/ext/transport/chttp2/transport/flow_control.cc:117] 00498C78[0][svr] | t updt sent | trw:                             0, tlw:                       4183485, taw:                       4187469, srw:                              , slw:                              , saw:                              
I0105 12:30:25.271000000   216 src/core/lib/iomgr/timer_generic.cc:492]   .. shard[9]->queue_deadline_cap --> 12764
I0105 12:30:25.272000000   196 src/core/ext/transport/chttp2/transport/chttp2_transport.cc:800] W:00497270 SERVER [ipv4:10.1.0.1:41200] state WRITING -> IDLE [begin writing nothing]
I0105 12:30:25.272000000   216 src/core/lib/iomgr/timer_generic.cc:558]   .. shard[9] popped 0


Op woensdag 5 januari 2022 om 11:54:13 UTC+1 schreef John Forensics:

John Forensics

unread,
Jan 5, 2022, 8:52:09 AM1/5/22
to grpc.io
So the stream is stalled as the congestion window is closed. Just verified with tcpdump and tcptrace, and can confirm thats the problem.

https://github.com/grpc/grpc/blob/2e3e3ccc3d5c40010a6a5555d6f824a5024176a7/src/core/ext/transport/chttp2/transport/writing.cc#L449

This explains why excessive logging and making my program inefficient helps (e.g. adding debug logging of gRPC itself).

Which changes the question: why isn't the stream unstalled when the congestion-window opens again..

This also seems something that would be different in terms of low level API  in Linux, Windows and WINE.

Op woensdag 5 januari 2022 om 14:32:20 UTC+1 schreef John Forensics:

John Forensics

unread,
Jan 5, 2022, 10:53:34 AM1/5/22
to grpc.io

Could one of the developers please elaborate on the case where:
* The sender has been blocked (shortly) on a closed TCP Window leaving a large set of gRPC chunks waiting at the HTTP2 stream sender side
* The receiver has not send a OUTBOUND WINDOW_UPDATE to signal its OK to send more data.

My current impression is that the python (sender) will not continue sending although the TCP Congestion window is open again.


Op woensdag 5 januari 2022 om 14:52:09 UTC+1 schreef John Forensics:

John Forensics

unread,
Jan 6, 2022, 9:02:11 AM1/6/22
to grpc.io
Still getting closed:

On windows gRPC is handled using WSASend asynchronous IO. When reaching a TCP congested situation WSASend should report  WSAEWOULDBLOCK however in WINE, the gRPC usage of WSASend always reports "0" 

grep "Ret  ws2_32.WSASend" /tmp/stderr | sort | uniq -c
      1 00000d4:Ret  ws2_32.WSASend() retval=00000000 ret=027f06db
     36 00d0:Ret  ws2_32.WSASend() retval=00000000 ret=027f06db
     15 00d4:Ret  ws2_32.WSASend() retval=00000000 ret=027f06db
      1 gen00d0:Ret  ws2_32.WSASend() retval=00000000 ret=027f06db
      1 _manager00d0:Ret  ws2_32.WSASend() retval=00000000 ret=027f06db
      1 MES00d0:Ret  ws2_32.WSASend() retval=00000000 ret=027f06db

Also all calls to WSASend have LPWSAOVERLAPPED and LPWSAOVERLAPPED_COMPLETION_ROUTINE set to zero.

Call ws2_32.WSASend(000000fc,03cbf428,00000001,03cbf40c,00000000,00000000,00000000) ret=027f06db

int WSAAPI WSASend( [in] SOCKET s, [in] LPWSABUF lpBuffers, [in] DWORD dwBufferCount, [out] LPDWORD lpNumberOfBytesSent, [in] DWORD dwFlags, [in] LPWSAOVERLAPPED lpOverlapped, [in] LPWSAOVERLAPPED_COMPLETION_ROUTINE lpCompletionRoutine );

This matches the code in /src/core/lib/iomgr/tcp_windows.cc

  /* First, let's try a synchronous, non-blocking write. */
  status = WSASend(socket->socket, buffers, (DWORD)tcp->write_slices->count,
                   &bytes_sent, 0, NULL, NULL);
  info->wsa_error = status == 0 ? 0 : WSAGetLastError();

  /* We would kind of expect to get a WSAEWOULDBLOCK here, especially on a busy
     connection that has its send queue filled up. But if we don't, then we can
     avoid doing an async write operation at all. */
  if (info->wsa_error != WSAEWOULDBLOCK) {
    grpc_error_handle error = status == 0
                                  ? GRPC_ERROR_NONE
                                  : GRPC_WSA_ERROR(info->wsa_error, "WSASend");
    grpc_core::ExecCtx::Run(DEBUG_LOCATION, cb, error);
    if (allocated) gpr_free(allocated);
    return;
  }


So the "dispute" for this API probably is, is WSASend on Overlapped IO allowed to send partially and report the lpNumberOfBytesSent < number of bytes in the buffer.

Anyone got a opinion about that?


Op woensdag 5 januari 2022 om 16:53:34 UTC+1 schreef John Forensics:

John Forensics

unread,
Jan 11, 2022, 2:39:00 AM1/11/22
to grpc.io
I created a PR for this:



Op donderdag 6 januari 2022 om 15:02:11 UTC+1 schreef John Forensics:
Reply all
Reply to author
Forward
0 new messages