v1.19.1, C++
Linux
We have synchronous client and synchronous server applications. In our case, client streams data and server listens most of the time. Client closes connection after 1 or 2 days based on configs.
In our client, on disconnect with server we check for channel status and reconnect to server.
We have seen issues with connection drops and client blocks forever. To deal with it, we enabled keep-alive pings following this document (https://github.com/grpc/grpc/blob/master/doc/keepalive.md). The parameters we used for both client and server are:
GRPC_ARG_KEEPALIVE_TIME_MS = 10000
GRPC_ARG_KEEPALIVE_TIMEOUT_MS = 2000
GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS = 1
GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS = 10000
GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS = 5000
GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA = 0
(I don't have any other GRPC variables set other than these)
We expect to see client and server to send keepalive packets periodically and only disconnect if the connection between them or other end is down.
I see periodically (10 mins) the connection at client is dropped and our retry logic connects again.
At Client:
machine-host:/home/abc# GRPC_VERBOSITY=debug ./client
D0417 02:26:19.773717438 21870 ev_posix.cc:170] Using polling engine: epollex
D0417 02:26:19.774037411 21870 dns_resolver_ares.cc:488] Using ares dns resolver
I0417 02:26:19.775079731 21870 socket_utils_common_posix.cc:349] Disabling AF_INET6 sockets because ::1 is not available.
I0417 02:26:19.776441917 21870 subchannel.cc:1025] New connected subchannel at 0x3fff74020830 for subchannel 0x3fff740210a0
E0417 02:36:21.775425004 21898 chttp2_transport.cc:2765] ipv4:3.1.1.2:9011: Keepalive watchdog fired. Closing transport.
I0417 02:36:39.774677222 21898 subchannel.cc:1025] New connected subchannel at 0x3fff58001980 for subchannel 0x3fff740210a0
E0417 02:46:41.774407146 21898 chttp2_transport.cc:2765] ipv4:3.1.1.2:9011: Keepalive watchdog fired. Closing transport.
I0417 02:46:59.774638182 21898 subchannel.cc:1025] New connected subchannel at 0x3fff50001170 for subchannel 0x3fff740210a0
E0417 02:57:01.774407382 21898 chttp2_transport.cc:2765] ipv4:3.1.1.2:9011: Keepalive watchdog fired. Closing transport.
I0417 02:57:19.774682903 21898 subchannel.cc:1025] New connected subchannel at 0x3fff58002450 for subchannel 0x3fff740210a0
At Server:
machine-host:/home/abc# GRPC_VERBOSITY=debug ./grpc_listener 3.1.1.2 9011
D0417 02:26:11.047268367 8603 ev_posix.cc:170] Using polling engine: epollex
D0417 02:26:11.047769221 8603 dns_resolver_ares.cc:488] Using ares dns resolver
I0417 02:26:11.047961621 8603 server_builder.cc:281] Synchronous server. Num CQs: 1, Min pollers: 1, Max Pollers: 2, CQ timeout (msec): 10000
I0417 02:26:11.048415621 8603 socket_utils_common_posix.cc:349] Disabling AF_INET6 sockets because ::1 is not available.
<no other grpc logs printed>
Whenever this reconnect happens we see increment in eventfd and eventpoll inodes of client application (2 new inodes created). As the connection drop and reconnect happens for every 10 mins, we see 2 new inodes created by client. Due to this, once we hit the max (1024) file descriptors our program is crashing. Similar to this issue: #19767 [same stack trace seen]
I have few questions.
No
To experiment further to see keep alive packets in tcpdump, I have set following keep alive params at client end and nothing at client end:
GRPC_ARG_KEEPALIVE_TIME_MS = 10000
GRPC_ARG_KEEPALIVE_TIMEOUT_MS = 2000
GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS = 1
I expect here that keep-alive packets are exchanged every 10 secs but I never saw any packet exchange b/w client and server except data and after 5 minutes or so, I see some packets flow (length 17, 60 and 13 shown below) and then FINISH packet is sent from client to server. I am not sure what the other packets that flown between client and server. Are they http2 ping packets?
3.1.1.1.43788 -> client
3.1.1.2.9011 -> server
23:13:36.774898 IP 3.1.1.1.43788 > 3.1.1.2.9011: Flags [P.], seq 943:1213, ack 90, win 115, options [nop,nop,TS val 2434255 ecr 2409302], length 270 ----------------> This is payload data from client to server.
23:13:36.774960 IP 3.1.1.2.9011 > 3.1.1.1.43788: Flags [.], ack 1213, win 130, options [nop,nop,TS val 2434294 ecr 2434255], length 0
23:13:41.369246 IP 3.1.1.1.43788 > 3.1.1.2.9011: Flags [P.], seq 1213:1230, ack 90, win 115, options [nop,nop,TS val 2435403 ecr 2434294], length 17
23:13:41.369308 IP 3.1.1.2.9011 > 3.1.1.1.43788: Flags [.], ack 1230, win 130, options [nop,nop,TS val 2435443 ecr 2435403], length 0
23:13:41.369730 IP 3.1.1.2.9011 > 3.1.1.1.43788: Flags [P.], seq 90:150, ack 1230, win 130, options [nop,nop,TS val 2435443 ecr 2435403], length 60
23:13:41.370142 IP 3.1.1.1.43788 > 3.1.1.2.9011: Flags [.], ack 150, win 115, options [nop,nop,TS val 2435403 ecr 2435443], length 0
23:13:43.369560 IP 3.1.1.1.43788 > 3.1.1.2.9011: Flags [P.], seq 1230:1243, ack 150, win 115, options [nop,nop,TS val 2435903 ecr 2435443], length 13
23:13:43.369620 IP 3.1.1.1.43788 > 3.1.1.2.9011: Flags [F.], seq 1243, ack 150, win 115, options [nop,nop,TS val 2435903 ecr 2435443], length 0 ----------------> This is finish from client to server.
23:13:43.369929 IP 3.1.1.2.9011 > 3.1.1.1.43788: Flags [F.], seq 150, ack 1244, win 130, options [nop,nop,TS val 2435943 ecr 2435903], length 0
23:13:43.370380 IP 3.1.1.1.43788 > 3.1.1.2.9011: Flags [.], ack 151, win 115, options [nop,nop,TS val 2435904 ecr 2435943], length 0
After this, reconnect happens and those packets are seen. Same sequence happens for every 5 minutes.