Hello,
grpc C++, RELEASE_13_1, running on Android 4.4.
My grpc client fails to re-connect to a server if I physically disconnect the network, wait some time (> 10 mins), and then reconnect the network.
GRPC_TRACE is set to "connectivity_state" and I periodically check the state by calling channel->GetState(false):
1. Starting client (4 streaming services multiplexed on 1 channel):
16:57:09.281237 [ grpc] <DEBUG> CONWATCH: 0x7548dabc pick_first: get IDLE
16:57:09.282513 [ grpc] <DEBUG> SET: 0x75485058 client_channel: IDLE --> IDLE [new_lb+resolver]
16:57:09.284304 [ grpc] <DEBUG> CONWATCH: 0x7548dabc pick_first: from IDLE [cur=IDLE] notify=0x7548c33c
16:57:09.285091 [ grpc] <DEBUG> CONWATCH: 0x7548ca74 subchannel: from IDLE [cur=IDLE] notify=0x7548c364
16:57:09.287457 [ grpc] <DEBUG> SET: 0x7548ca74 subchannel: IDLE --> CONNECTING [state_change]
16:57:09.289711 [ grpc] <DEBUG> SET: 0x7548dabc pick_first: IDLE --> CONNECTING [connecting_changed]
16:57:09.291326 [ grpc] <DEBUG> CONWATCH: 0x7548ca74 subchannel: from CONNECTING [cur=CONNECTING] notify=0x7548c364
16:57:09.292361 [ grpc] <DEBUG> SET: 0x75485058 client_channel: IDLE --> CONNECTING [lb_changed]
16:57:09.293598 [ grpc] <DEBUG> CONWATCH: 0x7548dabc pick_first: from CONNECTING [cur=CONNECTING] notify=0x75485d1c
16:57:09.559146 [ grpc] <DEBUG> CONWATCH: 0x754b626c client_transport: from READY [cur=READY] notify=0x754b0c98
16:57:09.560757 [ grpc] <DEBUG> SET: 0x7548ca74 subchannel: CONNECTING --> READY [connected]
16:57:09.581935 [ grpc] <DEBUG> SET: 0x7548dabc pick_first: CONNECTING --> READY [connecting_ready]
16:57:09.588590 [ grpc] <DEBUG> CONWATCH: 0x754b626c client_transport: from READY [cur=READY] notify=0x7548da94
16:57:09.589425 [ grpc] <DEBUG> SET: 0x75485058 client_channel: CONNECTING --> READY [lb_changed]
16:57:09.590138 [ grpc] <DEBUG> CONWATCH: 0x7548dabc pick_first: from READY [cur=READY] notify=0x7548a7fc
2. Checking connectivity
16:57:54.025261 [ grpc] <DEBUG> CONWATCH: 0x75485058 client_channel: get READY
3. Disconnect cable (client continues to write to server - output socket buffer filling - up to a point)
16:59:39.033484 [ grpc] <DEBUG> CONWATCH: 0x75485058 client_channel: get READY
Both connections to the server remain open.
4. After 18 minutes or so:
17:17:39.616776 [ grpc] <DEBUG> SET: 0x754b626c client_transport: READY --> FATAL_FAILURE [close_transport]
17:17:39.618827 [ grpc] <DEBUG> SET: 0x7548dabc pick_first: READY --> FATAL_FAILURE [selected_changed]
17:17:39.619160 [ grpc] <DEBUG> SET: 0x7548ca74 subchannel: READY --> FATAL_FAILURE [reflect_child]
17:17:39.619854 [ grpc] <DEBUG> SET: 0x75485058 client_channel: READY --> TRANSIENT_FAILURE [lb_changed]
At this point all services fail. We retry by closing and opening new service instances. We're using the blocking API so the new service instances are blocked in the ClientReaderWriter constructor on the completion queue.
Both connections the the server are closed. All this seems OK so far.
5. A few minutes later:
17:19:01.573006 [ grpc] <ERROR> getaddrinfo: No address associated with hostname
17:19:01.573913 [ grpc] <DEBUG> SET: 0x7548dabc pick_first: FATAL_FAILURE --> FATAL_FAILURE [shutdown]
17:19:01.574632 [ grpc] <DEBUG> CONWATCH: 0x754b626c client_transport: unsubscribe notify=0x7548da94
17:19:01.575053 [ grpc] <DEBUG> SET: 0x75485058 client_channel: TRANSIENT_FAILURE --> TRANSIENT_FAILURE [new_lb+resolver]
6. Reconnect cable:
17:23:55.667141 [ grpc] <DEBUG> CONWATCH: 0x75485058 client_channel: get TRANSIENT_FAILURE
Still in TRANSIENT_FAILURE state. The client never attempts to reconnect again.
Any ideas? Thanks!