Grpc XDS Connection failure and retries

158 views
Skip to first unread message

Kartik Raval

unread,
Aug 17, 2022, 8:13:20 AM8/17/22
to grpc.io
Hi,

My setup details:
Client/Server using - GRPC-Go v1.48.0
Istio-XDS with Traffic policy - Load balancer with consistent hash to route to server.
Client and Server running as separate pods on k8s cluster.

Issue:
1. Server side does xds.NewGRPCServer()
2. Client side connects using URI - xds:///chatservice:<port>
3. Client and server are able to exchange messages.
4. Server is terminated manually.
5. Client side loses the connection and retries once, but that attempt fails and then it never retries again. 
6. Server side is restarted and now available. 
7. Client side does not retry and never connects.
8. If Client calls Dial again, then it is able to reconnect. But not automatically. 
9. It is observed that if I turn off XDS, the GRPC client is able to reconnect automatically to the server. 

Can someone pl help and let me know if I am missing something or can try some configuration options that can help in this case.

Thanks.
-kartik.


Logs of client side for reference:

2022/08/17 12:04:38 INFO: [core] [Channel #1 SubChannel #5] Subchannel Connectivity change to IDLE

2022/08/17 12:04:38 INFO: [transport] transport: loopyWriter.run returning. connection error: desc = "transport is closing"

2022/08/17 12:04:38 INFO: [xds] [ring-hash-lb 0xc000839880] handle SubConn state change: 0xc00007fb20, IDLE

2022/08/17 12:04:38 INFO: [xds] [priority-lb 0xc000495490] Balancer state update from locality priority-0-0, new state: {ConnectivityState:IDLE Picker:0xc000840960}

2022/08/17 12:04:38 WARNING: [xds] ciu, cn, cu: priority-0-0, priority-0-0, priority-0-0

2022/08/17 12:04:38 INFO: [xds] [xds-cluster-manager-lb 0xc0003efb00] Balancer state update from locality cluster:outbound|16000||chatservice.default.svc.cluster.local, new state: {Connec

tivityState:IDLE Picker:0xc000840960}

2022/08/17 12:04:38 INFO: [xds] [xds-cluster-manager-lb 0xc0003efb00] Child pickers: map[cluster:outbound|16000||chatservice.default.svc.cluster.local:picker:0xc000840960,state:IDLE,state

ToAggregate:IDLE]

2022/08/17 12:04:38 INFO: [core] [Channel #1] Channel Connectivity change to IDLE

2022/08/17 12:04:38 INFO: [xds] [priority-lb 0xc000495490] switching to ("priority-0-0", 0) in syncPriority

2022/08/17 12:04:38 Send -  Msg-3

2022/08/17 12:04:38 INFO: [core] [Channel #1 SubChannel #5] Subchannel Connectivity change to CONNECTING

2022/08/17 12:04:38 INFO: [core] [Channel #1 SubChannel #5] Subchannel picks a new address "172.20.0.170:16000" to connect

2022/08/17 12:04:38 INFO: [xds] [ring-hash-lb 0xc000839880] handle SubConn state change: 0xc00007fb20, CONNECTING

2022/08/17 12:04:38 INFO: [xds] [priority-lb 0xc000495490] Balancer state update from locality priority-0-0, new state: {ConnectivityState:CONNECTING Picker:0xc00004f1d0}

2022/08/17 12:04:38 WARNING: [xds] ciu, cn, cu: priority-0-0, priority-0-0, priority-0-0

2022/08/17 12:04:38 INFO: [xds] [xds-cluster-manager-lb 0xc0003efb00] Balancer state update from locality cluster:outbound|16000||chatservice.default.svc.cluster.local, new state: {ConnectivityState:CONNECTING Picker:0xc00004f1d0}

2022/08/17 12:04:38 INFO: [xds] [xds-cluster-manager-lb 0xc0003efb00] Child pickers: map[cluster:outbound|16000||chatservice.default.svc.cluster.local:picker:0xc00004f1d0,state:CONNECTING,stateToAggregate:CONNECTING]

2022/08/17 12:04:38 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING

2022/08/17 12:04:38 INFO: [xds] [priority-lb 0xc000495490] switching to ("priority-0-0", 0) in syncPriority

2022/08/17 12:04:38 WARNING: [core] [Channel #1 SubChannel #5] grpc: addrConn.createTransport failed to connect to {

  "Addr": "172.20.0.170:16000",

  "ServerName": "chatservice:16000",

  "Attributes": {},

  "BalancerAttributes": {},

  "Type": 0,

  "Metadata": null

}. Err: connection error: desc = "transport: Error while dialing dial tcp 172.20.0.170:16000: connect: connection refused"

2022/08/17 12:04:38 INFO: [core] [Channel #1 SubChannel #5] Subchannel Connectivity change to TRANSIENT_FAILURE

2022/08/17 12:04:38 INFO: [xds] [ring-hash-lb 0xc000839880] handle SubConn state change: 0xc00007fb20, TRANSIENT_FAILURE

2022/08/17 12:04:38 INFO: [xds] [priority-lb 0xc000495490] Balancer state update from locality priority-0-0, new state: {ConnectivityState:TRANSIENT_FAILURE Picker:0xc00004f3b0}

2022/08/17 12:04:38 WARNING: [xds] ciu, cn, cu: priority-0-0, priority-0-0, priority-0-0

2022/08/17 12:04:38 INFO: [xds] [xds-cluster-manager-lb 0xc0003efb00] Balancer state update from locality cluster:outbound|16000||chatservice.default.svc.cluster.local, new state: {ConnectivityState:TRANSIENT_FAILURE Picker:0xc00004f3b0}

2022/08/17 12:04:38 INFO: [xds] [xds-cluster-manager-lb 0xc0003efb00] Child pickers: map[cluster:outbound|16000||chatservice.default.svc.cluster.local:picker:0xc00004f3b0,state:TRANSIENT_FAILURE,stateToAggregate:TRANSIENT_FAILURE]

2022/08/17 12:04:38 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE

2022/08/17 12:04:38 INFO: [xds] [priority-lb 0xc000495490] switching to ("priority-0-0", 0) in syncPriority

2022/08/17 12:04:38 Error when calling SayHello: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing dial tcp 172.20.0.170:16000: connect: connection refused"

2022/08/17 12:04:39 INFO: [core] [Channel #1 SubChannel #5] Subchannel Connectivity change to IDLE

2022/08/17 12:04:39 INFO: [xds] [ring-hash-lb 0xc000839880] handle SubConn state change: 0xc00007fb20, IDLE




Easwar Swaminathan

unread,
Aug 22, 2022, 1:11:21 PM8/22/22
to grpc.io
Thanks for reporting this issue. We have root caused the issue and identified a bug in the `ring_hash` LB policy implementation. Will be sending a PR on the grpc-go repo soon.

Kartik Raval

unread,
Sep 20, 2022, 1:16:20 AM9/20/22
to grpc.io
Thanks for taking this up. Awaiting next GRPC release to try out again.
Reply all
Reply to author
Forward
0 new messages