Client-Side xDS Not Handling Removed Endpoints

30 views
Skip to first unread message

Lawrence Finn

unread,
May 18, 2023, 9:52:47 AM5/18/23
to grpc.io
Hi all.

I've been playing with xDS client side and noticed an odd occurrence handling a massive scale-down on the server side.  If an entire locality of an endpoint disappears, the gRPC client does not remove any of the hosts from the lb.  I've seen issues on both the golang client and java client.  For example, if us-east-1c disappears, the client side keeps trying to connect to it.  
2023/05/17 23:56:24 INFO: [xds] [xds-client 0xc00083cdc0] ADS response received: {
  "versionInfo": "2023-05-17T22:41:11Z/3164",
  "resources": [
    {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "clusterName": "outbound|6565||myservice.svc.cluster.local",
      "endpoints": [
        {
          "locality": {
            "region": "us-east-1",
            "zone": "us-east-1c"
          }

Then I remove 1a
2023/05/17 23:58:32 INFO: Resource with name: outbound|6565||myservice.svc.cluster.local, type: *endpointv3.ClusterLoadAssignment, contains: {
  "clusterName": "outbound|6565||myservice.svc.cluster.local",
  "endpoints": [
    {
      "locality": {
        "region": "us-east-1",
        "zone": "us-east-1a"
      },

I then see a lot of 

2023/05/18 00:01:43 INFO: [xds] [weighted-target-lb 0xc000eac660] Balancer state update from locality {"region":"us-east-1","zone":"us-east-1c"}, new state: {ConnectivityState:TRANSIENT_FAILURE Picker:0xc0005439c0}
2023/05/18 00:01:44 WARNING: [core] [Channel #1 SubChannel #12] grpc: addrConn.createTransport failed to connect to {
  "Addr": "IP_HERE:6565",
  "ServerName": "myservice.svc.cluster.local:6565",
  "Attributes": {},
  "BalancerAttributes": {},
  "Type": 0,
  "Metadata": null
}. Err: connection error: desc = "transport: error while dialing: dial tcp IP_HERE:6565: i/o timeout"
2023/05/18 00:01:44 INFO: [core] [Channel #1 SubChannel #12] Subchannel Connectivity change to TRANSIENT_FAILURE
2023/05/18 00:01:44 INFO: [balancer] base.baseBalancer: handle SubConn state change: 0xc0007d0020, TRANSIENT_FAILURE
2023/05/18 00:01:44 INFO: [xds] [weighted-target-lb 0xc000eac660] Balancer state update from locality {"region":"us-east-1","zone":"us-east-1c"}, new state: {ConnectivityState:TRANSIENT_FAILURE Picker:0xc00076efc0}

Eric Anderson

unread,
May 19, 2023, 11:58:27 AM5/19/23
to Lawrence Finn, grpc.io
There's a 15 minute timer for localities. See gRFC A56 Priority LB Policy. Which localities are provided by control planes can be unstable, so old localities are preserved for a while in case the control plane changes its mind.

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/975fb1ba-66e8-4a7e-86d5-956d071097c2n%40googlegroups.com.

Lawrence Finn

unread,
May 19, 2023, 2:59:26 PM5/19/23
to grpc.io
Oh interesting, thanks for the followup.  The wording makes it sound like that "child" will be marked as deactivated and then the timer to remove it starts, so requests will not try to be routed there from that deactivation on?  Or am I misunderstanding that?
Reply all
Reply to author
Forward
0 new messages