The gRPC Java client would have tried refreshing the IP addresses but what must have happened is timing issues in the headless service's scaling up. When the old set of pods go down they would have sent a GOAWAY on the established connections by the client (since you use
keepalive on the client, it is all the more likely the GOAWAY was not lost to the client). This would have immediately caused a re-resolution and still have received the old addresses as the new pods may not have come up yet, and fail to establish connection and cause rpcs to fail. After all the addresses fail to connect, name re-resolution will be triggered and a re-connection scheduled after a backoff time dictated by the connection
backoff policy in your service config. You can try one of the following options.
1. Using the connection backoff policy to wait for lesser time or configuring retry policies for rpcs to wait for longer.
2. Forcing channel reconnect with ManagedChannel.resetConnectBackoff to reset the back off timer and cause a re-resolution and reconnect.
2. Using waitForReady in CallOptions so the RPC waits for the channel to become ready.
3. Active polling of the channel state with ManagedChannel.getState or ManagedChannel.notifyWhenStateChanged to know when the channel becomes READY.