I'm trying to set up a python gRPC simple client and server, where the client uses round robin load balancing against a single DNS record, where I have multiple servers (instances) in the DNS record.
In the beginning, I'm able to connect and issue queries fine, but when I try a re-deploy of my servers, I get some weird behavior that I was hoping would be resolved automatically by the client library. In my re-deploy, I first bring up new servers, set the DNS record to the IPs of the new servers, and then destroy the old servers. Everything seems to work until I destroy the old servers, at which point I get a couple of UNAVAILABLE_ERRORs followed by DEADLINE_EXCEEDED until I kill the client. From what I understand, when the sub-channels go down (i.e. the server instances are killed), the channel should re-resolve the DNS record and attempt to re-connect to the new instances. Am I interpreting this incorrectly? Is there some channel and/or server option I need to set in order for this to work?
Sample client below:
channel = grpc.insecure_channel("localhost:10000", options=(("grpc.lb_policy_name", "round_robin"),))
fut = grpc.channel_ready_future(channel)
fut.result()
print("done waiting")
stub = test_pb2_grpc.TestStub(channel)
while True:
try:
print(stub.Test(request, timeout=5))
except grpc.RpcError as e:
print("{} {}".format(time.time(), e))
time.sleep(0.5)