DNS requery not happening with round_robin load balancing in docker swarm

daniel...@noknow.nz

unread,

Oct 26, 2017, 1:42:35 AM10/26/17

to grpc.io

I'm trying to get a gRPC client/server running using nodejs under Docker swarm.

The gRPC server is setup using Dockers `endpoint_mode: dnsrr` so that dns has multiple A records (one for each instance of the server). The gRPC client is running as another Docker service, with the gRPC option {'grpc.lb_policy_name': 'round_robin'}. With this setup, everything runs nicely and requests are load balanced between the gRPC server instances.

If I restart the gRPC servers and they get the same IP addresses as they previously had, everything continues to work well, the client is able to re-establish connectivity. However, if the IP addresses change, the client never manages to establish contact with the new server instances.

My understanding was as soon as all 'sub-channels' were unavailable, gRPC was supposed to requery DNS. This doesn't appear to be happening.

I was using version 1.6.6, however have also tried 1.7.0-pre1 and getting the same results.

Appreciate any help or guidance anyone is able to provide.

Regards,

Daniel

Michael Lumish

unread,

Oct 26, 2017, 12:03:33 PM10/26/17

to daniel...@noknow.nz, grpc.io

To clarify, are you saying that after your client loses its connection to every server, it never reestablishes a connection with any of them?

--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To post to this group, send email to grp...@googlegroups.com.
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/e180fc34-0dd0-4523-84e3-1f387651f01f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

daniel...@noknow.nz

unread,

Oct 26, 2017, 4:01:17 PM10/26/17

to grpc.io

Hi Michael,

On Friday, October 27, 2017 at 5:03:33 AM UTC+13, Michael Lumish wrote:

To clarify, are you saying that after your client loses its connection to every server, it never reestablishes a connection with any of them?

Yes, exactly.

I have a small project and a bash script that that demonstrates this. If you've got a Linux/Mac with nodejs and docker running in swarm mode, I can share it with you. Essentially it starts 1 client and 2 server instances (the client just sends a 'ping' request every 2 seconds the server just sends a response indicating which server responded) all runs well and shows load balancing between them. Then the script shuts down both instances of the server and starts them up again in a way to force them to have different IP addresses. The client is never able to reconnect with the 2 new instances on the different IP addresses.

Regards,

Daniel

David Garcia Quintas

unread,

Nov 1, 2017, 7:48:15 PM11/1/17

to grpc.io

Hi Daniel,

Do the port numbers of the server also change? If not, it'd be helpful if you could provide me with the logs produced when run with the following environment variables set: GRPC_VERBOSITY=debug GRPC_TRACE=client_channel,round_robin

David Garcia Quintas

unread,

Nov 1, 2017, 8:42:41 PM11/1/17

to grpc.io

Another consideration is: has DNS been updated the reflect the new servers's IPs? The way things work is:

(DNS) names such as foo.com resolve to a set of ip addresses
LB policies are created over the set of ip addresses (internally there's one subchannel per ip)
If all subchannels go into shutdown (eg when all servers die), the LB policy will also die. This will result in a request for re-resolution of the name under which the channel was created (in this case, the DNS name). A new LB policy will be created from the results of this re-resolution.

Which is why, if 1) port numbers change (DNS doesn't provide port information) and/or 2) DNS doesn't resolve to the servers's new addresses, the new LB policy won't contain valid subchannels.

daniel...@noknow.nz

unread,

Nov 2, 2017, 3:59:26 AM11/2/17

to grpc.io

Hi David,

The ports don't change, they remain the same (port 3000) for all server instances. DNS is updated with the new IP addresses to show this, the logs show the results of dns lookups (by nodejs) when each request is sent to the server. Log from client with debug logging enabled is attached.

Hope you are able to spot something.

Regards,

Daniel

pingclient.log

pingserver.log

timeline.log

David Garcia Quintas

unread,

Nov 7, 2017, 7:43:38 PM11/7/17

to grpc.io

Hey Daniel,

I see what's going on. 10.0.0.{4,5} go down, re-resolution is triggered, but before the DNS has been updated, so round robin gets 10.0.0.{4,5} again. These addresses will never again have a backend behind, so RR gets caught up in a retry-loop, unaware of the DNS update pointing to the available 10.0.0.{7,8}. The fix for this issue is almost ready to be merged (see here), and consists on actively re-requesting resolutions in a different way, not only when we go from healthy to unhealthy LB (right now we don't re-resolve if we stay in unhealthy without having ever been healthy).

tl;dr: please check back once https://github.com/grpc/grpc/pull/12829 has been merged, which should happen within 1-2 weeks. I'm adding Juanli (the author of that change) to this thread.

daniel...@noknow.nz

unread,

Nov 8, 2017, 10:28:23 PM11/8/17

to grpc.io

Hi David,

That's great news. Makes perfect sense why it wasn't working. Will keep an eye out for the PR to be merged, will test again and let you know the outcome.

Reply all

Reply to author

Forward