grpc stops forward progress if DNS resolve has 0 addresses

160 views
Skip to first unread message

Peter Hurley

unread,
Jul 29, 2022, 5:27:45 PM7/29/22
to grpc.io
Hi,

We're trying to debug/repro a problem we observe in production triggered by empty c-ares resolve.

What we observe is that client channel connection stalls if the hostname DNS resolve is empty; ie. server list is empty:
   2022-04-12_04:43:14.45451 I0412 04:43:14.454464756 5638 pick_first.cc:147] Pick First 0x3bf5400 created.
   2022-04-12_04:43:14.45455 I0412 04:43:14.454521631 5638 pick_first.cc:266] Pick First 0x3bf5400 received update with 0 addresses
   2022-04-12_04:43:14.45461 I0412 04:43:14.454584964 5638 subchannel_list.h:363] [pick_first 0x3bf5400] Creating subchannel list 0x3c5a6c0 for 0 subchannels

No further activity for that client channel occurs.

We've been unable to reproduce this failure in testing, and would appreciate any pointers:
  • what is supposed to re-kick a new DNS resolve if the server list is empty?
  • where to check in the resolver code for an empty server list?
  • or any other ideas for how to track down the problem

We're using grpc v1.36.4 w/ libcares2 1.14

Regards,
Peter Hurley

AJ Heller

unread,
Aug 5, 2022, 8:35:21 PM8/5/22
to grpc.io
That's mysterious, do you know what the state of the DNS records are when this occurs? And would it be possible for you to upgrade your gRPC library and try to reproduce this? v1.36.4 is over a year old, and a fair handful of bug fixes have gone in since then.

We've been unable to reproduce this failure in testing, and would appreciate any pointers:

Regarding that, are you able to reproduce the conditions in which the failure occurs, or are they maybe not fully understood? e.g., run a local DNS server for testing, and modify its records.

Peter Hurley

unread,
Aug 10, 2022, 9:41:07 AM8/10/22
to AJ Heller, grpc.io
Thanks for the reply.

> And would it be possible for you to upgrade your gRPC library and try to reproduce this? 
I didn't see any similar issue (marked fixed or not) in https://github.com/grpc/grpc/issues; we were hoping the community could confirm whether this has been observed and fixed already but went unreported in github.

> v1.36.4 is over a year old, and a fair handful of bug fixes have gone in since then.
We're using the still experimental TLSCredentials so every version bump is non-trivial, and we've already found fixed a number of core bugs ourselves, so it'll be a while before we're upgrading again in production.

> Regarding that, are you able to reproduce the conditions in which the failure occurs, or are they maybe not fully understood? e.g., run a local DNS server for testing, and modify its records.
Yeah, the exact conditions are not well understood, but almost certainly happening during a restart of the local caching dnsmasq server due to intermittent connection loss.


--
You received this message because you are subscribed to the Google Groups "grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email to grpc-io+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/grpc-io/306779dd-0a68-4b95-851e-0a5979a4e872n%40googlegroups.com.

Mark D. Roth

unread,
Aug 17, 2022, 1:35:47 PM8/17/22
to Peter Hurley, AJ Heller, grpc.io
Can you try running with the following environment variables set, and share the log?  That might help us figure out what's going on here.

GRPC_VERBOSITY=DEBUG
GRPC_TRACE=client_channel_routing,pick_first,cares_resolver

In general, the c-ares resolver should return an error when there's an empty address list, so it should automatically retry the resolution periodically until it succeeds.  The only exception I see in the code is if there are balancer addresses successfully returned, but that shouldn't be the case if you're using pick_first.  Unless maybe you're using a service config in DNS, but the service config lookup is failing also?

Anyway, getting some additional logs will probably help us understand what's going wrong here.



--
Mark D. Roth <ro...@google.com>
Software Engineer
Google, Inc.

Chi Jameson

unread,
Aug 29, 2022, 6:25:06 PM8/29/22
to grpc.io
Hello!

We've been able to locate where the client channel stops attempting to reconnect, but haven't found how/why the c-ares resolver successfully passes a 0 address list to the pick_first load balancer. What appears to be happening is it hits this 0 addresses check and causes a TRANSIENT_FAILURE, but then the client channel never responds beyond that. We've seen this same freeze happen in v1.46.4 at the same subchannel list check, but I don't necessarily know if it's possible to hit that if statement in a practice as the reproduction method I've used is to simply provide an empty address list to that method. Trying to actually get an empty c-ares to reproduce the behavior we're seeing has proven to be difficult as most of the time the resolver behaves as you mentioned.

What normally listens for the UpdateState from the channel_control_helper? That might give us a good hint for why the client channel stops after that point.

Thanks!
Chi
Cisco Meraki

Mark D. Roth

unread,
Aug 31, 2022, 11:19:44 AM8/31/22
to Chi Jameson, grpc.io
Looking at our code more closely, it looks like there is a bug here.  If the resolver returns an error for the addresses on the very first resolution attempt, it looks like we will get into a state where nothing will re-resolve.

It looks like this bug has been here for a long time, so I'm surprised no one has run into it until now.  It definitely needs to be fixed, but it'll take a bit of work to make all the pieces work together the right way.  Can you please file an issue and tag me on it?

Thanks very much for reporting this!

Reply all
Reply to author
Forward
0 new messages