Hello,
I was implementing a test consul cluster and testing round robin DNS when I made a startling discovery about the default behaviour of many apps that have switched to using getaddrinfo() for name resolution rather than gethostbyname(). getaddrinfo does an internal sort on returned addresses. This means that when provided a randomized list in round robin it will sort it into a nearly deterministic list. This means that only one or 2 hosts receive almost all of the traffic while some hosts are never hit. I encountered this while testing a service with 3 members running curl in a loop on an Ubuntu trusty host. This issue is widespread as it comes from the upstream glibc implementing rule 9 of RFC3484. This has been an issue for over a decade but is becoming more or a problem as more and more applications and libraries switch to getaddrinfo for ipv6 capability believing it is a drop in replacement for gethostbyname. This creates a serious limit in the usefulness of consul in load balancing internal services for legacy applications.
Patching all of our hosts glibc is impractical given the many distros in consideration. It occurs to me that the best work around is a consul configurable that simply breaks the for loop after a single entry. This gives getaddrinfo nothing to sort and returns effective round robin behaviour to the applications. The only downside is that it doesn't give applications will fail thru capability a list to work through. This should be mitigated by consul health checks and is the unfortunate price to get our broken round robin back. Below is the local diff. Let me know if I should do a pull request.
diff --git a/command/agent/config.go b/command/agent/config.go
index c036591..664010c 100644
--- a/command/agent/config.go
+++ b/command/agent/config.go
@@ -76,6 +76,14 @@ type DNSConfig struct {
// returned by default for UDP.
EnableTruncate bool `mapstructure:"enable_truncate"`
+ // EnableSingleton is used to override default behavior
+ // of DNS and return only a single host in a round robin
+ // DNS service request rather than all healthy service
+ // members. This is a work around for systems using
+ // getaddrinfo using rule 9 sorting from RFC3484 which
+ // breaks round robin DNS.
+ EnableSingleton bool `mapstructure:"enable_singleton"`
+
// MaxStale is used to bound how stale of a result is
// accepted for a DNS lookup. This can be used with
// AllowStale to limit how old of a value is served up.
@@ -1034,6 +1042,9 @@ func MergeConfig(a, b *Config) *Config {
if b.DNSConfig.EnableTruncate {
result.DNSConfig.EnableTruncate = true
}
+ if b.DNSConfig.EnableSingleton {
+ result.DNSConfig.EnableSingleton = true
+ }
if b.DNSConfig.MaxStale != 0 {
result.DNSConfig.MaxStale = b.DNSConfig.MaxStale
}
diff --git a/command/agent/dns.go b/command/agent/dns.go
index 33db8ba..1cee5c7 100644
--- a/command/agent/dns.go
+++ b/command/agent/dns.go
@@ -665,6 +665,9 @@ func (d *DNSServer) serviceNodeRecords(nodes structs.CheckServiceNodes, req, res
if records != nil {
resp.Answer = append(resp.Answer, records...)
}
+ if d.config.EnableSingleton {
+ break
+ }
}
}