We were testing a T1 gateway for failover using an 18.04 lab system and noticed this issue. The lab system is running release 18.04 HA with three servers:
- 10.20.2.34 Primary
- 10.20.2.35 Secondary
- 10.20.2.36 Secondary
The Phones are Polycom VVX300/1/400 phones (4) running Release 5.5.2 of firmware. DNS is standard Sipxcom defaults - no custom DNS rules.
10.20.2.35/26 are added as domain aliases into the system.
I first tested inbound and outbound calls with all servers in the HA cluster active to ascertain there were no issues. The primary server was then shut down. Calls from 3 of the 4 phones would not work - either via the gateway or to each other. I port-mirrored the switch and did packet traces to see whether the problem phones were issuing INVITES to the Sipxcom secondary servers - they were not. I did a mongo db.registrar.find() command and determined that the three problem phones were registered to the primary server - the one working phone was registered to a secondary server. I enabled DNS debug on one of the problem Polycom phones - attached is sample output:
0708164602|dns |1|00|doDnsSrvLookupForARecordList(tcp): Doing DNS lookup (port 5060)
0708164602|dns |1|00|doDnsLookupForList for record A: hostname '10.20.2.34' attempting..
0708164602|dns |1|00|doDnsLookupForList(A): returning passed in ipAddress '10.20.2.34'
0708164602|dns |1|00|doDnsSrvLookupForARecordList(tcp): kept port at 5060
0708164602|dns |1|00|doDnsLookupForList for record A: hostname '10.20.2.34' attempting..
0708164602|dns |1|00|doDnsLookupForList(A): returning passed in ipAddress '10.20.2.34'
0708164607|sip |4|00|Failed to connect to [
10.20.2.34:5060] : Error[Operation now in progress]
I used the same phones to connect to a 17.04 HA cluster - two things were noticed. First on the 17.04 system (default DNS configuration with equal weightings), phones would register in a round-robin fashion - i.e. first phone would register to 10.20.2.34, second phone to 10.20.2.35, third phone to 10.20.2.36, fourth phone to 10.20.2.34. On the 18.04 system, one phone registered to 10.20.2.35, and all other phones registered to the 10.20.2.34 primary server - not what I expected. The primary server was shut down on the 17.04 cluster, and DNS debug was enabled on one of the phones - here is the output:
0708165408|dns |2|00|Found DNS cache entry for
lvtest.com(A)
0708165408|dns |1|00|doDnsLookupForList for record SRV: hostname '_sip._
tcp.lvtest.com' attempting..
0708165408|dns |2|00|Hit in Dynamic DNS cache for _sip._
tcp.voip1.flintgrp.com(SRV) expires in 1733 seconds
0708165408|dns |2|00|dnsRandomizeSubset: For sum 30 1 byte rand # (reg 0.117647 x randNum 25) = 2.941176 -> ceil 3.000000 final 3
0708165408|dns |1|00|dnsRandomizeSubset: SRV selected 1 and inserted at -1 priority 30 weight 10 sum 10 target
pbx.lvtest.com'
0708165408|dns |2|00|dnsRandomizeSubset: For sum 20 1 byte rand # (reg 0.078431 x randNum 52) = 4.078432 -> ceil 5.000000 final 5
0708165408|dns |1|00|dnsRandomizeSubset: SRV selected 2 and inserted at -1 priority 30 weight 10 sum 10 target '
pbx3.lvtest.com'
0708165408|dns |2|00|dnsRandomizeSubset: For sum 30 1 byte rand # (reg 0.117647 x randNum 237) = 27.882353 -> ceil 28.000000 final 28
0708165408|dns |1|00|dnsRandomizeSubset: SRV selected 1 and inserted at -1 priority 30 weight 10 sum 30 target '
pbx2.lvtest.com'
0708165408|dns |2|00|dnsRandomizeSubset: For sum 20 1 byte rand # (reg 0.078431 x randNum 145) = 11.372549 -> ceil 12.000000 final 12
0708165408|dns |1|00|dnsRandomizeSubset: SRV selected 2 and inserted at -1 priority 30 weight 10 sum 20 target '
pbx3.lvtest.com''
0708165408|dns |1|00|dnsSrv2A calling LookupForList(A) for '
pbx2.lvtest.com' port 5060
0708165408|dns |1|00|doDnsLookupForList for record A: hostname '
pbx2.lvtest.com' attempting..
0708165408|dns |2|00|Hit in Dynamic DNS cache for
pbx2.lvtest.com(A) expires in 1733 seconds
0708165408|dns |1|00|dnsSrv2A calling LookupForList(A) for '
pbx3.lvtest.com' port 5060
0708165408|dns |1|00|doDnsLookupForList for record A: hostname '
pbx3.lvtest.com' attempting..
I eyeballed the default DNS records on the two HA clusters and did not notice any anomalies. I did notice that both 18.04 and 17.04 are running Centos 6.7 but that the 18.04 system is on a later Centos build. Given that 18.04 doesn't register the phones in a round-robin fashion across all servers and respect the priorities in the default configuration, I wonder whether there is a DNS gremlin that has crept into 18.04.
Be glad to open a Jira on this issue.
All the best
Peter