Phones Fail in Release 18.04 High Availability

67 views
Skip to first unread message

pmkr...@gmail.com

unread,
Jul 8, 2018, 5:27:26 PM7/8/18
to sipxcom-users
We were testing a T1 gateway for failover using an 18.04 lab system and noticed this issue. The lab system is running release 18.04 HA with three servers:
  • 10.20.2.34 Primary
  • 10.20.2.35 Secondary
  • 10.20.2.36 Secondary
The Phones are Polycom VVX300/1/400 phones (4) running Release 5.5.2 of firmware. DNS is standard Sipxcom defaults - no custom DNS rules. 10.20.2.35/26 are added as domain aliases into the system.

I first tested inbound and outbound calls with all servers in the HA cluster active to ascertain there were no issues. The primary server was then shut down. Calls from 3 of the 4 phones would not work - either via the gateway or to each other. I port-mirrored the switch and did packet traces to see whether the problem phones were issuing INVITES to the Sipxcom secondary servers - they were not. I did a mongo db.registrar.find() command and determined that the three problem phones were registered to the primary server - the one working phone was registered to a secondary server. I enabled DNS debug on one of the problem Polycom phones - attached is sample output:

0708164602|dns  |1|00|doDnsSrvLookupForARecordList(tcp): Doing DNS lookup (port 5060)
0708164602|dns  |1|00|doDnsLookupForList for record A: hostname '10.20.2.34' attempting..
0708164602|dns  |1|00|doDnsLookupForList(A): returning passed in ipAddress '10.20.2.34'
0708164602|dns  |1|00|doDnsSrvLookupForARecordList(tcp): kept port at 5060
0708164602|dns  |1|00|doDnsLookupForList for record A: hostname '10.20.2.34' attempting..
0708164602|dns  |1|00|doDnsLookupForList(A): returning passed in ipAddress '10.20.2.34'
0708164607|sip  |4|00|Failed to connect to [10.20.2.34:5060] : Error[Operation now in progress]

I used the same phones to connect to a 17.04 HA cluster - two things were noticed. First on the 17.04 system (default DNS configuration with equal weightings), phones would register in a round-robin fashion - i.e. first phone would register to 10.20.2.34, second phone to 10.20.2.35, third phone to 10.20.2.36, fourth phone to 10.20.2.34. On the 18.04 system, one phone registered to 10.20.2.35, and all other phones registered to the 10.20.2.34 primary server - not what I expected. The primary server was shut down on the 17.04 cluster, and DNS debug was enabled on one of the phones - here is the output:

0708165408|dns  |2|00|Found DNS cache entry for lvtest.com(A)
0708165408|dns  |1|00|doDnsLookupForList for record SRV: hostname '_sip._tcp.lvtest.com' attempting..
0708165408|dns  |2|00|Hit in Dynamic DNS cache for _sip._tcp.voip1.flintgrp.com(SRV) expires in 1733 seconds
0708165408|dns  |2|00|dnsRandomizeSubset: For sum 30 1 byte rand # (reg 0.117647 x randNum 25) = 2.941176 -> ceil 3.000000 final 3
0708165408|dns  |1|00|dnsRandomizeSubset: SRV selected 1 and inserted at -1 priority    30 weight    10 sum    10 target pbx.lvtest.com'
0708165408|dns  |2|00|dnsRandomizeSubset: For sum 20 1 byte rand # (reg 0.078431 x randNum 52) = 4.078432 -> ceil 5.000000 final 5
0708165408|dns  |1|00|dnsRandomizeSubset: SRV selected 2 and inserted at -1 priority    30 weight    10 sum    10 target 'pbx3.lvtest.com'
0708165408|dns  |2|00|dnsRandomizeSubset: For sum 30 1 byte rand # (reg 0.117647 x randNum 237) = 27.882353 -> ceil 28.000000 final 28
0708165408|dns  |1|00|dnsRandomizeSubset: SRV selected 1 and inserted at -1 priority    30 weight    10 sum    30 target 'pbx2.lvtest.com'
0708165408|dns  |2|00|dnsRandomizeSubset: For sum 20 1 byte rand # (reg 0.078431 x randNum 145) = 11.372549 -> ceil 12.000000 final 12
0708165408|dns  |1|00|dnsRandomizeSubset: SRV selected 2 and inserted at -1 priority    30 weight    10 sum    20 target 'pbx3.lvtest.com''
0708165408|dns  |1|00|dnsSrv2A calling LookupForList(A) for 'pbx2.lvtest.com' port 5060
0708165408|dns  |1|00|doDnsLookupForList for record A: hostname 'pbx2.lvtest.com' attempting..
0708165408|dns  |2|00|Hit in Dynamic DNS cache for pbx2.lvtest.com(A) expires in 1733 seconds
0708165408|dns  |1|00|dnsSrv2A calling LookupForList(A) for 'pbx3.lvtest.com' port 5060
0708165408|dns  |1|00|doDnsLookupForList for record A: hostname 'pbx3.lvtest.com' attempting..

I eyeballed the default DNS records on the two HA clusters and did not notice any anomalies. I did notice that both 18.04 and 17.04 are running Centos 6.7 but that the 18.04 system is on a later Centos build. Given that 18.04 doesn't register the phones in a round-robin fashion across all servers and respect the priorities in the default configuration, I wonder whether there is a DNS gremlin that has crept into 18.04.

Be glad to open a Jira on this issue.

All the best
Peter

iuliu...@ezuce.com

unread,
Jul 11, 2018, 6:51:07 AM7/11/18
to sipxcom-users
Hi, Peter, can you confirm that DNS was enabled on the secondary servers?

iuliu...@ezuce.com

unread,
Jul 11, 2018, 8:41:49 AM7/11/18
to sipxcom-users
Peter, we have tried to reproduce the issue you reported in the lab but did not get the same results. On a three-server cluster with 4 phones, the registrations were spread on all three nodes. Shutting down the primary causes the first calls from each phone to be completed with a slight delay, due to the time necessary for the primary DNS to time out. It looks like the phones in your configuration cannot reach another DNS server, either because the DNS service is not running on the remaining servers or because the DHCP service did not provide them to the phone.

pmkr...@gmail.com

unread,
Jul 11, 2018, 12:38:12 PM7/11/18
to sipxcom-users
Many thanks for the followup here - much appreciated. I moved the phones back to the 18.04 server and tested - issue still persists. Validated that phones are getting right domain name addresses via DHCP (bootp response) and that DNS is running on the secondary servers. My next step is restarting DNS servers and testing again. Let me know if there are diagnostics you would like me to capture. All the best Peter

iuliu...@ezuce.com

unread,
Jul 12, 2018, 3:19:32 AM7/12/18
to sipxcom-users
Hi, Peter, can you try to add manually one of the other servers in the cluster as DNS Alt. Server in the phone's Advanced->Network Configuration? In theory this scenario should occur only if the phone does not receive or use DNS data properly. Should the primary DNS fail, after the request times out it has to go to the next DNS server automatically. The subsequent requests are made to the secondary DNS, so the next call(s) should process instantly. Also, please allow a few seconds or tens of seconds for the MongoDB reelection to take place. Once the heartbeat between the Mongo nodes fails a re-election of the primary is triggered. The cluster functions only when a primary exists, this is the only way to ensure database consistency.

pmkr...@gmail.com

unread,
Jul 12, 2018, 11:34:40 AM7/12/18
to sipxcom-users
I got to the bottom of this issue - this 18.04 lab system was also being used to test BLA in an HA environment (had completely forgotten about this). I received an update to an old Jira that changing the firewall settings for sipxsaa from cluster to public would allow BLA to work again without provisioning the Phone->SIP Servers-> IP address. This works (and thank you for making BLA work again in HA/standalone environments without this workaround).  But in testing BLA on HA, I found shared lines would work for several hours to 2-3 days and then fail again. So I hardcoded the Phone->SIP Servers->IP address with the address of the primary server - this provisions the voIpProt.SIP.outboundProxy.address parameter on the phones with shared lines. This change bypasses DNS for those phones in failover scenarios. When I removed this provisioning, failover worked again. In regards to BLA on HA, I'm finding that a restart of the sipxsaa process gets BLA working again on shared lines. I'm trying to gather more information on the HA BLA issue.

Again, many thanks for your assistance. All the best Peter

Michael Picher

unread,
Jul 12, 2018, 11:42:45 AM7/12/18
to Peter Krautle, sipxcom-users
Yea, that bla / blf implementation is buggy at best. It keels over around 1000 users too. Just to be clear for others, bla/blf is not HA, just the rest of the clusterable services.

You may want a cron to just restart sipxsaa once in a while (or daily if it lasts that long). We tried to fix it but ultimately we had to build something new that is in the commercial version of code.

Thanks,
  Mike

Michael Picher, VP of Product Management
eZuce, Inc.

300 Brickstone Square

Suite 104

Andover, MA. 01810


Notice: This transmittal and/or attachments may be privileged or confidential. It is intended solely for the addressee(s) named above. Any dissemination or copying is strictly prohibited. If you received this transmittal in error, please notify us immediately by reply and immediately delete this message and all its attachments. Thank you.


--
You received this message because you are subscribed to the Google Groups "sipxcom-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sipxcom-user...@googlegroups.com.
To post to this group, send email to sipxco...@googlegroups.com.
Visit this group at https://groups.google.com/group/sipxcom-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/sipxcom-users/545f0064-5a40-4eb5-9e50-085b3aaf753a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

pmkr...@gmail.com

unread,
Jul 12, 2018, 6:43:51 PM7/12/18
to sipxcom-users
On small standalone systems for small and medium businesses, BLA and presence has been very stable. BLA on HA will solve some deployments where call park is problematic from an operational perspective (e.g. paging). Yes, the current lab work is to determine the right interval to restart sipxsaa. Peter
Reply all
Reply to author
Forward
0 new messages