Timing issue with Chrome Webrtc/TURN

1,029 views
Skip to first unread message

ste...@ipcortex.co.uk

unread,
May 30, 2015, 5:20:17 AM5/30/15
to discuss...@googlegroups.com
Hi all,

A colleague and I have spent a few of days digging into the issue we have, and while we've reached a reproducible scenario, we cannot explain why. I was hoping someone could suggest where to look next.

We are successfully using JsSIP to connect our Chrome Web-App via a 'coturn' rfc5766 STUN/TURN server to Asterisk. Asterisk does not have TURN configured, but understands relay candidates, and talks to them fine. Calls work fine both ways. In fact we can change many variables such as using TCP and UDP, using TLS or unencrypted STUN and TURN, and it will continue to work as expected, but there is one thing that will reliably break it...

If Chrome is receiving an inbound call, and is the active party (DTLS master for the RTP setup) the network RTT between Chrome and the TURN server must not drop below about 40ms, otherwise Chrome never moves from 'ICEConnectionStateChecking' to 'ICEConnectionStateConnected'.

This is massively reliable so much so that if we add a 50ms artificial delay to outbound packets on the TURN server, everything always works. If we remove the delay so the response drops to 20ms or less, it always fails, and if we set a packet delay to make the RTT about 39ms, then calls fail/succeed about 50% of the time.

When things fail, the RTP and RTCP can be seen flowing correctly from Asterisk via TURN to Chrome, and chrome://webrtc-internals acknowledges that. Outbound from Chrome to TURN, RTCP is being sent, but no RTP. Asterisk is happy in ICE-state-connected, Chrome appears to be stuck. We've taken several Wireshark traces, mostly at the TURN server, and it seems that Chrome simply never decides to send the Channel-Bind commands, which is a precursor to doing the DTLS setup. Up to that point all the traces both good and bad seem to be going okay.

Replacing Chrome with Firefox also works perfectly regardless of the RTT, though that introduces so many variables it may not be a valid data point.

I would love to enable STUN/TURN debugging on Chrome, but none of the methods I've found on-line cause anything useful to be logged :(

Thanks for any pointers,
Steve


ste...@ipcortex.co.uk

unread,
Jun 1, 2015, 1:11:11 PM6/1/15
to discuss...@googlegroups.com
On Saturday, 30 May 2015 10:20:17 UTC+1, ste...@ipcortex.co.uk wrote:
[snip]
I would love to enable STUN/TURN debugging on Chrome, but none of the methods I've found on-line cause anything useful to be logged :(



In the absence of any Chrome debug, I have continue with some very detailed analyses of PCAP traces from both Chrome and from The TURN server, and have a theory as to what makes it work/fail.

When setting up a new Channel, Chrome appears to need to see a successful bind-user going both ways before it will accept a channel as being valid (quite reasonable).

When the sequence goes as follows, all is OK:
- Chrome receives Bind-User from remote via relay
- Chrome sends CreatePerm to relay
- Chrome sends Bind-OK to remote via relay
- Chrome receives CreatePerm-OK from relay
- Chrome sends Bind-User to remote via relay
- Chrome receives Bind-OK from remote via relay
- Chrome receives Bind-User from remote via relay
- Chrome sends Bind-OK to remote via relay
(The order of the last 2 binds/responses is unimportant.)

When the sequence goes as follows, all is UNHAPPY:
- Chrome receives Bind-User from remote via relay
- Chrome sends CreatePerm to relay
- Chrome sends Bind-OK to remote via relay
- Chrome receives CreatePerm-OK from relay
- Chrome sends Bind-User to remote via relay
- Chrome receives Bind-OK from remote via relay
(Note that the remote does not send a repeated Bind-User.)

The implication is the Chrome expects a Bind-User from the remote end AFTER the CreatePerm has completed, and that it is "forgetting" that is already received and acknowledged one. In fact, there is no reason for the far-end to send a 2nd Bind-User unless the first expires. In fact the only reason that the working example works is that the slow-Internet link causes the remote-end to retransmit the first Bind-User because the response is sufficiently late that it thinks it has gone missing!

Comments? Am I missing something? Is the above making any sense?

Thanks.
Steve

Christoffer Jansson

unread,
Jun 3, 2015, 4:52:30 AM6/3/15
to discuss...@googlegroups.com, ste...@ipcortex.co.uk
Hi,

When you mention nothing useful logged, what logging have you enabled, what Chrome version and on what OS are you testing this (Windows currently requires "--no-sandbox"  in order to enable logging and also webrtc logging has been broken there for a while)?

/Chris

ste...@ipcortex.co.uk

unread,
Jun 3, 2015, 5:22:57 AM6/3/15
to discuss...@googlegroups.com, ste...@ipcortex.co.uk
Hi,

I was unaware of the --no-sandbox flag, so I'll give that a go. When I say nothing useful, I mean no mention of STUN/TURN/ICE/BIND/CHANNEL of any sort appears in the logs. I've been using:

--enable-logging --v=4 --vmodule=*source/talk/*=4 --vmodule=*libjingle/source/talk/*=4 --vmodule=*media/audio/*=4

Based on a random list of suggestions in posts back in 2013 and 2014.

I will try with '--no-sandbox' and feed back any more information I can gather.

Thanks,
Steve

ste...@ipcortex.co.uk

unread,
Jun 4, 2015, 7:04:21 AM6/4/15
to discuss...@googlegroups.com
It turns out that what I needed is:

    --no-sandbox --enable-logging --v=1 --vmodule=*webrtc/*=4

Which revealed a whole host of information, which when combined with a copy of the WebRTC source tree suggests the following,

From my current understanding, most of the action seems to take place in:
    ./webrtc/p2p/base/port.cc
which is supported by event handlers in:
    ./webrtc/p2p/base/p2ptransportchannel.cc

As per my earlier post, there are two flows which differ only slightly, and only because of far-end latency issues. In code terms, the not-working case never calls SignalUseCandidate() before SignalStateChange(), even though the connection has progressed far enough that it would probably be appropriate to do so. This results in SwitchBestConnectionTo() never being called.

The two flows are as follows, using text from the logfile, and abbreviating the flow to concentrate on the RTCP via TURN prflx connection that needs to come up.

Working:
  • "Received STUN ping", set_read_state()
  • "Sent STUN ping response"
  • "Sending STUN ping"
  • "Create permission for x.x.x.x:x succeeded"
  • "Received STUN ping"
    (This is a far-end retransmit due to high latency)
  • "Sent STUN ping response"
  • SignalUseCandidate(), "Not switching...not writable"
    (Sets 'pending_best_connection_')
  • set_write_state(), SignalStateChange(), "Switching best connection...now writable"
    (Uses 'pending_best_connection_')
  • "New best connection"
  • "Received STUN ping response"
  • "Channel bind for x.x.x.x:x succeeded"
Non-Working:
  • "Received STUN ping", set_read_state()
  • "Sent STUN ping response"
  • "Sending STUN ping"
  • "Create permission for x.x.x.x:x succeeded"
  • set_write_state(), SignalStateChange()
    ('pending_best_connection_' is NULL, so NoOp)
  • "Received STUN ping response"
At this point, the connection state is "CRWS" and should be proceeding, but nothing causes SignalUseCandidate() to be called, so it is never considered. Should SignalUseCandidate() be called conditionally within MaybeAddPrflxCandidate() if the connection is already progressed this far???

Comments and feedback requested (pretty please). I will attach the two chome_debug.log files in case it helps.

Thanks,
Steve
chrome_log.tgz

Nazmus Shakeeb

unread,
Jun 5, 2015, 6:33:43 AM6/5/15
to discuss...@googlegroups.com
Can you please add capture file for failure and success case.  

ste...@ipcortex.co.uk

unread,
Jun 5, 2015, 12:44:52 PM6/5/15
to discuss...@googlegroups.com
I believe the attached are 3 sample good and 3 sample bad files captured in Wireshark on a Windows 7 laptop.

Regards,
Steve


On Friday, 5 June 2015 11:33:43 UTC+1, Nazmus Shakeeb wrote:
Can you please add capture file for failure and success case.  


[snip] 
chrome_pcapng.tar.gz

Nazmus Shakeeb

unread,
Jun 6, 2015, 6:45:32 AM6/6/15
to discuss...@googlegroups.com
Can you please run this test again with 2 changes.

   1. disable endpoint permission in the turn server.
   
   2. disable stun in the asterisk server. 
       or 
       add only host candidate when chrome receives both host and server reflexive address from asterisk.     

In this configuration asterisk will have no server reflexive candidate. So chrome will not send any check request via TURN server unless it receives one.

For #1 coturn rfc5766 STUN/TURN server has --server-relay option to disable endpoint permission. 

In this setup call will be established between peer reflexive candidate and relay.  

Nazmus Shakeeb

unread,
Jun 8, 2015, 5:37:22 AM6/8/15
to discuss...@googlegroups.com
Could you test using the given configuration ?


On Saturday, May 30, 2015 at 3:20:17 PM UTC+6, ste...@ipcortex.co.uk wrote:

ste...@ipcortex.co.uk

unread,
Jun 8, 2015, 5:38:51 AM6/8/15
to discuss...@googlegroups.com
Hi,

Before I make my TURN server completely insecure, could you perhaps explain what benefit will be gained by making these two changes?

I understand that the STUN server provides a "mostly useless" SRFLX candidate from Asterisk, but this is easily ignored in the logs.

I do not understand how '--server-relay' will help debugging though. As per the documentation "The rule is: if you do not understand what is this option about and why would you need it, then you absolutely must NOT use it under any circumstances." Even the RFC explains how it will allow malicious use of the TURN server.

Regards,
Steve

Nazmus Shakeeb

unread,
Jun 8, 2015, 9:54:04 AM6/8/15
to discuss...@googlegroups.com
>>I understand that the STUN server provides a "mostly useless" SRFLX candidate from Asterisk, but this is easily ignored in the logs.

Some Nat device when receives unsolicited request, block the sender for sometimes. Here callee sent  ICE check request to the caller NAT so caller NAT may block TURN server. 

If you disable Asterisk's server reflexive candidate this will not happen.

This will creates another issue , as callee will not get server reflexive candidate of caller , he can not create permission on the TURN server , so it will not work unless you disable permission in the turn.

You can also see in wire shark that Asterisk is the ICE controller , Asterisk should sent "USE CANDIDATE" to declare the end of ICE check.  Here Asterisk did not do it. So it seems that Asterisk is not getting response of its ICE check request.


>>I do not understand how '--server-relay' will help debugging though. As per the documentation "The rule is: if you do not understand what is this option about and why would you need it, then you absolutely must NOT use it under any circumstances." Even the RFC >>explains how it will allow malicious use of the TURN server.

This is for debugging. I thought you will do it in your test server not in the production server. Although this security feature is not very useful as it relay on the source address of UDP packet which can be easily faked and 

Some NAT devices may have multiple IP addresses and can use different IP addresses for different request so TURN will not always work for these type of NAT unless you set --server-relay option.



 
        



On Saturday, May 30, 2015 at 3:20:17 PM UTC+6, ste...@ipcortex.co.uk wrote:

ste...@ipcortex.co.uk

unread,
Jun 8, 2015, 10:38:36 AM6/8/15
to discuss...@googlegroups.com
Thank you for the clarification. 

Some NAT device when receives unsolicited request, block the sender for sometimes. Here callee sent ICE check request to the caller NAT so caller NAT may block TURN server. 

We know that the "unsolicited request" is not a problem because the same "unsolicited request" is sent in the good-trace that works perfectly, so is not setting some kind of administrative block on the TURN server. In fact Chrome continues to try and use the TURN server but never accepts the candidate. We can also see that Chrome requests the permission and that the permission is granted - The Chrome logs show this succeeding on the good and on the bad trace.

You can also see in wire shark that Asterisk is the ICE controller , Asterisk should sent "USE CANDIDATE" to declare the end of ICE check.  Here Asterisk did not do it. So it seems that Asterisk is not getting response of its ICE check request.

I have checked the STUN messages from Asterisk and they DO contain the correct  "USE-CANDIDATE" and "PRIORITY" flags - These are not properly decoded by Wireshark on the client-side because it is encapsulated in a STUN "Send Message" header, but it is still present.

This is for debugging. I thought you will do it in your test server not in the production server.

You are correct that this is a test-only server that we have created specifically to diagnose this problem, but it is still risky as it can open it up as a DoS relay, which I would rather avoid, particularly as I have just published its IP address here! :)

Some NAT devices may have multiple IP addresses and can use different IP addresses for different request

That is an interesting thought, but this is definitely not the case here. Also, this would still be true in both the working and not-working example so is unlikely to be related to this issue.

As per the Chrome logs, the setup goes just fine in both good and bad cases, but Chrome simply chooses never to use the connection that it knows it has available in the bad case because of the order things happen.

Regards,
Steve

Nazmus Shakeeb

unread,
Jun 11, 2015, 10:17:36 AM6/11/15
to discuss...@googlegroups.com
Can you please verify that Asterisk is using regular nomination ? 

I think it is possible to configure nomination type in the asterisk ( in the configuration file or while compiling source code)

The issue may be related to using aggressive nomination in the Asterisk side.

if you search good or bad log using "Adding connection from peer reflexive candidate" , you will get the candidates by which communication should proceed 

then if you search "Received STUN ping,"  you will get the ICE check request received by chrome from asterisk using this candidates. 

For success case you will find  "Received STUN ping," after getting peer reflexive candidate. For failure case you will not get any using these candidates

if Asterisk use regular nomination it must send second STUN check with use candidate.

On Saturday, May 30, 2015 at 3:20:17 PM UTC+6, ste...@ipcortex.co.uk wrote:

ste...@ipcortex.co.uk

unread,
Jun 11, 2015, 10:42:42 AM6/11/15
to discuss...@googlegroups.com
OK. I looked into that - I may have to do some experiments to see if that helps.

Thanks for the pointer I will let you know how it goes.

Steve

ste...@ipcortex.co.uk

unread,
Jun 12, 2015, 5:04:03 AM6/12/15
to discuss...@googlegroups.com
Thanks once again for the pointer.

It turns out that libpjnath, which is used by Asterisk for ICE, defaults to aggressive nomination.

Turning that mode off has certainly helped, though I still think there is some scope to improve Chrome's behaviour in this circumstance. Chrome does after-all have a single well defined candidate in its hand, so probably ought to consider using it???

I will be continuing to experiment over time, but can now proceed without horrible hacks on my TURN server to delay every packet :)

Steve

Nazmus Shakeeb

unread,
Jun 12, 2015, 5:10:16 AM6/12/15
to discuss...@googlegroups.com
You are right. This issue can be solved in chrome also.  


On Saturday, May 30, 2015 at 3:20:17 PM UTC+6, ste...@ipcortex.co.uk wrote:
Reply all
Reply to author
Forward
0 new messages