Problem with Connect error response 447 when sending a Connect request for connecting to a control connection.

184 views
Skip to first unread message

Markus Wallén

unread,
Oct 6, 2014, 7:03:29 AM10/6/14
to turn-server-project...@googlegroups.com

Hi,

I have some trouble with Connect requests. Sometimes, but not always, I get a 447, connection timeout or failure, from the TURN server when trying to do a Connect request from one of the peers.

I'll try to describe the setup we do as clear as possible. A lot of the setup is basically just steps taken from the RFC so it should not be anything new or odd, but I want to make it clear what I'm doing and hopefully get an answer that can help me solve my problem.

The system consists of two types of Peers. Peers that are “contactable” (they do not initiate contact themselves) and Peers that are “contacters” (they initiate contact). I'll refer to the contactable Peers as Peer1 and the contacters as Peer2.

Peer1 always have an allocation, a control connection, to the TURN server, which has permission on it that only allows the TURN server to connect to it. The allocation and permission are refreshed regularly.

When Peer2 wants to contact Peer1, Peer2 sets up an allocation, a control connection, and sends a Connect request asking the TURN server to connect to Peer1's control connection. If the connect request is successful a new connection is set up from both Peer1 and Peer2 and a ConnectionBind is initiated from both Peers establishing a data connection between the two Peers over the TURN server.

Sometimes, as stated before I get a 447 error response on the Connect request from Peer2, even though that I can see that Peer1 has an active control connection to the TURN server.
To try to see what's happening in the TURN server the implementation was changed at file src/apps/relay/ns_ioalib_engine_impl.c, line 1302, just added a print that prints evutil_socket_error_to_string(evutil_socket_geterror(bufferevent_getfd(bev))), which gives Connection Refused.

These two flows describe what I see when it succeeds in creating a data connection and when it fails.

The connection flow (success):
--------------------
Peer1 -> Allocation -> TURN server
TURN server -> Allocation response: turnserver:peer1 -> Peer1
Peer1 -> CreatePermission (allowed: TURN server) -> TURN server
Peer1 -> Refresh allocation -> TURN server
.
.
.
Peer2 -> Allocation -> TURN server
TURN server -> Allocation response: turnserver:peer2 -> Peer2
Peer2 -> Connect Request(turnserver:peer1) -> TURN server
TURN server -> ConnectionAttempt(connectionid:1, xor-peer-address:(turnserver:peer2)) -> Peer1
TURN server -> ConnectResponse(connectionid:1) -> Peer2
Peer1 -> new connection -> TURN server
Peer1 -> Connection Bind(connectionid:1) -> TURN server

Peer2 -> new connection -> TURN server
Peer2 -> Connection Bind(connectionid:1) -> TURN server

 

The connection flow (fail):
--------------------
Peer1 -> Allocation -> TURN server
TURN server -> Allocation response: turnserver:peer1 -> Peer1
Peer1 -> CreatePermission (allowed: TURN server) -> TURN server
Peer1 -> Refresh allocation -> TURN server
.
.
.
Peer2 -> Allocation -> TURN server
TURN server -> Allocation response: turnserver:peer2 -> Peer2
Peer2 -> Connect Request(turnserver:peer1) -> TURN server
TURN server -> ConnectResponse(error:447) -> Peer2

Oleg Moskalenko

unread,
Oct 6, 2014, 11:22:35 AM10/6/14
to Markus Wallén, turn-server-project...@googlegroups.com
Markus, what is the version that you are running ?

The connectionids in the successful flow must be different for different peers.

How frequent is the rate of the failures ? 

What is the network topology ? Is the -x parameter used in the setup ?

Oleg

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "TURN Server (Open-Source project)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to turn-server-project-rfc57...@googlegroups.com.
To post to this group, send email to turn-server-project...@googlegroups.com.
Visit this group at http://groups.google.com/group/turn-server-project-rfc5766-turn-server.
For more options, visit https://groups.google.com/d/optout.

Oleg Moskalenko

unread,
Oct 6, 2014, 11:50:25 AM10/6/14
to Markus Wallén, turn-server-project...@googlegroups.com
Markus, in your debug print, could you please add the addresses ? Just print the content of ret->remote_addr and ret->local_addr, and use the function addr_to_string to convert it to the human-readable form. It will give us the information who is connecting - from where to where.

Regards,
Oleg

Markus Wallén

unread,
Oct 7, 2014, 9:35:28 AM10/7/14
to turn-server-project...@googlegroups.com, markus....@gmail.com
I'm running 3.2.4.4

- The connectionids in the successful flow must be different for different peers. 
Yes, I guess that could be more clear in the flow. I should've written something like connectionid:xx to be more clear about that it is "some" connectionid that is used.

- Is the -x parameter used in the setup?
No, it's not set as far as I know. What it is for?

- What is the network topology?
Peer1 is behind a router. Router and Peer2 is connected directly to internet. TURN server runs on Amazon.

- How frequent is the rate of the failures?
Hard to tell, but it seemed that it occured "after a while" (maybe a day or two) of running the peers.
We suspected that it might be related to max FD limit, so it was changed to 999999.
We had quite a lot of connections to the server,  which led us to suspect that it might be something port related.
We tested to decrease the min/max port span to 11 ports, and then it was really easy to reproduce the 447 problem. We expected to get 508 at first but in our prints we got a mix of these:
Error detected: errno == Connection reset by peer
Remote address: 172.xx.yy.zzz:65003, Local address: 172.xx.yy.zzz:65010
Error detected: errno == Connection reset by peer
Remote address: 172.xx.yy.zzz:65003, Local address: 172.xx.yy.zzz:65005
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65000, Local address: 172.xx.yy.zzz:65005
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65006, Local address: 172.xx.yy.zzz:65010
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65002, Local address: 172.xx.yy.zzz:65008
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65006, Local address: 172.xx.yy.zzz:65009
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65002, Local address: 172.xx.yy.zzz:65004

After one of the restarts we did get a 508 when peers tried to connect, and no prints. We reduced the number of peers and then the turn server accepted allocations again, but with the 447 problem when trying to do a connect request.
To unsubscribe from this group and stop receiving emails from it, send an email to turn-server-project-rfc5766-turn-server+unsubscribe@googlegroups.com.
To post to this group, send email to turn-server-project-rfc5766-turn-...@googlegroups.com.

Oleg Moskalenko

unread,
Oct 7, 2014, 3:20:35 PM10/7/14
to turn-server-project...@googlegroups.com, markus....@gmail.com

Markus, see below:

On Tuesday, October 7, 2014 6:35:28 AM UTC-7, Markus Wallén wrote:
I'm running 3.2.4.4

great
 

- The connectionids in the successful flow must be different for different peers. 
Yes, I guess that could be more clear in the flow. I should've written something like connectionid:xx to be more clear about that it is "some" connectionid that is used.

- Is the -x parameter used in the setup?
No, it's not set as far as I know. What it is for?

I meant the "-X" option, or --external-ip. You are saying that you are running in in Amazon. That means that you do have to use the -X option.

 

- What is the network topology?
Peer1 is behind a router. Router and Peer2 is connected directly to internet. TURN server runs on Amazon.

- How frequent is the rate of the failures?
Hard to tell, but it seemed that it occured "after a while" (maybe a day or two) of running the peers.
We suspected that it might be related to max FD limit, so it was changed to 999999.
We had quite a lot of connections to the server,  which led us to suspect that it might be something port related.
We tested to decrease the min/max port span to 11 ports, and then it was really easy to reproduce the 447 problem. We expected to get 508 at first but in our prints we got a mix of these:
Error detected: errno == Connection reset by peer
Remote address: 172.xx.yy.zzz:65003, Local address: 172.xx.yy.zzz:65010
Error detected: errno == Connection reset by peer
Remote address: 172.xx.yy.zzz:65003, Local address: 172.xx.yy.zzz:65005
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65000, Local address: 172.xx.yy.zzz:65005
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65006, Local address: 172.xx.yy.zzz:65010
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65002, Local address: 172.xx.yy.zzz:65008
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65006, Local address: 172.xx.yy.zzz:65009
Error detected: errno == Connection refused
Remote address: 172.xx.yy.zzz:65002, Local address: 172.xx.yy.zzz:65004

After one of the restarts we did get a 508 when peers tried to connect, and no prints. We reduced the number of peers and then the turn server accepted allocations again, but with the 447 problem when trying to do a connect request.

I tested the TCP relay use case in the situation of the limited ports range. I am getting 508 error, very reliably, in 100% of cases.

So, we have two facts to deal with:

1) You have an Amazon TURN server setup, subject to Amazon firewall machinery;
2) In some port ranges (like 65000+) the connection problem is more pronounced.

Together, it tells me that it very well may be an Amazon firewall issue. I was asking you to include the additional address debug messages. As far as I can see, you did it, and the output tells us that it is failing to connect from "internal" to "internal" address (172.*). I was afraid that there is an address translation issue, when the connection is going from "internal" to "external" address or back - but this is obviously not the case.

So, I'd recommend a few things:

1) Investigate the Amazon firewall settings.
2) Ask Amazon whether they have a known issues with that.
3) Try to allocate several internal/external address pairs on the same server, to give the server more resources.

I'll perform more testing on our Amazon instance, to see if I can find anything.

Regards,
Oleg

Oleg Moskalenko

unread,
Oct 8, 2014, 2:20:48 AM10/8/14
to turn-server-project...@googlegroups.com, markus....@gmail.com
I tried to reproduce the problem with our Amazon image instance and I could not get the error 447. The back-to-back connections are always happening successfully; if I set too few ports and I start too many clients, then I am always  getting the 508 error - and never 447.

So my suspicion is that something is not right in your Amazon instance or in your firewall. Are you using our suggested image, or you set it by yourself, from the scratch ?

Oleg

Markus Wallén

unread,
Oct 9, 2014, 6:50:51 AM10/9/14
to turn-server-project...@googlegroups.com, markus....@gmail.com
I noticed that sometimes allocations from my peers end up on the same port in the turn server. Is this expected behaviour? I'll continue to look into this and see if I can describe a case where I get these issues.

Oleg Moskalenko

unread,
Oct 9, 2014, 11:26:08 AM10/9/14
to Markus Wallén, turn-server-project...@googlegroups.com
On Thu, Oct 9, 2014 at 3:50 AM, Markus Wallén <markus....@gmail.com> wrote:
I noticed that sometimes allocations from my peers end up on the same port in the turn server. Is this expected behaviour?

Two different TURN clients cannot have their relay endpoints on the same endpoint (IP:port), simultaneously. If you do ever observe such a situation then this is a bug.

 

--
You received this message because you are subscribed to the Google Groups "TURN Server (Open-Source project)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to turn-server-project-rfc57...@googlegroups.com.
To post to this group, send email to turn-server-project...@googlegroups.com.

Oleg Moskalenko

unread,
Oct 9, 2014, 12:32:28 PM10/9/14
to Markus Wallén, turn-server-project...@googlegroups.com
Another thing is that the TCP relay (how the RFC 6062 specs are written) cannot work perfectly in the older Linux kernels due to the kernel limitations. For production systems, you have to use either the newer kernels (3.9 or newer) or a patched older kernel (2.6.32-431 or newer) or a FreeBSD system.

Marcus, you have to be sure that you interpret and describe the situation correctly.

The term "peer" reserved in TURN for the "plain protocol" network entities - the "peers" are not aware of the TURN server and they communicate over the plain network protocols with the TURN server. For the network entities that send requests to the TURN server over the TURN protocol we reserve the term "TURN clients".

Different TURN clients cannot have allocations (the relay endpoints) on the same IP:port pair. But if the same TURN client is establishing connections to different peers then those peers will "talk" to the same remote TCP address - the peers will be contacted from the same relay endpoint.


Regards,
Oleg



Oleg Moskalenko

unread,
Oct 12, 2014, 9:04:50 PM10/12/14
to turn-server-project...@googlegroups.com, markus....@gmail.com
Marcus, I suppose that I found the answer to your question.

See the function ioa_create_connecting_tcp_relay_socket() and the comments inside the function. The problem is indeed in the old Linux kernel implementation. The RFC 6062 cannot be implemented as written, and we are using a workaround. But the workaround results in the condition when the relay endpoint becomes temporary unavailable.

To resolve the problem, you have to compile and to run the TURN server on a more modern OS. The recommended examples are:

1) FreeBSD 9.x+
2) Linux with kernels 3.9+ (ArchLinux, Fedora, CentOS/RedHat 7, Ubuntu 14+, a contemporary Amazon Linux);
3) an older Linux with patched kernel (like CentOS 6.5).

Regards,
Oleg
To unsubscribe from this group and stop receiving emails from it, send an email to turn-server-project-rfc5766-turn-server+unsubscribe@googlegroups.com.
To post to this group, send email to turn-server-project-rfc5766-turn-...@googlegroups.com.

Markus Wallén

unread,
Oct 13, 2014, 3:58:46 AM10/13/14
to turn-server-project...@googlegroups.com, markus....@gmail.com
I realize that I used the term Peer in the wrong context.
What I've seen is that two different turn clients that create an allocation can get allocated to the same port on the turnserver. Is this expected behavior? When a Peer wants to contact a turn client through the allocated port, how will the turn server know which turn client the peer wants to connect to? I hope I'm describing the sitauation in a proper way (with the correct terms), otherwise I'll try to explain it better. In my case, some of my turn clients are peers as well, which might make it a little bit confusing and is probably the reason why I mix things up.

I'm using Ubuntu 14.04 with Kernel 3.13, which I think should be ok.

Regards,
Markus

Oleg Moskalenko

unread,
Oct 13, 2014, 4:09:37 AM10/13/14
to Markus Wallén, turn-server-project...@googlegroups.com
On Mon, Oct 13, 2014 at 12:58 AM, Markus Wallén <markus....@gmail.com> wrote:
I realize that I used the term Peer in the wrong context.
What I've seen is that two different turn clients that create an allocation can get allocated to the same port on the turnserver. Is this expected behavior?

No, that's not an expected behavior and I actually cannot imagine how that can be. Of course everything is possible, but that's highly improbable. Are you sure that you are interpreting the results of your observations correctly ?

 
When a Peer wants to contact a turn client through the allocated port, how will the turn server know which turn client the peer wants to connect to?

Every allocation must have a unique port.
 
I hope I'm describing the sitauation in a proper way (with the correct terms), otherwise I'll try to explain it better. In my case, some of my turn clients are peers as well, which might make it a little bit confusing and is probably the reason why I mix things up.

I'm using Ubuntu 14.04 with Kernel 3.13, which I think should be ok.

That's great. But did you compile the server yourself, in the same environment ? The debian/ubuntu images in the download area on the download server are compiled with older kernel.

Regards,
Oleg


Markus Wallén

unread,
Oct 15, 2014, 3:05:05 AM10/15/14
to turn-server-project...@googlegroups.com, markus....@gmail.com


Den måndagen den 13:e oktober 2014 kl. 10:09:37 UTC+2 skrev Oleg Moskalenko:


On Mon, Oct 13, 2014 at 12:58 AM, Markus Wallén <markus....@gmail.com> wrote:
I realize that I used the term Peer in the wrong context.
What I've seen is that two different turn clients that create an allocation can get allocated to the same port on the turnserver. Is this expected behavior?

No, that's not an expected behavior and I actually cannot imagine how that can be. Of course everything is possible, but that's highly improbable. Are you sure that you are interpreting the results of your observations correctly ?

I'm pretty sure. 

This is what it looks like when I see the issue and uses the following command:
netcat localhost 5766
ps
    4) id=000000000000003711, user <turnclient1>:
      started 668 secs ago
      expiring in 292 secs
      client protocol TCP, relay protocol TCP
      client addr aaa.bbb.ccc.ddd:54406, server addr 172.xxx.yyy.zzz:3478
      relay addr 172.xxx.yyy.zzz:65002
      fingerprints enforced: OFF
      mobile: OFF
      usage: rp=6, rb=784, sp=5, sb=468
       rate: r=0, s=0, total=0 (bytes per sec)
      peers:
          172.xxx.yyy.zzz

    9) id=000000000000004028, user <turnclient2>:
      started 5 secs ago
      expiring in 595 secs
      client protocol TCP, relay protocol TCP
      client addr eee.fff.ggg.hhh:40265, server addr 172.xxx.yyy.zzz:3478
      relay addr 172.xxx.yyy.zzz:65002
      fingerprints enforced: OFF
      mobile: OFF
      usage: rp=3, rb=344, sp=2, sb=220
       rate: r=0, s=0, total=0 (bytes per sec)
      peers:
          172.xxx.yyy.zzz:65003

I have not pasted in all clients. I still use a limited port span between 65000 and 65010 to be able to "easily" reproduce the problem. Though most of the time the clients will get a unique port. If I interpret the information correctly both these clients are allocated on 65002.

 
When a Peer wants to contact a turn client through the allocated port, how will the turn server know which turn client the peer wants to connect to?

Every allocation must have a unique port.

Ok. Good, that's what I thought. 
 
I hope I'm describing the sitauation in a proper way (with the correct terms), otherwise I'll try to explain it better. In my case, some of my turn clients are peers as well, which might make it a little bit confusing and is probably the reason why I mix things up.

I'm using Ubuntu 14.04 with Kernel 3.13, which I think should be ok.

That's great. But did you compile the server yourself, in the same environment ? The debian/ubuntu images in the download area on the download server are compiled with older kernel.

Yes, it's compiled for that server, in that environment.

Regards
Markus 

Oleg Moskalenko

unread,
Oct 15, 2014, 12:55:01 PM10/15/14
to turn-server-project...@googlegroups.com, markus....@gmail.com
The situation like you described must not be possible with the current code, and I cannot imagine how it may happen. That is definitely not an intended behaviour.

I have to write a special dedicated test to find out whether I can reproduce the problem.

Can you try the latest code from SVN - the unreleased (yet) version 3.2.4.5 ? I fixed a number of memory-related problems, that may help.

One strange (may be) thing is that both allocations are on the same TCP server thread, but that is probably just a coincidence.

Oleg

Oleg Moskalenko

unread,
Oct 16, 2014, 5:09:13 AM10/16/14
to turn-server-project...@googlegroups.com, markus....@gmail.com
I wrote a special test and I was running the test for hours, and the test went through the millions allocations, successfully. I could not reproduce the ports duplication problem.

Although, I found (and fixed) a potential problem, but only with RTCP relay endpoints (UDP). That is not your case.

May be you are observing a statistics anachronism problem - when one session ended but ot reported to the telnet statistics, yet, but the second session with the same port has already been reported. But that is very unlikely.

As I said before, please try to run the latest code from SVN, and let me know whether you still see the problem.

Oleg

Markus Wallén

unread,
Oct 27, 2014, 10:04:36 AM10/27/14
to turn-server-project...@googlegroups.com
Finally I had the time to re-run my case.
Using the old version (3.2.4.4) it was quite easy to reproduce the error. I actually got like 20-25 allocations on a server with 7 open ports.
Using the new version (3.2.4.5) and doing the same test I have so far been unsuccessful in reproducing the error. So whatever you did seems to have fixed the problems I was experiencing. 

Thanks! 

Oleg Moskalenko

unread,
Oct 27, 2014, 12:27:38 PM10/27/14
to Markus Wallén, turn-server-project...@googlegroups.com
That's a strange case... But I am glad that the problem is gone.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "TURN Server (Open-Source project)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to turn-server-project-rfc57...@googlegroups.com.
To post to this group, send email to turn-server-project...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages