Hi
I have spent days hunting down a connection problem without any luck. I'm trying to implement a relatively simple one2one Call with Kurento.
Attached are logs of cases that worked (they have "working" in the filename) and logs of cases that failed (they have "notworking" in the filename). The files that don't have "debug" in the filename have the loglevel set to "3,Kurento*:4,kms*:4,sdp*:4,webrtc*:4,*rtpendpoint:4,rtp*handler:4,rtpsynchronizer:4,agnosticbin:4", the files that have "debug" in the file name have the log level set to "3,Kurento*:4,kms*:5,sdp*:5,webrtc*:5,*rtpendpoint:5,rtp*handler:4,rtpsynchronizer:4,agnosticbin:4".
Any help or new input is greatly appreciated!
Description of the Problem:
In about 30% of cases, the WebRTC connection cannot be established. Unfortunately I'm short of any kind of patttern when the Connection can be established and when not, it seems completely random. I'm in the same network, using the same devices, using the same TURN server, using the same signalling protocol, but in 30% of cases the connection cannot be established.
When I run the application locally, it seems to work much more reliably, the connection can be established almost 100% of the time (or maybe even 100% of time, I have tested so many times I lost track). I set up the infrastructure locally with docker, and run the different containers (TURN, Kurento, Signalling) in separate networks to mimic a production deployment.
We experience the same behavior in our development and production environment. In our development environment we have absolutely no firewalls in place, so that doesn't seem to be the problem.
What I have tried to find the cause of the Problem:
Mostly I have been comparing logs of cases that worked and cases that didn't work but I have failed to find any significant difference between them that could point me to the problem.
I have tested the WebRTC connection over the TURN server (with Firefox and the force_relay flag) and over Kurento directly, but in both cases the connection fails in ~30% of cases.
I have tried filtering all ICE candidates that are not Relay candidates.
I have sniffed traffic between our signalling server (which also controls Kurento) and Kurento to see any difference in the JSON RPS messages exchanged but they appear to be essentially the same.
I have sniffed the traffic from the clients of a successful and unsuccessful connection but could spot a significant difference
I have simplified the Kurento media pipeline (no recording, no Hubs) but the behavior is the same
I have used different browsers (Chrome, Firefox and a native iOS implementation) but the behavior is the same