Videoroom getting frozen

Imobach González Sosa

unread,

May 6, 2015, 7:35:54 AM5/6/15

to meetech...@googlegroups.com

Hi all,

We're developing Jangouts[1] (some kind of «Google Hangouts» clone) and we're relying on Janus. It seems to work quite ok but, sometimes, video/audio get frozen for everyone and, to be honest, I am not able to find the problem in Janus logs.

I've seen a lot of NACKS and packets retransmissions, so I'm no sure if it has something to do with bandwidth or something like that. I'm attaching the logs of the full session (they are compressed) just in case someone could spot the problem there.

Please, let me know if you want us to do more tests.

Thanks a lot in advance!

[1] https://github.com/ancorgs/jangouts

janus-20150506-debug.log.bz2

Lorenzo Miniero

unread,

May 6, 2015, 8:36:18 AM5/6/15

to meetech...@googlegroups.com

Hi,

love the name! :-D

I'll give it a try myself as soon as I can, as I'm always curious about projects that use Janus. :-)

About the freeze issues, the causes may be different, and I'm not sure the log alone can help. It may be a keyframe that has been lost, for instance, causing all differential packets to be discarded by the decoder until a new key frame is received. Do you configure a FIR/PLI frequency for rooms, or did you disable that part? A regular FIR/PLI (e.g., every 5-10 seconds) may help make sure that in such cases a key frame is received much sooner: without that, browsers usually send a FIR every 100 seconds instead. If it's not FIR/PLI related (e.g., video never recovers for some publishers) it may instead be a different issue, e.g., video that is not being received by a publisher anymore, or for some reason not being broadcasted to (all or some?) participants. One additional reason may be ICE related: if the publisher is using TURN and the binding stops for some reason, none of the media that is being sent would be relayed by the TURN server to Janus anymore. As you can see, there are several potential causes, and the only way to investigate is check the available sources.

The best way to check this is to look at both the chrome://webrtc-internals (or in case of Firefox, about:webrtc) and the Janus admin API. The latter will provide, for each handle, details about the number of bytes that have been sent/received per media type, and it has a per-second view as well. Of course it also provides other useful pieces of information, as for instance the ICE state, currently used candidates, and so on. Using our admin demo page it can be a bit messy to look for the right handle, as you'd have to crawl through all of them until you find the publisher handle you're interested in, and maybe some of the viewers as well (just to verify, for instance, if it's a specific viewer that's not receiving anything anymore or something else). You may want to script something that automates that process for you. Please also notice that there's no async event for admin stuff: it's all request/response based, so you need to refresh a query about a handle to get new information about it.

By checking the combined info on client (internals) and server (admin API) sides, there should be additional pieces to the puzzle.

Hope that helps,

Lorenzo

Ancor Gonzalez Sosa

unread,

May 7, 2015, 6:00:28 AM5/7/15

to meetech...@googlegroups.com

Replying inline.

El miércoles, 6 de mayo de 2015, 14:36:18 (UTC+2), Lorenzo Miniero escribió:

Hi,

love the name! :-D
I'll give it a try myself as soon as I can, as I'm always curious about projects that use Janus. :-)

About the freeze issues, the causes may be different, and I'm not sure the log alone can help. It may be a keyframe that has been lost, for instance, causing all differential packets to be discarded by the decoder until a new key frame is received. Do you configure a FIR/PLI frequency for rooms, or did you disable that part? A regular FIR/PLI (e.g., every 5-10 seconds) may help make sure that in such cases a key frame is received much sooner: without that, browsers usually send a FIR every 100 seconds instead.

We have fir_freq = 10 in all the rooms. So that shouldn't be the problem.

If it's not FIR/PLI related (e.g., video never recovers for some publishers) it may instead be a different issue, e.g., video that is not being received by a publisher anymore, or for some reason not being broadcasted to (all or some?) participants.

I assume you meant "video not being received FROM a publisher". Anyway, if that happens (let's say a bandwidth problem in one of the participants), do you mean that the problem can propagate to the whole system and cause the whole room to freeze? That looks like a quite undesirable feature.

One additional reason may be ICE related: if the publisher is using TURN and the binding stops for some reason, none of the media that is being sent would be relayed by the TURN server to Janus anymore. As you can see, there are several potential causes, and the only way to investigate is check the available sources.

We saw this on the logs.

STUN-CLIENT(srflx(IP4:10.100.201.16:58265/UDP|stun.l.google.com:19302)): Timed out

Can it be the culprit or looks more like a symptom or a consequence?
If it's the culprit, what's your recommendation?

The best way to check this is to look at both the chrome://webrtc-internals (or in case of Firefox, about:webrtc) and the Janus admin API. The latter will provide, for each handle, details about the number of bytes that have been sent/received per media type, and it has a per-second view as well. Of course it also provides other useful pieces of information, as for instance the ICE state, currently used candidates, and so on. Using our admin demo page it can be a bit messy to look for the right handle, as you'd have to crawl through all of them until you find the publisher handle you're interested in, and maybe some of the viewers as well (just to verify, for instance, if it's a specific viewer that's not receiving anything anymore or something else). You may want to script something that automates that process for you. Please also notice that there's no async event for admin stuff: it's all request/response based, so you need to refresh a query about a handle to get new information about it.

Debugging all that info sounds like too much for us. All we know about WebRTC is that Janus makes everything easy for us. :-)

By checking the combined info on client (internals) and server (admin API) sides, there should be additional pieces to the puzzle.

Hope that helps,
Lorenzo

Thanks a lot.

Lorenzo Miniero

unread,

May 7, 2015, 1:58:38 PM5/7/15

to meetech...@googlegroups.com, anc...@gmail.com

Il giorno giovedì 7 maggio 2015 12:00:28 UTC+2, Ancor Gonzalez Sosa ha scritto:

Replying inline.

El miércoles, 6 de mayo de 2015, 14:36:18 (UTC+2), Lorenzo Miniero escribió:
Hi,

love the name! :-D
I'll give it a try myself as soon as I can, as I'm always curious about projects that use Janus. :-)

About the freeze issues, the causes may be different, and I'm not sure the log alone can help. It may be a keyframe that has been lost, for instance, causing all differential packets to be discarded by the decoder until a new key frame is received. Do you configure a FIR/PLI frequency for rooms, or did you disable that part? A regular FIR/PLI (e.g., every 5-10 seconds) may help make sure that in such cases a key frame is received much sooner: without that, browsers usually send a FIR every 100 seconds instead.

We have fir_freq = 10 in all the rooms. So that shouldn't be the problem.

If it's not FIR/PLI related (e.g., video never recovers for some publishers) it may instead be a different issue, e.g., video that is not being received by a publisher anymore, or for some reason not being broadcasted to (all or some?) participants.

I assume you meant "video not being received FROM a publisher". Anyway, if that happens (let's say a bandwidth problem in one of the participants), do you mean that the problem can propagate to the whole system and cause the whole room to freeze? That looks like a quite undesirable feature.

Yes I meant FROM. I thought only some videos were getting frozen, which is why I thought about the possible cause above. If one of the publishers can't send its frames to Janus anymore, all its viewers are obviously not going to receive them anymore: this translates in frozen video/audio from the viewer's perspective. If a few viewers can't get the video for some reason, same thing for them only.

A complete freeze of all videos for everybody is something different, and not something we ever experienced. Try to make sure you're not pushing the forced video bitrate too high, especially if you're seeing a lot of nacks. The Janus videoroom plugin notifies about slow link events (both on the uplink and downlink sides) so you can make use of that feedback to configure the publishers' bitrate accordingly.

One additional reason may be ICE related: if the publisher is using TURN and the binding stops for some reason, none of the media that is being sent would be relayed by the TURN server to Janus anymore. As you can see, there are several potential causes, and the only way to investigate is check the available sources.

We saw this on the logs.

STUN-CLIENT(srflx(IP4:10.100.201.16:58265/UDP|stun.l.google.com:19302)): Timed out

Can it be the culprit or looks more like a symptom or a consequence?
If it's the culprit, what's your recommendation?

Did you configure a STUN server for Janus to use? Remember that a STUN server in janus.cfg applies to Janus alone (that is Janus getting STUN candidates for itself), not its clients. A STUN/TURN configuration for clients needs to be done in JavaScript. That said, I don't think it's relevant: a failure to get srflx candidates could result in no media connectivity being established at all (if Janus is behind a NAT), not frozen video after it was working.

The best way to check this is to look at both the chrome://webrtc-internals (or in case of Firefox, about:webrtc) and the Janus admin API. The latter will provide, for each handle, details about the number of bytes that have been sent/received per media type, and it has a per-second view as well. Of course it also provides other useful pieces of information, as for instance the ICE state, currently used candidates, and so on. Using our admin demo page it can be a bit messy to look for the right handle, as you'd have to crawl through all of them until you find the publisher handle you're interested in, and maybe some of the viewers as well (just to verify, for instance, if it's a specific viewer that's not receiving anything anymore or something else). You may want to script something that automates that process for you. Please also notice that there's no async event for admin stuff: it's all request/response based, so you need to refresh a query about a handle to get new information about it.

Debugging all that info sounds like too much for us. All we know about WebRTC is that Janus makes everything easy for us. :-)

I understand it's a burden but within the context of multimedia applications involving WebRTC, especially when such issues are happening, that's unfortunately unavoidable at times. As I said, try focusing on publishers for now. Look for the publisher related handles in the admin page (you'll notice the plugin-specific info in the handle is from the videoroom plugin and related to a publisher type, with info on the ID of the publiser itself) and check if they're sending data at all when you get freezes. Hitting the refresh icon for that handle will give you up to date info for it. If data is flowing correctly there, you might want to check a viewer handle for that specific publisher, and see if media is being forwarded there. For viewers it's probably easier to check the webrtc-internals stuff, as it should show the amount of data being received in real-time. The getBitrate() property in janus.js can also be used for the purpose: if you're using your own JavaScript code, do somethig similar to extract the bitrate from the PeerConnection stats.

Apart from that, I guess the usual Wireshark/tcpdump checks could also help investigate what's wrong. It may be simply a matter of network connectivity issues on the server side.

Cheers,

Lorenzo

Ancor Gonzalez Sosa

unread,

May 8, 2015, 12:06:20 PM5/8/15

to meetech...@googlegroups.com

El miércoles, 6 de mayo de 2015, 14:36:18 (UTC+2), Lorenzo Miniero escribió:

Hi,

love the name! :-D
I'll give it a try myself as soon as I can, as I'm always curious about projects that use Janus. :-)

Now it's really really easy. https://susestudio.com/a/OKtQUM/jangouts-opensuse

Ancor Gonzalez Sosa

unread,

May 8, 2015, 12:20:59 PM5/8/15

to meetech...@googlegroups.com, anc...@gmail.com

El jueves, 7 de mayo de 2015, 19:58:38 (UTC+2), Lorenzo Miniero escribió:

Il giorno giovedì 7 maggio 2015 12:00:28 UTC+2, Ancor Gonzalez Sosa ha scritto:
Replying inline.

El miércoles, 6 de mayo de 2015, 14:36:18 (UTC+2), Lorenzo Miniero escribió:
Hi,

love the name! :-D
I'll give it a try myself as soon as I can, as I'm always curious about projects that use Janus. :-)

About the freeze issues, the causes may be different, and I'm not sure the log alone can help. It may be a keyframe that has been lost, for instance, causing all differential packets to be discarded by the decoder until a new key frame is received. Do you configure a FIR/PLI frequency for rooms, or did you disable that part? A regular FIR/PLI (e.g., every 5-10 seconds) may help make sure that in such cases a key frame is received much sooner: without that, browsers usually send a FIR every 100 seconds instead.

We have fir_freq = 10 in all the rooms. So that shouldn't be the problem.

If it's not FIR/PLI related (e.g., video never recovers for some publishers) it may instead be a different issue, e.g., video that is not being received by a publisher anymore, or for some reason not being broadcasted to (all or some?) participants.

I assume you meant "video not being received FROM a publisher". Anyway, if that happens (let's say a bandwidth problem in one of the participants), do you mean that the problem can propagate to the whole system and cause the whole room to freeze? That looks like a quite undesirable feature.

Yes I meant FROM. I thought only some videos were getting frozen, which is why I thought about the possible cause above. If one of the publishers can't send its frames to Janus anymore, all its viewers are obviously not going to receive them anymore: this translates in frozen video/audio from the viewer's perspective. If a few viewers can't get the video for some reason, same thing for them only.

A complete freeze of all videos for everybody is something different, and not something we ever experienced. Try to make sure you're not pushing the forced video bitrate too high, especially if you're seeing a lot of nacks. The Janus videoroom plugin notifies about slow link events (both on the uplink and downlink sides) so you can make use of that feedback to configure the publishers' bitrate accordingly.

We are seeing a lot of nacks and slow link notifications. We have "bitrate = 64000" in the room which should mean very low bitrate for everybody, isn't it?

We will do further debugging but according to your explanations it's starting to seem like we are simply getting out of bandwidth in the server.

Lorenzo Miniero

unread,

May 8, 2015, 12:35:52 PM5/8/15

to meetech...@googlegroups.com, anc...@gmail.com

64000 is very low and prone to freezing the video while encoding (I know for a fact that below that the video won't work, not sure what happens tiptoeing around that threshold). I'd recommend 128000 instead.

L.

Lorenzo Miniero

unread,

May 8, 2015, 12:36:09 PM5/8/15

to meetech...@googlegroups.com, anc...@gmail.com

Thanks, we'll give that a try!

L.

Wilbert Jackson

unread,

May 8, 2015, 3:51:04 PM5/8/15

to meetech...@googlegroups.com

I posted an issue about seeing a lot of nacks and the video freezing a few weeks ago I believe. Got no response. This sounds like our same issue. What we found (tested Chrome only) was that when adjusting our REMB settings some nodes could keep up with the encoding and some could not. As we adjusted our REMB to accommodate the slower nodes the Chrome bandwidth adjustment algorithm would detect how much real bandwidth was available and adjust its bandwidth setting for all nodes causing a lot of nacks on some slower nodes. As well the increased number of nacks caused the server bog down and node bandwidth to drop to zero. As a test we set a timer in our Javascript MUC plugin app to adjust the REMB setting to 256k every so many seconds to override Chromes bandwidth adjustment and we have not had the same problem.

wilbert jackson

Lorenzo Miniero

unread,

May 11, 2015, 6:24:03 AM5/11/15

to meetech...@googlegroups.com, wilbert...@gmail.com

That shouldn't be needed, as the latest version of the videoroom (not sure when we added this) already sends a REMB every 5 seconds:

https://github.com/meetecho/janus-gateway/blob/master/plugins/janus_videoroom.c#L1421

L.

Wilbert Jackson

unread,

May 11, 2015, 3:14:16 PM5/11/15

to meetech...@googlegroups.com, wilbert...@gmail.com

If we take the timer based REMD setting out and set the REMB bitrate cap the cap stays in place for a while then the Chrome controlled bandwidth adjustment bitrate drafts higher than the cap setting. The couple of screenshots below shows this. The session is between a desktop device running over a LAN and a tablet device running over wifi. We set the bitrate cap to 256 kbits/sec then monitor Chrome getstats() info as well as the real bandwidth available through the wifi connection using an app with wrote. About an hour after setting the REMB cap the bitrate started to increase and fluctuated between 450 kbits/sec to 750 kbits/sec with corresponding Janus nack messages over a 6 hour period when lasted checked. As well Chrome Nacks Sent increase and Chrome Frame Rate drops from 15 fps to around 9 fps causing lip sink problems.

Thanks

Wilbert Jackson

Lorenzo Miniero

unread,

May 11, 2015, 3:28:47 PM5/11/15

to meetech...@googlegroups.com, wilbert...@gmail.com

What I'm puzzled about is that a manual "configure" with a specified bitrate results in the same REMB being sent as the one we automatically trigger every 5 seconds, so I'm not sure why one has effect and the other doesn't. It may be that for some reason that part of the plugin code is not being invoked, but that shouldn't happen: have you tried adding some debug lines to see if they're ever called? The only other thing I'm thinking about is that the automated REMB is only triggered as a result of the incoming_rtp callback: if no RTP packet is received, the 5 seconds check is not done and the mechanism is not triggered. Anyway, that shouldn't be the cause, as REMB only makes sense to publishers, and publishers do send RTP on a regular basis.

L.

Wilbert Jackson

unread,

May 11, 2015, 3:55:45 PM5/11/15

to meetech...@googlegroups.com, wilbert...@gmail.com

No I have not tried adding debug messages to see if the plugin REMB timer code is being called. I will try that.

Thanks

Wilbert Jackson

unread,

May 12, 2015, 12:59:48 PM5/12/15

to meetech...@googlegroups.com, wilbert...@gmail.com

The 5 second check to send an REMB is working. What happens in Chrome we when monitor the getstat in info is:

1. The REMB cap gets set and sets the senders googAvailableSendBandwidth and the googTargetEncBitrate to the cap value.

2. The senders googActulEncBitrate value constantly changes and will have kbits/sec values greater and less than the googAvailableSendBandwidth and googTargetEncBitrate cap values. In one test we set the REMB cap to 350 kbits/sec and watched the googTargetEncBitrate increase to 487 kbits/sec and an average of about 395 kbits/sec.

3. Whenever the googTargetEncBitrate was greater than REMD cap settings the googTrasmitBitrate would fluctuate to increase to a value greater than the REMB cap. The receivers bitrate value would increase above its REMB cap setting. This is turn caused the server to issue the Just got some nacks we should handle message.

4. In some time instances as the test was running the googTrasnmitBitRate would jump to values around 800 kbits/sec and the server would then display a slow link message

Bottomline is watching the getstats info shows that the googTrasnmitBitRate and googReceiveBitRate fluctuates above and below the REMB cap value with above being more constant. It looks like the Chrome congestion algorithm is changing the sender REMB in response to changes in the receiver's bandwidth. If the sender REMB cap is set to 256 kbits/sec or below the receiver bandwidth drops to low and the sender encoding starts dropping low. See the screen shot below.

wilbert jackson

On Monday, May 11, 2015 at 3:28:47 PM UTC-4, Lorenzo Miniero wrote:

Ancor Gonzalez Sosa

unread,

May 14, 2015, 12:38:07 PM5/14/15

to meetech...@googlegroups.com

El miércoles, 6 de mayo de 2015, 13:35:54 (UTC+2), Imobach González Sosa escribió:

Hi all,

We're developing Jangouts[1] (some kind of «Google Hangouts» clone) and we're relying on Janus. It seems to work quite ok but, sometimes, video/audio get frozen for everyone and, to be honest, I am not able to find the problem in Janus logs.

I've seen a lot of NACKS and packets retransmissions, so I'm no sure if it has something to do with bandwidth or something like that. I'm attaching the logs of the full session (they are compressed) just in case someone could spot the problem there.

Please, let me know if you want us to do more tests.

Actually we have done several more tests. Since from Lorenzo's replies and from our own observations everything started to look like a bandwidth problem we created a test instance at Amazon EC2 (where the bandwidth should be more than enough). Then we did a call for testers and we got an AWESOME response, with a lot of people joining to channel #jangouts on irc.freenode.net just to be available for us to waste their time every time we need a wave of incoming connections.

So first of all, this is an invitation to help us debugging the problem, just ping us (imobach or ancorgs) in the #jangouts channel on freenode. We can even provide root access to the server for debugging purposes and we have a legion of fellow open source enthusiast just waiting for a call to jump on the server in some kind of organized DDoS. :-)

During those stress tests we observed 4 different kind of problems (all of them appearing when the amount of people connected to the room exceeded a safe limit of 8-9):

(1) Some people was not able to join
(2) Some people joined and subscribed successfully to the already existing feeds, but nobody noticed them. They didn't appear in the others participant screen. I'm not sure in which point the communication broke (just in one direction).
(3) Janus segfaulted a couple of times :-(
(4) Some data channels were flaky

About (2).

I'm having a look to the logs right now (of course, I can provide the full logs if needed) studying the case of one of the affected users.
I will provide more info as soon as I have a full understanding on what happened (is taking me longer than expected).

About (3).

All I can do is provide this backtrace. I hope it helps somehow.

#0 srtp_get_stream (srtp=srtp@entry=0x0, ssrc=3690470643) at srtp/srtp.c:1810
#1 0x00007f6d1e6ebfeb in srtp_unprotect (ctx=0x0, srtp_hdr=srtp_hdr@entry=0x7f6c1270bb90, pkt_octet_len=pkt_octet_len@entry=0x7f6c1270b984)
    at srtp/srtp.c:1488
#2 0x0000000000428884 in janus_ice_cb_nice_recv (agent=<optimized out>, stream_id=<optimized out>, component_id=<optimized out>, len=88,
    buf=0x7f6c1270bb90 "\001\001", ice=0x7f6cee5bce00) at ice.c:1183
#3 0x00007f6d20512fb6 in ?? () from /usr/lib64/libnice.so.10
#4 0x00007f6d1de2c11b in socket_source_dispatch (source=source@entry=0x7f6cee374590, callback=<optimized out>, user_data=<optimized out>)
    at gsocket.c:3264
#5 0x00007f6d1fffe166 in g_main_dispatch (context=0x7f6cedfb0440) at gmain.c:3066
#6 g_main_context_dispatch (context=context@entry=0x7f6cedfb0440) at gmain.c:3642
#7 0x00007f6d1fffe4b8 in g_main_context_iterate (context=0x7f6cedfb0440, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>)
    at gmain.c:3713
#8 0x00007f6d1fffe8ba in g_main_loop_run (loop=0x7f6cee3fcb40) at gmain.c:3907
#9 0x0000000000423131 in janus_ice_thread (data=0x7f6c9557b7c0) at ice.c:1436
#10 0x00007f6d20022e15 in g_thread_proxy (data=0x7f6cdc0c4c00) at gthread.c:798
#11 0x00007f6d1e4cf0a4 in start_thread (arg=0x7f6c1271c700) at pthread_create.c:309
#12 0x00007f6d1e2047fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I'll try to provide more information about (2) as soon as possible. If we can help providing full logs or any other information, don't hesitate to ask. And if you want to experience the crashes on first person, just contact us on irc. :-)

Cheers.

Lorenzo Miniero

unread,

May 15, 2015, 5:08:55 AM5/15/15

to meetech...@googlegroups.com, anc...@gmail.com

Great, and thanks for doing this, as it also provides a nice opportunity for stress testing! I' think I'll be able to chime in at the beginning of next week (sorry, can't make it earlier than that). Of course I encourage all other members of the group to give it a try and help us all make Janus a better place ;-)

About the number of people exceeding 8-9, also take into account potential limitations on the client side, as the CPU/mem consumption there might get up to a point where they can't handle it anymore. In such a case, also the capture/encoding/transmission might be affected, which could result in missing packets sent to Janus, Janus asking for retransmission of those lost packets, and the client not being able to cope (and viewers who would get a bad experience as a result). So estimating the usage of resources on that side might help exclude some causes.

About other bits, just some quick thoughts:

(1) the inability to join can depend on several factors... if it's always the same people, try to make sure they can indeed do WebRTC FROM a network perspective, e.g., that a TURN is available. Debugging the ICE state on both server and client side might help.

(2) this might mean that the publisher stream was not successfully established for them, and as such their availability was not notified. This might be related to the excessive resources we discussed about and/or different issues: again looking at the admin API for that specific handle might have more info on the cause of the issue (eg. ICE failure or something else).

(3) looks like an invalid SRTP context, maybe a result of a destroyed handle that was still used; I do see a (ctx=0x0) infact, maybe we should add a check for that;

(4) meaning that some messages were not received/relayed, or that they could not be established?

Thanks again for this experiment, hope to join you soon!

Lorenzo

Ancor Gonzalez Sosa

unread,

May 15, 2015, 7:35:57 AM5/15/15

to meetech...@googlegroups.com, anc...@gmail.com

El viernes, 15 de mayo de 2015, 11:08:55 (UTC+2), Lorenzo Miniero escribió:

Great, and thanks for doing this, as it also provides a nice opportunity for stress testing! I' think I'll be able to chime in at the beginning of next week (sorry, can't make it earlier than that). Of course I encourage all other members of the group to give it a try and help us all make Janus a better place ;-)

About the number of people exceeding 8-9, also take into account potential limitations on the client side, as the CPU/mem consumption there might get up to a point where they can't handle it anymore. In such a case, also the capture/encoding/transmission might be affected, which could result in missing packets sent to Janus, Janus asking for retransmission of those lost packets, and the client not being able to cope (and viewers who would get a bad experience as a result). So estimating the usage of resources on that side might help exclude some causes.

One note about resources usage on clients. Despite you advised us to set the bitrate to 128k, we kept it in 64k for this experiment because we didn't want to change a lot of different settings in every attempt (if you change more than one setting then you never know what the real culprit was).

About other bits, just some quick thoughts:

(1) the inability to join can depend on several factors... if it's always the same people, try to make sure they can indeed do WebRTC FROM a network perspective, e.g., that a TURN is available. Debugging the ICE state on both server and client side might help.

They can join perfectly when the number of participants is lower.

(2) this might mean that the publisher stream was not successfully established for them, and as such their availability was not notified. This might be related to the excessive resources we discussed about and/or different issues: again looking at the admin API for that specific handle might have more info on the cause of the issue (eg. ICE failure or something else).

I had a closer look to the logs and I'm certainly not sure to understand what is going on. I took one affected user as example, and, chronologically I see this in the logs:

- She gets the event with the list of participants.
- A lot of notifications to the already logged people about the new girl on the block
- A lot of people sending a "join" to become her listeners
- One "attached" message for her
- Some more join requests from listeners
- Five more "attached" messages for her again (I didn't expected them at all)
- Another listener join
- Tons (and I mean TONS) of unpublished events with her id
- Even more join requests from listeners (resulting in errors, since the feed is not longer available).

I guess there are a lot of retransmissions, delays and similar things going on.

(3) looks like an invalid SRTP context, maybe a result of a destroyed handle that was still used; I do see a (ctx=0x0) infact, maybe we should add a check for that;

There are a lot of errors like this in the log. Only some of them resulted in segfaults.
[ERR] [ice.c:janus_ice_cb_nice_recv:1190] [xxx] SRTP unprotect error: err_status_auth_fail

(4) meaning that some messages were not received/relayed, or that they could not be established?

Actually, this is the point which worries me the less right now.

Thanks again for this experiment, hope to join you soon!

That would be awesome.

See you!

Lorenzo Miniero

unread,

May 19, 2015, 10:54:20 AM5/19/15

to meetech...@googlegroups.com, anc...@gmail.com

Can you try updating Janus? Pierce provided a patch that should fix the sometimes incorrect and heavy behaviour that Janus had with respect to NACKs and retransmissions, so this might help in your case.

Lorenzo

Ancor Gonzalez Sosa

unread,

May 19, 2015, 12:10:52 PM5/19/15

to meetech...@googlegroups.com, anc...@gmail.com

El martes, 19 de mayo de 2015, 16:54:20 (UTC+2), Lorenzo Miniero escribió:

Can you try updating Janus? Pierce provided a patch that should fix the sometimes incorrect and heavy behaviour that Janus had with respect to NACKs and retransmissions, so this might help in your case.

Thanks for the advise. I have just updated the (open)SUSE packages in the repositories and in our testing instance. I will do a new call for massive testing tomorrow afternoon. So everybody feel free to join us in the #jangouts channel at irc.freenode.net around 4pm CEST time. The more people we have, the better we can debug the problem. :-)

Cheers.

Lorenzo Miniero

unread,

May 20, 2015, 8:56:19 AM5/20/15

to meetech...@googlegroups.com, anc...@gmail.com

Did some more fixes this morning, so a further update could help :-)

I'll try to pop in for some tests later, although I can't promise I'll be able to stay much...

Lorenzo

Wilbert Jackson

unread,

May 20, 2015, 3:45:13 PM5/20/15

to meetech...@googlegroups.com, anc...@gmail.com

Lorenzo,

Theses changes are some great work. We have been running our Video MCU app all day with 1 client (a nexus tablet) connected at 512 kbits/sec and 3 connected at 1 mbit/sec. When the tablet was connected at 1 mbit/sec we saw some "slow link and nack messages". When we set the tablet to a lower REMB of 512 kbits/sec the messages stopped. Frame rates are 30 fps. We have tried this test setup before and have alway run into many, many nack messages being generated when increasing the REMB above 256 kbits/sec. You have seen my posts on the forum about the issue. Nice work!

wilbert jackson

Lorenzo Miniero

unread,

May 20, 2015, 4:33:32 PM5/20/15

to meetech...@googlegroups.com, wilbert...@gmail.com, anc...@gmail.com

Good to know they improved your scenario as well!

Lorenzo

Ancor Gonzalez Sosa

unread,

May 21, 2015, 1:55:15 AM5/21/15

to meetech...@googlegroups.com, anc...@gmail.com

El miércoles, 20 de mayo de 2015, 21:45:13 (UTC+2), Wilbert Jackson escribió:

Lorenzo,

Theses changes are some great work.

Yes!

Just for the records (Lorenzo already know because he was there), it also improved dramatically our use case. No single crash of Janus and no weird behavior. Moreover, the server was able to handle several streams more than before.

Great work!

Lorenzo Miniero

unread,

May 21, 2015, 4:20:46 AM5/21/15

to meetech...@googlegroups.com, anc...@gmail.com

Most of the credit should go to Pierce as he did the fixes on the NACKs stuff ;-)

(see https://github.com/meetecho/janus-gateway/pull/238)

L.

Ancor Gonzalez Sosa

unread,

May 22, 2015, 4:39:04 AM5/22/15

to meetech...@googlegroups.com

El jueves, 21 de mayo de 2015, 7:55:15 (UTC+2), Ancor Gonzalez Sosa escribió:

El miércoles, 20 de mayo de 2015, 21:45:13 (UTC+2), Wilbert Jackson escribió:
Lorenzo,

Theses changes are some great work.

Yes!

Just for the records (Lorenzo already know because he was there), it also improved dramatically our use case. No single crash of Janus and no weird behavior. Moreover, the server was able to handle several streams more than before.

Just another detail about the mentioned test, for the records.

We observed that most clients were transmitting around 200kbit/s, although the room was forced to 64kb/s. We thought it was related to the issues described by Wilbert in this thread (clients ignoring the room threshold after a while). The solution seemed to send a "configure" from the client side once in a while to set the bitrate back to 64k (or whatever is set in the room).

I did a preliminary implementation and the problem seems to be another. None Firefox or Chromium seem to ignore the threshold set sever-side, so it looks like I actually don't need the client-side-generated extra call to "configure" (which in the other hand, works nicely to change the bitrate for any other purpose).

The real problem is that Firefox doesn't seem to allow to set the threshold below 192kbits/s. Setting a room to 64k, 128k or 192k has the same effect: Chromium honors the limit while Firefox limits the bitrate to 192k (no matter if the value is lower). With higher bitrates everything works as expected. Sending the "configure" client-side changes nothing: any bitrate lower than 192 is treated as 192 by Firefox, no matter were the REMB is originated.

I hope that information is useful to somebody. Of course, if you know a way to enforce smaller bitrate in Firefox, I'd be glad to hear it.

Cheers.

Lorenzo Miniero

unread,

May 22, 2015, 4:46:01 AM5/22/15

to meetech...@googlegroups.com, anc...@gmail.com

Good find, I wasn't aware of that... thanks!

L.

Ancor Gonzalez Sosa

unread,

May 22, 2015, 5:26:09 AM5/22/15

to meetech...@googlegroups.com

El viernes, 22 de mayo de 2015, 10:39:04 (UTC+2), Ancor Gonzalez Sosa escribió:

El jueves, 21 de mayo de 2015, 7:55:15 (UTC+2), Ancor Gonzalez Sosa escribió:
El miércoles, 20 de mayo de 2015, 21:45:13 (UTC+2), Wilbert Jackson escribió:
Lorenzo,

Theses changes are some great work.

Yes!

Just for the records (Lorenzo already know because he was there), it also improved dramatically our use case. No single crash of Janus and no weird behavior. Moreover, the server was able to handle several streams more than before.

Just another detail about the mentioned test, for the records.

We observed that most clients were transmitting around 200kbit/s, although the room was forced to 64kb/s. We thought it was related to the issues described by Wilbert in this thread (clients ignoring the room threshold after a while). The solution seemed to send a "configure" from the client side once in a while to set the bitrate back to 64k (or whatever is set in the room).

I did a preliminary implementation and the problem seems to be another. None Firefox or Chromium seem to ignore the threshold set sever-side, so it looks like I actually don't need the client-side-generated extra call to "configure" (which in the other hand, works nicely to change the bitrate for any other purpose).

The real problem is that Firefox doesn't seem to allow to set the threshold below 192kbits/s. Setting a room to 64k, 128k or 192k has the same effect: Chromium honors the limit while Firefox limits the bitrate to 192k (no matter if the value is lower). With higher bitrates everything works as expected. Sending the "configure" client-side changes nothing: any bitrate lower than 192 is treated as 192 by Firefox, no matter were the REMB is originated.

I hope that information is useful to somebody. Of course, if you know a way to enforce smaller bitrate in Firefox, I'd be glad to hear it.

Replying my own question. It can be adjusted via this setting on about:config (set to 200 by default):
media.peerconnection.video.min_bitrate

I tested and it works.

Cheers.

Reply all

Reply to author

Forward