Concurrent webcam activations with more Thant 40 participants

335 views
Skip to first unread message

Fabrice Rouillier

unread,
Feb 15, 2021, 1:07:03 PM2/15/21
to BigBlueButton-Setup
Hi all,

We are definitively facing difficult problems with 2.2.31 (behind a NAT on Scaleway cloud) as soon as there are more than around 40 participants in a session : only very few webcams can be switched on, exactly as described in : https://github.com/bigbluebutton/bigbluebutton/issues/11099

We didn't face to these kinds of problems with 2.2.26.

Would a come back to 2.2.26 be the best current solution ?
(we had to cancel important meetings I do not think that users will let another chance in case of problems).

Fred Dixon

unread,
Feb 15, 2021, 2:46:13 PM2/15/21
to BigBlueButton-.
Hi Fabrice,

Have you enabled parallel Kurento servers, see



Regards,... Fred

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-s...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigbluebutton-setup/db0e59a1-7075-4b1a-af05-009885052550n%40googlegroups.com.


--
BigBlueButton Developer

Like BigBlueButton?  Tweet us at @bigbluebutton

Fabrice Rouillier

unread,
Feb 15, 2021, 3:50:13 PM2/15/21
to BigBlueButton-Setup

Hi Fred,

I have tried many things, including enabling the parallel Kurento, using a coturn server (I remember that version 2.26 did suffer with a bad default turn server) and almost all the tricks I could find at several places (see for example here : https://github.com/manishkatyan/bbb-optimize).

On a week suited Scaleway instance, the server works like charm with rooms with below 40 users , whatever the number of rooms, but any room over around 50 users is a nightmare.

For example 10 rooms with 10 users with 10 cams (100 simultaneous webcams on)  works perfectly on a 32 VCPU instance, but one room with 50 users runs to many difficulties whenever the users try to switch their cam on.

I looks like race problem at the beginning of the session since some time later in the same session it became possible to have a dozen of webcams simultaneously on.

Regards,

Fabrice.

basisbit

unread,
Feb 16, 2021, 5:34:01 AM2/16/21
to BigBlueButton-Setup
Fabrice, if you need support for starting many webcam streams at the same time, I'd suggest you to contact Kurento for commercial support: https://www.kurento.org/contact

Fabrice Rouillier

unread,
Feb 16, 2021, 8:35:50 AM2/16/21
to BigBlueButton-Setup
Hi basisbit

Might I understand that this "limitation" is exclusively due to Kurento ?

Do we have any information about it (something like a maximal number of streams )?

Regards,

Fabrice.

sd...@distancelearning.cloud

unread,
Feb 16, 2021, 9:09:02 AM2/16/21
to bigbluebu...@googlegroups.com

You not going to be able to have 50 cams on in one room, in any current version of BBB.

 

50x50 = 2500 streams for that meeting.  And the work required for the BBB client browser to handle that many peer connections, and video decoding processes just won’t work.

You will have disconnects and timeouts for sure, and disappoint your end users.

 

Using video pagination and limiting the number to <5 for viewers may help.

 

You said it worked in .26?  That’s hard to believe,  but you can use bbb-install to build this older version if it worked better for you.

 

Regards,

Stephen

Fabrice Rouillier

unread,
Feb 16, 2021, 9:55:21 AM2/16/21
to BigBlueButton-Setup

Hi Stephen,

"You not going to be able to have 50 cams on in one room, in any current version of BBB."

The goal is not to have 50 webcams open but say 10 or 20 (more is just useless) simultaneously with a total number of participants greater than 50. Currently is difficult to have more than 2 or 3 webcams simultaneously on as soon as the number of participants exceed 50,problems occur at the connexion.

I used to have a dozen of webcams simultaneously opened with more than 100 participants with version 2.2.26. For example in this video we were streaming with OBS a part of a BBB-room with more than 100 participants and you can notice that we had more that 12 webcams simultaneously on (few minutes before the beginning of event we had more than 20 webcams) and it worked perfectly : https://www.twitch.tv/videos/646628448


Regards,

Fabrice.

sd...@distancelearning.cloud

unread,
Feb 16, 2021, 10:31:21 AM2/16/21
to bigbluebu...@googlegroups.com

So your saying even if you have 1 meeting with just 2 to 3 cams,  you are not able to join past 50 users on your scaleway instance?  And this same instance on .26 had no issues?

 

What size instance are you using? Are these 3 standard streams, or HD?

Would be interesting to see htop run on server when the issues start, to see whats happening on the server.

Fabrice Rouillier

unread,
Feb 16, 2021, 10:48:57 AM2/16/21
to BigBlueButton-Setup
Le mardi 16 février 2021 à 16:31:21 UTC+1, distancelearning.cloud a écrit :

So your saying even if you have 1 meeting with just 2 to 3 cams, 

Precisely almost the same situation as described in the issue below and replace "have all the users enable their webcam sharing" by "between 10 and 20 users enable their webcam sharing" 

And this same instance on .26 had no issues?

Right. 

What size instance are you using? Are these 3 standard streams, or HD?

In the last attempt it was a 32 VCPU with 128 GB of RAM.

Would be interesting to see htop run on server when the issues start, to see whats happening on the server.

The load did not exceed 30% at any moment during the last experiments but I didnt' catch more precise informations.  (I was the animator of the session ...)

If it helps , I noticed the following : 
- there were pics of loads without explanations like a program race
- when using a coturn server, there are almost always 401 errors (incoming packet mess age processed, error 401: Unauthorized) in the logs (maybe could confirm the "race" phenomena ?) when clients are connecting. 

Note that removing the coturn server and come back to the default configuration with the google stun do not solve the problem.

Regards,

Fabrice.


basisbit

unread,
Feb 16, 2021, 11:23:07 AM2/16/21
to BigBlueButton-Setup
Please post a screenshot of htop, sorted by CPU usage. I'd bet some money on nginx/meteor hitting 100% on that server, which is ridiculously sized.

TLDR: I'd suggest changing your setup to a couple of 8 or 12 CPU cores dedicated servers. Much better performance and much cheaper (depending on used hardware and used data center provider). Whoever chose to do a single BBB setup on a Scaleway GP1-L did probably not read https://docs.bigbluebutton.org/2.2/install.html#minimum-server-requirements . If you have to use the Scaleway GP1-L, setting up 6 BBB servers on each GP1-L server using VMs (for example KVM) or linux containers might be your best approach.

Fabrice Rouillier

unread,
Feb 16, 2021, 12:02:17 PM2/16/21
to BigBlueButton-Setup
"If you have to use the Scaleway GP1-L, setting up 6 BBB servers on each GP1-L server using VMs (for example KVM) or linux containers might be your best approach."

I do not "need" to use a GP1-L,  I just did have a GP1-L the last time I had troubles.

That is what I am doing almost every week since around one year ... May be you should have a look to the set of scripts I wrote for setting automatically VMs in many situations ....
https://gitlab.inria.fr/rouillie/bbb  or have a look on the Scaleway testimonial about what I am doing with their cloud and BBB  https://www.scaleway.com/en/customer-testimonials/animath/ since several months and which kinds of large events we are organizing since April 2020 using BBB : https://www.inria.fr/fr/parlons-maths-dematherialisation

So please, try to give me a little credit if I am saying that something did change between version 2.26 and version 2.31 and believe me if I tell you that I now almost all lines of the official documentation several times and did make several tests, organized several events with dozens of meetings in parallel.

I am not a specialist of Kurento not of WebRTC, I am just a poor researcher in mathematics,  but I know how to check the load of my machine and I set machines using BBB almost every week with different instances, including physical machines, with or without NAT. 

So I am just trying to spend time in order to understand and my guess is that this could benefit to the development.

> Please post a screenshot of htop, sorted by CPU usage. I'd bet some money on nginx/meteor hitting 100% on that server, which is ridiculously sized.

We are talking about a GP1-L (32 cpu, 128 GB of RAM) not about a DEV-L (4 CPU 8 GB of RAM) 
We are talking about a room with around 50 people where you can not let between 10 and 20 people opening their webcam. 
The load did never exceed 50% (I have set a prometheus/graphana alert) and the machine was, at this moment , clearly oversized (it was so because I set it for accepting in the morning several VMs).

As already said it used to work perfectly with twice the number of people and twice the number of webcams without problem nor limit of load as shown in the video I did post earlier.


basisbit

unread,
Feb 16, 2021, 12:37:17 PM2/16/21
to BigBlueButton-Setup
Okay, maybe I would have lost that money on that bet :D

> We are talking about a GP1-L (32 cpu, 128 GB of RAM) not about a DEV-L (4 CPU 8 GB of RAM) 

Oh, still a slight chance... You can use a 1024 core system, still has the problem of NodeJS being limited to only one CPU core tread because of the bbb-html5 code doing all the work in the event-thread instead of handing off the workload to worker-threads. Depending how the virtualization is set up, there might also be other VMs using the same CPU core, thus easily halving the available CPU core time for NodeJS.

So, please post a screenshot of htop made while this problem is happening.



Please update your scripts README.md to not tell people to open tcp/1935 on their firewall.

Fabrice Rouillier

unread,
Feb 16, 2021, 12:48:11 PM2/16/21
to BigBlueButton-Setup
Le mardi 16 février 2021 à 18:37:17 UTC+1, basisbit a écrit :
Okay, maybe I would have lost that money on that bet :D

> We are talking about a GP1-L (32 cpu, 128 GB of RAM) not about a DEV-L (4 CPU 8 GB of RAM) 

Oh, still a slight chance... You can use a 1024 core system, still has the problem of NodeJS being limited to only one CPU 
core tread because of the bbb-html5 code doing all the work in the event-thread instead of handing off the workload to worker-threads.

In that case, the situation might be better with BBB-dev-alphax with x>3 having several nodes in parallel, right ? I am currently using a dev version on one of our server and I find it rather stable.
(note that the difference with 2.26 is still missing to me)

In the same spirit, would it make sense to increase the number of threads in Kurento's configuration ?

So, please post a screenshot of htop made while this problem is happening.

Sure, but hope not facing it again :-)

Fabrice.
 

Paulo Lanzarin

unread,
Feb 16, 2021, 8:56:06 PM2/16/21
to bigbluebu...@googlegroups.com
Fabrice,

The only part that "significantly" changed between .26 and .31 is Kurento (regarding the knowledge area of the problem you're referring to).
It got re-merged with upstream Kurento (version 6.15). Besides that, there weren't any bulky changes to video, client or server-side.
And the merge with 6.15 didn't bring up many changes as well (hence the "significantly").

The one suspected regression that was mentioned in the issue you refer to (the worker pool rewrite thing) was shipped way back in August, 2.2.23.
Hardly likely to be that as I mentioned it in the issue. There's only one commit that smells like it could bring a regression of
sorts in 6.15 and is one that adds frame gap monitoring probes to try and compensate recording gaps. Although it's not much likely to be a culprit,
I won't discard it.

That version bump happened in .30. I think you can rollback to 2.2.26 via bbb-install (or even install it in a new VM and keep it running
for a field trial), and then see if the problem pops up again or not. I don't remember the command to do that. Somebody here in the list might,
I'm not familiar enough with bbb-install. See the P.S. 1 of this mail for an alternative to rolling back only Kurento.

100x12-20 cams is a dangerous scenario, though. That's a ceiling of ~2k streams, which I consider a dangerous number.
Kurento never scaled vertically well. My rule of thumb (which people say is too conservative but I still think is a dangerous number)
is 80 streams per core up to a safe ceiling of 800 streams (regardless of core count).
When you go over that number, whether it works ok or not largely depends on the scenario, eg: if camera sharing is spread over during the
session's timeline, it should be better; if everything is shared at once, it probably will be spotty. And the list of possible scenarios is big.
See P.S. 2 for theoretical mitigations.

My current load testing setup is kinda borked, so I can't run a comparison load test myself to see if there's a regression until I fix it (I will, and I will
get back to you once I do it).

Please let me know if you try the rollback and stop seeing the problem. Otherwise, if you don't plan to do so, also please let me know
and I'll find a way to speed up a load test on my side.

s,

prlanzarin

========

P.S. 1:

If you do not want to rollback the whole BBB, there's a second way: running a previous Kurento version via Docker.
It should be straightforward. See (presuming you've installed docker in your environment):

1 - If you're using 3 KMSs: sudo systemctl stop kurento-media-server-8888.service bbb-webrtc-sfu
If you're using just one KMS:  sudo systemctl stop kurento-media-server.service bbb-webrtc-sfu

2 -
```
docker run -d --name <name_of_your_container> --network=host -v /var/kurento:/var/kurento -e PORT=8888 -e KURENTO_LOGS_PATH=/var/log/kurento-media-server -e RTP_MAX_PORT=<udp-range-max> -e RTP_MIN_PORT=<udp-range-min> -e GST_DEBUG='Kurento*:5' -e NETWORK_INTERFACES=<your_main_network_interface> kurento/kurento-media-server:6.14.0 ```
```
Please replace <*> with your env-specific values.
3 - sudo systemctl restart bbb-webrtc-sfu


If you try the docker thing, please keep an eye on BigBlueButton restarts. The packaged Kurento will not be able to spin up correctly if the docker is up (and vice-versa). So I recommend leaving the environment prepared
(image installed, but down), and turning stuff on (step 1, docker restart <name_of_your_container>, step 3) before your field trial.


P.S. 2:

There are theoretical approaches to alleviate the problematic scenarios (> 1k streams), but we've not implemented (or finished) any of them yet. For example:
  - Negotiation priority queues based on media directions (ie prioritize negotiating sendonly streams, aka shared cameras, over recvonly streams, aka subscribers)
  - Negotiation pacing/throttling: batch negotiation in queues of N streams based on the current number of streams in the meeting (avoid spiralling)
  - https://github.com/bigbluebutton/bbb-webrtc-sfu/issues/65
    * This: should be able to take Kurento off BBB's main server for big scenarios like yours (over 1k streams). Run multiple, smaller Kurento VMs. Round robin them. Keep them under the safe threshold. Be more resource efficient.
    * I need to fix the problems I mentioned in the issue for that to be production ready.

========

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-s...@googlegroups.com.

Fabrice Rouillier

unread,
Feb 17, 2021, 2:44:03 AM2/17/21
to BigBlueButton-Setup
Dear Paulo, 

Thanks a lot for your detailed answer.  I have got few questions (sorry in advance if some are stupid): 

- Is the global maximum of 800 for a cor maximum of 80 related with the number of threads set in kurento.conf.json (which is set to 10) ? Would it be a good idea to try to increase this value ?
- Shall we multiply 800 by 3 when using the multi-kurento option ? (I did understand that no because the 3 were used for different purposes)
- Do we count this maximum per room of globally for the server  ? 
- Is it a count for simultaneous connexions  not simultaneous webcams switched on ? 
- Is nodejs also a second bottleneck ?

With a limitation set to 1000 Streams I understand that at most 10 vcpu will be used by kurento independently of the configuration and thus I deduce that it is almost useless to have a total greater than  something like 24 vcpu (adding 4 cpu for nodejs 4 cpu for freeswitch and 6 extra for other services) for 1 unique room. Am I right ?

Currently with 2.2.31 and default settings, it is almost a miracle if one can switch on 2 cameras and share 1 screen in good conditions in a room 120 users (so "only" 360 streams).
On the other hand, 10 rooms with 10 webcams each didnt  cause troubles last Sunday 

So , I will try in parallel the two following experiments (thanks for your indications, I am sufficiently comfortable with the installation to use them): 

- configure one server with the current ( development ?)  version and try to push it at the maximum it can be pushed (find n such that n camera could be simultaneously switched on  with 100 users)
  with the very last configuration modifications suggested à several places (most seem to be present in the development version). 

- configure servers with the following 
       2.2.15 : I did use it in May with rooms with 100 participants and 15 simultaneous webcams (so around 1500 streams) 
       2.2.26 : was in September  with 150 participants en 3 or 4 webcams simultaneously without and was still fast and responsive.

Stefan L.

unread,
Feb 17, 2021, 3:43:08 AM2/17/21
to bigbluebu...@googlegroups.com
Please do not run old versions of BBB (especially not anything older than 2.2.27) in public. Those are very easy to hack and get root shell on.
People already spotted automated attacks for some of those vulnerabilities, which then install backdoors and hide themself.
Also, Fabrice, you violate GDPR if you ignore above advice and have humans use that server.

You received this message because you are subscribed to a topic in the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigbluebutton-setup/lIX-GVrdZO8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigbluebutton-s...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigbluebutton-setup/7dc9d3d1-9131-43e8-87f7-e976cc9dc6abn%40googlegroups.com.

Fabrice Rouillier

unread,
Feb 17, 2021, 4:04:06 AM2/17/21
to BigBlueButton-Setup
Also, Fabrice, you violate GDPR if you ignore above advice and have humans use that server.

:-)

We yearly organize the "Alkindi" competition on our servers and the main sponsor is DGSE (French equivalent of the CIA) 
see here for a description https://concours-alkindi.fr/main.html#/

As several responsible of the DGSE  (including the director) have to connect, I am rather confident about the risks I take in regards of the law since we are already very strictly controlled.

Our public human users do not have any account on our BBB servers. We also use a coturn server to avoid calls to google servers.

Moreover , the servers are created few minutes before the events and destroyed (or stopped) few minutes after since our scripts make possible to set an entire VM in less than 5 minutes.

Regards,

Fabrice.

Paulo Lanzarin

unread,
Feb 17, 2021, 6:01:53 AM2/17/21
to bigbluebu...@googlegroups.com
Fabrice,


> - Is the global maximum of 800 for a cor maximum of 80 related with the number of threads set in kurento.conf.json (which is set to 10) ? Would it be a good idea to try to increase this value ?

It isn't. Increasing that value (the media set thread count) has no effect AFAIK (it even can make things worse). But you could try it yourself if you're curious, because the data I have about
what I'm saying is 1.5 years old and I really don't want to go looking after it in my backup HDs :).


> - Shall we multiply 800 by 3 when using the multi-kurento option ? (I did understand that no because the 3 were used for different purposes)

Yes. But for each media type separately. And there's a critical point in spinning multiple KMS in the same machine where context switching diminishing returns
kick in, so you can put infinite Kurentos to get infinite streams in the same VM.


> Do we count this maximum per room of globally for the server?

Globally.


> - Is it a count for simultaneous connexions  not simultaneous webcams switched on ? 

Simultaneous streams (in + out).


> Currently with 2.2.31 and default settings, it is almost a miracle if one can switch on 2 cameras and share 1 screen in good conditions in a room 120 users (so "only" 360 streams).
On the other hand, 10 rooms with 10 webcams each didnt  cause troubles last Sunday 

This isn't normal. 360 is a very safe number, which makes me believe there is some sort of regression somewhere OR your environment changed in some way such that made it trigger the
problem somehow. The fact that the 10*10*10 works properly makes me think it might have something to do with 1-N scenarios rather than N-N. I'll try to run a webinar scenario load test myself
(usually I do 150-3 cam-1 screen) ASAP and get back to you.

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-s...@googlegroups.com.

Fabrice Rouillier

unread,
Feb 17, 2021, 7:19:29 AM2/17/21
to BigBlueButton-Setup
> Currently with 2.2.31 and default settings, it is almost a miracle if one can switch on 2 cameras and share 1 screen in good conditions in a room 120 users (so "only" 360 streams).
On the other hand, 10 rooms with 10 webcams each didnt  cause troubles last Sunday 

This isn't normal.

I am sure it is not.
So here some the settings that did cause failures :

- 2.2.31 installed in January with coturn enabled using bbb-install.sh (either in the turn server and in the bbb-server).
->  it was not possible to switch on more than 2 webcams at the opening in a room with 50 people. 
At second attempt has been made combing back to a configuration without coturn (just replacing the stun-turn configuration file) without much more success.
-> With 30 people in a room, no problem noticed, several webcams could be opened.

- 2.2.31 installed in February with a coturn
-> a room with 130 people : it was almost impossible to switch on more than 2 cameras (plus a shared screen) at the opening but we got 4 webcams later.
 
(usually I do 150-3 cam-1 screen) ASAP and get back to you.

This is very kind of you, tell me if I can give more information or make tests in my side.

Regards,

Fabrice.
 

Paulo Lanzarin

unread,
Feb 17, 2021, 7:19:43 AM2/17/21
to bigbluebu...@googlegroups.com
Fabrice,

A corrigendum to two things in my last mails:


> Yes. But for each media type separately. And there's a critical point in spinning multiple KMS in the same machine where context switching diminishing returns
kick in, so you can put infinite Kurentos to get infinite streams in the same VM.

* [...] so you CAN'T put infinite Kurentos [...]

> docker run -d --name <name_of_your_container> --network=host -v /var/kurento:/var/kurento -e PORT=8888 -e KURENTO_LOGS_PATH=/var/log/kurento-media-server -e RTP_MAX_PORT=<udp-range-max> -e RTP_MIN_PORT=<udp-range-min> -e GST_DEBUG='Kurento*:5' -e NETWORK_INTERFACES=<your_main_network_interface> kurento/kurento-media-server:6.14.0

Some environment variables had the names wrong. I was thinking about the 6140 version we use at the company I work for, not the upstream KMS image.
Here's the mapping for kurento/kurento-media-server:6.14.0:
  - RTP_MIN|MAX_PORT -> KMS_MIN|MAX_PORT
  - NETWORK_INTERFACES -> KMS_NETWORK_INTERFACES

Also, if your server is behind NAT, don't forget to set it's external IP address or a STUN server (KMS_EXTERNAL_ADDRESS, KMS_STUN_IP, KMS_STUN_PORT)


s,

prlanzarin

Fabrice Rouillier

unread,
Feb 17, 2021, 7:27:31 AM2/17/21
to BigBlueButton-Setup
Thanks for the erratum.

Also, if your server is behind NAT, don't forget to set it's external IP address or a STUN server (KMS_EXTERNAL_ADDRESS, KMS_STUN_IP, KMS_STUN_PORT)

Yes I should precise : in all these experiments the server is behind a NAT. I have only one server that is not behind a NAT but it is a small one.

Fabrice 

Fabrice Rouillier

unread,
Feb 17, 2021, 8:45:40 AM2/17/21
to BigBlueButton-Setup
Last experiment few minutes ago with 60 participants was ***catastrophic****.

Configuration  
**************

- GP1-M Scaleway : 16 vcpu 64 GB of RAM

- BBB .2.2.31 behind a nat on Scaleway network in Paris. 
Server initially configured with a coturn using bbb-install.sh(1), coturn removed for the test by replacing the file /usr/share/bbb-web/WEB-INF/classes/spring/turn-stun-servers.xml.with_turn by the original one.

The server has been switched off  for the night which induces a change of local IP that we correct automatically at boot time with a script (2)

(1) and (2) were executed by the install_bbb.sh  and change_internal_ip.sh scripts from  https://gitlab.inria.fr/rouillie/bbb (the first one runs a slight modification of the official bbb-install.sh where I did just modify the way internal/external IPs are detected in the get_IP() function since the original ones do not detect properly our IPs on our network, the second is very short and modifies the files where the local IP appears.

- Other modifications : 
 /usr/share/meteor/bundle/systemd_start.sh  :  PORT=3000 /usr/share/$NODE_VERSION/bin/node --max-old-space-size=4096 --max_semi_space_size=128 main.js
/etc/bigbluebutton/bbb-conf/apply-config.s : enableMultipleKurentos 
/etc/kurento/modules/kurento/WebRtcEndpoint.conf.ini : externalIPv4=Public-IP-of-BBB-Server
/usr/share/meteor/bundle/programs/server/assets/app/config/settings.yml :  baseTimeout to 60000

Experience :
************

- Starting with ONE webcam : quite slow to be activated but works
- 5 minutes later, a second webcam : took 2 minutes with a black square and then works
- 5 minutes later : switch off all cameras and then ask for a total of 5 participants to switch on their webcams immediately ...   It turns out that after 4 minutes only few participants did see some of the 5 webcams as if the streams  were opened sequentially and very slowly.
- In parallel, trying to switch on  my webcam in a separate  room where I was alone, did lead 2003 error.

Here is the htop at the beginning of the session , th htop during the connexion of the 5 webcams and the graphana  report

(basisbit : you have definitively lost your bet xD)

Regards,

Fabrice.
Image collée à 2021-2-17 14-10.jpg
Image collée à 2021-2-17 13-52.jpg
Image collée à 2021-2-17 13-50.jpg

sd...@distancelearning.cloud

unread,
Feb 17, 2021, 8:59:52 AM2/17/21
to bigbluebu...@googlegroups.com

Are the 60 users bots from one server, or real users from different locations?

Also multipleKurentos is a newer option…   did .26 have this?   If you ran the same test with 1 kurento… does it behave the same.

 

 

 

From: bigbluebu...@googlegroups.com <bigbluebu...@googlegroups.com> On Behalf Of Fabrice Rouillier
Sent: Wednesday, February 17, 2021 8:46 AM
To: BigBlueButton-Setup <bigbluebu...@googlegroups.com>
Subject: Re: [bigbluebutton-setup] Concurrent webcam activations with more Thant 40 participants

 

Last experiment few minutes ago with 60 participants was ***catastrophic****.

Fabrice Rouillier

unread,
Feb 17, 2021, 9:04:48 AM2/17/21
to BigBlueButton-Setup
> Are the 60 users bots from one server, or real users from different locations?

Real users.

Also multipleKurentos is a newer option…   did .26 have this?   If you ran the same test with 1 kurento… does it behave the same.

No .26 didn't . We ran the same test with 1 kurento with the same behavior, we also tried with coturn unsuccessfully.

Problems appeared with BBB 2.2.31 and we do not find any rational explanation after several tests with several installations.



Paulo Lanzarin

unread,
Feb 17, 2021, 9:08:38 AM2/17/21
to bigbluebu...@googlegroups.com
Fabrice,

Please run

$ grep -i "CPU exhausted" * -r

in /var/log/kurento-media-server/


--
You received this message because you are subscribed to the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-s...@googlegroups.com.

Paulo Lanzarin

unread,
Feb 17, 2021, 9:13:05 AM2/17/21
to bigbluebu...@googlegroups.com
Also, people keep bringing up the multiple Kurento stuff.
Everyone should always use it. It's very improbable it has anything to do with this, even more when
Fabrice confirms it has the same behaviour with one or three.
The core code that made it possible has been in the project for 2 years now
(https://github.com/mconf/bbb-webrtc-sfu/commit/7d989030ec127fa1358fd162610f5b8ec47a8e40). It's not fresh stuff.

Fabrice Rouillier

unread,
Feb 17, 2021, 9:30:23 AM2/17/21
to BigBlueButton-Setup
$ grep -i "CPU exhausted" * -r

in /var/log/kurento-media-server/


The answer is empty.

This is coherent with the top and the Grafana . The behaviors very strange, just as if the bindings to the streams were just very slow.

Fabrice. 

Paulo Lanzarin

unread,
Feb 17, 2021, 9:32:06 AM2/17/21
to bigbluebu...@googlegroups.com
Are you able to share any logs or you'd be constrained by GDPR?

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-s...@googlegroups.com.

Fabrice Rouillier

unread,
Feb 17, 2021, 9:39:32 AM2/17/21
to BigBlueButton-Setup
Le mercredi 17 février 2021 à 15:32:06 UTC+1, plan...@gmail.com a écrit :
Are you able to share any logs or you'd be constrained by GDPR?

I am constrained (was a private session in my research institute ). 

I was thinking that among all the BBB users it is rather impossible that only our servers are touched or, if so, we should perhaps have a look to cloud itself.

Cheers,

Fabrice.

basisbit

unread,
Feb 17, 2021, 9:50:54 AM2/17/21
to BigBlueButton-Setup
... I asked for a screenshot of htop sorted by CPU usage (percentage), not by total CPU time since boot. Please provide the htop screenshot.

Fabrice Rouillier

unread,
Feb 17, 2021, 9:56:08 AM2/17/21
to BigBlueButton-Setup
You can easily check by performing a simple addition that the main process are listed. 

I didn't have a direct access to the machine, a college dit the top. We are deploying a lot of energy in order to help, we are just doing our best.

sd...@distancelearning.cloud

unread,
Feb 17, 2021, 9:59:40 AM2/17/21
to bigbluebu...@googlegroups.com

Fabrice,  your instance in paris scaleway? Or another region?

 

 

From: bigbluebu...@googlegroups.com <bigbluebu...@googlegroups.com> On Behalf Of Fabrice Rouillier
Sent: Wednesday, February 17, 2021 9:40 AM
To: BigBlueButton-Setup <bigbluebu...@googlegroups.com>
Subject: Re: [bigbluebutton-setup] Concurrent webcam activations with more Thant 40 participants

 

 

Le mercredi 17 février 2021 à 15:32:06 UTC+1, plan...@gmail.com a écrit :

--

You received this message because you are subscribed to the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-s...@googlegroups.com.

Fabrice Rouillier

unread,
Feb 17, 2021, 10:01:54 AM2/17/21
to BigBlueButton-Setup
> Fabrice,  your instance in paris scaleway? Or another region?

Scaleway Paris , precisely fr-par-1


Fabrice.
 

Fabrice Rouillier

unread,
Feb 17, 2021, 10:16:38 AM2/17/21
to BigBlueButton-Setup

Precisely : from only the 4 first lines (the addition to perform)  and from 16x( average load)=(total load)  - loops sorry there was also a multiplication) -  I  prove that there is no process that take 100% of the load (in fact no process took more than 70% of load). 

So you did lost your bet xD

13:50 : the first 4 lines give 160% of CPU while the total charge was 219% so the maximum charge for ALL the other process is less than 60% and thus none of them reached 100%.
13-52 : the first 4 lines give 130% of CPU while the total charge was 198* so that maximum charge for ALL the other process is less than 70% and thus none of them reached 100%.
 

Paulo Lanzarin

unread,
Feb 18, 2021, 7:55:00 AM2/18/21
to bigbluebu...@googlegroups.com
I was following this up with Fabrice in private.

There seems to be something with the provider, specifically, which is stalling Kurento's signalling
socket if enough pressure is applied.

It doesn't seem to be a KMS regression since we tested it with a rolled-back Kurento as well.

More investigation is still needed to figure out what is happening in that VM.

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-Setup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-s...@googlegroups.com.

Fabrice Rouillier

unread,
Feb 18, 2021, 9:29:18 AM2/18/21
to BigBlueButton-Setup
As Paulo said yesterday night (thanks a lot to him for the time spent) : the good point is that we can now reproduce the problem easily.

- I have contacted the provider, they are studying the problem right now. 
- In parallel, I will test other settings in the next days from bar metal on the same cloud to physical machines on another network , behind a nat or not.

The good point is that we will then have a full protocole for testing installations :-)

Paulo Lanzarin

unread,
Feb 22, 2021, 7:54:38 AM2/22/21
to bigbluebu...@googlegroups.com
Fabrice and I went a bit further on this.

It seems it is kind of a regression, but it only manifests itself in some specific providers
(but I'm not yet sure if it's a generalized issue in that specific provider or just the region
we were testing in).

So, the epilogue:

I had enabled mDNS candidate processing in bbb-webrtc-sfu v2.2.23, which landed in v2.2.30.
The rationale on why I enabled it is described in https://github.com/bigbluebutton/bbb-webrtc-sfu/pull/76.
I only enabled it because it'd be cool to cover the intranet-with-no-STUN-nor-TURN scenarios, but that's life.

This seems to have triggered a problem in Fabrice's VMs (Scaleway) where the DNS queries generated
by that were stalling Kurento somehow. Pcaps indicates replayed responses AND missing query responses,
so some of the signalling threads might be hogged by those derelict mDNS queries.

I've not yet isolated the side effect to it's core, but I've disabled it again in the aforementioned PR and that should
land in 2.2.32. Taking into account that I've not yet isolated the side-effects precisely, there's also the possibility that we're wrong
and that enabling mDNS is not the culprit. But that's where bisecting + load testing led me to.

Disabling mDNS resolve should have no effect on properly configured BBB instances.

So, to review the our quick fix process:
  1. change kurentoAllowMDNSCandidates to false in /usr/local/bigbluebutton/bbb-webrtc-sfu/config/default.yml
  2. sudo systemctl restart bbb-webrtc-sfu

Further configs I've suggested to Fabrice that are not crucial, but that will improve responsiveness to video/listen only/screenshare a lot
in that specific scenario (BBB and Greenlight running in the same NAT VM):

 - Add the networkInterfaces config option to your install scripts. It's configured in /etc/kurento/modules/kurento/WebRtcEndpoint.conf.ini. You should set

it to whichever is the default, public network interface in your server (in this case, ens2).

    * If the NIF has unwanted IPs (ie private intranet IPs for VPS comms which are not used for media exchange), you can add them to ipIgnoreList

 - Please set niceAgentIceTcp (/etc/kurento/modules/kurento/WebRtcEndpoint.conf.ini) to 0 if you're not using TCP for media comm (that's BBB's default, so

it can be set to 0)

 - /etc/kurento/modules/kurento/WebRtcEndpoint.conf.ini -> Set externalIPv4 if your server is behind NAT

 - /etc/kurento/modules/kurento/WebRtcEndpoint.conf.ini -> stunServerAddress/stunServerPort: this config is superseded by externalIPv4 the absolute majority of scenarios,

so you don't need to set it. Cases where you do need to set it is on premises deployments where you need media to work both over the intranet and the internet

at the same time.


s,

prlanzarin

Paulo Lanzarin

unread,
Feb 22, 2021, 8:04:28 AM2/22/21
to bigbluebu...@googlegroups.com
Correction:


> I had enabled mDNS candidate processing in bbb-webrtc-sfu v2.2.23, which landed in v2.2.30.

Fabrice Rouillier

unread,
Feb 22, 2021, 8:11:55 AM2/22/21
to BigBlueButton-Setup
Fabrice and I went a bit further on this.


Ha ha friendly guy, you were the main investigator in this story :-)
 
Further configs I've suggested to Fabrice that are not crucial, but that will improve responsiveness to video/listen only/screenshare a lot
in that specific scenario (BBB and Greenlight running in the same NAT VM):

Tested bots on several machines (on Scaleway cloud , on public research network, behind a Nat or Not, using a coturn or not) with a couple of other options such as multiple Kurento servers , I got excellent results each time.

I confirm the impressive improvements; The modified version of BBB works amazingly well.

Fabrice.

Reply all
Reply to author
Forward
0 new messages