2.5.4 - Long meeting big issue after a couple of hours joining audio was slow then impossible

450 views
Skip to first unread message

Pablo Pico

unread,
Aug 25, 2022, 12:53:04 PM8/25/22
to BigBlueButton-dev
Hello,
Using BigBlueButton 2.5.4 version i found an issue with a long meeting with more than 100 users. 0 webcams. 0 screenshares, just whiteboard. After two hours of meeting joining audio became slower and slower until it was impossible, and random disconnections started to appear. All becoming worse as the time passed by.

This happened yeseterday. I must confirm that the issue only started when the meeting was more than 2 hours old- That and other reasons i explain here make me believe it was not directly related to the number of users.

Facts:
  • Users started to get disconnected / kicked out randomly after more than 2 hours of class
  • The teacher reported this and a collegue from another location and me joined to observe the issue and we got dropped too at random times several times. Since that happened randomly i could not find any pattern related to it
  • After being dropped from audio or from the entire meeting, one would try again and that was the thing seemed consistent: as the time passed it became slower and harder or almost impossible to be able to join the audio for me. It woud timeout.
  • When the problem started i discovered that using "listen only" instead of "microphone" mode looked a inmediate solution because it would join really fast.
  • But the problem was like advancing during the next hour and then random disconnections started to increase until it was completely unacceptable
  • Most of the times when i got kicked the displayed message was misleading because it sounded like if someone else kicked me out. But that was not the case. Sometimes, especially when i joined in listen only mode i would only get disconnected from audio instead of being kicked out.
  • The server CPU was always low (less than 5%). There was nothing unusual with it. In fact, there was only one meeting at the time.
  • We started other meetings during the issue just to confirm it was only related to that meeting.
  • At some point the situation escalated way too badly and the teacher got dropped too. We asked him to finish the meeting and start it again.
  • After starting the meeting again, the behavior was again like it was during the first two hours: no issues.

I would like to ask for help from the you.
  1. Has anyone experienced this?
  2. Does anyone know what the possible root cause for this is? I tried to imagine possible causes. Maybe something that grows without limit during the meeting and is proportional to the number of attendees? I looked to the chat. They used it but it did not look that big to be able to impact.

Also, i would like to ask this:
The meteor is nowadays split in multiple threads. But i would like to know if one single meeting takes advantage of this. In other words, Can a single meeting use multiple processes or just one?

A few extra details that might be of interest:
  • Discard any cpu or bandwidth issues because the server is monitored with grafana and it never got over 5% cpu. 
  • Since the meeting was bigger than usual restrictions were applied. No private chat, no edit notes, no participants list, no webcams, no microphone.. basically only moderator was speaking. Participants were only able to use the public chat, so the meeting would stay "light". And really at first we had zero issues.
  • We know that there are two directives capable of kicking users out: allowDuplicateExtUserid=false (we were not allowing duplicate users)
    maxInactivityTimeoutMinutes=90
    i mention this in case there could be any bug with those. I disabled those to see if we have another chance any day soon, but since that specific class is unhappy with what happened the teacher will be using something else for that class.
  • Several times during the issue we opened another meeting in the server and everything was runing smooth and perfect.

but the slowness we had was not the entirely html5, it was joining the audio and disconnections.

21UCS521 Jerome Abel

unread,
Aug 25, 2022, 1:51:34 PM8/25/22
to BigBlueButton-dev
You will be needing a bigger servers or load balancer pool. Metal servers preffered. 

Fred Dixon

unread,
Aug 25, 2022, 5:23:13 PM8/25/22
to bigblueb...@googlegroups.com
Hi Pablo,

Thanks for reporting this.

> Most of the times when i got kicked the displayed message was misleading because it sounded like if someone else kicked me out. But that was not the case. Sometimes, especially when i joined in listen only mode i would only get disconnected from audio instead of being kicked out.

That message "You have been disconnected" was an unfortunate wording -- no one kicked you out; rather, the server lost connection and initiated a disconnect.  We're going to be modifying that message in an upcoming 2.5.x update.

> Does anyone know what the possible root cause for this is? I tried to imagine possible causes. Maybe something that grows without limit during the meeting and is proportional to the number of attendees? I looked to the chat. They used it but it did not look that big to be able to impact.

...

> The server CPU was always low (less than 5%). There was nothing unusual with it. In fact, there was only one meeting at the time.

Ok, this is a puzzle.  Was there any processes that were running at 100%?


Can you say more on what was happening in the class.  Was it only one teacher talking, was it a variety of people talking.  Was there any presentation, or lots of presentations?  Was there little chat, or lots of chat? etc.  

Your post was very good.  Any additional information you can share as to what was occurring during the meeting will help us narrow it down.


Regards,... Fred


--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigbluebutton-dev/1602c30d-9143-4acc-972d-ec9e0fb48839n%40googlegroups.com.


--
BigBlueButton Developer

Like BigBlueButton?  Tweet us at @bigbluebutton

Pablo Pico

unread,
Aug 26, 2022, 4:47:10 PM8/26/22
to BigBlueButton-dev
Hello 

Jerome: thanks for your reply. But this is not related to bigger server (the server has 24 cores, and could easily handle 600 users). Moreover, it is not a balancing matter, because balancer will put one meeting in one server, and this issue showed up in a single meeting in a server with no more meetings...

Fred, Thanks for your reponse.
The hypothesis that this is somehow related to meeting duration is growing according to today's experience.

We convinced the teacher to try his class again. It was about 140 people.  5 hours recording most of the time.
My collegue and me kept monitoring everything including CPU with htop and grafana just to re-confirm the cpu can not be the issue. Not even one core was above 10%.
At first, when the meeting was new, we could not see any issues. We kept a couple of test users inside the meeting and a couple more getting in and out every now and then.
Then when the meeting was more than 90 minutes old the issue appeared. It was really slow to connect microphone. Sometimes during the echo test we would get a disconnection alert. Sometimes it would connect.

We guessed we were going to have the same issue. Because we did not want them to stop liking BBB we asked the teacher to completely restart the meeting during the break they take after the first two hours of class. He did restart the meeting and the new meeting again started with zero issues. Moreover, we used a different server (one that has been used and has been stable for a long time) for the second part of the class. 
And again the issue appeared with the same behavior!
Again, at first, when this second meeting was new, everything was good and smooth. We had to wait like 2 more hours  to be able to see the slowness again.

Symptoms: 
  • the first symptom is a notorius slowness when connecting with microphone (which in this meeting was only possible for moderators). 
  • Second. You notice that during echo test you get some alert showing an audio disconnection. But still it connects.
  • Third. Joining the meeting with microphone becomes slower and slower until it beomes almost impossible. Again keep in mind that this only affects a user that would join late, when the meeting is old. 
  • Second. Random disconnections.
  • Third. It is impossible to rejoin with microphone and if you suceed you get kicked out fast.
  • Fourth. It looks like "listen only mode" is less prone to be completely kicked out of the meeting, and just gets disconnected from audio
Our tests were done mostly with moderators because we wanted to compare the ability to join with mic vs "listen only". 
Since this meeting never got that old as the one two days ago we never got to a point where many users started to get disconnected and/or even "listen mode" became also ineffective.

Interesting things we observed:
  • Teacher never lost connection and apparently students did not have major issues, but keep in mind that, unlike our test users, these users connected to the meeting when the meeting was still new.
  • Fortunately, if you never get disconnected/kicked out of the meeting you will not notice any issues.
  • As the meeting grows older it looks like chane of getting disconnected grows.
  • The first time we experienced this, two days ago, it was worse than today apparently because we did not restart the meeting soon enough.
  • Again we had test meetings during the issue in the same server to confirm the issue was not global to the server, but only specific to the meeting.
  • The first server (used for the fist meeting) had no other meetings. The second server/meeting had a few other meetings from other classes. They had no issues. Both server use same 2.5.4 version of BBB so i conclude that the problem only appears under the specific conditions of the meeting and not just any meeting.

Meeting had:
  • Only one mic open (the teacher) alll the time. No screen share, just whiteboard with a presentation of 141 pages.
  • Participants had no access to microphone (all forced to listen only). No webcams at all. There was a very low usage of public chat. Private chat was restricted. Participants list was also restricted
  • During the last hour of the meeting there were less than 90 users left and we the issue remained.

Things to consider before jumping to conclusions:
  • We have been using BigBlueButton for almost 2 years and are very familiar with all the technical stuff
  • I can confirm that this is not a connectivity issue because we used test users from different locations, and  different ISPs and the issues became consistent across all our connections, especially when the meeting was old enough.

I found this: https://github.com/bigbluebutton/bigbluebutton/issues/15070 and i believe that our issue is related to the issue described in that thread.
We will try to reproduce the issue by having meeting with just a couple of users and a long meeting with everything as similar as we can. I will keep you posted.

elem hsb

unread,
Aug 28, 2022, 9:48:57 AM8/28/22
to bigblueb...@googlegroups.com
Hi,

i reproduced the same behaviour, as a ONE man show.

Start with MIC connect, testing after >90 Min.
- connect of CAM got error 1020, connect impossible
- disconnect Mic and reconnect, got error 1020 too, connect impossible

Workaround:
- Short leaving the meeting and join into again, connects works like charm.

Setup:
  BBB 2.5.5, fullaudio, Google Chrome on MacOS.

regards

Fred Dixon

unread,
Aug 28, 2022, 10:04:39 AM8/28/22
to bigblueb...@googlegroups.com
> i reproduced the same behaviour, as a ONE man show.

Ok.  Can you try this on 


and let us know if you can reproduce.

Regards,... Fred


  



elem hsb

unread,
Aug 28, 2022, 6:01:33 PM8/28/22
to bigblueb...@googlegroups.com
Dear Fred,

i can't reproduce it on https://test25.bigbluebutton.org/

Clientversion: 2846
BigBlueButton Version: 2.5.5

regards

Fred Dixon

unread,
Aug 30, 2022, 7:05:59 AM8/30/22
to bigblueb...@googlegroups.com
Hi Elem,

> Setup:
>  BBB 2.5.5, fullaudio, Google Chrome on MacOS.

We don't have fullaudio enabled on https://test25.bigbluebutton.org/.   It's still experimental, see


Can you disable fullaudio and try it again?  Let us know if you have the same experience as on https://test25.bigbluebutton.org/.


Regards,... Fred



elem hsb

unread,
Aug 31, 2022, 10:48:46 AM8/31/22
to bigblueb...@googlegroups.com
Hi Fred,


Am 30.08.2022 um 13:05 schrieb Fred Dixon <ffd...@gmail.com>:

Can you disable fullaudio and try it again?  Let us know if you have the same experience as on https://test25.bigbluebutton.org/.

I have disabled fullAudio and tested again.

It faults again, on all reconnects, not only on audio ...

failed 1002 on new audio connect
failed 1020 on new cam connect
failed 1102 on scressnshare connect

It seams like a session / user key has expired ...

re-join to the running session works fine, all connects are now possible again.

Best regards

elem  


Fred Dixon

unread,
Aug 31, 2022, 12:14:26 PM8/31/22
to bigblueb...@googlegroups.com
Hi Elem,

> It faults again, on all reconnects, not only on audio ...

> failed 1002 on new audio connect
> failed 1020 on new cam connect
> failed 1102 on scressnshare connect

> It seams like a session / user key has expired ...


Not sure what is happening -- something is different between your (A) BigBlueButton server and it's network vs. (B) https://test25.bigbluebutton.org/ and it's network.

I would open a session on both your server and https://test25.bigblubutton.org/ at the same time and see if yours drops.  If so, is there anything different between (A) and (B) that might give clues?

Regards,.. Fred




--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.

Paulo Lanzarin

unread,
Aug 31, 2022, 12:14:52 PM8/31/22
to bigblueb...@googlegroups.com
It seams like a session / user key has expired ...

re-join to the running session works fine, all connects are now possible again.

You might be into something.  All those socket paths you mentioned (default microphone, cam, screen sharing)
go through the /checkAuthorization endpoint that depends on a valid session + valid sessionToken + valid session cookie
for it to be allowed.

If that would be the case something might be turning those infos stale in the server somehow.
If you can reliably reproduce that, please open the browser's network tab, filter by WS
and then trigger the problem again. Take note of the newly opened WS entry, expand all its requests. Either send it textually
or via a print screen here if possible.


--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.

elem hsb

unread,
Aug 31, 2022, 4:50:39 PM8/31/22
to bigblueb...@googlegroups.com
Hi Paulo,

here are the results: cam connect at 16:51 works fine, reconnect at 22:21 o'clock it's failing.

ngingx/errors on server:

2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4107/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4106/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4105/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4104/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4103/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4102/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"


At the moment i am wondering about clients clock drift of -2:36 minutes ... 

Regards, elem


Jerome Loubens

unread,
Sep 1, 2022, 2:23:59 AM9/1/22
to BigBlueButton-dev
Hello
I had a 1020 error when opening the camera and I had updated my ubuntu and it hadn't solved anything.
So I updated via bbb-install-2.5.sh, reconfigured with my settings, and everything worked again.
I don't know if you have the same problem, but in the end, package updates aren't great, you should be able to deactivate them to avoid small problems.
Have a good day.

elem hsb

unread,
Sep 1, 2022, 5:22:58 AM9/1/22
to bigblueb...@googlegroups.com
Hi Paulo, Hi Fred,

i am sorry now it's on test25.bigbluebutton.org also see screenshoot.

Audio was stable since 23:52, join cam after long time fails. 
Adblovk plus was disabled and now deinstalled, it's same.

regards, elem





Am 31.08.2022 um 22:50 schrieb elem hsb <ele...@gmail.com>:

Hi Paulo,

here are the results: cam connect at 16:51 works fine, reconnect at 22:21 o'clock it's failing.

ngingx/errors on server:

2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4107/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4106/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4105/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4104/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4103/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"
2022/08/31 22:23:52 [error] 3161#3161: *248 connect() failed (111: Connection refused) while connecting to upstream, client: 2003:f7:e711:7700:2552:5206:eeb4:663f, server: bbb5.fk4.hs-bremen.de, request: "GET /html5client/null HTTP/2.0", upstream: "http://127.0.0.1:4102/html5client/null", host: "bbb5.fk4.hs-bremen.de", referrer: "https://bbb5.fk4.hs-bremen.de/html5client/join?sessionToken=zf9dmaojn7cft5me"


At the moment i am wondering about clients clock drift of -2:36 minutes ... 

Regards, elem
<Bildschirmfoto 2022-08-31 um 22.33.41.png>

Paulo Lanzarin

unread,
Sep 1, 2022, 11:59:35 AM9/1/22
to bigblueb...@googlegroups.com
👍 will dig into this

Pablo Pico

unread,
Sep 5, 2022, 4:22:20 PM9/5/22
to BigBlueButton-dev
Hello Fred and everyone.
Thank you for your replies.

I have dedicated severals days to test and understand the bug faced and i have some final conclusions and proposed solutions to share, hoping those will be useful for all the BigBlueButton developers and community.

The issue has nothing to do with the duration of the meeting as i implied in my first post. Although the issue has nothing to do with meeting's duration, from now on i will call it a "long meeting" referring to meetings  that already have had lots of chats and annotations when a user joins it.

Conclusions for versión 2.5:
  1. The issue has to do with the number of chats and annotations  (drawings on the dashboard) and users who join late.
  2. There is a performance issue affecting the user's experience when joining long meetings. The issue may severely slow down the ability to connect to the audio (especifically with microphone) right after joining the meeting wich is the most common action when joining. Moreover, trying to connect as soon as the user joins the meeting may result in the user's removal from meeting
  3. This issue becomes worse because of an unfortunate wording of the message displayed when the user gets disconnected. The error says: "You have been removed from the meeting"
    upload.png
    creating confusion.
  4. When a user joins a meeting, the user will get, via websocket, every chat message and every annotation; one by one.
  5. The server sends a bunch of duplicated and repeated messages. All that results in several MB of data to recieve which will take several seconds (maybe minutes) depending on bandwith and maybe other factors. Note: You may find a screenshot and some notes about later at the end.
  6. When a user joins a long meeting he can not join microphone before all the data i just mentioned is downloaded, meaning that he will see a frozen GUI after hitting the "microphone" icon for several seconds or minutes until one of these 3 things may happen:
    1. will timeout and probably connect after an automatic retry (it will see an alert saying he was disconnected from audio but still he will connect)
    2. will be removed from meeting (when the meeting just has too many chats or annotations this one is the most likely to happen. In fact in some sever cases it will happen practically always)
    3. will join the meeting with no audio, the dialogue disappears with no message and the user has to manually retry.
  7. The more node.js processes the server has set in bbb-html-with-roles.conf   the bigger the problem. This is because the more processes the more duplicated messages.

Conclusions for versión 2.6 (2.6.0-alpha.2):
When running these tests i was checking the code a bit to better understand a few things, and then i went to the git and found out that there is a good work by @germanocaumo in: https://github.com/bigbluebutton/bigbluebutton/tree/96c54a2acb28caa67cd772e43d5e06198bf99e6c/bigbluebutton-html5/imports/api/annotations/server/methods that would improve all this. 
I installed a 2.6.0-alpha.2 and ran all the tests over again to see the impact of this improvement. It helped but it does not completely fix the issue.
By the way, great work with the new whiteboard coming!
  1. The issue persists although some improvements in the right direction (in my opinion) have been made
  2. Version 2.6.0-alpha.2 already improved the long meeting join performance by sending a bulk message for chat (instead of one by one) in the websocket
  3. But the annotations issue remains and again the more node.js processes the more repeated messages. Note that once the user has already joined, websocket messages for annotations are handled by "stream-annotations". The "stream-annotations is not part of this issue because it does not sent any duplicated or repeated messages. in fact those "stream-annotations" are grouped in one message if there are several annotations to be sent.  

Proposed fix:

  • The first thing, i think, is to fix the bug of repeated messages. If it is any help i wrote a code and tested in my server 2.6. It is not a solution but it may help to understand the repeated messages issue. At the end of this post i show some screenshots to show how bad this is.
  • Group the websocket messages on user join. In 2.6 the chat messages already do this. Also stream-annotations do this. But "annotations" fail to do  this  (when the user joins)  and apparently it has an unintended bug resulting in messages having different ids for the same messages.
  • Take advantage of the "permessage-deflate" header. This will reduce the amount of data to be transfered by the websocket. And now that messages will be sent grouped, instead of 1 by 1, this improvement will probably have more relevance. Please see how you can easily achieve this and some extra notes on this below.

A few extrea comments:
  • During the real meetings that started all this for me, we had many users. All this tetst however were done with 1 to 4 users.
  • During the real meetings another issue arised. Users would get disconnected randomly. And there semed to be a strong correlation between the issue i explained here and the probability to get randomly disconnected. I say this because the longer the meeting, the higher the chance it was to get disconnected meaning more people started to get disconnected without explanation. 
  • I did not run any test to be 100% sure (because i do not see how to simulate a crowded meeting) but i suspect that this issue becomes worse in big meetings for moderators (probably some websocket messages are affected proportionally with the number of participants?) and for users with connections that are not very fast.
  • There was probably a lot of confusion during the real meetings because of the "removed from meeting message", and probably some of the users just had a connection issue. However i had the impression that the meeting became very strict or angry disconnecting users, probably because of those 2 factors: long meeting + many users. Again no high CPU has been involved in all this. Not even one cpu core.
  • I am not sure if all our issue would be covered by the proposed solutions but if rejoining becomes easer it will surely help a lot in case of a disconnection.

PD: FYI, after running tests with and without full audio in version 2.5 i found out that it does not fix the problem. Moreover, i do not see any difference in functionality using full audio in that version after following the instructions published in the official site. However there was small difference when enabling full audio: the dialog box to connect microphone would appear later (probably after downloading some of the websocket messages first) so it was kind of good and bad. Good because you would start trying to connect audio some seconds later, increasing the chance of success. And bad because a user would not see the dialog soon enough and the user may take it as an error and react unexpectedly.


Extra: Code to send only unique annotations and some notes about it:
In function sendAnnotationHelper

const keyUniqueAnnotations = 'id';
   var arrayUniqueAnnotations = [...new Map(annotations.map(item =>
     [item[keyUniqueAnnotations], item])).values()];

//_.each(_.groupBy(annotations., "wbId"), whiteboardAnnotations => {
   _.each(_.groupBy(arrayUniqueAnnotations, "type"), whiteboardAnnotations => {

---
Keep in mind that this is probably not the right fix. I would not know why there are repeated messages in the first place. I tried to group the messages (and apparently the original code intends to do so by whiteboard id) but i failed... they always get one by one.
I could not properly debug but i saw that the developer clearly intended to group the messages and then send them to RedisPubSub.publishUserMessage. However i could never get them grouped.


Extra: How to use "permessage-defate":
Modern browsers are sending a header like this:
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits

That tells the server that browser can handle compressed messages. But, as of now, BigBlueButton is ignoring that header. 
The easiest way to take advantage of this is to edit /usr/share/meteor/bundle/systemd_start_frontend.sh file:
#export SERVER_WEBSOCKET_COMPRESSION=0
export SERVER_WEBSOCKET_COMPRESSION='{"level":5, "maxWindowBits":13, "memLevel":7, "requestMaxWindowBits":13}'

The first parameter "level" is recommended to be set between 4 and 6. All three options are good compression for little cpu being 6 the highest compression and 4 the most cpu. Other level of compressions are not suitable for this situation.

I already have this in my 2.5.4 production server with level 4 and i have not seen any noticeable cpu impact. And i think, there should not be.
This is how the responde headers will like after applying this:
upload2.png

After applying this fix i could save about 90% of the data size transferred via websocket in a long meeting.


Extra: Screenshot and notes about repeated messages:
This screenshot shows 107 messages filtered by the word "rect" when i just joined a meeting.  It was meant to show 10 messages because there was 10 rectangles. And hopefully, after the fix it will be only one or a few messages for all the whiteboard.
excessive-annotation-messages-few-rectangles.png
Test # 1. Draw 10 rectangles. Then join with another user:
on 2.5.4 =   i recieved 107 rectangle messages
on 2.6.0-alpha.2 =  i received 15 rectangle messages. I could not find whey some are repeated and some are not. So 

Test #2. Draw 1 rectangle. Then join with another user:
on 2.5.4 = 7 rectangle messages (NUMBER_OF_BACKEND_NODEJS_PROCESSES=3, NUMBER_OF_FRONTEND_NODEJS_PROCESSES=6). Note that if processes number are reduced, the number of messages tend to decrease.
on 2.6.0-alpha.2 = sometimes 1 and sometimes 2 (Apparently number of processes did not matter much in this version). It was not clear why there is a variation in the results with the same steps.


Extra: Websocket repeated message example:
This message is the same trinagle sent 3 times (usually more in v 2.5.4). It has the same field  id, but the message id is different for some of them.
a["{"msg":"added","collection":"annotations","id":"9E82ZaK7aaWYr3jef","fields":{"id":"0f1d67c7-e0ce-4953-1058-f459ed13cbf5","meetingId":"f4d29f359ccea5fe23da85606b44e7cd13de578a-1661909019963","userId":"w_xiqq1swzeeut","annotationInfo":{"size":[474.79,347.41],"style":{"isFilled":false,"size":"small","scale":1,"color":"black","dash":"draw"},"label":"","rotation":0,"id":"0f1d67c7-e0ce-4953-1058-f459ed13cbf5","labelPoint":[0.5,0.5],"type":"triangle","parentId":"1","childIndex":1.5,"name":"Triangle","point":[849.71,341.17]},"wbId":"97b150e9176a5ded62f150009509d6390450cadd-1661909020050/1","whiteboardId":"97b150e9176a5ded62f150009509d6390450cadd-1661909020050/1"}}"]
a["{"msg":"added","collection":"annotations","id":"9E82ZaK7aaWYr3jef","fields":{"id":"0f1d67c7-e0ce-4953-1058-f459ed13cbf5","meetingId":"f4d29f359ccea5fe23da85606b44e7cd13de578a-1661909019963","userId":"w_xiqq1swzeeut","annotationInfo":{"size":[474.79,347.41],"style":{"isFilled":false,"size":"small","scale":1,"color":"black","dash":"draw"},"label":"","rotation":0,"id":"0f1d67c7-e0ce-4953-1058-f459ed13cbf5","labelPoint":[0.5,0.5],"type":"triangle","parentId":"1","childIndex":1.5,"name":"Triangle","point":[849.71,341.17]},"wbId":"97b150e9176a5ded62f150009509d6390450cadd-1661909020050/1","whiteboardId":"97b150e9176a5ded62f150009509d6390450cadd-1661909020050/1"}}"]
a["{"msg":"added","collection":"annotations","id":"NjHthuAb54qeB7qvn","fields":{"id":"0f1d67c7-e0ce-4953-1058-f459ed13cbf5","meetingId":"f4d29f359ccea5fe23da85606b44e7cd13de578a-1661909019963","userId":"w_xiqq1swzeeut","annotationInfo":{"size":[474.79,347.41],"style":{"isFilled":false,"size":"small","scale":1,"color":"black","dash":"draw"},"label":"","rotation":0,"id":"0f1d67c7-e0ce-4953-1058-f459ed13cbf5","labelPoint":[0.5,0.5],"type":"triangle","parentId":"1","childIndex":1.5,"name":"Triangle","point":[849.71,341.17]},"wbId":"97b150e9176a5ded62f150009509d6390450cadd-1661909020050/1","whiteboardId":"97b150e9176a5ded62f150009509d6390450cadd-1661909020050/1"}}"]

Regards,

Pablo - BBBPlugin

Pablo Pico

unread,
Sep 5, 2022, 4:24:29 PM9/5/22
to BigBlueButton-dev
I missed this screenshot in  Extra: Screenshot and notes about repeated messages:

"Respuesta" is "Response"
excessive-annotation-messages-few-rectangles.png

Paulo Lanzarin

unread,
Sep 27, 2022, 3:35:18 PM9/27/22
to bigblueb...@googlegroups.com
elem,

The "socket times out after a long time in a meeting" issue turned out to be a bit tangential to the OP's topic,
but I'll give a final follow-up to it since I finally got around to give a look at that.
See https://github.com/bigbluebutton/bigbluebutton/pull/15741.
The PR doesn't really fix the problem, but outlines it and provides a mitigation of sorts.
We'll have to figure out a way to properly deal with session invalidation.

Germano Caumo Carniel

unread,
Sep 28, 2022, 10:20:37 AM9/28/22
to BigBlueButton-dev
Hello, sorry for the late response.
First of all many thanks for the detailed and in depth tests you made @Pablo, after a lot of investigation in the code I found 2 bugs that could be generating the annotations websocket behaviour you saw. I have opened a PR now with a possible fix: (only for 2.6 for now, will need to backport it for 2.5)

https://github.com/bigbluebutton/bigbluebutton/pull/15745

Fred Dixon

unread,
Oct 27, 2022, 4:23:17 PM10/27/22
to bigblueb...@googlegroups.com
Hi Pablo,

We just released 2.5.7 (now 2.5.8 with a minor update) that has work to reduce the 403 disconnects.

Can you run some tests on your end and let us know if you see improvements relative to the earlier 2.5.4 version.

Regards,... Fred 

Reply all
Reply to author
Forward
0 new messages