Websocket sometimes unavailable (error code 1002)

875 views
Skip to first unread message

Kim-Alexander Brodowski

unread,
May 4, 2020, 10:28:01 AM5/4/20
to BigBlueButton-dev

Hello,

I'm working for IServ GmbH (https://iserv.eu/) and we've integrated BBB into our IServ Schoolserver as a direct response to the SARS-CoV-2 pandemic. We're currently hosting about 260 active BBB instances (the exact number varies with demand) plus a couple of spares and development instances.

First of all let me be clear that we use BBB in an unsupported scenario. We run BBB inside of a docker container. This allows us much easier deployment within our infrastructure and ensures consistency across many instances. Also, we don't need to update our tool chains to work with Ubuntu 16.04 as all of our software is designed to run on Debian stable/testing. Running inside a docker container requires a couple of workarounds I'm not going to go into here and I don't think it should matter much for the problem described below. We also don't use scalelite or Greenlight, but instead just BBB's API.

The problem: Sometimes, essentially at random, it looks like FreeSWITCH seems to stop accepting new connections. To us the problem is visible through nginx's log file (all times in UTC):

2020/05/04 09:24:59 [error] 5448#5448: *734320 connect() failed (110: Connection timed out) while connecting to upstream, client: 198.18.1.1, server: bbbhag06.videoconference.iserv.eu, request: "GET /ws HTTP/1.1", upstream: "https://116.202.241.247:7443/ws", host: "bbbhag06.videoconference.iserv.eu"

This problem occurs seemingly at random at different times on different servers. I have not yet succeeded in finding much of a pattern. As we can see the service has not been restarted:

root@bbbhag06:/# systemctl status freeswitch
● freeswitch.service - freeswitch
Loaded: loaded (/
lib/systemd/system/freeswitch.service; disabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/freeswitch.service.d
└─iserv.conf, override.conf
Active: active (running) since Wed 2020-04-29 00:11:17 UTC; 5 days ago
Main PID: 5205 (freeswitch)
CGroup: /docker/d6ec8a55ae21fbb69816a0e1867b54d8939c0321330574fa5b6fb974cec1e8ee/system.slice/freeswitch.service
└─5205 /opt/freeswitch/bin/freeswitch -u freeswitch -g daemon -ncwait -nonat -nf
5205 /opt/freeswitch/bin/freeswitch -u freeswitch -g daemon -ncwait -nonat -nf

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

I explicitly mention this because at least once a similar issue was caused by FreeSWITCH crashing which doesn't appear to be the case here. The event time above corresponds to the time frame where we expect most load on the system, however we're nowhere near our capacity limits nor anywhere near the limits configured for FreeSWITCH (max-sessions/sessions-per-second). If past experiences are anything to go by already connected users don't experience issues.

The only thing in common between servers affected appears to be that they all have rather high CPU core counts. That being said this is only anecdotal evidence and might just be a red herring.

The freeswitch.log has been not much help due to its verbosity. I've changed the log level to warning. Apart from numerous "[WARNING] switch_rtp.c:845 No audio stun for a long time!" these are the only lines logged around that time:

55a95f4e-60ef-4c27-9042-62fa6a792c6f 2020-05-04 09:25:09.397541 [ERR] switch_rtp.c:6027 RTCP UNPROTECT ERR
55a95f4e-60ef-4c27-9042-62fa6a792c6f 2020-05-04 09:25:09.397541 [WARNING] switch_rtp.c:6952 INVALID RTCP PACKET TYPE 51 VER 3 LEN 17388
6a510796-d384-496e-8a4b-aba5732f4b44 2020-05-04 09:33:56.577544 [WARNING] switch_rtp.c:6257 SRTP audio unprotect failed with code 9 (replay check failed) 74 bytes 10 errors
7c1ca9b3-90ca-4982-b812-810e7c1afadc 2020-05-04 09:38:57.077548 [ERR] switch_rtp.c:6027 RTCP UNPROTECT ERR
7c1ca9b3-90ca-4982-b812-810e7c1afadc 2020-05-04 09:38:57.077548 [WARNING] switch_rtp.c:6952 INVALID RTCP PACKET TYPE 0 VER 2 LEN 144
3c96fe2d-4526-4206-9ce7-8c7b6385a615 2020-05-04 09:41:03.257541 [ERR] switch_rtp.c:3266 audio Handshake failure 1. This may happen when you use legacy DTLS v1.0 (legacyDTLS channel var is set) but endpoint requires DTLS v1.2.

I've added monitoring that watches nginx's error.log and informs me when servers are having this issue. I can't say for certain how many of our servers are affected or how often this happens since these changes haven't been released yet, but I'm quite sure it doesn't happen all that often. With the number of servers and meetings we're hosting it does however affect service quality, especially because load balancing will prefer servers with lower loads which usually ends up prioritizing servers having issues.

The problem does magically fix itself after a while without any interventions from our side.

I hope someone will be able to help me debug this.

One last thing: I'd like to express much gratitude for everyone contributing to BBB and its components. We've started using BBB when we were starting to work from home at the beginning of March due to SARS-CoV-2. We've made cloud hosted BBBs available at no additional charge to our customers as we saw demand for a privacy friendly video conferencing solution and had already seen its potential in house. BBB has made communication a lot easier on our end.


Kind regards,

Kim-A. Brodowski

Richard Alam

unread,
May 4, 2020, 10:33:38 AM5/4/20
to BigBlueButton-dev
Hi,


Paulo is working with FS devs investigating this issue. Paulo is able to reproduce the problem and forwarded information to FS devs.

Hope we'll be able to resolve soon.

Richard
 

Kind regards,

Kim-A. Brodowski

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigbluebutton-dev/5162e20d-eefd-4c13-b7e3-311d00ff69ac%40googlegroups.com.


--

Richard Alam

unread,
May 15, 2020, 9:36:13 AM5/15/20
to BigBlueButton-dev
Hi,

Just a quick update.

We are testing a fix provided by FS devs.


If things go well, it'll be in the next update.

Richard

Fred Dixon

unread,
May 15, 2020, 12:27:24 PM5/15/20
to BigBlueButton-dev
Hi KKim-A. Brodowski,

The fix that Richard referenced will be in 2.2.11; however, we would be interested in having you try this latest build of FreeSWITCH and let us know if you see any improvements.  We just pushed the development packages for BigBlueButton 2.2 to


echo "deb https://ubuntu.bigbluebutton.org/xenial-22-dev bigbluebutton-xenial main" | sudo tee /etc/apt/sources.list.d/bigbluebutton.list

and in particular it has FreeSWITCH

bbb-freeswitch-core_2.2.0-109_amd64.deb


If you could test this developer build of 2.2.11 on some of your servers, that would be very helpful.


Regards,... fred



--
BigBlueButton Developer

Like BigBlueButton?  Tweet us at @bigbluebutton

Kim-Alexander Brodowski

unread,
May 15, 2020, 1:10:44 PM5/15/20
to BigBlueButton-dev
Hello,

I'm currently working with my colleagues on our deployment plan. We're probably going to rollout these changes on our testing branch to about 10 servers later this weekend. Thank you so much for your support.

Kind regards,
Kim-A. Brodowski

Kim-Alexander Brodowski

unread,
May 18, 2020, 4:21:36 AM5/18/20
to bigblueb...@googlegroups.com

Good morning,

it looks like this did not fix the problems for us. To give you an idea of the scope: Of our about 260 BigBlueButton instances 6 were affected by the problem this morning. One of them had BigBlueButton 2.2.11 with bbb-freeswitch-core version 2:2.2.0-109 installed.

---

Kind regards,
Kim-Alexander Brodowski


--
You received this message because you are subscribed to a topic in the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigbluebutton-dev/u7C3o0q3IJo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigbluebutton-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigbluebutton-dev/431f7f99-c15d-4b3a-a1f5-61071cceab47%40googlegroups.com.

Kim-Alexander Brodowski

unread,
May 18, 2020, 4:51:41 AM5/18/20
to bigblueb...@googlegroups.com

Upon further inspection the server running 2.2.11 might just have experienced a FreeSWITCH crash, which has the same symptoms, but most likely a different underlying cause:

May 18 09:48:19 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Main process exited, code=killed, status=6/ABRT
May 18 09:48:19 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Unit entered failed state.
May 18 09:48:19 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Failed with result 'signal'.
May 18 09:48:20 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Service hold-off time over, scheduling restart.

Kind regards,
Kim-Alexander Brodowski


Kim-Alexander Brodowski

unread,
May 18, 2020, 5:49:57 AM5/18/20
to BigBlueButton-dev
Hello again,


at first sorry for being a bit spammy here. I've extracted the backtrace of the crash mentioned above:


#0 0x00007f359e35f428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1 0x00007f359e36102a in __GI_abort () at abort.c:89
#2 0x00007f359e357bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x7f35953c8e7a "orq->orq_queue", file=file@entry=0x7f35953c874e "nta.c",
line=line@entry=8101, function=function@entry=0x7f35953c9ce0 <__PRETTY_FUNCTION__.18131> "outgoing_create") at assert.c:92
#3 0x00007f359e357c82 in __GI___assert_fail (assertion=assertion@entry=0x7f35953c8e7a "orq->orq_queue", file=file@entry=0x7f35953c874e "nta.c", line=line@entry=8101,
function=function@entry=0x7f35953c9ce0 <__PRETTY_FUNCTION__.18131> "outgoing_create") at assert.c:101
#4 0x00007f35952c0aab in outgoing_create (agent=agent@entry=0x7f3570001e40, callback=<optimized out>, callback@entry=0x7f35952d92bf <nua_client_orq_response>,
magic=magic@entry=0x7f3570903830, route_url=<optimized out>, route_url@entry=0x0, tpn=<optimized out>, tpn@entry=0x0, msg=msg@entry=0x7f3082b78390,
tag=0x7f3595630cc0 <ntatag_default_proxy>, value=139867511457056) at nta.c:8101
#5 0x00007f35952c1a23 in nta_outgoing_mcreate (agent=0x7f3570001e40, callback=callback@entry=0x7f35952d92bf <nua_client_orq_response>, magic=0x7f3570903830,
route_url=route_url@entry=0x0, msg=msg@entry=0x7f3082b78390, tag=tag@entry=0x7f3595630cc0 <ntatag_default_proxy>, value=139867511457056) at nta.c:7485
#6 0x00007f35952d94ae in nua_base_client_request (cr=cr@entry=0x7f3570903830, msg=msg@entry=0x7f3082b78390, sip=sip@entry=0x7f3082b78488, tags=tags@entry=0x7f359509faf0)
at nua_client.c:960
#7 0x00007f35952da3e3 in nua_base_client_trequest (cr=0x7f3570903830, msg=0x7f3082b78390, sip=0x7f3082b78488, tag=<optimized out>, value=<optimized out>) at nua_client.c:928
#8 0x00007f35952ea477 in nua_bye_client_request (cr=<optimized out>, msg=<optimized out>, sip=<optimized out>, tags=<optimized out>) at nua_session.c:3819
#9 0x00007f35952d9b06 in nua_client_request_sendmsg (cr=cr@entry=0x7f3570903830) at nua_client.c:903
#10 0x00007f35952da420 in nua_client_request_try (cr=cr@entry=0x7f3570903830) at nua_client.c:710
#11 0x00007f35952d8a29 in nua_client_init_request0 (cr=cr@entry=0x7f3570903830) at nua_client.c:607
#12 0x00007f35952d8a63 in nua_client_init_request (cr=cr@entry=0x7f3570903830) at nua_client.c:444
#13 0x00007f35952d8c17 in nua_client_create (nh=nh@entry=0x7f35709ecb50, event=event@entry=33, methods=methods@entry=0x7f359562c300 <nua_bye_client_methods>,
tags=tags@entry=0x7f2ef40677a0) at nua_client.c:201
#14 0x00007f35952ef160 in nua_stack_bye (nua=nua@entry=0x7f3584025eb0, nh=nh@entry=0x7f35709ecb50, e=e@entry=nua_r_bye, tags=tags@entry=0x7f2ef40677a0) at nua_session.c:3776
#15 0x00007f35952f113d in nua_stack_signal (nua=0x7f3584025eb0, msg=<optimized out>, ee=0x7f2ef4067778) at nua_stack.c:607
#16 0x00007f35953203e2 in su_base_port_execute_msgs (queue=0x0, queue@entry=0x7f2ef4067730) at su_base_port.c:280
#17 0x00007f35953206b7 in su_base_port_getmsgs (self=0x7f35700008c0) at su_base_port.c:202
#18 0x00007f3595320810 in su_base_port_run (self=0x7f35700008c0) at su_base_port.c:335
#19 0x00007f359532605d in su_port_run (self=<optimized out>) at su_port.h:326
#20 su_root_run (self=<optimized out>) at su_root.c:819
#21 0x00007f3595324df1 in su_pthread_port_clone_main (varg=0x7f3595153720) at su_pthread_port.c:343
#22 0x00007f359e6fb6ba in start_thread (arg=0x7f35950a0700) at pthread_create.c:333
#23 0x00007f359e43141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

This looks a lot like what has been already reported here https://github.com/signalwire/freeswitch/issues/145. As of now 15 servers reported issues related to websockets this morning.


Kind regards,
Kim-Alexander Brodowski


On Monday, May 18, 2020 at 10:51:41 AM UTC+2, Kim-Alexander Brodowski wrote:

Upon further inspection the server running 2.2.11 might just have experienced a FreeSWITCH crash, which has the same symptoms, but most likely a different underlying cause:

May 18 09:48:19 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Main process exited, code=killed, status=6/ABRT
May 18 09:48:19 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Unit entered failed state.
May 18 09:48:19 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Failed with result 'signal'.
May 18 09:48:20 bbbitg01.videoconference.iserv.eu systemd[1]: freeswitch.service: Service hold-off time over, scheduling restart.

Kind regards,
Kim-Alexander Brodowski



Am 18.05.2020 10:21, schrieb Kim-Alexander Brodowski:

Good morning,

it looks like this did not fix the problems for us. To give you an idea of the scope: Of our about 260 BigBlueButton instances 6 were affected by the problem this morning. One of them had BigBlueButton 2.2.11 with bbb-freeswitch-core version 2:2.2.0-109 installed.

---

Kind regards,
Kim-Alexander Brodowski


Am 15.05.2020 19:10, schrieb Kim-Alexander Brodowski:

Hello,
 
I'm currently working with my colleagues on our deployment plan. We're probably going to rollout these changes on our testing branch to about 10 servers later this weekend. Thank you so much for your support.
 
Kind regards,
Kim-A. Brodowski


--
You received this message because you are subscribed to a topic in the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigbluebutton-dev/u7C3o0q3IJo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigbluebutton-dev+unsubscribe@googlegroups.com.


--
You received this message because you are subscribed to a topic in the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigbluebutton-dev/u7C3o0q3IJo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigbluebutton-dev+unsubscribe@googlegroups.com.

Fred Dixon

unread,
May 18, 2020, 10:59:20 AM5/18/20
to BigBlueButton-dev
Hi Kim,

We've just pushed an update to xenial-22-dev which has FreeSWITCH 1.10.3, which is what we are planning to use for 2.2.12.  FreeSWITCH 1.10.3 was released a few days ago


If you can try this latest build, we'd be interested to hear the results.


Regards,... Fred


To unsubscribe from this group and all its topics, send an email to bigbluebutton-...@googlegroups.com.


--
You received this message because you are subscribed to a topic in the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigbluebutton-dev/u7C3o0q3IJo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigbluebutton-...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigbluebutton-dev/5818d9c2-4bdb-4d24-9730-cb3960f8f556%40googlegroups.com.

Paulo Lanzarin

unread,
May 18, 2020, 11:30:07 AM5/18/20
to BigBlueButton-dev
Also, Kim: can you please get the output of /opt/freeswitch/bin/fs_cli -x version from a 2.2.11 server that got the WS stack stuck?
Thanks.

Fred Dixon

unread,
May 18, 2020, 12:25:24 PM5/18/20
to BigBlueButton-dev
Hi Kim,

If you upgrade xenial-22-dev (which is a work-in-progress for 2.2.12), you should have this packaged version of FreeSWITCH

 bbb-freeswitch-core_2.2.0-112_amd64.deb

Regards,... Fred

On Mon, May 18, 2020 at 11:30 AM Paulo Lanzarin <plan...@gmail.com> wrote:
Also, Kim: can you please get the output of /opt/freeswitch/bin/fs_cli -x version from a 2.2.11 server that got the WS stack stuck?
Thanks.

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.

Kim-Alexander Brodowski

unread,
May 18, 2020, 12:32:35 PM5/18/20
to BigBlueButton-dev


On Monday, May 18, 2020 at 5:30:07 PM UTC+2, Paulo Lanzarin wrote:
Also, Kim: can you please get the output of /opt/freeswitch/bin/fs_cli -x version from a 2.2.11 server that got the WS stack stuck?
Thanks.

I'll try, but as mentioned above the server running 2.2.11 that experienced issues this morning had a FS crash rather than just not responding to websockets requests. That might have been a false alarm.

As far as FreeSWITCH 1.10.3 and BBB 2.2.12 is concerned: I'd like to know whether https://github.com/signalwire/freeswitch/pull/626 is included there. Sadly the source code for BBBs FS versions appears to be unavailable publicly and at the very least the Debian package changelogs don't offer much insight.

I've updated our monitoring scripts to include an indicator of the cause of websocket issues we're experiencing, so I should be able to easily see if servers are unavailable due to FS stopping to respond or crashing outright in the coming days.

Paulo Lanzarin

unread,
May 18, 2020, 12:44:06 PM5/18/20
to BigBlueButton-dev
2.2.11 includes #626 and so will 2.2.12. As far as #626 is concerned, it seems we fixed one WS lock breakpoint, but not all.
Seems there's still some scenarios where the sofia WS stack is getting stuck. We're looking into it.

Next time you get a server stuck, please try to gather the following info if possible. Try to match the log dates to the time where
the WS stack got stuck if in any case they got rotated.
 - output of fs_cli -x "show channels"
 - output of sudo netstat -putan | grep -i "7443\|5066\|5060"
 - latest freeswitch log file
 - nginx's error.log (/var/log/nginx/error.log)
 - akka-bbb-fsesl's latest log (/var/log/bbb-fsesl-akka)

regards,

Paulo.

Fred Dixon

unread,
May 18, 2020, 1:34:18 PM5/18/20
to BigBlueButton-dev
Hi Kim,

> As far as FreeSWITCH 1.10.3 and BBB 2.2.12 is concerned: I'd like to know whether https://github.com/signalwire/freeswitch/pull/626 is included there.

To ensure we had the patch, we rebuilt the FreeSWITCH package a few hours ago from their master (which does have the websocket fix).  If you use xenial-22-dev, you should get

 bbb-freeswitch-core_2.2.0-112_amd64.deb


# /opt/freeswitch/bin/fs_cli -x version
FreeSWITCH Version 1.10.4-dev+git~20200515T214527Z~e1f63ce5b1~64bit (git e1f63ce 2020-05-15 21:45:27Z 64bit)


Regards,... Fred


--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.

Kim-Alexander Brodowski

unread,
May 19, 2020, 4:29:45 AM5/19/20
to BigBlueButton-dev
Good morning,

I've sent Paulo the requested log files directly to protect the privacy of our users. I hope this helps.

Kind regards,
Kim-Alexander Brodowski

basisbit

unread,
May 27, 2020, 11:05:15 PM5/27/20
to BigBlueButton-dev
Hi all, 

yesterday we experienced the same issue, which caused a service interruption of half an hour on one of our servers which was on rather high load.
Connections to domainname.tld/ws either timed out or returned HTTP status 404.
After finding out that this is a know problem, we at least wanted to reduce impact on users if this happens again, and automate recovery of the affected server.

So during the last few hours we developed and tested a Systemd service which checks if the FreeSWITCH websocket port is frozen or not, and if necessary, restarts FreeSWITCH to hopefully resolve the problem. It then waits 10 minutes before checking again to avoid flapping. All the users loose audio connection, but after a minute or so, they can just reconnect audio in the HTML5 client by pressing the audio button and doing the echo test again. They don't get kicked from the session.

.deb package for easy installation and source can be found here: https://gitlab.senfcall.de/johannesbarthel/freeswitch-watchdog
(make sure that hostname -f returns the correct fqdn for your server so that the https certificate is trusted)
You can check the service journal for logging information to see if it works for you. After starting the service, it waits 10 minutes initially (so it does not restart freeswitch after server boot), so be sure to wait long enough when testing this. Please report if this helped you or if you encountered any problems using it.

Best regards,
basisbit

Fred Dixon

unread,
May 27, 2020, 11:30:01 PM5/27/20
to BigBlueButton-dev
Hi Basisbit,

Thanks for sharing this!  And you did it in rust too :-).

Regards,... Fred

--
You received this message because you are subscribed to the Google Groups "BigBlueButton-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigbluebutton-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages