Benchmarking "streaming plugin" with RTP inbound traffic

1,084 views
Skip to first unread message

Chris Stasiak

unread,
Oct 16, 2015, 3:26:07 AM10/16/15
to meetecho-janus
Hello,

I am benchmarking janus, specifically streaming plugin with RTP inbound traffic. The goal would be to estimate the maximum number of client per janus instance where the client receives 5Mbits video/audio in "realtime" (camera-to-screen in less than 400ms for transatlantic connection). 

In general it works well as long as Janus process reaches 100% of CPU. In my test there is Janus with following config:
- 2 x Intel Xenon E5-2620 v3 (24 cores)
- 32GB RAM
- up/down link 10Gbits
- Debian Wheezy

The clients are done with AWS instances by implementing "WebRTC stress load". More details you can find in our blogpost http://www.cargomedia.ch/2015/10/15/headless-chrome-on-ec2.html

With current setup I can handle publisher stream of 5Mbits and re-stream up to 25 WebRTC client with 5Mbits. 

Observations:
- Janus uses 100% of single core. The rest of 23 cores are basically idle
- It uses 130Mbit of uplink (our max is 10GBits)

In general for 25+ clients the delay on delivery rise from 400ms to 3+ seconds.

Is there any chance to optimise that? Could you please advise me what I am doing wrong?

Lorenzo Miniero

unread,
Oct 16, 2015, 4:34:31 AM10/16/15
to meetecho-janus
Il giorno venerdì 16 ottobre 2015 09:26:07 UTC+2, Chris Stasiak ha scritto:
Hello,

I am benchmarking janus, specifically streaming plugin with RTP inbound traffic. The goal would be to estimate the maximum number of client per janus instance where the client receives 5Mbits video/audio in "realtime" (camera-to-screen in less than 400ms for transatlantic connection). 



Cool!

 
In general it works well as long as Janus process reaches 100% of CPU. In my test there is Janus with following config:
- 2 x Intel Xenon E5-2620 v3 (24 cores)
- 32GB RAM
- up/down link 10Gbits
- Debian Wheezy

The clients are done with AWS instances by implementing "WebRTC stress load". More details you can find in our blogpost http://www.cargomedia.ch/2015/10/15/headless-chrome-on-ec2.html



Nice approach. We've done something similar in our own stress tests, even though we leveraged Selenium for the purpose (check the performance paper here for details, https://janus.conf.meetecho.com/citeus).
 

With current setup I can handle publisher stream of 5Mbits and re-stream up to 25 WebRTC client with 5Mbits. 

Observations:
- Janus uses 100% of single core. The rest of 23 cores are basically idle


Janus doesn't have an explicit way of handling multiple cores at the moment. I guess that's something we'll have to tackle sooner or later. The way we handle this in production is to actually use the multicore machines we own as host in a Dockerized environment. As such we have multiple Janus instances running on their own Docker machines, which allows us to either target different scenarios or distribute load among isolated nodes.

If not via Docker, try to configure and run two different Janus instances running at the same time on your high performance machine, and create the streaming mountpoint on both. Have the camera feed both mountpoints and try to distribute the viewers across them. This should have each Janus instance use its own core, which should improve things for the time being.

 
- It uses 130Mbit of uplink (our max is 10GBits)



This is pretty much in line with what I expected. 5mbps * 25 users = 125mbps, and if you take into account signalling, ICE checks, SRTP overhead, retransmissions and the like, 130 make sense. The bottleneck in your case is not the bandwidth, obviously, but the fact that Janus has reached the peak of the processing power it has available.

Distributing the load as suggested above should allow for more bandwidth to be used.

 
In general for 25+ clients the delay on delivery rise from 400ms to 3+ seconds.



100% might explain that, as Janus and the restreaming bits struggle to keep pace apparently, and so internal queues increase. In our stress tests (the one I linked above) we got much more viewers but the video bitrate was smaller than that: we never addressed 5mbps video streams. We actually stopped before reaching the peak as the bottleneck in our case was bandwidth, IIRC.


Is there any chance to optimise that? Could you please advise me what I am doing wrong?


See above, and I hope that helps. Keep us posted on these interesting results!
Lorenzo 

Lorenzo Miniero

unread,
Oct 16, 2015, 5:29:03 AM10/16/15
to meetecho-janus
Hi Chris,

answering myself about the multiple core support, I just talked to my colleague who ran the stress tests, and he told me Janus actually used all the cores it had available in our scenarios. Looking around, this is compliant with what I read about this, which is that, unless you prevent this explicitly (sched_setaffinity?), the kernel should automatically try and schedule processes (or threads, for the same process) onto all available cores. We make a heavy use of multiple threads, so that should indeed be happening.

Was the issue in your case that delay increased when 100% was reached, or that things started working badly in general? Did you try to push beyond the 25 users to see what happened? How many users did you manage to serve, delay notwithstanding, before something broke?

I'm wondering if in your case the issue is the single mountpoint that is trying to serve all the users. As of now, when you configure a mountpoint, a thread is created to receive data. When you receive a packet, the thread iterates on all viewers and tells the core to send the packet for all of them. That said, this thread is not also responsible of actually sending data to all viewers: pushing data to the core just results in copying the packet and enqueing it for delivery, and then the dedicated thread for the viewer actually sends it. As such, there might be one different thing you can try, that is to create a different mountpoint on the same server, and then send the same data to both of them: distribute viewers across the two mountpoints and check if this results in a better management of the available cores. If it does, then it means something can be optimized in the mountpoint thread or in the pushing mechanism.

Thanks,
Lorenzo

Chris Stasiak

unread,
Oct 16, 2015, 6:34:02 AM10/16/15
to meetecho-janus
Hi Lorenzo,

Thanks for all information.

For next benchmark I am going to scale vertically on the single bare metal HPC instance by running 2 scenarios:
- mulitple Janus instances
- using docker

For both cases I assume to observe a better CPU utilization. Will share my result as soon as possible.


Answering your questions:
1. looks like Janus threads are distributed to all cores but they consume very little CPU. Only one thread (main one I guess) uses max CPU (when many clients connected).
2. I have tried to scale up to 100 clients with 5Mbits each. Up to 25 clients there is ~400ms delay. Up to 50 clients it works with delay of delivery 2-3sec. Up to 100 clients there are missing frames + delay is ~7sec.
3. I have tried they same with 2 publishers by streaming into 2 separate mountpoints. The result is the same, means the limitation seems to be in distributing to the WebRTC end-point (I could achieve a good quality up to 25 client for mp1+mp2).

As you said, most likely there is a problem with looping/pushing data for many listeners if stream is quite big!

Will investigate deeper and keep you updated.

Thanks,
Chris

Lorenzo Miniero

unread,
Oct 16, 2015, 11:57:39 AM10/16/15
to meetecho-janus
Il giorno venerdì 16 ottobre 2015 12:34:02 UTC+2, Chris Stasiak ha scritto:
Hi Lorenzo,

Thanks for all information.

For next benchmark I am going to scale vertically on the single bare metal HPC instance by running 2 scenarios:
- mulitple Janus instances
- using docker

For both cases I assume to observe a better CPU utilization. Will share my result as soon as possible.


Answering your questions:
1. looks like Janus threads are distributed to all cores but they consume very little CPU. Only one thread (main one I guess) uses max CPU (when many clients connected).
2. I have tried to scale up to 100 clients with 5Mbits each. Up to 25 clients there is ~400ms delay. Up to 50 clients it works with delay of delivery 2-3sec. Up to 100 clients there are missing frames + delay is ~7sec.
3. I have tried they same with 2 publishers by streaming into 2 separate mountpoints. The result is the same, means the limitation seems to be in distributing to the WebRTC end-point (I could achieve a good quality up to 25 client for mp1+mp2).



With the two mountpoints, did you see the load being distributed among two CPUs, or were both being served by the same one?
Looking forward to more results!

L.

Амнон Исраэли

unread,
Oct 16, 2015, 3:36:00 PM10/16/15
to meetecho-janus
I use streaming plugin on 
2 x Xeon(R) CPU E5410  @ 2.33GHz (8 Cores)

I stream to 5 mount point 
1 - Video
4 -Audio

Audio it's a translation of same video content

In the web server i gather video+audio.
We translate to 24 language but for testing i did only 4.
In the future i'm plan to use all translation.

30 clients / ~1 Mbit per client (video+audio)

10% load being distributed among two CPUs

CentOS 7

 

Chris Stasiak

unread,
Oct 19, 2015, 6:45:57 AM10/19/15
to meetecho-janus
Please have a look on more results..


  1. Architecture

    1. Server description

      1. 2.9GHz Xeon E5-2690, 32GB RAM, 10gbps ethernet

      2. Debian Wheezy

    2. Publisher description

      1. Live RTP inbound from Gstreamer

      2. 2Mbit and 5Mbit incoming stream

    3. Subscriber description

      1. Headless Chrome instances on EC2 http://www.cargomedia.ch/2015/10/15/headless-chrome-on-ec2.html

      2. Up to 75 WebRTC subscribers per AWS instance


  1. Benchmark phases

    1. Single mountpoint per single Janus instance

    2. Two mountpoints per single Janus instance

    3. Two Janus instances with one mountpoint each.


  1. Results

    1. Spreadsheet with results https://docs.google.com/a/cargomedia.ch/spreadsheets/d/1ViHanQXDHRkLsToYG3SOH2W-7VqZGWLu7nYselx9fWs/editFor single or double mountpoint per Janus instance the CPU seems to behave identical. Only one process is active.

    2. Throughput limit has been reached at 310Mbit per Janus instance (single and double mountpoint). Above that the video starts stuttering.

    3. For 2 separate Janus instances the performance and throughput has doubled. The limit of 310Mbit still exist per single Janus instance. But for 2 x Janus instance we could stream up to 600Mbit. Adding next instances seems to work reliably.


Hope you can find this useful!

Regards,
Chris

Lorenzo Miniero

unread,
Oct 19, 2015, 6:54:04 AM10/19/15
to meetecho-janus
Thanks for the update! The spreadsheet requires permission to access, just asked for it.

These are indeed interesting results. I thought that with the double mountpoint the related threads may be assigned to different cores, but that doesn't seem to be (always?) the case. I guess that we could add an option to force a mountpoint on a specific core (e.g., via pthread_setaffinity_np) but that may not be very intuitive.

The throughput limit may be a consequence of the CPU limit being reached: maybe limiting the operations Janus has to do (e.g., limit debugging, forcing bundle/rtcp-mux) or forcing compiler optimizations (not sure what we have now, maybe -O2?) might speed that up. Anyway, we'll have to figure out if and how we can leverage the available cores even for the single mountpoint scenario. Not sure if that would be easy/possible considering that, as discussed, this is usually taken care of automatically by the kernel, and there are no obvious solutions in C.

L.

Chris Stasiak

unread,
Oct 19, 2015, 7:16:38 AM10/19/15
to meetecho-janus
Sorry, the spreadsheet should be publicly available now.

My observations:
- agree, the throughput limitation is strongly related to the CPU. Second test has been done with upgraded CPU (from Intel Xenon E5-2620 to 2.9GHz Xeon E5-2690) and I confirm there is significant increase in the performance.
- also the threads (peers) are distributed over all cores (with utilization less than 2-3% each) but there is always single thread (main) which consumes 100% of core (when higher outbound traffic) for single or double mountpoint.
- when mounpoint (single/double test) main thread reaches max of CPU it is easy to observe that most of webrtc/peer threads are 0% CPU. Looks like data is not pushed there anymore?

Some compiler optimization would help for sure! However, better threads distribution per MP would increase throughput significantly imo.

Regards,
Chris

Lorenzo Miniero

unread,
Oct 19, 2015, 7:22:47 AM10/19/15
to meetecho-janus
Il giorno lunedì 19 ottobre 2015 13:16:38 UTC+2, Chris Stasiak ha scritto:
Sorry, the spreadsheet should be publicly available now.

My observations:
- agree, the throughput limitation is strongly related to the CPU. Second test has been done with upgraded CPU (from Intel Xenon E5-2620 to 2.9GHz Xeon E5-2690) and I confirm there is significant increase in the performance.
- also the threads (peers) are distributed over all cores (with utilization less than 2-3% each) but there is always single thread (main) which consumes 100% of core (when higher outbound traffic) for single or double mountpoint.
- when mounpoint (single/double test) main thread reaches max of CPU it is easy to observe that most of webrtc/peer threads are 0% CPU. Looks like data is not pushed there anymore?



Are you sure the more or less idle threads are the peers ones? I'm not sure the naming we use (we make use of GThread for threads) is reflected in anything but the machine logging, so those threads may actually be something else (e.g., HTTP/WebSockets requests being served).

Have you tried attaching, via gdb, to the exact pid of the thread that seems to be working much more than the others? This should allow you to spot which part of the code is being invoked and, as a consequence, the thread we created in Janus that is running it.

 
Some compiler optimization would help for sure! However, better threads distribution per MP would increase throughput significantly imo.



As I was saying, there's no obvious way to do that in C. Other languages have some helpers, C doesn't, and this is mostly left to the kernel. If there really is a single thread that is doing a lot while others aren't, it's not threads we have to better distribute, but their responsibilities.

L.
Reply all
Reply to author
Forward
0 new messages