High CPU Utilization Between Distros

John Burke

unread,

Nov 4, 2021, 2:35:36 PM11/4/21

to rtpengine

Hey Everyone,

There's been some discussion on github about CPU increase w/ linux distro (#1387, #1379). I've run into this same issue when upgrading rtpengine versions, which took me from Debian 9 to 10/11. I initially upgraded from mr7.0.0.1/Debian 9 to the latest version running on Debian 11, but noticed that it was running 2X the CPU load.

I've gone through the exercise of running head-to-head tests with the same source code compiled against different versions on Debian and the results show that Debian 9 runs at half the load compared to Debian 10/11... regardless of rtpengine version (see attached). I've confirmed the kernel module is properly taking traffic via watching /proc/rtpengine/0/list.

Since this is just my lab, I've taken the recommendations from #1379 and disabled ALL cpu mitigations to see if this was causing the increased load:

grep . /sys/devices/system/cpu/vulnerabilities/*

/sys/devices/system/cpu/vulnerabilities/itlb_multihit:KVM: Vulnerable

/sys/devices/system/cpu/vulnerabilities/l1tf:Mitigation: PTE Inversion; VMX: vulnerable

/sys/devices/system/cpu/vulnerabilities/mds:Vulnerable; SMT vulnerable

/sys/devices/system/cpu/vulnerabilities/meltdown:Vulnerable

/sys/devices/system/cpu/vulnerabilities/spec_store_bypass:Vulnerable

/sys/devices/system/cpu/vulnerabilities/spectre_v1:Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers

/sys/devices/system/cpu/vulnerabilities/spectre_v2:Vulnerable, STIBP: disabled

/sys/devices/system/cpu/vulnerabilities/srbds:Not affected

/sys/devices/system/cpu/vulnerabilities/tsx_async_abort:Not affected

However, this had no affect and still ran at 2X the load.

Anyone have any suggestions / ideas?

Thanks,

John Burke

user_cpu.png

kernel_cpu.png

total_cpu.png

Richard Fuchs

unread,

Nov 4, 2021, 3:02:01 PM11/4/21

to rtpe...@googlegroups.com

So which kernel versions were these done with? (Each Debian release may offer different kernels, e.g. depending on whether the -backports repo is enabled...)

Do you have the option of trying different kernels on one system while leaving everything else alone? (e.g. a new kernel on Deb 9 or an older one on Deb 10?)

Is this a VM or on bare metal?

Which process does the CPU usage get accounted to?

--
You received this message because you are subscribed to the Google Groups "rtpengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtpengine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rtpengine/8d6b01d4-8f90-45dd-8b13-12fac8df05f3n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Burke

unread,

Nov 4, 2021, 3:45:50 PM11/4/21

to rtpengine

Tests are run on bare metal, same server.

Kernel versions:

debian 9 = 4.9.0-16-amd64

debian 10 = 4.19.0-18-amd64

Yes, I can run some tests with separate kernels. I can try a bisect approach on debian 10 and see if I can pinpoint a specific kernel version that introduces the CPU spike... unless you had a specific version you wanted to test.

As for the CPU usage, I attached a capture. It looks like the CPU is accounted to the parent process with equal distribution to children procs.

Thanks,

John

htop_cpu.png

Richard Fuchs

unread,

Nov 4, 2021, 6:24:14 PM11/4/21

to rtpe...@googlegroups.com

On 04/11/2021 15.45, [EXT] John Burke wrote:

Tests are run on bare metal, same server.

Kernel versions:

debian 9 = 4.9.0-16-amd64

debian 10 = 4.19.0-18-amd64

Yes, I can run some tests with separate kernels. I can try a bisect approach on debian 10 and see if I can pinpoint a specific kernel version that introduces the CPU spike... unless you had a specific version you wanted to test.

I'd say trying Deb 9 with the kernel from the -backports repo (which should be 4.19, similar to Deb 10) would be a good start.

As for the CPU usage, I attached a capture. It looks like the CPU is accounted to the parent process with equal distribution to children procs.

That's odd because the parent process doesn't really do anything. What does regular top with the -H option say? (I think htop might be summing up the usage somehow...) Also what about ksoftirq processes? (Those would what the kernel forwarding would show up as)

John Burke

unread,

Nov 5, 2021, 2:01:57 PM11/5/21

to rtpengine

Yeah it appears that htop tries to sum the child procs... and it doesn't show the kernel module procs which I wasn't aware of.

I took your advice and ran Deb9 with the backported kernel 4.19 and observed the ~2X CPU increase. Attached are snapshots via top -H of load during the tests. What's odd is in the backported kernel, there are 2 kernel procs but in the base kernel just 1. I'm not sure if the /# references the table id used by the kernel module, but in both tests this was set to id=0.

iptables rules:

-A INPUT -p udp -m udp -j RTPENGINE --id 0

-A INPUT -p udp -m udp -j ACCEPT

top_debian9_bpkernel.png

top_debian9_basekernel.png

Richard Fuchs

unread,

Nov 5, 2021, 2:37:32 PM11/5/21

to rtpe...@googlegroups.com

On 05/11/2021 14.01, [EXT] John Burke wrote:

Yeah it appears that htop tries to sum the child procs... and it doesn't show the kernel module procs which I wasn't aware of.

I took your advice and ran Deb9 with the backported kernel 4.19 and observed the ~2X CPU increase. Attached are snapshots via top -H of load during the tests. What's odd is in the backported kernel, there are 2 kernel procs but in the base kernel just 1. I'm not sure if the /# references the table id used by the kernel module, but in both tests this was set to id=0.

That is related to how incoming packets are distributed to soft IRQ handlers on different CPU cores by the kernel and the NIC driver. There's one ksoftirq thread per CPU core. Look at how many RX queues your NIC provides, how they're distributed to different cores, and how packets are assigned to queues (all of which can be done through ethtool).

Cheers

Volodymyr Fedorov

unread,

Nov 5, 2021, 2:38:20 PM11/5/21

to rtpengine

Hi John,

I could see quite high ksoftirqd usage, what network card you are using?

Could you try:

ethtool -N eth2 rx-flow-hash udp4 sdfn

But please substitute eth2 with your real adapter name.

John Burke

unread,

Nov 11, 2021, 9:07:51 PM11/11/21

to rtpengine

Thank you for the feedback! As you both identified, my test env was showing high CPU due to network card limitations. I'm not sure why deb9 vs deb10/11 was showing such a CPU difference, but it's a bit of a moot point as we don't ever reach NIC limits on our production hardware.

I reset and went back to reviewing the production env. I noticed that the CPU increase looks to be correlated to processing packets within userspace. Setting the endpoint-learning option to immediate brings the CPU level down to what we normally see out of the server running mr7.x. It appears that the 3 sec learning phase where packets traverse userspace is causing the CPU load. I have confirmed that this 3 sec learning phase also exists on our servers running mr7.x where the load is low, so the packet throughput in userspace should be the same between rtpengine versions.

I setup head-to-head testing on production hardware. Attached are thread loads (top -H), where mr10.x uses quite a bit more CPU than mr7.x under the same packet-per-second load. The poller threads seem to be the culprit.

Thanks,

John

top_mr7.x.png

top_mr10.x.png

Alex Lutay

unread,

Nov 12, 2021, 9:29:43 AM11/12/21

to rtpengine

Hi John,

Make sure your calls are still being processed in kernel mode...
and no transcoding/etc in use.
The latest rtpengine reports a way better internal statistics:

> root@sp2:~# rtpengine-ctl list totals
>
> Statistics over currently running sessions:
...> Transcoded media :0
> Packets per second (userspace) :0
> Bytes per second (userspace) :0
> Errors per second (userspace) :0
> Packets per second (kernel) :0
> Bytes per second (kernel) :0
> Errors per second (kernel) :0
> Packets per second (total) :0
> Bytes per second (total) :0
> Errors per second (total) :0
> Userspace-only media streams :0
> Kernel-only media streams :0
> Mixed kernel/userspace media streams :0
...

On 11/12/21 3:07 AM, [EXT] John Burke wrote:
...

> I reset and went back to reviewing the production env. I noticed that
> the CPU increase looks to be correlated to processing packets within

> userspace. Setting the endpoint-learning option to /immediate/ brings

> the CPU level down to what we normally see out of the server running
> mr7.x. It appears that the 3 sec learning phase where packets traverse
> userspace is causing the CPU load. I have confirmed that this 3 sec
> learning phase also exists on our servers running mr7.x where the load
> is low, so the packet throughput in userspace should be the same between
> rtpengine versions.

...

--
Alex Lutay

John Burke

unread,

Nov 16, 2021, 6:01:44 PM11/16/21

to rtpengine

Wanted to follow-up. I found the CPU increase is related to userspace processing and linux kernel version. Taking the same code base and running user-only tests results in Debian10 = +150% and Debian11 = +200% increase in CPU when compared to Debian9. The results can be duplicated by compiling and running the older kernels on Debian11, where I saw a -150%/-200% utilization. I saw no noticeable load increase when operating purely in kernelspace (i.e. OFF / IMMEDIATE endpoint-learning). I'm guessing the numbers are less extreme on new gen processors (I spot checked a few and the delta was closer to +130% between Debian9-11).

Depending on traffic quality, this may or may not cause noticeable load impact. I was analyzing shorter duration traffic, which caused a noticeable increase in load as the user/kernel ratio is quite high. For longer duration traffic, this is less of an issue.

I have a PR that I was going to submit shortly to allow for setting the endpoint-learning option on a per-call basis. This way for lower quality traffic (where NAT is usually not an issue), CPU can be spared via OFF/IMMEDIATE setting but we can still enforce endpoint-learning for known NAT'd traffic.

Appreciate all the help with this one!

Volodymyr Fedorov

unread,

Nov 17, 2021, 1:24:27 AM11/17/21

to John Burke, rtpengine

Hi John,

One more thing, did you try to disable all mitigations? And redo all your tests for latest Debian releases?

Br,

Vova

--

You received this message because you are subscribed to the Google Groups "rtpengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rtpengine+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/rtpengine/5c1416a9-c38b-4a58-8bef-74883d030c16n%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Best regards,

Volodymyr

John Burke

unread,

Nov 17, 2021, 1:37:34 PM11/17/21

to rtpengine

Hey Vova,

I failed to mention that, but yes I did run an equal set of tests with mitigations disabled. The results showed minimal impact, with ~5% CPU savings on both Deb10 and Deb11.

Thanks,

John

Michael Zimpel

unread,

Jan 24, 2022, 6:23:24 AM1/24/22

to rtpengine

Hi all,

we also observed this issue on our servers running rtpengine.

Latest testing with mr9.4.1.6 on Debian 9 / 10 / 11. Softirq is much higher starting with Debian 10.

With kernel 4.9.0-17-amd64 our rtpengine performance is very good.
Grafana attached.
------------------------------------------------------------------------------
top - 11:55:06 up 2 days, 11:12, 1 user, load average: 0,09, 0,06, 0,05
Tasks: 215 total, 1 running, 214 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu1 : 0,0 us, 0,0 sy, 0,0 ni, 99,7 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu2 : 0,3 us, 0,7 sy, 0,0 ni, 99,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 0,7 us, 0,4 sy, 0,0 ni, 98,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 1,0 us, 0,3 sy, 0,0 ni, 98,6 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu5 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu6 : 0,4 us, 0,0 sy, 0,0 ni, 99,3 id, 0,0 wa, 0,0 hi, 0,4 si, 0,0 st
%Cpu7 : 0,3 us, 0,7 sy, 0,0 ni, 99,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu8 : 0,0 us, 0,4 sy, 0,0 ni, 99,6 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu9 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu10 : 0,0 us, 0,0 sy, 0,0 ni, 99,7 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu11 : 0,0 us, 0,0 sy, 0,0 ni,100,0 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu12 : 0,0 us, 0,3 sy, 0,0 ni, 99,3 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu13 : 0,0 us, 0,0 sy, 0,0 ni, 99,7 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu14 : 0,3 us, 0,3 sy, 0,0 ni, 99,0 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu15 : 0,0 us, 0,7 sy, 0,0 ni, 99,0 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st

Refcount: 2
Control PID: 3970
Targets: 2328
------------------------------------------------------------------------------

With kernel 5.10.0-9-amd64 the softirqs are much higher.
Grafana attached.
------------------------------------------------------------------------------
top - 11:55:18 up 3 days, 2:46, 1 user, load average: 0,26, 0,25, 0,16
Tasks: 208 total, 2 running, 206 sleeping, 0 stopped, 0 zombie
%CPU0 : 0,4 us, 0,4 sy, 0,0 ni, 91,1 id, 0,0 wa, 0,0 hi, 8,0 si, 0,0 st
%CPU1 : 0,0 us, 0,9 sy, 0,0 ni, 88,5 id, 0,0 wa, 0,0 hi, 10,6 si, 0,0 st
%CPU2 : 0,0 us, 0,4 sy, 0,0 ni, 97,4 id, 0,0 wa, 0,0 hi, 2,2 si, 0,0 st
%CPU3 : 0,4 us, 2,5 sy, 0,0 ni, 97,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%CPU4 : 0,0 us, 0,0 sy, 0,0 ni, 86,6 id, 0,0 wa, 0,0 hi, 13,4 si, 0,0 st
%CPU5 : 0,0 us, 0,9 sy, 0,0 ni, 94,8 id, 0,0 wa, 0,0 hi, 4,4 si, 0,0 st
%CPU6 : 0,9 us, 0,0 sy, 0,0 ni, 94,3 id, 0,0 wa, 0,0 hi, 4,8 si, 0,0 st
%CPU7 : 0,0 us, 0,8 sy, 0,0 ni, 68,9 id, 0,0 wa, 0,0 hi, 30,3 si, 0,0 st
%CPU8 : 0,4 us, 1,3 sy, 0,0 ni, 98,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%CPU9 : 0,0 us, 0,0 sy, 0,0 ni, 73,2 id, 0,0 wa, 0,0 hi, 26,8 si, 0,0 st
%CPU10 : 0,8 us, 0,8 sy, 0,0 ni, 98,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%CPU11 : 0,5 us, 0,9 sy, 0,0 ni, 89,1 id, 0,0 wa, 0,0 hi, 9,5 si, 0,0 st
%CPU12 : 0,8 us, 1,3 sy, 0,0 ni, 97,9 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%CPU13 : 0,9 us, 1,7 sy, 0,0 ni, 84,6 id, 0,0 wa, 0,0 hi, 12,8 si, 0,0 st
%CPU14 : 0,0 us, 0,4 sy, 0,0 ni, 98,7 id, 0,0 wa, 0,0 hi, 0,9 si, 0,0 st
%CPU15 : 0,4 us, 0,4 sy, 0,0 ni, 61,6 id, 0,0 wa, 0,0 hi, 37,6 si, 0,0 st

Refcount: 1
Control PID: 1522
Targets: 2278
------------------------------------------------------------------------------

Kernel forwarding was working and checked with /proc/rtpengine/0/status (& list). Our measurements are way above the 200% CPU increase reported by John. This might be caused by our traffic profile. But the same traffic profile can easily have 11.000+ targets on 4.9.0-17-amd64 with softirqs not going above 30-40% CPU usage.

Best regards
Michael

4.9.0-17-amd64.jpg

5.10.0-9-amd64.jpg

John Burke

unread,

Jan 24, 2022, 8:12:48 AM1/24/22

to Michael Zimpel, rtpengine

Hey Michael,

Did you see the same results when changing the endpoint-learning config?

Thanks,

John

To view this discussion on the web visit https://groups.google.com/d/msgid/rtpengine/616d7cc2-a64e-4d42-a756-9cf3f4fb937bn%40googlegroups.com.

Reply all

Reply to author

Forward