[PATCH] perftune.py - ignore num rx queues when chosing default cpuset mode

Eliran Sinvani

<eliransin@scylladb.com>

unread,

Mar 24, 2020, 3:53:14 AM3/24/20

to seastar-dev@googlegroups.com, avi@scylladb.com, vladz@scylladb.com, glauber@scylladb.com, Eliran Sinvani

The theory suggested that when we have enough rx queues it will be
better to spread the IRQs across all of the CPUs which means Seastars
shards will share them with the rx queue interrupt processing, instead
of cutting some CPUs only for the sake of the queue interrupt processing.
However in practice experience shows that this approach is bad for our
high percentage latencies.
This commit changes the default CPU distribution mode to only take into
account the BOX size (CPUS and Hyperthreads) and make the decision
solely based on this.
The new rule of thumb is derived from the assumption that whenever we
can afford to spare a CPU/Hyperthread for interrupt processing alone
we should prefer to do so even if the load can be nicely and
symmetrically split between the units since the context switches impacts
our high percentage latencies.

Note: The default configuration selection covers a wide range of cases
but sometimes it is essential to customize the configuration in order
to achieve best results.

Fixes #729
Ref #308

Signed-off-by: Eliran Sinvani <elir...@scylladb.com>
---
scripts/perftune.py | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/scripts/perftune.py b/scripts/perftune.py
index d83ad2bc..04ef30d6 100755
--- a/scripts/perftune.py
+++ b/scripts/perftune.py
@@ -771,12 +771,10 @@ class NetPerfTuner(PerfTunerBase):
"""
Returns the default configuration mode for the given interface.
"""
- rx_queues_count = self.__get_rx_queue_count(iface)
-
num_cores = int(run_hwloc_calc(['--number-of', 'core', 'machine:0', '--restrict', self.args.cpu_mask]))
num_PUs = int(run_hwloc_calc(['--number-of', 'PU', 'machine:0', '--restrict', self.args.cpu_mask]))

- if num_PUs <= 4 or rx_queues_count == num_PUs:
+ if num_PUs <= 4:
return PerfTunerBase.SupportedModes.mq
elif num_cores <= 4:
return PerfTunerBase.SupportedModes.sq
@@ -1205,8 +1203,8 @@ Modes description:
spreads NAPIs' handling between all CPUs.

If there isn't any mode given script will use a default mode:
- - If number of physical CPU cores per Rx HW queue is greater than 4 - use the 'sq-split' mode.
- - Otherwise, if number of hyperthreads per Rx HW queue is greater than 4 - use the 'sq' mode.
+ - If number of physical CPU greater than 4 - use the 'sq-split' mode.
+ - Otherwise, if number of hyperthreads is greater than 4 - use the 'sq' mode.
- Otherwise use the 'mq' mode.

Default values:
--
2.24.1

Avi Kivity

<avi@scylladb.com>

unread,

Mar 25, 2020, 4:01:56 AM3/25/20

to Eliran Sinvani, seastar-dev@googlegroups.com, vladz@scylladb.com, glauber@scylladb.com

On 3/24/20 9:53 AM, Eliran Sinvani wrote:
> The theory suggested that when we have enough rx queues it will be
> better to spread the IRQs across all of the CPUs which means Seastars
> shards will share them with the rx queue interrupt processing, instead
> of cutting some CPUs only for the sake of the queue interrupt processing.
> However in practice experience shows that this approach is bad for our
> high percentage latencies.
> This commit changes the default CPU distribution mode to only take into
> account the BOX size (CPUS and Hyperthreads) and make the decision
> solely based on this.
> The new rule of thumb is derived from the assumption that whenever we
> can afford to spare a CPU/Hyperthread for interrupt processing alone
> we should prefer to do so even if the load can be nicely and
> symmetrically split between the units since the context switches impacts
> our high percentage latencies.

Is there any understanding why the previous rule was wrong? Any
measurements?

Eliran Sinvani

<eliransin@scylladb.com>

unread,

Mar 25, 2020, 4:29:21 AM3/25/20

to Avi Kivity, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

On Wed, Mar 25, 2020 at 10:01 AM Avi Kivity <a...@scylladb.com> wrote:

On 3/24/20 9:53 AM, Eliran Sinvani wrote:
> The theory suggested that when we have enough rx queues it will be
> better to spread the IRQs across all of the CPUs which means Seastars
> shards will share them with the rx queue interrupt processing, instead
> of cutting some CPUs only for the sake of the queue interrupt processing.
> However in practice experience shows that this approach is bad for our
> high percentage latencies.
> This commit changes the default CPU distribution mode to only take into
> account the BOX size (CPUS and Hyperthreads) and make the decision
> solely based on this.
> The new rule of thumb is derived from the assumption that whenever we
> can afford to spare a CPU/Hyperthread for interrupt processing alone
> we should prefer to do so even if the load can be nicely and
> symmetrically split between the units since the context switches impacts
> our high percentage latencies.

Is there any understanding why the previous rule was wrong? Any
measurements?

There are no measurements. There is @Glauber Costa and @Vladislav Zolotarov field experience.

About understanding, there is an explanation if that's what you mean. Having interrupt context switches on every

shard increases latencies for the operations waiting in line while the isr is executing.

I will ask @Roy Dahan about how to run a latency measurements test on i3en.

Another point of consideration, even if this configuration is fine, why isn't it also correct when for exampe:

rx_queues_count == (num_PUs + 1) or a simetric case like: rx_queues_count == (2*num_PUs) ?

So IMHO either way something is not right with this condition as it is today :)

Avi Kivity

<avi@scylladb.com>

unread,

Mar 25, 2020, 5:24:39 AM3/25/20

to Eliran Sinvani, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

On 3/25/20 10:29 AM, Eliran Sinvani wrote:

On Wed, Mar 25, 2020 at 10:01 AM Avi Kivity <a...@scylladb.com> wrote:

On 3/24/20 9:53 AM, Eliran Sinvani wrote:
> The theory suggested that when we have enough rx queues it will be
> better to spread the IRQs across all of the CPUs which means Seastars
> shards will share them with the rx queue interrupt processing, instead
> of cutting some CPUs only for the sake of the queue interrupt processing.
> However in practice experience shows that this approach is bad for our
> high percentage latencies.
> This commit changes the default CPU distribution mode to only take into
> account the BOX size (CPUS and Hyperthreads) and make the decision
> solely based on this.
> The new rule of thumb is derived from the assumption that whenever we
> can afford to spare a CPU/Hyperthread for interrupt processing alone
> we should prefer to do so even if the load can be nicely and
> symmetrically split between the units since the context switches impacts
> our high percentage latencies.

Is there any understanding why the previous rule was wrong? Any
measurements?

There are no measurements. There is @Glauber Costa and @Vladislav Zolotarov field experience.

About understanding, there is an explanation if that's what you mean. Having interrupt context switches on every

shard increases latencies for the operations waiting in line while the isr is executing.

Having fewer cores to process work, and cross-core communications also increases latency.

In the extreme case, if the networking code becomes the bottleneck, throughput is severely reduced.

I will ask @Roy Dahan about how to run a latency measurements test on i3en.

Another point of consideration, even if this configuration is fine, why isn't it also correct when for exampe:

rx_queues_count == (num_PUs + 1) or a simetric case like: rx_queues_count == (2*num_PUs) ?

So IMHO either way something is not right with this condition as it is today :)

Maybe, but if you want to make a change here you need to come with better justification than "derived from the assumption".

Eliran Sinvani

<eliransin@scylladb.com>

unread,

Mar 25, 2020, 5:35:23 AM3/25/20

to Avi Kivity, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

OK, I will do some measurements in order to justify the change. Is doing a comparison on Scylla latency and throughput enough?

My plan is to pick a machine that answers the above condition (i3en) and run performance tests (latency and throughput) with the two configurations, the current and the one that would have been chosen if the condition is removed. Do you think this is enough?

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Mar 26, 2020, 5:44:06 PM3/26/20

to Eliran Sinvani, seastar-dev@googlegroups.com, avi@scylladb.com, glauber@scylladb.com

You lost a "cores" in the "physical CPU cores greater..."

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Mar 26, 2020, 5:45:56 PM3/26/20

to Eliran Sinvani, Avi Kivity, Roy Dahan, seastar-dev, Glauber Costa

Drivers are not going to allocate more RSS capable HW queues than there PU (hyperthreads) in the system.

Avi Kivity

<avi@scylladb.com>

unread,

Mar 29, 2020, 3:47:38 AM3/29/20

to Eliran Sinvani, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

I'd like to see some analysis. That is, not just measurements of the throughput numbers, but some proof that the problem really is.

For example, is it caused by the kernel softirq thread interfering with scheduling? Or are the connections imbalanced across cores so that some cores have more work than others?

Eliran Sinvani

<eliransin@scylladb.com>

unread,

Mar 29, 2020, 6:41:42 AM3/29/20

to Avi Kivity, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

OK, I'll take a stab at it. @Glauber Costa @Vladislav Zolotarov feel free to advice as you agreed that this change is necessary.
I am willing to do the footwork (and think work within my limited experience), a way to prove to Avi the necessity/legitimacy of this

change will help :D

Pekka Enberg

<penberg@scylladb.com>

unread,

Mar 30, 2020, 2:47:45 AM3/30/20

to Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

On Sun, Mar 29, 2020 at 10:47 AM Avi Kivity <a...@scylladb.com> wrote:
> OK, I will do some measurements in order to justify the change. Is doing a comparison on Scylla latency and throughput enough?
>
> I'd like to see some analysis. That is, not just measurements of the throughput numbers, but some proof that the problem really is.

What kind of proof are you looking for?

You get higher latency because the kernel networking softirq thread is
stealing time from your application thread. Seastar does not like
this, just like it does not like other applications running on its
cores.

And, to make things worse, the kernel threads are processing packets
that might not even be consumed by the co-located application thread.
The worst case scenario is that your application thread is constantly
being interrupted to process packets for another application thread.

Isolating network processing on its own dedicated cores eliminates the
interference. This comes at the (potential) cost of reduced maximum
throughput if you don't dedicate enough cores to the network stack and
wastes cores if you dedicate too many.

- Pekka

Avi Kivity

<avi@scylladb.com>

unread,

Mar 30, 2020, 3:30:11 AM3/30/20

to Pekka Enberg, Eliran Sinvani, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

On 3/30/20 9:47 AM, Pekka Enberg wrote:
> On Sun, Mar 29, 2020 at 10:47 AM Avi Kivity <a...@scylladb.com> wrote:
>> OK, I will do some measurements in order to justify the change. Is doing a comparison on Scylla latency and throughput enough?
>>
>> I'd like to see some analysis. That is, not just measurements of the throughput numbers, but some proof that the problem really is.
> What kind of proof are you looking for?
>
> You get higher latency because the kernel networking softirq thread is
> stealing time from your application thread.

That doesn't explain higher latency. If softirq steals 50 usec out of
every 500 usec poll period, the latency would not increase appreciably.

If it steals more than 50 usec per 500 usec poll period, then on a
10-core system, moving it to a dedicated core would bottleneck on that
core and performance would suffer.

Now, it's possible that softirq is more efficient on a dedicated core
and therefore bottlenecks later, but this is handwaving.

> Seastar does not like
> this, just like it does not like other applications running on its
> cores.
>
> And, to make things worse, the kernel threads are processing packets
> that might not even be consumed by the co-located application thread.
> The worst case scenario is that your application thread is constantly
> being interrupted to process packets for another application thread.

The kernel has software flow steering and so this shoud be minimal. It
can be bad if hardware connection distribution is bad, but this is the
sort of thing I want measurement and analysis for.

Pekka Enberg

<penberg@scylladb.com>

unread,

Mar 30, 2020, 3:49:36 AM3/30/20

to Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

Hi Avi,

On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
>
> On 3/30/20 9:47 AM, Pekka Enberg wrote:
> > On Sun, Mar 29, 2020 at 10:47 AM Avi Kivity <a...@scylladb.com> wrote:
> >> OK, I will do some measurements in order to justify the change. Is doing a comparison on Scylla latency and throughput enough?
> >>
> >> I'd like to see some analysis. That is, not just measurements of the throughput numbers, but some proof that the problem really is.
> > What kind of proof are you looking for?
> >
> > You get higher latency because the kernel networking softirq thread is
> > stealing time from your application thread.
>
> That doesn't explain higher latency. If softirq steals 50 usec out of
> every 500 usec poll period, the latency would not increase appreciably.

The default RX softirq budget is 2 ms:

[penberg@nero linux]$ cat /proc/sys/net/core/netdev_budget_usecs
2000

On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
> If it steals more than 50 usec per 500 usec poll period, then on a
> 10-core system, moving it to a dedicated core would bottleneck on that
> core and performance would suffer.

I don't understand what you mean here. Yes, moving to dedicated cores
can indeed limit _maximum throughput_ if you don't have enough cores
to sustain the network traffic. But why would the poll period matter
here at all? If the cores are dedicated for networking, then you will
just process packets in a busy loop.

On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
> Now, it's possible that softirq is more efficient on a dedicated core
> and therefore bottlenecks later, but this is handwaving.

You say "handwaving", I say "well-known".

See for example, the following papers on the topic of dedicating CPU
cores for network stack processing:

https://www.research.ibm.com/haifa/dept/stt/pubs/isostack-final.pdf

https://people.mpi-sws.org/~antoinek/documents/19eurosys_tas.pdf

> > Seastar does not like
> > this, just like it does not like other applications running on its
> > cores.
> >
> > And, to make things worse, the kernel threads are processing packets
> > that might not even be consumed by the co-located application thread.
> > The worst case scenario is that your application thread is constantly
> > being interrupted to process packets for another application thread.
>
> The kernel has software flow steering and so this shoud be minimal. It
> can be bad if hardware connection distribution is bad, but this is the
> sort of thing I want measurement and analysis for.

How does software flow steering help here? Hardware steering causes
the interference because packets are steered to a core that's running
an unrelated application thread.

Software steering arguably makes things worse because instead of
processing a packet in one step, you now first inspect it on core N to
determine the target core, and then perform the protocol stack
processing on core M, which requires cache lines to be moved there
too.

- Pekka

Avi Kivity

<avi@scylladb.com>

unread,

Mar 30, 2020, 4:00:36 AM3/30/20

to Pekka Enberg, Eliran Sinvani, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

On 3/30/20 10:49 AM, Pekka Enberg wrote:
> Hi Avi,
>
> On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
>> On 3/30/20 9:47 AM, Pekka Enberg wrote:
>>> On Sun, Mar 29, 2020 at 10:47 AM Avi Kivity <a...@scylladb.com> wrote:
>>>> OK, I will do some measurements in order to justify the change. Is doing a comparison on Scylla latency and throughput enough?
>>>>
>>>> I'd like to see some analysis. That is, not just measurements of the throughput numbers, but some proof that the problem really is.
>>> What kind of proof are you looking for?
>>>
>>> You get higher latency because the kernel networking softirq thread is
>>> stealing time from your application thread.
>> That doesn't explain higher latency. If softirq steals 50 usec out of
>> every 500 usec poll period, the latency would not increase appreciably.
> The default RX softirq budget is 2 ms:
>
> [penberg@nero linux]$ cat /proc/sys/net/core/netdev_budget_usecs
> 2000

Maybe it should be reduced, like we tune other parameters.

>
> On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
>> If it steals more than 50 usec per 500 usec poll period, then on a
>> 10-core system, moving it to a dedicated core would bottleneck on that
>> core and performance would suffer.
> I don't understand what you mean here. Yes, moving to dedicated cores
> can indeed limit _maximum throughput_ if you don't have enough cores
> to sustain the network traffic. But why would the poll period matter
> here at all? If the cores are dedicated for networking, then you will
> just process packets in a busy loop.

I'm talking about the case where cores are not dedicated for networking.
I'm asserting we can't be using more than 10% of the core for networking
(on average), because if we did, then the move to a dedicated networking
core would cause that core to become a bottleneck.

>
> On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
>> Now, it's possible that softirq is more efficient on a dedicated core
>> and therefore bottlenecks later, but this is handwaving.
> You say "handwaving", I say "well-known".
>
> See for example, the following papers on the topic of dedicating CPU
> cores for network stack processing:
>
> https://www.research.ibm.com/haifa/dept/stt/pubs/isostack-final.pdf
>
> https://people.mpi-sws.org/~antoinek/documents/19eurosys_tas.pdf

Those papers are evidence, but don't translate directly. The
environments are different enough to merit our own measurement.

>>> Seastar does not like
>>> this, just like it does not like other applications running on its
>>> cores.
>>>
>>> And, to make things worse, the kernel threads are processing packets
>>> that might not even be consumed by the co-located application thread.
>>> The worst case scenario is that your application thread is constantly
>>> being interrupted to process packets for another application thread.
>> The kernel has software flow steering and so this shoud be minimal. It
>> can be bad if hardware connection distribution is bad, but this is the
>> sort of thing I want measurement and analysis for.
> How does software flow steering help here? Hardware steering causes
> the interference because packets are steered to a core that's running
> an unrelated application thread.
>
> Software steering arguably makes things worse because instead of
> processing a packet in one step, you now first inspect it on core N to
> determine the target core, and then perform the protocol stack
> processing on core M, which requires cache lines to be moved there
> too.

Software flow steering is very cheap compared to tcp. In fact the packet
doesn't have to be touched at all, because the hardware (at least some
NICs) provide the hash in the rx descriptor.

Hardware flow steering is bad when the number of queues is smaller than
the number of hardware threads, because it concentrates that work on a
subset of our shards, but that case is already considered by the script
(and asymmetric configuration is used).

Eliran, testing with i3 will be useless because it has a small number of
hardware queues.

Pekka Enberg

<penberg@scylladb.com>

unread,

Mar 30, 2020, 5:23:29 AM3/30/20

to Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Vladislav Zolotarov, Glauber Costa

On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
> > > That doesn't explain higher latency. If softirq steals 50 usec out of
> > > every 500 usec poll period, the latency would not increase appreciably.

On 3/30/20 10:49 AM, Pekka Enberg wrote:
> > The default RX softirq budget is 2 ms:
> >
> > [penberg@nero linux]$ cat /proc/sys/net/core/netdev_budget_usecs
> > 2000

On Mon, Mar 30, 2020 at 11:00 AM Avi Kivity <a...@scylladb.com> wrote:
> Maybe it should be reduced, like we tune other parameters.

Yeah, definitely worth exploring.

On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
> > > If it steals more than 50 usec per 500 usec poll period, then on a
> > > 10-core system, moving it to a dedicated core would bottleneck on that
> > > core and performance would suffer.

On 3/30/20 10:49 AM, Pekka Enberg wrote:
> > I don't understand what you mean here. Yes, moving to dedicated cores
> > can indeed limit _maximum throughput_ if you don't have enough cores
> > to sustain the network traffic. But why would the poll period matter
> > here at all? If the cores are dedicated for networking, then you will
> > just process packets in a busy loop.

On Mon, Mar 30, 2020 at 11:00 AM Avi Kivity <a...@scylladb.com> wrote:
> I'm talking about the case where cores are not dedicated for networking.
> I'm asserting we can't be using more than 10% of the core for networking
> (on average), because if we did, then the move to a dedicated networking
> core would cause that core to become a bottleneck.

Sure.

Btw, I am not claiming a *single* CPU core is sufficient to keep up
with 10+ Gbps NICs. The time budget to process a packet is simply too
small for Linux networking stack to cope:

https://lwn.net/Articles/629155/

On Mon, Mar 30, 2020 at 10:30 AM Avi Kivity <a...@scylladb.com> wrote:
> > > Now, it's possible that softirq is more efficient on a dedicated core
> > > and therefore bottlenecks later, but this is handwaving.
> > You say "handwaving", I say "well-known".

On 3/30/20 10:49 AM, Pekka Enberg wrote:
> > See for example, the following papers on the topic of dedicating CPU
> > cores for network stack processing:
> >
> > https://www.research.ibm.com/haifa/dept/stt/pubs/isostack-final.pdf
> >
> > https://people.mpi-sws.org/~antoinek/documents/19eurosys_tas.pdf

On Mon, Mar 30, 2020 at 11:00 AM Avi Kivity <a...@scylladb.com> wrote:
> Those papers are evidence, but don't translate directly. The
> environments are different enough to merit our own measurement.

No disagreement over measurements, but I fail to see how the
environment is significantly different.

The main issue is the scheduler, AFAICT. Look at the flow of a request
through the system:

(1) A packet arrives on NIC RX queue on CPU X. This CPU is running an
application thread, which is preempted.

(2) CPU X performs software flow steering, allocates SKBs, forwards
them to the per-CPU queue of CPU Y, and, in some (rare) cases, sends
an IPI to CPU Y.

(3) Application thread on CPU Y is preempted to run the networking
stack. SKBs are processed by CPU Y and deallocated (SLUB remote free
is expensive, but I doubt it matters here).

So request latency is at the mercy of *two* Linux scheduling decisions
at points (2) and (3).

The CentOS kernel, at least, uses CONFIG_PREEMPT_VOLUNTARY:

https://git.centos.org/rpms/kernel/blob/c8/f/SOURCES/kernel-x86_64.config#_4381

This means that application thread is not preempted unless it invokes
a system call (CONFIG_PREEMPT_VOLUNTARY relies on sprinkled kernel
preemption points). Of course, Seastar tries very hard *not* to invoke
system calls, so your packet might be waiting in a queue for a while.

Also, as packets kept piling up while we waited for the softirq to
run, it's going to preempt the application threads for maximum
duration of its budget (easy to verify with trace-cmd, btw). So now
you ended up totally ruining the latency of some fraction of your
requests.

On 3/30/20 10:49 AM, Pekka Enberg wrote:
> > How does software flow steering help here? Hardware steering causes
> > the interference because packets are steered to a core that's running
> > an unrelated application thread.
> >
> > Software steering arguably makes things worse because instead of
> > processing a packet in one step, you now first inspect it on core N to
> > determine the target core, and then perform the protocol stack
> > processing on core M, which requires cache lines to be moved there
> > too.

On Mon, Mar 30, 2020 at 11:00 AM Avi Kivity <a...@scylladb.com> wrote:
> Software flow steering is very cheap compared to tcp. In fact the packet
> doesn't have to be touched at all, because the hardware (at least some
> NICs) provide the hash in the rx descriptor.

Right, I was thinking of RPS, but you are talking about RFS. I see
that get_rps_cpu() looks up the CPU where recvmsg() was last executed
on a socket using a hash that can be provided by the hardware.

On Mon, Mar 30, 2020 at 11:00 AM Avi Kivity <a...@scylladb.com> wrote:
> Hardware flow steering is bad when the number of queues is smaller than
> the number of hardware threads, because it concentrates that work on a
> subset of our shards, but that case is already considered by the script
> (and asymmetric configuration is used).

Hardware flow steering is not optimal if it does not match the
application-level sharding. RFS seems to address this issue by
establishing a mapping based on which CPU ran recvmsg() last. However,
it seems to me it's not that cheap from latency point of view because
of the scheduler.

- Pekka

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Mar 30, 2020, 9:07:12 PM3/30/20

to Pekka Enberg, Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Glauber Costa

Avi, Pekka, please, keep in mind that SoftIRQ and HardIRQ are two
different contexts.
NIC HardIRQ (a.k.a. ISR) is going to preempt whatever runs at the moment
on the CPU (except for other ISRs, in which case the arbitration is a
little more complex), disable its own interrupt and schedule the SoftIRQ
(NAPI).

NAPI will run the next time SoftIRQ is scheduled to run and these days
it runs in a thread called ksoftirqd/XYZ, where XYZ is the index of the
corresponding CPU.
ksoftirqd's priority is supposed to be "low".
On my machine it's 80 - the same as the one scylla thread has got. (I'd
guess it's the default - was too lazy to verify that).

But let's get back to what is going to happen next in our Rx path:

NAPI will try to handle up to budget packets (different for each driver,
usually 128) and if there isn't any packet left - will re-enable the
interrupt.
If there ARE packets left to process it will "tell SoftIRQ to run it
again in the next iteration" (I don't want to go into too much details
here).

The later mode is called "NAPI polling mode".

RPS packets backlog is scheduled on a remote CPU as a special NAPI
instance which "consumes" packets from its list instead of from the real
HW: https://elixir.bootlin.com/linux/latest/source/net/core/dev.c#L6113
Except for that it undergoes the same scheduling rules.

Each NAPI instance is going to be scheduled for running one after
another in the context of the corresponding CPU SoftIRQ (NET_RX, to be
specific):
https://elixir.bootlin.com/linux/latest/source/net/core/dev.c#L6616
And NET_RX will limit its own runtime by two factors:

/proc/sys/net/core/netdev_budget
/proc/sys/net/core/netdev_budget_usecs

The later has already been mentioned by Pekka and the former limits the
maximum number of packets allowed to be handled by all NAPI instances
called in a single NET_RX iteration:
https://elixir.bootlin.com/linux/latest/source/net/core/dev.c#L6645

So, to sum up:
* If your app is running on the same CPU were your NIC IRQ is going to
be triggered there is a good chance that there will be 2 context
switches in the context of each HW interrupt: first by HW interrupt and
then by SoftIRQ.
* If NAPI is in a polling mode then there's going to be only a single
context switch in the context of a single NET_RX invocation (no HW
interrupt).
* If all what NET_RX needs to invoke are RPS pollers then there is
going to be only a single context switch IF IPIs are rare.
* The more packets there is to handle and the more there are NAPI
instances scheduled on the same CPU - the more chances are that NAPIs
scheduled on this CPU are going to be called in a polling mode thereby
improving the CPU efficiency.

Therefore we can see that unless the "IRQ CPU" gets saturated its (CPU
wise) efficiency of handling egress packets is going to be better (at
least not worse) than in the case of single HW queue per CPU:
* Less context switches on Scylla CPUs.
* NAPI stays longer in the polling mode on IRQ CPUs == less HW
interrupts.

In the one-IRQ-per-CPU config case the only benefit is that the CPU
saturation point is harder to achieve.
From all other points this configuration seems to be worse.

>
> (2) CPU X performs software flow steering, allocates SKBs, forwards
> them to the per-CPU queue of CPU Y, and, in some (rare) cases, sends
> an IPI to CPU Y.
>
> (3) Application thread on CPU Y is preempted to run the networking
> stack. SKBs are processed by CPU Y and deallocated (SLUB remote free
> is expensive, but I doubt it matters here).

Just to clarify: application on the CPU Y is not supposed/going to be
preempted right away when SoftIRQ is scheduled due to reasons I
mentioned above (ksoftirqd has the same priority as your application).
SoftIRQ will only be scheduled to run at some later point in time
(according to scheduler rules). I see Pekka describes it in more detail
below.

>
> So request latency is at the mercy of *two* Linux scheduling decisions
> at points (2) and (3).

There is nothing about "mercy" or "chance" in (2) - skbs are steered to
a very well defined CPU(s) decided according to RPS/RFS configuration
(we configure both for any configuration).

I think what Avi said it true for both RPS and RFS. The hash value is
delivered on a packet's CQE on proper NICs.

>
> On Mon, Mar 30, 2020 at 11:00 AM Avi Kivity <a...@scylladb.com> wrote:
>> Hardware flow steering is bad when the number of queues is smaller than
>> the number of hardware threads, because it concentrates that work on a
>> subset of our shards, but that case is already considered by the script
>> (and asymmetric configuration is used).
> Hardware flow steering is not optimal if it does not match the
> application-level sharding.

Correct.
In other words, hash based steering (HW or SW (RPS) - doesn't matter)
with a very high probability not going to be optimal in a seastar case
because the hash will likely decide on the queue that belongs to a
different CPU than the one where your thread runs.

> RFS seems to address this issue by
> establishing a mapping based on which CPU ran recvmsg() last. However,
> it seems to me it's not that cheap from latency point of view because
> of the scheduler.

AFAIKT there isn't a better alternative except for user space networking
(a-la DPDK + either choosing a specific TCP ports in order to ensure a
"correct" HW RSS choice or configuring HW filtering rules).

The main win of SQ_SPLIT mode is that Scylla CPUs don't have NIC HW
interrupts and forceful context switches that accompany them.
The second big win is that NET_RX SoftIRQs that are going to run on
Scylla CPUs in SQ/SQ_SPLIT modes are going to process packets that are
later going to be consumed by the thread on the same CPU which is
supposed to yield better cache coherency.

Our observations show that in the MQ mode NAPI is rarely going to be in
the polling mode hence the context switching penalty is going to be the
highest.
Plus (although not formally proved) I expect the CPU cache efficiency to
be worse due to reasons I mentioned above.

>
> - Pekka

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Mar 30, 2020, 9:13:55 PM3/30/20

to Eliran Sinvani, Avi Kivity, Roy Dahan, seastar-dev, Glauber Costa

What we look to demonstrate here is what I described in my previous email: high level of context switches due to high level of HW interrupts on Scylla CPUs in MQ mode.
Then we will reason based on high percentile latencies.
The test should be a latency test with a lot of small packets.

I think starting with a default c-s schema (payload of 300 bytes) when a Seastar CPU load is at about 50% should be a good start.
Let's have a call and I'll help you with setting up the testing environment.

Pekka Enberg

<penberg@scylladb.com>

unread,

Mar 31, 2020, 7:10:29 AM3/31/20

to Vladislav Zolotarov, Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Glauber Costa

Hi Vlad,

(Thanks for filling in many of the gaps I had in my explanation of how
the Linux network stack works!)

On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:
> Therefore we can see that unless the "IRQ CPU" gets saturated its (CPU
> wise) efficiency of handling egress packets is going to be better (at

You mean "ingress" (incoming) here, right?

On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:
> least not worse) than in the case of single HW queue per CPU:
> * Less context switches on Scylla CPUs.
> * NAPI stays longer in the polling mode on IRQ CPUs == less HW
> interrupts.
>
> In the one-IRQ-per-CPU config case the only benefit is that the CPU
> saturation point is harder to achieve.
> From all other points this configuration seems to be worse.

Yeah, very important point about the "IRQ CPU" configuration
(dedicated CPUs for the network stack) about transitioning and staying
in NAPI polling mode, which I indeed missed!

On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:
> > So request latency is at the mercy of *two* Linux scheduling decisions
> > at points (2) and (3).
>
> There is nothing about "mercy" or "chance" in (2) - skbs are steered to
> a very well defined CPU(s) decided according to RPS/RFS configuration
> (we configure both for any configuration).

Let me be more explicit: at the mercy of Linux scheduler _running_ the
network RX ksoftirq thread.

You are of course absolutely correct that the _steering_ decision is
deterministic. However, this just means the packets are in the right
queue, but request _latency_ is determined by when that queue is
processed, and that's where the Linux kernel scheduling decisions come
in.

On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:
> > RFS seems to address this issue by
> > establishing a mapping based on which CPU ran recvmsg() last. However,
> > it seems to me it's not that cheap from latency point of view because
> > of the scheduler.
>
> AFAIKT there isn't a better alternative except for user space networking
> (a-la DPDK + either choosing a specific TCP ports in order to ensure a
> "correct" HW RSS choice or configuring HW filtering rules).

Kernel-bypass is going to be more efficient. One nice trick from MICA
is to open a port per shard and program the NIC flow controller to
steer packets to a specific core based on the port. This way, a
shard-aware driver could steer a request to the per-shard port and
have the request delivered directly to the correct CPU core by the
NIC. Another approach is to use XDP/eBPF to inspect the CQL requests
and steer to the correct CPU.

On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:
> The main win of SQ_SPLIT mode is that Scylla CPUs don't have NIC HW
> interrupts and forceful context switches that accompany them.
> The second big win is that NET_RX SoftIRQs that are going to run on
> Scylla CPUs in SQ/SQ_SPLIT modes are going to process packets that are
> later going to be consumed by the thread on the same CPU which is
> supposed to yield better cache coherency.
>
> Our observations show that in the MQ mode NAPI is rarely going to be in
> the polling mode hence the context switching penalty is going to be the
> highest.
> Plus (although not formally proved) I expect the CPU cache efficiency to
> be worse due to reasons I mentioned above.

Just to clarify that I understand what you are saying here: SQ_SPLIT
== "IRQ CPU" == dedicated CPU cores for networking stack, correct?

- Pekka

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Mar 31, 2020, 10:08:56 AM3/31/20

to Pekka Enberg, Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Glauber Costa

On 3/31/20 7:10 AM, Pekka Enberg wrote:

Hi Vlad,

(Thanks for filling in many of the gaps I had in my explanation of how
the Linux network stack works!)

On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:

Therefore we can see that unless the "IRQ CPU" gets saturated its (CPU
wise) efficiency of handling egress packets is going to be better (at

You mean "ingress" (incoming) here, right?

Right. Sorry... ;)


On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:

least not worse) than in the case of single HW queue per CPU:
   * Less context switches on Scylla CPUs.
   * NAPI stays longer in the polling mode on IRQ CPUs == less HW
interrupts.

In the one-IRQ-per-CPU config case the only benefit is that the CPU
saturation point is harder to achieve.
From all other points this configuration seems to be worse.

Yeah, very important point about the "IRQ CPU" configuration
(dedicated CPUs for the network stack) about transitioning and staying
in NAPI polling mode, which I indeed missed!

On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:

So request latency is at the mercy of *two* Linux scheduling decisions
at points (2) and (3).

There is nothing about "mercy" or "chance" in (2) - skbs are steered to
a very well defined CPU(s) decided according to RPS/RFS configuration
(we configure both for any configuration).

Let me be more explicit: at the mercy of Linux scheduler _running_ the
network RX ksoftirq thread.

You are of course absolutely correct that the _steering_ decision is
deterministic. However, this just means the packets are in the right
queue, but request _latency_ is determined by when that queue is
processed, and that's where the Linux kernel scheduling decisions come
in.

Agree.
However note that these "delays" have two sides: on the one hand they can potentially increase a latency but on the other hand they may (and most likely do) improve it by increasing the amount of packets processed in a single call and thereby decreasing the amount of context switches required in order to process X packets.


On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:

RFS seems to address this issue by
establishing a mapping based on which CPU ran recvmsg() last. However,
it seems to me it's not that cheap from latency point of view because
of the scheduler.

AFAIKT there isn't a better alternative except for user space networking
(a-la DPDK + either choosing a specific TCP ports in order to ensure a
"correct" HW RSS choice or configuring HW filtering rules).

Kernel-bypass is going to be more efficient. One nice trick from MICA
is to open a port per shard and program the NIC flow controller to
steer packets to a specific core based on the port.

True. These are the "HW filtering rules" I mentioned above in the DPDK context.
And you are right - allocating a dedicated per-shard HW queue could be better than MQ, but it would unlikely be better than SQ/SQ_SPLIT modes.
The main disadvantage of co-location of HW queue IRQ (which I guess is part of your idea) and your application on the same CPU that I see are context switches that would be caused by ISRs which would immediately preempt the application.

It will all come down to what is more (or less) efficient: NIC ISR or IPI receiving.

This way, a
shard-aware driver could steer a request to the per-shard port and
have the request delivered directly to the correct CPU core by the
NIC. Another approach is to use XDP/eBPF to inspect the CQL requests
and steer to the correct CPU.

The same here.
We would not need to receive IPIs and will have a good buffers locality but would also have ISRs and a heavy NAPI part that unpins device buffers and repopulates the ring running on the application CPU.


On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:

The main win of SQ_SPLIT mode is that Scylla CPUs don't have NIC HW
interrupts and forceful context switches that accompany them.
The second big win is that NET_RX SoftIRQs that are going to run on
Scylla CPUs in SQ/SQ_SPLIT modes are going to process packets that are
later going to be consumed by the thread on the same CPU which is
supposed to yield better cache coherency.

Our observations show that in the MQ mode NAPI is rarely going to be in
the polling mode hence the context switching penalty is going to be the
highest.
Plus (although not formally proved) I expect the CPU cache efficiency to
be worse due to reasons I mentioned above.

Just to clarify that I understand what you are saying here: SQ_SPLIT
== "IRQ CPU" == dedicated CPU cores for networking stack, correct?

Sure, let me clarify.

SQ: A single dedicated HT for handling NIC IRQs.
SQ_SPLIT: A single dedicated core (usually 2 HTs) for handling NIC IRQs.

For SQ and SQ_SPLIT modes Scylla shards are not running on IRQ CPUs.

MQ: IRQs are evenly distributed and pinned among all present CPUs and are co-located with Scylla shards.


- Pekka

Pekka Enberg

<penberg@scylladb.com>

unread,

Mar 31, 2020, 10:17:58 AM3/31/20

to Vladislav Zolotarov, Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Glauber Costa

On Tue, Mar 31, 2020 at 5:08 PM Vladislav Zolotarov <vl...@scylladb.com> wrote:
> Sure, let me clarify.
>
> SQ: A single dedicated HT for handling NIC IRQs.
> SQ_SPLIT: A single dedicated core (usually 2 HTs) for handling NIC IRQs.
>
> For SQ and SQ_SPLIT modes Scylla shards are not running on IRQ CPUs.
>
> MQ: IRQs are evenly distributed and pinned among all present CPUs and are co-located with Scylla shards.

Thanks for the clarification!

Btw, the "SQ" scenario sounds a bit odd to be honest. You'd still get
some interference from sharing physical core resources between the
IRQ-servicing logical core and the application thread logical core.

- Pekka

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Mar 31, 2020, 11:22:13 AM3/31/20

to Pekka Enberg, Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Glauber Costa

You are absolutely right.
We only use it on very small boxes where we don't want to sacrifice Scylla CPU firepower too much by cutting off 2 CPUs.
The heuristics tells that we should always cut less than 25% of the total CPUs:

        if num_PUs <= 4:
            return PerfTunerBase.SupportedModes.mq

        elif num_cores <= 4:
            return PerfTunerBase.SupportedModes.sq
        else:
            return PerfTunerBase.SupportedModes.sq_split

PUs == HTs
cores == full CPU cores


- Pekka

Pekka Enberg

<penberg@scylladb.com>

unread,

Mar 31, 2020, 12:09:35 PM3/31/20

to Vladislav Zolotarov, Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Glauber Costa

Yeah, makes sense, thanks for the explanation Vlad!

- Pekka

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Jun 15, 2020, 9:14:17 AM6/15/20

to Pekka Enberg, Avi Kivity, Eliran Sinvani, Roy Dahan, seastar-dev, Eyal Gutkind, Guy Carmin

Ping.
Guys, I was sure it's been merged already!.. :D

We are hitting a long standing
https://github.com/scylladb/scylla-enterprise/issues/1250 at Grab's.
Please, consider merging and backporting this patch to 2019.1.

Thanks.

>
> - Pekka

Pekka Enberg

<penberg@scylladb.com>

unread,

Jun 15, 2020, 9:53:26 AM6/15/20

to Vladislav Zolotarov, Avi Kivity, Eliran Sinvani, Roy Dahan, Eyal Gutkind, Guy Carmin, seastar-dev

On Mon, Jun 15, 2020 at 4:14 PM Vladislav Zolotarov <vl...@scylladb.com> wrote:

Ping.
Guys, I was sure it's been merged already!.. :D

I am fine with this patch:

Reviewed-by: Pekka Enberg <pen...@scylladb.com>

But it's Avi's call because he wanted more analysis, I think.

- Pekka

Vladislav Zolotarov

<vladz@scylladb.com>

unread,

Jun 15, 2020, 11:42:26 AM6/15/20

to Pekka Enberg, Avi Kivity, Eliran Sinvani, Roy Dahan, Eyal Gutkind, Guy Carmin, seastar-dev