On 3/24/20 9:53 AM, Eliran Sinvani wrote:
> The theory suggested that when we have enough rx queues it will be
> better to spread the IRQs across all of the CPUs which means Seastars
> shards will share them with the rx queue interrupt processing, instead
> of cutting some CPUs only for the sake of the queue interrupt processing.
> However in practice experience shows that this approach is bad for our
> high percentage latencies.
> This commit changes the default CPU distribution mode to only take into
> account the BOX size (CPUS and Hyperthreads) and make the decision
> solely based on this.
> The new rule of thumb is derived from the assumption that whenever we
> can afford to spare a CPU/Hyperthread for interrupt processing alone
> we should prefer to do so even if the load can be nicely and
> symmetrically split between the units since the context switches impacts
> our high percentage latencies.
Is there any understanding why the previous rule was wrong? Any
measurements?
On Wed, Mar 25, 2020 at 10:01 AM Avi Kivity <a...@scylladb.com> wrote:
On 3/24/20 9:53 AM, Eliran Sinvani wrote:
> The theory suggested that when we have enough rx queues it will be
> better to spread the IRQs across all of the CPUs which means Seastars
> shards will share them with the rx queue interrupt processing, instead
> of cutting some CPUs only for the sake of the queue interrupt processing.
> However in practice experience shows that this approach is bad for our
> high percentage latencies.
> This commit changes the default CPU distribution mode to only take into
> account the BOX size (CPUS and Hyperthreads) and make the decision
> solely based on this.
> The new rule of thumb is derived from the assumption that whenever we
> can afford to spare a CPU/Hyperthread for interrupt processing alone
> we should prefer to do so even if the load can be nicely and
> symmetrically split between the units since the context switches impacts
> our high percentage latencies.
Is there any understanding why the previous rule was wrong? Any
measurements?
There are no measurements. There is @Glauber Costa and @Vladislav Zolotarov field experience.About understanding, there is an explanation if that's what you mean. Having interrupt context switches on everyshard increases latencies for the operations waiting in line while the isr is executing.
Having fewer cores to process work, and cross-core communications also increases latency.
In the extreme case, if the networking code becomes the
bottleneck, throughput is severely reduced.
I will ask @Roy Dahan about how to run a latency measurements test on i3en.Another point of consideration, even if this configuration is fine, why isn't it also correct when for exampe:rx_queues_count == (num_PUs + 1) or a simetric case like: rx_queues_count == (2*num_PUs) ?So IMHO either way something is not right with this condition as it is today :)
Maybe, but if you want to make a change here you need to come
with better justification than "derived from the assumption".
I'd like to see some analysis. That is, not just measurements of the throughput numbers, but some proof that the problem really is.
For example, is it caused by the kernel softirq thread
interfering with scheduling? Or are the connections imbalanced
across cores so that some cores have more work than others?
Hi Vlad, (Thanks for filling in many of the gaps I had in my explanation of how the Linux network stack works!) On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:Therefore we can see that unless the "IRQ CPU" gets saturated its (CPU wise) efficiency of handling egress packets is going to be better (atYou mean "ingress" (incoming) here, right?
On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:least not worse) than in the case of single HW queue per CPU: * Less context switches on Scylla CPUs. * NAPI stays longer in the polling mode on IRQ CPUs == less HW interrupts. In the one-IRQ-per-CPU config case the only benefit is that the CPU saturation point is harder to achieve. From all other points this configuration seems to be worse.Yeah, very important point about the "IRQ CPU" configuration (dedicated CPUs for the network stack) about transitioning and staying in NAPI polling mode, which I indeed missed! On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:So request latency is at the mercy of *two* Linux scheduling decisions at points (2) and (3).There is nothing about "mercy" or "chance" in (2) - skbs are steered to a very well defined CPU(s) decided according to RPS/RFS configuration (we configure both for any configuration).Let me be more explicit: at the mercy of Linux scheduler _running_ the network RX ksoftirq thread. You are of course absolutely correct that the _steering_ decision is deterministic. However, this just means the packets are in the right queue, but request _latency_ is determined by when that queue is processed, and that's where the Linux kernel scheduling decisions come in.
On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:RFS seems to address this issue by establishing a mapping based on which CPU ran recvmsg() last. However, it seems to me it's not that cheap from latency point of view because of the scheduler.AFAIKT there isn't a better alternative except for user space networking (a-la DPDK + either choosing a specific TCP ports in order to ensure a "correct" HW RSS choice or configuring HW filtering rules).Kernel-bypass is going to be more efficient. One nice trick from MICA is to open a port per shard and program the NIC flow controller to steer packets to a specific core based on the port.
This way, a shard-aware driver could steer a request to the per-shard port and have the request delivered directly to the correct CPU core by the NIC. Another approach is to use XDP/eBPF to inspect the CQL requests and steer to the correct CPU.
On Tue, Mar 31, 2020 at 4:07 AM Vladislav Zolotarov <vl...@scylladb.com> wrote:The main win of SQ_SPLIT mode is that Scylla CPUs don't have NIC HW interrupts and forceful context switches that accompany them. The second big win is that NET_RX SoftIRQs that are going to run on Scylla CPUs in SQ/SQ_SPLIT modes are going to process packets that are later going to be consumed by the thread on the same CPU which is supposed to yield better cache coherency. Our observations show that in the MQ mode NAPI is rarely going to be in the polling mode hence the context switching penalty is going to be the highest. Plus (although not formally proved) I expect the CPU cache efficiency to be worse due to reasons I mentioned above.Just to clarify that I understand what you are saying here: SQ_SPLIT == "IRQ CPU" == dedicated CPU cores for networking stack, correct?
- Pekka
if num_PUs <= 4:
return PerfTunerBase.SupportedModes.mq
elif num_cores <= 4:
return PerfTunerBase.SupportedModes.sq
else:
return PerfTunerBase.SupportedModes.sq_split
- Pekka
Ping.
Guys, I was sure it's been merged already!.. :D
- Pekka