rq_affinity doesn't seem to work?

Jiang, Dave

unread,

Jul 12, 2011, 3:10:02 PM7/12/11

to

Jens,
I'm doing some performance tuning for the Intel isci SAS controller driver, and I noticed some interesting numbers with mpstat. Looking at the numbers it seems that rq_affinity is not moving the request completion to the request submission CPU. Using fio to saturate the system with 512B I/Os, I noticed that all I/Os are bound to the CPUs (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack in the driver so that it records the CPU during request construction and then I try to steer the scsi->done() calls to the request CPUs. With this simple hack, mpstat shows that the soft irq contexts are now distributed. I observed significant performance increase. The iowait% gone from 30s and 40s to low single digit approaching 0. Any ideas what could be happening with the rq_affinity logic? I'm assuming rq_affinity should behave the way my hacked solution is behaving. This is running on an 8 core single CPU SandyBridge based system with hyper-threading turned off. The two MSIX interrupts on the controller are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm running fio with 8 SAS disks and 8 threads.

no rq_affinity:
09:23:31 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:23:36 AM all 9.65 0.00 41.75 23.60 0.00 24.98 0.00 0.00 0.03
09:23:36 AM 0 13.40 0.00 59.60 27.00 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 1 14.00 0.00 58.80 27.20 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 2 13.20 0.00 57.40 29.40 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 3 12.40 0.00 57.00 30.60 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 4 12.60 0.00 52.80 34.60 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 5 11.62 0.00 48.30 40.08 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 6 0.00 0.00 0.20 0.00 0.00 99.80 0.00 0.00 0.00
09:23:36 AM 7 0.00 0.00 0.00 0.00 0.00 99.80 0.00 0.00 0.20

with rq_affinity:
09:25:04 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:25:09 AM all 9.50 0.00 42.32 23.19 0.00 24.99 0.00 0.00 0.00
09:25:09 AM 0 13.80 0.00 61.60 24.60 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 1 13.03 0.00 60.32 26.65 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 2 12.83 0.00 58.52 28.66 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 3 12.20 0.00 56.60 31.20 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 4 12.20 0.00 52.40 35.40 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 5 11.78 0.00 49.30 38.92 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 6 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
09:25:09 AM 7 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00

with soft irq steering:
09:31:57 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:32:02 AM all 12.73 0.00 46.82 1.63 8.03 28.59 0.00 0.00 2.20
09:32:02 AM 0 16.20 0.00 55.00 3.20 10.20 15.40 0.00 0.00 0.00
09:32:02 AM 1 15.60 0.00 57.60 0.00 10.00 16.80 0.00 0.00 0.00
09:32:02 AM 2 16.03 0.00 56.91 0.20 10.62 16.23 0.00 0.00 0.00
09:32:02 AM 3 15.77 0.00 58.48 0.20 10.18 15.17 0.00 0.00 0.20
09:32:02 AM 4 16.17 0.00 56.09 0.00 10.18 17.56 0.00 0.00 0.00
09:32:02 AM 5 16.00 0.00 56.60 0.20 10.60 16.60 0.00 0.00 0.00
09:32:02 AM 6 3.41 0.00 18.64 3.81 0.80 60.52 0.00 0.00 12.83
09:32:02 AM 7 2.79 0.00 14.97 5.79 1.40 70.26 0.00 0.00 4.79
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jens Axboe

unread,

Jul 12, 2011, 4:40:01 PM7/12/11

to

On 2011-07-12 21:03, Jiang, Dave wrote:
> Jens,
> I'm doing some performance tuning for the Intel isci SAS controller
> driver, and I noticed some interesting numbers with mpstat. Looking at
> the numbers it seems that rq_affinity is not moving the request
> completion to the request submission CPU. Using fio to saturate the
> system with 512B I/Os, I noticed that all I/Os are bound to the CPUs
> (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack
> in the driver so that it records the CPU during request construction
> and then I try to steer the scsi->done() calls to the request CPUs.
> With this simple hack, mpstat shows that the soft irq contexts are now
> distributed. I observed significant performance increase. The iowait%
> gone from 30s and 40s to low single digit approaching 0. Any ideas
> what could be happening with the rq_affinity logic? I'm assuming
> rq_affinity should behave the way my hacked solution is behaving. This
> is running on an 8 core single CPU SandyBridge based system with
> hyper-threading turned off. The two MSIX interrupts on the controller
> are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm
> running fio with 8 SAS disks and 8 threads.

It's probably the grouping, we need to do something about that. Does the
below patch make it behave as you expect?

diff --git a/block/blk.h b/block/blk.h
index d658628..17d53d8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -157,6 +157,7 @@ static inline int queue_congestion_off_threshold(struct request_queue *q)

static inline int blk_cpu_to_group(int cpu)
{
+#if 0
int group = NR_CPUS;
#ifdef CONFIG_SCHED_MC
const struct cpumask *mask = cpu_coregroup_mask(cpu);
@@ -168,6 +169,7 @@ static inline int blk_cpu_to_group(int cpu)
#endif
if (likely(group < NR_CPUS))
return group;
+#endif
return cpu;
}

--
Jens Axboe

Jiang, Dave

unread,

Jul 12, 2011, 5:20:02 PM7/12/11

to

Yep that is it.

02:14:12 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:14:17 PM all 11.98 0.00 46.62 1.18 0.00 37.79 0.00 0.00 2.43
02:14:17 PM 0 15.43 0.00 55.31 0.00 0.00 29.26 0.00 0.00 0.00
02:14:17 PM 1 14.83 0.00 56.71 0.00 0.00 28.46 0.00 0.00 0.00
02:14:17 PM 2 14.80 0.00 56.00 0.00 0.00 29.20 0.00 0.00 0.00
02:14:17 PM 3 14.63 0.00 57.11 0.00 0.00 28.26 0.00 0.00 0.00
02:14:17 PM 4 14.80 0.00 57.60 0.00 0.00 27.60 0.00 0.00 0.00
02:14:17 PM 5 15.03 0.00 56.11 0.00 0.00 28.86 0.00 0.00 0.00
02:14:17 PM 6 3.79 0.00 20.16 5.99 0.00 59.68 0.00 0.00 10.38
02:14:17 PM 7 2.80 0.00 14.20 3.20 0.00 70.80 0.00 0.00 9.00

Matthew Wilcox

unread,

Jul 13, 2011, 1:20:02 PM7/13/11

to

On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote:
> It's probably the grouping, we need to do something about that. Does the
> below patch make it behave as you expect?

"something", absolutely. But there is benefit from doing some aggregation
(we tried disabling it entirely with the "well-known OLTP benchmark" and
performance went down).

Ideally we'd do something like "if the softirq is taking up more than 10%
of a core, split the grouping". Do we have enough stats to do that kind
of monitoring?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

Jens Axboe

unread,

Jul 13, 2011, 2:10:02 PM7/13/11

to

On 2011-07-13 19:10, Matthew Wilcox wrote:
> On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote:
>> It's probably the grouping, we need to do something about that. Does the
>> below patch make it behave as you expect?
>
> "something", absolutely. But there is benefit from doing some aggregation
> (we tried disabling it entirely with the "well-known OLTP benchmark" and
> performance went down).

Yep, that's why the current solution is somewhat middle of the road...

> Ideally we'd do something like "if the softirq is taking up more than 10%
> of a core, split the grouping". Do we have enough stats to do that kind
> of monitoring?

I don't think we have those stats, though it could/should be pulled from
the ksoftirqX threads. We could have some metric, ala

dest_cpu = get_group_completion_cpu(rq->cpu);
if (ksoftirqd_of(dest_cpu) >= 90% busy)
dest_cpu = rq->cpu;

to send things completely local to the submitter of the IO, IFF the
current CPU is close to running at full tilt.

--
Jens Axboe

Roland Dreier

unread,

Jul 14, 2011, 1:10:01 PM7/14/11

to

On Wed, Jul 13, 2011 at 10:10 AM, Matthew Wilcox <mat...@wil.cx> wrote:
> On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote:
>> It's probably the grouping, we need to do something about that. Does the
>> below patch make it behave as you expect?
>
> "something", absolutely. But there is benefit from doing some aggregation
> (we tried disabling it entirely with the "well-known OLTP benchmark" and
> performance went down).
>
> Ideally we'd do something like "if the softirq is taking up more than 10%
> of a core, split the grouping". Do we have enough stats to do that kind
> of monitoring?

What platform was your "OLTP benchmark" on? It seems that as the
number of cores per package goes up, this grouping becomes too coarse,
since almost everyone will have SCHED_MC set in the code:

static inline int blk_cpu_to_group(int cpu)
{

int group = NR_CPUS;
#ifdef CONFIG_SCHED_MC
const struct cpumask *mask = cpu_coregroup_mask(cpu);

group = cpumask_first(mask);
#elif defined(CONFIG_SCHED_SMT)
group = cpumask_first(topology_thread_cpumask(cpu));
#else
return cpu;

#endif
if (likely(group < NR_CPUS))
return group;

return cpu;
}

and so we use cpumask_first(cpu_coregroup_mask(cpu)). And from

const struct cpumask *cpu_coregroup_mask(int cpu)
{
struct cpuinfo_x86 *c = &cpu_data(cpu);
/*
* For perf, we return last level cache shared map.
* And for power savings, we return cpu_core_map
*/
if ((sched_mc_power_savings || sched_smt_power_savings) &&
!(cpu_has(c, X86_FEATURE_AMD_DCM)))
return cpu_core_mask(cpu);
else
return cpu_llc_shared_mask(cpu);
}

in the "max performance" case, we use cpu_llc_shared_mask().

The problem as we've seen it is that on a dual-socket Westmere (Xeon
56xx) system, we have two sockets with 6 cores (12 threads) each, all
sharing L3 cache, and so we end up with all block softirqs on only 2
out of 24 threads, which is not enough to handle all the IOPS that
fast storage can provide.

It's not clear to me what the right answer or tradeoffs are here. It
might make sense to use only one hyperthread per core for block
softirqs. As I understand the Westmere cache topology, there's not
really an obvious intermediate step -- all the cores in a package
share the L3, and then each core has its own L2.

Limiting softirqs to 10% of a core seems a bit low, since we seem to
be able to use more than 100% of a core handling block softirqs, and
anyway magic numbers like that seem to always be wrong sometimes.
Perhaps we could use the queue length on the destination CPU as a
proxy for how busy ksoftirq is?

- R.

Dan Williams

unread,

Jul 15, 2011, 4:30:02 PM7/15/11

to

This is likely too aggressive (untested / need to confirm it resolves
the isci issue), but it's at least straightforward to determine, and I
wonder if it prevents the regression Matthew is seeing. It assumes that
the once we have naturally spilled from the irq return path to ksoftirqd
that this cpu is having trouble keeping up with the load.

??

diff --git a/block/blk-core.c b/block/blk-core.c
index d2f8f40..9c7ba87 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
init_request_from_bio(req, bio);

if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
- bio_flagged(bio, BIO_CPU_AFFINE)) {
- req->cpu = blk_cpu_to_group(get_cpu());
- put_cpu();
- }
+ bio_flagged(bio, BIO_CPU_AFFINE))
+ req->cpu = smp_processor_id();

plug = current->plug;
if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c216..720918f 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -101,17 +101,21 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
.notifier_call = blk_cpu_notify,
};

+DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+
void __blk_complete_request(struct request *req)
{
+ int ccpu, cpu, group_ccpu, group_cpu;
struct request_queue *q = req->q;
+ struct task_struct *tsk;
unsigned long flags;
- int ccpu, cpu, group_cpu;

BUG_ON(!q->softirq_done_fn);

local_irq_save(flags);
cpu = smp_processor_id();
group_cpu = blk_cpu_to_group(cpu);
+ tsk = per_cpu(ksoftirqd, cpu);

/*
* Select completion CPU
@@ -120,8 +124,15 @@ void __blk_complete_request(struct request *req)
ccpu = req->cpu;
else
ccpu = cpu;
+ group_ccpu = blk_cpu_to_group(ccpu);

- if (ccpu == cpu || ccpu == group_cpu) {
+ /*
+ * try to skip a remote softirq-trigger if the completion is
+ * within the same group, but not if local softirqs have already
+ * spilled to ksoftirqd
+ */
+ if (ccpu == cpu ||
+ (group_ccpu == group_cpu && tsk->state != TASK_RUNNING)) {
struct list_head *list;
do_local:
list = &__get_cpu_var(blk_cpu_done);

ersatz splatt

unread,

Jul 15, 2011, 7:50:01 PM7/15/11

to

On Thu, Jul 14, 2011 at 10:02 AM, Roland Dreier <rol...@purestorage.com> wrote:

> The problem as we've seen it is that on a dual-socket Westmere (Xeon
> 56xx) system, we have two sockets with 6 cores (12 threads) each, all
> sharing L3 cache, and so we end up with all block softirqs on only 2
> out of 24 threads, which is not enough to handle all the IOPS that
> fast storage can provide.

I have a dual socket system with Tylersburg chipset (approximately
Westmere I gather).
With two Xeon X5660 packages I get this when running with more iops
potential than the system can handle:

02:15:00 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:15:02 PM all 2.76 0.00 30.40 28.28 0.00 13.74
0.00 0.00 24.81
02:15:02 PM 0 0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00
02:15:02 PM 1 0.00 0.00 0.50 0.00 0.00 0.00
0.00 0.00 99.50
02:15:02 PM 2 3.02 0.00 36.68 52.26 0.00 8.04
0.00 0.00 0.00
02:15:02 PM 3 2.50 0.00 36.00 54.50 0.00 7.00
0.00 0.00 0.00
02:15:02 PM 4 5.47 0.00 64.18 18.91 0.00 11.44
0.00 0.00 0.00
02:15:02 PM 5 3.02 0.00 37.69 53.27 0.00 6.03
0.00 0.00 0.00
02:15:02 PM 6 0.00 0.00 0.50 0.00 0.00 91.54
0.00 0.00 7.96
02:15:02 PM 7 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
02:15:02 PM 8 3.00 0.00 35.50 55.00 0.00 6.50
0.00 0.00 0.00
02:15:02 PM 9 3.02 0.00 39.70 50.25 0.00 7.04
0.00 0.00 0.00
02:15:02 PM 10 3.50 0.00 36.50 53.00 0.00 7.00
0.00 0.00 0.00
02:15:02 PM 11 6.53 0.00 70.85 9.05 0.00 13.57
0.00 0.00 0.00
02:15:02 PM 12 0.00 0.00 0.57 0.00 0.00 0.00
0.00 0.00 99.43
02:15:02 PM 13 3.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 97.00
02:15:02 PM 14 2.50 0.00 36.50 54.00 0.00 7.00
0.00 0.00 0.00
02:15:02 PM 15 3.52 0.00 36.18 53.77 0.00 6.53
0.00 0.00 0.00
02:15:02 PM 16 5.00 0.00 64.00 21.00 0.00 10.00
0.00 0.00 0.00
02:15:02 PM 17 3.02 0.00 37.19 52.76 0.00 7.04
0.00 0.00 0.00
02:15:02 PM 18 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
02:15:02 PM 19 0.00 0.00 1.01 0.00 0.00 0.00
0.00 0.00 98.99
02:15:02 PM 20 3.48 0.00 38.31 52.24 0.00 5.97
0.00 0.00 0.00
02:15:02 PM 21 5.50 0.00 63.00 18.50 0.00 13.00
0.00 0.00 0.00
02:15:02 PM 22 2.50 0.00 35.00 54.50 0.00 8.00
0.00 0.00 0.00
02:15:02 PM 23 5.03 0.00 58.79 23.62 0.00 12.56
0.00 0.00 0.00

By "more IOPS potential than the system can handle", I mean that with
about a quarter of the targets I get the same figure. The HBA is
known to handle more than twice the IOPS I'm seeing.

I'm using 16 targets with fio driving one target with each core you
see sys activity on. You can see that two additional cores are
getting weighed down -- 0 and 6. Is that indicative of the
bottleneck?

These results are without using any of the patches suggested in this
e-mail thread. I'll have to try and see if they help.

What is the top number of IOPS I should hope for with this system and
the Linux kernel?
Dave Jiang (or anyone else) -- can you share the max IOPS that you are seeing?

ersatz splatt

unread,

Jul 15, 2011, 10:20:02 PM7/15/11

to

With the quickest and easiest fix (the first suggestion from Jens
Axboe), I was able to get another 20%+ in IOPS. Thank you.

(Pardon the previous ugly wrap on the data. I'm not sure how to stop
that with my e-mail vendor)

Driving more IOPS on the same system looks like this for me:

CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle

all 2.85 0.00 31.37 12.05 0.00 14.84 0.00 0.00 38.90
0 2.44 0.00 0.00 0.00 0.00 4.39 0.00 0.00 93.17
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
2 1.51 0.00 23.12 70.85 0.00 4.52 0.00 0.00 0.00
3 5.05 0.00 51.01 19.70 0.00 24.24 0.00 0.00 0.00
4 5.47 0.00 62.19 1.00 0.00 31.34 0.00 0.00 0.00
5 4.00 0.00 50.00 22.50 0.00 23.50 0.00 0.00 0.00
6 0.00 0.00 0.00 0.00 0.00 0.47 0.00 0.00 99.53
7 0.00 0.00 0.22 0.00 0.00 0.00 0.00 0.00 99.78
8 4.48 0.00 53.23 16.92 0.00 25.37 0.00 0.00 0.00
9 4.48 0.00 50.25 19.40 0.00 25.87 0.00 0.00 0.00
10 5.53 0.00 63.82 0.50 0.00 30.15 0.00 0.00 0.00
11 3.50 0.00 52.00 20.50 0.00 24.00 0.00 0.00 0.00
12 0.50 0.00 1.00 1.49 0.00 0.00 0.00 0.00 97.01
13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
14 3.50 0.00 43.50 35.50 0.00 17.50 0.00 0.00 0.00
15 4.02 0.00 51.26 20.60 0.00 24.12 0.00 0.00 0.00
16 6.03 0.00 57.29 8.54 0.00 28.14 0.00 0.00 0.00
17 4.50 0.00 49.00 25.00 0.00 21.50 0.00 0.00 0.00

18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
20 4.98 0.00 57.21 11.44 0.00 26.37 0.00 0.00 0.00
21 4.50 0.00 54.00 16.00 0.00 25.50 0.00 0.00 0.00
22 5.50 0.00 58.00 7.00 0.00 29.50 0.00 0.00 0.00
23 4.00 0.00 49.50 22.50 0.00 24.00 0.00 0.00 0.00

I'm happy to have the performance improvement, but I would like to
know how I could do much better. The storage hardware is all capable
of about twice the IOPS I'm getting now.

I see that "sys" is eating most of the CPU time at this point. What
do I need to fix? Is fio too heavy in implementation? ... or is this
a scsi midlayer bottleneck?

I would be happy to get advice on what I should do to better
illuminate the bottleneck.

Christoph Hellwig

unread,

Jul 15, 2011, 10:50:01 PM7/15/11

to

On Fri, Jul 15, 2011 at 04:43:44PM -0700, ersatz splatt wrote:
> I have a dual socket system with Tylersburg chipset (approximately
> Westmere I gather).
> With two Xeon X5660 packages I get this when running with more iops
> potential than the system can handle:

What HBA do you use? Does it already have a lockless ->queuecommand?