no rq_affinity:
09:23:31 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:23:36 AM all 9.65 0.00 41.75 23.60 0.00 24.98 0.00 0.00 0.03
09:23:36 AM 0 13.40 0.00 59.60 27.00 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 1 14.00 0.00 58.80 27.20 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 2 13.20 0.00 57.40 29.40 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 3 12.40 0.00 57.00 30.60 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 4 12.60 0.00 52.80 34.60 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 5 11.62 0.00 48.30 40.08 0.00 0.00 0.00 0.00 0.00
09:23:36 AM 6 0.00 0.00 0.20 0.00 0.00 99.80 0.00 0.00 0.00
09:23:36 AM 7 0.00 0.00 0.00 0.00 0.00 99.80 0.00 0.00 0.20
with rq_affinity:
09:25:04 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:25:09 AM all 9.50 0.00 42.32 23.19 0.00 24.99 0.00 0.00 0.00
09:25:09 AM 0 13.80 0.00 61.60 24.60 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 1 13.03 0.00 60.32 26.65 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 2 12.83 0.00 58.52 28.66 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 3 12.20 0.00 56.60 31.20 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 4 12.20 0.00 52.40 35.40 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 5 11.78 0.00 49.30 38.92 0.00 0.00 0.00 0.00 0.00
09:25:09 AM 6 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
09:25:09 AM 7 0.00 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00
with soft irq steering:
09:31:57 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
09:32:02 AM all 12.73 0.00 46.82 1.63 8.03 28.59 0.00 0.00 2.20
09:32:02 AM 0 16.20 0.00 55.00 3.20 10.20 15.40 0.00 0.00 0.00
09:32:02 AM 1 15.60 0.00 57.60 0.00 10.00 16.80 0.00 0.00 0.00
09:32:02 AM 2 16.03 0.00 56.91 0.20 10.62 16.23 0.00 0.00 0.00
09:32:02 AM 3 15.77 0.00 58.48 0.20 10.18 15.17 0.00 0.00 0.20
09:32:02 AM 4 16.17 0.00 56.09 0.00 10.18 17.56 0.00 0.00 0.00
09:32:02 AM 5 16.00 0.00 56.60 0.20 10.60 16.60 0.00 0.00 0.00
09:32:02 AM 6 3.41 0.00 18.64 3.81 0.80 60.52 0.00 0.00 12.83
09:32:02 AM 7 2.79 0.00 14.97 5.79 1.40 70.26 0.00 0.00 4.79
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
It's probably the grouping, we need to do something about that. Does the
below patch make it behave as you expect?
diff --git a/block/blk.h b/block/blk.h
index d658628..17d53d8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -157,6 +157,7 @@ static inline int queue_congestion_off_threshold(struct request_queue *q)
static inline int blk_cpu_to_group(int cpu)
{
+#if 0
int group = NR_CPUS;
#ifdef CONFIG_SCHED_MC
const struct cpumask *mask = cpu_coregroup_mask(cpu);
@@ -168,6 +169,7 @@ static inline int blk_cpu_to_group(int cpu)
#endif
if (likely(group < NR_CPUS))
return group;
+#endif
return cpu;
}
--
Jens Axboe
Yep that is it.
02:14:12 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:14:17 PM all 11.98 0.00 46.62 1.18 0.00 37.79 0.00 0.00 2.43
02:14:17 PM 0 15.43 0.00 55.31 0.00 0.00 29.26 0.00 0.00 0.00
02:14:17 PM 1 14.83 0.00 56.71 0.00 0.00 28.46 0.00 0.00 0.00
02:14:17 PM 2 14.80 0.00 56.00 0.00 0.00 29.20 0.00 0.00 0.00
02:14:17 PM 3 14.63 0.00 57.11 0.00 0.00 28.26 0.00 0.00 0.00
02:14:17 PM 4 14.80 0.00 57.60 0.00 0.00 27.60 0.00 0.00 0.00
02:14:17 PM 5 15.03 0.00 56.11 0.00 0.00 28.86 0.00 0.00 0.00
02:14:17 PM 6 3.79 0.00 20.16 5.99 0.00 59.68 0.00 0.00 10.38
02:14:17 PM 7 2.80 0.00 14.20 3.20 0.00 70.80 0.00 0.00 9.00
"something", absolutely. But there is benefit from doing some aggregation
(we tried disabling it entirely with the "well-known OLTP benchmark" and
performance went down).
Ideally we'd do something like "if the softirq is taking up more than 10%
of a core, split the grouping". Do we have enough stats to do that kind
of monitoring?
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
Yep, that's why the current solution is somewhat middle of the road...
> Ideally we'd do something like "if the softirq is taking up more than 10%
> of a core, split the grouping". Do we have enough stats to do that kind
> of monitoring?
I don't think we have those stats, though it could/should be pulled from
the ksoftirqX threads. We could have some metric, ala
dest_cpu = get_group_completion_cpu(rq->cpu);
if (ksoftirqd_of(dest_cpu) >= 90% busy)
dest_cpu = rq->cpu;
to send things completely local to the submitter of the IO, IFF the
current CPU is close to running at full tilt.
--
Jens Axboe
What platform was your "OLTP benchmark" on? It seems that as the
number of cores per package goes up, this grouping becomes too coarse,
since almost everyone will have SCHED_MC set in the code:
static inline int blk_cpu_to_group(int cpu)
{
int group = NR_CPUS;
#ifdef CONFIG_SCHED_MC
const struct cpumask *mask = cpu_coregroup_mask(cpu);
group = cpumask_first(mask);
#elif defined(CONFIG_SCHED_SMT)
group = cpumask_first(topology_thread_cpumask(cpu));
#else
return cpu;
#endif
if (likely(group < NR_CPUS))
return group;
return cpu;
}
and so we use cpumask_first(cpu_coregroup_mask(cpu)). And from
const struct cpumask *cpu_coregroup_mask(int cpu)
{
struct cpuinfo_x86 *c = &cpu_data(cpu);
/*
* For perf, we return last level cache shared map.
* And for power savings, we return cpu_core_map
*/
if ((sched_mc_power_savings || sched_smt_power_savings) &&
!(cpu_has(c, X86_FEATURE_AMD_DCM)))
return cpu_core_mask(cpu);
else
return cpu_llc_shared_mask(cpu);
}
in the "max performance" case, we use cpu_llc_shared_mask().
The problem as we've seen it is that on a dual-socket Westmere (Xeon
56xx) system, we have two sockets with 6 cores (12 threads) each, all
sharing L3 cache, and so we end up with all block softirqs on only 2
out of 24 threads, which is not enough to handle all the IOPS that
fast storage can provide.
It's not clear to me what the right answer or tradeoffs are here. It
might make sense to use only one hyperthread per core for block
softirqs. As I understand the Westmere cache topology, there's not
really an obvious intermediate step -- all the cores in a package
share the L3, and then each core has its own L2.
Limiting softirqs to 10% of a core seems a bit low, since we seem to
be able to use more than 100% of a core handling block softirqs, and
anyway magic numbers like that seem to always be wrong sometimes.
Perhaps we could use the queue length on the destination CPU as a
proxy for how busy ksoftirq is?
- R.
This is likely too aggressive (untested / need to confirm it resolves
the isci issue), but it's at least straightforward to determine, and I
wonder if it prevents the regression Matthew is seeing. It assumes that
the once we have naturally spilled from the irq return path to ksoftirqd
that this cpu is having trouble keeping up with the load.
??
diff --git a/block/blk-core.c b/block/blk-core.c
index d2f8f40..9c7ba87 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
init_request_from_bio(req, bio);
if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
- bio_flagged(bio, BIO_CPU_AFFINE)) {
- req->cpu = blk_cpu_to_group(get_cpu());
- put_cpu();
- }
+ bio_flagged(bio, BIO_CPU_AFFINE))
+ req->cpu = smp_processor_id();
plug = current->plug;
if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c216..720918f 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -101,17 +101,21 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
.notifier_call = blk_cpu_notify,
};
+DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+
void __blk_complete_request(struct request *req)
{
+ int ccpu, cpu, group_ccpu, group_cpu;
struct request_queue *q = req->q;
+ struct task_struct *tsk;
unsigned long flags;
- int ccpu, cpu, group_cpu;
BUG_ON(!q->softirq_done_fn);
local_irq_save(flags);
cpu = smp_processor_id();
group_cpu = blk_cpu_to_group(cpu);
+ tsk = per_cpu(ksoftirqd, cpu);
/*
* Select completion CPU
@@ -120,8 +124,15 @@ void __blk_complete_request(struct request *req)
ccpu = req->cpu;
else
ccpu = cpu;
+ group_ccpu = blk_cpu_to_group(ccpu);
- if (ccpu == cpu || ccpu == group_cpu) {
+ /*
+ * try to skip a remote softirq-trigger if the completion is
+ * within the same group, but not if local softirqs have already
+ * spilled to ksoftirqd
+ */
+ if (ccpu == cpu ||
+ (group_ccpu == group_cpu && tsk->state != TASK_RUNNING)) {
struct list_head *list;
do_local:
list = &__get_cpu_var(blk_cpu_done);
> The problem as we've seen it is that on a dual-socket Westmere (Xeon
> 56xx) system, we have two sockets with 6 cores (12 threads) each, all
> sharing L3 cache, and so we end up with all block softirqs on only 2
> out of 24 threads, which is not enough to handle all the IOPS that
> fast storage can provide.
I have a dual socket system with Tylersburg chipset (approximately
Westmere I gather).
With two Xeon X5660 packages I get this when running with more iops
potential than the system can handle:
02:15:00 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
02:15:02 PM all 2.76 0.00 30.40 28.28 0.00 13.74
0.00 0.00 24.81
02:15:02 PM 0 0.00 0.00 0.00 0.00 0.00 100.00
0.00 0.00 0.00
02:15:02 PM 1 0.00 0.00 0.50 0.00 0.00 0.00
0.00 0.00 99.50
02:15:02 PM 2 3.02 0.00 36.68 52.26 0.00 8.04
0.00 0.00 0.00
02:15:02 PM 3 2.50 0.00 36.00 54.50 0.00 7.00
0.00 0.00 0.00
02:15:02 PM 4 5.47 0.00 64.18 18.91 0.00 11.44
0.00 0.00 0.00
02:15:02 PM 5 3.02 0.00 37.69 53.27 0.00 6.03
0.00 0.00 0.00
02:15:02 PM 6 0.00 0.00 0.50 0.00 0.00 91.54
0.00 0.00 7.96
02:15:02 PM 7 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
02:15:02 PM 8 3.00 0.00 35.50 55.00 0.00 6.50
0.00 0.00 0.00
02:15:02 PM 9 3.02 0.00 39.70 50.25 0.00 7.04
0.00 0.00 0.00
02:15:02 PM 10 3.50 0.00 36.50 53.00 0.00 7.00
0.00 0.00 0.00
02:15:02 PM 11 6.53 0.00 70.85 9.05 0.00 13.57
0.00 0.00 0.00
02:15:02 PM 12 0.00 0.00 0.57 0.00 0.00 0.00
0.00 0.00 99.43
02:15:02 PM 13 3.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 97.00
02:15:02 PM 14 2.50 0.00 36.50 54.00 0.00 7.00
0.00 0.00 0.00
02:15:02 PM 15 3.52 0.00 36.18 53.77 0.00 6.53
0.00 0.00 0.00
02:15:02 PM 16 5.00 0.00 64.00 21.00 0.00 10.00
0.00 0.00 0.00
02:15:02 PM 17 3.02 0.00 37.19 52.76 0.00 7.04
0.00 0.00 0.00
02:15:02 PM 18 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 100.00
02:15:02 PM 19 0.00 0.00 1.01 0.00 0.00 0.00
0.00 0.00 98.99
02:15:02 PM 20 3.48 0.00 38.31 52.24 0.00 5.97
0.00 0.00 0.00
02:15:02 PM 21 5.50 0.00 63.00 18.50 0.00 13.00
0.00 0.00 0.00
02:15:02 PM 22 2.50 0.00 35.00 54.50 0.00 8.00
0.00 0.00 0.00
02:15:02 PM 23 5.03 0.00 58.79 23.62 0.00 12.56
0.00 0.00 0.00
By "more IOPS potential than the system can handle", I mean that with
about a quarter of the targets I get the same figure. The HBA is
known to handle more than twice the IOPS I'm seeing.
I'm using 16 targets with fio driving one target with each core you
see sys activity on. You can see that two additional cores are
getting weighed down -- 0 and 6. Is that indicative of the
bottleneck?
These results are without using any of the patches suggested in this
e-mail thread. I'll have to try and see if they help.
What is the top number of IOPS I should hope for with this system and
the Linux kernel?
Dave Jiang (or anyone else) -- can you share the max IOPS that you are seeing?
(Pardon the previous ugly wrap on the data. I'm not sure how to stop
that with my e-mail vendor)
Driving more IOPS on the same system looks like this for me:
CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
all 2.85 0.00 31.37 12.05 0.00 14.84 0.00 0.00 38.90
0 2.44 0.00 0.00 0.00 0.00 4.39 0.00 0.00 93.17
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
2 1.51 0.00 23.12 70.85 0.00 4.52 0.00 0.00 0.00
3 5.05 0.00 51.01 19.70 0.00 24.24 0.00 0.00 0.00
4 5.47 0.00 62.19 1.00 0.00 31.34 0.00 0.00 0.00
5 4.00 0.00 50.00 22.50 0.00 23.50 0.00 0.00 0.00
6 0.00 0.00 0.00 0.00 0.00 0.47 0.00 0.00 99.53
7 0.00 0.00 0.22 0.00 0.00 0.00 0.00 0.00 99.78
8 4.48 0.00 53.23 16.92 0.00 25.37 0.00 0.00 0.00
9 4.48 0.00 50.25 19.40 0.00 25.87 0.00 0.00 0.00
10 5.53 0.00 63.82 0.50 0.00 30.15 0.00 0.00 0.00
11 3.50 0.00 52.00 20.50 0.00 24.00 0.00 0.00 0.00
12 0.50 0.00 1.00 1.49 0.00 0.00 0.00 0.00 97.01
13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
14 3.50 0.00 43.50 35.50 0.00 17.50 0.00 0.00 0.00
15 4.02 0.00 51.26 20.60 0.00 24.12 0.00 0.00 0.00
16 6.03 0.00 57.29 8.54 0.00 28.14 0.00 0.00 0.00
17 4.50 0.00 49.00 25.00 0.00 21.50 0.00 0.00 0.00
18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
20 4.98 0.00 57.21 11.44 0.00 26.37 0.00 0.00 0.00
21 4.50 0.00 54.00 16.00 0.00 25.50 0.00 0.00 0.00
22 5.50 0.00 58.00 7.00 0.00 29.50 0.00 0.00 0.00
23 4.00 0.00 49.50 22.50 0.00 24.00 0.00 0.00 0.00
I'm happy to have the performance improvement, but I would like to
know how I could do much better. The storage hardware is all capable
of about twice the IOPS I'm getting now.
I see that "sys" is eating most of the CPU time at this point. What
do I need to fix? Is fio too heavy in implementation? ... or is this
a scsi midlayer bottleneck?
I would be happy to get advice on what I should do to better
illuminate the bottleneck.
What HBA do you use? Does it already have a lockless ->queuecommand?