Any suggestions, prior experience on this is welcome.
Thanks,
M.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.
SSE3 in the kernel memcpy would be incredible expensive,
it would need a full FPU saving for every call and preemption
disabled.
I haven't seen your patches, but until you get all that
right (and add a lot more overhead to most copies) you
have a good change currently to corrupting user FPU state.
> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.
It depends on the CPU
There have been some improvements for Atom on newer kernels
I believe.
But then kernel memcpy is usually optimized for relatively
small copies (<= 4K) because very few kernel loads do more.
-Andi
--
a...@linux.intel.com -- Speaking for myself only
> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.
> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.
> 2. Does the default kernel configuration for i386 include the best
> memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)
>
> Any suggestions, prior experience on this is welcome.
Sounds very interesting - it would be nice to see 'perf record' +
'perf report' profiles done on that workload, before and after your
patches.
The thing is, we obviously want to achieve those gains of 12.2% fps
and while we probably do not want to switch the kernel's memcpy to
SSE right now (the save/restore costs are significant), we could
certainly try to optimize the specific codepath that your video
playback path is hitting.
If it's some bulk memcpy in a key video driver then we could offer a
bulk-optimized x86 memcpy variant which could be called from that
driver - and that could use SSE3 as well.
So yes, if the speedup is real then i'm sure we can achieve that
speedup - but exact profiles and measurements would have to be shown.
Thanks,
Ingo
FWIW, I've been playing with SSE memcpy version for the kernel recently
too, here's what I have so far:
First of all, I did a trace of all the memcpy buffer sizes used while
building a kernel, see attached kernel_build.sizes.
On the one hand, there is a large amount of small chunks copied (1.1M
of 1.2M calls total), and, on the other, a relatively small amount of
larger sized mem copies (256 - 2048 bytes) which are about 100K in total
but which account for the larger cumulative amount of data copied: 138MB
of 175MB total. So, if the buffer copied is big enough, the context
save/restore cost might be something we're willing to pay.
I've implemented the SSE memcpy first in userspace to measure the
speedup vs memcpy_64 we have right now:
Benchmarking with 10000 iterations, average results:
size XM MM speedup
119 540.58 449.491 0.8314969419
189 296.318 263.507 0.8892692985
206 297.949 271.399 0.9108923485
224 255.565 235.38 0.9210161798
221 299.383 276.628 0.9239941159
245 299.806 279.432 0.9320430545
369 314.774 316.89 1.006721324
425 327.536 330.475 1.00897153
439 330.847 334.532 1.01113687
458 333.159 340.124 1.020904708
503 334.44 352.166 1.053003229
767 375.612 429.949 1.144661625
870 358.888 312.572 0.8709465025
882 394.297 454.977 1.153893229
925 403.82 472.56 1.170222413
1009 407.147 490.171 1.203915735
1525 512.059 660.133 1.289174911
1737 556.85 725.552 1.302958536
1778 533.839 711.59 1.332965994
1864 558.06 745.317 1.335549882
2039 585.915 813.806 1.388949687
3068 766.462 1105.56 1.442422252
3471 883.983 1239.99 1.40272883
3570 895.822 1266.74 1.414057295
3748 906.832 1302.4 1.436212771
4086 957.649 1486.93 1.552686041
6130 1238.45 1996.42 1.612023046
6961 1413.11 2201.55 1.557939181
7162 1385.5 2216.49 1.59977178
7499 1440.87 2330.12 1.617158856
8182 1610.74 2720.45 1.688950194
12273 2307.86 4042.88 1.751787902
13924 2431.8 4224.48 1.737184756
14335 2469.4 4218.82 1.708440514
15018 2675.67 1904.07 0.711622886
16374 2989.75 5296.26 1.771470902
24564 4262.15 7696.86 1.805863077
27852 4362.53 3347.72 0.7673805572
28672 5122.8 7113.14 1.388524413
30033 4874.62 8740.04 1.792967931
32768 6014.78 7564.2 1.257603505
49142 14464.2 21114.2 1.459757233
55702 16055 23496.8 1.463523623
57339 16725.7 24553.8 1.46803388
60073 17451.5 24407.3 1.398579162
Size is with randomly generated misalignment to test the implementation.
I've implemented the SSE memcpy similar to arch/x86/lib/mmx_32.c and did
some kernel build traces:
with SSE memcpy
===============
Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):
3301761.517649 task-clock # 24.001 CPUs utilized ( +- 1.48% )
520,658 context-switches # 0.000 M/sec ( +- 0.25% )
63,845 CPU-migrations # 0.000 M/sec ( +- 0.58% )
26,070,835 page-faults # 0.008 M/sec ( +- 0.00% )
1,812,482,599,021 cycles # 0.549 GHz ( +- 0.85% ) [64.55%]
551,783,051,492 stalled-cycles-frontend # 30.44% frontend cycles idle ( +- 0.98% ) [65.64%]
444,996,901,060 stalled-cycles-backend # 24.55% backend cycles idle ( +- 1.15% ) [67.16%]
1,488,917,931,766 instructions # 0.82 insns per cycle
# 0.37 stalled cycles per insn ( +- 0.91% ) [69.25%]
340,575,978,517 branches # 103.150 M/sec ( +- 0.99% ) [68.29%]
21,519,667,206 branch-misses # 6.32% of all branches ( +- 1.09% ) [65.11%]
137.567155255 seconds time elapsed ( +- 1.48% )
plain 3.0
=========
Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):
3504754.425527 task-clock # 24.001 CPUs utilized ( +- 1.31% )
518,139 context-switches # 0.000 M/sec ( +- 0.32% )
61,790 CPU-migrations # 0.000 M/sec ( +- 0.73% )
26,056,947 page-faults # 0.007 M/sec ( +- 0.00% )
1,826,757,751,616 cycles # 0.521 GHz ( +- 0.66% ) [63.86%]
557,800,617,954 stalled-cycles-frontend # 30.54% frontend cycles idle ( +- 0.79% ) [64.65%]
443,950,768,357 stalled-cycles-backend # 24.30% backend cycles idle ( +- 0.60% ) [67.07%]
1,469,707,613,500 instructions # 0.80 insns per cycle
# 0.38 stalled cycles per insn ( +- 0.68% ) [69.98%]
335,560,565,070 branches # 95.744 M/sec ( +- 0.67% ) [69.09%]
21,365,279,176 branch-misses # 6.37% of all branches ( +- 0.65% ) [65.36%]
146.025263276 seconds time elapsed ( +- 1.31% )
So, although kernel build is probably not the proper workload for an
SSE memcpy routine, I'm seeing 9 secs build time improvement, i.e.
something around 6%. We're executing a bit more instructions but I'd say
the amount of data moved per instruction is higher due to the quadword
moves.
Here's the SSE memcpy version I got so far, I haven't wired in the
proper CPU feature detection yet because we want to run more benchmarks
like netperf and stuff to see whether we see any positive results there.
The SYSTEM_RUNNING check is to take care of early boot situations where
we can't handle FPU exceptions but we use memcpy. There's an aligned and
misaligned variant which should handle any buffers and sizes although
I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
cover context save/restore somewhat.
Comments are much appreciated! :-)
--
From 385519e844f3466f500774c2c37afe44691ef8d2 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <borisla...@amd.com>
Date: Thu, 11 Aug 2011 18:43:08 +0200
Subject: [PATCH] SSE3 memcpy in C
Signed-off-by: Borislav Petkov <borisla...@amd.com>
---
arch/x86/include/asm/string_64.h | 14 ++++-
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/sse_memcpy_64.c | 133 ++++++++++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/lib/sse_memcpy_64.c
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..7bd51bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
#define __HAVE_ARCH_MEMCPY 1
#ifndef CONFIG_KMEMCHECK
+extern void *__memcpy(void *to, const void *from, size_t len);
+extern void *__sse_memcpy(void *to, const void *from, size_t len);
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
-extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ void *__ret; \
+ if (__len >= 512) \
+ __ret = __sse_memcpy((dst), (src), __len); \
+ else \
+ __ret = __memcpy((dst), (src), __len); \
+ __ret; \
+})
#else
-extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \
({ \
size_t __len = (len); \
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f2479f1..5f90709 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -36,7 +36,7 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
- obj-y += iomap_copy_64.o
+ obj-y += iomap_copy_64.o sse_memcpy_64.o
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
lib-y += thunk_64.o clear_page_64.o copy_page_64.o
lib-y += memmove_64.o memset_64.o
diff --git a/arch/x86/lib/sse_memcpy_64.c b/arch/x86/lib/sse_memcpy_64.c
new file mode 100644
index 0000000..b53fc31
--- /dev/null
+++ b/arch/x86/lib/sse_memcpy_64.c
@@ -0,0 +1,133 @@
+#include <linux/module.h>
+
+#include <asm/i387.h>
+#include <asm/string_64.h>
+
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+ unsigned long src = (unsigned long)from;
+ unsigned long dst = (unsigned long)to;
+ void *p = to;
+ int i;
+
+ if (in_interrupt())
+ return __memcpy(to, from, len);
+
+ if (system_state != SYSTEM_RUNNING)
+ return __memcpy(to, from, len);
+
+ kernel_fpu_begin();
+
+ /* check alignment */
+ if ((src ^ dst) & 0xf)
+ goto unaligned;
+
+ if (src & 0xf) {
+ u8 chunk = 0x10 - (src & 0xf);
+
+ /* copy chunk until next 16-byte */
+ __memcpy(to, from, chunk);
+ len -= chunk;
+ to += chunk;
+ from += chunk;
+ }
+
+ /*
+ * copy in 256 Byte portions
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movaps 0x0(%0), %%xmm0\n\t"
+ "movaps 0x10(%0), %%xmm1\n\t"
+ "movaps 0x20(%0), %%xmm2\n\t"
+ "movaps 0x30(%0), %%xmm3\n\t"
+ "movaps 0x40(%0), %%xmm4\n\t"
+ "movaps 0x50(%0), %%xmm5\n\t"
+ "movaps 0x60(%0), %%xmm6\n\t"
+ "movaps 0x70(%0), %%xmm7\n\t"
+ "movaps 0x80(%0), %%xmm8\n\t"
+ "movaps 0x90(%0), %%xmm9\n\t"
+ "movaps 0xa0(%0), %%xmm10\n\t"
+ "movaps 0xb0(%0), %%xmm11\n\t"
+ "movaps 0xc0(%0), %%xmm12\n\t"
+ "movaps 0xd0(%0), %%xmm13\n\t"
+ "movaps 0xe0(%0), %%xmm14\n\t"
+ "movaps 0xf0(%0), %%xmm15\n\t"
+
+ "movaps %%xmm0, 0x0(%1)\n\t"
+ "movaps %%xmm1, 0x10(%1)\n\t"
+ "movaps %%xmm2, 0x20(%1)\n\t"
+ "movaps %%xmm3, 0x30(%1)\n\t"
+ "movaps %%xmm4, 0x40(%1)\n\t"
+ "movaps %%xmm5, 0x50(%1)\n\t"
+ "movaps %%xmm6, 0x60(%1)\n\t"
+ "movaps %%xmm7, 0x70(%1)\n\t"
+ "movaps %%xmm8, 0x80(%1)\n\t"
+ "movaps %%xmm9, 0x90(%1)\n\t"
+ "movaps %%xmm10, 0xa0(%1)\n\t"
+ "movaps %%xmm11, 0xb0(%1)\n\t"
+ "movaps %%xmm12, 0xc0(%1)\n\t"
+ "movaps %%xmm13, 0xd0(%1)\n\t"
+ "movaps %%xmm14, 0xe0(%1)\n\t"
+ "movaps %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+ goto trailer;
+
+unaligned:
+ /*
+ * copy in 256 Byte portions unaligned
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movups 0x0(%0), %%xmm0\n\t"
+ "movups 0x10(%0), %%xmm1\n\t"
+ "movups 0x20(%0), %%xmm2\n\t"
+ "movups 0x30(%0), %%xmm3\n\t"
+ "movups 0x40(%0), %%xmm4\n\t"
+ "movups 0x50(%0), %%xmm5\n\t"
+ "movups 0x60(%0), %%xmm6\n\t"
+ "movups 0x70(%0), %%xmm7\n\t"
+ "movups 0x80(%0), %%xmm8\n\t"
+ "movups 0x90(%0), %%xmm9\n\t"
+ "movups 0xa0(%0), %%xmm10\n\t"
+ "movups 0xb0(%0), %%xmm11\n\t"
+ "movups 0xc0(%0), %%xmm12\n\t"
+ "movups 0xd0(%0), %%xmm13\n\t"
+ "movups 0xe0(%0), %%xmm14\n\t"
+ "movups 0xf0(%0), %%xmm15\n\t"
+
+ "movups %%xmm0, 0x0(%1)\n\t"
+ "movups %%xmm1, 0x10(%1)\n\t"
+ "movups %%xmm2, 0x20(%1)\n\t"
+ "movups %%xmm3, 0x30(%1)\n\t"
+ "movups %%xmm4, 0x40(%1)\n\t"
+ "movups %%xmm5, 0x50(%1)\n\t"
+ "movups %%xmm6, 0x60(%1)\n\t"
+ "movups %%xmm7, 0x70(%1)\n\t"
+ "movups %%xmm8, 0x80(%1)\n\t"
+ "movups %%xmm9, 0x90(%1)\n\t"
+ "movups %%xmm10, 0xa0(%1)\n\t"
+ "movups %%xmm11, 0xb0(%1)\n\t"
+ "movups %%xmm12, 0xc0(%1)\n\t"
+ "movups %%xmm13, 0xd0(%1)\n\t"
+ "movups %%xmm14, 0xe0(%1)\n\t"
+ "movups %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+trailer:
+ __memcpy(to, from, len & 0xff);
+
+ kernel_fpu_end();
+
+ return p;
+}
+EXPORT_SYMBOL_GPL(__sse_memcpy);
--
1.7.6.134.gcf13f6
--
Regards/Gruss,
Boris.
Please, no. Do not inline every memcpy invocation.
This is pure bloat (comsidering how many memcpy calls there are)
and it doesn't even win anything in speed, since there will be
a fucntion call either way.
Put the __len >= 512 check inside your memcpy instead.
You may do the check if you know that __len is constant:
if (__builtin_constant_p(__len) && __len >= 512) ...
because in this case gcc will evaluate it at compile-time.
--
vda
In the __len < 512 case, this would actually cause two function calls,
actually: once the __sse_memcpy and then the __memcpy one.
> You may do the check if you know that __len is constant:
> if (__builtin_constant_p(__len) && __len >= 512) ...
> because in this case gcc will evaluate it at compile-time.
That could justify the bloat at least partially.
Actually, I had a version which sticks sse_memcpy code into memcpy_64.S
and that would save us both the function call and the bloat. I might
return to that one if it turns out that SSE memcpy makes sense for the
kernel.
Thanks.
--
Regards/Gruss,
Boris.
Boris, thanks for the patch. On seeing your patch:
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+ unsigned long src = (unsigned long)from;
+ unsigned long dst = (unsigned long)to;
+ void *p = to;
+ int i;
+
+ if (in_interrupt())
+ return __memcpy(to, from, len)
So what is the reason we cannot use sse_memcpy in interrupt context.
(fpu registers not saved ? )
My question is still not answered. There are 3 versions of memcpy in kernel:
***********************************arch/x86/include/asm/string_32.h******************************
179 #ifndef CONFIG_KMEMCHECK
180
181 #if (__GNUC__ >= 4)
182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
183 #else
184 #define memcpy(t, f, n) \
185 (__builtin_constant_p((n)) \
186 ? __constant_memcpy((t), (f), (n)) \
187 : __memcpy((t), (f), (n)))
188 #endif
189 #else
190 /*
191 * kmemcheck becomes very happy if we use the REP instructions
unconditionally,
192 * because it means that we know both memory operands in advance.
193 */
194 #define memcpy(t, f, n) __memcpy((t), (f), (n))
195 #endif
196
197
****************************************************************************************.
I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this
is valid only for AMD and not for Atom Z5xx series.
This means __memcpy, __constant_memcpy, __builtin_memcpy .
I have a hunch by default we were using __builtin_memcpy. This is
because I see my GCC version >=4 and CONFIG_KMEMCHECK not defined.
Can someone confirm of these 3 which is used, with i386_defconfig.
Again with i386_defconfig which workloads provide the best results
with the default implementation.
thanks,
M.
You didn't notice the "else".
>> You may do the check if you know that __len is constant:
>> if (__builtin_constant_p(__len) && __len >= 512) ...
>> because in this case gcc will evaluate it at compile-time.
>
> That could justify the bloat at least partially.
There will be no bloat in this case.
--
vda
I don't think you ever get #NM as a result of kernel_fpu_begin, but you
can certainly have problems when kernel_fpu_begin nests by accident.
There's irq_fpu_usable() for this.
(irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)
--Andy
Because, AFAICT, when we handle an #NM exception while running
sse_memcpy in an IRQ handler, we might need to allocate FPU save state
area, which in turn, can sleep. Then, we might get another IRQ while
sleeping and we should be deadlocked.
But let me stress on the "AFAICT" above, someone who actually knows the
FPU code should correct me if I'm missing something.
> My question is still not answered. There are 3 versions of memcpy in
Yes, on 32-bit you're using the compiler-supplied version
__builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
and above. Reportedly, using __builtin_memcpy generates better code.
Btw, my version of SSE memcpy is 64-bit only.
--
Regards/Gruss,
Boris.
Oh I didn't know about irq_fpu_usable(), thanks.
But still, irq_fpu_usable() still checks !in_interrupt() which means
that we don't want to run SSE instructions in IRQ context. OTOH, we
still are fine when running with CR0.TS. So what happens when we get an
#NM as a result of executing an FPU instruction in an IRQ handler? We
will have to do init_fpu() on the current task if the last hasn't used
math yet and do the slab allocation of the FPU context area (I'm looking
at math_state_restore, btw).
Thanks.
--
Regards/Gruss,
Boris.
IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in
an interrupt and TS=1, when we know that we're not in a
kernel_fpu_begin section, so it's safe to start one (and do clts).
IMO this code is not very good, and I plan to fix it sooner or later.
I want kernel_fpu_begin (or its equivalent*) to be very fast and
usable from any context whatsoever. Mucking with TS is slower than a
complete save and restore of YMM state.
(*) kernel_fpu_begin is a bad name. It's only safe to use integer
instructions inside a kernel_fpu_begin section because MXCSR (and the
387 equivalent) could contain garbage.
--Andy
Doh, yes, I see it now. This way we save the math state of the current
process if needed and "disable" #NM exceptions until kernel_fpu_end() by
clearing CR0.TS, sure. Thanks.
> IMO this code is not very good, and I plan to fix it sooner or later.
Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
You could probably reuse some bits from there. The patchset should be in
tip/x86/xsave.
> I want kernel_fpu_begin (or its equivalent*) to be very fast and
> usable from any context whatsoever. Mucking with TS is slower than a
> complete save and restore of YMM state.
Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
This would obviate the need to muck with contexts but that could get
expensive wrt stack operations. The advantage is that I'm not dealing
with the whole FPU state but only with 16 XMM regs. I should probably
dust off that version again and retest.
Or, if we want to use SSE stuff in the kernel, we might think of
allocating its own FPU context(s) and handle those...
> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.
Well, do we want to use floating point instructions in the kernel?
Thanks.
--
Regards/Gruss,
Boris.
Uh... no, it just means you have to initialize the settings. It's a
perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
-hpa
I prefer get_xstate / put_xstate, but this could rapidly devolve into
bikeshedding. :)
--Andy
I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
80 ns and a full state save+restore is only ~60 ns. Without
infrastructure changes, I don't think you can avoid the clts and stts.
You might be able to get away with turning off IRQs, reading CR0 to
check TS, pushing XMM regs, and being very certain that you don't
accidentally generate any VEX-coded instructions.
>
> Or, if we want to use SSE stuff in the kernel, we might think of
> allocating its own FPU context(s) and handle those...
I'm thinking of having a stack of FPU states to parallel irq stacks
and IST stacks. It gets a little hairy when code inside
kernel_fpu_begin traps for a non-irq non-IST reason, though.
Fortunately, those are rare and all of the EX_TABLE users could mark
xmm regs as clobbered (except for copy_from_user...). Keeping
kernel_fpu_begin non-preemptable makes it less bad because the extra
FPU state can be per-cpu and not per-task.
This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
The major speedup will come from saving state in kernel_fpu_begin but
not restoring it until the code in entry_??.S restores registers.
>
>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
>
> Well, do we want to use floating point instructions in the kernel?
The only use I could find is in staging.
--Andy
a) Quite.
b) xstate is not architecture-neutral.
-hpa
Are there any architecture-neutral users of this thing? If I were
writing generic code, I would expect:
kernel_fpu_begin();
foo *= 1.5;
kernel_fpu_end();
to work, but I would not expect:
kernel_fpu_begin();
use_xmm_registers();
kernel_fpu_end();
to make any sense.
Since the former does not actually work, I would hope that there is no
non-x86-specific user.
--Andy
Look at the RAID-6 code, for example. It makes the various
architecture-specific codes look more similar.
-hpa
Yeah, probably.
> You might be able to get away with turning off IRQs, reading CR0 to
> check TS, pushing XMM regs, and being very certain that you don't
> accidentally generate any VEX-coded instructions.
That's ok - I'm using movaps/movups. But, the problem is that I still
need to save FPU state if the task I'm interrupting has been using FPU
instructions. So, I can't get away without saving the context in which
case I don't need to save the XMM regs anyway.
>> Or, if we want to use SSE stuff in the kernel, we might think of
>> allocating its own FPU context(s) and handle those...
>
> I'm thinking of having a stack of FPU states to parallel irq stacks
> and IST stacks.
... I'm guessing with the same nesting as hardirqs? Making FPU
instructions usable in irq contexts too.
> It gets a little hairy when code inside kernel_fpu_begin traps for a
> non-irq non-IST reason, though.
How does that happen? You're in the kernel with preemption disabled and
TS cleared, what would cause the #NM? I think that if you need to switch
context, you simply "push" the current FPU context, allocate a new one
and clts as part of the FPU context switching, no?
> Fortunately, those are rare and all of the EX_TABLE users could mark
> xmm regs as clobbered (except for copy_from_user...).
Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
shows reasonable speedup there, we might need to make those work too.
> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
> extra FPU state can be per-cpu and not per-task.
Yep.
> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>
> The major speedup will come from saving state in kernel_fpu_begin but
> not restoring it until the code in entry_??.S restores registers.
But you'd need to save each kernel FPU state when nesting, no?
>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>
>> Well, do we want to use floating point instructions in the kernel?
>
> The only use I could find is in staging.
Exactly my point - I think we should do it only when it's really worth
the trouble.
--
Regards/Gruss,
Boris.
Not #NM, but page faults can happen too (even just accessing vmalloc space).
>
>> Fortunately, those are rare and all of the EX_TABLE users could mark
>> xmm regs as clobbered (except for copy_from_user...).
>
> Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> shows reasonable speedup there, we might need to make those work too.
I'm a little surprised that SSE beats fast string operations, but I
guess benchmarking always wins.
>
>> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
>> extra FPU state can be per-cpu and not per-task.
>
> Yep.
>
>> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>>
>> The major speedup will come from saving state in kernel_fpu_begin but
>> not restoring it until the code in entry_??.S restores registers.
>
> But you'd need to save each kernel FPU state when nesting, no?
>
Yes. But we don't nest that much, and the save/restore isn't all that
expensive. And we don't have to save/restore unless kernel entries
nest and both entries try to use kernel_fpu_begin at the same time.
This whole project may take awhile. The code in there is a
poorly-documented mess, even after Hans' cleanups. (It's a lot worse
without them, though.)
--Andy
If by fast string operations you mean X86_FEATURE_ERMS, then that's
Intel-only and that actually would need to be benchmarked separately.
Currently, I see speedup for large(r) buffers only vs rep; movsq. But I
dunno about rep; movsb's enhanced rep string tricks Intel does.
> Yes. But we don't nest that much, and the save/restore isn't all that
> expensive. And we don't have to save/restore unless kernel entries
> nest and both entries try to use kernel_fpu_begin at the same time.
Yep.
> This whole project may take awhile. The code in there is a
> poorly-documented mess, even after Hans' cleanups. (It's a lot worse
> without them, though.)
Oh yeah, this code could use lotsa scrubbing :)
--
Regards/Gruss,
Boris.
I meant X86_FEATURE_REP_GOOD. (That may also be Intel-only, but it
sounds like rep;movsq might move whole cachelines on cpus at least a
few generations back.) I don't know if any ERMS cpus exist yet.
> Benchmarking with 10000 iterations, average results:
> size XM MM speedup
> 119 540.58 449.491 0.8314969419
> 12273 2307.86 4042.88 1.751787902
> 13924 2431.8 4224.48 1.737184756
> 14335 2469.4 4218.82 1.708440514
> 15018 2675.67 1904.07 0.711622886
> 16374 2989.75 5296.26 1.771470902
> 24564 4262.15 7696.86 1.805863077
> 27852 4362.53 3347.72 0.7673805572
> 28672 5122.8 7113.14 1.388524413
> 30033 4874.62 8740.04 1.792967931
The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
really good about this till we understand what happened for those two cases.
Also, anytime I see "10000 iterations", I ask myself if the benchmark rigging
took proper note of hot/cold cache issues. That *may* explain the two oddball
results we see above - but not knowing more about how it was benched, it's hard
to say.
We would rather use the 32 bit patch. Have you already got a 32 bit
patch. How can I use sse3 for 32 bit.
I don't think you have submitted 64 bit patch in the mainline.
Is there still work ongoing on this.
Regards,
Melwyn
Nope, only 64-bit for now, sorry.
> How can I use sse3 for 32 bit.
Well, OTTOMH, you have only 8 xmm regs in 32-bit instead of 16, which
should halve the performance of the 64-bit version in a perfect world.
However, we don't know how the performance of a 32-bit SSE memcpy
version behaves vs the gcc builtin one - that would require benchmarking
too.
But other than that, I don't see a problem with having a 32-bit version.
> I don't think you have submitted 64 bit patch in the mainline.
> Is there still work ongoing on this.
Yeah, we are currently benchmarking it to see whether it actually makes
sense to even have SSE memcpy in the kernel.
--
Regards/Gruss,
Boris.
Yep.
> Also, anytime I see "10000 iterations", I ask myself if the benchmark
> rigging took proper note of hot/cold cache issues. That *may* explain
> the two oddball results we see above - but not knowing more about how
> it was benched, it's hard to say.
Yeah, the more scrutiny this gets the better. So I've cleaned up my
setup and have attached it.
xm_mem.c does the benchmarking and in bench_memcpy() there's the
sse_memcpy call which is the SSE memcpy implementation using inline asm.
It looks like gcc produces pretty crappy code here because if I replace
the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
same function but in pure asm - I get much better numbers, sometimes
even over 2x. It all depends on the alignment of the buffers though.
Also, those numbers don't include the context saving/restoring which the
kernel does for us.
7491 1509.89 2346.94 1.554378381
8170 2166.81 2857.78 1.318890326
12277 2659.03 4179.31 1.571744176
13907 2571.24 4125.7 1.604558427
14319 2638.74 5799.67 2.19789466 <----
14993 2752.42 4413.85 1.603625603
16371 3479.11 5562.65 1.59887055
So please take a look and let me know what you think.
Thanks.
--
Regards/Gruss,
Boris.
Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
2011/8/16 Borislav Petkov <b...@amd64.org>:
This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.
Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.
I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.
If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.
Cheers,
Maarten
---
Attached: my modified version of the sse memcpy you posted.
I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.
"rep movs" is generally optimized in microcode on most modern Intel
CPU's for some easyish cases, and it will outperform just about
anything.
Atom is a notable exception, but if you expect performance on any
general loads from Atom, you need to get your head examined. Atom is a
disaster for anything but tuned loops.
The "easyish cases" depend on microarchitecture. They are improving,
so long-term "rep movs" is the best way regardless, but for most
current ones it's something like "source aligned to 8 bytes *and*
source and destination are equal "mod 64"".
And that's true in a lot of common situations. It's true for the page
copy, for example, and it's often true for big user "read()/write()"
calls (but "often" may not be "often enough" - high-performance
userland should strive to align read/write buffers to 64 bytes, for
example).
Many other cases of "memcpy()" are the fairly small, constant-sized
ones, where the optimal strategy tends to be "move words by hand".
Linus
Yeah,
this probably makes enabling SSE memcpy in the kernel a task
with diminishing returns. There are also the additional costs of
saving/restoring FPU context in the kernel which eat off from any SSE
speedup.
And then there's the additional I$ pressure because "rep movs" is
much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
smallest (two-byte) instructions I could use - in the AVX case they can
get up to 4 Bytes of length with the VEX prefix and the additional SIB,
size override, etc. fields.
Oh, and then there's copy_*_user which also does fault handling and
replacing that with a SSE version of memcpy could get quite hairy quite
fast.
Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
when I get the time to see whether it still makes sense, at all.
Thanks.
--
Regards/Gruss,
Boris.
For example 3754 with src misalignment 4 and target misalignment 20
takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
The modified testcase is attached, I did some optimizations in avx memcpy,
but I fear I may be missing something, when I tried to put it in the kernel, it
complained about sata errors I never had before, so I immediately went for
the power button to prevent more errors, fortunately it only corrupted some
kernel object files, and btrfs threw checksum errors. :)
All in all I think testing in userspace is safer, you might want to run it on an
idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to
performance.
~Maarten
I think for bigger memcpy's it might make sense to demand stricter
alignment. What are your numbers for (0/0) ? In my case it seems
that kernel memcpy is always faster for that. In fact, it seems
src&63 == dst&63 is generally faster with kernel memcpy.
Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:
WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()
The most persistent one appears to be the btrfs' *_extent_buffer,
it gets the most warnings on my system. Apart from that on my
system there's not much to gain, since the alignment is already
close to optimal.
My ext4 /home doesn't throw warnings, so I'd gain the most
by figuring out if I could improve btrfs/extent_io.c in some way.
The patch for triggering those warnings is below, change to WARN_ON
if you want to see which one happens the most for you.
I was pleasantly surprised though.
>> The modified testcase is attached, I did some optimizations in avx
>> memcpy, but I fear I may be missing something, when I tried to put it
>> in the kernel, it complained about sata errors I never had before,
>> so I immediately went for the power button to prevent more errors,
>> fortunately it only corrupted some kernel object files, and btrfs
>> threw checksum errors. :)
> Well, your version should do something similar to what _mmx_memcpy does:
> save FPU state and not execute in IRQ context.
>
>> All in all I think testing in userspace is safer, you might want to
>> run it on an idle cpu with schedtool, with a high fifo priority, and
>> set cpufreq governor to performance.
> No, you need a generic system with default settings - otherwise it is
> blatant benchmark lying :-)
diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..77180bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -30,6 +30,14 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
#ifndef CONFIG_KMEMCHECK
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ const void *__src = (src); \
+ void *__dst = (dst); \
+ WARN_ON_ONCE(__len > 1024 && (((long)__src & 63) != ((long)__dst & 63))); \
+ memcpy(__dst, __src, __len); \
+})
#else
extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \
Actually,
assuming alignment matters, I'd need to redo the trace_printk run I did
initially on buffer sizes:
http://marc.info/?l=linux-kernel&m=131331602309340 (kernel_build.sizes attached)
to get a more sensible grasp on the alignment of kernel buffers along
with their sizes and to see whether we're doing a lot of unaligned large
buffer copies in the kernel. I seriously doubt that, though, we should
be doing everything pagewise anyway so...
Concerning numbers, I ran your version again and sorted the output by
speedup. The highest scores are:
30037(12/44) 5566.4 12797.2 2.299011642
28672(12/44) 5512.97 12588.7 2.283467991
30037(28/60) 5610.34 12732.7 2.269502799
27852(12/44) 5398.36 12242.4 2.267803859
30037(4/36) 5585.02 12598.6 2.25578257
28672(28/60) 5499.11 12317.5 2.239914033
27852(28/60) 5349.78 11918.9 2.227919527
27852(20/52) 5335.92 11750.7 2.202186795
24576(12/44) 4991.37 10987.2 2.201247446
and this is pretty cool. Here are the (0/0) cases:
8192(0/0) 2627.82 3038.43 1.156255766
12288(0/0) 3116.62 3675.98 1.179475031
13926(0/0) 3330.04 4077.08 1.224334839
14336(0/0) 3377.95 4067.24 1.204055286
15018(0/0) 3465.3 4215.3 1.216430725
16384(0/0) 3623.33 4442.38 1.226050715
24576(0/0) 4629.53 6021.81 1.300737559
27852(0/0) 5026.69 6619.26 1.316823133
28672(0/0) 5157.73 6831.39 1.324495749
30037(0/0) 5322.01 6978.36 1.3112261
It is not 2x anymore but still.
Anyway, looking at the buffer sizes, they're rather ridiculous and even
if we get them in some workload, they won't repeat n times per second to
be relevant. So we'll see...
Thanks.
--
Regards/Gruss,
Boris.
Yeah, this is what my trace of a kernel build showed too:
Bytes Count
===== =====
...
224 3
225 3
227 3
244 1
254 5
255 13
256 21708
512 21746
848 12907
1920 36536
2048 21708
OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for
example when shuffling network buffers to/from userspace. Converting
those to SSE memcpy might not be as easy as memcpy itself, though.
> Yes, there are exceptions that copy into big vmalloc'ed buffers, but
> they don't tend to matter. Things like module loading etc.
Too small a number of repetitions to matter, yes.
--
Regards/Gruss,
Boris.