x86 memcpy performance

melwyn lobo

unread,

Aug 12, 2011, 2:00:02 PM8/12/11

to

Hi All,
Our Video recorder application uses memcpy for every frame. About 2KB
data every frame on Intel® Atom™ Z5xx processor.
With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
implemented memcpy is suboptimal, because when we replaced
with an optmized one (using ssse3, exact patches are currently being
finalized) ew obtained 22fps a gain of 12.2 %.
C0 residency also reduced from 75% to 67%. This means power benefits too.
My questions:
1. Is kernel memcpy profiled for optimal performance.
2. Does the default kernel configuration for i386 include the best
memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)

Any suggestions, prior experience on this is welcome.

Thanks,
M.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Andi Kleen

unread,

Aug 12, 2011, 2:40:01 PM8/12/11

to

melwyn lobo <linux....@gmail.com> writes:

> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.

SSE3 in the kernel memcpy would be incredible expensive,
it would need a full FPU saving for every call and preemption
disabled.

I haven't seen your patches, but until you get all that
right (and add a lot more overhead to most copies) you
have a good change currently to corrupting user FPU state.

> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.

It depends on the CPU

There have been some improvements for Atom on newer kernels
I believe.

But then kernel memcpy is usually optimized for relatively
small copies (<= 4K) because very few kernel loads do more.

-Andi

--
a...@linux.intel.com -- Speaking for myself only

Ingo Molnar

unread,

Aug 12, 2011, 4:00:02 PM8/12/11

to

* melwyn lobo <linux....@gmail.com> wrote:

> Hi All,
> Our Video recorder application uses memcpy for every frame. About 2KB
> data every frame on Intel® Atom™ Z5xx processor.
> With default 2.6.35 kernel we got 19.6 fps. But it seems kernel
> implemented memcpy is suboptimal, because when we replaced
> with an optmized one (using ssse3, exact patches are currently being
> finalized) ew obtained 22fps a gain of 12.2 %.
> C0 residency also reduced from 75% to 67%. This means power benefits too.
> My questions:
> 1. Is kernel memcpy profiled for optimal performance.
> 2. Does the default kernel configuration for i386 include the best
> memcpy implementation (AMD 3DNOW, __builtin_memcpy .... etc)
>
> Any suggestions, prior experience on this is welcome.

Sounds very interesting - it would be nice to see 'perf record' +
'perf report' profiles done on that workload, before and after your
patches.

The thing is, we obviously want to achieve those gains of 12.2% fps
and while we probably do not want to switch the kernel's memcpy to
SSE right now (the save/restore costs are significant), we could
certainly try to optimize the specific codepath that your video
playback path is hitting.

If it's some bulk memcpy in a key video driver then we could offer a
bulk-optimized x86 memcpy variant which could be called from that
driver - and that could use SSE3 as well.

So yes, if the speedup is real then i'm sure we can achieve that
speedup - but exact profiles and measurements would have to be shown.

Thanks,

Ingo

Borislav Petkov

unread,

Aug 14, 2011, 6:00:02 AM8/14/11

to

On Fri, Aug 12, 2011 at 09:52:20PM +0200, Ingo Molnar wrote:
> Sounds very interesting - it would be nice to see 'perf record' +
> 'perf report' profiles done on that workload, before and after your
> patches.

FWIW, I've been playing with SSE memcpy version for the kernel recently
too, here's what I have so far:

First of all, I did a trace of all the memcpy buffer sizes used while
building a kernel, see attached kernel_build.sizes.

On the one hand, there is a large amount of small chunks copied (1.1M
of 1.2M calls total), and, on the other, a relatively small amount of
larger sized mem copies (256 - 2048 bytes) which are about 100K in total
but which account for the larger cumulative amount of data copied: 138MB
of 175MB total. So, if the buffer copied is big enough, the context
save/restore cost might be something we're willing to pay.

I've implemented the SSE memcpy first in userspace to measure the
speedup vs memcpy_64 we have right now:

Benchmarking with 10000 iterations, average results:
size XM MM speedup
119 540.58 449.491 0.8314969419
189 296.318 263.507 0.8892692985
206 297.949 271.399 0.9108923485
224 255.565 235.38 0.9210161798
221 299.383 276.628 0.9239941159
245 299.806 279.432 0.9320430545
369 314.774 316.89 1.006721324
425 327.536 330.475 1.00897153
439 330.847 334.532 1.01113687
458 333.159 340.124 1.020904708
503 334.44 352.166 1.053003229
767 375.612 429.949 1.144661625
870 358.888 312.572 0.8709465025
882 394.297 454.977 1.153893229
925 403.82 472.56 1.170222413
1009 407.147 490.171 1.203915735
1525 512.059 660.133 1.289174911
1737 556.85 725.552 1.302958536
1778 533.839 711.59 1.332965994
1864 558.06 745.317 1.335549882
2039 585.915 813.806 1.388949687
3068 766.462 1105.56 1.442422252
3471 883.983 1239.99 1.40272883
3570 895.822 1266.74 1.414057295
3748 906.832 1302.4 1.436212771
4086 957.649 1486.93 1.552686041
6130 1238.45 1996.42 1.612023046
6961 1413.11 2201.55 1.557939181
7162 1385.5 2216.49 1.59977178
7499 1440.87 2330.12 1.617158856
8182 1610.74 2720.45 1.688950194
12273 2307.86 4042.88 1.751787902
13924 2431.8 4224.48 1.737184756
14335 2469.4 4218.82 1.708440514
15018 2675.67 1904.07 0.711622886
16374 2989.75 5296.26 1.771470902
24564 4262.15 7696.86 1.805863077
27852 4362.53 3347.72 0.7673805572
28672 5122.8 7113.14 1.388524413
30033 4874.62 8740.04 1.792967931
32768 6014.78 7564.2 1.257603505
49142 14464.2 21114.2 1.459757233
55702 16055 23496.8 1.463523623
57339 16725.7 24553.8 1.46803388
60073 17451.5 24407.3 1.398579162

Size is with randomly generated misalignment to test the implementation.

I've implemented the SSE memcpy similar to arch/x86/lib/mmx_32.c and did
some kernel build traces:

with SSE memcpy
===============

Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):

3301761.517649 task-clock # 24.001 CPUs utilized ( +- 1.48% )
520,658 context-switches # 0.000 M/sec ( +- 0.25% )
63,845 CPU-migrations # 0.000 M/sec ( +- 0.58% )
26,070,835 page-faults # 0.008 M/sec ( +- 0.00% )
1,812,482,599,021 cycles # 0.549 GHz ( +- 0.85% ) [64.55%]
551,783,051,492 stalled-cycles-frontend # 30.44% frontend cycles idle ( +- 0.98% ) [65.64%]
444,996,901,060 stalled-cycles-backend # 24.55% backend cycles idle ( +- 1.15% ) [67.16%]
1,488,917,931,766 instructions # 0.82 insns per cycle
# 0.37 stalled cycles per insn ( +- 0.91% ) [69.25%]
340,575,978,517 branches # 103.150 M/sec ( +- 0.99% ) [68.29%]
21,519,667,206 branch-misses # 6.32% of all branches ( +- 1.09% ) [65.11%]

137.567155255 seconds time elapsed ( +- 1.48% )

plain 3.0
=========

Performance counter stats for '/root/boris/bin/build-kernel.sh' (10 runs):

3504754.425527 task-clock # 24.001 CPUs utilized ( +- 1.31% )
518,139 context-switches # 0.000 M/sec ( +- 0.32% )
61,790 CPU-migrations # 0.000 M/sec ( +- 0.73% )
26,056,947 page-faults # 0.007 M/sec ( +- 0.00% )
1,826,757,751,616 cycles # 0.521 GHz ( +- 0.66% ) [63.86%]
557,800,617,954 stalled-cycles-frontend # 30.54% frontend cycles idle ( +- 0.79% ) [64.65%]
443,950,768,357 stalled-cycles-backend # 24.30% backend cycles idle ( +- 0.60% ) [67.07%]
1,469,707,613,500 instructions # 0.80 insns per cycle
# 0.38 stalled cycles per insn ( +- 0.68% ) [69.98%]
335,560,565,070 branches # 95.744 M/sec ( +- 0.67% ) [69.09%]
21,365,279,176 branch-misses # 6.37% of all branches ( +- 0.65% ) [65.36%]

146.025263276 seconds time elapsed ( +- 1.31% )

So, although kernel build is probably not the proper workload for an
SSE memcpy routine, I'm seeing 9 secs build time improvement, i.e.
something around 6%. We're executing a bit more instructions but I'd say
the amount of data moved per instruction is higher due to the quadword
moves.

Here's the SSE memcpy version I got so far, I haven't wired in the
proper CPU feature detection yet because we want to run more benchmarks
like netperf and stuff to see whether we see any positive results there.

The SYSTEM_RUNNING check is to take care of early boot situations where
we can't handle FPU exceptions but we use memcpy. There's an aligned and
misaligned variant which should handle any buffers and sizes although
I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
cover context save/restore somewhat.

Comments are much appreciated! :-)

--
From 385519e844f3466f500774c2c37afe44691ef8d2 Mon Sep 17 00:00:00 2001
From: Borislav Petkov <borisla...@amd.com>
Date: Thu, 11 Aug 2011 18:43:08 +0200
Subject: [PATCH] SSE3 memcpy in C

Signed-off-by: Borislav Petkov <borisla...@amd.com>
---
arch/x86/include/asm/string_64.h | 14 ++++-
arch/x86/lib/Makefile | 2 +-
arch/x86/lib/sse_memcpy_64.c | 133 ++++++++++++++++++++++++++++++++++++++
3 files changed, 146 insertions(+), 3 deletions(-)
create mode 100644 arch/x86/lib/sse_memcpy_64.c

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..7bd51bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t

#define __HAVE_ARCH_MEMCPY 1
#ifndef CONFIG_KMEMCHECK
+extern void *__memcpy(void *to, const void *from, size_t len);
+extern void *__sse_memcpy(void *to, const void *from, size_t len);
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
-extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ void *__ret; \
+ if (__len >= 512) \
+ __ret = __sse_memcpy((dst), (src), __len); \
+ else \
+ __ret = __memcpy((dst), (src), __len); \
+ __ret; \
+})
#else
-extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \
({ \
size_t __len = (len); \
diff --git a/arch/x86/lib/Makefile b/arch/x86/lib/Makefile
index f2479f1..5f90709 100644
--- a/arch/x86/lib/Makefile
+++ b/arch/x86/lib/Makefile
@@ -36,7 +36,7 @@ ifneq ($(CONFIG_X86_CMPXCHG64),y)
endif
lib-$(CONFIG_X86_USE_3DNOW) += mmx_32.o
else
- obj-y += iomap_copy_64.o
+ obj-y += iomap_copy_64.o sse_memcpy_64.o
lib-y += csum-partial_64.o csum-copy_64.o csum-wrappers_64.o
lib-y += thunk_64.o clear_page_64.o copy_page_64.o
lib-y += memmove_64.o memset_64.o
diff --git a/arch/x86/lib/sse_memcpy_64.c b/arch/x86/lib/sse_memcpy_64.c
new file mode 100644
index 0000000..b53fc31
--- /dev/null
+++ b/arch/x86/lib/sse_memcpy_64.c
@@ -0,0 +1,133 @@
+#include <linux/module.h>
+
+#include <asm/i387.h>
+#include <asm/string_64.h>
+
+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+ unsigned long src = (unsigned long)from;
+ unsigned long dst = (unsigned long)to;
+ void *p = to;
+ int i;
+
+ if (in_interrupt())
+ return __memcpy(to, from, len);
+
+ if (system_state != SYSTEM_RUNNING)
+ return __memcpy(to, from, len);
+
+ kernel_fpu_begin();
+
+ /* check alignment */
+ if ((src ^ dst) & 0xf)
+ goto unaligned;
+
+ if (src & 0xf) {
+ u8 chunk = 0x10 - (src & 0xf);
+
+ /* copy chunk until next 16-byte */
+ __memcpy(to, from, chunk);
+ len -= chunk;
+ to += chunk;
+ from += chunk;
+ }
+
+ /*
+ * copy in 256 Byte portions
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movaps 0x0(%0), %%xmm0\n\t"
+ "movaps 0x10(%0), %%xmm1\n\t"
+ "movaps 0x20(%0), %%xmm2\n\t"
+ "movaps 0x30(%0), %%xmm3\n\t"
+ "movaps 0x40(%0), %%xmm4\n\t"
+ "movaps 0x50(%0), %%xmm5\n\t"
+ "movaps 0x60(%0), %%xmm6\n\t"
+ "movaps 0x70(%0), %%xmm7\n\t"
+ "movaps 0x80(%0), %%xmm8\n\t"
+ "movaps 0x90(%0), %%xmm9\n\t"
+ "movaps 0xa0(%0), %%xmm10\n\t"
+ "movaps 0xb0(%0), %%xmm11\n\t"
+ "movaps 0xc0(%0), %%xmm12\n\t"
+ "movaps 0xd0(%0), %%xmm13\n\t"
+ "movaps 0xe0(%0), %%xmm14\n\t"
+ "movaps 0xf0(%0), %%xmm15\n\t"
+
+ "movaps %%xmm0, 0x0(%1)\n\t"
+ "movaps %%xmm1, 0x10(%1)\n\t"
+ "movaps %%xmm2, 0x20(%1)\n\t"
+ "movaps %%xmm3, 0x30(%1)\n\t"
+ "movaps %%xmm4, 0x40(%1)\n\t"
+ "movaps %%xmm5, 0x50(%1)\n\t"
+ "movaps %%xmm6, 0x60(%1)\n\t"
+ "movaps %%xmm7, 0x70(%1)\n\t"
+ "movaps %%xmm8, 0x80(%1)\n\t"
+ "movaps %%xmm9, 0x90(%1)\n\t"
+ "movaps %%xmm10, 0xa0(%1)\n\t"
+ "movaps %%xmm11, 0xb0(%1)\n\t"
+ "movaps %%xmm12, 0xc0(%1)\n\t"
+ "movaps %%xmm13, 0xd0(%1)\n\t"
+ "movaps %%xmm14, 0xe0(%1)\n\t"
+ "movaps %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+ goto trailer;
+
+unaligned:
+ /*
+ * copy in 256 Byte portions unaligned
+ */
+ for (i = 0; i < (len & ~0xff); i += 256) {
+ asm volatile(
+ "movups 0x0(%0), %%xmm0\n\t"
+ "movups 0x10(%0), %%xmm1\n\t"
+ "movups 0x20(%0), %%xmm2\n\t"
+ "movups 0x30(%0), %%xmm3\n\t"
+ "movups 0x40(%0), %%xmm4\n\t"
+ "movups 0x50(%0), %%xmm5\n\t"
+ "movups 0x60(%0), %%xmm6\n\t"
+ "movups 0x70(%0), %%xmm7\n\t"
+ "movups 0x80(%0), %%xmm8\n\t"
+ "movups 0x90(%0), %%xmm9\n\t"
+ "movups 0xa0(%0), %%xmm10\n\t"
+ "movups 0xb0(%0), %%xmm11\n\t"
+ "movups 0xc0(%0), %%xmm12\n\t"
+ "movups 0xd0(%0), %%xmm13\n\t"
+ "movups 0xe0(%0), %%xmm14\n\t"
+ "movups 0xf0(%0), %%xmm15\n\t"
+
+ "movups %%xmm0, 0x0(%1)\n\t"
+ "movups %%xmm1, 0x10(%1)\n\t"
+ "movups %%xmm2, 0x20(%1)\n\t"
+ "movups %%xmm3, 0x30(%1)\n\t"
+ "movups %%xmm4, 0x40(%1)\n\t"
+ "movups %%xmm5, 0x50(%1)\n\t"
+ "movups %%xmm6, 0x60(%1)\n\t"
+ "movups %%xmm7, 0x70(%1)\n\t"
+ "movups %%xmm8, 0x80(%1)\n\t"
+ "movups %%xmm9, 0x90(%1)\n\t"
+ "movups %%xmm10, 0xa0(%1)\n\t"
+ "movups %%xmm11, 0xb0(%1)\n\t"
+ "movups %%xmm12, 0xc0(%1)\n\t"
+ "movups %%xmm13, 0xd0(%1)\n\t"
+ "movups %%xmm14, 0xe0(%1)\n\t"
+ "movups %%xmm15, 0xf0(%1)\n\t"
+ : : "r" (from), "r" (to) : "memory");
+
+ from += 256;
+ to += 256;
+ }
+
+trailer:
+ __memcpy(to, from, len & 0xff);
+
+ kernel_fpu_end();
+
+ return p;
+}
+EXPORT_SYMBOL_GPL(__sse_memcpy);
--
1.7.6.134.gcf13f6

--
Regards/Gruss,
Boris.

kernel_build.sizes

Denys Vlasenko

unread,

Aug 14, 2011, 7:20:02 AM8/14/11

to

On Sunday 14 August 2011 11:59, Borislav Petkov wrote:
> Here's the SSE memcpy version I got so far, I haven't wired in the
> proper CPU feature detection yet because we want to run more benchmarks
> like netperf and stuff to see whether we see any positive results there.
>
> The SYSTEM_RUNNING check is to take care of early boot situations where
> we can't handle FPU exceptions but we use memcpy. There's an aligned and
> misaligned variant which should handle any buffers and sizes although
> I've set the SSE memcpy threshold at 512 Bytes buffersize the least to
> cover context save/restore somewhat.
>
> Comments are much appreciated! :-)
>

> --- a/arch/x86/include/asm/string_64.h
> +++ b/arch/x86/include/asm/string_64.h
> @@ -28,10 +28,20 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
>
> #define __HAVE_ARCH_MEMCPY 1
> #ifndef CONFIG_KMEMCHECK
> +extern void *__memcpy(void *to, const void *from, size_t len);
> +extern void *__sse_memcpy(void *to, const void *from, size_t len);
> #if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
> -extern void *memcpy(void *to, const void *from, size_t len);
> +#define memcpy(dst, src, len) \
> +({ \
> + size_t __len = (len); \
> + void *__ret; \
> + if (__len >= 512) \
> + __ret = __sse_memcpy((dst), (src), __len); \
> + else \
> + __ret = __memcpy((dst), (src), __len); \
> + __ret; \
> +})

Please, no. Do not inline every memcpy invocation.
This is pure bloat (comsidering how many memcpy calls there are)
and it doesn't even win anything in speed, since there will be
a fucntion call either way.
Put the __len >= 512 check inside your memcpy instead.

You may do the check if you know that __len is constant:
if (__builtin_constant_p(__len) && __len >= 512) ...
because in this case gcc will evaluate it at compile-time.

--
vda

Borislav Petkov

unread,

Aug 14, 2011, 8:50:02 AM8/14/11

to

In the __len < 512 case, this would actually cause two function calls,
actually: once the __sse_memcpy and then the __memcpy one.

> You may do the check if you know that __len is constant:
> if (__builtin_constant_p(__len) && __len >= 512) ...
> because in this case gcc will evaluate it at compile-time.

That could justify the bloat at least partially.

Actually, I had a version which sticks sse_memcpy code into memcpy_64.S
and that would save us both the function call and the bloat. I might
return to that one if it turns out that SSE memcpy makes sense for the
kernel.

Thanks.

--
Regards/Gruss,
Boris.

melwyn lobo

unread,

Aug 15, 2011, 9:30:02 AM8/15/11

to

Hi,
Was on a vacation for last two days. Thanks for the good insights into
the issue.
Ingo, unfortunately the data we have is on a soon to be released
platform and strictly confidential at this stage.

Boris, thanks for the patch. On seeing your patch:

+void *__sse_memcpy(void *to, const void *from, size_t len)
+{
+ unsigned long src = (unsigned long)from;
+ unsigned long dst = (unsigned long)to;
+ void *p = to;
+ int i;
+
+ if (in_interrupt())
+ return __memcpy(to, from, len)

So what is the reason we cannot use sse_memcpy in interrupt context.
(fpu registers not saved ? )
My question is still not answered. There are 3 versions of memcpy in kernel:

***********************************arch/x86/include/asm/string_32.h******************************
179 #ifndef CONFIG_KMEMCHECK
180
181 #if (__GNUC__ >= 4)
182 #define memcpy(t, f, n) __builtin_memcpy(t, f, n)
183 #else
184 #define memcpy(t, f, n) \
185 (__builtin_constant_p((n)) \
186 ? __constant_memcpy((t), (f), (n)) \
187 : __memcpy((t), (f), (n)))
188 #endif
189 #else
190 /*
191 * kmemcheck becomes very happy if we use the REP instructions
unconditionally,
192 * because it means that we know both memory operands in advance.
193 */
194 #define memcpy(t, f, n) __memcpy((t), (f), (n))
195 #endif
196
197
****************************************************************************************.
I will ignore CONFIG_X86_USE_3DNOW (including mmx_memcpy() ) as this
is valid only for AMD and not for Atom Z5xx series.
This means __memcpy, __constant_memcpy, __builtin_memcpy .
I have a hunch by default we were using __builtin_memcpy. This is
because I see my GCC version >=4 and CONFIG_KMEMCHECK not defined.
Can someone confirm of these 3 which is used, with i386_defconfig.
Again with i386_defconfig which workloads provide the best results
with the default implementation.

thanks,
M.

Denys Vlasenko

unread,

Aug 15, 2011, 9:50:02 AM8/15/11

to

On Sun, Aug 14, 2011 at 2:40 PM, Borislav Petkov <b...@alien8.de> wrote:
>> > + � if (__len >= 512) � � � � � � � � � � � � � � � � � � � \
>> > + � � � � � __ret = __sse_memcpy((dst), (src), __len); � � �\
>> > + � else � � � � � � � � � � � � � � � � � � � � � � � � � �\
>> > + � � � � � __ret = __memcpy((dst), (src), __len); � � � � �\
>> > + � __ret; � � � � � � � � � � � � � � � � � � � � � � � � �\
>> > +})
>>
>> Please, no. Do not inline every memcpy invocation.
>> This is pure bloat (comsidering how many memcpy calls there are)
>> and it doesn't even win anything in speed, since there will be
>> a fucntion call either way.
>> Put the __len >= 512 check inside your memcpy instead.
>
> In the __len < 512 case, this would actually cause two function calls,
> actually: once the __sse_memcpy and then the __memcpy one.

You didn't notice the "else".

>> You may do the check if you know that __len is constant:
>> if (__builtin_constant_p(__len) && __len >= 512) ...
>> because in this case gcc will evaluate it at compile-time.
>
> That could justify the bloat at least partially.

There will be no bloat in this case.
--
vda

Andy Lutomirski

unread,

Aug 15, 2011, 11:00:03 AM8/15/11

to

On 08/15/2011 10:55 AM, Borislav Petkov wrote:

> On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote:
>> Hi,
>> Was on a vacation for last two days. Thanks for the good insights into
>> the issue.
>> Ingo, unfortunately the data we have is on a soon to be released
>> platform and strictly confidential at this stage.
>>
>> Boris, thanks for the patch. On seeing your patch:
>> +void *__sse_memcpy(void *to, const void *from, size_t len)
>> +{
>> + unsigned long src = (unsigned long)from;
>> + unsigned long dst = (unsigned long)to;
>> + void *p = to;
>> + int i;
>> +
>> + if (in_interrupt())
>> + return __memcpy(to, from, len)
>> So what is the reason we cannot use sse_memcpy in interrupt context.
>> (fpu registers not saved ? )
>

> Because, AFAICT, when we handle an #NM exception while running
> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
> area, which in turn, can sleep. Then, we might get another IRQ while
> sleeping and we should be deadlocked.
>
> But let me stress on the "AFAICT" above, someone who actually knows the
> FPU code should correct me if I'm missing something.

I don't think you ever get #NM as a result of kernel_fpu_begin, but you
can certainly have problems when kernel_fpu_begin nests by accident.
There's irq_fpu_usable() for this.

(irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)

--Andy

Borislav Petkov

unread,

Aug 15, 2011, 11:00:03 AM8/15/11

to

On Mon, 15 August, 2011 3:27 pm, melwyn lobo wrote:
> Hi,
> Was on a vacation for last two days. Thanks for the good insights into
> the issue.
> Ingo, unfortunately the data we have is on a soon to be released
> platform and strictly confidential at this stage.
>
> Boris, thanks for the patch. On seeing your patch:
> +void *__sse_memcpy(void *to, const void *from, size_t len)
> +{
> + unsigned long src = (unsigned long)from;
> + unsigned long dst = (unsigned long)to;
> + void *p = to;
> + int i;
> +
> + if (in_interrupt())
> + return __memcpy(to, from, len)
> So what is the reason we cannot use sse_memcpy in interrupt context.
> (fpu registers not saved ? )

Because, AFAICT, when we handle an #NM exception while running
sse_memcpy in an IRQ handler, we might need to allocate FPU save state
area, which in turn, can sleep. Then, we might get another IRQ while
sleeping and we should be deadlocked.

But let me stress on the "AFAICT" above, someone who actually knows the
FPU code should correct me if I'm missing something.

> My question is still not answered. There are 3 versions of memcpy in

Yes, on 32-bit you're using the compiler-supplied version
__builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
and above. Reportedly, using __builtin_memcpy generates better code.

Btw, my version of SSE memcpy is 64-bit only.

--
Regards/Gruss,
Boris.

Borislav Petkov

unread,

Aug 15, 2011, 11:40:02 AM8/15/11

to

On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote:
>>> So what is the reason we cannot use sse_memcpy in interrupt context.
>>> (fpu registers not saved ? )
>>
>> Because, AFAICT, when we handle an #NM exception while running
>> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
>> area, which in turn, can sleep. Then, we might get another IRQ while
>> sleeping and we should be deadlocked.
>>
>> But let me stress on the "AFAICT" above, someone who actually knows the
>> FPU code should correct me if I'm missing something.
>
> I don't think you ever get #NM as a result of kernel_fpu_begin, but you
> can certainly have problems when kernel_fpu_begin nests by accident.
> There's irq_fpu_usable() for this.
>
> (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)

Oh I didn't know about irq_fpu_usable(), thanks.

But still, irq_fpu_usable() still checks !in_interrupt() which means
that we don't want to run SSE instructions in IRQ context. OTOH, we
still are fine when running with CR0.TS. So what happens when we get an
#NM as a result of executing an FPU instruction in an IRQ handler? We
will have to do init_fpu() on the current task if the last hasn't used
math yet and do the slab allocation of the FPU context area (I'm looking
at math_state_restore, btw).

Thanks.

--
Regards/Gruss,
Boris.

Andrew Lutomirski

unread,

Aug 15, 2011, 11:40:03 AM8/15/11

to

On Mon, Aug 15, 2011 at 11:29 AM, Borislav Petkov <b...@alien8.de> wrote:
> On Mon, 15 August, 2011 4:59 pm, Andy Lutomirski wrote:
>>>> So what is the reason we cannot use sse_memcpy in interrupt context.
>>>> (fpu registers not saved ? )
>>>
>>> Because, AFAICT, when we handle an #NM exception while running
>>> sse_memcpy in an IRQ handler, we might need to allocate FPU save state
>>> area, which in turn, can sleep. Then, we might get another IRQ while
>>> sleeping and we should be deadlocked.
>>>
>>> But let me stress on the "AFAICT" above, someone who actually knows the
>>> FPU code should correct me if I'm missing something.
>>
>> I don't think you ever get #NM as a result of kernel_fpu_begin, but you
>> can certainly have problems when kernel_fpu_begin nests by accident.
>> There's irq_fpu_usable() for this.
>>
>> (irq_fpu_usable() reads cr0 sometimes and I suspect it can be slow.)
>
> Oh I didn't know about irq_fpu_usable(), thanks.
>
> But still, irq_fpu_usable() still checks !in_interrupt() which means
> that we don't want to run SSE instructions in IRQ context. OTOH, we
> still are fine when running with CR0.TS. So what happens when we get an
> #NM as a result of executing an FPU instruction in an IRQ handler? We
> will have to do init_fpu() on the current task if the last hasn't used
> math yet and do the slab allocation of the FPU context area (I'm looking
> at math_state_restore, btw).

IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in
an interrupt and TS=1, when we know that we're not in a
kernel_fpu_begin section, so it's safe to start one (and do clts).

IMO this code is not very good, and I plan to fix it sooner or later.
I want kernel_fpu_begin (or its equivalent*) to be very fast and
usable from any context whatsoever. Mucking with TS is slower than a
complete save and restore of YMM state.

(*) kernel_fpu_begin is a bad name. It's only safe to use integer
instructions inside a kernel_fpu_begin section because MXCSR (and the
387 equivalent) could contain garbage.

--Andy

Borislav Petkov

unread,

Aug 15, 2011, 12:20:01 PM8/15/11

to

On Mon, 15 August, 2011 5:36 pm, Andrew Lutomirski wrote:
>> But still, irq_fpu_usable() still checks !in_interrupt() which means
>> that we don't want to run SSE instructions in IRQ context. OTOH, we
>> still are fine when running with CR0.TS. So what happens when we get an
>> #NM as a result of executing an FPU instruction in an IRQ handler? We
>> will have to do init_fpu() on the current task if the last hasn't used
>> math yet and do the slab allocation of the FPU context area (I'm looking
>> at math_state_restore, btw).
>
> IIRC kernel_fpu_begin does clts, so #NM won't happen. But if we're in
> an interrupt and TS=1, when we know that we're not in a
> kernel_fpu_begin section, so it's safe to start one (and do clts).

Doh, yes, I see it now. This way we save the math state of the current
process if needed and "disable" #NM exceptions until kernel_fpu_end() by
clearing CR0.TS, sure. Thanks.

> IMO this code is not very good, and I plan to fix it sooner or later.

Yep. Also, AFAIR, Hans did some FPU cleanup as part of his xsave rework.
You could probably reuse some bits from there. The patchset should be in
tip/x86/xsave.

> I want kernel_fpu_begin (or its equivalent*) to be very fast and
> usable from any context whatsoever. Mucking with TS is slower than a
> complete save and restore of YMM state.

Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
This would obviate the need to muck with contexts but that could get
expensive wrt stack operations. The advantage is that I'm not dealing
with the whole FPU state but only with 16 XMM regs. I should probably
dust off that version again and retest.

Or, if we want to use SSE stuff in the kernel, we might think of
allocating its own FPU context(s) and handle those...

> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.

Well, do we want to use floating point instructions in the kernel?

Thanks.

--
Regards/Gruss,
Boris.

H. Peter Anvin

unread,

Aug 15, 2011, 12:20:02 PM8/15/11

to

On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>
> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
> instructions inside a kernel_fpu_begin section because MXCSR (and the
> 387 equivalent) could contain garbage.
>

Uh... no, it just means you have to initialize the settings. It's a
perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.

-hpa

Andrew Lutomirski

unread,

Aug 15, 2011, 1:00:02 PM8/15/11

to

On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <h...@zytor.com> wrote:
> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>
>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
>>
>
> Uh... no, it just means you have to initialize the settings. It's a
> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.

I prefer get_xstate / put_xstate, but this could rapidly devolve into
bikeshedding. :)

--Andy

Andrew Lutomirski

unread,

Aug 15, 2011, 1:10:03 PM8/15/11

to

I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
80 ns and a full state save+restore is only ~60 ns. Without
infrastructure changes, I don't think you can avoid the clts and stts.

You might be able to get away with turning off IRQs, reading CR0 to
check TS, pushing XMM regs, and being very certain that you don't
accidentally generate any VEX-coded instructions.

>
> Or, if we want to use SSE stuff in the kernel, we might think of
> allocating its own FPU context(s) and handle those...

I'm thinking of having a stack of FPU states to parallel irq stacks
and IST stacks. It gets a little hairy when code inside
kernel_fpu_begin traps for a non-irq non-IST reason, though.
Fortunately, those are rare and all of the EX_TABLE users could mark
xmm regs as clobbered (except for copy_from_user...). Keeping
kernel_fpu_begin non-preemptable makes it less bad because the extra
FPU state can be per-cpu and not per-task.

This is extra fun on 32 bit, which IIRC doesn't have IST stacks.

The major speedup will come from saving state in kernel_fpu_begin but
not restoring it until the code in entry_??.S restores registers.

>
>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>> 387 equivalent) could contain garbage.
>
> Well, do we want to use floating point instructions in the kernel?

The only use I could find is in staging.

--Andy

H. Peter Anvin

unread,

Aug 15, 2011, 2:30:02 PM8/15/11

to

On 08/15/2011 09:58 AM, Andrew Lutomirski wrote:
> On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <h...@zytor.com> wrote:
>> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>>
>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>>
>>
>> Uh... no, it just means you have to initialize the settings. It's a
>> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
>
> I prefer get_xstate / put_xstate, but this could rapidly devolve into
> bikeshedding. :)
>

a) Quite.

b) xstate is not architecture-neutral.

-hpa

Andrew Lutomirski

unread,

Aug 15, 2011, 2:40:01 PM8/15/11

to

On Mon, Aug 15, 2011 at 2:26 PM, H. Peter Anvin <h...@zytor.com> wrote:
> On 08/15/2011 09:58 AM, Andrew Lutomirski wrote:
>> On Mon, Aug 15, 2011 at 12:12 PM, H. Peter Anvin <h...@zytor.com> wrote:
>>> On 08/15/2011 08:36 AM, Andrew Lutomirski wrote:
>>>>
>>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>>> 387 equivalent) could contain garbage.
>>>>
>>>
>>> Uh... no, it just means you have to initialize the settings. It's a
>>> perfectly good name, it's called kernel_fpu_begin, not kernel_fp_begin.
>>
>> I prefer get_xstate / put_xstate, but this could rapidly devolve into
>> bikeshedding. :)
>>
>
> a) Quite.
>
> b) xstate is not architecture-neutral.

Are there any architecture-neutral users of this thing? If I were
writing generic code, I would expect:

kernel_fpu_begin();
foo *= 1.5;
kernel_fpu_end();

to work, but I would not expect:

kernel_fpu_begin();
use_xmm_registers();
kernel_fpu_end();

to make any sense.

Since the former does not actually work, I would hope that there is no
non-x86-specific user.

--Andy

H. Peter Anvin

unread,

Aug 15, 2011, 3:00:02 PM8/15/11

to

On 08/15/2011 11:35 AM, Andrew Lutomirski wrote:
>
> Are there any architecture-neutral users of this thing?

Look at the RAID-6 code, for example. It makes the various
architecture-specific codes look more similar.

-hpa

Borislav Petkov

unread,

Aug 15, 2011, 3:00:03 PM8/15/11

to

On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>> Well, I had a SSE memcpy which saved/restored the XMM regs on the stack.
>> This would obviate the need to muck with contexts but that could get
>> expensive wrt stack operations. The advantage is that I'm not dealing
>> with the whole FPU state but only with 16 XMM regs. I should probably
>> dust off that version again and retest.
>
> I bet it won't be a significant win. On Sandy Bridge, clts/stts takes
> 80 ns and a full state save+restore is only ~60 ns.
> Without infrastructure changes, I don't think you can avoid the clts
> and stts.

Yeah, probably.

> You might be able to get away with turning off IRQs, reading CR0 to
> check TS, pushing XMM regs, and being very certain that you don't
> accidentally generate any VEX-coded instructions.

That's ok - I'm using movaps/movups. But, the problem is that I still
need to save FPU state if the task I'm interrupting has been using FPU
instructions. So, I can't get away without saving the context in which
case I don't need to save the XMM regs anyway.

>> Or, if we want to use SSE stuff in the kernel, we might think of
>> allocating its own FPU context(s) and handle those...
>
> I'm thinking of having a stack of FPU states to parallel irq stacks
> and IST stacks.

... I'm guessing with the same nesting as hardirqs? Making FPU
instructions usable in irq contexts too.

> It gets a little hairy when code inside kernel_fpu_begin traps for a
> non-irq non-IST reason, though.

How does that happen? You're in the kernel with preemption disabled and
TS cleared, what would cause the #NM? I think that if you need to switch
context, you simply "push" the current FPU context, allocate a new one
and clts as part of the FPU context switching, no?

> Fortunately, those are rare and all of the EX_TABLE users could mark
> xmm regs as clobbered (except for copy_from_user...).

Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
shows reasonable speedup there, we might need to make those work too.

> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
> extra FPU state can be per-cpu and not per-task.

Yep.

> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>
> The major speedup will come from saving state in kernel_fpu_begin but
> not restoring it until the code in entry_??.S restores registers.

But you'd need to save each kernel FPU state when nesting, no?

>>> (*) kernel_fpu_begin is a bad name. It's only safe to use integer
>>> instructions inside a kernel_fpu_begin section because MXCSR (and the
>>> 387 equivalent) could contain garbage.
>>
>> Well, do we want to use floating point instructions in the kernel?
>
> The only use I could find is in staging.

Exactly my point - I think we should do it only when it's really worth
the trouble.

--
Regards/Gruss,
Boris.

Andrew Lutomirski

unread,

Aug 15, 2011, 3:20:03 PM8/15/11

to

On Mon, Aug 15, 2011 at 2:49 PM, Borislav Petkov <b...@alien8.de> wrote:
> On Mon, 15 August, 2011 7:04 pm, Andrew Lutomirski wrote:
>>> Or, if we want to use SSE stuff in the kernel, we might think of
>>> allocating its own FPU context(s) and handle those...
>>
>> I'm thinking of having a stack of FPU states to parallel irq stacks
>> and IST stacks.
>
> ... I'm guessing with the same nesting as hardirqs? Making FPU
> instructions usable in irq contexts too.
>
>> It gets a little hairy when code inside kernel_fpu_begin traps for a
>> non-irq non-IST reason, though.
>
> How does that happen? You're in the kernel with preemption disabled and
> TS cleared, what would cause the #NM? I think that if you need to switch
> context, you simply "push" the current FPU context, allocate a new one
> and clts as part of the FPU context switching, no?

Not #NM, but page faults can happen too (even just accessing vmalloc space).

>
>> Fortunately, those are rare and all of the EX_TABLE users could mark
>> xmm regs as clobbered (except for copy_from_user...).
>
> Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> shows reasonable speedup there, we might need to make those work too.

I'm a little surprised that SSE beats fast string operations, but I
guess benchmarking always wins.

>
>> Keeping kernel_fpu_begin non-preemptable makes it less bad because the
>> extra FPU state can be per-cpu and not per-task.
>
> Yep.
>
>> This is extra fun on 32 bit, which IIRC doesn't have IST stacks.
>>
>> The major speedup will come from saving state in kernel_fpu_begin but
>> not restoring it until the code in entry_??.S restores registers.
>
> But you'd need to save each kernel FPU state when nesting, no?
>

Yes. But we don't nest that much, and the save/restore isn't all that
expensive. And we don't have to save/restore unless kernel entries
nest and both entries try to use kernel_fpu_begin at the same time.

This whole project may take awhile. The code in there is a
poorly-documented mess, even after Hans' cleanups. (It's a lot worse
without them, though.)

--Andy

Borislav Petkov

unread,

Aug 15, 2011, 4:10:03 PM8/15/11

to

On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote:
> > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
> > shows reasonable speedup there, we might need to make those work too.
>
> I'm a little surprised that SSE beats fast string operations, but I
> guess benchmarking always wins.

If by fast string operations you mean X86_FEATURE_ERMS, then that's
Intel-only and that actually would need to be benchmarked separately.
Currently, I see speedup for large(r) buffers only vs rep; movsq. But I
dunno about rep; movsb's enhanced rep string tricks Intel does.

> Yes. But we don't nest that much, and the save/restore isn't all that
> expensive. And we don't have to save/restore unless kernel entries
> nest and both entries try to use kernel_fpu_begin at the same time.

Yep.

> This whole project may take awhile. The code in there is a
> poorly-documented mess, even after Hans' cleanups. (It's a lot worse
> without them, though.)

Oh yeah, this code could use lotsa scrubbing :)

--
Regards/Gruss,
Boris.

Andrew Lutomirski

unread,

Aug 15, 2011, 4:10:03 PM8/15/11

to

On Mon, Aug 15, 2011 at 4:05 PM, Borislav Petkov <b...@alien8.de> wrote:
> On Mon, Aug 15, 2011 at 03:11:40PM -0400, Andrew Lutomirski wrote:
>> > Well, copy_from_user... does a bunch of rep; movsq - if the SSE version
>> > shows reasonable speedup there, we might need to make those work too.
>>
>> I'm a little surprised that SSE beats fast string operations, but I
>> guess benchmarking always wins.
>
> If by fast string operations you mean X86_FEATURE_ERMS, then that's
> Intel-only and that actually would need to be benchmarked separately.
> Currently, I see speedup for large(r) buffers only vs rep; movsq. But I
> dunno about rep; movsb's enhanced rep string tricks Intel does.

I meant X86_FEATURE_REP_GOOD. (That may also be Intel-only, but it
sounds like rep;movsq might move whole cachelines on cpus at least a
few generations back.) I don't know if any ERMS cpus exist yet.

Valdis.K...@vt.edu

unread,

Aug 15, 2011, 10:40:01 PM8/15/11

to

On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:

> Benchmarking with 10000 iterations, average results:
> size XM MM speedup
> 119 540.58 449.491 0.8314969419

> 12273 2307.86 4042.88 1.751787902

> 13924 2431.8 4224.48 1.737184756
> 14335 2469.4 4218.82 1.708440514
> 15018 2675.67 1904.07 0.711622886
> 16374 2989.75 5296.26 1.771470902
> 24564 4262.15 7696.86 1.805863077
> 27852 4362.53 3347.72 0.7673805572
> 28672 5122.8 7113.14 1.388524413
> 30033 4874.62 8740.04 1.792967931

The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
really good about this till we understand what happened for those two cases.

Also, anytime I see "10000 iterations", I ask myself if the benchmark rigging
took proper note of hot/cold cache issues. That *may* explain the two oddball
results we see above - but not knowing more about how it was benched, it's hard
to say.

melwyn lobo

unread,

Aug 16, 2011, 3:20:02 AM8/16/11

to

> Yes, on 32-bit you're using the compiler-supplied version
> __builtin_memcpy when CONFIG_KMEMCHECK=n and your gcc is of version 4
> and above. Reportedly, using __builtin_memcpy generates better code.
>
> Btw, my version of SSE memcpy is 64-bit only.
>
> --
> Regards/Gruss,
> Boris.
>
>

We would rather use the 32 bit patch. Have you already got a 32 bit
patch. How can I use sse3 for 32 bit.
I don't think you have submitted 64 bit patch in the mainline.
Is there still work ongoing on this.

Regards,
Melwyn

Borislav Petkov

unread,

Aug 16, 2011, 3:50:01 AM8/16/11

to

On Tue, Aug 16, 2011 at 12:49:28PM +0530, melwyn lobo wrote:
> We would rather use the 32 bit patch. Have you already got a 32 bit
> patch.

Nope, only 64-bit for now, sorry.

> How can I use sse3 for 32 bit.

Well, OTTOMH, you have only 8 xmm regs in 32-bit instead of 16, which
should halve the performance of the 64-bit version in a perfect world.
However, we don't know how the performance of a 32-bit SSE memcpy
version behaves vs the gcc builtin one - that would require benchmarking
too.

But other than that, I don't see a problem with having a 32-bit version.

> I don't think you have submitted 64 bit patch in the mainline.
> Is there still work ongoing on this.

Yeah, we are currently benchmarking it to see whether it actually makes
sense to even have SSE memcpy in the kernel.

--
Regards/Gruss,
Boris.

Borislav Petkov

unread,

Aug 16, 2011, 8:20:02 AM8/16/11

to

On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.K...@vt.edu wrote:
> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>
> > Benchmarking with 10000 iterations, average results:
> > size XM MM speedup
> > 119 540.58 449.491 0.8314969419
>
> > 12273 2307.86 4042.88 1.751787902
> > 13924 2431.8 4224.48 1.737184756
> > 14335 2469.4 4218.82 1.708440514
> > 15018 2675.67 1904.07 0.711622886
> > 16374 2989.75 5296.26 1.771470902
> > 24564 4262.15 7696.86 1.805863077
> > 27852 4362.53 3347.72 0.7673805572
> > 28672 5122.8 7113.14 1.388524413
> > 30033 4874.62 8740.04 1.792967931
>
> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
> really good about this till we understand what happened for those two cases.

Yep.

> Also, anytime I see "10000 iterations", I ask myself if the benchmark
> rigging took proper note of hot/cold cache issues. That *may* explain
> the two oddball results we see above - but not knowing more about how
> it was benched, it's hard to say.

Yeah, the more scrutiny this gets the better. So I've cleaned up my
setup and have attached it.

xm_mem.c does the benchmarking and in bench_memcpy() there's the
sse_memcpy call which is the SSE memcpy implementation using inline asm.
It looks like gcc produces pretty crappy code here because if I replace
the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
same function but in pure asm - I get much better numbers, sometimes
even over 2x. It all depends on the alignment of the buffers though.
Also, those numbers don't include the context saving/restoring which the
kernel does for us.

7491 1509.89 2346.94 1.554378381
8170 2166.81 2857.78 1.318890326
12277 2659.03 4179.31 1.571744176
13907 2571.24 4125.7 1.604558427
14319 2638.74 5799.67 2.19789466 <----
14993 2752.42 4413.85 1.603625603
16371 3479.11 5562.65 1.59887055

So please take a look and let me know what you think.

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

sse_memcpy.tar.bz2

Maarten Lankhorst

unread,

Sep 1, 2011, 11:20:01 AM9/1/11

to

Hey,

2011/8/16 Borislav Petkov <b...@amd64.org>:

This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.

Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.

I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.

If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.

Cheers,
Maarten

---
Attached: my modified version of the sse memcpy you posted.

I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.

ym_memcpy.txt

Linus Torvalds

unread,

Sep 1, 2011, 12:20:01 PM9/1/11

to

On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
<m.b.la...@gmail.com> wrote:
>
> This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> and I finally figured out why. I also extended the test to an optimized avx memcpy,
> but I think the kernel memcpy will always win in the aligned case.

"rep movs" is generally optimized in microcode on most modern Intel
CPU's for some easyish cases, and it will outperform just about
anything.

Atom is a notable exception, but if you expect performance on any
general loads from Atom, you need to get your head examined. Atom is a
disaster for anything but tuned loops.

The "easyish cases" depend on microarchitecture. They are improving,
so long-term "rep movs" is the best way regardless, but for most
current ones it's something like "source aligned to 8 bytes *and*
source and destination are equal "mod 64"".

And that's true in a lot of common situations. It's true for the page
copy, for example, and it's often true for big user "read()/write()"
calls (but "often" may not be "often enough" - high-performance
userland should strive to align read/write buffers to 64 bytes, for
example).

Many other cases of "memcpy()" are the fairly small, constant-sized
ones, where the optimal strategy tends to be "move words by hand".

Linus

Borislav Petkov

unread,

Sep 8, 2011, 4:40:02 AM9/8/11

to

On Thu, Sep 01, 2011 at 09:18:32AM -0700, Linus Torvalds wrote:
> On Thu, Sep 1, 2011 at 8:15 AM, Maarten Lankhorst
> <m.b.la...@gmail.com> wrote:
> >
> > This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
> > and I finally figured out why. I also extended the test to an optimized avx memcpy,
> > but I think the kernel memcpy will always win in the aligned case.
>
> "rep movs" is generally optimized in microcode on most modern Intel
> CPU's for some easyish cases, and it will outperform just about
> anything.
>
> Atom is a notable exception, but if you expect performance on any
> general loads from Atom, you need to get your head examined. Atom is a
> disaster for anything but tuned loops.
>
> The "easyish cases" depend on microarchitecture. They are improving,
> so long-term "rep movs" is the best way regardless, but for most
> current ones it's something like "source aligned to 8 bytes *and*
> source and destination are equal "mod 64"".
>
> And that's true in a lot of common situations. It's true for the page
> copy, for example, and it's often true for big user "read()/write()"
> calls (but "often" may not be "often enough" - high-performance
> userland should strive to align read/write buffers to 64 bytes, for
> example).
>
> Many other cases of "memcpy()" are the fairly small, constant-sized
> ones, where the optimal strategy tends to be "move words by hand".

Yeah,

this probably makes enabling SSE memcpy in the kernel a task
with diminishing returns. There are also the additional costs of
saving/restoring FPU context in the kernel which eat off from any SSE
speedup.

And then there's the additional I$ pressure because "rep movs" is
much smaller than all those mov[au]ps stanzas. Btw, mov[au]ps are the
smallest (two-byte) instructions I could use - in the AVX case they can
get up to 4 Bytes of length with the VEX prefix and the additional SIB,
size override, etc. fields.

Oh, and then there's copy_*_user which also does fault handling and
replacing that with a SSE version of memcpy could get quite hairy quite
fast.

Anyway, I'll try to benchmark an asm version of SSE memcpy in the kernel
when I get the time to see whether it still makes sense, at all.

Thanks.

--
Regards/Gruss,
Boris.

Maarten Lankhorst

unread,

Sep 8, 2011, 7:00:02 AM9/8/11

to

I have changed your sse memcpy to test various alignments with
source/destination offsets instead of random, from that you can
see that you don't really get a speedup at all. It seems to be more
a case of 'kernel memcpy is significantly slower with some alignments',
than 'avx memcpy is just that much faster'.

For example 3754 with src misalignment 4 and target misalignment 20
takes 1185 units on avx memcpy, but 1480 units with kernel memcpy

The modified testcase is attached, I did some optimizations in avx memcpy,
but I fear I may be missing something, when I tried to put it in the kernel, it
complained about sata errors I never had before, so I immediately went for
the power button to prevent more errors, fortunately it only corrupted some
kernel object files, and btrfs threw checksum errors. :)

All in all I think testing in userspace is safer, you might want to run it on an
idle cpu with schedtool, with a high fifo priority, and set cpufreq governor to
performance.

~Maarten

memcpy.tar.gz

Borislav Petkov

unread,

Sep 9, 2011, 4:20:03 AM9/9/11

to

On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
> I have changed your sse memcpy to test various alignments with
> source/destination offsets instead of random, from that you can
> see that you don't really get a speedup at all. It seems to be more
> a case of 'kernel memcpy is significantly slower with some alignments',
> than 'avx memcpy is just that much faster'.
>
> For example 3754 with src misalignment 4 and target misalignment 20
> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy

Right, so the idea is to check whether with the bigger buffer sizes
(and misaligned, although this should not be that often the case in
the kernel) the SSE version would outperform a "rep movs" with ucode
optimizations not kicking in.

With your version modified back to SSE memcpy (don't have an AVX box
right now) I get on an AMD F10h:

...
16384(12/40) 4756.24 7867.74 1.654192552
16384(40/12) 5067.81 6068.71 1.197500008
16384(12/44) 4341.3 8474.96 1.952172387
16384(44/12) 4277.13 7107.64 1.661777347
16384(12/48) 4989.16 7964.54 1.596369011
16384(48/12) 4644.94 6499.5 1.399264281
...

which looks like pretty nice numbers to me. I can't say whether there
ever is 16K buffer we copy in the kernel but if there were... But <16K
buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
As I said, best it would be to put it in the kernel and run a bunch of
benchmarks...

> The modified testcase is attached, I did some optimizations in avx
> memcpy, but I fear I may be missing something, when I tried to put it
> in the kernel, it complained about sata errors I never had before,
> so I immediately went for the power button to prevent more errors,
> fortunately it only corrupted some kernel object files, and btrfs
> threw checksum errors. :)

Well, your version should do something similar to what _mmx_memcpy does:
save FPU state and not execute in IRQ context.

> All in all I think testing in userspace is safer, you might want to
> run it on an idle cpu with schedtool, with a high fifo priority, and
> set cpufreq governor to performance.

No, you need a generic system with default settings - otherwise it is
blatant benchmark lying :-)

Maarten Lankhorst

unread,

Sep 9, 2011, 6:40:01 AM9/9/11

to

Hey,

I think for bigger memcpy's it might make sense to demand stricter
alignment. What are your numbers for (0/0) ? In my case it seems
that kernel memcpy is always faster for that. In fact, it seems
src&63 == dst&63 is generally faster with kernel memcpy.

Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:

WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()

The most persistent one appears to be the btrfs' *_extent_buffer,
it gets the most warnings on my system. Apart from that on my
system there's not much to gain, since the alignment is already
close to optimal.

My ext4 /home doesn't throw warnings, so I'd gain the most
by figuring out if I could improve btrfs/extent_io.c in some way.
The patch for triggering those warnings is below, change to WARN_ON
if you want to see which one happens the most for you.

I was pleasantly surprised though.

>> The modified testcase is attached, I did some optimizations in avx
>> memcpy, but I fear I may be missing something, when I tried to put it
>> in the kernel, it complained about sata errors I never had before,
>> so I immediately went for the power button to prevent more errors,
>> fortunately it only corrupted some kernel object files, and btrfs
>> threw checksum errors. :)
> Well, your version should do something similar to what _mmx_memcpy does:
> save FPU state and not execute in IRQ context.
>
>> All in all I think testing in userspace is safer, you might want to
>> run it on an idle cpu with schedtool, with a high fifo priority, and
>> set cpufreq governor to performance.
> No, you need a generic system with default settings - otherwise it is
> blatant benchmark lying :-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..77180bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -30,6 +30,14 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
#ifndef CONFIG_KMEMCHECK
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ const void *__src = (src); \
+ void *__dst = (dst); \
+ WARN_ON_ONCE(__len > 1024 && (((long)__src & 63) != ((long)__dst & 63))); \
+ memcpy(__dst, __src, __len); \
+})
#else
extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \

Maarten Lankhorst

unread,

Sep 9, 2011, 7:30:01 AM9/9/11

to

Hey just a followup on btrfs,

The btrfs one which happens far more often than all others is read_extent_buffer,
but most of them are page aligned on destination. This means that for me,
avx memcpy might be 10% slower or 10% faster, depending on the specific source
alignment, so avx memcpy wouldn't help much.

This specific one happened far more than any of the other memcpy usages, and
ignoring the check when destination is page aligned, most of them are gone.

In short: I don't think I can get a speedup by using avx memcpy in-kernel.

YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
case, but for the common aligned cases too. Or some concrete numbers that misaligned
happens a lot for you.

~Maarten

Borislav Petkov

unread,

Sep 9, 2011, 9:50:02 AM9/9/11

to

On Fri, Sep 09, 2011 at 01:23:09PM +0200, Maarten Lankhorst wrote:
> This specific one happened far more than any of the other memcpy usages, and
> ignoring the check when destination is page aligned, most of them are gone.
>
> In short: I don't think I can get a speedup by using avx memcpy in-kernel.
>
> YMMV, if it does speed up for you, I'd love to see concrete numbers. And not only worst
> case, but for the common aligned cases too. Or some concrete numbers that misaligned
> happens a lot for you.

Actually,

assuming alignment matters, I'd need to redo the trace_printk run I did
initially on buffer sizes:

http://marc.info/?l=linux-kernel&m=131331602309340 (kernel_build.sizes attached)

to get a more sensible grasp on the alignment of kernel buffers along
with their sizes and to see whether we're doing a lot of unaligned large
buffer copies in the kernel. I seriously doubt that, though, we should
be doing everything pagewise anyway so...

Concerning numbers, I ran your version again and sorted the output by
speedup. The highest scores are:

30037(12/44) 5566.4 12797.2 2.299011642
28672(12/44) 5512.97 12588.7 2.283467991
30037(28/60) 5610.34 12732.7 2.269502799
27852(12/44) 5398.36 12242.4 2.267803859
30037(4/36) 5585.02 12598.6 2.25578257
28672(28/60) 5499.11 12317.5 2.239914033
27852(28/60) 5349.78 11918.9 2.227919527
27852(20/52) 5335.92 11750.7 2.202186795
24576(12/44) 4991.37 10987.2 2.201247446

and this is pretty cool. Here are the (0/0) cases:

8192(0/0) 2627.82 3038.43 1.156255766
12288(0/0) 3116.62 3675.98 1.179475031
13926(0/0) 3330.04 4077.08 1.224334839
14336(0/0) 3377.95 4067.24 1.204055286
15018(0/0) 3465.3 4215.3 1.216430725
16384(0/0) 3623.33 4442.38 1.226050715
24576(0/0) 4629.53 6021.81 1.300737559
27852(0/0) 5026.69 6619.26 1.316823133
28672(0/0) 5157.73 6831.39 1.324495749
30037(0/0) 5322.01 6978.36 1.3112261

It is not 2x anymore but still.

Anyway, looking at the buffer sizes, they're rather ridiculous and even
if we get them in some workload, they won't repeat n times per second to
be relevant. So we'll see...

Thanks.

--
Regards/Gruss,
Boris.

kernel_build.sizes

Linus Torvalds

unread,

Sep 9, 2011, 10:50:03 AM9/9/11

to

On Fri, Sep 9, 2011 at 1:14 AM, Borislav Petkov <b...@alien8.de> wrote:
>
> which looks like pretty nice numbers to me. I can't say whether there
> ever is 16K buffer we copy in the kernel but if there were...

Kernel memcpy's are basically almost always smaller than a page size,
because that tends to be the fundamental allocation size.

Yes, there are exceptions that copy into big vmalloc'ed buffers, but
they don't tend to matter. Things like module loading etc.

Linus

Borislav Petkov

unread,

Sep 9, 2011, 11:40:02 AM9/9/11

to

On Fri, Sep 09, 2011 at 07:39:18AM -0700, Linus Torvalds wrote:
> Kernel memcpy's are basically almost always smaller than a page size,
> because that tends to be the fundamental allocation size.

Yeah, this is what my trace of a kernel build showed too:

Bytes Count
===== =====

...

224 3
225 3
227 3
244 1
254 5
255 13
256 21708
512 21746
848 12907
1920 36536
2048 21708

OTOH, I keep thinking that copy_*_user might be doing bigger sizes, for
example when shuffling network buffers to/from userspace. Converting
those to SSE memcpy might not be as easy as memcpy itself, though.

> Yes, there are exceptions that copy into big vmalloc'ed buffers, but
> they don't tend to matter. Things like module loading etc.

Too small a number of repetitions to matter, yes.

--
Regards/Gruss,
Boris.

melwyn lobo

unread,

Dec 5, 2011, 7:30:02 AM12/5/11

to

The driver has a loop of memcpy the source and destination addresses
based on a runtime computed value and confuses the compiler on the
alignement.
So instead of generating neat 32 bit memcpy, gcc generates "rep movsb"
Example code snippet:
src = (char *)kmap(bo->pages[idx]);
src += offset;
memcpy(des, src, len);
Be replacing ssse3 only for memcpy of length larger than 1K bytes (for
my driver typical length are 2k metadata from SRAM to DDR) I think
overheads of FPU save and restore can be forgiven.
Will SSSE3 work for unlaigned pointers as well ? If it doesn't I am
lucky for past 6 months :)

melwyn lobo

unread,

Dec 5, 2011, 8:00:02 AM12/5/11

to

Will AVX work on Intel ATOM. I guess not. Then is this now not the
time for having architecture dependant definitions for basic cpu
intensive tasks

Alan Cox

unread,

Dec 5, 2011, 9:40:01 AM12/5/11

to

> Will AVX work on Intel ATOM. I guess not. Then is this now not the
> time for having architecture dependant definitions for basic cpu
> intensive tasks

It's pretty much a necessity if you want to fine tune some of this.

> > If you want to speed up memcpy, I think your best bet is to find out why it's
> > so much slower when src and dst aren't 64-byte aligned compared to each other.

rep mov on most x86 processors is an extremely optimised path. The 64
byte alignment behaviour is to be expected given the processor cache line
size.

Alan