REP MOVSB performance (was: Hardware simulation in Forth)

Anton Ertl

unread,

Sep 19, 2017, 10:09:42 AM9/19/17

to

Rod Pemberton <EmailN...@voenflacbe.cpm> writes:
>On Tue, 12 Sep 2017 08:24:51 GMT
>an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> REP MOVSB is slow. Very slow.
>
>Do you have any references for that claim?

I was conflating the results of my CMOVE speed tests (which don't use
REP MOVSB, however), with some disappointing experiences that I had
with REP MOVSQ (which was slower than a simple loop for the block size
I used). So I decided to do a more in-depth measurement of REP MOVSB
vs. some alternatives. I wrote a microbenchmark that copies a buffer
to a non-overlapping buffer, with both buffers independently starting
at offsets from 0 to 4095 (for the "aligned" results, offsets are
aligned to 32 bytes); the copying is done with REP MOVSB, and libc's
memmove, and memcpy.

You find the benchmark on
<http://www.complang.tuwien.ac.at/anton/move/> (not in a
nice-to-download package yet).

You find the results below, and my observations here:

* REP MOVSB is slower than memcpy for some block sizes (especially
<1KB) on all platforms, and for all block sizes on some platforms
(Penryn, Sandy Bridge, unaligned Ivy Bridge, Zen), and often not
just by a little. In theory the hardware people should know how to
get the best performance out of their hardware, but in practice,
that seems hard to achieve.

* Aligned buffers help REP MOVSB a lot, surprisingly especially at
larger block sizes. I would have expected that hardware can deal
with that better than software, which needs (predicted) branches to
deal with that efficiently. Once you pay for misalignment, an odd
block size does not cost extra.

* Startup overhead is high for REP MOVSB; some are better for one
byte, but are then even slower for 8. On the balance, if I had to
choose between REP MOVSB and an implementation that eschews REP
MOVSB, I would choose the latter, because of the bad performance for
small block sizes. Viewed another way, thanks to the startup
overhead I have to implement something relatively complex for CMOVE
that may use REP MOVSB, but only for large block sizes.

* There is a surprising gap between memcpy and memmove performance;
sometimes memcpy is faster, sometimes memmove. In theory, for this
benchmark memcpy should never be slower than memmove, and memmove
should only be slower by a three-instruction sequence that contains
a predictable loop (so the actual copying code can start right
away). Also, in those cases where REP MOVSB is faster, it should be
faster, memmove and memcpy should use that (in this benchmark), and
the extra cost should just be a few checks.

Looking at these results, it is all the more ridiculous to have a
memcpy separate from memmove. If they spent the effort that they
spend on the separate routines on a memmove that uses rep movsb
where profitable, they would see better performance for both
routines.

* Enhanced REP MOVSB/STOSB (starting with Ivy Bridge; CPU flag erms)
is mentioned as feature in Intel's optimization manual, but the
difference between Sandy Bridge and Ivy Bridge in REP MOVSB
performance is not bigger than other differences that do not get a
separate flag. The biggest difference is seen at the lower counts,
e.g., 53 (Ivy) vs. 173 cycles for blocksize 128.

* repmovsb (unaligned) has a 22x cycle count improvement between
Penryn (2007) and Skylake (2015). The cycle count improvemet from
K8 (2003/2005) to Zen (2017) on repmovsb aligned is a factor of 15.
So there is still a lot of progress in some areas.

* The improvement in memmove/memcpy performance from glibc 2.3.6/glibc
2.7 to glibc 2.24 are probably for a good part in the software and
for a smaller part in the hardware. I cannot run a newer statically
linked binary on an older kernel ("Fatal: kernel too old"), so I
built a statically linked binary on the glibc 2.3.6 system, and ran
it on the Zen hardware. The glibc 2.24 memmove is faster by a
factor of about 3 for the larger block sizes, and not quite a factor
of 2 for memcpy. The better memmove/memcpy cycle count over K8 is
due to this software improvement and a factor of almost 2 hardware
improvement.

* It is strange that memmove is close to memcpy on Haswell and
Skylake, but is much slower on Zen. Different code paths at work?

Things that this microbenchmark does not cover, and that may have a
significant influence on performance:

* Using the results; supposedly REP MOVSB has advantage there because
of weaker ordering requirements of the stores (or is that about
independent instructions? the optimization manual is unclear). I
have not seen any benchmark that demonstrates that.

* In real applications other code will compete for I-cache space with
the monstrous implementations of memmove and memcpy in glibc (one
memmove I looked at had 11KB of machine code).

* This microbenchmark uses the same block size all the time, which is
a good case for branch prediction for memmove and memcpy. A less
predictable size may slow down memmove and memcpy (and possibly some
implementations of REP MOVSB).

You can find more discussion on these issues on
<https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy>.

Results are in cycles per iteration (i.e. buffer copying work plus
some loop and call overhead).

Penryn (Xeon 5450), glibc 2.7
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
21 86 104 142 221 378 691 1319 2575 5086 10106 21276 repmovsb
16 30 68 97 97 135 211 362 665 1287 2499 5031 memmove
20 21 39 48 72 120 210 391 853 1685 3360 6773 memcpy
21 85 103 135 175 195 234 314 472 789 1424 2875 repmovsb aligned
16 30 35 39 47 60 94 160 291 554 1105 2646 memmove aligned
20 20 19 20 26 47 81 164 360 653 1239 2693 memcpy aligned
21 86 103 141 220 377 690 1318 2573 5084 10108 21275 repmovsb blksz-1
18 28 56 77 82 120 198 348 651 1276 2499 5015 memmove blksz-1
21 18 29 49 72 120 210 389 851 1682 3357 6771 memcpy blksz-1

Sandy Bridge (Xeon E3-1220) eglibc 2.11.3
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
19 83 100 129 174 183 206 268 398 653 1164 2236 repmovsb
14 28 44 56 79 127 230 430 830 1674 3287 6521 memmove
18 19 29 31 37 49 87 161 261 459 857 1703 memcpy
18 81 100 129 173 179 195 228 301 448 737 1357 repmovsb aligned
15 28 31 35 38 46 76 141 267 550 1075 2151 memmove aligned
19 19 17 17 23 35 65 125 194 314 555 1086 memcpy aligned
18 83 99 128 174 181 205 267 397 651 1162 2233 repmovsb blksz-1
16 26 42 54 77 126 226 426 833 1675 3286 6523 memmove blksz-1
19 16 15 32 36 50 86 161 260 459 858 1705 memcpy blksz-1

Ivy Bridge (Core i3-3227U), glibc 2.23
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
41 41 42 42 54 61 75 117 218 421 838 1658 repmovsb
14 14 15 15 17 45 64 102 173 319 615 1437 memmove
17 19 13 17 20 34 53 90 166 338 647 1439 memcpy
42 41 41 42 53 60 71 96 158 287 557 1093 repmovsb aligned
13 13 14 14 15 27 42 72 136 265 545 1341 memmove aligned
16 18 12 16 18 30 47 79 153 291 551 1241 memcpy aligned
53 41 42 42 54 68 82 123 225 427 833 1656 repmovsb blksz-1
14 14 15 15 18 45 63 102 172 319 614 1434 memmove blksz-1
17 20 13 17 20 34 53 91 166 338 647 1438 memcpy blksz-1

Haswell (Core i7-4690K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
38 38 38 38 45 51 64 100 171 306 576 1135 repmovsb
10 10 11 11 14 30 48 86 149 282 567 1414 memmove
11 12 9 12 15 29 48 86 167 324 628 1415 memcpy
39 39 39 39 46 50 58 74 106 170 298 581 repmovsb aligned
11 11 12 12 13 26 38 67 132 260 531 1362 memmove aligned
12 13 10 15 15 24 37 69 134 277 534 1236 memcpy aligned
50 38 38 38 47 52 66 104 175 310 579 1148 repmovsb blksz-1
10 10 11 11 15 29 47 83 149 280 567 1374 memmove blksz-1
10 11 9 12 15 29 48 86 161 324 628 1417 memcpy blksz-1

Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
33 33 33 33 40 44 54 76 130 237 460 974 repmovsb
10 10 10 10 12 24 40 75 145 302 570 1384 memmove
11 12 8 10 13 26 45 84 160 312 606 1316 memcpy
33 33 33 33 41 45 53 69 101 175 302 564 repmovsb aligned
11 11 11 11 12 24 37 72 141 285 558 1369 memmove aligned
13 14 10 12 15 23 40 75 151 288 562 1267 memcpy aligned
60 33 33 33 43 47 57 78 132 238 460 952 repmovsb blksz-1
10 10 11 11 12 24 40 75 145 301 570 1411 memmove blksz-1
10 11 8 10 13 26 45 84 164 312 606 1347 memcpy blksz-1

Goldmont (Celeron J3455), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
49 48 48 50 54 63 81 123 213 392 831 2681 repmovsb
10 8 8 19 19 37 66 109 206 398 861 2700 memmove
10 8 8 19 19 37 65 109 206 398 863 2699 memcpy
49 48 48 50 54 62 78 111 177 309 635 2130 repmovsb aligned
11 9 9 19 19 37 65 106 197 312 633 2157 memmove aligned
11 9 9 19 19 37 65 106 197 312 634 2157 memcpy aligned
38 53 64 66 70 78 95 137 226 405 831 2689 repmovsb blksz-1
10 9 8 13 19 37 65 109 206 409 835 2714 memmove blksz-1
10 9 8 13 19 37 65 109 206 409 829 2706 memcpy blksz-1

K8 (Athlon 64 X2 4400+), glibc 2.3.6
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
21 28 54 90 162 307 595 1171 2325 4632 9244 18467 repmovsb
17 40 69 80 104 161 253 433 794 1514 2955 5836 memmove
24 31 57 82 98 129 199 323 570 1064 2053 4032 memcpy
21 28 53 87 155 292 566 1113 2206 4394 8768 17516 repmovsb aligned
17 40 33 37 46 68 118 234 451 834 1635 3237 memmove aligned
24 31 56 45 54 72 120 193 338 627 1207 2367 memcpy aligned
17 27 53 89 161 306 594 1171 2325 4629 9248 18461 repmovsb blksz-1
17 37 61 81 105 152 251 433 792 1513 2952 5825 memmove blksz-1
20 30 56 83 100 130 202 325 572 1067 2054 4030 memcpy blksz-1

K10 (Phenom II X2 560), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
15 22 48 84 157 309 566 1080 2107 4161 8270 16487 repmovsb
16 35 56 69 104 152 262 456 839 1604 3135 6201 memmove
16 19 13 19 31 68 114 226 408 774 1505 2968 memcpy
14 21 48 85 158 122 154 219 348 606 1122 2155 repmovsb aligned
16 39 35 38 46 63 95 190 364 664 1268 2583 memmove aligned
19 21 13 20 25 56 89 177 306 566 1084 2121 memcpy aligned
14 21 47 83 155 300 565 1079 2106 4160 8269 16487 repmovsb blksz-1
17 32 55 68 91 156 261 454 837 1602 3131 6190 memmove blksz-1
17 23 13 18 30 69 114 228 411 774 1508 2966 memcpy blksz-1

Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 33 57 105 110 119 140 184 321 599 1160 2324 repmovsb
13 14 13 14 30 42 65 107 175 325 600 1222 memmove
10 10 11 12 30 43 67 113 185 329 604 1226 memcpy
25 33 57 83 87 95 111 143 207 335 594 1136 repmovsb aligned
12 13 12 13 16 24 40 72 136 264 536 1094 memmove aligned
11 11 12 11 21 27 42 74 139 267 541 1092 memcpy aligned
23 32 56 90 110 120 140 184 321 600 1160 2324 repmovsb blksz-1
13 13 14 13 30 42 67 108 176 325 599 1219 memmove blksz-1
10 10 11 12 31 43 67 113 185 331 604 1221 memcpy blksz-1

Zen (Ryzen 5 1600X), glibc 2.3.6 (-static)
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 32 56 106 111 119 140 184 321 600 1161 2334 repmovsb
10 18 29 36 49 77 132 263 501 940 1816 3581 memmove
26 34 59 80 88 102 133 198 342 599 1114 2182 memcpy
25 33 56 85 89 97 113 145 209 337 595 1145 repmovsb aligned
10 18 20 19 24 40 72 137 286 542 1054 2110 memmove aligned
26 34 59 50 55 70 100 165 311 567 1079 2126 memcpy aligned
22 32 56 90 111 119 142 184 321 600 1161 2338 repmovsb blksz-1
8 16 29 36 49 76 131 261 499 938 1814 3582 memmove blksz-1
24 33 58 82 88 101 134 198 345 602 1117 2184 memcpy blksz-1

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: http://www.forth200x.org/forth200x.html
EuroForth 2017: http://euro.theforth.net/

Terje Mathisen

unread,

Sep 19, 2017, 10:46:11 AM9/19/17

to

Anton Ertl wrote:
> Rod Pemberton <EmailN...@voenflacbe.cpm> writes:
>> On Tue, 12 Sep 2017 08:24:51 GMT
>> an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>>
>>> REP MOVSB is slow. Very slow.
>>
>> Do you have any references for that claim?
>
> I was conflating the results of my CMOVE speed tests (which don't use
> REP MOVSB, however), with some disappointing experiences that I had
> with REP MOVSQ (which was slower than a simple loop for the block size
> I used). So I decided to do a more in-depth measurement of REP MOVSB
> vs. some alternatives. I wrote a microbenchmark that copies a buffer
> to a non-overlapping buffer, with both buffers independently starting
> at offsets from 0 to 4095 (for the "aligned" results, offsets are
> aligned to 32 bytes); the copying is done with REP MOVSB, and libc's
> memmove, and memcpy.
>
> You find the benchmark on
> <http://www.complang.tuwien.ac.at/anton/move/> (not in a
> nice-to-download package yet).
>
> You find the results below, and my observations here:

This is wonderful work Anton, very nice.

The only thing I'm really missing is a few odd block sizes, i.e.
memcpy(a, b, len) where len is 0..31 modulo 32.

Particularly for relatively small lengths of misaligned buffers I would
expect modern ("fast strings") hardware to beat most sw implementations.

Handling a misaligned starting point, a couple of 32-byte blocks and a
misaligned tail end would be somewhat painful in sw. Using masked writes
to handle the tail might even need a lookup table for the mask.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

Sep 19, 2017, 11:28:13 AM9/19/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:

>Anton Ertl wrote:
>> You find the benchmark on
>> <http://www.complang.tuwien.ac.at/anton/move/> (not in a
>> nice-to-download package yet).

Nice to download now available:

http://www.complang.tuwien.ac.at/anton/move/move.zip

>The only thing I'm really missing is a few odd block sizes, i.e.
>memcpy(a, b, len) where len is 0..31 modulo 32.

the blksz-1 results give 31 mod 32 for the blocksizes of 32 and
higher, and also provide len=7. Given that they are close to the
(unaligned blksize) results, I did not bother looking for other odd
block sizes.

But if you want more, just change the Makefile.

>Particularly for relatively small lengths of misaligned buffers I would
>expect modern ("fast strings") hardware to beat most sw implementations.

I had expected that, too, but unfortunately, that's not the case.

>Handling a misaligned starting point, a couple of 32-byte blocks and a
>misaligned tail end would be somewhat painful in sw. Using masked writes
>to handle the tail might even need a lookup table for the mask.

Do we have masked writes before AVX512? Are they efficient?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

already...@yahoo.com

unread,

Sep 19, 2017, 12:50:53 PM9/19/17

to

On Tuesday, September 19, 2017 at 6:28:13 PM UTC+3, Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
> >Anton Ertl wrote:
> >> You find the benchmark on
> >> <http://www.complang.tuwien.ac.at/anton/move/> (not in a
> >> nice-to-download package yet).
>
> Nice to download now available:
>
> http://www.complang.tuwien.ac.at/anton/move/move.zip
>
> >The only thing I'm really missing is a few odd block sizes, i.e.
> >memcpy(a, b, len) where len is 0..31 modulo 32.
>
> the blksz-1 results give 31 mod 32 for the blocksizes of 32 and
> higher, and also provide len=7. Given that they are close to the
> (unaligned blksize) results, I did not bother looking for other odd
> block sizes.
>
> But if you want more, just change the Makefile.
>
> >Particularly for relatively small lengths of misaligned buffers I would
> >expect modern ("fast strings") hardware to beat most sw implementations.
>
> I had expected that, too, but unfortunately, that's not the case.
>
> >Handling a misaligned starting point, a couple of 32-byte blocks and a
> >misaligned tail end would be somewhat painful in sw. Using masked writes
> >to handle the tail might even need a lookup table for the mask.
>
> Do we have masked writes before AVX512?

For 4B granularity - yes.
vmaskmovps is a part of original AVX.

For 1B granularity there exist maskmovq (available since SSE) and maskmovdqu (SSE2), but both generates non-temporal stores, so using them for small memory copy is probably a bad idea.

> Are they efficient?
>

vmaskmovps is good enough for handling of last SIMD word of the line in something like SGEMM. Which is not too demanding.

Terje Mathisen

unread,

Sep 19, 2017, 1:21:15 PM9/19/17

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:

>> Particularly for relatively small lengths of misaligned buffers I
>> would expect modern ("fast strings") hardware to beat most sw
>> implementations.
>
> I had expected that, too, but unfortunately, that's not the case.

:-(

>
>> Handling a misaligned starting point, a couple of 32-byte blocks
>> and a misaligned tail end would be somewhat painful in sw. Using
>> masked writes to handle the tail might even need a lookup table for
>> the mask.
>
> Do we have masked writes before AVX512? Are they efficient?

Sure!

MASKMOVDQU writes 16 bytes using another 16-byte mask value to determine
which bytes to actually write (using the top bit in each byte), with the
destination address being the same as for STOS/MOVS i.e. the DI/EDI/RDI
register.

This operation has been available since SSE2, so every single 64-bit
capable cpu is guaranteed to have this opcode.

With AVX you get the corresponding 256-bit VMASKMOVDQU operation as well.

Since it allows unaligned target addresses it should be possible to use
this directly for both the first and last block of a memcpy() operation.

timca...@aol.com

unread,

Sep 19, 2017, 3:23:55 PM9/19/17

to

From the Intel manual:

The MASKMOVDQU instruction generates a non-temporal hint to the processor to minimize cache pollution. The non-temporal hint is implemented by using a write combining (WC) memory type protocol (see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10, of the Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1). Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MASKMOVDQU instructions if multiple processors might use different memory types to read/write the destination memory locations.

Behavior with a mask of all 0s is as follows:
• No data will be written to memory.
• Signaling of breakpoints (code or data) is not guaranteed; different processor implementations may signal or not signal these breakpoints.
• Exceptions associated with addressing memory and page faults may still be signaled (implementation dependent).
• If the destination memory region is mapped as UC or WP, enforcement of associated semantics for thesememory types is not guaranteed (that is, is reserved) and is implementation-specific.

Anton Ertl

unread,

Sep 20, 2017, 3:43:29 AM9/20/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>> Do we have masked writes before AVX512? Are they efficient?
>
>Sure!
>
>MASKMOVDQU writes 16 bytes using another 16-byte mask value to determine
>which bytes to actually write (using the top bit in each byte), with the
>destination address being the same as for STOS/MOVS i.e. the DI/EDI/RDI
>register.
>
>This operation has been available since SSE2, so every single 64-bit
>capable cpu is guaranteed to have this opcode.
>
>With AVX you get the corresponding 256-bit VMASKMOVDQU operation as well.

According to <http://www.felixcloutier.com/x86/MASKMOVDQU.html>, only
the 128-bit version is allowed. Given that both the SSE2 and the
AVX128 version support 2 operands, I wonder what the difference
between these versions is.

- anton
--

Terje Mathisen

unread,

Sep 20, 2017, 4:28:12 AM9/20/17

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:

>> This operation has been available since SSE2, so every single 64-bit
>> capable cpu is guaranteed to have this opcode.
>>
>> With AVX you get the corresponding 256-bit VMASKMOVDQU operation as well.
>
> According to <http://www.felixcloutier.com/x86/MASKMOVDQU.html>, only
> the 128-bit version is allowed. Given that both the SSE2 and the
> AVX128 version support 2 operands, I wonder what the difference
> between these versions is.

Huh? That means the only difference is that you have twice as many
working registers, but that's almost certainly not the limiter for a
fast memcpy() implementation unless you get into really big blocks on a
memory system with a huge turnaround penalty:

32 16-byte regs give you 512 bytes of buffer space, so still not into
page size range...

Anton Ertl

unread,

Sep 20, 2017, 6:30:23 AM9/20/17

to

already...@yahoo.com writes:
>On Tuesday, September 19, 2017 at 6:28:13 PM UTC+3, Anton Ertl wrote:
>> Do we have masked writes before AVX512?
>
>For 4B granularity - yes.
>vmaskmovps is a part of original AVX.
>
>For 1B granularity there exist maskmovq (available since SSE) and maskmovdqu (SSE2), but both generates non-temporal stores, so using them for small memory copy is probably a bad idea.

Lots of restrictions on that one:-(. Otherwise it would be ideal for
dealing with the first, last, or only bytes of a memcpy/memmove with
little branching.

What else can we do? If the block is larger than 32 bytes, we can
just do an unaligned 32-byte store at the start address, followed by
aligned stores, followed by an unaligned 32-byte store for the final
bytes; I hope the store buffer logic does ok for these partial
overwrites. For shorter blocks, we can use the same scheme with lower
granularity. Unfortunately, selecting between granularities requires
branches, and possibly incurs mispredictions. We probably can use
vmovmaskps to reduce the amount of branching needed and the overall
code size.

Terje Mathisen

unread,

Sep 20, 2017, 6:54:50 AM9/20/17

to

Anton Ertl wrote:
> already...@yahoo.com writes:
>> On Tuesday, September 19, 2017 at 6:28:13 PM UTC+3, Anton Ertl wrote:
>>> Do we have masked writes before AVX512?
>>
>> For 4B granularity - yes.
>> vmaskmovps is a part of original AVX.
>>
>> For 1B granularity there exist maskmovq (available since SSE) and maskmovdqu (SSE2), but both generates non-temporal stores, so using them for small memory copy is probably a bad idea.
>
> Lots of restrictions on that one:-(. Otherwise it would be ideal for
> dealing with the first, last, or only bytes of a memcpy/memmove with
> little branching.
>
> What else can we do? If the block is larger than 32 bytes, we can
> just do an unaligned 32-byte store at the start address, followed by
> aligned stores, followed by an unaligned 32-byte store for the final
> bytes; I hope the store buffer logic does ok for these partial
> overwrites. For shorter blocks, we can use the same scheme with lower
> granularity. Unfortunately, selecting between granularities requires
> branches, and possibly incurs mispredictions. We probably can use
> vmovmaskps to reduce the amount of branching needed and the overall
> code size.

I really, really want REP MOVSB to be(come) intelligent enough to avoid
multi-cycle startup overhead and do the actual transfer as a bunch of
cache line load/store operations, even if that requires a hefty byte
shifter to handle relative misalignment.

already...@yahoo.com

unread,

Sep 20, 2017, 7:08:43 AM9/20/17

to

On Wednesday, September 20, 2017 at 11:28:12 AM UTC+3, Terje Mathisen wrote:
> Anton Ertl wrote:
> > Terje Mathisen <terje.m...@tmsw.no> writes:
> >> This operation has been available since SSE2, so every single 64-bit
> >> capable cpu is guaranteed to have this opcode.
> >>
> >> With AVX you get the corresponding 256-bit VMASKMOVDQU operation as well.
> >
> > According to <http://www.felixcloutier.com/x86/MASKMOVDQU.html>, only
> > the 128-bit version is allowed. Given that both the SSE2 and the
> > AVX128 version support 2 operands, I wonder what the difference
> > between these versions is.
>
> Huh? That means the only difference is that you have twice as many
> working registers,

No, you don't. It's an AVX instruction we are talking about, not AVX256.
Functionality appears to be exactly the same, only encoding is different.

Anton Ertl

unread,

Sep 20, 2017, 11:49:07 AM9/20/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>I really, really want REP MOVSB to be(come) intelligent enough to avoid
>multi-cycle startup overhead and do the actual transfer as a bunch of
>cache line load/store operations, even if that requires a hefty byte
>shifter to handle relative misalignment.

The hefty byte shifters are already there, to handle misaligned AVX
loads and stores. The main missing thing is the logic to generate the
appropriate (possibly masked) loads, stores, and masks for the first
and last stores. One might also think about a bit of hardware to
avoid loading the same 32-byte line twice, or alternatively, to allow
consecutive unaligned stores at one store per cycle (with some
store-buffer smarts), and I guess there is relatively little extra
hardware necessary.

Anton Ertl

unread,

Sep 20, 2017, 12:54:34 PM9/20/17

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Results are in cycles per iteration (i.e. buffer copying work plus
>some loop and call overhead).

New results: Bonnell (inserted before Goldmont), and Excavator
(between K10 and Zen).

Bonnell (Atom 330), glibc 2.24

1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size

40 54 102 166 295 552 1066 2094 4150 8263 16492 38320 repmovsb
55 58 74 98 154 195 246 346 535 913 1669 4560 memmove
47 50 66 89 151 189 239 339 529 906 1662 4531 memcpy
40 54 102 167 296 554 1070 2103 4169 8299 16567 38166 repmovsb aligned
54 54 60 71 109 126 158 222 365 622 1135 4485 memmove aligned
46 46 52 63 97 115 147 212 357 614 1127 4187 memcpy aligned
35 52 100 164 293 550 1064 2092 4148 8260 16489 38315 repmovsb blksz-1
52 60 74 98 153 195 250 346 536 914 1669 4553 memmove blksz-1
44 52 67 89 151 189 240 339 529 906 1662 4531 memcpy blksz-1

Excavator (Athlon X4 845), glibc 2.24

1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size

25 32 66 113 123 139 169 237 383 675 1306 2833 repmovsb
19 19 20 19 24 65 67 108 189 361 719 1982 memmove
16 16 21 21 24 67 86 120 211 376 732 2036 memcpy
26 34 68 89 98 115 126 162 262 405 746 2028 repmovsb aligned
20 20 21 21 19 32 55 97 167 312 674 1987 memmove aligned
22 22 23 24 25 31 53 96 169 325 664 1989 memcpy aligned
24 31 65 116 122 139 168 236 382 673 1304 2884 repmovsb blksz-1
18 19 20 21 24 63 67 107 190 352 713 2000 memmove blksz-1
19 16 21 22 25 66 87 119 212 376 730 2026 memcpy blksz-1

Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 33 57 105 110 119 140 184 321 599 1160 2324 repmovsb
13 14 13 14 30 42 65 107 175 325 600 1222 memmove
10 10 11 12 30 43 67 113 185 329 604 1226 memcpy
25 33 57 83 87 95 111 143 207 335 594 1136 repmovsb aligned
12 13 12 13 16 24 40 72 136 264 536 1094 memmove aligned
11 11 12 11 21 27 42 74 139 267 541 1092 memcpy aligned
23 32 56 90 110 120 140 184 321 600 1160 2324 repmovsb blksz-1
13 13 14 13 30 42 67 108 176 325 599 1219 memmove blksz-1
10 10 11 12 31 43 67 113 185 331 604 1221 memcpy blksz-1

Zen (Ryzen 5 1600X), glibc 2.3.6 (-static)
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 32 56 106 111 119 140 184 321 600 1161 2334 repmovsb
10 18 29 36 49 77 132 263 501 940 1816 3581 memmove
26 34 59 80 88 102 133 198 342 599 1114 2182 memcpy
25 33 56 85 89 97 113 145 209 337 595 1145 repmovsb aligned
10 18 20 19 24 40 72 137 286 542 1054 2110 memmove aligned
26 34 59 50 55 70 100 165 311 567 1079 2126 memcpy aligned
22 32 56 90 111 119 142 184 321 600 1161 2338 repmovsb blksz-1
8 16 29 36 49 76 131 261 499 938 1814 3582 memmove blksz-1
24 33 58 82 88 101 134 198 345 602 1117 2184 memcpy blksz-1

- anton
--

Anton Ertl

unread,

Sep 23, 2017, 1:12:37 PM9/23/17

to

already...@yahoo.com writes:
>On Tuesday, September 19, 2017 at 6:28:13 PM UTC+3, Anton Ertl wrote:
>> Terje Mathisen <terje.m...@tmsw.no> writes:
>> >Handling a misaligned starting point, a couple of 32-byte blocks and a
>> >misaligned tail end would be somewhat painful in sw. Using masked writes
>> >to handle the tail might even need a lookup table for the mask.

Actually, the only problem is short blocks. For longer ones, my
solution just uses unaligned full-width loads and stores on start and
end, and unaligned loads and aligned stores in between, with the start
and end handling possibly overlapping the in-between stuff.

>> Do we have masked writes before AVX512?

...

>vmaskmovps is good enough for handling of last SIMD word of the line in something like SGEMM. Which is not too demanding.

It turns out that this very much depends on the implementation. On
Skylake, it's ok, Haswell not so great, and on Zen it is abysmally
slow.

I implemented two variants of memcpy, one ("sse") using MOVDQU (SSE2),
and using a lot of different cases for short blocks (basically one
case for every power of 2); the other ("avx") uses AVX and it uses
VMASKMOVPS to cover everything between block lengths 5 and 63,
therefore has fewer cases and fewer branches.

I also wrote another benchmark "random" that varies the block lengths
among a number of given block lengths. I use one set of block lengths
("anti-sse") that covers all the cases of the "sse" implementation,
and a "anti-avx" that covers all the cases of the "avx"
implementation. These variants are intended to produce the worst-case
branch prediction for "sse" and "avx", and to check this, in addition
to the cycles per memcpy, also the branch mispredictions per memcpy
are reported. Note that the other benchmarks use the same block
length for the whole run and are a best case for the branch predictor.

I'll spare you (and me) the results for all the machines, and just
present some recent ones: Skylake, Haswell, and Zen.

The "sse" implementation is competetive to (and, on Zen, faster than)
glibc's memcpy (which also uses SSE AFAIK), and it weighs in at 202
bytes (when compiled with gcc-4.9 -O).

The "avx" implementation produces the best results at most block sizes
on the Intel chips, and also costs 202 bytes of code (plus 64 bytes of
data) when compiled with gcc-4.9 -O). It is beaten by "repmovsb" at
block size 16k, probably because avx suffers L1 conflict misses there,
while repmovsb does not store into L1.

On Zen the "avx" version has a similar speed to "sse", because Zen has
a 16-byte-wide memory system (one might expect that an unaligned AVX
load causes only 3 memory accesses (compared to 4 for two unaligned
SSE loads), but there is no speedup of avx over sse visible for the
unaligned cases). On Zen we see a huge slowdown for the cases where
vmaskmovps is used, so that instruction seems to be quite slow (4
vmovmaskps instructions are used in these cases, two loads and two
stores). I guess I should have a variant of "avx" that uses more
cases in order to avoid vmovmaskps.

Looking at the "random" results, all the software implementations are
still below the startup overhead of REP MOVSB despite suffering branch
mispredictions. The avx version has lower branch mispredictions (even
on the anti-avx version) and is faster than the sse version on this
benchmark on the Intel chips, so a fast vmovmaskps does pay off.

Still, even anti-sse is not as bad for sse as I would have expected.
There are around three branches used to reach each case, and I would
expect a 50% branch miss rate from random case selection, i.e., 1.5
mispredictions. Maybe the 4K entry table of random case-selecting
block lengths in the random benchmark is too small to thrash the
branch predictor.

In any case, as long as the hardware manufacturers don't improve their
REP MOVSB implementations to reduce the startup overhead, software
implementations are doing fine even when the block lengths vary.

One oddity occured when I compiled the code on a machine with gcc-7.2.
Instead of compiling the intrinsics into one AVX256 instruction each
as I had intended, gcc-7.2 compiled

x = _mm256_loadu_si256((__m256i *)(dlast+off));
_mm256_storeu_si256((__m256i *)dlast, x);

into

ac: c5 fa 6f 04 37 vmovdqu (%rdi,%rsi,1),%xmm0
b1: c4 e3 7d 18 44 37 10 vinsertf128 $0x1,0x10(%rdi,%rsi,1),%ymm0,%ymm0
b8: 01
b9: c5 f8 11 07 vmovups %xmm0,(%rdi)
bd: c4 e3 7d 19 47 10 01 vextractf128 $0x1,%ymm0,0x10(%rdi)

and the result is significantly slower on this (Zen-based) machine
than the code produced by gcc-6.3, which happens to translate as I
intended.

You can find the code at
http://www.complang.tuwien.ac.at/anton/move/move.zip

Haswell (Core i7-4790K), glibc 2.19

1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size

39 39 39 39 46 52 64 99 171 306 575 1132 repmovsb
14 14 15 15 17 30 48 85 150 281 570 1370 memmove
15 16 13 16 19 32 48 86 161 327 631 1420 memcpy
10 11 14 16 21 30 50 86 167 317 619 1381 sse
10 14 14 18 19 24 35 55 98 195 379 880 avx
38 38 38 38 45 49 57 73 105 169 297 581 repmovsb aligned
14 14 15 15 16 26 38 68 132 278 535 1340 memmove aligned
15 16 13 16 18 27 38 77 134 282 539 1240 memcpy aligned
10 11 12 16 22 30 46 78 149 275 531 1173 sse aligned
10 15 13 15 15 21 29 61 77 150 277 647 avx aligned
51 39 39 39 48 53 67 104 175 309 579 1144 repmovsb blksz-1
14 14 15 15 18 29 48 85 150 281 554 1377 memmove blksz-1
14 15 13 16 19 32 48 86 161 327 631 1426 memcpy blksz-1
11 11 13 17 21 30 50 87 167 317 619 1380 sse blksz-1
10 14 14 13 18 24 35 55 98 196 379 881 avx blksz-1
anti-avx anti-sse
43 0.00 39 0.00 repmovsb random
32 0.84 33 0.89 memmove random
31 0.85 31 0.86 memcpy random
27 0.81 28 1.00 sse random
26 0.73 22 0.44 avx random

Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size

32 32 32 32 40 44 54 76 131 236 478 938 repmovsb
14 14 14 14 15 27 43 77 147 305 573 1417 memmove
13 14 10 12 14 27 46 85 165 313 607 1350 memcpy
8 10 12 14 18 25 41 90 161 309 604 1329 sse
8 10 10 14 13 17 25 42 94 157 307 728 avx
32 32 32 32 40 44 54 68 101 174 301 563 repmovsb aligned
14 14 14 14 15 25 38 73 142 288 561 1376 memmove aligned
14 15 11 13 14 24 41 76 153 289 563 1267 memcpy aligned
10 12 14 16 20 25 38 91 149 286 560 1225 sse aligned
10 12 12 15 14 18 24 40 90 143 276 645 avx aligned
61 32 32 32 43 47 57 78 131 238 459 952 repmovsb blksz-1
14 14 15 15 16 26 42 77 146 304 572 1414 memmove blksz-1
12 13 10 12 14 27 46 85 165 313 607 1355 memcpy blksz-1
9 10 12 14 18 24 41 87 161 309 604 1336 sse blksz-1
11 10 10 10 13 17 25 42 94 157 308 737 avx blksz-1
anti-avx anti-sse
39 0.00 35 0.00 repmovsb random
34 0.84 34 0.89 memmove random
30 0.81 30 0.84 memcpy random
26 0.77 29 0.97 sse random
24 0.69 18 0.41 avx random

Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size

25 33 57 106 110 120 140 184 321 599 1160 2323 repmovsb
16 16 16 17 32 43 66 107 177 328 601 1225 memmove
13 13 14 13 38 49 73 116 188 336 610 1233 memcpy
11 10 11 15 19 24 38 70 158 283 540 1142 sse
10 50 50 15 17 25 41 75 139 283 538 1129 avx

25 33 57 83 87 95 111 143 207 335 594 1136 repmovsb aligned

16 16 16 17 19 25 41 73 138 266 540 1101 memmove aligned
13 13 14 13 23 28 43 75 140 268 543 1096 memcpy aligned
11 10 11 12 14 20 34 68 154 278 534 1090 sse aligned
10 50 50 12 11 18 34 66 130 276 532 1086 avx aligned
23 32 56 89 111 120 140 184 321 599 1160 2323 repmovsb blksz-1
16 16 17 16 33 44 67 108 178 328 601 1226 memmove blksz-1
13 13 14 14 39 50 74 116 189 337 609 1237 memcpy blksz-1
12 10 11 15 19 24 39 73 160 285 542 1140 sse blksz-1
11 50 50 50 17 25 41 75 140 284 539 1128 avx blksz-1
anti-avx anti-sse
57 0.00 43 0.00 repmovsb random
45 0.97 41 0.88 memmove random
45 1.00 41 0.88 memcpy random
30 0.73 31 0.77 sse random
36 0.75 51 0.63 avx random

pco...@gmail.com

unread,

Sep 27, 2017, 12:44:31 AM9/27/17

to

Are you sure statically linking glibc doesn't defeat the runtime selection of memcpy functions? It normally does that during dynamic-linker resolution.

On Saturday, September 23, 2017 at 2:12:37 PM UTC-3, Anton Ertl wrote:
> Actually, the only problem is short blocks. For longer ones, my
> solution just uses unaligned full-width loads and stores on start and
> end, and unaligned loads and aligned stores in between, with the start
> and end handling possibly overlapping the in-between stuff.

For short blocks, current glibc uses two (potentially-overlapping) unaligned loads from the start and end of the block, then stores into the destination. So it doesn't even have to check for overlap in memmove. It branches to select a block size, though: (64B zmm if available), 32B ymm, 16B xmm, or 8B/4B/2B integer regs.

See some good descriptive comments in the glibc source code:
https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19

It wasn't that long ago that a different strategy was used.

> >> Do we have masked writes before AVX512?
> ...
> >vmaskmovps is good enough for handling of last SIMD word of the line in something like SGEMM. Which is not too demanding.
>
> It turns out that this very much depends on the implementation. On
> Skylake, it's ok, Haswell not so great, and on Zen it is abysmally
> slow.

Even on Intel chips, Intel's optimization manual warns of pitfalls when the masked portion would have faulted. It avoids any correctness problems, but an unaligned copy at the end of a memory page before an unmapped page could be slow.

Intel says (11.9 CONDITIONAL SIMD PACKED LOADS AND STORES) https://software.intel.com/en-us/articles/intel-sdm#optimization:

Masked loads including an illegal address range do not result in an exception if the range is under a zero mask value. However, the processor may take a multi-hundred-cycle “assist” to determine that no part of the illegal range have a one mask value. This assist may occur even when the mask is “zero” and it seems obvious to the programmer that the load should not be executed.

* Use VMASKMOV only in cases where VMOVUPS cannot be used.
* Use VMASKMOV on 32Byte aligned addresses if possible.

For a libc memcpy implementation, it's probably best to avoid performance gotchas like that unless the upside for the normal case is significant. The penalty might be similar to an FP denormal.

Generating a mask from an unaligned load on a ..., -1, -1, 0, 0, ... constant is pretty good, but could itself cache-miss. It might be less good for code that calls memcpy only occasionally (always after dirtying enough cache to evict that buffer). But still with small buffer sizes so one cache-miss during startup is significant? I guess in that case memcpy isn't really a hotspot.

Anyway, if you test this and it does well enough in practice, it's worth considering VMASKMOVPS / VPMASKMOVD

Byte-granularity (V)MASKMOVDQU is not worth considering at all. There's no way to disable the NT hint, so it forcibly evicts the destination from cache. Also, Agner Fog says it's 10 uops on Skylake (6 ALU, 4 for ports 2/3). So even apart from evicting the destination, it has one per 6 cycle throughput. If it saved a branch mispredict every time, the high uop / throughput cost could still be worth it, but branch prediction probably isn't that bad. (And the NT behaviour makes it unusable anyway except at the end of very large copies.)

> One oddity occured when I compiled the code on a machine with gcc-7.2.
> Instead of compiling the intrinsics into one AVX256 instruction each
> as I had intended, gcc-7.2 compiled
>
> x = _mm256_loadu_si256((__m256i *)(dlast+off));
> _mm256_storeu_si256((__m256i *)dlast, x);
>
> into
>
> ac: c5 fa 6f 04 37 vmovdqu (%rdi,%rsi,1),%xmm0
> b1: c4 e3 7d 18 44 37 10 vinsertf128 $0x1,0x10(%rdi,%rsi,1),%ymm0,%ymm0
> b8: 01
> b9: c5 f8 11 07 vmovups %xmm0,(%rdi)
> bd: c4 e3 7d 19 47 10 01 vextractf128 $0x1,%ymm0,0x10(%rdi)
>
> and the result is significantly slower on this (Zen-based) machine
> than the code produced by gcc-6.3, which happens to translate as I
> intended.

You probably compiled with -mtune=generic, which unfortunately includes both

-mavx256-split-unaligned-load and -mavx256-split-unaligned-store. gcc7 "fixed" these options to apply to integer-vector loads as well as FP, even when -mavx2 is enabled. (See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 re: the lack of ISA-extension-selection awareness in tuning options.)

Gcc's current tuning for -march/-mtune=znver1 is no-split unaligned loads, but still split unaligned 256b stores, IIRC. If your results indicate that split stores are not a win for Zen, please let the gcc devs know.

Maybe this Sandybridge tuning can be removed from tune=generic, especially since a lot of times functions using unaligned loads do actually get aligned data at run-time. Split loads can only ever be a win if data is less than 32B aligned. IIRC, SnB is not terrible on data that is 16B-aligned, efficiently handling 32B even splits across cache lines because it loads in 2 cycles anyway.

Terje Mathisen

unread,

Sep 27, 2017, 2:30:18 AM9/27/17

to

pco...@gmail.com wrote:
> Are you sure statically linking glibc doesn't defeat the runtime
> selection of memcpy functions? It normally does that during
> dynamic-linker resolution.
>
> On Saturday, September 23, 2017 at 2:12:37 PM UTC-3, Anton Ertl
> wrote:
>> Actually, the only problem is short blocks. For longer ones, my
>> solution just uses unaligned full-width loads and stores on start
>> and end, and unaligned loads and aligned stores in between, with
>> the start and end handling possibly overlapping the in-between
>> stuff.
>
> For short blocks, current glibc uses two (potentially-overlapping)
> unaligned loads from the start and end of the block, then stores into
> the destination. So it doesn't even have to check for overlap in
> memmove. It branches to select a block size, though: (64B zmm if
> available), 32B ymm, 16B xmm, or 8B/4B/2B integer regs.

It also has to handle very short lengths, i.e. less than a wide
register! I.e. you can safely use 32-byte ops only when the length is at
least 32 bytes, og 16-byte for 16+ len. At this point it will quickly
become quite expensive to handle all the possible sizes, while still
guaranteeing no reads or writes outside the buffer limits:

if (len <= 64) { // Can be handled with two 256-bit AVX load/store ops
if (len > 32) {
avx0 = unaligned_load32(src);
avx1 = unaligned_load32(src+len-32);
unaligned_store32(dst, avx0);
unaligned_store32(dst+len-32, avx1);
}
else if (len > 16) {
sse0 = unaligned_load16(src);
sse1 = unaligned_load16(src+len-16);
unaligned_store16(dst, sse0);
unaligned_store16(dst+len-16, sse1);
}
else if (len > 8) {
reg0 = unaligned_load8(src);
reg1 = unaligned_load8(src+len-8);
unaligned_store8(dst, reg0);
unaligned_store8(dst+len-8, reg1);
}
... Still need to handle 0 to 8 byte operations :-(

At this point it becomes tempting to use a bit scan operation on the
length and use the result to branch directly to an optimized version for
each log2(len) possibility, although you might want to special-case very
large lengths in order to reduce the branch table size, but only those
that are longer than what you can fit inside CPU registers.

Blocks of up to 1024 bytes will fit in maximum 32 registers of 32 bytes
each, but at those sizes I would test if it made sense to only use
unaligned ops for the first and last block and a series of aligned
(store) ops for the middle part.

I would initialize (at library startup) that branch table with function
pointers that are optimized for the current cpu, the alternative is to
just have a global function pointer for the main memmove() entry point.

No matter how you do it you end up with an indirect branch/call which
will mispredict on many cpus.

One caveat: Using these often overlapping dual load/store ops means that
memmove() cannot safely be used to send a block of data into IO space,
only to normal RAM.

>
> See some good descriptive comments in the glibc source code:
> https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19

I try to never read library source code, mostly because it is fun to
figure this stuff out by myself, but also so I can truthfully say that I
didn't borrow something I read. :-)

already...@yahoo.com

unread,

Sep 27, 2017, 6:49:05 AM9/27/17

to

On Wednesday, September 20, 2017 at 7:54:34 PM UTC+3, Anton Ertl wrote:
> an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
> >Results are in cycles per iteration (i.e. buffer copying work plus
> >some loop and call overhead).
>
> New results: Bonnell (inserted before Goldmont), and Excavator
> (between K10 and Zen).
>

> Bonnell (Atom 330), glibc 2.24
> 1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
> 40 54 102 166 295 552 1066 2094 4150 8263 16492 38320 repmovsb
> 55 58 74 98 154 195 246 346 535 913 1669 4560 memmove
> 47 50 66 89 151 189 239 339 529 906 1662 4531 memcpy
> 40 54 102 167 296 554 1070 2103 4169 8299 16567 38166 repmovsb aligned
> 54 54 60 71 109 126 158 222 365 622 1135 4485 memmove aligned
> 46 46 52 63 97 115 147 212 357 614 1127 4187 memcpy aligned
> 35 52 100 164 293 550 1064 2092 4148 8260 16489 38315 repmovsb blksz-1
> 52 60 74 98 153 195 250 346 536 914 1669 4553 memmove blksz-1
> 44 52 67 89 151 189 240 339 529 906 1662 4531 memcpy blksz-1
>

It looks like on Bonnel they did not implement "fast string" enhancements.
Which means that repmovsD is probably much faster than repmovsB.

pco...@gmail.com

unread,

Sep 27, 2017, 11:33:46 AM9/27/17

to

On Wednesday, September 27, 2017 at 3:30:18 AM UTC-3, Terje Mathisen wrote:

> pco...@gmail.com wrote:
> > It branches to select a block size, though: (64B zmm if
> > available), 32B ymm, 16B xmm, or 8B/4B/2B integer regs.
>
> It also has to handle very short lengths, i.e. less than a wide
> register!

That's why it branches to select a block size...

> I try to never read library source code

I think it's crazy to put a lot of time into profiling glibc's implementation without looking at what it does. Otherwise how will you even know what test cases it might have trouble with?

> At this point it becomes tempting to use a bit scan operation on the
> length and use the result to branch directly to an optimized version for
> each log2(len) possibility

That's an interesting idea for small copies. You could construct it to not need a jump table, but instead have each block of 2 loads + 2 stores padded to the same length so you can *compute* the jump target address instead of loading it, reducing mispredict penalty.

Each block works for size = 2^n + 1 up to 2^(n+1), so we need to bitscan (size-1). (Actually, glibc just uses ranges like between_8_15 for anything smaller than 2 max-size vectors)

.intel_syntax noprefix
dec rdx
cmp rdx, SMALL_BLOCK_THRESHOLD
ja .Llarge_block_or_0

lea rcx, [RIP + .Lcopy_1_to_2_bytes]

xor eax, eax # bsr leaves the dest unmodified on input = 0
bsr eax, edx # so size=1 or size=2 (edx=0 or 1) both leave eax=0

lea eax, [rax + rax*2] # eax *= 3
lea rcx, [rcx + rax * 8] # target + 24 * bsr(size-1)
jmp rcx # rdx=size-1

.p2align 4
.Lcopy_1_to_2_bytes:
movzx ecx, byte [rsi]
movzx esi, byte [rsi + rdx]
mov [rdi+rdx], sil
mov [rdi], cl
ret

## pad to 24B
.Lcopy_3_to_4_bytes
# assert(. - .Lcopy_1_to_2_bytes == 24)
movzx ecx, word [rsi]
movzx esi, word [rsi + rdx - 1]
mov [rdi+rdx - 1], si
mov [rdi], cx
ret

## pad to 24B
.Lcopy_5_to_7_bytes
# assert(. - .Lcopy_1_to_2_bytes == 24*2)
mov ecx, [rsi]
mov esi, [rsi + rdx - 3]
mov [rdi+rdx - 3], esi
mov [rdi], ecx
ret

... and so on up to
vmovdqu ymm0, [rsi]
vmovdqu ymm1, [rsi + rdx - 31]
vmovdqu [rdi+rdx - 31], ymm1
vmovdqu [rdi], ymm0
vzeroupper
ret

storing the first vector last is probably good, to avoid store-forwarding stalls when it's read. Reading it first, too could go either way, IDK. Glibc isn't consistent between integer and xmm.

24B is exactly enough for 4x vmovdqu + vzeroupper + ret with these addressing modes.

20: c5 fe 6f 06 vmovdqu ymm0,YMMWORD PTR [rsi]
24: c5 fe 6f 4c 16 e1 vmovdqu ymm1,YMMWORD PTR [rsi+rdx*1-0x1f]
2a: c5 fe 7f 4c 17 e1 vmovdqu YMMWORD PTR [rdi+rdx*1-0x1f],ymm1
30: c5 fe 7f 07 vmovdqu YMMWORD PTR [rdi],ymm0
34: c5 f8 77 vzeroupper
37: c3 ret

Even for blocks where the jump target is only 8B from the end of a cache line, the first instruction is still decodeable, so the memory system can start working on the load right away even if it takes an extra code-fetch to decode the stores.

Possibly a 1B load from src and dst would be useful as a prefetch in case memcpy and the buffers are both cold in cache, but that's probably more overhead than we want if everything is hot in cache.

> One caveat: Using these often overlapping dual load/store ops means that
> memmove() cannot safely be used to send a block of data into IO space,
> only to normal RAM.

If operation sizes or store order matters, you can't use memmove anyway!

Writing to device memory probably makes unaligned and overlapping a lot less efficient, though, so even if it was safe you probably wouldn't want to use it.

Terje Mathisen

unread,

Sep 27, 2017, 2:20:40 PM9/27/17

to

pco...@gmail.com wrote:
> On Wednesday, September 27, 2017 at 3:30:18 AM UTC-3, Terje Mathisen

>> At this point it becomes tempting to use a bit scan operation on
>> the length and use the result to branch directly to an optimized
>> version for each log2(len) possibility
>
> That's an interesting idea for small copies. You could construct it
> to not need a jump table, but instead have each block of 2 loads + 2
> stores padded to the same length so you can *compute* the jump
> target address instead of loading it, reducing mispredict penalty.

This is the way I used to do it in my 16-bit x86 asm days. :-)

If this was memcpy instead of memmove it would be even more tempting to
jump into an array of regsize moves, but for memmove we have to either
load everything up front or do the copy in opposite direction if the
buffers overlap badly.

> 24B is exactly enough for 4x vmovdqu + vzeroupper + ret with
> these addressing modes.

We only need enough space for the second-largest (next to last) block,
since it is OK if the last one needs a bit more, i.e. maybe 20/21 is
sufficient?

Assuming 21 we can calculate that as 3*7 or (ecx+)+(4*5+1)*eax:

add ecx,eax
lea eax,[eax*4+eax]
lea ecx,[ecx+eax*4]

Travis Downs

unread,

Sep 28, 2017, 2:40:00 AM9/28/17

to

On Tuesday, September 19, 2017 at 7:09:42 AM UTC-7, Anton Ertl wrote:

> So I decided to do a more in-depth measurement of REP MOVSB
> vs. some alternatives. I wrote a microbenchmark that copies a buffer
> to a non-overlapping buffer, with both buffers independently starting
> at offsets from 0 to 4095 (for the "aligned" results, offsets are
> aligned to 32 bytes); the copying is done with REP MOVSB, and libc's
> memmove, and memcpy.

Excellent work. I added a link back here for that stackoverflow post at https://stackoverflow.com/a/43574756/149138 , although I don't have time at the moment to better integrate the findings.

> * REP MOVSB is slower than memcpy for some block sizes (especially
> <1KB) on all platforms, and for all block sizes on some platforms
> (Penryn, Sandy Bridge, unaligned Ivy Bridge, Zen), and often not
> just by a little. In theory the hardware people should know how to
> get the best performance out of their hardware, but in practice,
> that seems hard to achieve.

Indeed, it seems like a big missed opportunity.

Travis Downs

unread,

Sep 28, 2017, 2:46:27 AM9/28/17

to

On Wednesday, September 27, 2017 at 8:33:46 AM UTC-7, pco...@gmail.com wrote:

> vmovdqu ymm0, [rsi]
> vmovdqu ymm1, [rsi + rdx - 31]
> vmovdqu [rdi+rdx - 31], ymm1
> vmovdqu [rdi], ymm0
> vzeroupper
> ret

It's a shame about the vzeroupper. That's not a particularly cheap instruction (4 uops) on recent Intel, but on recent AMD (Ryzen) it is particularly terrible at 17 ops and one per 6 cycles! So using ymm regs for copies from 32 to 63 bytes is probably a pessimization on recent AMD.

It might even be a pessimization on Intel when you consider the "AVX256 transition" penalties and "AVX turbo". At least using ymm regs even for mov triggers the former, not sure about the latter.

Anton Ertl

unread,

Sep 28, 2017, 4:34:55 AM9/28/17

to

pco...@gmail.com writes:
>Are you sure statically linking glibc doesn't defeat the runtime selection =
>of memcpy functions? It normally does that during dynamic-linker resolutio=
>n.

I expect that it does not, but I doubt that glibc-2.3.6 has any
Zen-specific runtime selection. I used static linking for glibc 2.3.6
only to compare glibc-2.3.6 to (dynamically linked) glibc-2.24, and I
benchmarked the result on Zen. But if you want to provide better
results by using dynamically linked glibc-2.3.6 and dynamically linked
glibc-2.24 on the same machine, go ahead!

>On Saturday, September 23, 2017 at 2:12:37 PM UTC-3, Anton Ertl wrote:
>> Actually, the only problem is short blocks. For longer ones, my
>> solution just uses unaligned full-width loads and stores on start and
>> end, and unaligned loads and aligned stores in between, with the start
>> and end handling possibly overlapping the in-between stuff.
>

>For short blocks, current glibc uses two (potentially-overlapping) unaligne=
>d loads from the start and end of the block, then stores into the destinati=
>on.

Yes, that's what my {sse,avx}mem{cpy,move} do, too.

>So it doesn't even have to check for overlap in memmove. It branches =
>to select a block size, though: (64B zmm if available), 32B ymm, 16B xmm, o=
>r 8B/4B/2B integer regs.

Same here. I use the following cases:

sse avx
0
1
2-4
5-8 5-63
9-16 >63
17-32
>32

With the ">" cases using unrolling by a factor of 2 and therefore
having another branch after the final iteration.

>See some good descriptive comments in the glibc source code:

>https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec=
>-unaligned-erms.S.html#19

Thanks. Since which glibc version is this used?

I see that this code has extra cases for up to 8*VEC_SIZE and uses
unrolling by a factor of 4. I chose to switch to the loop variant
ASAP, to reduce cases and thus code size and branch mispredictions
(the loop branch also has mispredictions, so having extra cases does
not necessarily increase the mispredictions.

I also chose unrolling by a factor of 2 (glibc uses 4), because
no-unrolling is not competetive with the glibc versions I compared to
(unrolling factor 2 is), because of code size, and because of the
branch mispredictions for the final iterations. Hmm, I guess that
using an extra case for 4*VEC_SIZE would allow me to avoid the check
(and potential branch misprediction) for the final iteration. But it
would increase the code size.

>> >> Do we have masked writes before AVX512?=20
>> ...
>> >vmaskmovps is good enough for handling of last SIMD word of the line in =

>something like SGEMM. Which is not too demanding.

>>=20

>> It turns out that this very much depends on the implementation. On
>> Skylake, it's ok, Haswell not so great, and on Zen it is abysmally
>> slow.
>

>Even on Intel chips, Intel's optimization manual warns of pitfalls when the=
> masked portion would have faulted. It avoids any correctness problems, bu=
>t an unaligned copy at the end of a memory page before an unmapped page cou=
>ld be slow.

My benchmark does not exercise that (at least not intentionally).
Another thing to do.

>For a libc memcpy implementation, it's probably best to avoid performance g=

>otchas like that unless the upside for the normal case is significant.

The random benchmark gives an idea of the branch prediction upside.
E.g., for Skylake:

anti-avx anti-sse

26 0.77 29 0.97 sse random
24 0.69 18 0.41 avx random

So we see 2-11 cycles speedup, mainly from better branch prediction
(the ns are relatively small, so the block size difference between SSE
and AVX should not play a role).

>Generating a mask from an unaligned load on a ..., -1, -1, 0, 0, ... const=
>ant is pretty good, but could itself cache-miss. It might be less good for=
> code that calls memcpy only occasionally (always after dirtying enough cac=
>he to evict that buffer).

Yes, but the alternative is to have more cases, which needs more code,
and may cause an I-cache miss instead, and also increases branch
mispredictions. On the balance I think that VMASKMOVPS is the way to
go if it is fast (i.e., not on Zen, we will have to see about Intel).

In any case, this constant array should be cache-line-aligned, so that
it does not take more cache lines than necessary.

>Byte-granularity (V)MASKMOVDQU is not worth considering at all. There's no=
> way to disable the NT hint, so it forcibly evicts the destination from cac=
>he.

NT is that bad? Ouch!

>> One oddity occured when I compiled the code on a machine with gcc-7.2.
>> Instead of compiling the intrinsics into one AVX256 instruction each
>> as I had intended, gcc-7.2 compiled

>>=20
>> x =3D _mm256_loadu_si256((__m256i *)(dlast+off));
>> _mm256_storeu_si256((__m256i *)dlast, x);
>>=20
>> into
>>=20

>> ac: c5 fa 6f 04 37 vmovdqu (%rdi,%rsi,1),%xmm0

>> b1: c4 e3 7d 18 44 37 10 vinsertf128 $0x1,0x10(%rdi,%rsi,1),%ymm0,=
>%ymm0
>> b8: 01=20

>> b9: c5 f8 11 07 vmovups %xmm0,(%rdi)
>> bd: c4 e3 7d 19 47 10 01 vextractf128 $0x1,%ymm0,0x10(%rdi)

>>=20

>> and the result is significantly slower on this (Zen-based) machine
>> than the code produced by gcc-6.3, which happens to translate as I
>> intended.
>

>You probably compiled with -mtune=3Dgeneric, which unfortunately includes b=
>oth
>
>-mavx256-split-unaligned-load and -mavx256-split-unaligned-store. gcc7 "fi=
>xed" these options to apply to integer-vector loads as well as FP, even whe=
>n -mavx2 is enabled. (See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D8=

>0568 re: the lack of ISA-extension-selection awareness in tuning options.)

I used -O -mavx, but I especially used Intel intrinsics, and expect a
1:1 correspondence between intrinsic and machine instruction.

But this shows the wisdom of the author of
<https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html>
(you?) to use assembly language instead of C with intrinsics.

>Gcc's current tuning for -march/-mtune=3Dznver1 is no-split unaligned loads=
>, but still split unaligned 256b stores, IIRC. If your results indicate th=

>at split stores are not a win for Zen, please let the gcc devs know.

I'll leave that to people who still have faith in gcc devs. They are
free to use my code and data, which are available.

>Maybe this Sandybridge tuning can be removed from tune=3Dgeneric, especiall=
>y since a lot of times functions using unaligned loads do actually get alig=
>ned data at run-time. Split loads can only ever be a win if data is less t=
>han 32B aligned. IIRC, SnB is not terrible on data that is 16B-aligned, ef=
>ficiently handling 32B even splits across cache lines because it loads in 2=
> cycles anyway.

Here are results for avxmemmove using unsplit stores (gcc-4.9.2) for
Sandy bridge:

64 128 256 512 blksz
23 26 41 73 avx
20 19 25 38 avx aligned
23 26 42 73 avx blksz-1

And here with split stores (gcc-7.2):

64 128 256 512 blksz
18 21 32 56 avx
19 20 28 44 avx aligned
24 21 32 56 avx blksz-1

Looks like unsplit unaligned stores are really expensive on Sandy
Bridge.

Anton Ertl

unread,

Sep 28, 2017, 4:42:49 AM9/28/17

to

Travis Downs <travis...@gmail.com> writes:
>On Wednesday, September 27, 2017 at 8:33:46 AM UTC-7, pco...@gmail.com wrot=

>e:
>
>> vmovdqu ymm0, [rsi]
>> vmovdqu ymm1, [rsi + rdx - 31]
>> vmovdqu [rdi+rdx - 31], ymm1
>> vmovdqu [rdi], ymm0
>> vzeroupper
>> ret
>

>It's a shame about the vzeroupper. That's not a particularly cheap instruct=
>ion (4 uops) on recent Intel, but on recent AMD (Ryzen) it is particularly =
>terrible at 17 ops and one per 6 cycles! So using ymm regs for copies from =

>32 to 63 bytes is probably a pessimization on recent AMD.

I don't think (but have not measured) that AMD CPUs benefit from
vzeroupper. All of them are treating ymm registers as two 128-bit
physical registers, while vzeroupper is a workaround for problems with
Intel's 256-bit register implementations.

In my measurements, using SSE and AVX for memcpy are equally fast on
Zen (except where I use vmovmaskps, which is slow on Zen).

>It might even be a pessimization on Intel when you consider the "AVX256 tra=
>nsition" penalties

The vzeroupper is a workaround for that AFAIK.

pco...@gmail.com

unread,

Sep 28, 2017, 11:15:32 AM9/28/17

to

On Thursday, September 28, 2017 at 5:34:55 AM UTC-3, Anton Ertl wrote:
> pco...@gmail.com writes:
> >Are you sure statically linking glibc doesn't defeat the runtime selection =
> >of memcpy functions? It normally does that during dynamic-linker resolutio=
> >n.
>
> I expect that it does not, but I doubt that glibc-2.3.6 has any
> Zen-specific runtime selection.

I guess for memcpy, there's probably just SSE and AVX. If glibc-2.3.6 had any AVX versions, they probably aren't getting used. For string functions, there are SSSE3 versions.

> But if you want to provide better
> results by using dynamically linked glibc-2.3.6 and dynamically linked
> glibc-2.24 on the same machine, go ahead!

If I had a Zen, I'd probably single-step a call to `memcpy` (after dynamic linker lazy loading was done) to figure out which function it actually used. Then grab the source for that and create a stand-alone version of that function called glibc236_memcpy and call it directly from the benchmark.

> >See some good descriptive comments in the glibc source code:
> >https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec=
> >-unaligned-erms.S.html#19
>
> Thanks. Since which glibc version is this used?

IDK, I didn't do a git blame. (And no, I didn't write it. I just found it while investigating what exactly glibc does for memcpy/memset.)

> I see that this code has extra cases for up to 8*VEC_SIZE and uses
> unrolling by a factor of 4. I chose to switch to the loop variant
> ASAP, to reduce cases and thus code size and branch mispredictions
> (the loop branch also has mispredictions, so having extra cases does
> not necessarily increase the mispredictions.

Fun fact: Skylake can correctly predict the not-taken loop exit for iteration counts up to 22 or 23 (I forget which). With more than that, you get one mispredict per loop.

Unrolling gets more work done per mispredict in the worst case, where the iteration count is exactly 24. (i.e. you don't get one mispredict per call until twice as large a size). IDK if this is significant, or worth the code-size cost, but worth keeping in mind.

> >Generating a mask from an unaligned load on a ..., -1, -1, 0, 0,
>

> Yes, but the alternative is to have more cases, which needs more code,
> and may cause an I-cache miss instead, and also increases branch
> mispredictions. On the balance I think that VMASKMOVPS is the way to
> go if it is fast (i.e., not on Zen, we will have to see about Intel).

Yeah, it's a tradeoff. The code fetch is nearly sequential, though, so probably at least within the same DRAM page, and may already be in flight by the time the first instructions are executed.

I normally think about optimizing small loops that touch a lot of data, though, not programs with bloated code that would suffer a lot from more I-cache pressure from a large memcpy. Maybe for a lot of programs that make a lot of use of memcpy, your idea would be a win.

> >Byte-granularity (V)MASKMOVDQU is not worth considering at all. There's no=
> > way to disable the NT hint, so it forcibly evicts the destination from cac=
> >he.
>
> NT is that bad? Ouch!

It's a feature, not a bug. It means you don't need CLFLUSH after an NT store to make sure data is really in DRAM, in case that's needed for non-coherent DMA. (Although I think DMA is normally cache-coherent on modern x86, since the memory controller is inside the CPU and can probe L3 cache).

> >-mavx256-split-unaligned-load and -mavx256-split-unaligned-store. gcc7 "fi=
> >xed" these options to apply to integer-vector loads as well as FP, even whe=
> >n -mavx2 is enabled. (See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=3D8=
> >0568 re: the lack of ISA-extension-selection awareness in tuning options.)
>
> I used -O -mavx, but I especially used Intel intrinsics, and expect a
> 1:1 correspondence between intrinsic and machine instruction.

I agree that specific "optimization" is surprising if you weren't expecting it, but it does kind of make sense. I don't think -mavx256-split-unaligned-load is a good default for tune=generic these days, though. That's why you should use -march=whatever instead of just -mavx whenever possible.

It's really too bad there isn't a tune=generic-avx2, to tune for all CPUs that support AVX2 and ignore those that don't.

Compilers do lots of optimizations, like compiling _mm_extract_epi32(v, 0) to a movd eax, xmm0 instead of a pextrd eax, xmm0, 0. And also folding loads into memory operands for ALU instructions, or even optimizing away store/load on a local array.

It's usually a good thing, especially clang is often good at optimizing intrinsics to use better shuffles. (But sometimes it makes worse choices...)

> Here are results for avxmemmove using unsplit stores (gcc-4.9.2) for
> Sandy bridge:
>
> 64 128 256 512 blksz
> 23 26 41 73 avx
> 20 19 25 38 avx aligned
> 23 26 42 73 avx blksz-1
>
> And here with split stores (gcc-7.2):
>
> 64 128 256 512 blksz
> 18 21 32 56 avx
> 19 20 28 44 avx aligned
> 24 21 32 56 avx blksz-1

Ok, interesting. It's not catastrophically bad on SnB, and that's for memcpy. Anything with some computation mixed in will suffer less.

It's definitely still good for tune=sandybridge, but you could argue it's not so bad on SnB that other CPUs need to suffer for it with tune=generic. Especially since the penalty only occurs when the data is actually misaligned at runtime, but gcc will split your loads/stores any time it can't prove at compile time that the data is *always* aligned (i.e. where vmovaps would be safe).

gcc is really dumb about implementing it, too: it doesn't use split loads/stores as part of other shuffles, so code with 256b loads + 256b shuffles ends up completely brain-dead: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136

Anton Ertl

unread,

Sep 28, 2017, 1:37:58 PM9/28/17

to

pco...@gmail.com writes:
>On Thursday, September 28, 2017 at 5:34:55 AM UTC-3, Anton Ertl wrote:
>> pco...@gmail.com writes:

>> >Are you sure statically linking glibc doesn't defeat the runtime selecti=
>on =3D
>> >of memcpy functions? It normally does that during dynamic-linker resolu=
>tio=3D
>> >n.
>>=20

>> I expect that it does not, but I doubt that glibc-2.3.6 has any
>> Zen-specific runtime selection.
>

> I guess for memcpy, there's probably just SSE and AVX. If glibc-2.3.6 had=

> any AVX versions, they probably aren't getting used.

2.3.6 was released in 2005, so I really doubt that it supports AVX.
Looking at the results for the 2.19 (2014) memcpy on Haswell and
Skylake (very similar to my ssememcpy and slower than avxmemcpy), I
guess that the 2.19 memcpy still used only SSE2.

> If I had a Zen, I'd probably single-step a call to `memcpy` (after dynamic=
> linker lazy loading was done) to figure out which function it actually use=
>d. Then grab the source for that and create a stand-alone version of that =

>function called glibc236_memcpy and call it directly from the benchmark.

I don't think Zen is a particularly interesting target here, because
SSE and AVX have mostly the same performance. But I have the youngest
glibc on it and wanted to compare with the oldest glibc I have
available, so I ended up using Zen for that.

>> I see that this code has extra cases for up to 8*VEC_SIZE and uses
>> unrolling by a factor of 4. I chose to switch to the loop variant
>> ASAP, to reduce cases and thus code size and branch mispredictions
>> (the loop branch also has mispredictions, so having extra cases does
>> not necessarily increase the mispredictions.
>

>Fun fact: Skylake can correctly predict the not-taken loop exit for iterati=
>on counts up to 22 or 23 (I forget which). With more than that, you get on=
>e mispredict per loop.

Unfortunately, it's history-based prediction, so you will see
mispredictions even for lower counts if the counts vary randomly.

I did an investigation of counted-loop prediction earlier
<2017Mar1...@mips.complang.tuwien.ac.at>.

>I normally think about optimizing small loops that touch a lot of data, tho=
>ugh, not programs with bloated code that would suffer a lot from more I-cac=
>he pressure from a large memcpy. Maybe for a lot of programs that make a l=

>ot of use of memcpy, your idea would be a win.

For a general-purpose library routine like memmove, I think that we
should look for a good balance for most cases. And that includes
calling it now and then in bloated programs. An 11KB monster like the
glibc memcpy (or was it memmove?) that I looked it some time ago may
be good for some benchmarks (but not better than my small ssememcpy),
but will have bad performance in other cases.

>> Here are results for avxmemmove using unsplit stores (gcc-4.9.2) for
>> Sandy bridge:

>>=20

>> 64 128 256 512 blksz
>> 23 26 41 73 avx
>> 20 19 25 38 avx aligned
>> 23 26 42 73 avx blksz-1

>>=20

>> And here with split stores (gcc-7.2):

>>=20

>> 64 128 256 512 blksz
>> 18 21 32 56 avx
>> 19 20 28 44 avx aligned
>> 24 21 32 56 avx blksz-1

The strange thing is that the difference between the unsplit case and
the split case increases with the block size for the unaligned memcpy
("avx"), even though the number of really unaligned store-vmovdqu-s is
always 2. Maybe gcc-7.2 also does something for unaligned loads that
helps Sandy Bridge. I need to take a closer look.

pco...@gmail.com

unread,

Sep 29, 2017, 1:16:57 AM9/29/17

to

On Thursday, September 28, 2017 at 2:37:58 PM UTC-3, Anton Ertl wrote:
> >Fun fact: Skylake can correctly predict the not-taken loop exit for iterati=
> >on counts up to 22 or 23 (I forget which). With more than that, you get on=
> >e mispredict per loop.
>
> Unfortunately, it's history-based prediction, so you will see
> mispredictions even for lower counts if the counts vary randomly.

Yes, of course. I *suspect* that a lot of programs that spend lots of time in memcpy use one or a few constant sizes (at least over short time periods of milliseconds to seconds). Is that even close to accurate? Obviously performing badly for varying counts is not acceptable, but optimizing a bit for repeated calls with the same count is probably not too bad for memcpy.

> For a general-purpose library routine like memmove, I think that we
> should look for a good balance for most cases. And that includes
> calling it now and then in bloated programs. An 11KB monster like the
> glibc memcpy (or was it memmove?) that I looked it some time ago may
> be good for some benchmarks (but not better than my small ssememcpy),
> but will have bad performance in other cases.

Agreed, that sounds ridiculous, and probably a symptom of tuning for microbenchmarks. (Although I wonder if that includes some CPU-dispatching alternatives.) But a 1 for 1 trade of code-size for data-size isn't bad, and not using any data is nice.

Anyway, if your idea really does perform well in a lot of cases, then that's great. It would be cool if there's something even better than what glibc is doing now.

> The strange thing is that the difference between the unsplit case and
> the split case increases with the block size for the unaligned memcpy
> ("avx"), even though the number of really unaligned store-vmovdqu-s is
> always 2. Maybe gcc-7.2 also does something for unaligned loads that
> helps Sandy Bridge. I need to take a closer look.

Are you using loadu / storeu intrinsics in the aligned loop? gcc will still split them if it fails to prove the addresses are actually aligned.

There are separate tune options for load vs. store. Current Zen tuning enabled split 256b stores, but not split 256b loads, IIRC. Use -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store to disable both. (Or use -mtune=haswell to do that + other things, like optimize for macro-fusion which IIRC tune=generic still doesn't do.)

Sounds like you should look at the compiler's asm output check that you're getting what you expect.

timca...@aol.com

unread,

Oct 20, 2017, 1:41:35 PM10/20/17

to

Interesting snippet from the new "Intel® Architecture
Instruction Set Extensions and Future Features
Programming Reference" just posted:
https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf

Fast Short REP MOV - scheduled for Ice Lake & Later processors.

- Tim

Terje Mathisen

unread,

Oct 21, 2017, 1:50:07 AM10/21/17

to

Very short of detail indeed, but I'll still say: Finally!

Andy Glew stated here at least a decade ago that he argued for getting
this around the PPro timeframe, so two decades ago...

Compared to the Galois Field instructions this will have a somewhat
larger user base. :-)

Anton Ertl

unread,

Oct 21, 2017, 2:36:59 AM10/21/17

to

Terje Mathisen <terje.m...@tmsw.no> writes:

>timca...@aol.com wrote:
>> Fast Short REP MOV - scheduled for Ice Lake & Later processors.
>>
>> - Tim
>>
>Very short of detail indeed, but I'll still say: Finally!

I'll wait for the thing to arrive before judging it.

>Andy Glew stated here at least a decade ago that he argued for getting
>this around the PPro timeframe, so two decades ago...

I dimly remember that he wrote that this was a later insight, and that
during the P6 design he was caught in the RISC ideology and did not
want to add special hardware to make that fast.

- anton
--

Terje Mathisen

unread,

Oct 21, 2017, 9:39:21 AM10/21/17

to

Anton Ertl wrote:
> Terje Mathisen <terje.m...@tmsw.no> writes:
>> timca...@aol.com wrote:
>>> Fast Short REP MOV - scheduled for Ice Lake & Later processors.
>>>
>>> - Tim
>>>
>> Very short of detail indeed, but I'll still say: Finally!
>
> I'll wait for the thing to arrive before judging it.
>
>> Andy Glew stated here at least a decade ago that he argued for
>> getting this around the PPro timeframe, so two decades ago...
>
> I dimly remember that he wrote that this was a later insight, and
> that during the P6 design he was caught in the RISC ideology and did
> not want to add special hardware to make that fast.

OK, it will still be at least two decades before these cpus gets into
user hands. :-)