Rod Pemberton <EmailN...@voenflacbe.cpm> writes:
>On Tue, 12 Sep 2017 08:24:51 GMT
>
an...@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>
>> REP MOVSB is slow. Very slow.
>
>Do you have any references for that claim?
I was conflating the results of my CMOVE speed tests (which don't use
REP MOVSB, however), with some disappointing experiences that I had
with REP MOVSQ (which was slower than a simple loop for the block size
I used). So I decided to do a more in-depth measurement of REP MOVSB
vs. some alternatives. I wrote a microbenchmark that copies a buffer
to a non-overlapping buffer, with both buffers independently starting
at offsets from 0 to 4095 (for the "aligned" results, offsets are
aligned to 32 bytes); the copying is done with REP MOVSB, and libc's
memmove, and memcpy.
You find the benchmark on
<
http://www.complang.tuwien.ac.at/anton/move/> (not in a
nice-to-download package yet).
You find the results below, and my observations here:
* REP MOVSB is slower than memcpy for some block sizes (especially
<1KB) on all platforms, and for all block sizes on some platforms
(Penryn, Sandy Bridge, unaligned Ivy Bridge, Zen), and often not
just by a little. In theory the hardware people should know how to
get the best performance out of their hardware, but in practice,
that seems hard to achieve.
* Aligned buffers help REP MOVSB a lot, surprisingly especially at
larger block sizes. I would have expected that hardware can deal
with that better than software, which needs (predicted) branches to
deal with that efficiently. Once you pay for misalignment, an odd
block size does not cost extra.
* Startup overhead is high for REP MOVSB; some are better for one
byte, but are then even slower for 8. On the balance, if I had to
choose between REP MOVSB and an implementation that eschews REP
MOVSB, I would choose the latter, because of the bad performance for
small block sizes. Viewed another way, thanks to the startup
overhead I have to implement something relatively complex for CMOVE
that may use REP MOVSB, but only for large block sizes.
* There is a surprising gap between memcpy and memmove performance;
sometimes memcpy is faster, sometimes memmove. In theory, for this
benchmark memcpy should never be slower than memmove, and memmove
should only be slower by a three-instruction sequence that contains
a predictable loop (so the actual copying code can start right
away). Also, in those cases where REP MOVSB is faster, it should be
faster, memmove and memcpy should use that (in this benchmark), and
the extra cost should just be a few checks.
Looking at these results, it is all the more ridiculous to have a
memcpy separate from memmove. If they spent the effort that they
spend on the separate routines on a memmove that uses rep movsb
where profitable, they would see better performance for both
routines.
* Enhanced REP MOVSB/STOSB (starting with Ivy Bridge; CPU flag erms)
is mentioned as feature in Intel's optimization manual, but the
difference between Sandy Bridge and Ivy Bridge in REP MOVSB
performance is not bigger than other differences that do not get a
separate flag. The biggest difference is seen at the lower counts,
e.g., 53 (Ivy) vs. 173 cycles for blocksize 128.
* repmovsb (unaligned) has a 22x cycle count improvement between
Penryn (2007) and Skylake (2015). The cycle count improvemet from
K8 (2003/2005) to Zen (2017) on repmovsb aligned is a factor of 15.
So there is still a lot of progress in some areas.
* The improvement in memmove/memcpy performance from glibc 2.3.6/glibc
2.7 to glibc 2.24 are probably for a good part in the software and
for a smaller part in the hardware. I cannot run a newer statically
linked binary on an older kernel ("Fatal: kernel too old"), so I
built a statically linked binary on the glibc 2.3.6 system, and ran
it on the Zen hardware. The glibc 2.24 memmove is faster by a
factor of about 3 for the larger block sizes, and not quite a factor
of 2 for memcpy. The better memmove/memcpy cycle count over K8 is
due to this software improvement and a factor of almost 2 hardware
improvement.
* It is strange that memmove is close to memcpy on Haswell and
Skylake, but is much slower on Zen. Different code paths at work?
Things that this microbenchmark does not cover, and that may have a
significant influence on performance:
* Using the results; supposedly REP MOVSB has advantage there because
of weaker ordering requirements of the stores (or is that about
independent instructions? the optimization manual is unclear). I
have not seen any benchmark that demonstrates that.
* In real applications other code will compete for I-cache space with
the monstrous implementations of memmove and memcpy in glibc (one
memmove I looked at had 11KB of machine code).
* This microbenchmark uses the same block size all the time, which is
a good case for branch prediction for memmove and memcpy. A less
predictable size may slow down memmove and memcpy (and possibly some
implementations of REP MOVSB).
You can find more discussion on these issues on
<
https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy>.
Results are in cycles per iteration (i.e. buffer copying work plus
some loop and call overhead).
Penryn (Xeon 5450), glibc 2.7
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
21 86 104 142 221 378 691 1319 2575 5086 10106 21276 repmovsb
16 30 68 97 97 135 211 362 665 1287 2499 5031 memmove
20 21 39 48 72 120 210 391 853 1685 3360 6773 memcpy
21 85 103 135 175 195 234 314 472 789 1424 2875 repmovsb aligned
16 30 35 39 47 60 94 160 291 554 1105 2646 memmove aligned
20 20 19 20 26 47 81 164 360 653 1239 2693 memcpy aligned
21 86 103 141 220 377 690 1318 2573 5084 10108 21275 repmovsb blksz-1
18 28 56 77 82 120 198 348 651 1276 2499 5015 memmove blksz-1
21 18 29 49 72 120 210 389 851 1682 3357 6771 memcpy blksz-1
Sandy Bridge (Xeon E3-1220) eglibc 2.11.3
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
19 83 100 129 174 183 206 268 398 653 1164 2236 repmovsb
14 28 44 56 79 127 230 430 830 1674 3287 6521 memmove
18 19 29 31 37 49 87 161 261 459 857 1703 memcpy
18 81 100 129 173 179 195 228 301 448 737 1357 repmovsb aligned
15 28 31 35 38 46 76 141 267 550 1075 2151 memmove aligned
19 19 17 17 23 35 65 125 194 314 555 1086 memcpy aligned
18 83 99 128 174 181 205 267 397 651 1162 2233 repmovsb blksz-1
16 26 42 54 77 126 226 426 833 1675 3286 6523 memmove blksz-1
19 16 15 32 36 50 86 161 260 459 858 1705 memcpy blksz-1
Ivy Bridge (Core i3-3227U), glibc 2.23
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
41 41 42 42 54 61 75 117 218 421 838 1658 repmovsb
14 14 15 15 17 45 64 102 173 319 615 1437 memmove
17 19 13 17 20 34 53 90 166 338 647 1439 memcpy
42 41 41 42 53 60 71 96 158 287 557 1093 repmovsb aligned
13 13 14 14 15 27 42 72 136 265 545 1341 memmove aligned
16 18 12 16 18 30 47 79 153 291 551 1241 memcpy aligned
53 41 42 42 54 68 82 123 225 427 833 1656 repmovsb blksz-1
14 14 15 15 18 45 63 102 172 319 614 1434 memmove blksz-1
17 20 13 17 20 34 53 91 166 338 647 1438 memcpy blksz-1
Haswell (Core i7-4690K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
38 38 38 38 45 51 64 100 171 306 576 1135 repmovsb
10 10 11 11 14 30 48 86 149 282 567 1414 memmove
11 12 9 12 15 29 48 86 167 324 628 1415 memcpy
39 39 39 39 46 50 58 74 106 170 298 581 repmovsb aligned
11 11 12 12 13 26 38 67 132 260 531 1362 memmove aligned
12 13 10 15 15 24 37 69 134 277 534 1236 memcpy aligned
50 38 38 38 47 52 66 104 175 310 579 1148 repmovsb blksz-1
10 10 11 11 15 29 47 83 149 280 567 1374 memmove blksz-1
10 11 9 12 15 29 48 86 161 324 628 1417 memcpy blksz-1
Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
33 33 33 33 40 44 54 76 130 237 460 974 repmovsb
10 10 10 10 12 24 40 75 145 302 570 1384 memmove
11 12 8 10 13 26 45 84 160 312 606 1316 memcpy
33 33 33 33 41 45 53 69 101 175 302 564 repmovsb aligned
11 11 11 11 12 24 37 72 141 285 558 1369 memmove aligned
13 14 10 12 15 23 40 75 151 288 562 1267 memcpy aligned
60 33 33 33 43 47 57 78 132 238 460 952 repmovsb blksz-1
10 10 11 11 12 24 40 75 145 301 570 1411 memmove blksz-1
10 11 8 10 13 26 45 84 164 312 606 1347 memcpy blksz-1
Goldmont (Celeron J3455), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
49 48 48 50 54 63 81 123 213 392 831 2681 repmovsb
10 8 8 19 19 37 66 109 206 398 861 2700 memmove
10 8 8 19 19 37 65 109 206 398 863 2699 memcpy
49 48 48 50 54 62 78 111 177 309 635 2130 repmovsb aligned
11 9 9 19 19 37 65 106 197 312 633 2157 memmove aligned
11 9 9 19 19 37 65 106 197 312 634 2157 memcpy aligned
38 53 64 66 70 78 95 137 226 405 831 2689 repmovsb blksz-1
10 9 8 13 19 37 65 109 206 409 835 2714 memmove blksz-1
10 9 8 13 19 37 65 109 206 409 829 2706 memcpy blksz-1
K8 (Athlon 64 X2 4400+), glibc 2.3.6
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
21 28 54 90 162 307 595 1171 2325 4632 9244 18467 repmovsb
17 40 69 80 104 161 253 433 794 1514 2955 5836 memmove
24 31 57 82 98 129 199 323 570 1064 2053 4032 memcpy
21 28 53 87 155 292 566 1113 2206 4394 8768 17516 repmovsb aligned
17 40 33 37 46 68 118 234 451 834 1635 3237 memmove aligned
24 31 56 45 54 72 120 193 338 627 1207 2367 memcpy aligned
17 27 53 89 161 306 594 1171 2325 4629 9248 18461 repmovsb blksz-1
17 37 61 81 105 152 251 433 792 1513 2952 5825 memmove blksz-1
20 30 56 83 100 130 202 325 572 1067 2054 4030 memcpy blksz-1
K10 (Phenom II X2 560), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
15 22 48 84 157 309 566 1080 2107 4161 8270 16487 repmovsb
16 35 56 69 104 152 262 456 839 1604 3135 6201 memmove
16 19 13 19 31 68 114 226 408 774 1505 2968 memcpy
14 21 48 85 158 122 154 219 348 606 1122 2155 repmovsb aligned
16 39 35 38 46 63 95 190 364 664 1268 2583 memmove aligned
19 21 13 20 25 56 89 177 306 566 1084 2121 memcpy aligned
14 21 47 83 155 300 565 1079 2106 4160 8269 16487 repmovsb blksz-1
17 32 55 68 91 156 261 454 837 1602 3131 6190 memmove blksz-1
17 23 13 18 30 69 114 228 411 774 1508 2966 memcpy blksz-1
Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 33 57 105 110 119 140 184 321 599 1160 2324 repmovsb
13 14 13 14 30 42 65 107 175 325 600 1222 memmove
10 10 11 12 30 43 67 113 185 329 604 1226 memcpy
25 33 57 83 87 95 111 143 207 335 594 1136 repmovsb aligned
12 13 12 13 16 24 40 72 136 264 536 1094 memmove aligned
11 11 12 11 21 27 42 74 139 267 541 1092 memcpy aligned
23 32 56 90 110 120 140 184 321 600 1160 2324 repmovsb blksz-1
13 13 14 13 30 42 67 108 176 325 599 1219 memmove blksz-1
10 10 11 12 31 43 67 113 185 331 604 1221 memcpy blksz-1
Zen (Ryzen 5 1600X), glibc 2.3.6 (-static)
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
25 32 56 106 111 119 140 184 321 600 1161 2334 repmovsb
10 18 29 36 49 77 132 263 501 940 1816 3581 memmove
26 34 59 80 88 102 133 198 342 599 1114 2182 memcpy
25 33 56 85 89 97 113 145 209 337 595 1145 repmovsb aligned
10 18 20 19 24 40 72 137 286 542 1054 2110 memmove aligned
26 34 59 50 55 70 100 165 311 567 1079 2126 memcpy aligned
22 32 56 90 111 119 142 184 321 600 1161 2338 repmovsb blksz-1
8 16 29 36 49 76 131 261 499 938 1814 3582 memmove blksz-1
24 33 58 82 88 101 134 198 345 602 1117 2184 memcpy blksz-1