Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

VVM question

240 views
Skip to first unread message

Thomas Koenig

unread,
Aug 22, 2021, 12:34:21 PM8/22/21
to
Hi,

a question regarding VVM.

Take the following simplified version of Fortran's MAXLOC intrinsic,
which returns the position of the array element with the maximum
value (the first if there are many).

int m2(int * const restrict a, int n)
{
int m, nm;
int i;

m = INT_MIN;
nm = -1;
for (i=0; i<n; i++)
{
if (a[i] > m)
{
m = a[i];
nm = i;
}
}
return nm;
}

An SIMD version with m lanes would probably determine the maximum
value for each lane separately and, at the end of the loop, return
the smallest index of the largest value, so something like

for (i=0; i<n; i+=n_lanes)
{
if (a[i] > m[0])
{
m[0] = a[i];
nm[0] = i;
}
if (a[i+1] > m[1])
{
m[1] = a[i+1];
nm[1] = i + 1;
}
...
}

How would VVM handle that? Could it also use a similar parallel
approach, from just translating the scalar code?

Anton Ertl

unread,
Aug 22, 2021, 1:54:27 PM8/22/21
to
Thomas Koenig <tko...@netcologne.de> writes:
>Hi,
>
>a question regarding VVM.
>
>Take the following simplified version of Fortran's MAXLOC intrinsic,
>which returns the position of the array element with the maximum
>value (the first if there are many).

This is actually a significant part of the inner loop of Jon Bentley's
Traveling Salesman example which we looked at in the thread that
begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
already...@yahoo.com presented a vectorized version of the loop
over the whole array (rather than a loop that ends as soon as it finds
something closer than before) in
<b2aed821-2b7e-456d...@googlegroups.com>, and I
discussed it in <2016Nov1...@mips.complang.tuwien.ac.at>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Terje Mathisen

unread,
Aug 22, 2021, 3:41:29 PM8/22/21
to
A verctor version of that, VMM or SIMD, would probably run better with
predicates/conditional moves so as to remove all internal branching from
the core iteration.

One issue is of course that predicates or CMOVs require both some setup
time and latency limitations, so you might actually need to unroll the
code to use twice as many accumulators. If you do so then you can afford
the parallel compare and the pairs of conditional moves of a new max value.

Terje


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,
Aug 22, 2021, 4:19:53 PM8/22/21
to
Terje Mathisen <terje.m...@tmsw.no> schrieb:
> Thomas Koenig wrote:

>> Take the following simplified version of Fortran's MAXLOC intrinsic,
>> which returns the position of the array element with the maximum
>> value (the first if there are many).

[...]

> A verctor version of that, VMM or SIMD, would probably run better with
> predicates/conditional moves so as to remove all internal branching from
> the core iteration.
>
> One issue is of course that predicates or CMOVs require both some setup
> time and latency limitations, so you might actually need to unroll the
> code to use twice as many accumulators. If you do so then you can afford
> the parallel compare and the pairs of conditional moves of a new max value.

FYI, there is some AVX2 code (which I did not write) at
https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121

Thomas Koenig

unread,
Aug 22, 2021, 4:22:30 PM8/22/21
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> This is actually a significant part of the inner loop of Jon Bentley's
> Traveling Salesman example which we looked at in the thread that
> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.

Is there a way to access those?

Google Groups always asks me for a userid, which I do not have.

Anton Ertl

unread,
Aug 22, 2021, 5:39:18 PM8/22/21
to
Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> This is actually a significant part of the inner loop of Jon Bentley's
>> Traveling Salesman example which we looked at in the thread that
>> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
>
>Is there a way to access those?

http://al.howardknight.net/

Bookmark it immediately!

MitchAlsup

unread,
Aug 22, 2021, 9:11:06 PM8/22/21
to
On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
> Hi,
>
> a question regarding VVM.
>
> Take the following simplified version of Fortran's MAXLOC intrinsic,
> which returns the position of the array element with the maximum
> value (the first if there are many).
>
> int m2(int * const restrict a, int n)
> {
> int m, nm;
> int i;
>
> m = INT_MIN;
> nm = -1;
> for (i=0; i<n; i++)
> {
> if (a[i] > m)
> {
> m = a[i];
> nm = i;
> }
> }
> return nm;
> }
<
GLOBAL m2
ENTRY m2
m2:
MOV R3,#0x7FFFFFFFFFFFFFFF
MOV R4,#-1
MOV R5,#0
top:
VEC R8,{R3,R4}
LDW R6,[R1+R5<<2]
CMP R7,R6,R3
PGT R7,{2,TT}
MOV R3,R6 // Be careful on this assignment
MOV R4,R5 // Be careful on this assignment
LOOP LT,R5,#1,R2
MOV R1,R3
RET
>
> An SIMD version with m lanes would probably determine the maximum
> value for each lane separately and, at the end of the loop, return
> the smallest index of the largest value, so something like
>
> for (i=0; i<n; i+=n_lanes)
> {
> if (a[i] > m[0])
> {
> m[0] = a[i];
> nm[0] = i;
> }
> if (a[i+1] > m[1])
> {
> m[1] = a[i+1];
> nm[1] = i + 1;
> }
> ...
> }
>
> How would VVM handle that? Could it also use a similar parallel
> approach, from just translating the scalar code?
<
The VEC instruction tells each loop to watch for modifications to R3 and R4
and obey loop carried dependencies.

Thomas Koenig

unread,
Aug 23, 2021, 1:43:02 AM8/23/21
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tko...@netcologne.de> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>> This is actually a significant part of the inner loop of Jon Bentley's
>>> Traveling Salesman example which we looked at in the thread that
>>> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
>>
>>Is there a way to access those?
>
> http://al.howardknight.net/

Thanks, very good link!

It does not do threading, unfortunately.

> Bookmark it immediately!

Done.

Thomas Koenig

unread,
Aug 23, 2021, 1:44:45 AM8/23/21
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
[...]
I'm afraid that does not answer my question, at least I do not
understand it this way.

Will it run several iterations in parallel without source code
modification, or not?

Thomas Koenig

unread,
Aug 23, 2021, 1:54:59 AM8/23/21
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
> Thomas Koenig <tko...@netcologne.de> writes:
>>Hi,
>>
>>a question regarding VVM.
>>
>>Take the following simplified version of Fortran's MAXLOC intrinsic,
>>which returns the position of the array element with the maximum
>>value (the first if there are many).
>
> This is actually a significant part of the inner loop of Jon Bentley's
> Traveling Salesman example which we looked at in the thread that
> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
> already...@yahoo.com presented a vectorized version of the loop
> over the whole array (rather than a loop that ends as soon as it finds
> something closer than before) in
><b2aed821-2b7e-456d...@googlegroups.com>, and I
> discussed it in <2016Nov1...@mips.complang.tuwien.ac.at>.

From what I read in your article, the effect of the code seems
so-so and depend on the architecture.

Would it be possible to give the code at
https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121 (which is
part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 a spin
to see if it does better? It worked well on Zen 1 despite that
architecture only "faking" AVX2 with 128-bit registers.

Anton Ertl

unread,
Aug 23, 2021, 6:55:30 AM8/23/21
to
Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> This is actually a significant part of the inner loop of Jon Bentley's
>> Traveling Salesman example which we looked at in the thread that
>> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
>> already...@yahoo.com presented a vectorized version of the loop
>> over the whole array (rather than a loop that ends as soon as it finds
>> something closer than before) in
>><b2aed821-2b7e-456d...@googlegroups.com>, and I
>> discussed it in <2016Nov1...@mips.complang.tuwien.ac.at>.
>
>From what I read in your article, the effect of the code seems
>so-so and depend on the architecture.

It also depends on the array size (between 1 and 10000 in my case).
Assuming the crossover point is, say, 2000, it's probably best to use
AVX "branchless" for the first 2000 elements, and then continue with
AVX hard. I wanted to look into "branchless" some more, but, as
usual, other things needed my attention and so I did not pursue it
further.

>Would it be possible to give the code at
>https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121 (which is
>part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 a spin
>to see if it does better? It worked well on Zen 1 despite that
>architecture only "faking" AVX2 with 128-bit registers.

I could not compile it on Debian 11 ("relocation R_X86_64_32S against
`.data' can not be used when making a PIE object; recompile with
-fPIE"; this means that the assembly code contains an absolute address
and should be replaced with a rip-relative address), so I compiled it
on Debian 8 (gcc-4.9.2).

Below is what I see. What does it mean?

On Skylake:

# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.250980 0.198142 0.328205 0.227758
256 0.378698 0.351648 1.000000 0.479401
512 0.498054 0.453901 0.486692 0.609524
1024 0.533889 0.509453 0.499025 0.727273
2048 0.549946 0.558952 0.515869 0.768769
4096 0.560022 0.562174 0.465243 0.821830
8192 0.562560 0.563179 0.616496 0.836260
16384 0.568376 0.566568 0.840464 1.221957
32768 0.569482 0.568612 0.998598 1.522960
65536 0.569640 0.569839 1.227496 2.413316
131072 0.569141 0.570295 1.334039 1.866857
262144 0.570032 0.568262 1.389593 1.929232
524288 0.569357 0.566879 1.508152 1.673972
1048576 0.561443 0.555999 1.533845 1.503037
2097152 0.560509 0.560691 1.458560 1.509459
4194304 0.559187 0.560557 1.456157 1.503564
8388608 0.561024 0.560462 1.494831 1.514211
16777216 0.560297 0.559024 1.496209 1.510765
33554432 0.559756 0.560659 1.501258 1.512948
67108864 0.559765 0.560249 1.507910 1.512386
134217728 0.560098 0.560409 1.506587 1.515123
268435456 0.560284 0.560472 1.509522 1.516031
536870912 0.559883 0.560436 1.508366 1.516430

536870912 0.560183 0.560181 1.509494 1.516893
268435456 0.560113 0.560441 1.507528 1.516041
134217728 0.559948 0.560224 1.509935 1.519144
67108864 0.561124 0.561204 1.505807 1.519437
33554432 0.560492 0.559996 1.518871 1.521890
16777216 0.561216 0.560984 1.501925 1.512587
8388608 0.560717 0.560970 1.481175 1.511185
4194304 0.559531 0.560643 1.456585 1.486993
2097152 0.558511 0.561203 1.401787 1.453253
1048576 0.562318 0.558435 1.330910 1.345867
524288 0.570140 0.567899 1.630715 1.883340
262144 0.570461 0.569180 2.328472 2.177891
131072 0.570325 0.570285 2.357071 2.205857
65536 0.570335 0.569650 1.842244 1.968639
32768 0.569601 0.569621 1.501879 1.559045
16384 0.567431 0.566254 1.169951 1.247259
8192 0.563566 0.562097 0.786936 0.833876
4096 0.564187 0.553214 0.518219 0.807253
2048 0.552617 0.553813 0.516911 0.791957
1024 0.540084 0.522983 0.510469 0.725212
512 0.509960 0.454707 0.490421 0.621359
256 0.477612 0.391437 0.468864 0.481203
128 0.450704 0.421053 0.400000 0.278261

On Zen 3:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.198142 0.177285 0.673684 0.374269
256 0.374269 0.336842 1.122807 0.962406
512 0.561404 0.396285 1.924812 1.347368
1024 0.585812 0.402200 3.368421 2.694737
2048 0.612440 0.411410 2.836565 3.592982
4096 0.612440 0.417789 2.629012 4.145749
8192 0.626683 0.419414 2.661468 5.013464
16384 0.629428 0.420232 2.728847 5.258023
32768 0.629887 0.420437 2.703184 5.226156
65536 0.630809 0.420642 3.193762 5.372684
131072 0.631040 0.422911 3.101855 5.331164
262144 0.629772 0.420719 3.108845 5.360160
524288 0.630751 0.420783 2.422235 5.402135
1048576 0.631184 0.420764 2.097454 5.460935
2097152 0.631162 0.422665 1.937176 5.322937
4194304 0.630690 0.421456 1.999682 4.177444
8388608 0.627358 0.420154 2.007738 3.061207
16777216 0.618820 0.418670 2.549933 2.558949
33554432 0.621342 0.418237 2.360456 2.400368
67108864 0.623856 0.418145 2.394890 2.419667
134217728 0.625304 0.417954 2.421189 2.449932
268435456 0.626265 0.417947 2.452580 2.475416
536870912 0.626237 0.417929 2.441393 2.459276

536870912 0.626340 0.417924 2.446799 2.465893
268435456 0.626351 0.417885 2.438281 2.455598
134217728 0.626210 0.417981 2.430257 2.454633
67108864 0.621872 0.418221 2.435699 2.456569
33554432 0.615973 0.418177 2.423352 2.464991
16777216 0.615432 0.418174 2.377893 2.401301
8388608 0.616635 0.418836 2.178726 2.195453
4194304 0.630528 0.420659 2.144731 2.754934
2097152 0.631170 0.420805 4.347582 5.448535
1048576 0.629830 0.420802 4.644690 5.443698
524288 0.631126 0.420757 4.547479 5.419109
262144 0.630809 0.420796 4.442065 5.410609
131072 0.634289 0.420539 4.382799 5.314735
65536 0.630578 0.420744 4.422132 5.475021
32768 0.630809 0.420027 4.311579 5.322937
16384 0.632196 0.419823 4.311579 5.194673
8192 0.628510 0.419414 4.145749 4.899522
4096 0.623061 0.417789 3.992203 4.145749
2048 0.612440 0.411410 3.849624 3.170279
1024 0.585812 0.408293 2.994152 2.245614
512 0.561404 0.374269 2.245614 1.347368
256 0.481203 0.354571 1.347368 0.748538
128 0.421053 0.336842 0.842105 0.421053

On Zen 2:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.224561 0.210526 0.561404 0.481203
256 0.449123 0.320802 1.347368 1.122807
512 0.561404 0.384962 2.245614 1.347368
1024 0.573348 0.390542 2.994152 3.368421
2048 0.579513 0.396285 2.245614 0.769925
4096 0.602176 0.399220 1.996101 3.368421
8192 0.597172 0.399961 2.092999 4.311579
16384 0.557052 0.407522 1.774312 4.311579
32768 0.602597 0.402575 2.103209 4.311579
65536 0.611571 0.401077 1.783487 4.175863
131072 0.601651 0.407859 1.825007 4.508841
262144 0.608228 0.405629 1.760277 4.453535
524288 0.608362 0.405533 1.707133 4.460735
1048576 0.603678 0.402904 1.525548 4.465066
2097152 0.601913 0.401237 1.524494 4.422487
4194304 0.552595 0.388290 1.598199 1.880156
8388608 0.502226 0.367968 1.548654 1.485521
16777216 0.518826 0.359816 1.446895 1.508812
33554432 0.533121 0.365163 1.498006 1.497000
67108864 0.497297 0.364485 1.489156 1.489815
134217728 0.503081 0.363182 1.485826 1.497298
268435456 0.502331 0.362487 1.476280 1.483783
536870912 0.497604 0.363058 1.471117 1.486186

536870912 0.501185 0.362092 1.474724 1.484411
268435456 0.505392 0.362086 1.480385 1.487473
134217728 0.505914 0.362678 1.477032 1.491003
67108864 0.502462 0.365261 1.489096 1.491641
33554432 0.510745 0.369752 1.499425 1.512902
16777216 0.288380 0.365712 1.486786 1.508833
8388608 0.508408 0.369453 1.474655 1.501383
4194304 0.549125 0.384958 1.482000 1.811350
2097152 0.608269 0.403455 1.714878 4.018949
1048576 0.608483 0.405545 3.159389 4.377932
524288 0.608416 0.407738 3.235707 4.465066
262144 0.608389 0.407787 3.234190 4.476656
131072 0.611463 0.407859 3.229647 4.473752
65536 0.604710 0.407714 3.241789 4.611314
32768 0.610705 0.407137 3.217596 4.377238
16384 0.606411 0.402951 3.170279 4.268890
8192 0.605559 0.406753 3.079699 3.992203
4096 0.592250 0.405224 3.079699 3.592982
2048 0.592250 0.402200 3.170279 3.170279
1024 0.573348 0.390542 2.994152 2.245614
512 0.538947 0.384962 1.924812 1.347368
256 0.481203 0.374269 1.122807 2.245614
128 0.481203 0.336842 1.684211 0.421053

On Zen:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.126984 0.122605 0.053872 0.126984
256 0.187135 0.182336 0.215488 0.374269
512 0.384384 0.278867 0.346883 0.677249
1024 0.270899 0.251721 0.451499 0.917563
2048 0.474074 0.326948 0.462511 0.729345
4096 0.421399 0.336621 0.702332 1.094017
8192 0.434266 0.334641 0.603596 0.702332
16384 0.435097 0.340397 0.682326 0.995867
32768 0.382607 0.336372 0.682837 0.835066
65536 0.247377 0.270015 0.651322 0.729637
131072 0.474074 0.332836 0.756942 0.902998
262144 0.460115 0.331397 0.741676 0.737171
524288 0.386670 0.310371 0.690772 0.679873
1048576 0.447765 0.263413 0.621113 0.639005
2097152 0.457797 0.327805 0.721960 0.666532
4194304 0.463389 0.329218 0.782020 0.794948
8388608 0.467029 0.329897 0.859277 0.855042
16777216 0.484859 0.323049 0.903128 1.051807
33554432 0.485079 0.334300 1.365597 1.332054
67108864 0.484846 0.334430 1.382202 1.341124
134217728 0.484798 0.334393 1.388014 1.344886
268435456 0.483813 0.334464 1.384969 1.343667
536870912 0.484345 0.334486 1.380940 1.349843

536870912 0.484308 0.334349 1.383726 1.346343
268435456 0.482901 0.334368 1.390845 1.350834
134217728 0.483038 0.334411 1.389173 1.348572
67108864 0.483145 0.334543 1.382173 1.339797
33554432 0.482469 0.334347 1.373056 1.339054
16777216 0.484889 0.334234 1.352430 1.320905
8388608 0.485658 0.334543 1.318118 1.295186
4194304 0.482499 0.333650 1.255438 1.245587
2097152 0.499248 0.337774 1.263211 1.287671
1048576 0.513389 0.342309 1.319222 1.791556
524288 0.513271 0.342462 1.909223 1.699365
262144 0.513307 0.342414 2.043148 1.760585
131072 0.513235 0.342350 1.728817 1.918277
65536 0.512801 0.342189 1.603916 1.882569
32768 0.511648 0.341932 1.961686 1.932531
16384 0.509643 0.341163 1.865209 1.850045
8192 0.504558 0.339635 1.996101 1.820444
4096 0.494686 0.336621 1.835125 1.835125
2048 0.474074 0.328838 1.723906 1.723906
1024 0.444444 0.323232 1.497076 1.673203
512 0.374269 0.296296 1.292929 1.422222
256 0.323232 0.263374 1.015873 1.185185
128 0.296296 0.222222 0.888889 0.507937

On Tiger Lake:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.507937 0.416938 0.220690 0.066082
256 0.677249 0.448336 0.523517 0.549356
512 0.986513 0.881239 0.702332 0.990329
1024 1.365333 1.253366 0.519007 1.102260
2048 1.505882 1.455579 0.634449 1.074502
4096 1.618972 1.579637 0.647282 1.081595
8192 1.666395 1.436185 0.598349 1.104788
16384 1.697296 1.689420 0.625845 1.077824
32768 1.655786 1.708892 0.699737 1.084818
65536 1.723180 1.713583 0.619556 1.340698
131072 1.718369 1.680906 1.005015 1.388548
262144 1.728601 1.710509 1.225676 1.425385
524288 1.523752 1.505083 1.162664 2.004450
1048576 1.510044 1.498694 1.268084 1.857777
2097152 1.322782 1.301467 1.209695 1.561971
4194304 1.313021 1.283718 1.168607 1.616382
8388608 1.302382 1.315042 1.158387 1.744677
16777216 1.295171 1.300065 1.203086 1.742706
33554432 1.298210 1.300364 1.193553 1.702402
67108864 1.298965 1.298582 1.201180 1.711111
134217728 1.290168 1.295982 1.405890 1.719377
268435456 1.298625 1.290871 1.724766 1.714571
536870912 1.298432 1.286179 1.714161 1.720285

536870912 1.292826 1.297217 1.724536 1.724241
268435456 1.299574 1.298605 1.726204 1.708710
134217728 1.296023 1.292664 1.737698 1.722198
67108864 1.298964 1.297574 1.727979 1.718513
33554432 1.299467 1.193327 1.731739 1.719353
16777216 1.301645 1.293439 1.708782 1.709346
8388608 1.296258 1.290181 1.618332 1.638156
4194304 1.263116 1.282948 1.546646 1.573795
2097152 1.296268 1.273609 1.428083 1.517347
1048576 1.580714 1.527439 1.327631 1.708135
524288 1.722903 1.713628 2.477650 2.922518
262144 1.729501 1.727769 2.557078 2.947558
131072 1.728407 1.725903 1.135826 1.759970
65536 1.725494 1.720240 1.119126 1.474376
32768 1.719383 1.710765 0.812819 1.093615
16384 1.702234 1.686638 0.811491 1.110629
8192 1.682136 1.641354 0.804873 1.096947
4096 1.652279 1.412414 0.803925 1.081881
2048 1.543331 1.426184 0.735104 1.067779
1024 1.216152 1.278402 1.992218 0.959700
512 1.221957 1.201878 0.695652 1.089362
256 1.000000 0.583144 0.744186 0.992248
128 0.761905 0.512000 0.677249 0.882759

Stephen Fuld

unread,
Aug 23, 2021, 10:55:45 AM8/23/21
to
This hearkens back to the thread we had some months ago on reductions in
VVM. I think the answer is "mostly not". I say this because the full
cache line load streaming capability is sort of doing multiple loads in
parallel, but the the compare part of the loop will not use multiple
ALUs in parallel, even if they are available.



--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,
Aug 23, 2021, 11:44:53 AM8/23/21
to
Yes iterations will run in parallel on multiple lanes.
However, any lane that writes to R3 or R4 will cause a serial dependency
at LOOP and will be backed up, much like branch repair, and played out again.
<
So, let us postulate that we have a 4-lanes, and the loop is zipping through
iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
run like expected.
<
In effect, the loop runs as expected, but this kind of dependency causes
a "blip" in execution width.

Anton Ertl

unread,
Aug 23, 2021, 11:49:14 AM8/23/21
to
Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>On 8/22/2021 10:44 PM, Thomas Koenig wrote:
|int m2(int * const restrict a, int n)
|{
| int m, nm;
| int i;
|
| m = INT_MIN;
| nm = -1;
| for (i=0; i<n; i++)
| {
| if (a[i] > m)
| {
| m = a[i];
| nm = i;
| }
| }
| return nm;
|}
...
>> Will it run several iterations in parallel without source code
>> modification, or not?
>
>This hearkens back to the thread we had some months ago on reductions in
>VVM. I think the answer is "mostly not". I say this because the full
>cache line load streaming capability is sort of doing multiple loads in
>parallel, but the the compare part of the loop will not use multiple
>ALUs in parallel, even if they are available.

Why not? Consider this as the following equivalent code:

int m2(int * const restrict a, int n)
{
int m, nm;
int i;

m = INT_MIN;
nm = -1;
i=0;
while (i<n) {
while (i<n && a[i]<=m)
i++;
if (a[i] > m) {
m = a[i];
nm = i;
}
i++;
}
return nm;
}

Now look at the inner loop. It is so easy to vectorize that even VVM
may be able to do it (maybe even auto-vectorizing compilers). Of
course, at first it will have very short trip counts, but they
increase the further through the array you work, as the probability to
find an element larger than the largest one up to now decreases
(unless the array is sorted).

Stephen Fuld

unread,
Aug 23, 2021, 12:02:47 PM8/23/21
to
Ahhh! I didn't understand that. So in the case of summing the elements
of an unsigned integer vector, it is the writes to the "running sum"
register that causes the serial dependency and thus prevents parallel
additions. That makes sense.

Thomas Koenig

unread,
Aug 23, 2021, 1:40:32 PM8/23/21
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
A strangeness that some distributors have put into compilers
recently. Luckily enough, the base version of gcc does not do this.

> recompile with
> -fPIE"; this means that the assembly code contains an absolute address
> and should be replaced with a rip-relative address), so I compiled it
> on Debian 8 (gcc-4.9.2).
>
> Below is what I see. What does it mean?

The numbers mean average iterations per cycle. "n" is the vector
length. The "normal" version is the code as I posted it. The
"expect" version uses __builtin_expect to tell the compiler that
finding a new maximum seems unlikely. AVX2 is an AVX2 version of
the code, and AVX2_unroll is an unrolled version of AVX2.

Numbers go up and then down again to have some reproducibility.
I suspect the "going down" numbers are more reliable, so I'll
look at those.


> On Skylake:

> 536870912 0.560183 0.560181 1.509494 1.516893
> 268435456 0.560113 0.560441 1.507528 1.516041
> 134217728 0.559948 0.560224 1.509935 1.519144

So, for a very long vector: 0.56 iterations per cycle for normal
code, 1.52 iterations for the AVX2 code. Almost a factor of
three, not bad.

[...]

> 1024 0.540084 0.522983 0.510469 0.725212
> 512 0.509960 0.454707 0.490421 0.621359
> 256 0.477612 0.391437 0.468864 0.481203
> 128 0.450704 0.421053 0.400000 0.278261

Don't use the unrolled AVX2 stuff on short vectors, I suppose,
but at least the slowdown for AVX2 against normal code is slight.

> On Zen 3:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 0.626340 0.417924 2.446799 2.465893
> 268435456 0.626351 0.417885 2.438281 2.455598

Scalar code is about par with Skylake, the AVX2 code is better.
Strange that the __builtin_expect code is slower, but that may
just be the rather old compiler.

> 256 0.481203 0.354571 1.347368 0.748538
> 128 0.421053 0.336842 0.842105 0.421053

The clear winner for Zen3: The AVX2 stuff without unrolling.

> On Zen 2:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 0.501185 0.362092 1.474724 1.484411
> 268435456 0.505392 0.362086 1.480385 1.487473

> 256 0.481203 0.374269 1.122807 2.245614
> 128 0.481203 0.336842 1.684211 0.421053

Again, slower than Zen3, but still AVX2 wins hands-down.

> On Zen:
># Ints per cycle

>
> 536870912 0.484308 0.334349 1.383726 1.346343

> 512 0.374269 0.296296 1.292929 1.422222
> 256 0.323232 0.263374 1.015873 1.185185
> 128 0.296296 0.222222 0.888889 0.507937

Again, a clear win for the non-unrolled AVX2 code.
>
> On Tiger Lake:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 1.292826 1.297217 1.724536 1.724241
> 268435456 1.299574 1.298605 1.726204 1.708710

The scalar variant is _very_ good, AVX2 does gain some, but not
as much as the other architectures, especially when

> 512 1.221957 1.201878 0.695652 1.089362
> 256 1.000000 0.583144 0.744186 0.992248
> 128 0.761905 0.512000 0.677249 0.882759

it seems to get slower towards the end (but the numbers still are a
bit erratic).

AVX2 without unrolling seems to be the clear winner for all
architectures you checked, especially the AMD ones, except for Tiger
Lake, which combines excellent performance with of the scalar loop
wiht lackluster performance on AVX2. Maybe they figured that,
while they do support the instructions, performance was not
so inoportant for them after all. For a processor intended for
the mobile market, that makes sense.

Thomas Koenig

unread,
Aug 23, 2021, 1:50:56 PM8/23/21
to
Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
> On 8/23/2021 8:44 AM, MitchAlsup wrote:

>> Yes iterations will run in parallel on multiple lanes.
>> However, any lane that writes to R3 or R4 will cause a serial dependency
>> at LOOP and will be backed up, much like branch repair, and played out again.
>> <
>> So, let us postulate that we have a 4-lanes, and the loop is zipping through
>> iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
>> Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
>> begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
>> run like expected.
>> <
>> In effect, the loop runs as expected, but this kind of dependency causes
>> a "blip" in execution width.

Good explanation, thanks.

>
> Ahhh! I didn't understand that. So in the case of summing the elements
> of an unsigned integer vector, it is the writes to the "running sum"
> register that causes the serial dependency and thus prevents parallel
> additions. That makes sense.

So (moving the goalposts towards summation here), VVM-optimized code
could look like

for (i=0; i<n; i+=m) {
for (j=0; j<m; j++)
s[i+j] += a[i+j];
}

with suitable postprocessing (and pre-processing if n
is not divisible by m).

Hm. This doesn't really make it more elegant than doing the same kind
of thing in SIMD.

Or how should a reduction be written?

Stephen Fuld

unread,
Aug 23, 2021, 2:16:47 PM8/23/21
to
I think that depends upon whether the order of the operations is
potentially significant. For example, if the values are signed, you may
hit an underflow/overflow at an intermediate step that gets "cancelled
out" by doing multiple intermediate sums then a final "sum of
intermediates" step. Many people have pointed out the issues with doing
the multiply/adds needed for an inner product in parallel. That is why
I specified unsigned integers in the vector to be summed.

So for full generality, you have to do one element at a time. You code
it that way, and VVM executes it that way, except with the benefit of
the full cache width loads. If you know that order is not significant
(i.e. summing unsigned integers), you would have to unroll the loop in
the source code, which would allow parallel additions) and thus buy you
improved performance. A final add outside the loop gets the grand total.

There is the question of how much to unroll the loop. I think you
probably want to unroll it four times in the source code. That way, you
get maximum performance on any system with up to four integer units
without source code changes. I don't think you can do eight, as you
might exceed the VVM instruction limit. You could certainly do two, and
that would work, but you would be giving away performance on a CPU with
4 integer units.


A question for Mitch. Suppose you unroll the loop for summing a vector
of unsigned integers. In VVM, the load for the first element causes a
cache line to be loaded into a streaming buffer. The next load, for the
second element, wants the next entry in the buffer. So, does VVM
recognize this all as a "dense reference" even though the references are
not from the same instruction?

MitchAlsup

unread,
Aug 23, 2021, 2:40:44 PM8/23/21
to
As the Loop is installed in the stations, memory reference address patterns are
examined. If the address pattern are based on indexing off of the register used
in the LOOP instruction, then certain inferences can be made. The determination
of dense is one of these.
<
On the other hand it is easy to code gather scatter in which the pointers/indexes
are dense and the indirect data not.
<
Dense, in a VVM sense, is that several iterations of the loop can all access one
streaming buffer (avoid cache and TLB) so that other stuff (gather/scatter memory
refs) have access through normal cache paths.
<
Back to the posed question:
<
If the programmer unrolled the loop by hand (like DGEMM without transposes):
The LDs would need to be coded using offsets from the index register to be
recognized as dense::

MOV Ri,#0
VEC R8,{}
LDD R4,[R2,Ri<<3]
LDD R5,[R2,Ri<<3+8]
LDD R6,[R2,Ri<<3+16]
LDD R7,[R2,Ri<<3+24]
...
LOOP LT,Ri,#4,Rmax
<
The above code would be recognized as dense.
<
MOV Ri,#0
ADD R9,R2,#8
ADD R9,R2,#16
ADD R10,R2,#24
VEC R8,{}
LDD R4,[R2,Ri<<3]
LDD R5,[R8,Ri<<3+8]
LDD R6,[R9,Ri<<3+16]
LDD R7,[R10,Ri<<3+24]
...
LOOP LT,Ri,#4,Rmax
<
This loop is harder to recognize as dense--even though the number of words
in the loop is less.
<
As the loop is being installed in the stations, the first iteration is performed,
so many of the address patterns can be detected using actual AGEN addresses
not just instructions patterns--so the second case might or might no be recognized.

Thomas Koenig

unread,
Aug 23, 2021, 5:29:19 PM8/23/21
to
MitchAlsup <Mitch...@aol.com> schrieb:
Hm... all possible, but less elegant that it could be. All the
manual unrolling and autovectorization and... rears its ugly
head again.

With all the mechanisms that VVM already offers, a way for the
programmer or a programming language to specify that operations
such as summation can be done in any order would be a very useful
addition.

Suggestion:

A variant of the VEC instruction, which does not specify a special
register to keep the address in (which can be hardwired if there
is no space in the thread header). This leaves five bits for
"reduction" registters, which specify that operations on that
register can be done in any order in the loop.

This would be a perfect match for OpenMP's reduction clause or
for the planned REDUCTION addition to Fortran's DO CONCURRENT.

It would not have a 1:1 match for C semantics, sure, but this
should not pose a problem, I hope :-)

MitchAlsup

unread,
Aug 23, 2021, 5:55:26 PM8/23/21
to
That might be one way...........
<
My preferred means is to make a way to specify that a function unit is
performing a reduction, and that it should not deliver its value at the
end of its calculation, but hold on to it an use it in the next calculation.
<
So, a FMAC reduction would take the form of::

FMAC --,--,--,Rsum
VEC Rx,{}
LDD RA,[Ra+Ri<<3]
FMAC --,RA,RB,--
LOOP LT,Ri,#1,Rmax
FMAC Rsum,--,--,--
<
or something like that. where "--" means there is no operand or result being
specified, use the last operand that showed up, and make the destination
into an operand for the next cycle.
<
The multiplier ALREADY has the ability to perform accumulates
every cycle at the wide adder (3×52+52-bit incrementer), all we need
is an ISA way to specify feed back the last intermediate result as
an operand to the next calculation.
<
The major problem is where does one store the state on an interrupt
taken inside of the loop. I am letting my subconscious dwell on it right
now.

Stephen Fuld

unread,
Aug 23, 2021, 10:51:45 PM8/23/21
to
I am probably missing something here. To me the main advantage of
allowing out of order summations (using summations here as shorthand for
other similar type operations), was to allow the hardware to make use of
multiple functional units. That is, a core with two adders could, if
allowed, complete the summation in about half the time. Without that, I
don't see any advantage of out of order summations on VVM. If I am
wrong, please explain. If I am right, see below.



> Suggestion:
>
> A variant of the VEC instruction, which does not specify a special
> register to keep the address in (which can be hardwired if there
> is no space in the thread header). This leaves five bits for
> "reduction" registters, which specify that operations on that
> register can be done in any order in the loop.

Doing the operations in a different order isn't the problem. You need a
way to allow/specify the two partial sums to be added together in the
end. I don't see your proposal as doing that. And, of course, it is
limited to five registers which must be specified in the hardware design.



>
> This would be a perfect match for OpenMP's reduction clause or
> for the planned REDUCTION addition to Fortran's DO CONCURRENT.

I am not an OpenMP person, and my knowledge of Fortran is old, so could
you please give a brief explanation of what these two things do? Thanks.

Thomas Koenig

unread,
Aug 24, 2021, 2:27:51 AM8/24/21
to
Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
Or, equvalently, I have been explaining things badly :-)

> To me the main advantage of
> allowing out of order summations (using summations here as shorthand for
> other similar type operations), was to allow the hardware to make use of
> multiple functional units.

Yes.

> That is, a core with two adders could, if
> allowed, complete the summation in about half the time.

Yes.

>Without that, I
> don't see any advantage of out of order summations on VVM. If I am
> wrong, please explain. If I am right, see below.

Seeing below.

>
>
>
>> Suggestion:
>>
>> A variant of the VEC instruction, which does not specify a special
>> register to keep the address in (which can be hardwired if there
>> is no space in the thread header). This leaves five bits for
>> "reduction" registters, which specify that operations on that
>> register can be done in any order in the loop.
>
> Doing the operations in a different order isn't the problem.

It's one half of the problem.

The way VVM is currently specified, it's stricly in-order semantics
you write down a C loop, and the hardware delivers the results
exactly in the order you wrote down. This would have to be
changed.


> You need a
> way to allow/specify the two partial sums to be added together in the
> end.

That as well.

>I don't see your proposal as doing that.

I thought I had implied it, but it was obviously not clear enough.


> And, of course, it is
> limited to five registers which must be specified in the hardware design.

Five reductions in a loop would be plenty, it is usually one, or more
rarely two.

>> This would be a perfect match for OpenMP's reduction clause or
>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>
> I am not an OpenMP person, and my knowledge of Fortran is old, so could
> you please give a brief explanation of what these two things do? Thanks.

#pragma omp simd reduction(+:var)

before a loop will tell the compiler that it can go wild
with the sequence of loops but that "var" will be used
in a summation reduction.

DO CONCURRENT also runs loops in an unspecified order,
the REDUCTION clause would then allow to, for example,
sum up all elements.

One problems with C and similar languages is that you have
to specify an ordering of the loop explicitly, which shapes
programmer's thinking and also shapes intermediate languages
for compilers...

Anton Ertl

unread,
Aug 24, 2021, 3:24:16 AM8/24/21
to
Thomas Koenig <tko...@netcologne.de> writes:
>The numbers mean average iterations per cycle.

Actually per rdtsc unit (which have not been CPU cycles for over a
decade). And you time only a single run through the loop, so any
disturbance (rdtsc wobble, interrupt, etc.) will be very visible. You
also see one rtdsc (on average) in the result.

>"n" is the vector
>length. The "normal" version is the code as I posted it. The
>"expect" version uses __builtin_expect to tell the compiler that
>finding a new maximum seems unlikely.

I compiled with gcc-4.9 -O -std=c99, and this gives the following loops:

ml m2
mov (%rdi,%rdx,4),%ecx mov (%rdi,%rdx,4),%ecx
cmp %r8d,%ecx cmp %r8d,%ecx
jle 40077e <ml+0x21> jle 4007a9 <m2+0x21>
mov %edx,%eax mov %edx,%eax
mov %ecx,%r8d mov %ecx,%r8d
add $0x1,%rdx add $0x1,%rdx
cmp %edx,%esi cmp %edx,%esi
jg 400771 <ml+0x14> jg 40079c <m2+0x14>

So the same loop. Differences in performance may be from code
alignment (does that play a role with uCode caches?) or from
disturbances.

>AVX2 is an AVX2 version of
>the code, and AVX2_unroll is an unrolled version of AVX2.
>
>Numbers go up and then down again to have some reproducibility.
>I suspect the "going down" numbers are more reliable, so I'll
>look at those.
>
>
>> On Skylake:
>
>> 536870912 0.560183 0.560181 1.509494 1.516893
>> 268435456 0.560113 0.560441 1.507528 1.516041
>> 134217728 0.559948 0.560224 1.509935 1.519144
>
>So, for a very long vector: 0.56 iterations per cycle for normal
>code, 1.52 iterations for the AVX2 code. Almost a factor of
>three, not bad.

Given that all CPUs show higher AVX values in between, the loop seems
to run into some limit at some point. My first guess would be the L2
or L3 cache bandwidth, but the edge is not close to either limit:

before
edge L2 L3
1MB 0.25MB 6MB Skylake
8MB 0.5MB 32MB Zen3
4MB/8MB 0.5MB 16MB Zen2 (Zen2 allocates L3 only from a 16MB slice)
2MB/4MB 0.5MB 8MB Zen (Zen allocates L3 only from an 8MB slice).
2MB 1.5MB 8MB Tiger Lake

Given that the AVX code is branchless apart from the loop-back edge,
the limit cannot be branch predictor capacity.

Whatever the limit is, the result for the huge arrays show that limit,
not the capabilities of the SIMD units. For that better look at the
best SIMD results:

n normal expect AVX2 AVX2_unroll
131072 0.570325 0.570285 2.357071 2.205857 Skylake
1048576 0.629830 0.420802 4.644690 5.443698 Zen 3
262144 0.608389 0.407787 3.234190 4.476656 Zen 2
262144 0.513307 0.342414 2.043148 1.760585 Zen
262144 1.729501 1.727769 2.557078 2.947558 Tiger Lake

So Zen with its 128-bit SIMD units is worst, and Tiger Lake is better
than Skylake. It's surprising that Zen2 and Zen3 are so much better
than Tiger Lake. We see a speedup by a factor 8.6 of AVX2_unroll over
normal on Zen3, which is very impressive.

Let's also compare the n=512M values for across CPUs:

n normal expect AVX2 AVX2_unroll
536870912 0.560183 0.560181 1.509494 1.516893 Skylake
536870912 0.626340 0.417924 2.446799 2.465893 Zen 3
536870912 0.501185 0.362092 1.474724 1.484411 Zen 2
536870912 0.484308 0.334349 1.383726 1.346343 Zen
536870912 1.292826 1.297217 1.724536 1.724241 Tiger Lake

The limit seems to be similar on Skylake, Zen 2, and Zen, slightly
higher on Tiger Lake, and much higher on Zen 3. Given that main
memory bandwidth is an issue at that size, it would be good to know
how long an rtdsc tick is.

>> 1024 0.540084 0.522983 0.510469 0.725212
>> 512 0.509960 0.454707 0.490421 0.621359
>> 256 0.477612 0.391437 0.468864 0.481203
>> 128 0.450704 0.421053 0.400000 0.278261
>
>Don't use the unrolled AVX2 stuff on short vectors, I suppose,
>but at least the slowdown for AVX2 against normal code is slight.

I find it surprising that your branchless AVX2 code does not show a
speedup compared to the scalar code, which I expect to enter the if
code ~4.85 times for n=128, and have a branch misprediction every
time. And n=128 is not that small that SIMD should not provide a nice
benefit (only 16 iterations through the AVX2 inner loop).

Let's compare the n=128 cases for all CPUs:

n normal expect AVX2 AVX2_unroll
128 0.450704 0.421053 0.400000 0.278261 Skylake
128 0.421053 0.336842 0.842105 0.421053 Zen 3
128 0.481203 0.336842 1.684211 0.421053 Zen 2
128 0.296296 0.222222 0.888889 0.507937 Zen
128 0.761905 0.512000 0.677249 0.882759 Tiger Lake

So for the Zens AVX2 provides a nice speedup for n=128 (especially Zen
2). Skylake is not so great here. Maybe it's the 256-bit wakeup
slowdown. Tiger Lake already shows a speedup from unrolling at n=128.

>> On Zen 3:
>># Ints per cycle
>># n normal expect AVX2 AVX2_unroll
>
>> 536870912 0.626340 0.417924 2.446799 2.465893
>> 268435456 0.626351 0.417885 2.438281 2.455598
>
>Scalar code is about par with Skylake, the AVX2 code is better.
>Strange that the __builtin_expect code is slower, but that may
>just be the rather old compiler.

Apparently the different code alignment rubs Zen3 very much the wrong
way.

>The clear winner for Zen3: The AVX2 stuff without unrolling.

Not if n>=4k.

>> On Tiger Lake:
>># Ints per cycle
>># n normal expect AVX2 AVX2_unroll
>
>> 536870912 1.292826 1.297217 1.724536 1.724241
>> 268435456 1.299574 1.298605 1.726204 1.708710
>
>The scalar variant is _very_ good

More than 1.7 iterations/rtdsc unit at the high point. Either the
rtdsc unit is very far from the cycle time, or this processor can do
more than one back-to-back add in one cycle. My guess it's the
former. The base clock of this CPU (Core i5-1135G7) seems to be
2.4GHz, the turbo 4.2GHz. If the rtdsc uses the base clock and the
benchmark runs at max turbo, that would result in 1.75 cycles/rtdsc
unit, and the results fit that nicely. Still, 1 cycle/iteration of
the loop above is a very nice result; it means that Tiger Lake can
perform 2 taken branches per cycle (while, e.g., on Skylake, each
taken branch costs a cycle, resulting in at most 0.5 iterations/cycle
if we assume that a non-taken if branch results in a branch
misprediction).

Getting the result in cycles and in ns would be useful.

>AVX2 without unrolling seems to be the clear winner for all
>architectures you checked, especially the AMD ones, except for Tiger
>Lake, which combines excellent performance with of the scalar loop
>wiht lackluster performance on AVX2. Maybe they figured that,
>while they do support the instructions, performance was not
>so inoportant for them after all. For a processor intended for
>the mobile market, that makes sense.

Tiger Lake was not designed for the mobile market, it ended up being
sold only there because of the fab difficulties that Intel had. They
put in AVX-512 because they thought that SIMD performance is important
for the intended markets of this core. My guess is that the latency
per iteration of the vpcmpgtd-vpblendvb recurrence is relatively long;
according to https://www.agner.org/optimize/instruction_tables.pdf,
this recurrence has 3 cycles of latency; should not be so bad. Hmm.

Terje Mathisen

unread,
Aug 24, 2021, 3:41:02 AM8/24/21
to
The main issue is that the loop is buggy! The inner loop can exit due to
(i<n), at which point the next line "a[i] > m" becomes UB.

Modifying it to while (i+1<n && a[i] <= m) would work I think, but it is
easier to check the index below:

while (i<n) {
while (i<n && a[i]<=m)
i++;
if (i < n) {
m = a[i];
nm = i;
}
i++;


> course, at first it will have very short trip counts, but they
> increase the further through the array you work, as the probability to
> find an element larger than the largest one up to now decreases
> (unless the array is sorted).

Searching for a max value in N random elements expects log(n) hits, so
yes it is usually OK.

Terje Mathisen

unread,
Aug 24, 2021, 4:10:39 AM8/24/21
to
The eventual solution for all this will be similar to Mitch's FMAC
accumulator, i.e. a form of super-accumulator which allows one or more
elements to be added per cycle, while delaying all inexact/rounding to
the very end.

A carry-save exact accumulator with ~1100 paired bits would only use a
single full adder (2 or 3 gate delays?) to accept a new input, right?

I am not sure what is the best way for such a beast to handle both
additions and subtractions: Do you need to invert/negate the value to be
subtracted?

Thomas Koenig

unread,
Aug 24, 2021, 5:58:52 AM8/24/21
to
Terje Mathisen <terje.m...@tmsw.no> schrieb:
> Thomas Koenig wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
[some snippage, hopefully context-preserving]

>>>> With all the mechanisms that VVM already offers, a way for the
>>>> programmer or a programming language to specify that operations
>>>> such as summation can be done in any order would be a very useful
>>>> addition.

[...]

>>>> Suggestion:
>>>>
>>>> A variant of the VEC instruction, which does not specify a special
>>>> register to keep the address in (which can be hardwired if there
>>>> is no space in the thread header). This leaves five bits for
>>>> "reduction" registters, which specify that operations on that
>>>> register can be done in any order in the loop.

[...]

>> The way VVM is currently specified, it's stricly in-order semantics
>> you write down a C loop, and the hardware delivers the results
>> exactly in the order you wrote down. This would have to be
>> changed.
>>
>>
>>> You need a
>>> way to allow/specify the two partial sums to be added together in the
>>> end.
>>
>> That as well.

[...]

> The eventual solution for all this will be similar to Mitch's FMAC
> accumulator, i.e. a form of super-accumulator which allows one or more
> elements to be added per cycle, while delaying all inexact/rounding to
> the very end.
>
> A carry-save exact accumulator with ~1100 paired bits would only use a
> single full adder (2 or 3 gate delays?) to accept a new input, right?

Depends on the number of functional units you have. If you have
eight, for a high-performance CPU, you would need four layers
of adders to reduce the number of numbers to be added to two.

It might make sense to go to a signed digit implementation for
the 1100-bit adder (even though that doubles the number of bits of
storage needed for the register) and only do the carry propagation
once, upon storing. Adding one binary number to a signed digit
number should be cheap enough to be done in half a cycle, so adding
one number per cycle throughput sounds feasible.

> I am not sure what is the best way for such a beast to handle both
> additions and subtractions: Do you need to invert/negate the value to be
> subtracted?

Probably easiest.

However, there are also other operations in reduction, multiplication
for example...

Stephen Fuld

unread,
Aug 24, 2021, 10:59:31 AM8/24/21
to
Sort of. If you code the two statements
C := A + B
D := E + F
it may be, say because of a cache miss on the load of B, that the second
addition occurs before the first. I think it is more accurate to say
that the hardware makes it appear with correct semantics, even if
internally, it takes some liberties with the ordering, as long as the
result is the same.

> This would have to be
> changed.

If it makes a difference, yes. That is why I keep going back to summing
a vector of unsigned integers, where it doesn't. But if it does, then
you need some syntax like you describe below to tell the hardware it is
OK. Then you need some mechanism in the hardware that the compiler can
generate to tell the hardware weather it matters or not.



>> You need a
>> way to allow/specify the two partial sums to be added together in the
>> end.
>
> That as well.
>
>> I don't see your proposal as doing that.
>
> I thought I had implied it, but it was obviously not clear enough.

The problem is, you can't use a single accumulator as the hardware can't
do two/four adds to the same accumulator in the same cycle. If you are
doing say 2/4 partial sums, you need 2/4 places to store them (e.g.
registers), say in the event of an interrupt, and you need to tell the
hardware that, at the end, it needs to add the partial sums. It is this
syntax in the VVM ISA that is not there.


>> And, of course, it is
>> limited to five registers which must be specified in the hardware design.
>
> Five reductions in a loop would be plenty, it is usually one, or more
> rarely two.

Sure. I was thinking more about the fact that the choice of five, say
R1-R5, limits the compiler's register allocation flexibility. Perhaps
this isn't a big deal, but it is something.


>>> This would be a perfect match for OpenMP's reduction clause or
>>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>>
>> I am not an OpenMP person, and my knowledge of Fortran is old, so could
>> you please give a brief explanation of what these two things do? Thanks.
>
> #pragma omp simd reduction(+:var)
>
> before a loop will tell the compiler that it can go wild
> with the sequence of loops but that "var" will be used
> in a summation reduction.
>
> DO CONCURRENT also runs loops in an unspecified order,
> the REDUCTION clause would then allow to, for example,
> sum up all elements.

That makes excellent sense. Thank you.



> One problems with C and similar languages is that you have
> to specify an ordering of the loop explicitly,

So you need something like a pragma to perform similarly to the
Concurrrent Reduction clause in Fortran.


> which shapes
> programmer's thinking and also shapes intermediate languages
> for compilers...

Are you saying something about C programmers versus say Fortran
programmers? :-)

As for intermediate languages, if they can solve it for Fortran, I guess
they could do the same for C.

MitchAlsup

unread,
Aug 24, 2021, 11:25:41 AM8/24/21
to
The word you are looking for is "reduce", or "reduction"
I want this series of calculations reduced to a single number.
<
> other similar type operations), was to allow the hardware to make use of
> multiple functional units. That is, a core with two adders could, if
> allowed, complete the summation in about half the time.
<
This comes with numeric "issues". As with such, I do not want compilers
creating code to use wide resources WITHOUT some word from the
programmer saying it is "OK this time". (#pragma or such)
<
> Without that, I
> don't see any advantage of out of order summations on VVM. If I am
> wrong, please explain. If I am right, see below.
> > Suggestion:
> >
> > A variant of the VEC instruction, which does not specify a special
> > register to keep the address in (which can be hardwired if there
> > is no space in the thread header). This leaves five bits for
> > "reduction" registters, which specify that operations on that
> > register can be done in any order in the loop.
<
> Doing the operations in a different order isn't the problem. You need a
> way to allow/specify the two partial sums to be added together in the
> end. I don't see your proposal as doing that. And, of course, it is
> limited to five registers which must be specified in the hardware design.
<
Done properly, you want each loop performing Kahan-Babuška summations
not lossy clumsy double FP.
<
Kahan-Babuška summation should be known to and embedded in the compiler.

MitchAlsup

unread,
Aug 24, 2021, 11:28:56 AM8/24/21
to
Who chooses time versus numerical precision ? Human, of software.
IEEE 754-1985 took hardware off the list of who could choose.
<
Even in integer code, unrolled summation changes where exceptions
get thrown (if thrown at all).

MitchAlsup

unread,
Aug 24, 2021, 11:37:40 AM8/24/21
to
A carry save accumulator uses a 4-input-2-output compressor. With
true complement signals, this is 2 gates of delay (3-input XOR is 1 gate
in TC form). Without TC signals it is 3 gates.
>
> I am not sure what is the best way for such a beast to handle both
> additions and subtractions: Do you need to invert/negate the value to be
> subtracted?
<
The key property is that the value spinning around in the accumulator
remains the polarity of when it started (+ or - but not both) while
the multiplier can produce + or - on a per calculation basis.
But in practice one can work in the negations at the cost of a
single additional gate of delay.

Stephen Fuld

unread,
Aug 24, 2021, 11:42:32 AM8/24/21
to
On 8/24/2021 8:25 AM, MitchAlsup wrote:
> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:

snip

>>> With all the mechanisms that VVM already offers, a way for the
>>> programmer or a programming language to specify that operations
>>> such as summation can be done in any order would be a very useful
>>> addition.
>> I am probably missing something here. To me the main advantage of
>> allowing out of order summations (using summations here as shorthand for
> <
> The word you are looking for is "reduce", or "reduction"
> I want this series of calculations reduced to a single number.


Yes, I worded that clumsily. I probably should have used the word
"operation" to indicate whatever is needed to perform the reduction.
But using simple summation as an example, makes saying things like
"partial sums" meaningful. Is there a generally accepted better term
for the intermediate results of a reduction?


>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units. That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
> <
> This comes with numeric "issues". As with such, I do not want compilers
> creating code to use wide resources WITHOUT some word from the
> programmer saying it is "OK this time". (#pragma or such)

Absolutely agreed. But Thomas has said such things are coming, at least
for Fortran. But adding a pragma for C doesn't seem unreasonable, as it
potentially allows a huge performance improvement. But clearly it
shouldn't be the default.



> <
>> Without that, I
>> don't see any advantage of out of order summations on VVM. If I am
>> wrong, please explain. If I am right, see below.
>>> Suggestion:
>>>
>>> A variant of the VEC instruction, which does not specify a special
>>> register to keep the address in (which can be hardwired if there
>>> is no space in the thread header). This leaves five bits for
>>> "reduction" registters, which specify that operations on that
>>> register can be done in any order in the loop.
> <
>> Doing the operations in a different order isn't the problem. You need a
>> way to allow/specify the two partial sums to be added together in the
>> end. I don't see your proposal as doing that. And, of course, it is
>> limited to five registers which must be specified in the hardware design.
> <
> Done properly, you want each loop performing Kahan-Babuška summations
> not lossy clumsy double FP.
> <
> Kahan-Babuška summation should be known to and embedded in the compiler.

While I agree with that, it still doesn't address the issues of needing
multiple intermediate results "places"(to allow multiple partial
reductions (is that the right word?) to proceed in parallel, and how to
combine them at the end. You need some ISA syntax to specify such things.

MitchAlsup

unread,
Aug 24, 2021, 11:43:40 AM8/24/21
to
Each layer of 4-2 compression costs 2 gates of delay. 8-2 4 gates
16-2 6 gates......
>
> It might make sense to go to a signed digit implementation for
> the 1100-bit adder (even though that doubles the number of bits of
> storage needed for the register) and only do the carry propagation
> once, upon storing.
<
Carry save is perfectly adequate for an accumulator as carries are
only moving forward ln2( #inputs ) per cycle. IT is easy to build
a Carry Save Find First circuit, so you know where in that 1076 bit
accumulator is the most likely highest bit of significance. {You
will only be off by 1 at most}
<
< Adding one binary number to a signed digit
> number should be cheap enough to be done in half a cycle, so adding
> one number per cycle throughput sounds feasible.
<
> > I am not sure what is the best way for such a beast to handle both
> > additions and subtractions: Do you need to invert/negate the value to be
> > subtracted?
<
New value into summation can be + or -
Current running sum cannot (or can at another couple of gate delays.)

MitchAlsup

unread,
Aug 24, 2021, 11:52:55 AM8/24/21
to
On Tuesday, August 24, 2021 at 10:42:32 AM UTC-5, Stephen Fuld wrote:
> On 8/24/2021 8:25 AM, MitchAlsup wrote:
> > On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
> >> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> snip
> >>> With all the mechanisms that VVM already offers, a way for the
> >>> programmer or a programming language to specify that operations
> >>> such as summation can be done in any order would be a very useful
> >>> addition.
> >> I am probably missing something here. To me the main advantage of
> >> allowing out of order summations (using summations here as shorthand for
> > <
> > The word you are looking for is "reduce", or "reduction"
> > I want this series of calculations reduced to a single number.
> Yes, I worded that clumsily. I probably should have used the word
> "operation" to indicate whatever is needed to perform the reduction.
> But using simple summation as an example, makes saying things like
> "partial sums" meaningful. Is there a generally accepted better term
> for the intermediate results of a reduction?
<
As far as I know: accumulator is about as good as it gets.
A single "core" can work on a single reduction.
It takes multiple cores to work on multiple reductions simultaneously.
cores, by themselves, are multiple clock cycles apart, and so are not
in general, allowed to know anything about what the others are up to.
<
To properly sum multiple reductions one needs a way to ship the
multiple 1076-bit reductions to a common adder/normalizer/rounder.
One HAS to perform a single rounding to get the correct final result
in a Kahan sense (754 has not gone this far, yet).
<
unums (posits) use the word "quire" as the accumulator.

Terje Mathisen

unread,
Aug 24, 2021, 12:13:26 PM8/24/21
to
You can round each super-acc result to 109 bits or more, at which point
you are guaranteed to get the same result when you add these two
together and then do a final round to double, as if you actually did the
full 1100-bit merge and then rounded, right?

...

No, it is easy to contruct a counter-example. :-(

OTOH, I expect that such a super-acc with redundant storage would have a
way to canonicalize it while storing, and at this point it is easy to
merge multiple accumulators as unsigned integer arrays.

If the hw also reports the position of the most significant bit, then
the merging can be handled more efficiently by starting at that position
and add all the accumulator results, then iterate over subsequent 64-bit
blocks until the end or until no change is possible at the rounding point.

Stephen Fuld

unread,
Aug 24, 2021, 12:24:16 PM8/24/21
to
Sure. Apparently, I am not making myself clear. I want to be able
*optionally* to take advantage of multiple functional units in a single
core to speed up a single reduction. As an example, if I want to sum
the elements of an array, and I have two FUs that can do the addition, I
want each to sum half the elements in parallel, then a final add to
combine the two partial sums.

As was pointed out earlier, you can certainly get this by unrolling the
loop, then doing the final add of the two partial sums outside of the
loop. But this makes the code "specify" the number of lanes (i.e. how
many times the loop is unrolled by), thus makes it hardware model
specific (i.e. it can't take advantage of a more powerful model with say
four appropriate FUs).

One of the beauties of VVM is that, for most loops, you code it for one
unit, but the hardware "automagically" and transparently invokes
multiple FUs to speed things up. I want to be able to extend this
capability (transparently taking advantage of multiple FUs if they are
available), to reductions, *where the programmer/compiler knows it will
be OK to do so*.

Thomas Koenig

unread,
Aug 24, 2021, 12:26:36 PM8/24/21
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Tuesday, August 24, 2021 at 4:58:52 AM UTC-5, Thomas Koenig wrote:

>> Depends on the number of functional units you have. If you have
>> eight, for a high-performance CPU, you would need four layers
>> of adders to reduce the number of numbers to be added to two.
><
> Each layer of 4-2 compression costs 2 gates of delay. 8-2 4 gates
> 16-2 6 gates......

So 4-2 compression is currently what people are now using, not
one of the (many) other possibilities, and not the full
adders that I assumed?

I didn't know that, thanks for the info.

>>

>> It might make sense to go to a signed digit implementation for
>> the 1100-bit adder (even though that doubles the number of bits of
>> storage needed for the register) and only do the carry propagation
>> once, upon storing.
><
> Carry save is perfectly adequate for an accumulator as carries are
> only moving forward ln2( #inputs ) per cycle. IT is easy to build
> a Carry Save Find First circuit, so you know where in that 1076 bit
> accumulator is the most likely highest bit of significance. {You
> will only be off by 1 at most}

Interesting.

Thomas Koenig

unread,
Aug 24, 2021, 12:38:50 PM8/24/21
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:

>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units. That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
><
> This comes with numeric "issues". As with such, I do not want compilers
> creating code to use wide resources WITHOUT some word from the
> programmer saying it is "OK this time". (#pragma or such)

Agreed - this should be specified in the language (and preferably
not with a #pragma, which are sort of an evil hack, but for things
that are standardized like OpenMP that is accepatble).

Unfortunately, it takes a big hammer to get code to be vectorized on
SIMD systems, such as ignoring all these niceties, so a big hammer
is what is supplied, with seductively named compiler options such
as -Ofast or -ffast-math.

(There's a double-langauge pun lurking here. "fast" means "almost"
in German, and one could vary an old saying about experiments to
"Fast programs, fast results, fast richtig" (where the last part
means "almost correct"). Maybe something for Anton's office wall?)

I appreciate what VVM can do here, by keeping the sequential
dependencies alive. There's simply a case where the language
does not specify this, as in DO CONCURRENT or #pragma simd,
when these semantics do not apply and where more speed would
help (which is the point you made above).

>> Doing the operations in a different order isn't the problem. You need a
>> way to allow/specify the two partial sums to be added together in the
>> end. I don't see your proposal as doing that. And, of course, it is
>> limited to five registers which must be specified in the hardware design.
><
> Done properly, you want each loop performing Kahan-Babuška summations
> not lossy clumsy double FP.
><
> Kahan-Babuška summation should be known to and embedded in the compiler.

Depends if you need it or not. It is a considerable overhead /
slowdown, and putting it in for all cases also would not be a
good thing.

MitchAlsup

unread,
Aug 24, 2021, 12:43:51 PM8/24/21
to
On some implementations is may have considerable overhead (6 FP ops
mostly dependent, instead of 1) on others it cost no more than FMAC all by
itself. In my 66000 ISA it can be encoded in 2 instructions and performed
in 1 (CARRY is an instruction-modifier and does not necessarily execute)

Thomas Koenig

unread,
Aug 24, 2021, 12:46:09 PM8/24/21
to
MitchAlsup <Mitch...@aol.com> schrieb:

> Carry save is perfectly adequate for an accumulator as carries are
> only moving forward ln2( #inputs ) per cycle. IT is easy to build
> a Carry Save Find First circuit, so you know where in that 1076 bit
> accumulator is the most likely highest bit of significance. {You
> will only be off by 1 at most}

Having second thoughts here...

Assuming I calculate

01..111111111....111000
+ 00..000000000....001000
=========================
10..000000000....000000

with the 1076 bit accumulator, the carry would still have to be
propagated all the way, across more than 1024 bits if I am unlucky.
This is why I suggested signed digits, because carries only propagate
a single position there (if done correctly).

How would a carry-save adder deal with this situation?

MitchAlsup

unread,
Aug 24, 2021, 1:14:10 PM8/24/21
to
On Tuesday, August 24, 2021 at 11:46:09 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Carry save is perfectly adequate for an accumulator as carries are
> > only moving forward ln2( #inputs ) per cycle. IT is easy to build
> > a Carry Save Find First circuit, so you know where in that 1076 bit
> > accumulator is the most likely highest bit of significance. {You
> > will only be off by 1 at most}
> Having second thoughts here...
>
> Assuming I calculate
>
> 01..111111111....111000
> + 00..000000000....001000
> =========================
> 10..000000000....000000
>
> with the 1076 bit accumulator, the carry would still have to be
> propagated all the way, across more than 1024 bits if I am unlucky.
<
But this propagation takes place in the adder not in the accumulator
which remains carry save during all the iterations. The adder is used
only after all of the reductions from all of the lanes arrive and are
compressed into a final carry save result. This adder will be of Kogee
Stone variety, where the middle order bits are the last to resolve,
allowing one to begin finding the point of greatest significance
early.
<
> This is why I suggested signed digits, because carries only propagate
> a single position there (if done correctly).
>
> How would a carry-save adder deal with this situation?
>
You are going to get carry save results out of the multiplier
every cycle. In carry save form, you use 2 signals for each bit.
We know fast multipliers perform their work in not just carry
save but also True-Complement (to make the XORs fast),
so I assume we have carry-save and true-complement out
of the multiplier.
<
Anytime one gets a value out of a latch (of flip-flop) one has
access to both true and compliment values. The accumulator
"runs" through this flip-flop, so we have both carry save and true
complement data from the 'flop.
<
Thus we have 2-bits from the multiplier and 2 bits from the
accumulator flop and we can add these 2 values in 2 gates
of delay producing carry save output, which we then flop.
<
Since this addition only costs 2 gates of delay, we could take
results from 3 lanes and 1 accumulator in 4 gates, 7 lanes
and 1 accumulator in 6 gates: the reduction can easily take
as many as 16 lanes, per cycle. I doubt anyone is going to
build a machine "that wide".