VVM question

Thomas Koenig

unread,

Aug 22, 2021, 12:34:21 PM8/22/21

to

Hi,

a question regarding VVM.

Take the following simplified version of Fortran's MAXLOC intrinsic,
which returns the position of the array element with the maximum
value (the first if there are many).

int m2(int * const restrict a, int n)
{
int m, nm;
int i;

m = INT_MIN;
nm = -1;
for (i=0; i<n; i++)
{
if (a[i] > m)
{
m = a[i];
nm = i;
}
}
return nm;
}

An SIMD version with m lanes would probably determine the maximum
value for each lane separately and, at the end of the loop, return
the smallest index of the largest value, so something like

for (i=0; i<n; i+=n_lanes)
{
if (a[i] > m[0])
{
m[0] = a[i];
nm[0] = i;
}
if (a[i+1] > m[1])
{
m[1] = a[i+1];
nm[1] = i + 1;
}
...
}

How would VVM handle that? Could it also use a similar parallel
approach, from just translating the scalar code?

Anton Ertl

unread,

Aug 22, 2021, 1:54:27 PM8/22/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Hi,
>
>a question regarding VVM.
>
>Take the following simplified version of Fortran's MAXLOC intrinsic,
>which returns the position of the array element with the maximum
>value (the first if there are many).

This is actually a significant part of the inner loop of Jon Bentley's
Traveling Salesman example which we looked at in the thread that
begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
already...@yahoo.com presented a vectorized version of the loop
over the whole array (rather than a loop that ends as soon as it finds
something closer than before) in
<b2aed821-2b7e-456d...@googlegroups.com>, and I
discussed it in <2016Nov1...@mips.complang.tuwien.ac.at>.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Terje Mathisen

unread,

Aug 22, 2021, 3:41:29 PM8/22/21

to

A verctor version of that, VMM or SIMD, would probably run better with
predicates/conditional moves so as to remove all internal branching from
the core iteration.

One issue is of course that predicates or CMOVs require both some setup
time and latency limitations, so you might actually need to unroll the
code to use twice as many accumulators. If you do so then you can afford
the parallel compare and the pairs of conditional moves of a new max value.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Aug 22, 2021, 4:19:53 PM8/22/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Thomas Koenig wrote:

>> Take the following simplified version of Fortran's MAXLOC intrinsic,
>> which returns the position of the array element with the maximum
>> value (the first if there are many).

[...]

> A verctor version of that, VMM or SIMD, would probably run better with
> predicates/conditional moves so as to remove all internal branching from
> the core iteration.
>
> One issue is of course that predicates or CMOVs require both some setup
> time and latency limitations, so you might actually need to unroll the
> code to use twice as many accumulators. If you do so then you can afford
> the parallel compare and the pairs of conditional moves of a new max value.

FYI, there is some AVX2 code (which I did not write) at
https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121

Thomas Koenig

unread,

Aug 22, 2021, 4:22:30 PM8/22/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> This is actually a significant part of the inner loop of Jon Bentley's
> Traveling Salesman example which we looked at in the thread that
> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.

Is there a way to access those?

Google Groups always asks me for a userid, which I do not have.

Anton Ertl

unread,

Aug 22, 2021, 5:39:18 PM8/22/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> This is actually a significant part of the inner loop of Jon Bentley's
>> Traveling Salesman example which we looked at in the thread that
>> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
>
>Is there a way to access those?

http://al.howardknight.net/

Bookmark it immediately!

MitchAlsup

unread,

Aug 22, 2021, 9:11:06 PM8/22/21

to

On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:
> Hi,
>
> a question regarding VVM.
>
> Take the following simplified version of Fortran's MAXLOC intrinsic,
> which returns the position of the array element with the maximum
> value (the first if there are many).
>
> int m2(int * const restrict a, int n)
> {
> int m, nm;
> int i;
>
> m = INT_MIN;
> nm = -1;
> for (i=0; i<n; i++)
> {
> if (a[i] > m)
> {
> m = a[i];
> nm = i;
> }
> }
> return nm;
> }
<

GLOBAL m2
ENTRY m2
m2:
MOV R3,#0x7FFFFFFFFFFFFFFF
MOV R4,#-1
MOV R5,#0
top:
VEC R8,{R3,R4}
LDW R6,[R1+R5<<2]
CMP R7,R6,R3
PGT R7,{2,TT}
MOV R3,R6 // Be careful on this assignment
MOV R4,R5 // Be careful on this assignment
LOOP LT,R5,#1,R2
MOV R1,R3
RET

>
> An SIMD version with m lanes would probably determine the maximum
> value for each lane separately and, at the end of the loop, return
> the smallest index of the largest value, so something like
>
> for (i=0; i<n; i+=n_lanes)
> {
> if (a[i] > m[0])
> {
> m[0] = a[i];
> nm[0] = i;
> }
> if (a[i+1] > m[1])
> {
> m[1] = a[i+1];
> nm[1] = i + 1;
> }
> ...
> }
>
> How would VVM handle that? Could it also use a similar parallel
> approach, from just translating the scalar code?
<

The VEC instruction tells each loop to watch for modifications to R3 and R4
and obey loop carried dependencies.

Thomas Koenig

unread,

Aug 23, 2021, 1:43:02 AM8/23/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>>> This is actually a significant part of the inner loop of Jon Bentley's
>>> Traveling Salesman example which we looked at in the thread that
>>> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
>>
>>Is there a way to access those?
>
> http://al.howardknight.net/

Thanks, very good link!

It does not do threading, unfortunately.

> Bookmark it immediately!

Done.

Thomas Koenig

unread,

Aug 23, 2021, 1:44:45 AM8/23/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Sunday, August 22, 2021 at 11:34:21 AM UTC-5, Thomas Koenig wrote:

[...]

I'm afraid that does not answer my question, at least I do not
understand it this way.

Will it run several iterations in parallel without source code
modification, or not?

Thomas Koenig

unread,

Aug 23, 2021, 1:54:59 AM8/23/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>Hi,
>>
>>a question regarding VVM.
>>
>>Take the following simplified version of Fortran's MAXLOC intrinsic,
>>which returns the position of the array element with the maximum
>>value (the first if there are many).
>
> This is actually a significant part of the inner loop of Jon Bentley's
> Traveling Salesman example which we looked at in the thread that
> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
> already...@yahoo.com presented a vectorized version of the loop
> over the whole array (rather than a loop that ends as soon as it finds
> something closer than before) in
><b2aed821-2b7e-456d...@googlegroups.com>, and I
> discussed it in <2016Nov1...@mips.complang.tuwien.ac.at>.

From what I read in your article, the effect of the code seems
so-so and depend on the architecture.

Would it be possible to give the code at
https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121 (which is
part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 a spin
to see if it does better? It worked well on Zen 1 despite that
architecture only "faking" AVX2 with 128-bit registers.

Anton Ertl

unread,

Aug 23, 2021, 6:55:30 AM8/23/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

>> This is actually a significant part of the inner loop of Jon Bentley's
>> Traveling Salesman example which we looked at in the thread that
>> begins with <2016Nov1...@mips.complang.tuwien.ac.at>.
>> already...@yahoo.com presented a vectorized version of the loop
>> over the whole array (rather than a loop that ends as soon as it finds
>> something closer than before) in
>><b2aed821-2b7e-456d...@googlegroups.com>, and I
>> discussed it in <2016Nov1...@mips.complang.tuwien.ac.at>.
>
>From what I read in your article, the effect of the code seems
>so-so and depend on the architecture.

It also depends on the array size (between 1 and 10000 in my case).
Assuming the crossover point is, say, 2000, it's probably best to use
AVX "branchless" for the first 2000 elements, and then continue with
AVX hard. I wanted to look into "branchless" some more, but, as
usual, other things needed my attention and so I did not pursue it
further.

>Would it be possible to give the code at
>https://gcc.gnu.org/bugzilla/attachment.cgi?id=44121 (which is
>part of https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85740 a spin
>to see if it does better? It worked well on Zen 1 despite that
>architecture only "faking" AVX2 with 128-bit registers.

I could not compile it on Debian 11 ("relocation R_X86_64_32S against
`.data' can not be used when making a PIE object; recompile with
-fPIE"; this means that the assembly code contains an absolute address
and should be replaced with a rip-relative address), so I compiled it
on Debian 8 (gcc-4.9.2).

Below is what I see. What does it mean?

On Skylake:

# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.250980 0.198142 0.328205 0.227758
256 0.378698 0.351648 1.000000 0.479401
512 0.498054 0.453901 0.486692 0.609524
1024 0.533889 0.509453 0.499025 0.727273
2048 0.549946 0.558952 0.515869 0.768769
4096 0.560022 0.562174 0.465243 0.821830
8192 0.562560 0.563179 0.616496 0.836260
16384 0.568376 0.566568 0.840464 1.221957
32768 0.569482 0.568612 0.998598 1.522960
65536 0.569640 0.569839 1.227496 2.413316
131072 0.569141 0.570295 1.334039 1.866857
262144 0.570032 0.568262 1.389593 1.929232
524288 0.569357 0.566879 1.508152 1.673972
1048576 0.561443 0.555999 1.533845 1.503037
2097152 0.560509 0.560691 1.458560 1.509459
4194304 0.559187 0.560557 1.456157 1.503564
8388608 0.561024 0.560462 1.494831 1.514211
16777216 0.560297 0.559024 1.496209 1.510765
33554432 0.559756 0.560659 1.501258 1.512948
67108864 0.559765 0.560249 1.507910 1.512386
134217728 0.560098 0.560409 1.506587 1.515123
268435456 0.560284 0.560472 1.509522 1.516031
536870912 0.559883 0.560436 1.508366 1.516430

536870912 0.560183 0.560181 1.509494 1.516893
268435456 0.560113 0.560441 1.507528 1.516041
134217728 0.559948 0.560224 1.509935 1.519144
67108864 0.561124 0.561204 1.505807 1.519437
33554432 0.560492 0.559996 1.518871 1.521890
16777216 0.561216 0.560984 1.501925 1.512587
8388608 0.560717 0.560970 1.481175 1.511185
4194304 0.559531 0.560643 1.456585 1.486993
2097152 0.558511 0.561203 1.401787 1.453253
1048576 0.562318 0.558435 1.330910 1.345867
524288 0.570140 0.567899 1.630715 1.883340
262144 0.570461 0.569180 2.328472 2.177891
131072 0.570325 0.570285 2.357071 2.205857
65536 0.570335 0.569650 1.842244 1.968639
32768 0.569601 0.569621 1.501879 1.559045
16384 0.567431 0.566254 1.169951 1.247259
8192 0.563566 0.562097 0.786936 0.833876
4096 0.564187 0.553214 0.518219 0.807253
2048 0.552617 0.553813 0.516911 0.791957
1024 0.540084 0.522983 0.510469 0.725212
512 0.509960 0.454707 0.490421 0.621359
256 0.477612 0.391437 0.468864 0.481203
128 0.450704 0.421053 0.400000 0.278261

On Zen 3:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.198142 0.177285 0.673684 0.374269
256 0.374269 0.336842 1.122807 0.962406
512 0.561404 0.396285 1.924812 1.347368
1024 0.585812 0.402200 3.368421 2.694737
2048 0.612440 0.411410 2.836565 3.592982
4096 0.612440 0.417789 2.629012 4.145749
8192 0.626683 0.419414 2.661468 5.013464
16384 0.629428 0.420232 2.728847 5.258023
32768 0.629887 0.420437 2.703184 5.226156
65536 0.630809 0.420642 3.193762 5.372684
131072 0.631040 0.422911 3.101855 5.331164
262144 0.629772 0.420719 3.108845 5.360160
524288 0.630751 0.420783 2.422235 5.402135
1048576 0.631184 0.420764 2.097454 5.460935
2097152 0.631162 0.422665 1.937176 5.322937
4194304 0.630690 0.421456 1.999682 4.177444
8388608 0.627358 0.420154 2.007738 3.061207
16777216 0.618820 0.418670 2.549933 2.558949
33554432 0.621342 0.418237 2.360456 2.400368
67108864 0.623856 0.418145 2.394890 2.419667
134217728 0.625304 0.417954 2.421189 2.449932
268435456 0.626265 0.417947 2.452580 2.475416
536870912 0.626237 0.417929 2.441393 2.459276

536870912 0.626340 0.417924 2.446799 2.465893
268435456 0.626351 0.417885 2.438281 2.455598
134217728 0.626210 0.417981 2.430257 2.454633
67108864 0.621872 0.418221 2.435699 2.456569
33554432 0.615973 0.418177 2.423352 2.464991
16777216 0.615432 0.418174 2.377893 2.401301
8388608 0.616635 0.418836 2.178726 2.195453
4194304 0.630528 0.420659 2.144731 2.754934
2097152 0.631170 0.420805 4.347582 5.448535
1048576 0.629830 0.420802 4.644690 5.443698
524288 0.631126 0.420757 4.547479 5.419109
262144 0.630809 0.420796 4.442065 5.410609
131072 0.634289 0.420539 4.382799 5.314735
65536 0.630578 0.420744 4.422132 5.475021
32768 0.630809 0.420027 4.311579 5.322937
16384 0.632196 0.419823 4.311579 5.194673
8192 0.628510 0.419414 4.145749 4.899522
4096 0.623061 0.417789 3.992203 4.145749
2048 0.612440 0.411410 3.849624 3.170279
1024 0.585812 0.408293 2.994152 2.245614
512 0.561404 0.374269 2.245614 1.347368
256 0.481203 0.354571 1.347368 0.748538
128 0.421053 0.336842 0.842105 0.421053

On Zen 2:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.224561 0.210526 0.561404 0.481203
256 0.449123 0.320802 1.347368 1.122807
512 0.561404 0.384962 2.245614 1.347368
1024 0.573348 0.390542 2.994152 3.368421
2048 0.579513 0.396285 2.245614 0.769925
4096 0.602176 0.399220 1.996101 3.368421
8192 0.597172 0.399961 2.092999 4.311579
16384 0.557052 0.407522 1.774312 4.311579
32768 0.602597 0.402575 2.103209 4.311579
65536 0.611571 0.401077 1.783487 4.175863
131072 0.601651 0.407859 1.825007 4.508841
262144 0.608228 0.405629 1.760277 4.453535
524288 0.608362 0.405533 1.707133 4.460735
1048576 0.603678 0.402904 1.525548 4.465066
2097152 0.601913 0.401237 1.524494 4.422487
4194304 0.552595 0.388290 1.598199 1.880156
8388608 0.502226 0.367968 1.548654 1.485521
16777216 0.518826 0.359816 1.446895 1.508812
33554432 0.533121 0.365163 1.498006 1.497000
67108864 0.497297 0.364485 1.489156 1.489815
134217728 0.503081 0.363182 1.485826 1.497298
268435456 0.502331 0.362487 1.476280 1.483783
536870912 0.497604 0.363058 1.471117 1.486186

536870912 0.501185 0.362092 1.474724 1.484411
268435456 0.505392 0.362086 1.480385 1.487473
134217728 0.505914 0.362678 1.477032 1.491003
67108864 0.502462 0.365261 1.489096 1.491641
33554432 0.510745 0.369752 1.499425 1.512902
16777216 0.288380 0.365712 1.486786 1.508833
8388608 0.508408 0.369453 1.474655 1.501383
4194304 0.549125 0.384958 1.482000 1.811350
2097152 0.608269 0.403455 1.714878 4.018949
1048576 0.608483 0.405545 3.159389 4.377932
524288 0.608416 0.407738 3.235707 4.465066
262144 0.608389 0.407787 3.234190 4.476656
131072 0.611463 0.407859 3.229647 4.473752
65536 0.604710 0.407714 3.241789 4.611314
32768 0.610705 0.407137 3.217596 4.377238
16384 0.606411 0.402951 3.170279 4.268890
8192 0.605559 0.406753 3.079699 3.992203
4096 0.592250 0.405224 3.079699 3.592982
2048 0.592250 0.402200 3.170279 3.170279
1024 0.573348 0.390542 2.994152 2.245614
512 0.538947 0.384962 1.924812 1.347368
256 0.481203 0.374269 1.122807 2.245614
128 0.481203 0.336842 1.684211 0.421053

On Zen:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.126984 0.122605 0.053872 0.126984
256 0.187135 0.182336 0.215488 0.374269
512 0.384384 0.278867 0.346883 0.677249
1024 0.270899 0.251721 0.451499 0.917563
2048 0.474074 0.326948 0.462511 0.729345
4096 0.421399 0.336621 0.702332 1.094017
8192 0.434266 0.334641 0.603596 0.702332
16384 0.435097 0.340397 0.682326 0.995867
32768 0.382607 0.336372 0.682837 0.835066
65536 0.247377 0.270015 0.651322 0.729637
131072 0.474074 0.332836 0.756942 0.902998
262144 0.460115 0.331397 0.741676 0.737171
524288 0.386670 0.310371 0.690772 0.679873
1048576 0.447765 0.263413 0.621113 0.639005
2097152 0.457797 0.327805 0.721960 0.666532
4194304 0.463389 0.329218 0.782020 0.794948
8388608 0.467029 0.329897 0.859277 0.855042
16777216 0.484859 0.323049 0.903128 1.051807
33554432 0.485079 0.334300 1.365597 1.332054
67108864 0.484846 0.334430 1.382202 1.341124
134217728 0.484798 0.334393 1.388014 1.344886
268435456 0.483813 0.334464 1.384969 1.343667
536870912 0.484345 0.334486 1.380940 1.349843

536870912 0.484308 0.334349 1.383726 1.346343
268435456 0.482901 0.334368 1.390845 1.350834
134217728 0.483038 0.334411 1.389173 1.348572
67108864 0.483145 0.334543 1.382173 1.339797
33554432 0.482469 0.334347 1.373056 1.339054
16777216 0.484889 0.334234 1.352430 1.320905
8388608 0.485658 0.334543 1.318118 1.295186
4194304 0.482499 0.333650 1.255438 1.245587
2097152 0.499248 0.337774 1.263211 1.287671
1048576 0.513389 0.342309 1.319222 1.791556
524288 0.513271 0.342462 1.909223 1.699365
262144 0.513307 0.342414 2.043148 1.760585
131072 0.513235 0.342350 1.728817 1.918277
65536 0.512801 0.342189 1.603916 1.882569
32768 0.511648 0.341932 1.961686 1.932531
16384 0.509643 0.341163 1.865209 1.850045
8192 0.504558 0.339635 1.996101 1.820444
4096 0.494686 0.336621 1.835125 1.835125
2048 0.474074 0.328838 1.723906 1.723906
1024 0.444444 0.323232 1.497076 1.673203
512 0.374269 0.296296 1.292929 1.422222
256 0.323232 0.263374 1.015873 1.185185
128 0.296296 0.222222 0.888889 0.507937

On Tiger Lake:
# Ints per cycle
# n normal expect AVX2 AVX2_unroll
128 0.507937 0.416938 0.220690 0.066082
256 0.677249 0.448336 0.523517 0.549356
512 0.986513 0.881239 0.702332 0.990329
1024 1.365333 1.253366 0.519007 1.102260
2048 1.505882 1.455579 0.634449 1.074502
4096 1.618972 1.579637 0.647282 1.081595
8192 1.666395 1.436185 0.598349 1.104788
16384 1.697296 1.689420 0.625845 1.077824
32768 1.655786 1.708892 0.699737 1.084818
65536 1.723180 1.713583 0.619556 1.340698
131072 1.718369 1.680906 1.005015 1.388548
262144 1.728601 1.710509 1.225676 1.425385
524288 1.523752 1.505083 1.162664 2.004450
1048576 1.510044 1.498694 1.268084 1.857777
2097152 1.322782 1.301467 1.209695 1.561971
4194304 1.313021 1.283718 1.168607 1.616382
8388608 1.302382 1.315042 1.158387 1.744677
16777216 1.295171 1.300065 1.203086 1.742706
33554432 1.298210 1.300364 1.193553 1.702402
67108864 1.298965 1.298582 1.201180 1.711111
134217728 1.290168 1.295982 1.405890 1.719377
268435456 1.298625 1.290871 1.724766 1.714571
536870912 1.298432 1.286179 1.714161 1.720285

536870912 1.292826 1.297217 1.724536 1.724241
268435456 1.299574 1.298605 1.726204 1.708710
134217728 1.296023 1.292664 1.737698 1.722198
67108864 1.298964 1.297574 1.727979 1.718513
33554432 1.299467 1.193327 1.731739 1.719353
16777216 1.301645 1.293439 1.708782 1.709346
8388608 1.296258 1.290181 1.618332 1.638156
4194304 1.263116 1.282948 1.546646 1.573795
2097152 1.296268 1.273609 1.428083 1.517347
1048576 1.580714 1.527439 1.327631 1.708135
524288 1.722903 1.713628 2.477650 2.922518
262144 1.729501 1.727769 2.557078 2.947558
131072 1.728407 1.725903 1.135826 1.759970
65536 1.725494 1.720240 1.119126 1.474376
32768 1.719383 1.710765 0.812819 1.093615
16384 1.702234 1.686638 0.811491 1.110629
8192 1.682136 1.641354 0.804873 1.096947
4096 1.652279 1.412414 0.803925 1.081881
2048 1.543331 1.426184 0.735104 1.067779
1024 1.216152 1.278402 1.992218 0.959700
512 1.221957 1.201878 0.695652 1.089362
256 1.000000 0.583144 0.744186 0.992248
128 0.761905 0.512000 0.677249 0.882759

Stephen Fuld

unread,

Aug 23, 2021, 10:55:45 AM8/23/21

to

This hearkens back to the thread we had some months ago on reductions in
VVM. I think the answer is "mostly not". I say this because the full
cache line load streaming capability is sort of doing multiple loads in
parallel, but the the compare part of the loop will not use multiple
ALUs in parallel, even if they are available.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

MitchAlsup

unread,

Aug 23, 2021, 11:44:53 AM8/23/21

to

Yes iterations will run in parallel on multiple lanes.
However, any lane that writes to R3 or R4 will cause a serial dependency
at LOOP and will be backed up, much like branch repair, and played out again.
<
So, let us postulate that we have a 4-lanes, and the loop is zipping through
iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
run like expected.
<
In effect, the loop runs as expected, but this kind of dependency causes
a "blip" in execution width.

Anton Ertl

unread,

Aug 23, 2021, 11:49:14 AM8/23/21

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>On 8/22/2021 10:44 PM, Thomas Koenig wrote:
|int m2(int * const restrict a, int n)
|{
| int m, nm;
| int i;
|
| m = INT_MIN;
| nm = -1;
| for (i=0; i<n; i++)
| {
| if (a[i] > m)
| {
| m = a[i];
| nm = i;
| }
| }
| return nm;
|}

...

>> Will it run several iterations in parallel without source code
>> modification, or not?
>
>This hearkens back to the thread we had some months ago on reductions in
>VVM. I think the answer is "mostly not". I say this because the full
>cache line load streaming capability is sort of doing multiple loads in
>parallel, but the the compare part of the loop will not use multiple
>ALUs in parallel, even if they are available.

Why not? Consider this as the following equivalent code:

int m2(int * const restrict a, int n)
{
int m, nm;
int i;

m = INT_MIN;
nm = -1;

i=0;
while (i<n) {
while (i<n && a[i]<=m)
i++;

if (a[i] > m) {
m = a[i];
nm = i;
}

i++;
}
return nm;
}

Now look at the inner loop. It is so easy to vectorize that even VVM
may be able to do it (maybe even auto-vectorizing compilers). Of
course, at first it will have very short trip counts, but they
increase the further through the array you work, as the probability to
find an element larger than the largest one up to now decreases
(unless the array is sorted).

Stephen Fuld

unread,

Aug 23, 2021, 12:02:47 PM8/23/21

to

Ahhh! I didn't understand that. So in the case of summing the elements
of an unsigned integer vector, it is the writes to the "running sum"
register that causes the serial dependency and thus prevents parallel
additions. That makes sense.

Thomas Koenig

unread,

Aug 23, 2021, 1:40:32 PM8/23/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

A strangeness that some distributors have put into compilers
recently. Luckily enough, the base version of gcc does not do this.

> recompile with
> -fPIE"; this means that the assembly code contains an absolute address
> and should be replaced with a rip-relative address), so I compiled it
> on Debian 8 (gcc-4.9.2).
>
> Below is what I see. What does it mean?

The numbers mean average iterations per cycle. "n" is the vector
length. The "normal" version is the code as I posted it. The
"expect" version uses __builtin_expect to tell the compiler that
finding a new maximum seems unlikely. AVX2 is an AVX2 version of
the code, and AVX2_unroll is an unrolled version of AVX2.

Numbers go up and then down again to have some reproducibility.
I suspect the "going down" numbers are more reliable, so I'll
look at those.

> On Skylake:

> 536870912 0.560183 0.560181 1.509494 1.516893
> 268435456 0.560113 0.560441 1.507528 1.516041
> 134217728 0.559948 0.560224 1.509935 1.519144

So, for a very long vector: 0.56 iterations per cycle for normal
code, 1.52 iterations for the AVX2 code. Almost a factor of
three, not bad.

[...]

> 1024 0.540084 0.522983 0.510469 0.725212
> 512 0.509960 0.454707 0.490421 0.621359
> 256 0.477612 0.391437 0.468864 0.481203
> 128 0.450704 0.421053 0.400000 0.278261

Don't use the unrolled AVX2 stuff on short vectors, I suppose,
but at least the slowdown for AVX2 against normal code is slight.

> On Zen 3:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 0.626340 0.417924 2.446799 2.465893
> 268435456 0.626351 0.417885 2.438281 2.455598

Scalar code is about par with Skylake, the AVX2 code is better.
Strange that the __builtin_expect code is slower, but that may
just be the rather old compiler.

> 256 0.481203 0.354571 1.347368 0.748538
> 128 0.421053 0.336842 0.842105 0.421053

The clear winner for Zen3: The AVX2 stuff without unrolling.

> On Zen 2:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 0.501185 0.362092 1.474724 1.484411
> 268435456 0.505392 0.362086 1.480385 1.487473

> 256 0.481203 0.374269 1.122807 2.245614
> 128 0.481203 0.336842 1.684211 0.421053

Again, slower than Zen3, but still AVX2 wins hands-down.

> On Zen:
># Ints per cycle

>

> 536870912 0.484308 0.334349 1.383726 1.346343

> 512 0.374269 0.296296 1.292929 1.422222
> 256 0.323232 0.263374 1.015873 1.185185
> 128 0.296296 0.222222 0.888889 0.507937

Again, a clear win for the non-unrolled AVX2 code.

>
> On Tiger Lake:
># Ints per cycle
># n normal expect AVX2 AVX2_unroll

> 536870912 1.292826 1.297217 1.724536 1.724241
> 268435456 1.299574 1.298605 1.726204 1.708710

The scalar variant is _very_ good, AVX2 does gain some, but not
as much as the other architectures, especially when

> 512 1.221957 1.201878 0.695652 1.089362
> 256 1.000000 0.583144 0.744186 0.992248
> 128 0.761905 0.512000 0.677249 0.882759

it seems to get slower towards the end (but the numbers still are a
bit erratic).

AVX2 without unrolling seems to be the clear winner for all
architectures you checked, especially the AMD ones, except for Tiger
Lake, which combines excellent performance with of the scalar loop
wiht lackluster performance on AVX2. Maybe they figured that,
while they do support the instructions, performance was not
so inoportant for them after all. For a processor intended for
the mobile market, that makes sense.

Thomas Koenig

unread,

Aug 23, 2021, 1:50:56 PM8/23/21

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

> On 8/23/2021 8:44 AM, MitchAlsup wrote:

>> Yes iterations will run in parallel on multiple lanes.
>> However, any lane that writes to R3 or R4 will cause a serial dependency
>> at LOOP and will be backed up, much like branch repair, and played out again.
>> <
>> So, let us postulate that we have a 4-lanes, and the loop is zipping through
>> iterations lickity split, and Lane[k] in iteration[j] performs a write to R3 and R4.
>> Lanes[K+1] and any future iterations[j+1] are cancelled, and the next iteration
>> begins with Lane[k+1] in Lane[0], so the lanes "take" a skew, but otherwise
>> run like expected.
>> <
>> In effect, the loop runs as expected, but this kind of dependency causes
>> a "blip" in execution width.

Good explanation, thanks.

>
> Ahhh! I didn't understand that. So in the case of summing the elements
> of an unsigned integer vector, it is the writes to the "running sum"
> register that causes the serial dependency and thus prevents parallel
> additions. That makes sense.

So (moving the goalposts towards summation here), VVM-optimized code
could look like

for (i=0; i<n; i+=m) {
for (j=0; j<m; j++)
s[i+j] += a[i+j];
}

with suitable postprocessing (and pre-processing if n
is not divisible by m).

Hm. This doesn't really make it more elegant than doing the same kind
of thing in SIMD.

Or how should a reduction be written?

Stephen Fuld

unread,

Aug 23, 2021, 2:16:47 PM8/23/21

to

I think that depends upon whether the order of the operations is
potentially significant. For example, if the values are signed, you may
hit an underflow/overflow at an intermediate step that gets "cancelled
out" by doing multiple intermediate sums then a final "sum of
intermediates" step. Many people have pointed out the issues with doing
the multiply/adds needed for an inner product in parallel. That is why
I specified unsigned integers in the vector to be summed.

So for full generality, you have to do one element at a time. You code
it that way, and VVM executes it that way, except with the benefit of
the full cache width loads. If you know that order is not significant
(i.e. summing unsigned integers), you would have to unroll the loop in
the source code, which would allow parallel additions) and thus buy you
improved performance. A final add outside the loop gets the grand total.

There is the question of how much to unroll the loop. I think you
probably want to unroll it four times in the source code. That way, you
get maximum performance on any system with up to four integer units
without source code changes. I don't think you can do eight, as you
might exceed the VVM instruction limit. You could certainly do two, and
that would work, but you would be giving away performance on a CPU with
4 integer units.

A question for Mitch. Suppose you unroll the loop for summing a vector
of unsigned integers. In VVM, the load for the first element causes a
cache line to be loaded into a streaming buffer. The next load, for the
second element, wants the next entry in the buffer. So, does VVM
recognize this all as a "dense reference" even though the references are
not from the same instruction?

MitchAlsup

unread,

Aug 23, 2021, 2:40:44 PM8/23/21

to

As the Loop is installed in the stations, memory reference address patterns are
examined. If the address pattern are based on indexing off of the register used
in the LOOP instruction, then certain inferences can be made. The determination
of dense is one of these.
<
On the other hand it is easy to code gather scatter in which the pointers/indexes
are dense and the indirect data not.
<
Dense, in a VVM sense, is that several iterations of the loop can all access one
streaming buffer (avoid cache and TLB) so that other stuff (gather/scatter memory
refs) have access through normal cache paths.
<
Back to the posed question:
<
If the programmer unrolled the loop by hand (like DGEMM without transposes):
The LDs would need to be coded using offsets from the index register to be
recognized as dense::

MOV Ri,#0
VEC R8,{}
LDD R4,[R2,Ri<<3]
LDD R5,[R2,Ri<<3+8]
LDD R6,[R2,Ri<<3+16]
LDD R7,[R2,Ri<<3+24]
...
LOOP LT,Ri,#4,Rmax
<
The above code would be recognized as dense.
<
MOV Ri,#0
ADD R9,R2,#8
ADD R9,R2,#16
ADD R10,R2,#24
VEC R8,{}
LDD R4,[R2,Ri<<3]
LDD R5,[R8,Ri<<3+8]
LDD R6,[R9,Ri<<3+16]
LDD R7,[R10,Ri<<3+24]
...
LOOP LT,Ri,#4,Rmax
<
This loop is harder to recognize as dense--even though the number of words
in the loop is less.
<
As the loop is being installed in the stations, the first iteration is performed,
so many of the address patterns can be detected using actual AGEN addresses
not just instructions patterns--so the second case might or might no be recognized.

Thomas Koenig

unread,

Aug 23, 2021, 5:29:19 PM8/23/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

Hm... all possible, but less elegant that it could be. All the
manual unrolling and autovectorization and... rears its ugly
head again.

With all the mechanisms that VVM already offers, a way for the
programmer or a programming language to specify that operations
such as summation can be done in any order would be a very useful
addition.

Suggestion:

A variant of the VEC instruction, which does not specify a special
register to keep the address in (which can be hardwired if there
is no space in the thread header). This leaves five bits for
"reduction" registters, which specify that operations on that
register can be done in any order in the loop.

This would be a perfect match for OpenMP's reduction clause or
for the planned REDUCTION addition to Fortran's DO CONCURRENT.

It would not have a 1:1 match for C semantics, sure, but this
should not pose a problem, I hope :-)

MitchAlsup

unread,

Aug 23, 2021, 5:55:26 PM8/23/21

to

That might be one way...........
<
My preferred means is to make a way to specify that a function unit is
performing a reduction, and that it should not deliver its value at the
end of its calculation, but hold on to it an use it in the next calculation.
<
So, a FMAC reduction would take the form of::

FMAC --,--,--,Rsum
VEC Rx,{}
LDD RA,[Ra+Ri<<3]
FMAC --,RA,RB,--
LOOP LT,Ri,#1,Rmax
FMAC Rsum,--,--,--
<
or something like that. where "--" means there is no operand or result being
specified, use the last operand that showed up, and make the destination
into an operand for the next cycle.
<
The multiplier ALREADY has the ability to perform accumulates
every cycle at the wide adder (3×52+52-bit incrementer), all we need
is an ISA way to specify feed back the last intermediate result as
an operand to the next calculation.
<
The major problem is where does one store the state on an interrupt
taken inside of the loop. I am letting my subconscious dwell on it right
now.

Stephen Fuld

unread,

Aug 23, 2021, 10:51:45 PM8/23/21

to

I am probably missing something here. To me the main advantage of
allowing out of order summations (using summations here as shorthand for
other similar type operations), was to allow the hardware to make use of
multiple functional units. That is, a core with two adders could, if
allowed, complete the summation in about half the time. Without that, I
don't see any advantage of out of order summations on VVM. If I am
wrong, please explain. If I am right, see below.

> Suggestion:
>
> A variant of the VEC instruction, which does not specify a special
> register to keep the address in (which can be hardwired if there
> is no space in the thread header). This leaves five bits for
> "reduction" registters, which specify that operations on that
> register can be done in any order in the loop.

Doing the operations in a different order isn't the problem. You need a
way to allow/specify the two partial sums to be added together in the
end. I don't see your proposal as doing that. And, of course, it is
limited to five registers which must be specified in the hardware design.

>
> This would be a perfect match for OpenMP's reduction clause or
> for the planned REDUCTION addition to Fortran's DO CONCURRENT.

I am not an OpenMP person, and my knowledge of Fortran is old, so could
you please give a brief explanation of what these two things do? Thanks.

Thomas Koenig

unread,

Aug 24, 2021, 2:27:51 AM8/24/21

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

Or, equvalently, I have been explaining things badly :-)

> To me the main advantage of
> allowing out of order summations (using summations here as shorthand for
> other similar type operations), was to allow the hardware to make use of
> multiple functional units.

Yes.

> That is, a core with two adders could, if
> allowed, complete the summation in about half the time.

Yes.

>Without that, I
> don't see any advantage of out of order summations on VVM. If I am
> wrong, please explain. If I am right, see below.

Seeing below.

>
>
>
>> Suggestion:
>>
>> A variant of the VEC instruction, which does not specify a special
>> register to keep the address in (which can be hardwired if there
>> is no space in the thread header). This leaves five bits for
>> "reduction" registters, which specify that operations on that
>> register can be done in any order in the loop.
>
> Doing the operations in a different order isn't the problem.

It's one half of the problem.

The way VVM is currently specified, it's stricly in-order semantics
you write down a C loop, and the hardware delivers the results
exactly in the order you wrote down. This would have to be
changed.

> You need a
> way to allow/specify the two partial sums to be added together in the
> end.

That as well.

>I don't see your proposal as doing that.

I thought I had implied it, but it was obviously not clear enough.

> And, of course, it is
> limited to five registers which must be specified in the hardware design.

Five reductions in a loop would be plenty, it is usually one, or more
rarely two.

>> This would be a perfect match for OpenMP's reduction clause or
>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>
> I am not an OpenMP person, and my knowledge of Fortran is old, so could
> you please give a brief explanation of what these two things do? Thanks.

#pragma omp simd reduction(+:var)

before a loop will tell the compiler that it can go wild
with the sequence of loops but that "var" will be used
in a summation reduction.

DO CONCURRENT also runs loops in an unspecified order,
the REDUCTION clause would then allow to, for example,
sum up all elements.

One problems with C and similar languages is that you have
to specify an ordering of the loop explicitly, which shapes
programmer's thinking and also shapes intermediate languages
for compilers...

Anton Ertl

unread,

Aug 24, 2021, 3:24:16 AM8/24/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>The numbers mean average iterations per cycle.

Actually per rdtsc unit (which have not been CPU cycles for over a
decade). And you time only a single run through the loop, so any
disturbance (rdtsc wobble, interrupt, etc.) will be very visible. You
also see one rtdsc (on average) in the result.

>"n" is the vector
>length. The "normal" version is the code as I posted it. The
>"expect" version uses __builtin_expect to tell the compiler that
>finding a new maximum seems unlikely.

I compiled with gcc-4.9 -O -std=c99, and this gives the following loops:

ml m2
mov (%rdi,%rdx,4),%ecx mov (%rdi,%rdx,4),%ecx
cmp %r8d,%ecx cmp %r8d,%ecx
jle 40077e <ml+0x21> jle 4007a9 <m2+0x21>
mov %edx,%eax mov %edx,%eax
mov %ecx,%r8d mov %ecx,%r8d
add $0x1,%rdx add $0x1,%rdx
cmp %edx,%esi cmp %edx,%esi
jg 400771 <ml+0x14> jg 40079c <m2+0x14>

So the same loop. Differences in performance may be from code
alignment (does that play a role with uCode caches?) or from
disturbances.

>AVX2 is an AVX2 version of
>the code, and AVX2_unroll is an unrolled version of AVX2.
>
>Numbers go up and then down again to have some reproducibility.
>I suspect the "going down" numbers are more reliable, so I'll
>look at those.
>
>
>> On Skylake:
>
>> 536870912 0.560183 0.560181 1.509494 1.516893
>> 268435456 0.560113 0.560441 1.507528 1.516041
>> 134217728 0.559948 0.560224 1.509935 1.519144
>
>So, for a very long vector: 0.56 iterations per cycle for normal
>code, 1.52 iterations for the AVX2 code. Almost a factor of
>three, not bad.

Given that all CPUs show higher AVX values in between, the loop seems
to run into some limit at some point. My first guess would be the L2
or L3 cache bandwidth, but the edge is not close to either limit:

before
edge L2 L3
1MB 0.25MB 6MB Skylake
8MB 0.5MB 32MB Zen3
4MB/8MB 0.5MB 16MB Zen2 (Zen2 allocates L3 only from a 16MB slice)
2MB/4MB 0.5MB 8MB Zen (Zen allocates L3 only from an 8MB slice).
2MB 1.5MB 8MB Tiger Lake

Given that the AVX code is branchless apart from the loop-back edge,
the limit cannot be branch predictor capacity.

Whatever the limit is, the result for the huge arrays show that limit,
not the capabilities of the SIMD units. For that better look at the
best SIMD results:

n normal expect AVX2 AVX2_unroll

131072 0.570325 0.570285 2.357071 2.205857 Skylake
1048576 0.629830 0.420802 4.644690 5.443698 Zen 3
262144 0.608389 0.407787 3.234190 4.476656 Zen 2
262144 0.513307 0.342414 2.043148 1.760585 Zen
262144 1.729501 1.727769 2.557078 2.947558 Tiger Lake

So Zen with its 128-bit SIMD units is worst, and Tiger Lake is better
than Skylake. It's surprising that Zen2 and Zen3 are so much better
than Tiger Lake. We see a speedup by a factor 8.6 of AVX2_unroll over
normal on Zen3, which is very impressive.

Let's also compare the n=512M values for across CPUs:

n normal expect AVX2 AVX2_unroll

536870912 0.560183 0.560181 1.509494 1.516893 Skylake
536870912 0.626340 0.417924 2.446799 2.465893 Zen 3
536870912 0.501185 0.362092 1.474724 1.484411 Zen 2
536870912 0.484308 0.334349 1.383726 1.346343 Zen
536870912 1.292826 1.297217 1.724536 1.724241 Tiger Lake

The limit seems to be similar on Skylake, Zen 2, and Zen, slightly
higher on Tiger Lake, and much higher on Zen 3. Given that main
memory bandwidth is an issue at that size, it would be good to know
how long an rtdsc tick is.

>> 1024 0.540084 0.522983 0.510469 0.725212
>> 512 0.509960 0.454707 0.490421 0.621359
>> 256 0.477612 0.391437 0.468864 0.481203
>> 128 0.450704 0.421053 0.400000 0.278261
>
>Don't use the unrolled AVX2 stuff on short vectors, I suppose,
>but at least the slowdown for AVX2 against normal code is slight.

I find it surprising that your branchless AVX2 code does not show a
speedup compared to the scalar code, which I expect to enter the if
code ~4.85 times for n=128, and have a branch misprediction every
time. And n=128 is not that small that SIMD should not provide a nice
benefit (only 16 iterations through the AVX2 inner loop).

Let's compare the n=128 cases for all CPUs:

n normal expect AVX2 AVX2_unroll

128 0.450704 0.421053 0.400000 0.278261 Skylake
128 0.421053 0.336842 0.842105 0.421053 Zen 3
128 0.481203 0.336842 1.684211 0.421053 Zen 2
128 0.296296 0.222222 0.888889 0.507937 Zen
128 0.761905 0.512000 0.677249 0.882759 Tiger Lake

So for the Zens AVX2 provides a nice speedup for n=128 (especially Zen
2). Skylake is not so great here. Maybe it's the 256-bit wakeup
slowdown. Tiger Lake already shows a speedup from unrolling at n=128.

>> On Zen 3:
>># Ints per cycle
>># n normal expect AVX2 AVX2_unroll
>
>> 536870912 0.626340 0.417924 2.446799 2.465893
>> 268435456 0.626351 0.417885 2.438281 2.455598
>
>Scalar code is about par with Skylake, the AVX2 code is better.
>Strange that the __builtin_expect code is slower, but that may
>just be the rather old compiler.

Apparently the different code alignment rubs Zen3 very much the wrong
way.

>The clear winner for Zen3: The AVX2 stuff without unrolling.

Not if n>=4k.

>> On Tiger Lake:
>># Ints per cycle
>># n normal expect AVX2 AVX2_unroll
>
>> 536870912 1.292826 1.297217 1.724536 1.724241
>> 268435456 1.299574 1.298605 1.726204 1.708710
>
>The scalar variant is _very_ good

More than 1.7 iterations/rtdsc unit at the high point. Either the
rtdsc unit is very far from the cycle time, or this processor can do
more than one back-to-back add in one cycle. My guess it's the
former. The base clock of this CPU (Core i5-1135G7) seems to be
2.4GHz, the turbo 4.2GHz. If the rtdsc uses the base clock and the
benchmark runs at max turbo, that would result in 1.75 cycles/rtdsc
unit, and the results fit that nicely. Still, 1 cycle/iteration of
the loop above is a very nice result; it means that Tiger Lake can
perform 2 taken branches per cycle (while, e.g., on Skylake, each
taken branch costs a cycle, resulting in at most 0.5 iterations/cycle
if we assume that a non-taken if branch results in a branch
misprediction).

Getting the result in cycles and in ns would be useful.

>AVX2 without unrolling seems to be the clear winner for all
>architectures you checked, especially the AMD ones, except for Tiger
>Lake, which combines excellent performance with of the scalar loop
>wiht lackluster performance on AVX2. Maybe they figured that,
>while they do support the instructions, performance was not
>so inoportant for them after all. For a processor intended for
>the mobile market, that makes sense.

Tiger Lake was not designed for the mobile market, it ended up being
sold only there because of the fab difficulties that Intel had. They
put in AVX-512 because they thought that SIMD performance is important
for the intended markets of this core. My guess is that the latency
per iteration of the vpcmpgtd-vpblendvb recurrence is relatively long;
according to https://www.agner.org/optimize/instruction_tables.pdf,
this recurrence has 3 cycles of latency; should not be so bad. Hmm.

Terje Mathisen

unread,

Aug 24, 2021, 3:41:02 AM8/24/21

to

The main issue is that the loop is buggy! The inner loop can exit due to
(i<n), at which point the next line "a[i] > m" becomes UB.

Modifying it to while (i+1<n && a[i] <= m) would work I think, but it is
easier to check the index below:

while (i<n) {
while (i<n && a[i]<=m)
i++;

if (i < n) {

m = a[i];
nm = i;
}
i++;

> course, at first it will have very short trip counts, but they
> increase the further through the array you work, as the probability to
> find an element larger than the largest one up to now decreases
> (unless the array is sorted).

Searching for a max value in N random elements expects log(n) hits, so
yes it is usually OK.

Terje Mathisen

unread,

Aug 24, 2021, 4:10:39 AM8/24/21

to

The eventual solution for all this will be similar to Mitch's FMAC
accumulator, i.e. a form of super-accumulator which allows one or more
elements to be added per cycle, while delaying all inexact/rounding to
the very end.

A carry-save exact accumulator with ~1100 paired bits would only use a
single full adder (2 or 3 gate delays?) to accept a new input, right?

I am not sure what is the best way for such a beast to handle both
additions and subtractions: Do you need to invert/negate the value to be
subtracted?

Thomas Koenig

unread,

Aug 24, 2021, 5:58:52 AM8/24/21

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

> Thomas Koenig wrote:
>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:

[some snippage, hopefully context-preserving]

>>>> With all the mechanisms that VVM already offers, a way for the
>>>> programmer or a programming language to specify that operations
>>>> such as summation can be done in any order would be a very useful
>>>> addition.

[...]

>>>> Suggestion:
>>>>
>>>> A variant of the VEC instruction, which does not specify a special
>>>> register to keep the address in (which can be hardwired if there
>>>> is no space in the thread header). This leaves five bits for
>>>> "reduction" registters, which specify that operations on that
>>>> register can be done in any order in the loop.

[...]

>> The way VVM is currently specified, it's stricly in-order semantics
>> you write down a C loop, and the hardware delivers the results
>> exactly in the order you wrote down. This would have to be
>> changed.
>>
>>
>>> You need a
>>> way to allow/specify the two partial sums to be added together in the
>>> end.
>>
>> That as well.

[...]

> The eventual solution for all this will be similar to Mitch's FMAC
> accumulator, i.e. a form of super-accumulator which allows one or more
> elements to be added per cycle, while delaying all inexact/rounding to
> the very end.
>
> A carry-save exact accumulator with ~1100 paired bits would only use a
> single full adder (2 or 3 gate delays?) to accept a new input, right?

Depends on the number of functional units you have. If you have
eight, for a high-performance CPU, you would need four layers
of adders to reduce the number of numbers to be added to two.

It might make sense to go to a signed digit implementation for
the 1100-bit adder (even though that doubles the number of bits of
storage needed for the register) and only do the carry propagation
once, upon storing. Adding one binary number to a signed digit
number should be cheap enough to be done in half a cycle, so adding
one number per cycle throughput sounds feasible.

> I am not sure what is the best way for such a beast to handle both
> additions and subtractions: Do you need to invert/negate the value to be
> subtracted?

Probably easiest.

However, there are also other operations in reduction, multiplication
for example...

Stephen Fuld

unread,

Aug 24, 2021, 10:59:31 AM8/24/21

to

Sort of. If you code the two statements
C := A + B
D := E + F
it may be, say because of a cache miss on the load of B, that the second
addition occurs before the first. I think it is more accurate to say
that the hardware makes it appear with correct semantics, even if
internally, it takes some liberties with the ordering, as long as the
result is the same.

> This would have to be
> changed.

If it makes a difference, yes. That is why I keep going back to summing
a vector of unsigned integers, where it doesn't. But if it does, then
you need some syntax like you describe below to tell the hardware it is
OK. Then you need some mechanism in the hardware that the compiler can
generate to tell the hardware weather it matters or not.

>> You need a
>> way to allow/specify the two partial sums to be added together in the
>> end.
>
> That as well.
>
>> I don't see your proposal as doing that.
>
> I thought I had implied it, but it was obviously not clear enough.

The problem is, you can't use a single accumulator as the hardware can't
do two/four adds to the same accumulator in the same cycle. If you are
doing say 2/4 partial sums, you need 2/4 places to store them (e.g.
registers), say in the event of an interrupt, and you need to tell the
hardware that, at the end, it needs to add the partial sums. It is this
syntax in the VVM ISA that is not there.

>> And, of course, it is
>> limited to five registers which must be specified in the hardware design.
>
> Five reductions in a loop would be plenty, it is usually one, or more
> rarely two.

Sure. I was thinking more about the fact that the choice of five, say
R1-R5, limits the compiler's register allocation flexibility. Perhaps
this isn't a big deal, but it is something.

>>> This would be a perfect match for OpenMP's reduction clause or
>>> for the planned REDUCTION addition to Fortran's DO CONCURRENT.
>>
>> I am not an OpenMP person, and my knowledge of Fortran is old, so could
>> you please give a brief explanation of what these two things do? Thanks.
>
> #pragma omp simd reduction(+:var)
>
> before a loop will tell the compiler that it can go wild
> with the sequence of loops but that "var" will be used
> in a summation reduction.
>
> DO CONCURRENT also runs loops in an unspecified order,
> the REDUCTION clause would then allow to, for example,
> sum up all elements.

That makes excellent sense. Thank you.

> One problems with C and similar languages is that you have
> to specify an ordering of the loop explicitly,

So you need something like a pragma to perform similarly to the
Concurrrent Reduction clause in Fortran.

> which shapes
> programmer's thinking and also shapes intermediate languages
> for compilers...

Are you saying something about C programmers versus say Fortran
programmers? :-)

As for intermediate languages, if they can solve it for Fortran, I guess
they could do the same for C.

MitchAlsup

unread,

Aug 24, 2021, 11:25:41 AM8/24/21

to

The word you are looking for is "reduce", or "reduction"
I want this series of calculations reduced to a single number.

<
> other similar type operations), was to allow the hardware to make use of
> multiple functional units. That is, a core with two adders could, if
> allowed, complete the summation in about half the time.
<

This comes with numeric "issues". As with such, I do not want compilers
creating code to use wide resources WITHOUT some word from the
programmer saying it is "OK this time". (#pragma or such)

<
> Without that, I
> don't see any advantage of out of order summations on VVM. If I am
> wrong, please explain. If I am right, see below.
> > Suggestion:
> >
> > A variant of the VEC instruction, which does not specify a special
> > register to keep the address in (which can be hardwired if there
> > is no space in the thread header). This leaves five bits for
> > "reduction" registters, which specify that operations on that
> > register can be done in any order in the loop.
<
> Doing the operations in a different order isn't the problem. You need a
> way to allow/specify the two partial sums to be added together in the
> end. I don't see your proposal as doing that. And, of course, it is
> limited to five registers which must be specified in the hardware design.
<

Done properly, you want each loop performing Kahan-Babuška summations
not lossy clumsy double FP.
<
Kahan-Babuška summation should be known to and embedded in the compiler.

MitchAlsup

unread,

Aug 24, 2021, 11:28:56 AM8/24/21

to

Who chooses time versus numerical precision ? Human, of software.
IEEE 754-1985 took hardware off the list of who could choose.
<
Even in integer code, unrolled summation changes where exceptions
get thrown (if thrown at all).

MitchAlsup

unread,

Aug 24, 2021, 11:37:40 AM8/24/21

to

A carry save accumulator uses a 4-input-2-output compressor. With
true complement signals, this is 2 gates of delay (3-input XOR is 1 gate
in TC form). Without TC signals it is 3 gates.

>
> I am not sure what is the best way for such a beast to handle both
> additions and subtractions: Do you need to invert/negate the value to be
> subtracted?
<

The key property is that the value spinning around in the accumulator
remains the polarity of when it started (+ or - but not both) while
the multiplier can produce + or - on a per calculation basis.
But in practice one can work in the negations at the cost of a
single additional gate of delay.

Stephen Fuld

unread,

Aug 24, 2021, 11:42:32 AM8/24/21

to

On 8/24/2021 8:25 AM, MitchAlsup wrote:
> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
>> On 8/23/2021 2:29 PM, Thomas Koenig wrote:

snip

>>> With all the mechanisms that VVM already offers, a way for the
>>> programmer or a programming language to specify that operations
>>> such as summation can be done in any order would be a very useful
>>> addition.
>> I am probably missing something here. To me the main advantage of
>> allowing out of order summations (using summations here as shorthand for
> <
> The word you are looking for is "reduce", or "reduction"
> I want this series of calculations reduced to a single number.

Yes, I worded that clumsily. I probably should have used the word
"operation" to indicate whatever is needed to perform the reduction.
But using simple summation as an example, makes saying things like
"partial sums" meaningful. Is there a generally accepted better term
for the intermediate results of a reduction?

>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units. That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
> <
> This comes with numeric "issues". As with such, I do not want compilers
> creating code to use wide resources WITHOUT some word from the
> programmer saying it is "OK this time". (#pragma or such)

Absolutely agreed. But Thomas has said such things are coming, at least
for Fortran. But adding a pragma for C doesn't seem unreasonable, as it
potentially allows a huge performance improvement. But clearly it
shouldn't be the default.

> <
>> Without that, I
>> don't see any advantage of out of order summations on VVM. If I am
>> wrong, please explain. If I am right, see below.
>>> Suggestion:
>>>
>>> A variant of the VEC instruction, which does not specify a special
>>> register to keep the address in (which can be hardwired if there
>>> is no space in the thread header). This leaves five bits for
>>> "reduction" registters, which specify that operations on that
>>> register can be done in any order in the loop.
> <
>> Doing the operations in a different order isn't the problem. You need a
>> way to allow/specify the two partial sums to be added together in the
>> end. I don't see your proposal as doing that. And, of course, it is
>> limited to five registers which must be specified in the hardware design.
> <
> Done properly, you want each loop performing Kahan-Babuška summations
> not lossy clumsy double FP.
> <
> Kahan-Babuška summation should be known to and embedded in the compiler.

While I agree with that, it still doesn't address the issues of needing
multiple intermediate results "places"(to allow multiple partial
reductions (is that the right word?) to proceed in parallel, and how to
combine them at the end. You need some ISA syntax to specify such things.

MitchAlsup

unread,

Aug 24, 2021, 11:43:40 AM8/24/21

to

Each layer of 4-2 compression costs 2 gates of delay. 8-2 4 gates
16-2 6 gates......

>
> It might make sense to go to a signed digit implementation for
> the 1100-bit adder (even though that doubles the number of bits of
> storage needed for the register) and only do the carry propagation
> once, upon storing.
<

Carry save is perfectly adequate for an accumulator as carries are
only moving forward ln2( #inputs ) per cycle. IT is easy to build
a Carry Save Find First circuit, so you know where in that 1076 bit
accumulator is the most likely highest bit of significance. {You
will only be off by 1 at most}

<
< Adding one binary number to a signed digit
> number should be cheap enough to be done in half a cycle, so adding
> one number per cycle throughput sounds feasible.
<
> > I am not sure what is the best way for such a beast to handle both
> > additions and subtractions: Do you need to invert/negate the value to be
> > subtracted?
<

New value into summation can be + or -
Current running sum cannot (or can at another couple of gate delays.)

MitchAlsup

unread,

Aug 24, 2021, 11:52:55 AM8/24/21

to

On Tuesday, August 24, 2021 at 10:42:32 AM UTC-5, Stephen Fuld wrote:
> On 8/24/2021 8:25 AM, MitchAlsup wrote:
> > On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:
> >> On 8/23/2021 2:29 PM, Thomas Koenig wrote:
> snip
> >>> With all the mechanisms that VVM already offers, a way for the
> >>> programmer or a programming language to specify that operations
> >>> such as summation can be done in any order would be a very useful
> >>> addition.
> >> I am probably missing something here. To me the main advantage of
> >> allowing out of order summations (using summations here as shorthand for
> > <
> > The word you are looking for is "reduce", or "reduction"
> > I want this series of calculations reduced to a single number.
> Yes, I worded that clumsily. I probably should have used the word
> "operation" to indicate whatever is needed to perform the reduction.
> But using simple summation as an example, makes saying things like
> "partial sums" meaningful. Is there a generally accepted better term
> for the intermediate results of a reduction?
<

As far as I know: accumulator is about as good as it gets.

A single "core" can work on a single reduction.
It takes multiple cores to work on multiple reductions simultaneously.
cores, by themselves, are multiple clock cycles apart, and so are not
in general, allowed to know anything about what the others are up to.
<
To properly sum multiple reductions one needs a way to ship the
multiple 1076-bit reductions to a common adder/normalizer/rounder.
One HAS to perform a single rounding to get the correct final result
in a Kahan sense (754 has not gone this far, yet).
<
unums (posits) use the word "quire" as the accumulator.

Terje Mathisen

unread,

Aug 24, 2021, 12:13:26 PM8/24/21

to

You can round each super-acc result to 109 bits or more, at which point
you are guaranteed to get the same result when you add these two
together and then do a final round to double, as if you actually did the
full 1100-bit merge and then rounded, right?

...

No, it is easy to contruct a counter-example. :-(

OTOH, I expect that such a super-acc with redundant storage would have a
way to canonicalize it while storing, and at this point it is easy to
merge multiple accumulators as unsigned integer arrays.

If the hw also reports the position of the most significant bit, then
the merging can be handled more efficiently by starting at that position
and add all the accumulator results, then iterate over subsequent 64-bit
blocks until the end or until no change is possible at the rounding point.

Stephen Fuld

unread,

Aug 24, 2021, 12:24:16 PM8/24/21

to

Sure. Apparently, I am not making myself clear. I want to be able
*optionally* to take advantage of multiple functional units in a single
core to speed up a single reduction. As an example, if I want to sum
the elements of an array, and I have two FUs that can do the addition, I
want each to sum half the elements in parallel, then a final add to
combine the two partial sums.

As was pointed out earlier, you can certainly get this by unrolling the
loop, then doing the final add of the two partial sums outside of the
loop. But this makes the code "specify" the number of lanes (i.e. how
many times the loop is unrolled by), thus makes it hardware model
specific (i.e. it can't take advantage of a more powerful model with say
four appropriate FUs).

One of the beauties of VVM is that, for most loops, you code it for one
unit, but the hardware "automagically" and transparently invokes
multiple FUs to speed things up. I want to be able to extend this
capability (transparently taking advantage of multiple FUs if they are
available), to reductions, *where the programmer/compiler knows it will
be OK to do so*.

Thomas Koenig

unread,

Aug 24, 2021, 12:26:36 PM8/24/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Tuesday, August 24, 2021 at 4:58:52 AM UTC-5, Thomas Koenig wrote:

>> Depends on the number of functional units you have. If you have
>> eight, for a high-performance CPU, you would need four layers
>> of adders to reduce the number of numbers to be added to two.
><
> Each layer of 4-2 compression costs 2 gates of delay. 8-2 4 gates
> 16-2 6 gates......

So 4-2 compression is currently what people are now using, not
one of the (many) other possibilities, and not the full
adders that I assumed?

I didn't know that, thanks for the info.

>>

>> It might make sense to go to a signed digit implementation for
>> the 1100-bit adder (even though that doubles the number of bits of
>> storage needed for the register) and only do the carry propagation
>> once, upon storing.
><
> Carry save is perfectly adequate for an accumulator as carries are
> only moving forward ln2( #inputs ) per cycle. IT is easy to build
> a Carry Save Find First circuit, so you know where in that 1076 bit
> accumulator is the most likely highest bit of significance. {You
> will only be off by 1 at most}

Interesting.

Thomas Koenig

unread,

Aug 24, 2021, 12:38:50 PM8/24/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Monday, August 23, 2021 at 9:51:45 PM UTC-5, Stephen Fuld wrote:

>> other similar type operations), was to allow the hardware to make use of
>> multiple functional units. That is, a core with two adders could, if
>> allowed, complete the summation in about half the time.
><
> This comes with numeric "issues". As with such, I do not want compilers
> creating code to use wide resources WITHOUT some word from the
> programmer saying it is "OK this time". (#pragma or such)

Agreed - this should be specified in the language (and preferably
not with a #pragma, which are sort of an evil hack, but for things
that are standardized like OpenMP that is accepatble).

Unfortunately, it takes a big hammer to get code to be vectorized on
SIMD systems, such as ignoring all these niceties, so a big hammer
is what is supplied, with seductively named compiler options such
as -Ofast or -ffast-math.

(There's a double-langauge pun lurking here. "fast" means "almost"
in German, and one could vary an old saying about experiments to
"Fast programs, fast results, fast richtig" (where the last part
means "almost correct"). Maybe something for Anton's office wall?)

I appreciate what VVM can do here, by keeping the sequential
dependencies alive. There's simply a case where the language
does not specify this, as in DO CONCURRENT or #pragma simd,
when these semantics do not apply and where more speed would
help (which is the point you made above).

>> Doing the operations in a different order isn't the problem. You need a
>> way to allow/specify the two partial sums to be added together in the
>> end. I don't see your proposal as doing that. And, of course, it is
>> limited to five registers which must be specified in the hardware design.
><
> Done properly, you want each loop performing Kahan-Babuška summations
> not lossy clumsy double FP.
><
> Kahan-Babuška summation should be known to and embedded in the compiler.

Depends if you need it or not. It is a considerable overhead /
slowdown, and putting it in for all cases also would not be a
good thing.

MitchAlsup

unread,

Aug 24, 2021, 12:43:51 PM8/24/21

to

On some implementations is may have considerable overhead (6 FP ops
mostly dependent, instead of 1) on others it cost no more than FMAC all by
itself. In my 66000 ISA it can be encoded in 2 instructions and performed
in 1 (CARRY is an instruction-modifier and does not necessarily execute)

Thomas Koenig

unread,

Aug 24, 2021, 12:46:09 PM8/24/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> Carry save is perfectly adequate for an accumulator as carries are
> only moving forward ln2( #inputs ) per cycle. IT is easy to build
> a Carry Save Find First circuit, so you know where in that 1076 bit
> accumulator is the most likely highest bit of significance. {You
> will only be off by 1 at most}

Having second thoughts here...

Assuming I calculate

01..111111111....111000
+ 00..000000000....001000
=========================
10..000000000....000000

with the 1076 bit accumulator, the carry would still have to be
propagated all the way, across more than 1024 bits if I am unlucky.
This is why I suggested signed digits, because carries only propagate
a single position there (if done correctly).

How would a carry-save adder deal with this situation?

MitchAlsup

unread,

Aug 24, 2021, 1:14:10 PM8/24/21

to

On Tuesday, August 24, 2021 at 11:46:09 AM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Carry save is perfectly adequate for an accumulator as carries are
> > only moving forward ln2( #inputs ) per cycle. IT is easy to build
> > a Carry Save Find First circuit, so you know where in that 1076 bit
> > accumulator is the most likely highest bit of significance. {You
> > will only be off by 1 at most}
> Having second thoughts here...
>
> Assuming I calculate
>
> 01..111111111....111000
> + 00..000000000....001000
> =========================
> 10..000000000....000000
>
> with the 1076 bit accumulator, the carry would still have to be
> propagated all the way, across more than 1024 bits if I am unlucky.
<

But this propagation takes place in the adder not in the accumulator
which remains carry save during all the iterations. The adder is used
only after all of the reductions from all of the lanes arrive and are
compressed into a final carry save result. This adder will be of Kogee
Stone variety, where the middle order bits are the last to resolve,
allowing one to begin finding the point of greatest significance
early.

<
> This is why I suggested signed digits, because carries only propagate
> a single position there (if done correctly).
>
> How would a carry-save adder deal with this situation?
>

You are going to get carry save results out of the multiplier
every cycle. In carry save form, you use 2 signals for each bit.
We know fast multipliers perform their work in not just carry
save but also True-Complement (to make the XORs fast),
so I assume we have carry-save and true-complement out
of the multiplier.
<
Anytime one gets a value out of a latch (of flip-flop) one has
access to both true and compliment values. The accumulator
"runs" through this flip-flop, so we have both carry save and true
complement data from the 'flop.
<
Thus we have 2-bits from the multiplier and 2 bits from the
accumulator flop and we can add these 2 values in 2 gates
of delay producing carry save output, which we then flop.
<
Since this addition only costs 2 gates of delay, we could take
results from 3 lanes and 1 accumulator in 4 gates, 7 lanes
and 1 accumulator in 6 gates: the reduction can easily take
as many as 16 lanes, per cycle. I doubt anyone is going to
build a machine "that wide".

luke.l...@gmail.com

unread,

Aug 27, 2021, 5:36:28 PM8/27/21

to

On Monday, August 23, 2021 at 6:44:45 AM UTC+1, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:

> Will it run several iterations in parallel without source code
> modification, or not?

yes, by running the exact same instructions in in-flight out-of-order multi-issue micro-architecture.

VVM took me 2 years to understand rhe concept, it was a lightbulb moment but only after i had a brainwave in SVP64. more of a 2x4 cluebat smack to the side of the head, but hey.

to conceptually understand VVM in full,.first take either Cray-style Vectors or "SIMD where it is implemented as a horizontal for-loop". doesn't matter which: the important part is:

you have a for-loop on the current instruction, it goes from 0 to LEN-1

thus, instruction1 you have a for-loop from element 0 to element LEN+1
instruction 2: for-loop from element 0 to element LEN+1

got that bit so far?

Vertical-First, of which VVM is a type, you do this:

START LOOP
instruction 1 element 0
instruction 2 element 0
instruction 3 element 0
LOOP BACK
instruction 1 element 1
instruction 2 element 1
....
LOOP BACK
instruction 1 element LEN-1
Instruction 2 element LEN-1
instruction 3 element LEN-1
END LOOP

that's it.

that's all there is to it. if you understand this difference,
between Horizontal-First and Vertical-First element/instruction
processing order, you understand VVM.

thus you can see:

* a simple single issue may do this dead easy, exactly
as the scalar code is written
* a multi issue version may analyse the loop and shove multiple
overlapping elements into overlapping in-flight buffers.

there is no actual optimisation, it just turns out that
the LOOP instruction declares various things such
as the loop invariant, and which registers can be
considered "Vectorised".

[somebody please do check this:
the limitation as i understand it is that those registers
marked as "Vectorised" *cannot* have data passed in to
them. as best i have been able to tell you *must* use
LD to get data into a Vector element and you likewise
*must* ST that element towards the end of the loop]

if you do this then the HW is happy due to it detecting
that the elements are in fact loop independent, and need
only concern itself about Memory R/W Hazards.

SVP64's Vertical First Mode on the other hand has
*actual* mapping to *actual* registers one to one
with the conceptual element numbering illustrated
above, and consequently you *do not* have to rely
on LD/ST. SVP64 is however a 64bit ISA so not
especially compact.

l.

MitchAlsup

unread,

Aug 27, 2021, 5:43:23 PM8/27/21

to

The difference between Scalar, Vector, and Loop Carried is::
a) Scalar registers are read once put into stations and then
....just used to supply operands to instructions
b) Vector registers capture a result produced in this iteration
c) Loop Carried registers capture a result produced in a
....previous iteration.
<
The distinction is used in the operand capture portion of
the stations.

>
> if you do this then the HW is happy due to it detecting
> that the elements are in fact loop independent, and need
> only concern itself about Memory R/W Hazards.
<

One of the things I learned when building wide OoO machines
is the stunning nature of how LITTLE OoOness is needed to
get within spitting distance of "all you can get" performance.

luke.l...@gmail.com

unread,

Aug 27, 2021, 6:05:20 PM8/27/21

to

On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:

> The major problem is where does one store the state on an interrupt
> taken inside of the loop. I am letting my subconscious dwell on it right
> now.

the answer is, without an actual register, you can't.

even if it was a Special Purpose register for storing
in-flight data, it's still a register.

if it was an actual GPR/FPR, actually named
in the instruction (and given as a src/dst into
the FMAC as the accumulator) *now* you
have a register to actually store in-flight
data during an interrupt. it can even be
contextswitched.

a lot of things in SVP64 look overly complex until precise interrupt
handling is thrown into the pot.

l.

luke.l...@gmail.com

unread,

Aug 27, 2021, 6:15:09 PM8/27/21

to

On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:

> My preferred means is to make a way to specify that a function unit is
> performing a reduction, and that it should not deliver its value at the
> end of its calculation, but hold on to it an use it in the next calculation.

in SVP64 we do not have explicit reduction or iteration instructions:
iteration is performed simply by issuing a Vector-Context on top
of a base instruction where OH LOOK! the difference between the
register number src snd dest happens to be exactly one, and OH LOOK!
on each time round the multi-issue-capable execution backend
you get spammed with ADD r1, r2 ADD r2, r3 ADD r3, r4 etc etc
which looks an awful lot like iteration / reduction and is inherently
precise exception interruptible.

we are however also working out an explicit (fixed, predictable)
paralleliseable reduction schedule, it is quite tricky. but, as a fixed
schedule, even non-commutative operations can be thrown at it.

l.

luke.l...@gmail.com

unread,

Aug 27, 2021, 6:23:10 PM8/27/21

to

On Tuesday, August 24, 2021 at 5:24:16 PM UTC+1, Stephen Fuld wrote:

> As was pointed out earlier, you can certainly get this by unrolling the
> loop, then doing the final add of the two partial sums outside of the
> loop. But this makes the code "specify" the number of lanes (i.e. how
> many times the loop is unrolled by), thus makes it hardware model
> specific (i.e. it can't take advantage of a more powerful model with say
> four appropriate FUs).

we designed a reduction "schedule" which does exactly that and
is not tied to the microarchitectural width.

it is a recursive tree algorithm (butterfly-like) as part of the SVP64
Specification. hardware will be required to implement that
algorithm or one that produces the exact same result.

it is possible in other words if you are willing to make the decision
to make it part of the specification of the ISA. ARM did something
similar i believe

however i am not sure that a Vertical-First ISA such as VVM the
concept of Horizontal Reduction even makes sense. i have a strong
feeling they are mutually exclusively incompatible.

l.

MitchAlsup

unread,

Aug 27, 2021, 7:55:31 PM8/27/21

to

On Friday, August 27, 2021 at 5:05:20 PM UTC-5, luke.l...@gmail.com wrote:
> On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:
>
> > The major problem is where does one store the state on an interrupt
> > taken inside of the loop. I am letting my subconscious dwell on it right
> > now.
<

Reminding others that this state is 1076 bits long.

<
> the answer is, without an actual register, you can't.
>
> even if it was a Special Purpose register for storing
> in-flight data, it's still a register.
>
> if it was an actual GPR/FPR, actually named
> in the instruction (and given as a src/dst into
> the FMAC as the accumulator) *now* you
> have a register to actually store in-flight
> data during an interrupt. it can even be
> context switched.
<

OK, so what register do you have that is big enough to hold all 1076 bits ?

luke.l...@gmail.com

unread,

Aug 27, 2021, 8:56:46 PM8/27/21

to

On Saturday, August 28, 2021 at 12:55:31 AM UTC+1, MitchAlsup wrote:

> OK, so what register do you have that is big enough to hold all 1076 bits ?

none, i'm limiting things to 64 bit, so that'd need to be
done across multiple registers (long arithmetic, big
integer math).

one option to explore: the iteration/reduction in SVP64, the
parallel schedule, *requires* a Vector src/dest for use as
an accumulator, because it is used to store partial parallel
results as part of the tree reduction. this has an advantage
in that it solves the precise exception problem.

the only reason this is possible is because SVP64 has explicit
not implicit Vector registers, mapped onto the (enlargened)
scalar regfile.

if VVM was similarly extended to have *actual* Vector registers
as opposed to its current design of only mapping elements onto
in-flight Reservation Stations, *then* there would be somewhere
to put large in-flight data (such as 1076 bits).

l

EricP

unread,

Aug 28, 2021, 10:17:41 AM8/28/21

to

Did you decide against the barf-buffer?

luke.l...@gmail.com

unread,

Aug 28, 2021, 10:30:30 AM8/28/21

to

On Saturday, August 28, 2021 at 3:17:41 PM UTC+1, EricP wrote:
> Did you decide against the barf-buffer?

https://m.youtube.com/watch?v=KvzrguhmK0o&t=17

MitchAlsup

unread,

Aug 28, 2021, 1:46:40 PM8/28/21

to

No, I was simply reinforcing the notion that IEEE 754 requires an infinitely
precise result prior to rounding. And in the case of reductions, 1076-bits
is enough, while 64 and 128 definitely are not.
<
Any advancement of 754 needs to remain aware of the progress of unums
in order to stay the preferred FP standard. Unums have the quire, 754 needs
something at least as good.

Terje Mathisen

unread,

Aug 29, 2021, 4:09:58 PM8/29/21

to

MitchAlsup wrote:
> On Friday, August 27, 2021 at 5:05:20 PM UTC-5, luke.l...@gmail.com wrote:
>> On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:
>>
>>> The major problem is where does one store the state on an interrupt
>>> taken inside of the loop. I am letting my subconscious dwell on it right
>>> now.
> <
> Reminding others that this state is 1076 bits long.

I would in fact (strongly) prefer for this super accumulator to have
quite a few guard bits! I.e. if I add & subtract a bunch of values then
depending upon the order I might have a temporary overflow which
wouldn't have happened with some other reduction order.

How many would be enough?

For sum reductions 32 extra bits would allow arbitrary ordering of at
least a few billion items, so that's plenty. The additional cost of
going to 1108 instead of 1076 bits seems miniscule.
:-)

We still need some (really fast!) storage if this resource needs to be
interruptable, otherwise we would be left with a (memory-mapped?)
accelerator which you would ask the OS to allocate to you for the
duration, but with the additional cost that you would need to write all
the results to this unit. That could still be OK if it is on-chip and
can accept several store operations/cycle.

Doing it this way removes the interruptability requirement, at the cost
of needing a driver to handle arbitration.

EricP

unread,

Aug 29, 2021, 4:26:22 PM8/29/21

to

Terje Mathisen wrote:
> MitchAlsup wrote:
>> On Friday, August 27, 2021 at 5:05:20 PM UTC-5, luke.l...@gmail.com
>> wrote:
>>> On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:
>>>
>>>> The major problem is where does one store the state on an interrupt
>>>> taken inside of the loop. I am letting my subconscious dwell on it
>>>> right
>>>> now.
>> <
>> Reminding others that this state is 1076 bits long.
>
> I would in fact (strongly) prefer for this super accumulator to have
> quite a few guard bits! I.e. if I add & subtract a bunch of values then
> depending upon the order I might have a temporary overflow which
> wouldn't have happened with some other reduction order.
>
> How many would be enough?
>
> For sum reductions 32 extra bits would allow arbitrary ordering of at
> least a few billion items, so that's plenty. The additional cost of
> going to 1108 instead of 1076 bits seems miniscule.
> :-)

If My 66000 can have 32 arch registers doesn't that allow 32 FMA's for
cross-loop registers, plus per-loop registers are in addition to that,
to all be in-flight on an in-order core when that interrupt arrives?

So it could be looking at saving something like 64 * 1108 bits.

Or maybe I have misunderstood.

MitchAlsup

unread,

Aug 29, 2021, 4:39:10 PM8/29/21

to

On Sunday, August 29, 2021 at 3:26:22 PM UTC-5, EricP wrote:
> Terje Mathisen wrote:
> > MitchAlsup wrote:
> >> On Friday, August 27, 2021 at 5:05:20 PM UTC-5, luke.l...@gmail.com
> >> wrote:
> >>> On Monday, August 23, 2021 at 10:55:26 PM UTC+1, MitchAlsup wrote:
> >>>
> >>>> The major problem is where does one store the state on an interrupt
> >>>> taken inside of the loop. I am letting my subconscious dwell on it
> >>>> right
> >>>> now.
> >> <
> >> Reminding others that this state is 1076 bits long.
> >
> > I would in fact (strongly) prefer for this super accumulator to have
> > quite a few guard bits! I.e. if I add & subtract a bunch of values then
> > depending upon the order I might have a temporary overflow which
> > wouldn't have happened with some other reduction order.
> >
> > How many would be enough?
> >
> > For sum reductions 32 extra bits would allow arbitrary ordering of at
> > least a few billion items, so that's plenty. The additional cost of
> > going to 1108 instead of 1076 bits seems miniscule.
> > :-)
<
> If My 66000 can have 32 arch registers doesn't that allow 32 FMA's for
> cross-loop registers, plus per-loop registers are in addition to that,
> to all be in-flight on an in-order core when that interrupt arrives?
<

An FMAC is 4 or 5 cycles of visible latency.

>
> So it could be looking at saving something like 64 * 1108 bits.
<

Building an accumulation of FMULs one might have 4 (or 8)
lanes of FMULs sending 4(|8)×106 bit products per cycle to a
single Accumulator which accumulates them into a 1076-bit
accumulator.
<
So, I only see 1 of these "big" registers per reduction.
<
What you want is to perform the calculation leading to the reduction
as wide as possible and then perform the accumulation of the reduction
as wide as possible to a single carry-save result per cycle. Whether
accumulation is AND, OR, XOR, IADD, UADD, FADD makes no real
difference to the accumulation.

Thomas Koenig

unread,

Aug 29, 2021, 5:25:20 PM8/29/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> What you want is to perform the calculation leading to the reduction
> as wide as possible and then perform the accumulation of the reduction
> as wide as possible to a single carry-save result per cycle. Whether
> accumulation is AND, OR, XOR, IADD, UADD, FADD makes no real
> difference to the accumulation.

Adding four or eight FMAC results into a single one... would the
carry-save adder also have to be the whole width? Anything
else might be too complicated, I guess.

If one result from the carry-save adder comes in every cycle,
I assume that final addition would also have to pipelined?

An, one other remark: The reduction in question could also be
a multiplication.

A 1076*106 bit multiplier sounds challenging...

MitchAlsup

unread,

Aug 29, 2021, 5:34:27 PM8/29/21

to

On Sunday, August 29, 2021 at 4:25:20 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > What you want is to perform the calculation leading to the reduction
> > as wide as possible and then perform the accumulation of the reduction
> > as wide as possible to a single carry-save result per cycle. Whether
> > accumulation is AND, OR, XOR, IADD, UADD, FADD makes no real
> > difference to the accumulation.
<
> Adding four or eight FMAC results into a single one... would the
> carry-save adder also have to be the whole width? Anything
> else might be too complicated, I guess.
<

Congratulations for your "Captain Obvious" remark.

>
> If one result from the carry-save adder comes in every cycle,
> I assume that final addition would also have to pipelined?
<

Addition, Normalization, and Rounding are all after the accumulate
stage. As are Underflow, Overflow,.....

>
> An, one other remark: The reduction in question could also be
> a multiplication.
<

No. Multiplication is NOT a reduction found in physics or math.
It is a producer of things to become reduced.

>
> A 1076*106 bit multiplier sounds challenging...

It is just 17×17 = 288 { 64*64 multiplications }.

luke.l...@gmail.com

unread,

Aug 29, 2021, 5:41:26 PM8/29/21

to

On Sunday, August 29, 2021 at 9:09:58 PM UTC+1, Terje Mathisen wrote:

> We still need some (really fast!) storage if this resource needs to be
> interruptable, otherwise we would be left with a (memory-mapped?)
> accelerator which you would ask the OS to allocate to you for the
> duration, but with the additional cost that you would need to write all
> the results to this unit. That could still be OK if it is on-chip and
> can accept several store operations/cycle.

the typical way would be to have a batch of SPRs/CSRs
(special purpose / control status registers) each 64 bit,
the 1000-bit value is spread across multiple of them.

> Doing it this way removes the interruptability requirement, at the cost
> of needing a driver to handle arbitration.

if the contents were accessible via QTY 16 64 bit SPRs then save/restore
is a matter of LD/ST of 16 additional SPRs. it is a hell of a lot of state,
probably best done through conditional checking to see if the state is
actually in use.

in a Horizontal-First Vector ISA this problem does not occur, you simply
declare that the entire state in the Reduction Sum is to be thrown away, if
interrupted, and this is perfectly possible because Horizontal-First
Vectors *only* execute the one Vector instruction, in full, all elements,
before moving on to the next instruction.

Vertical-First you are a bit screwed: element-based accumulation
on one instruction, followed by others that have nothing to do
with the Reduction Sum, then round the loop again, can you "unwind"
the entire lot back to *before* the Reduction Sum started?
(by using Shadowing or Transaction snapshots)
maybe the answer there is yes.

that is still an awful lot of in-flight Shadowed state to be holding in buffers,
and, worse, a simple single-issue in-order system (one of the
supported designs of VVM) would not cope.

i think this is why in SVP64 we're supporting both Horizontal and
Vertical Vector Modes, because there are advantages (and disadvantages)
of each, and they complement each other.

l.

Stephen Fuld

unread,

Aug 29, 2021, 6:08:15 PM8/29/21

to

I may be missing something, but why not have the space defined in the
the thread header, right after the area allocated for register saves.
(Adding an extra say 150 bytes per thread is insignificant.) You would
have a bit set to indicate whether the large accumulator is in use or
not (Carry meta instruction immediately proceeding the VEC
instruction?), so it would only need to be saved/restored if it was
actually in use. You might need variants of the load/store multiple to
indicate the accumulator as the target/source for the instruction. But
since the ISR isn't going to use the accumulator, the store can be lazy.
There is an extra time penalty to restart the thread using the
accumulator, but it should be pretty rare, and only hurts him. Thus, no
extra hardware (perhaps it uses some level of the cache, TBD), and no
driver required.

MitchAlsup

unread,

Aug 29, 2021, 6:29:33 PM8/29/21

to

On Sunday, August 29, 2021 at 4:41:26 PM UTC-5, luke.l...@gmail.com wrote:
> On Sunday, August 29, 2021 at 9:09:58 PM UTC+1, Terje Mathisen wrote:
>
> > We still need some (really fast!) storage if this resource needs to be
> > interruptable, otherwise we would be left with a (memory-mapped?)
> > accelerator which you would ask the OS to allocate to you for the
> > duration, but with the additional cost that you would need to write all
> > the results to this unit. That could still be OK if it is on-chip and
> > can accept several store operations/cycle.
<
> the typical way would be to have a batch of SPRs/CSRs
> (special purpose / control status registers) each 64 bit,
> the 1000-bit value is spread across multiple of them.
<
> > Doing it this way removes the interruptability requirement, at the cost
> > of needing a driver to handle arbitration.
<
> if the contents were accessible via QTY 16 64 bit SPRs then save/restore
> is a matter of LD/ST of 16 additional SPRs. it is a hell of a lot of state,
> probably best done through conditional checking to see if the state is
> actually in use.
>
> in a Horizontal-First Vector ISA this problem does not occur, you simply
> declare that the entire state in the Reduction Sum is to be thrown away, if
> interrupted, and this is perfectly possible because Horizontal-First
> Vectors *only* execute the one Vector instruction, in full, all elements,
> before moving on to the next instruction.
<

I should remind you that the reduction is over a loop of vectorized
calculations. You want the accurate reduction of 1B calculations,
not 64 (or whatever). So the problem does not exist for the H-F vector
but still remains for a loop of H-F vectors. How do you get the 1076
bits from the first loop into the accumulator for the 2nd, 3rd, 4th loops ?

>
> Vertical-First you are a bit screwed: element-based accumulation
> on one instruction, followed by others that have nothing to do
> with the Reduction Sum, then round the loop again, can you "unwind"
> the entire lot back to *before* the Reduction Sum started?
> (by using Shadowing or Transaction snapshots)
> maybe the answer there is yes.
<

Screwedness is the same. It just appears at slightly different boundaries.

>
> that is still an awful lot of in-flight Shadowed state to be holding in buffers,
> and, worse, a simple single-issue in-order system (one of the
> supported designs of VVM) would not cope.
<

Which is why I recognize the reduction problem but have not converged
to a implementable solution.

luke.l...@gmail.com

unread,

Aug 29, 2021, 6:30:28 PM8/29/21

to

On Sunday, August 29, 2021 at 11:08:15 PM UTC+1, Stephen Fuld wrote:

> I may be missing something, but why not have the space defined in the
> the thread header, right after the area allocated for register saves.
> (Adding an extra say 150 bytes per thread is insignificant.) You would
> have a bit set to indicate whether the large accumulator is in use or
> not (Carry meta instruction immediately proceeding the VEC
> instruction?), so it would only need to be saved/restored if it was
> actually in use.

funny, crossover, this is exactly the idea i suggested, too.
whether it be done CISC style (the hardware performs the save)
or RISC style (explicit instructions in the ISR) it is still quite
expensive. in the RISC case the ISR must always have at least
the insteuction testing the bit, plus a branch to jump over
the save.

i'm not seeing many other viable options though, if Vertical-Mode
is to be preserved in VVM and this feature added.

would it not be better to attempt long-integer math instead? support
both massive FP numbers spread across multiple GPRs as well
as big integer math, and do it that way?

l.

MitchAlsup

unread,

Aug 29, 2021, 7:35:57 PM8/29/21

to

On Sunday, August 29, 2021 at 5:30:28 PM UTC-5, luke.l...@gmail.com wrote:
> On Sunday, August 29, 2021 at 11:08:15 PM UTC+1, Stephen Fuld wrote:
>
> > I may be missing something, but why not have the space defined in the
> > the thread header, right after the area allocated for register saves.
> > (Adding an extra say 150 bytes per thread is insignificant.) You would
> > have a bit set to indicate whether the large accumulator is in use or
> > not (Carry meta instruction immediately proceeding the VEC
> > instruction?), so it would only need to be saved/restored if it was
> > actually in use.
<

Currently, I have 320 bytes of Thread State (256 Bytes of which are registers).
I would like to put these things on 512Byte boundaries. Leaving 1536
bits left over. So, technically, one per thread(=process) would fit.
<
I have been using this space for Exception logout between the time of
raising an exception and when exception is processed by handler. But
nothing is in concrete yet. This logout allows a disabled exception to
logout the fact that an instruction raised this exception, and hold onto
it while the exception remains disabled. But when reEnabled, the enabled
and raised exception not only causes control transfer, but has data to
feed the handler.

<
> funny, crossover, this is exactly the idea i suggested, too.
> whether it be done CISC style (the hardware performs the save)
> or RISC style (explicit instructions in the ISR) it is still quite
> expensive. in the RISC case the ISR must always have at least
> the insteuction testing the bit, plus a branch to jump over
> the save.
<

Not just expensive in time, but in complexity.

>
> i'm not seeing many other viable options though, if Vertical-Mode
> is to be preserved in VVM and this feature added.
>
> would it not be better to attempt long-integer math instead? support
> both massive FP numbers spread across multiple GPRs as well
> as big integer math, and do it that way?
<

Kahan-Babbuška (IEEE 754-2019) summation is already in My 66000 ISA.
<
This in a numerical sense is being compared and contrasted with quires
in unums of which IEEE fails on a technical sense (numeric purity) and
wins on a practical sense (implementation difficulty).
>
> l.

Anton Ertl

unread,

Aug 30, 2021, 11:53:00 AM8/30/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>An, one other remark: The reduction in question could also be
>a multiplication.
>
>A 1076*106 bit multiplier sounds challenging...

Mitch Alsup pointed out that multiplicative reduction is rare.

One other aspect is that multiplication does not produce cancellation,
so there is no need for such a large accumulator. Exponent overflow
is much more likely to become a problem when you perform
multiplicative reduction on a large set of numbers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

luke.l...@gmail.com

unread,

Aug 30, 2021, 12:58:37 PM8/30/21

to

On Monday, August 30, 2021 at 12:35:57 AM UTC+1, MitchAlsup wrote:

> I have been using this space for Exception logout between the time of
> raising an exception and when exception is processed by handler. But
> nothing is in concrete yet. This logout allows a disabled exception to
> logout the fact that an instruction raised this exception, and hold onto
> it while the exception remains disabled. But when reEnabled, the enabled
> and raised exception not only causes control transfer, but has data to
> feed the handler.

you will perhaps be either fascinated or amused that i could read the
words but they melted my brain and i couldn't understand it.
perhaps because i lack context? if you'd like to use that as an
opportunity to clarify things i'm more than happy to hear further
elaboration.

> <
> > funny, crossover, this is exactly the idea i suggested, too.
> > whether it be done CISC style (the hardware performs the save)
> > or RISC style (explicit instructions in the ISR) it is still quite
> > expensive. in the RISC case the ISR must always have at least
> > the insteuction testing the bit, plus a branch to jump over
> > the save.
> <
> Not just expensive in time, but in complexity.
> >
> > i'm not seeing many other viable options though, if Vertical-Mode
> > is to be preserved in VVM and this feature added.
> >
> > would it not be better to attempt long-integer math instead? support
> > both massive FP numbers spread across multiple GPRs as well
> > as big integer math, and do it that way?
> <
> Kahan-Babbuška (IEEE 754-2019) summation is already in My 66000 ISA.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

interesting. so the intermediate result is created, but the remainder
is lost. *however*, by *subtracting* the intermediate result from the
current sum, the difference (which may be quite small) is dropped
into a secondary accumulator.

nice trick.

is that being applied at a hardware level or as an expected algorithm to be
used in software?

l.

Stephen Fuld

unread,

Aug 30, 2021, 1:30:13 PM8/30/21

to

On 8/29/2021 4:35 PM, MitchAlsup wrote:
> On Sunday, August 29, 2021 at 5:30:28 PM UTC-5, luke.l...@gmail.com wrote:
>> On Sunday, August 29, 2021 at 11:08:15 PM UTC+1, Stephen Fuld wrote:
>>
>>> I may be missing something, but why not have the space defined in the
>>> the thread header, right after the area allocated for register saves.
>>> (Adding an extra say 150 bytes per thread is insignificant.) You would
>>> have a bit set to indicate whether the large accumulator is in use or
>>> not (Carry meta instruction immediately proceeding the VEC
>>> instruction?), so it would only need to be saved/restored if it was
>>> actually in use.
> <
> Currently, I have 320 bytes of Thread State (256 Bytes of which are registers).
> I would like to put these things on 512Byte boundaries. Leaving 1536
> bits left over. So, technically, one per thread(=process) would fit.
> <
> I have been using this space for Exception logout between the time of
> raising an exception and when exception is processed by handler. But
> nothing is in concrete yet. This logout allows a disabled exception to
> logout the fact that an instruction raised this exception, and hold onto
> it while the exception remains disabled. But when reEnabled, the enabled
> and raised exception not only causes control transfer, but has data to
> feed the handler.

OK, that is certainly reasonable, but I am not sure that it is a
problem. At first blush, the space's use as a save area for the
accumulator only occurs when the using thread is being switched out,
which means that, by definition, exceptions weren't disabled. The
thread using it will then not be active, so it can't take an exception.
So I think you can use the same space for both uses, as long as you
have a bit in there somewhere to indicate which use is the current one.

MitchAlsup

unread,

Aug 30, 2021, 1:45:24 PM8/30/21

to

On Monday, August 30, 2021 at 11:58:37 AM UTC-5, luke.l...@gmail.com wrote:
> On Monday, August 30, 2021 at 12:35:57 AM UTC+1, MitchAlsup wrote:
>
> > I have been using this space for Exception logout between the time of
> > raising an exception and when exception is processed by handler. But
> > nothing is in concrete yet. This logout allows a disabled exception to
> > logout the fact that an instruction raised this exception, and hold onto
> > it while the exception remains disabled. But when reEnabled, the enabled
> > and raised exception not only causes control transfer, but has data to
> > feed the handler.
<
> you will perhaps be either fascinated or amused that i could read the
> words but they melted my brain and i couldn't understand it.
> perhaps because i lack context? if you'd like to use that as an
> opportunity to clarify things i'm more than happy to hear further
> elaboration.
<

I use the word Thread whereas most Unix/Linux would use the
word process. A Thread in my nomenclature can be a u/l thread
under a process, a process itself, or an OS thread servicing the
process. A Thread contains a Root Pointer to memory mapping
tables, and contains State: PSW, register file, an Exception Raised
register and an Exception Enabled Register. These later 2 enable the
Thread to control when he takes exception control transfer.
<
When an Exception is Raised, a bit of state is stored so the
Exception Handler, when it runs, knows what to do. That an
exception was raised is recorded in the Raised register,
What caused the exception, where it was caused, is recorded
elsewhere. Right now this elsewhere is concatenated to the
area containing the Thread State; it could be moved elsewhere.
IP (in Thread State) points at the offending instruction,...

> > <
> > > funny, crossover, this is exactly the idea i suggested, too.
> > > whether it be done CISC style (the hardware performs the save)
> > > or RISC style (explicit instructions in the ISR) it is still quite
> > > expensive. in the RISC case the ISR must always have at least
> > > the insteuction testing the bit, plus a branch to jump over
> > > the save.
> > <
> > Not just expensive in time, but in complexity.
> > >
> > > i'm not seeing many other viable options though, if Vertical-Mode
> > > is to be preserved in VVM and this feature added.
> > >
> > > would it not be better to attempt long-integer math instead? support
> > > both massive FP numbers spread across multiple GPRs as well
> > > as big integer math, and do it that way?
> > <
> > Kahan-Babbuška (IEEE 754-2019) summation is already in My 66000 ISA.
> https://en.wikipedia.org/wiki/Kahan_summation_algorithm
>
> interesting. so the intermediate result is created, but the remainder
> is lost. *however*, by *subtracting* the intermediate result from the
> current sum, the difference (which may be quite small) is dropped
> into a secondary accumulator.
>
> nice trick.
>
> is that being applied at a hardware level or as an expected algorithm to be
> used in software?
<

Kahan-Babbuška summation in My 66000 is 1 instruction-modifier,
and 1 FMAC instruction. It is easier in HW to do these things than in SW.
>
> l.

MitchAlsup

unread,

Aug 30, 2021, 1:50:18 PM8/30/21

to

Yes, as far as HW exceptions are concerned. Software can also "throw"
exceptions at a Thread, Inter Processor Communication in an instruction.
<
I am slowly grinding towards using a slight superset of these mechanisms
to support ADA-style rendezvous mostly in HW. For these purposes I
need a "message" space of at least 8 doublewords forward and backward
through the call-accept boundary.

Ivan Godard

unread,

Aug 30, 2021, 4:17:00 PM8/30/21

to

On 8/30/2021 8:45 AM, Anton Ertl wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
>> An, one other remark: The reduction in question could also be
>> a multiplication.
>>
>> A 1076*106 bit multiplier sounds challenging...
>
> Mitch Alsup pointed out that multiplicative reduction is rare.
>
> One other aspect is that multiplication does not produce cancellation,
> so there is no need for such a large accumulator. Exponent overflow
> is much more likely to become a problem when you perform
> multiplicative reduction on a large set of numbers.
>
> - anton
>

Underflow too; denorms have their own issues.

Ivan Godard

unread,

Aug 30, 2021, 4:19:17 PM8/30/21

to

Some hardware. The problem for most ISAs is the paired result, with the
write-port consequences.

MitchAlsup

unread,

Aug 30, 2021, 6:01:27 PM8/30/21

to

But in My 66000, the functionality is accessed through CARRY which provides
the second operand input and provides a register for to additional result.
No pairing or sharing.

Ivan Godard

unread,

Aug 30, 2021, 6:21:19 PM8/30/21

to

Er - "some"?

MitchAlsup

unread,

Aug 30, 2021, 9:22:13 PM8/30/21

to

Well the registers are not paired, nor are they shared.
<
The registers are not paired because the registers can just as easily be R9,R23
as R5,R6; this holds for both operands and results.
<
The registers are not shared because the CARRY supplied register is used for both
operand and result.

Ivan Godard

unread,

Aug 30, 2021, 10:47:57 PM8/30/21

to

Terminology :-( Talking about the datapath issue of multidrop, not about
the encoding issue of register pairs; sorry. You and I both ("some")
have multidrop, but it's a complication widely omitted.

Thomas Koenig

unread,

Aug 31, 2021, 2:42:54 AM8/31/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> No. Multiplication is NOT a reduction found in physics or math.
> It is a producer of things to become reduced.

It's at least found in the OpenMP specs
https://www.openmp.org/spec-html/5.0/openmpsu107.html
(which does not mean that it is used frequently).

Terje Mathisen

unread,

Aug 31, 2021, 3:44:36 AM8/31/21

to

Not really for a super-accumulator: You just align the mantissa bits,
based on the exponent (inserting the hidden bit unless the exponent is
zero), and from then on there's just the carry-save full adder array.

The actual 53-bit input aligner with a ~1100 bit target is a substantial
piece of hardware, the shifter from the deep lagoon. :-)

This is the point where I expect Mitch to pipe in and say that it is
only 3-5 more gate delays than a normal FMAC aligner/normalizer, and of
course fully pipelineable. :-)

luke.l...@gmail.com

unread,

Sep 30, 2022, 9:15:57 PM9/30/22

to

On Tuesday, August 24, 2021 at 8:41:02 AM UTC+1, Terje Mathisen wrote:
> Anton Ertl wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >> On 8/22/2021 10:44 PM, Thomas Koenig wrote:
> > |int m2(int * const restrict a, int n)
> > |{
> > | int m, nm;
> > | int i;
> > |
> > | m = INT_MIN;
> > | nm = -1;
> > | for (i=0; i<n; i++)
> > | {
> > | if (a[i] > m)
> > | {
> > | m = a[i];
> > | nm = i;
> > | }
> > | }
> > | return nm;
> > |}

going back to the original

> Modifying it to while (i+1<n && a[i] <= m) would work I think, but it is
> easier to check the index below:
> while (i<n) {
> while (i<n && a[i]<=m)
> i++;
> if (i < n) {
> m = a[i];
> nm = i;
> }
> i++;

it turns out this one is remarkably simple: use the characteristics
of mapreduce mode to overwrite nm from a parallel
comparison, it is not strictly necessary to perform the
additional skipping (while (i<n && a[i]<=m) i++)

SVP64 mapreduce mode is a misnomer, it actually simply
allows scalar results to continue to be written to during
looping, where normally the first scalar result written causes
early-termination of the horizontal vector loop.

> int m2(int * const restrict a, int n)
> {
> int m, nm;
> int i;
>
> m = INT_MIN;
> nm = -1;

first prepare offsets

setvl vl=16
sv.svstep *offs, srcstep # creates sequence 0 1 2 .. vl-1
li r8, 8 # use this with madd later

and nm-potential

li nm, -1
li nmbase, 0

next use CTR mode

> for (i=0; i<n; i++)
> {

setvl r1,0,16,0,1,1 # VL=r1=MIN(MAXVL,CTR)

perform the load of the vector data, elstrided:

sv.ld/els *a, 8(a_ptr)

> if (a[i] > m)
> {

comparison:

sv.cmp cr0.gt, *a, m

next, these two both use mapreduce mode with predication

> m = a[i];

sv.ori/mr/m=gt m,*a,0

> nm = i;

add-overwriting base with vector-offset into nm

sv.add/mr/m=gt nm, nmbase, *offs

> }

nmbase must be incremented by vl

add nmbase, nmbase, r1 # r1 contains copy of vl

TODO also remember to update ptr to a by 8*VL

madd a_ptr, r8, r1, a_ptr

branch and subtract VL from CTR if not end of loop

sv.bc/all 16, *0, loop

> }
> return nm;

mr r3, nmbase
blr

the trick here is the multiple overwrites of nm and m, even though
they are scalar, the usual "termination of looping because scalar result"
is switched *off* in mapreduce mode.
thus due to the vector predication the last successful conditional overwrite
"wins"

the second trick was to use a base nm which increments by VL, from a
vector of sequential constants.

there will be better/other methods, using data-dependent failfirst
will stop at the first compare-fail but will need "continuation"
(2 nested loops) whereas the mapreduce one is really simple but
relies on WaW hazards to be eliminated

Quadibloc

unread,

Oct 1, 2022, 2:12:11 AM10/1/22

to

On Sunday, August 22, 2021 at 11:54:59 PM UTC-6, Thomas Koenig wrote:
> It worked well on Zen 1 despite that
> architecture only "faking" AVX2 with 128-bit registers.

You can fake AVX-256 with a 128-bit ALU, but I don't see
how you can fake it if your *registers* are too short.

Assuming this is a typo, this is the same sort of thing
they're doing in the current generation with AVX-512.

John Savard

Anton Ertl

unread,

Oct 1, 2022, 3:59:29 AM10/1/22

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>You can fake AVX-256 with a 128-bit ALU, but I don't see
>how you can fake it if your *registers* are too short.

Use two 128-bit registers to implement a YMM register. You can see it
nicely if you measure the number of physical registers:

ROB XMM YMM
size GPRs regs regs year microarchitecture
128 92 128 64 2015 Carrizo (Excavator)
192 145 144 66 2017 Zen
224 141 147 143 2019 Zen2

Note that the number of YMM registers is half (or less) that of XMM
registers on Carrizo and Zen.

luke.l...@gmail.com

unread,

Apr 26, 2023, 11:51:28 AM4/26/23

to

On Tuesday, August 24, 2021 at 8:41:02 AM UTC+1, Terje Mathisen wrote:

> Modifying it to while (i+1<n && a[i] <= m) would work I think, but it is
> easier to check the index below:
> while (i<n) {
> while (i<n && a[i]<=m)
> i++;
> if (i < n) {
> m = a[i];
> nm = i;
> }
> i++;

while (i<n) {
// skip up to first max

while (i<n && a[i]<=m) i++;

// continue as long as picking new m
while (i<n && a[i]>m) {

m = a[i];
nm = i;

i++;
}
}

sorry, reviving this one as i am looking to implement it in SVP64 assembler, does the above look reasonable as a way to parallelise this?

phase 1: while (i<n && a[i]<=m) i++;

a batch can be loaded, each one tested against m, and the vector length truncated at the first fail (a[i] > m)

phase 2: while (i<n && a[i]>m) { m = a[i]; nm = i; i++; }

*another* batch loaded, tested against m, and this time the new m is always updated, looping ends when m is no longer to be updated.

of course if alternating greater less greater less appears in the data this is a sequential algorithm!

l.

Terje Mathisen

unread,

Apr 26, 2023, 1:16:14 PM4/26/23

to

I think your code looks good.

If we want to min-max optimize for minimum time on maximally bad input,
then another algorithm is probably needed.

luke.l...@gmail.com

unread,

Apr 26, 2023, 1:33:48 PM4/26/23

to

On Wednesday, April 26, 2023 at 6:16:14 PM UTC+1, Terje Mathisen wrote:

> luke.l...@gmail.com wrote:
> > while (i<n) {
> > // skip up to first max
> > while (i<n && a[i]<=m) i++;
> > // continue as long as picking new m
> > while (i<n && a[i]>m) {
> > m = a[i];
> > nm = i;
> > i++;

> I think your code looks good.

thanks for checking, Terje.

i was able to use a cheat of SVP64 which uses
a scalar as an "accumulator":

sv.max. *1, *1, *0

which loop-unrolls as:

for i in range(VL):
GPR(1+i) = MAX(GPR(1+i), GPR(0+i)
CR[1+i].GT = 1 if MAX was true else 0

which, with a little thought, you should spot that that
cascades through to the last element as the biggest
number.

so not only do you get the last entry in the Vector set to
the maximum, you also get an "Audit Trail" in the
Vector of Condition Register Fields of when a max
was detected.

and *that* you can then use to grab the index of the
last-successful MAX-was-actually-greater.

> If we want to min-max optimize for minimum time on maximally bad input,
> then another algorithm is probably needed.

yyeah i added a Parallel Reduction Schedule into Simple-V,
which performs a... damn there's a word for it - "work-priority"?
more efficient rather than the low-latency one which does
duplicate work in some cases.

anyway, if you *only* wanted the mix/max of a vector, then
a Parallel Reduction would be the obvious choice.

but finding the index? that's tricky. i'm still thinking about it.

l.

MitchAlsup

unread,

Apr 26, 2023, 1:57:22 PM4/26/23

to

On Wednesday, April 26, 2023 at 10:51:28 AM UTC-5, luke.l...@gmail.com wrote:
> On Tuesday, August 24, 2021 at 8:41:02 AM UTC+1, Terje Mathisen wrote:

> while (i<n) {
> // skip up to first max
> while (i<n && a[i]<=m) i++;
> // continue as long as picking new m
> while (i<n && a[i]>m) {
> m = a[i];
> nm = i;
> i++;
> }
> }
<

The best VVM code would be:
for( i = 1, m = a[0], nm = 0; i < max; i++ )
if( a[i] >= m )
m = a[i], nm = i;
<
Compiling into:
<
MOV Ri,#1
LDD Rm,[Ra]
MOV Rnm,#0
VEC Rt,{}
LDD Rai,[Ra,Ri<<3]
FCMP Rs,Rai,Rm
PGE Rs,TT
MOV Rm,Rai
MOV Rnm,Ri
LOOP LE,Ri,#1,Rmax
<
Since this is a thread on VVM...............
<
The loop can run as wide as the cache port, starting K new LDDs
every cycle. {where K is 2 or 4 or 8 depending on the width of the
cache port(s)}
<
When a new maximum is found, there may be a 1-to-(K-1)-cycle
hiccup in the loop timing (depending on available resources and
strings of new maxima.) {So, if you run into one new maximum
there is a 1-cycle hiccup, but if you run into 7 successive new
maxima there is a 7-cycle hiccup.}
<
So, while your code looks like it will do what you want; the target
ISA is what is driving you to excess expression complexity.

luke.l...@gmail.com

unread,

Apr 26, 2023, 2:28:35 PM4/26/23

to

On Wednesday, April 26, 2023 at 6:57:22 PM UTC+1, MitchAlsup wrote:

> The best VVM code would be:
> for( i = 1, m = a[0], nm = 0; i < max; i++ )
> if( a[i] >= m )
> m = a[i], nm = i;

relying on the scalar-autovectorisation from hazard identification, yes.

> Since this is a thread on VVM...............

yippeee :)

> <
> The loop can run as wide as the cache port, starting K new LDDs
> every cycle. {where K is 2 or 4 or 8 depending on the width of the
> cache port(s)}

multi-issue (OoO) speculative LDs, yes.

> <
> When a new maximum is found, there may be a 1-to-(K-1)-cycle
> hiccup in the loop timing (depending on available resources and
> strings of new maxima.) {So, if you run into one new maximum
> there is a 1-cycle hiccup, but if you run into 7 successive new
> maxima there is a 7-cycle hiccup.}

yes, this was the issue i ran smack into with an early version
of an algorithm, the amount of parallelism is data-dependent.

> So, while your code looks like it will do what you want; the target
> ISA is what is driving you to excess expression complexity.

SVP64 Vertical-First could achieve the same simplicity, i wanted
to see if there was a parallel-processing way, exploiting the
Horizontal-First mode.

l.