# Parallelization of Reductions in VVM

73 views

### Stephen Fuld

Sep 14, 2022, 12:01:08 PMSep 14
to
Based on threads here several months ago, I was under the, apparently
mistaken, understanding that VVM would not parallelize reductions. But
based on later posts, I understood that, at least under some conditions,
the hardware would parallelize reductions. So, I want to be sure of my
basic understanding, then explore the envelope of "parallizability".

So, just to be sure, first let me define parallelization as taking
advantage of multiple arithmetic functional units (of the appropriate
type) to perform multiple operations of the reduction in parallel, thus
reducing the total elapsed time of the reduction by a factor approaching
the number of functional units available. i.e. if you have two integer
FUs, the summation of the elements of an array would take about half the
time as if you had only one FU. This includes doing what is necessary
to combine the partial results at loop exit.

Let's start simple. Say we want the sum of an array of 1,000 elements.
My current understanding is that if those elements are floating point
numbers, the calculation will not be parallelized, but if they are
integers, the calculation will be parallelized. Is that correct?

If so, it implies that the hardware will recognize that the two loops,
which are the same except for a different add instruction, will be
treated differently.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

### Ivan Godard

Sep 14, 2022, 1:05:03 PMSep 14
to
The problem with all reduction parallelization is that it gives the
wrong result if FP is involved, because FP is not associative. VVM
doesn't help.

### Stephen Fuld

Sep 14, 2022, 1:23:45 PMSep 14
to
Oh, I agree (but see below). Which is why I posed the question about
integer only reduction. In fact, it was integer reductions that I had
thought several months ago were not parallelized, but then several weeks
ago thought they were.

That is why I asked the question.

But one quibble with your statement. It is not if FP is involved, but
only if FP arithmetic is involved. i.e If you want the MAX value of an
array of floating point numbers, I think that should be parallelizable.
Right?

### MitchAlsup

Sep 14, 2022, 2:19:17 PMSep 14
to
If you build something like a Posit Quire then FP addition is associative.
If, on the other hand, you build Kahan-Babuška summation, FP addition
is almost associative in practice.
<
That is the problem is the arithmetic specification, not how the loop
is performed.

### Thomas Koenig

Sep 15, 2022, 2:07:40 AMSep 15
to
Ivan Godard <iv...@millcomputing.com> schrieb:
Define "wrong".

If you define that whatever order the loop is written in a language
language like C is the only right one, then your statement
above applies.

However, if changing the order of summation would generate a wrong
result in every case, then every matrix multiplication optimization
algorithm out there would produce wrong results.

There are also languages which specify something like a sum directly,
like Fortran's sum intrinsic, where no ordering is guaranteed.

Some people explicitly tell the compiler they don't care about
floating point ordering by specifying options like -ffast-math.
Some of these know what they are doing, some don't.

Some people also do reductions by hand (IIRC, there are examples
in the reference BLAS).

Some programming languages like Fortran or extensions like
OpenMP specify reductions directly, without specifying an order.

Floating-point reduction (vector-matrix-product) is arguably the
most important operation performed in scientific computing and
engineering, becaus they are the basis for the Krylov subspace
methods for solving large linear equations, which is needed for just
about any calculation of differential equations involving more than
one dimension (including strenght calculations and fluid dynamics).

Next time you drive a car, sit in a train or bus, or fly a plane,
remember that it was designed with the help of floating point
reductions.

### Terje Mathisen

Sep 15, 2022, 5:23:11 AMSep 15
to
Ivan Godard wrote:
> On 9/14/2022 9:00 AM, Stephen Fuld wrote:
>> Based on threads here several months ago, I was under the, apparently
>> mistaken, understanding that VVM would not parallelize reductions.Â
>> But based on later posts, I understood that, at least under some
>> conditions, the hardware would parallelize reductions.Â  So, I want to
>> be sure of my basic understanding, then explore the envelope of
>> "parallizability".
>>
>> So, just to be sure, first let me define parallelization as taking
>> advantage of multiple arithmetic functional units (of the appropriate
>> type) to perform multiple operations of the reduction in parallel,
>> thus reducing the total elapsed time of the reduction by a factor
>> approaching the number of functional units available.Â  i.e. if you
>> have two integer FUs, the summation of the elements of an array would
>> take about half the time as if you had only one FU.Â  This includes
>> doing what is necessary to combine the partial results at loop exit.
>>
>> Let's start simple.Â  Say we want the sum of an array of 1,000
>> elements. My current understanding is that if those elements are
>> floating point numbers, the calculation will not be parallelized, but
>> if they are integers, the calculation will be parallelized.Â  Is that
>> correct?
>>
>> If so, it implies that the hardware will recognize that the two loops,
>> which are the same except for a different add instruction, will be
>> treated differently.
>>
>>
>>
>
> The problem with all reduction parallelization is that it gives the
> wrong result if FP is involved, because FP is not associative. VVM
> doesn't help.
>
This is probably _the_ main reason why I expect to see
super-accumulators, i.e. ~1100 bits x 2 (carry-save) is enough to ingest
a new double prec value in a fraction of a clock cycle.

When all reductions are exact, the order does not matter, right?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

### Michael S

Sep 15, 2022, 6:37:27 AMSep 15
to
On Thursday, September 15, 2022 at 12:23:11 PM UTC+3, Terje Mathisen wrote:
>
> This is probably _the_ main reason why I expect to see
> super-accumulators, i.e. ~1100 bits x 2 (carry-save) is enough to ingest
> a new double prec value in a fraction of a clock cycle.
>
> When all reductions are exact, the order does not matter, right?
>

How big is "all" ?
With normal SIMD we, at least, have a definite answer.
With any sort of "scalable" SIMD, including VVM, we have no answer.

And, of course, even when we have the definite answer, the answer
(i.e result of reduction, based on super-accumulators) does not
match the result that is considered "correct" by serial HLLs.
One can argue that our answer is better, but it's not the same.

### Anton Ertl

Sep 15, 2022, 7:31:04 AMSep 15
to
Thomas Koenig <tko...@netcologne.de> writes:
>Ivan Godard <iv...@millcomputing.com> schrieb:
>> The problem with all reduction parallelization is that it gives the
>> wrong result if FP is involved, because FP is not associative. VVM
>> doesn't help.
>
>Define "wrong".
>
>If you define that whatever order the loop is written in a language
>language like C is the only right one, then your statement
>above applies.

In the context of VVM, the right result is what you get when you
perform the same instructions in scalar mode.

>However, if changing the order of summation would generate a wrong
>result in every case, then every matrix multiplication optimization
>algorithm out there would produce wrong results.

Matrix multiplication has so much parallelism that many optimizations
are possible without reassociation. Basically, all multiplications
can be performed in parallel, and each element of the result matrix is
independent of the computations for all the other elements, so they
can all be performed in parallel (e.g., for a 1000x1000 result matrix,
there is a parallelism of 1000000 even if you compute each element
completely sequentially. The only part where the association plays a
role is the additions. Just don't reassociate them, and you get the
same result.

- anton

### Thomas Koenig

Sep 15, 2022, 11:37:52 AMSep 15
to
Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
As far as I can see, that would exclude all cache-friendly blocked
matrix multiplication algorithms. Not very useful, I'm afraid.

### Stephen Fuld

Sep 15, 2022, 1:07:42 PMSep 15
to
discussion on what VVM could or should do (I really do.), but I am just
asking what it *will* do. Is it the case that it will parallelize say
summing the values in an array if they are integers, but not if they are
floating point?

Thanks.

On 9/14/2022 9:00 AM, Stephen Fuld wrote:

### MitchAlsup

Sep 15, 2022, 1:43:43 PMSep 15
to
On Thursday, September 15, 2022 at 12:07:42 PM UTC-5, Stephen Fuld wrote:
> discussion on what VVM could or should do (I really do.), but I am just
> asking what it *will* do. Is it the case that it will parallelize say
> summing the values in an array if they are integers, but not if they are
> floating point?
>
> Thanks.
<
VVM is an architectural specification. It is like asking what an x86
processor can do--well these things came out with MMX at 1GHz
and then proceeded to where we are today.
<
VVM implementations that perform no better than pure scalar code.
and occupy no more area than adding a few sequences to the pipeline.
While this would be "sad" it is not an improper implementation. Then
there can be implementations where VVM has 2× to 8× the arithmetic
circuitry as the "main data path" providing 4×-32× performance gains.
<
Which do you want me to address is the question !
<
For my part, I am looking at a ¼ cache line wide arithmetic addition
as a reasonable VVM implementation. This contains arithmetic to
perform {16-byte, 8-half, 4-word, 2-double} sized calculations with
"several" inbound buffers and 1-or-2 outbound buffers. Whether this
uses the existing FPU of incarnates another is undecided.

### Stephen Fuld

Sep 15, 2022, 5:27:35 PMSep 15
to
On 9/15/2022 10:43 AM, MitchAlsup wrote:
> On Thursday, September 15, 2022 at 12:07:42 PM UTC-5, Stephen Fuld wrote:
>> discussion on what VVM could or should do (I really do.), but I am just
>> asking what it *will* do. Is it the case that it will parallelize say
>> summing the values in an array if they are integers, but not if they are
>> floating point?
>>
>> Thanks.

First, thanks for responding.

> <
> VVM is an architectural specification. It is like asking what an x86
> processor can do--well these things came out with MMX at 1GHz
> and then proceeded to where we are today.
> <
> What you are asking about is implementation. There can be <proper>
> VVM implementations that perform no better than pure scalar code.
> and occupy no more area than adding a few sequences to the pipeline.
> While this would be "sad" it is not an improper implementation. Then
> there can be implementations where VVM has 2× to 8× the arithmetic
> circuitry as the "main data path" providing 4×-32× performance gains.

I had considered using the word "implementation", but I thought, perhaps
incorrectly, that the fact that different implementations could produce
arithmetically different results (in this case depending upon whether FP
math was parallelized), was something that should be specified in the
architecture. But I accept that it certainly could be "implementation".

> <
> Which do you want me to address is the question !

> <
> For my part, I am looking at a ¼ cache line wide arithmetic addition
> as a reasonable VVM implementation. This contains arithmetic to
> perform {16-byte, 8-half, 4-word, 2-double} sized calculations with
> "several" inbound buffers and 1-or-2 outbound buffers.

So integer reductions could be parallelized. Good. We can explore this
more in the future.

> Whether this
> uses the existing FPU of incarnates another is undecided.

Ahhh! OK. It seems apparent that there are people who favor each way,
or at least there are people who are willing to give up on "numerically
correct" results in order to get them faster. Perhaps an argument could
be made for supporting both. I certainly am not competent to know the

### MitchAlsup

Sep 15, 2022, 6:16:49 PMSep 15
to
And as I alluded to above:: There is no reason one could not implement
a 754-version of Posit-Quires.
<
At this point, most of it falls into the category of "is it worth it" not
"can it be done". There will be "potential" applications where VVM is
not highly valued (controllers) and those where it is mandatory (HPC).
Who is footing the bill {both engineering and manufacturing costs}
and what is the expected volume {thousands per month->millions per
week}.

### Thomas Koenig

Sep 16, 2022, 2:55:33 AMSep 16
to
MitchAlsup <Mitch...@aol.com> schrieb:
> On Thursday, September 15, 2022 at 12:07:42 PM UTC-5, Stephen Fuld wrote:
>> discussion on what VVM could or should do (I really do.), but I am just
>> asking what it *will* do. Is it the case that it will parallelize say
>> summing the values in an array if they are integers, but not if they are
>> floating point?
>>
>> Thanks.
><
> VVM is an architectural specification. It is like asking what an x86
> processor can do--well these things came out with MMX at 1GHz
> and then proceeded to where we are today.
><
> What you are asking about is implementation. There can be <proper>
> VVM implementations that perform no better than pure scalar code.
> and occupy no more area than adding a few sequences to the pipeline.
> While this would be "sad" it is not an improper implementation. Then
> there can be implementations where VVM has 2× to 8× the arithmetic
> circuitry as the "main data path" providing 4×-32× performance gains.
><
> Which do you want me to address is the question !

It's not my question, but...

As far as I understand, VVM prescribes loops are executed as they
would be in C. It has no special support for reductions. So,
if you want to calculate a dot product, there is no special
support.

Generated code for

double dot_product (double *const restrict a, double const *restrict b, long int n)
{
double s = 0.0;
for (long int i = 0; i<n; i++)
s += a[i] * b[i];

return s;
}

is

; %bb.0:
enter r0,r0,0,0
cmp r4,r3,#1
blt r4,.LBB0_1
mov r4,#0x0000000000000000
.LBB0_3: ; %.lr.ph
; =>This Inner Loop Header: Depth=1
vec r5,{r4}
ldd r6,[r1]
ldd r7,[r2]
fmac r4,r6,r7,r4
loop ne,r3,#0,#-1
; %bb.4: ; %._crit_edge
mov r1,r4
exit r0,r0,0,0
.LBB0_1:
mov r4,#0x0000000000000000
mov r1,r4
exit r0,r0,0,0

so there is no special advantage to VVM here - the bottleneck
is the FMAC instruction. With a FMAC latency of four cycles, this
loop cannot run faster than 1/4 of the clock.

In order to get around this, compiler or programmer have to
delve into the bag of tricks developed for SIMD "vectorization",
unrolling and loop peeling, specifically. FMAC can be assumed
to be piplined, so that is possible.

So, it would have to look something like

double dot_product (double *const restrict a, double const *restrict b, long int n)
{
double s0, s1, s2, s3, s4, s5, s6, s7, s_extra;
long int i;

if (n < 0)
return 0.0;

s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0.0;
for (i = 0; i <n % 8; i++)
s0 = s0 + a[i] * b[i];

for (i = i; i<n; i+=8)
{
s0 += a[i] * b[i];
s1 += a[i+1] * b[i+1];
s2 += a[i+2] * b[i+2];
s3 += a[i+3] * b[i+3];
s4 += a[i+4] * b[i+4];
s5 += a[i+5] * b[i+5];
s6 += a[i+6] * b[i+6];
s7 += a[i+7] * b[i+7];
}
s_extra = 0.0;

return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7 + s_extra;
}

which is more error-prone (no guarantee for correctness of the
above code) and which also does not vectorize with the current
compiler (at least from what I could throw at it).

So, VVM is great as far as it goes, but it does not offer a
good path to reductions. That is still missing, I think.

### Anton Ertl

Sep 16, 2022, 4:35:54 AMSep 16
to
Thomas Koenig <tko...@netcologne.de> writes:
>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
>> Matrix multiplication has so much parallelism that many optimizations
>> are possible without reassociation. Basically, all multiplications
>> can be performed in parallel, and each element of the result matrix is
>> independent of the computations for all the other elements, so they
>> can all be performed in parallel (e.g., for a 1000x1000 result matrix,
>> there is a parallelism of 1000000 even if you compute each element
>> completely sequentially. The only part where the association plays a
>> role is the additions. Just don't reassociate them, and you get the
>> same result.
>
>As far as I can see, that would exclude all cache-friendly blocked
>matrix multiplication algorithms.

What do you see that makes you think so?

I see that, on the contrary, it is trivial to perform cache blocking
without reassociation: I multiply a subblock of A with a subblock of B
and add the result to a subblock of C. Then I work on the next
subblocks. As long as I process the subblocks in any of the wide
variety of orders that do not reassociate, the result is unchanged.
And as far as I can see, among these orders there are some that are
optimal for cache reuse.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'

### Thomas Koenig

Sep 16, 2022, 6:39:52 AMSep 16
to
Thomas Koenig <tko...@netcologne.de> schrieb:

[Reduction]

> So, it would have to look something like
>
> double dot_product (double *const restrict a, double const *restrict b, long int n)
> {
> double s0, s1, s2, s3, s4, s5, s6, s7, s_extra;
> long int i;
>
> if (n < 0)
> return 0.0;
>
> s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0.0;
> for (i = 0; i <n % 8; i++)
> s0 = s0 + a[i] * b[i];
>
> for (i = i; i<n; i+=8)
> {
> s0 += a[i] * b[i];
> s1 += a[i+1] * b[i+1];
> s2 += a[i+2] * b[i+2];
> s3 += a[i+3] * b[i+3];
> s4 += a[i+4] * b[i+4];
> s5 += a[i+5] * b[i+5];
> s6 += a[i+6] * b[i+6];
> s7 += a[i+7] * b[i+7];
> }
> s_extra = 0.0;
>
> return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7 + s_extra;
> }
>
> which is more error-prone (no guarantee for correctness of the
> above code) and which also does not vectorize with the current
> compiler (at least from what I could throw at it).

Ah, I finally succeeded. I have to put an optimizing option to
clang already, or opt and llc will not optimize. Strange...

Anyway, the central loop now looks like

vec r13,{r4,r5,r7,r8,r9,r10,r11,r12}
ldd r14,[r1,r6<<3,0]
ldd r15,[r1,r6<<3,8]
ldd r30,[r2,r6<<3,0]
ldd r29,[r2,r6<<3,8]
fmac r4,r15,r29,r4
fmac r5,r14,r30,r5
ldd r14,[r1,r6<<3,40]
ldd r15,[r1,r6<<3,32]
ldd r30,[r1,r6<<3,16]
ldd r29,[r1,r6<<3,24]
ldd r28,[r2,r6<<3,40]
ldd r27,[r2,r6<<3,32]
ldd r26,[r2,r6<<3,16]
ldd r25,[r2,r6<<3,24]
fmac r8,r29,r25,r8
fmac r7,r30,r26,r7
fmac r9,r15,r27,r9
fmac r10,r14,r28,r10
ldd r14,[r1,r6<<3,48]
ldd r15,[r1,r6<<3,56]
ldd r30,[r2,r6<<3,48]
ldd r29,[r2,r6<<3,56]
fmac r12,r15,r29,r12
fmac r11,r14,r30,r11
loop lt,r6,r3,#8

which should be good for eight FMAs every four cycles if
the memory can keep up and the ALU is wide enough.

Can't test complex, because it seems to make a function
call.

### Stephen Fuld

Sep 16, 2022, 12:27:27 PMSep 16
to
Oh! I missed the significance of that post. :-( And I had never heard
of quires, so I did a little research. What I found primarily was that
I am not qualified to make an intelligent contribution to that debate. :-(

> <
> At this point, most of it falls into the category of "is it worth it" not
> "can it be done". There will be "potential" applications where VVM is
> not highly valued (controllers) and those where it is mandatory (HPC).

And many where it is desirable but not mandatory. And it may be that
there are applications where VVM is highly desirable, but accurate
parallelized floating point reductions is required, or not. Several
choices.

> Who is footing the bill {both engineering and manufacturing costs}
> and what is the expected volume {thousands per month->millions per
> week}.

As always.

Thanks for the details and being patient with me.

### Stephen Fuld

Sep 22, 2022, 12:29:18 PMSep 22
to
On 9/15/2022 2:27 PM, Stephen Fuld wrote:
> On 9/15/2022 10:43 AM, MitchAlsup wrote:

snip

>> For my part, I am looking at a ¼ cache line wide arithmetic addition
>> as a reasonable VVM implementation. This contains arithmetic to
>> perform {16-byte, 8-half, 4-word, 2-double} sized calculations with
>> "several" inbound buffers and 1-or-2 outbound buffers.
>
> So integer reductions could be parallelized.  Good.  We can explore this
> more in the future.

The future is now. :-)

From your description, it looks to me like you are looking at the
technique of a splitting up a long add into several shorter adds by
selectively blocking carry propagation. If so, I have several questions.

1. Consider the case of summing an array of 256 word sized (32 bit)
integers where the sum is to be put into a 64 bit integer. What happens
if one or more of your intermediate sums exceeds 32 bits? If you are
using the suppressed carries technique, I think you could get overflows
in the intermediate adds that wouldn't show up in a non-parallelized
version.

2. What about reductions like "find the maximum value of any element in
an array"? I don't think this is amenable to the suppressed carry
technique, while it falls out naturally in a multiple functional unit
approach.