73 views

Skip to first unread message

Sep 14, 2022, 12:01:08 PMSep 14

to

Based on threads here several months ago, I was under the, apparently

mistaken, understanding that VVM would not parallelize reductions. But

based on later posts, I understood that, at least under some conditions,

the hardware would parallelize reductions. So, I want to be sure of my

basic understanding, then explore the envelope of "parallizability".

So, just to be sure, first let me define parallelization as taking

advantage of multiple arithmetic functional units (of the appropriate

type) to perform multiple operations of the reduction in parallel, thus

reducing the total elapsed time of the reduction by a factor approaching

the number of functional units available. i.e. if you have two integer

FUs, the summation of the elements of an array would take about half the

time as if you had only one FU. This includes doing what is necessary

to combine the partial results at loop exit.

Let's start simple. Say we want the sum of an array of 1,000 elements.

My current understanding is that if those elements are floating point

numbers, the calculation will not be parallelized, but if they are

integers, the calculation will be parallelized. Is that correct?

If so, it implies that the hardware will recognize that the two loops,

which are the same except for a different add instruction, will be

treated differently.

--

- Stephen Fuld

(e-mail address disguised to prevent spam)

mistaken, understanding that VVM would not parallelize reductions. But

based on later posts, I understood that, at least under some conditions,

the hardware would parallelize reductions. So, I want to be sure of my

basic understanding, then explore the envelope of "parallizability".

So, just to be sure, first let me define parallelization as taking

advantage of multiple arithmetic functional units (of the appropriate

type) to perform multiple operations of the reduction in parallel, thus

reducing the total elapsed time of the reduction by a factor approaching

the number of functional units available. i.e. if you have two integer

FUs, the summation of the elements of an array would take about half the

time as if you had only one FU. This includes doing what is necessary

to combine the partial results at loop exit.

Let's start simple. Say we want the sum of an array of 1,000 elements.

My current understanding is that if those elements are floating point

numbers, the calculation will not be parallelized, but if they are

integers, the calculation will be parallelized. Is that correct?

If so, it implies that the hardware will recognize that the two loops,

which are the same except for a different add instruction, will be

treated differently.

--

- Stephen Fuld

(e-mail address disguised to prevent spam)

Sep 14, 2022, 1:05:03 PMSep 14

to

wrong result if FP is involved, because FP is not associative. VVM

doesn't help.

Sep 14, 2022, 1:23:45 PMSep 14

to

integer only reduction. In fact, it was integer reductions that I had

thought several months ago were not parallelized, but then several weeks

ago thought they were.

That is why I asked the question.

But one quibble with your statement. It is not if FP is involved, but

only if FP arithmetic is involved. i.e If you want the MAX value of an

array of floating point numbers, I think that should be parallelizable.

Right?

Sep 14, 2022, 2:19:17 PMSep 14

to

If, on the other hand, you build Kahan-Babuška summation, FP addition

is almost associative in practice.

<

That is the problem is the arithmetic specification, not how the loop

is performed.

Sep 15, 2022, 2:07:40 AMSep 15

to

Ivan Godard <iv...@millcomputing.com> schrieb:

Define "wrong".

If you define that whatever order the loop is written in a language

language like C is the only right one, then your statement

above applies.

However, if changing the order of summation would generate a wrong

result in every case, then every matrix multiplication optimization

algorithm out there would produce wrong results.

There are also languages which specify something like a sum directly,

like Fortran's sum intrinsic, where no ordering is guaranteed.

Some people explicitly tell the compiler they don't care about

floating point ordering by specifying options like -ffast-math.

Some of these know what they are doing, some don't.

Some people also do reductions by hand (IIRC, there are examples

in the reference BLAS).

Some programming languages like Fortran or extensions like

OpenMP specify reductions directly, without specifying an order.

Floating-point reduction (vector-matrix-product) is arguably the

most important operation performed in scientific computing and

engineering, becaus they are the basis for the Krylov subspace

methods for solving large linear equations, which is needed for just

about any calculation of differential equations involving more than

one dimension (including strenght calculations and fluid dynamics).

Next time you drive a car, sit in a train or bus, or fly a plane,

remember that it was designed with the help of floating point

reductions.

If you define that whatever order the loop is written in a language

language like C is the only right one, then your statement

above applies.

However, if changing the order of summation would generate a wrong

result in every case, then every matrix multiplication optimization

algorithm out there would produce wrong results.

There are also languages which specify something like a sum directly,

like Fortran's sum intrinsic, where no ordering is guaranteed.

Some people explicitly tell the compiler they don't care about

floating point ordering by specifying options like -ffast-math.

Some of these know what they are doing, some don't.

Some people also do reductions by hand (IIRC, there are examples

in the reference BLAS).

Some programming languages like Fortran or extensions like

OpenMP specify reductions directly, without specifying an order.

Floating-point reduction (vector-matrix-product) is arguably the

most important operation performed in scientific computing and

engineering, becaus they are the basis for the Krylov subspace

methods for solving large linear equations, which is needed for just

about any calculation of differential equations involving more than

one dimension (including strenght calculations and fluid dynamics).

Next time you drive a car, sit in a train or bus, or fly a plane,

remember that it was designed with the help of floating point

reductions.

Sep 15, 2022, 5:23:11 AMSep 15

to

Ivan Godard wrote:

> On 9/14/2022 9:00 AM, Stephen Fuld wrote:

>> Based on threads here several months ago, I was under the, apparently

>> mistaken, understanding that VVM would not parallelize reductions.Â
> On 9/14/2022 9:00 AM, Stephen Fuld wrote:

>> Based on threads here several months ago, I was under the, apparently

>> But based on later posts, I understood that, at least under some

>> conditions, the hardware would parallelize reductions.Â So, I want to
>> be sure of my basic understanding, then explore the envelope of

>> "parallizability".

>>

>> So, just to be sure, first let me define parallelization as taking

>> advantage of multiple arithmetic functional units (of the appropriate

>> type) to perform multiple operations of the reduction in parallel,

>> thus reducing the total elapsed time of the reduction by a factor

>> approaching the number of functional units available.Â i.e. if you
>> "parallizability".

>>

>> So, just to be sure, first let me define parallelization as taking

>> advantage of multiple arithmetic functional units (of the appropriate

>> type) to perform multiple operations of the reduction in parallel,

>> thus reducing the total elapsed time of the reduction by a factor

>> have two integer FUs, the summation of the elements of an array would

>> take about half the time as if you had only one FU.Â This includes
>> doing what is necessary to combine the partial results at loop exit.

>>

>> Let's start simple.Â Say we want the sum of an array of 1,000
>>

>> elements. My current understanding is that if those elements are

>> floating point numbers, the calculation will not be parallelized, but

>> if they are integers, the calculation will be parallelized.Â Is that
>> floating point numbers, the calculation will not be parallelized, but

>> correct?

>>

>> If so, it implies that the hardware will recognize that the two loops,

>> which are the same except for a different add instruction, will be

>> treated differently.

>>

>>

>>

>

> The problem with all reduction parallelization is that it gives the

> wrong result if FP is involved, because FP is not associative. VVM

> doesn't help.

>

This is probably _the_ main reason why I expect to see
>>

>> If so, it implies that the hardware will recognize that the two loops,

>> which are the same except for a different add instruction, will be

>> treated differently.

>>

>>

>>

>

> The problem with all reduction parallelization is that it gives the

> wrong result if FP is involved, because FP is not associative. VVM

> doesn't help.

>

super-accumulators, i.e. ~1100 bits x 2 (carry-save) is enough to ingest

a new double prec value in a fraction of a clock cycle.

When all reductions are exact, the order does not matter, right?

Terje

--

- <Terje.Mathisen at tmsw.no>

"almost all programming can be viewed as an exercise in caching"

Sep 15, 2022, 6:37:27 AMSep 15

to

On Thursday, September 15, 2022 at 12:23:11 PM UTC+3, Terje Mathisen wrote:

>

> This is probably _the_ main reason why I expect to see

> super-accumulators, i.e. ~1100 bits x 2 (carry-save) is enough to ingest

> a new double prec value in a fraction of a clock cycle.

>

> When all reductions are exact, the order does not matter, right?

>

How big is "all" ?
>

> This is probably _the_ main reason why I expect to see

> super-accumulators, i.e. ~1100 bits x 2 (carry-save) is enough to ingest

> a new double prec value in a fraction of a clock cycle.

>

> When all reductions are exact, the order does not matter, right?

>

With normal SIMD we, at least, have a definite answer.

With any sort of "scalable" SIMD, including VVM, we have no answer.

And, of course, even when we have the definite answer, the answer

(i.e result of reduction, based on super-accumulators) does not

match the result that is considered "correct" by serial HLLs.

One can argue that our answer is better, but it's not the same.

Sep 15, 2022, 7:31:04 AMSep 15

to

Thomas Koenig <tko...@netcologne.de> writes:

>Ivan Godard <iv...@millcomputing.com> schrieb:

perform the same instructions in scalar mode.

>However, if changing the order of summation would generate a wrong

>result in every case, then every matrix multiplication optimization

>algorithm out there would produce wrong results.

Matrix multiplication has so much parallelism that many optimizations

are possible without reassociation. Basically, all multiplications

can be performed in parallel, and each element of the result matrix is

independent of the computations for all the other elements, so they

can all be performed in parallel (e.g., for a 1000x1000 result matrix,

there is a parallelism of 1000000 even if you compute each element

completely sequentially. The only part where the association plays a

role is the additions. Just don't reassociate them, and you get the

same result.

- anton

>Ivan Godard <iv...@millcomputing.com> schrieb:

>> The problem with all reduction parallelization is that it gives the

>> wrong result if FP is involved, because FP is not associative. VVM

>> doesn't help.

>

>Define "wrong".

>

>If you define that whatever order the loop is written in a language

>language like C is the only right one, then your statement

>above applies.

In the context of VVM, the right result is what you get when you
>> wrong result if FP is involved, because FP is not associative. VVM

>> doesn't help.

>

>Define "wrong".

>

>If you define that whatever order the loop is written in a language

>language like C is the only right one, then your statement

>above applies.

perform the same instructions in scalar mode.

>However, if changing the order of summation would generate a wrong

>result in every case, then every matrix multiplication optimization

>algorithm out there would produce wrong results.

are possible without reassociation. Basically, all multiplications

can be performed in parallel, and each element of the result matrix is

independent of the computations for all the other elements, so they

can all be performed in parallel (e.g., for a 1000x1000 result matrix,

there is a parallelism of 1000000 even if you compute each element

completely sequentially. The only part where the association plays a

role is the additions. Just don't reassociate them, and you get the

same result.

- anton

Sep 15, 2022, 11:37:52 AMSep 15

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

As far as I can see, that would exclude all cache-friendly blocked

matrix multiplication algorithms. Not very useful, I'm afraid.

matrix multiplication algorithms. Not very useful, I'm afraid.

Sep 15, 2022, 1:07:42 PMSep 15

to

Mitch, please help me out here. While I do appreciate all the

discussion on what VVM could or should do (I really do.), but I am just

asking what it *will* do. Is it the case that it will parallelize say

summing the values in an array if they are integers, but not if they are

floating point?

Thanks.

On 9/14/2022 9:00 AM, Stephen Fuld wrote:

discussion on what VVM could or should do (I really do.), but I am just

asking what it *will* do. Is it the case that it will parallelize say

summing the values in an array if they are integers, but not if they are

floating point?

Thanks.

On 9/14/2022 9:00 AM, Stephen Fuld wrote:

Sep 15, 2022, 1:43:43 PMSep 15

to

On Thursday, September 15, 2022 at 12:07:42 PM UTC-5, Stephen Fuld wrote:

> Mitch, please help me out here. While I do appreciate all the

> discussion on what VVM could or should do (I really do.), but I am just

> asking what it *will* do. Is it the case that it will parallelize say

> summing the values in an array if they are integers, but not if they are

> floating point?

>

> Thanks.

<

VVM is an architectural specification. It is like asking what an x86
> Mitch, please help me out here. While I do appreciate all the

> discussion on what VVM could or should do (I really do.), but I am just

> asking what it *will* do. Is it the case that it will parallelize say

> summing the values in an array if they are integers, but not if they are

> floating point?

>

> Thanks.

<

processor can do--well these things came out with MMX at 1GHz

and then proceeded to where we are today.

<

What you are asking about is implementation. There can be <proper>

VVM implementations that perform no better than pure scalar code.

and occupy no more area than adding a few sequences to the pipeline.

While this would be "sad" it is not an improper implementation. Then

there can be implementations where VVM has 2× to 8× the arithmetic

circuitry as the "main data path" providing 4×-32× performance gains.

<

Which do you want me to address is the question !

<

For my part, I am looking at a ¼ cache line wide arithmetic addition

as a reasonable VVM implementation. This contains arithmetic to

perform {16-byte, 8-half, 4-word, 2-double} sized calculations with

"several" inbound buffers and 1-or-2 outbound buffers. Whether this

uses the existing FPU of incarnates another is undecided.

Sep 15, 2022, 5:27:35 PMSep 15

to

On 9/15/2022 10:43 AM, MitchAlsup wrote:

> On Thursday, September 15, 2022 at 12:07:42 PM UTC-5, Stephen Fuld wrote:

>> Mitch, please help me out here. While I do appreciate all the

>> discussion on what VVM could or should do (I really do.), but I am just

>> asking what it *will* do. Is it the case that it will parallelize say

>> summing the values in an array if they are integers, but not if they are

>> floating point?

>>

>> Thanks.

First, thanks for responding.
> On Thursday, September 15, 2022 at 12:07:42 PM UTC-5, Stephen Fuld wrote:

>> Mitch, please help me out here. While I do appreciate all the

>> discussion on what VVM could or should do (I really do.), but I am just

>> asking what it *will* do. Is it the case that it will parallelize say

>> summing the values in an array if they are integers, but not if they are

>> floating point?

>>

>> Thanks.

> <

> VVM is an architectural specification. It is like asking what an x86

> processor can do--well these things came out with MMX at 1GHz

> and then proceeded to where we are today.

> <

> What you are asking about is implementation. There can be <proper>

> VVM implementations that perform no better than pure scalar code.

> and occupy no more area than adding a few sequences to the pipeline.

> While this would be "sad" it is not an improper implementation. Then

> there can be implementations where VVM has 2× to 8× the arithmetic

> circuitry as the "main data path" providing 4×-32× performance gains.

incorrectly, that the fact that different implementations could produce

arithmetically different results (in this case depending upon whether FP

math was parallelized), was something that should be specified in the

architecture. But I accept that it certainly could be "implementation".

> <

> Which do you want me to address is the question !

about your "sad" implementation. :-)

> <

> For my part, I am looking at a ¼ cache line wide arithmetic addition

> as a reasonable VVM implementation. This contains arithmetic to

> perform {16-byte, 8-half, 4-word, 2-double} sized calculations with

> "several" inbound buffers and 1-or-2 outbound buffers.

more in the future.

> Whether this

> uses the existing FPU of incarnates another is undecided.

or at least there are people who are willing to give up on "numerically

correct" results in order to get them faster. Perhaps an argument could

be made for supporting both. I certainly am not competent to know the

"right" answer.

Sep 15, 2022, 6:16:49 PMSep 15

to

a 754-version of Posit-Quires.

<

At this point, most of it falls into the category of "is it worth it" not

"can it be done". There will be "potential" applications where VVM is

not highly valued (controllers) and those where it is mandatory (HPC).

Who is footing the bill {both engineering and manufacturing costs}

and what is the expected volume {thousands per month->millions per

week}.

Sep 16, 2022, 2:55:33 AMSep 16

to

MitchAlsup <Mitch...@aol.com> schrieb:

As far as I understand, VVM prescribes loops are executed as they

would be in C. It has no special support for reductions. So,

if you want to calculate a dot product, there is no special

support.

Generated code for

double dot_product (double *const restrict a, double const *restrict b, long int n)

{

double s = 0.0;

for (long int i = 0; i<n; i++)

s += a[i] * b[i];

return s;

}

is

; %bb.0:

enter r0,r0,0,0

add r30,sp,#8

cmp r4,r3,#1

blt r4,.LBB0_1

; %bb.2: ; %.lr.ph.preheader

mov r4,#0x0000000000000000

.LBB0_3: ; %.lr.ph

; =>This Inner Loop Header: Depth=1

vec r5,{r4}

ldd r6,[r1]

ldd r7,[r2]

fmac r4,r6,r7,r4

add r2,r2,#8

add r1,r1,#8

loop ne,r3,#0,#-1

; %bb.4: ; %._crit_edge

mov r1,r4

exit r0,r0,0,0

.LBB0_1:

mov r4,#0x0000000000000000

mov r1,r4

exit r0,r0,0,0

so there is no special advantage to VVM here - the bottleneck

is the FMAC instruction. With a FMAC latency of four cycles, this

loop cannot run faster than 1/4 of the clock.

In order to get around this, compiler or programmer have to

delve into the bag of tricks developed for SIMD "vectorization",

unrolling and loop peeling, specifically. FMAC can be assumed

to be piplined, so that is possible.

So, it would have to look something like

double dot_product (double *const restrict a, double const *restrict b, long int n)

{

double s0, s1, s2, s3, s4, s5, s6, s7, s_extra;

long int i;

if (n < 0)

return 0.0;

s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0.0;

for (i = 0; i <n % 8; i++)

s0 = s0 + a[i] * b[i];

for (i = i; i<n; i+=8)

{

s0 += a[i] * b[i];

s1 += a[i+1] * b[i+1];

s2 += a[i+2] * b[i+2];

s3 += a[i+3] * b[i+3];

s4 += a[i+4] * b[i+4];

s5 += a[i+5] * b[i+5];

s6 += a[i+6] * b[i+6];

s7 += a[i+7] * b[i+7];

}

s_extra = 0.0;

return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7 + s_extra;

}

which is more error-prone (no guarantee for correctness of the

above code) and which also does not vectorize with the current

compiler (at least from what I could throw at it).

So, VVM is great as far as it goes, but it does not offer a

good path to reductions. That is still missing, I think.

> On Thursday, September 15, 2022 at 12:07:42 PM UTC-5, Stephen Fuld wrote:

>> Mitch, please help me out here. While I do appreciate all the

>> discussion on what VVM could or should do (I really do.), but I am just

>> asking what it *will* do. Is it the case that it will parallelize say

>> summing the values in an array if they are integers, but not if they are

>> floating point?

>>

>> Thanks.

><

> VVM is an architectural specification. It is like asking what an x86

> processor can do--well these things came out with MMX at 1GHz

> and then proceeded to where we are today.

><

> What you are asking about is implementation. There can be <proper>

> VVM implementations that perform no better than pure scalar code.

> and occupy no more area than adding a few sequences to the pipeline.

> While this would be "sad" it is not an improper implementation. Then

> there can be implementations where VVM has 2× to 8× the arithmetic

> circuitry as the "main data path" providing 4×-32× performance gains.

><

> Which do you want me to address is the question !

It's not my question, but...
>> Mitch, please help me out here. While I do appreciate all the

>> discussion on what VVM could or should do (I really do.), but I am just

>> asking what it *will* do. Is it the case that it will parallelize say

>> summing the values in an array if they are integers, but not if they are

>> floating point?

>>

>> Thanks.

><

> VVM is an architectural specification. It is like asking what an x86

> processor can do--well these things came out with MMX at 1GHz

> and then proceeded to where we are today.

><

> What you are asking about is implementation. There can be <proper>

> VVM implementations that perform no better than pure scalar code.

> and occupy no more area than adding a few sequences to the pipeline.

> While this would be "sad" it is not an improper implementation. Then

> there can be implementations where VVM has 2× to 8× the arithmetic

> circuitry as the "main data path" providing 4×-32× performance gains.

><

> Which do you want me to address is the question !

As far as I understand, VVM prescribes loops are executed as they

would be in C. It has no special support for reductions. So,

if you want to calculate a dot product, there is no special

support.

Generated code for

double dot_product (double *const restrict a, double const *restrict b, long int n)

{

double s = 0.0;

for (long int i = 0; i<n; i++)

s += a[i] * b[i];

return s;

}

is

; %bb.0:

enter r0,r0,0,0

add r30,sp,#8

cmp r4,r3,#1

blt r4,.LBB0_1

; %bb.2: ; %.lr.ph.preheader

mov r4,#0x0000000000000000

.LBB0_3: ; %.lr.ph

; =>This Inner Loop Header: Depth=1

vec r5,{r4}

ldd r6,[r1]

ldd r7,[r2]

fmac r4,r6,r7,r4

add r2,r2,#8

add r1,r1,#8

loop ne,r3,#0,#-1

; %bb.4: ; %._crit_edge

mov r1,r4

exit r0,r0,0,0

.LBB0_1:

mov r4,#0x0000000000000000

mov r1,r4

exit r0,r0,0,0

so there is no special advantage to VVM here - the bottleneck

is the FMAC instruction. With a FMAC latency of four cycles, this

loop cannot run faster than 1/4 of the clock.

In order to get around this, compiler or programmer have to

delve into the bag of tricks developed for SIMD "vectorization",

unrolling and loop peeling, specifically. FMAC can be assumed

to be piplined, so that is possible.

So, it would have to look something like

double dot_product (double *const restrict a, double const *restrict b, long int n)

{

double s0, s1, s2, s3, s4, s5, s6, s7, s_extra;

long int i;

if (n < 0)

return 0.0;

s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0.0;

for (i = 0; i <n % 8; i++)

s0 = s0 + a[i] * b[i];

for (i = i; i<n; i+=8)

{

s0 += a[i] * b[i];

s1 += a[i+1] * b[i+1];

s2 += a[i+2] * b[i+2];

s3 += a[i+3] * b[i+3];

s4 += a[i+4] * b[i+4];

s5 += a[i+5] * b[i+5];

s6 += a[i+6] * b[i+6];

s7 += a[i+7] * b[i+7];

}

s_extra = 0.0;

return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7 + s_extra;

}

which is more error-prone (no guarantee for correctness of the

above code) and which also does not vectorize with the current

compiler (at least from what I could throw at it).

So, VVM is great as far as it goes, but it does not offer a

good path to reductions. That is still missing, I think.

Sep 16, 2022, 4:35:54 AMSep 16

to

Thomas Koenig <tko...@netcologne.de> writes:

>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

I see that, on the contrary, it is trivial to perform cache blocking

without reassociation: I multiply a subblock of A with a subblock of B

and add the result to a subblock of C. Then I work on the next

subblocks. As long as I process the subblocks in any of the wide

variety of orders that do not reassociate, the result is unchanged.

And as far as I can see, among these orders there are some that are

optimal for cache reuse.

- anton

--

'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'

Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

>Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

>> Matrix multiplication has so much parallelism that many optimizations

>> are possible without reassociation. Basically, all multiplications

>> can be performed in parallel, and each element of the result matrix is

>> independent of the computations for all the other elements, so they

>> can all be performed in parallel (e.g., for a 1000x1000 result matrix,

>> there is a parallelism of 1000000 even if you compute each element

>> completely sequentially. The only part where the association plays a

>> role is the additions. Just don't reassociate them, and you get the

>> same result.

>

>As far as I can see, that would exclude all cache-friendly blocked

>matrix multiplication algorithms.

What do you see that makes you think so?
>> are possible without reassociation. Basically, all multiplications

>> can be performed in parallel, and each element of the result matrix is

>> independent of the computations for all the other elements, so they

>> can all be performed in parallel (e.g., for a 1000x1000 result matrix,

>> there is a parallelism of 1000000 even if you compute each element

>> completely sequentially. The only part where the association plays a

>> role is the additions. Just don't reassociate them, and you get the

>> same result.

>

>As far as I can see, that would exclude all cache-friendly blocked

>matrix multiplication algorithms.

I see that, on the contrary, it is trivial to perform cache blocking

without reassociation: I multiply a subblock of A with a subblock of B

and add the result to a subblock of C. Then I work on the next

subblocks. As long as I process the subblocks in any of the wide

variety of orders that do not reassociate, the result is unchanged.

And as far as I can see, among these orders there are some that are

optimal for cache reuse.

- anton

--

'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'

Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Sep 16, 2022, 6:39:52 AMSep 16

to

Thomas Koenig <tko...@netcologne.de> schrieb:

[Reduction]

> So, it would have to look something like

>

> double dot_product (double *const restrict a, double const *restrict b, long int n)

> {

> double s0, s1, s2, s3, s4, s5, s6, s7, s_extra;

> long int i;

>

> if (n < 0)

> return 0.0;

>

> s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0.0;

> for (i = 0; i <n % 8; i++)

> s0 = s0 + a[i] * b[i];

>

> for (i = i; i<n; i+=8)

> {

> s0 += a[i] * b[i];

> s1 += a[i+1] * b[i+1];

> s2 += a[i+2] * b[i+2];

> s3 += a[i+3] * b[i+3];

> s4 += a[i+4] * b[i+4];

> s5 += a[i+5] * b[i+5];

> s6 += a[i+6] * b[i+6];

> s7 += a[i+7] * b[i+7];

> }

> s_extra = 0.0;

>

> return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7 + s_extra;

> }

>

> which is more error-prone (no guarantee for correctness of the

> above code) and which also does not vectorize with the current

> compiler (at least from what I could throw at it).

Ah, I finally succeeded. I have to put an optimizing option to

clang already, or opt and llc will not optimize. Strange...

Anyway, the central loop now looks like

vec r13,{r4,r5,r7,r8,r9,r10,r11,r12}

ldd r14,[r1,r6<<3,0]

ldd r15,[r1,r6<<3,8]

ldd r30,[r2,r6<<3,0]

ldd r29,[r2,r6<<3,8]

fmac r4,r15,r29,r4

fmac r5,r14,r30,r5

ldd r14,[r1,r6<<3,40]

ldd r15,[r1,r6<<3,32]

ldd r30,[r1,r6<<3,16]

ldd r29,[r1,r6<<3,24]

ldd r28,[r2,r6<<3,40]

ldd r27,[r2,r6<<3,32]

ldd r26,[r2,r6<<3,16]

ldd r25,[r2,r6<<3,24]

fmac r8,r29,r25,r8

fmac r7,r30,r26,r7

fmac r9,r15,r27,r9

fmac r10,r14,r28,r10

ldd r14,[r1,r6<<3,48]

ldd r15,[r1,r6<<3,56]

ldd r30,[r2,r6<<3,48]

ldd r29,[r2,r6<<3,56]

fmac r12,r15,r29,r12

fmac r11,r14,r30,r11

loop lt,r6,r3,#8

which should be good for eight FMAs every four cycles if

the memory can keep up and the ALU is wide enough.

Can't test complex, because it seems to make a function

call.

[Reduction]

> So, it would have to look something like

>

> double dot_product (double *const restrict a, double const *restrict b, long int n)

> {

> double s0, s1, s2, s3, s4, s5, s6, s7, s_extra;

> long int i;

>

> if (n < 0)

> return 0.0;

>

> s0 = s1 = s2 = s3 = s4 = s5 = s6 = s7 = 0.0;

> for (i = 0; i <n % 8; i++)

> s0 = s0 + a[i] * b[i];

>

> for (i = i; i<n; i+=8)

> {

> s0 += a[i] * b[i];

> s1 += a[i+1] * b[i+1];

> s2 += a[i+2] * b[i+2];

> s3 += a[i+3] * b[i+3];

> s4 += a[i+4] * b[i+4];

> s5 += a[i+5] * b[i+5];

> s6 += a[i+6] * b[i+6];

> s7 += a[i+7] * b[i+7];

> }

> s_extra = 0.0;

>

> return s0 + s1 + s2 + s3 + s4 + s5 + s6 + s7 + s_extra;

> }

>

> which is more error-prone (no guarantee for correctness of the

> above code) and which also does not vectorize with the current

> compiler (at least from what I could throw at it).

clang already, or opt and llc will not optimize. Strange...

Anyway, the central loop now looks like

vec r13,{r4,r5,r7,r8,r9,r10,r11,r12}

ldd r14,[r1,r6<<3,0]

ldd r15,[r1,r6<<3,8]

ldd r30,[r2,r6<<3,0]

ldd r29,[r2,r6<<3,8]

fmac r4,r15,r29,r4

fmac r5,r14,r30,r5

ldd r14,[r1,r6<<3,40]

ldd r15,[r1,r6<<3,32]

ldd r30,[r1,r6<<3,16]

ldd r29,[r1,r6<<3,24]

ldd r28,[r2,r6<<3,40]

ldd r27,[r2,r6<<3,32]

ldd r26,[r2,r6<<3,16]

ldd r25,[r2,r6<<3,24]

fmac r8,r29,r25,r8

fmac r7,r30,r26,r7

fmac r9,r15,r27,r9

fmac r10,r14,r28,r10

ldd r14,[r1,r6<<3,48]

ldd r15,[r1,r6<<3,56]

ldd r30,[r2,r6<<3,48]

ldd r29,[r2,r6<<3,56]

fmac r12,r15,r29,r12

fmac r11,r14,r30,r11

loop lt,r6,r3,#8

which should be good for eight FMAs every four cycles if

the memory can keep up and the ALU is wide enough.

Can't test complex, because it seems to make a function

call.

Sep 16, 2022, 12:27:27 PMSep 16

to

of quires, so I did a little research. What I found primarily was that

I am not qualified to make an intelligent contribution to that debate. :-(

> <

> At this point, most of it falls into the category of "is it worth it" not

> "can it be done". There will be "potential" applications where VVM is

> not highly valued (controllers) and those where it is mandatory (HPC).

there are applications where VVM is highly desirable, but accurate

parallelized floating point reductions is required, or not. Several

choices.

> Who is footing the bill {both engineering and manufacturing costs}

> and what is the expected volume {thousands per month->millions per

> week}.

Thanks for the details and being patient with me.

Sep 22, 2022, 12:29:18 PMSep 22

to

On 9/15/2022 2:27 PM, Stephen Fuld wrote:

> On 9/15/2022 10:43 AM, MitchAlsup wrote:

snip
> On 9/15/2022 10:43 AM, MitchAlsup wrote:

>> For my part, I am looking at a ¼ cache line wide arithmetic addition

>> as a reasonable VVM implementation. This contains arithmetic to

>> perform {16-byte, 8-half, 4-word, 2-double} sized calculations with

>> "several" inbound buffers and 1-or-2 outbound buffers.

>

> So integer reductions could be parallelized. Good. We can explore this

> more in the future.

From your description, it looks to me like you are looking at the

technique of a splitting up a long add into several shorter adds by

selectively blocking carry propagation. If so, I have several questions.

1. Consider the case of summing an array of 256 word sized (32 bit)

integers where the sum is to be put into a 64 bit integer. What happens

if one or more of your intermediate sums exceeds 32 bits? If you are

using the suppressed carries technique, I think you could get overflows

in the intermediate adds that wouldn't show up in a non-parallelized

version.

2. What about reductions like "find the maximum value of any element in

an array"? I don't think this is amenable to the suppressed carry

technique, while it falls out naturally in a multiple functional unit

approach.

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu