Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Fortran aliasing

191 views
Skip to first unread message

robert....@oracle.com

unread,
May 30, 2012, 11:41:16 PM5/30/12
to
The following program demonstrates a problem I found a few weeks ago.

SUBROUTINE SUBR(B, C)
DIMENSION B(10), C(5)
DO 10 I = 1, 10
B(I) = 2.0
10 CONTINUE
END

PROGRAM COPY
DIMENSION A(10)
DO 10 I = 1,10
A(I) = 1.0
10 CONTINUE
CALL SUBR(A, A(1:10:2))
PRINT *, A
END

For a few days, I thought the implementation I support did not
handle this code correctly. I even developed a fix for the
problem. Then, I read the aliasing rules more carefully, and
I discovered that the program is not standard conforming.

The program was compiled and run on a variety of systems,
and it produced consistent results on all of them.

Robert Corbett

glen herrmannsfeldt

unread,
May 31, 2012, 12:01:06 AM5/31/12
to
robert....@oracle.com wrote:
> The following program demonstrates a problem I found a few weeks ago.

I don't see any aliasing. Also, why is C not used?

> SUBROUTINE SUBR(B, C)
> DIMENSION B(10), C(5)
> DO 10 I = 1, 10
> B(I) = 2.0
> 10 CONTINUE
> END

> PROGRAM COPY
> DIMENSION A(10)
> DO 10 I = 1,10
> A(I) = 1.0
> 10 CONTINUE
> CALL SUBR(A, A(1:10:2))
> PRINT *, A
> END

> For a few days, I thought the implementation I support did not
> handle this code correctly. I even developed a fix for the
> problem. Then, I read the aliasing rules more carefully, and
> I discovered that the program is not standard conforming.

-- glen

robert....@oracle.com

unread,
May 31, 2012, 12:14:40 AM5/31/12
to
On May 30, 9:01 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> robert.corb...@oracle.com wrote:
> > The following program demonstrates a problem I found a few weeks ago.
>
> I don't see any aliasing. Also, why is C not used?

The dummy arguments B and C reference some of the same
elements of the array A. That is the aliasing. C is
not used because the program is obviously not standard
conforming if C is used.

Bob Corbett

William Clodius

unread,
May 31, 2012, 12:22:34 AM5/31/12
to
A(1:5) is aliased with A(1:10:2). The general behavior he illustrates
doesn't depend on whether anything is done with C. If the
implementation used copy-in/copy-out the result would depend on whether
the copy out of A(1:10:2) occurred before or after the copy out of
A(1:10).

--
Bill Clodius
los the lost and net the pet to email

Richard Maine

unread,
May 31, 2012, 12:57:29 AM5/31/12
to
William Clodius <wclo...@lost-alamos.pet> wrote:

> glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
>
> > robert....@oracle.com wrote:
> > > The following program demonstrates a problem I found a few weeks ago.
> >
> > I don't see any aliasing. Also, why is C not used?

> A(1:5) is aliased with A(1:10:2). The general behavior he illustrates
> doesn't depend on whether anything is done with C. If the
> implementation used copy-in/copy-out the result would depend on whether
> the copy out of A(1:10:2) occurred before or after the copy out of
> A(1:10).

Yup. I agree with Bill's analysis (and with Bob's conclusion that the
code is nonstandard). As Bob says, the problem becomes more "obvious" if
C is used, but the aliasing exists anyway. If one wants to be picky
about the standard-speak, the standard doesn't actually define the term
"aliasing" (perhaps it might simplify some rules to do so, but then
maybe not), but it does define that this particular situation makes the
code invalid. I just find it easier to explain (but I don't need to, as
Bill already did so) by using the concept of aliasing.

--
Richard Maine | Good judgment comes from experience;
email: last name at domain . net | experience comes from bad judgment.
domain: summertriangle | -- Mark Twain

Phillip Helbig---undress to reply

unread,
May 31, 2012, 2:18:29 AM5/31/12
to
In article
<cb370060-8fa0-4f80...@v9g2000yqm.googlegroups.com>,
robert....@oracle.com writes:

> SUBROUTINE SUBR(B, C)
> DIMENSION B(10), C(5)
> DO 10 I = 1, 10
> B(I) = 2.0
> 10 CONTINUE
> END
>
> PROGRAM COPY
> DIMENSION A(10)
> DO 10 I = 1,10
> A(I) = 1.0
> 10 CONTINUE
> CALL SUBR(A, A(1:10:2))
> PRINT *, A
> END

> The program was compiled and run on a variety of systems,
> and it produced consistent results on all of them.

I get

1.000000 2.000000 1.000000 2.000000 1.000000
2.000000 1.000000 2.000000 1.000000 2.000000

Is that what you get?

I'm surprised. COPY sets all elements of A to 1.0. It is then passed
to B, which sets them all to 2.0.

C seems irrelevant; nothing is done with it in SUBR. So

SUBROUTINE SUBR(B)
DIMENSION B(10)
DO 10 I = 1, 10
B(I) = 2.0
10 CONTINUE
END

PROGRAM COPY
DIMENSION A(10)
DO 10 I = 1,10
A(I) = 1.0
10 CONTINUE
CALL SUBR(A)
PRINT *, A
END

However, this produces

2.000000 2.000000 2.000000 2.000000 2.000000
2.000000 2.000000 2.000000 2.000000 2.000000

as expected.

What is going on? SUBR gets two arguments: A and half of A. The first
argument is changed, the second is not. My guess is that the compiler
would be allowed to produce the output "all elements 2.0" in the first
case as well, though the result shown is also allowed.

Maybe the result could be understood as follows. Since nothing happens
to C, this is equivalent to

DO 10 I = 1, 5
C(I) = C(I)

which (re) sets these values to 1.0.

glen herrmannsfeldt

unread,
May 31, 2012, 2:38:42 AM5/31/12
to
William Clodius <wclo...@lost-alamos.pet> wrote:

(snip)
>> > SUBROUTINE SUBR(B, C)
>> > DIMENSION B(10), C(5)
>> > DO 10 I = 1, 10
>> > B(I) = 2.0
>> > 10 CONTINUE
>> > END

>> > PROGRAM COPY
>> > DIMENSION A(10)
>> > DO 10 I = 1,10
>> > A(I) = 1.0
>> > 10 CONTINUE
>> > CALL SUBR(A, A(1:10:2))
>> > PRINT *, A
>> > END

(snip)
> A(1:5) is aliased with A(1:10:2). The general behavior he illustrates
> doesn't depend on whether anything is done with C. If the
> implementation used copy-in/copy-out the result would depend on whether
> the copy out of A(1:10:2) occurred before or after the copy out of
> A(1:10).

I think some time ago I wondered about this, but then was
somehow convinced that it wasn't a problem.

None that I know of do actual copy-in/copy-out for arrays.
Some do a copy before the call, in the case of (potentially)
non-contiguous arrays.

I did once try some tests to detect problems caused by array
copies, but wasn't able to cause any such problems.

-- glen

robert....@oracle.com

unread,
May 31, 2012, 3:15:03 AM5/31/12
to
On May 30, 11:18 pm, hel...@astro.multiCLOTHESvax.de (Phillip Helbig---
undress to reply) wrote:
> In article
> <cb370060-8fa0-4f80-8d4a-c5eb46641...@v9g2000yqm.googlegroups.com>,
>
>
>
>
>
>
>
>
>
> robert.corb...@oracle.com writes:
> >       SUBROUTINE SUBR(B, C)
> >         DIMENSION B(10), C(5)
> >         DO 10 I = 1, 10
> >           B(I) = 2.0
> >    10   CONTINUE
> >       END
>
> >       PROGRAM COPY
> >         DIMENSION A(10)
> >         DO 10 I = 1,10
> >           A(I) = 1.0
> >    10   CONTINUE
> >         CALL SUBR(A, A(1:10:2))
> >         PRINT *, A
> >       END
> > The program was compiled and run on a variety of systems,
> > and it produced consistent results on all of them.
>
> I get
>
>   1.000000       2.000000       1.000000       2.000000       1.000000
>   2.000000       1.000000       2.000000       1.000000       2.000000
>
> Is that what you get?

Yes, it is.

> I'm surprised.

I thought some people would be surprised.
The array section A(1:10:2)is not contiguous. The array C is
is an explicit-shape array, which for most implementations is
required to be contiguous. The common way of working around
this inconsistency is to make a contiguous copy of the elements
of the section and pass that to C. After the return, the
elements of the copy are copied back to the corresponding
elements of the array A.

Bob Corbett

Phillip Helbig---undress to reply

unread,
May 31, 2012, 3:52:16 AM5/31/12
to
In article
<e28906f6-25e2-4159...@h9g2000yqi.googlegroups.com>,
robert....@oracle.com writes:

> The array section A(1:10:2)is not contiguous. The array C is
> is an explicit-shape array, which for most implementations is
> required to be contiguous. The common way of working around
> this inconsistency is to make a contiguous copy of the elements
> of the section and pass that to C. After the return, the
> elements of the copy are copied back to the corresponding
> elements of the array A.

Makes sense. However, would a compiler be allowed to optimize this
away, since nothing is done with C?

Ian Harvey

unread,
May 31, 2012, 7:19:50 AM5/31/12
to
A compiler would be allowed to do this (noting that the code is still
non-conforming regardless of what the compiler does) if it was clever
enough but with Fortran's separate compilation model the compiler
typically doesn't know that nothing is done with the C argument when it
compiles the call to subr.

Richard Maine

unread,
May 31, 2012, 11:08:39 AM5/31/12
to
Phillip Helbig---undress to reply <hel...@astro.multiCLOTHESvax.de>
wrote:
The compiler is allowed to do absolutely anything, since the code is
nonconforming and in a way that does not require compiler diagnostics.

James Van Buskirk

unread,
May 31, 2012, 3:16:53 PM5/31/12
to
<robert....@oracle.com> wrote in message
news:e28906f6-25e2-4159...@h9g2000yqi.googlegroups.com...

> The array section A(1:10:2)is not contiguous. The array C is
> is an explicit-shape array, which for most implementations is
> required to be contiguous. The common way of working around
> this inconsistency is to make a contiguous copy of the elements
> of the section and pass that to C. After the return, the
> elements of the copy are copied back to the corresponding
> elements of the array A.

Fun example, Bob! For even more fun, we could try permuting the
order of arguments B and C and making A itself discontiguous:

C:\gfortran\clf\copy_test>type ct1.for
SUBROUTINE SUBR(B, C)
DIMENSION B(10), C(5)
DO 10 I = 1, 10
B(I) = 2.0
10 CONTINUE
END

PROGRAM COPY
DIMENSION A(10)
DO 10 I = 1,10
A(I) = 1.0
10 CONTINUE
CALL SUBR(A, A(1:10:2))
PRINT *, A
END

C:\gfortran\clf\copy_test>gfortran ct1.for -oct1

C:\gfortran\clf\copy_test>ct1
1.00000000 2.00000000 1.00000000 2.00000000
1.0000000
0 2.00000000 1.00000000 2.00000000 1.00000000
2.00
000000

C:\gfortran\clf\copy_test>type ct2.for
SUBROUTINE SUBR(C, B)
DIMENSION B(10), C(5)
DO 10 I = 1, 10
B(I) = 2.0
10 CONTINUE
END

PROGRAM COPY
DIMENSION A(10)
DO 10 I = 1,10
A(I) = 1.0
10 CONTINUE
CALL SUBR(A(1:10:2), A)
PRINT *, A
END

C:\gfortran\clf\copy_test>gfortran ct2.for -oct2

C:\gfortran\clf\copy_test>ct2
1.00000000 2.00000000 1.00000000 2.00000000
1.0000000
0 2.00000000 1.00000000 2.00000000 1.00000000
2.00
000000

C:\gfortran\clf\copy_test>type ct3.for
SUBROUTINE SUBR(B, C)
DIMENSION B(10), C(5)
DO 10 I = 1, 10
B(I) = 2.0
10 CONTINUE
END

PROGRAM COPY
DIMENSION D(20)
TARGET D
REAL(KIND(D)), POINTER :: A(:)
A => D(1:20:2)
DO 10 I = 1,10
A(I) = 1.0
10 CONTINUE
CALL SUBR(A, A(1:10:2))
PRINT *, A
END

C:\gfortran\clf\copy_test>gfortran ct3.for -oct3

C:\gfortran\clf\copy_test>ct3
1.00000000 2.00000000 1.00000000 2.00000000
1.0000000
0 2.00000000 1.00000000 2.00000000 1.00000000
2.00
000000

C:\gfortran\clf\copy_test>type ct4.for
SUBROUTINE SUBR(C, B)
DIMENSION B(10), C(5)
DO 10 I = 1, 10
B(I) = 2.0
10 CONTINUE
END

PROGRAM COPY
DIMENSION D(20)
TARGET D
REAL(KIND(D)), POINTER :: A(:)
A => D(1:20:2)
DO 10 I = 1,10
A(I) = 1.0
10 CONTINUE
CALL SUBR(A(1:10:2), A)
PRINT *, A
END

C:\gfortran\clf\copy_test>gfortran ct4.for -oct4

C:\gfortran\clf\copy_test>ct4
2.00000000 2.00000000 2.00000000 2.00000000
2.0000000
0 2.00000000 2.00000000 2.00000000 2.00000000
2.00
000000

For ct3.for and ct4.for we may expect results to depend upon compiler
and optimization level because there is then no certain order in
which copy-in/copy out is performed.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end


William Clodius

unread,
May 31, 2012, 11:21:10 PM5/31/12
to
Is it consistent with

CALL SUBR(A(1:10:1), A(1:10:2))

which encourages two copy-in/copy-outs.

robert....@oracle.com

unread,
May 31, 2012, 11:48:57 PM5/31/12
to
Yes, I got the same results with that change.

I got different results with

CALL SUBR(A(10:1:-1), A(1:10:2))

which more strongly encourages two sets of copies.

Bob Corbett

Ron Shepard

unread,
Jun 1, 2012, 9:18:34 PM6/1/12
to
In article <1kkz9kq.1evufv0fzo6ksN%wclo...@lost-alamos.pet>,
wclo...@lost-alamos.pet (William Clodius) wrote:

> Is it consistent with
>
> CALL SUBR(A(1:10:1), A(1:10:2))
>
> which encourages two copy-in/copy-outs.

I don't think this has been mentioned before, but the reason
copy-in/copy-out is done is because the interface to SUBR is
implicit. The above code demonstrates this very well. If there
were an explicit interface with assumed shape dummy arguments, then
no copy-in/copy-out would occur, but also I think most compilers
would then recognize that the code is illegal due to the alias and
print an error message at compile time. If SUBR() were a module
subroutine rather than an external subroutine, then it would have an
explicit interface through the USE statement.

$.02 -Ron Shepard

Dick Hendrickson

unread,
Jun 1, 2012, 10:33:49 PM6/1/12
to
Err, uhh, sort of no; what you say is true, but it's not really all of
the issue. It's easy to think about copy-in/copy-out for non-contiguous
arrays; but, it's really a much more general problem. Modern optimizing
compilers almost always use [pseudo]copy-in/copy-out for integer scalars.

Given a subroutine like
Subroutine xxx (array, N)
integer array(100)
DO I = 1,100
N = N + something or other
Array(I) = N - something else
enddo
end
If you look at the generated code you'll think the compiler (rather
inefficiently) copies N into a register during the call, modifies it
during the loop, and copies it out during the return process. No
discontiguous arrays, no special module handling, implicit or explicit
doesn't matter, nothing to alert the compiler. People would be
surprised and disappointed if N were loaded and stored from memory on
each iteration.

Then if you call it with
integer array(100000)
read *, j
call xxx (array, array(j))
things fall apart due to normal expected optimization if j is less than
100.

The anti-aliasing rule is designed to allow the compiler to do all sorts
of optimizations.

Dick Hendrickson

Richard Maine

unread,
Jun 2, 2012, 12:27:02 AM6/2/12
to
Ron Shepard <ron-s...@NOSPAM.comcast.net> wrote:

> In article <1kkz9kq.1evufv0fzo6ksN%wclo...@lost-alamos.pet>,
> wclo...@lost-alamos.pet (William Clodius) wrote:
>
> > Is it consistent with
> >
> > CALL SUBR(A(1:10:1), A(1:10:2))
> >
> > which encourages two copy-in/copy-outs.
>
> I don't think this has been mentioned before, but the reason
> copy-in/copy-out is done is because the interface to SUBR is
> implicit.

Um, no. I completely disagree. The reason for the copy-in/copy-out is
that the dummy argument is explicit shape.

> If there
> were an explicit interface with assumed shape dummy arguments, then
> no copy-in/copy-out would occur,

You appear to be conflating two things and attributing the issue to the
wrong one. It is the assumed-shape that would keep copy-in/copy-out from
happening. (That's not true 100% of the time either; you sometimes can
get copy behavior with assumed shape, but as a rule you most often won't
unless you go out of your way to ask for it, which you can do wth some
compilers). It is true that assumed shape requires an explicit
interface, but the converse is not so. An explicit interface does not
require assumed shape.

If it were true that it was the implicit interface that was causing the
copy, then just making the interface explicit would change that. There
would be no need to bring up assumed shape at all. The old "try to
change only one thing at a time when debugging" philosophy.

glen herrmannsfeldt

unread,
Jun 2, 2012, 1:40:53 AM6/2/12
to
Dick Hendrickson <dick.hen...@att.net> wrote:
(snip)
> arrays; but, it's really a much more general problem. Modern optimizing
> compilers almost always use [pseudo]copy-in/copy-out for integer scalars.

> Given a subroutine like
> Subroutine xxx (array, N)
> integer array(100)
> DO I = 1,100
> N = N + something or other
> Array(I) = N - something else
> enddo
> end

I would probably use a local variable in the loop, and then
copy to N myself.

> If you look at the generated code you'll think the compiler (rather
> inefficiently) copies N into a register during the call, modifies it
> during the loop, and copies it out during the return process. No
> discontiguous arrays, no special module handling, implicit or explicit
> doesn't matter, nothing to alert the compiler. People would be
> surprised and disappointed if N were loaded and stored from memory on
> each iteration.

The OS/360 compilers always make a local copy for scalars.
I haven't seen any others that do it so consistently, but I will
believe that some optimizers would do it.

> Then if you call it with
> integer array(100000)
> read *, j
> call xxx (array, array(j))
> things fall apart due to normal expected optimization if j is
> less than 100.

Reminds me of ALGOL and Call by name. Though the usual cases are
more like:

call yyy(array(j),j)

In ALGOL, if you change j inside yyy, you get a different array element.

-- glen

Richard Maine

unread,
Jun 2, 2012, 2:13:45 AM6/2/12
to
glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

> Dick Hendrickson <dick.hen...@att.net> wrote:
> (snip)
> > arrays; but, it's really a much more general problem. Modern optimizing
> > compilers almost always use [pseudo]copy-in/copy-out for integer scalars.
>
> > Given a subroutine like
> > Subroutine xxx (array, N)
> > integer array(100)
> > DO I = 1,100
> > N = N + something or other
> > Array(I) = N - something else
> > enddo
> > end
>
> I would probably use a local variable in the loop, and then
> copy to N myself.

The point, however, is not how you might write the code differently, but
what a compiler is allowed to do with the code as is. Plus, of course,
the example is intentionally trivial, but illustrates behavior that can
arise in cases that aren't so trivial.

> > If you look at the generated code you'll think the compiler (rather
> > inefficiently) copies N into a register during the call, modifies it
> > during the loop, and copies it out during the return process. No
> > discontiguous arrays, no special module handling, implicit or explicit
> > doesn't matter, nothing to alert the compiler. People would be
> > surprised and disappointed if N were loaded and stored from memory on
> > each iteration.
>
> The OS/360 compilers always make a local copy for scalars.
> I haven't seen any others that do it so consistently, but I will
> believe that some optimizers would do it.

I think you might have read what Dick said too hastily. He's not talking
about copying into a local variable that lives in memory. He is talking
about copying the value into a register. Darn near all compilers tend to
keep values in registers during loops when optimization is on. That's
such a basic technique that you might even see it with optimization
turned off. I have trouble believing that you haven't seen other
compilers that do that kind of thing pretty consistently.

This isn't what people are usually talking about when they refer to
copy-in, but it is, in fact, a copy, even though not to memory.

Ron Shepard

unread,
Jun 2, 2012, 2:28:57 AM6/2/12
to
In article <1kl14i3.6o77p1l79l36N%nos...@see.signature>,
nos...@see.signature (Richard Maine) wrote:

> Ron Shepard <ron-s...@NOSPAM.comcast.net> wrote:
>
> > In article <1kkz9kq.1evufv0fzo6ksN%wclo...@lost-alamos.pet>,
> > wclo...@lost-alamos.pet (William Clodius) wrote:
> >
> > > Is it consistent with
> > >
> > > CALL SUBR(A(1:10:1), A(1:10:2))
> > >
> > > which encourages two copy-in/copy-outs.
> >
> > I don't think this has been mentioned before, but the reason
> > copy-in/copy-out is done is because the interface to SUBR is
> > implicit.
>
> Um, no. I completely disagree. The reason for the copy-in/copy-out is
> that the dummy argument is explicit shape.
>
> > If there
> > were an explicit interface with assumed shape dummy arguments, then
> > no copy-in/copy-out would occur,
>
> You appear to be conflating two things and attributing the issue to the
> wrong one. It is the assumed-shape that would keep copy-in/copy-out from
> happening.

Yes, I should have reversed the order of those two sentences. But
both things are required because (as you say) you cannot have
explicit shape without having an explicit interface.

> (That's not true 100% of the time either; you sometimes can
> get copy behavior with assumed shape, but as a rule you most often won't
> unless you go out of your way to ask for it, which you can do wth some
> compilers). It is true that assumed shape requires an explicit
> interface, but the converse is not so. An explicit interface does not
> require assumed shape.
>
> If it were true that it was the implicit interface that was causing the
> copy, then just making the interface explicit would change that. There
> would be no need to bring up assumed shape at all. The old "try to
> change only one thing at a time when debugging" philosophy.

Yes, I agree again, but in this situation you cannot do only one of
those things to avoid the copy-in/copy-out, you must do both. And
if you do both, then the compiler will probably recognize the
underlying alias error, something else that cannot be done with only
an implicit interface.

$.02 -Ron Shepard

Ron Shepard

unread,
Jun 2, 2012, 2:40:33 AM6/2/12
to
In article <1kl19d9.k98sq31svc9t2N%nos...@see.signature>,
nos...@see.signature (Richard Maine) wrote:

> I think you might have read what Dick said too hastily. He's not talking
> about copying into a local variable that lives in memory. He is talking
> about copying the value into a register. Darn near all compilers tend to
> keep values in registers during loops when optimization is on. That's
> such a basic technique that you might even see it with optimization
> turned off. I have trouble believing that you haven't seen other
> compilers that do that kind of thing pretty consistently.
>
> This isn't what people are usually talking about when they refer to
> copy-in, but it is, in fact, a copy, even though not to memory.

There is hardware that exists that is based on pipelined operations
in which the intermediate results exist neither in memory nor in
registers. The floating point value is generated in three pieces on
separate memory cycles. First the exponent is generated and passed
down the pipeline, then the high-order bits are generated and passed
on, and then the low-order bits are generated.

Vector and SIMD hardware also does something similar, where
intermediate results are consumed (and sometimes generated and
consumed several times) before they are actually stored in either
registers or memory locations.

From an alias perspective (or avoiding an apparent alias), these
situations also count as copies of data.

$.02 -Ron Shepard

Tobias Burnus

unread,
Jun 2, 2012, 2:43:54 AM6/2/12
to
Dick Hendrickson wrote:
> The anti-aliasing rule is designed to allow the compiler to do all sorts
> of optimizations.

One real-world example is that the following invalid code. It does not
involve copy-in/copy-out issues, but it is also related to passing the
same argument twice to a function. The code fails when the temporary
variable "inte" is optimized away. That happens with one compiler with
optimization turned on.

(By chance, for this code, the ordering of the generated assembler
instructions only leads to unexpected results on the 32-bit and not on
the 64-bit x86 platform.)

call iei4ei(in4,in4)
...
subroutine iei4ei(inpu,oupu)
integer(kind=1) :: inpu(4)
integer(kind=1) :: oupu(4)
! Local
integer(kind=1) :: inte(4)
!
inte(:) = inpu(:)
oupu(4) = inte(1)
oupu(3) = inte(2)
oupu(2) = inte(3)
oupu(1) = inte(4)
end subroutine iei4ei

Tobias

glen herrmannsfeldt

unread,
Jun 2, 2012, 5:37:21 AM6/2/12
to
Ron Shepard <ron-s...@nospam.comcast.net> wrote:

(snip)
> There is hardware that exists that is based on pipelined operations
> in which the intermediate results exist neither in memory nor in
> registers. The floating point value is generated in three pieces on
> separate memory cycles. First the exponent is generated and passed
> down the pipeline, then the high-order bits are generated and passed
> on, and then the low-order bits are generated.

And, in addition, register renaming to pass along such values,
especially on register starved machines.

> Vector and SIMD hardware also does something similar, where
> intermediate results are consumed (and sometimes generated and
> consumed several times) before they are actually stored in either
> registers or memory locations.

> From an alias perspective (or avoiding an apparent alias), these
> situations also count as copies of data.

Some might be noticed even without aliasing.

-- glen

Ron Shepard

unread,
Jun 2, 2012, 12:24:25 PM6/2/12
to
In article <ron-shepard-BA53...@news60.forteinc.com>,
Ron Shepard <ron-s...@NOSPAM.comcast.net> wrote:

> There is hardware that exists that is based on pipelined operations
> in which the intermediate results exist neither in memory nor in
> registers. The floating point value is generated in three pieces on
> separate memory cycles. First the exponent is generated and passed
> down the pipeline, then the high-order bits are generated and passed
> on, and then the low-order bits are generated.

I think I wrote this backwards. I think it is the low-order bits
that are generated first, and then the high-order bits become
available on the next cycle.

> Vector and SIMD hardware also does something similar, where
> intermediate results are consumed (and sometimes generated and
> consumed several times) before they are actually stored in either
> registers or memory locations.

Another example of this is the fused multiply-add that is now fairly
common in instruction sets: a=a+x*y. This is used, for example, in
computing dot products. The result of the multiplication is never
stored in memory or a register (at least not a normal addressable
register), it is directed into the adder where the result is updated
in a normal addressable register. If "a" is aliased to "x" and/or
"y" in the fortran source code, then the result using the register
copy of "a" would be different from the result where everything is
stored back to memory during the dot product. This is a particular
kind of optimization using explicit register copies for the result
"a" with a pipelined operation.

glen herrmannsfeldt

unread,
Jun 2, 2012, 2:08:25 PM6/2/12
to
Ron Shepard <ron-s...@nospam.comcast.net> wrote:

(snip)
> I think I wrote this backwards. I think it is the low-order bits
> that are generated first, and then the high-order bits become
> available on the next cycle.

Well, for add and subtract usually the low bits first, and
for multiply and divide the high bits first. But you will
never see that, as it will always complete the operation
before giving you the bits.

The IBM 360/91 was for many years a favorite machine to discuss
in books on pipelined processors. (And also the machine much of
my Fortran programming was done on.) Besides all the above, it
does out of order execution, such that the results of later
instructions might make it into registers before earlier ones.

Now, many processors now do out of order execution, but it
is usual to still do retirement (including storing values in
actual registers) in order. That takes a lot of logic to keep
results around, which the 360/91 doesn't have. One result is
the imprecise interrupt. If there is a floating point divide
followed by a floating point add, the add may finish first.
Now, if the divide overflows and an exception is generated,
all instructions in execution have to be completed before
the interrupt, and the address stored will be after the add.

Often there is the "multiple imprecise interrupt," as
more exceptions occur while the pipeline is being flushed.

One Fortran significant result of imprecise interrupts
was no fixup for alignment errors. It was not hard to write
a COMMON statement such that some variables were not aligned
on the required boundary. After not so long, programmers learned
to avoid this, but on many machines an interrupt routine would
attempt a fixup by copying the data, redoing the operation,
and then copying it back. With the imprecise interrupt on
the 360/91, that wasn't possible.

Then again, fixup was slow enough that it was usual to fix
the COMMON instead.

-- glen

Ron Shepard

unread,
Jun 2, 2012, 6:35:13 PM6/2/12
to
In article <jqdkqo$tbe$1...@speranza.aioe.org>,
glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

> Ron Shepard <ron-s...@nospam.comcast.net> wrote:
>
> (snip)
> > I think I wrote this backwards. I think it is the low-order bits
> > that are generated first, and then the high-order bits become
> > available on the next cycle.
>
> Well, for add and subtract usually the low bits first, and
> for multiply and divide the high bits first. But you will
> never see that, as it will always complete the operation
> before giving you the bits.

On the machines where I saw this, the full result was most certainly
NOT completed before the bits were passed on to the next stage. For
example, on the FPS-164 an add took two cycles to complete the
result and a multiply took three cycles to complete. This was a
VLIW type of machine where you could program different parts of the
instruction. It was possible to pipe the results of the multiplier
directly into the adder, and the bits showed up cycle by cycle as I
described above. It was possible, for example, to execute a dot
product with one instruction, and after the pipes were full, you
would get an effective execution rate of one add and one multiply
per clock cycle. The add, multiply, address increments, test, and
branch would all fit in one instruction word which just kept looping
on itself. But within any one instruction step, the results from
the adds and the multiplies were never complete, just parts of the
operations were done and then passed on to the next stage.

I think there were other VLIW type machines that also allowed this
kind of pipelining. This was in the early 80's before the RISC type
processors became popular. The RISC idea also included FP
instructions that took several cycles to complete, but as you say
the results were typically not available for further processing
until the last cycle had been executed. An exception of sorts was
that some RISC machines had a fused multiply-add instruction that
took fewer cycles to complete than the sequence of separate multiply
and add instructions, presumably because it avoided storage of the
intermediate result from the multiplier. I say "presumably" because
it was all done in hardware, and the programmer could not see the
microinstructions under the covers like you could on the VLIW
machines.

$.02 -Ron Shepard

glen herrmannsfeldt

unread,
Jun 2, 2012, 9:13:55 PM6/2/12
to
Ron Shepard <ron-s...@nospam.comcast.net> wrote:

(snip)
>> Well, for add and subtract usually the low bits first, and
>> for multiply and divide the high bits first. But you will
>> never see that, as it will always complete the operation
>> before giving you the bits.

> On the machines where I saw this, the full result was most certainly
> NOT completed before the bits were passed on to the next stage. For
> example, on the FPS-164 an add took two cycles to complete the
> result and a multiply took three cycles to complete. This was a
> VLIW type of machine where you could program different parts of the
> instruction.

I never used the FPS machine, though I did work with some people
who did. Yes, strange machines.

> I think there were other VLIW type machines that also allowed this
> kind of pipelining. This was in the early 80's before the RISC type
> processors became popular. The RISC idea also included FP
> instructions that took several cycles to complete, but as you say
> the results were typically not available for further processing
> until the last cycle had been executed.

Some RISC machines had the one cycle per instruction goal, and
since multiply took more than one cycle, had a multiply-step
instruction. You have to execute it the appropriate number of
times!

-- glen
0 new messages