That's the utterly conventional way to do it, as on everything from VAX to x86
to ARM, yes. It is more convenient for assembly language programmers and needs
less setup code before the loop.
Having loads scale the index by the size of the load (only) is absolutely an
option I've been considering. It's cheap to do and doesn't need any extra
encoding space.
The problem is that adding an index in a store is NOT under consideration
because it adds an expensive new feature to the ISA in needing three source
registers. Ok, not so much a problem for FP stores, but definitely for integer
ones.
The point is that adding an index for loads (but not stores) is very cheap. And
adding EA writeback for stores (but not loads) is very cheap. And asymmetry
between those two have unexpected synergy -- as shown in my code example -- IFF
the index for the loads is *not* scaled.
Of course you can easily argue that ...
saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4
li i,0
1:
flw a,(ap,i<<2)
flw x,(xp,i<<2)
flw y,(yp,i<<2)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
addi i,i,1
bne i,count,1b
2: ret
.. is better than ..
saxpy: ; float *dstp, *ap, *xp, *yp; int count
beqz count,2f
addi dstp,dstp,-4
sub ap,ap,dstp
sub xp,xp,dstp
sub yp,yp,dstp
slli count,count,2
add limit,dstp,count
1:
flw a,(ap,dstp)
flw x,(xp,dstp)
flw y,(yp,dstp)
fmadd.s dst,a,x,y
fsw dst,4(dstp)!
bne dstp,limit,1b
2: ret
... because of the smaller setup code, despite the one extra instruction in the
loop. It's certainly much better than the version with five bump instructions
in the loop.
It also makes it possible for the dst, a, x, y arrays to have all completely
different element sizes.
PowerPC is an interesting case. It has both indexed loads and stores (though
not scaled), and (as Luke just discovered) EA writeback for both loads and
stores.
Given the C code:
void saxpy(float *dstp, float *ap, float *xp, float *yp, int count) {
for (int i = 0; i<count; ++i){
float a = ap[i];
float x = xp[i];
float y = yp[i];
float dst = a * x + y;
dstp[i] = dst;
}
}
With -O2 PowerPC gcc 7.5.0 produces:
00000000 <saxpy>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 2c 07 00 00 cmpwi r7,0
8: 40 81 00 50 ble 58 <saxpy+0x58>
c: 54 e9 10 3a rlwinm r9,r7,2,0,29
10: 38 84 ff fc addi r4,r4,-4
14: 39 29 ff fc addi r9,r9,-4
18: 38 a5 ff fc addi r5,r5,-4
1c: 55 29 f0 be rlwinm r9,r9,30,2,31
20: 38 c6 ff fc addi r6,r6,-4
24: 39 29 00 01 addi r9,r9,1
28: 38 63 ff fc addi r3,r3,-4
2c: 7d 29 03 a6 mtctr r9
30: 60 00 00 00 nop
34: 60 00 00 00 nop
38: 60 00 00 00 nop
3c: 60 00 00 00 nop
40: c4 04 00 04 lfsu f0,4(r4)
44: c5 65 00 04 lfsu f11,4(r5)
48: c5 86 00 04 lfsu f12,4(r6)
4c: ec 00 62 fa fmadds f0,f0,f11,f12
50: d4 03 00 04 stfsu f0,4(r3)
54: 42 00 ff ec bdnz 40 <saxpy+0x40>
58: 38 21 00 10 addi r1,r1,16
5c: 4e 80 00 20 blr
For speed, gcc prefers the EA writeback form, despite the huge loop setup --
and even NOP pads it for alignment.
With -Os:
00000000 <saxpy>:
0: 94 21 ff f0 stwu r1,-16(r1)
4: 2f 87 00 00 cmpwi cr7,r7,0
8: 39 20 00 00 li r9,0
c: 39 47 00 01 addi r10,r7,1
10: 40 9c 00 08 bge cr7,18 <saxpy+0x18>
14: 39 40 00 01 li r10,1
18: 2c 0a 00 01 cmpwi r10,1
1c: 39 4a ff ff addi r10,r10,-1
20: 40 82 00 0c bne 2c <saxpy+0x2c>
24: 38 21 00 10 addi r1,r1,16
28: 4e 80 00 20 blr
2c: 7c 04 4c 2e lfsx f0,r4,r9
30: 7d 65 4c 2e lfsx f11,r5,r9
34: 7d 86 4c 2e lfsx f12,r6,r9
38: ec 00 62 fa fmadds f0,f0,f11,f12
3c: 7c 03 4d 2e stfsx f0,r3,r9
40: 39 29 00 04 addi r9,r9,4
44: 4b ff ff d4 b 18 <saxpy+0x18>
Size optimization prefers the indexed form. I don't know why speed optimization
doesn't as well. (ignoring the speed pessimising loop structure here)
RISC-V gcc 9.2.0 makes fake indexed instructions using an add&load and bumping
a single counter rather than bumping multiple pointers. It doesn't make any
difference to the overall speed or code size.
0000000000000000 <saxpy>:
0: 02e05a63 blez a4,34 <.L1>
4: 00271313 slli t1,a4,0x2
8: 4781 li a5,0
000000000000000a <.L3>:
a: 00f68733 add a4,a3,a5
e: 00f588b3 add a7,a1,a5
12: 00f60833 add a6,a2,a5
16: 00072707 flw fa4,0(a4)
1a: 0008a787 flw fa5,0(a7)
1e: 00082687 flw fa3,0(a6)
22: 00f50733 add a4,a0,a5
26: 0791 addi a5,a5,4
28: 70d7f7c3 fmadd.s fa5,fa5,fa3,fa4
2c: 00f72027 fsw fa5,0(a4)
30: fcf31de3 bne t1,a5,a <.L3>
0000000000000034 <.L1>:
34: 8082 ret