That's the utterly conventional way to do it, as on everything from VAX to x86
to ARM, yes. It is more convenient for assembly language programmers and needs
less setup code before the loop.
Having loads scale the index by the size of the load (only) is absolutely an
option I've been considering. It's cheap to do and doesn't need any extra
encoding space.
The problem is that adding an index in a store is NOT under consideration
because it adds an expensive new feature to the ISA in needing three source
registers. Ok, not so much a problem for FP stores, but definitely for integer
ones.
The point is that adding an index for loads (but not stores) is very cheap. And
adding EA writeback for stores (but not loads) is very cheap. And asymmetry
between those two have unexpected synergy -- as shown in my code example -- IFF
the index for the loads is *not* scaled.
Of course you can easily argue that ...
saxpy: ; float *dstp, *ap, *xp, *yp; int count
   beqz count,2f
   addi dstp,dstp,-4
   li i,0
1:
   flw a,(ap,i<<2)
   flw x,(xp,i<<2)
   flw y,(yp,i<<2)
   fmadd.s dst,a,x,y
   fsw dst,4(dstp)!
   addi i,i,1
   bne i,count,1b
2: ret
.. is better than ..
saxpy: ; float *dstp, *ap, *xp, *yp; int count
   beqz count,2f
   addi dstp,dstp,-4
   sub ap,ap,dstp
   sub xp,xp,dstp
   sub yp,yp,dstp
   slli count,count,2
   add limit,dstp,count
1:
   flw a,(ap,dstp)
   flw x,(xp,dstp)
   flw y,(yp,dstp)
   fmadd.s dst,a,x,y
   fsw dst,4(dstp)!
   bne dstp,limit,1b
2: ret
... because of the smaller setup code, despite the one extra instruction in the
loop. It's certainly much better than the version with five bump instructions
in the loop.
It also makes it possible for the dst, a, x, y arrays to have all completely
different element sizes.
PowerPC is an interesting case. It has both indexed loads and stores (though
not scaled), and (as Luke just discovered) EA writeback for both loads and
stores.
Given the C code:
void saxpy(float *dstp, float *ap, float *xp, float *yp, int count) {
  for (int i = 0; i<count; ++i){
      float a = ap[i];
      float x = xp[i];
      float y = yp[i];
      float dst = a * x + y;
      dstp[i] = dst;
  }
}
With -O2 PowerPC gcc 7.5.0 produces:
00000000 <saxpy>:
   0:   94 21 ff f0     stwu    r1,-16(r1)
   4:   2c 07 00 00     cmpwi   r7,0
   8:   40 81 00 50     ble     58 <saxpy+0x58>
   c:   54 e9 10 3a     rlwinm  r9,r7,2,0,29
  10:   38 84 ff fc     addi    r4,r4,-4
  14:   39 29 ff fc     addi    r9,r9,-4
  18:   38 a5 ff fc     addi    r5,r5,-4
  1c:   55 29 f0 be     rlwinm  r9,r9,30,2,31
  20:   38 c6 ff fc     addi    r6,r6,-4
  24:   39 29 00 01     addi    r9,r9,1
  28:   38 63 ff fc     addi    r3,r3,-4
  2c:   7d 29 03 a6     mtctr   r9
  30:   60 00 00 00     nop
  34:   60 00 00 00     nop
  38:   60 00 00 00     nop
  3c:   60 00 00 00     nop
  40:   c4 04 00 04     lfsu    f0,4(r4)
  44:   c5 65 00 04     lfsu    f11,4(r5)
  48:   c5 86 00 04     lfsu    f12,4(r6)
  4c:   ec 00 62 fa     fmadds  f0,f0,f11,f12
  50:   d4 03 00 04     stfsu   f0,4(r3)
  54:   42 00 ff ec     bdnz    40 <saxpy+0x40>
  58:   38 21 00 10     addi    r1,r1,16
  5c:   4e 80 00 20     blr
For speed, gcc prefers the EA writeback form, despite the huge loop setup --
and even NOP pads it for alignment.
With -Os:
00000000 <saxpy>:
   0:   94 21 ff f0     stwu    r1,-16(r1)
   4:   2f 87 00 00     cmpwi   cr7,r7,0
   8:   39 20 00 00     li      r9,0
   c:   39 47 00 01     addi    r10,r7,1
  10:   40 9c 00 08     bge     cr7,18 <saxpy+0x18>
  14:   39 40 00 01     li      r10,1
  18:   2c 0a 00 01     cmpwi   r10,1
  1c:   39 4a ff ff     addi    r10,r10,-1
  20:   40 82 00 0c     bne     2c <saxpy+0x2c>
  24:   38 21 00 10     addi    r1,r1,16
  28:   4e 80 00 20     blr
  2c:   7c 04 4c 2e     lfsx    f0,r4,r9
  30:   7d 65 4c 2e     lfsx    f11,r5,r9
  34:   7d 86 4c 2e     lfsx    f12,r6,r9
  38:   ec 00 62 fa     fmadds  f0,f0,f11,f12
  3c:   7c 03 4d 2e     stfsx   f0,r3,r9
  40:   39 29 00 04     addi    r9,r9,4
  44:   4b ff ff d4     b       18 <saxpy+0x18>
Size optimization prefers the indexed form. I don't know why speed optimization
doesn't as well. (ignoring the speed pessimising loop structure here)
RISC-V gcc 9.2.0 makes fake indexed instructions using an add&load and bumping
a single counter rather than bumping multiple pointers. It doesn't make any
difference to the overall speed or code size.
0000000000000000 <saxpy>:
   0:   02e05a63                blez    a4,34 <.L1>
   4:   00271313                slli    t1,a4,0x2
   8:   4781                    li      a5,0
000000000000000a <.L3>:
   a:   00f68733                add     a4,a3,a5
   e:   00f588b3                add     a7,a1,a5
  12:   00f60833                add     a6,a2,a5
  16:   00072707                flw     fa4,0(a4)
  1a:   0008a787                flw     fa5,0(a7)
  1e:   00082687                flw     fa3,0(a6)
  22:   00f50733                add     a4,a0,a5
  26:   0791                    addi    a5,a5,4
  28:   70d7f7c3                fmadd.s fa5,fa5,fa3,fa4
  2c:   00f72027                fsw     fa5,0(a4)
  30:   fcf31de3                bne     t1,a5,a <.L3>
0000000000000034 <.L1>:
  34:   8082                    ret