On Monday, May 8, 2023 at 3:36:24 AM UTC+1,
robf...@gmail.com wrote:
> loop1:
> LOADG g16,[r1+r2*]
> STOREG g16,[r3+r2++*]
> BLTU r2,1000,.loop1
>
> I must look at adding string instructions back into the instruction set.
yeah can i suggest really don't do that. what happens if you want
to support UCS-2 (strncpyW)? then UCS-4? more than that: the
concepts needed to efficiently support strings, well you have to
add them anyway so why not make them first-order concepts
at the ISA level?
(i am assuming a Horizontal-First Vector ISA here: this does
not apply to Mitch's 66000 which is Vertical-First)
first thing: Fault-First is needed. explained here:
https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf
this basically is a contractual declaration, "i want you to
load *UP TO* a set maximum number of elements, and
to TELL me how many were actually loaded"
second: extend that same concept onto data: "i want you
to perform some operation *UP TO* a set maximum
number of elements, but if as part of that *ELEMENT*
there is a test that fails, STOP and tell me where you
stopped".
the first concept allows you to safely issue LOADs
knowing full well that no page-fault or other exception
will occur, because the hardware is ORDERED to avoid
them.
the second concept allows you to detect e.g. a null-chr
within a sequential block, but still expressed as a Vector
operation.
the combination of these two allows you to speculatively
load massive parallel blocks of sequential data, that are
then tested in parallel for zero, after which it is plain
sailing to perform the copy.
at all times the Vector Length remains within required
bounds, having been first truncated to take care of potential
exceptions and then having been truncated up to (and
including) the null-chr.
note at lines 52 and 55 that they are both "post-increment".
this is a Vector Load where hardware is permitted to notice
that where the fundamental element operation is a *Scalar*
Load-with-Update, a repeated run of Updates can
be optimised out to only hit the register file with the very
last of those Updates.
of course all of this is completely irrelevant for a Vertical-First
ISA (or an ISA with Vertical-First Vectorisation Mode),
because everything looks to a Vertical-First ISA (such as
Mitch's 66000) like Scalar Looping.
Horizontal-First on the other hand you know that a
large batch of Element-operations are going to hit the
back-end and consequently may micro-code a much more
efficient suite of operations that take up far less resources
than if the individual element operations were naively
thrown into Execute. (a good example is the big-integer
3-in 2-out multiply instruction we are proposing to Power ISA,
which uses one of the Read-regs and one of the Write-regs as
a 64-bit carry. when chained: 1st operation: 3-in 1-out middle-ops
2-in 1-out last-op 2-in 2-out).
https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_ldst.py;hb=HEAD#l36
44 "mtspr 9, 3", # move r3 to CTR
45 "addi 0,0,0", # initialise r0 to zero
46 # chr-copy loop starts here:
47 # for (i = 0; i < n && src[i] != '\0'; i++)
48 # dest[i] = src[i];
49 # VL (and r1) = MIN(CTR,MAXVL=4)
50 "setvl 1,0,%d,0,1,1" % maxvl,
51 # load VL bytes (update r10 addr)
52 "sv.lbzu/pi *16, 1(10)", # should be /lf here as well
53 "sv.cmpi/ff=eq/vli *0,1,*16,0", # cmp against zero, truncate VL
54 # store VL bytes (update r12 addr)
55 "sv.stbu/pi *16, 1(12)",
56 "sv.bc/all 0, *2, -0x1c", # test CTR, stop if cmpi failed
57 # zeroing loop starts here:
58 # for ( ; i < n; i++)
59 # dest[i] = '\0';
60 # VL (and r1) = MIN(CTR,MAXVL=4)
61 "setvl 1,0,%d,0,1,1" % maxvl,
62 # store VL zeros (update r12 addr)
63 "sv.stbu/pi 0, 1(12)",
64 "sv.bc 16, *0, -0xc", # dec CTR by VL, stop at zero