Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Load/Store with auto-increment

363 views
Skip to first unread message

Marcus

unread,
May 6, 2023, 10:59:03 AM5/6/23
to
Load/store with auto-increment/decrement can reduce the number of
instructions in many loops (especially those that mostly iterate over
arrays of data). It can also be used in function prologues and epilogues
(for push/pop functionality).

For a long time I had dismissed load/store with auto-increment for my
ISA (MRISC32). The reason is that a load operation with auto-increment
would have TWO results (the loaded value and the updated address base),
which would be a complication (all other instructions have at most one
result).

However, a couple of days ago I realized that store operations do not
have any result, so I could add instructions for store with auto-
increment, and still only have one result. I have a pretty good idea
of how to do it (instruction encoding etc), and it would fit fairly
well (the only oddity would be that the result register is not the
first register address in the instruction word, but the second register
address, which requires some more MUX:ing in the decoding stages).

The next question is: What flavors should I have?

- Post-increment (most common?)
- Post-decrement
- Pre-increment
- Pre-decrement (second most common?)

The "pre" variants would possibly add more logic to critical paths (e.g.
add more gate delay in the AGU before the address is ready for the
memory stage).

Any thoughts? Is it worth it?

/Marcus

John Levine

unread,
May 6, 2023, 11:38:20 AM5/6/23
to
According to Marcus <m.de...@this.bitsnbites.eu>:
>Load/store with auto-increment/decrement can reduce the number of
>instructions in many loops (especially those that mostly iterate over
>arrays of data). It can also be used in function prologues and epilogues
>(for push/pop functionality). ...

>Any thoughts? Is it worth it?

Autoincrement was quite popular in the 1960s and 70s. The DEC 12 and
18 bit minis and the DG Nova had a version of it where specific
addresses would autoinrement or decrement when used as indirect
addresses. I did a fair amount of PDP-8 programming and those
autoincrement locations were precious, which said as much about the
limits of the 8's instruction set as anything else.

The PDP-11 generalized this to useful modes -(R) and (R)+ to
predecrement or postincrement any register when used as an address,
which is how it handled stacks and the simple cases of stepping
through a string or array.

It also had indirect versions of both, @(R)+ which was useful for
stepping through an array of pointers (one instruction dispatch for
threaded code or coroutines) and @-(R) which turned out to be useless
and was dropped in the VAX.

Here it is 50 years later and they're all gone. I think the increase
in code density wasn't worth the contortions to ensure that your data
structures fit the few cases that the autoincrement modes handled. It
also made it harder to parallelize and pipeline stuff since address
modes had side effects that had to be scheduled around or potentially
unwound in a page fault.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Marcus

unread,
May 6, 2023, 1:36:30 PM5/6/23
to
On 2023-05-06, John Levine wrote:
> Here it is 50 years later and they're all gone. I think the increase
> in code density wasn't worth the contortions to ensure that your data
> structures fit the few cases that the autoincrement modes handled. It
> also made it harder to parallelize and pipeline stuff since address
> modes had side effects that had to be scheduled around or potentially
> unwound in a page fault.

Actually, ARM has auto-increment (even AArch64). I think that if you
limit what you can do (not the crazy multi-memory accesses instructions
that was popular in CISC, e.g. 68k), you should not have any problems
with page fault handling etc. Unless...

Does the auto-increment instruction implicitly introduce a data-
dependency that's also dependent on the memory operation to complete?
Is there any real difference compared to doing the memory operation
and the address increment in two separate instructions (in an OoO
machine)?

/Marcus

MitchAlsup

unread,
May 6, 2023, 2:15:41 PM5/6/23
to
On Saturday, May 6, 2023 at 9:59:03 AM UTC-5, Marcus wrote:
> Load/store with auto-increment/decrement can reduce the number of
> instructions in many loops (especially those that mostly iterate over
> arrays of data). It can also be used in function prologues and epilogues
> (for push/pop functionality).
<
Can it actually save instructions ??
<
p = <some address>;
q = <some other address>;
for( i = 0; i < max; i++ )
*p++ = *q++;
<
LDA Rp,[IP,,displacement1]
LDA Rq,[IP,,displacement2]
MOV Ri,#0
VEC Rt,{}
top_of_loop:
LDSW Rqm,[Rq+Ri<<2]
STW Rgm,[Rp+Ri<<2
LOOP LE,Ri,#1,Rmax
end_of_loop:
>
Which instruction can be saved in this loop??
<
> For a long time I had dismissed load/store with auto-increment for my
> ISA (MRISC32). The reason is that a load operation with auto-increment
> would have TWO results (the loaded value and the updated address base),
<
That is the first problem.
<
> which would be a complication (all other instructions have at most one
> result).
>
> However, a couple of days ago I realized that store operations do not
> have any result, so I could add instructions for store with auto-
> increment, and still only have one result. I have a pretty good idea
> of how to do it (instruction encoding etc), and it would fit fairly
> well (the only oddity would be that the result register is not the
> first register address in the instruction word, but the second register
> address, which requires some more MUX:ing in the decoding stages).
<
So, autoincrement on STs only ??
>
> The next question is: What flavors should I have?
>
> - Post-increment (most common?)
> - Post-decrement
> - Pre-increment
> - Pre-decrement (second most common?)
<
Not having these eliminates having to choose.
>
> The "pre" variants would possibly add more logic to critical paths (e.g.
> add more gate delay in the AGU before the address is ready for the
> memory stage).
>
> Any thoughts? Is it worth it?
<
In my option, needing autoincrements is a sign of a weak ISA and
possibly that of a less than stellar compiler.
>
> /Marcus

MitchAlsup

unread,
May 6, 2023, 2:17:22 PM5/6/23
to
On Saturday, May 6, 2023 at 12:36:30 PM UTC-5, Marcus wrote:
> On 2023-05-06, John Levine wrote:
> > Here it is 50 years later and they're all gone. I think the increase
> > in code density wasn't worth the contortions to ensure that your data
> > structures fit the few cases that the autoincrement modes handled. It
> > also made it harder to parallelize and pipeline stuff since address
> > modes had side effects that had to be scheduled around or potentially
> > unwound in a page fault.
> Actually, ARM has auto-increment (even AArch64). I think that if you
> limit what you can do (not the crazy multi-memory accesses instructions
> that was popular in CISC, e.g. 68k), you should not have any problems
> with page fault handling etc. Unless...
>
> Does the auto-increment instruction implicitly introduce a data-
> dependency that's also dependent on the memory operation to complete?
<
Not necessarily, but it does create a base-register to base-register
dependency on uses of the addressing register. So, memory is not
compromised, but use of the register can be.

Thomas Koenig

unread,
May 6, 2023, 5:04:37 PM5/6/23
to
Marcus <m.de...@this.bitsnbites.eu> schrieb:
> Load/store with auto-increment/decrement can reduce the number of
> instructions in many loops (especially those that mostly iterate over
> arrays of data). It can also be used in function prologues and epilogues
> (for push/pop functionality).

One step further: You can have something like POWER's load and
store with update. For example,

ldux rt,ra,rb

will load a doubleword from the address ra + rb and set ra to
ra + rb, or

ldu rt,num(ra)

will load rt from num + ra and set ra = ra + num.

You can simulate autoincrement/autodecrement if you write

ldu rt,8(ra)

or

ldu rt,-8(ra)

respectively.

> For a long time I had dismissed load/store with auto-increment for my
> ISA (MRISC32). The reason is that a load operation with auto-increment
> would have TWO results (the loaded value and the updated address base),
> which would be a complication (all other instructions have at most one
> result).

Exactly.

> However, a couple of days ago I realized that store operations do not
> have any result, so I could add instructions for store with auto-
> increment, and still only have one result.

That would create a rather weird asymmetry between load and store.
It could also create problems for the compiler - I'm not sure that
gcc is set up to easily handle different addressing modes for load
and store.

> I have a pretty good idea
> of how to do it (instruction encoding etc), and it would fit fairly
> well (the only oddity would be that the result register is not the
> first register address in the instruction word, but the second register
> address, which requires some more MUX:ing in the decoding stages).
>
> The next question is: What flavors should I have?
>
> - Post-increment (most common?)
> - Post-decrement
> - Pre-increment
> - Pre-decrement (second most common?)

If you want to save instructions in a loop and have a "compare to zero"
instruction (which I seem to remember you do), then a negative index
could be something else to try.

Consider transforming

for (int i=0; i<n; i++)
a[i] = b[i] + 2;

into

*ap = a + n;
*bp = b + n;
for (int i=-n; i != 0; i++)
ap[i] = bp[i] + 2;

and expressing the body of the loop as

start:
ldd r1,rb,-ri
addi r1,r1,2
std r1,ra,-ri
add ri,ri,1
beq0 ri,start

Hmm... is there any ISA which allows for both negative and positive
indexing?

> The "pre" variants would possibly add more logic to critical paths (e.g.
> add more gate delay in the AGU before the address is ready for the
> memory stage).
>
> Any thoughts? Is it worth it?

Not sure it is - this kind of instruction will be split into two
micro-instructions on any OoO machine, and probably for in-order,
as well.

MitchAlsup

unread,
May 6, 2023, 10:00:37 PM5/6/23
to
Consider a string of *p++
a = *p++;
b = *p++;
c = *p++;
<
Here we see the failure of the ++ or -- notation.
The LD of b is dependent on the ++ of a
The LD of c is dependent on the ++ of b
Whereas if the above was written::
<
a = p[0];
b = p[1];
c = p[2];
p +=3;
<
Now all three LDs are independent and can issue/execute/retire
simultaneously. Also, the add to p is independent, so we took
3 "instructions" that were serially dependent and make them into
4 instructions that are completely independent in all phases of
execution.

BGB

unread,
May 6, 2023, 10:58:39 PM5/6/23
to
I skipped auto-increment as it typically saves "hardly anything" (at
best) and adds an awkward case that needs to be decomposed into two
sub-operations (most other cases).

So, I didn't really feel it was "worth it".

It could almost make sense on a 1-wide machine, except that one needs to
add one of the main expensive parts of a 2-wide machine in order to
support it (and on a superscalar machine, the increment would likely end
up running in parallel with some other op anyways).

...


For register save/restore, maybe it makes sense:
But, one can use normal displacement loads/stores and a single big
adjustment instead;
Things like "*ptr++" could use it, but are still not common enough to
make it significant (combined with the thing of the "ptr++" part usually
just running in parallel with another op anyways).


>>
>> /Marcus

robf...@gmail.com

unread,
May 7, 2023, 3:36:19 AM5/7/23
to
Auto inc/dec can be difficult for the compiler to make use of. Sometime
the p++ will end up as a separate add anyway. If there is scaled indexed
addressing often loop increment vars can be used, and the loop
increment is needed anyway.
p[n] = q[n];
n++;
I used extra bits available in load / store instruction to indicate the
cache-ability of data. Requires compiler support though.

Having a push instruction can be handy, and good for code density if it
can push multiple registers in a single instruction.

I have multi-register loads and stores in groups of eight registers for
Thor. Based on filling up the entire cache line with register data then
issuing a single load or store operation.


Anton Ertl

unread,
May 7, 2023, 8:43:31 AM5/7/23
to
Marcus <m.de...@this.bitsnbites.eu> writes:
>Load/store with auto-increment/decrement can reduce the number of
>instructions in many loops (especially those that mostly iterate over
>arrays of data).

Yes.

If you do it only for stores, as suggested below, it could be used for
loops that read from one or more arrays and write to one array, all
with the same stride, as follows (in pseudo-C-code):

/* read from a and b, write to c */
da=a-c;
db=b-c;
for (...) {
*c = c[da] * c[db];
c+=stride;
}

the "c+=stride" could become the autoincrement of the store.

>It can also be used in function prologues and epilogues
>(for push/pop functionality).

Not so great, because it introduces data dependencies between the
stores that you then have to get rid of if you want to support more
than one store per cycle. As for the pops, those are loads, and here
the autoincrement would require an additional write port to the
register file, as you point out below; plus it would introduce data
dependencies that you don't want (many cores support more than one
load per cycle).

>The next question is: What flavors should I have?
>
>- Post-increment (most common?)
>- Post-decrement
>- Pre-increment
>- Pre-decrement (second most common?)
>
>The "pre" variants would possibly add more logic to critical paths (e.g.
>add more gate delay in the AGU before the address is ready for the
>memory stage).

You typically have memory-access instructions that include an addition
in the address computation; in that case pre obviously has no extra
cost. The cost of the addition can be reduced (eliminated) with a
technique called sum-addressed memory. OTOH, IA-64 supports only
memory accesses of an address given in a register, so here the
architects apparently thought that sum-addressed memory is still too
slow.

Increment vs. decrement: If your store supports reading two registers
for address computation (in addition to the data register), you can
put the stride in a register, making the whole question moot. Even if
you only support reading one register in addition to the data, you can
have a sign-extended constant stride, again giving you both increment
and decrement options. Note that having a store that does not support
the sum of two registers, but does support autoincrement, and a load
that supports the sum of two registers as address is means that both
loads and stores can read two registers and write one register, which
may be useful for certain microarchitectural approaches.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Anton Ertl

unread,
May 7, 2023, 12:00:09 PM5/7/23
to
John Levine <jo...@taugh.com> writes:
>Here it is 50 years later and they're all gone.

PowerPC and ARM A32 is still there. And there's even a new
architecture with auto-increment: ARM A64.

>I think the increase
>in code density wasn't worth the contortions to ensure that your data
>structures fit the few cases that the autoincrement modes handled.

Are you thinking of the DSPs that do not have displacement addressing,
but have auto-increment, leading to a number of papers on how the
compiler should arrange the variables to make best use of that?

With displacement addressing no such contortions are necessary.

>It
>also made it harder to parallelize and pipeline stuff since address
>modes had side effects that had to be scheduled around or potentially
>unwound in a page fault.

Pipelining was apparently no problem, as evidenced by several early
RISCs (ARM (A32), HPPA, PowerPC) having auto-increment. Just don't
write the address register before verifying the address. And
parallelizing is no problem, either: IA-64 was designed for Explicitly
Parallel Instruction Computing, and has auto-increment.

Scott Lurndal

unread,
May 7, 2023, 12:05:06 PM5/7/23
to
Can the compiler not recogize the first pattern and convert
it into the second form under the as-if rule?

John Levine

unread,
May 7, 2023, 12:56:03 PM5/7/23
to
It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>PowerPC and ARM A32 is still there. And there's even a new
>architecture with auto-increment: ARM A64.

I need to take a look.

>>I think the increase
>>in code density wasn't worth the contortions to ensure that your data
>>structures fit the few cases that the autoincrement modes handled.
>
>Are you thinking of the DSPs that do not have displacement addressing,
>but have auto-increment, leading to a number of papers on how the
>compiler should arrange the variables to make best use of that?

Autoincrement only increments by the size of a single datum so it
works for strings and vectors, not for arrays of structures or 2-D
arrays. Compare it to the 360's BXLE loop closing instruction which
put the stride in a register so it could be whatever you wanted.
It also had base+index which the Vax did but the PDP-11 only sort
of did if you used absolute addresses instead of a base.

On the PDP-11 autoincrement allowed a two instruction string copy loop:

c: movb (r1)+,(r2)+
bnz c ; loop if the byte wasn't zero

but how useful is that now? I don't know.

>With displacement addressing no such contortions are necessary.

I don't see how that solves the stride problem. Or did you mean
something else?

>>It >also made it harder to parallelize and pipeline stuff since address
>>modes had side effects that had to be scheduled around or potentially
>>unwound in a page fault.
>
>Pipelining was apparently no problem, as evidenced by several early
>RISCs (ARM (A32), HPPA, PowerPC) having auto-increment. Just don't
>write the address register before verifying the address. ...

Do they have the kind of hazards that the -11 and Vax did, where you could
autoincrement the same register more than once in a single instruction, or
use the incremented register as an operand? That made things messy.

David Brown

unread,
May 7, 2023, 1:47:24 PM5/7/23
to
Yes, and compilers have done such conversions for decades. (Of course,
that assumes you are not dealing with external data, or expressions that
could alias each other.)

David Brown

unread,
May 7, 2023, 1:49:24 PM5/7/23
to
On 07/05/2023 18:55, John Levine wrote:
> It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>> PowerPC and ARM A32 is still there. And there's even a new
>> architecture with auto-increment: ARM A64.
>
> I need to take a look.
>
>>> I think the increase
>>> in code density wasn't worth the contortions to ensure that your data
>>> structures fit the few cases that the autoincrement modes handled.
>>
>> Are you thinking of the DSPs that do not have displacement addressing,
>> but have auto-increment, leading to a number of papers on how the
>> compiler should arrange the variables to make best use of that?
>
> Autoincrement only increments by the size of a single datum so it
> works for strings and vectors, not for arrays of structures or 2-D
> arrays. Compare it to the 360's BXLE loop closing instruction which
> put the stride in a register so it could be whatever you wanted.
> It also had base+index which the Vax did but the PDP-11 only sort
> of did if you used absolute addresses instead of a base.
>
> On the PDP-11 autoincrement allowed a two instruction string copy loop:
>
> c: movb (r1)+,(r2)+
> bnz c ; loop if the byte wasn't zero
>
> but how useful is that now? I don't know.
>

Similar instructions would be used for copying memory blocks, and that
is very useful!


Thomas Koenig

unread,
May 7, 2023, 1:58:59 PM5/7/23
to
Scott Lurndal <sc...@slp53.sl.home> schrieb:
Of course:

void bar (int a, int b, int c);

void foo (int *p)
{
int a, b, c;
a = *p++;
b = *p++;
c = *p++;
bar (a, b, c);
}

results in

lw a2,8(a0)
lw a1,4(a0)
lw a0,0(a0)
tail bar

on RISC-V, for example (aarch64 plays games with load double,
so it's a bit harder to read).

But I believe Mitch was referring to the assembler equivalent, where
p be held in a register.

Autodecrement and increment is done on 386ff. How do they avoid
the register dependency of the stack register? Special handling?
Instruction fusing?

John Levine

unread,
May 7, 2023, 2:12:10 PM5/7/23
to
It appears that David Brown <david...@hesbynett.no> said:
>> On the PDP-11 autoincrement allowed a two instruction string copy loop:
>>
>> c: movb (r1)+,(r2)+
>> bnz c ; loop if the byte wasn't zero
>>
>> but how useful is that now? I don't know.
>
>Similar instructions would be used for copying memory blocks, and that
>is very useful!

Not really. On modern computers you want to copy in ways that make
best use of the multiple registers so you're more likely to do a
sequence of loads followed by a sequence of stores, mabybe with shift
and mask in between if they're not aligned, then move on to the next
block. You could use autoincrement but you'll probably get better
performance with instructions that clearly don't depend on each other
so they can run in parallel, e.g.

; r8 is source, r9 is dest
loop:
ld r1,0[r8]
ld r2,8[r8]
ld r3,16[r8]
ld r4,24[r8]
; shift and mask to align if needed
st r1,0[r9]
st r2,8[r9]
st r3,16[r9]
st r4,24[r9]

addi r8,#32
addi r9,#32
branch if not done to loop

MitchAlsup

unread,
May 7, 2023, 2:29:26 PM5/7/23
to
A) the compiler is so allowed
B) once the compiler is doing this, wanting auto{inc,dec} in your
ISA evaporates.

MitchAlsup

unread,
May 7, 2023, 2:31:52 PM5/7/23
to
Except you are moving blocks 1-byte at a time--which was fine for PDP-11 days
and for the era of 16-bits "was sufficient" addressing.

MitchAlsup

unread,
May 7, 2023, 2:33:39 PM5/7/23
to
They did not--they just "ate" the latency and register conflicts.
But in general, the Great-Big execution window made all those
"go away".

MitchAlsup

unread,
May 7, 2023, 2:36:25 PM5/7/23
to
On Sunday, May 7, 2023 at 1:12:10 PM UTC-5, John Levine wrote:
> It appears that David Brown <david...@hesbynett.no> said:
> >> On the PDP-11 autoincrement allowed a two instruction string copy loop:
> >>
> >> c: movb (r1)+,(r2)+
> >> bnz c ; loop if the byte wasn't zero
> >>
> >> but how useful is that now? I don't know.
> >
> >Similar instructions would be used for copying memory blocks, and that
> >is very useful!
> Not really. On modern computers you want to copy in ways that make
> best use of the multiple registers so you're more likely to do a
> sequence of loads followed by a sequence of stores, mabybe with shift
> and mask in between if they're not aligned, then move on to the next
> block. You could use autoincrement but you'll probably get better
> performance with instructions that clearly don't depend on each other
> so they can run in parallel, e.g.
<
Or you can (put into ISA and) use MM (memory to memory move)
<
MM Rcount,Rfrom,Rto
<
And rest assured that HW will simply do the optimal thing for that
implementation {up to 1 cache line per cycle.}

Stephen Fuld

unread,
May 7, 2023, 3:12:13 PM5/7/23
to
On 5/7/2023 9:55 AM, John Levine wrote:

snip

> Autoincrement only increments by the size of a single datum so it
> works for strings and vectors, not for arrays of structures or 2-D
> arrays. Compare it to the 360's BXLE loop closing instruction which
> put the stride in a register so it could be whatever you wanted.

Or the 1108 which allowed you to specify, with an instruction bit, that
the high order half of an index register is added to the low order half
(which is all that was used for address calculation) after the memory
address is computed.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Thomas Koenig

unread,
May 7, 2023, 4:49:20 PM5/7/23
to
Thomas Koenig <tko...@netcologne.de> schrieb:

> Autodecrement and increment is done on 386ff. How do they avoid
> the register dependency of the stack register? Special handling?
> Instruction fusing?

Seems like they have a dedicated stack engine for the
purpose. Agner Fog (who else) has a nice explanation at
https://agner.org/optimize/microarchitecture.pdf . Basically,
there is an extra stage in the pipeline for handling stack pointers
and for inserting stack synchronization micro-ops.

That is one level of complexity that address + offset addressing
relative to the stack pointer solves nicely.

BGB

unread,
May 7, 2023, 5:27:52 PM5/7/23
to
On 5/7/2023 7:07 AM, Anton Ertl wrote:
> Marcus <m.de...@this.bitsnbites.eu> writes:
>> Load/store with auto-increment/decrement can reduce the number of
>> instructions in many loops (especially those that mostly iterate over
>> arrays of data).
>
> Yes.
>
> If you do it only for stores, as suggested below, it could be used for
> loops that read from one or more arrays and write to one array, all
> with the same stride, as follows (in pseudo-C-code):
>
> /* read from a and b, write to c */
> da=a-c;
> db=b-c;
> for (...) {
> *c = c[da] * c[db];
> c+=stride;
> }
>
> the "c+=stride" could become the autoincrement of the store.
>

Not all instructions are created equal.

Fewer instructions may not be a win if these instructions would result
in a higher latency.


>> It can also be used in function prologues and epilogues
>> (for push/pop functionality).
>
> Not so great, because it introduces data dependencies between the
> stores that you then have to get rid of if you want to support more
> than one store per cycle. As for the pops, those are loads, and here
> the autoincrement would require an additional write port to the
> register file, as you point out below; plus it would introduce data
> dependencies that you don't want (many cores support more than one
> load per cycle).
>

But, is kinda moot as, say:
MOV.Q R13, @-SP
MOV.Q R12, @-SP
MOV.Q R11, @-SP
MOV.Q R10, @-SP
MOV.Q R9, @-SP
MOV.Q R8, @-SP

Only saves 1 instruction vs, say:
ADD -48, SP
MOV.Q R13, (SP, 40)
MOV.Q R12, (SP, 32)
MOV.Q R11, (SP, 24)
MOV.Q R10, (SP, 16)
MOV.Q R9, (SP, 8)
MOV.Q R8, (SP, 0)

Depending on how it is implemented, the dependency issues on the shared
register could actually make the use of auto-increment slower than the
use of fixed displacement loads/stores (and, if one needs to wait the
whole latency of a load or store for the increment's write-back to
finish, using auto-increment in this way is likely "dead on arrival").


I can also note that an earlier form of BJX2 had PUSH/POP instructions,
but these were removed. Noting the above, it is probably not all that
hard to guess why...
Nothing to add here.

> - anton

MitchAlsup

unread,
May 7, 2023, 5:47:45 PM5/7/23
to
On Sunday, May 7, 2023 at 4:27:52 PM UTC-5, BGB wrote:
> On 5/7/2023 7:07 AM, Anton Ertl wrote:
> > Marcus <m.de...@this.bitsnbites.eu> writes:
> >> Load/store with auto-increment/decrement can reduce the number of
> >> instructions in many loops (especially those that mostly iterate over
> >> arrays of data).
> >
> > Yes.
> >
> > If you do it only for stores, as suggested below, it could be used for
> > loops that read from one or more arrays and write to one array, all
> > with the same stride, as follows (in pseudo-C-code):
> >
> > /* read from a and b, write to c */
> > da=a-c;
> > db=b-c;
> > for (...) {
> > *c = c[da] * c[db];
> > c+=stride;
> > }
> >
> > the "c+=stride" could become the autoincrement of the store.
> >
> Not all instructions are created equal.
>
> Fewer instructions may not be a win if these instructions would result
> in a higher latency.
<
But eliminating sequential dependencies is almost always a win
because it directly addresses latency.
<
> >> It can also be used in function prologues and epilogues
> >> (for push/pop functionality).
> >
> > Not so great, because it introduces data dependencies between the
> > stores that you then have to get rid of if you want to support more
> > than one store per cycle. As for the pops, those are loads, and here
> > the autoincrement would require an additional write port to the
> > register file, as you point out below; plus it would introduce data
> > dependencies that you don't want (many cores support more than one
> > load per cycle).
> >
> But, is kinda moot as, say:
> MOV.Q R13, @-SP
> MOV.Q R12, @-SP
> MOV.Q R11, @-SP
> MOV.Q R10, @-SP
> MOV.Q R9, @-SP
> MOV.Q R8, @-SP
>
> Only saves 1 instruction vs, say:
> ADD -48, SP
> MOV.Q R13, (SP, 40)
> MOV.Q R12, (SP, 32)
> MOV.Q R11, (SP, 24)
> MOV.Q R10, (SP, 16)
> MOV.Q R9, (SP, 8)
> MOV.Q R8, (SP, 0)
<
If you actually wanted to save instructions you would::
<
MOV.Q R13:R8,@-SP
<
So the argument of saving 1 instruction becomes moot--you can save 5
instructions.

robf...@gmail.com

unread,
May 7, 2023, 10:36:24 PM5/7/23
to
Got me thinking of how auto adjust addressing could be added to the Thor
core. There is a bit available in the scaled indexed addressing mode, so I
shoehorned in post-inc, pre-dec modes. This should work with group
register load and store too allowing auto increment for:

loop1:
LOADG g16,[r1+r2*]
STOREG g16,[r3+r2++*]
BLTU r2,1000,.loop1

I must look at adding string instructions back into the instruction set.
Previously there has been copy, set, and compare string instructions. It
is tempting to add a REP instruction modifier to the ISA. It could be a
modified branch instruction because the displacement is not needed.

RLTU r55,1000,”RR”
LOADG g16,[r1+r2*]
STOREG g16,[r3+r2++*]

David Brown

unread,
May 8, 2023, 3:06:06 AM5/8/23
to
Of course you would move the data in bigger sizes - as big as you can,
based on your (i.e., the compiler's) knowledge of alignments, sizes, etc.

Anton Ertl

unread,
May 8, 2023, 4:10:11 AM5/8/23
to
Thomas Koenig <tko...@netcologne.de> writes:
>void bar (int a, int b, int c);
>
>void foo (int *p)
>{
> int a, b, c;
> a = *p++;
> b = *p++;
> c = *p++;
> bar (a, b, c);
>}
>
>results in
>
> lw a2,8(a0)
> lw a1,4(a0)
> lw a0,0(a0)
> tail bar
>
>on RISC-V, for example (aarch64 plays games with load double,
>so it's a bit harder to read).
>
>But I believe Mitch was referring to the assembler equivalent, where
>p be held in a register.
>
>Autodecrement and increment is done on 386ff. How do they avoid
>the register dependency of the stack register? Special handling?

Yes. My understanding is that they do something similar in the
decoding hardware to what the compiler does for the code above (and of
course the hardware probably does not eliminate the update of the
stack pointer as dead code).

luke.l...@gmail.com

unread,
May 8, 2023, 11:09:38 AM5/8/23
to
On Monday, May 8, 2023 at 3:36:24 AM UTC+1, robf...@gmail.com wrote:

> loop1:
> LOADG g16,[r1+r2*]
> STOREG g16,[r3+r2++*]
> BLTU r2,1000,.loop1
>
> I must look at adding string instructions back into the instruction set.

yeah can i suggest really don't do that. what happens if you want
to support UCS-2 (strncpyW)? then UCS-4? more than that: the
concepts needed to efficiently support strings, well you have to
add them anyway so why not make them first-order concepts
at the ISA level?

(i am assuming a Horizontal-First Vector ISA here: this does
not apply to Mitch's 66000 which is Vertical-First)

first thing: Fault-First is needed. explained here:
https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf

this basically is a contractual declaration, "i want you to
load *UP TO* a set maximum number of elements, and
to TELL me how many were actually loaded"

second: extend that same concept onto data: "i want you
to perform some operation *UP TO* a set maximum
number of elements, but if as part of that *ELEMENT*
there is a test that fails, STOP and tell me where you
stopped".

the first concept allows you to safely issue LOADs
knowing full well that no page-fault or other exception
will occur, because the hardware is ORDERED to avoid
them.

the second concept allows you to detect e.g. a null-chr
within a sequential block, but still expressed as a Vector
operation.

the combination of these two allows you to speculatively
load massive parallel blocks of sequential data, that are
then tested in parallel for zero, after which it is plain
sailing to perform the copy.

at all times the Vector Length remains within required
bounds, having been first truncated to take care of potential
exceptions and then having been truncated up to (and
including) the null-chr.

note at lines 52 and 55 that they are both "post-increment".
this is a Vector Load where hardware is permitted to notice
that where the fundamental element operation is a *Scalar*
Load-with-Update, a repeated run of Updates can
be optimised out to only hit the register file with the very
last of those Updates.

of course all of this is completely irrelevant for a Vertical-First
ISA (or an ISA with Vertical-First Vectorisation Mode),
because everything looks to a Vertical-First ISA (such as
Mitch's 66000) like Scalar Looping.

Horizontal-First on the other hand you know that a
large batch of Element-operations are going to hit the
back-end and consequently may micro-code a much more
efficient suite of operations that take up far less resources
than if the individual element operations were naively
thrown into Execute. (a good example is the big-integer
3-in 2-out multiply instruction we are proposing to Power ISA,
which uses one of the Read-regs and one of the Write-regs as
a 64-bit carry. when chained: 1st operation: 3-in 1-out middle-ops
2-in 1-out last-op 2-in 2-out).

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_ldst.py;hb=HEAD#l36

44 "mtspr 9, 3", # move r3 to CTR
45 "addi 0,0,0", # initialise r0 to zero
46 # chr-copy loop starts here:
47 # for (i = 0; i < n && src[i] != '\0'; i++)
48 # dest[i] = src[i];
49 # VL (and r1) = MIN(CTR,MAXVL=4)
50 "setvl 1,0,%d,0,1,1" % maxvl,
51 # load VL bytes (update r10 addr)
52 "sv.lbzu/pi *16, 1(10)", # should be /lf here as well
53 "sv.cmpi/ff=eq/vli *0,1,*16,0", # cmp against zero, truncate VL
54 # store VL bytes (update r12 addr)
55 "sv.stbu/pi *16, 1(12)",
56 "sv.bc/all 0, *2, -0x1c", # test CTR, stop if cmpi failed
57 # zeroing loop starts here:
58 # for ( ; i < n; i++)
59 # dest[i] = '\0';
60 # VL (and r1) = MIN(CTR,MAXVL=4)
61 "setvl 1,0,%d,0,1,1" % maxvl,
62 # store VL zeros (update r12 addr)
63 "sv.stbu/pi 0, 1(12)",
64 "sv.bc 16, *0, -0xc", # dec CTR by VL, stop at zero

luke.l...@gmail.com

unread,
May 8, 2023, 11:15:36 AM5/8/23
to
On Sunday, May 7, 2023 at 5:00:09 PM UTC+1, Anton Ertl wrote:
> John Levine <jo...@taugh.com> writes:
> >Here it is 50 years later and they're all gone.
> PowerPC and ARM A32 is still there.

yyep.

> >also made it harder to parallelize and pipeline stuff since address
> >modes had side effects that had to be scheduled around or potentially
> >unwound in a page fault.

see https://groups.google.com/g/comp.arch/c/_-dp_ZU6TN0/m/G1lzn4M3BgAJ
for reference to Load/Store Fault-First. only useful in Horizontal-First
ISAs (Vertical-First avoids the problem entirely).

> Pipelining was apparently no problem, as evidenced by several early
> RISCs (ARM (A32), HPPA, PowerPC) having auto-increment.

note that Power ISA Architects debated 20+ years ago whether
to add both pre- and post- Update (not quite the same as
auto-increment but you can consider RB or an Immediate to
be "the amount to auto-increment by" which is real handy).

due to space considerations (it's a hell of a lot of instructions
to add) they went with pre-update, on the basis that post-update
may be synthesised by (ha ha) performing a subtract *outside*
of the loop prior to entering the loop.

sigh :) it works...

l.

luke.l...@gmail.com

unread,
May 8, 2023, 11:42:36 AM5/8/23
to
On Sunday, May 7, 2023 at 3:00:37 AM UTC+1, MitchAlsup wrote:
> Consider a string of *p++
> a = *p++;
> b = *p++;
> c = *p++;
> <
> Here we see the failure of the ++ or -- notation.
> The LD of b is dependent on the ++ of a
> The LD of c is dependent on the ++ of b
> Whereas if the above was written::
> <
> a = p[0];
> b = p[1];
> c = p[2];
> p +=3;

in my mind this is the sort of thing that a compiler pass
should recognise, and perform a miniature AST-rewrite.

at which point *another* pass could spot that if it allocates
a b and c in consecutive registers it may also perform
a 3-long Vector LD. but at that point we are straying into
the bottomless-money-pit of Auto-Vectorisation...

l.

luke.l...@gmail.com

unread,
May 8, 2023, 12:06:49 PM5/8/23
to
On Saturday, May 6, 2023 at 4:38:20 PM UTC+1, John Levine wrote:

> Here it is 50 years later and they're all gone. I think the increase
> in code density wasn't worth the contortions to ensure that your data
> structures fit the few cases that the autoincrement modes handled.

i thought that too ("few modes") until i realised that you can use
LD-with-Update in a Vector Loop with zero-checking to perform
linked-list-pointer-chasing in a single instruction.

> It
> also made it harder to parallelize and pipeline stuff since address
> modes had side effects that had to be scheduled around or potentially
> unwound in a page fault.

i mentioned in another post about ARM SVE Load-Fault-First
which helps there. i suspect that even Vertical-First ISAs
would have the same issues, once amortisation has been
carried out at the back-end (multiple loops merged into
back-end SIMD).

see ARM SVE paper about pointer-chasing (figure 6)
https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf

i realised that a repeated-application-of-LD-ST-Update
can chase down the linked-list whilst also dropping
the list structure pointers into consecutive registers.
by also then adding Data-Dependent Fail-First (check
if the data loaded is NULL) you can get the Vector
Operation to stop at or after the NULL, and truncate
such that subsequent Vector operations do not attempt
to go beyond the NULL.

that's a *big* application of auto-update.

also you can use the same instruction to chase double-linked
lists *simultaneously* by making the offset of the updated
register be 2 away from the read-address instead of 1:

sv.ldu/ff=NULL *x+2, *x

what that is doing is, it is reading the address from
sequential registers starting at x, but it is *storing*
the address loaded at registers starting at x+2.

consequently it can be either chasing a single double-linked
list *or* chasing two single-linked-lists, terminating at
the first NULL. at which point to be honest things get
slightly messy as you have to work out which list is
valid, sigh (as you can tell this is a WIP).

l.

Scott Lurndal

unread,
May 8, 2023, 12:22:10 PM5/8/23
to
"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com wrote=
>:
>
>> loop1:=20
>> LOADG g16,[r1+r2*]=20
>> STOREG g16,[r3+r2++*]=20
>> BLTU r2,1000,.loop1=20
>>=20
>> I must look at adding string instructions back into the instruction set.=
>=20
>
>yeah can i suggest really don't do that. what happens if you want
>to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>concepts needed to efficiently support strings, well you have to
>add them anyway so why not make them first-order concepts
>at the ISA level?

UTF8 should be good enough for everything; best to deprecate
USC-2 et al.

John Levine

unread,
May 8, 2023, 1:20:39 PM5/8/23
to
It appears that John Levine <jo...@taugh.com> said:
>It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>>PowerPC and ARM A32 is still there. And there's even a new
>>architecture with auto-increment: ARM A64.
>
>I need to take a look.

I took a look at ARM and what they did is quite clever. The increment
or decrement amount is a field in the instruction, so up to the field
size (8 bits plus sign as I recall) you can have whatever stride you
want. It also has only one address per instruction so you don't have
the issues you did on the PDP-11 and Vax.

For block memory copies, there's three instructions, roughly prolog,
bocy, epilog, to do it. They don't seem to use autoincrement.

Scott Lurndal

unread,
May 8, 2023, 1:29:47 PM5/8/23
to
John Levine <jo...@taugh.com> writes:
>It appears that John Levine <jo...@taugh.com> said:
>>It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>>>PowerPC and ARM A32 is still there. And there's even a new
>>>architecture with auto-increment: ARM A64.
>>
>>I need to take a look.
>
>I took a look at ARM and what they did is quite clever. The increment
>or decrement amount is a field in the instruction, so up to the field
>size (8 bits plus sign as I recall) you can have whatever stride you
>want. It also has only one address per instruction so you don't have
>the issues you did on the PDP-11 and Vax.
>
>For block memory copies, there's three instructions, roughly prolog,
>bocy, epilog, to do it. They don't seem to use autoincrement.

Those instructions (FEAT_MOP) are very new - I'm not aware of any shipping
ARMv8 processors that support them yet.

Like the VAX MOVC3/5 instructions, FEAT_MOP updates registers and allows
synchronous (e.g. page fault) and asynchronous (interrupts) during operation;
updating the registers appropriately. There is a special exception that
may be caused if the thread is moved to a different CPU during a copy.

John Dallman

unread,
May 8, 2023, 2:23:12 PM5/8/23
to
In article <Yia6M.2700564$iU59....@fx14.iad>, sc...@slp53.sl.home
(Scott Lurndal) wrote:

> Those instructions (FEAT_MOP) are very new - I'm not aware of any
> shipping ARMv8 processors that support them yet.

They appear to be an ARMv9 feature.
<https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions
/CPYFPTN--CPYFMTN--CPYFETN--Memory-Copy-Forward-only--reads-and-writes-unp
rivileged-and-non-temporal->

Those are available in Qualcomm Snapdragon 7 Gen 1 onwards and Snapdragon
8 Gen 1 onwards, and MediaTek Dimensionity 9000 chips. So there are
several models of Android 'phone that have them, but not much else. I
have some Snapdragon 8 Gen 1 development kit devices that use them at
work; the chip was announced in November '21 and 'phones appeared in
summer '22.

This is a situation where manufacturers that use ARM core designs can get
ahead of fully custom designs: the Apple M-series chips aren't ARMv9 yet.


The ARM Neoverse V2, N2 and E2 cores support ARMv9, but they were
announced last September and nothing with them has shipped yet.

John

BGB

unread,
May 8, 2023, 2:32:37 PM5/8/23
to
For many use-cases (transmission and storage), UTF-8 is a sane default,
but there are cases where UTF-8 is not ideal, such as inside console
displays or text editors.

Still makes sense to keep support UTF-16 around for the cases where it
is useful.



Though, for an "advanced" text interface, it usually makes sense to have
additional bits per character cell, say:
(31:28): Background Color
(27:24): Foreground Color
(23:20): Attribute Flags
(19: 0): Codepoint

Or 64-bits if one wants more color-depth and/or things like font size
(or additional attribute modifiers, such as skin-tone modifier for
emojis, etc).

This mostly allowing the text rendering to work in a typical "stream of
character cells" sense.


Though, this sort of approach is generally unable to represent things
like "Zalgo text" (formed by using an excessive number of diacritics and
similar over each letter), and I am not entirely sure how "standard"
text-rendering deals with this sort of thing.

Say, a "straightforward" implementation with 64-bit character cells only
allowing for 1 or 2 diacritics per character.

Then again, it doesn't seem to work in the other text editors I use
anyways, so the inability to represent it is likely a non-issue in most
use cases. (Say, the text editor will strip off most of the diacritics
leaving only the base text).


Well, and similarly approaches like representing each character cell as
a small pixel bitmap (say, 16 colors from a per-cell palette), also
wouldn't be able to represent "Zalgo text" (say, if each cell bitmap
only allows the character to extend 50% out each side of its nominal
bounds).

This is with a 32x32 bitmap per character cell (assuming nominal 16x16
text rendering), but this would need ~ 1K per rendered character cell
(horridly impractical).

Then again, it is possible that only a fixed number of such characters
could exist at any moment, and then be treated as "transient virtual
characters".

Seems almost like many of the text layout renders are operating directly
on a raster image though, without using intermediate character cells.

...


I don't really bother with any of this for TestKern, which (at present)
doesn't even support the full BMP, and what little is supported is
limited to what can be represented directly in 8x8x1 pixel character cells.


John Levine

unread,
May 8, 2023, 2:33:07 PM5/8/23
to
It appears that David Brown <david...@hesbynett.no> said:
>> Except you are moving blocks 1-byte at a time--which was fine for PDP-11 days
>> and for the era of 16-bits "was sufficient" addressing.
>
>Of course you would move the data in bigger sizes - as big as you can,
>based on your (i.e., the compiler's) knowledge of alignments, sizes, etc.

Which, as discussed in a lot of other messages, you do with groups of
loads and stores where autoincrement isn't very useful.

As I said a few messages ago, I can see how the ARM version with the
stride in the instruction could be useful for stepping through arrays,
but I wouldn't want to get cleverer than that.

John Levine

unread,
May 8, 2023, 2:37:39 PM5/8/23
to
According to BGB <cr8...@gmail.com>:
>> UTF8 should be good enough for everything; best to deprecate
>> USC-2 et al.
>
>For many use-cases (transmission and storage), UTF-8 is a sane default,
>but there are cases where UTF-8 is not ideal, such as inside console
>displays or text editors.
>
>Still makes sense to keep support UTF-16 around for the cases where it
>is useful.

I can see UCS-2 if you're willing to permanently limit yourself to the
subset of Unicode is supports. UTF-16 with surrogate pairs is about as
pessimal an encoding as I can imagine. It's not fixed length, it
doesn't sort consistently like UTF-8 does, and it's not even very
compacy.

In your text editor if you have room I'd say use UTF-32 everywhere, if
not, store stuff in UTF-8 and expand it to UTF-32 when you're working
with it.

Scott Lurndal

unread,
May 8, 2023, 2:44:28 PM5/8/23
to
V9 has a number of "optional" features as similar to ARMv8
which defines multiple "versions" that require certain features
(e.g. v8.1 through v8.8).

N2 is V9.0 which doesn't include FEAT_MOP. See the TRM for N2.

https://developer.arm.com/documentation/102099/0000/The-Neoverse-N2--core

If the feature is supported, the TRM will indicate the appropriate
value in the MOPS field of ID_AA64ISA2_EL1, which the current
N2 Cores do not support.

As for the snapdragon 7 (Cortex 710) processors, they do not support
FEAT_MOPS (which are part of a version after V9.0). MOPS is also
allowed in ARMv8.8 implementations (I am not aware of any extant v8.8 chips).

https://developer.arm.com/documentation/101800/latest

Scott Lurndal

unread,
May 8, 2023, 2:48:14 PM5/8/23
to
BGB <cr8...@gmail.com> writes:
>On 5/8/2023 11:22 AM, Scott Lurndal wrote:
>> "luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>>> On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com wrote=
>>> :
>>>
>>>> loop1:=20
>>>> LOADG g16,[r1+r2*]=20
>>>> STOREG g16,[r3+r2++*]=20
>>>> BLTU r2,1000,.loop1=20
>>>> =20
>>>> I must look at adding string instructions back into the instruction set.=
>>> =20
>>>
>>> yeah can i suggest really don't do that. what happens if you want
>>> to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>>> concepts needed to efficiently support strings, well you have to
>>> add them anyway so why not make them first-order concepts
>>> at the ISA level?
>>
>> UTF8 should be good enough for everything; best to deprecate
>> USC-2 et al.
>>
>
>For many use-cases (transmission and storage), UTF-8 is a sane default,
>but there are cases where UTF-8 is not ideal, such as inside console
>displays or text editors.

I disagree with that. Linux-based systems, for example, have no problem using UTF-8
exclusively for editors, x-terms and any other i18n'd application.

>
>Still makes sense to keep support UTF-16 around for the cases where it
>is useful.

It's a painful and non-universal mechanism. Certainly not worth adding
support in the processor for it.

John Dallman

unread,
May 8, 2023, 2:56:23 PM5/8/23
to
In article <cqb6M.534496$Olad....@fx35.iad>, sc...@slp53.sl.home
(Scott Lurndal) wrote:

> V9 has a number of "optional" features as similar to ARMv8
> which defines multiple "versions" that require certain features
> (e.g. v8.1 through v8.8).
>
> N2 is V9.0 which doesn't include FEAT_MOP. See the TRM for N2.

Oh, rats. I was under the impression that ARMv9 was uniform, but clearly
this is wrong.

John

luke.l...@gmail.com

unread,
May 8, 2023, 3:10:12 PM5/8/23
to
On Monday, May 8, 2023 at 7:56:23 PM UTC+1, John Dallman wrote:

> Oh, rats. I was under the impression that ARMv9 was uniform, but clearly
> this is wrong.

not only is it non-uniform there is silicon errata making different hardware
completely binary-incompatible. of course ARM does not care because they
sell to "Silicon Partners" not end-users, and nobody has noticed because
all those disparate systems run Android which is Java bytecode. therefore
as long as workarounds for the errors are compiled into the *java interpreter*
nobody even notices.

step out of that android apps box and start compiling native binaries you are into a
world of pain.

l.

Scott Lurndal

unread,
May 8, 2023, 3:28:33 PM5/8/23
to
"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Monday, May 8, 2023 at 7:56:23=E2=80=AFPM UTC+1, John Dallman wrote:
>
>> Oh, rats. I was under the impression that ARMv9 was uniform, but clearly=
>=20
>> this is wrong.=20
>
>not only is it non-uniform there is silicon errata making different hardwar=
>e
>completely binary-incompatible. of course ARM does not care because they
>sell to "Silicon Partners" not end-users, and nobody has noticed because
>all those disparate systems run Android which is Java bytecode. therefore
>as long as workarounds for the errors are compiled into the *java interpret=
>er*
>nobody even notices.

Actually, the android phones run the linux operating system and
assorted native utilities; The application
level android runtime (dalvik) executes bytecode.

Similar to the CPUID instruction on intel, ARM provides registers that
describe which features are implemented on each chip. Software is
expected to not use unimplemented features by testing to see if they're
available. Now, those registers aren't available to user-mode code, but
at least in linux, the OS provides an interface that applications can
use to determine which features are implemented.

Fundamentally, it's no different than the many generations of Intel
and AMD processors each of which implement different sets of SSE/MMX/SGX
et alia features.

>
>step out of that android apps box and start compiling native binaries you a=
>re into a
>world of pain.

Some examples would be useful. All of our chips are ARMv8 (and now ARMv9)
and we've had no "incompatabilities" between generations other than newer
chips have newer features (the ID registers allow the OS to determine
which features are supported).

BGB

unread,
May 8, 2023, 4:29:17 PM5/8/23
to
On 5/8/2023 1:47 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/8/2023 11:22 AM, Scott Lurndal wrote:
>>> "luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>>>> On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com wrote=
>>>> :
>>>>
>>>>> loop1:=20
>>>>> LOADG g16,[r1+r2*]=20
>>>>> STOREG g16,[r3+r2++*]=20
>>>>> BLTU r2,1000,.loop1=20
>>>>> =20
>>>>> I must look at adding string instructions back into the instruction set.=
>>>> =20
>>>>
>>>> yeah can i suggest really don't do that. what happens if you want
>>>> to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>>>> concepts needed to efficiently support strings, well you have to
>>>> add them anyway so why not make them first-order concepts
>>>> at the ISA level?
>>>
>>> UTF8 should be good enough for everything; best to deprecate
>>> USC-2 et al.
>>>
>>
>> For many use-cases (transmission and storage), UTF-8 is a sane default,
>> but there are cases where UTF-8 is not ideal, such as inside console
>> displays or text editors.
>
> I disagree with that. Linux-based systems, for example, have no problem using UTF-8
> exclusively for editors, x-terms and any other i18n'd application.
>

As noted, UTF-8 makes sense for "transmission", say, sending text to or
from the console; or in the files loaded or saved from a text editor, etc...


Trying to process, edit, and redraw text *directly* in UTF-8 form
internally would be a massive PITA, and would be computationally
expensive, hence why a "character cell" approach is useful. But, as
noted, 32 or 64 bit cells usually make more sense here as, for things
like "syntax highlighting" etc, it makes sense to mark out the
text-colors in the editor buffers (rather than during the redraw process).

Things like variable-size text rendering add some complexity, but these
are mostly keeping track of the width of each character cell, and the
maximum height for the cells in the row.


Then when one saves out the text, or copies it to the OS clipboard, etc,
it is converted back to UTF-8 (or UTF-16).


As for fonts, there are various strategies:
8x8x1, 8x16x1, or 16x16x1 bitmap
Works, but fairly limited, does not deal with resizable text.
Small pixel bitmap (say, 16x16 or 32x32, 2..8 bpp)
Can deal with things like emojis, but not really resizable.
Signed Distance Fields
Resizable, but less ideal for full-color images (1).
Small vector images for each glyph
Traditional form of True-Type Fonts
Needlessly expensive to draw glyphs this way.


Scaling bitmap fonts with either nearest neighbor or bilinear
insterpolation does not give good looking text (nearest neighbor giving
inconsistent jagged edges, bilinear giving blurry text).

So, "Signed Distance Fields" are a good workaround, but mostly make the
most sense for representing monochrome images.

Effectively, "good" 8-color results require a 6 component image, with 2
components per color bit. For a monochrome image and SDF would need a 2
component image.
A 16-color image would need 8 components to represent effectively with
an SDF.

An SDF can be done using 1 component per channel, but the edge quality
isn't as good (one component forms encoding a combined XY distance from
an edge, and 2 component separately encoding the X and Y distances).

Usual algorithm is to interpolate the texels using bilinear
interopolation or similar, and then threshold the results per color bit
(then one can feed this through a small color palette). Traditionally,
this process being done in a fragment shader or similar.


I guess traditionally, one using a 256x256 texture for every 256 glyphs,
with 16x16 texels per glyph.

Here, the full Unicode BMP would need 256 textures, or roughly 8MB if
each SDF is encoded using DXT1. Though, one trick is to store the glyphs
as a 16x16x1 bitmap font, and then dynamically converting blocks of
glyphs into SDF form (this is how some of my past 3D engines had worked
IIRC).


Though, currently, I haven't really gotten to this stage yet with
TestKern, still just sorta using 8x8x1 pixel bitmap fonts for now.

And, at the moment, I am experimenting with 640x400 and 800x600
256-color modes, and have started working on adding mouse support
(somewhat needed if I add any sort of GUI to this).


In this case, 640x400 8-bpp mode having the advantage that it needs less
memory bandwidth, so the screen is slightly less of a broken jittery
mess (and also, the 800x600 mode currently uses a non-standard 36Hz
refresh).

I guess one possibility could be to give the display hardware an
interface to talk directly with DDR controller (and effectively bypass
the L2 cache). Mostly as the properties the L2 cache adds are "not
particularly optimal" for the access patterns of screen-refresh.

An "L2 bypass path" could potentially be able to sustain high enough
bandwidth to avoid the screen looking like a broken mess when trying to
operate at "slightly higher" resolutions.


It is pros/cons between 256-color and color-cell:
Color cell gives better color fidelity, but more graphical artifacts;
256-color has fewer obvious artifacts, but the color fidelity kinda
sucks (going the RGB555 -> Indexed route; with a "generic" palette);
Drawing the screen image using ordered dither sorta helps, but also
doesn't look particularly good either.

Apparently Half-Life had used this approach (rendering internally using
RGB555 but then reducing the final image back down to 256 color in the
software renderer), but IIRC it looked a lot better than what I am
currently getting.

These images sort of showing the issues I am dealing with:
https://twitter.com/cr88192/status/1654288824669708290

One showing the issue that plagues the 640x400 hi-color mode (and also
800x600 modes), and the other showing the "kinda meh" color rendition
with a fixed 256-color "OS palette" (of the options tested, this being
the palette layout that got the lowest RMSE in my collection of test
images).

Well, along with the 256-color image showing a bug that I have fixed (it
was a bug when doing a partial update of copying the internal
framebuffer to VRAM).

Note that the screen framebuffer is still internally drawn in RGB555,
and then converted to 256-color when being copied into VRAM (well, as
opposed to feeding i