Load/Store with auto-increment

Marcus

unread,

May 6, 2023, 10:59:03 AM5/6/23

to

Load/store with auto-increment/decrement can reduce the number of
instructions in many loops (especially those that mostly iterate over
arrays of data). It can also be used in function prologues and epilogues
(for push/pop functionality).

For a long time I had dismissed load/store with auto-increment for my
ISA (MRISC32). The reason is that a load operation with auto-increment
would have TWO results (the loaded value and the updated address base),
which would be a complication (all other instructions have at most one
result).

However, a couple of days ago I realized that store operations do not
have any result, so I could add instructions for store with auto-
increment, and still only have one result. I have a pretty good idea
of how to do it (instruction encoding etc), and it would fit fairly
well (the only oddity would be that the result register is not the
first register address in the instruction word, but the second register
address, which requires some more MUX:ing in the decoding stages).

The next question is: What flavors should I have?

- Post-increment (most common?)
- Post-decrement
- Pre-increment
- Pre-decrement (second most common?)

The "pre" variants would possibly add more logic to critical paths (e.g.
add more gate delay in the AGU before the address is ready for the
memory stage).

Any thoughts? Is it worth it?

/Marcus

John Levine

unread,

May 6, 2023, 11:38:20 AM5/6/23

to

According to Marcus <m.de...@this.bitsnbites.eu>:

>Load/store with auto-increment/decrement can reduce the number of
>instructions in many loops (especially those that mostly iterate over
>arrays of data). It can also be used in function prologues and epilogues

>(for push/pop functionality). ...

>Any thoughts? Is it worth it?

Autoincrement was quite popular in the 1960s and 70s. The DEC 12 and
18 bit minis and the DG Nova had a version of it where specific
addresses would autoinrement or decrement when used as indirect
addresses. I did a fair amount of PDP-8 programming and those
autoincrement locations were precious, which said as much about the
limits of the 8's instruction set as anything else.

The PDP-11 generalized this to useful modes -(R) and (R)+ to
predecrement or postincrement any register when used as an address,
which is how it handled stacks and the simple cases of stepping
through a string or array.

It also had indirect versions of both, @(R)+ which was useful for
stepping through an array of pointers (one instruction dispatch for
threaded code or coroutines) and @-(R) which turned out to be useless
and was dropped in the VAX.

Here it is 50 years later and they're all gone. I think the increase
in code density wasn't worth the contortions to ensure that your data
structures fit the few cases that the autoincrement modes handled. It
also made it harder to parallelize and pipeline stuff since address
modes had side effects that had to be scheduled around or potentially
unwound in a page fault.

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Marcus

unread,

May 6, 2023, 1:36:30 PM5/6/23

to

On 2023-05-06, John Levine wrote:
> Here it is 50 years later and they're all gone. I think the increase
> in code density wasn't worth the contortions to ensure that your data
> structures fit the few cases that the autoincrement modes handled. It
> also made it harder to parallelize and pipeline stuff since address
> modes had side effects that had to be scheduled around or potentially
> unwound in a page fault.

Actually, ARM has auto-increment (even AArch64). I think that if you
limit what you can do (not the crazy multi-memory accesses instructions
that was popular in CISC, e.g. 68k), you should not have any problems
with page fault handling etc. Unless...

Does the auto-increment instruction implicitly introduce a data-
dependency that's also dependent on the memory operation to complete?
Is there any real difference compared to doing the memory operation
and the address increment in two separate instructions (in an OoO
machine)?

/Marcus

MitchAlsup

unread,

May 6, 2023, 2:15:41 PM5/6/23

to

On Saturday, May 6, 2023 at 9:59:03 AM UTC-5, Marcus wrote:
> Load/store with auto-increment/decrement can reduce the number of
> instructions in many loops (especially those that mostly iterate over
> arrays of data). It can also be used in function prologues and epilogues
> (for push/pop functionality).
<

Can it actually save instructions ??
<
p = <some address>;
q = <some other address>;
for( i = 0; i < max; i++ )
*p++ = *q++;
<
LDA Rp,[IP,,displacement1]
LDA Rq,[IP,,displacement2]
MOV Ri,#0
VEC Rt,{}
top_of_loop:
LDSW Rqm,[Rq+Ri<<2]
STW Rgm,[Rp+Ri<<2
LOOP LE,Ri,#1,Rmax
end_of_loop:
>
Which instruction can be saved in this loop??

<
> For a long time I had dismissed load/store with auto-increment for my
> ISA (MRISC32). The reason is that a load operation with auto-increment
> would have TWO results (the loaded value and the updated address base),
<

That is the first problem.

<
> which would be a complication (all other instructions have at most one
> result).
>
> However, a couple of days ago I realized that store operations do not
> have any result, so I could add instructions for store with auto-
> increment, and still only have one result. I have a pretty good idea
> of how to do it (instruction encoding etc), and it would fit fairly
> well (the only oddity would be that the result register is not the
> first register address in the instruction word, but the second register
> address, which requires some more MUX:ing in the decoding stages).
<

So, autoincrement on STs only ??

>
> The next question is: What flavors should I have?
>
> - Post-increment (most common?)
> - Post-decrement
> - Pre-increment
> - Pre-decrement (second most common?)
<

Not having these eliminates having to choose.

>
> The "pre" variants would possibly add more logic to critical paths (e.g.
> add more gate delay in the AGU before the address is ready for the
> memory stage).
>
> Any thoughts? Is it worth it?
<

In my option, needing autoincrements is a sign of a weak ISA and
possibly that of a less than stellar compiler.
>
> /Marcus

MitchAlsup

unread,

May 6, 2023, 2:17:22 PM5/6/23

to

On Saturday, May 6, 2023 at 12:36:30 PM UTC-5, Marcus wrote:
> On 2023-05-06, John Levine wrote:
> > Here it is 50 years later and they're all gone. I think the increase
> > in code density wasn't worth the contortions to ensure that your data
> > structures fit the few cases that the autoincrement modes handled. It
> > also made it harder to parallelize and pipeline stuff since address
> > modes had side effects that had to be scheduled around or potentially
> > unwound in a page fault.
> Actually, ARM has auto-increment (even AArch64). I think that if you
> limit what you can do (not the crazy multi-memory accesses instructions
> that was popular in CISC, e.g. 68k), you should not have any problems
> with page fault handling etc. Unless...
>
> Does the auto-increment instruction implicitly introduce a data-
> dependency that's also dependent on the memory operation to complete?
<

Not necessarily, but it does create a base-register to base-register
dependency on uses of the addressing register. So, memory is not
compromised, but use of the register can be.

Thomas Koenig

unread,

May 6, 2023, 5:04:37 PM5/6/23

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> Load/store with auto-increment/decrement can reduce the number of
> instructions in many loops (especially those that mostly iterate over
> arrays of data). It can also be used in function prologues and epilogues
> (for push/pop functionality).

One step further: You can have something like POWER's load and
store with update. For example,

ldux rt,ra,rb

will load a doubleword from the address ra + rb and set ra to
ra + rb, or

ldu rt,num(ra)

will load rt from num + ra and set ra = ra + num.

You can simulate autoincrement/autodecrement if you write

ldu rt,8(ra)

or

ldu rt,-8(ra)

respectively.

> For a long time I had dismissed load/store with auto-increment for my
> ISA (MRISC32). The reason is that a load operation with auto-increment
> would have TWO results (the loaded value and the updated address base),
> which would be a complication (all other instructions have at most one
> result).

Exactly.

> However, a couple of days ago I realized that store operations do not
> have any result, so I could add instructions for store with auto-
> increment, and still only have one result.

That would create a rather weird asymmetry between load and store.
It could also create problems for the compiler - I'm not sure that
gcc is set up to easily handle different addressing modes for load
and store.

> I have a pretty good idea
> of how to do it (instruction encoding etc), and it would fit fairly
> well (the only oddity would be that the result register is not the
> first register address in the instruction word, but the second register
> address, which requires some more MUX:ing in the decoding stages).
>
> The next question is: What flavors should I have?
>
> - Post-increment (most common?)
> - Post-decrement
> - Pre-increment
> - Pre-decrement (second most common?)

If you want to save instructions in a loop and have a "compare to zero"
instruction (which I seem to remember you do), then a negative index
could be something else to try.

Consider transforming

for (int i=0; i<n; i++)
a[i] = b[i] + 2;

into

*ap = a + n;
*bp = b + n;
for (int i=-n; i != 0; i++)
ap[i] = bp[i] + 2;

and expressing the body of the loop as

start:
ldd r1,rb,-ri
addi r1,r1,2
std r1,ra,-ri
add ri,ri,1
beq0 ri,start

Hmm... is there any ISA which allows for both negative and positive
indexing?

> The "pre" variants would possibly add more logic to critical paths (e.g.
> add more gate delay in the AGU before the address is ready for the
> memory stage).
>
> Any thoughts? Is it worth it?

Not sure it is - this kind of instruction will be split into two
micro-instructions on any OoO machine, and probably for in-order,
as well.

MitchAlsup

unread,

May 6, 2023, 10:00:37 PM5/6/23

to

Consider a string of *p++
a = *p++;
b = *p++;
c = *p++;
<
Here we see the failure of the ++ or -- notation.
The LD of b is dependent on the ++ of a
The LD of c is dependent on the ++ of b
Whereas if the above was written::
<
a = p[0];
b = p[1];
c = p[2];
p +=3;
<
Now all three LDs are independent and can issue/execute/retire
simultaneously. Also, the add to p is independent, so we took
3 "instructions" that were serially dependent and make them into
4 instructions that are completely independent in all phases of
execution.

BGB

unread,

May 6, 2023, 10:58:39 PM5/6/23

to

I skipped auto-increment as it typically saves "hardly anything" (at
best) and adds an awkward case that needs to be decomposed into two
sub-operations (most other cases).

So, I didn't really feel it was "worth it".

It could almost make sense on a 1-wide machine, except that one needs to
add one of the main expensive parts of a 2-wide machine in order to
support it (and on a superscalar machine, the increment would likely end
up running in parallel with some other op anyways).

...

For register save/restore, maybe it makes sense:
But, one can use normal displacement loads/stores and a single big
adjustment instead;
Things like "*ptr++" could use it, but are still not common enough to
make it significant (combined with the thing of the "ptr++" part usually
just running in parallel with another op anyways).

>>
>> /Marcus

robf...@gmail.com

unread,

May 7, 2023, 3:36:19 AM5/7/23

to

Auto inc/dec can be difficult for the compiler to make use of. Sometime
the p++ will end up as a separate add anyway. If there is scaled indexed
addressing often loop increment vars can be used, and the loop
increment is needed anyway.
p[n] = q[n];
n++;
I used extra bits available in load / store instruction to indicate the
cache-ability of data. Requires compiler support though.

Having a push instruction can be handy, and good for code density if it
can push multiple registers in a single instruction.

I have multi-register loads and stores in groups of eight registers for
Thor. Based on filling up the entire cache line with register data then
issuing a single load or store operation.

Anton Ertl

unread,

May 7, 2023, 8:43:31 AM5/7/23

to

Marcus <m.de...@this.bitsnbites.eu> writes:
>Load/store with auto-increment/decrement can reduce the number of
>instructions in many loops (especially those that mostly iterate over
>arrays of data).

Yes.

If you do it only for stores, as suggested below, it could be used for
loops that read from one or more arrays and write to one array, all
with the same stride, as follows (in pseudo-C-code):

/* read from a and b, write to c */
da=a-c;
db=b-c;
for (...) {
*c = c[da] * c[db];
c+=stride;
}

the "c+=stride" could become the autoincrement of the store.

>It can also be used in function prologues and epilogues
>(for push/pop functionality).

Not so great, because it introduces data dependencies between the
stores that you then have to get rid of if you want to support more
than one store per cycle. As for the pops, those are loads, and here
the autoincrement would require an additional write port to the
register file, as you point out below; plus it would introduce data
dependencies that you don't want (many cores support more than one
load per cycle).

>The next question is: What flavors should I have?
>
>- Post-increment (most common?)
>- Post-decrement
>- Pre-increment
>- Pre-decrement (second most common?)
>
>The "pre" variants would possibly add more logic to critical paths (e.g.
>add more gate delay in the AGU before the address is ready for the
>memory stage).

You typically have memory-access instructions that include an addition
in the address computation; in that case pre obviously has no extra
cost. The cost of the addition can be reduced (eliminated) with a
technique called sum-addressed memory. OTOH, IA-64 supports only
memory accesses of an address given in a register, so here the
architects apparently thought that sum-addressed memory is still too
slow.

Increment vs. decrement: If your store supports reading two registers
for address computation (in addition to the data register), you can
put the stride in a register, making the whole question moot. Even if
you only support reading one register in addition to the data, you can
have a sign-extended constant stride, again giving you both increment
and decrement options. Note that having a store that does not support
the sum of two registers, but does support autoincrement, and a load
that supports the sum of two registers as address is means that both
loads and stores can read two registers and write one register, which
may be useful for certain microarchitectural approaches.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Anton Ertl

unread,

May 7, 2023, 12:00:09 PM5/7/23

to

John Levine <jo...@taugh.com> writes:
>Here it is 50 years later and they're all gone.

PowerPC and ARM A32 is still there. And there's even a new
architecture with auto-increment: ARM A64.

>I think the increase
>in code density wasn't worth the contortions to ensure that your data
>structures fit the few cases that the autoincrement modes handled.

Are you thinking of the DSPs that do not have displacement addressing,
but have auto-increment, leading to a number of papers on how the
compiler should arrange the variables to make best use of that?

With displacement addressing no such contortions are necessary.

>It
>also made it harder to parallelize and pipeline stuff since address
>modes had side effects that had to be scheduled around or potentially
>unwound in a page fault.

Pipelining was apparently no problem, as evidenced by several early
RISCs (ARM (A32), HPPA, PowerPC) having auto-increment. Just don't
write the address register before verifying the address. And
parallelizing is no problem, either: IA-64 was designed for Explicitly
Parallel Instruction Computing, and has auto-increment.

Scott Lurndal

unread,

May 7, 2023, 12:05:06 PM5/7/23

to

Can the compiler not recogize the first pattern and convert
it into the second form under the as-if rule?

John Levine

unread,

May 7, 2023, 12:56:03 PM5/7/23

to

It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>PowerPC and ARM A32 is still there. And there's even a new
>architecture with auto-increment: ARM A64.

I need to take a look.

>>I think the increase
>>in code density wasn't worth the contortions to ensure that your data
>>structures fit the few cases that the autoincrement modes handled.
>
>Are you thinking of the DSPs that do not have displacement addressing,
>but have auto-increment, leading to a number of papers on how the
>compiler should arrange the variables to make best use of that?

Autoincrement only increments by the size of a single datum so it
works for strings and vectors, not for arrays of structures or 2-D
arrays. Compare it to the 360's BXLE loop closing instruction which
put the stride in a register so it could be whatever you wanted.
It also had base+index which the Vax did but the PDP-11 only sort
of did if you used absolute addresses instead of a base.

On the PDP-11 autoincrement allowed a two instruction string copy loop:

c: movb (r1)+,(r2)+
bnz c ; loop if the byte wasn't zero

but how useful is that now? I don't know.

>With displacement addressing no such contortions are necessary.

I don't see how that solves the stride problem. Or did you mean
something else?

>>It >also made it harder to parallelize and pipeline stuff since address
>>modes had side effects that had to be scheduled around or potentially
>>unwound in a page fault.
>
>Pipelining was apparently no problem, as evidenced by several early
>RISCs (ARM (A32), HPPA, PowerPC) having auto-increment. Just don't

>write the address register before verifying the address. ...

Do they have the kind of hazards that the -11 and Vax did, where you could
autoincrement the same register more than once in a single instruction, or
use the incremented register as an operand? That made things messy.

David Brown

unread,

May 7, 2023, 1:47:24 PM5/7/23

to

Yes, and compilers have done such conversions for decades. (Of course,
that assumes you are not dealing with external data, or expressions that
could alias each other.)

David Brown

unread,

May 7, 2023, 1:49:24 PM5/7/23

to

On 07/05/2023 18:55, John Levine wrote:
> It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>> PowerPC and ARM A32 is still there. And there's even a new
>> architecture with auto-increment: ARM A64.
>
> I need to take a look.
>
>>> I think the increase
>>> in code density wasn't worth the contortions to ensure that your data
>>> structures fit the few cases that the autoincrement modes handled.
>>
>> Are you thinking of the DSPs that do not have displacement addressing,
>> but have auto-increment, leading to a number of papers on how the
>> compiler should arrange the variables to make best use of that?
>
> Autoincrement only increments by the size of a single datum so it
> works for strings and vectors, not for arrays of structures or 2-D
> arrays. Compare it to the 360's BXLE loop closing instruction which
> put the stride in a register so it could be whatever you wanted.
> It also had base+index which the Vax did but the PDP-11 only sort
> of did if you used absolute addresses instead of a base.
>
> On the PDP-11 autoincrement allowed a two instruction string copy loop:
>
> c: movb (r1)+,(r2)+
> bnz c ; loop if the byte wasn't zero
>
> but how useful is that now? I don't know.
>

Similar instructions would be used for copying memory blocks, and that
is very useful!

Thomas Koenig

unread,

May 7, 2023, 1:58:59 PM5/7/23

to

Scott Lurndal <sc...@slp53.sl.home> schrieb:

Of course:

void bar (int a, int b, int c);

void foo (int *p)
{
int a, b, c;

a = *p++;
b = *p++;
c = *p++;

bar (a, b, c);
}

results in

lw a2,8(a0)
lw a1,4(a0)
lw a0,0(a0)
tail bar

on RISC-V, for example (aarch64 plays games with load double,
so it's a bit harder to read).

But I believe Mitch was referring to the assembler equivalent, where
p be held in a register.

Autodecrement and increment is done on 386ff. How do they avoid
the register dependency of the stack register? Special handling?
Instruction fusing?

John Levine

unread,

May 7, 2023, 2:12:10 PM5/7/23

to

It appears that David Brown <david...@hesbynett.no> said:
>> On the PDP-11 autoincrement allowed a two instruction string copy loop:
>>
>> c: movb (r1)+,(r2)+
>> bnz c ; loop if the byte wasn't zero
>>
>> but how useful is that now? I don't know.
>
>Similar instructions would be used for copying memory blocks, and that
>is very useful!

Not really. On modern computers you want to copy in ways that make
best use of the multiple registers so you're more likely to do a
sequence of loads followed by a sequence of stores, mabybe with shift
and mask in between if they're not aligned, then move on to the next
block. You could use autoincrement but you'll probably get better
performance with instructions that clearly don't depend on each other
so they can run in parallel, e.g.

; r8 is source, r9 is dest
loop:
ld r1,0[r8]
ld r2,8[r8]
ld r3,16[r8]
ld r4,24[r8]
; shift and mask to align if needed
st r1,0[r9]
st r2,8[r9]
st r3,16[r9]
st r4,24[r9]

addi r8,#32
addi r9,#32
branch if not done to loop

MitchAlsup

unread,

May 7, 2023, 2:29:26 PM5/7/23

to

A) the compiler is so allowed
B) once the compiler is doing this, wanting auto{inc,dec} in your
ISA evaporates.

MitchAlsup

unread,

May 7, 2023, 2:31:52 PM5/7/23

to

Except you are moving blocks 1-byte at a time--which was fine for PDP-11 days
and for the era of 16-bits "was sufficient" addressing.

MitchAlsup

unread,

May 7, 2023, 2:33:39 PM5/7/23

to

They did not--they just "ate" the latency and register conflicts.
But in general, the Great-Big execution window made all those
"go away".

MitchAlsup

unread,

May 7, 2023, 2:36:25 PM5/7/23

to

On Sunday, May 7, 2023 at 1:12:10 PM UTC-5, John Levine wrote:
> It appears that David Brown <david...@hesbynett.no> said:
> >> On the PDP-11 autoincrement allowed a two instruction string copy loop:
> >>
> >> c: movb (r1)+,(r2)+
> >> bnz c ; loop if the byte wasn't zero
> >>
> >> but how useful is that now? I don't know.
> >
> >Similar instructions would be used for copying memory blocks, and that
> >is very useful!
> Not really. On modern computers you want to copy in ways that make
> best use of the multiple registers so you're more likely to do a
> sequence of loads followed by a sequence of stores, mabybe with shift
> and mask in between if they're not aligned, then move on to the next
> block. You could use autoincrement but you'll probably get better
> performance with instructions that clearly don't depend on each other
> so they can run in parallel, e.g.
<

Or you can (put into ISA and) use MM (memory to memory move)
<
MM Rcount,Rfrom,Rto
<
And rest assured that HW will simply do the optimal thing for that
implementation {up to 1 cache line per cycle.}

Stephen Fuld

unread,

May 7, 2023, 3:12:13 PM5/7/23

to

On 5/7/2023 9:55 AM, John Levine wrote:

snip

> Autoincrement only increments by the size of a single datum so it
> works for strings and vectors, not for arrays of structures or 2-D
> arrays. Compare it to the 360's BXLE loop closing instruction which
> put the stride in a register so it could be whatever you wanted.

Or the 1108 which allowed you to specify, with an instruction bit, that
the high order half of an index register is added to the low order half
(which is all that was used for address calculation) after the memory
address is computed.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Thomas Koenig

unread,

May 7, 2023, 4:49:20 PM5/7/23

to

Thomas Koenig <tko...@netcologne.de> schrieb:

> Autodecrement and increment is done on 386ff. How do they avoid
> the register dependency of the stack register? Special handling?
> Instruction fusing?

Seems like they have a dedicated stack engine for the
purpose. Agner Fog (who else) has a nice explanation at
https://agner.org/optimize/microarchitecture.pdf . Basically,
there is an extra stage in the pipeline for handling stack pointers
and for inserting stack synchronization micro-ops.

That is one level of complexity that address + offset addressing
relative to the stack pointer solves nicely.

BGB

unread,

May 7, 2023, 5:27:52 PM5/7/23

to

On 5/7/2023 7:07 AM, Anton Ertl wrote:
> Marcus <m.de...@this.bitsnbites.eu> writes:
>> Load/store with auto-increment/decrement can reduce the number of
>> instructions in many loops (especially those that mostly iterate over
>> arrays of data).
>
> Yes.
>
> If you do it only for stores, as suggested below, it could be used for
> loops that read from one or more arrays and write to one array, all
> with the same stride, as follows (in pseudo-C-code):
>
> /* read from a and b, write to c */
> da=a-c;
> db=b-c;
> for (...) {
> *c = c[da] * c[db];
> c+=stride;
> }
>
> the "c+=stride" could become the autoincrement of the store.
>

Not all instructions are created equal.

Fewer instructions may not be a win if these instructions would result
in a higher latency.

>> It can also be used in function prologues and epilogues
>> (for push/pop functionality).
>
> Not so great, because it introduces data dependencies between the
> stores that you then have to get rid of if you want to support more
> than one store per cycle. As for the pops, those are loads, and here
> the autoincrement would require an additional write port to the
> register file, as you point out below; plus it would introduce data
> dependencies that you don't want (many cores support more than one
> load per cycle).
>

But, is kinda moot as, say:
MOV.Q R13, @-SP
MOV.Q R12, @-SP
MOV.Q R11, @-SP
MOV.Q R10, @-SP
MOV.Q R9, @-SP
MOV.Q R8, @-SP

Only saves 1 instruction vs, say:
ADD -48, SP
MOV.Q R13, (SP, 40)
MOV.Q R12, (SP, 32)
MOV.Q R11, (SP, 24)
MOV.Q R10, (SP, 16)
MOV.Q R9, (SP, 8)
MOV.Q R8, (SP, 0)

Depending on how it is implemented, the dependency issues on the shared
register could actually make the use of auto-increment slower than the
use of fixed displacement loads/stores (and, if one needs to wait the
whole latency of a load or store for the increment's write-back to
finish, using auto-increment in this way is likely "dead on arrival").

I can also note that an earlier form of BJX2 had PUSH/POP instructions,
but these were removed. Noting the above, it is probably not all that
hard to guess why...

Nothing to add here.

> - anton

MitchAlsup

unread,

May 7, 2023, 5:47:45 PM5/7/23

to

On Sunday, May 7, 2023 at 4:27:52 PM UTC-5, BGB wrote:
> On 5/7/2023 7:07 AM, Anton Ertl wrote:
> > Marcus <m.de...@this.bitsnbites.eu> writes:
> >> Load/store with auto-increment/decrement can reduce the number of
> >> instructions in many loops (especially those that mostly iterate over
> >> arrays of data).
> >
> > Yes.
> >
> > If you do it only for stores, as suggested below, it could be used for
> > loops that read from one or more arrays and write to one array, all
> > with the same stride, as follows (in pseudo-C-code):
> >
> > /* read from a and b, write to c */
> > da=a-c;
> > db=b-c;
> > for (...) {
> > *c = c[da] * c[db];
> > c+=stride;
> > }
> >
> > the "c+=stride" could become the autoincrement of the store.
> >
> Not all instructions are created equal.
>
> Fewer instructions may not be a win if these instructions would result
> in a higher latency.
<

But eliminating sequential dependencies is almost always a win
because it directly addresses latency.

<
> >> It can also be used in function prologues and epilogues
> >> (for push/pop functionality).
> >
> > Not so great, because it introduces data dependencies between the
> > stores that you then have to get rid of if you want to support more
> > than one store per cycle. As for the pops, those are loads, and here
> > the autoincrement would require an additional write port to the
> > register file, as you point out below; plus it would introduce data
> > dependencies that you don't want (many cores support more than one
> > load per cycle).
> >
> But, is kinda moot as, say:
> MOV.Q R13, @-SP
> MOV.Q R12, @-SP
> MOV.Q R11, @-SP
> MOV.Q R10, @-SP
> MOV.Q R9, @-SP
> MOV.Q R8, @-SP
>
> Only saves 1 instruction vs, say:
> ADD -48, SP
> MOV.Q R13, (SP, 40)
> MOV.Q R12, (SP, 32)
> MOV.Q R11, (SP, 24)
> MOV.Q R10, (SP, 16)
> MOV.Q R9, (SP, 8)
> MOV.Q R8, (SP, 0)
<

If you actually wanted to save instructions you would::
<
MOV.Q R13:R8,@-SP
<
So the argument of saving 1 instruction becomes moot--you can save 5
instructions.

robf...@gmail.com

unread,

May 7, 2023, 10:36:24 PM5/7/23

to

Got me thinking of how auto adjust addressing could be added to the Thor
core. There is a bit available in the scaled indexed addressing mode, so I
shoehorned in post-inc, pre-dec modes. This should work with group
register load and store too allowing auto increment for:

loop1:
LOADG g16,[r1+r2*]
STOREG g16,[r3+r2++*]
BLTU r2,1000,.loop1

I must look at adding string instructions back into the instruction set.
Previously there has been copy, set, and compare string instructions. It
is tempting to add a REP instruction modifier to the ISA. It could be a
modified branch instruction because the displacement is not needed.

RLTU r55,1000,”RR”
LOADG g16,[r1+r2*]
STOREG g16,[r3+r2++*]

David Brown

unread,

May 8, 2023, 3:06:06 AM5/8/23

to

Of course you would move the data in bigger sizes - as big as you can,
based on your (i.e., the compiler's) knowledge of alignments, sizes, etc.

Anton Ertl

unread,

May 8, 2023, 4:10:11 AM5/8/23

to

Thomas Koenig <tko...@netcologne.de> writes:
>void bar (int a, int b, int c);
>
>void foo (int *p)
>{
> int a, b, c;
> a = *p++;
> b = *p++;
> c = *p++;
> bar (a, b, c);
>}
>
>results in
>
> lw a2,8(a0)
> lw a1,4(a0)
> lw a0,0(a0)
> tail bar
>
>on RISC-V, for example (aarch64 plays games with load double,
>so it's a bit harder to read).
>
>But I believe Mitch was referring to the assembler equivalent, where
>p be held in a register.
>
>Autodecrement and increment is done on 386ff. How do they avoid
>the register dependency of the stack register? Special handling?

Yes. My understanding is that they do something similar in the
decoding hardware to what the compiler does for the code above (and of
course the hardware probably does not eliminate the update of the
stack pointer as dead code).

luke.l...@gmail.com

unread,

May 8, 2023, 11:09:38 AM5/8/23

to

On Monday, May 8, 2023 at 3:36:24 AM UTC+1, robf...@gmail.com wrote:

> loop1:
> LOADG g16,[r1+r2*]
> STOREG g16,[r3+r2++*]
> BLTU r2,1000,.loop1
>
> I must look at adding string instructions back into the instruction set.

yeah can i suggest really don't do that. what happens if you want
to support UCS-2 (strncpyW)? then UCS-4? more than that: the
concepts needed to efficiently support strings, well you have to
add them anyway so why not make them first-order concepts
at the ISA level?

(i am assuming a Horizontal-First Vector ISA here: this does
not apply to Mitch's 66000 which is Vertical-First)

first thing: Fault-First is needed. explained here:
https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf

this basically is a contractual declaration, "i want you to
load *UP TO* a set maximum number of elements, and
to TELL me how many were actually loaded"

second: extend that same concept onto data: "i want you
to perform some operation *UP TO* a set maximum
number of elements, but if as part of that *ELEMENT*
there is a test that fails, STOP and tell me where you
stopped".

the first concept allows you to safely issue LOADs
knowing full well that no page-fault or other exception
will occur, because the hardware is ORDERED to avoid
them.

the second concept allows you to detect e.g. a null-chr
within a sequential block, but still expressed as a Vector
operation.

the combination of these two allows you to speculatively
load massive parallel blocks of sequential data, that are
then tested in parallel for zero, after which it is plain
sailing to perform the copy.

at all times the Vector Length remains within required
bounds, having been first truncated to take care of potential
exceptions and then having been truncated up to (and
including) the null-chr.

note at lines 52 and 55 that they are both "post-increment".
this is a Vector Load where hardware is permitted to notice
that where the fundamental element operation is a *Scalar*
Load-with-Update, a repeated run of Updates can
be optimised out to only hit the register file with the very
last of those Updates.

of course all of this is completely irrelevant for a Vertical-First
ISA (or an ISA with Vertical-First Vectorisation Mode),
because everything looks to a Vertical-First ISA (such as
Mitch's 66000) like Scalar Looping.

Horizontal-First on the other hand you know that a
large batch of Element-operations are going to hit the
back-end and consequently may micro-code a much more
efficient suite of operations that take up far less resources
than if the individual element operations were naively
thrown into Execute. (a good example is the big-integer
3-in 2-out multiply instruction we are proposing to Power ISA,
which uses one of the Read-regs and one of the Write-regs as
a 64-bit carry. when chained: 1st operation: 3-in 1-out middle-ops
2-in 1-out last-op 2-in 2-out).

https://git.libre-soc.org/?p=openpower-isa.git;a=blob;f=src/openpower/decoder/isa/test_caller_svp64_ldst.py;hb=HEAD#l36

44 "mtspr 9, 3", # move r3 to CTR
45 "addi 0,0,0", # initialise r0 to zero
46 # chr-copy loop starts here:
47 # for (i = 0; i < n && src[i] != '\0'; i++)
48 # dest[i] = src[i];
49 # VL (and r1) = MIN(CTR,MAXVL=4)
50 "setvl 1,0,%d,0,1,1" % maxvl,
51 # load VL bytes (update r10 addr)
52 "sv.lbzu/pi *16, 1(10)", # should be /lf here as well
53 "sv.cmpi/ff=eq/vli *0,1,*16,0", # cmp against zero, truncate VL
54 # store VL bytes (update r12 addr)
55 "sv.stbu/pi *16, 1(12)",
56 "sv.bc/all 0, *2, -0x1c", # test CTR, stop if cmpi failed
57 # zeroing loop starts here:
58 # for ( ; i < n; i++)
59 # dest[i] = '\0';
60 # VL (and r1) = MIN(CTR,MAXVL=4)
61 "setvl 1,0,%d,0,1,1" % maxvl,
62 # store VL zeros (update r12 addr)
63 "sv.stbu/pi 0, 1(12)",
64 "sv.bc 16, *0, -0xc", # dec CTR by VL, stop at zero

luke.l...@gmail.com

unread,

May 8, 2023, 11:15:36 AM5/8/23

to

On Sunday, May 7, 2023 at 5:00:09 PM UTC+1, Anton Ertl wrote:
> John Levine <jo...@taugh.com> writes:
> >Here it is 50 years later and they're all gone.
> PowerPC and ARM A32 is still there.

yyep.

> >also made it harder to parallelize and pipeline stuff since address
> >modes had side effects that had to be scheduled around or potentially
> >unwound in a page fault.

see https://groups.google.com/g/comp.arch/c/_-dp_ZU6TN0/m/G1lzn4M3BgAJ
for reference to Load/Store Fault-First. only useful in Horizontal-First
ISAs (Vertical-First avoids the problem entirely).

> Pipelining was apparently no problem, as evidenced by several early
> RISCs (ARM (A32), HPPA, PowerPC) having auto-increment.

note that Power ISA Architects debated 20+ years ago whether
to add both pre- and post- Update (not quite the same as
auto-increment but you can consider RB or an Immediate to
be "the amount to auto-increment by" which is real handy).

due to space considerations (it's a hell of a lot of instructions
to add) they went with pre-update, on the basis that post-update
may be synthesised by (ha ha) performing a subtract *outside*
of the loop prior to entering the loop.

sigh :) it works...

l.

luke.l...@gmail.com

unread,

May 8, 2023, 11:42:36 AM5/8/23

to

On Sunday, May 7, 2023 at 3:00:37 AM UTC+1, MitchAlsup wrote:
> Consider a string of *p++
> a = *p++;
> b = *p++;
> c = *p++;
> <
> Here we see the failure of the ++ or -- notation.
> The LD of b is dependent on the ++ of a
> The LD of c is dependent on the ++ of b
> Whereas if the above was written::
> <
> a = p[0];
> b = p[1];
> c = p[2];
> p +=3;

in my mind this is the sort of thing that a compiler pass
should recognise, and perform a miniature AST-rewrite.

at which point *another* pass could spot that if it allocates
a b and c in consecutive registers it may also perform
a 3-long Vector LD. but at that point we are straying into
the bottomless-money-pit of Auto-Vectorisation...

l.

luke.l...@gmail.com

unread,

May 8, 2023, 12:06:49 PM5/8/23

to

On Saturday, May 6, 2023 at 4:38:20 PM UTC+1, John Levine wrote:

> Here it is 50 years later and they're all gone. I think the increase

> in code density wasn't worth the contortions to ensure that your data
> structures fit the few cases that the autoincrement modes handled.

i thought that too ("few modes") until i realised that you can use
LD-with-Update in a Vector Loop with zero-checking to perform
linked-list-pointer-chasing in a single instruction.

> It

> also made it harder to parallelize and pipeline stuff since address
> modes had side effects that had to be scheduled around or potentially
> unwound in a page fault.

i mentioned in another post about ARM SVE Load-Fault-First
which helps there. i suspect that even Vertical-First ISAs
would have the same issues, once amortisation has been
carried out at the back-end (multiple loops merged into
back-end SIMD).

see ARM SVE paper about pointer-chasing (figure 6)
https://alastairreid.github.io/papers/sve-ieee-micro-2017.pdf

i realised that a repeated-application-of-LD-ST-Update
can chase down the linked-list whilst also dropping
the list structure pointers into consecutive registers.
by also then adding Data-Dependent Fail-First (check
if the data loaded is NULL) you can get the Vector
Operation to stop at or after the NULL, and truncate
such that subsequent Vector operations do not attempt
to go beyond the NULL.

that's a *big* application of auto-update.

also you can use the same instruction to chase double-linked
lists *simultaneously* by making the offset of the updated
register be 2 away from the read-address instead of 1:

sv.ldu/ff=NULL *x+2, *x

what that is doing is, it is reading the address from
sequential registers starting at x, but it is *storing*
the address loaded at registers starting at x+2.

consequently it can be either chasing a single double-linked
list *or* chasing two single-linked-lists, terminating at
the first NULL. at which point to be honest things get
slightly messy as you have to work out which list is
valid, sigh (as you can tell this is a WIP).

l.

Scott Lurndal

unread,

May 8, 2023, 12:22:10 PM5/8/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com wrote=
>:
>
>> loop1:=20
>> LOADG g16,[r1+r2*]=20
>> STOREG g16,[r3+r2++*]=20
>> BLTU r2,1000,.loop1=20
>>=20
>> I must look at adding string instructions back into the instruction set.=
>=20

>
>yeah can i suggest really don't do that. what happens if you want
>to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>concepts needed to efficiently support strings, well you have to
>add them anyway so why not make them first-order concepts
>at the ISA level?

UTF8 should be good enough for everything; best to deprecate
USC-2 et al.

John Levine

unread,

May 8, 2023, 1:20:39 PM5/8/23

to

It appears that John Levine <jo...@taugh.com> said:
>It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>>PowerPC and ARM A32 is still there. And there's even a new
>>architecture with auto-increment: ARM A64.
>
>I need to take a look.

I took a look at ARM and what they did is quite clever. The increment
or decrement amount is a field in the instruction, so up to the field
size (8 bits plus sign as I recall) you can have whatever stride you
want. It also has only one address per instruction so you don't have
the issues you did on the PDP-11 and Vax.

For block memory copies, there's three instructions, roughly prolog,
bocy, epilog, to do it. They don't seem to use autoincrement.

Scott Lurndal

unread,

May 8, 2023, 1:29:47 PM5/8/23

to

John Levine <jo...@taugh.com> writes:
>It appears that John Levine <jo...@taugh.com> said:
>>It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>>>PowerPC and ARM A32 is still there. And there's even a new
>>>architecture with auto-increment: ARM A64.
>>
>>I need to take a look.
>
>I took a look at ARM and what they did is quite clever. The increment
>or decrement amount is a field in the instruction, so up to the field
>size (8 bits plus sign as I recall) you can have whatever stride you
>want. It also has only one address per instruction so you don't have
>the issues you did on the PDP-11 and Vax.
>
>For block memory copies, there's three instructions, roughly prolog,
>bocy, epilog, to do it. They don't seem to use autoincrement.

Those instructions (FEAT_MOP) are very new - I'm not aware of any shipping
ARMv8 processors that support them yet.

Like the VAX MOVC3/5 instructions, FEAT_MOP updates registers and allows
synchronous (e.g. page fault) and asynchronous (interrupts) during operation;
updating the registers appropriately. There is a special exception that
may be caused if the thread is moved to a different CPU during a copy.

John Dallman

unread,

May 8, 2023, 2:23:12 PM5/8/23

to

In article <Yia6M.2700564$iU59....@fx14.iad>, sc...@slp53.sl.home

(Scott Lurndal) wrote:

> Those instructions (FEAT_MOP) are very new - I'm not aware of any
> shipping ARMv8 processors that support them yet.

They appear to be an ARMv9 feature.
<https://developer.arm.com/documentation/ddi0602/2021-12/Base-Instructions
/CPYFPTN--CPYFMTN--CPYFETN--Memory-Copy-Forward-only--reads-and-writes-unp
rivileged-and-non-temporal->

Those are available in Qualcomm Snapdragon 7 Gen 1 onwards and Snapdragon
8 Gen 1 onwards, and MediaTek Dimensionity 9000 chips. So there are
several models of Android 'phone that have them, but not much else. I
have some Snapdragon 8 Gen 1 development kit devices that use them at
work; the chip was announced in November '21 and 'phones appeared in
summer '22.

This is a situation where manufacturers that use ARM core designs can get
ahead of fully custom designs: the Apple M-series chips aren't ARMv9 yet.

The ARM Neoverse V2, N2 and E2 cores support ARMv9, but they were
announced last September and nothing with them has shipped yet.

John

BGB

unread,

May 8, 2023, 2:32:37 PM5/8/23

to

For many use-cases (transmission and storage), UTF-8 is a sane default,
but there are cases where UTF-8 is not ideal, such as inside console
displays or text editors.

Still makes sense to keep support UTF-16 around for the cases where it
is useful.

Though, for an "advanced" text interface, it usually makes sense to have
additional bits per character cell, say:
(31:28): Background Color
(27:24): Foreground Color
(23:20): Attribute Flags
(19: 0): Codepoint

Or 64-bits if one wants more color-depth and/or things like font size
(or additional attribute modifiers, such as skin-tone modifier for
emojis, etc).

This mostly allowing the text rendering to work in a typical "stream of
character cells" sense.

Though, this sort of approach is generally unable to represent things
like "Zalgo text" (formed by using an excessive number of diacritics and
similar over each letter), and I am not entirely sure how "standard"
text-rendering deals with this sort of thing.

Say, a "straightforward" implementation with 64-bit character cells only
allowing for 1 or 2 diacritics per character.

Then again, it doesn't seem to work in the other text editors I use
anyways, so the inability to represent it is likely a non-issue in most
use cases. (Say, the text editor will strip off most of the diacritics
leaving only the base text).

Well, and similarly approaches like representing each character cell as
a small pixel bitmap (say, 16 colors from a per-cell palette), also
wouldn't be able to represent "Zalgo text" (say, if each cell bitmap
only allows the character to extend 50% out each side of its nominal
bounds).

This is with a 32x32 bitmap per character cell (assuming nominal 16x16
text rendering), but this would need ~ 1K per rendered character cell
(horridly impractical).

Then again, it is possible that only a fixed number of such characters
could exist at any moment, and then be treated as "transient virtual
characters".

Seems almost like many of the text layout renders are operating directly
on a raster image though, without using intermediate character cells.

...

I don't really bother with any of this for TestKern, which (at present)
doesn't even support the full BMP, and what little is supported is
limited to what can be represented directly in 8x8x1 pixel character cells.

John Levine

unread,

May 8, 2023, 2:33:07 PM5/8/23

to

It appears that David Brown <david...@hesbynett.no> said:

>> Except you are moving blocks 1-byte at a time--which was fine for PDP-11 days
>> and for the era of 16-bits "was sufficient" addressing.
>
>Of course you would move the data in bigger sizes - as big as you can,
>based on your (i.e., the compiler's) knowledge of alignments, sizes, etc.

Which, as discussed in a lot of other messages, you do with groups of
loads and stores where autoincrement isn't very useful.

As I said a few messages ago, I can see how the ARM version with the
stride in the instruction could be useful for stepping through arrays,
but I wouldn't want to get cleverer than that.

John Levine

unread,

May 8, 2023, 2:37:39 PM5/8/23

to

According to BGB <cr8...@gmail.com>:

>> UTF8 should be good enough for everything; best to deprecate
>> USC-2 et al.
>
>For many use-cases (transmission and storage), UTF-8 is a sane default,
>but there are cases where UTF-8 is not ideal, such as inside console
>displays or text editors.
>
>Still makes sense to keep support UTF-16 around for the cases where it
>is useful.

I can see UCS-2 if you're willing to permanently limit yourself to the
subset of Unicode is supports. UTF-16 with surrogate pairs is about as
pessimal an encoding as I can imagine. It's not fixed length, it
doesn't sort consistently like UTF-8 does, and it's not even very
compacy.

In your text editor if you have room I'd say use UTF-32 everywhere, if
not, store stuff in UTF-8 and expand it to UTF-32 when you're working
with it.

Scott Lurndal

unread,

May 8, 2023, 2:44:28 PM5/8/23

to

V9 has a number of "optional" features as similar to ARMv8
which defines multiple "versions" that require certain features
(e.g. v8.1 through v8.8).

N2 is V9.0 which doesn't include FEAT_MOP. See the TRM for N2.

https://developer.arm.com/documentation/102099/0000/The-Neoverse-N2--core

If the feature is supported, the TRM will indicate the appropriate
value in the MOPS field of ID_AA64ISA2_EL1, which the current
N2 Cores do not support.

As for the snapdragon 7 (Cortex 710) processors, they do not support
FEAT_MOPS (which are part of a version after V9.0). MOPS is also
allowed in ARMv8.8 implementations (I am not aware of any extant v8.8 chips).

https://developer.arm.com/documentation/101800/latest

Scott Lurndal

unread,

May 8, 2023, 2:48:14 PM5/8/23

to

BGB <cr8...@gmail.com> writes:
>On 5/8/2023 11:22 AM, Scott Lurndal wrote:
>> "luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>>> On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com wrote=
>>> :
>>>
>>>> loop1:=20
>>>> LOADG g16,[r1+r2*]=20
>>>> STOREG g16,[r3+r2++*]=20
>>>> BLTU r2,1000,.loop1=20
>>>> =20
>>>> I must look at adding string instructions back into the instruction set.=
>>> =20
>>>
>>> yeah can i suggest really don't do that. what happens if you want
>>> to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>>> concepts needed to efficiently support strings, well you have to
>>> add them anyway so why not make them first-order concepts
>>> at the ISA level?
>>
>> UTF8 should be good enough for everything; best to deprecate
>> USC-2 et al.
>>
>
>For many use-cases (transmission and storage), UTF-8 is a sane default,
>but there are cases where UTF-8 is not ideal, such as inside console
>displays or text editors.

I disagree with that. Linux-based systems, for example, have no problem using UTF-8
exclusively for editors, x-terms and any other i18n'd application.

>
>Still makes sense to keep support UTF-16 around for the cases where it
>is useful.

It's a painful and non-universal mechanism. Certainly not worth adding
support in the processor for it.

John Dallman

unread,

May 8, 2023, 2:56:23 PM5/8/23

to

In article <cqb6M.534496$Olad....@fx35.iad>, sc...@slp53.sl.home

(Scott Lurndal) wrote:

> V9 has a number of "optional" features as similar to ARMv8
> which defines multiple "versions" that require certain features
> (e.g. v8.1 through v8.8).
>
> N2 is V9.0 which doesn't include FEAT_MOP. See the TRM for N2.

Oh, rats. I was under the impression that ARMv9 was uniform, but clearly
this is wrong.

John

luke.l...@gmail.com

unread,

May 8, 2023, 3:10:12 PM5/8/23

to

On Monday, May 8, 2023 at 7:56:23 PM UTC+1, John Dallman wrote:

> Oh, rats. I was under the impression that ARMv9 was uniform, but clearly
> this is wrong.

not only is it non-uniform there is silicon errata making different hardware
completely binary-incompatible. of course ARM does not care because they
sell to "Silicon Partners" not end-users, and nobody has noticed because
all those disparate systems run Android which is Java bytecode. therefore
as long as workarounds for the errors are compiled into the *java interpreter*
nobody even notices.

step out of that android apps box and start compiling native binaries you are into a
world of pain.

l.

Scott Lurndal

unread,

May 8, 2023, 3:28:33 PM5/8/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Monday, May 8, 2023 at 7:56:23=E2=80=AFPM UTC+1, John Dallman wrote:
>
>> Oh, rats. I was under the impression that ARMv9 was uniform, but clearly=
>=20
>> this is wrong.=20
>
>not only is it non-uniform there is silicon errata making different hardwar=

>e
>completely binary-incompatible. of course ARM does not care because they
>sell to "Silicon Partners" not end-users, and nobody has noticed because
>all those disparate systems run Android which is Java bytecode. therefore

>as long as workarounds for the errors are compiled into the *java interpret=
>er*
>nobody even notices.

Actually, the android phones run the linux operating system and
assorted native utilities; The application
level android runtime (dalvik) executes bytecode.

Similar to the CPUID instruction on intel, ARM provides registers that
describe which features are implemented on each chip. Software is
expected to not use unimplemented features by testing to see if they're
available. Now, those registers aren't available to user-mode code, but
at least in linux, the OS provides an interface that applications can
use to determine which features are implemented.

Fundamentally, it's no different than the many generations of Intel
and AMD processors each of which implement different sets of SSE/MMX/SGX
et alia features.

>
>step out of that android apps box and start compiling native binaries you a=

>re into a
>world of pain.

Some examples would be useful. All of our chips are ARMv8 (and now ARMv9)
and we've had no "incompatabilities" between generations other than newer
chips have newer features (the ID registers allow the OS to determine
which features are supported).

BGB

unread,

May 8, 2023, 4:29:17 PM5/8/23

to

On 5/8/2023 1:47 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/8/2023 11:22 AM, Scott Lurndal wrote:
>>> "luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>>>> On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com wrote=
>>>> :
>>>>
>>>>> loop1:=20
>>>>> LOADG g16,[r1+r2*]=20
>>>>> STOREG g16,[r3+r2++*]=20
>>>>> BLTU r2,1000,.loop1=20
>>>>> =20
>>>>> I must look at adding string instructions back into the instruction set.=
>>>> =20
>>>>
>>>> yeah can i suggest really don't do that. what happens if you want
>>>> to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>>>> concepts needed to efficiently support strings, well you have to
>>>> add them anyway so why not make them first-order concepts
>>>> at the ISA level?
>>>
>>> UTF8 should be good enough for everything; best to deprecate
>>> USC-2 et al.
>>>
>>
>> For many use-cases (transmission and storage), UTF-8 is a sane default,
>> but there are cases where UTF-8 is not ideal, such as inside console
>> displays or text editors.
>
> I disagree with that. Linux-based systems, for example, have no problem using UTF-8
> exclusively for editors, x-terms and any other i18n'd application.
>

As noted, UTF-8 makes sense for "transmission", say, sending text to or
from the console; or in the files loaded or saved from a text editor, etc...

Trying to process, edit, and redraw text *directly* in UTF-8 form
internally would be a massive PITA, and would be computationally
expensive, hence why a "character cell" approach is useful. But, as
noted, 32 or 64 bit cells usually make more sense here as, for things
like "syntax highlighting" etc, it makes sense to mark out the
text-colors in the editor buffers (rather than during the redraw process).

Things like variable-size text rendering add some complexity, but these
are mostly keeping track of the width of each character cell, and the
maximum height for the cells in the row.

Then when one saves out the text, or copies it to the OS clipboard, etc,
it is converted back to UTF-8 (or UTF-16).

As for fonts, there are various strategies:
8x8x1, 8x16x1, or 16x16x1 bitmap
Works, but fairly limited, does not deal with resizable text.
Small pixel bitmap (say, 16x16 or 32x32, 2..8 bpp)
Can deal with things like emojis, but not really resizable.
Signed Distance Fields
Resizable, but less ideal for full-color images (1).
Small vector images for each glyph
Traditional form of True-Type Fonts
Needlessly expensive to draw glyphs this way.

Scaling bitmap fonts with either nearest neighbor or bilinear
insterpolation does not give good looking text (nearest neighbor giving
inconsistent jagged edges, bilinear giving blurry text).

So, "Signed Distance Fields" are a good workaround, but mostly make the
most sense for representing monochrome images.

Effectively, "good" 8-color results require a 6 component image, with 2
components per color bit. For a monochrome image and SDF would need a 2
component image.
A 16-color image would need 8 components to represent effectively with
an SDF.

An SDF can be done using 1 component per channel, but the edge quality
isn't as good (one component forms encoding a combined XY distance from
an edge, and 2 component separately encoding the X and Y distances).

Usual algorithm is to interpolate the texels using bilinear
interopolation or similar, and then threshold the results per color bit
(then one can feed this through a small color palette). Traditionally,
this process being done in a fragment shader or similar.

I guess traditionally, one using a 256x256 texture for every 256 glyphs,
with 16x16 texels per glyph.

Here, the full Unicode BMP would need 256 textures, or roughly 8MB if
each SDF is encoded using DXT1. Though, one trick is to store the glyphs
as a 16x16x1 bitmap font, and then dynamically converting blocks of
glyphs into SDF form (this is how some of my past 3D engines had worked
IIRC).

Though, currently, I haven't really gotten to this stage yet with
TestKern, still just sorta using 8x8x1 pixel bitmap fonts for now.

And, at the moment, I am experimenting with 640x400 and 800x600
256-color modes, and have started working on adding mouse support
(somewhat needed if I add any sort of GUI to this).

In this case, 640x400 8-bpp mode having the advantage that it needs less
memory bandwidth, so the screen is slightly less of a broken jittery
mess (and also, the 800x600 mode currently uses a non-standard 36Hz
refresh).

I guess one possibility could be to give the display hardware an
interface to talk directly with DDR controller (and effectively bypass
the L2 cache). Mostly as the properties the L2 cache adds are "not
particularly optimal" for the access patterns of screen-refresh.

An "L2 bypass path" could potentially be able to sustain high enough
bandwidth to avoid the screen looking like a broken mess when trying to
operate at "slightly higher" resolutions.

It is pros/cons between 256-color and color-cell:
Color cell gives better color fidelity, but more graphical artifacts;
256-color has fewer obvious artifacts, but the color fidelity kinda
sucks (going the RGB555 -> Indexed route; with a "generic" palette);
Drawing the screen image using ordered dither sorta helps, but also
doesn't look particularly good either.

Apparently Half-Life had used this approach (rendering internally using
RGB555 but then reducing the final image back down to 256 color in the
software renderer), but IIRC it looked a lot better than what I am
currently getting.

These images sort of showing the issues I am dealing with:
https://twitter.com/cr88192/status/1654288824669708290

One showing the issue that plagues the 640x400 hi-color mode (and also
800x600 modes), and the other showing the "kinda meh" color rendition
with a fixed 256-color "OS palette" (of the options tested, this being
the palette layout that got the lowest RMSE in my collection of test
images).

Well, along with the 256-color image showing a bug that I have fixed (it
was a bug when doing a partial update of copying the internal
framebuffer to VRAM).

Note that the screen framebuffer is still internally drawn in RGB555,
and then converted to 256-color when being copied into VRAM (well, as
opposed to feeding it through a color-cell encoder).

So, internally this is a 512K screen framebuffer in 640x400 mode, or 1MB
for 800x600. The window also having its own backing buffer (which Doom
draws into, triggering the window stack to be redrawn into the screen
buffer, and then uploaded to VRAM).

>>
>> Still makes sense to keep support UTF-16 around for the cases where it
>> is useful.
>
> It's a painful and non-universal mechanism. Certainly not worth adding
> support in the processor for it.
>

CPU shouldn't really need to know or care.

For all it needs to know about it, it is dealing with 16 or 32 bit WORD
or DWORD values, or packed 16 or 32 bit integer vectors.

The C compiler maybe needs to know/care, and some parts of the C library
which cross paths with this.

BGB

unread,

May 8, 2023, 6:07:17 PM5/8/23

to

On 5/8/2023 1:35 PM, John Levine wrote:
> According to BGB <cr8...@gmail.com>:
>>> UTF8 should be good enough for everything; best to deprecate
>>> USC-2 et al.
>>
>> For many use-cases (transmission and storage), UTF-8 is a sane default,
>> but there are cases where UTF-8 is not ideal, such as inside console
>> displays or text editors.
>>
>> Still makes sense to keep support UTF-16 around for the cases where it
>> is useful.
>
> I can see UCS-2 if you're willing to permanently limit yourself to the
> subset of Unicode is supports. UTF-16 with surrogate pairs is about as
> pessimal an encoding as I can imagine. It's not fixed length, it
> doesn't sort consistently like UTF-8 does, and it's not even very
> compacy.
>

For many contexts, it is sufficient.
People were also apparently happy enough with codepages back in the 1980s.

Granted, UTF-16 is a bit niche...

For text files and API interfaces, I am mostly using UTF-8 for pretty
much everything.

Though, for example, the FAT32 filesystem backend is limited internally
to 1252 (8.3 names) and UTF-16 (LFNs). So, in this case, the filesystem
driver needs to do any character conversion.

In the current driver, the filenames are internally normalized into
UTF-8 form, where it will walk the directory, reading and converting
each name to a normalized form, and then checking if this matches the
filename it is looking for (previously, it would figure if the name fit
an 8.3 or LFN name pattern, and then treat these cases as two different
search strategies).

Also TestKern treats FAT names as case-sensitive, unlike Windows which
is traditionally case-insensitive (though, no idea how Windows would
deal with encountering cases where there are two files with names that
differ only in case, ...).

Though, for text files I took the Unix style "raw blob of bytes"
approach (things like code-page conversion or CR/LF inside "stdio" being
"stuff I don't want to poke with a stick").

Decided to mostly leave out going too much into filesystem related
annoyances.

> In your text editor if you have room I'd say use UTF-32 everywhere, if
> not, store stuff in UTF-8 and expand it to UTF-32 when you're working
> with it.
>

As noted, I have often used 32 or 64 bit character cells in
text-editors. Though, mostly this is for holding things like text color
and attribute metadata.

Usually, each line of text would be given its own array of cells as well
(as opposed to representing newlines and similar directly in the editor).

Line length and similar is a possible issue, but "in practical use" it
probably doesn't matter if the editor has a 256 character line limit,
with a sort of "implicit mandatory word-wrap" past this point (if people
don't like the word wrap, it is their fault for using unreasonably long
lines...); with conventional word-wrap merely enforcing the more
traditional 76..80 character line limit.

Admittedly, I am still not entirely sure how "Zalgo text" is handled,
but testing it in a few (Notepad-style) text editors (not written by
me), it seems to typically be stripped down to around 1 or 2 diacritics
per character.

So, the "crawling mess of stacking diacritics" effect seems to be far
from a universal behavior.

Also not sure if this can be considered as "intentional and well
behaved", or merely "an abuse of the Unicode encoding scheme".

Otherwise, not really sure whether my BJX2 core has the computational
power to effectively manage signed-distance-field text rendering. I
guess I will need to cross that bridge when I get to it.

Most likely option would be going from 16x16x1 to SDF, and then using
the SDF to synthesize bitmap fonts in whatever size is needed (then
mostly rendering using bitmap cells). Well, as opposed to re-running all
the SDF math every time a character is drawn.

I guess this approach could also conceivably be used with True-Type
Fonts as well, and/or render a true-type glyph in a larger size (say,
64x64) and then turning it into a lower-resolution SDF image.

Not looked too much into how True-Type Font handling traditionally works
though.

...

luke.l...@gmail.com

unread,

May 8, 2023, 6:07:35 PM5/8/23

to

On Monday, May 8, 2023 at 8:28:33 PM UTC+1, Scott Lurndal wrote:

> Actually, the android phones run the linux operating system and
> assorted native utilities;

i know. the binaries - all of them - come compiled with the
Board Support Package supplied by ARM specifically for that
Silicon Partner (Samsung, Allwinner, TI).

and of course, ARM fudges things by providing a *matching*
version of gcc (etc.) that by default contains all the workarounds
for the faulty silicon HDL that they provided that Silicon Partner
with...

> The application
> level android runtime (dalvik) executes bytecode.

dalvik will be compiled with the compiler workarounds. therefore as
far as *users* are concerned (downloaders of android "apps")
"everything just works" because you never, ever see actual assembler
in a *java bytecode* program.

> Fundamentally, it's no different than the many generations of Intel
> and AMD processors each of which implement different sets of SSE/MMX/SGX
> et alia features.

except ARM doesn't give a stuff. why would they? are you paying them
royalties for an ARM License?

> Some examples would be useful. All of our chips are ARMv8 (and now ARMv9)
> and we've had no "incompatabilities" between generations other than newer
> chips have newer features (the ID registers allow the OS to determine
> which features are supported).

you wouldn't - because of the disconnect due to the prevalence of
Android hiding the problem.

it's only when you try to actually take one of these smartphones,
or any product such as those from Hardkernel, and *remove* the
Android OS and *replace* it with e.g. debian then start compiling
*linux* binaries for yourself that you run into the incompatibility
issues.

this is third-hand knowledge over a voice conference call from a
developer who ran into this extremely weird problem, so i cannot
provide further specifics right now.

l.

Scott Lurndal

unread,

May 8, 2023, 6:28:32 PM5/8/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>On Monday, May 8, 2023 at 8:28:33=E2=80=AFPM UTC+1, Scott Lurndal wrote:
>
>> Actually, the android phones run the linux operating system and=20

>> assorted native utilities;
>
>i know. the binaries - all of them - come compiled with the
>Board Support Package supplied by ARM specifically for that
>Silicon Partner (Samsung, Allwinner, TI).
>
>and of course, ARM fudges things by providing a *matching*
>version of gcc (etc.) that by default contains all the workarounds
>for the faulty silicon HDL that they provided that Silicon Partner
>with...
>

>> The application=20
>> level android runtime (dalvik) executes bytecode.=20

>
>dalvik will be compiled with the compiler workarounds. therefore as
>far as *users* are concerned (downloaders of android "apps")
>"everything just works" because you never, ever see actual assembler
>in a *java bytecode* program.
>

> > Fundamentally, it's no different than the many generations of Intel=20
>> and AMD processors each of which implement different sets of SSE/MMX/SGX=
>=20
>> et alia features.=20

>
>except ARM doesn't give a stuff. why would they? are you paying them
>royalties for an ARM License?

Yes.

>
>> Some examples would be useful. All of our chips are ARMv8 (and now ARMv9)=
>=20
>> and we've had no "incompatabilities" between generations other than newer=
>=20
>> chips have newer features (the ID registers allow the OS to determine=20

>> which features are supported).
>
>you wouldn't - because of the disconnect due to the prevalence of
>Android hiding the problem.

We don't run android. We run server grade linux workloads.

>
>it's only when you try to actually take one of these smartphones,
>or any product such as those from Hardkernel, and *remove* the
>Android OS and *replace* it with e.g. debian then start compiling
>*linux* binaries for yourself that you run into the incompatibility
>issues.

You still haven't enumerated any of the soi disant "incompatibility issues".

>
>this is third-hand knowledge over a voice conference call from a
>developer who ran into this extremely weird problem, so i cannot
>provide further specifics right now.

So, one swallow doesn't make a summer.

https://en.wikipedia.org/wiki/AWS_Graviton
https://www.datacenterdynamics.com/en/news/googles-in-house-arm-data-center-cpus-pass-key-milestone-ahead-of-reported-2025-cloud-launch/

Scott Lurndal

unread,

May 8, 2023, 6:36:23 PM5/8/23

to

BGB <cr8...@gmail.com> writes:
>On 5/8/2023 1:35 PM, John Levine wrote:
>> According to BGB <cr8...@gmail.com>:
>>>> UTF8 should be good enough for everything; best to deprecate
>>>> USC-2 et al.
>>>
>>> For many use-cases (transmission and storage), UTF-8 is a sane default,
>>> but there are cases where UTF-8 is not ideal, such as inside console
>>> displays or text editors.
>>>
>>> Still makes sense to keep support UTF-16 around for the cases where it
>>> is useful.
>>
>> I can see UCS-2 if you're willing to permanently limit yourself to the
>> subset of Unicode is supports. UTF-16 with surrogate pairs is about as
>> pessimal an encoding as I can imagine. It's not fixed length, it
>> doesn't sort consistently like UTF-8 does, and it's not even very
>> compacy.
>>
>
>For many contexts, it is sufficient.
>People were also apparently happy enough with codepages back in the 1980s.

That wasn't my experience. And you're limiting yourself to the
microsoft world when you discuss codepages. There was a lot more
to computing back then and IBM, Univac, Burroughs all had international
customers and all had proprietary mechanisms to support them (I had
to design the i18n/l10n support for the Burroughs MCP/VS circa '85)
and there was no common codeset support in EBCDIC those days (leaving
aside de-facto IBM encodings, e.g CP37).

>Though, for example, the FAT32 filesystem backend is limited internally
>to 1252 (8.3 names) and UTF-16 (LFNs). So, in this case, the filesystem
>driver needs to do any character conversion.

It also limits the characters from 1252 that are allowable. IIRC,
the colon character isn't allowed in a FAT32 filename, even when
using LFNs.

The only character not allowed in Unix/Linux UTF-8 filenames is
the forward slash character, and due to the OS API, the nul-byte.

BGB

unread,

May 8, 2023, 8:35:56 PM5/8/23

to

OK, I had thought the codepage system was "borderline universal", well
except for the computers that did things differently (for example,
Commodore 64, Apple II, ATARI ST, etc, apparently lacking things like
lower-case letters and various other ASCII characters; ...).

Apparently it was also common on the NES to only have a limited subset
of ASCII, since any character cells used for ASCII by extension could
not be used for tile or sprite graphics.

This differs slightly from the text-mode in the BJX2 display hardware,
which currently allows for 1024 unique character cells to be active at
the same time.

Though, at the moment this is mostly used for the 1252 characters, and a
limited range of other characters.

Eg:
0000..007F: ASCII
0080..00FF: 8859-1 / 1252
0100..017F: Graphics (PETSCII+Misc)
0180..01FF: CP437 Glyphs
0200..03FF: Dynamically assigned.

The higher mappings partly exclude some redundant characters, such as
the PETSCII block excludes ASCII characters, and instead adds more
graphics characters, and the CP437 block omits those that are redundant
with 1252.

There are a few blocks beyond this (just non-fixed), including things
like the Greek and Cyrillic alphabets and similar.

The font-space doesn't currently map 1:1 with the BMP; with a remapping
table used to map from Unicode codepoints to the various glyphs
currently existing in the font. I ended up drawing a lot of them myself,
as the existing standardized Unicode fonts (like Unifont) weren't really
designed for 8x8 pixel character cells.

>> Though, for example, the FAT32 filesystem backend is limited internally
>> to 1252 (8.3 names) and UTF-16 (LFNs). So, in this case, the filesystem
>> driver needs to do any character conversion.
>
> It also limits the characters from 1252 that are allowable. IIRC,
> the colon character isn't allowed in a FAT32 filename, even when
> using LFNs.
>
> The only character not allowed in Unix/Linux UTF-8 filenames is
> the forward slash character, and due to the OS API, the nul-byte.
>

Hmm...

I had also assumed Windows-like character limitations here:
: ; / \ ...
Not being allowed.

Others, like space, being in the "sorta allowed but preferably avoided"
category (in my own file naming conventions, I usually consider space as
a "not allowed" character).

Though, in this case, thus far TestKern had used a "vaguely Unix-like
organization":
/ : Virtual, created at boot time.
/boot : SDcard's FAT32 image mounted here
/usr : An "OS image" can be mounted here.
/bin : Symlink to /usr/bin
/etc : Symlink to /usr/etc
...

Thus far, most of the ported software ends up in "/boot", since this is
the part Windows can access.

Not a whole lot in '/bin' yet, as most of the core commands thus far are
built directly into the shell (including a hex viewer and text editor
and similar).

Names like "foo:/" or "foo://" are special, and can be handled with
special handlers (partly intended for URIs in general). Conceptually,
something like Windows drive-letters could be treated as URI's, but this
isn't currently done.

Similarly, given in cases where it usually comes up in a browser on
Windows, filesystem references are usually something like:
file://k:/Foo/bar.txt

This seems like a strike against treating 'k:/' and similar as URI-like
spaces.

And, in a practical sense, 'k:/' would save relatively little over
'/mnt/k' or similar. And, similarly, MS seems to have spent the past 20+
years trying to push the drive-letter system by the wayside anyways (at
least as far as the GUI is concerned).

>

John Levine

unread,

May 8, 2023, 9:55:58 PM5/8/23

to

According to BGB <cr8...@gmail.com>:

>>> People were also apparently happy enough with codepages back in the 1980s.
>>

>> That wasn't my experience. ...
>
>OK, I had thought the codepage system was "borderline universal", ...

Sure, except that there were a zillion different code pages (don't get
me started on all of the EBCDIC variants), you couldn't count on
whatever code page your text used being available on the computers you
used, and there was no way to tell if you were using the right code
page except that the text was scrampled. For extra fun there were some
that had shift codes to switch code pages on the fly. I have a
thousand page O'Reilly book that tries to explain how this worked for
east Asian languages.

For all its faults, Unicode is a stupendous improvement over its
predecessors.

MitchAlsup

unread,

May 8, 2023, 11:12:36 PM5/8/23

to

On Monday, May 8, 2023 at 8:55:58 PM UTC-5, John Levine wrote:
> According to BGB <cr8...@gmail.com>:
> >>> People were also apparently happy enough with codepages back in the 1980s.
> >>
> >> That wasn't my experience. ...
> >
> >OK, I had thought the codepage system was "borderline universal", ...
>
> Sure, except that there were a zillion different code pages (don't get
> me started on all of the EBCDIC variants), you couldn't count on
> whatever code page your text used being available on the computers you
> used, and there was no way to tell if you were using the right code
> page except that the text was scrampled. For extra fun there were some
> that had shift codes to switch code pages on the fly. I have a
> thousand page O'Reilly book that tries to explain how this worked for
> east Asian languages.
<

This reminds me of the guy who decided to scramble up his IBM 360/67 TSS
password by using a series of decimal multiply instructions, and then he
applied his algorithm to his OWN REAL password. And then had to figure
out what he had done by reversing the decimal math by pencil and paper.....

>
> For all its faults, Unicode is a stupendous improvement over its
> predecessors.
<

Is Unicode not capable of doing certain things people still want done ?
That is, why not just take Unicode (in al its variants) and say "just to that",
instead of continuing to play with other methods and means ??

Anton Ertl

unread,

May 9, 2023, 2:28:20 AM5/9/23

to

"luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>it's only when you try to actually take one of these smartphones,
>or any product such as those from Hardkernel, and *remove* the
>Android OS and *replace* it with e.g. debian then start compiling
>*linux* binaries for yourself that you run into the incompatibility
>issues.

We have not removed Android from the ARM-based SBCs (from Hardkernel,
Pine64, the Raspi foundation and others, with different cores from
ARM, such as A53, A72, and A73, and earlier A9) because it never was
there. We run them under Linux and build software on them, and have
not noticed incompatibilities.

And Bernd Paysan is building native-code (i.e., not Java) software on
Android and runs it on various smartphones, and has not reported
incompatibilities, either.

I have noticed that MacOS on Apple Silicon does not behave for mmap()
in the same was as MacOS on Intel, but that's entirely due to MacOS:
On (Asahi) Linux on Apple Silicon mmap() works as we expect.

Thomas Koenig

unread,

May 9, 2023, 2:29:16 AM5/9/23

to

Scott Lurndal <sc...@slp53.sl.home> schrieb:

> The only character not allowed in Unix/Linux UTF-8 filenames is
> the forward slash character, and due to the OS API, the nul-byte.

IIRC, you could sometimes create filenames with slashes in them via Macs
over NFS. Those were quite hard to get rid of, I believe.

Thomas Koenig

unread,

May 9, 2023, 2:29:50 AM5/9/23

to

BGB <cr8...@gmail.com> schrieb:

> People were also apparently happy enough with codepages back in the 1980s.

"Happy enough" is an overstatement.

They were a nuisance because there were different code pages -
you never knew if you got CP 437 or CP 850, with or without
the € symbol, or CP 1250 for Windows, or whatever. A mess.

Since my last name actually contains an ö (which I dumbed down for
Newsgroup headers), I've read several variants of over the years.

Anton Ertl

unread,

May 9, 2023, 2:31:24 AM5/9/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:
>The only character not allowed in Unix/Linux UTF-8 filenames is
>the forward slash character, and due to the OS API, the nul-byte.

That depends on the file system; file systems like FAT, NTFS, ISO9660
have restrictions on the file names. So POSIX has defined which
characters are guaranteed to work with all file systems, and the list
is pretty restrictive.

robf...@gmail.com

unread,

May 9, 2023, 3:29:01 AM5/9/23

to

Thor string operations would support 8/16/32 bit characters. There
is a size field in the instruction.

Thor goes with a wider instruction format than others so there are not
as many encoding issues. It also allows the predicate registers to be
part of the general-purpose register array, partly why the GPR file was
set at 64 entries. Rather than have different sets of registers with special
access, Thor uses a unified register file. The trade-off is that it is not
as efficient storage wise. But it is a hobby design, it needs to be
simple, or it would not be possible to get the project done. It
probably suffers from being too simple for good efficiencies. With
instructions at 40/48 bits long code is about two to three times the
size of the equivalent 68k code. Number of instructions executed is
about the same, but the instructions are 2.5 times wider.

Is the ARM first-fault stuff patented? Seems like a good idea to me.
Dump a bitmask of failing loads in a predicate register, then process
accordingly. The vector agnostic stuff is great.

Tentatively added to the ISA is a REP instruction that can repeat any
small sequence of instructions. Requires a REP context buffer.
Clearing memory is just REP prior to a STORE r0,...

John Dallman

unread,

May 9, 2023, 3:37:33 AM5/9/23

to

In article <c5ef9032-22dc-4811...@googlegroups.com>,

luke.l...@gmail.com () wrote:

> and of course, ARM fudges things by providing a *matching*
> version of gcc (etc.) that by default contains all the workarounds
> for the faulty silicon HDL that they provided that Silicon Partner
> with...

I've been providing native code libraries for 64-bit ARM Android for
several years, and never had any complaints about compatibility. I build
them from C and C++ code, not Java (or Kotlin). I do just one build, with
the Google NDK version of Clang. That has no options for specific
manufacturers' hardware, and the way Android app distribution works
requires native code libraries to work on all devices that support the
ABI the library is built for.

> this is third-hand knowledge over a voice conference call from a
> developer who ran into this extremely weird problem, so i cannot
> provide further specifics right now.

I think your informant may have been confused. In the early days of
Android there were several different native code ABIs, with different
instruction availability. But now it's down to one 32-bit ABI, which is
on the way out, and one 64-bit ABI (there isn't one for ARMv9 yet)

John

Anton Ertl

unread,

May 9, 2023, 3:52:12 AM5/9/23

to

BGB <cr8...@gmail.com> writes:
>As noted, UTF-8 makes sense for "transmission", say, sending text to or
>from the console; or in the files loaded or saved from a text editor, etc...
>
>
>Trying to process, edit, and redraw text *directly* in UTF-8 form
>internally would be a massive PITA, and would be computationally
>expensive, hence why a "character cell" approach is useful.

Nope.

1) There is no "character cell" approach for Unicode; Unicode just
gives you code points, not characters; a code point can be combined
with combining marks into a single glyph, and there can be an
arbitrary number of code points in a glyph. So even with UTF-32
(aka UCS-4) there is no "character cell".

2) Most code deals with processing character data and passing it
onwards, and for that UTF-8 works fine and is usually more
efficient than, e.g., UTF-32. You just treat strings as strings:
as blocks of bytes, no further insight needed in most cases. And
with UTF-8, you need to deal with less data and are therefore
faster.

There is a reason why nobody (except Python3) uses UTF-32: The
benefit of fixed-size code points is miniscule compared to the pain
of having to rewrite all the byte-oriented software. And even
those who fell for the idea of fixed-size 16-bit Unicode in the
early 1990s, like Microsoft and Java, realized that variable-size
code points are no problem, and so switched to UTF-16 rather than
UTF-32 when Unicode 2.0 made fixed-size UCS-2 evaporate.

3) Concerning editing, our command-line editor in Gforth should be
aware of glyphs (but currently is only aware of code points, but
UTF-32 would not help with that difference); this did require some
changes in the code compared to the earlier 8-bit-fixed-size
version, but the changes were relatively minor and far less than
converting the command-line editor (or the whole of Gforth) to
dealing with UTF-32.

4) Rendering Unicode with the different glyph widths, combining marks
etc. is hard, but decoding the code points of UTF-8 is only a small
and relatively easy part of doing that. If you passed in the data
as UTF-32, it would be hardly easier. Fortunately, most software
does not have to implement these compilications, but either leaves
it to a library or an application like xterm, and the other
software passes the data as UTF-8 strings to the library or
application.

>But, as
>noted, 32 or 64 bit cells usually make more sense here as, for things
>like "syntax highlighting" etc, it makes sense to mark out the
>text-colors in the editor buffers (rather than during the redraw process).

I don't see why that should be the case. I can think of various ways
of organizing things, but the ways with UTF-32 don't provide an
obvious advantage over the UTF-8 ways.

I just tried the following:

Started Emacs 27.1, resulting in 54500KB RSS (resident set size; the
machine has enough RAM, so this is the actual memory used). Next I
loaded a 449MB mbox file (which contains data with various encodings);
this resulted in the RSS of Emacs growing to 727580KB, i.e., 657MB
more than at the start. So, while there is some overhead, it's by far
not the factor 4 or 8 of expanding each byte (which is usually a code
point, at least as far as Emacs is concerned) into a 32-bit or 64-bit
cell, plus adding additional space for the metadata such as colouring.

> Small vector images for each glyph
> Traditional form of True-Type Fonts
> Needlessly expensive to draw glyphs this way.

This way has won both on-screen and on paper. It seems that the
vector graphics are converted to bitmaps and then cached as bitmaps on
the rendering device. E.g., Metafont used to convert the font
descriptions to bitmaps (Type 2 fonts in Postscript), and in the early
days everybody used that, but a while ago vector fonts (Type 1 fonts
in Postscript) became the preferred way, so the Metafonts are
converted to vector graphics, and the printer, typesetting machine, or
screen renderer converts that to bitmaps.

>Effectively, "good" 8-color results require a 6 component image, with 2
>components per color bit. For a monochrome image and SDF would need a 2
>component image.
>A 16-color image would need 8 components to represent effectively with
>an SDF.

Are you working on a retro-computing project. I got 24-bpp (16M)
colours in 1996 with the Matrox Millenium. Ok, then there was a
regression to 16bpp with the Voodoo 3 (and funny discussions about
this deficiency of this card), but since this century such discussions
are over.

David Brown

unread,

May 9, 2023, 3:58:28 AM5/9/23

to

On 09/05/2023 05:12, MitchAlsup wrote:
> On Monday, May 8, 2023 at 8:55:58 PM UTC-5, John Levine wrote:
>> According to BGB <cr8...@gmail.com>:
>>>>> People were also apparently happy enough with codepages back in the 1980s.
>>>>
>>>> That wasn't my experience. ...

>> For all its faults, Unicode is a stupendous improvement over its
>> predecessors.
> <
> Is Unicode not capable of doing certain things people still want done ?
> That is, why not just take Unicode (in al its variants) and say "just to that",
> instead of continuing to play with other methods and means ??

I think Unicode probably handles most people's needs, though I don't
know how it compares in practice to other solutions for CJK scripts.

But there are certainly things that might have been done differently, if
the Unicode folk had been able to see into the future. For one thing,
UCS-2 and UTF-16 would never have existed. UTF-8 would have been the
only encoding supported for transfer of Unicode data, while UCS-4 would
be supported for internal use within programs, with the endianness
deliberately unspecified (no "byte order mark").

UCS-2 and then UTF-16 made sense at the time, but were quickly shown to
be inadequate, and have been a millstone for Unicode ever since - the
surrogate pair system for UTF-16 fragments the code space and limits it
to about half of what it could easily be in UTF-8, and hinders obvious
extensions to UTF-8 encoding if more code points are needed in the future.

BGB

unread,

May 9, 2023, 4:18:45 AM5/9/23

to

Alternately, if all the Hangul characters were multi-part (rather than
combined), and all of the Chinese characters were expressed by a series
of combining characters, etc, then UCS-2 may have been sufficient...

>
>
>

Terje Mathisen

unread,

May 9, 2023, 4:33:15 AM5/9/23

to

Scott Lurndal wrote:
> "luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>> On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com wrote=
>> :
>>
>>> loop1:=20
>>> LOADG g16,[r1+r2*]=20
>>> STOREG g16,[r3+r2++*]=20
>>> BLTU r2,1000,.loop1=20
>>> =20
>>> I must look at adding string instructions back into the instruction set.=
>> =20
>>
>> yeah can i suggest really don't do that. what happens if you want
>> to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>> concepts needed to efficiently support strings, well you have to
>> add them anyway so why not make them first-order concepts
>> at the ISA level?
>

> UTF8 should be good enough for everything; best to deprecate
> USC-2 et al.

The pragmatic part of me really wish for this to be true, and I intend
to act as if it is: Any exceptions will just have to deal with it, using
standard/naive scalar code paths.

For UCS-4 we are already grabbing at least four bytes in each iteration,
so the possible vector gains are much smaller than for something which
works with UTF-8, and for western languages mostly getting almost a
character per byte.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Anton Ertl

unread,

May 9, 2023, 4:37:48 AM5/9/23

to

John Levine <jo...@taugh.com> writes:
>It appears that David Brown <david...@hesbynett.no> said:

>>Similar instructions would be used for copying memory blocks, and that
>>is very useful!
>
>Not really. On modern computers you want to copy in ways that make
>best use of the multiple registers so you're more likely to do a
>sequence of loads followed by a sequence of stores, mabybe with shift
>and mask in between if they're not aligned, then move on to the next
>block.

On a modern computer with an OoO processor I would do the widest load
and store available (typically the widest SIMD width), resulting in a
loop like (from
<https://github.com/AntonErtl/move/blob/master/avxmemmove.c>):

for (; d<dlast; d+=32) {
__m256i x = _mm256_loadu_si256((__m256i *)(d+off));
_mm256_storeu_si256((__m256i *)d, x);
}

The OoO machinery deals with parallelizing the sequence of loads and
stores (and using multiple physical registers for that if necessary).
However, to reduce the resource consumption of the loop overhead, this
loop can be unrolled (and I did manual unrolling by a factor of two as
an alternative to the loop above. The unrolling is also useful if the
"d+=32" becomes the limiting component for the execution speed through
dependencies (i.e., on a CPU that is able to do >1 loads and >1 stores
per cycle).

Alignment is handled automatically by modern processors, and my code
does it on loads (because many CPUs support two loads and one store
per cycle). But I have seen code that aligns the loads and prefers
unaligned stores; maybe the idea here is that the store buffer aligns
the stores for free.

Terje Mathisen

unread,

May 9, 2023, 4:42:52 AM5/9/23

to

BGB wrote:

> On 5/8/2023 11:22 AM, Scott Lurndal wrote:
>> "luke.l...@gmail.com" <luke.l...@gmail.com> writes:
>>> On Monday, May 8, 2023 at 3:36:24=E2=80=AFAM UTC+1, robf...@gmail.com
>>> wrote=
>>> :
>>>
>>>> loop1:=20
>>>> LOADG g16,[r1+r2*]=20
>>>> STOREG g16,[r3+r2++*]=20
>>>> BLTU r2,1000,.loop1=20
>>>> =20
>>>> I must look at adding string instructions back into the instruction
>>>> set.=
>>> =20
>>>
>>> yeah can i suggest really don't do that. what happens if you want
>>> to support UCS-2 (strncpyW)? then UCS-4? more than that: the
>>> concepts needed to efficiently support strings, well you have to
>>> add them anyway so why not make them first-order concepts
>>> at the ISA level?
>>
>> UTF8 should be good enough for everything; best to deprecate
>> USC-2 et al.
>>
>

> For many use-cases (transmission and storage), UTF-8 is a sane default,
> but there are cases where UTF-8 is not ideal, such as inside console
> displays or text editors.

I beg to disagree:

The only advantage of a wide encoding (2 or 4 bytes/char) is that you
can directly index into whatever position you want, but for actual work
I would far rather have a list of text chunks, each keeping a header
count of the number of internal characters. (Possibly using lazy
evaluation, so don't run the UTF-8 character boundary scan until needed.)

I think the classical (emacs?) approach is to split the text buffer
around the cursor, so that you can do O(1) inserts and deletes at that
point, then while moving the cursor you just update the position until
something is changed and you split again.

Anton Ertl

unread,

May 9, 2023, 5:13:09 AM5/9/23

to

John Levine <jo...@taugh.com> writes:
>It appears that Anton Ertl <an...@mips.complang.tuwien.ac.at> said:
>>>I think the increase
>>>in code density wasn't worth the contortions to ensure that your data
>>>structures fit the few cases that the autoincrement modes handled.
>>
>>Are you thinking of the DSPs that do not have displacement addressing,
>>but have auto-increment, leading to a number of papers on how the
>>compiler should arrange the variables to make best use of that?
>
>Autoincrement only increments by the size of a single datum so it
>works for strings and vectors, not for arrays of structures or 2-D
>arrays.

That depends. It's certainly the case for PDP-11, VAX and 68000, but
PowerPC and ARM A64 (not sure about A32, HPPA and IA-64) are more
flexible with the stride.

>>With displacement addressing no such contortions are necessary.
>
>I don't see how that solves the stride problem. Or did you mean
>something else?

I meant something else, in particular papers such as

https://dl.acm.org/doi/abs/10.1145/301618.301653

>>Pipelining was apparently no problem, as evidenced by several early
>>RISCs (ARM (A32), HPPA, PowerPC) having auto-increment. Just don't
>>write the address register before verifying the address. ...
>
>Do they have the kind of hazards that the -11 and Vax did, where you could
>autoincrement the same register more than once in a single instruction, or
>use the incremented register as an operand? That made things messy.

Only one register is updated. I would have to look up the ISA
definition on how they deal with having another register operand with
the same register number as the incremented register, but I am sure
that they specified it in a way that does not require more sequential
substeps than is necessary anyway; i.e., either the decoder does not
accept these register combinations, or the register reads are all
before the register writes, and there is an explicit priority for the
register writes.

Anton Ertl

unread,

May 9, 2023, 5:21:39 AM5/9/23

to

Thomas Koenig <tko...@netcologne.de> writes:
>Marcus <m.de...@this.bitsnbites.eu> schrieb:
>> However, a couple of days ago I realized that store operations do not
>> have any result, so I could add instructions for store with auto-
>> increment, and still only have one result.
>
>That would create a rather weird asymmetry between load and store.
>It could also create problems for the compiler - I'm not sure that
>gcc is set up to easily handle different addressing modes for load
>and store.

Why not? There is no requirement in the instruction selection
mechanism that gcc uses that all instructions support the same
addressing modes. It probably takes some fine-tuning for gcc to make
best use of such a feature, but allowing it to make opportunistic use
of the feature should not be hard.

>> Any thoughts? Is it worth it?
>
>Not sure it is - this kind of instruction will be split into two
>micro-instructions on any OoO machine, and probably for in-order,
>as well.

The A64 designers obviously thought that it is worth it. The PowerPC
designers seem to have had doubts pretty early, as I read already in
the early 1990s that these instructions will be split into two
microinstructions on many implementations.

BGB

unread,

May 9, 2023, 5:27:06 AM5/9/23

to

What I had usually used here was not UTF-32 per-se, in that it often had
a bunch of other stuff bit-twiddled in.

It would be more like how the text-mode works in my CPU project, where
nominally the text-cells are 64-bit bit-packed structures.

>
>> But, as
>> noted, 32 or 64 bit cells usually make more sense here as, for things
>> like "syntax highlighting" etc, it makes sense to mark out the
>> text-colors in the editor buffers (rather than during the redraw process).
>
> I don't see why that should be the case. I can think of various ways
> of organizing things, but the ways with UTF-32 don't provide an
> obvious advantage over the UTF-8 ways.
>

Drawing is looping over an array and drawing each cell;
The current cursor position is well-defined as a character index;
Things like inserting a character are straightforward (slide everything
right and add the character);
...

It seems like if all this were done directly using UTF-8 strings, it
would be a pain. Even things like "where is the cursor at?" would get
more complicated.

> I just tried the following:
>
> Started Emacs 27.1, resulting in 54500KB RSS (resident set size; the
> machine has enough RAM, so this is the actual memory used). Next I
> loaded a 449MB mbox file (which contains data with various encodings);
> this resulted in the RSS of Emacs growing to 727580KB, i.e., 657MB
> more than at the start. So, while there is some overhead, it's by far
> not the factor 4 or 8 of expanding each byte (which is usually a code
> point, at least as far as Emacs is concerned) into a 32-bit or 64-bit
> cell, plus adding additional space for the metadata such as colouring.
>

The use of 64-bit cells was usually how I had done it, but granted I am
usually working on the assumption that one is working on reasonable-size
text-files, not something huge.

And, if a 100K text file expands to 900K inside the editor, this isn't
usually a dealbreaker.

Granted, I haven't really looked into how editors like Notepad or SciTE
work internally.

>> Small vector images for each glyph
>> Traditional form of True-Type Fonts
>> Needlessly expensive to draw glyphs this way.
>
> This way has won both on-screen and on paper. It seems that the
> vector graphics are converted to bitmaps and then cached as bitmaps on
> the rendering device. E.g., Metafont used to convert the font
> descriptions to bitmaps (Type 2 fonts in Postscript), and in the early
> days everybody used that, but a while ago vector fonts (Type 1 fonts
> in Postscript) became the preferred way, so the Metafonts are
> converted to vector graphics, and the printer, typesetting machine, or
> screen renderer converts that to bitmaps.
>

OK.
I guess, converting them to specific-size bitmaps in memory or similar
is probably workable.

>> Effectively, "good" 8-color results require a 6 component image, with 2
>> components per color bit. For a monochrome image and SDF would need a 2
>> component image.
>> A 16-color image would need 8 components to represent effectively with
>> an SDF.
>
> Are you working on a retro-computing project. I got 24-bpp (16M)
> colours in 1996 with the Matrox Millenium. Ok, then there was a
> regression to 16bpp with the Voodoo 3 (and funny discussions about
> this deficiency of this card), but since this century such discussions
> are over.
>

This is more about the specific funkiness of using SDF's for text
rendering (as opposed to directly using bitmap images; or trying to draw
each character vector-graphics style).

So, we can interpolate the texels, and then threshold the value, say:
Values < 0.5 become Black;
Values > 0.5 become White.

So, effectively, one might end up using a full RGB555 image or similar
to represent what is effectively, in the final results, analogous to a
1-bit monochrome image. Just, one can scale it freely and it emulates
some of the visual properties of vector graphics.

One effectively needs multiple RGB555 images being run in parallel to
pull off something like 8-color output (or, say, 32-bits per pixel just
to express 8 colors).

Main issue with "just" using a plain texture image, being that if you
scale it (using good old bilinear or trilinear filtering) the result
tends to look like "blurry crap" for things like text. So, people
devised way such that text and fonts (expressed with conventional
textures) could be scaled without becoming blurry.

As for some other things:
For most stuff in my BJX2 project, I am using RGB555, because it is
"good enough", and also neither of the FPGA boards I am using have more
than this on their VGA connectors either (previous board only had 4-bits
per component on the VGA connector, current board has a full 5 bits).

The visual difference between RGB555 and RGB888 is "fairly modest".
Apart from things like smooth gradients, people might not notice the
difference.

Due to memory bandwidth issues, I can't really do high-res high-depth
displays:
320x200 16-bpp, works OK ( ~ 8MB/sec );
640x400 8-bpp, works OK ( ~ 15MB/sec );
640x400 16-bpp, has graphical issues (~ 30 MB/sec ).
L2 can't respond quickly enough for visually stable results.
800x600 8-bpp, works OK ( ~ 17MB/sec at 36Hz );
Non-standard timing, but seems to work sorta;
Would need 26 MB/s at a more standard 56Hz timing.

One downside is the "how to make 256 color not kinda look like crap"
issue. Using 256-color can work if one can optimize the palette, but
this is not ideal for "real time".

The "best performer" in terms of RMSE was a "16 shades of 16 colors"
palette, but it doesn't really do great things for color fidelity.

However, direct RGB-based palettes loose a significant amount of detail.

...

However, this is a big step up from the recent past, where 800x600 would
have only been available with a 4-color (CGA-like) palette. Mostly
because I added a feature to expand the VRAM to a "massive" 512K.

Terje Mathisen

unread,

May 9, 2023, 6:42:36 AM5/9/23

to

John Levine wrote:
> According to BGB <cr8...@gmail.com>:
>>>> People were also apparently happy enough with codepages back in the 1980s.
>>>
>>> That wasn't my experience. ...
>>
>> OK, I had thought the codepage system was "borderline universal", ...
>
> Sure, except that there were a zillion different code pages (don't get
> me started on all of the EBCDIC variants), you couldn't count on
> whatever code page your text used being available on the computers you
> used, and there was no way to tell if you were using the right code
> page except that the text was scrampled. For extra fun there were some
> that had shift codes to switch code pages on the fly. I have a
> thousand page O'Reilly book that tries to explain how this worked for
> east Asian languages.
>
> For all its faults, Unicode is a stupendous improvement over its
> predecessors.
>

The code page mess was one of the key reasons for why MIME mail
standards defined a very limited subset of "printable" ascii as not
quoted: Everything else requires some form of encoding/multi-char sequences.

This was in fact the key problem that makes it impossible to create a
MIME-compatible executable text program on the original 8088!

I found out that you must have at least a 80186 since that is the first
CPU which supported POPA: Even though you can PUSH and POP many
registers using printable/MIME ascii bytes, the four memory-addressing
registers (SI/DI/BX/BP) all happens to use byte values just after 'Z'.

Same issue with CALL/RET/IRET/JMP <backwards> since all of them need a
signed char/8-bit value.

Terje Mathisen

unread,

May 9, 2023, 6:55:10 AM5/9/23

to

The original IBM PC / CP 437 did contain ö and Ö afair, but not the
Norwegian/Danish equivalents ø and Ø. To get those you needed a video
card with modified character ROM, or (from EGA/VGA on) a software-loaded
alternative set of character glyphs.

Since the characters got reloaded from ROM on every mode switch, the
driver needed to hook that call and reload the correct set each time it
happened.

In my own alternative driver I minimized the size by copying the ROM set
into the VGA memory, then I could modify in place just the missing
characters. BTW, since the EGA had 350 scan lines and the VGA 480, both
supporting 25x80 screens, as well as 43/50 in an alternate compact
format, I needed backup font glyphs for all those alternatives.

On the keyboard side things were more complicated since about 40
different key combinations used alternate encodings in Norwegian mode,
but I still went with an approach where I let most keys through to the
BIOS driver, only replacing the ones that needed it.

David Brown

unread,

May 9, 2023, 9:27:25 AM5/9/23

to

No, it would never have been sufficient - not with the number of
characters Unicode has now - even with the changes you suggest. For a
character encoding system that aims to cover all current and most
obsolete writing systems, 65536 code points is simply not enough.

Scott Lurndal

unread,

May 9, 2023, 9:34:31 AM5/9/23

to

In general (T32, A32 and A64) that's considered 'undefined' behavior.

Scott Lurndal

unread,

May 9, 2023, 9:44:35 AM5/9/23

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>John Levine wrote:
>> According to BGB <cr8...@gmail.com>:
>>>>> People were also apparently happy enough with codepages back in the 1980s.
>>>>
>>>> That wasn't my experience. ...
>>>
>>> OK, I had thought the codepage system was "borderline universal", ...
>>

>> Sure, except that there were a zillion different code pages ...

>>
>> For all its faults, Unicode is a stupendous improvement over its
>> predecessors.
>>
>The code page mess was one of the key reasons for why MIME mail
>standards defined a very limited subset of "printable" ascii as not
>quoted: Everything else requires some form of encoding/multi-char sequences.
>
>This was in fact the key problem that makes it impossible to create a
>MIME-compatible executable text program on the original 8088!

Which would have been obsolete for a decade by the time MIME showed up.

Using the 8088 was always torture for those of us who grew up with
PDP-8s and PDP-11.

Scott Lurndal

unread,

May 9, 2023, 9:47:47 AM5/9/23

to

Thomas Koenig <tko...@netcologne.de> writes:
>BGB <cr8...@gmail.com> schrieb:
>
>> People were also apparently happy enough with codepages back in the 1980s.
>
>"Happy enough" is an overstatement.
>
>They were a nuisance because there were different code pages -
>you never knew if you got CP 437 or CP 850, with or without

>the â‚¬ symbol, or CP 1250 for Windows, or whatever. A mess.
>
>Since my last name actually contains an Ã¶ (which I dumbed down for

>Newsgroup headers), I've read several variants of over the years.

König, correct? Does that imply royal descent :-) Koenig seems
to be the usual replacement when the umlaut is unavailable.

tschüß

(as I learned it: I read recently that the s-sset has been deprecated).

EricP

unread,

May 9, 2023, 11:41:44 AM5/9/23

to

Anton Ertl wrote:
> Thomas Koenig <tko...@netcologne.de> writes:
>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>

>>> Any thoughts? Is it worth it?
>> Not sure it is - this kind of instruction will be split into two
>> micro-instructions on any OoO machine, and probably for in-order,
>> as well.
>
> The A64 designers obviously thought that it is worth it. The PowerPC
> designers seem to have had doubts pretty early, as I read already in
> the early 1990s that these instructions will be split into two
> microinstructions on many implementations.
>
> - anton

PowerPC is a different market. ARM has a lot of codec/dsp code which may
expect certain features like auto inc/dec, irrespective of the HW cost.

EricP

unread,

May 9, 2023, 11:41:45 AM5/9/23

to

Anton Ertl wrote:

> John Levine <jo...@taugh.com> writes:
>> Do they have the kind of hazards that the -11 and Vax did, where you could
>> autoincrement the same register more than once in a single instruction, or
>> use the incremented register as an operand? That made things messy.
>
> Only one register is updated. I would have to look up the ISA
> definition on how they deal with having another register operand with
> the same register number as the incremented register, but I am sure
> that they specified it in a way that does not require more sequential
> substeps than is necessary anyway; i.e., either the decoder does not
> accept these register combinations, or the register reads are all
> before the register writes, and there is an explicit priority for the
> register writes.
>
> - anton

I'm pretty sure the VAX processed instruction operands with
side effects serially, though I could not find an explicit
statement in the manual saying so.

So "MOVL (rs)+,-(rd)" would load (rs) with post increment by 4,
then predecrement and store (rd).
If rs == rd it just does a pop and then a push at the same address.

There is a note on scaled-indexed addressing mode saying that if
the index register is the same as an auto increment base register,
that the result address is unpredictable.
But that is the only one that says that.

And immediate addressing mode is actually auto increment applied to r15,
the PC, and means the data value immediately follows the operand
specifier, which requires a serial, byte by byte parse.

The general VAX instruction format is

opcode1 [opcode2] [opspec1 [immediate1] [opspec2 [immediate2]... ]]

where opcode and opspec are bytes, and immediates are 1,2,4,8 or 16 bytes.

This is why even on the high end VAXes from late 1980's that it
could decode the opcode and the first optional opspec in 1 clock,
plus 1 clock per subsequent opspec [immediate].

I think this is why the general register file on the ECL VAX 8700
had only 1 read (and 1 write) port - because the operands decoded
serially so it only needed 1 operand read port.

Anton Ertl

unread,

May 9, 2023, 12:22:17 PM5/9/23

to

Terje Mathisen <terje.m...@tmsw.no> writes:
>I think the classical (emacs?) approach is to split the text buffer
>around the cursor, so that you can do O(1) inserts and deletes at that
>point, then while moving the cursor you just update the position until
>something is changed and you split again.

Yes, that's the way that Emacs uses the gap buffer
<https://www.gnu.org/software/emacs/manual/html_node/elisp/Buffer-Gap.html>.
When I implemented an editor with a gap buffer, the gap was always at
the cursor position.

I wonder if Emacs uses multiple gaps for multiple cursors in a buffer.

Another good data structure for editors is the piece table, used in
Microsoft word.

Thomas Koenig

unread,

May 9, 2023, 12:29:25 PM5/9/23

to

Scott Lurndal <sc...@slp53.sl.home> schrieb:

> Thomas Koenig <tko...@netcologne.de> writes:
>>BGB <cr8...@gmail.com> schrieb:
>>
>>> People were also apparently happy enough with codepages back in the 1980s.
>>
>>"Happy enough" is an overstatement.
>>
>>They were a nuisance because there were different code pages -
>>you never knew if you got CP 437 or CP 850, with or without

>>the € symbol, or CP 1250 for Windows, or whatever. A mess.
>>
>>Since my last name actually contains an ö (which I dumbed down for

>>Newsgroup headers), I've read several variants of over the years.
>
> König, correct?

Yes.

> Does that imply royal descent :-)

Not at all. As far as my family can be traced back, there is
absolutely no hereditary nobility anywhere there.

"König is actually a fairly common surname in Germany. My first
name is also quite common. This means that I cannot be googled
without without some additional information because there are so
many false hits.

The only kind of anonymity left, these days :-)

Anton Ertl

unread,

May 9, 2023, 12:44:20 PM5/9/23

to

BGB <cr8...@gmail.com> writes:
>On 5/9/2023 1:33 AM, Anton Ertl wrote:
>> I don't see why that should be the case. I can think of various ways
>> of organizing things, but the ways with UTF-32 don't provide an
>> obvious advantage over the UTF-8 ways.
>>
>
>Drawing is looping over an array and drawing each cell;

And with UTF-8, you loop over the array and draw each code point. The
UTF-8 decoding is five lines or so in a huge overall effort.

>The current cursor position is well-defined as a character index;

You mean code point index?

With UTF-8, the current cursor position is well-defined as a byte index.

>Things like inserting a character are straightforward (slide everything
>right and add the character);

Just the same for UTF-8.

>It seems like if all this were done directly using UTF-8 strings, it
>would be a pain. Even things like "where is the cursor at?" would get
>more complicated.

Just do it, and you will see that it's much less painful than
converting between UTF-8 and UTF-32 all the time.

>> I just tried the following:
>>
>> Started Emacs 27.1, resulting in 54500KB RSS (resident set size; the
>> machine has enough RAM, so this is the actual memory used). Next I
>> loaded a 449MB mbox file (which contains data with various encodings);
>> this resulted in the RSS of Emacs growing to 727580KB, i.e., 657MB
>> more than at the start. So, while there is some overhead, it's by far
>> not the factor 4 or 8 of expanding each byte (which is usually a code
>> point, at least as far as Emacs is concerned) into a 32-bit or 64-bit
>> cell, plus adding additional space for the metadata such as colouring.
>>
>
>The use of 64-bit cells was usually how I had done it, but granted I am
>usually working on the assumption that one is working on reasonable-size
>text-files, not something huge.
>
>And, if a 100K text file expands to 900K inside the editor, this isn't
>usually a dealbreaker.

In 2021 we replaced a machine with 32GB RAM by one with 128GB RAM,
because on the 32GB machine too often Emacs told me that there is not
enough memory for loading some file. If you don't want to make a toy,
you stop thinking of large files as "unreasonable-size".

>> This way has won both on-screen and on paper. It seems that the
>> vector graphics are converted to bitmaps and then cached as bitmaps on
>> the rendering device. E.g., Metafont used to convert the font
>> descriptions to bitmaps (Type 2 fonts in Postscript), and in the early

That's actually Type 3 fonts.

Anton Ertl

unread,

May 9, 2023, 12:45:21 PM5/9/23

to

sc...@slp53.sl.home (Scott Lurndal) writes:

>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>I would have to look up the ISA
>>definition on how they deal with having another register operand with
>>the same register number as the incremented register,
>
>In general (T32, A32 and A64) that's considered 'undefined' behavior.

Ouch! See my sig.

Scott Lurndal

unread,

May 9, 2023, 12:50:05 PM5/9/23

to

Really? ARM has a lot of non-dsp code. The vast majority of code
executed on ARM is non-dsp, non-codec. Take any phone or tablet
for example, they're all (even Apple) running ARM (v8 for the most part
now) cores and have hardware acceleration for codec purposes.

Scott Lurndal

unread,

May 9, 2023, 12:54:37 PM5/9/23

to

an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>sc...@slp53.sl.home (Scott Lurndal) writes:
>>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>>I would have to look up the ISA
>>>definition on how they deal with having another register operand with
>>>the same register number as the incremented register,
>>
>>In general (T32, A32 and A64) that's considered 'undefined' behavior.
>
>Ouch! See my sig.

Sorry, I was imprecise. The term of art used in the documentation
is "Unpredictable" behavior.

e.g. in the LDR(register) description for the ARMv7m architecture:

if Rn == '1111' then SEE LDR (literal);
t = UInt(Rt); n = UInt(Rn); m = UInt(Rm);
(shift_t, shift_n) = (SRType_LSL, UInt(imm2));
if m IN {13,15} then UNPREDICTABLE;
if t == 15 && InITBlock() && !LastInITBlock() then UNPREDICTABLE;

Now we can discuss the rather odd if-then blocks in the T32 instruction set....

Scott Lurndal

unread,

May 9, 2023, 12:59:07 PM5/9/23

to

Thomas Koenig <tko...@netcologne.de> writes:
>Scott Lurndal <sc...@slp53.sl.home> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>>BGB <cr8...@gmail.com> schrieb:
>>>
>>>> People were also apparently happy enough with codepages back in the 1980s.
>>>
>>>"Happy enough" is an overstatement.
>>>
>>>They were a nuisance because there were different code pages -
>>>you never knew if you got CP 437 or CP 850, with or without

>>>the â‚¬ symbol, or CP 1250 for Windows, or whatever. A mess.
>>>
>>>Since my last name actually contains an Ã¶ (which I dumbed down for

>>>Newsgroup headers), I've read several variants of over the years.
>>
>> König, correct?
>
>Yes.
>
>> Does that imply royal descent :-)
>
>Not at all. As far as my family can be traced back, there is
>absolutely no hereditary nobility anywhere there.
>
>"König is actually a fairly common surname in Germany. My first
>name is also quite common. This means that I cannot be googled
>without without some additional information because there are so
>many false hits.

King is also somewhat common in the US and presumably the UK.

>
>The only kind of anonymity left, these days :-)

My surname is rather rare...

BGB

unread,

May 9, 2023, 2:15:34 PM5/9/23

to

On 5/9/2023 11:22 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/9/2023 1:33 AM, Anton Ertl wrote:
>>> I don't see why that should be the case. I can think of various ways
>>> of organizing things, but the ways with UTF-32 don't provide an
>>> obvious advantage over the UTF-8 ways.
>>>
>>
>> Drawing is looping over an array and drawing each cell;
>
> And with UTF-8, you loop over the array and draw each code point. The
> UTF-8 decoding is five lines or so in a huge overall effort.
>
>> The current cursor position is well-defined as a character index;
>
> You mean code point index?
>

With 64-bit cells, one may end up essentially merging several
code-points into a single logical character cell.

Say, for example, diacritics on characters, or color modifiers on
Emojis, etc.

Say:
(63:56): Modifier B / Hi
(55:48): Modifier A / Lo
(47:40): Color B
(39:32): Color A
(31:21): Control Flags / Etc
(20: 0): Base Glyph

Where, one either has one 16-bit modifier, or 2 8-bit modifiers (from a
table).

But, as noted, there are various cases that this sort of approach can't
express.

I guess one other tradeoff here is between whether one's editor is
targeting a bitmap/framebuffer display or a color-cell display. For
example, some things may make more sense on a bitmap display than one
based around fixed-size cells (though, this distinction matters a lot
less if one is drawing to a framebuffer and then feeding the framebuffer
through a color-cell encoder).

...

> With UTF-8, the current cursor position is well-defined as a byte index.
>
>> Things like inserting a character are straightforward (slide everything
>> right and add the character);
>
> Just the same for UTF-8.
>
>> It seems like if all this were done directly using UTF-8 strings, it
>> would be a pain. Even things like "where is the cursor at?" would get
>> more complicated.
>
> Just do it, and you will see that it's much less painful than
> converting between UTF-8 and UTF-32 all the time.
>

There is no straightforward mapping between UTF-8 bytes and the logical
X position of a cursor on screen.

One would likely need a loop or similar to walk the text to figure out
where to put the cursor at.

>>> I just tried the following:
>>>
>>> Started Emacs 27.1, resulting in 54500KB RSS (resident set size; the
>>> machine has enough RAM, so this is the actual memory used). Next I
>>> loaded a 449MB mbox file (which contains data with various encodings);
>>> this resulted in the RSS of Emacs growing to 727580KB, i.e., 657MB
>>> more than at the start. So, while there is some overhead, it's by far
>>> not the factor 4 or 8 of expanding each byte (which is usually a code
>>> point, at least as far as Emacs is concerned) into a 32-bit or 64-bit
>>> cell, plus adding additional space for the metadata such as colouring.
>>>
>>
>> The use of 64-bit cells was usually how I had done it, but granted I am
>> usually working on the assumption that one is working on reasonable-size
>> text-files, not something huge.
>>
>> And, if a 100K text file expands to 900K inside the editor, this isn't
>> usually a dealbreaker.
>
> In 2021 we replaced a machine with 32GB RAM by one with 128GB RAM,
> because on the 32GB machine too often Emacs told me that there is not
> enough memory for loading some file. If you don't want to make a toy,
> you stop thinking of large files as "unreasonable-size".
>

If you load a 5 or 10 MB file in Notepad or similar, then it may start
temporarily locking up every time one hits a key...

Most other editors I have used (often Scintilla based) often also don't
really like huge files either.

Presumably, for most "normal" use-cases, it is likely to be uncommon for
someone to edit files much over 100-200K.

If Emacs can handle huge files, maybe that is in its merit, but this is
not likely a common use-case.

Scott Lurndal

unread,

May 9, 2023, 2:27:03 PM5/9/23

to

BGB <cr8...@gmail.com> writes:
>On 5/9/2023 11:22 AM, Anton Ertl wrote:
>> BGB <cr8...@gmail.com> writes:
>>> On 5/9/2023 1:33 AM, Anton Ertl wrote:
>>>> I don't see why that should be the case. I can think of various ways
>>>> of organizing things, but the ways with UTF-32 don't provide an
>>>> obvious advantage over the UTF-8 ways.
>>>>
>>>
>>> Drawing is looping over an array and drawing each cell;
>>
>> And with UTF-8, you loop over the array and draw each code point. The
>> UTF-8 decoding is five lines or so in a huge overall effort.
>>
>>> The current cursor position is well-defined as a character index;
>>
>> You mean code point index?
>>

>> With UTF-8, the current cursor position is well-defined as a byte index.
>>
>>> Things like inserting a character are straightforward (slide everything
>>> right and add the character);
>>
>> Just the same for UTF-8.
>>
>>> It seems like if all this were done directly using UTF-8 strings, it
>>> would be a pain. Even things like "where is the cursor at?" would get
>>> more complicated.
>>
>> Just do it, and you will see that it's much less painful than
>> converting between UTF-8 and UTF-32 all the time.
>>
>
>There is no straightforward mapping between UTF-8 bytes and the logical
>X position of a cursor on screen.

Why should there be a straighforward mapping? The cursor is an
abstract object which represents the location of character cell on the
screen, not a glyph in a UTF-8 stream.

>
>One would likely need a loop or similar to walk the text to figure out
>where to put the cursor at.

Dangling preposition :-)

Why not track the start address/offset of the sequence of bytes in the UTF-8
string by screen 'cell'when first converting the UTF-8 sequence to a
display glyph?

One needs to track
other attributes for the cell anyway (e.g. style such as bold/italic,
font for that cell, (if different from other cells), fg and bg colors,
read left or read right, etc).

David Brown

unread,

May 9, 2023, 2:29:10 PM5/9/23

to

On 09/05/2023 18:29, Thomas Koenig wrote:
> Scott Lurndal <sc...@slp53.sl.home> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>> BGB <cr8...@gmail.com> schrieb:
>>>
>>>> People were also apparently happy enough with codepages back in the 1980s.
>>>
>>> "Happy enough" is an overstatement.
>>>
>>> They were a nuisance because there were different code pages -
>>> you never knew if you got CP 437 or CP 850, with or without
>>> the € symbol, or CP 1250 for Windows, or whatever. A mess.
>>>
>>> Since my last name actually contains an ö (which I dumbed down for
>>> Newsgroup headers), I've read several variants of over the years.
>>

>> K�nig, correct?

>
> Yes.
>
>> Does that imply royal descent :-)
>
> Not at all. As far as my family can be traced back, there is
> absolutely no hereditary nobility anywhere there.
>

> "K�nig is actually a fairly common surname in Germany. My first

> name is also quite common. This means that I cannot be googled
> without without some additional information because there are so
> many false hits.
>

You'll have to try harder to beat /my/ name for commonality! Even here
in Norway, there are about a dozen other David Brown's, and there are
surely tens of thousands of us in the UK and USA. (Not counting tractors.)

BGB

unread,

May 9, 2023, 2:51:36 PM5/9/23

to

If we know, say, that it is at column 30, and each cell is 8 pixels wide:
Its X offset is 30*8

Otherwise (including with variable width fonts), one would need a more
complex algorithm to figure it out.

Granted, thus far most of my stuff is fixed-width only, and I personally
find fixed-width to be more readable than variable width (also fonts
which leave I 1 l | etc as visually ambiguous are particularly annoying
to me...).

But, for whatever reason, programs seem to keep trying to default to
variable width fonts and fonts seem to keep using visually ambiguous
characters, leading to a bit of annoyance on my part.

>>
>> One would likely need a loop or similar to walk the text to figure out
>> where to put the cursor at.
>
> Dangling preposition :-)
>
> Why not track the start address/offset of the sequence of bytes in the UTF-8
> string by screen 'cell'when first converting the UTF-8 sequence to a
> display glyph?
>

So, cells encode an offset into a UTF-8 line buffer?...
I guess this could be possible.

I guess this could maybe also allow for more compact storage of lines
that are not currently visible on screen (leaving them as UTF-8 and
using the 64-bit character cells only for redrawing the display).

> One needs to track
> other attributes for the cell anyway (e.g. style such as bold/italic,
> font for that cell, (if different from other cells), fg and bg colors,
> read left or read right, etc).
>

All this stuff would usually be encoded in the character cell itself
once it is transcribed from UTF-8 or similar.

Most of this metadata is then discarded again when converting the lines
back into UTF-8 (say, when saving the file).

Though, I guess it is arguably hit-or-miss if one has an editor that can
potentially lose information if it is not directly representable.

MitchAlsup

unread,

May 9, 2023, 3:08:37 PM5/9/23

to

On Tuesday, May 9, 2023 at 1:15:34 PM UTC-5, BGB wrote:
> On 5/9/2023 11:22 AM, Anton Ertl wrote:
>
>
> Presumably, for most "normal" use-cases, it is likely to be uncommon for
> someone to edit files much over 100-200K.
<

I edit (MSWord) files that are 20-30 MB, when converted into PDF they
shrink to 5 MB. I had one MSWord file (300-odd pages long) that I
separated into 2 files because it got big enough that Word started
making errors. The sum of both resulting files was only 70% the size
of the original.

>
> If Emacs can handle huge files, maybe that is in its merit, but this is
> not likely a common use-case.
<

It depends, I have seen environments where each chapter in a book was
its own file, and other cases where the entire book and appendixes were
in a single file. Realistically, if you want cross reference links to reach
across the whole book, you only have that chance with "everything is
in one (1) file".
<

BGB

unread,

May 9, 2023, 3:13:03 PM5/9/23

to

On 5/9/2023 11:29 AM, Thomas Koenig wrote:
> Scott Lurndal <sc...@slp53.sl.home> schrieb:
>> Thomas Koenig <tko...@netcologne.de> writes:
>>> BGB <cr8...@gmail.com> schrieb:
>>>
>>>> People were also apparently happy enough with codepages back in the 1980s.
>>>
>>> "Happy enough" is an overstatement.
>>>
>>> They were a nuisance because there were different code pages -
>>> you never knew if you got CP 437 or CP 850, with or without
>>> the € symbol, or CP 1250 for Windows, or whatever. A mess.
>>>
>>> Since my last name actually contains an ö (which I dumbed down for
>>> Newsgroup headers), I've read several variants of over the years.
>>

>> K�nig, correct?

>
> Yes.
>
>> Does that imply royal descent :-)
>
> Not at all. As far as my family can be traced back, there is
> absolutely no hereditary nobility anywhere there.
>

> "K�nig is actually a fairly common surname in Germany. My first

> name is also quite common. This means that I cannot be googled
> without without some additional information because there are so
> many false hits.
>
> The only kind of anonymity left, these days :-)

At least on my end, I am seeing these characters as a diamond shape with
an embedded '?' ...

Not sure what everyone else is seeing.

MitchAlsup

unread,

May 9, 2023, 3:31:40 PM5/9/23

to

The first time I saw then they were an umlated o (Terge)
Then later they became a diamond with a question mark.

BGB

unread,

May 9, 2023, 3:40:32 PM5/9/23

to

On 5/9/2023 2:08 PM, MitchAlsup wrote:
> On Tuesday, May 9, 2023 at 1:15:34 PM UTC-5, BGB wrote:
>> On 5/9/2023 11:22 AM, Anton Ertl wrote:
>>
>>
>> Presumably, for most "normal" use-cases, it is likely to be uncommon for
>> someone to edit files much over 100-200K.
> <
> I edit (MSWord) files that are 20-30 MB, when converted into PDF they
> shrink to 5 MB. I had one MSWord file (300-odd pages long) that I
> separated into 2 files because it got big enough that Word started
> making errors. The sum of both resulting files was only 70% the size
> of the original.

I think MSWord is its own category, because its files are (last I
looked) typically a bunch of big globs of XML inside of a rebranded ZIP
archive...

I was thinking of 100-200K of typically bare (usually ASCII) text.

In my case, I have OpenOffice, but have noted that a lot of people
seemingly have no idea what to do with ".ODT" files...

Though, it can load/save Word97 documents, but apparently results seem
to come up broken looking if people try to load them in actual MSWord, ...

Granted, it is a slightly older install (OpenOffice having apparently
come to an end since then and having been replaced by LibreOffice).

>>
>> If Emacs can handle huge files, maybe that is in its merit, but this is
>> not likely a common use-case.
> <
> It depends, I have seen environments where each chapter in a book was
> its own file, and other cases where the entire book and appendixes were
> in a single file. Realistically, if you want cross reference links to reach
> across the whole book, you only have that chance with "everything is
> in one (1) file".
> <

OK.

For some of my fiction stories, I usually have them as a big text file
(usually writing them first using MediaWiki notation).

Theoretically, could use Markdown, but not as familiar with Markdown
notation and don't like it as much.

Usually I run out of "writing go power" well before I am pushing the
limits of a text-editor though.

Meanwhile, 10K-80K is more typical for C source files.

Many of the debug dump files are in MB territory, but these have an
obvious / visible loading time, and are prone to cause intermittent
stalling if opened in Notepad2 (along with a "generally not very
responsive" UI; eg, scrolling is laggy, etc).

Loading a 9MB file seems to result in an ~ 35MB Notepad2 instance...

Trying the original Windows Notepad, it does seem to handle a 9MB file a
fair bit better (faster loading, no obvious lag, and around a 22MB
memory footprint).

BGB

unread,

May 9, 2023, 3:49:50 PM5/9/23

to

On 5/7/2023 4:47 PM, MitchAlsup wrote:
> On Sunday, May 7, 2023 at 4:27:52 PM UTC-5, BGB wrote:
>> On 5/7/2023 7:07 AM, Anton Ertl wrote:
>>> Marcus <m.de...@this.bitsnbites.eu> writes:
>>>> Load/store with auto-increment/decrement can reduce the number of
>>>> instructions in many loops (especially those that mostly iterate over
>>>> arrays of data).
>>>
>>> Yes.
>>>
>>> If you do it only for stores, as suggested below, it could be used for
>>> loops that read from one or more arrays and write to one array, all
>>> with the same stride, as follows (in pseudo-C-code):
>>>
>>> /* read from a and b, write to c */
>>> da=a-c;
>>> db=b-c;
>>> for (...) {
>>> *c = c[da] * c[db];
>>> c+=stride;
>>> }
>>>
>>> the "c+=stride" could become the autoincrement of the store.
>>>
>> Not all instructions are created equal.
>>
>> Fewer instructions may not be a win if these instructions would result
>> in a higher latency.
> <
> But eliminating sequential dependencies is almost always a win
> because it directly addresses latency.
> <

If you have a series of auto-increment instructions on a register, you
have just gone and created a new sequential dependency.

The only way it can be "a win" is if the auto-incremented register can
be accessed again within a single clock cycle.

As noted though, it is generally a little faster to do, say:
while(ct<cte)
{
ct[0]=cs[0];
ct[1]=cs[1];
ct[2]=cs[2];
ct[3]=cs[3];
ct+=4; cs+=4;
}
Rather than say:
while(ct<cte)
{
*ct++=*cs++;
*ct++=*cs++;
*ct++=*cs++;
*ct++=*cs++;
}

And, naive cases like:
while(ct<cte)
*ct++=*cs++;

Being particularly slow. In this case, the speed of the loop is
bottlenecked by the loop-related overheads.

>>>> It can also be used in function prologues and epilogues
>>>> (for push/pop functionality).
>>>
>>> Not so great, because it introduces data dependencies between the
>>> stores that you then have to get rid of if you want to support more
>>> than one store per cycle. As for the pops, those are loads, and here
>>> the autoincrement would require an additional write port to the
>>> register file, as you point out below; plus it would introduce data
>>> dependencies that you don't want (many cores support more than one
>>> load per cycle).
>>>
>> But, is kinda moot as, say:
>> MOV.Q R13, @-SP
>> MOV.Q R12, @-SP
>> MOV.Q R11, @-SP
>> MOV.Q R10, @-SP
>> MOV.Q R9, @-SP
>> MOV.Q R8, @-SP
>>
>> Only saves 1 instruction vs, say:
>> ADD -48, SP
>> MOV.Q R13, (SP, 40)
>> MOV.Q R12, (SP, 32)
>> MOV.Q R11, (SP, 24)
>> MOV.Q R10, (SP, 16)
>> MOV.Q R9, (SP, 8)
>> MOV.Q R8, (SP, 0)
> <
> If you actually wanted to save instructions you would::
> <
> MOV.Q R13:R8,@-SP
> <
> So the argument of saving 1 instruction becomes moot--you can save 5
> instructions.

This adds a whole new set of issues.

BJX2 doesn't have this.
What it does have, is the ability to reuse previous prolog and epilog
sequences via calls and branches.

Apparently, GCC on RISC-V sorta does something similar, except that
(IIRC) every possibility is treated as a runtime call. BGBCC keeps track
of them in a table, and emits these sequences as-needed.

Originally, this was intended as a size-optimization feature, but ended
up using it "in general" as the savings in L1 I$ misses seemed to offset
the "cost" of the extra branch and call instructions for most
non-trivial sequences.

>>
>> Depending on how it is implemented, the dependency issues on the shared
>> register could actually make the use of auto-increment slower than the
>> use of fixed displacement loads/stores (and, if one needs to wait the
>> whole latency of a load or store for the increment's write-back to
>> finish, using auto-increment in this way is likely "dead on arrival").
>>
>>
>> I can also note that an earlier form of BJX2 had PUSH/POP instructions,
>> but these were removed. Noting the above, it is probably not all that
>> hard to guess why...
>>>> The next question is: What flavors should I have?
>>>>
>>>> - Post-increment (most common?)
>>>> - Post-decrement
>>>> - Pre-increment
>>>> - Pre-decrement (second most common?)
>>>>
>>>> The "pre" variants would possibly add more logic to critical paths (e.g.
>>>> add more gate delay in the AGU before the address is ready for the
>>>> memory stage).
>>>
>>> You typically have memory-access instructions that include an addition
>>> in the address computation; in that case pre obviously has no extra
>>> cost. The cost of the addition can be reduced (eliminated) with a
>>> technique called sum-addressed memory. OTOH, IA-64 supports only
>>> memory accesses of an address given in a register, so here the
>>> architects apparently thought that sum-addressed memory is still too
>>> slow.
>>>
>>> Increment vs. decrement: If your store supports reading two registers
>>> for address computation (in addition to the data register), you can
>>> put the stride in a register, making the whole question moot. Even if
>>> you only support reading one register in addition to the data, you can
>>> have a sign-extended constant stride, again giving you both increment
>>> and decrement options. Note that having a store that does not support
>>> the sum of two registers, but does support autoincrement, and a load
>>> that supports the sum of two registers as address is means that both
>>> loads and stores can read two registers and write one register, which
>>> may be useful for certain microarchitectural approaches.
>>>
>> Nothing to add here.
>>
>>> - anton

Scott Lurndal

unread,

May 9, 2023, 4:21:34 PM5/9/23

to

I had posted 'König' using 8859-1 encoding. My client doesn't support
any other encoding. Somewhere along the way some client may have
turned it into UTF-8 when replying.

Scott Lurndal

unread,

May 9, 2023, 4:27:48 PM5/9/23

to

BGB <cr8...@gmail.com> writes:
>On 5/9/2023 2:08 PM, MitchAlsup wrote:
>> On Tuesday, May 9, 2023 at 1:15:34 PM UTC-5, BGB wrote:
>>> On 5/9/2023 11:22 AM, Anton Ertl wrote:
>>>
>>>
>>> Presumably, for most "normal" use-cases, it is likely to be uncommon for
>>> someone to edit files much over 100-200K.
>> <
>> I edit (MSWord) files that are 20-30 MB, when converted into PDF they
>> shrink to 5 MB. I had one MSWord file (300-odd pages long) that I
>> separated into 2 files because it got big enough that Word started
>> making errors. The sum of both resulting files was only 70% the size
>> of the original.
>
>I think MSWord is its own category, because its files are (last I
>looked) typically a bunch of big globs of XML inside of a rebranded ZIP
>archive...
>
>I was thinking of 100-200K of typically bare (usually ASCII) text.

The best is a markup system like troff or RUNOFF or UWScript. They
work much better with source code control systems than binary blobs.

Who needs WYSIWYG anyway?

.TL
Boot Sequencer Actor Specification - Issue 2
.AF Company Name
.AU "XXZ Y. ZZZZ" "" "Unix Systems Group"
.MT "Programmer's Notes"
.ND "October 9, 1992"
.H 1 "Abstract"
This document describes the functionality and design of the boot sequencer
actor and its interfaces to other system components on the COMPANY\*F-based
.FS
COMPANY is a registered trademark of Elided.
.FE
Unisys SVR4 UNIX\*F release 1.0.
.FS
UNIX is a registered trademark of Unix System Laboratories, Inc.
.FE
Familiarity with the Boot Monograph\*(Rf
.RS gp
BBBB and AAA, "Booting a Distributed System", Revision 3: July 24 1991
.RF
and Booting and administrating a single-system image\*(Rf are useful for
the broad view.
.RS cb
CCCC, "Booting and administrating a single-system image, November 18, 1991
.RF
.H 1 "Original Implementation"
Monolithic kernels usually have global access to all the necessary
information to proceed with system boot.
In the COMPANY SSU environment, this information is partitioned amongst the
various SSU actors. Some of this information which resides in a given SSU
actor is needed by other SSU actors to proceed with their boot.
.P
In the single site case, all the SSU actors collectively contain all
the necessary
information to enable each of them to boot.
Information is exchanged through static ports groups known to the
initialization code.
The SSU actors must know where to obtain the
disparate pieces of information they lack.
They need to know where various data items must be exported.
In addition, they must be cognizant of the timing dependencies
for synchronizing the exchange of information.
All this is accomplished
through a combination of hard coded logic
in the various SSU actors and a protocol which indicates the specific order
in which the initialization threads of each of the SSU actors must run.
....

EricP

unread,

May 9, 2023, 4:36:41 PM5/9/23

to

All auto inc/dec (what A64 calls pre or post indexed)
does is eliminate one ADD Immediate per pointer,
and having instructions with multiple dest registers has
non-trivial HW costs on the high end OoO implementations.
The A64 architects surely would have known this
having implemented it on high end A32 cores.

For example, having instructions with 2 dest registers changes
the cost for a multi-lane OoO renamer from BigO(n^2) to BigO((2n)^2)
so a 4-lane 2-dest renamer costs 16 times as much.
And this is for a feature that would be rarely used and is redundant.

So why have it?
A reason might be that some segment of their market needs/expects it,
like time critical loops on in-order cores, eg. codecs on phones.

Scott Lurndal

unread,

May 9, 2023, 5:03:26 PM5/9/23

to

It's more likely that they have it because it is useful, and
in those "oOo" cores, doesn't really cost very much in logic,
area or power. Have you looked at the ARMv8 Architecture specification;
it's not at all aimed at "codec on phones".

Andy Valencia

unread,

May 9, 2023, 7:35:51 PM5/9/23

to

Thomas Koenig <tko...@netcologne.de> writes:
> Scott Lurndal <sc...@slp53.sl.home> schrieb:

> > The only character not allowed in Unix/Linux UTF-8 filenames is
> > the forward slash character, and due to the OS API, the nul-byte.
> IIRC, you could sometimes create filenames with slashes in them via Macs
> over NFS. Those were quite hard to get rid of, I believe.

We had a tool "fsdb" which, while mostly for those of us maintaining the
filesystem code in HP/UX, permitted one to flip a filename character into the
slash. With hilarious results from your average user.

Of course, fsck was soon taught to deal with such illegal filenames....

Andy Valencia
Home page: https://www.vsta.org/andy/
To contact me: https://www.vsta.org/contact/andy.html

John Levine

unread,

May 9, 2023, 7:52:58 PM5/9/23

to

According to BGB <cr8...@gmail.com>:
>> I think Unicode probably handles most people's needs, though I don't
>> know how it compares in practice to other solutions for CJK scripts.

Asian software is converging on Unicode, but it'll be a long time until
all of the other encodings go away.

>> UCS-2 and then UTF-16 made sense at the time, but were quickly shown to
>> be inadequate, and have been a millstone for Unicode ever since -

Yup, quite short sighted in retrospect. UTF-8 wasn't invented until
late 1992, by Ken Thompon of Unix fame. Before that there were single
byte encodings like UTF-1 which didn't work very well, so in 1990
UCS-2 was the least bad option. Too bad they didn't wait, or Ken
didn't start looking at it sooner.

>Alternately, if all the Hangul characters were multi-part (rather than
>combined), and all of the Chinese characters were expressed by a series
>of combining characters, etc, then UCS-2 may have been sufficient...

Hangul is quite unusual in that it was invented all at once by one
guy, it's totally consistent, so it'd work just to use the
components. Chinese has evolved over a thousand years and there is
considerable disagreement about exactly what counts as a stroke or a
radical. There is a committee with a gigantic backlog of obscure
Chinese characters, slowly looking at each to decide whether to
assign them a code point, or they're a version of something else.

A surprisingly interesting book called Kingdom of Characters is
about the history of printed Chinese, the invention of Chinese
typewriters (two different ones), simplified characters, and
now computer typography and Unicode. Here's a link to Amazon:

https://amzn.to/3LRgGGN

--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

John Levine

unread,

May 9, 2023, 8:18:36 PM5/9/23

to

According to Scott Lurndal <sl...@pacbell.net>:
>Using the 8088 was always torture for those of us who grew up with
>PDP-8s and PDP-11.

I really did grow up on PDP-8 and PDP-11 (with detours to PDP-6/10 and
som OS/360) and the 8086/8088 wasn't so bad.

Dunno if anyone remembers PC/IX. It was Unix System III ported from
PDP-11 to IBM PC. It worked remarkably well. It was small model only,
64K each of code and data. We found that 8088 code was slightly
smaller than PDP-11 code so everything that fit on the -11 fit on
PC/IX. It was reliable enough that we once got a bug report about
something that only happened after a machine had been running
continuously for a year.

The frustrating bit was that the 8088 had considerably more than 64K
of code and data, and people would like to use it all, but the ways to
do that were at best a kludge. In a later job I worked on Javelin
which was medium mode code. It used code overlays to keep the code
footprint reasonably small, and whatever memory was left for data,
including bank switched "expanded" memory up to a then-amazing 8
megabytes.

BGB

unread,

May 9, 2023, 11:01:03 PM5/9/23

to

Probably a reason to have it in A64 is because it already existed and
the cores would still need to run 32-bit ARM, so they would already need
to have the required hardware.

As I understand it, ARM wasn't really designed for speed, rather to be
cheap and low-power (while still being powerful enough to be "actually
useful"). Initially them wanting it to be under 1W so that they could
use a plastic chip package, etc.

They won cellphones not by being the fastest chips available, but by
having something you could put in a cellphone that didn't eat the
battery (this being apparently part of why Intel Atom couldn't get much
ground; with Atom using too much power).

But, then, with smartphones, people also wanted better performance as
well, so one also needed more generally capable CPUs (but, even then, a
lot of the cellphones are still only coming with in-order CPUs; Cortex
A53 and A55 and so on).

David Brown

unread,

May 10, 2023, 4:32:59 AM5/10/23

to

On 09/05/2023 22:21, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 5/9/2023 11:29 AM, Thomas Koenig wrote:
>>> Scott Lurndal <sc...@slp53.sl.home> schrieb:
>>>> Thomas Koenig <tko...@netcologne.de> writes:
>>>>> BGB <cr8...@gmail.com> schrieb:
>>>>>
>>>>>> People were also apparently happy enough with codepages back in the 1980s.
>>>>>
>>>>> "Happy enough" is an overstatement.
>>>>>
>>>>> They were a nuisance because there were different code pages -
>>>>> you never knew if you got CP 437 or CP 850, with or without

>>>>> the â‚¬ symbol, or CP 1250 for Windows, or whatever. A mess.
>>>>>
>>>>> Since my last name actually contains an Ã¶ (which I dumbed down for

>>>>> Newsgroup headers), I've read several variants of over the years.
>>>>

>>>> Kï¿½nig, correct?

>>>
>>> Yes.
>>>
>>>> Does that imply royal descent :-)
>>>
>>> Not at all. As far as my family can be traced back, there is
>>> absolutely no hereditary nobility anywhere there.
>>>

>>> "Kï¿½nig is actually a fairly common surname in Germany. My first

>>> name is also quite common. This means that I cannot be googled
>>> without without some additional information because there are so
>>> many false hits.
>>>
>>> The only kind of anonymity left, these days :-)
>>
>> At least on my end, I am seeing these characters as a diamond shape with
>> an embedded '?' ...
>
> I had posted 'König' using 8859-1 encoding. My client doesn't support
> any other encoding. Somewhere along the way some client may have
> turned it into UTF-8 when replying.
>

Was that just to prove that "Happy enough" is an overstatement, and show
what a PITA code pages are? If so, well done - that is a convincing
argument :-)

Now it is surely time to give up on your own newsreader, and get a
decent one. Any news client that doesn't support UTF-8 in this century
should be considered broken. (It is particularly embarrassing for a
*nix fan - it would be understandable from someone who thought Windows
98 and Outlook Express was the pinnacle of usability.)