Faster Sort Algorithms

MitchAlsup

unread,

Jun 7, 2023, 5:40:22 PM6/7/23

to

An AI bot has figured out some faster sorting algorithms.

https://www.nature.com/articles/s41586-023-06004-9

Looking at the C++* code output from the paper::

void variable_sort_2( int length, int *a )
{
switch( length )
{
case 0:
case 1:
return;
case 2:
int temp = a[0];
a[0] = (a[1] < a[0] ) ? a[1] : a[0];
a[1] = (a[1] < temp ) ? temp : a[1];
}
}

There is x86 asm code in the article, consisting of 11 instructions.

My 66000 ASM

variable_sort_2:
CMP R3,R1,#2
PLT R3,TTTTTT
LD R4,[R2,0]
LD R5,[R2,4]
MAX R6,R4,R5
MIN R7,R4,R5
ST R7,[R2,0]
ST R5,[R2,4]
RET

For a count of 9 instructions.

(*) it looks like C code to me instead of C++ code........

BGB

unread,

Jun 7, 2023, 9:32:20 PM6/7/23

to

Hmm, if "hand compiling" (very loose translation):
CMPGT 1, R4 //1c
BF .L0 //2c (*)
MOV.L (R5, 0), R6 //2c
MOV.L (R5, 4), R7 //3c
CMPGT R7, R6 //1c
MOV.L?T R7, (R5, 0) //1c
MOV.L?T R6, (R5, 4) //1c
.L0: RTS //2c

Length: 8 ops.
Latency: around 13 clock cycles (best case).

*: Depending on branch predictor (2c | 8c).
"RTS?F" could have been used, but this would have ended up with a higher
total latency.

> (*) it looks like C code to me instead of C++ code........

Yeah.

If I were to write similar in C:
void variable_sort_2(int length, int *arr)
{
int t0, t1;
if(length>1)
{
t0=arr[0]; t1=arr[1];
if(t0>t1)
{ arr[0]=t1; arr[1]=t0; }
}
}

Could generalize it further (to larger N), but decided to leave it out
(starts getting pretty hairy pretty quick).

In other news, generalized the:
long long x, y;
y=x*10;
Transform to handle all patterns up to:
y=(x<<S0)+(x<<S1)+(x<<S2)+(x<<S3);
Which can cover "most" of the values between 3 and 64 (but drops off
pretty quickly for larger values).

Despite its limitations, it seems to have an unexpectedly high hit rate
for long-long multiply.
Granted, the most common numbers in this case seem to be:
35, 10, 3, 5, 2097, ...

Excluding powers of 2 which are turned into shifts.

I guess leaving an open question of how many more "obvious cases" I may
have missed...

...

Terje Mathisen

unread,

Jun 8, 2023, 3:26:25 AM6/8/23

to

MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9
>
> Looking at the C++* code output from the paper::
>
> void variable_sort_2( int length, int *a )
> {
> switch( length )
> {
> case 0:
> case 1:
> return;
> case 2:
> int temp = a[0];
> a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> a[1] = (a[1] < temp ) ? temp : a[1];
> }
> }

This is branchless sorting, something I started to investigate decades
ago: Every time I've done so I've found that since I force write
operations to store back the sorted pair, even when it was already in
order, the cost of those stores outweigh the advantage I got from being
branchless.

Things might be different now, but I did start with the assumption that
stores back to an already $L1 resident cache line would be very close to
free.

I have seen similar effects from branchless compressed token extraction,
where I keep reloading the source words even when the source
byte/word/dword index is the same, and here I also have big problems
beating naive code based on branch prediction working well.

> There is x86 asm code in the article, consisting of 11 instructions.
>
> My 66000 ASM
>
> variable_sort_2:
> CMP R3,R1,#2
> PLT R3,TTTTTT
> LD R4,[R2,0]
> LD R5,[R2,4]
> MAX R6,R4,R5
> MIN R7,R4,R5
> ST R7,[R2,0]
> ST R5,[R2,4]
> RET
>
> For a count of 9 instructions.

That's nice, but see my caveat about the actual cost of "free" stores.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Thomas Koenig

unread,

Jun 8, 2023, 8:02:23 AM6/8/23

to

Terje Mathisen <terje.m...@tmsw.no> schrieb:

Hmm... so better code could then be

cmp r1,r1,#2
pne r1,0,FFFFFF
ldsw r1,[r2]
ldsw r3,[r2,4]
cmp r4,r1,r3
ple r4,0,FF
stw r3,[r2]
stw r1,[r2,4]
ret

? Also branchless, and does not do the store if the condition is
not fulfilled.

EricP

unread,

Jun 8, 2023, 10:51:20 AM6/8/23

to

MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9

Ummm... in fig 3.b they show the original sort3 result as

A = mem[0]
B = mem[1]
C = mem[2]

mem[0] = min(A,B,C)
mem[1] = max(min(A,C),B)
mem[2] = max(A,C)

and in fig 3.c they show the AI optimized sort3 result as

mem[0] = min(A,B)
mem[1] = max(min(A,C),B)
mem[2] = max(A,C)

If initially A=2, B=3, C=1 then the original produces

mem[0] = min(1,2,3) = 1
mem[1] = max(min(2,1),3) = 3
mem[2] = max(2,1) = 2

and the AI optimized produces

mem[0] = min(2,3) = 2
mem[1] = max(min(2,1),3) = 3
mem[2] = max(2,1) = 2

neither of them is the correct ascending sort order.

Marcus

unread,

Jun 8, 2023, 2:49:24 PM6/8/23

to

For MRISC32 I get one branch:

variable_sort_2:
sne r1, r1, #2
bs r1, 1f
ldw r3, [r2]
ldw r1, [r2, #4]
min r4, r1, r3
max r1, r1, r3
stw r4, [r2]
stw r1, [r2, #4]
1: ret

/Marcus

Stephen Fuld

unread,

Jun 8, 2023, 3:27:10 PM6/8/23

to

On 6/8/2023 12:26 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> An AI bot has figured out some faster sorting algorithms.
>>
>> https://www.nature.com/articles/s41586-023-06004-9
>>
>> Looking at the C++* code output from the paper::
>>
>> void variable_sort_2( int length, int *a )
>> {
>>       switch( length )
>>       {
>>        case 0:
>>        case 1:
>>                      return;
>>        case 2:
>>                      int temp = a[0];
>>                      a[0] = (a[1] < a[0] ) ? a[1] : a[0];
>>                      a[1] = (a[1] < temp ) ? temp : a[1];
>>       }
>> }
>
> This is branchless sorting, something I started to investigate decades
> ago: Every time I've done so I've found that since I force write
> operations to store back the sorted pair, even when it was already in
> order, the cost of those stores outweigh the advantage I got from being
> branchless.
>
> Things might be different now, but I did start with the assumption that
> stores back to an already $L1 resident cache line would be very close to
> free.

Note that the paper did give actual, experimental results. They showed
a not necessarily huge, but statistically significant improvement.

https://www.nature.com/articles/s41586-023-06004-9#Sec6

and the data given in table 1 linked to in that section.

There is some more discussion of the results in the news release related
to the article at

https://www.nature.com/articles/d41586-023-01883-4

together with a comment in the last paragraph that they would like to
apply the same methods to hardware design!

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Terje Mathisen

unread,

Jun 9, 2023, 2:07:19 AM6/9/23

to

That's great, I'm reading the article now!

>
> https://www.nature.com/articles/s41586-023-06004-9#Sec6
>
> and the data given in table 1 linked to in that section.
>
> There is some more discussion of the results in the news release related
> to the article at
>
> https://www.nature.com/articles/d41586-023-01883-4

I'll have to check that more closely, but my immediate reaction for the
first part (better latency for VarSortN than human code) is that (a)
their assumption that human code doesn't take advantage of previous
compares is at least flawed, but more importantly, branchless x86 code
that uses CMOVs to get rid of the branches are almost always beaten by
less sophisticated code which does use branching to skip alread-sorted
parts.

I.e. I have had to work hard to generate random or even pessimal inputs
to get consistent speedups.

More importantly, their comparisons are always against human branchless
code, not also againsts hybrid/branchy code where the actual input data
strongly impacts the results. If they can use this approach to beat the
big iron sorting benchmarks, then they have something really worthwhile!

They emphasize how VarSort4 works by sorting the first 3 elements, then
adding in the last element in the length=4 case, but this is afaik just
a final insertion sort. Not at all surprisingly, this saves a lot of
instructions vs having to write and branch to three totally separate
Sort2, Sort3, Sort4 sections.

I'm in fact more optimistic about sorting lengths that fit inside a SIMD
register, using MIN/MAX operations or parallel compares and shuffles.

>
> together with a comment in the last paragraph that they would like to
> apply the same methods to hardware design!

This is probably more relevant: When you (i.e Mitch) can apply hw
resources in parallel, finding shorter paths to the final results looks
to me a lot like network optimizations that VLSI tools have employed for
decades.

Terje Mathisen

unread,

Jun 9, 2023, 2:22:34 AM6/9/23

to

I finally got to the part where they got a measured 1.7% speedup for
large sorts done with LLVM libc++ sort, with their patch accepted.

Well done, even if less amazing than the huge initial claims.

Re the final part about VarInt codeing/decoding, this was recently done
with some quite clever SIMD code, so that large parts could be handled
without branches or additiona loads.

Terje

Michael S

unread,

Jun 9, 2023, 6:15:31 AM6/9/23

to

Your asm can be literally translated into 9 x86 (AVX) instructions.
https://godbolt.org/z/Gbz1Kh56f
However I suspect that on majority of real-word x86 cores 11 original
instructions are faster.

> (*) it looks like C code to me instead of C++ code........

To me it looks legal in both languages.

Michael S

unread,

Jun 9, 2023, 6:30:12 AM6/9/23

to

On Friday, June 9, 2023 at 9:22:34 AM UTC+3, Terje Mathisen wrote:
> I finally got to the part where they got a measured 1.7% speedup for
> large sorts done with LLVM libc++ sort, with their patch accepted.
>

I would think that for large sorts you can find 1-2 or may be even 3%
of speed by playing with "small segment" threshold that decides where
to switch from quicksort to insertion sort. But this gain will be very target
and size dependent.

> Well done, even if less amazing than the huge initial claims.

Still, to me acceptance of the patch looks like a mistake on the part
of libc++ maintainers. The gain by itself is not worth the hassle of change.
May be, there were other factors, like the new code is less messy
than the one replaced?

>
> Re the final part about VarInt codeing/decoding, this was recently done
> with some quite clever SIMD code, so that large parts could be handled
> without branches or additiona loads.
>

IMHO, for practical purposes, sorting short segment is too small part of
bigger sorts to bother with heavy optimizations.
Which does not mean that it couldn't be done for enjoyment.

MitchAlsup

unread,

Jun 9, 2023, 11:24:57 AM6/9/23

to

Which is why it is not C++ -- it uses none of the C++ features which are above C.

Bill Findlay

unread,

Jun 9, 2023, 11:37:39 AM6/9/23

to

On 9 Jun 2023, MitchAlsup wrote
(in article<fd0b5dc1-6328-4a88...@googlegroups.com>):

> On Friday, June 9, 2023 at 5:15:31???AM UTC-5, Michael S wrote:

KDF9, designed ca. 1960:

YA0; YA1; MAX; =YA1; =YA0; {VR;}

It sets the V flipflop if the values had to be reversed in the NEST,
"VR;" clears V if you don't care about that.
--
Bill Findlay

Scott Lurndal

unread,

Jun 9, 2023, 11:45:05 AM6/9/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, June 9, 2023 at 5:15:31=E2=80=AFAM UTC-5, Michael S wrote:
>> On Thursday, June 8, 2023 at 12:40:22=E2=80=AFAM UTC+3, MitchAlsup wrote:=
>=20
>> > An AI bot has figured out some faster sorting algorithms.=20
>> >=20
>> > https://www.nature.com/articles/s41586-023-06004-9=20
>> >=20
>> > Looking at the C++* code output from the paper::=20
>> >=20
>> > void variable_sort_2( int length, int *a )=20
>> > {=20
>> > switch( length )=20
>> > {=20
>> > case 0:=20
>> > case 1:=20
>> > return;=20
>> > case 2:=20
>> > int temp =3D a[0];=20
>> > a[0] =3D (a[1] < a[0] ) ? a[1] : a[0];=20
>> > a[1] =3D (a[1] < temp ) ? temp : a[1];=20
>> > }=20
>> > }=20
>> >=20
>> > There is x86 asm code in the article, consisting of 11 instructions.=20
>> >=20
>> > My 66000 ASM=20
>> >=20
>> > variable_sort_2:=20
>> > CMP R3,R1,#2=20
>> > PLT R3,TTTTTT=20
>> > LD R4,[R2,0]=20
>> > LD R5,[R2,4]=20
>> > MAX R6,R4,R5=20
>> > MIN R7,R4,R5=20
>> > ST R7,[R2,0]=20
>> > ST R5,[R2,4]=20
>> > RET=20
>> >=20
>> > For a count of 9 instructions.=20
>> >
>> Your asm can be literally translated into 9 x86 (AVX) instructions.=20
>> https://godbolt.org/z/Gbz1Kh56f=20
>> However I suspect that on majority of real-word x86 cores 11 original=20

>> instructions are faster.
>> > (*) it looks like C code to me instead of C++ code........
>> To me it looks legal in both languages.
><

>Which is why it is not C++ -- it uses none of the C++ features which are ab=
>ove C.

That doesn't mean that it isn't C++. Many of the soi disant "C++ features"
are syntactic sugar. The good parts (object orientation and data encapsulation)
can be used without using all the cruft.

MitchAlsup

unread,

Jun 9, 2023, 12:29:06 PM6/9/23

to

I remember a program written for the C obfuscation event many years ago
which was compliable in C, compliable in Fortran, and compliable in Pascal.
<
Does that make that program any of them ??

EricP

unread,

Jun 9, 2023, 12:54:51 PM6/9/23

to

EricP wrote:
> MitchAlsup wrote:
>> An AI bot has figured out some faster sorting algorithms.
>>
>> https://www.nature.com/articles/s41586-023-06004-9
>
> Ummm... in fig 3.b they show the original sort3 result as

> and in fig 3.c they show the AI optimized sort3 result as
>

> neither of them is the correct ascending sort order.

I think see what they were talking about. Those figures were examples of
arbitrary sort networks of 3 and 4 variables, and nothing to do with
the sort3 and sort4 routines.

Anyway, the code AlphaDev produced is here:

https://github.com/deepmind/alphadev/blob/main/sort_functions_test.cc

and is almost the same code I would have written as it results from
doing conditional swaps. For Sort3AlphaDev my only difference is I would
put the "mov (%0), %%r8d" at the front to get the load going ASAP.
Same for the other SortN's.

So I don't see what optimization AlphaDev could have done.

Remember that CMOV is often implemented as two uOps, one to move the
source if condition True, the other to copy the original dest if False.
That creates a dependency between the original dest register and
subsequent users of that same register that is not apparent in the code.
While register MOV can sometimes be optimized away by HW,
CMOV for both the True and False paths cannot.

The actual performance issues are obscured by the assembler.
Using CSWAP as a conditional register swap macro highlights the differences.

Sort3 is uninteresting because it is just
Sort3 (a, b, c)
CSWAP_GT (a,b)
CSWAP_GT (b,c)
CSWAP_GT (a,b)

Sort4 and others are interesting because some operations can be done
in parallel if a cpu can launch multiple ALU instructions per clock.

Sort4 (a, b, c, d)
CSWAP_GT (a,b)
CSWAP_GT (c,d)
CSWAP_GT (b,c)
CSWAP_GT (a,b)
CSWAP_GT (c,d)

For ISA design, compare and swap CSWAP_cc rd,rs could probably be an
instruction along with SWAP rd,rs.

I also have Load Immediate Conditional LIC , Load Conditional LDC and
Store Conditional STC. LDC and STC have nothing to do with atomics.

LIC and LDC allow you to conditionally overwrite a register value.
STC allows you to optimize away the store if you didn't change a value.

For LDC and STC one idea I had was that rather than add a condition register
because that would mean STC had 4 source register, was to test whether
the base address register was zero/non-zero as the disable/enable.

Applied to the above register sort routines, using CSWAP replaces a MOV
and 2 CMOV. One could set register bit flags to 1 if any swap occurs.
Then later test that flag and a STC skip the write back if unchanged.
In this SortN case that's probably more trouble than its worth.

MitchAlsup

unread,

Jun 9, 2023, 3:43:17 PM6/9/23

to

On Friday, June 9, 2023 at 11:54:51 AM UTC-5, EricP wrote:
> EricP wrote:
> > MitchAlsup wrote:
> >> An AI bot has figured out some faster sorting algorithms.
> >>
> >> https://www.nature.com/articles/s41586-023-06004-9
> >
> > Ummm... in fig 3.b they show the original sort3 result as
> > and in fig 3.c they show the AI optimized sort3 result as
> >
> > neither of them is the correct ascending sort order.
> I think see what they were talking about. Those figures were examples of
> arbitrary sort networks of 3 and 4 variables, and nothing to do with
> the sort3 and sort4 routines.
>
> Anyway, the code AlphaDev produced is here:
>
> https://github.com/deepmind/alphadev/blob/main/sort_functions_test.cc
>
> and is almost the same code I would have written as it results from
> doing conditional swaps. For Sort3AlphaDev my only difference is I would
> put the "mov (%0), %%r8d" at the front to get the load going ASAP.
> Same for the other SortN's.
>
> So I don't see what optimization AlphaDev could have done.
>
> Remember that CMOV is often implemented as two uOps, one to move the
> source if condition True, the other to copy the original dest if False.
<

This is why MAX and MIN instructions are simply better, fewer operands
just as easy a choice.

<
> That creates a dependency between the original dest register and
> subsequent users of that same register that is not apparent in the code.
> While register MOV can sometimes be optimized away by HW,
> CMOV for both the True and False paths cannot.
>
> The actual performance issues are obscured by the assembler.
<

Errrrrrr........ISA is the problem, ASM is simply how ISA is spelled using ASCII.

<
> Using CSWAP as a conditional register swap macro highlights the differences.
>
> Sort3 is uninteresting because it is just
> Sort3 (a, b, c)
> CSWAP_GT (a,b)
> CSWAP_GT (b,c)
> CSWAP_GT (a,b)
<

// All single result instructions
MAX TA,A,B
MIN TB,A,B
MAX TB,TB,C
MIN C,TB,C
MAX A,TA,TB
MIN B,TA,TB
<
I see no particular reason that MAX and MIN cannot be issued together
(in parallel) so we have 3 cycles of 2 instruction each.

>
> Sort4 and others are interesting because some operations can be done
> in parallel if a cpu can launch multiple ALU instructions per clock.
>
> Sort4 (a, b, c, d)
> CSWAP_GT (a,b)
> CSWAP_GT (c,d)
> CSWAP_GT (b,c)
> CSWAP_GT (a,b)
> CSWAP_GT (c,d)
>
> For ISA design, compare and swap CSWAP_cc rd,rs could probably be an
> instruction along with SWAP rd,rs.
<

2 results per instruction is contrary to the tenets of RISC.
And most/many of us thing condition codes are contrary to tenets of RISC.

Terje Mathisen

unread,

Jun 10, 2023, 4:12:54 PM6/10/23

to

Re our recent sort thread:

https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md

cpp_vqsort_new() seems like the current winner in the generic sort
library race, with gains almost certainly well above the quoted 1.7% for
the AI-derived micro-optimizations that got into clang libc

EricP

unread,

Jun 11, 2023, 9:44:06 AM6/11/23

to

MitchAlsup wrote:
> On Friday, June 9, 2023 at 11:54:51 AM UTC-5, EricP wrote:
>>
>> For ISA design, compare and swap CSWAP_cc rd,rs could probably be an
>> instruction along with SWAP rd,rs.
> <
> 2 results per instruction is contrary to the tenets of RISC.
> And most/many of us thing condition codes are contrary to tenets of RISC.

I was driven to instructions with 2-dest results by my desire to
(a) not have condition codes and (b) wanting to handle add with carry
in a straight forward way. That leads to many new wide result instructions.

I see the RISC tenets as guidelines advising designers to
consider the benefits and costs rather than hidebound dogma.
And that cost/benefit consideration is made in the context of the
logic budgets of the current day, not locked to a date mid 1980's.

Variable length instructions opens the ISA to much useful functionality
like full size immediate operands but requires a more complicated fetch
unit than fixed size. This cost is generally accepted today.

I found your arguments for the utility of store immediate instructions
convincing but that requires allowing for up to two immediates,
one for the addressing mode and one for data. The cost of that internally
is it requires extra decode logic, extra flip-flops in the front end to
hold the second immediate and extra muxes to route it, and that puts
pressure on the budget designs.

Having up to two dest registers open an ISA to a whole host of new and
useful instructions. Mostly these are double-wide results for ADD, MUL,
various shifts and bit fields. It also allows integer DIV to return
both the results and remainder, and SWAP or CSWAP, plus many others.

As to the cost of 2 dest registers, at the low end for an in-order,
single issue pipeline with 1 register write port, taking an extra clock
to write a second result does not add any complexity.

For a multi-lane OoO uArch my concern was for Rename to NOT require
two register allocation units for each lane, adding greatly to the cost
and complexity, and which would mostly sit unused burning power.

My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
split across two consecutive lanes. Rename allocates the 1-dest registers
to uOps as normal, then fuses the two pieces and outputs a single uOp to
one output lane to Dispatch.

The OoO writeback of multiple dests is no more expensive because it
already requires multiple result and forwarding buses and write ports
so that it can catch up after bubbles.

With plausable solutions for each problem area, and the costs for 2-dest
registers under control, I green lighted it for general use by new
instructions.

So "CSWAP rdX, rdY, rsA, rsB, rsCC" slides right into the ISA:
IF (rsCC) rdX = rsA, rdY = rsB ELSE rdX = rsB, rdY = rsA END IF
It needs 3 source registers, but FMA and ADD-wide already need that,
and 2 dest registers, and could be implemented as a single ALU uOp.

Anton Ertl

unread,

Jun 11, 2023, 10:57:39 AM6/11/23

to

EricP <ThatWould...@thevillage.com> writes:
>I was driven to instructions with 2-dest results by my desire to
>(a) not have condition codes and (b) wanting to handle add with carry
>in a straight forward way.

I have expanded my answer to that problem into a paper:
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.

Of course, the question is whether you consider this answer
straightforward.

>Variable length instructions opens the ISA to much useful functionality
>like full size immediate operands but requires a more complicated fetch
>unit than fixed size. This cost is generally accepted today.

Counterexample: ARM A64 (and ARM has ample experience with
variable-size instructions).

>As to the cost of 2 dest registers, at the low end for an in-order,
>single issue pipeline with 1 register write port, taking an extra clock
>to write a second result does not add any complexity.

At least not much.

But what's the difference from

add s0 = a, b
sltu carry0 = s0,a

which also takes two cycles? Or do you have a three-input
add-with-carry? That would certainly solve the major problem that
MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
carry-in.

>For a multi-lane OoO uArch my concern was for Rename to NOT require
>two register allocation units for each lane, adding greatly to the cost
>and complexity, and which would mostly sit unused burning power.
>
>My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>split across two consecutive lanes. Rename allocates the 1-dest registers
>to uOps as normal, then fuses the two pieces and outputs a single uOp to
>one output lane to Dispatch.

I wonder if that's what ARM are doing for their multi-result
instructions (load-pair with pre/post-indexed addressing mode consumes
three write ports).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

robf...@gmail.com

unread,

Jun 11, 2023, 12:59:33 PM6/11/23

to

On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
> >I was driven to instructions with 2-dest results by my desire to
> >(a) not have condition codes and (b) wanting to handle add with carry
> >in a straight forward way.
> I have expanded my answer to that problem into a paper:
> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>
> Of course, the question is whether you consider this answer
> straightforward.
> >Variable length instructions opens the ISA to much useful functionality
> >like full size immediate operands but requires a more complicated fetch
> >unit than fixed size. This cost is generally accepted today.
> Counterexample: ARM A64 (and ARM has ample experience with
> variable-size instructions).

I just decided not to long ago to go with a fixed sized instruction because
variable length calculations became on the critical timing path for an FPGA
design. Postfixes can effectively extend the instruction with constants.

> >As to the cost of 2 dest registers, at the low end for an in-order,
> >single issue pipeline with 1 register write port, taking an extra clock
> >to write a second result does not add any complexity.
> At least not much.
>
> But what's the difference from
>
>
> add s0 = a, b
> sltu carry0 = s0,a
>
> which also takes two cycles? Or do you have a three-input
> add-with-carry? That would certainly solve the major problem that
> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> carry-in.
> >For a multi-lane OoO uArch my concern was for Rename to NOT require
> >two register allocation units for each lane, adding greatly to the cost
> >and complexity, and which would mostly sit unused burning power.
> >
> >My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> >split across two consecutive lanes. Rename allocates the 1-dest registers
> >to uOps as normal, then fuses the two pieces and outputs a single uOp to
> >one output lane to Dispatch.
> I wonder if that's what ARM are doing for their multi-result
> instructions (load-pair with pre/post-indexed addressing mode consumes
> three write ports).
>

An idea I have been toying with is for some operations caching the operands
and results. So, if a divide is done the remainder is available immediately
without having to do a full divide. Works for multiply high too. I think it should
work well with reciprocal calculations too, where the same reciprocal is used
in a loop for instance.

To deal with extended precision ADD in the past I have used a three operand
ADD, combined with a carry generate instruction. The carry generate
instruction takes the same operands as the add instruction but returns the
carry out instead of the sum. It can then be added in the next instruction. It
is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
I think the ADD and the carry-generate can be done in the same clock cycle
in a superscalar CPU.

BGB

unread,

Jun 11, 2023, 3:47:51 PM6/11/23

to

On 6/11/2023 11:59 AM, robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>>> I was driven to instructions with 2-dest results by my desire to
>>> (a) not have condition codes and (b) wanting to handle add with carry
>>> in a straight forward way.
>> I have expanded my answer to that problem into a paper:
>> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>>
>> Of course, the question is whether you consider this answer
>> straightforward.
>>> Variable length instructions opens the ISA to much useful functionality
>>> like full size immediate operands but requires a more complicated fetch
>>> unit than fixed size. This cost is generally accepted today.
>> Counterexample: ARM A64 (and ARM has ample experience with
>> variable-size instructions).
>
> I just decided not to long ago to go with a fixed sized instruction because
> variable length calculations became on the critical timing path for an FPGA
> design. Postfixes can effectively extend the instruction with constants.
>

I went with prefixes.

If I designed another (new) variable-length ISA, it is probable I would
still go with prefixes.

Though, I would likely start (this time) with the assumption of the
32-bit instructions as baseline, and the 16-bit as a shorthand. My
encoding is currently a little messy in that (in its original form) the
32-bit instructions were themselves a prefix encoding (of two 16-bit parts).

Granted, prefixes have the downside of turning the immediate values into
confetti... But, then again, there is RISC-V with fixed-length
instructions and the immediate fields are still confetti...

Doesn't really effect the CPU too much, but admittedly is kind of a pain
for the assembler and linker (linker needing a larger number of reloc
types). Though, one can reduce the annoyance, say, by making all PC-rel
offsets either Byte or Word (depending on the op), and all GBR-Rel ops
Byte based, ... At least cutting down on the needed number of reloc types.

Well, except when the encoding rules change from one instruction to
another, which is admittedly thus far why I haven't bothered with
RISC-V's 'C' extension.

Like, RVC is sort of like someone looked at Thumb and was then like
"Hold my beer!".

One other option would be, say, to have variable length ops with a
size/form bucket, say:
000: 32-bit op, 3R Forms
001: 32-bit op, 3RI Forms
010: 64-bit op, 3R/4R/etc forms
011: 64-bit op, 3RI Forms, Imm32
100: 96-bit op, -
101: 96-bit op, 3RI Forms, Imm64

Where, say:
000z-zzzz-zzdd-dddd ssss-sstt-tttt-zzzz //3R
001z-zzzz-zzdd-dddd ssss-ssii-iiii-iiii //3RI (Imm10)

011z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm32)
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm

101z-zzzz-zzdd-dddd ssss-sszz-zzzz-zzzz //3RI (Imm64)
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm
iiii-iiii-iiii-iiii iiii-iiii-iiii-iiii //Imm

Where, say, the Imm32 and Imm64 encodings operate in (mostly) the same
space as the 3R encodings, but will normally suppress the Rt field in
favor of an immediate. The Imm10 encodings would represent a slightly
different encoding space (likely exclusively Load/Store ops and ALU
Immediate instructions).

This would be nicer for an assembler and linker, but would implicitly
mean that the instruction stream would no longer be "self-aligning".

Say, with a self-aligning stream, if one starts decoding/disassembling
instructions from an arbitrary location in the "text" sections, then the
decoding will realign with the instruction stream within a few
instruction words (this property is useful for tools like debuggers).

Possible, but likely to get rather annoying, rather fast.

One almost may as well have some designated architectural scratchpad
registers in this case.

Say:
DMULU R4, R5 //MACH:MACL = R4*R5
STS MACL, R2 // R2 = MACL

But, yeah, I know of an ISA that did the multiplier this way (cough, SH).

Though, to be fair, it is less bad than doing the multiplier via MMIO
(cough, MSP430).

In my case, the 32-bit widening multiply ops generate a 64-bit result:
MULU.L R4, R5, R2 //Zero-Extended
DMULU.L R4, R5, R2 //Widening

The 64-bit ops are only available in low/high variants, and slow. Had
considered a widening variant (with a 128-bit result), but given these
were only really added for sake of the RISC-V 'M' extension, and 'M'
doesn't have widening ops, I didn't bother in this case.

Though, a widening version would at least (in theory) make it more
useful for implementing 128-bit multiply, nevermind if it would still
currently be faster to build 128-bit multiply using the 32-bit multiply
ops (but, this part could be streamlined slightly if I added helper ops
to do the high/low and high-high multiplies without needing to use a
bunch of shift instructions...).

Well, and the (slow) 64-bit multiplier was still left out to try to
shoe-horn a 3-wide version of the BJX2 core into an XC7S50 (along with a
bunch of other features).

But, I still need to find a way to shave off some more LUTs, as I want
to be able to have a CPI based camera-interface module, and I still
don't currently have the LUT budget for this.

... decided to leave out going into the specifics of the CPI camera
interface or how the module would work (or my annoyances trying to fit
all this into the XC7S50 on the Arty S7...).

But, also annoyingly: I can't use my Nexys A7 board either, as it
doesn't have enough unassigned IO (all those
LEDs/switches/7-segment-display/VGA-port/... eating up much of the
FPGA's total IO pins). Its "sibling" (the Arty A7) mostly just leaving a
lot more of this to PMODs and similar.

MitchAlsup

unread,

Jun 11, 2023, 4:48:51 PM6/11/23

to

On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
> On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:

> > >Variable length instructions opens the ISA to much useful functionality
> > >like full size immediate operands but requires a more complicated fetch
> > >unit than fixed size. This cost is generally accepted today.
> > Counterexample: ARM A64 (and ARM has ample experience with
> > variable-size instructions).
<
> I just decided not to long ago to go with a fixed sized instruction because
> variable length calculations became on the critical timing path for an FPGA
> design. Postfixes can effectively extend the instruction with constants.
<

I found an encoding scheme whereby determining the length of an
instruction is 4 gates of delay covering 1-5 word instructions. It is
entirely parallel and can be treeified at 1 gate per doubling of the
number of instructions decoded. Thus, in 1 stage of logic one can
parse 16-instructions out of 32-words.
<
Now, of course, all this takes place in unary form, which is the natural
form of hardware parsing of bit strings into useful containers. Why unary ?
Unary becomes the inputs to multiplexers--obviating the need for binary
to unary decoding.
<
Perhaps your variable length decoding mechanism was not very refined.

<
> > >As to the cost of 2 dest registers, at the low end for an in-order,
> > >single issue pipeline with 1 register write port, taking an extra clock
> > >to write a second result does not add any complexity.
<

For My 66000 CARRY, I get rid of the complexity of 2 result registers
by having a recirculating bus, aptly called the Carry-Bus, that recirculates
results to operands until the end of the CARRY shadow. And if the last
calculation does not produce a carry-out, the result does not get written.
Much of the time the carry result is never "written" into the register file.

<
> > At least not much.
> >
> > But what's the difference from
> >
> >
> > add s0 = a, b
> > sltu carry0 = s0,a
> >
> > which also takes two cycles? Or do you have a three-input
> > add-with-carry? That would certainly solve the major problem that
> > MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> > carry-in.
> > >For a multi-lane OoO uArch my concern was for Rename to NOT require
> > >two register allocation units for each lane, adding greatly to the cost
> > >and complexity, and which would mostly sit unused burning power.
> > >
> > >My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
> > >split across two consecutive lanes. Rename allocates the 1-dest registers
> > >to uOps as normal, then fuses the two pieces and outputs a single uOp to
> > >one output lane to Dispatch.
> > I wonder if that's what ARM are doing for their multi-result
> > instructions (load-pair with pre/post-indexed addressing mode consumes
> > three write ports).
> >
> An idea I have been toying with is for some operations caching the operands
> and results. So, if a divide is done the remainder is available immediately
> without having to do a full divide. Works for multiply high too. I think it should
> work well with reciprocal calculations too, where the same reciprocal is used
> in a loop for instance.
<

My VEC instruction informs the hardware which registers hold live results when
the LOOP terminates. This means a majority of results produced inside a loop
never need to actually get written into the register file, the result just has to be
passed on as an operand. This saves power.

<
>
> To deal with extended precision ADD in the past I have used a three operand
> ADD, combined with a carry generate instruction. The carry generate
> instruction takes the same operands as the add instruction but returns the
> carry out instead of the sum. It can then be added in the next instruction. It
> is a bit ugly but if one really wants to do multi-word arithmetic it is possible.
> I think the ADD and the carry-generate can be done in the same clock cycle
> in a superscalar CPU.
<

My 66000 - 256-bit integer add::
<
CARRY R16,{{I}{IO}{IO}{O}}
ADD R12,R4,R8 // carry Out only
ADD R13,R5,R9 // Carry In and Out
ADD R14,R6,R10 // Carry In and Out
ADD R15,R7,R11 // Carry In only
<
Here, in typical use, the carry register does not have to be read from the
register file nor does it have to get written at the end, either.

MitchAlsup

unread,

Jun 11, 2023, 4:58:15 PM6/11/23

to

There is no particular reason to use RISC-V ISA as anything but a
stick in the sand......

>
> Doesn't really effect the CPU too much, but admittedly is kind of a pain
> for the assembler and linker (linker needing a larger number of reloc
> types). Though, one can reduce the annoyance, say, by making all PC-rel
> offsets either Byte or Word (depending on the op), and all GBR-Rel ops
> Byte based, ... At least cutting down on the needed number of reloc types.
>

I did not end up having to need any of that.

>
> Well, except when the encoding rules change from one instruction to
> another, which is admittedly thus far why I haven't bothered with
> RISC-V's 'C' extension.
>
> Like, RVC is sort of like someone looked at Thumb and was then like
> "Hold my beer!".
>
>
> One other option would be, say, to have variable length ops with a
> size/form bucket, say:
> 000: 32-bit op, 3R Forms
> 001: 32-bit op, 3RI Forms
> 010: 64-bit op, 3R/4R/etc forms
> 011: 64-bit op, 3RI Forms, Imm32
> 100: 96-bit op, -
> 101: 96-bit op, 3RI Forms, Imm64
<

For my part, every instruction gets 1 immediate and/or 1 displacement.
Calculation instructions get 1 immediate
Loads get 1 displacement
Stores get 1 displacement and 1 immediate
<
Also note: everything needed to determine the length of an instruction
is available in the first 32-bits of the instruction.
<
My rational was that the compiler would be "pretty good" at simplifying
the instructions such that when it ran into a DAG where a node needed
2 constants, a vast majority of the time it could perform the calculation
at compile time. Evidence has born this out.

Once you bite off on the fact you need to support (double) FMACs
all of the calculation resources are present to do integer multiply
fast........and it is used often enough to justify smelling like an integer
ADD in format and operand routing.

EricP

unread,

Jun 11, 2023, 8:11:09 PM6/11/23

to

The various integer ADDs formats I came up with
(single and wide means one or two integer register specifiers)

#1 ADD single = single + single (3 registers)
ADDTS trap signed overflow
ADDTU trap unsigned carry
#2 ADDW2 wide = single + single (4 registers)
#3 ADDW3 wide = single + single + single (5 registers)
#4 ADDWW wide = wide + single (5 registers)

#2 and #3 are for propagating a carry horizontally along a chain.
#4 is for the equivalent of a carry-save add on a column.
The My 66000 CARRY prefix can do an arbitrary long horizontal
carry propagate but didn't seem to have a way to specify #4.

BGB

unread,

Jun 11, 2023, 8:15:32 PM6/11/23

to

RISC-V is effectively the "ISA de jure" at this moment of time.

Whether or not it is actually the best ISA possible in an engineering sense.

But, yeah, some of us would prefer that, if 20 bits are going to be put
next to each other and used as a 20 bit value, then the bits would also
be a contiguous 20 bit value...

Or, that there is some semblance of "actually has a good reason" for the
encodings of some instructions to differ from other instructions, ...

OTOH:
FEii_jjjj-F2nm_Zekk
Ends up with a 33 bit immediate as:
{ Ei, ii, jjjj, kk }

Which, while not quite as nice as a contiguous value, is still better
IMO than a bunch of bits with no particular order relative to each other.

Well, there is:
F8Zn_iiii
Which does have a contiguous immediate at the expense of being
inconsistent with the encodings for the rest of the ISA, but then again:
F8ni_Zeii
Is a bit more chewed up, so it is a tradeoff.

>>
>> Doesn't really effect the CPU too much, but admittedly is kind of a pain
>> for the assembler and linker (linker needing a larger number of reloc
>> types). Though, one can reduce the annoyance, say, by making all PC-rel
>> offsets either Byte or Word (depending on the op), and all GBR-Rel ops
>> Byte based, ... At least cutting down on the needed number of reloc types.
>>
> I did not end up having to need any of that.

In my case, Load/Store displacements were normally element-size scaled,
but for the PC-Rel and GBR-Rel cases, I ended up special casing them to
always use a 1-byte element scale.

Also sorta works given PC and GBR aren't normally accessible from GPR
space, and come at the expense of two other registers not being allowed
as a base register for memory Load/Store ops.

The alternative having been, instead, to have fewer GPRs and cut off
more of the register space for SPRs.

>>
>> Well, except when the encoding rules change from one instruction to
>> another, which is admittedly thus far why I haven't bothered with
>> RISC-V's 'C' extension.
>>
>> Like, RVC is sort of like someone looked at Thumb and was then like
>> "Hold my beer!".
>>
>>
>> One other option would be, say, to have variable length ops with a
>> size/form bucket, say:
>> 000: 32-bit op, 3R Forms
>> 001: 32-bit op, 3RI Forms
>> 010: 64-bit op, 3R/4R/etc forms
>> 011: 64-bit op, 3RI Forms, Imm32
>> 100: 96-bit op, -
>> 101: 96-bit op, 3RI Forms, Imm64
> <
> For my part, every instruction gets 1 immediate and/or 1 displacement.
> Calculation instructions get 1 immediate
> Loads get 1 displacement
> Stores get 1 displacement and 1 immediate
> <
> Also note: everything needed to determine the length of an instruction
> is available in the first 32-bits of the instruction.
> <
> My rational was that the compiler would be "pretty good" at simplifying
> the instructions such that when it ran into a DAG where a node needed
> 2 constants, a vast majority of the time it could perform the calculation
> at compile time. Evidence has born this out.

OK.

As noted, I don't really have any 2-immediate cases.

I didn't really see many cases where dealing with 2 immediate cases
seemed "worth it".

The rare partial exceptions being, say:
Store a small constant to memory;
Shift a small value by a constant.
Say: Imm4<<Imm6

But, these cases aren't quite "common enough" to justify the cost.

The FMUL and IMUL cases still don't share resources.

Binary64 FMUL is slightly more expensive than 32-bit IMUL, but trying to
combine them would likely cost more than it would save, and would also
cost in terms of IMUL latency.

The resources needed for a fast 64-bit integer multiply are somewhat
more than those needed for a "corner cutting" Binary64 multiply (or vs a
slow Shift-Add multiplier).

But, integer multiply is common enough that one does sort of want an at
least want it to "not suck".

robf...@gmail.com

unread,

Jun 11, 2023, 8:21:03 PM6/11/23

to

On Sunday, June 11, 2023 at 4:48:51 PM UTC-4, MitchAlsup wrote:
> On Sunday, June 11, 2023 at 11:59:33 AM UTC-5, robf...@gmail.com wrote:
> > On Sunday, June 11, 2023 at 10:57:39 AM UTC-4, Anton Ertl wrote:
>
> > > >Variable length instructions opens the ISA to much useful functionality
> > > >like full size immediate operands but requires a more complicated fetch
> > > >unit than fixed size. This cost is generally accepted today.
> > > Counterexample: ARM A64 (and ARM has ample experience with
> > > variable-size instructions).
> <
> > I just decided not to long ago to go with a fixed sized instruction because
> > variable length calculations became on the critical timing path for an FPGA
> > design. Postfixes can effectively extend the instruction with constants.
> <
> I found an encoding scheme whereby determining the length of an
> instruction is 4 gates of delay covering 1-5 word instructions. It is
> entirely parallel and can be treeified at 1 gate per doubling of the
> number of instructions decoded. Thus, in 1 stage of logic one can
> parse 16-instructions out of 32-words.
> <
> Now, of course, all this takes place in unary form, which is the natural
> form of hardware parsing of bit strings into useful containers. Why unary ?
> Unary becomes the inputs to multiplexers--obviating the need for binary
> to unary decoding.
> <
> Perhaps your variable length decoding mechanism was not very refined.

I think it was probably okay, just a lack of pipelining at the fetch stage slowed
things down. The length was determined by a lookup of the first eight bits of
the instruction. I wanted to compact the length bits down to fractions of bits,
I thought it somewhat wasteful to have them for every instruction. The lookup
table was only 1 LUT deep, but a second layer of LUTs was needed for
decoding. Part of the issue was doing too much at the fetch stage without
pipelining it. The instruction alignment using a 512+ wide bit shift was also
being done in the same clock cycle. It was fed from the output of an odd, even
mux. A consequence of using 40-bit wide instructions. So, that took a couple
of layers of LUTs. On top of that LUTs were being read combo style from the
LUT ram. And on top of that NOP instructions were being inserted for cache
misses, and IRQ instructions for interrupts. So, another 3:1 mux in the fetch
stage. It needed more pipelining. I found it would be fast enough though by
removing the length lookup. Other pieces of the design showed up on the
critical path. It is scary how much logic takes place in one cycle. I did not
want to add extra pipeline stages at fetch as that creates more flushing of
instructions during branches.

I hit a kind of a minimum useful cycle with the results forwarding from the
output of the ALU back to the input appeared on the timing path. Removing
the forwarding muxes would cut the performance in half, so it was better to
go with a longer clock cycle. That gave me some leeway with the rest of the
design.

I had to slow the whole design down though as it got too large for the FPGA.
Reduced the size of the register file by using block RAMs instead of LUTs and
they needed to be clocked at double the CPU rate. Removed an ALU so an
FPU could be supported. A double frequency clock is fed to some parts of the
design like the multiplier / divider, square root.

Saving up for a larger FPGA.

A lot more could be fit into the FPGA at the cost of some performance. The
core is only about 62k LUTs with an FPU. The rest of the system including
MMU takes about 35k LUTs. I have yet to add the neural network accelerator,
NNA, and graphics accelerator.

BGB

unread,

Jun 11, 2023, 11:29:55 PM6/11/23

to

Hmm...

My case:
BJX2 Core, with all of the current "fancy stuff" enabled:
3-wide, multiple ISA modes, 96-bit VAs, "all the stuff", ...
~ 40k LUT.
Current version which fits into the XC7S50:
3-wide (6R3W), FPU and SIMD, 48-bit VAs
~ 28k LUT
RISC-like subset:
1-wide, 32-bit address space, no FPU or MMU
3R1W register file.
~ 10k LUT

There was, at one point, a variant with a 4R2W register file, which was
unattractive as it had nearly the cost of the 3-wide case, while being
barely more capable than the 1-wide case.

Smallest cores I have done:
~ 5k LUTs
These being roughly:
16 GPRs (32-bit), 16-bit instructions, aligned-only memory
No FPU or MMU, no multiply, fixed-shift only
2R1W register file.

The 5k LUT cores would be fairly minimalist even vs RISC-V.
The only real way I know of to get down to this size being to omit such
niceties as unaligned memory access and a shift unit.

Also these had "L1 caches" which would only hold a single cache line
(having an array of cache-lines significantly increases the cost of the
L1 cache; but also leads to a fairly drastic performance improvement).

...

As-is, I can do:
2 cores on XC7A200T (device: 134k LUT)
1 core on XC7A100T (device: 63k LUT)
1 core (reduced features) on XC7S50 (device: 32k LUT)
1 core (RISC-like only) on XC7S25 (device: 14k LUT)

I could maybe fit 3 cores or do a dedicated GPU core or similar on the
XC7A200T.

I technically now have a board with an XC7K325T (gotten for surprisingly
cheap off AliExpress), but haven't had much luck with getting the
open-source toolchain working, and the free version of Vivado doesn't
support this FPGA, ...

Also the open-source tools are a little sketchy, as they apparently
depend on having a specific version of the Linux version of Vivado that
gets invoked in the background (and a proper FOSS toolchain should have
no dependence on the proprietary toolchain it is meant to replace).

Biggest in free version seems to be XC7K160T (but, QMTECH wasn't selling
any boards with this chip, nor is anyone else selling boards with these
for "affordable" prices).

The XC7A200T has more LUTs than the XC7K160T, but the latter could run
the core at higher clock speeds.

robf...@gmail.com

unread,

Jun 12, 2023, 2:02:04 AM6/12/23

to

A 10r2w register file is being used for the Thor core. It is quite large. It
was originally made from LUTs and could run at 50 MHz. But it is now
block RAM. 3 source operand registers, plus the target register and
predicate register, for two lanes of execution. But it now also supports
multiple register sets, should I get around to a SMT version of the core.

>
> Smallest cores I have done:
> ~ 5k LUTs
> These being roughly:
> 16 GPRs (32-bit), 16-bit instructions, aligned-only memory
> No FPU or MMU, no multiply, fixed-shift only
> 2R1W register file.
>
> The 5k LUT cores would be fairly minimalist even vs RISC-V.
> The only real way I know of to get down to this size being to omit such
> niceties as unaligned memory access and a shift unit.
>
> Also these had "L1 caches" which would only hold a single cache line
> (having an array of cache-lines significantly increases the cost of the
> L1 cache; but also leads to a fairly drastic performance improvement).
>

I started small, limited by a XC4035? FPGA. A round 600 LUTs. I had wire-
wrapped my own board with 64kx8 dram attached and composite video
output. I have constantly gotten to larger and larger cores. I still have a
couple of the original chips somewhere. Maybe collector's items now.

> ...
>
>
> As-is, I can do:
> 2 cores on XC7A200T (device: 134k LUT)
> 1 core on XC7A100T (device: 63k LUT)
> 1 core (reduced features) on XC7S50 (device: 32k LUT)
> 1 core (RISC-like only) on XC7S25 (device: 14k LUT)
>
> I could maybe fit 3 cores or do a dedicated GPU core or similar on the
> XC7A200T.
>
>
>
> I technically now have a board with an XC7K325T (gotten for surprisingly
> cheap off AliExpress), but haven't had much luck with getting the
> open-source toolchain working, and the free version of Vivado doesn't
> support this FPGA, ...

I was looking at a XC7K410T on ebay, but the 410 was not listed as
supported by the open-source tools; the 420 is. I could not find a 420
for sale.

>
> Also the open-source tools are a little sketchy, as they apparently
> depend on having a specific version of the Linux version of Vivado that
> gets invoked in the background (and a proper FOSS toolchain should have
> no dependence on the proprietary toolchain it is meant to replace).

That is good to know. It sounds like the open-source tools have a ways to
go yet.

Terje Mathisen

unread,

Jun 12, 2023, 2:51:33 AM6/12/23

to

Presumably, the new exact AugmentedAddition/Multiplication FP ops are a
part of that new set. :-)

robf...@gmail.com

unread,

Jun 12, 2023, 3:06:40 AM6/12/23

to

Ooops. I think that was a XC4005/4010.

BGB

unread,

Jun 12, 2023, 5:18:30 AM6/12/23

to

Yeah...

I had looked at a 12R6W regfile at one point, but quick estimates were
that this was not going to be friendly to resource cost...

My current 6R3W regfile is implemented in LUTRAM.
Each normal instruction is only reserved 2 register ports;
Instructions that need more than 2 ports will use multiple lanes.

This mostly works with 64 GPRs and my current strategy of only having a
single set of registers.

Having multiple bank-swapped register sets would be a problem though...

>>
>> Smallest cores I have done:
>> ~ 5k LUTs
>> These being roughly:
>> 16 GPRs (32-bit), 16-bit instructions, aligned-only memory
>> No FPU or MMU, no multiply, fixed-shift only
>> 2R1W register file.
>>
>> The 5k LUT cores would be fairly minimalist even vs RISC-V.
>> The only real way I know of to get down to this size being to omit such
>> niceties as unaligned memory access and a shift unit.
>>
>> Also these had "L1 caches" which would only hold a single cache line
>> (having an array of cache-lines significantly increases the cost of the
>> L1 cache; but also leads to a fairly drastic performance improvement).
>>
>
> I started small, limited by a XC4035? FPGA. A round 600 LUTs. I had wire-
> wrapped my own board with 64kx8 dram attached and composite video
> output. I have constantly gotten to larger and larger cores. I still have a
> couple of the original chips somewhere. Maybe collector's items now.
>

OK.

I got into things more recently, when the 7-series chips were already in
common use.

First board I got had an XC7S50.

Trying to keep size reasonable, but always more stuff to burn LUTs on.

For example, just went and added an experimental hack so that the
integer multiplier can communicate with the Shift-Add unit, and if the
multiplier sees a division it can handle by turning it into a multiply
by reciprocal, it can signal the divide unit to let the multiplier
handle it instead.

This tweak can boost Dhrystone score to a little over 80k (0.91 DMIPS/MHz).

>> ...
>>
>>
>> As-is, I can do:
>> 2 cores on XC7A200T (device: 134k LUT)
>> 1 core on XC7A100T (device: 63k LUT)
>> 1 core (reduced features) on XC7S50 (device: 32k LUT)
>> 1 core (RISC-like only) on XC7S25 (device: 14k LUT)
>>
>> I could maybe fit 3 cores or do a dedicated GPU core or similar on the
>> XC7A200T.
>>
>>
>>
>> I technically now have a board with an XC7K325T (gotten for surprisingly
>> cheap off AliExpress), but haven't had much luck with getting the
>> open-source toolchain working, and the free version of Vivado doesn't
>> support this FPGA, ...
>
> I was looking at a XC7K410T on ebay, but the 410 was not listed as
> supported by the open-source tools; the 420 is. I could not find a 420
> for sale.
>

Well, and not supported by the free Vivado WebPack either...

Annoying that Vivado goes from free to "very expensive".
Would have been nice if one could be like, "Hey, I have this FPGA board
I got for $100 off AliExpress, can I just give you guys, say, $35 or
something so I can generate bitsteams for it?..."

>>
>> Also the open-source tools are a little sketchy, as they apparently
>> depend on having a specific version of the Linux version of Vivado that
>> gets invoked in the background (and a proper FOSS toolchain should have
>> no dependence on the proprietary toolchain it is meant to replace).
>
> That is good to know. It sounds like the open-source tools have a ways to
> go yet.
>

Seems so...

EricP

unread,

Jun 12, 2023, 1:12:37 PM6/12/23

to

Anton Ertl wrote:
> EricP <ThatWould...@thevillage.com> writes:
>> I was driven to instructions with 2-dest results by my desire to
>> (a) not have condition codes and (b) wanting to handle add with carry
>> in a straight forward way.
>
> I have expanded my answer to that problem into a paper:
> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>
> Of course, the question is whether you consider this answer
> straightforward.
>
>> Variable length instructions opens the ISA to much useful functionality
>> like full size immediate operands but requires a more complicated fetch
>> unit than fixed size. This cost is generally accepted today.
>
> Counterexample: ARM A64 (and ARM has ample experience with
> variable-size instructions).

Yes, it is currently aligned 4-byte instruction words.
It also has condition codes, address modes like pre/post index,
load/store pair. So the risc tenets are already abandon.

If they want to expand their ISA for, say, instructions requiring more
registers or longer immediates then they would pretty much have to do so
in 4-byte granules. That could result in lots of pad 0's.

>> As to the cost of 2 dest registers, at the low end for an in-order,
>> single issue pipeline with 1 register write port, taking an extra clock
>> to write a second result does not add any complexity.
>
> At least not much.
>
> But what's the difference from
>
>
> add s0 = a, b
> sltu carry0 = s0,a
>
> which also takes two cycles? Or do you have a three-input
> add-with-carry? That would certainly solve the major problem that
> MIPS, Alpha, and RISC-V have for this stuff, which is dealing with the
> carry-in.

Yeah, I crossed the Rubicon and went with 2 and 3 input, wide output.
Its not just add. There are so many other instructions that compute
most or all of a wide result, then throw the high part away.
Then a second instruction repeats the first but throws the low part away.

Better to have a single instruction that does what the customer wants.

>> For a multi-lane OoO uArch my concern was for Rename to NOT require
>> two register allocation units for each lane, adding greatly to the cost
>> and complexity, and which would mostly sit unused burning power.
>>
>> My solution is for Decode to spit out 2-dest instructions as two 1-dest uOps
>> split across two consecutive lanes. Rename allocates the 1-dest registers
>> to uOps as normal, then fuses the two pieces and outputs a single uOp to
>> one output lane to Dispatch.
>
> I wonder if that's what ARM are doing for their multi-result
> instructions (load-pair with pre/post-indexed addressing mode consumes
> three write ports).
>
> - anton

Hard to guess without knowing more about the uArch as different design
approaches, separate future register file (FRF) (aka ROB) and
architecture register file (ARF), or merged physical register file (PRF),
have different allocation, free, and renaming mechanisms.
And this all interacts with branch checkpoint/rollback,
how uOp scheduling and launch works, etc.

Scott Lurndal

unread,

Jun 12, 2023, 1:39:40 PM6/12/23

to

EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>> EricP <ThatWould...@thevillage.com> writes:
>>> I was driven to instructions with 2-dest results by my desire to
>>> (a) not have condition codes and (b) wanting to handle add with carry
>>> in a straight forward way.
>>
>> I have expanded my answer to that problem into a paper:
>> <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
>>
>> Of course, the question is whether you consider this answer
>> straightforward.
>>
>>> Variable length instructions opens the ISA to much useful functionality
>>> like full size immediate operands but requires a more complicated fetch
>>> unit than fixed size. This cost is generally accepted today.
>>
>> Counterexample: ARM A64 (and ARM has ample experience with
>> variable-size instructions).
>
>Yes, it is currently aligned 4-byte instruction words.
>It also has condition codes, address modes like pre/post index,
>load/store pair. So the risc tenets are already abandon.
>
>If they want to expand their ISA for, say, instructions requiring more
>registers or longer immediates then they would pretty much have to do so
>in 4-byte granules. That could result in lots of pad 0's.

They already have instructions that require eight registers. They're
just required to be numbered consecutively.

MitchAlsup

unread,

Jun 12, 2023, 2:03:05 PM6/12/23

to

Mc 88120 used a PRF with logical CAMs for the architectural registers
regular decoders for the physical registers, and a history table of CAM
valid bits to backup the architectural "names" and for allocation of
registers. We could back the mispredicted branch up and be inserting
instructions at the actual target back into the window in the same clock.
...................

EricP

unread,

Jun 12, 2023, 2:19:07 PM6/12/23

to

unread,

Jun 13, 2023, 2:50:10 AM6/13/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Monday, June 12, 2023 at 4:05:16 PM UTC-5, Thomas Koenig wrote:
>> Hmm.. a couple of thoughts.
>>
>> Assumme we have the three-operand instructions
>>
>> MIN Rt,Ra,Rb,Rc
>> MID Rt,Ra,Rb,Rc
>> MAX Rt,Ra,Rb,Rc
><
> In the 1980s and very early '90s, some designers thought about using
> CAM sort arrays* in hardware. Software would load up a number of
> values and write them into the "sorter", after loading all values, soft-
> ware would read out all the value in order. No instruction was needed
> to cause the sorter to sort, the writing of values in performed sorting.
><
> These never went anywhere--draw your own conclusions.

The pendulum of reintegration swung the other way, quickly.

Although there appears to be a SORTL instruction on zSystem, according to
https://blog.share.org/Article/peeking-under-the-hood-of-sort-acceleration-on-z15
which appears to bring some improvement, so...

><
> (*) CAMs can be made to do things other than exact match detection.

Interesting.

MIN and MAX of three registers is straightforward two instructions,
but MID is more challenging. It could be something like

MIN Rtmp1,Ra,Rb
MIN Rtmp2,Ra,Rc
MIN Rtmp3,Rb,Rc
MAX Rtmp1,Rt1,Rt2
MAX Rtmp1,Rt1,Rt3

with (assuming single-cycle latency) anything between three
(for a sufficiently wide OoO machine) and five cycles of latency.

The last two instructions would be a three-way MAX, so this
could be shortened to

MIN Rtmp1,Ra,Rb
MIN Rtmp2,Ra,Rc
MIN Rtmp3,Rb,Rc
MAX Rtmp1,Rtmp1,Rtmp2,Rttmp3

in that case, for between two and four cycles (assuming that
three-way MAX is a single cycle). For OoO, MID is then probably
not worth it, especially if it is can not be done in a single
cycle.

Three-way sort could then be

MIN Rmin,Ra,Rb,Rc
MAX Rmax,Ra,Rb,Rc
MIN Rtmp1,Ra,Rb
MIN Rtmp2,Ra,Rc
MIN Rtmp3,Rb,Rc
MAX Rmid,Rtmp1,Rtmp2,Rttmp3

so between two and six cycles.

Anton Ertl

unread,

Jun 13, 2023, 4:35:19 AM6/13/23

to

EricP <ThatWould...@thevillage.com> writes:
[ARM A64]

>It also has condition codes, address modes like pre/post index,
>load/store pair. So the risc tenets are already abandon.

You mean the MIPS tenets. Condition codes are in SPARC (and, I guess,
Berkeley RISC), PA-RISC, and Power(PC). Autoincrement addressing
modes (however they are called) are in PA-RISC and Power(PC).
Load/Store multiple are in Power(PC).

The biggest deviation from classical RISC principles here seems
load/store pair (or multiple), because it means that several cache
lines and several pages can be accessed by one load or store. But
given that unaligned accesses have won (even RV32I (the bottom tier of
RISC-V) supports misaligned accesses), so any load can access two
cache lines and two pages, load/store pair is no additional burden
apart from the register port issue, which is what we are discussing
here. The load/store pair instruction certainly are a good way to
make the most of the expensive load/store units without incurring the
problems of load/store multiple.

Terje Mathisen

unread,

Jun 13, 2023, 4:36:16 AM6/13/23

to

When writing SW emulation code for FADD/FSUB it is really nice to have
an instruction which takes two FP variables and compares/sorts them by
magnitude...

I.e. ignore the top bit, then do a compare on exp+mantissa and return
with (A,B) where |A| >= |B|

At this point it becomes easy to handle the rest of the algorithm.

Terje Mathisen

unread,

Jun 13, 2023, 4:42:57 AM6/13/23

to

I've previously looked at doing an approximate sort with the (possibly
truncated) key in the top part of a 64-bit container and the index of
the full record in the lower part. (The splitting point could vary
depending upon how many items you need to sort and/or the key size.)

This would be a stable sort when the full keys fit, for larger keys I
would need a final mini-sort for each stretch of identical extracted
keys. They here being that MIN/MAX would work very well.

For very wide keys, like text strings with potentially very long
indentical prefixes, I would need a better algorithm!

Terje Mathisen

unread,

Jun 13, 2023, 5:01:35 AM6/13/23

to

MitchAlsup wrote:
> On Monday, June 12, 2023 at 4:05:16 PM UTC-5, Thomas Koenig wrote:
>> Hmm.. a couple of thoughts.
>>
>> Assumme we have the three-operand instructions
>>
>> MIN Rt,Ra,Rb,Rc
>> MID Rt,Ra,Rb,Rc
>> MAX Rt,Ra,Rb,Rc
> <
> In the 1980s and very early '90s, some designers thought about using
> CAM sort arrays* in hardware. Software would load up a number of
> values and write them into the "sorter", after loading all values, soft-
> ware would read out all the value in order. No instruction was needed
> to cause the sorter to sort, the writing of values in performed sorting.
> <
> These never went anywhere--draw your own conclusions.

Oh, but the sorting chip was actually produced, and was supposed to be
the "secret sauce" behind FAST (which Microsoft aquired for $2B, and is
still the core of Bing search).

You are absolutely correct that it never actually went anywhere, because
FAST had some very smart people who quickly figured out that this was a
SW and not a HW problem.

> <
> (*) CAMs can be made to do things other than exact match detection.
>>
>> which would, as the name suggests, select the minimum, the mitpoint
>> and the maximum of three values. Are these instructions which
>> could reasonably be done in a single cycle, and issued in parallel
>> on an OoO machine? That should certainly speed up three-way
>> sorting, if this is reasonable.
> <
> Yours is only the 3-operand version, the array I mentioned above
> was 32 entry or 64 entry (its been too long).

unread,

Jun 13, 2023, 8:30:27 AM6/13/23

to

The key question is: how often do people sort arrays of numbers?
My guess is, in fields of signal and image processing they do it not very
rarely, but also not often. And typical size here is from 2-3 to low hundreds.

In other fields of computing, like commercial, scientific, data analytics,
sorting arrays of numbers is *extremely* rare.

What is not rare is sorting arrays of records by numeric key.
Of which most important case is sorting small records consisting of
numeric key and pointer-or-index.

Max and min instruction are probably far less useful in this mighty
important case then they are in less important case of sorting arrays of number.

Scott Lurndal

unread,

Jun 13, 2023, 9:37:10 AM6/13/23

to

MitchAlsup <Mitch...@aol.com> writes:
>On Monday, June 12, 2023 at 4:05:16=E2=80=AFPM UTC-5, Thomas Koenig wrote:
>> Hmm.. a couple of thoughts.=20
>>=20
>> Assumme we have the three-operand instructions=20
>>=20
>> MIN Rt,Ra,Rb,Rc=20
>> MID Rt,Ra,Rb,Rc=20
>> MAX Rt,Ra,Rb,Rc=20
><
>In the 1980s and very early '90s, some designers thought about using=20
>CAM sort arrays* in hardware. Software would load up a number of=20

>values and write them into the "sorter", after loading all values, soft-
>ware would read out all the value in order. No instruction was needed
>to cause the sorter to sort, the writing of values in performed sorting.
><
>These never went anywhere--draw your own conclusions.
><
>(*) CAMs can be made to do things other than exact match detection.

>>=20
>> which would, as the name suggests, select the minimum, the mitpoint=20
>> and the maximum of three values. Are these instructions which=20
>> could reasonably be done in a single cycle, and issued in parallel=20
>> on an OoO machine? That should certainly speed up three-way=20
>> sorting, if this is reasonable.=20
><
>Yours is only the 3-operand version, the array I mentioned above=20

>was 32 entry or 64 entry (its been too long).

>>=20
>> Another idea, potentially even wilder (and probably worse :-)=20
>> A "load double and sort" instruction, so=20
>>=20
>> LS Rmin,Rmax,[Rs]=20
>>=20
>> would load the maximum of [Rs] and [Rs+offs] into Rmax and the=20
>> minimum of [Rs] and [Rs+offs], with offs being the size of the=20
>> data loaded.=20
>>=20
>> Comments? Too horrible even to think about? Has been tried and=20
>> does not work because...?
><
>I, personally, do not think it a good idea to add a comparison stage=20

>to the already difficult AGEN-CACHE-ALIGN portion of the memory
>pipeline.

Well, does it have to done in the pipeline? Or at the memory
controller.

For instance, ARM64 has STSMAX (Atomic store signed maximum):

Atomic signed maximum on word in memory, without return,
atomically loads an 64-bit byte from memory, compares
it against the value held in a register, and stores the
larger value back to memory, treating the values as signed
numbers.

Granted, not particulary performant in a tight integer sort
loop.

robf...@gmail.com

unread,

Jun 13, 2023, 11:10:13 AM6/13/23

to

How about double-wide min and max? 1/2 key and 1/2 pointer?

robf...@gmail.com

unread,

Jun 13, 2023, 11:40:11 AM6/13/23

to

On Tuesday, June 13, 2023 at 6:26:55 AM UTC-4, Thomas Koenig wrote:
> robf...@gmail.com <robf...@gmail.com> schrieb:
> > I have been considering using 128-bit bundles containing three
> > 40-bit instructions in fixed locations. That would get back some of the
> > benefits of the 4-byte aligned instructions, but it wastes 6% of memory.
> > The I$ could be made 120-bits wide so no wastage there.
> We previously discussed 128-bit bundles with instructions being
> a multiple of 20 bits, with stop bits in the header.
>

I tried 120-bit bundles, but it makes things more complex than byte
alignment. The bundle must be found on the cache-line, then high order
PC bits used to select the bundle and low order PC bits used to select
the word. There are two multiplies by 120/40 involved. For byte alignment
the low order PC bits can be multiplied by eight. Computing the branch
displacements was a headache when modifying the assembler an extra
byte needed to be inserted every 15 bytes. The bundles also make moving
code around for PIC more challenging.

> Frequent instructions like "add ra,rb,rb" or "add ra,rb,#1" could
> then fall in the 20-bit category.
>
> If you manage 40% of your instructions to use 20 bits and 60% to use
> 40 bits, you break even in code density with a 32-bit encoding.
> But 40 bits are a better fit if you have more than 32 registers.

I think it would be difficult to make good use of 20-bit instructions as
there are 64 GPRs. Most instructions also have predicates and format
fields for vector operations. The 40-bit instruction is packed. I ran out of
registers in the ABI even with 64, as some are used for micro-code. I think
the code density may be less than 20% larger than a 32-bit code as
sometimes 32-bit code needs extra instructions to encode an operation.

> Putting 32-bit or 64-bit constants into the stream is less
> convenient, and will lead to losses in code density vs.,
> for example, Mitch's constant encoding.

Yea, about 20% is lost in code density. Constants are encoded as NOPs
following the instruction. The NOP opcode takes up some space.

>
> You'll need to run a frequency analysis on instructions to see
> if it makes sense for you or not.

With byte aligned code it may be possible to use 24 bit instructions. I had
variable instruction lengths in a previous iteration and it did improve code
density significantly. I guess I am after a faster clock cycle time and less
hardware with this machine.

MitchAlsup

unread,

Jun 13, 2023, 2:30:16 PM6/13/23

to

Hardware could perform 3 comparisons in parallel A CMP B, A CMP C,
and B CMP C. Then using the > and < indications from the 3 comparisons
perform::
<
SRT3 {Ra, Rb, Rc} > {Ra, Rb, Rc}
<
In one pass through a 2 cycle pipeline. Latency = 2 throughput = 1.
By reusing the register names fewer fields are required in the instruction.

MitchAlsup

unread,

Jun 13, 2023, 2:34:01 PM6/13/23

to

What about EBCIDIC strings ??
What about ASCII strings ??
What about UTF8 strings ??

Terje Mathisen

unread,

Jun 13, 2023, 3:40:52 PM6/13/23

to

I am guessing that the fastest approach might be similar to the C
single-byte file read call, which is actually a macro that picks the
next byte from the file buffer, only falling back to an actual function
call when the buffer is empty.

You probably need some C++ template magic to make this all work
semi-portably across multiple record key types?

I remember writing sort code 35+ years ago that would extract and
serialize all the keys, together with a record index, so that the
comparison function could always use an inline REPE CMPSB. This meant
that some types of sub-key parts needed to be inverted or negated in
order to get the desired sort direction, but this way I only did all
this once per record instead of twice for every pair comparison.

Michael S

unread,

Jun 13, 2023, 3:52:11 PM6/13/23

to

On the GPR side or on the SIMD side, assuming that your machine has SIMD side
in the first place.

On GPR side:
Such instruction, apart from complexity of implementation, is going to require twice
as many rename resources vs "normal" instruction, so good implementation will likely
make it half the throughout of "normal" instructions. So, the whole enterprise of replacing
1 compare + 4 conditional moves with Min2 + Max2 will improve performance by 20%.
I don't find it very attractive.

On SIMD side.
For 128-bit SIMD - great idea. Very easy to implement. Flexible and useful.

For wider SIMD - how do you define it? There are two options:

A. Keys are in lower 64 bits (or 32 bits). The rest of register is payload.

B. Each 128-bit superlane has the key of its own, so for 512b register you do
4 independent Min2 or Max2 operations at once.

I don't know which option is more useful in practice. I do know however that
option A is much harder to implement in hardware than optiion B.

John Levine

unread,

Jun 13, 2023, 4:09:44 PM6/13/23

to

According to MitchAlsup <Mitch...@aol.com>:

On the systems I know, the sort routine treats the data as a set of
byte strings, and the default comparison is a string compare.

Not entirely by accident, this gives a reasonable answer if the string
is EBCDIC, ASCII, or UTF-8. Digits sort in 0-9 order and alphabetics
sort reasonably. Sorting UTF-8 gives the same order as if you decoded
them to code points and sorted them.
--
Regards,
John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

Scott Lurndal

unread,

Jun 13, 2023, 4:46:04 PM6/13/23

to

John Levine <jo...@taugh.com> writes:
>According to MitchAlsup <Mitch...@aol.com>:
>>On Tuesday, June 13, 2023 at 5:15:20 AM UTC-5, Thomas Koenig wrote:
>>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> > Many sort algorithms allow the user to supply the subroutine that determines
>>> > the sort order::
>>> ><
>>> > SORT( A, length, comparison )
>>> > {
>>> > for( j = 0; j < length-1; j++ )
>>> > for( i = 0; I < length; i++ )
>>> > if( comparison( A[i], A[j] )
>>> > SWAP( A[i], A[j] );
>>> > }
>>> An argument for either templates or LTO if there ever was one.
>>> Making a function call to compare two integers or floats sounds
>>> wrong, however low the overhead may be.
>><
>>What about EBCIDIC strings ??
>>What about ASCII strings ??
>>What about UTF8 strings ??
>
>On the systems I know, the sort routine treats the data as a set of
>byte strings, and the default comparison is a string compare.
>
>Not entirely by accident, this gives a reasonable answer if the string
>is EBCDIC, ASCII, or UTF-8. Digits sort in 0-9 order and alphabetics
>sort reasonably. Sorting UTF-8 gives the same order as if you decoded
>them to code points and sorted them.

One key difference - in ASCII digits collate before all alpha characters;
the reverse is true for EBCDIC causing a different sort order for
alphanumeric keys.

Thomas Koenig

unread,

Jun 17, 2023, 2:45:51 PM6/17/23

to

MitchAlsup <Mitch...@aol.com> schrieb:

To do this in two cyles would this require writing three registers
at the same time (so at least a 3R3W register file), correct?
Or is there some way of making this less expensive?

It might actually make sense to split this into two instructions for
more flexibility, to make both direct and index sort possible.

So something like

CMP3GT Rt,Rs1,Rs2,Rs3
SRT3 Rt,Ra,Rb,Rc

where CMP3GT could generate a bit pattern indicating which way
the target registers of SRT3 should be swapped (for example 010
001 100 if the value of Ra should be moved into Rb, the value of
Rb into Ra and the value of Rc should be left unchanged).

MitchAlsup

unread,

Jun 17, 2023, 3:01:38 PM6/17/23

to

Yes, that is the implication of throughput = 1

unread,

Jun 19, 2023, 5:47:39 PM6/19/23

to

On Monday, June 19, 2023 at 2:55:35 PM UTC-5, BGB wrote:
> On Monday, June 19, 2023 at 1:25:43 PM UTC-5, John Levine wrote:
> > According to BGB <cr8...@gmail.com>:
> > >> >sort reasonably. Sorting UTF-8 gives the same order as if you decoded
> > >> >them to code points and sorted them.
> > >Though, stricmp/strnicmp is more of a problem:
> > >ASCII: Easy enough.
> > >1252: Also works, just needs more rules.
> > >UTF-8: Simple case: ASCII only. Non-simple case, "Oh No".
> > >
> > >Woud need different implementations to support both 1252 and UTF-8 and give sensible results...
> > What's not sensible about sorting UTF-8 into code point order?
> >
> The stricmp/strnicmp (AKA: strcasecmp/strncasecmp or _stricmp/_strnicmp) functions do a case-insensitive compare.
>
> This requires internally normalizing the case, which requires knowing the character encoding (if, say, the Latin1 characters are also intended to be case-insensitive).
>

{Beginning to text to be commented}

> I am not sure what the "standard" solution is though (the descriptions I had found make no reference to locale or character encoding).
>
> Or, one could make a case that these functions only apply to ASCII (and/or 1252) and have separate "_stricmp_u8" functions and similar?...
> > If you want a sort order that matches a language's dictionary order,
> > that's much harder, but it's not trivial in ASCII, either.
> Granted, but easy enough to define 'A' == 'a' for sake of case-insensitive compare, and ignore others.
>
> Then there are the accented characters, which are mostly a matter of mapping.
>
> The German "hard S" character has no obvious case mapping, and making it compare equal to 'ss'/'SS' seems like too much of a hack...
>
> Can ignore most of the rest of the Unicode space as (aside from Greek and Cyrillic and similar) most other alphabets do not have case.

{End of text being commented}
<
A perfect case illustrating "another layer of indirection" saves the day::
<
if( table[string1[i]] == table[string2[i]] ) continue;
<
The table solves the ASCII versus EBCIDIC versus field-data,
and solves the {German, Greek, Cyrillic,...} problems.

>
>
> Status:
> Still in blackout, but parents have went and got a generator. Mostly using it to try to keep the refrigerator cold... Apparently the utility company estimates about a week to get power back online (Tulsa's grid kinda got wrecked by 100mph winds...).
<

Makes me wonder about what electricity would cost if the transmission lines were able to withstand 100 MPH winds.....

BGB

unread,

Jun 19, 2023, 6:53:17 PM6/19/23

to

Yeah, NTFS sorta used this approach IIRC, and embedded the tables into the filesystem.

Looking around more, I found out:
Conventionally, it is based on locale settings (which in my case, likely means "Assume 1252 by default unless locale is set to 'C.UTF-8' or similar").

Apparently, the more standard behavior is to normalize to lower case before the compare, but the C library I am using normalizes to upper case. If I remember, may need to fix this once power is back...

Before power went out, had worked some on speeding up these functions, as it turns out they were eating a lot of CPU time in some programs (had reworked them some to work on blocks of 8 chars at a time, but this would only work for
ASCII/1252).

Some of this is related to trying to hunt down remaining cases where programs behave differently when built on BJX2 than on native x86 targets...

> >
> >
> > Status:
> > Still in blackout, but parents have went and got a generator. Mostly using it to try to keep the refrigerator cold... Apparently the utility company estimates about a week to get power back online (Tulsa's grid kinda got wrecked by 100mph winds...).
> <
> Makes me wonder about what electricity would cost if the transmission lines were able to withstand 100 MPH winds.....

Dunno. Could bury the lines, but this probably wouldn't work well for high voltage transmission lines. Buried lines aren't really a thing here, mostly just wooden utility poles.

Like, say, if 100kV lines required, say, 8-10" of fiberglass and silicone insulation or similar, this would likely be impractical.

Would likely be easier for 960V and 30kV lines though (eg, "most of them"), leaving mostly 100kV to 500kV on the big pylons... IIRC, for 30kV it is something like 2 inch silicone or similar.

Wind knocked over a bunch of trees and also a bunch of the utility poles, ... Most of the Tulsa metropolitan area was knocked out.

I guess during the storm in some places, there was a lot of flying trees and other debris that had took flight, ... Lots of flashes where the grid was getting torn up, and a whole lot of lightning as well.

Storm was relatively short, but was enough to make the whole house rumble at one point as it passed over. I guess in some places, buildings got damaged and windows blown-in, ...

John Levine

unread,

Jun 19, 2023, 7:02:41 PM6/19/23

to

According to MitchAlsup <Mitch...@aol.com>:

>> > If you want a sort order that matches a language's dictionary order,
>> > that's much harder, but it's not trivial in ASCII, either.
>> Granted, but easy enough to define 'A' == 'a' for sake of case-insensitive compare, and ignore others.

If you just want English case folding it's easy. If you want dictionary order, or any other
language, it's much harder.

>> The German "hard S" character has no obvious case mapping, and making it compare equal to 'ss'/'SS' seems like too much of a hack...

This sort of stuff is very lanuage specific. In German, ö is usually
equivalent to oe, but in Scandanavian languages it is not. Even worse,
in Unicode there are two ways to represent ö, a single composed
character or a combining diaresis followed by a plain "o". People spend
years trying to get this right.

>> Can ignore most of the rest of the Unicode space as (aside from Greek and Cyrillic and similar) most other alphabets do not have case.

Yeah, but they have variants. Are the traditional and simplified
versions of Chinese characters equivalent or not? The answer is "it depends".

>A perfect case illustrating "another layer of indirection" saves the day::
><
> if( table[string1[i]] == table[string2[i]] ) continue;

That is so very, very, not adequate.

>Makes me wonder about what electricity would cost if the transmission lines were able to withstand 100 MPH winds.....

Well, you could visit New York City where they're all underground so it's not a problem.

BGB

unread,

Jun 19, 2023, 9:28:14 PM6/19/23

to

On Monday, June 19, 2023 at 6:02:41 PM UTC-5, John Levine wrote:
> According to MitchAlsup <Mitch...@aol.com>:
> >> > If you want a sort order that matches a language's dictionary order,
> >> > that's much harder, but it's not trivial in ASCII, either.
> >> Granted, but easy enough to define 'A' == 'a' for sake of case-insensitive compare, and ignore others.
> If you just want English case folding it's easy. If you want dictionary order, or any other
> language, it's much harder.
> >> The German "hard S" character has no obvious case mapping, and making it compare equal to 'ss'/'SS' seems like too much of a hack...
> This sort of stuff is very lanuage specific. In German, ö is usually
> equivalent to oe, but in Scandanavian languages it is not. Even worse,
> in Unicode there are two ways to represent ö, a single composed
> character or a combining diaresis followed by a plain "o". People spend
> years trying to get this right.

I am probably going to take the simpler route and assume that 'O'=='o' (assume accent/umlaut, not sure how to type these on Android with a bluetooth keyboard). But, matching them with oe, probably not.

In this case, "hard S" would not have any case translation.

Probably, toupper/tolower will assume the existing "ambiguous 1252/unicode hybrid" space.

Will ignore collating rules, since those are strcoll's business; and seemingly the rules for "minimal valid implementation" allow treating strcoll as an alias to strcmp...

> >> Can ignore most of the rest of the Unicode space as (aside from Greek and Cyrillic and similar) most other alphabets do not have case.
> Yeah, but they have variants. Are the traditional and simplified
> versions of Chinese characters equivalent or not? The answer is "it depends".

I am probably going to conveniently ignore all this.

Probably Latin, Greek, Cyrillic, then call it done.
Probably also the existing practice of assuming 1252 and UTF-8, ignoring the existence of all the other codepages.

> >A perfect case illustrating "another layer of indirection" saves the day::
> ><
> > if( table[string1[i]] == table[string2[i]] ) continue;
> That is so very, very, not adequate.
> >Makes me wonder about what electricity would cost if the transmission lines were able to withstand 100 MPH winds.....
> Well, you could visit New York City where they're all underground so it's not a problem.

They could do this, and probably then avoid needing to repair the grid every "Tornado season", maybe also have fewer power bumps and short blackouts (common annoyance, power going out for 1-2 hours every 1 or 2 months or so, and short bumps maybe a few times per week, ...).

But, if you reduce maintenance work, then one risks people complaining about the loss of jobs, ...

OTOH, it would add cost.

It is sort of like how people go and say that things like public mass transit are a socialist pipe dream...

Like, theoretically, there are busses, but if the busses only have a limited number of stops and have a 2 hour wait time on most routes, not very useful...

Like, a common argument being that the only "safe" and "viable" form of transportation is to get an overly large pickup truck, ... Roads mostly full of large pickups and SUVs (and the occasional person driving a sedan style car).

Terje Mathisen

unread,

Jun 20, 2023, 1:35:43 AM6/20/23

to

John Levine wrote:
> According to BGB <cr8...@gmail.com>:
>>>> sort reasonably. Sorting UTF-8 gives the same order as if you decoded
>>>> them to code points and sorted them.
>
>> Though, stricmp/strnicmp is more of a problem:
>> ASCII: Easy enough.
>> 1252: Also works, just needs more rules.
>> UTF-8: Simple case: ASCII only. Non-simple case, "Oh No".
>>
>> Woud need different implementations to support both 1252 and UTF-8 and give sensible results...
>
> What's not sensible about sorting UTF-8 into code point order?
>
> If you want a sort order that matches a language's dictionary order,
> that's much harder, but it's not trivial in ASCII, either.

Much, much harder: A is first, Å (Aring) is last in Norway and Denmark.

AA sorts the same as Å. I.e. you need to look for letter combinations
and sort them as a totally different code point.

Many years ago I hamdled this by creating sort keys which were
pre-digested into collating order, i.e. so that they could be sorted
with a pure binary byte array compare (like strcmp()).

Terje Mathisen

unread,

Jun 20, 2023, 1:57:14 AM6/20/23

to

For speed you really want to do that lookup just once per key, instead
of twice per comparison. The only possible excpetion would be when the
keys are so long that storing a second copy runs out of memoery (or $L2
cache working set).

>>
>>
>> Status:
>> Still in blackout, but parents have went and got a generator. Mostly using it to try to keep the refrigerator cold... Apparently the utility company estimates about a week to get power back online (Tulsa's grid kinda got wrecked by 100mph winds...).
> <
> Makes me wonder about what electricity would cost if the transmission lines were able to withstand 100 MPH winds.....

Digging 400KV cables into the ground is about 3X more costly than
overhead power lines, but that does of course not translate into a 3X
cost for electricity, just for that specific part of the long distance
transmission. You could maybe add 3-10 cent/kWh?

Terje

> <
>>> --
>>> Regards,
>>> John Levine, jo...@taugh.com, Primary Perpetrator of "The Internet for Dummies",
>>> Please consider the environment before reading this e-mail. https://jl.ly

BGB

unread,

Jun 20, 2023, 12:36:05 PM6/20/23

to

Lookup table versus rules is a tradeoff. Early on, rules-based logic was the winner here, but in a more recent test, lookup tables were faster, at least for ASCII/1252 range. For the rest of the BMP space, a lookup table would be too large. Also, for UTF-8, the logic for decoding codepoints would probably be a bigger issue than the logic for case-conversion.

Did read somewhere that for case insensitive compare with Unicode, the rule is more "case normalization" rather than simply converting to lower or upper case, though for the Latin alphabet and similar, the rule is lower case. IIRC, the idea was that Cyrillic (IIRC, not sure) and a few other alphabets normalized to upper case instead for the case-insensitive compare.

> >>
> >>
> >> Status:
> >> Still in blackout, but parents have went and got a generator. Mostly using it to try to keep the refrigerator cold... Apparently the utility company estimates about a week to get power back online (Tulsa's grid kinda got wrecked by 100mph winds...).
> > <
> > Makes me wonder about what electricity would cost if the transmission lines were able to withstand 100 MPH winds.....
> Digging 400KV cables into the ground is about 3X more costly than
> overhead power lines, but that does of course not translate into a 3X
> cost for electricity, just for that specific part of the long distance
> transmission. You could maybe add 3-10 cent/kWh?
>

Yeah, not sure what you would do for 400kV.

One possible guess for insulating the wire would be, say: Central conductor wrapped with fiberglass and silicone; thick layer of silica sand with some fiberglass and silicone (to add stability and keep water out), outer layer likely more fiberglass and silicone (to add durability). Solid rubber insulation would likely be cost prohibitive I think.

Insulating the wire would likely cost more than the conductor in this case. Would almost make sense to use thicker conductors operating at lower voltages to reduce the cost of the insulation (say, using a lot more 30kV or 50kV lines).

Otherwise, still blackout.
Did read that apparently they are sending over a bunch of utility trucks from Texas and similar to help out...

Terje Mathisen

unread,

Jun 21, 2023, 8:55:01 AM6/21/23

to

BGB wrote:
> On Tuesday, June 20, 2023 at 12:57:14 AM UTC-5, Terje Mathisen

>>> transmission lines were able to withstand 100 MPH winds.....
>> Digging 400KV cables into the ground is about 3X more costly than
>> overhead power lines, but that does of course not translate into a
>> 3X cost for electricity, just for that specific part of the long
>> distance transmission. You could maybe add 3-10 cent/kWh?
>>
>
> Yeah, not sure what you would do for 400kV.
>
> One possible guess for insulating the wire would be, say: Central
> conductor wrapped with fiberglass and silicone; thick layer of silica
> sand with some fiberglass and silicone (to add stability and keep
> water out), outer layer likely more fiberglass and silicone (to add
> durability). Solid rubber insulation would likely be cost prohibitive
> I think.
>
> Insulating the wire would likely cost more than the conductor in this
> case. Would almost make sense to use thicker conductors operating at
> lower voltages to reduce the cost of the insulation (say, using a lot
> more 30kV or 50kV lines).

Below-ground cables, either directly deposited with a sand/gravel
wrapping, or inside a concrete culvert is well known technology. The
cables are interesting, with a lot of layers for electrical
conductivity, pulling strength and insulation.

Up in Rauland/Telemark the local power company is proactively digging
trenches and laying cables to replace overhead conductors in all the
places where history have shown ice buildup and/or heavy storm trefall
to be a problem.

Terje

BGB

unread,

Jun 21, 2023, 7:34:17 PM6/21/23

to

On 6/21/2023 7:54 AM, Terje Mathisen wrote:
> BGB wrote:

>> On Tuesday, June 20, 2023 at 12:57:14â€¯AM UTC-5, Terje Mathisen

Makes sense.

Had looked, and it seems that insulating the cables for these voltages
doesn't require quite as much insulation as I had thought previously.

But, here, after a number of days (3 full days, Saturday and today being
partial days), power is now back on...

Though, there is a partial story for this:
Before this, I was off searching for my Arty-S7 board, but wasn't having
much luck finding it in my room, eventually I got partly frustrated, and
was sort of like "G-d, if you can point out where it is at, this would
be convenient." (Though, this was a sort of selfish request, which is
not necessarily a good thing; was half hoping for a sign to be like
"Yeah, it is over there"...).

A short time later, there was the weather alert for the approaching
storm. I posted a message about it on Twitter, then ended up putting PC
into hibernate mode. Maybe a few minutes later, the storm hit and
knocked out power.

It kinda sucked, but did on-off continue the search for the Arty-S7
board. Pretty much as soon as I found it, the power came back on...

Though, granted, the storm would have happened either way (not like
anything I am doing would matter enough to warrant something like this),
but like, weird stuff like this does happen sometimes (though usually on
a much smaller scale).

Like, I don't really see or hear anything directly. But, if I consider
doing something He wouldn't necessarily approve of, then everything
starts going to crap.

Almost sort of like in the "Book of Jonah" in this sense, just without
any obvious "mission" (or "message") in this case (nor any sort of
direct/tangible confirmation of His presence...).

Well, other examples were cases like, say, at one point I was going to
try to express interest in a female (who I possibly had a chance with),
but before I could give her a note doing so, a gas line exploded and
everyone had to be evacuated from the building. But, yeah, just a long
stream of seemingly random events sort of like this.

Did sort of barter that if He showed me where it was at, I would give
some sort of public acknowledgement of this. Though not exactly what I
had in mind, but I guess I am still under some obligation to give it
mention...

People can think what they will...
Not sure this would accomplish anything, in any case.

Not sure if there is anything else I should mention here.
Will not try to claim to be exactly a model example of piety though, in
any case...

I guess presumably now I can get back to "business as usual" (working on
my projects and similar; and then needing to divide my time with
machining parts out of metal, ...).

George Neuner

unread,

Jun 24, 2023, 12:16:05 AM6/24/23

to

On Tue, 20 Jun 2023 07:57:09 +0200, Terje Mathisen
<terje.m...@tmsw.no> wrote:

>Digging 400KV cables into the ground is about 3X more costly than
>overhead power lines, but that does of course not translate into a 3X
>cost for electricity, just for that specific part of the long distance
>transmission. You could maybe add 3-10 cent/kWh?

That's just simple installation ... in the US at least you would have
to add in the costs of obtaining permission to dig: state, local, and
maybe federal also depending on where. Then there would be endless
lawsuits from environmentalists worried about ground squirrels (or
whatever) being displaced by digging, and from neighbors (if there are
any) concerned about how the digging would be disruptive to their
lives, or how the buried lines might be dangerous to them.

The higher the line voltage, the more opposition you would face. In
many places local 10..20 KV lines do get buried [they are in my
community], but people regularly see those same voltage wires on poles
so they aren't terribly afraid of them. However, start talking about
burying 100+KV lines that normally are strung on /towers/ protected by
fences ... then people start to freak out.

By the time you started digging, you would already have spent 10..15
times the simple installation cost just getting permission to do it.

And note that you would face exactly the same obstacles trying to put
up a line of transmission towers - however the line of towers would be
far less costly than digging a ditch of the same length. You can guess
what happens.

>Terje

YMMV,
George

BGB

unread,

Jun 24, 2023, 2:33:35 AM6/24/23

to

Going back and forth to the my shop, I can see that many of the utility
poles have been replaced.

A lot of the old wooden poles broke off, so they dig a hole and put in a
new pole, in a few cases next to the stump of the old broken-off pole
(where they just sort of awkwardly tied the new pole onto what was left
of the old pole using steel straps).

Some of the places where they put in the new poles, they are not
particularly straight. So, new wooden poles, but installed leaning over
at weird angles.

Well, along with some amount of questionable looking wiring. Compared
with the "relatively orderly" way the lines are usually put up, there
are now cases where wires are run in ways that look somewhat more haphazard.

In a few places, there were rusty looking steel utility poles instead.
These ones seem to have pretty much survived.

Seems like they "could" have put in more steel or concrete poles, since
these appear to have been better able to survive getting blown over.

Also a lot of broken off branches strewn about, and some broken off trees.

On the street where my shop is, the power is back on, despite a few of
the utility poles being seriously leaned over and the power-lines are
basically laying on the ground (luckily the residential style
power-lines seem to have a sort of rubber sheath).

Not sure how well it would go if someone drove into the power-lines though.

In some cases, it looks like the base of the pole just sort of dislodged
itself from the ground and the whole pole just sorta fell over.

It all doesn't look particularly "elegant" at the moment...

Ironically, it was sort of a very different setup on Guam, where despite
many things being (at the ground level or in buildings) being a bit
questionable; the utility poles were basically all tall and imposing
pillars of solid concrete.

So, the power lines and utility poles were significantly taller than the
height of the trees, and typically like a 2 foot diameter. The cables
would run down the side of the pole, and then branch off to the houses.

Along one of the major roads, there was basically a series of very large
utility poles (~ 3 or 4 feet in diameter, also very tall). People would
periodically crash their cars into the poles, the cars would often end
up partly wrapped around the pole, but the poles were pretty much
entirely unaffected.

Though, typically whenever this happened, people would go and set up a
makeshift shrine on the pole, usually with flowers and pictures of
whoever passed on after crashing into the pole.

Well, also on Guam, the wind would also sometimes blow hard enough to
blow out ones' windows. The houses generally were made with solid
concrete walls and roofs (basically, like big concrete boxes), and
usually with steel doors and steel window shutters (and sealed concrete
slab floors). One would close the shutters during storms to hopefully
keep the wind from blowing out ones' windows (with wind speeds during
"typhoon season" sometimes high enough to flip and roll cars and
similar, ...).

Things like wood-frame housing "just wasn't a thing" there.
And the power would still usually survive a storm, ...

Then again, I think it was sort of the idea that most of the
construction practices there came from the US Military...

Also, house fires were less of an issue, since even if ones' stuff
caught fire, the house itself would remain intact (apart from maybe some
scorch marks; and fires would usually be confined to a single room, ...).

Though, radon buildup in houses was an issue (so, say, one would get a
radon pump installed, which pulls a vacuum in a pit dug under part of
the house's slab, and then vents it out through the roof).

But, yeah, in Tulsa, nearly every time it rains too hard or similar,
there is a blackout...

Also, wood-frame houses don't hold up particularly well to a tornado (or
high wind speeds in general), ...

>
>> Terje
>
> YMMV,
> George

Skybuck Flying

unread,

Jun 24, 2023, 2:52:08 AM6/24/23

to

On Wednesday, June 7, 2023 at 11:40:22 PM UTC+2, MitchAlsup wrote:
> An AI bot has figured out some faster sorting algorithms.
>
> https://www.nature.com/articles/s41586-023-06004-9
>
> Looking at the C++* code output from the paper::
>
> void variable_sort_2( int length, int *a )
> {
> switch( length )
> {
> case 0:
> case 1:
> return;
> case 2:
> int temp = a[0];
> a[0] = (a[1] < a[0] ) ? a[1] : a[0];
> a[1] = (a[1] < temp ) ? temp : a[1];
> }
> }
>
> There is x86 asm code in the article, consisting of 11 instructions.
>
> My 66000 ASM
>
> variable_sort_2:
> CMP R3,R1,#2
> PLT R3,TTTTTT
> LD R4,[R2,0]
> LD R5,[R2,4]
> MAX R6,R4,R5
> MIN R7,R4,R5
> ST R7,[R2,0]
> ST R5,[R2,4]
> RET
>
> For a count of 9 instructions.
>
> (*) it looks like C code to me instead of C++ code........

Not very impressive to sort only 2 numbers ?!

Where is the rest ?! ;)

Bye,
Skybuck.

Scott Lurndal

unread,

Jun 24, 2023, 2:41:34 PM6/24/23

to

George Neuner <gneu...@comcast.net> writes:
>On Tue, 20 Jun 2023 07:57:09 +0200, Terje Mathisen
><terje.m...@tmsw.no> wrote:
>
>>Digging 400KV cables into the ground is about 3X more costly than
>>overhead power lines, but that does of course not translate into a 3X
>>cost for electricity, just for that specific part of the long distance
>>transmission. You could maybe add 3-10 cent/kWh?
>
>That's just simple installation ... in the US at least you would have
>to add in the costs of obtaining permission to dig: state, local, and
>maybe federal also depending on where. Then there would be endless
>lawsuits from environmentalists worried about ground squirrels (or
>whatever) being displaced by digging, and from neighbors (if there are
>any) concerned about how the digging would be disruptive to their
>lives, or how the buried lines might be dangerous to them.

Hasn't stopped PG&E.

https://www.pge.com/en_US/residential/customer-service/other-services/electric-undergrounding-program/electric-undergrounding-program.page

>
>The higher the line voltage, the more opposition you would face. In
>many places local 10..20 KV lines do get buried [they are in my
>community], but people regularly see those same voltage wires on poles
>so they aren't terribly afraid of them.

And these are the lines most likely to be a problem in
a storm or wildfire; think rural area distribution networks.

> However, start talking about
>burying 100+KV lines that normally are strung on /towers/ protected by
>fences ... then people start to freak out.

The towers for these are far more robust than the lodgepole pines used
for the 20kv distribution network and much more resiliant to natural
disaster or falling trees.