Mitch's 66000?

1194 views
Skip to first unread message

Kyle Hayes

unread,
Nov 3, 2019, 11:06:13 AM11/3/19
to
I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(

Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?

I tried searching on Google (that's where I read comp.arch) but I get a huge number of results that are just references to the name "66000" and nothing particularly substantive. After the first 1500 or so, I gave up.

Thanks for any pointers!

Best,
Kyle

MitchAlsup

unread,
Nov 3, 2019, 12:26:04 PM11/3/19
to
On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
>
> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?

Its "My 66000" note: lower case 'y'.
Its "Alsup".

The virtual Vector Method (VVM) documentation is not ready for prime time.

It is a method for adding vector processing to a CPU/ISA that vectorizes
loops instead of vectorizing instructions.

In fact VVM adds but 2 instructions to the ISA, one denoting which registers
in the loop are vectorized and which are Scalar. Scalar registers are static
over the loop, vector registers change at least once per loop; the other
denoting the loop termination and looping condition.

VVM is perfectly capable of vectorizing small string loops and loops upto
about the size of the inner loop of FFTs (29 instructions in My 66000 ISA).

VVM utilizes the instruction queuing logic in the CPU to give the appearance
of vector processing while operating in such a partial order that precise
exceptions are both possible and practical. There are no artificial boundaries
to the size of the loop, the loop can execute for 1-infinity go arounds.

In VVM there is no vector register file, vector operands and results remain
in the data-flow portions of the data path (they don't get read from RF nor
do the get written to RF--mostly). Scalar operands are read once and held in
the instruction queueing mechanisms, and used as many times are needed.

It is likely that during a vectorized loop the FETCH and DECODE stages will
be idle (after prefetching in the instructions after loop termination.)

Memory addresses are generated in order per loop,
Loads are hoisted above loads only after disambiguation,
Stores are delayed below loads only after disambiguation,
Calculations are computed in function unit order once per loop,
There can be multiple "lanes" of memory reference,
There can be multiple lanes of Calculations per Function Unit,

The compiler gets the illusion of vector registers (making porting easier)

This adds almost no HW to the core, certainly no VRF, no Vector only FUs,...
It utilizes Reservation Stations or similar instruction queueing facilities
already present. In short is provides the vast majority of the "bang" for
almost no "bucks", while complying with precise exceptions--which appear
to be from a scalar instruction stream (no additional OS software needed).

For example:: A simple string copy routine in My 66000 Assembler:

loop:
LDSB R7,[Rp+Ri]
STB R7,[Rq+Ri]
ADD Ri,Ri,1
BNZ R7,loop

vectorizes into:

loop: VEC {{VSV}{VSV}{VV}{V}}
LDSB R7,[Rp+Ri]
STB R7,[Rq+Ri]
LOOP NZ,R7,Ri,1

And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
nor written on a per cycle basis, R7 and Ri will be written to the RF when the
loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
is eliminated, while the speed of processing is increased. The VEC instruction
is examined once (DECODE) and is not part of the execution of the loop. The
LOOP instruction performs a comparison and an incrementation and conditionally
transfers control to after the VEC instruction.

The above vectorized loop can run at 1 loop per cycle if there are 2 lanes
of memory. That same code can run 4 loops per cycle if there are 8 lanes of
memory. And so on.

A small CPU (1-wide in order) can perform VVM with a few sequences added to
the control unit. So the cost to the really small machine is virtually zero.

As implementations get bigger they can choose to implement the number of lanes
that make sense to the rest of their design point.

All machine versions can run the same expression of the Vectorized loop.

Memory aliasing is handled by running the loops in a restricted partial order
so that if the software expression of the loop contains memory aliasing, the
loop runs slower obeying memory dependencies, yet if the same loop contains
no memory dependencies, it runs faster. SW has to do nothing to obtain this
performance and protection (unlike vector machines with vector registers).

BGB

unread,
Nov 3, 2019, 1:11:28 PM11/3/19
to
On 11/3/2019 11:26 AM, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
>> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
>>
>> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
>
> Its "My 66000" note: lower case 'y'.
> Its "Alsup".
>

I think the bigger issue here is the apparent lack of any publicly
visible documentation for the ISA (at least, that I am aware of).

Partial contrast is my BJX2 ISA, which at least has stuff on GitHub
which people can look at if they are interested in where I am going with
this.


Though, my posts don't always match the ISA spec 1:1, as sometimes
details may end up changing or being implemented differently or features
may end up dropped, or I may be writing speculatively as-if a certain
instruction existed, ...

I may need to do some cleanup / revision at some point, as the design
has accumulated some unnecessary baggage in a few areas, ... ( It is
possible a lot of the stuff not currently implemented in the 'E' and 'F'
profiles may end up being dropped and/or redesigned; with the 'F'
profile being used as the basis of a partially redesigned version of the
core ISA )


Did start to speculate maybe "MY 66000" is intended to be a proprietary
/ commercial ISA, or it could have to do with patent reasons, or
something... But, I don't really know, and at best can only speculate.

Started writing something about this, but then kind of got distracted
and went down a rabbit hole about things like FAT32+LFN's and ExFAT
being kinda pointless for the majority of use cases (and probably is
"only really a thing" due to a likely attempt at a money-grab on MS's
part; since they can no longer make any money off FAT32 due to the
relevant patents having expired; and even then, as written, wouldn't
have likely applied to a VFAT driver in a Unix-style OS in the
first-place as the actual claims were essentially N/A).

In this way, it was sorta like the MIPS related suits (suing over
instructions which weren't even implemented by the infringing party), or
S3TC (sort of a long-standing bogeyman hindering compressed-texture
support in MesaGL and similar, ...).

...

MitchAlsup

unread,
Nov 3, 2019, 1:18:06 PM11/3/19
to
On Sunday, November 3, 2019 at 12:11:28 PM UTC-6, BGB wrote:
> On 11/3/2019 11:26 AM, MitchAlsup wrote:
> > On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> >> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
> >>
> >> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
> >
> > Its "My 66000" note: lower case 'y'.
> > Its "Alsup".
> >
>
> I think the bigger issue here is the apparent lack of any publicly
> visible documentation for the ISA (at least, that I am aware of).

I just don't have/know-of a place to put it, and after it is there how
to update it somewhat regularly.
>
> Partial contrast is my BJX2 ISA, which at least has stuff on GitHub
> which people can look at if they are interested in where I am going with
> this.
>
>
> Though, my posts don't always match the ISA spec 1:1, as sometimes
> details may end up changing or being implemented differently or features
> may end up dropped, or I may be writing speculatively as-if a certain
> instruction existed, ...
>
> I may need to do some cleanup / revision at some point, as the design
> has accumulated some unnecessary baggage in a few areas, ... ( It is
> possible a lot of the stuff not currently implemented in the 'E' and 'F'
> profiles may end up being dropped and/or redesigned; with the 'F'
> profile being used as the basis of a partially redesigned version of the
> core ISA )
>
>
> Did start to speculate maybe "MY 66000" is intended to be a proprietary
> / commercial ISA, or it could have to do with patent reasons, or
> something... But, I don't really know, and at best can only speculate.

It is not supposed to be proprietary at all, and it is intended to be given
away at zero cost to receiver. The only restriction is that the originator
(me) be transferred/acknowledged by any recipient.
>
> Started writing something about this, but then kind of got distracted
> and went down a rabbit hole about things like FAT32+LFN's and ExFAT
> being kinda pointless for the majority of use cases (and probably is
> "only really a thing" due to a likely attempt at a money-grab on MS's
> part; since they can no longer make any money off FAT32 due to the
> relevant patents having expired; and even then, as written, wouldn't
> have likely applied to a VFAT driver in a Unix-style OS in the
> first-place as the actual claims were essentially N/A).
>
> In this way, it was sorta like the MIPS related suits (suing over
> instructions which weren't even implemented by the infringing party), or
> S3TC (sort of a long-standing bogeyman hindering compressed-texture
> support in MesaGL and similar, ...).

It is stuff like this that left SIMD instructions out--I prefer to perform
SIMD with lanes of coordinated calculation rather than blowing up ISA.

Stephen Fuld

unread,
Nov 3, 2019, 1:41:27 PM11/3/19
to
Since the loop is using bytes, and assuming you have a D-cache, don't
you require fewer lanes of memory, i.e. only read or write memory once
per cache line, not per byte? Or by "lanes of memory" do you mean
load/store units within the CPU?



>
> A small CPU (1-wide in order) can perform VVM with a few sequences added to
> the control unit. So the cost to the really small machine is virtually zero.
>
> As implementations get bigger they can choose to implement the number of lanes
> that make sense to the rest of their design point.
>
> All machine versions can run the same expression of the Vectorized loop.
>
> Memory aliasing is handled by running the loops in a restricted partial order
> so that if the software expression of the loop contains memory aliasing, the
> loop runs slower obeying memory dependencies, yet if the same loop contains
> no memory dependencies, it runs faster. SW has to do nothing to obtain this
> performance and protection (unlike vector machines with vector registers).

I think you are sort of "burying the lead", perhaps because it is
obvious to you.

With VVM, you can automagically use as many ALUs as the hardware
provides without any changes to your source code. So you get more than
the capabilities of a typical vector processor for very little extra
"stuff", either in the hardware or software. As such is is quite
different from Luke's RISC V proposal.


In case it isn't obvious, I am a big fan! :-)


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Kyle Hayes

unread,
Nov 3, 2019, 4:54:08 PM11/3/19
to
On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
> On 11/3/2019 9:26 AM, MitchAlsup wrote:
> > On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> >> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
> >>
> >> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
> >
> > Its "My 66000" note: lower case 'y'.
> > Its "Alsup".

(Responding to Mitch)

Apologies, not enough caffeine when I wrote that :-(

Stayed up way too late last night for work and I used to work with a Mick Alsop. In the morning fog what was in the brain and what went out the fingers did not match :-(

Sort of a cached incoherency protocol...

> > The virtual Vector Method (VVM) documentation is not ready for prime time.
> >
> > It is a method for adding vector processing to a CPU/ISA that vectorizes
> > loops instead of vectorizing instructions.
> >
> > In fact VVM adds but 2 instructions to the ISA, one denoting which registers
> > in the loop are vectorized and which are Scalar. Scalar registers are static
> > over the loop, vector registers change at least once per loop; the other
> > denoting the loop termination and looping condition.
> >

This is the VEC and LOOP instructions below?
Just to make sure I understand:

Rp = source string pointer
Rq = destination string pointer
R7 = byte being copied
Ri = loop index

> > vectorizes into:
> >
> > loop: VEC {{VSV}{VSV}{VV}{V}}
> > LDSB R7,[Rp+Ri]
> > STB R7,[Rq+Ri]
> > LOOP NZ,R7,Ri,1

And this is where my stupidity shows. I can't figure out what the {{VSV}{VSV}{VV}{V}} mean.

I thought maybe it applied to the instruction groups below it with each {} "sub" section being on instruction... But I am not sure about that because there are four such {} groupings but only three instructions below VEC.

Taking this one instruction at a time:

LDSB R7, [Rp+Ri]

OK, so we want R7 to be a "vector" and Ri to be a "vector" since the value changes per loop iteration. Rp stays as a scalar. Hence {VSV}?

For STB this makes sense in the same way.

For LOOP... Not sure. Why {VV} and {V}? The first {VV} make some sense as the R7 and Ri arguments of the instruction should be "vectorized". But I am not sure what happens with the last {V}.

Can I buy a clue, please?

> > And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
> > with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
> > nor written on a per cycle basis, R7 and Ri will be written to the RF when the
> > loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
> > is eliminated, while the speed of processing is increased. The VEC instruction
> > is examined once (DECODE) and is not part of the execution of the loop. The
> > LOOP instruction performs a comparison and an incrementation and conditionally
> > transfers control to after the VEC instruction.

The power savings are very impressive!

I think I am still a bit muddled here. This sounds like there is some more-or-less implicit state that is stored (maybe in a CR or something) when VEC "executes" (it doesn't, really) such that the branch target is saved. So can these VEC/LOOP blocks nest? What happens if you jump out of the middle of such a block?

This seems like a lot going on in the LOOP instruction. Maybe it is not as much as I thought. Looks like two read ports and a write port if I treat it as if it is a real loop instruction.

> > The above vectorized loop can run at 1 loop per cycle if there are 2 lanes
> > of memory. That same code can run 4 loops per cycle if there are 8 lanes of
> > memory. And so on.
>
>
> Since the loop is using bytes, and assuming you have a D-cache, don't
> you require fewer lanes of memory, i.e. only read or write memory once
> per cache line, not per byte? Or by "lanes of memory" do you mean
> load/store units within the CPU?

(Responding to Stephen)

Hmm, interesting point about the D$. I was reading "lanes of memory" as memory controllers/memory banks.

What I can see of this seems very, very efficient. I think :-)

> > A small CPU (1-wide in order) can perform VVM with a few sequences added to
> > the control unit. So the cost to the really small machine is virtually zero.
> >
> > As implementations get bigger they can choose to implement the number of lanes
> > that make sense to the rest of their design point.
> >
> > All machine versions can run the same expression of the Vectorized loop.
> >
> > Memory aliasing is handled by running the loops in a restricted partial order
> > so that if the software expression of the loop contains memory aliasing, the
> > loop runs slower obeying memory dependencies, yet if the same loop contains
> > no memory dependencies, it runs faster. SW has to do nothing to obtain this
> > performance and protection (unlike vector machines with vector registers).
>
> I think you are sort of "burying the lead", perhaps because it is
> obvious to you.
>
> With VVM, you can automagically use as many ALUs as the hardware
> provides without any changes to your source code. So you get more than
> the capabilities of a typical vector processor for very little extra
> "stuff", either in the hardware or software. As such is is quite
> different from Luke's RISC V proposal.

One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.

Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).

Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.

> In case it isn't obvious, I am a big fan! :-)

I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!

Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!

SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.

I've just started reading up on RISC-V's vector proposal as well and so far it reminds me a lot of SVE.

Best,
Kyle

Best,
Kyle

BGB

unread,
Nov 3, 2019, 5:16:01 PM11/3/19
to
On 11/3/2019 12:18 PM, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 12:11:28 PM UTC-6, BGB wrote:
>> On 11/3/2019 11:26 AM, MitchAlsup wrote:
>>> On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
>>>> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
>>>>
>>>> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
>>>
>>> Its "My 66000" note: lower case 'y'.
>>> Its "Alsup".
>>>
>>
>> I think the bigger issue here is the apparent lack of any publicly
>> visible documentation for the ISA (at least, that I am aware of).
>
> I just don't have/know-of a place to put it, and after it is there how
> to update it somewhat regularly.

OK. I was sticking some stuff in the GitHub Wiki thingy, but it kinda
leaves something to be desired (at least when using it to parse the
MediaWiki format; barely works and is prone to misparse stuff and
otherwise screw up).


If there were something sort of like a hybrid of MediaWiki and Adobe
Acrobat (*), this might be useful sometimes, but alas...

*: Say, accepts its input as a pile of files in a MediaWiki-like format
(I prefer this to MarkDown variants), produces competent HTML and PDF
output, supports operation both as a standalone desktop application and
online via a web-browser, is ideally open-source, reasonably easy to
build (IOW: not the usual FOSS 3rd party dependency hell),
cross-platform, ...


>>
>> Partial contrast is my BJX2 ISA, which at least has stuff on GitHub
>> which people can look at if they are interested in where I am going with
>> this.
>>
>>
>> Though, my posts don't always match the ISA spec 1:1, as sometimes
>> details may end up changing or being implemented differently or features
>> may end up dropped, or I may be writing speculatively as-if a certain
>> instruction existed, ...
>>
>> I may need to do some cleanup / revision at some point, as the design
>> has accumulated some unnecessary baggage in a few areas, ... ( It is
>> possible a lot of the stuff not currently implemented in the 'E' and 'F'
>> profiles may end up being dropped and/or redesigned; with the 'F'
>> profile being used as the basis of a partially redesigned version of the
>> core ISA )
>>
>>
>> Did start to speculate maybe "MY 66000" is intended to be a proprietary
>> / commercial ISA, or it could have to do with patent reasons, or
>> something... But, I don't really know, and at best can only speculate.
>
> It is not supposed to be proprietary at all, and it is intended to be given
> away at zero cost to receiver. The only restriction is that the originator
> (me) be transferred/acknowledged by any recipient.

OK.


My intended policy for BJX2 falls in the "people can mostly do whatever
with it" camp. If it gets popular, alternate competing implementations
and not-binary-compatible variants seem almost inevitable, but the ideal
case would be if binary incompatibility (and mutually incompatible ISA
extensions) can be kept small.

This later point would require getting the ISA to a stage where
"freezing" the core parts of the ISA makes sense. I don't feel I am
quite to this stage yet, but I am gradually getting closer (though, at
this point, probably what I have now will probably be at least
"reasonably close" to the final product).


>>
>> Started writing something about this, but then kind of got distracted
>> and went down a rabbit hole about things like FAT32+LFN's and ExFAT
>> being kinda pointless for the majority of use cases (and probably is
>> "only really a thing" due to a likely attempt at a money-grab on MS's
>> part; since they can no longer make any money off FAT32 due to the
>> relevant patents having expired; and even then, as written, wouldn't
>> have likely applied to a VFAT driver in a Unix-style OS in the
>> first-place as the actual claims were essentially N/A).
>>
>> In this way, it was sorta like the MIPS related suits (suing over
>> instructions which weren't even implemented by the infringing party), or
>> S3TC (sort of a long-standing bogeyman hindering compressed-texture
>> support in MesaGL and similar, ...).
>
> It is stuff like this that left SIMD instructions out--I prefer to perform
> SIMD with lanes of coordinated calculation rather than blowing up ISA.

A person can also do SIMD without going to the same level of absurd
extremes as SSE or NEON...


Both a basic SIMD variant and explicitly parallel execute lanes exist in
BJX2, though relying on the latter is a little heavyweight in the sense
that there isn't currently a good way to gloss over implementation
differences here (such as the number of execute lanes, or which types of
operations are allowed in which lanes), and 3x integer lanes has less
throughput than 4x packed words, or potentially using 3x lanes + SIMD
for potentially around ~ 12x word-ops at a time.


It is likely, if this were used in a more "application focused"
environment, binaries would primarily be distributed in a bytecode or a
"sanitized" scalar subset of the ISA (which could then be, as-needed,
dynamically-translated into the variant used on the hardware; or ran
directly at potentially diminished performance).

In this case, the translation/specialization stage would serve a similar
role to superscalar support in a CPU, just handled in software (probably
done AOT, with the translated binaries being cached somewhere).


A bytecode format could be used, which would likely need to serve a few
criteria:
Can be used generate reasonably efficient code for multiple targets;
Natively handles languages like C or C++ (among others);
Probably stack-based, with implicit or explicit static types;
Ideally, can be made flexible WRT details like "sizeof(void *)", but
need not gloss over the C library or preprocessor defines (more
problematic);
Can natively handle concepts like pointers and "goto" without creating
an awkward/convoluted mess (unlike, say, WASM);
...


The later case (a sanitized BJX2 subset) could probably be done with
minimal annotation or metadata, mostly it requires:
All static parts of the control flow graph either be statically
reachable or have some sort of metadata/annotation to indicate their
existence (possibly in the form of "null exports" for anything only
reachable via function pointers);
No support for self-modification of any statically modified code;
Predefined methods for things like loading function pointers;
...

This is less general, but would allow a comparably simpler translator
than would be needed for a higher-level bytecode format. A
sanity-checker could also be used to verify that the code conforms to
the subset (code would be rejected if it goes outside the defined subset).


It should also be possible to "safely" run untrusted BJX2 code more
directly on the CPU (via such a subset) without the need for
address-space trickery (or using x86 segmentation and a hackish ISA
subset, as in the case of the Google NaCl thing), as the ISA design
allows imposing Unix-like access-right checking at the MMU level (in
addition to the usual user/supervisor split).


It may sound a little silly, but this idea may have been partly inspired
by how things like door locks were presented in "Tron 2.0"; sub-programs
essentially needing dedicated keys rather than having free reign of the
whole address space seemed like it could be helpful in terms of
security, and doesn't add all that much in terms of cost (apart from a
little overhead in the page-tables and TLBs).


There is currently a few limitations though (to the "VUGID" system):
Pretty much no existing software or OS's will support this;
The ID's will necessarily exist in a separate address space from those
used for file-access rights checking;
The range of UID's / GID's available to an application is fairly limited;
Actually using this feature (in any non-trivial way) will be kind of a pain;
...

Note that User and Supervisor spaces would essentially also have their
own separate VUGID spaces.


For the time being, TeskKern will also use VUGID in place of separate
address spaces for applications (the main address space for each program
would essentially be "Apps/PID"; and there may be a system call for
allocating new VUID/VGID values). It is likely separate per-process
address spaces will be added eventually though.

I suspect at most only a rare few programs will actually have reason to
care that VUGID exists.


As noted (likely cruft/hackery) in TestKern:
Single address space, cooperative scheduling;
VFS system calls will always pass absolute paths;
CWD/PWD will be handled via the syscall wrappers;
The shell, basic commands, ... are all rolled together.
Sort of like Busybox/Toybox, but also includes the shell;
...

I am still debating some details, but it is likely I will not spawn a
separate shell process for handling shell scripts. Instead there will
likely be a combined shell-script interpreter that handles everything
(with separate contexts for different shell scripts).

All this may or may not take place in the kernel, still TBD.
Cases like "#!/bin/sh" would be handled "as magic", rather than by
spawning a new shell instance to run each script (though, for other
programs, "#!" will launch it as a new instance).

Combined with rolling basic commands directly into the shell (similar to
how it works in DOS/Windows) should be able to reduce overhead somewhat.


<snip>

rick.c...@gmail.com

unread,
Nov 3, 2019, 6:33:07 PM11/3/19
to
On Sunday, November 3, 2019 at 1:18:06 PM UTC-5, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 12:11:28 PM UTC-6, BGB wrote:
> > I think the bigger issue here is the apparent lack of any publicly
> > visible documentation for the ISA (at least, that I am aware of).
>
> I just don't have/know-of a place to put it, and after it is there how
> to update it somewhat regularly.

You could try Open Cores:

https://opencores.org/forum/Cores

You would likely be the big fish there, sir, garnering a
quick following.

--
Rick C. Hodgin

MitchAlsup

unread,
Nov 3, 2019, 7:25:14 PM11/3/19
to
On Sunday, November 3, 2019 at 3:54:08 PM UTC-6, Kyle Hayes wrote:
> On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
> > On 11/3/2019 9:26 AM, MitchAlsup wrote:
> > > On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> > >> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
> > >>
> > >> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
> > >
> > > Its "My 66000" note: lower case 'y'.
> > > Its "Alsup".
>
> (Responding to Mitch)
>
> Apologies, not enough caffeine when I wrote that :-(
>
> Stayed up way too late last night for work and I used to work with a Mick Alsop. In the morning fog what was in the brain and what went out the fingers did not match :-(
>
> Sort of a cached incoherency protocol...
>
> > > The virtual Vector Method (VVM) documentation is not ready for prime time.
> > >
> > > It is a method for adding vector processing to a CPU/ISA that vectorizes
> > > loops instead of vectorizing instructions.
> > >
> > > In fact VVM adds but 2 instructions to the ISA, one denoting which registers
> > > in the loop are vectorized and which are Scalar. Scalar registers are static
> > > over the loop, vector registers change at least once per loop; the other
> > > denoting the loop termination and looping condition.
> > >
>
> This is the VEC and LOOP instructions below?

Yes.
Yes
>
> > > vectorizes into:
> > >
> > > loop: VEC {{VSV}{VSV}{VV}{V}}
> > > LDSB R7,[Rp+Ri]
> > > STB R7,[Rq+Ri]
> > > LOOP NZ,R7,Ri,1
>
> And this is where my stupidity shows. I can't figure out what the {{VSV}{VSV}{VV}{V}} mean.

Right now, it is simply syntactic sugar.
>
> I thought maybe it applied to the instruction groups below it with each {} "sub" section being on instruction... But I am not sure about that because there are four such {} groupings but only three instructions below VEC.
>
> Taking this one instruction at a time:
>
> LDSB R7, [Rp+Ri]
>
> OK, so we want R7 to be a "vector" and Ri to be a "vector" since the value changes per loop iteration. Rp stays as a scalar. Hence {VSV}?

Yes.
>
> For STB this makes sense in the same way.
>
> For LOOP... Not sure. Why {VV} and {V}? The first {VV} make some sense as the R7 and Ri arguments of the instruction should be "vectorized". But I am not sure what happens with the last {V}.

Ri is the loop control vector (changing every iteration)--and thus not scalar.
>
> Can I buy a clue, please?
>
> > > And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
> > > with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
> > > nor written on a per cycle basis, R7 and Ri will be written to the RF when the
> > > loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
> > > is eliminated, while the speed of processing is increased. The VEC instruction
> > > is examined once (DECODE) and is not part of the execution of the loop. The
> > > LOOP instruction performs a comparison and an incrementation and conditionally
> > > transfers control to after the VEC instruction.
>
> The power savings are very impressive!
>
> I think I am still a bit muddled here. This sounds like there is some more-or-less implicit state that is stored (maybe in a CR or something) when VEC "executes" (it doesn't, really) such that the branch target is saved. So can these VEC/LOOP blocks nest? What happens if you jump out of the middle of such a block?

The only implicit state is the address of the top of the loop.
>
> This seems like a lot going on in the LOOP instruction. Maybe it is not as much as I thought. Looks like two read ports and a write port if I treat it as if it is a real loop instruction.
>
> > > The above vectorized loop can run at 1 loop per cycle if there are 2 lanes
> > > of memory. That same code can run 4 loops per cycle if there are 8 lanes of
> > > memory. And so on.
> >
> >
> > Since the loop is using bytes, and assuming you have a D-cache, don't
> > you require fewer lanes of memory, i.e. only read or write memory once
> > per cache line, not per byte? Or by "lanes of memory" do you mean
> > load/store units within the CPU?
>
> (Responding to Stephen)
>
> Hmm, interesting point about the D$. I was reading "lanes of memory" as memory controllers/memory banks.

Lanes into the cache.
>
> What I can see of this seems very, very efficient. I think :-)
>
> > > A small CPU (1-wide in order) can perform VVM with a few sequences added to
> > > the control unit. So the cost to the really small machine is virtually zero.
> > >
> > > As implementations get bigger they can choose to implement the number of lanes
> > > that make sense to the rest of their design point.
> > >
> > > All machine versions can run the same expression of the Vectorized loop.
> > >
> > > Memory aliasing is handled by running the loops in a restricted partial order
> > > so that if the software expression of the loop contains memory aliasing, the
> > > loop runs slower obeying memory dependencies, yet if the same loop contains
> > > no memory dependencies, it runs faster. SW has to do nothing to obtain this
> > > performance and protection (unlike vector machines with vector registers).
> >
> > I think you are sort of "burying the lead", perhaps because it is
> > obvious to you.
> >
> > With VVM, you can automagically use as many ALUs as the hardware
> > provides without any changes to your source code. So you get more than
> > the capabilities of a typical vector processor for very little extra
> > "stuff", either in the hardware or software. As such is is quite
> > different from Luke's RISC V proposal.
>
> One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.

RISC-V has vector registers, VVM does not.
>
> Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).

Obviously.
>
> Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.
>
> > In case it isn't obvious, I am a big fan! :-)
>
> I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
>
> Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!

VEC uses an immediate field to decorate K following instructions with vector
or scalar designations so that the various kinds of instruction queueing can
be properly dedicated to the loop at hand.
>
> SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.

And thus way more expensive in HW.

Kyle Hayes

unread,
Nov 3, 2019, 11:31:53 PM11/3/19
to
On Sunday, November 3, 2019 at 4:25:14 PM UTC-8, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 3:54:08 PM UTC-6, Kyle Hayes wrote:
> > On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
> > > On 11/3/2019 9:26 AM, MitchAlsup wrote:
[snip]
> > > > vectorizes into:
> > > >
> > > > loop: VEC {{VSV}{VSV}{VV}{V}}
> > > > LDSB R7,[Rp+Ri]
> > > > STB R7,[Rq+Ri]
> > > > LOOP NZ,R7,Ri,1
> >
> > And this is where my stupidity shows. I can't figure out what the {{VSV}{VSV}{VV}{V}} mean.
>
> Right now, it is simply syntactic sugar.
> >
> > I thought maybe it applied to the instruction groups below it with each {} "sub" section being on instruction... But I am not sure about that because there are four such {} groupings but only three instructions below VEC.
> >
> > Taking this one instruction at a time:
> >
> > LDSB R7, [Rp+Ri]
> >
> > OK, so we want R7 to be a "vector" and Ri to be a "vector" since the value changes per loop iteration. Rp stays as a scalar. Hence {VSV}?
>
> Yes.
> >
> > For STB this makes sense in the same way.
> >
> > For LOOP... Not sure. Why {VV} and {V}? The first {VV} make some sense as the R7 and Ri arguments of the instruction should be "vectorized". But I am not sure what happens with the last {V}.
>
> Ri is the loop control vector (changing every iteration)--and thus not scalar.

Er... but wasn't that covered by the {VV} part?

LOOP NZ,R7,Ri,1

So the third group in VEC was {VV}. Does that apply to R7 and Ri above or to something else?

> > > > And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
> > > > with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
> > > > nor written on a per cycle basis, R7 and Ri will be written to the RF when the
> > > > loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
> > > > is eliminated, while the speed of processing is increased. The VEC instruction
> > > > is examined once (DECODE) and is not part of the execution of the loop. The
> > > > LOOP instruction performs a comparison and an incrementation and conditionally
> > > > transfers control to after the VEC instruction.
> >
> > The power savings are very impressive!
> >
> > I think I am still a bit muddled here. This sounds like there is some more-or-less implicit state that is stored (maybe in a CR or something) when VEC "executes" (it doesn't, really) such that the branch target is saved. So can these VEC/LOOP blocks nest? What happens if you jump out of the middle of such a block?
>
> The only implicit state is the address of the top of the loop.

I was less than clear. That's what I meant by branch target. I was still thinking of the non-vectorized version with a branch at the bottom to hop back to the start of the loop.
Did I misunderstand? There are two proposals for "vectorizing" RISC-V. One seems vaguely similar to ARM's SVE and has architectural vector registers. The other is Luke's proposal which does not use vector registers. At least that is how I understood them.

> >
> > Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).
>
> Obviously.
> >
> > Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.
> >
> > > In case it isn't obvious, I am a big fan! :-)
> >
> > I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
> >
> > Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
>
> VEC uses an immediate field to decorate K following instructions with vector
> or scalar designations so that the various kinds of instruction queueing can
> be properly dedicated to the loop at hand.

What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.

The VEC instruction has a fairly limited window unless it can use a lot of bits. Does the My 66000 use variable length instructions? I thought I saw that it does...

I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.

I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.

> > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
> >
> And thus way more expensive in HW.

Which then encourages multiple instruction set dialects.

Though RISC-V is already doing that. I fear that will be a problem for RISC-V. It will really be a family of somewhat-related ISAs rather than a single one. x86 tends (more or less) to have each new generation of ISA be a superset of the previous generation. RISC-V is definitely heading in a more patchwork direction.

It will be interesting to see how ARM pushes (if it does) SVE. They have three multiple-data extensions now: NEON, SVE and the other new one for more deeply embedded systems, M...something.

Thanks for all the info!

Best,
Kyle

MitchAlsup

unread,
Nov 4, 2019, 11:28:41 AM11/4/19
to
Both R7 and Ri
When I had long conversations (e-mail) with Luke, his proposal still had
real vector registers and a chunk length of 64.
> >
> > RISC-V has vector registers, VVM does not.
>
> Did I misunderstand? There are two proposals for "vectorizing" RISC-V. One seems vaguely similar to ARM's SVE and has architectural vector registers. The other is Luke's proposal which does not use vector registers. At least that is how I understood them.
>
> > >
> > > Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).
> >
> > Obviously.
> > >
> > > Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.
> > >
> > > > In case it isn't obvious, I am a big fan! :-)
> > >
> > > I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
> > >
> > > Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
> >
> > VEC uses an immediate field to decorate K following instructions with vector
> > or scalar designations so that the various kinds of instruction queueing can
> > be properly dedicated to the loop at hand.
>
> What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.

This is a follow-on of my predication scheme. In My 66000, there are predicate
instructions (like branch instructions) but use the immediate field to cast
a predicate (execute or don't execute) over a number of sequentially following
instructions. The predicate instruction also performs a conditional test and
uses this to determine if the instructions execute or not.

The VEC instruction contains a 16-bit, 32-bit, or 64-bit designator for the
{VSV} fields (in little endian notation; R->L)
>
> The VEC instruction has a fairly limited window unless it can use a lot of bits. Does the My 66000 use variable length instructions? I thought I saw that it does...

Yes, My 66000 has variable length instructions, but all extensions of the
instruction are simply data. My 66000 has::

OP Rd,Rs1,Rs2
OP Rd,Rs1,-RS2
OP Rd,-RS1,RS2
OP Rd,-RS1,-RS2

OP Rd,Rs,IMM16
OP Rd,Rs,IMM32
OP Rd,IMM32,Rs
OP Rd,Rs,IMM64
OP Rd,IMM64,Rs

3OP Rd,Rs1,Rs2,Rs3
3OP Rd,Rs1,Rs2,-Rs3
3OP Rd,Rs1,-Rs2,Rs3
3OP Rd,Rs1,-Rs2,-Rs3

MEM Rd,[Rb+DIDP16]
MEM Rd,[Rb+Ri<<s]
MEM Rd,[Rb+DISP32]
MEM Rd,[Rb+Ri<<s+DISP32]
MEM Rd,[Rb+DIPS64]
MEM Rd,[Rb+Ri<<s+DISP64]

ST IMM32,[any of the MEMs]
ST IMM64,[any of the MEMs]
>
> I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.

While this is my general inclination, this part has not solidified.
>
> I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.
>
> > > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.

THis is what I call:: "Blowing up the ISA" to fit Vectors/SIMD/...
> > >
> > And thus way more expensive in HW.
>
> Which then encourages multiple instruction set dialects.

Which is why VVM adds but 2 instructions.
>
> Though RISC-V is already doing that. I fear that will be a problem for RISC-V. It will really be a family of somewhat-related ISAs rather than a single one. x86 tends (more or less) to have each new generation of ISA be a superset of the previous generation. RISC-V is definitely heading in a more patchwork direction.

Only the magnitude of the problem is unknown within RISC-V

Kyle Hayes

unread,
Nov 4, 2019, 9:00:37 PM11/4/19
to
On Monday, November 4, 2019 at 8:28:41 AM UTC-8, MitchAlsup wrote:
[big snip]
> > > > One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.
>
> When I had long conversations (e-mail) with Luke, his proposal still had
> real vector registers and a chunk length of 64.

I think it has changed significantly from that. Now there are no new instructions and a special register (or set) is used in a way that reminds me a bit of your VEC instruction to tag certain registers as vector registers. However, instead of actually being vector registers, the CPU will use consecutive scalar registers for the vector elements.

So some parts seem mildly similar to your ideas in My 66000 (the use of a prefix, though in Luke's case it may just be a MOV to a special register) where others (use of existing scalars as vector elements) are quite different. There seems to be an inherent limit on scalability of Luke's idea when you run out of scalar registers. In reading through what he's posted, it was not clear to me whether you could use physical registers or only architectural registers. It seemed like the latter, but I did not dig that deep yet.

In the My 66000 (and I assume Luke's proposal) there must be some sort of way to handle traps and interrupts within a vectorized loop. How does My 66000 deal with that? With Luke's scheme, using scalar registers, it seems like there is no real difference between the vectorized state and a normal loop. However, I can't remember if he keeps the index in a scalar as well.

> > > VEC uses an immediate field to decorate K following instructions with vector
> > > or scalar designations so that the various kinds of instruction queueing can
> > > be properly dedicated to the loop at hand.
> >
> > What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.
>
> This is a follow-on of my predication scheme. In My 66000, there are predicate
> instructions (like branch instructions) but use the immediate field to cast
> a predicate (execute or don't execute) over a number of sequentially following
> instructions. The predicate instruction also performs a conditional test and
> uses this to determine if the instructions execute or not.
>
> The VEC instruction contains a 16-bit, 32-bit, or 64-bit designator for the
> {VSV} fields (in little endian notation; R->L)

Ah, OK. Even a simplistic 3-bits-per-instruction format for VEC would give you a 21-instruction "shadow" after the VEC instruction with a 64-bit immediate. That's quite large and should allow for many reasonable loops to be vectorized!

In the case where predicated instructions or vectorized instructions are interleaved with a function call, how are the extra state bits handled? On function return you'd need to know which succeeding instructions were predicated to not be executed and which were or which scalars were vectorized and which were not.

As above, for traps, this seems like someplace you need to be able to store state, but here, you would not be switching from user to supervisor mode (if you have that).

In Luke's proposal for RISC-V I seem to remember that the vectorization state was in a special CSR that could be read and restored back by user code. So as long as that state was pushed on the stack and restored back on a function call/return or a trap it seemed to work.

However, in My 66000, the state of the "vector" isn't in architecturally visible registers. I am probably missing something obvious :-(

> Yes, My 66000 has variable length instructions, but all extensions of the
> instruction are simply data. My 66000 has::
>
> OP Rd,Rs1,Rs2
> OP Rd,Rs1,-RS2
> OP Rd,-RS1,RS2
> OP Rd,-RS1,-RS2

Do these negative variants provide sufficient extra code density to warrant the opcode space? It seems like it would not be much extra hardware as most of this would already be in the ALU and it is more of a matter of selecting the right functions.

> OP Rd,Rs,IMM16

No OP Rd, IMM16, Rs?

> OP Rd,Rs,IMM32
> OP Rd,IMM32,Rs
> OP Rd,Rs,IMM64
> OP Rd,IMM64,Rs

These can take care of constant loads too?

> 3OP Rd,Rs1,Rs2,Rs3
> 3OP Rd,Rs1,Rs2,-Rs3
> 3OP Rd,Rs1,-Rs2,Rs3
> 3OP Rd,Rs1,-Rs2,-Rs3

These seem good for code density for calculating various more complicated addresses. Is that the intent?

> MEM Rd,[Rb+DIDP16]

No MEM Rd, [Rb+Ri<<s+IMM16]?

> MEM Rd,[Rb+Ri<<s]
> MEM Rd,[Rb+DISP32]
> MEM Rd,[Rb+Ri<<s+DISP32]
> MEM Rd,[Rb+DIPS64]
> MEM Rd,[Rb+Ri<<s+DISP64]

These are mostly loads?

Can these be done relative to the PC?

Are these all available in byte, 16-bit word, 32-bit word and 64-bit word forms (for integers)?

> ST IMM32,[any of the MEMs]
> ST IMM64,[any of the MEMs]

I noticed that many of the 16-bit immediate versions do not always have symmetrical pairs. No need or do they fall into different instruction formats than the 32 and 64-bit versions?

> >
> > I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.
>
> While this is my general inclination, this part has not solidified.

I see that you have some instructions with 4 registers. Just disallow those from VEC blocks? It seems like a waste to use up 4 bits in the VEC instruction per target instruction when most will have 2 or 3 registers.

> > I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.

And this is wrong. You could have an instruction in the middle of a VEC block that does not do anything with vectors. So you do need that encoding. Rats.

> > > > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
>
> THis is what I call:: "Blowing up the ISA" to fit Vectors/SIMD/...

Somewhere, maybe in Luke's proposal, the comment was made that the number of instructions Intel has had to add for all the various sizes of vector and the different combinations of element sizes is combinatorial (or exponential, I can't remember and haven't done the math, it was the wrong kind of scaling). I think it was Luke's proposal because there was quite a bit about how to define the various element, sub-element, and vector lengths.

> > Though RISC-V is already doing that. I fear that will be a problem for RISC-V. It will really be a family of somewhat-related ISAs rather than a single one. x86 tends (more or less) to have each new generation of ISA be a superset of the previous generation. RISC-V is definitely heading in a more patchwork direction.
>
> Only the magnitude of the problem is unknown within RISC-V

I think it is already fairly bad. Different implementations have different, not-standard, instruction sets already. As long as the implementors keep GCC and/or LLVM up to date that is not too painful, but that rarely happens. Usually they hack one version of GCC/LLVM and that is where the effort stops.

Best,
Kyle

MitchAlsup

unread,
Nov 5, 2019, 11:39:08 AM11/5/19
to
On Monday, November 4, 2019 at 8:00:37 PM UTC-6, Kyle Hayes wrote:
> On Monday, November 4, 2019 at 8:28:41 AM UTC-8, MitchAlsup wrote:
> [big snip]
> > > > > One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.
> >
> > When I had long conversations (e-mail) with Luke, his proposal still had
> > real vector registers and a chunk length of 64.
>
> I think it has changed significantly from that. Now there are no new instructions and a special register (or set) is used in a way that reminds me a bit of your VEC instruction to tag certain registers as vector registers. However, instead of actually being vector registers, the CPU will use consecutive scalar registers for the vector elements.
>
> So some parts seem mildly similar to your ideas in My 66000 (the use of a prefix, though in Luke's case it may just be a MOV to a special register) where others (use of existing scalars as vector elements) are quite different. There seems to be an inherent limit on scalability of Luke's idea when you run out of scalar registers. In reading through what he's posted, it was not clear to me whether you could use physical registers or only architectural registers. It seemed like the latter, but I did not dig that deep yet.
>
> In the My 66000 (and I assume Luke's proposal) there must be some sort of way to handle traps and interrupts within a vectorized loop. How does My 66000 deal with that? With Luke's scheme, using scalar registers, it seems like there is no real difference between the vectorized state and a normal loop. However, I can't remember if he keeps the index in a scalar as well.

Since the instructions "in" the loop are in fact Scalar instructions, an
exception is taken as if it were a Scalar instruction. If the exception
is repaired and control returns, the rest of the loop is run in Scalar
mode, and the LOOP instruction will transfer control to the VEC instruction,
and Vector mode reconveins.
>
> > > > VEC uses an immediate field to decorate K following instructions with vector
> > > > or scalar designations so that the various kinds of instruction queueing can
> > > > be properly dedicated to the loop at hand.
> > >
> > > What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.
> >
> > This is a follow-on of my predication scheme. In My 66000, there are predicate
> > instructions (like branch instructions) but use the immediate field to cast
> > a predicate (execute or don't execute) over a number of sequentially following
> > instructions. The predicate instruction also performs a conditional test and
> > uses this to determine if the instructions execute or not.
> >
> > The VEC instruction contains a 16-bit, 32-bit, or 64-bit designator for the
> > {VSV} fields (in little endian notation; R->L)
>
> Ah, OK. Even a simplistic 3-bits-per-instruction format for VEC would give you a 21-instruction "shadow" after the VEC instruction with a 64-bit immediate. That's quite large and should allow for many reasonable loops to be vectorized!
>
> In the case where predicated instructions or vectorized instructions are interleaved with a function call, how are the extra state bits handled? On function return you'd need to know which succeeding instructions were predicated to not be executed and which were or which scalars were vectorized and which were not.

The My 66000 ISA has a part of a control register that handles predication.
This part is replicated in the machine based on the capabilities of the
vector lanes--1 set per loop. a 1 means execute, a 0 means skip.
>
> As above, for traps, this seems like someplace you need to be able to store state, but here, you would not be switching from user to supervisor mode (if you have that).

There is a Quad DoubleWord of state {PC, Modes, Enables, Raised} that contains
program state.
>
> In Luke's proposal for RISC-V I seem to remember that the vectorization state was in a special CSR that could be read and restored back by user code. So as long as that state was pushed on the stack and restored back on a function call/return or a trap it seemed to work.
>
> However, in My 66000, the state of the "vector" isn't in architecturally visible registers. I am probably missing something obvious :-(

There are instructions to Read, Write, and Exchange DoubleWords with
processor state.
>
> > Yes, My 66000 has variable length instructions, but all extensions of the
> > instruction are simply data. My 66000 has::
> >
> > OP Rd,Rs1,Rs2
> > OP Rd,Rs1,-RS2
> > OP Rd,-RS1,RS2
> > OP Rd,-RS1,-RS2
>
> Do these negative variants provide sufficient extra code density to warrant the opcode space? It seems like it would not be much extra hardware as most of this would already be in the ALU and it is more of a matter of selecting the right functions.

First of all, there is no additional HW (Hint: an Adder can be used as a
subtractor,...)
So one loses the need for a SUB instruction, while gaining an AND-NOT inst-
ruction.
But the gain is small.

Thirdly, it is EXACTLY these bits which are used to govern the attachment
of large immediates and displacements to each instruction. So, I needed
the bits for other purposes--and it is these purposes that save instructions.
Something between 6% and 9% of MIPS 1 instructions were used to paste bit
together only to be used as an operand. In My 66000 none of this is necessary.
>
> > OP Rd,Rs,IMM16
>
> No OP Rd, IMM16, Rs?

One has to draw the line somewhere.
>
> > OP Rd,Rs,IMM32
> > OP Rd,IMM32,Rs
> > OP Rd,Rs,IMM64
> > OP Rd,IMM64,Rs
>
> These can take care of constant loads too?

Yes.
>
> > 3OP Rd,Rs1,Rs2,Rs3
> > 3OP Rd,Rs1,Rs2,-Rs3
> > 3OP Rd,Rs1,-Rs2,Rs3
> > 3OP Rd,Rs1,-Rs2,-Rs3
>
> These seem good for code density for calculating various more complicated addresses. Is that the intent?

Yes.
>
> > MEM Rd,[Rb+DIDP16]
>
> No MEM Rd, [Rb+Ri<<s+IMM16]?

No, no bits to encode with.
>
> > MEM Rd,[Rb+Ri<<s]
> > MEM Rd,[Rb+DISP32]
> > MEM Rd,[Rb+Ri<<s+DISP32]
> > MEM Rd,[Rb+DIPS64]
> > MEM Rd,[Rb+Ri<<s+DISP64]
>
> These are mostly loads?
7 LDs, 1 LEA, 4 STs, 1 PREfetch, 1 <post>PUSH
>
> Can these be done relative to the PC?

Rbase = R0 -> IP
Rindex = R0 -> no index
>
> Are these all available in byte, 16-bit word, 32-bit word and 64-bit word forms (for integers)?

Both signed and unsigned.
In addition the memory model is misaligned.

>
> > ST IMM32,[any of the MEMs]
> > ST IMM64,[any of the MEMs]
>
> I noticed that many of the 16-bit immediate versions do not always have symmetrical pairs. No need or do they fall into different instruction formats than the 32 and 64-bit versions?

Pairs of LDs:: LDSB and LDUB are paired with STB; in both DISP16 form and
{DISP32, DISP64} forms.
>
> > >
> > > I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.
> >
> > While this is my general inclination, this part has not solidified.
>
> I see that you have some instructions with 4 registers. Just disallow those from VEC blocks? It seems like a waste to use up 4 bits in the VEC instruction per target instruction when most will have 2 or 3 registers.

FMAC is the workhorse of floating point arithmetic. Restricting it from
vectorization would be BAD.
>
> > > I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.

The plan is to take the immediate and then as each instruction is DECODEd
attach as many bits to the instruction as that instruction requires. So,
instructions with immediates consume fewer bits from the casting vector.

Stephen Fuld

unread,
Nov 5, 2019, 12:00:32 PM11/5/19
to
On 11/3/2019 4:25 PM, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 3:54:08 PM UTC-6, Kyle Hayes wrote:
>> On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
>>> On 11/3/2019 9:26 AM, MitchAlsup wrote:


snip
Hmmmm! While three lanes into cache seems fine for a 1 wide machine,
having 12 lanes for a four wide seems like a lot. Is that typical for
such a machine?

I am beginning to think that "byte move" may not be the best example to
show off VVM. While it is simple to code and to comprehend, it is so
memory access bound that you might not get to take full advantage of VVM.

You may recall that some time ago, I spent some time trying to develop
an "industrial strength" version using 8 bytes per load/store, but still
giving the correct semantics on protection faults, etc. I never got
something that I was satisfied with. I freely admit that may be my
failure, not VVMs.

I was trying to do up to within 8 bytes of a page boundary in 8 byte
chunks, then switching to single byte chunks for 8 bytes, then back to 8
byte chunks. The problems stemmed from requiring two conditions to
terminate the loop, a zero byte, and a page boundary. Perhaps there is
a better way.

But given that, I expect a non VVM routine that moves up to 8 bytes at a
time (assuming you have your "compare any byte to zero" instruction),
would be faster and use less power than a byte at a time VVM routine.

MitchAlsup

unread,
Nov 5, 2019, 12:30:18 PM11/5/19
to
No, not for either. A 6-wide machine should have between 2 and 3 lanes into
the cache. You want the MB to be essentially balanced with the calk BW and
the control-flow BW.

So, on Mc 88120, a 6-wide machine, we had 3 AGEN ports to cache, 1 FP+
1 FP*, and 1 BC units. Each of those units had INT and LOG calculations,
and the 3 MEM units has shifters.
>
> I am beginning to think that "byte move" may not be the best example to
> show off VVM.

It isn't, but what it does show is the clean transition from scalar to vector.

> While it is simple to code and to comprehend, it is so
> memory access bound that you might not get to take full advantage of VVM.
>
> You may recall that some time ago, I spent some time trying to develop
> an "industrial strength" version using 8 bytes per load/store, but still
> giving the correct semantics on protection faults, etc. I never got
> something that I was satisfied with. I freely admit that may be my
> failure, not VVMs.
>
> I was trying to do up to within 8 bytes of a page boundary in 8 byte
> chunks, then switching to single byte chunks for 8 bytes, then back to 8
> byte chunks. The problems stemmed from requiring two conditions to
> terminate the loop, a zero byte, and a page boundary. Perhaps there is
> a better way.

Any taken branch automagically terminates the vector loop. Thus one does simple
flow control in a loop with predication, and you branch out of the loop for
any reason you desire. So:

for( i = 0; i < MAX; i++ )
if( A[i] > B[i] ) break;

works just fine. you can also predicate yourself past the LOOP instruction
and leave the loop.

Stephen Fuld

unread,
Nov 5, 2019, 1:28:21 PM11/5/19
to
Yes, I certainly agree with that. Is there a better example that shows
the transition but doesn't have byte move's drawbacks? Perhaps vector
addition?



>
>> While it is simple to code and to comprehend, it is so
>> memory access bound that you might not get to take full advantage of VVM.
>>
>> You may recall that some time ago, I spent some time trying to develop
>> an "industrial strength" version using 8 bytes per load/store, but still
>> giving the correct semantics on protection faults, etc. I never got
>> something that I was satisfied with. I freely admit that may be my
>> failure, not VVMs.
>>
>> I was trying to do up to within 8 bytes of a page boundary in 8 byte
>> chunks, then switching to single byte chunks for 8 bytes, then back to 8
>> byte chunks. The problems stemmed from requiring two conditions to
>> terminate the loop, a zero byte, and a page boundary. Perhaps there is
>> a better way.
>
> Any taken branch automagically terminates the vector loop. Thus one does simple
> flow control in a loop with predication, and you branch out of the loop for
> any reason you desire. So:
>
> for( i = 0; i < MAX; i++ )
> if( A[i] > B[i] ) break;
>
> works just fine.


Ahhh! That may be the answer. Branch out of the loop when the address
gets within 8 of a page boundary, then do a byte at a time move loop for
up to 8 bytes, then go back to the top to finish the rest of the
transfer, if needed. i.e. a "regular" loop that has nested within it,
two VVM loops, one for 8 bytes at a time, the other for one byte at a time.

I'll have to think about that.

Thanks.

Rick C. Hodgin

unread,
Nov 5, 2019, 1:58:53 PM11/5/19
to
On 11/3/2019 12:26 PM, MitchAlsup wrote:
> For example:: A simple string copy routine in My 66000 Assembler:
>
> loop:
> LDSB R7,[Rp+Ri]
> STB R7,[Rq+Ri]
> ADD Ri,Ri,1
> BNZ R7,loop
>
> vectorizes into:
>
> loop: VEC {{VSV}{VSV}{VV}{V}}
> LDSB R7,[Rp+Ri]
> STB R7,[Rq+Ri]
> LOOP NZ,R7,Ri,1


I think this concept is brilliant, Mitch. You've inspired me to
add a similar feature for general purpose computing:

setup:
vecf32 r1,r7,4 ; 32-bit fp data, starting at *r1, *r7

; Integer form:
; veci32 r3,r4,8 ; 64-bit int data, starting at *r3, *r4
;
; The vector registers are 128-bits or 256-bits wide, so
; whatever would fit in there is assumed.

loop:
; Perform logical ops on things related to r1, r7, which are
; pointers to the start of data. Processing is conducted
; across the 4-fp-wide loaded vectors in parallel in lieu
; of the encoded fp reg counterparts. Other registers ref-
; erenced will be only for input. Changes in r1 or r7 will
; be used to access prior / next vector items in memory.
;
; All calculations using r1 and r7, which correspond to f1,
; f7 for floating point regs, will be extended out to the
; v1, v7 registers used for this operation. r1 and r7 are
; pointers to data, f1 and f7 are used to reference the ops
; to perform in parallel, and v1 and v7 are where the work
; is actually conducted.
;
; In this way, you can use scalar operations on f1 and f7,
; which when you're in vecf32 mode, for example, are aliased
; placeholders for the real operation which takes place in
; the vector engine.
;
; A very simple way to encode general purpose vector ops.
;
; It would be interesting to allow a conditional branch op
; on the vector data, which would trap to the OS, which would
; create four new threads to process each of the items in
; parallel.

done:
vecend ; Done with vector aliasing

Quite a concept, sir.

--
Rick C. Hodgin

MitchAlsup

unread,
Nov 5, 2019, 7:58:56 PM11/5/19
to
My 66000 has an inherently misaligned memory system (cache), so for the 90%-ile
cases, you don't have to worry about those kinds of boundaries.

Bruce Hoult

unread,
Nov 6, 2019, 12:29:35 AM11/6/19
to
On Sunday, November 3, 2019 at 1:54:08 PM UTC-8, Kyle Hayes wrote:
> I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
>
> Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
>
> SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
>
> I've just started reading up on RISC-V's vector proposal as well and so far it reminds me a lot of SVE.

The other way around. The Berkeley people have been working on this (Cray-like) style of vector processing for a *long* time. They invented RISC-V in 2010 as a side project so they would have an open ISA as the control processor for the vector processor they were already working on.

It seems to be little-known that ARM hasn't shipped anything with SVE yet, and I believe hasn't even announced anything. (Fujitsu have announced something, I don't know the hardware status of it). SiFive has last week announced the U87 Out-of_order CPU core (comparable to ARM's 2015 A72) with vector unit, planned to be available in the second half of 2020.

Stephen Fuld

unread,
Nov 6, 2019, 1:03:09 AM11/6/19
to
Agreed, of course. The effect on performance should be minimal, but you
do have to handle the infrequent event of crossing a 4K boundary and
potentially causing a protection exception. If it weren't for that,
this would be a lot easier. :-)


My thoughts now are to compute the number of 8 byte chunks from the
starting source address to the next 4k boundary, then do the same for
the destination address. Take the minimum of those two. Call that the
counter

Then do the first VVM loop using 64 bit loads and stores, but adding a
decrement the counter by 8, test for zero and if so branch out of the
first loop.

If you never get to a 4K boundary, which will happen most of the time,
you use the loop instruction to test for any zero byte. When you find
it, you exit the VVM loop, and store the remaining up to 8 bytes. You
are done.

At the target of the branch out of the loop, you enter a second VVM loop
that does one byte at a time, for up to 8 bytes. Within that loop you
test for zero. This will get you over the 4K boundary, or you will find
the zero termination byte. If there is no zero termination byte, when
the loop is done, you branch back to the very beginning and repeat the
whole process for the next, up to 4K chunk.

Does that seem reasonable?

Terje Mathisen

unread,
Nov 6, 2019, 5:00:07 AM11/6/19
to
Stephen Fuld wrote:
> My thoughts now are to compute the number of 8 byte chunks from the
> starting source address to the next 4k boundary, then do the same for
> the destination address.  Take the minimum of those two.  Call that the
> counter
>
> Then do the first VVM loop using 64 bit loads and stores, but adding a
> decrement the counter by 8, test for zero and if so branch out of the
> first loop.

For max speed you really want cache line transfers, so 64 instead of 8
bytes per move.

>
> If you never get to a 4K boundary, which will happen most of the time,
> you use the loop instruction to test for any zero byte.  When you find
> it, you exit the VVM loop, and store the remaining up to 8 bytes.  You
> are done.
>
> At the target of the branch out of the loop, you enter a second VVM loop
> that does one byte at a time, for up to 8 bytes.  Within that loop you
> test for zero. This will get you over the 4K boundary, or you will find
> the zero termination byte.  If there is no zero termination byte, when
> the loop is done, you branch back to the very beginning and repeat the
> whole process for the next, up to 4K chunk.
>
> Does that seem reasonable?

It will indeed work, you just have to do a lot of calculations for the
pair of boundary (page limit) conditions.

It might in fact be faster to use the "aligned access only!" approach as
long as you have a fast double-wide shifter/byte aligner so that you can
use only aligned loads and stores, but Mill's approach with None or NaR
flagging of all out of bounds bytes is far simpler and allows much
smaller and easier to verify code.

A half-way approach would allow misaligned stores of all full blocks
(words/cache lines), i.e. no internal terminator found, but only aligned
loads: This is fairly easy to setup by just doing an initial aligned
block load, then a byte shuffle/shift to move the bytes down to the
beginning of the register to align them, then the normal test for a
terminator before we use a masked store to write only these bytes.

Terje


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,
Nov 6, 2019, 6:21:30 AM11/6/19
to
Yeah... SIMD seems like a simpler solution...

It has its drawbacks, granted, but if its scope is kept fairly limited
it doesn't totally ruin the ISA it is glued onto while still offering
some useful gains over pure scalar code (and without adding an excessive
amount of cost and complexity to the implementation).

already...@yahoo.com

unread,
Nov 6, 2019, 7:09:07 AM11/6/19
to
On Wednesday, November 6, 2019 at 7:29:35 AM UTC+2, Bruce Hoult wrote:
> On Sunday, November 3, 2019 at 1:54:08 PM UTC-8, Kyle Hayes wrote:
> > I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
> >
> > Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
> >
> > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
> >
> > I've just started reading up on RISC-V's vector proposal as well and so far it reminds me a lot of SVE.
>
> The other way around. The Berkeley people have been working on this (Cray-like) style of vector processing for a *long* time. They invented RISC-V in 2010 as a side project so they would have an open ISA as the control processor for the vector processor they were already working on.
>
> It seems to be little-known that ARM hasn't shipped anything with SVE yet, and I believe hasn't even announced anything.

Correct, but I'd think that it is well-know to everybody who can spell "SVE".

> (Fujitsu have announced something, I don't know the hardware status of it).


https://www.fujitsu.com/global/about/resources/news/press-releases/2019/0415-01.html

English is not my native tongue so I am not sure what the word "concludes" really means in this context.
But it it sounds like they have chip that is more works than not.

Bruce Hoult

unread,
Nov 6, 2019, 9:10:43 AM11/6/19
to
On Wednesday, November 6, 2019 at 4:09:07 AM UTC-8, already...@yahoo.com wrote:
> > (Fujitsu have announced something, I don't know the hardware status of it).
>
>
> https://www.fujitsu.com/global/about/resources/news/press-releases/2019/0415-01.html
>
> English is not my native tongue so I am not sure what the word "concludes" really means in this context.
> But it it sounds like they have chip that is more works than not.

It means Fujitsu and RIKEN (the research institute) have agreed what words to put in the contract, and signed it.

Beginning global sales in the second half of fiscal 2019 means signing contracts with other customers, for later delivery.

They don't expect to have a machine available for actual use until 2021 or 2022.

That suggests to me that they haven't taped out production chips yet, but probably have near final RTL tested in FPGAs or Palladiums or whatever and maybe low volume test chips.

NaN

unread,
Nov 6, 2019, 10:59:48 AM11/6/19
to

http://www.ssken.gr.jp/MAINSITE/event/2019/20190820-hpcf/lecture-04/20190820_HPCF_shinjo.pdf

2019年11月6日水曜日 23時10分43秒 UTC+9 Bruce Hoult:

Stephen Fuld

unread,
Nov 6, 2019, 11:17:01 AM11/6/19
to
On 11/6/2019 2:00 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> My thoughts now are to compute the number of 8 byte chunks from the
>> starting source address to the next 4k boundary, then do the same for
>> the destination address.  Take the minimum of those two.  Call that
>> the counter
>>
>> Then do the first VVM loop using 64 bit loads and stores, but adding a
>> decrement the counter by 8, test for zero and if so branch out of the
>> first loop.
>
> For max speed you really want cache line transfers, so 64 instead of 8
> bytes per move.

I may be wrong, but I don't think the My 66000 has instructions to do
that. But note that on a multi lane CPU, you can get closer to that, as
the different ALUs will access consecutive 64 bit blocks at the same
time. I don't know if Mitch's hardware can take advantage of that.




>> If you never get to a 4K boundary, which will happen most of the time,
>> you use the loop instruction to test for any zero byte.  When you find
>> it, you exit the VVM loop, and store the remaining up to 8 bytes.  You
>> are done.
>>
>> At the target of the branch out of the loop, you enter a second VVM
>> loop that does one byte at a time, for up to 8 bytes.  Within that
>> loop you test for zero. This will get you over the 4K boundary, or you
>> will find the zero termination byte.  If there is no zero termination
>> byte, when the loop is done, you branch back to the very beginning and
>> repeat the whole process for the next, up to 4K chunk.
>>
>> Does that seem reasonable?
>
> It will indeed work, you just have to do a lot of calculations for the
> pair of boundary (page limit) conditions.

"A lot" seems excessive, as it involves an AND and a subtract for the
source and destination, which can be overlapped, and a min function (I
don't know what instructions will be generated for that). But for the
vast majority of cases, you only have to do this once.


> It might in fact be faster to use the "aligned access only!" approach as
> long as you have a fast double-wide shifter/byte aligner so that you can
> use only aligned loads and stores,


Could be. I don't know what shift instructions are available on My 6600.



> but Mill's approach with None or NaR
> flagging of all out of bounds bytes is far simpler and allows much
> smaller and easier to verify code.

Yes, but Mill doesn't have, or at least hasn't disclosed, anything like
the VVM.


> A half-way approach would allow misaligned stores of all full blocks
> (words/cache lines), i.e. no internal terminator found, but only aligned
> loads: This is fairly easy to setup by just doing an initial aligned
> block load, then a byte shuffle/shift to move the bytes down to the
> beginning of the register to align them, then the normal test for a
> terminator before we use a masked store to write only these bytes.


Doesn't this not write the last few bytes in the last page before the
write protection boundary?

MitchAlsup

unread,
Nov 6, 2019, 11:54:31 AM11/6/19
to
On Wednesday, November 6, 2019 at 10:17:01 AM UTC-6, Stephen Fuld wrote:
> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
> > Stephen Fuld wrote:
> >> My thoughts now are to compute the number of 8 byte chunks from the
> >> starting source address to the next 4k boundary, then do the same for
> >> the destination address.  Take the minimum of those two.  Call that
> >> the counter
> >>
> >> Then do the first VVM loop using 64 bit loads and stores, but adding a
> >> decrement the counter by 8, test for zero and if so branch out of the
> >> first loop.
> >
> > For max speed you really want cache line transfers, so 64 instead of 8
> > bytes per move.
>
> I may be wrong, but I don't think the My 66000 has instructions to do
> that. But note that on a multi lane CPU, you can get closer to that, as
> the different ALUs will access consecutive 64 bit blocks at the same
> time. I don't know if Mitch's hardware can take advantage of that.
>
The My 66000 does have cache line sized loads and stores--these must be
cache line aligned and must be MOD 8 register aligned.
>
>
>
> >> If you never get to a 4K boundary, which will happen most of the time,
> >> you use the loop instruction to test for any zero byte.  When you find
> >> it, you exit the VVM loop, and store the remaining up to 8 bytes.  You
> >> are done.
> >>
> >> At the target of the branch out of the loop, you enter a second VVM
> >> loop that does one byte at a time, for up to 8 bytes.  Within that
> >> loop you test for zero. This will get you over the 4K boundary, or you
> >> will find the zero termination byte.  If there is no zero termination
> >> byte, when the loop is done, you branch back to the very beginning and
> >> repeat the whole process for the next, up to 4K chunk.
> >>
> >> Does that seem reasonable?
> >
> > It will indeed work, you just have to do a lot of calculations for the
> > pair of boundary (page limit) conditions.
>
> "A lot" seems excessive, as it involves an AND and a subtract for the
> source and destination, which can be overlapped, and a min function (I
> don't know what instructions will be generated for that). But for the
> vast majority of cases, you only have to do this once.
>
>
> > It might in fact be faster to use the "aligned access only!" approach as
> > long as you have a fast double-wide shifter/byte aligner so that you can
> > use only aligned loads and stores,
>
>
> Could be. I don't know what shift instructions are available on My 6600.

Shift Left
unsigned SL( unsigned S1, unsigned S2 )
{
unsigned w = S2<37:32>,
o = S2< 5: 0>;
if( w != 0 )
{
S1 <<= 64 - w;
S1 >>= 64 - w - o;
}
else
S1 >>= o;
return S1;
}

SHIFT Right
signed SR( signed S1, unsigned S2 )
{
unsigned w = S2<37:32>,
o = S2< 5: 0>;
if( w != 0 )
{
S1 <<= 64 - w - o;
S1 >>= 64 - w;
}
else
S1 >>= o;
return S1
}

Insert
unsigned INS( unsigned S1, unsigned S2, unsigned S3 )
{
unsigned m,
w = S2<37:32>,
o = S2< 5: 0>;
if( w != 0 )
m = (~(~0 << w )) << o;
else
m = ~0 << o;
S3 <<= o;
return ( S1 & ~m ) | ( S3 & m );

Stephen Fuld

unread,
Nov 6, 2019, 12:09:04 PM11/6/19
to
On 11/6/2019 8:54 AM, MitchAlsup wrote:
> On Wednesday, November 6, 2019 at 10:17:01 AM UTC-6, Stephen Fuld wrote:
>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>> Stephen Fuld wrote:
>>>> My thoughts now are to compute the number of 8 byte chunks from the
>>>> starting source address to the next 4k boundary, then do the same for
>>>> the destination address.  Take the minimum of those two.  Call that
>>>> the counter
>>>>
>>>> Then do the first VVM loop using 64 bit loads and stores, but adding a
>>>> decrement the counter by 8, test for zero and if so branch out of the
>>>> first loop.
>>>
>>> For max speed you really want cache line transfers, so 64 instead of 8
>>> bytes per move.
>>
>> I may be wrong, but I don't think the My 66000 has instructions to do
>> that. But note that on a multi lane CPU, you can get closer to that, as
>> the different ALUs will access consecutive 64 bit blocks at the same
>> time. I don't know if Mitch's hardware can take advantage of that.
>>
> The My 66000 does have cache line sized loads and stores--these must be
> cache line aligned and must be MOD 8 register aligned.

Interesting. How does this interact with VVM? It seems like a of
internal storage.


snip
So no double wide shifts as Terje was proposing using?

MitchAlsup

unread,
Nov 6, 2019, 12:59:34 PM11/6/19
to
On Wednesday, November 6, 2019 at 11:09:04 AM UTC-6, Stephen Fuld wrote:
> On 11/6/2019 8:54 AM, MitchAlsup wrote:
> > On Wednesday, November 6, 2019 at 10:17:01 AM UTC-6, Stephen Fuld wrote:
> >> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
> >>> Stephen Fuld wrote:
> >>>> My thoughts now are to compute the number of 8 byte chunks from the
> >>>> starting source address to the next 4k boundary, then do the same for
> >>>> the destination address.  Take the minimum of those two.  Call that
> >>>> the counter
> >>>>
> >>>> Then do the first VVM loop using 64 bit loads and stores, but adding a
> >>>> decrement the counter by 8, test for zero and if so branch out of the
> >>>> first loop.
> >>>
> >>> For max speed you really want cache line transfers, so 64 instead of 8
> >>> bytes per move.
> >>
> >> I may be wrong, but I don't think the My 66000 has instructions to do
> >> that. But note that on a multi lane CPU, you can get closer to that, as
> >> the different ALUs will access consecutive 64 bit blocks at the same
> >> time. I don't know if Mitch's hardware can take advantage of that.
> >>
> > The My 66000 does have cache line sized loads and stores--these must be
> > cache line aligned and must be MOD 8 register aligned.
>
> Interesting. How does this interact with VVM? It seems like a of
> internal storage.

HW in the My 66000 architecture is responsible for loading and storing of
Register File at context switches. RF are defined to be on Quad-Line
boundaries, and these instructions access that already existing infrastructure.
Since HW does the loading and storing of RFs, the RF has been configured such
that even a 1-wide machine can write 4-registers per cycle (and similarly
read 4 registers per cycle). So these instructions make use of wider data-
paths than the execution portions have.

This is one of the reasons a context switch has only 12-cycles of overhead.
The 4-cache lines read and the 4-cache lines written completely overlap
(5-total cycles)

Ivan Godard

unread,
Nov 6, 2019, 1:07:27 PM11/6/19
to
Doesn't have, doesn't want.

Does have SIMD of all element sizes. SIMD operands are treated like
scalar, with statically known sizes and counts, but in parallel. SIMD is
intended for things like pixels or complex numbers.

Separately, dynamically sized sequences are called streams. What's on
the other end of a stream can be hardware or software, in or out of your
protection domain. Think Unix pipes. Details NYF.

Terje Mathisen

unread,
Nov 6, 2019, 2:03:17 PM11/6/19
to
Stephen Fuld wrote:
> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>> but Mill's approach with None or NaR flagging of all out of bounds
>> bytes is far simpler and allows much smaller and easier to verify code.
>
> Yes, but Mill doesn't have, or at least hasn't disclosed, anything like
> the VVM.

With 128-bit vector regs you get most of the benefit, with NaR/None
taking care of all the ugly/hard prequel/sequel/alignment issues.
>
>
>> A half-way approach would allow misaligned stores of all full blocks
>> (words/cache lines), i.e. no internal terminator found, but only
>> aligned loads: This is fairly easy to setup by just doing an initial
>> aligned block load, then a byte shuffle/shift to move the bytes down
>> to the beginning of the register to align them, then the normal test
>> for a terminator before we use a masked store to write only these bytes.
>
>
> Doesn't this not write the last few bytes in the last page before the
> write protection boundary?

No: This is the idea behind only writing the full blocks this way, they
are by definition always safe, even if misaligned.

The final block, i.e. with the found terminator will be written using a
masked store (i.e. no access past the final terminator byte), so that is
also safe.

Stephen Fuld

unread,
Nov 6, 2019, 2:04:44 PM11/6/19
to
Obviously, it is tempting to use these for byte move, but I guess you
would have to have instructions within the VVM loop to test for zero
byte in the first seven registers, and use the LOOP instruction to test
for zero in the last eight bytes.

Also, you increase by a factor of 8 the probability of hitting a 4k
boundary.

It complicates the clean up code, as you have to handle multiple registers.

And, of course, it requires start-up code to load/store the first
unaligned bytes. You didn't say you have double width shifts, so
alighment is more complicated.

More to think about. Ugh! :-)

Stephen Fuld

unread,
Nov 6, 2019, 2:09:56 PM11/6/19
to
Right. May be not useful for byte move. IIRC, the size, in bits of
each belt position is 64, so the SIMD you could use the capability for 8
bytes at time, but no more. Is that right?


> Separately, dynamically sized sequences are called streams. What's on
> the other end of a stream can be hardware or software, in or out of your
> protection domain. Think Unix pipes. Details NYF.


You just keep whetting our appetites. :-)

Stephen Fuld

unread,
Nov 6, 2019, 2:25:15 PM11/6/19
to
On 11/6/2019 11:03 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>> but Mill's approach with None or NaR flagging of all out of bounds
>>> bytes is far simpler and allows much smaller and easier to verify code.
>>
>> Yes, but Mill doesn't have, or at least hasn't disclosed, anything
>> like the VVM.
>
> With 128-bit vector regs you get most of the benefit, with NaR/None
> taking care of all the ugly/hard prequel/sequel/alignment issues.
>>
>>
>>> A half-way approach would allow misaligned stores of all full blocks
>>> (words/cache lines), i.e. no internal terminator found, but only
>>> aligned loads: This is fairly easy to setup by just doing an initial
>>> aligned block load, then a byte shuffle/shift to move the bytes down
>>> to the beginning of the register to align them, then the normal test
>>> for a terminator before we use a masked store to write only these bytes.
>>
>>
>> Doesn't this not write the last few bytes in the last page before the
>> write protection boundary?
>
> No: This is the idea behind only writing the full blocks this way, they
> are by definition always safe, even if misaligned.

I'm sorry, I don't understand. In the paragraph describing this
approach, you say "allow misaligned stores of all full blocks" Doesn't
this mean that a store could fault and leave the last few bytes of the
page before the fault unwritten? Note I am talking about blocks other
than the final one.

Stephen Fuld

unread,
Nov 6, 2019, 2:30:31 PM11/6/19
to
On 11/6/2019 11:03 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>> but Mill's approach with None or NaR flagging of all out of bounds
>>> bytes is far simpler and allows much smaller and easier to verify code.
>>
>> Yes, but Mill doesn't have, or at least hasn't disclosed, anything
>> like the VVM.
>
> With 128-bit vector regs you get most of the benefit, with NaR/None
> taking care of all the ugly/hard prequel/sequel/alignment issues.

I understand. I mis-remembered that the vector registers were 64 bits.
Sorry Ivan. :-(

And I suppose you could have multiple of these going on at the same time
(wide issue), with the actual number being model dependent. It would be
up to the compiler and the specializer to extract the parallelism. Right?

NaN

unread,
Nov 6, 2019, 3:58:16 PM11/6/19
to
If 66000 is variable instruction word length and alignment-free architecture then interconnection network between instruction cache and instruction register is 2N bytes (where maximum N byes word length, strictly it is 2N-1) at least?
This is possible in case of that program is not self-writing (no stores for instructions).
Does that mean of instruction prefetch support needing wider path so then such the wider path is sufficient?

BTW, if we try applying this baseline to data, how to store under alignment-free condition?

MitchAlsup

unread,
Nov 6, 2019, 4:25:29 PM11/6/19
to
On Wednesday, November 6, 2019 at 2:58:16 PM UTC-6, NaN wrote:
> If 66000 is variable instruction word length and alignment-free architecture then interconnection network between instruction cache and instruction register is 2N bytes (where maximum N byes word length, strictly it is 2N-1) at least?

The ICache read access width is 4 Aligned Words (in small implementations).

> This is possible in case of that program is not self-writing (no stores for instructions).

Instruction FETCH is defined as being as far into the past as possible, so
a Store into Instruction space (enabled by TLB.W) can write into the
instruction memory. But the Store instruction has no concept as to how far
earlier any FETCHes may have transpired. In addition, if the instruction
is in the Instruction Buffer, it may be used multiple times before the
store is recognized.

> Does that mean of instruction prefetch support needing wider path so then such the wider path is sufficient?

The FETCH path is 4 Words wide per cycle, and there is a 16-20 word instruction
buffer.
>
> BTW, if we try applying this baseline to data, how to store under alignment-free condition?

1-wide machine:

The general state is that after a control transfer, FETCH issues 3 FETCHes
in a row, and examines them after each arrive. If the first quad Word of
instructions contains a branch, the target of the branch will be FETCHed
in the 4th cycle. When the predicted stream has less than 3 remaining
instructions it will FETCH its subsequent. Thus, the FETCHer is attempting
to stage up branch target instructions and maintain the momentum of Inst
issue, and decrease the branch penalty close to zero.

Wider machines are allowed to do what is appropriate for them.

If you want to disturb the instructions in some context, you do it from a
different context. Context switching flushes the IB which is the only way
to synchronize stores and FETCHes.

MitchAlsup

unread,
Nov 6, 2019, 4:27:47 PM11/6/19
to
On Wednesday, November 6, 2019 at 2:58:16 PM UTC-6, NaN wrote:
>
> BTW, if we try applying this baseline to data, how to store under alignment-free condition?

You don't want to be modifying instructions that are visible to some running
context. You want to allocate new memory, write the entire subroutine, then
remove write permission and enable execute permission, then, place and then
post and address of where you put the new subroutine.

NaN

unread,
Nov 6, 2019, 4:45:30 PM11/6/19
to
Mitch-san,

Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?

Best,
NaN

NaN

unread,
Nov 6, 2019, 4:57:20 PM11/6/19
to
If data word is variable length, let us say maximum D-byte, then D byte-line data cache sub-sets with address decoder to select cache(s) and with shared tag memory is indeed sufficient. If the byte cache-line is too narrow then we can make S-byte line which probably supports maximum S-stride access across D subsets.
Just like a chunking, similar to scalarization in GPUs.

NaN

NaN

unread,
Nov 6, 2019, 5:13:55 PM11/6/19
to
2019年11月7日木曜日 6時57分20秒 UTC+9 NaN:
> If data word is variable length, let us say maximum D-byte, then D byte-line data cache sub-sets with address decoder to select cache(s) and with shared tag memory is indeed sufficient. If the byte cache-line is too narrow then we can make S-byte line which probably supports maximum S-stride access across D subsets.
> Just like a chunking, similar to scalarization in GPUs.
>
> NaN

Maybe maximum D-stride S-length byte-vector is in one cache line, is correct.

Stefan Monnier

unread,
Nov 6, 2019, 5:23:58 PM11/6/19
to
> Since the instructions "in" the loop are in fact Scalar instructions, an
> exception is taken as if it were a Scalar instruction. If the exception
> is repaired and control returns, the rest of the loop is run in Scalar
> mode, and the LOOP instruction will transfer control to the VEC instruction,
> and Vector mode reconveins.

IIUC by "rest of the loop" you really mean "rest of the loop's body",
aka a single iteration of the loop is performed before going back to
"vector mode", right?


Stefan

Ivan Godard

unread,
Nov 6, 2019, 6:07:20 PM11/6/19
to
SIMD is up to the config's maximum scalar size, which is part of the
tapeout-time configuration. Machines with quad have up to 16-way SIMD,
and so on; there's no architectural limit, there are physical cost
considerations because all belt positions must be able to accommodate
the max. A FP-banger config would likely have 16-byte width for {D,D}
complex use, as would a database mainframe but for 33-digit COBOL
numeric and DBMS row searching. The Mill is a general purpose
architecture :-)

There is limited support for combining SIMD and vectors in that the
vector elements in a loop can be SIMD objects. One of the videos shows
how this implements strcpy(), and byte move works the same way. Look for
the smear() operation.

>> Separately, dynamically sized sequences are called streams. What's on
>> the other end of a stream can be hardware or software, in or out of
>> your protection domain. Think Unix pipes. Details NYF.
>
>
> You just keep whetting our appetites.  :-)

I try :-)

Ivan Godard

unread,
Nov 6, 2019, 6:13:00 PM11/6/19
to
On 11/6/2019 11:30 AM, Stephen Fuld wrote:
> On 11/6/2019 11:03 AM, Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>>> but Mill's approach with None or NaR flagging of all out of bounds
>>>> bytes is far simpler and allows much smaller and easier to verify code.
>>>
>>> Yes, but Mill doesn't have, or at least hasn't disclosed, anything
>>> like the VVM.
>>
>> With 128-bit vector regs you get most of the benefit, with NaR/None
>> taking care of all the ugly/hard prequel/sequel/alignment issues.
>
> I understand.  I mis-remembered that the vector registers were 64 bits.
> Sorry Ivan.  :-(

Operand size is per-member configured.

> And I suppose you could have multiple of these going on at the same time
> (wide issue), with the actual number being model dependent.  It would be
> up to the compiler and the specializer to extract the parallelism.  Right?

Yes, though the specializer does not yet do auto-vectorization.

Alex McDonald

unread,
Nov 6, 2019, 6:33:40 PM11/6/19
to
On 06-Nov-19 12:09, already...@yahoo.com wrote:
> https://www.fujitsu.com/global/about/resources/news/press-releases/2019/0415-01.html
>
> English is not my native tongue so I am not sure what the word "concludes" really means in this context.
> But it it sounds like they have chip that is more works than not.

Concluded as in "agreed". It's being used for dramatic effect; to give a
sense of the hours devoted to hard fought negotiations but now -- a
contract. Compare & contrast with "finalised"; this can be found
describing agreements between parties that have been advised by their
lawyers of the eye-watering legal bills should they go to court.

--
Alex

NaN

unread,
Nov 6, 2019, 6:40:21 PM11/6/19
to
Ivan-san,

How about a complexity caused by semantic gap between programming language and logical architecture?

Best,
S.Takano

2019年11月7日木曜日 8時07分20秒 UTC+9 Ivan Godard:

MitchAlsup

unread,
Nov 6, 2019, 8:11:10 PM11/6/19
to
On Wednesday, November 6, 2019 at 3:45:30 PM UTC-6, NaN wrote:
> Mitch-san,
>
> Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?

My 66000 ISA IS variable length, 1,2,3,4,5 words in length.
>
> Best,
> NaN

MitchAlsup

unread,
Nov 6, 2019, 8:12:24 PM11/6/19
to
Basically, yes.
>
>
> Stefan

NaN

unread,
Nov 7, 2019, 12:47:59 AM11/7/19
to
> > Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?
> My 66000 ISA IS variable length, 1,2,3,4,5 words in length.

Before talking about 66000, I would like to confirm your definition of word “instruction word”.
Is “instruction word”, unit length which can be fetched and steered by instruction fetcher, and multiple word composes a single instruction, not as VLIW like approach?

If it is so then your definition is equivalent to my definition of a byte-based variable instruction word just extending the byte to word in your definition.
So, I think that (2*5-1)*W-bit width instruction buffer at least (so, 16-word width instr buff is needed for word-alignment) is needed for max 5-word (a single word is W-bit) instruction length to juggling to fetch in a cycle.
I think standard DDR memory is up to 32Bytes to access, then how handling external memory access on such the narrow interface if your W is 4-byte?

Best,
S.Takano

Bruce Hoult

unread,
Nov 7, 2019, 1:01:38 AM11/7/19
to
The entire opcode is in the first word, yes? The rest is just raw literals/addresses/offsets?

Terje Mathisen

unread,
Nov 7, 2019, 2:33:18 AM11/7/19
to
If you get a trap here, then you should! The only way you can get such a
trap is because you are trying to write at least one byte past what you
have legal access to.

To only thng we have to really worry about is out of bounds accesses
that occur because we are using SIMD operations to replace byte-by-byte
moves, right?

These can occur either on the read or the store side:

By only reading aligned blocks, we know that no indivdual read operation
can cross a protection boundary, since these boundaries are at least
page aligned.

On the store side we are good as long as we never write any bytes past
the final location that would have been stored to using the byte-based loop.

Please note that the read approach will never work on a platform with
byte (or just sub-block) hardware protection boundaries, there you
either have to do it the slow way, or have user-level traps allowing you
to blindly use block-based moves, then recover with a byte loop in the
trap handler if triggered.

Ivan Godard

unread,
Nov 7, 2019, 3:08:34 AM11/7/19
to
Or use metadata

Terje Mathisen

unread,
Nov 7, 2019, 4:04:06 AM11/7/19
to
Sure, as on the Mill which I've pointed out several times is doing it
the Right Way. :-)

MitchAlsup

unread,
Nov 7, 2019, 12:09:54 PM11/7/19
to
On Wednesday, November 6, 2019 at 11:47:59 PM UTC-6, NaN wrote:
> > > Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?
> > My 66000 ISA IS variable length, 1,2,3,4,5 words in length.
>
> Before talking about 66000, I would like to confirm your definition of word “instruction word”.

An instruction consists of all the words needed to completely satisfy the
requirements set out by the instruction specifier.

An instruction specifier is the first word of an instruction.

A word is 32-bits.

> Is “instruction word”, unit length which can be fetched and steered by instruction fetcher, and multiple word composes a single instruction, not as VLIW like approach?

An instruction contains all of the necessary parts to completely specify
some kind of calculation, or control transfer, or memory reference.
>
> If it is so then your definition is equivalent to my definition of a byte-based variable instruction word just extending the byte to word in your definition.

Probably, I just use aligned 4-byte containers.

> So, I think that (2*5-1)*W-bit width instruction buffer at least (so, 16-word width instr buff is needed for word-alignment) is needed for max 5-word (a single word is W-bit) instruction length to juggling to fetch in a cycle.

Technically, one could fetch them 1 word at a time, using multiple cycles
to assemble a single executable instruction. But even my lowest conceived
implementation still fetches 128-aligned-bits per cycle.

> I think standard DDR memory is up to 32Bytes to access, then how handling external memory access on such the narrow interface if your W is 4-byte?

Caches, with cache line width of 512-bits or 4 quad words.
>
> Best,
> S.Takano

MitchAlsup

unread,
Nov 7, 2019, 12:11:10 PM11/7/19
to
The first word contains everything except {32, 64}-bit {immediates and
displacements.}

Stephen Fuld

unread,
Nov 7, 2019, 12:33:42 PM11/7/19
to
Right.

>
> These can occur either on the read or the store side:
>
> By only reading aligned blocks, we know that no indivdual read operation
> can cross a protection boundary, since these boundaries are at least
> page aligned.


Yes.


>
> On the store side we are good as long as we never write any bytes past
> the final location that would have been stored to using the byte-based
> loop.


Yes, but consider the following example. The source is 10 bytes long
(though we don't know that at the start), and page aligned. The
destination address is three bytes before a page boundary, and the next
page is beyond the limit. You load the first 8 bytes and find no
termination byte. So you do a store. We agreed above that in this
scenario, there is no need to align the stores, so you attempt the
store. It gets a protection fault because it can't store the fourth
(and subsequent) bytes. Since the store failed, it didn't write bytes
1-3. But these bytes would have been written in a byte by byte loop.

What am I missing?

EricP

unread,
Nov 7, 2019, 2:04:57 PM11/7/19
to
The behavior change is more more than just not writing those 3 bytes.

With failure behavior that way, you are changing the page fault
behavior slightly since to succeed both the output pages in the
straddle must be resident and valid, and in the TLB.
If the same happened for source page straddles then the
minimum resident data page set size is 4 pages, not 2.
I know its not a big change, but it is different.

It might affect routines inside an OS that require pages
to be pinned while worked on.



Stefan Monnier

unread,
Nov 7, 2019, 2:18:02 PM11/7/19