Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Mitch's 66000?

1,463 views
Skip to first unread message

Kyle Hayes

unread,
Nov 3, 2019, 11:06:13 AM11/3/19
to
I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(

Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?

I tried searching on Google (that's where I read comp.arch) but I get a huge number of results that are just references to the name "66000" and nothing particularly substantive. After the first 1500 or so, I gave up.

Thanks for any pointers!

Best,
Kyle

MitchAlsup

unread,
Nov 3, 2019, 12:26:04 PM11/3/19
to
On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
>
> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?

Its "My 66000" note: lower case 'y'.
Its "Alsup".

The virtual Vector Method (VVM) documentation is not ready for prime time.

It is a method for adding vector processing to a CPU/ISA that vectorizes
loops instead of vectorizing instructions.

In fact VVM adds but 2 instructions to the ISA, one denoting which registers
in the loop are vectorized and which are Scalar. Scalar registers are static
over the loop, vector registers change at least once per loop; the other
denoting the loop termination and looping condition.

VVM is perfectly capable of vectorizing small string loops and loops upto
about the size of the inner loop of FFTs (29 instructions in My 66000 ISA).

VVM utilizes the instruction queuing logic in the CPU to give the appearance
of vector processing while operating in such a partial order that precise
exceptions are both possible and practical. There are no artificial boundaries
to the size of the loop, the loop can execute for 1-infinity go arounds.

In VVM there is no vector register file, vector operands and results remain
in the data-flow portions of the data path (they don't get read from RF nor
do the get written to RF--mostly). Scalar operands are read once and held in
the instruction queueing mechanisms, and used as many times are needed.

It is likely that during a vectorized loop the FETCH and DECODE stages will
be idle (after prefetching in the instructions after loop termination.)

Memory addresses are generated in order per loop,
Loads are hoisted above loads only after disambiguation,
Stores are delayed below loads only after disambiguation,
Calculations are computed in function unit order once per loop,
There can be multiple "lanes" of memory reference,
There can be multiple lanes of Calculations per Function Unit,

The compiler gets the illusion of vector registers (making porting easier)

This adds almost no HW to the core, certainly no VRF, no Vector only FUs,...
It utilizes Reservation Stations or similar instruction queueing facilities
already present. In short is provides the vast majority of the "bang" for
almost no "bucks", while complying with precise exceptions--which appear
to be from a scalar instruction stream (no additional OS software needed).

For example:: A simple string copy routine in My 66000 Assembler:

loop:
LDSB R7,[Rp+Ri]
STB R7,[Rq+Ri]
ADD Ri,Ri,1
BNZ R7,loop

vectorizes into:

loop: VEC {{VSV}{VSV}{VV}{V}}
LDSB R7,[Rp+Ri]
STB R7,[Rq+Ri]
LOOP NZ,R7,Ri,1

And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
nor written on a per cycle basis, R7 and Ri will be written to the RF when the
loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
is eliminated, while the speed of processing is increased. The VEC instruction
is examined once (DECODE) and is not part of the execution of the loop. The
LOOP instruction performs a comparison and an incrementation and conditionally
transfers control to after the VEC instruction.

The above vectorized loop can run at 1 loop per cycle if there are 2 lanes
of memory. That same code can run 4 loops per cycle if there are 8 lanes of
memory. And so on.

A small CPU (1-wide in order) can perform VVM with a few sequences added to
the control unit. So the cost to the really small machine is virtually zero.

As implementations get bigger they can choose to implement the number of lanes
that make sense to the rest of their design point.

All machine versions can run the same expression of the Vectorized loop.

Memory aliasing is handled by running the loops in a restricted partial order
so that if the software expression of the loop contains memory aliasing, the
loop runs slower obeying memory dependencies, yet if the same loop contains
no memory dependencies, it runs faster. SW has to do nothing to obtain this
performance and protection (unlike vector machines with vector registers).

BGB

unread,
Nov 3, 2019, 1:11:28 PM11/3/19
to
On 11/3/2019 11:26 AM, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
>> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
>>
>> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
>
> Its "My 66000" note: lower case 'y'.
> Its "Alsup".
>

I think the bigger issue here is the apparent lack of any publicly
visible documentation for the ISA (at least, that I am aware of).

Partial contrast is my BJX2 ISA, which at least has stuff on GitHub
which people can look at if they are interested in where I am going with
this.


Though, my posts don't always match the ISA spec 1:1, as sometimes
details may end up changing or being implemented differently or features
may end up dropped, or I may be writing speculatively as-if a certain
instruction existed, ...

I may need to do some cleanup / revision at some point, as the design
has accumulated some unnecessary baggage in a few areas, ... ( It is
possible a lot of the stuff not currently implemented in the 'E' and 'F'
profiles may end up being dropped and/or redesigned; with the 'F'
profile being used as the basis of a partially redesigned version of the
core ISA )


Did start to speculate maybe "MY 66000" is intended to be a proprietary
/ commercial ISA, or it could have to do with patent reasons, or
something... But, I don't really know, and at best can only speculate.

Started writing something about this, but then kind of got distracted
and went down a rabbit hole about things like FAT32+LFN's and ExFAT
being kinda pointless for the majority of use cases (and probably is
"only really a thing" due to a likely attempt at a money-grab on MS's
part; since they can no longer make any money off FAT32 due to the
relevant patents having expired; and even then, as written, wouldn't
have likely applied to a VFAT driver in a Unix-style OS in the
first-place as the actual claims were essentially N/A).

In this way, it was sorta like the MIPS related suits (suing over
instructions which weren't even implemented by the infringing party), or
S3TC (sort of a long-standing bogeyman hindering compressed-texture
support in MesaGL and similar, ...).

...

MitchAlsup

unread,
Nov 3, 2019, 1:18:06 PM11/3/19
to
On Sunday, November 3, 2019 at 12:11:28 PM UTC-6, BGB wrote:
> On 11/3/2019 11:26 AM, MitchAlsup wrote:
> > On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> >> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
> >>
> >> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
> >
> > Its "My 66000" note: lower case 'y'.
> > Its "Alsup".
> >
>
> I think the bigger issue here is the apparent lack of any publicly
> visible documentation for the ISA (at least, that I am aware of).

I just don't have/know-of a place to put it, and after it is there how
to update it somewhat regularly.
>
> Partial contrast is my BJX2 ISA, which at least has stuff on GitHub
> which people can look at if they are interested in where I am going with
> this.
>
>
> Though, my posts don't always match the ISA spec 1:1, as sometimes
> details may end up changing or being implemented differently or features
> may end up dropped, or I may be writing speculatively as-if a certain
> instruction existed, ...
>
> I may need to do some cleanup / revision at some point, as the design
> has accumulated some unnecessary baggage in a few areas, ... ( It is
> possible a lot of the stuff not currently implemented in the 'E' and 'F'
> profiles may end up being dropped and/or redesigned; with the 'F'
> profile being used as the basis of a partially redesigned version of the
> core ISA )
>
>
> Did start to speculate maybe "MY 66000" is intended to be a proprietary
> / commercial ISA, or it could have to do with patent reasons, or
> something... But, I don't really know, and at best can only speculate.

It is not supposed to be proprietary at all, and it is intended to be given
away at zero cost to receiver. The only restriction is that the originator
(me) be transferred/acknowledged by any recipient.
>
> Started writing something about this, but then kind of got distracted
> and went down a rabbit hole about things like FAT32+LFN's and ExFAT
> being kinda pointless for the majority of use cases (and probably is
> "only really a thing" due to a likely attempt at a money-grab on MS's
> part; since they can no longer make any money off FAT32 due to the
> relevant patents having expired; and even then, as written, wouldn't
> have likely applied to a VFAT driver in a Unix-style OS in the
> first-place as the actual claims were essentially N/A).
>
> In this way, it was sorta like the MIPS related suits (suing over
> instructions which weren't even implemented by the infringing party), or
> S3TC (sort of a long-standing bogeyman hindering compressed-texture
> support in MesaGL and similar, ...).

It is stuff like this that left SIMD instructions out--I prefer to perform
SIMD with lanes of coordinated calculation rather than blowing up ISA.

Stephen Fuld

unread,
Nov 3, 2019, 1:41:27 PM11/3/19
to
Since the loop is using bytes, and assuming you have a D-cache, don't
you require fewer lanes of memory, i.e. only read or write memory once
per cache line, not per byte? Or by "lanes of memory" do you mean
load/store units within the CPU?



>
> A small CPU (1-wide in order) can perform VVM with a few sequences added to
> the control unit. So the cost to the really small machine is virtually zero.
>
> As implementations get bigger they can choose to implement the number of lanes
> that make sense to the rest of their design point.
>
> All machine versions can run the same expression of the Vectorized loop.
>
> Memory aliasing is handled by running the loops in a restricted partial order
> so that if the software expression of the loop contains memory aliasing, the
> loop runs slower obeying memory dependencies, yet if the same loop contains
> no memory dependencies, it runs faster. SW has to do nothing to obtain this
> performance and protection (unlike vector machines with vector registers).

I think you are sort of "burying the lead", perhaps because it is
obvious to you.

With VVM, you can automagically use as many ALUs as the hardware
provides without any changes to your source code. So you get more than
the capabilities of a typical vector processor for very little extra
"stuff", either in the hardware or software. As such is is quite
different from Luke's RISC V proposal.


In case it isn't obvious, I am a big fan! :-)


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Kyle Hayes

unread,
Nov 3, 2019, 4:54:08 PM11/3/19
to
On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
> On 11/3/2019 9:26 AM, MitchAlsup wrote:
> > On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> >> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
> >>
> >> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
> >
> > Its "My 66000" note: lower case 'y'.
> > Its "Alsup".

(Responding to Mitch)

Apologies, not enough caffeine when I wrote that :-(

Stayed up way too late last night for work and I used to work with a Mick Alsop. In the morning fog what was in the brain and what went out the fingers did not match :-(

Sort of a cached incoherency protocol...

> > The virtual Vector Method (VVM) documentation is not ready for prime time.
> >
> > It is a method for adding vector processing to a CPU/ISA that vectorizes
> > loops instead of vectorizing instructions.
> >
> > In fact VVM adds but 2 instructions to the ISA, one denoting which registers
> > in the loop are vectorized and which are Scalar. Scalar registers are static
> > over the loop, vector registers change at least once per loop; the other
> > denoting the loop termination and looping condition.
> >

This is the VEC and LOOP instructions below?
Just to make sure I understand:

Rp = source string pointer
Rq = destination string pointer
R7 = byte being copied
Ri = loop index

> > vectorizes into:
> >
> > loop: VEC {{VSV}{VSV}{VV}{V}}
> > LDSB R7,[Rp+Ri]
> > STB R7,[Rq+Ri]
> > LOOP NZ,R7,Ri,1

And this is where my stupidity shows. I can't figure out what the {{VSV}{VSV}{VV}{V}} mean.

I thought maybe it applied to the instruction groups below it with each {} "sub" section being on instruction... But I am not sure about that because there are four such {} groupings but only three instructions below VEC.

Taking this one instruction at a time:

LDSB R7, [Rp+Ri]

OK, so we want R7 to be a "vector" and Ri to be a "vector" since the value changes per loop iteration. Rp stays as a scalar. Hence {VSV}?

For STB this makes sense in the same way.

For LOOP... Not sure. Why {VV} and {V}? The first {VV} make some sense as the R7 and Ri arguments of the instruction should be "vectorized". But I am not sure what happens with the last {V}.

Can I buy a clue, please?

> > And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
> > with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
> > nor written on a per cycle basis, R7 and Ri will be written to the RF when the
> > loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
> > is eliminated, while the speed of processing is increased. The VEC instruction
> > is examined once (DECODE) and is not part of the execution of the loop. The
> > LOOP instruction performs a comparison and an incrementation and conditionally
> > transfers control to after the VEC instruction.

The power savings are very impressive!

I think I am still a bit muddled here. This sounds like there is some more-or-less implicit state that is stored (maybe in a CR or something) when VEC "executes" (it doesn't, really) such that the branch target is saved. So can these VEC/LOOP blocks nest? What happens if you jump out of the middle of such a block?

This seems like a lot going on in the LOOP instruction. Maybe it is not as much as I thought. Looks like two read ports and a write port if I treat it as if it is a real loop instruction.

> > The above vectorized loop can run at 1 loop per cycle if there are 2 lanes
> > of memory. That same code can run 4 loops per cycle if there are 8 lanes of
> > memory. And so on.
>
>
> Since the loop is using bytes, and assuming you have a D-cache, don't
> you require fewer lanes of memory, i.e. only read or write memory once
> per cache line, not per byte? Or by "lanes of memory" do you mean
> load/store units within the CPU?

(Responding to Stephen)

Hmm, interesting point about the D$. I was reading "lanes of memory" as memory controllers/memory banks.

What I can see of this seems very, very efficient. I think :-)

> > A small CPU (1-wide in order) can perform VVM with a few sequences added to
> > the control unit. So the cost to the really small machine is virtually zero.
> >
> > As implementations get bigger they can choose to implement the number of lanes
> > that make sense to the rest of their design point.
> >
> > All machine versions can run the same expression of the Vectorized loop.
> >
> > Memory aliasing is handled by running the loops in a restricted partial order
> > so that if the software expression of the loop contains memory aliasing, the
> > loop runs slower obeying memory dependencies, yet if the same loop contains
> > no memory dependencies, it runs faster. SW has to do nothing to obtain this
> > performance and protection (unlike vector machines with vector registers).
>
> I think you are sort of "burying the lead", perhaps because it is
> obvious to you.
>
> With VVM, you can automagically use as many ALUs as the hardware
> provides without any changes to your source code. So you get more than
> the capabilities of a typical vector processor for very little extra
> "stuff", either in the hardware or software. As such is is quite
> different from Luke's RISC V proposal.

One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.

Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).

Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.

> In case it isn't obvious, I am a big fan! :-)

I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!

Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!

SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.

I've just started reading up on RISC-V's vector proposal as well and so far it reminds me a lot of SVE.

Best,
Kyle

Best,
Kyle

BGB

unread,
Nov 3, 2019, 5:16:01 PM11/3/19
to
On 11/3/2019 12:18 PM, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 12:11:28 PM UTC-6, BGB wrote:
>> On 11/3/2019 11:26 AM, MitchAlsup wrote:
>>> On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
>>>> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
>>>>
>>>> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
>>>
>>> Its "My 66000" note: lower case 'y'.
>>> Its "Alsup".
>>>
>>
>> I think the bigger issue here is the apparent lack of any publicly
>> visible documentation for the ISA (at least, that I am aware of).
>
> I just don't have/know-of a place to put it, and after it is there how
> to update it somewhat regularly.

OK. I was sticking some stuff in the GitHub Wiki thingy, but it kinda
leaves something to be desired (at least when using it to parse the
MediaWiki format; barely works and is prone to misparse stuff and
otherwise screw up).


If there were something sort of like a hybrid of MediaWiki and Adobe
Acrobat (*), this might be useful sometimes, but alas...

*: Say, accepts its input as a pile of files in a MediaWiki-like format
(I prefer this to MarkDown variants), produces competent HTML and PDF
output, supports operation both as a standalone desktop application and
online via a web-browser, is ideally open-source, reasonably easy to
build (IOW: not the usual FOSS 3rd party dependency hell),
cross-platform, ...


>>
>> Partial contrast is my BJX2 ISA, which at least has stuff on GitHub
>> which people can look at if they are interested in where I am going with
>> this.
>>
>>
>> Though, my posts don't always match the ISA spec 1:1, as sometimes
>> details may end up changing or being implemented differently or features
>> may end up dropped, or I may be writing speculatively as-if a certain
>> instruction existed, ...
>>
>> I may need to do some cleanup / revision at some point, as the design
>> has accumulated some unnecessary baggage in a few areas, ... ( It is
>> possible a lot of the stuff not currently implemented in the 'E' and 'F'
>> profiles may end up being dropped and/or redesigned; with the 'F'
>> profile being used as the basis of a partially redesigned version of the
>> core ISA )
>>
>>
>> Did start to speculate maybe "MY 66000" is intended to be a proprietary
>> / commercial ISA, or it could have to do with patent reasons, or
>> something... But, I don't really know, and at best can only speculate.
>
> It is not supposed to be proprietary at all, and it is intended to be given
> away at zero cost to receiver. The only restriction is that the originator
> (me) be transferred/acknowledged by any recipient.

OK.


My intended policy for BJX2 falls in the "people can mostly do whatever
with it" camp. If it gets popular, alternate competing implementations
and not-binary-compatible variants seem almost inevitable, but the ideal
case would be if binary incompatibility (and mutually incompatible ISA
extensions) can be kept small.

This later point would require getting the ISA to a stage where
"freezing" the core parts of the ISA makes sense. I don't feel I am
quite to this stage yet, but I am gradually getting closer (though, at
this point, probably what I have now will probably be at least
"reasonably close" to the final product).


>>
>> Started writing something about this, but then kind of got distracted
>> and went down a rabbit hole about things like FAT32+LFN's and ExFAT
>> being kinda pointless for the majority of use cases (and probably is
>> "only really a thing" due to a likely attempt at a money-grab on MS's
>> part; since they can no longer make any money off FAT32 due to the
>> relevant patents having expired; and even then, as written, wouldn't
>> have likely applied to a VFAT driver in a Unix-style OS in the
>> first-place as the actual claims were essentially N/A).
>>
>> In this way, it was sorta like the MIPS related suits (suing over
>> instructions which weren't even implemented by the infringing party), or
>> S3TC (sort of a long-standing bogeyman hindering compressed-texture
>> support in MesaGL and similar, ...).
>
> It is stuff like this that left SIMD instructions out--I prefer to perform
> SIMD with lanes of coordinated calculation rather than blowing up ISA.

A person can also do SIMD without going to the same level of absurd
extremes as SSE or NEON...


Both a basic SIMD variant and explicitly parallel execute lanes exist in
BJX2, though relying on the latter is a little heavyweight in the sense
that there isn't currently a good way to gloss over implementation
differences here (such as the number of execute lanes, or which types of
operations are allowed in which lanes), and 3x integer lanes has less
throughput than 4x packed words, or potentially using 3x lanes + SIMD
for potentially around ~ 12x word-ops at a time.


It is likely, if this were used in a more "application focused"
environment, binaries would primarily be distributed in a bytecode or a
"sanitized" scalar subset of the ISA (which could then be, as-needed,
dynamically-translated into the variant used on the hardware; or ran
directly at potentially diminished performance).

In this case, the translation/specialization stage would serve a similar
role to superscalar support in a CPU, just handled in software (probably
done AOT, with the translated binaries being cached somewhere).


A bytecode format could be used, which would likely need to serve a few
criteria:
Can be used generate reasonably efficient code for multiple targets;
Natively handles languages like C or C++ (among others);
Probably stack-based, with implicit or explicit static types;
Ideally, can be made flexible WRT details like "sizeof(void *)", but
need not gloss over the C library or preprocessor defines (more
problematic);
Can natively handle concepts like pointers and "goto" without creating
an awkward/convoluted mess (unlike, say, WASM);
...


The later case (a sanitized BJX2 subset) could probably be done with
minimal annotation or metadata, mostly it requires:
All static parts of the control flow graph either be statically
reachable or have some sort of metadata/annotation to indicate their
existence (possibly in the form of "null exports" for anything only
reachable via function pointers);
No support for self-modification of any statically modified code;
Predefined methods for things like loading function pointers;
...

This is less general, but would allow a comparably simpler translator
than would be needed for a higher-level bytecode format. A
sanity-checker could also be used to verify that the code conforms to
the subset (code would be rejected if it goes outside the defined subset).


It should also be possible to "safely" run untrusted BJX2 code more
directly on the CPU (via such a subset) without the need for
address-space trickery (or using x86 segmentation and a hackish ISA
subset, as in the case of the Google NaCl thing), as the ISA design
allows imposing Unix-like access-right checking at the MMU level (in
addition to the usual user/supervisor split).


It may sound a little silly, but this idea may have been partly inspired
by how things like door locks were presented in "Tron 2.0"; sub-programs
essentially needing dedicated keys rather than having free reign of the
whole address space seemed like it could be helpful in terms of
security, and doesn't add all that much in terms of cost (apart from a
little overhead in the page-tables and TLBs).


There is currently a few limitations though (to the "VUGID" system):
Pretty much no existing software or OS's will support this;
The ID's will necessarily exist in a separate address space from those
used for file-access rights checking;
The range of UID's / GID's available to an application is fairly limited;
Actually using this feature (in any non-trivial way) will be kind of a pain;
...

Note that User and Supervisor spaces would essentially also have their
own separate VUGID spaces.


For the time being, TeskKern will also use VUGID in place of separate
address spaces for applications (the main address space for each program
would essentially be "Apps/PID"; and there may be a system call for
allocating new VUID/VGID values). It is likely separate per-process
address spaces will be added eventually though.

I suspect at most only a rare few programs will actually have reason to
care that VUGID exists.


As noted (likely cruft/hackery) in TestKern:
Single address space, cooperative scheduling;
VFS system calls will always pass absolute paths;
CWD/PWD will be handled via the syscall wrappers;
The shell, basic commands, ... are all rolled together.
Sort of like Busybox/Toybox, but also includes the shell;
...

I am still debating some details, but it is likely I will not spawn a
separate shell process for handling shell scripts. Instead there will
likely be a combined shell-script interpreter that handles everything
(with separate contexts for different shell scripts).

All this may or may not take place in the kernel, still TBD.
Cases like "#!/bin/sh" would be handled "as magic", rather than by
spawning a new shell instance to run each script (though, for other
programs, "#!" will launch it as a new instance).

Combined with rolling basic commands directly into the shell (similar to
how it works in DOS/Windows) should be able to reduce overhead somewhat.


<snip>

rick.c...@gmail.com

unread,
Nov 3, 2019, 6:33:07 PM11/3/19
to
On Sunday, November 3, 2019 at 1:18:06 PM UTC-5, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 12:11:28 PM UTC-6, BGB wrote:
> > I think the bigger issue here is the apparent lack of any publicly
> > visible documentation for the ISA (at least, that I am aware of).
>
> I just don't have/know-of a place to put it, and after it is there how
> to update it somewhat regularly.

You could try Open Cores:

https://opencores.org/forum/Cores

You would likely be the big fish there, sir, garnering a
quick following.

--
Rick C. Hodgin

MitchAlsup

unread,
Nov 3, 2019, 7:25:14 PM11/3/19
to
On Sunday, November 3, 2019 at 3:54:08 PM UTC-6, Kyle Hayes wrote:
> On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
> > On 11/3/2019 9:26 AM, MitchAlsup wrote:
> > > On Sunday, November 3, 2019 at 10:06:13 AM UTC-6, Kyle Hayes wrote:
> > >> I see many references to Mitch Alsop's ME 66000 (did I get the name correct?). Is there a wiki or something I can look at? I am really intrigued by the hints about the vector handling. Actually, I am a bit confused by it :-(
> > >>
> > >> Is it similar to what Luke is proposing for RISC-V? Or am I completely in the weeds?
> > >
> > > Its "My 66000" note: lower case 'y'.
> > > Its "Alsup".
>
> (Responding to Mitch)
>
> Apologies, not enough caffeine when I wrote that :-(
>
> Stayed up way too late last night for work and I used to work with a Mick Alsop. In the morning fog what was in the brain and what went out the fingers did not match :-(
>
> Sort of a cached incoherency protocol...
>
> > > The virtual Vector Method (VVM) documentation is not ready for prime time.
> > >
> > > It is a method for adding vector processing to a CPU/ISA that vectorizes
> > > loops instead of vectorizing instructions.
> > >
> > > In fact VVM adds but 2 instructions to the ISA, one denoting which registers
> > > in the loop are vectorized and which are Scalar. Scalar registers are static
> > > over the loop, vector registers change at least once per loop; the other
> > > denoting the loop termination and looping condition.
> > >
>
> This is the VEC and LOOP instructions below?

Yes.
Yes
>
> > > vectorizes into:
> > >
> > > loop: VEC {{VSV}{VSV}{VV}{V}}
> > > LDSB R7,[Rp+Ri]
> > > STB R7,[Rq+Ri]
> > > LOOP NZ,R7,Ri,1
>
> And this is where my stupidity shows. I can't figure out what the {{VSV}{VSV}{VV}{V}} mean.

Right now, it is simply syntactic sugar.
>
> I thought maybe it applied to the instruction groups below it with each {} "sub" section being on instruction... But I am not sure about that because there are four such {} groupings but only three instructions below VEC.
>
> Taking this one instruction at a time:
>
> LDSB R7, [Rp+Ri]
>
> OK, so we want R7 to be a "vector" and Ri to be a "vector" since the value changes per loop iteration. Rp stays as a scalar. Hence {VSV}?

Yes.
>
> For STB this makes sense in the same way.
>
> For LOOP... Not sure. Why {VV} and {V}? The first {VV} make some sense as the R7 and Ri arguments of the instruction should be "vectorized". But I am not sure what happens with the last {V}.

Ri is the loop control vector (changing every iteration)--and thus not scalar.
>
> Can I buy a clue, please?
>
> > > And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
> > > with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
> > > nor written on a per cycle basis, R7 and Ri will be written to the RF when the
> > > loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
> > > is eliminated, while the speed of processing is increased. The VEC instruction
> > > is examined once (DECODE) and is not part of the execution of the loop. The
> > > LOOP instruction performs a comparison and an incrementation and conditionally
> > > transfers control to after the VEC instruction.
>
> The power savings are very impressive!
>
> I think I am still a bit muddled here. This sounds like there is some more-or-less implicit state that is stored (maybe in a CR or something) when VEC "executes" (it doesn't, really) such that the branch target is saved. So can these VEC/LOOP blocks nest? What happens if you jump out of the middle of such a block?

The only implicit state is the address of the top of the loop.
>
> This seems like a lot going on in the LOOP instruction. Maybe it is not as much as I thought. Looks like two read ports and a write port if I treat it as if it is a real loop instruction.
>
> > > The above vectorized loop can run at 1 loop per cycle if there are 2 lanes
> > > of memory. That same code can run 4 loops per cycle if there are 8 lanes of
> > > memory. And so on.
> >
> >
> > Since the loop is using bytes, and assuming you have a D-cache, don't
> > you require fewer lanes of memory, i.e. only read or write memory once
> > per cache line, not per byte? Or by "lanes of memory" do you mean
> > load/store units within the CPU?
>
> (Responding to Stephen)
>
> Hmm, interesting point about the D$. I was reading "lanes of memory" as memory controllers/memory banks.

Lanes into the cache.
>
> What I can see of this seems very, very efficient. I think :-)
>
> > > A small CPU (1-wide in order) can perform VVM with a few sequences added to
> > > the control unit. So the cost to the really small machine is virtually zero.
> > >
> > > As implementations get bigger they can choose to implement the number of lanes
> > > that make sense to the rest of their design point.
> > >
> > > All machine versions can run the same expression of the Vectorized loop.
> > >
> > > Memory aliasing is handled by running the loops in a restricted partial order
> > > so that if the software expression of the loop contains memory aliasing, the
> > > loop runs slower obeying memory dependencies, yet if the same loop contains
> > > no memory dependencies, it runs faster. SW has to do nothing to obtain this
> > > performance and protection (unlike vector machines with vector registers).
> >
> > I think you are sort of "burying the lead", perhaps because it is
> > obvious to you.
> >
> > With VVM, you can automagically use as many ALUs as the hardware
> > provides without any changes to your source code. So you get more than
> > the capabilities of a typical vector processor for very little extra
> > "stuff", either in the hardware or software. As such is is quite
> > different from Luke's RISC V proposal.
>
> One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.

RISC-V has vector registers, VVM does not.
>
> Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).

Obviously.
>
> Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.
>
> > In case it isn't obvious, I am a big fan! :-)
>
> I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
>
> Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!

VEC uses an immediate field to decorate K following instructions with vector
or scalar designations so that the various kinds of instruction queueing can
be properly dedicated to the loop at hand.
>
> SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.

And thus way more expensive in HW.

Kyle Hayes

unread,
Nov 3, 2019, 11:31:53 PM11/3/19
to
On Sunday, November 3, 2019 at 4:25:14 PM UTC-8, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 3:54:08 PM UTC-6, Kyle Hayes wrote:
> > On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
> > > On 11/3/2019 9:26 AM, MitchAlsup wrote:
[snip]
> > > > vectorizes into:
> > > >
> > > > loop: VEC {{VSV}{VSV}{VV}{V}}
> > > > LDSB R7,[Rp+Ri]
> > > > STB R7,[Rq+Ri]
> > > > LOOP NZ,R7,Ri,1
> >
> > And this is where my stupidity shows. I can't figure out what the {{VSV}{VSV}{VV}{V}} mean.
>
> Right now, it is simply syntactic sugar.
> >
> > I thought maybe it applied to the instruction groups below it with each {} "sub" section being on instruction... But I am not sure about that because there are four such {} groupings but only three instructions below VEC.
> >
> > Taking this one instruction at a time:
> >
> > LDSB R7, [Rp+Ri]
> >
> > OK, so we want R7 to be a "vector" and Ri to be a "vector" since the value changes per loop iteration. Rp stays as a scalar. Hence {VSV}?
>
> Yes.
> >
> > For STB this makes sense in the same way.
> >
> > For LOOP... Not sure. Why {VV} and {V}? The first {VV} make some sense as the R7 and Ri arguments of the instruction should be "vectorized". But I am not sure what happens with the last {V}.
>
> Ri is the loop control vector (changing every iteration)--and thus not scalar.

Er... but wasn't that covered by the {VV} part?

LOOP NZ,R7,Ri,1

So the third group in VEC was {VV}. Does that apply to R7 and Ri above or to something else?

> > > > And now it runs 4 instructions (the Scalar loop) in at most 3 cycles per loop
> > > > with the FETCH, PARSE, DECODE stages of the pipeline IDLE. The RF is not read
> > > > nor written on a per cycle basis, R7 and Ri will be written to the RF when the
> > > > loop terminates. Thus somewhere around 50% of the power dissipation of the CPU
> > > > is eliminated, while the speed of processing is increased. The VEC instruction
> > > > is examined once (DECODE) and is not part of the execution of the loop. The
> > > > LOOP instruction performs a comparison and an incrementation and conditionally
> > > > transfers control to after the VEC instruction.
> >
> > The power savings are very impressive!
> >
> > I think I am still a bit muddled here. This sounds like there is some more-or-less implicit state that is stored (maybe in a CR or something) when VEC "executes" (it doesn't, really) such that the branch target is saved. So can these VEC/LOOP blocks nest? What happens if you jump out of the middle of such a block?
>
> The only implicit state is the address of the top of the loop.

I was less than clear. That's what I meant by branch target. I was still thinking of the non-vectorized version with a branch at the bottom to hop back to the start of the loop.
Did I misunderstand? There are two proposals for "vectorizing" RISC-V. One seems vaguely similar to ARM's SVE and has architectural vector registers. The other is Luke's proposal which does not use vector registers. At least that is how I understood them.

> >
> > Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).
>
> Obviously.
> >
> > Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.
> >
> > > In case it isn't obvious, I am a big fan! :-)
> >
> > I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
> >
> > Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
>
> VEC uses an immediate field to decorate K following instructions with vector
> or scalar designations so that the various kinds of instruction queueing can
> be properly dedicated to the loop at hand.

What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.

The VEC instruction has a fairly limited window unless it can use a lot of bits. Does the My 66000 use variable length instructions? I thought I saw that it does...

I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.

I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.

> > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
> >
> And thus way more expensive in HW.

Which then encourages multiple instruction set dialects.

Though RISC-V is already doing that. I fear that will be a problem for RISC-V. It will really be a family of somewhat-related ISAs rather than a single one. x86 tends (more or less) to have each new generation of ISA be a superset of the previous generation. RISC-V is definitely heading in a more patchwork direction.

It will be interesting to see how ARM pushes (if it does) SVE. They have three multiple-data extensions now: NEON, SVE and the other new one for more deeply embedded systems, M...something.

Thanks for all the info!

Best,
Kyle

MitchAlsup

unread,
Nov 4, 2019, 11:28:41 AM11/4/19
to
Both R7 and Ri
When I had long conversations (e-mail) with Luke, his proposal still had
real vector registers and a chunk length of 64.
> >
> > RISC-V has vector registers, VVM does not.
>
> Did I misunderstand? There are two proposals for "vectorizing" RISC-V. One seems vaguely similar to ARM's SVE and has architectural vector registers. The other is Luke's proposal which does not use vector registers. At least that is how I understood them.
>
> > >
> > > Both will need a lot of memory bandwidth to keep the performance up on larger loops (if the loop is something like a string copy).
> >
> > Obviously.
> > >
> > > Both rely on metadata instructions to set state within the CPU to make up for a lack of extra bits in the instructions. VEC in this case.
> > >
> > > > In case it isn't obvious, I am a big fan! :-)
> > >
> > > I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
> > >
> > > Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
> >
> > VEC uses an immediate field to decorate K following instructions with vector
> > or scalar designations so that the various kinds of instruction queueing can
> > be properly dedicated to the loop at hand.
>
> What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.

This is a follow-on of my predication scheme. In My 66000, there are predicate
instructions (like branch instructions) but use the immediate field to cast
a predicate (execute or don't execute) over a number of sequentially following
instructions. The predicate instruction also performs a conditional test and
uses this to determine if the instructions execute or not.

The VEC instruction contains a 16-bit, 32-bit, or 64-bit designator for the
{VSV} fields (in little endian notation; R->L)
>
> The VEC instruction has a fairly limited window unless it can use a lot of bits. Does the My 66000 use variable length instructions? I thought I saw that it does...

Yes, My 66000 has variable length instructions, but all extensions of the
instruction are simply data. My 66000 has::

OP Rd,Rs1,Rs2
OP Rd,Rs1,-RS2
OP Rd,-RS1,RS2
OP Rd,-RS1,-RS2

OP Rd,Rs,IMM16
OP Rd,Rs,IMM32
OP Rd,IMM32,Rs
OP Rd,Rs,IMM64
OP Rd,IMM64,Rs

3OP Rd,Rs1,Rs2,Rs3
3OP Rd,Rs1,Rs2,-Rs3
3OP Rd,Rs1,-Rs2,Rs3
3OP Rd,Rs1,-Rs2,-Rs3

MEM Rd,[Rb+DIDP16]
MEM Rd,[Rb+Ri<<s]
MEM Rd,[Rb+DISP32]
MEM Rd,[Rb+Ri<<s+DISP32]
MEM Rd,[Rb+DIPS64]
MEM Rd,[Rb+Ri<<s+DISP64]

ST IMM32,[any of the MEMs]
ST IMM64,[any of the MEMs]
>
> I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.

While this is my general inclination, this part has not solidified.
>
> I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.
>
> > > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.

THis is what I call:: "Blowing up the ISA" to fit Vectors/SIMD/...
> > >
> > And thus way more expensive in HW.
>
> Which then encourages multiple instruction set dialects.

Which is why VVM adds but 2 instructions.
>
> Though RISC-V is already doing that. I fear that will be a problem for RISC-V. It will really be a family of somewhat-related ISAs rather than a single one. x86 tends (more or less) to have each new generation of ISA be a superset of the previous generation. RISC-V is definitely heading in a more patchwork direction.

Only the magnitude of the problem is unknown within RISC-V

Kyle Hayes

unread,
Nov 4, 2019, 9:00:37 PM11/4/19
to
On Monday, November 4, 2019 at 8:28:41 AM UTC-8, MitchAlsup wrote:
[big snip]
> > > > One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.
>
> When I had long conversations (e-mail) with Luke, his proposal still had
> real vector registers and a chunk length of 64.

I think it has changed significantly from that. Now there are no new instructions and a special register (or set) is used in a way that reminds me a bit of your VEC instruction to tag certain registers as vector registers. However, instead of actually being vector registers, the CPU will use consecutive scalar registers for the vector elements.

So some parts seem mildly similar to your ideas in My 66000 (the use of a prefix, though in Luke's case it may just be a MOV to a special register) where others (use of existing scalars as vector elements) are quite different. There seems to be an inherent limit on scalability of Luke's idea when you run out of scalar registers. In reading through what he's posted, it was not clear to me whether you could use physical registers or only architectural registers. It seemed like the latter, but I did not dig that deep yet.

In the My 66000 (and I assume Luke's proposal) there must be some sort of way to handle traps and interrupts within a vectorized loop. How does My 66000 deal with that? With Luke's scheme, using scalar registers, it seems like there is no real difference between the vectorized state and a normal loop. However, I can't remember if he keeps the index in a scalar as well.

> > > VEC uses an immediate field to decorate K following instructions with vector
> > > or scalar designations so that the various kinds of instruction queueing can
> > > be properly dedicated to the loop at hand.
> >
> > What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.
>
> This is a follow-on of my predication scheme. In My 66000, there are predicate
> instructions (like branch instructions) but use the immediate field to cast
> a predicate (execute or don't execute) over a number of sequentially following
> instructions. The predicate instruction also performs a conditional test and
> uses this to determine if the instructions execute or not.
>
> The VEC instruction contains a 16-bit, 32-bit, or 64-bit designator for the
> {VSV} fields (in little endian notation; R->L)

Ah, OK. Even a simplistic 3-bits-per-instruction format for VEC would give you a 21-instruction "shadow" after the VEC instruction with a 64-bit immediate. That's quite large and should allow for many reasonable loops to be vectorized!

In the case where predicated instructions or vectorized instructions are interleaved with a function call, how are the extra state bits handled? On function return you'd need to know which succeeding instructions were predicated to not be executed and which were or which scalars were vectorized and which were not.

As above, for traps, this seems like someplace you need to be able to store state, but here, you would not be switching from user to supervisor mode (if you have that).

In Luke's proposal for RISC-V I seem to remember that the vectorization state was in a special CSR that could be read and restored back by user code. So as long as that state was pushed on the stack and restored back on a function call/return or a trap it seemed to work.

However, in My 66000, the state of the "vector" isn't in architecturally visible registers. I am probably missing something obvious :-(

> Yes, My 66000 has variable length instructions, but all extensions of the
> instruction are simply data. My 66000 has::
>
> OP Rd,Rs1,Rs2
> OP Rd,Rs1,-RS2
> OP Rd,-RS1,RS2
> OP Rd,-RS1,-RS2

Do these negative variants provide sufficient extra code density to warrant the opcode space? It seems like it would not be much extra hardware as most of this would already be in the ALU and it is more of a matter of selecting the right functions.

> OP Rd,Rs,IMM16

No OP Rd, IMM16, Rs?

> OP Rd,Rs,IMM32
> OP Rd,IMM32,Rs
> OP Rd,Rs,IMM64
> OP Rd,IMM64,Rs

These can take care of constant loads too?

> 3OP Rd,Rs1,Rs2,Rs3
> 3OP Rd,Rs1,Rs2,-Rs3
> 3OP Rd,Rs1,-Rs2,Rs3
> 3OP Rd,Rs1,-Rs2,-Rs3

These seem good for code density for calculating various more complicated addresses. Is that the intent?

> MEM Rd,[Rb+DIDP16]

No MEM Rd, [Rb+Ri<<s+IMM16]?

> MEM Rd,[Rb+Ri<<s]
> MEM Rd,[Rb+DISP32]
> MEM Rd,[Rb+Ri<<s+DISP32]
> MEM Rd,[Rb+DIPS64]
> MEM Rd,[Rb+Ri<<s+DISP64]

These are mostly loads?

Can these be done relative to the PC?

Are these all available in byte, 16-bit word, 32-bit word and 64-bit word forms (for integers)?

> ST IMM32,[any of the MEMs]
> ST IMM64,[any of the MEMs]

I noticed that many of the 16-bit immediate versions do not always have symmetrical pairs. No need or do they fall into different instruction formats than the 32 and 64-bit versions?

> >
> > I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.
>
> While this is my general inclination, this part has not solidified.

I see that you have some instructions with 4 registers. Just disallow those from VEC blocks? It seems like a waste to use up 4 bits in the VEC instruction per target instruction when most will have 2 or 3 registers.

> > I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.

And this is wrong. You could have an instruction in the middle of a VEC block that does not do anything with vectors. So you do need that encoding. Rats.

> > > > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
>
> THis is what I call:: "Blowing up the ISA" to fit Vectors/SIMD/...

Somewhere, maybe in Luke's proposal, the comment was made that the number of instructions Intel has had to add for all the various sizes of vector and the different combinations of element sizes is combinatorial (or exponential, I can't remember and haven't done the math, it was the wrong kind of scaling). I think it was Luke's proposal because there was quite a bit about how to define the various element, sub-element, and vector lengths.

> > Though RISC-V is already doing that. I fear that will be a problem for RISC-V. It will really be a family of somewhat-related ISAs rather than a single one. x86 tends (more or less) to have each new generation of ISA be a superset of the previous generation. RISC-V is definitely heading in a more patchwork direction.
>
> Only the magnitude of the problem is unknown within RISC-V

I think it is already fairly bad. Different implementations have different, not-standard, instruction sets already. As long as the implementors keep GCC and/or LLVM up to date that is not too painful, but that rarely happens. Usually they hack one version of GCC/LLVM and that is where the effort stops.

Best,
Kyle

MitchAlsup

unread,
Nov 5, 2019, 11:39:08 AM11/5/19
to
On Monday, November 4, 2019 at 8:00:37 PM UTC-6, Kyle Hayes wrote:
> On Monday, November 4, 2019 at 8:28:41 AM UTC-8, MitchAlsup wrote:
> [big snip]
> > > > > One big difference I see (and that may just be lack of comprehension on my part!) is that Luke's RISC-V proposal uses registers to "unroll" (not quite the right term) the loop. The 66000 method does not. This seems to be a fundamental limit in the RISC-V proposal.
> >
> > When I had long conversations (e-mail) with Luke, his proposal still had
> > real vector registers and a chunk length of 64.
>
> I think it has changed significantly from that. Now there are no new instructions and a special register (or set) is used in a way that reminds me a bit of your VEC instruction to tag certain registers as vector registers. However, instead of actually being vector registers, the CPU will use consecutive scalar registers for the vector elements.
>
> So some parts seem mildly similar to your ideas in My 66000 (the use of a prefix, though in Luke's case it may just be a MOV to a special register) where others (use of existing scalars as vector elements) are quite different. There seems to be an inherent limit on scalability of Luke's idea when you run out of scalar registers. In reading through what he's posted, it was not clear to me whether you could use physical registers or only architectural registers. It seemed like the latter, but I did not dig that deep yet.
>
> In the My 66000 (and I assume Luke's proposal) there must be some sort of way to handle traps and interrupts within a vectorized loop. How does My 66000 deal with that? With Luke's scheme, using scalar registers, it seems like there is no real difference between the vectorized state and a normal loop. However, I can't remember if he keeps the index in a scalar as well.

Since the instructions "in" the loop are in fact Scalar instructions, an
exception is taken as if it were a Scalar instruction. If the exception
is repaired and control returns, the rest of the loop is run in Scalar
mode, and the LOOP instruction will transfer control to the VEC instruction,
and Vector mode reconveins.
>
> > > > VEC uses an immediate field to decorate K following instructions with vector
> > > > or scalar designations so that the various kinds of instruction queueing can
> > > > be properly dedicated to the loop at hand.
> > >
> > > What was the tradeoff here? I assume that another way to do this would have been to have a bit per register position in each instruction indicating whether it was a "vector" or not. That's a lot of bits.
> >
> > This is a follow-on of my predication scheme. In My 66000, there are predicate
> > instructions (like branch instructions) but use the immediate field to cast
> > a predicate (execute or don't execute) over a number of sequentially following
> > instructions. The predicate instruction also performs a conditional test and
> > uses this to determine if the instructions execute or not.
> >
> > The VEC instruction contains a 16-bit, 32-bit, or 64-bit designator for the
> > {VSV} fields (in little endian notation; R->L)
>
> Ah, OK. Even a simplistic 3-bits-per-instruction format for VEC would give you a 21-instruction "shadow" after the VEC instruction with a 64-bit immediate. That's quite large and should allow for many reasonable loops to be vectorized!
>
> In the case where predicated instructions or vectorized instructions are interleaved with a function call, how are the extra state bits handled? On function return you'd need to know which succeeding instructions were predicated to not be executed and which were or which scalars were vectorized and which were not.

The My 66000 ISA has a part of a control register that handles predication.
This part is replicated in the machine based on the capabilities of the
vector lanes--1 set per loop. a 1 means execute, a 0 means skip.
>
> As above, for traps, this seems like someplace you need to be able to store state, but here, you would not be switching from user to supervisor mode (if you have that).

There is a Quad DoubleWord of state {PC, Modes, Enables, Raised} that contains
program state.
>
> In Luke's proposal for RISC-V I seem to remember that the vectorization state was in a special CSR that could be read and restored back by user code. So as long as that state was pushed on the stack and restored back on a function call/return or a trap it seemed to work.
>
> However, in My 66000, the state of the "vector" isn't in architecturally visible registers. I am probably missing something obvious :-(

There are instructions to Read, Write, and Exchange DoubleWords with
processor state.
>
> > Yes, My 66000 has variable length instructions, but all extensions of the
> > instruction are simply data. My 66000 has::
> >
> > OP Rd,Rs1,Rs2
> > OP Rd,Rs1,-RS2
> > OP Rd,-RS1,RS2
> > OP Rd,-RS1,-RS2
>
> Do these negative variants provide sufficient extra code density to warrant the opcode space? It seems like it would not be much extra hardware as most of this would already be in the ALU and it is more of a matter of selecting the right functions.

First of all, there is no additional HW (Hint: an Adder can be used as a
subtractor,...)
So one loses the need for a SUB instruction, while gaining an AND-NOT inst-
ruction.
But the gain is small.

Thirdly, it is EXACTLY these bits which are used to govern the attachment
of large immediates and displacements to each instruction. So, I needed
the bits for other purposes--and it is these purposes that save instructions.
Something between 6% and 9% of MIPS 1 instructions were used to paste bit
together only to be used as an operand. In My 66000 none of this is necessary.
>
> > OP Rd,Rs,IMM16
>
> No OP Rd, IMM16, Rs?

One has to draw the line somewhere.
>
> > OP Rd,Rs,IMM32
> > OP Rd,IMM32,Rs
> > OP Rd,Rs,IMM64
> > OP Rd,IMM64,Rs
>
> These can take care of constant loads too?

Yes.
>
> > 3OP Rd,Rs1,Rs2,Rs3
> > 3OP Rd,Rs1,Rs2,-Rs3
> > 3OP Rd,Rs1,-Rs2,Rs3
> > 3OP Rd,Rs1,-Rs2,-Rs3
>
> These seem good for code density for calculating various more complicated addresses. Is that the intent?

Yes.
>
> > MEM Rd,[Rb+DIDP16]
>
> No MEM Rd, [Rb+Ri<<s+IMM16]?

No, no bits to encode with.
>
> > MEM Rd,[Rb+Ri<<s]
> > MEM Rd,[Rb+DISP32]
> > MEM Rd,[Rb+Ri<<s+DISP32]
> > MEM Rd,[Rb+DIPS64]
> > MEM Rd,[Rb+Ri<<s+DISP64]
>
> These are mostly loads?
7 LDs, 1 LEA, 4 STs, 1 PREfetch, 1 <post>PUSH
>
> Can these be done relative to the PC?

Rbase = R0 -> IP
Rindex = R0 -> no index
>
> Are these all available in byte, 16-bit word, 32-bit word and 64-bit word forms (for integers)?

Both signed and unsigned.
In addition the memory model is misaligned.

>
> > ST IMM32,[any of the MEMs]
> > ST IMM64,[any of the MEMs]
>
> I noticed that many of the 16-bit immediate versions do not always have symmetrical pairs. No need or do they fall into different instruction formats than the 32 and 64-bit versions?

Pairs of LDs:: LDSB and LDUB are paired with STB; in both DISP16 form and
{DISP32, DISP64} forms.
>
> > >
> > > I am purely guessing here, but the groups (all triples?) of {VSV} would need to be the same size as the number of registers in the target instruction. Might be easier to just have everything as a triple and ignore some bits if the instruction does not use three registers.
> >
> > While this is my general inclination, this part has not solidified.
>
> I see that you have some instructions with 4 registers. Just disallow those from VEC blocks? It seems like a waste to use up 4 bits in the VEC instruction per target instruction when most will have 2 or 3 registers.

FMAC is the workhorse of floating point arithmetic. Restricting it from
vectorization would be BAD.
>
> > > I guess that some alternate encoding would be possible as well. If an instruction is outside the window of a VEC instruction, then all the registers are scalar. So that's one possibility that does not need to be encoded in the bits in VEC.

The plan is to take the immediate and then as each instruction is DECODEd
attach as many bits to the instruction as that instruction requires. So,
instructions with immediates consume fewer bits from the casting vector.

Stephen Fuld

unread,
Nov 5, 2019, 12:00:32 PM11/5/19
to
On 11/3/2019 4:25 PM, MitchAlsup wrote:
> On Sunday, November 3, 2019 at 3:54:08 PM UTC-6, Kyle Hayes wrote:
>> On Sunday, November 3, 2019 at 10:41:27 AM UTC-8, Stephen Fuld wrote:
>>> On 11/3/2019 9:26 AM, MitchAlsup wrote:


snip
Hmmmm! While three lanes into cache seems fine for a 1 wide machine,
having 12 lanes for a four wide seems like a lot. Is that typical for
such a machine?

I am beginning to think that "byte move" may not be the best example to
show off VVM. While it is simple to code and to comprehend, it is so
memory access bound that you might not get to take full advantage of VVM.

You may recall that some time ago, I spent some time trying to develop
an "industrial strength" version using 8 bytes per load/store, but still
giving the correct semantics on protection faults, etc. I never got
something that I was satisfied with. I freely admit that may be my
failure, not VVMs.

I was trying to do up to within 8 bytes of a page boundary in 8 byte
chunks, then switching to single byte chunks for 8 bytes, then back to 8
byte chunks. The problems stemmed from requiring two conditions to
terminate the loop, a zero byte, and a page boundary. Perhaps there is
a better way.

But given that, I expect a non VVM routine that moves up to 8 bytes at a
time (assuming you have your "compare any byte to zero" instruction),
would be faster and use less power than a byte at a time VVM routine.

MitchAlsup

unread,
Nov 5, 2019, 12:30:18 PM11/5/19
to
No, not for either. A 6-wide machine should have between 2 and 3 lanes into
the cache. You want the MB to be essentially balanced with the calk BW and
the control-flow BW.

So, on Mc 88120, a 6-wide machine, we had 3 AGEN ports to cache, 1 FP+
1 FP*, and 1 BC units. Each of those units had INT and LOG calculations,
and the 3 MEM units has shifters.
>
> I am beginning to think that "byte move" may not be the best example to
> show off VVM.

It isn't, but what it does show is the clean transition from scalar to vector.

> While it is simple to code and to comprehend, it is so
> memory access bound that you might not get to take full advantage of VVM.
>
> You may recall that some time ago, I spent some time trying to develop
> an "industrial strength" version using 8 bytes per load/store, but still
> giving the correct semantics on protection faults, etc. I never got
> something that I was satisfied with. I freely admit that may be my
> failure, not VVMs.
>
> I was trying to do up to within 8 bytes of a page boundary in 8 byte
> chunks, then switching to single byte chunks for 8 bytes, then back to 8
> byte chunks. The problems stemmed from requiring two conditions to
> terminate the loop, a zero byte, and a page boundary. Perhaps there is
> a better way.

Any taken branch automagically terminates the vector loop. Thus one does simple
flow control in a loop with predication, and you branch out of the loop for
any reason you desire. So:

for( i = 0; i < MAX; i++ )
if( A[i] > B[i] ) break;

works just fine. you can also predicate yourself past the LOOP instruction
and leave the loop.

Stephen Fuld

unread,
Nov 5, 2019, 1:28:21 PM11/5/19
to
Yes, I certainly agree with that. Is there a better example that shows
the transition but doesn't have byte move's drawbacks? Perhaps vector
addition?



>
>> While it is simple to code and to comprehend, it is so
>> memory access bound that you might not get to take full advantage of VVM.
>>
>> You may recall that some time ago, I spent some time trying to develop
>> an "industrial strength" version using 8 bytes per load/store, but still
>> giving the correct semantics on protection faults, etc. I never got
>> something that I was satisfied with. I freely admit that may be my
>> failure, not VVMs.
>>
>> I was trying to do up to within 8 bytes of a page boundary in 8 byte
>> chunks, then switching to single byte chunks for 8 bytes, then back to 8
>> byte chunks. The problems stemmed from requiring two conditions to
>> terminate the loop, a zero byte, and a page boundary. Perhaps there is
>> a better way.
>
> Any taken branch automagically terminates the vector loop. Thus one does simple
> flow control in a loop with predication, and you branch out of the loop for
> any reason you desire. So:
>
> for( i = 0; i < MAX; i++ )
> if( A[i] > B[i] ) break;
>
> works just fine.


Ahhh! That may be the answer. Branch out of the loop when the address
gets within 8 of a page boundary, then do a byte at a time move loop for
up to 8 bytes, then go back to the top to finish the rest of the
transfer, if needed. i.e. a "regular" loop that has nested within it,
two VVM loops, one for 8 bytes at a time, the other for one byte at a time.

I'll have to think about that.

Thanks.

Rick C. Hodgin

unread,
Nov 5, 2019, 1:58:53 PM11/5/19
to
On 11/3/2019 12:26 PM, MitchAlsup wrote:
> For example:: A simple string copy routine in My 66000 Assembler:
>
> loop:
> LDSB R7,[Rp+Ri]
> STB R7,[Rq+Ri]
> ADD Ri,Ri,1
> BNZ R7,loop
>
> vectorizes into:
>
> loop: VEC {{VSV}{VSV}{VV}{V}}
> LDSB R7,[Rp+Ri]
> STB R7,[Rq+Ri]
> LOOP NZ,R7,Ri,1


I think this concept is brilliant, Mitch. You've inspired me to
add a similar feature for general purpose computing:

setup:
vecf32 r1,r7,4 ; 32-bit fp data, starting at *r1, *r7

; Integer form:
; veci32 r3,r4,8 ; 64-bit int data, starting at *r3, *r4
;
; The vector registers are 128-bits or 256-bits wide, so
; whatever would fit in there is assumed.

loop:
; Perform logical ops on things related to r1, r7, which are
; pointers to the start of data. Processing is conducted
; across the 4-fp-wide loaded vectors in parallel in lieu
; of the encoded fp reg counterparts. Other registers ref-
; erenced will be only for input. Changes in r1 or r7 will
; be used to access prior / next vector items in memory.
;
; All calculations using r1 and r7, which correspond to f1,
; f7 for floating point regs, will be extended out to the
; v1, v7 registers used for this operation. r1 and r7 are
; pointers to data, f1 and f7 are used to reference the ops
; to perform in parallel, and v1 and v7 are where the work
; is actually conducted.
;
; In this way, you can use scalar operations on f1 and f7,
; which when you're in vecf32 mode, for example, are aliased
; placeholders for the real operation which takes place in
; the vector engine.
;
; A very simple way to encode general purpose vector ops.
;
; It would be interesting to allow a conditional branch op
; on the vector data, which would trap to the OS, which would
; create four new threads to process each of the items in
; parallel.

done:
vecend ; Done with vector aliasing

Quite a concept, sir.

--
Rick C. Hodgin

MitchAlsup

unread,
Nov 5, 2019, 7:58:56 PM11/5/19
to
My 66000 has an inherently misaligned memory system (cache), so for the 90%-ile
cases, you don't have to worry about those kinds of boundaries.

Bruce Hoult

unread,
Nov 6, 2019, 12:29:35 AM11/6/19
to
On Sunday, November 3, 2019 at 1:54:08 PM UTC-8, Kyle Hayes wrote:
> I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
>
> Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
>
> SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
>
> I've just started reading up on RISC-V's vector proposal as well and so far it reminds me a lot of SVE.

The other way around. The Berkeley people have been working on this (Cray-like) style of vector processing for a *long* time. They invented RISC-V in 2010 as a side project so they would have an open ISA as the control processor for the vector processor they were already working on.

It seems to be little-known that ARM hasn't shipped anything with SVE yet, and I believe hasn't even announced anything. (Fujitsu have announced something, I don't know the hardware status of it). SiFive has last week announced the U87 Out-of_order CPU core (comparable to ARM's 2015 A72) with vector unit, planned to be available in the second half of 2020.

Stephen Fuld

unread,
Nov 6, 2019, 1:03:09 AM11/6/19
to
Agreed, of course. The effect on performance should be minimal, but you
do have to handle the infrequent event of crossing a 4K boundary and
potentially causing a protection exception. If it weren't for that,
this would be a lot easier. :-)


My thoughts now are to compute the number of 8 byte chunks from the
starting source address to the next 4k boundary, then do the same for
the destination address. Take the minimum of those two. Call that the
counter

Then do the first VVM loop using 64 bit loads and stores, but adding a
decrement the counter by 8, test for zero and if so branch out of the
first loop.

If you never get to a 4K boundary, which will happen most of the time,
you use the loop instruction to test for any zero byte. When you find
it, you exit the VVM loop, and store the remaining up to 8 bytes. You
are done.

At the target of the branch out of the loop, you enter a second VVM loop
that does one byte at a time, for up to 8 bytes. Within that loop you
test for zero. This will get you over the 4K boundary, or you will find
the zero termination byte. If there is no zero termination byte, when
the loop is done, you branch back to the very beginning and repeat the
whole process for the next, up to 4K chunk.

Does that seem reasonable?

Terje Mathisen

unread,
Nov 6, 2019, 5:00:07 AM11/6/19
to
Stephen Fuld wrote:
> My thoughts now are to compute the number of 8 byte chunks from the
> starting source address to the next 4k boundary, then do the same for
> the destination address.  Take the minimum of those two.  Call that the
> counter
>
> Then do the first VVM loop using 64 bit loads and stores, but adding a
> decrement the counter by 8, test for zero and if so branch out of the
> first loop.

For max speed you really want cache line transfers, so 64 instead of 8
bytes per move.

>
> If you never get to a 4K boundary, which will happen most of the time,
> you use the loop instruction to test for any zero byte.  When you find
> it, you exit the VVM loop, and store the remaining up to 8 bytes.  You
> are done.
>
> At the target of the branch out of the loop, you enter a second VVM loop
> that does one byte at a time, for up to 8 bytes.  Within that loop you
> test for zero. This will get you over the 4K boundary, or you will find
> the zero termination byte.  If there is no zero termination byte, when
> the loop is done, you branch back to the very beginning and repeat the
> whole process for the next, up to 4K chunk.
>
> Does that seem reasonable?

It will indeed work, you just have to do a lot of calculations for the
pair of boundary (page limit) conditions.

It might in fact be faster to use the "aligned access only!" approach as
long as you have a fast double-wide shifter/byte aligner so that you can
use only aligned loads and stores, but Mill's approach with None or NaR
flagging of all out of bounds bytes is far simpler and allows much
smaller and easier to verify code.

A half-way approach would allow misaligned stores of all full blocks
(words/cache lines), i.e. no internal terminator found, but only aligned
loads: This is fairly easy to setup by just doing an initial aligned
block load, then a byte shuffle/shift to move the bytes down to the
beginning of the register to align them, then the normal test for a
terminator before we use a masked store to write only these bytes.

Terje


--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,
Nov 6, 2019, 6:21:30 AM11/6/19
to
Yeah... SIMD seems like a simpler solution...

It has its drawbacks, granted, but if its scope is kept fairly limited
it doesn't totally ruin the ISA it is glued onto while still offering
some useful gains over pure scalar code (and without adding an excessive
amount of cost and complexity to the implementation).

already...@yahoo.com

unread,
Nov 6, 2019, 7:09:07 AM11/6/19
to
On Wednesday, November 6, 2019 at 7:29:35 AM UTC+2, Bruce Hoult wrote:
> On Sunday, November 3, 2019 at 1:54:08 PM UTC-8, Kyle Hayes wrote:
> > I am interested in both. Whenever I look at the mess that Intel made with SSE, AVX, AVX2... Yuck. At least Intel set the bar very low!
> >
> > Mitch's ideas, Luke's ideas and things like ARM's SVE seem sooooooo much better!
> >
> > SVE is a lot more "conservative" in the sense that there are actual vector registers behind the instructions and you need lots of instructions to deal with different element sizes etc. What you do get over AVX is that one set of code can run efficiently on multiple implementations.
> >
> > I've just started reading up on RISC-V's vector proposal as well and so far it reminds me a lot of SVE.
>
> The other way around. The Berkeley people have been working on this (Cray-like) style of vector processing for a *long* time. They invented RISC-V in 2010 as a side project so they would have an open ISA as the control processor for the vector processor they were already working on.
>
> It seems to be little-known that ARM hasn't shipped anything with SVE yet, and I believe hasn't even announced anything.

Correct, but I'd think that it is well-know to everybody who can spell "SVE".

> (Fujitsu have announced something, I don't know the hardware status of it).


https://www.fujitsu.com/global/about/resources/news/press-releases/2019/0415-01.html

English is not my native tongue so I am not sure what the word "concludes" really means in this context.
But it it sounds like they have chip that is more works than not.

Bruce Hoult

unread,
Nov 6, 2019, 9:10:43 AM11/6/19
to
On Wednesday, November 6, 2019 at 4:09:07 AM UTC-8, already...@yahoo.com wrote:
> > (Fujitsu have announced something, I don't know the hardware status of it).
>
>
> https://www.fujitsu.com/global/about/resources/news/press-releases/2019/0415-01.html
>
> English is not my native tongue so I am not sure what the word "concludes" really means in this context.
> But it it sounds like they have chip that is more works than not.

It means Fujitsu and RIKEN (the research institute) have agreed what words to put in the contract, and signed it.

Beginning global sales in the second half of fiscal 2019 means signing contracts with other customers, for later delivery.

They don't expect to have a machine available for actual use until 2021 or 2022.

That suggests to me that they haven't taped out production chips yet, but probably have near final RTL tested in FPGAs or Palladiums or whatever and maybe low volume test chips.

NaN

unread,
Nov 6, 2019, 10:59:48 AM11/6/19
to

http://www.ssken.gr.jp/MAINSITE/event/2019/20190820-hpcf/lecture-04/20190820_HPCF_shinjo.pdf

2019年11月6日水曜日 23時10分43秒 UTC+9 Bruce Hoult:

Stephen Fuld

unread,
Nov 6, 2019, 11:17:01 AM11/6/19
to
On 11/6/2019 2:00 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> My thoughts now are to compute the number of 8 byte chunks from the
>> starting source address to the next 4k boundary, then do the same for
>> the destination address.  Take the minimum of those two.  Call that
>> the counter
>>
>> Then do the first VVM loop using 64 bit loads and stores, but adding a
>> decrement the counter by 8, test for zero and if so branch out of the
>> first loop.
>
> For max speed you really want cache line transfers, so 64 instead of 8
> bytes per move.

I may be wrong, but I don't think the My 66000 has instructions to do
that. But note that on a multi lane CPU, you can get closer to that, as
the different ALUs will access consecutive 64 bit blocks at the same
time. I don't know if Mitch's hardware can take advantage of that.




>> If you never get to a 4K boundary, which will happen most of the time,
>> you use the loop instruction to test for any zero byte.  When you find
>> it, you exit the VVM loop, and store the remaining up to 8 bytes.  You
>> are done.
>>
>> At the target of the branch out of the loop, you enter a second VVM
>> loop that does one byte at a time, for up to 8 bytes.  Within that
>> loop you test for zero. This will get you over the 4K boundary, or you
>> will find the zero termination byte.  If there is no zero termination
>> byte, when the loop is done, you branch back to the very beginning and
>> repeat the whole process for the next, up to 4K chunk.
>>
>> Does that seem reasonable?
>
> It will indeed work, you just have to do a lot of calculations for the
> pair of boundary (page limit) conditions.

"A lot" seems excessive, as it involves an AND and a subtract for the
source and destination, which can be overlapped, and a min function (I
don't know what instructions will be generated for that). But for the
vast majority of cases, you only have to do this once.


> It might in fact be faster to use the "aligned access only!" approach as
> long as you have a fast double-wide shifter/byte aligner so that you can
> use only aligned loads and stores,


Could be. I don't know what shift instructions are available on My 6600.



> but Mill's approach with None or NaR
> flagging of all out of bounds bytes is far simpler and allows much
> smaller and easier to verify code.

Yes, but Mill doesn't have, or at least hasn't disclosed, anything like
the VVM.


> A half-way approach would allow misaligned stores of all full blocks
> (words/cache lines), i.e. no internal terminator found, but only aligned
> loads: This is fairly easy to setup by just doing an initial aligned
> block load, then a byte shuffle/shift to move the bytes down to the
> beginning of the register to align them, then the normal test for a
> terminator before we use a masked store to write only these bytes.


Doesn't this not write the last few bytes in the last page before the
write protection boundary?

MitchAlsup

unread,
Nov 6, 2019, 11:54:31 AM11/6/19
to
On Wednesday, November 6, 2019 at 10:17:01 AM UTC-6, Stephen Fuld wrote:
> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
> > Stephen Fuld wrote:
> >> My thoughts now are to compute the number of 8 byte chunks from the
> >> starting source address to the next 4k boundary, then do the same for
> >> the destination address.  Take the minimum of those two.  Call that
> >> the counter
> >>
> >> Then do the first VVM loop using 64 bit loads and stores, but adding a
> >> decrement the counter by 8, test for zero and if so branch out of the
> >> first loop.
> >
> > For max speed you really want cache line transfers, so 64 instead of 8
> > bytes per move.
>
> I may be wrong, but I don't think the My 66000 has instructions to do
> that. But note that on a multi lane CPU, you can get closer to that, as
> the different ALUs will access consecutive 64 bit blocks at the same
> time. I don't know if Mitch's hardware can take advantage of that.
>
The My 66000 does have cache line sized loads and stores--these must be
cache line aligned and must be MOD 8 register aligned.
>
>
>
> >> If you never get to a 4K boundary, which will happen most of the time,
> >> you use the loop instruction to test for any zero byte.  When you find
> >> it, you exit the VVM loop, and store the remaining up to 8 bytes.  You
> >> are done.
> >>
> >> At the target of the branch out of the loop, you enter a second VVM
> >> loop that does one byte at a time, for up to 8 bytes.  Within that
> >> loop you test for zero. This will get you over the 4K boundary, or you
> >> will find the zero termination byte.  If there is no zero termination
> >> byte, when the loop is done, you branch back to the very beginning and
> >> repeat the whole process for the next, up to 4K chunk.
> >>
> >> Does that seem reasonable?
> >
> > It will indeed work, you just have to do a lot of calculations for the
> > pair of boundary (page limit) conditions.
>
> "A lot" seems excessive, as it involves an AND and a subtract for the
> source and destination, which can be overlapped, and a min function (I
> don't know what instructions will be generated for that). But for the
> vast majority of cases, you only have to do this once.
>
>
> > It might in fact be faster to use the "aligned access only!" approach as
> > long as you have a fast double-wide shifter/byte aligner so that you can
> > use only aligned loads and stores,
>
>
> Could be. I don't know what shift instructions are available on My 6600.

Shift Left
unsigned SL( unsigned S1, unsigned S2 )
{
unsigned w = S2<37:32>,
o = S2< 5: 0>;
if( w != 0 )
{
S1 <<= 64 - w;
S1 >>= 64 - w - o;
}
else
S1 >>= o;
return S1;
}

SHIFT Right
signed SR( signed S1, unsigned S2 )
{
unsigned w = S2<37:32>,
o = S2< 5: 0>;
if( w != 0 )
{
S1 <<= 64 - w - o;
S1 >>= 64 - w;
}
else
S1 >>= o;
return S1
}

Insert
unsigned INS( unsigned S1, unsigned S2, unsigned S3 )
{
unsigned m,
w = S2<37:32>,
o = S2< 5: 0>;
if( w != 0 )
m = (~(~0 << w )) << o;
else
m = ~0 << o;
S3 <<= o;
return ( S1 & ~m ) | ( S3 & m );

Stephen Fuld

unread,
Nov 6, 2019, 12:09:04 PM11/6/19
to
On 11/6/2019 8:54 AM, MitchAlsup wrote:
> On Wednesday, November 6, 2019 at 10:17:01 AM UTC-6, Stephen Fuld wrote:
>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>> Stephen Fuld wrote:
>>>> My thoughts now are to compute the number of 8 byte chunks from the
>>>> starting source address to the next 4k boundary, then do the same for
>>>> the destination address.  Take the minimum of those two.  Call that
>>>> the counter
>>>>
>>>> Then do the first VVM loop using 64 bit loads and stores, but adding a
>>>> decrement the counter by 8, test for zero and if so branch out of the
>>>> first loop.
>>>
>>> For max speed you really want cache line transfers, so 64 instead of 8
>>> bytes per move.
>>
>> I may be wrong, but I don't think the My 66000 has instructions to do
>> that. But note that on a multi lane CPU, you can get closer to that, as
>> the different ALUs will access consecutive 64 bit blocks at the same
>> time. I don't know if Mitch's hardware can take advantage of that.
>>
> The My 66000 does have cache line sized loads and stores--these must be
> cache line aligned and must be MOD 8 register aligned.

Interesting. How does this interact with VVM? It seems like a of
internal storage.


snip
So no double wide shifts as Terje was proposing using?

MitchAlsup

unread,
Nov 6, 2019, 12:59:34 PM11/6/19
to
On Wednesday, November 6, 2019 at 11:09:04 AM UTC-6, Stephen Fuld wrote:
> On 11/6/2019 8:54 AM, MitchAlsup wrote:
> > On Wednesday, November 6, 2019 at 10:17:01 AM UTC-6, Stephen Fuld wrote:
> >> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
> >>> Stephen Fuld wrote:
> >>>> My thoughts now are to compute the number of 8 byte chunks from the
> >>>> starting source address to the next 4k boundary, then do the same for
> >>>> the destination address.  Take the minimum of those two.  Call that
> >>>> the counter
> >>>>
> >>>> Then do the first VVM loop using 64 bit loads and stores, but adding a
> >>>> decrement the counter by 8, test for zero and if so branch out of the
> >>>> first loop.
> >>>
> >>> For max speed you really want cache line transfers, so 64 instead of 8
> >>> bytes per move.
> >>
> >> I may be wrong, but I don't think the My 66000 has instructions to do
> >> that. But note that on a multi lane CPU, you can get closer to that, as
> >> the different ALUs will access consecutive 64 bit blocks at the same
> >> time. I don't know if Mitch's hardware can take advantage of that.
> >>
> > The My 66000 does have cache line sized loads and stores--these must be
> > cache line aligned and must be MOD 8 register aligned.
>
> Interesting. How does this interact with VVM? It seems like a of
> internal storage.

HW in the My 66000 architecture is responsible for loading and storing of
Register File at context switches. RF are defined to be on Quad-Line
boundaries, and these instructions access that already existing infrastructure.
Since HW does the loading and storing of RFs, the RF has been configured such
that even a 1-wide machine can write 4-registers per cycle (and similarly
read 4 registers per cycle). So these instructions make use of wider data-
paths than the execution portions have.

This is one of the reasons a context switch has only 12-cycles of overhead.
The 4-cache lines read and the 4-cache lines written completely overlap
(5-total cycles)

Ivan Godard

unread,
Nov 6, 2019, 1:07:27 PM11/6/19
to
Doesn't have, doesn't want.

Does have SIMD of all element sizes. SIMD operands are treated like
scalar, with statically known sizes and counts, but in parallel. SIMD is
intended for things like pixels or complex numbers.

Separately, dynamically sized sequences are called streams. What's on
the other end of a stream can be hardware or software, in or out of your
protection domain. Think Unix pipes. Details NYF.

Terje Mathisen

unread,
Nov 6, 2019, 2:03:17 PM11/6/19
to
Stephen Fuld wrote:
> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>> but Mill's approach with None or NaR flagging of all out of bounds
>> bytes is far simpler and allows much smaller and easier to verify code.
>
> Yes, but Mill doesn't have, or at least hasn't disclosed, anything like
> the VVM.

With 128-bit vector regs you get most of the benefit, with NaR/None
taking care of all the ugly/hard prequel/sequel/alignment issues.
>
>
>> A half-way approach would allow misaligned stores of all full blocks
>> (words/cache lines), i.e. no internal terminator found, but only
>> aligned loads: This is fairly easy to setup by just doing an initial
>> aligned block load, then a byte shuffle/shift to move the bytes down
>> to the beginning of the register to align them, then the normal test
>> for a terminator before we use a masked store to write only these bytes.
>
>
> Doesn't this not write the last few bytes in the last page before the
> write protection boundary?

No: This is the idea behind only writing the full blocks this way, they
are by definition always safe, even if misaligned.

The final block, i.e. with the found terminator will be written using a
masked store (i.e. no access past the final terminator byte), so that is
also safe.

Stephen Fuld

unread,
Nov 6, 2019, 2:04:44 PM11/6/19
to
Obviously, it is tempting to use these for byte move, but I guess you
would have to have instructions within the VVM loop to test for zero
byte in the first seven registers, and use the LOOP instruction to test
for zero in the last eight bytes.

Also, you increase by a factor of 8 the probability of hitting a 4k
boundary.

It complicates the clean up code, as you have to handle multiple registers.

And, of course, it requires start-up code to load/store the first
unaligned bytes. You didn't say you have double width shifts, so
alighment is more complicated.

More to think about. Ugh! :-)

Stephen Fuld

unread,
Nov 6, 2019, 2:09:56 PM11/6/19
to
Right. May be not useful for byte move. IIRC, the size, in bits of
each belt position is 64, so the SIMD you could use the capability for 8
bytes at time, but no more. Is that right?


> Separately, dynamically sized sequences are called streams. What's on
> the other end of a stream can be hardware or software, in or out of your
> protection domain. Think Unix pipes. Details NYF.


You just keep whetting our appetites. :-)

Stephen Fuld

unread,
Nov 6, 2019, 2:25:15 PM11/6/19
to
On 11/6/2019 11:03 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>> but Mill's approach with None or NaR flagging of all out of bounds
>>> bytes is far simpler and allows much smaller and easier to verify code.
>>
>> Yes, but Mill doesn't have, or at least hasn't disclosed, anything
>> like the VVM.
>
> With 128-bit vector regs you get most of the benefit, with NaR/None
> taking care of all the ugly/hard prequel/sequel/alignment issues.
>>
>>
>>> A half-way approach would allow misaligned stores of all full blocks
>>> (words/cache lines), i.e. no internal terminator found, but only
>>> aligned loads: This is fairly easy to setup by just doing an initial
>>> aligned block load, then a byte shuffle/shift to move the bytes down
>>> to the beginning of the register to align them, then the normal test
>>> for a terminator before we use a masked store to write only these bytes.
>>
>>
>> Doesn't this not write the last few bytes in the last page before the
>> write protection boundary?
>
> No: This is the idea behind only writing the full blocks this way, they
> are by definition always safe, even if misaligned.

I'm sorry, I don't understand. In the paragraph describing this
approach, you say "allow misaligned stores of all full blocks" Doesn't
this mean that a store could fault and leave the last few bytes of the
page before the fault unwritten? Note I am talking about blocks other
than the final one.

Stephen Fuld

unread,
Nov 6, 2019, 2:30:31 PM11/6/19
to
On 11/6/2019 11:03 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>> but Mill's approach with None or NaR flagging of all out of bounds
>>> bytes is far simpler and allows much smaller and easier to verify code.
>>
>> Yes, but Mill doesn't have, or at least hasn't disclosed, anything
>> like the VVM.
>
> With 128-bit vector regs you get most of the benefit, with NaR/None
> taking care of all the ugly/hard prequel/sequel/alignment issues.

I understand. I mis-remembered that the vector registers were 64 bits.
Sorry Ivan. :-(

And I suppose you could have multiple of these going on at the same time
(wide issue), with the actual number being model dependent. It would be
up to the compiler and the specializer to extract the parallelism. Right?

NaN

unread,
Nov 6, 2019, 3:58:16 PM11/6/19
to
If 66000 is variable instruction word length and alignment-free architecture then interconnection network between instruction cache and instruction register is 2N bytes (where maximum N byes word length, strictly it is 2N-1) at least?
This is possible in case of that program is not self-writing (no stores for instructions).
Does that mean of instruction prefetch support needing wider path so then such the wider path is sufficient?

BTW, if we try applying this baseline to data, how to store under alignment-free condition?

MitchAlsup

unread,
Nov 6, 2019, 4:25:29 PM11/6/19
to
On Wednesday, November 6, 2019 at 2:58:16 PM UTC-6, NaN wrote:
> If 66000 is variable instruction word length and alignment-free architecture then interconnection network between instruction cache and instruction register is 2N bytes (where maximum N byes word length, strictly it is 2N-1) at least?

The ICache read access width is 4 Aligned Words (in small implementations).

> This is possible in case of that program is not self-writing (no stores for instructions).

Instruction FETCH is defined as being as far into the past as possible, so
a Store into Instruction space (enabled by TLB.W) can write into the
instruction memory. But the Store instruction has no concept as to how far
earlier any FETCHes may have transpired. In addition, if the instruction
is in the Instruction Buffer, it may be used multiple times before the
store is recognized.

> Does that mean of instruction prefetch support needing wider path so then such the wider path is sufficient?

The FETCH path is 4 Words wide per cycle, and there is a 16-20 word instruction
buffer.
>
> BTW, if we try applying this baseline to data, how to store under alignment-free condition?

1-wide machine:

The general state is that after a control transfer, FETCH issues 3 FETCHes
in a row, and examines them after each arrive. If the first quad Word of
instructions contains a branch, the target of the branch will be FETCHed
in the 4th cycle. When the predicted stream has less than 3 remaining
instructions it will FETCH its subsequent. Thus, the FETCHer is attempting
to stage up branch target instructions and maintain the momentum of Inst
issue, and decrease the branch penalty close to zero.

Wider machines are allowed to do what is appropriate for them.

If you want to disturb the instructions in some context, you do it from a
different context. Context switching flushes the IB which is the only way
to synchronize stores and FETCHes.

MitchAlsup

unread,
Nov 6, 2019, 4:27:47 PM11/6/19
to
On Wednesday, November 6, 2019 at 2:58:16 PM UTC-6, NaN wrote:
>
> BTW, if we try applying this baseline to data, how to store under alignment-free condition?

You don't want to be modifying instructions that are visible to some running
context. You want to allocate new memory, write the entire subroutine, then
remove write permission and enable execute permission, then, place and then
post and address of where you put the new subroutine.

NaN

unread,
Nov 6, 2019, 4:45:30 PM11/6/19
to
Mitch-san,

Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?

Best,
NaN

NaN

unread,
Nov 6, 2019, 4:57:20 PM11/6/19
to
If data word is variable length, let us say maximum D-byte, then D byte-line data cache sub-sets with address decoder to select cache(s) and with shared tag memory is indeed sufficient. If the byte cache-line is too narrow then we can make S-byte line which probably supports maximum S-stride access across D subsets.
Just like a chunking, similar to scalarization in GPUs.

NaN

NaN

unread,
Nov 6, 2019, 5:13:55 PM11/6/19
to
2019年11月7日木曜日 6時57分20秒 UTC+9 NaN:
> If data word is variable length, let us say maximum D-byte, then D byte-line data cache sub-sets with address decoder to select cache(s) and with shared tag memory is indeed sufficient. If the byte cache-line is too narrow then we can make S-byte line which probably supports maximum S-stride access across D subsets.
> Just like a chunking, similar to scalarization in GPUs.
>
> NaN

Maybe maximum D-stride S-length byte-vector is in one cache line, is correct.

Stefan Monnier

unread,
Nov 6, 2019, 5:23:58 PM11/6/19
to
> Since the instructions "in" the loop are in fact Scalar instructions, an
> exception is taken as if it were a Scalar instruction. If the exception
> is repaired and control returns, the rest of the loop is run in Scalar
> mode, and the LOOP instruction will transfer control to the VEC instruction,
> and Vector mode reconveins.

IIUC by "rest of the loop" you really mean "rest of the loop's body",
aka a single iteration of the loop is performed before going back to
"vector mode", right?


Stefan

Ivan Godard

unread,
Nov 6, 2019, 6:07:20 PM11/6/19
to
SIMD is up to the config's maximum scalar size, which is part of the
tapeout-time configuration. Machines with quad have up to 16-way SIMD,
and so on; there's no architectural limit, there are physical cost
considerations because all belt positions must be able to accommodate
the max. A FP-banger config would likely have 16-byte width for {D,D}
complex use, as would a database mainframe but for 33-digit COBOL
numeric and DBMS row searching. The Mill is a general purpose
architecture :-)

There is limited support for combining SIMD and vectors in that the
vector elements in a loop can be SIMD objects. One of the videos shows
how this implements strcpy(), and byte move works the same way. Look for
the smear() operation.

>> Separately, dynamically sized sequences are called streams. What's on
>> the other end of a stream can be hardware or software, in or out of
>> your protection domain. Think Unix pipes. Details NYF.
>
>
> You just keep whetting our appetites.  :-)

I try :-)

Ivan Godard

unread,
Nov 6, 2019, 6:13:00 PM11/6/19
to
On 11/6/2019 11:30 AM, Stephen Fuld wrote:
> On 11/6/2019 11:03 AM, Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>> On 11/6/2019 2:00 AM, Terje Mathisen wrote:
>>>> but Mill's approach with None or NaR flagging of all out of bounds
>>>> bytes is far simpler and allows much smaller and easier to verify code.
>>>
>>> Yes, but Mill doesn't have, or at least hasn't disclosed, anything
>>> like the VVM.
>>
>> With 128-bit vector regs you get most of the benefit, with NaR/None
>> taking care of all the ugly/hard prequel/sequel/alignment issues.
>
> I understand.  I mis-remembered that the vector registers were 64 bits.
> Sorry Ivan.  :-(

Operand size is per-member configured.

> And I suppose you could have multiple of these going on at the same time
> (wide issue), with the actual number being model dependent.  It would be
> up to the compiler and the specializer to extract the parallelism.  Right?

Yes, though the specializer does not yet do auto-vectorization.

Alex McDonald

unread,
Nov 6, 2019, 6:33:40 PM11/6/19
to
On 06-Nov-19 12:09, already...@yahoo.com wrote:
> https://www.fujitsu.com/global/about/resources/news/press-releases/2019/0415-01.html
>
> English is not my native tongue so I am not sure what the word "concludes" really means in this context.
> But it it sounds like they have chip that is more works than not.

Concluded as in "agreed". It's being used for dramatic effect; to give a
sense of the hours devoted to hard fought negotiations but now -- a
contract. Compare & contrast with "finalised"; this can be found
describing agreements between parties that have been advised by their
lawyers of the eye-watering legal bills should they go to court.

--
Alex

NaN

unread,
Nov 6, 2019, 6:40:21 PM11/6/19
to
Ivan-san,

How about a complexity caused by semantic gap between programming language and logical architecture?

Best,
S.Takano

2019年11月7日木曜日 8時07分20秒 UTC+9 Ivan Godard:

MitchAlsup

unread,
Nov 6, 2019, 8:11:10 PM11/6/19
to
On Wednesday, November 6, 2019 at 3:45:30 PM UTC-6, NaN wrote:
> Mitch-san,
>
> Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?

My 66000 ISA IS variable length, 1,2,3,4,5 words in length.
>
> Best,
> NaN

MitchAlsup

unread,
Nov 6, 2019, 8:12:24 PM11/6/19
to
Basically, yes.
>
>
> Stefan

NaN

unread,
Nov 7, 2019, 12:47:59 AM11/7/19
to
> > Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?
> My 66000 ISA IS variable length, 1,2,3,4,5 words in length.

Before talking about 66000, I would like to confirm your definition of word “instruction word”.
Is “instruction word”, unit length which can be fetched and steered by instruction fetcher, and multiple word composes a single instruction, not as VLIW like approach?

If it is so then your definition is equivalent to my definition of a byte-based variable instruction word just extending the byte to word in your definition.
So, I think that (2*5-1)*W-bit width instruction buffer at least (so, 16-word width instr buff is needed for word-alignment) is needed for max 5-word (a single word is W-bit) instruction length to juggling to fetch in a cycle.
I think standard DDR memory is up to 32Bytes to access, then how handling external memory access on such the narrow interface if your W is 4-byte?

Best,
S.Takano

Bruce Hoult

unread,
Nov 7, 2019, 1:01:38 AM11/7/19
to
The entire opcode is in the first word, yes? The rest is just raw literals/addresses/offsets?

Terje Mathisen

unread,
Nov 7, 2019, 2:33:18 AM11/7/19
to
If you get a trap here, then you should! The only way you can get such a
trap is because you are trying to write at least one byte past what you
have legal access to.

To only thng we have to really worry about is out of bounds accesses
that occur because we are using SIMD operations to replace byte-by-byte
moves, right?

These can occur either on the read or the store side:

By only reading aligned blocks, we know that no indivdual read operation
can cross a protection boundary, since these boundaries are at least
page aligned.

On the store side we are good as long as we never write any bytes past
the final location that would have been stored to using the byte-based loop.

Please note that the read approach will never work on a platform with
byte (or just sub-block) hardware protection boundaries, there you
either have to do it the slow way, or have user-level traps allowing you
to blindly use block-based moves, then recover with a byte loop in the
trap handler if triggered.

Ivan Godard

unread,
Nov 7, 2019, 3:08:34 AM11/7/19
to
Or use metadata

Terje Mathisen

unread,
Nov 7, 2019, 4:04:06 AM11/7/19
to
Sure, as on the Mill which I've pointed out several times is doing it
the Right Way. :-)

MitchAlsup

unread,
Nov 7, 2019, 12:09:54 PM11/7/19
to
On Wednesday, November 6, 2019 at 11:47:59 PM UTC-6, NaN wrote:
> > > Do you mean of that 66000 ISA is not variable instruction word length (ex. 32, 64, 96, and 128-bit length)?
> > My 66000 ISA IS variable length, 1,2,3,4,5 words in length.
>
> Before talking about 66000, I would like to confirm your definition of word “instruction word”.

An instruction consists of all the words needed to completely satisfy the
requirements set out by the instruction specifier.

An instruction specifier is the first word of an instruction.

A word is 32-bits.

> Is “instruction word”, unit length which can be fetched and steered by instruction fetcher, and multiple word composes a single instruction, not as VLIW like approach?

An instruction contains all of the necessary parts to completely specify
some kind of calculation, or control transfer, or memory reference.
>
> If it is so then your definition is equivalent to my definition of a byte-based variable instruction word just extending the byte to word in your definition.

Probably, I just use aligned 4-byte containers.

> So, I think that (2*5-1)*W-bit width instruction buffer at least (so, 16-word width instr buff is needed for word-alignment) is needed for max 5-word (a single word is W-bit) instruction length to juggling to fetch in a cycle.

Technically, one could fetch them 1 word at a time, using multiple cycles
to assemble a single executable instruction. But even my lowest conceived
implementation still fetches 128-aligned-bits per cycle.

> I think standard DDR memory is up to 32Bytes to access, then how handling external memory access on such the narrow interface if your W is 4-byte?

Caches, with cache line width of 512-bits or 4 quad words.
>
> Best,
> S.Takano

MitchAlsup

unread,
Nov 7, 2019, 12:11:10 PM11/7/19
to
The first word contains everything except {32, 64}-bit {immediates and
displacements.}

Stephen Fuld

unread,
Nov 7, 2019, 12:33:42 PM11/7/19
to
Right.

>
> These can occur either on the read or the store side:
>
> By only reading aligned blocks, we know that no indivdual read operation
> can cross a protection boundary, since these boundaries are at least
> page aligned.


Yes.


>
> On the store side we are good as long as we never write any bytes past
> the final location that would have been stored to using the byte-based
> loop.


Yes, but consider the following example. The source is 10 bytes long
(though we don't know that at the start), and page aligned. The
destination address is three bytes before a page boundary, and the next
page is beyond the limit. You load the first 8 bytes and find no
termination byte. So you do a store. We agreed above that in this
scenario, there is no need to align the stores, so you attempt the
store. It gets a protection fault because it can't store the fourth
(and subsequent) bytes. Since the store failed, it didn't write bytes
1-3. But these bytes would have been written in a byte by byte loop.

What am I missing?

EricP

unread,
Nov 7, 2019, 2:04:57 PM11/7/19
to
The behavior change is more more than just not writing those 3 bytes.

With failure behavior that way, you are changing the page fault
behavior slightly since to succeed both the output pages in the
straddle must be resident and valid, and in the TLB.
If the same happened for source page straddles then the
minimum resident data page set size is 4 pages, not 2.
I know its not a big change, but it is different.

It might affect routines inside an OS that require pages
to be pinned while worked on.



Stefan Monnier

unread,
Nov 7, 2019, 2:18:02 PM11/7/19
to
> The virtual Vector Method (VVM) documentation is not ready for prime time.
> It is a method for adding vector processing to a CPU/ISA that vectorizes
> loops instead of vectorizing instructions.

IIUC your VVM system basically tells the CPU that a loop is coming and
that during this loop the instruction fetch&decode as well as the
register writeback can be temporarily suspended, right?

Beside a reduction in power-consumption, what are the other benefits
(e.g. in terms of performance) compared to running the same loop without
the VEC&LOOP instruction?


Stefan

Terje Mathisen

unread,
Nov 7, 2019, 2:22:08 PM11/7/19
to
Nothing, except that the byte loop would also fail, and trap in the same
location. I.e. this is a program bug and _should_ trap.

If the trap handler fixes this by allocating the next page and
restarting the op, then everything is fine again, and nothing got
written twice, in either the SIMD or byte alternatives.

With no such trap handler I'm happy with leaving the crash dump slightly
different, debuggung it will be just about the same.

I.e. I don't really care that the program will "misbehave" when failing.

If we insist on your interpetation then almost all SIMD optimizations
would be faulty.

Stephen Fuld

unread,
Nov 7, 2019, 2:45:57 PM11/7/19
to
Well, I was just responding to what I thought was the requirement. That
is, it has to behave the same as if it was a byte to byte copy, even in
the event of failure. If that isn't a requirement, then things get
*much* simpler. How much simpler depends upon how much you are willing
to relax the requirement in the failure case.

Specifically, for the case of the My 66000's VVM, If you can live with
an arbitrary number of bytes not written on failure, then a 64 bit at a
time load/store loop with VVM works fine, (no extra alignment code
required as MY 66000 supports unaligned loads and stores) and you get a
factor of 8 speedup over the byte at a time version.

I haven't worked out the case of using the 8 double word loads and
stores. You might be able to get almost another factor of 8 speedup.
Not quite, as I think you have to code the termination byte tests for
the first 7 double words explicitly within the loop.

MitchAlsup

unread,
Nov 7, 2019, 3:15:28 PM11/7/19
to
On Thursday, November 7, 2019 at 1:18:02 PM UTC-6, Stefan Monnier wrote:
> > The virtual Vector Method (VVM) documentation is not ready for prime time.
> > It is a method for adding vector processing to a CPU/ISA that vectorizes
> > loops instead of vectorizing instructions.
>
> IIUC your VVM system basically tells the CPU that a loop is coming and
> that during this loop the instruction fetch&decode as well as the
> register writeback can be temporarily suspended, right?

That more or less captures the spirit of VVM.
>
> Beside a reduction in power-consumption, what are the other benefits
> (e.g. in terms of performance) compared to running the same loop without
> the VEC&LOOP instruction?

A 1-wide machine can run at 3 IPC for significant amounts of time.

Consider a small byte move loop in a machine with 2 LD slots, 1 ST slot,
1 INT slot, 1 Br slot, and 1 FP slot.

loop:
LDB R7,[Rp+Ri]
STB R7,[Rq+Ri]
ADD Ri,Ri,1
CMP Rc,Ri,Rmax
BLT loop

THis gets converted into the following loof after vectorizing:

loop: VEC {{VSV}{VSV}{V}}
LDB R7,[Rp+Ri]
STB R7,[Rq+Ri]
LOOP LT,Ri,Rmax,1

VEC is syntactic sugar and while examined during decode, it merely provides salt for the execution of the loop (DECODEd but not executed.) So, now the
loop contains 3 instructions doing the work of 5.

We load in the LD slot
We load in the ST slot
We load in the LOOPing slot (Br)
And we let the machine run

With 1 memory access per cycle, we get 2.5 IPC, for as long as the cache
and TLB have hits.

With 2 memory accesses per cycle, we can get 5 IPC.....

This thing is, there are lots of little loops, VVM expresses these in fewer
instructions, and enables all instructions in the loop to execute based solely
on register and memory dependencies. While executing these loops, the FETCH
and DECODE stages are basically idle.

Consider a reservation station style machine. VVM adds a bit of state to the
logic, such that SCALAR values are captured once, VECTOR values are captured
once per loop, and the RS entry remains valid until the loop terminates.
VECTOR values in the loop do not have to be written to RF during the loop
but are written when the loop terminates.

>
>
> Stefan

MitchAlsup

unread,
Nov 7, 2019, 3:17:58 PM11/7/19
to
You will surely run into some other saturation boundary before you get
this second factor of 8×. {Cache, L1<->L2, L2<->outer memory hierarchy}

> Not quite, as I think you have to code the termination byte tests for
> the first 7 double words explicitly within the loop.

But still, you get the first factor of <about> 3 for essentially zero cost.

Stefan Monnier

unread,
Nov 7, 2019, 3:58:16 PM11/7/19
to
>> Beside a reduction in power-consumption, what are the other benefits
>> (e.g. in terms of performance) compared to running the same loop without
>> the VEC&LOOP instruction?
> A 1-wide machine can run at 3 IPC for significant amounts of time.

IIUC this is because the "1-wide" limit is assumed to be imposed by the
decoder, right? I guess it also lifts the width-limit imposed by the
write-back, since that's also suspended.
You'd still need to be able to retire 3 instructions per cycle, tho, no?

> This thing is, there are lots of little loops, VVM expresses these in
> fewer instructions, and enables all instructions in the loop to
> execute based solely on register and memory dependencies. While
> executing these loops, the FETCH and DECODE stages are
> basically idle.

Right. It's a kind of "build your own vector instruction" toolkit.
If you look at it from the point of view of a C programmer, it comes
with many restrictions, but if you look at it from the point of view of
Cray-style vectors, it gives you a lot more flexibility than "vector
chaining".


Stefan

NaN

unread,
Nov 7, 2019, 4:16:00 PM11/7/19
to
Mitch-san,

On Friday, November 8, 2019 at 2:09:54 AM UTC+9, MitchAlsup wrote:
> > Before talking about 66000, I would like to confirm your definition of word “instruction word”.
> An instruction consists of all the words needed to completely satisfy the
> requirements set out by the instruction specifier.
> An instruction specifier is the first word of an instruction.
> A word is 32-bits.

If 66000 does really support 3/5-word length instruction then it actually needs (2*5-1)-word width I/F to fetch in a cycle as follows;
<---------------- I/F WIDTH ---------------->
[w00][w01][w02][w03][w04][w10][w11][w12][w13]
[w14][w20][w21][w22][w23][w24][w30][w31][w32]
[w33][w34][w40][w41][w42][w43][w44][w50][w51]
[w52][w53][w54][w60][w61][w62][w63][w64][w70]
[w71][w72][w73][w74][w80][w81][w82][w83][w84]

where [wij] is j-th word for i-th instruction.
This is all patterns to access for worst case of the 5-word instruction length.

> > Is “instruction word”, unit length which can be fetched and steered by instruction fetcher, and multiple word composes a single instruction, not as VLIW like approach?
> An instruction contains all of the necessary parts to completely specify
> some kind of calculation, or control transfer, or memory reference.

Yes, variable length instruction needs specifier bit-field in first word to indicate its length (how many words for the instruction).
5 types of length needs ceil(log_{2}(5))=3-bit field for the specifier in first word, right?

> > I think standard DDR memory is up to 32Bytes to access, then how handling external memory access on such the narrow interface if your W is 4-byte?
> Caches, with cache line width of 512-bits or 4 quad words.

Then, multiple "steps" is necessary for the icache miss caused by the 5-word instruction length for worst case, right?
And it takes into account for fitting working-set onto icache size?, although it is dependent on applications.
The cache is, of course, to bridge a gap of bandwidth for memory hierarchy, but the 5-word makes some complexity for worst case although it might not be frequently occurred, I think.

Best,
NaN

NaN

unread,
Nov 7, 2019, 4:20:12 PM11/7/19
to
Mitch-san,

Anyway, how the name of 66000 comes from ? (what meaning of the name)

Best,
NaN

already...@yahoo.com

unread,
Nov 7, 2019, 4:48:49 PM11/7/19
to
Mitch is a big fan of CDC 6600 and architect of Motorola 88000.

MitchAlsup

unread,
Nov 7, 2019, 5:08:52 PM11/7/19
to
On Thursday, November 7, 2019 at 2:58:16 PM UTC-6, Stefan Monnier wrote:
> >> Beside a reduction in power-consumption, what are the other benefits
> >> (e.g. in terms of performance) compared to running the same loop without
> >> the VEC&LOOP instruction?
> > A 1-wide machine can run at 3 IPC for significant amounts of time.
>
> IIUC this is because the "1-wide" limit is assumed to be imposed by the
> decoder, right? I guess it also lifts the width-limit imposed by the
> write-back, since that's also suspended.
> You'd still need to be able to retire 3 instructions per cycle, tho, no?

No, that is a subtle side effect of vectorizing the loop, the instruction
(per seé) exists for the duration of the loop! So it never retires until
the loop completes.

>
> > This thing is, there are lots of little loops, VVM expresses these in
> > fewer instructions, and enables all instructions in the loop to
> > execute based solely on register and memory dependencies. While
> > executing these loops, the FETCH and DECODE stages are
> > basically idle.
>
> Right. It's a kind of "build your own vector instruction" toolkit.
> If you look at it from the point of view of a C programmer, it comes
> with many restrictions, but if you look at it from the point of view of
> Cray-style vectors, it gives you a lot more flexibility than "vector
> chaining".

And at very low cost (no VRF,...)
>
>
> Stefan

NaN

unread,
Nov 7, 2019, 5:10:13 PM11/7/19
to
Thanks, already-san.
Then I think "MI66000" (MItch's 66000), "MT66000" (Mitch Tech's 66000), and or "MC66000" (borrow MC from Motorola) are candidates.

Best,
NaN

MitchAlsup

unread,
Nov 7, 2019, 5:18:19 PM11/7/19
to
On Thursday, November 7, 2019 at 3:16:00 PM UTC-6, NaN wrote:
> Mitch-san,
>
> On Friday, November 8, 2019 at 2:09:54 AM UTC+9, MitchAlsup wrote:
> > > Before talking about 66000, I would like to confirm your definition of word “instruction word”.
> > An instruction consists of all the words needed to completely satisfy the
> > requirements set out by the instruction specifier.
> > An instruction specifier is the first word of an instruction.
> > A word is 32-bits.
>
> If 66000 does really support 3/5-word length instruction then it actually needs (2*5-1)-word width I/F to fetch in a cycle as follows;
> <---------------- I/F WIDTH ---------------->
> [w00][w01][w02][w03][w04][w10][w11][w12][w13]
> [w14][w20][w21][w22][w23][w24][w30][w31][w32]
> [w33][w34][w40][w41][w42][w43][w44][w50][w51]
> [w52][w53][w54][w60][w61][w62][w63][w64][w70]
> [w71][w72][w73][w74][w80][w81][w82][w83][w84]
>
> where [wij] is j-th word for i-th instruction.

No, because of 2 things: if all of an instruction is not present it cannot
proceed past DECODE in the front end pipeline. That is you can stall waiting
for more words to arrive.

And, more importantly, the very vast majority of instructions are single word
in length.

In practice fetching 4-words per cycle is more than adequate for 1-wide
and 2-wide machines.

> This is all patterns to access for worst case of the 5-word instruction length.

Worst case is dealt with by stalling until the ret of the instruction arrives.
>
> > > Is “instruction word”, unit length which can be fetched and steered by instruction fetcher, and multiple word composes a single instruction, not as VLIW like approach?
> > An instruction contains all of the necessary parts to completely specify
> > some kind of calculation, or control transfer, or memory reference.
>
> Yes, variable length instruction needs specifier bit-field in first word to indicate its length (how many words for the instruction).
> 5 types of length needs ceil(log_{2}(5))=3-bit field for the specifier in first word, right?

No, I use 4-bits. When first bit is 0, the 3-bits do OTHER things to DECODE,
Other things I want the ISA to be able to do, (illustrated way above);
when 1; the 3-bits define immediates in lieu of registers.
So the real cost is 1-bit.

>
> > > I think standard DDR memory is up to 32Bytes to access, then how handling external memory access on such the narrow interface if your W is 4-byte?
> > Caches, with cache line width of 512-bits or 4 quad words.
>
> Then, multiple "steps" is necessary for the icache miss caused by the 5-word instruction length for worst case, right?

DECODE has to wait for all of the instruction words to arrive before pushing
the instruction into EXECUTE, yes. Other than that, no.

> And it takes into account for fitting working-set onto icache size?, although it is dependent on applications.

DECODE has to wait for all of the instruction words to arrive before pushing
the instruction into EXECUTE, yes. Other than that, no.

> The cache is, of course, to bridge a gap of bandwidth for memory hierarchy, but the 5-word makes some complexity for worst case although it might not be frequently occurred, I think.

I think you think improperly.
>
> Best,
> NaN

MitchAlsup

unread,
Nov 7, 2019, 5:19:42 PM11/7/19
to
CRAY (Seymore) maned the CDC 6600
I postpended a "0" and prepended "My " to denote that it is MY architecture.

Stefan Monnier

unread,
Nov 7, 2019, 5:25:30 PM11/7/19
to
>> >> Beside a reduction in power-consumption, what are the other benefits
>> >> (e.g. in terms of performance) compared to running the same loop without
>> >> the VEC&LOOP instruction?
>> > A 1-wide machine can run at 3 IPC for significant amounts of time.
>> IIUC this is because the "1-wide" limit is assumed to be imposed by the
>> decoder, right? I guess it also lifts the width-limit imposed by the
>> write-back, since that's also suspended.
>> You'd still need to be able to retire 3 instructions per cycle, tho, no?
> No, that is a subtle side effect of vectorizing the loop, the instruction
> (per seé) exists for the duration of the loop! So it never retires until
> the loop completes.

Then how to you get back the "in-order" semantics when you notice
a mispredicted branch or an exceptions (e.g. on memory access)?
Maybe it's not "retirement" strictly speaking, but you need to keep
track somewhere about relative ordering.

>> Right. It's a kind of "build your own vector instruction" toolkit.
>> If you look at it from the point of view of a C programmer, it comes
>> with many restrictions, but if you look at it from the point of view of
>> Cray-style vectors, it gives you a lot more flexibility than "vector
>> chaining".
> And at very low cost (no VRF,...)

That's because it's a "memory-to-memory" kind of vector
architecture, basically.


Stefan

NaN

unread,
Nov 7, 2019, 5:31:05 PM11/7/19
to
> > If 66000 does really support 3/5-word length instruction then it actually needs (2*5-1)-word width I/F to fetch in a cycle as follows;
> > <---------------- I/F WIDTH ---------------->
> > [w00][w01][w02][w03][w04][w10][w11][w12][w13]
> > [w14][w20][w21][w22][w23][w24][w30][w31][w32]
> > [w33][w34][w40][w41][w42][w43][w44][w50][w51]
> > [w52][w53][w54][w60][w61][w62][w63][w64][w70]
> > [w71][w72][w73][w74][w80][w81][w82][w83][w84]
> > where [wij] is j-th word for i-th instruction.
>
> No, because of 2 things: if all of an instruction is not present it cannot
> proceed past DECODE in the front end pipeline. That is you can stall waiting
> for more words to arrive.

I talk about instruction fetch approach in "a single cycle for worst case" (5-word instruction).
I think this simplify control after fetch stages by the single timing pattern.
If stall is sufficient this is no matter, but is can be single-cycle-fetching.

Best,
NaN

MitchAlsup

unread,
Nov 7, 2019, 5:56:19 PM11/7/19
to
On Thursday, November 7, 2019 at 4:25:30 PM UTC-6, Stefan Monnier wrote:
> >> >> Beside a reduction in power-consumption, what are the other benefits
> >> >> (e.g. in terms of performance) compared to running the same loop without
> >> >> the VEC&LOOP instruction?
> >> > A 1-wide machine can run at 3 IPC for significant amounts of time.
> >> IIUC this is because the "1-wide" limit is assumed to be imposed by the
> >> decoder, right? I guess it also lifts the width-limit imposed by the
> >> write-back, since that's also suspended.
> >> You'd still need to be able to retire 3 instructions per cycle, tho, no?
> > No, that is a subtle side effect of vectorizing the loop, the instruction
> > (per seé) exists for the duration of the loop! So it never retires until
> > the loop completes.
>
> Then how to you get back the "in-order" semantics when you notice
> a mispredicted branch or an exceptions (e.g. on memory access)?
> Maybe it's not "retirement" strictly speaking, but you need to keep
> track somewhere about relative ordering.

Yes, you do, but because the vector loop is NOT out-of-order with respect
to itself, the problem is greatly reduced. Basically the HW is performing
loop unrolling, AGENing memory addresses IN ORDER (this is what enables
loads to advance in front of stores without alias problems) and performing
loop calculations IN ORDER per FU, and not performing register writes
(so there is little to clean up). All you have to do is to find the last
value of all registers in {V} status and write them while transferring
control to handler.
>
> >> Right. It's a kind of "build your own vector instruction" toolkit.
> >> If you look at it from the point of view of a C programmer, it comes
> >> with many restrictions, but if you look at it from the point of view of
> >> Cray-style vectors, it gives you a lot more flexibility than "vector
> >> chaining".
> > And at very low cost (no VRF,...)
>
> That's because it's a "memory-to-memory" kind of vector
> architecture, basically.

But is is NOT Memory to Memory--it is <essentially> RS to RS data flow.
And has the low latencies of RS-to-RS data flow.
>
>
> Stefan

MitchAlsup

unread,
Nov 7, 2019, 6:04:49 PM11/7/19
to
Consider the smallest pipeline I am considering, a 1-wide machine with 2
stage DECODE I call PARSE and DECODE.

In cycle k we send out an address m to the I$.
In cycle k+1 we send out an address ((m&~16)+16) to the I$.
we receive 4 words from the I$
// let us postulate that 3 of these are useable
// for our 5 word instruction.
we PARSE the first instruction and ship to DECODE
In cycle k+2 we send out an address ((m&~16)+32) to the I$.
we receive 4 more words from the I$
WE DECODE the first instruction
// at this point all 5 needed words are present so DECODEd instruction
// is pushed into EXECUTE.

So, empty Inst Buffer still does not result in "Eating" a stall !.

>
> Best,
> NaN

Terje Mathisen

unread,
Nov 8, 2019, 2:42:01 AM11/8/19
to
Stephen Fuld wrote:
> On 11/7/2019 11:22 AM, Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>> What am I missing?
>>
>> Nothing, except that the byte loop would also fail, and trap in the
>> same location. I.e. this is a program bug and _should_ trap.
>>
>> If the trap handler fixes this by allocating the next page and
>> restarting the op, then everything is fine again, and nothing got
>> written twice, in either the SIMD or byte alternatives.
>>
>> With no such trap handler I'm happy with leaving the crash dump
>> slightly different, debuggung it will be just about the same.
>>
>> I.e. I don't really care that the program will "misbehave" when failing.
>>
>> If we insist on your interpetation then almost all SIMD optimizations
>> would be faulty.
>
>
> Well, I was just responding to what I thought was the requirement.  That
> is, it has to behave the same as if it was a byte to byte copy, even in
> the event of failure.  If that isn't a requirement, then things get
> *much* simpler.  How much simpler depends upon how much you are willing
> to relax the requirement in the failure case.

Sure. :-)
>
> Specifically, for the case of the My 66000's VVM, If you can live with
> an arbitrary number of bytes not written on failure, then a 64 bit at a
> time load/store loop with VVM works fine, (no extra alignment code
> required as MY 66000 supports unaligned loads and stores) and you get a
> factor of 8 speedup over the byte at a time version.

No, that would NOT work, since if you use misaligned loads, then you can
trap when crossing a protection boundary coming directly after the
actual terminator!

I.e. for C-style strings (or other data structure with embedded
terminator signal), you cannot perform any loads that cross a boundary
before you know that it is safe to do so, i.e. no terminator found yet.

Using only aligned load & store ops is of course sufficient to handle
all cases, but it does require an internal alignment stage to handle any
relative adjustment between source & dest. With SIMD your options are
usually either a two-source byte shuffle or a two-source shifter, with
the latter being _very_ rare for SIMD ops.

Havin just a single-source byte shuffle means that you have to also do a
mask/combine with previous remainder stage.

When you have a fast misaligned load op it might actually make sense to
load the data twice: First aligned to check for the terminator location,
and then misaligned so as to allow aligned stores?

I.e. using/abusing the byte alignment handler in the cache interface to
remove the problem. :-)

Brett

unread,
Nov 8, 2019, 3:28:51 AM11/8/19
to
The mistake of variable length instructions.
(Yes i was the VLIW fan boy of this group, people grow.)

Let’s say you want to be modern and scale up to 16 wide execution over the
next decade or two, do you really want the hassles of variable length
instructions?

Also you need something better than antique RISC ideas to realistically
scale up to 16 wide.
Mill does split decode, and there are lots of opportunities and variants
here.

How about on a function call the forward part is fixed width instructions
and the backward part is cache line(s) of bulk data constants and offsets
needed by the forward instructions.

Most CPU’s are only two read and one write and a simple mistake like having
a write through L1 cache like Bulldozer can cripple performance.
Having a read only scratchpad loaded on function call gives you a nice bump
in read performance programmers will flock to, not realizing how little it
really matters. ;)
ARM crushed the supposedly superior MIPS, marketing matters, something new
and different and better matters. A game ARM player with four different
instruction sets, MIPS had mostly one and a 64bit extension.

You can afford 32 bits or so of layout description at the start of the
scratch to make addressing this scratch easier for the following
instructions. Your short branch offset is now a scratch index with
potentially a large offset, so you get shorter fixed sized instructions.
24bit opcodes are possible. I realize this can cost a cycle on a function
call, don’t worry about it, you save on smaller opcodes. Perceived
performance matters more than real performance, like ARM.

You can take this to another level in version two by adding a zero page
that is not cache coherent for more read and write bandwidth, treated as a
second or extended register file for easy access instead of a funny memory
address.
Cache line loads and stores would manage the zero page, bulk loading and
storing those registers. Which you will sell as better bandwidth
performance against byte read writes to and from registers. Do all the
small work in registers, with bulk access to dram.

Forget about writing a compiler for zero page, offer a keyhole optimizer
that is fed with compiler directives tagging pointers that the zero page
should try to load.

As a hardware guy these ideas may scare you, and the performance uplift of
the first versions may be minimal. But ask programmers what they think, it
will be like waving a cat toy in front of a cat. Cats that have deep
pocketbooks and no interest in ordinary RISC chip variants.

You may be able to get paid to explore these ideas by venture capitalists.


Bruce Hoult

unread,
Nov 8, 2019, 7:38:07 AM11/8/19
to
On Friday, November 8, 2019 at 12:28:51 AM UTC-8, Brett wrote:
> The mistake of variable length instructions.
> (Yes i was the VLIW fan boy of this group, people grow.)
>
> Let’s say you want to be modern and scale up to 16 wide execution over the
> next decade or two, do you really want the hassles of variable length
> instructions?

There are vastly different degrees of variability between "none at all" (e.g. fixed 32 bit instructions in traditional RISCs) and "ridiculous" (e.g. "any integral number of bytes from 1 to 15" in x86_64).

In between you have Mitch's 4,8,12,16, or 20 bytes, and RISC-V RV64GC's and Thumb2's 2 or 4 bytes.

Two lengths feels like it should be more easily scalable to wider decode than five lengths. At least all three of those (My 66000, RISC-V, Thumb2) decode the length by looking at just a few bits in the first chunk.

One argument on Mitch's side is that his variability comes from including large literals and offsets within an instruction, allowing 1 instruction to do the work that would require several instructions in RISC-V and ARM (either assembling a large value into a temporary register from several small chunks, or else loading it from a constant pool).

The counter-argument is that such large constants and offsets are rare, and the important uses of them are inside loops where (if you have enough registers) you can move the loading of them outside the loop.

This seems to come down to some kind of aesthetics vs hard data argument.

paul wallich

unread,
Nov 8, 2019, 8:56:35 AM11/8/19
to
On 11/8/19 7:38 AM, Bruce Hoult wrote:
> On Friday, November 8, 2019 at 12:28:51 AM UTC-8, Brett wrote:
>> The mistake of variable length instructions.
>> (Yes i was the VLIW fan boy of this group, people grow.)
>>
>> Let’s say you want to be modern and scale up to 16 wide execution over the
>> next decade or two, do you really want the hassles of variable length
>> instructions?
[...]
> Two lengths feels like it should be more easily scalable to wider decode than five lengths. At least all three of those (My 66000, RISC-V, Thumb2) decode the length by looking at just a few bits in the first chunk.
>
> One argument on Mitch's side is that his variability comes from including large literals and offsets within an instruction, allowing 1 instruction to do the work that would require several instructions in RISC-V and ARM (either assembling a large value into a temporary register from several small chunks, or else loading it from a constant pool).
>
> The counter-argument is that such large constants and offsets are rare, and the important uses of them are inside loops where (if you have enough registers) you can move the loading of them outside the loop.
>
> This seems to come down to some kind of aesthetics vs hard data argument.

Speaking of hard data, it seems to me that in many situations nowadays
performance is dominated by worst-case behavior (not that worst-case
behavior often happens, but rather that it's so bad that it swamps
everything else). So shaving a few percent off worst-case occurrence
rates or timings might be worth more than even double-digit percentage
improvements in maximum performance.

Data?

paul


NaN

unread,
Nov 8, 2019, 9:30:59 AM11/8/19
to
On Monday, November 4, 2019 at 2:26:04 AM UTC+9, MitchAlsup wrote:
> It is a method for adding vector processing to a CPU/ISA that vectorizes
> loops instead of vectorizing instructions.
> In fact VVM adds but 2 instructions to the ISA, one denoting which registers
> in the loop are vectorized and which are Scalar. Scalar registers are static
> over the loop, vector registers change at least once per loop; the other
> denoting the loop termination and looping condition.
> VVM is perfectly capable of vectorizing small string loops and loops upto
> about the size of the inner loop of FFTs (29 instructions in My 66000 ISA).

Then, how to handle "inter-iteration" data dependency?
Read/write duration in terms of clock cycle for the inter-iteration data dependency is decided by dependency distance;

for (i=N; i<MAX; i++) Array[i] = S*Array[i - N] + T;

where S and T are scalar (ex. coefficient and offset, respectively).
Dependency distance in this tiny example is the "N" which the Array element or its allocated register has to keep (N-1) cycles at least.
Or it takes "N-1" cycles at least for arithmetic operation after load.
Does this distance not damage to exception?

In addition,
> For example:: A simple string copy routine in My 66000 Assembler:
> loop:
> LDSB R7,[Rp+Ri]
> STB R7,[Rq+Ri]
> ADD Ri,Ri,1
> BNZ R7,loop

This implies that there is no inter-iteration dependency in the loop, and focuses onto "chain-able" between functional units (FUs, bypassing between FUs by its true data dependency detection) in the iteration by "intra-iteration" data dependency.
I like the chain approach invented by Cray-san, this is hart of Mitch-san's vectorization method I think.
Does My66000 not focus onto such the inter-iteration data dependency ?, or also possible to steer ?

> vectorizes into:
> loop: VEC {{VSV}{VSV}{VV}{V}}
> LDSB R7,[Rp+Ri]
> STB R7,[Rq+Ri]
> LOOP NZ,R7,Ri,1

What does mean of the {{VSV}{VSV}{VV}{V}} ?, why not indicate register number in RF?
I think that VEC instruction can be two-word instruction if architecture having 32-entry register file for 32-bit data, first word can indicate the VEC and Ri++, and second one can indicate set of flag to address which register is used as "pseudo" vector-chain register, when the second word is immediate, 0x0000_8003 means register-1st, 2nd, and 16th are the vector-chain registers in 32-entry RF, others are "scalars" at the loop.
This removes special instruction of "LOOP NZ,R7,Ri,1" and can be simple traditional "BNZ R7,loop".
So, original My66000 needs two special instructions for vectorization, now, we need only one special instruction, right?
And such the new VEC instruction is executed only first time, therefore, the two-word VEC instruction is low-cost?

Best,
S.Takano

Stephen Fuld

unread,
Nov 8, 2019, 10:57:52 AM11/8/19
to
Ahhhh! Good point. For the VVM case, you could start with a byte loop
for up to 7 bytes to get to the point where the loads in the subsequent
double word loop are aligned. This is pretty easy to code, and
shouldn't add much overhead - no need for any shuffles.

Anton Ertl

unread,
Nov 8, 2019, 11:21:30 AM11/8/19
to
paul wallich <p...@panix.com> writes:
>Speaking of hard data, it seems to me that in many situations nowadays
>performance is dominated by worst-case behavior (not that worst-case
>behavior often happens, but rather that it's so bad that it swamps
>everything else).

What do you have in mind?

I have certainly observed huge improvements in worst-case behaviour in
certain areas on Intel CPUs. In particular, a Core 2 Duo E8400
(Wolfdale) takes ~160 cycles on a page-crossing unaligned load, a
Sandy Bridge or Ivy Bridge only ~30. The Opteron 270 (K8) fares
better at 6 cycles, but has this penalty for every unaligned access
<2015Mar3...@mips.complang.tuwien.ac.at>.

If you are thinking about DRAM latency, that has become much worse in
terms of clock cycles in the 1990s with the GHz race on, but since the
clock rates stabilized, DRAM latency in cycles has stabilized, too.

>So shaving a few percent off worst-case occurrence
>rates or timings might be worth more than even double-digit percentage
>improvements in maximum performance.

It always depends on where the bottleneck is. If the bottleneck is
executed instructions, you have to reduce that; if the bottleneck is
DRAM accesses, you have to reduce that; if the bottleneck is
dependencies, you have to reduce that; in the bottleneck is resources,
you have to reduce resource consumption; etc.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

Stephen Fuld

unread,
Nov 8, 2019, 12:10:59 PM11/8/19
to
On 11/8/2019 4:38 AM, Bruce Hoult wrote:
> On Friday, November 8, 2019 at 12:28:51 AM UTC-8, Brett wrote:
>> The mistake of variable length instructions.
>> (Yes i was the VLIW fan boy of this group, people grow.)
>>
>> Let’s say you want to be modern and scale up to 16 wide execution over the
>> next decade or two, do you really want the hassles of variable length
>> instructions?
>
> There are vastly different degrees of variability between "none at all" (e.g. fixed 32 bit instructions in traditional RISCs) and "ridiculous" (e.g. "any integral number of bytes from 1 to 15" in x86_64).
>
> In between you have Mitch's 4,8,12,16, or 20 bytes, and RISC-V RV64GC's and Thumb2's 2 or 4 bytes.
>
> Two lengths feels like it should be more easily scalable to wider decode than five lengths. At least all three of those (My 66000, RISC-V, Thumb2) decode the length by looking at just a few bits in the first chunk.
>
> One argument on Mitch's side is that his variability comes from including large literals and offsets within an instruction, allowing 1 instruction to do the work that would require several instructions in RISC-V and ARM (either assembling a large value into a temporary register from several small chunks, or else loading it from a constant pool).

The comparison versus the constant pool are interesting. The savings
are more that it first appears.

You free up a load/store/agen unit unit in the CPU.

Getting the data into the ALU will never result in an additional cache
miss.

The data is read as part of a sequential stream versus a likely random
memory access.

Presumably putting the data in the I cache versus the D cache is a win.



> The counter-argument is that such large constants and offsets are rare, and the important uses of them are inside loops where (if you have enough registers) you can move the loading of them outside the loop.


And if you have it in a constant pool, you only have it once, while you
might need it multiple times in different places in the instruction
stream, though I expect the space savings to be tiny.


>
> This seems to come down to some kind of aesthetics vs hard data argument.


Mitch has presented numbers for the overall savings. I presume they
came from actual instruction traces, etc.

MitchAlsup

unread,
Nov 8, 2019, 12:36:01 PM11/8/19
to
My 66000:
Whatever LD/ST accesses across a page/protection boundary, and one side
is not accessible, will take a trap before any data is read or written.
This remains precise both Scalar and Vector.

MitchAlsup

unread,
Nov 8, 2019, 12:47:27 PM11/8/19
to
On Friday, November 8, 2019 at 2:28:51 AM UTC-6, Brett wrote:
> MitchAlsup <?????@aol.com> wrote:
> > On Thursday, November 7, 2019 at 12:01:38 AM UTC-6, Bruce Hoult wrote:
> >> On Wednesday, November 6, 2019 at 5:11:10 PM UTC-8, MitchAlsup wrote:
> >>> On Wednesday, November 6, 2019 at 3:45:30 PM UTC-6, NaN wrote:
> >>>> Mitch-san,
> >>>>
> >>>> Do you mean of that 66000 ISA is not variable instruction word length
> >>>> (ex. 32, 64, 96, and 128-bit length)?
> >>>
> >>> My 66000 ISA IS variable length, 1,2,3,4,5 words in length.
> >>
> >> The entire opcode is in the first word, yes? The rest is just raw
> >> literals/addresses/offsets?
> >
> > The first word contains everything except {32, 64}-bit {immediates and
> > displacements.}
> >
>
> The mistake of variable length instructions.
> (Yes i was the VLIW fan boy of this group, people grow.)
I was an ardent RISC 1-generation advocate until I saw how easy it was
to support VLI. Yes, people do grow.
I was an ardent Aligned Only memory advocate until I saw how easy it was
for HW to make the problem go away. YEs, people do grow.
>
> Let’s say you want to be modern and scale up to 16 wide execution over the
> next decade or two, do you really want the hassles of variable length
> instructions?

Yes, because it gets rid of 6%-9% of the dynamic instructions. In addition
all immediates and displacements can be protected in E-only pages (no RW).
The Immediates and DIsplacements come through the I$ and do not pollute
the D$ or DTLB.
>
> Also you need something better than antique RISC ideas to realistically
> scale up to 16 wide.

Well, first of all, I can get up to 7-wide in a single stage of DECODE
due to the requirement of having an Instruction Buffer. I don't think
Instruction Issue will ever get 16-wide, but if ti did, you have to be
in a position of taking 3 branches per cycle, and the only way I know of
doing that is a packet (or trace) cache for instructions.

> Mill does split decode, and there are lots of opportunities and variants
> here.
>
> How about on a function call the forward part is fixed width instructions
> and the backward part is cache line(s) of bulk data constants and offsets
> needed by the forward instructions.

My guess is that 92%-95% of all My 66000 instructions are single word.
>
> Most CPU’s are only two read and one write and a simple mistake like having
> a write through L1 cache like Bulldozer can cripple performance.

So, Why would anyone do that?

> Having a read only scratchpad loaded on function call gives you a nice bump
> in read performance programmers will flock to, not realizing how little it
> really matters. ;)
> ARM crushed the supposedly superior MIPS, marketing matters, something new
> and different and better matters. A game ARM player with four different
> instruction sets, MIPS had mostly one and a 64bit extension.
>
> You can afford 32 bits or so of layout description at the start of the
> scratch to make addressing this scratch easier for the following
> instructions. Your short branch offset is now a scratch index with
> potentially a large offset, so you get shorter fixed sized instructions.
> 24bit opcodes are possible. I realize this can cost a cycle on a function
> call, don’t worry about it, you save on smaller opcodes. Perceived
> performance matters more than real performance, like ARM.

With n instruction buffer, PARSE can scan ahead, find CALL instructions and
FETCH the target before the CALL instruction gets into DECODE, saving call
latency.

MitchAlsup

unread,
Nov 8, 2019, 12:51:26 PM11/8/19
to
Large constants may be rare, but if the cost to support is low, and they
provide other useful benefits, then they are worth it.
>
> This seems to come down to some kind of aesthetics vs hard data argument.

They save 6%-9% in the instruction issued+retired count, the constants come
in through the I$ (higher hit rate) under E permission (not W-able) and
do not pollute the D$ or DTLB (preserving performance). To me, this seems
to state, they don't add much, but they add it where you want it added.

MitchAlsup

unread,
Nov 8, 2019, 12:58:54 PM11/8/19
to
On Friday, November 8, 2019 at 8:30:59 AM UTC-6, NaN wrote:
> On Monday, November 4, 2019 at 2:26:04 AM UTC+9, MitchAlsup wrote:
> > It is a method for adding vector processing to a CPU/ISA that vectorizes
> > loops instead of vectorizing instructions.
> > In fact VVM adds but 2 instructions to the ISA, one denoting which registers
> > in the loop are vectorized and which are Scalar. Scalar registers are static
> > over the loop, vector registers change at least once per loop; the other
> > denoting the loop termination and looping condition.
> > VVM is perfectly capable of vectorizing small string loops and loops upto
> > about the size of the inner loop of FFTs (29 instructions in My 66000 ISA).
>
> Then, how to handle "inter-iteration" data dependency?
> Read/write duration in terms of clock cycle for the inter-iteration data dependency is decided by dependency distance;
>
> for (i=N; i<MAX; i++) Array[i] = S*Array[i - N] + T;
>
> where S and T are scalar (ex. coefficient and offset, respectively).
> Dependency distance in this tiny example is the "N" which the Array element or its allocated register has to keep (N-1) cycles at least.

Any machine will stretch out to a maximal size (i.e., execution window) and
if Array[i-N] is live inside the execution window, we have a loop carried
dependency. This will be detected by comparing the LD addresses to the already
generated ST addresses, and delivering result data in lieu of memory data.

For those situations where N is bigger than the execution window, the same
code runs at vector performance.

> Or it takes "N-1" cycles at least for arithmetic operation after load.
> Does this distance not damage to exception?
>
> In addition,
> > For example:: A simple string copy routine in My 66000 Assembler:
> > loop:
> > LDSB R7,[Rp+Ri]
> > STB R7,[Rq+Ri]
> > ADD Ri,Ri,1
> > BNZ R7,loop
>
> This implies that there is no inter-iteration dependency in the loop, and focuses onto "chain-able" between functional units (FUs, bypassing between FUs by its true data dependency detection) in the iteration by "intra-iteration" data dependency.
> I like the chain approach invented by Cray-san, this is hart of Mitch-san's vectorization method I think.
> Does My66000 not focus onto such the inter-iteration data dependency ?, or also possible to steer ?

Mitch thinks that HW should take care of dependencies, and SW should take care
of expressing algorithms.
>
> > vectorizes into:
> > loop: VEC {{VSV}{VSV}{VV}{V}}
> > LDSB R7,[Rp+Ri]
> > STB R7,[Rq+Ri]
> > LOOP NZ,R7,Ri,1
>
> What does mean of the {{VSV}{VSV}{VV}{V}} ?, why not indicate register number in RF?

There are no bits in the instruction to denote S versus V. The VEC instruction
casts bits forward to provide said information.

> I think that VEC instruction can be two-word instruction if architecture having 32-entry register file for 32-bit data, first word can indicate the VEC and Ri++, and second one can indicate set of flag to address which register is used as "pseudo" vector-chain register, when the second word is immediate, 0x0000_8003 means register-1st, 2nd, and 16th are the vector-chain registers in 32-entry RF, others are "scalars" at the loop.

VEC can have a 16-bit immediate, a 32-bit immediate, or a 64-bit immediate.

MitchAlsup

unread,
Nov 8, 2019, 1:02:27 PM11/8/19
to
I am not at this position, yet. The data comes from a distillation of data from
Hennessy and Patterson concentrating on instruction data from RISC CPUs. It
is not better than hand waving accuracy.

Ivan Godard

unread,
Nov 8, 2019, 1:38:52 PM11/8/19
to
On 11/8/2019 12:28 AM, Brett wrote:
> MitchAlsup <Mitch...@aol.com> wrote:
>> On Thursday, November 7, 2019 at 12:01:38 AM UTC-6, Bruce Hoult wrote:
>>> On Wednesday, November 6, 2019 at 5:11:10 PM UTC-8, MitchAlsup wrote:
>>>> On Wednesday, November 6, 2019 at 3:45:30 PM UTC-6, NaN wrote:
>>>>> Mitch-san,
>>>>>
>>>>> Do you mean of that 66000 ISA is not variable instruction word length
>>>>> (ex. 32, 64, 96, and 128-bit length)?
>>>>
>>>> My 66000 ISA IS variable length, 1,2,3,4,5 words in length.
>>>
>>> The entire opcode is in the first word, yes? The rest is just raw
>>> literals/addresses/offsets?
>>
>> The first word contains everything except {32, 64}-bit {immediates and
>> displacements.}
>>
>
> The mistake of variable length instructions.
> (Yes i was the VLIW fan boy of this group, people grow.)
>
> Let’s say you want to be modern and scale up to 16 wide execution over the
> next decade or two, do you really want the hassles of variable length
> instructions?
>
> Also you need something better than antique RISC ideas to realistically
> scale up to 16 wide.
> Mill does split decode, and there are lots of opportunities and variants
> here.

Split stream is only part of it. The real issue is that operations vary
greatly in the amount of entropy they must express.

You can encode to match the entropy, and you wind up with the x86 and
everything from 1 to 20+ byte ops and an exponential linear parse
problem. Or you can encode with everything maximal length and you get
the Itanium and all that entropy (and memory) waste.

To cut this knot you pretty much must adopt some kind of a bundle
concept, where the bundles directly encode the bundle length rather
than having the parser figure it out. Even Mitch's vector ops that spray
metadata over subsequent ops can be seen as sort of bundle. Instructions
within a bundle can be different lengths but with fixed known placement,
so both the parse and entropy problems go away. While the bundles
themselves are varying length, the number of bundles per cycle can be
kept low enough for practical parsing.

Like your suggestion, Mill also dedicates special placement for pure
data such as constants and offsets. One of the six bundles (per cycle)
in the Mill instruction stream is just a byte array. The contents gets
divvied up among the ops in the other bundles, and so far we don't see
any problem with the contained data being byte aligned, although we
might have to go to word alignment if the shifter in large members gets
too expensve.

Ivan Godard

unread,
Nov 8, 2019, 1:55:26 PM11/8/19
to
On 11/8/2019 4:38 AM, Bruce Hoult wrote:
Only if you restrict the usage to "traditional" constants. If you have
cheap wide literals then you can use them in many other places besides
arithmetic constants an offsets.

Mill uses far more literal bytes for function (and branch) arg/result
than it does for arithmetic constants of any kind. A Mill call with five
args is one ~2 byte op with seven bytes (before compression) for all
args and the offset, favorably comparing with an x86 using PUSH. RISC
needs six ops and 24 bytes, yes?

Of course, that makes a Mill call() be a CISC op. Horrors!

Bruce Hoult

unread,
Nov 8, 2019, 2:21:53 PM11/8/19
to
It should be pretty easy to do the experiment for real in RISC-V land. Instruction encodings for how to do longer instructions are defined, there are plenty of FPGA-compatible open-source cores on github that could be extended (or just modify one of the software emulators), and it would be easy to teach the existing gcc or llvm back ends to use instructions with large immediates/offsets.

I'd certainly like to see the experiment done. But right now we have the U87 (Out-of-Order CPU with vector unit) to get out, along with software support for it.

6%-9% fewer instructions executed per program is in the range that might well end up being negated by fewer instructions per cycle or fewer cycles per second or more Joules per program if you weren't careful.

Bruce Hoult

unread,
Nov 8, 2019, 2:32:20 PM11/8/19
to
Sorry, could you clarify that? Is that 2 byte opcode, 5x 1 byte for operands, and 2 bytes for call offset?

RISC-V will do that with 5x 2 byte instructions to move five operands from arbitrary registers to a0-a4, plus a 4 byte JAL instruction to call a function within a +/- 1 MB range. So that's 14 bytes vs 9. Except that a good number of your function arguments might well be generated into the correct place without copying them.

Terje Mathisen

unread,
Nov 8, 2019, 2:33:40 PM11/8/19
to
I would definitely prefer to load 8 aligned bytes from the masked start
address, check for the terminator byte and if found, store the bytes up
to an including that byte and return.

If not found in the misaligned prequel, then I would load the next 8
bytes and combine the bottom bytes with the valid bytes from the first
dword, using a double-wide SHR.

This full word is written to the destination, then the dest pointer is
updated by the number of bytes in the prequel, so that when we now write
the full (second) dword, we will rewrite 1-7 bytes.

From real code, quite short strings are so common that we have to make
that more or less the fast path.

Brett

unread,
Nov 8, 2019, 2:41:25 PM11/8/19
to
I am guessing that Mill has a special stack and so you do not pay the cache
read/write cost of a traditional stack ~80% of the time.

I had been thinking that the cache 2 read 1 write was a major limit, but
Intel is wider than that now and there was not a major performance
improvement.

Terje Mathisen

unread,
Nov 8, 2019, 3:20:12 PM11/8/19
to
I'm pretty sure (80% ?) Mitch is right, i.e. having a very cheap way to
encode all constants, small and large, as inline immediates, will in
fact turn out to be a measurable/significant performance/power win.

The main benefit is probably from improved cache hit rates: Inline code
is very easy to get to nearly 100%, branching is the cause of almost all
I$ misses.

Ivan Godard

unread,
Nov 8, 2019, 3:24:24 PM11/8/19
to
Mill Gold config is ~28 wide; population is not yet tuned but I doubt
the total will change much. The specializer does aggressive
if-conversion and speculation, so few performance-important loops have
internal control flow. For such loops it is easy to saturate the core by
simple unrolling or software pipelining.

However, different loops saturate on different things - load paths,
store paths, FP units, integer units, the special Mill-only units - and
branch units. Gold has four branch units, but that is probably
over-provisioned. Four lets us do four-way unroll/pipeline, but we
almost certainly run out of something else before we can unroll that
much. So you have to take "28 wide" with a grain of salt - while 28-way
is physically possible on a Gold, in real code you'll run out of
something between 16- and 20-way.

Incidentally, I suspect Mitch's difficulty with three-way branching is
due to him needing a prediction per branch. Mill exit prediction avoids
that, and needs only one prediction per EBB, which is easily tractable
regardless of the number of branch instructions in that EBB.

MitchAlsup

unread,
Nov 8, 2019, 3:35:51 PM11/8/19
to
I was not suggesting teh predicting 3 branches per cycle was hare, just
that it becomes necessary (in some form) was width gets that wide.

In Mc 88120 we predicted a branch every cycle and could take a branch every
cycle. So in cycle k one fetches a packet at address m, and in cycle k+1
one can fetch from any of {return address, jump predictor address, sequential
address or taken address}. This was 1990. We used an agree predictor and
2 bits in the packet to drive this logic. When a packet was built it was
built on the observed path and could contain instructions from multiple
basic blocks with one conditional and any number of unconditional branches.
The branch predictor agreed with the observed flow or not.

In K9 we would predict 2 branches per cycle also based on previously observed
instruction flow order. It all scales nicely once you get rid of certain cruft
(like taken/not-taken versus observed/not-observed; having to calculate the
addresses, guessing which conditions,...)

Stephen Fuld

unread,
Nov 8, 2019, 3:48:58 PM11/8/19
to
On 11/8/2019 11:33 AM, Terje Mathisen wrote:
> Stephen Fuld wrote:
>> On 11/7/2019 11:41 PM, Terje Mathisen wrote:
>>> No, that would NOT work, since if you use misaligned loads, then you
>>> can trap when crossing a protection boundary coming directly after
>>> the actual terminator!
>>
>>
>> Ahhhh!  Good point.  For the VVM case, you could start with a byte
>> loop for up to 7 bytes to get to the point where the loads in the
>> subsequent double word loop are aligned.  This is pretty easy to code,
>> and shouldn't add much overhead - no need for any shuffles.
>
> I would definitely prefer to load 8 aligned bytes from the masked start
> address, check for the terminator byte and if found, store the bytes up
> to an including that byte and return.


Doesn't that cause a problem if there is a zero byte after the masked
start address but before the unmasked starting address? e.g. The start
address is byte 5 of 8, but byte 2 happens to be zero. You can get
around that, but it does require some extra work.


> If not found in the misaligned prequel, then I would load the next 8
> bytes and combine the bottom bytes with the valid bytes from the first
> dword, using a double-wide SHR.

But apparently the My 66000 doesn't have double wide shifts.


> This full word is written to the destination, then the dest pointer is
> updated by the number of bytes in the prequel, so that when we now write
> the full (second) dword, we will rewrite 1-7 bytes.
>
> From real code, quite short strings are so common that we have to make
> that more or less the fast path.
>
> Terje
>


--

Ivan Godard

unread,
Nov 8, 2019, 4:20:05 PM11/8/19
to
On 11/8/2019 11:32 AM, Bruce Hoult wrote:
> On Friday, November 8, 2019 at 10:55:26 AM UTC-8, Ivan Godard wrote:
>> On 11/8/2019 4:38 AM, Bruce Hoult wrote:
>>> On Friday, November 8, 2019 at 12:28:51 AM UTC-8, Brett wrote:
>>>> The mistake of variable length instructions.
>>>> (Yes i was the VLIW fan boy of this group, people grow.)
>>>>
>>>> Let’s say you want to be modern and scale up to 16 wide execution over the
>>>> next decade or two, do you really want the hassles of variable length
>>>> instructions?
>>>
>>> There are vastly different degrees of variability between "none at all" (e.g. fixed 32 bit instructions in traditional RISCs) and "ridiculous" (e.g. "any integral number of bytes from 1 to 15" in x86_64).
>>>
>>> In between you have Mitch's 4,8,12,16, or 20 bytes, and RISC-V RV64GC's and Thumb2's 2 or 4 bytes.
>>>
>>> Two lengths feels like it should be more easily scalable to wider decode than five lengths. At least all three of those (My 66000, RISC-V, Thumb2) decode the length by looking at just a few bits in the first chunk.
>>>
>>> One argument on Mitch's side is that his variability comes from including large literals and offsets within an instruction, allowing 1 instruction to do the work that would require several instructions in RISC-V and ARM (either assembling a large value into a temporary register from several small chunks, or else loading it from a constant pool).
>>>
>>> The counter-argument is that such large constants and offsets are rare, and the important uses of them are inside loops where (if you have enough registers) you can move the loading of them outside the loop.
>>
>> Only if you restrict the usage to "traditional" constants. If you have
>> cheap wide literals then you can use them in many other places besides
>> arithmetic constants an offsets.
>>
>> Mill uses far more literal bytes for function (and branch) arg/result
>> than it does for arithmetic constants of any kind. A Mill call with five
>> args is one ~2 byte op with seven bytes (before compression) for all
>> args and the offset, favorably comparing with an x86 using PUSH. RISC
>> needs six ops and 24 bytes, yes?
>
> Sorry, could you clarify that? Is that 2 byte opcode, 5x 1 byte for operands, and 2 bytes for call offset?

Things are bit fields inside a Mill bundle - if you know where the
fields are then it's as cheap to have things bit aligned as byte
aligned. Sizes vary by member, so I'll take the mid-range Silver as an
example. Silver has a morsel size of 4 bits, and each flow slot has an
associated immediate that is 0/1/2/4 bytes, and 0/1/2/3 associated
morsels, for a total bit count from 0 to 44 bits of data available. Ops
can carry over into an adjacent slot's data allocation; there are enough
trailing data-only slots that the biggest op in the last opcode slot
gets enough bits for its maximum needs.

This is separate from the opcode proper, whose size varies by slot
because each opcode slot has an entropy-optimal encoding whose size
depends on what other ops are supported in that slot. I'll assume 16
bits for the opcode, off by a handful of bits at the most.

The call uses one opcode slot (16), 5 arg morsels (5x4) and one offset
(0/1/2/4x8, 32 worst case) for a total of 66 bits or a hair over 8 bytes
worst case. A long time ago we looked at the average offset size and
IIRC it was ~10 bits, which would drop the whole call to under 7 bytes
on average.

Then there's the post-decode advantages too - only one op to issue vs.
six. That's the kind of thing that makes comparing Mill ILP difficult:
sure, a Silver is 20-wide - but three of those can be calls, so is it
20-wide or 35-wide?

> RISC-V will do that with 5x 2 byte instructions to move five operands from arbitrary registers to a0-a4, plus a 4 byte JAL instruction to call a function within a +/- 1 MB range. So that's 14 bytes vs 9. Except that a good number of your function arguments might well be generated into the correct place without copying them.
>

Mill 4-byte (max) offsets give a 4GB max range, very unlikely to be
used, while the 2-byte offset size (64k range) will miss some that your
1M range still can reach. Generating args in the right place is possible
for either architecture but more annoying in the temporal addressing of
a Mill, and a morsel is cheap enough that we don't bother trying.

There's security implications here: Mill can call to untrusted code
across protection boundaries, and the callee gets only the args
explicitly listed in the call. I don't know how or even whether you
handle exploit/bugs issues in calls, but I suspect that you have to
spill and clear the entire register set to do a semantically reliable call.

And just clearing isn't enough, because you don't have any way to poison
the unpassed registers; using a cleared but uninitialized zero is just
as much a bug as is reading the register droppings of your caller.
Security (and correctness even in the absence of security issues) is a
major focus in the Mill design - and one that seems to be ignored in
RISC architectures.

All in all, Mill seems to have the same 2x density advantage that we
have seen against other RISCs. We've always felt that code density vs.
any RISC was ignorable (though not as bad as Itanium of course), and our
real competitive target was 8086 code, which I consider to be the gold
standard of code density. Eyeball on 86 hand code when we last looked (a
while ago) showed roughly similar density depending on application. Not
bad considering that a Mill has 64-bit addresses vs. 16.

Terje Mathisen

unread,
Nov 8, 2019, 5:07:02 PM11/8/19
to
Stephen Fuld wrote:
> On 11/8/2019 11:33 AM, Terje Mathisen wrote:
>> Stephen Fuld wrote:
>>> On 11/7/2019 11:41 PM, Terje Mathisen wrote:
>>>> No, that would NOT work, since if you use misaligned loads, then you
>>>> can trap when crossing a protection boundary coming directly after
>>>> the actual terminator!
>>>
>>>
>>> Ahhhh!  Good point.  For the VVM case, you could start with a byte
>>> loop for up to 7 bytes to get to the point where the loads in the
>>> subsequent double word loop are aligned.  This is pretty easy to
>>> code, and shouldn't add much overhead - no need for any shuffles.
>>
>> I would definitely prefer to load 8 aligned bytes from the masked
>> start address, check for the terminator byte and if found, store the
>> bytes up to an including that byte and return.
>
>
> Doesn't that cause a problem if there is a zero byte after the masked
> start address but before the unmasked starting address?  e.g. The start
> address is byte 5 of 8, but byte 2 happens to be zero. You can get
> around that, but it does require some extra work.

I guessed that you would spot that little loop hole. :-)

Yes, you do have to OR in some non-zero bottom bytes, probably using a
misaligned load from a 14-byte array of 7 0xff bytes followed by 7 zero
bytes.

The alternative is actually quite nice, using a partial store of the
first N bytes needed to align the source pointer, then only

uint64_t *aligned_source = (uint64_t *) ((intptr_t) source & ~7);
uint64_t *dest64 = (uint64_t *) dest;

unsigned source_byte_offset = (unsigned) ((intpr_t) source & 7);
unsigned source_align_bytes = 8 - source_bytes_offset;

uint64_t first_word = *aligned_source++;
first_word >>= source_byte_offset; // zero bytes loaded on top
unsigned first_zero = _first_null(first_word); // Intrinsic

// Intrinsic, write the first 1-8 bytes
_masked_store_n(dest64, first_word, first_zero+1);

if (first_zero < source_align_bytes) { // prequel exit
return;
}
dest64 = (uint64_t) (dest+source_align_bytes);
do {
uint64_t w = *aligned_source++;
first_zero = _first_null(w);
if (first_zero < 8) break;
*dest64++ = w;
} while (1);
_masked_store_n(dest64, w, first_zero+1);

This code is of course better supported by the SIMD instruction set,
uisng that it would consist of nearly all compiler intrinsics. :-)

Ivan Godard

unread,
Nov 8, 2019, 5:26:48 PM11/8/19
to
We pass args on the belt up to the length of the belt (16 on Silver),
like register machines passing in the regs. Between the belt and the
scratchpad, we only create a stack frame when there's a local whose
address is taken and passed in a call.

> I had been thinking that the cache 2 read 1 write was a major limit, but
> Intel is wider than that now and there was not a major performance
> improvement.

There's isn't much gain until you get from 2x1 to 4x2 and
unroll/pipeline. Assuming that you have enough regs (or Mill equivalent)
so that spill/fill is rare, and that constants are in the code and not
loaded.
It is loading more messages.
0 new messages