On Sunday, June 6, 2021 at 11:31:15 AM UTC-5, BGB wrote:
> On 6/6/2021 5:34 AM, Michael S wrote:
> > On Sunday, June 6, 2021 at 1:24:27 AM UTC+3, MitchAlsup wrote:
> >> On Saturday, June 5, 2021 at 4:12:21 PM UTC-5, Paul A. Clayton wrote:
> >>> On Friday, June 4, 2021 at 1:20:59 PM UTC-4, MitchAlsup wrote:
> >>>> On Friday, June 4, 2021 at 12:01:09 PM UTC-5, Quadibloc wrote:
> >>>>> On Friday, May 14, 2021 at 1:21:26 PM UTC-6, Thomas Koenig wrote:
> >>>>>
> >>>>>> It killed off too many RISC architectures. There would eventually
> >>>>>> have been a consolidation, but x86_64 was not the right architecture
> >>>>>> to consolidate to...
> >>>>> So true, but there was little choice then. Now we have a second chance,
> >>>>> ARM.
> >>>> <
> >>>> Not much of a choice:
> >>>> a) mud pie
> >>>> b) mud pudding
> >>>
> >>> Are you aware that AArch64 (64-bit ARM ISA) is a relatively clean RISC and that ARM is no longer developing high-performance A-profile cores that support the 32-bit ISAs? (I think an abstraction layer would be better than a traditional ISA, but the business case for such is weaker. My 66000 is better than AArch64, but AArch64 does not seem like mud pudding.)
> >> <
> >> While I value your opinion, you, too, will come to a different conclusion in a few years.
> >
> >
> > Does it mean that you expect that in few years (how many?) the best available (== shipping, sold and bought, *not* paper) high-performance CPU would be neither mud pie nor mud pudding nor mud pastry (i.e. POWER) ?
> >
> Weighing in with my opinions here...
>
> It seems almost inevitable that an ISA will end up messy in some areas
> absent frequent redesign efforts and which break binary compatibility.
<
This is one of the BIG reasons My 66000 has about 30% of the major OpCode space
unallocated at present. This gives plenty of room to grow, without having to squirrel
things into odd corners.
>
> Adding features, and trying to avoid breaking existing binaries, will
> mean that new features don't necessarily mesh cleanly with older
> features, ...
<
This is another BIG reason I ended up preferring instruction modifiers over instructions
when extending the ISA to cover seldom used features (CARRY is the prime example).
>
>
> So, eg, AArch64 deals with 32 and 64 bit operations in a roughly similar
> way to x86-64, but then has funkiness to deal with sign or zero
> extending inputs (rather than keeping the values themselves in a sign or
> zero extended form).
<
I completely agree
>
> It also uses condition-code flags, which personally I am not
> particularly a fan of (I consider the "1 bit predicate" to be a
> different system).
<
I completely agree, here, too.
>
>
> One other drawback with most traditional RISC designs is that to run
> multiple ops in parallel, it is necessary for the hardware to be able to
> figure out whether or not parallel execution is possible, which is more
> difficult and more resource intensive than encoding it explicitly.
<
When I built the "wide" machine, we configured the instruction cache to
have more bits per instruction to encode the intra instruction dependencies.
If an instruction in the packet consumed a result as an operand, the register
specifier had the HoB set and the lower part of the field pointed at the
instruction which would deliver said result. When an instruction produces
a result that did not survive the end of the packet, we marked it as dead
so we did not even allocate a destination register for it.
<
This isolated all of this fairly hard BigO( n^3-to-n^3 ) into the packet builder;
which had execution window cycles to build packets and write the instruction
cache. When running out-of-Icache we only decoded 1 instruction per cycle
and used the inter-packet dependency logic to deal with dependencies. We
remembered these dependencies while the instruction(!) executed, and then
we used the info when packing instructions into the packet.
<
When running in-packet, you get 6-8 instructions per access and can transfer
control to 2 (or 3) different target addresses for the next fetch. There is no
arithmetic in the selection process, merely a "tag" from the packet and "take"
(or "agree") bits from the predictor(s).
<
Were I to do this today, I would do 2 instruction per cycle running out-of Icache.
>
> Though it is a tradeoff, in that now compiled code needs to care about
> things like how wide the target machine is, ...
<
The major thing the compiler should be concerned with is generating the
fewest instructions possible to calculate the semantics of the program;
AND the fewest control transfers possible (i.e., use PRED instead of Branch).
>
> So, a wider machine would likely need to fall back to using the
> superscalar approach if compiling code specifically for that width is
> not viable.
<
I want code compiled for a 1-wide in-order machine to run within spitting
distance (10%-ish) of the best code possible for the Great Big Out-of-Order
Machine.
>
>
>
>
> I was almost doing OK in my current ISA, though recent efforts towards
> expanding the GPR space to R0..R63 haven't been super clean.
>
> But, why? Mostly because when writing things like rasterizer and
> edge-walking loops in my OpenGL rasterizer (in ASM) I was frequently
> running into issues with running out of GPRs and needing to fall back to
> (gasp) spilling variables to memory.
<
I consider a rasterizer to be a dedicated unit that spits out long vectors
of pixels needed to be passed through a pixel processing kernel. you feed
it triangles, it feeds you vectors of pixels with interpolated coordinates.
That is: this is really one of those things you want dedicated HW for if
you want to run fast.
<
I understand the the HW BGB is working with does not provide the area
to be able to do this.