On 7/13/2022 10:49 PM, Andy wrote:
> On 10/07/22 15:12, BGB wrote:
>
>
>> I am left wondering if I could make it fit, at least in a basic sense,
>> on something like an XC7A100T. Dev-boards with these (such as the
>> Nexys A7) are available for around ~ $270 or so last I looked (and
>> this is basically what I am using for my BJX2 Core).
>>
>>
>> At least at a superficial level, the IA-64 ISA isn't *that* far beyond
>> what I have already done with the BJX2 ISA.
>>
>
> If you say so, looks like Mount Everest to me though...
>
It is complicated, granted, but at a basic level most of the parts are
not *that* complicated. Main problem, as noted, would mostly be the
stupidly large register file.
>
>>
>> Most obvious difference is that the IA-64 register file would be
>> significantly larger. Would also probably need to omit the IA-32
>> decoder, ...
>>
>
> Perhaps something smaller, Transmeta Crusoe or Efficion maybe, only 64
> registers if you include the deep speculation, 32 if you skip that, and
> the IA-32 decoding is just a re-assemble of the Code Morphing firmware
> you can find on the internet.
>
I had considered a few times maybe trying to do an x86 emulator on BJX2,
but this is one of those "never got around to it" issues.
Would need to go directly to JIT though, as there is pretty much no hope
of usable performance with a conventional interpreter.
And, on a 50 MHz CPU core, I would probably be lucky if it even matched
the performance of the original IBM PC.
Would likely also need instructions to allow faking the behavior of x86
style ALU and branch ops (my ISA lacks condition codes, and these would
be expensive to emulate).
>
>> In this case, the idea would partly be to emulate parts of the ISA on
>> top of itself (likely via hardware traps).
>>
>>
>> If I were to do it via a modified BJX2 core, would potentially replace
>> the RISC-V alt-mode with an IA-64 alt-mode, and considerably expand
>> the size of the register file and similar.
>
> Hmmm,
>
In any case, not going to do this, it was more a hypothetical.
>>
>> Though, this looks concerning, the amount of expansions needed would
>> likely push the core beyond the resource limits of the XC7A100T.
>
> Maybe skipping the great big Intel CPU cores is for the best. ;-)
>
Probably true.
I had previously wanted to buy a board with an XC7A200T (Nexys Video),
but lacked money.
Now it seems they are sold out pretty much everywhere...
>> If I were to approach the register file design in a similar way to to
>> what I have done with my BJX2 core, I will effectively need a
>> 512-entry register file (likely also 8R4W if using 64-bit ports).
>> Probably "more sane" to use multiple smaller register files.
>>
>> This seems a little absurd...
>
> Agreed
>
>>
>> This might require a bigger FPGA...
>
> Oh no...
>
>>
>> And or come up with a more cost-effective way to implement such a
>> register file.
>
> Possibly, not sure myself.
>
>
>>> Then there's the issue of the compiler to deal with, I imagine
>>> progress in VLIW scheduling compiler research has continued on since
>>> Itanium effectively died, but would anyone be motivated enough to
>>> collect the latest advancements and update a compiler just for the
>>> Itanium machines still working out there?
>>>
>>
>> AFAIK, GCC can target IA-64.
>> Not sure how good its code generation is.
>> Apparently the target has been deprecated though.
>>
>
> Always wondered how good GCC would be at generating code for a VLIW CPU.
> I just assumed those so inclined would steal whatever language front-end
> they could find and write the bulk of the VLIW specific compiler from
> scratch.
>
Dunno. I wrote my whole compiler from scratch.
But, given how much it sucks with my own ISA, and what would be needed
for "good" results with IA-64, it would likely be straight up terrible...
>
>>> There are probably far better / smaller / easier VLIW style cores to
>>> study and replicate in a FPGA than Itanium I think.
>>>
>>
>> It seems like I am one of the (relatively few) people doing VLIW on
>> FPGA (at all).
>>
>
> Aside from the odd DSP-core, you might be right.
>
Possibly.
I suspect the sinking of the Itanium had done a lot to sour the
reputation if VLIW in general.
Like, Itanium did for VLIW what the Hindenburg did for airships...
> If only Transmeta could have held on a little longer, or did things
> differently, like opening up the internal instruction set so that
> hackers and compiler writers could have targeted more optimal GCC code
> generation to their cores... they might still be around with huge sales
> in Android devices right now, and VLIW research could have got the
> injection of resources it needed to gain the performance needed to stay
> competitive at least.
>
> Although Nvidia isn't exactly setting the world on fire in CPU sales
> either...
>
Yeah, quite possible, it could have been interesting.
Emulation is one of those areas where one is almost invariably going to
take a loss; so if targeting the underlying ISA, it could maybe have
been more competitive.
Maybe also not try to set oneself up as "compete with Intel or bust".
>
>> Most of the other people I know of, are doing RISC variants (and/or
>> RISC-V implementations).
>>
>
> RISC is pretty much the text-book common denominator these days.
>
Pretty much.
> I kinda hope VLIW makes a mainstream comeback somehow, the current
> CISC/RISC duopoly doesn't seem particularly healthy for the long term view.
>
> Maybe massive machine learning trained compilers can make a dent in the
> software side of the VLIW equation?
>
Not sure here.
In my case, it is kinda lots of fiddly.
I recently got things a little better, by fiddling a fair bit with the
logic for shuffling instructions around. It tries to reduce interlocking
and improve cases for bundling.
Then, I ended up needing to add in logic to limit how much shuffling it
does, and to try to hash and cache the results of intermediate
comparisons, mostly as the "more advanced" shuffling cases were starting
to result in the process taking an unreasonably long time.
Part of the issue seems to be that in the shuffling process it only
takes a limited window into account at a time (which expanded from 3 to
7 instructions), but at each point it isn't necessarily the case that
there will be an agreement as to which option is lowest cost.
The only real alternative would be to evaluate the entire basic block
for each possible swapping decision.
Though, this only really happens in a minority of cases (most cases
don't have quite so much "instruction mobility").
Some gains here were due to adding heuristics to infer "non-aliasing"
memory accesses, say:
Same base register but non-overlapping displacements;
SP and non-SP in some combinations.
SP and GBR (Stack and Globals are never the same memory);
...
But, can't do as much with indexed loads/stores here, since one can't do
much of anything to infer potential overlap or non-overlap between these
cases.
Have also made an inference that SP or GBR based loads/stores can't
alias with indexed load/store. While a little hand-wavy, this is
"probably true".
Though, Jumbo-form and instructions with Relocations (typically the same
instruction) are classified as "immovable", and thus can't be moved
(though the window is now large enough that it can now shuffle "around"
these instructions in many cases).
>
>> Looks like pretty much no one is bothering with soft-core processors
>> for IA-64.
>
> I'm thinking that's probably for the best. ;-)
>
Probably true, after thinking more on it.
>
>>> But then with VLIW the hardware is just half the battle, you still
>>> need to program it so that it runs at near peak performance, which I
>>> take it is the harder part, again YMMV.
>>>
>>
>> Yeah...
>>
>> With my existing ISA, my C compiler gets nowhere near the full speed
>> of what is possible. Can do a little better by writing hand-optimized
>> ASM, but this doesn't scale very well.
>
>
> Seems to me that is the nub of the issue, if your WEX hardware is pretty
> much working as intended then getting decent code generation out of your
> compiler might be the best bang for your buck.
>
Yeah, on the hardware side, WEX is fairly straightforward.
CPU doesn't need to figure anything else, just see the bits and go.
Compiler is a whole different matter though, but recently I have been
making at least a little progress here.
Did get a bit of a speedup in some cases by switching the L2 back to
being direct-mapped.
Doom and ROTT seem to get a little faster.
Quake and Hexen aren't really effected either way.
Got a little bit of a speedup in Heretic, but this was mostly after
recompiling it (seemed to benefit from some of my more recent work on my
compiler).
Heretic went from 8 to 10 fps (before recompile) to 12 to 16 fps post
recompile.
Hexen was uneffected by recompile, still mostly stuck at around 8 fps.
Recompile got ROTT from around 10 to 12 fps, up to around 16 to 20.
Doom is now semi-consistently pulling off upwards of 20 fps.
Weirdly, the gains from switching the L2 back to DM only seem to really
come into effect "after" recompiling code with the improved shuffling
and bundling. Not sure why this would be.
Though, curiously, despite the better numbers and theoretical reduction
in the amount of interlocks (since the compiler is optimizing to try to
avoid interlocks), there would also appear to be a relative increase in
the amount of clock cycles spent on interlock stalls (odd).
Then again, it could be due to a reduction in the number of non-bundled
instructions, which reduces the number of instructions sitting around to
"absorb" these clock cycles.
> I'm sure there's still plenty of research papers to be read on the
> subject, and if you happen to invent some new way to efficiently pack
> many operations into a string of wide words, well, fortune and glory and
> that jazz could be yours for the taking...
>
> Or possibly the 8000lb gorilla will stomp on your neck and steal your
> lunch money just like it did to Transmeta...
>
> To early to tell I guess... :-)
>
If anyone uses my stuff, it is probably a win.
If they want to pay me to keep working on it, that is better...