On 2/19/2017 5:01 PM, jacko wrote:
>> I considered doing a custom 64-bit SH-based ISA, but haven't formally
>> done so yet.
>>
>> so, this is basically using the good ol' SH4 ISA. there is the slightly
>> newer 4A and 2A ISA's, which have some tempting instructions, but it
>> isn't known safe to use them as there may(?) still exist patents which
>> cover them.
>
> a more final version spec
https://dl.dropboxusercontent.com/u/1615413/Own%20Work/68k2-PC%23d12.pdf
with only a few added instructions and some new addressing modes.
There's even a table at the end to shown the free instruction space.
>
> Most of the patents come from implementation of instructions.
>
here is a summary spec showing the current 32-bit ISA:
https://github.com/cr88192/bgbtech_shxemu/wiki/SH-ISA
though the Hitachi SH-4 spec is also usable, though a few rarely used
SH-4 instructions aren't implemented, and my MMU works differently.
here is a more official spec of the ISA:
http://www.st.com/content/ccc/resource/technical/document/user_manual/69/23/ed/be/9b/ed/44/da/CD00147165.pdf/files/CD00147165.pdf/jcr:content/translations/en.CD00147165.pdf
don't currently have an up-to-date / complete spec for the 64-bit ISA
idea (which basically needs to not conflict with the 32-bit ISA).
I may modify it some, as the old design had 4 modes:
32-bit mode;
a mode with 32-bit addressing and 32/64-bit data ops.
a mode with 64-bit addressing and 16/32-bit data ops.
a mode with 64-bit addressing and 32/64-bit data ops.
I may reduce this down to a simple mode-select, and if 32-bit addressing
is desired, it would be handled more like in X32.
originally, the 4 modes were partly because in an earlier version,
32/64-bit data ops would completely eliminate 16-bit W ops.
if I limited it to a single-bit mode select, the only W loads/stores
available would be:
85m0 MOV.W @Rm, R0
81n0 MOV.W R0, @Rn
>> also, I wouldn't have a ready-made compiler supporting a custom ISA; I
>> would either need to modify GCC to do so (ick!), or write a dedicated C
>> compiler/backend (its own set of issues).
>
> This is a big plus to only need a small mod.
>
taking a 32-bit ISA and making it 64-bit is an easy feature for
emulators and assemblers, but C compilers deal with it much less gracefully.
though, in this case, the C ABI would be mostly unchanged (apart from
being 64-bit), which is very much unlike the case for the x86 vs x86-64
(now we have 3 somewhat different ABI's).
>> unlike the SH5 ISA (a completely new ISA for 64-bit mode), my ISA would
>> have more worked like in x86-64, quietly expanding the registers to 64
>> bits, and allowing many existing opcodes to function in various widths.
>
> I used the size = 11 option and removed many 68020+ instructions, and
relocated a lot.
>
SH-4 had instructions for Byte/Word/Long.
I needed Byte/Word/Long/Quad, but didn't want extensive modification to
the existing ISA, or to require significantly more opcodes.
>> most 32-bit ops remained 32-bits, but many 16-bit ops could be promoted
>> to 64-bits (with a few 32-bit ops promoting, and certain redundant
>> special cases remaining as 16-bit forms).
>
> I used some of the 8 bit instructions as 64, and the 8 bit ones in
more complex patterns as they are mainly IO rare, and often maskable in
16 bit.
>
>> this would partially penalize using 16-bit types in 64-bit mode, which
>> would generally require less efficient instruction sequences.
>
> Yes, this would be a problem with an addressing mode which is capable
of using a 64 bit register as 4 * 16 bit registers.
>
no direct analog here.
the issue here is mostly with loads/stores, where in the simple case,
one has:
MOV.W @Rm, R0
MOV.W R0, @Rn
rather than:
MOV.W @Rm, Rn
MOV.W Rm, @Rn
MOV.W @(Rm, disp), R0
MOV.W R0, @(Rn, disp)
...
this means a displacement+word load would require something like:
MOV R3, R8
ADD #6, R8
MOV.W @R8, R0
rather than:
MOV.W @(R3,6), R0
...
the word case is already penalized vs the long case, which has:
MOV.L @(Rm, disp), Rn
though, could use a similar hack to the word case and be like:
5nm0 MOV.W @Rm, Rn
given:
5nm0 MOV.L @(Rm, 0), Rn
is functionally equivalent to:
6nm2 MOV.L @Rm, Rn
previously, this I-form was spec'ed as a way to allow 64-bit loads when
dealing mostly with 16-bit data; but it is a tradeoff vs making all this
depend on context bits (like it is with the FPU instructions).
>> also how the FPU worked was tweaked a bit, the FPU now having 16 double
>> registers (vs 16x float or 8x split-double).
>
> There is no hard spec on the length of the FP registers.
>
in SH4, they were fixed at 32-bits, and dealing with 64-bit doubles
effectively worked with them split in-half across a pair of registers;
similar to the EDX:EAX system in x86.
in the emulator, because the halves are backwards of the useful order,
this generally means that double operations involve loading like:
MOV RAX, [...]
ROL RAX, 32
MOVQ XMMn, RAX
GCC seems to be helpful in that it seems to treat "double" more as a
guideline than as an actual requirement, so much of the time, even if
"double" is given to the compiler, it still emits code working on 32-bit
float values...
seems to be it has a sort of cleverness that, if it sees that the end
result is truncated to float, any of the calculations which lead to it
are truncated to float as well.
>> unclear would be if this were the most sane way to do this.
>>
>> as-before, instructions were still fixed-width 16-bit (it was otherwise
>> still basically the SH4 ISA).
> <snip>
>> super-ops aren't really great for raw MIPS counts, as they are
currently
>> counted as fewer instructions by the emulator (1), but are generally
>> useful for improving performance.
> <snip>
>> also I seem to now be often hitting 60 FPS in Quake 1 in 640x480, so...
>>
> <snip>
>> also debating whether to use the real hardware GPU (via OpenGL), or
>> going the "technically simpler in this case" route of using a software
>> rasterizer (makes most sense if using precooked screen-space triangles).
>
> OpenGLES2 is the most popular "on the market"
>
though, OTOH, a real/modern GPU is rather unlikely to be available if an
implementation were done with an FPGA. dealing with potentially a
software-emulated GPU, or something more along the lines of the S3 Virge
(say: limited fixed-function triangle drawing, *), seems more viable.
*: though the actual Virge seems to do polygon drawing, and apparently
is fed a single primitive at a time; I was more thinking of triangles
fed in via a queue.
I am left wondering how little I can get away with and still have Quake3
be able work ok (I did a software-rendered OpenGL before, but this time
I am considering something a little more limited).
though GLES2 is not likely to work well with a Virge-like design, but is
likely to also need either SW emulation or a GPU a bit more advanced
than can likely be shoved into an FPGA (alongside a few CPU cores and
some IO peripherals).
like, even if the GPU is effectively just a multicore processor, making
typical fragment shader code perform well is itself non-trivial (more so
if the GPU cores are too small to afford things like an FPU and SIMD,
say, if SH2 cores were used as a GPU).
sadly, the emulator is currently itself a bit faster than what I could
generally expect from a possible FPGA version.
like, right now (after some more JIT optimization, 1), experimentally I
am getting ~ 410-460 MIPS from the emulator (for a single thread), and
can effectively get ~1.6k MIPS with 4 threads.
1: added a basic register allocator, which per-trace may map SH4
registers onto x86-64 registers (currently R12D..R14D, EBX, and ESI).
tested using EBP for register allocation as well, but for some
unexplained reason, performance fell off a cliff while doing so.
the register allocator gained an average of about 50 MIPS or so (vs
always using loads/stores to the emulated CPU context).