Hi Bobby,
About LLVM, yeah it is very slow to request compilation from - I'd have to do interpretation as the first tier, and then have the code switch to LLVM once a block has been executed a certain number of times, a bit like Java and the Sun HotSpot compiler.
As for the core, here are some of my observations, in no particular order:
1. The Hacktarux JIT depends on the Cached Interpreter's behavior and structures, and adds things that are not needed when using only the Cached Interpreter. For example, there's a structure for each instruction in the Cached Interpreter [1], that is either 98 bytes [2] or 156 bytes [3]. That memory only required in the Hacktarux JIT and wasted otherwise, and an array of the resulting structure lacks cache line alignment for most of its elements. Modifying the signature or behavior of C functions for which calls are baked into native code generated by the Hacktarux JIT (memory accessors, gen_interupt, etc.) breaks it.
2. The Hacktarux JIT does not use blocks very well. If there's a jump to an opcode that hasn't yet been recompiled, the entire page is recompiled, in a way that any of the opcodes may then be the target of a jump. So if the native code used for the target of the jump has registers allocated, the "jump_wrapper" is executed first to load those [5]. Additionally, there can be no optimisation of runs of opcodes, due to the need to be able to jump to any of them later. (Say you jump to 0x8003_2C4C and the code is [LUI $4, 0x8020; ORI $4, $4, 0xC140; LW $4, 0($4)]. The constant memory reference cannot be optimised, because other code could jump to 0x8003_2C54, skipping the LUI+ORI.)
Forming different code blocks according to which instruction is jumped-to would make more efficient code and get rid of the register-loading jump stubs, because then each block would start off with nothing allocated and load exactly what it needs. That would also allow constant propagation for things like the code above, [LUI $4, 0x8020; ORI $4, $4, 0xC140; LW $4, 0($4)], which is really a reference to the constant address 0x8020_C140 in the N64 address space and can be turned directly into a load from *(uint32_t*) ((uint8_t*) rdram + 0x20C140). Those are pretty common references in N64 games. If the code is jumped-to at the third instruction, the LW, then a separate block that assumes no known value in $4 would be made.
3. Use of global variables. All memory accessors work by reading the value of 'address'. Read accessors (LB, LH, LW...) then read the value of 'rdword' and store through that pointer; write accessors (SB, SH, SW...) then read the value of 'cpu_byte', 'hword', 'word' or 'dword'. The caller has stored values there in memory, and the memory accessor must reload the values from memory. Done millions of times per second, the performance would be better if the values could simply be passed as function parameters, which go into registers in most ABIs. But this would break the Hacktarux JIT. (Not the New Dynarec, because it has its own memory accessors.)
4. Empty stubs are required for the Hacktarux JIT. empty_dynarec.c must be compiled and linked in even when its functions would be unused (i.e. !DYNAREC or NEW_DYNAREC [4]). I'm sure there's a technical reason, though I'm not sure what it is.
5. The Pure Interpreter depends on the Cached Interpreter. The Pure Interpreter asks the Cached Interpreter to prefetch 2 opcodes, the one at PC and its possible delay slot. This time, only 2 precomp_instr structures are used so it doesn't waste memory, and the structures are likely to be always in cache, but the Pure Interpreter is not self-contained.
Keeping all the memory-related code, the Coprocessor 0 and FPU opcodes, the TLB, exceptions and interrupts in their separate files is a nice touch; it's just the various interpreter and JIT drivers that all go back to the Cached Interpreter. Compatibility with the Hacktarux JIT may also be holding the core back.
Regards,
Neb.
[1]
https://github.com/mupen64plus/mupen64plus-core/blob/fe84dea/src/r4300/recomp.h#L69-L70[2]
https://github.com/mupen64plus/mupen64plus-core/blob/fe84dea/src/r4300/x86/assemble_struct.h[3]
https://github.com/mupen64plus/mupen64plus-core/blob/fe84dea/src/r4300/x86_64/assemble_struct.h[4]
https://github.com/mupen64plus/mupen64plus-core/blob/9e5e1da/projects/unix/Makefile#L453-L490[5]
https://github.com/mupen64plus/mupen64plus-core/blob/fe84dea/src/r4300/x86/rjump.c#L57-L60