The descriptions of the mechanisms I could find made it sound like it
would get stuck in its elongated state until some sort of "NO-OP" passes
through and it would contract back down.
Though, granted, the shortcut mechanisms in my ring-bus have a similar
issue. The shortcuts would not work under high levels of bus activity,
but as-is it is pretty rare for much more than a few messages to be on
the ring at any given time (unlike the execute pipeline, which would be
fairly dense much of the time).
>>>>
>>>> I still consider the non-stalling L1 I$ to be experimental:
>>>> It appears to change behaviors in some cases (*1);
>>>> If the I$ deadlocks, the pipelines' deadlock-detect doesn't trigger.
>>> <
>>> FETCH/DECODE gets stalled when::
>>> a) you don't know what address to fetch from (indirects)
>>> b) you don't have an instruction from the address you fetched from.
>>> c) you are waiting for the instructions to arrive (miss and pipe delays)
> <
>> I can avoid the stalling the rest of the pipeline by inserting a stream
>> of zero-length NOPs, which more or less works as expected.
> <
> At some cost in power........
Yeah, Vivado seems to also imply this was well.
>>
>> However, it seems the differences are not entirely invisible to code
>> running on the CPU core (mostly in the context of interrupt handling).
>>
>> The other difference was that there was some logic to detect if the
>> pipeline is deadlocked (say, if the 'hold' signal is active for 64k
>> cycles), which then dumps the state of the pipeline via $display statements.
> <
> Why would the pipeline deadlock. It is either cruising along or stalled
> waiting on something that WILL happen.
Usually, either:
There was a bug somewhere which caused the pipeline to stall indefinitely;
Something went wrong (such as a sanity check failing) and code triggered
a breakpoint.
In these cases, the state of the pipeline at the moment the stall
occurred can help give clued as to what had gone wrong (such as a
misbehaving instruction, etc).
The NOP padding logic in effect has the drawback that it interferes with
my debugging aide.
Yeah, it is possible. I could treat LR like a normal register update,
but would have to rework the BSR mechanism (likely by making it decode
BSR in BJX2 into a similar form used by JAL in RISC-V mode; and updating
LR via a register store rather than via a side-channel).
Could be.
I just sorta widened out the TLB, L1 caches, and bus interfaces
directly. Pretty much everything else remains 48-bits (so, eg, the PC
advance, branch predictor, ... have no need to be aware of XMOV).
The implementation for TLB was, basically:
Do some hackery to use two adjacent TLB entries like a giant TLB entry;
Do some hackery to LDTLB so that the entries end up adjacent;
...
This effectively cuts the TLB from 4-way to 2-way when working with
96-bit addresses, but the alternative would have doubled the Block-RAM
needed for the TLB (the main alternative having been to widen the TLB
entries to 256 bits, and ignore the high address bits if XMOV isn't
being used).
Though, this other option would make more sense if XMOV ends up being
used as much more than a gimmick (as 2-way has a much higher TLB miss
rate than 4-way).
>>>>
>>>> For related funkiness, it is likely that things like kernel-related
>>>> address ranges would need to be mapped into multiple quadrants (as
>>>> opposed to being able to treat it like a 96-bit linear address space in
>>>> this sense).
>>> <
>>> Why did you go to 96-bits? What does this abstraction buy that 64-bits
>>> does not suffice ?
>> Partly because:
>> I already had 48 bit pointers;
>> I could glue two 48 bit pointers together;
>> 48+48=96.
> <
> Yes, but what do you really gain if the physical address space remains 48-bits ?
I don't need more than 48-bits for physical addresses for now.
It isn't like we have *anywhere near* the 256TB limit in any available
FPGA boards, nor connected up to a single processor.
Actually, with the boards I have access to, even 48 bits is overkill;
could have still gotten along OK with the 29-bit physical address scheme
used by the SH4 with the current FPGA boards.
Though, granted, even the Arty board has enough RAM to push SH4's scheme
to its limit.
But, large virtual address spaces have other use-cases.
>>
>> This also leaves some bits left over for things like type-tags and
>> handling Java-style bounded arrays.
> <
> Is there something wrong with a Dope Vector to haul arrays around ?
This can be a fallback case for when the array bounds can't be
shoehorned into the pointer.
But, storing things like array bounds in memory has a few drawbacks:
Need to always provide memory for the descriptor;
Array accesses tend to need multiple memory accesses;
...
Though, granted, accessing such an array using tag-refs does require
logic in the handler:
Are we using the array in the inline format?
If yes, check bounds directly, do access.
If no, fetch bounds from memory, access array indirectly.
This can be faster/cheaper for smaller arrays with a 0-based index, but
imposes a penalty for larger arrays or for non-zero-based arrays.
One can debate though whether it would be faster to implement "int[]"
as, say:
Call runtime handler, which does tag-check handling and special cases;
Always do memory accesses to do bounds-checks
Inline call to "throw" logic for bounds-check failures.
At present, I went with the former, which have names like
"__lvo_loadindex_si" and similar, and are generally written in ASM.
In most of these cases, the handler is specialized for the index type of
the array (stuff is assumed to "have gone terribly wrong" if an
"object[]" array is passed to "int[]" or similar, *1).
*1: Unlike normal C types, the tagged type-system doesn't generally
allow one to freely cast whatever to whatever else (it instead does
runtime checks).
Another (slower) mechanism is to use __variant, which first needs to
determine that it is looking at an array, what is the base type of the
array, ... Which is stuff we already know with an "int[]" type.
It is possible there could be a "do bounds-checked load/store else
branch" instruction, but this would only seem justified if I were
intending to primarily run languages like Java or C# (rather than C).
It could benefit my BS2 language, which also uses Java-style arrays, but
thus far BS2 hasn't really been used much. This feature wouldn't really
help much with JavaScript or Python (the needed mechanism would be
considerably more complicated, *2).
Say:
MOVBC.L (Rm, Ri), Rn, Label
Checks that Rm represents an inline 0 based array:
If not, Branch
Checks that Ri is in the bounds encoded in Rm:
If not, Branch
Else, Do a Memory Load (no Branch)
The branched-to label would call the runtime handler.
While, arguably, it could be more efficient to have a "Do operation and
branch, else NOP" instruction (would save needing ~ 2 unconditional
branches per array operation), drawback is that it requires the core to
be able to simultaneously both initiate a branch and access memory.
Eg:
MOVABC.L (R8, R9), R10, .L0
BRA .L1
.L0:
/* Slow path. */
MOV R8, R4 | MOV R9, R5
BSR __lvo_loadindex_si
MOV R2, R10
.L1:
*2: Open question being if it would be worthwhile for someone to try to
design a hardware level ISA intended to try to make languages like
Python run efficiently...
>>
>> So, say, something like:
>> int[] arr = new int[100000000];
>> Can encode the 'int[]' array type and bounds directly into the pointer.
> <
> What if you got ::
> int[] arr = new int[-10000:99990000];
> <
> This is where Dope Vectors come in.
Granted. The current scheme (with 48-bit addresses) puts a serious limit
onto the sizes of array that can be expressed via the inline format, and
inline bounds only work with 0-based arrays.
Eg:
int[] arr1=new int[1024]; //OK, can use inline format
int[] arr2=new int[8192]; //Not OK, needs to use in-memory descriptor.
With a 96-bit pointer, one can conceivably extend the array bounds up to
28 bits or so.
I am also using the extra bits from the page-tables and TLB entries for
the VUGID system.
But, yeah, page-table entry layout is generally:
[63:48]: ACLID (or VUGID)
[47:12]: Physical Page Address
[11: 0]: Page Flag Bits
However, the address space widens to 96 bits in the L1 caches and TLBs,
so in effect, it gives a logical 96-bit virtual address space, with a
48-bit physical space.
However, to map such large virtual addresses, one may still need a
deeply nested page-table.
The parts that remain 48 bits are mostly:
The AGU;
The PC advance;
The branch predictor;
...
These parts operate on 48-bit addresses, as before, and then the
"quadrant address" is glued on before submitting the result to the L1
caches.
For XMOV ops, the quadrant address comes from the adjacent paired register:
[127:112], More Tag Bits
[111: 64], Quadrant Address
[ 63: 48], Type Tag Bits (Mostly ignored, *)
[ 47: 0], Base Address (Within Quadrant)
The low part behaves as it would for normal Load/Store ops.
Just for normal MOV.x, the high-order bits come from GBH.
So, say:
VA[95:48] = GBH[47:0]
VA[47: 0] = AGU[47:0]
if(XMOV)
VA[95:48] = XRm[47:0]
*1: Though one possibility is:
If in "Quadrant Add" mode, and:
Bits "[63:61] are 110"
Then:
VA[95:64]=GBH[47:16]
VA[63:48]=GBH[15: 0] +
{ 2'b0, Rm[60:47] } +
{ 15'h0, AGU[47] ^ Rm[47] }; //Carry
VA[ 47]=0 // Force userland VA
VA[46: 0]=AGU[46: 0]
Pro:
Backwards compatible with existing code;
Allows for 2 or 8 exabytes (pseudo-linear);
Less nasty than some other variants.
Cons:
Ugly hack;
Bakes part of the type-tag system into the ISA.
Misc:
Would not allow a misaligned access across a 128TB boundary.
>>
>> Making the page-table pages bigger seems to increase the effective space
>> requirements for the page table much faster than it reduces the number
>> of page-table levels.
> <
> This is where level skipping comes in. But this is ALSO why you don't make
> the pages so big. I would have ended up with 6-level tables had I used 4Kb
> pages, this is reduced to 5-levels with 8Kb tables. The 4Kb version used
> {4Kb, 2Mb, 1Gb, ½Tb, ¼Eb,..} leaving the top table (page) covering only ¼
> of this actual space. The 8Kb version used {8Kb, 8Mb, 8Gb, 8Tb, 8Eb,..}
> and uses ½ of the top level table actual space. But level skipping at Root
> makes most of these "go away" most of he time. Level skipping in the
> middle increases the table density.
Yeah.
I looked into it, and making the page table pages bigger than the basic
page size is likely to hurt a lot more than it gains.
My MMU uses base page-sizes of 4K, 16K, and 64K.
For TestKern, I have mostly been going with a page size of 16K.
>>>>
>>>>
>>>> There is also now a flag which controls whether user-mode code is
>>>> allowed to use XMOV instructions. This could allow for restricting a
>>>> program to a 32-bit or 48-bit userland within a larger 96-bit address space.
>>> <
>>> I think the fact that you need ::
>>> a) the XMOV instruction
>>> b) a mode bit associated with it
>>> is a first level indicator that you have gone down the wrong rabbit hole.
>>> <
>>> In contrast, My 66000 is completely flat
>>>
>> Quite possibly...
>> It is neither a particularly simple nor elegant design.
>> I was trying to come up with something "cheap" here.
> <
> simple, elegant, cheap :: choose any 2.
Yeah, I had more been aiming for simple and cheap.
Also, had taken some amounts of influence from a lot of 80s and 90s era
technology, and the occasional obscure reference here and there.