Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Misc: Non-Stalling Pipeline, 96-bit XMOV, thoughts and direction

108 views
Skip to first unread message

BGB

unread,
Oct 26, 2021, 3:21:26 AM10/26/21
to
Thus far, the pipeline in BJX2 has worked in a certain way:
L1 D$ misses, the whole pipeline stalls;
L1 I$ misses, the whole pipeline stalls.

However, I recently realized:
L1 D$ can potentially be made to instead store-back the result
asynchronously, with an interlock being used in place of a stall
(instructions/bundles will only move into the EX stages once their
dependencies are available);
L1 I$ can also made to inject NOPs and set PcStep to 0 rather than stall.

The latter of these mechanisms had recently been added as a special case
to deal with a bug related to TLB miss handling with the L1 I$. Partly,
this was because of a TLB miss bug where the pipeline could possibly end
up executing "stale garbage". It was needed to be able to let the
pipeline move forwards (so that it could initiate the ISR) while
preferably not letting the pipeline be able to execute garbage instructions.

This case has experimentally been extended to deal with general L1 I$
misses as well, which can also make timing easier (*). In this case, the
logic for handling an L1 I$ miss will no longer driving the whole rest
of the pipeline.

*: Though in Vivado, timing does not improve much, but the LUT cost went
down and power-usage estimate also drops, which seems like an improvement.


It also seems to slightly improve performance in cases where branches
trigger an L1 miss, since the two delays are no longer additive (the L1
miss-handling part of the branch and the pipeline-flushing part can now
happen in parallel with each other).

Apart from the pipeline now executing NOPs on an I$ miss, there isn't a
huge functional difference. Note that PC1 <= PC+Step, so if Step==0,
then PC1 <= PC.


Doing similar for the L1 D$ is also a possibility, and could potentially
reduce performance cost in the case of L1 D$ misses, since a miss would
no longer need to stall the execution of bundles which don't immediately
depend on the result.


It doesn't seem like too far of a stretch to imagine a pipeline where
much of the logic is effectively non-stalling, and instead uses
buffering and interlocks to move data between stages.

Though, for the time being, some level of stalling may be unavoidable.

But, say:
(PF IF), Self-Stall (Spews NOPs on L1 Miss)
(ID1 ID2), Interlock Stall (Triggered by ID2)
(EX1 EX2 EX3 WB), No Stall

It is likely that in this case, Load and Store operations would need to
be either partially (or entirely) moved into a sort of FIFO queue structure.


Though, even with a queue, there may still need to be a stall if the
queue is finite size and "gets full" (say, something like memset comes
along and does stores at a faster rate than they can be handled).

Another possibility that comes up with this is whether loads and stores
could be semi-independent, and it would be possible to handle a hit on a
load while at the same time waiting for a cache miss on a store (or vice
versa), though possibly with a "clash heuristic" (not allowing a load to
proceed if there is a pending store to the same cache line).

It seems like if something like this could reduce the number of cycles
spent on dealing with L1 misses, it could be worthwhile.


Though, some recent compiler optimizations / bug-fixes do appear to have
somewhat reduced the "cycles spent on cache miss" heuristic. One of
these bugs was resulting in pretty much every "obj->obj2" operation,
where obj2 was structure type, doing an extra unnecessary address
calculation and memory store (mostly due to more old/stale logic in the
compiler).

Not sure how many compiler bugs of this sort remain, though I can at
least say that the generated machine code does seem to have a visible
reduction in "entirely pointless" operations (though, in some cases,
reducing the "visual fluff" from one set of pointless operations makes
another set of pointless operations more evident).

The relative amount of cycles spent on inter-instruction interlocks
seems to be going up as well. For this case, would likely need to figure
out a good way to get the compiler to load values into registers several
cycles before they are needed in an expression.

There seems to still be some unresolved bugs which lead to crashes in
some cases when using virtual memory on the CPU core.



Recently, have also gone and implemented the 96-bit XMOV extensions, at
present:
When used, the 96-bit addressing mode reduces TLB associativity by half;
It was either this or double the Block-RAM cost of the TLB.
Some trickery was used to limit the resource cost impact on the L1 caches;
Several registers gain "high-register" halves (PC, GBR, VBR, ...).
PCH:PC, GBH:GBR, VBH:VBR, ...


The impact on resource cost and timing was a bit smaller than originally
expected. I was originally expecting it to be "very expensive", but this
isn't really the case.


Some specifics are still being fine tuned.
For example:
TBR was going to be extended, but decided against this;
SP and LR were extended, but I have also dropped this.
Extending these would add unnecessary complexity to the C ABI;
The added complexity was also not good for timing.
SP is relative to GBH;
LR is relative to PCH.



At present:
Load/Store addressing with a single register is handled as GBH:Rn;
Branches are handled as PCH:PC or PCH:Rn;
Any narrow jumps are local;
XMOV, Load/Store ops can use a register pair as an address;
JMPX, Can can be used to do a long jump to a register pair.

So, in terms of ISA mechanics, it functions sort of like 8086 style
segmented addressing, just with the results appended into a 96 bit
linear address and fed through the TLB.

It is possible that the TLB could be used to mimic a segmented
addressing mode.

Similarly, the extended address space may also be treated like multiple
smaller 32 or 48 bit spaces.

I have generally been leaning towards calling this addressing scheme
"quadrant" addressing, for various reasons. I feel it is "sufficiently
different" from traditional segmented addressing schemes to likely
justify its own term (and its mechanism is very different from segmented
addressing as understood on the 286 or 386).


Pointers are still generally kept as 64-bits with a 48-bit address by
default.

However, '__far' or '__huge' may be used to get the larger pointers.
Though, the '__near', '__far', and '__huge' keywords will depend on the
current pointer size:
sizeof(void *)==4:
__near: 32-bit
__far: 64-bit
__huge: 128-bit
sizeof(void *)==8:
__near: 64-bit
__far: 128-bit
__huge: 128-bit

But, still working out some of the details here...


Any thoughts?...

MitchAlsup

unread,
Oct 26, 2021, 10:27:09 AM10/26/21
to
Skid buffers (how one makes a stalling pipeline into a non-stalling pipeline)
consume(create) lots of flip-flops (area), but do not consume much power.
<
Used properly, a skid buffer on the D$ pipe could allow for hit under miss
cache access.

BGB

unread,
Oct 26, 2021, 7:31:03 PM10/26/21
to
Yeah, still need to look more into it.

If I could eliminate the stalling from the L1 D$, it is potentially
possible that this could help some with performance and timing.

I still consider the non-stalling L1 I$ to be experimental:
It appears to change behaviors in some cases (*1);
If the I$ deadlocks, the pipelines' deadlock-detect doesn't trigger.

*1: Annoyingly, I still don't have things entirely stable with virtual
memory it seems (the remaining bugs seem fairly esoteric and difficult
to isolate, generally take multiple hours of simulation time to
recreate, and poking at nearly anything in the Verilog causes them to
move around; possibly a timing issue or race condition somewhere).
Though, virtual memory seems to be working "for the most part".

...



Still not sure though why seemingly increasing the ring-bus address
width, TLB width, L1 cache address width, ... were fairly cheap, but...

Adding more paths from which high-order address bits can come from, or
supporting an expanded link register, is fairly bad.


Say, for XMOV:
Op is Wide Address: High-bits come from Other Register.
Default: High-bits come from GBH.
Is OK, but:
Op is Wide Address: High-bits come from Other Register.
Base is PC: High-bits come from PCH.
Base is TBR: High-bits come from TBH.
Base is SP: High-bits come from SPH.
Default: High-bits come from GBH.
Is much worse.


Likewise trying to be able to capture and restore the high order bits
for the Link Register also basically ruined timing.

Though, this could be an adverse reaction with the RISC-V mode, which
ends up somewhat increasing the logic for the "save captured
Link-Register state" case (since RISC-V mode allows directing the
captured PC to an arbitrary register; vs just to LR).

This does favor a "simpler mode" where generally everything is assumed
to be within a single quadrant by default (with the main exception being
to have GBH/PCH and similar separate mostly for sake of things like
system-calls and interrupt handling).

For related funkiness, it is likely that things like kernel-related
address ranges would need to be mapped into multiple quadrants (as
opposed to being able to treat it like a 96-bit linear address space in
this sense).


Though, it is possible that a sub-mode could allow full 64-bit linear
addressing (64k quadrants, or "64 kiloquad" ...). Such a mode could more
directly mimic something like x86-64 or ARM64 addressing modes.

Eg, essentially:
VA = (Quadrant[47:0]<<48) +
ZExt(Base[63:0]) +
(SExt(Index[33:0])<<Sc);

Though, timing is pretty tight on this mode...

It also would not work for executable code (as-is) nor would using it be
compatible with my existing runtime library, ...


Have yet to decide on the page-table schemes though; it is either some
sort of split-level scheme, or just adding a lot more page-table levels.

But, dealing with these huge address spaces via a 9-level page table
(with 16K pages) or similar seems a bit absurd.


There is also now a flag which controls whether user-mode code is
allowed to use XMOV instructions. This could allow for restricting a
program to a 32-bit or 48-bit userland within a larger 96-bit address space.

MitchAlsup

unread,
Oct 26, 2021, 8:20:46 PM10/26/21
to
On Tuesday, October 26, 2021 at 6:31:03 PM UTC-5, BGB wrote:
> On 10/26/2021 9:27 AM, MitchAlsup wrote:
> > On Tuesday, October 26, 2021 at 2:21:26 AM UTC-5, BGB wrote:
> >> Thus far, the pipeline in BJX2 has worked in a certain way:
> >> L1 D$ misses, the whole pipeline stalls;
> >> L1 I$ misses, the whole pipeline stalls.
> >>
> >> However, I recently realized:
<snip>
> >> But, still working out some of the details here...
> >>
> >>
> >> Any thoughts?...
> > <
> > Skid buffers (how one makes a stalling pipeline into a non-stalling pipeline)
> > consume(create) lots of flip-flops (area), but do not consume much power.
> > <
> > Used properly, a skid buffer on the D$ pipe could allow for hit under miss
> > cache access.
> >
> Yeah, still need to look more into it.
>
> If I could eliminate the stalling from the L1 D$, it is potentially
> possible that this could help some with performance and timing.
<
Let us postulate that your LD pipeline has 3 cycles {Agen, D$, and
LdAlign} you could throw a skid buffer at the end with 3 entries,
each entry holding the result of an instruction passing down that
pipeline (whether it made if or not).
<
Now assume that D$ takes a miss, so when the skid buffer captures
the miss, you stall new entries into the execution of memory refs,
and the skid buffer has all the information you need to restart the
miss from the D$, possibly even providing the address for the
returning data. The other MRs which get captured can be allowed
to complete (danger Will Robinson) or replayed after the D$ is repaired
from the skid buffer entry.
<
A simple shift register scheduler only has one flip-flop in the buffer
active at any point in time, so clock and flip-flop power is a complete
wash.
<
The rest is about what data you chose to put in the skid buffer.
>
> I still consider the non-stalling L1 I$ to be experimental:
> It appears to change behaviors in some cases (*1);
> If the I$ deadlocks, the pipelines' deadlock-detect doesn't trigger.
<
FETCH/DECODE gets stalled when::
a) you don't know what address to fetch from (indirects)
b) you don't have an instruction from the address you fetched from.
c) you are waiting for the instructions to arrive (miss and pipe delays)
>
<snip>
>
> Still not sure though why seemingly increasing the ring-bus address
> width, TLB width, L1 cache address width, ... were fairly cheap, but...
>
> Adding more paths from which high-order address bits can come from, or
> supporting an expanded link register, is fairly bad.
>
>
> Say, for XMOV:
> Op is Wide Address: High-bits come from Other Register.
> Default: High-bits come from GBH.
> Is OK, but:
> Op is Wide Address: High-bits come from Other Register.
> Base is PC: High-bits come from PCH.
> Base is TBR: High-bits come from TBH.
> Base is SP: High-bits come from SPH.
> Default: High-bits come from GBH.
> Is much worse.
<
My 66000 has none of this, the only accommodation it makes to AGEN
is displacements can be 16-bits, 32-bits, or 64-bits, and these are sorted out
in DECODE and not dynamically. All addresses, all TLB stuff, all PTEs, all
memory accesses are 64-bit in size. for 99% of accesses all the HoBs
come from the base register.
>
>
> Likewise trying to be able to capture and restore the high order bits
> for the Link Register also basically ruined timing.
<
Yep, try not doing that.
>
> Though, this could be an adverse reaction with the RISC-V mode, which
> ends up somewhat increasing the logic for the "save captured
> Link-Register state" case (since RISC-V mode allows directing the
> captured PC to an arbitrary register; vs just to LR).
<
My 66000 link register is R0 and captures the entire 64-bit return address.
99%+ of the time this is IP+4.
<
If you got rid of ½ of the mux entries, and added pipe delay cycles, would
it then make timing ?
>
> This does favor a "simpler mode" where generally everything is assumed
> to be within a single quadrant by default (with the main exception being
> to have GBH/PCH and similar separate mostly for sake of things like
> system-calls and interrupt handling).
<
I don't use a GBH per-seé, I can access globals directly, or indiretly (IP rel)
or SW can load the address of the globals (or thread locals) and put it in a
register of SW choosing.
>
> For related funkiness, it is likely that things like kernel-related
> address ranges would need to be mapped into multiple quadrants (as
> opposed to being able to treat it like a 96-bit linear address space in
> this sense).
<
Why did you go to 96-bits? What does this abstraction buy that 64-bits
does not suffice ?
>
>
> Though, it is possible that a sub-mode could allow full 64-bit linear
> addressing (64k quadrants, or "64 kiloquad" ...). Such a mode could more
> directly mimic something like x86-64 or ARM64 addressing modes.
>
> Eg, essentially:
> VA = (Quadrant[47:0]<<48) +
> ZExt(Base[63:0]) +
> (SExt(Index[33:0])<<Sc);
>
> Though, timing is pretty tight on this mode...
<
I am adding a "cast" (in the same way the CARRY instruction casts instruction
decode bits over a few successive instructions) to the memory reference
instructions so a LD can 'LD from' a different address space and a ST can 'ST
to' a different address space, then providing a means for a semi-privileged
unit of code to fetch the root pointer for which the tos/froms are appropriate
and make this register available as an alternate ROOT pointer for MMUing of
the accesses. {Very PDP-11/70-ish--and eliminates the NEED for the OS
to be in the same address space as the users for which it is providing
services.}
>
> It also would not work for executable code (as-is) nor would using it be
> compatible with my existing runtime library, ...
>
>
> Have yet to decide on the page-table schemes though; it is either some
> sort of split-level scheme, or just adding a lot more page-table levels.
<
My 66000 has a 5-level page table walk. The root pointer has a Level indicator
that chooses the size of the virtual address space {23-bits, 33-bits, 43-bits,
53-bits, 63-bits, or PA=VA}. Each PTP has a level indicator so we can skip
level in the tablewalk by trimming the VA space at the start, or by skipping
levels in the middle. Small stuff like 'cat' can be translated with only 2 pages
of MMU tables--mapping only dozens of actual code and data pages in memory.
>
> But, dealing with these huge address spaces via a 9-level page table
> (with 16K pages) or similar seems a bit absurd.
<
Yep, don't do that to yourself.
>
>
> There is also now a flag which controls whether user-mode code is
> allowed to use XMOV instructions. This could allow for restricting a
> program to a 32-bit or 48-bit userland within a larger 96-bit address space.
<
I think the fact that you need ::
a) the XMOV instruction
b) a mode bit associated with it
is a first level indicator that you have gone down the wrong rabbit hole.
<
In contrast, My 66000 is completely flat

BGB

unread,
Oct 27, 2021, 2:26:40 AM10/27/21
to
I think I understand how a skid butter works in a basic sense (from some
information I could gather), but less clear is how to adapt it to my
existing pipeline.

The descriptions I read made it seem like it would effectively "stretch
out" the pipeline (stuff goes in one-side, and comes back out a variable
number of cycles later).

But, then, either the pipeline is N-cycles longer, or I need some way to
fork it off for memory accesses and then re-join the results back into
the register-file after the fact.


>>
>> I still consider the non-stalling L1 I$ to be experimental:
>> It appears to change behaviors in some cases (*1);
>> If the I$ deadlocks, the pipelines' deadlock-detect doesn't trigger.
> <
> FETCH/DECODE gets stalled when::
> a) you don't know what address to fetch from (indirects)
> b) you don't have an instruction from the address you fetched from.
> c) you are waiting for the instructions to arrive (miss and pipe delays)

I can avoid the stalling the rest of the pipeline by inserting a stream
of zero-length NOPs, which more or less works as expected.

However, it seems the differences are not entirely invisible to code
running on the CPU core (mostly in the context of interrupt handling).

The other difference was that there was some logic to detect if the
pipeline is deadlocked (say, if the 'hold' signal is active for 64k
cycles), which then dumps the state of the pipeline via $display statements.

Granted, this logic could also be modified to treat a zero-length NOP
like a hold. But, even then, it would just show the pipeline contents as
NOPs rather than the state of the pipeline at the point the deadlock
occurred.


Though, granted, doing this does make timing better and seems to
slightly improve performance (though, Vivado's power-use estimate also
goes up somewhat).


>>
> <snip>
>>
>> Still not sure though why seemingly increasing the ring-bus address
>> width, TLB width, L1 cache address width, ... were fairly cheap, but...
>>
>> Adding more paths from which high-order address bits can come from, or
>> supporting an expanded link register, is fairly bad.
>>
>>
>> Say, for XMOV:
>> Op is Wide Address: High-bits come from Other Register.
>> Default: High-bits come from GBH.
>> Is OK, but:
>> Op is Wide Address: High-bits come from Other Register.
>> Base is PC: High-bits come from PCH.
>> Base is TBR: High-bits come from TBH.
>> Base is SP: High-bits come from SPH.
>> Default: High-bits come from GBH.
>> Is much worse.
> <
> My 66000 has none of this, the only accommodation it makes to AGEN
> is displacements can be 16-bits, 32-bits, or 64-bits, and these are sorted out
> in DECODE and not dynamically. All addresses, all TLB stuff, all PTEs, all
> memory accesses are 64-bit in size. for 99% of accesses all the HoBs
> come from the base register.

Yeah...

The main part of the AGU ignores that any of the XMOV stuff exists, but
it is just sort of hacked on.

I guess it can be noted that the model was sort of like the 8086 segment
registers:
PCH ~= CS
GBH ~= DS
SPH ~= SS (Dropped)


>>
>>
>> Likewise trying to be able to capture and restore the high order bits
>> for the Link Register also basically ruined timing.
> <
> Yep, try not doing that.

I made it so that LR is local to a single "quadrant".
If one needs an inter-quadrant function call, a thunk would be needed.

>>
>> Though, this could be an adverse reaction with the RISC-V mode, which
>> ends up somewhat increasing the logic for the "save captured
>> Link-Register state" case (since RISC-V mode allows directing the
>> captured PC to an arbitrary register; vs just to LR).
> <
> My 66000 link register is R0 and captures the entire 64-bit return address.
> 99%+ of the time this is IP+4.
> <
> If you got rid of ½ of the mux entries, and added pipe delay cycles, would
> it then make timing ?

It also works if I disable RISC-V mode, where:
Baseline pipeline:
Are we doing a JSR/BSR?
Yes, save old PC to LR;
(Test) Save PCH to LRH
No, Do nothing.
RISC-V Enabled:
Are we doing a JSR/BSR?
Is ((Rn==DLR) || (Rn==ZZR))
Yes, save old PC to LR;
(Test) Save PCH to LRH
Else:
Put captured value on the GPR port.
No, Do nothing.

RISC-V's JAL and JALR are routed through the mechanisms for BSR and JSR.



>>
>> This does favor a "simpler mode" where generally everything is assumed
>> to be within a single quadrant by default (with the main exception being
>> to have GBH/PCH and similar separate mostly for sake of things like
>> system-calls and interrupt handling).
> <
> I don't use a GBH per-seé, I can access globals directly, or indiretly (IP rel)
> or SW can load the address of the globals (or thread locals) and put it in a
> register of SW choosing.

There is GBR, but with this, there is GBH:GBR.

GBH was meant to be the high-half of GBR, but it has sense become "the
base address for the 48-bit address space within the 96-bit space."

Mostly this was because it turns out it was a lot more expensive to
widen everything else out, than to more or less leave everything else
thinking it was still operating in the original 48-bit space and then
using some last-minute bit-pasting hackery.


>>
>> For related funkiness, it is likely that things like kernel-related
>> address ranges would need to be mapped into multiple quadrants (as
>> opposed to being able to treat it like a 96-bit linear address space in
>> this sense).
> <
> Why did you go to 96-bits? What does this abstraction buy that 64-bits
> does not suffice ?

Partly because:
I already had 48 bit pointers;
I could glue two 48 bit pointers together;
48+48=96.

This also leaves some bits left over for things like type-tags and
handling Java-style bounded arrays.

So, say, something like:
int[] arr = new int[100000000];
Can encode the 'int[]' array type and bounds directly into the pointer.


Had considered 112 bits, but this would have cost a lot more and added
more issues. Bit shuffling to make the address "more linear" would have
also broken the symmetry between 48-bit Ld/St and 96-bit Ld/St. I wanted
the 48-bit space to be a sliver of the 96-bit space, rather than
effectively have what were several different address spaces (which would
require considerable bit-twiddly to convert between).

I had also, in any case, retain binary compatibility with code compiled
for the existing 48-bit space.

So, while not particularly elegant, the current design seemed to be "the
lesser of the evils" from where I started out.

But, had I been starting from a clean slate, might have considered just
gone over to a 64-bit flat address space instead.



The existing logic is basically:
Do address calculations the same as in 48-bit mode;
Glue on another 48 bits from elsewhere.

Within a given quadrant, nearly all the existing logic works just as it
would with 48-bit addresses.

In theory, I could expand to 64-bits, except:
My runtime sticks tagging data in the high 16 bits;
The runtime assumes these bits will be ignored by the CPU;
LR(63:48) is used to capture various operating-state flags;
...

So, in this sense, I engineered myself into a corner.
However, for most things near-term, 48-bit should be sufficient;
It allows for type-tagging the pointers (for dynamic-typed objects);
Carry propagation is faster on 48b than 64b.


In effect, C on BJX2 isn't using a pure "bare address pointer" system,
but rather includes an integrated dynamic type system, which may be used
internally in some cases. These features generally carve off a lot of
the high-order bits from the pointer to be used for things like
type-tags and similar.

So, rather than a bare address, we have:
An object pointer, which may have a type-tag, and 48-bit address;
A bounds-checked array-object type;
A 62-bit fixnum;
A 62-bit flonum;
...

I had generally tried to avoid canonizing the dynamic type-system in
hardware, instead leaving these bits as canonically "ignored" for memory
load/store.

Though the contents of these bits, as seen via LR and friends, are
defined for other things:
PC and LR (64:48): Some Status/Control Flags
GBR (64:48): FPU Status/Control
...

>>
>>
>> Though, it is possible that a sub-mode could allow full 64-bit linear
>> addressing (64k quadrants, or "64 kiloquad" ...). Such a mode could more
>> directly mimic something like x86-64 or ARM64 addressing modes.
>>
>> Eg, essentially:
>> VA = (Quadrant[47:0]<<48) +
>> ZExt(Base[63:0]) +
>> (SExt(Index[33:0])<<Sc);
>>
>> Though, timing is pretty tight on this mode...
> <
> I am adding a "cast" (in the same way the CARRY instruction casts instruction
> decode bits over a few successive instructions) to the memory reference
> instructions so a LD can 'LD from' a different address space and a ST can 'ST
> to' a different address space, then providing a means for a semi-privileged
> unit of code to fetch the root pointer for which the tos/froms are appropriate
> and make this register available as an alternate ROOT pointer for MMUing of
> the accesses. {Very PDP-11/70-ish--and eliminates the NEED for the OS
> to be in the same address space as the users for which it is providing
> services.}

I was trying to think up possible ways base-addressing could work and be
extended to mimic a 64-bit mode.
The above would be one possible way to allow for a 64-bit address space.

But:
Timing wasn't happy;
My existing C Runtime and C ABI would require modification to work with
64-bit bare-address pointers.

It could make sense for code running in RISC-V mode, since GCC and the
RISC-V ABI do not assume the existence of dynamic-type-tagging bits.


Another possible hack, could be:
Expand Rm[63:0] to Ra[95:0] as:
Ra = { GBH[47:16], Rm[47:32], Rm[63:48], Rm[31:0] };

Which is also a bit of an ugly hack, but could better mimic the
appearance of a 64-bit flat address space (is actually less linear that
it was before).


>>
>> It also would not work for executable code (as-is) nor would using it be
>> compatible with my existing runtime library, ...
>>
>>
>> Have yet to decide on the page-table schemes though; it is either some
>> sort of split-level scheme, or just adding a lot more page-table levels.
> <
> My 66000 has a 5-level page table walk. The root pointer has a Level indicator
> that chooses the size of the virtual address space {23-bits, 33-bits, 43-bits,
> 53-bits, 63-bits, or PA=VA}. Each PTP has a level indicator so we can skip
> level in the tablewalk by trimming the VA space at the start, or by skipping
> levels in the middle. Small stuff like 'cat' can be translated with only 2 pages
> of MMU tables--mapping only dozens of actual code and data pages in memory.

Yeah, could work.


>>
>> But, dealing with these huge address spaces via a 9-level page table
>> (with 16K pages) or similar seems a bit absurd.
> <
> Yep, don't do that to yourself.

I guess it is possible that each page-directory entry could have a
level-skip indicator.

This could then compact "sparse" parts of the virtual address space by
skipping over several levels of page table.


Well, or allow the page-tables to use a bigger logical page size than
the MMU page size, say:
MMU page size is 16K;
Page-Table Page-Size is 64K.

Though, this would still require a 7 level page-table to cover a 96-bit
space.

Making the page-table pages bigger seems to increase the effective space
requirements for the page table much faster than it reduces the number
of page-table levels.


>>
>>
>> There is also now a flag which controls whether user-mode code is
>> allowed to use XMOV instructions. This could allow for restricting a
>> program to a 32-bit or 48-bit userland within a larger 96-bit address space.
> <
> I think the fact that you need ::
> a) the XMOV instruction
> b) a mode bit associated with it
> is a first level indicator that you have gone down the wrong rabbit hole.
> <
> In contrast, My 66000 is completely flat
>

Quite possibly...
It is neither a particularly simple nor elegant design.
I was trying to come up with something "cheap" here.


Current XMOV encoding is the:
F0nm-8eoZ block
Which gives a full set of Load/Store ops (which operate on 128-bit
pointers holding a 96-bit address).


It does not include LEA though, since this can be done either:
With normal LEA ops (low-order bits);
Or, with ALUX ops (as full-width 128-bit integer operations).

These generally also only provide for 5-bit displacement encodings (in a
32-bit encoding), but I expect them to be used rarely.


The mode bit in this case is mostly because one doesn't necessarily want
programs to have full access to the address space, and disabling these
ops will effectively sandbox the program within its own 48-bit sliver of
the larger address space.

MitchAlsup

unread,
Oct 27, 2021, 12:46:37 PM10/27/21
to
done correctly the skid buffer adds nothing to the cycle count of the
pipeline when the pipe is not stalled.
<
It merely provides a landing zone for instructions when there is a stall
so as to allow the stalls to catch up to the (now buffered) instructions.
This SHOULD take tension out of the computation of the stalls; and
alleviate timing problem there-associated.
>
> But, then, either the pipeline is N-cycles longer, or I need some way to
> fork it off for memory accesses and then re-join the results back into
> the register-file after the fact.
<
As I said, the skid buffer adds no cycle count to the pipeline when the
pipeline is not stalled.
> >>
> >> I still consider the non-stalling L1 I$ to be experimental:
> >> It appears to change behaviors in some cases (*1);
> >> If the I$ deadlocks, the pipelines' deadlock-detect doesn't trigger.
> > <
> > FETCH/DECODE gets stalled when::
> > a) you don't know what address to fetch from (indirects)
> > b) you don't have an instruction from the address you fetched from.
> > c) you are waiting for the instructions to arrive (miss and pipe delays)
<
> I can avoid the stalling the rest of the pipeline by inserting a stream
> of zero-length NOPs, which more or less works as expected.
<
At some cost in power........
>
> However, it seems the differences are not entirely invisible to code
> running on the CPU core (mostly in the context of interrupt handling).
>
> The other difference was that there was some logic to detect if the
> pipeline is deadlocked (say, if the 'hold' signal is active for 64k
> cycles), which then dumps the state of the pipeline via $display statements.
<
Why would the pipeline deadlock. It is either cruising along or stalled
waiting on something that WILL happen.
This can be done as late as writeback if you carry PS down the pipe.
> (Test) Save PCH to LRH
> No, Do nothing.
> RISC-V Enabled:
> Are we doing a JSR/BSR?
> Is ((Rn==DLR) || (Rn==ZZR))
> Yes, save old PC to LR;
Once again this does not have to happen until writeback
> (Test) Save PCH to LRH
> Else:
> Put captured value on the GPR port.
> No, Do nothing.
>
> RISC-V's JAL and JALR are routed through the mechanisms for BSR and JSR.
> >>
> >> This does favor a "simpler mode" where generally everything is assumed
> >> to be within a single quadrant by default (with the main exception being
> >> to have GBH/PCH and similar separate mostly for sake of things like
> >> system-calls and interrupt handling).
> > <
> > I don't use a GBH per-seé, I can access globals directly, or indiretly (IP rel)
> > or SW can load the address of the globals (or thread locals) and put it in a
> > register of SW choosing.
> There is GBR, but with this, there is GBH:GBR.
>
> GBH was meant to be the high-half of GBR, but it has sense become "the
> base address for the 48-bit address space within the 96-bit space."
>
> Mostly this was because it turns out it was a lot more expensive to
> widen everything else out, than to more or less leave everything else
> thinking it was still operating in the original 48-bit space and then
> using some last-minute bit-pasting hackery.
<
Suggestion:: make the 48-bit stuff fast, and add cycles to the 96-bit stuff.
> >>
> >> For related funkiness, it is likely that things like kernel-related
> >> address ranges would need to be mapped into multiple quadrants (as
> >> opposed to being able to treat it like a 96-bit linear address space in
> >> this sense).
> > <
> > Why did you go to 96-bits? What does this abstraction buy that 64-bits
> > does not suffice ?
> Partly because:
> I already had 48 bit pointers;
> I could glue two 48 bit pointers together;
> 48+48=96.
<
Yes, but what do you really gain if the physical address space remains 48-bits ?
>
> This also leaves some bits left over for things like type-tags and
> handling Java-style bounded arrays.
<
Is there something wrong with a Dope Vector to haul arrays around ?
>
> So, say, something like:
> int[] arr = new int[100000000];
> Can encode the 'int[]' array type and bounds directly into the pointer.
<
What if you got ::
int[] arr = new int[-10000:99990000];
<
This is where Dope Vectors come in.
You stated above that you only really have a 48-bit VaS with other bits "pasted
on". So, with 16K pages, you should be able to have {16Kb, 32Mb, 64GB, and
48-bit VaS.} But I will note, all of these 48-bit entries are wasting 1/3rd of the
space being used in the mapping tables. This is a reason I went all the way
to 64-bits--get it done and over with so I can share logic across implementations.
>
> Making the page-table pages bigger seems to increase the effective space
> requirements for the page table much faster than it reduces the number
> of page-table levels.
<
This is where level skipping comes in. But this is ALSO why you don't make
the pages so big. I would have ended up with 6-level tables had I used 4Kb
pages, this is reduced to 5-levels with 8Kb tables. The 4Kb version used
{4Kb, 2Mb, 1Gb, ½Tb, ¼Eb,..} leaving the top table (page) covering only ¼
of this actual space. The 8Kb version used {8Kb, 8Mb, 8Gb, 8Tb, 8Eb,..}
and uses ½ of the top level table actual space. But level skipping at Root
makes most of these "go away" most of he time. Level skipping in the
middle increases the table density.
> >>
> >>
> >> There is also now a flag which controls whether user-mode code is
> >> allowed to use XMOV instructions. This could allow for restricting a
> >> program to a 32-bit or 48-bit userland within a larger 96-bit address space.
> > <
> > I think the fact that you need ::
> > a) the XMOV instruction
> > b) a mode bit associated with it
> > is a first level indicator that you have gone down the wrong rabbit hole.
> > <
> > In contrast, My 66000 is completely flat
> >
> Quite possibly...
> It is neither a particularly simple nor elegant design.
> I was trying to come up with something "cheap" here.
<
simple, elegant, cheap :: choose any 2.

BGB

unread,
Oct 27, 2021, 5:27:08 PM10/27/21
to
The descriptions of the mechanisms I could find made it sound like it
would get stuck in its elongated state until some sort of "NO-OP" passes
through and it would contract back down.

Though, granted, the shortcut mechanisms in my ring-bus have a similar
issue. The shortcuts would not work under high levels of bus activity,
but as-is it is pretty rare for much more than a few messages to be on
the ring at any given time (unlike the execute pipeline, which would be
fairly dense much of the time).

>>>>
>>>> I still consider the non-stalling L1 I$ to be experimental:
>>>> It appears to change behaviors in some cases (*1);
>>>> If the I$ deadlocks, the pipelines' deadlock-detect doesn't trigger.
>>> <
>>> FETCH/DECODE gets stalled when::
>>> a) you don't know what address to fetch from (indirects)
>>> b) you don't have an instruction from the address you fetched from.
>>> c) you are waiting for the instructions to arrive (miss and pipe delays)
> <
>> I can avoid the stalling the rest of the pipeline by inserting a stream
>> of zero-length NOPs, which more or less works as expected.
> <
> At some cost in power........

Yeah, Vivado seems to also imply this was well.

>>
>> However, it seems the differences are not entirely invisible to code
>> running on the CPU core (mostly in the context of interrupt handling).
>>
>> The other difference was that there was some logic to detect if the
>> pipeline is deadlocked (say, if the 'hold' signal is active for 64k
>> cycles), which then dumps the state of the pipeline via $display statements.
> <
> Why would the pipeline deadlock. It is either cruising along or stalled
> waiting on something that WILL happen.

Usually, either:
There was a bug somewhere which caused the pipeline to stall indefinitely;
Something went wrong (such as a sanity check failing) and code triggered
a breakpoint.

In these cases, the state of the pipeline at the moment the stall
occurred can help give clued as to what had gone wrong (such as a
misbehaving instruction, etc).

The NOP padding logic in effect has the drawback that it interferes with
my debugging aide.
Yeah, it is possible. I could treat LR like a normal register update,
but would have to rework the BSR mechanism (likely by making it decode
BSR in BJX2 into a similar form used by JAL in RISC-V mode; and updating
LR via a register store rather than via a side-channel).
Could be.

I just sorta widened out the TLB, L1 caches, and bus interfaces
directly. Pretty much everything else remains 48-bits (so, eg, the PC
advance, branch predictor, ... have no need to be aware of XMOV).


The implementation for TLB was, basically:
Do some hackery to use two adjacent TLB entries like a giant TLB entry;
Do some hackery to LDTLB so that the entries end up adjacent;
...

This effectively cuts the TLB from 4-way to 2-way when working with
96-bit addresses, but the alternative would have doubled the Block-RAM
needed for the TLB (the main alternative having been to widen the TLB
entries to 256 bits, and ignore the high address bits if XMOV isn't
being used).

Though, this other option would make more sense if XMOV ends up being
used as much more than a gimmick (as 2-way has a much higher TLB miss
rate than 4-way).


>>>>
>>>> For related funkiness, it is likely that things like kernel-related
>>>> address ranges would need to be mapped into multiple quadrants (as
>>>> opposed to being able to treat it like a 96-bit linear address space in
>>>> this sense).
>>> <
>>> Why did you go to 96-bits? What does this abstraction buy that 64-bits
>>> does not suffice ?
>> Partly because:
>> I already had 48 bit pointers;
>> I could glue two 48 bit pointers together;
>> 48+48=96.
> <
> Yes, but what do you really gain if the physical address space remains 48-bits ?

I don't need more than 48-bits for physical addresses for now.

It isn't like we have *anywhere near* the 256TB limit in any available
FPGA boards, nor connected up to a single processor.


Actually, with the boards I have access to, even 48 bits is overkill;
could have still gotten along OK with the 29-bit physical address scheme
used by the SH4 with the current FPGA boards.

Though, granted, even the Arty board has enough RAM to push SH4's scheme
to its limit.


But, large virtual address spaces have other use-cases.


>>
>> This also leaves some bits left over for things like type-tags and
>> handling Java-style bounded arrays.
> <
> Is there something wrong with a Dope Vector to haul arrays around ?

This can be a fallback case for when the array bounds can't be
shoehorned into the pointer.

But, storing things like array bounds in memory has a few drawbacks:
Need to always provide memory for the descriptor;
Array accesses tend to need multiple memory accesses;
...

Though, granted, accessing such an array using tag-refs does require
logic in the handler:
Are we using the array in the inline format?
If yes, check bounds directly, do access.
If no, fetch bounds from memory, access array indirectly.

This can be faster/cheaper for smaller arrays with a 0-based index, but
imposes a penalty for larger arrays or for non-zero-based arrays.

One can debate though whether it would be faster to implement "int[]"
as, say:
Call runtime handler, which does tag-check handling and special cases;
Always do memory accesses to do bounds-checks
Inline call to "throw" logic for bounds-check failures.


At present, I went with the former, which have names like
"__lvo_loadindex_si" and similar, and are generally written in ASM.
In most of these cases, the handler is specialized for the index type of
the array (stuff is assumed to "have gone terribly wrong" if an
"object[]" array is passed to "int[]" or similar, *1).

*1: Unlike normal C types, the tagged type-system doesn't generally
allow one to freely cast whatever to whatever else (it instead does
runtime checks).


Another (slower) mechanism is to use __variant, which first needs to
determine that it is looking at an array, what is the base type of the
array, ... Which is stuff we already know with an "int[]" type.


It is possible there could be a "do bounds-checked load/store else
branch" instruction, but this would only seem justified if I were
intending to primarily run languages like Java or C# (rather than C).

It could benefit my BS2 language, which also uses Java-style arrays, but
thus far BS2 hasn't really been used much. This feature wouldn't really
help much with JavaScript or Python (the needed mechanism would be
considerably more complicated, *2).


Say:
MOVBC.L (Rm, Ri), Rn, Label
Checks that Rm represents an inline 0 based array:
If not, Branch
Checks that Ri is in the bounds encoded in Rm:
If not, Branch
Else, Do a Memory Load (no Branch)

The branched-to label would call the runtime handler.


While, arguably, it could be more efficient to have a "Do operation and
branch, else NOP" instruction (would save needing ~ 2 unconditional
branches per array operation), drawback is that it requires the core to
be able to simultaneously both initiate a branch and access memory.

Eg:
MOVABC.L (R8, R9), R10, .L0
BRA .L1
.L0:
/* Slow path. */
MOV R8, R4 | MOV R9, R5
BSR __lvo_loadindex_si
MOV R2, R10
.L1:


*2: Open question being if it would be worthwhile for someone to try to
design a hardware level ISA intended to try to make languages like
Python run efficiently...


>>
>> So, say, something like:
>> int[] arr = new int[100000000];
>> Can encode the 'int[]' array type and bounds directly into the pointer.
> <
> What if you got ::
> int[] arr = new int[-10000:99990000];
> <
> This is where Dope Vectors come in.

Granted. The current scheme (with 48-bit addresses) puts a serious limit
onto the sizes of array that can be expressed via the inline format, and
inline bounds only work with 0-based arrays.

Eg:
int[] arr1=new int[1024]; //OK, can use inline format
int[] arr2=new int[8192]; //Not OK, needs to use in-memory descriptor.

With a 96-bit pointer, one can conceivably extend the array bounds up to
28 bits or so.
I am also using the extra bits from the page-tables and TLB entries for
the VUGID system.

But, yeah, page-table entry layout is generally:
[63:48]: ACLID (or VUGID)
[47:12]: Physical Page Address
[11: 0]: Page Flag Bits


However, the address space widens to 96 bits in the L1 caches and TLBs,
so in effect, it gives a logical 96-bit virtual address space, with a
48-bit physical space.

However, to map such large virtual addresses, one may still need a
deeply nested page-table.


The parts that remain 48 bits are mostly:
The AGU;
The PC advance;
The branch predictor;
...

These parts operate on 48-bit addresses, as before, and then the
"quadrant address" is glued on before submitting the result to the L1
caches.


For XMOV ops, the quadrant address comes from the adjacent paired register:
[127:112], More Tag Bits
[111: 64], Quadrant Address
[ 63: 48], Type Tag Bits (Mostly ignored, *)
[ 47: 0], Base Address (Within Quadrant)

The low part behaves as it would for normal Load/Store ops.
Just for normal MOV.x, the high-order bits come from GBH.

So, say:
VA[95:48] = GBH[47:0]
VA[47: 0] = AGU[47:0]
if(XMOV)
VA[95:48] = XRm[47:0]


*1: Though one possibility is:
If in "Quadrant Add" mode, and:
Bits "[63:61] are 110"
Then:
VA[95:64]=GBH[47:16]
VA[63:48]=GBH[15: 0] +
{ 2'b0, Rm[60:47] } +
{ 15'h0, AGU[47] ^ Rm[47] }; //Carry
VA[ 47]=0 // Force userland VA
VA[46: 0]=AGU[46: 0]

Pro:
Backwards compatible with existing code;
Allows for 2 or 8 exabytes (pseudo-linear);
Less nasty than some other variants.

Cons:
Ugly hack;
Bakes part of the type-tag system into the ISA.

Misc:
Would not allow a misaligned access across a 128TB boundary.

>>
>> Making the page-table pages bigger seems to increase the effective space
>> requirements for the page table much faster than it reduces the number
>> of page-table levels.
> <
> This is where level skipping comes in. But this is ALSO why you don't make
> the pages so big. I would have ended up with 6-level tables had I used 4Kb
> pages, this is reduced to 5-levels with 8Kb tables. The 4Kb version used
> {4Kb, 2Mb, 1Gb, ½Tb, ¼Eb,..} leaving the top table (page) covering only ¼
> of this actual space. The 8Kb version used {8Kb, 8Mb, 8Gb, 8Tb, 8Eb,..}
> and uses ½ of the top level table actual space. But level skipping at Root
> makes most of these "go away" most of he time. Level skipping in the
> middle increases the table density.

Yeah.

I looked into it, and making the page table pages bigger than the basic
page size is likely to hurt a lot more than it gains.

My MMU uses base page-sizes of 4K, 16K, and 64K.
For TestKern, I have mostly been going with a page size of 16K.


>>>>
>>>>
>>>> There is also now a flag which controls whether user-mode code is
>>>> allowed to use XMOV instructions. This could allow for restricting a
>>>> program to a 32-bit or 48-bit userland within a larger 96-bit address space.
>>> <
>>> I think the fact that you need ::
>>> a) the XMOV instruction
>>> b) a mode bit associated with it
>>> is a first level indicator that you have gone down the wrong rabbit hole.
>>> <
>>> In contrast, My 66000 is completely flat
>>>
>> Quite possibly...
>> It is neither a particularly simple nor elegant design.
>> I was trying to come up with something "cheap" here.
> <
> simple, elegant, cheap :: choose any 2.

Yeah, I had more been aiming for simple and cheap.

Also, had taken some amounts of influence from a lot of 80s and 90s era
technology, and the occasional obscure reference here and there.

MitchAlsup

unread,
Oct 27, 2021, 6:24:46 PM10/27/21
to
On Wednesday, October 27, 2021 at 4:27:08 PM UTC-5, BGB wrote:
> On 10/27/2021 11:46 AM, MitchAlsup wrote:
> > On Wednesday, October 27, 2021 at 1:26:40 AM UTC-5, BGB wrote:
> >> On 10/26/2021 7:20 PM, MitchAlsup wrote:

> > <
> >> I think I understand how a skid butter works in a basic sense (from some
> >> information I could gather), but less clear is how to adapt it to my
> >> existing pipeline.
> >>
> >> The descriptions I read made it seem like it would effectively "stretch
> >> out" the pipeline (stuff goes in one-side, and comes back out a variable
> >> number of cycles later).
> > <
> > done correctly the skid buffer adds nothing to the cycle count of the
> > pipeline when the pipe is not stalled.
> > <
> > It merely provides a landing zone for instructions when there is a stall
> > so as to allow the stalls to catch up to the (now buffered) instructions.
> > This SHOULD take tension out of the computation of the stalls; and
> > alleviate timing problem there-associated.
> >>
> >> But, then, either the pipeline is N-cycles longer, or I need some way to
> >> fork it off for memory accesses and then re-join the results back into
> >> the register-file after the fact.
> > <
> > As I said, the skid buffer adds no cycle count to the pipeline when the
> > pipeline is not stalled.
<
> The descriptions of the mechanisms I could find made it sound like it
> would get stuck in its elongated state until some sort of "NO-OP" passes
> through and it would contract back down.
<
As the stall comes to a conclusion, you sequence the instructions out of the
skid buffer before you inject new stuff into the pipeline. In the case of the D$,
you may want/need the VA to setup the cache for the returning data, ...
So, sometimes the skid buffer replays to the front of its 3-4-beat pipe pipe
and sometimes it simply finishes from the skid buffer. but once the machine
is back completely unstalled, it just acts like 4 flip-flops as a stage in the
pipe with a 4-way mux to the next stage. The reason the skid buffer exists
is so that the pipe can be stall free, and the skid buffer eats all of the stalls.
Do not try to make it do anything else.
>
<snip>
> > Why would the pipeline deadlock. It is either cruising along or stalled
> > waiting on something that WILL happen.
> Usually, either:
> There was a bug somewhere which caused the pipeline to stall indefinitely;
> Something went wrong (such as a sanity check failing) and code triggered
> a breakpoint.
>
> In these cases, the state of the pipeline at the moment the stall
> occurred can help give clued as to what had gone wrong (such as a
> misbehaving instruction, etc).
<
This information is all in the stalled skid buffer--so you COULD take the staging
flops off the scan path,.......
>
> The NOP padding logic in effect has the drawback that it interferes with
> my debugging aide.
<
I always went with a marble--marble present means clock the stage,
marble absent means do not clock the stage. One could consider the
marble an invisible NoOp, but that is beyond what is actually necessary.
<snip>
> >>> If you got rid of ½ of the mux entries, and added pipe delay cycles, would
> >>> it then make timing ?
> >> It also works if I disable RISC-V mode, where:
> >> Baseline pipeline:
> >> Are we doing a JSR/BSR?
> >> Yes, save old PC to LR;
> > This can be done as late as writeback if you carry PS down the pipe.
> Yeah, it is possible. I could treat LR like a normal register update,
> but would have to rework the BSR mechanism (likely by making it decode
> BSR in BJX2 into a similar form used by JAL in RISC-V mode; and updating
> LR via a register store rather than via a side-channel).
<
This sounds like the source of a timing problem, to me.
You have a side band port into LR and you are using it while the end of the
pipe is also writing to R[k].
<
Whereas: if there was only one write port into the file, they would all be done in
writeback.
<
<snip>
> >> Mostly this was because it turns out it was a lot more expensive to
> >> widen everything else out, than to more or less leave everything else
> >> thinking it was still operating in the original 48-bit space and then
> >> using some last-minute bit-pasting hackery.
> > <
> > Suggestion:: make the 48-bit stuff fast, and add cycles to the 96-bit stuff.
> Could be.
>
> I just sorta widened out the TLB, L1 caches, and bus interfaces
> directly. Pretty much everything else remains 48-bits (so, eg, the PC
> advance, branch predictor, ... have no need to be aware of XMOV).
>
>
> The implementation for TLB was, basically:
> Do some hackery to use two adjacent TLB entries like a giant TLB entry;
> Do some hackery to LDTLB so that the entries end up adjacent;
> ...
>
> This effectively cuts the TLB from 4-way to 2-way when working with
> 96-bit addresses, but the alternative would have doubled the Block-RAM
> needed for the TLB (the main alternative having been to widen the TLB
> entries to 256 bits, and ignore the high address bits if XMOV isn't
> being used).
>
> Though, this other option would make more sense if XMOV ends up being
> used as much more than a gimmick (as 2-way has a much higher TLB miss
> rate than 4-way).
<
the above sounds dangerous...............
Conceptually they are stored in memory; but there is no reason they cannot
be promoted into registers (assuming you don't run out.)
> ...
>
> Though, granted, accessing such an array using tag-refs does require
> logic in the handler:
> Are we using the array in the inline format?
<
If you conceptually put them in memory (local stack) and the register
allocator puts them in registers, you don't have to make these kinds of
decisions.
<
> If yes, check bounds directly, do access.
> If no, fetch bounds from memory, access array indirectly.
<
It all becomes optimizations of classical CSE form.
Or fortran slices:: equivalent to::
<
int[] arr = new int[ -12345;678910, by 13]
<
corresponding to the do loop:
for( i = arr.first_index, i < arr.final_index, i+=arr.span )
arr[i] = blah; // look ma, no index checks.
<
I would start with Dope Vectors and then see how to configure the optimizer such
that there is vanishingly small overhead in the real use cases.
>
<snip>
> > You stated above that you only really have a 48-bit VaS with other bits "pasted
> > on". So, with 16K pages, you should be able to have {16Kb, 32Mb, 64GB, and
> > 48-bit VaS.} But I will note, all of these 48-bit entries are wasting 1/3rd of the
> > space being used in the mapping tables. This is a reason I went all the way
> > to 64-bits--get it done and over with so I can share logic across implementations.
> I am also using the extra bits from the page-tables and TLB entries for
> the VUGID system.
>
> But, yeah, page-table entry layout is generally:
> [63:48]: ACLID (or VUGID)
> [47:12]: Physical Page Address
> [11: 0]: Page Flag Bits
>
I am using the PA<63:24> of the PTE as its SSID.
So all sharing of actual page tables results in sharing of the TLB entries,
And all sharing of the PTP entries results in sharing of the TableWalkAccelerator
entries.
{But then I haave coherent TLB (and MMU}
>
> However, the address space widens to 96 bits in the L1 caches and TLBs,
> so in effect, it gives a logical 96-bit virtual address space, with a
> 48-bit physical space.
<
Seems like a serious overkill.
>
> However, to map such large virtual addresses, one may still need a
> deeply nested page-table.
<snip>
> > <
> > simple, elegant, cheap :: choose any 2.
> Yeah, I had more been aiming for simple and cheap.
>
> Also, had taken some amounts of influence from a lot of 80s and 90s era
> technology, and the occasional obscure reference here and there.
<
80 and 90 technology should be converted from 32-bit to 64-bit without even
blinking of your eye.
<
Undo the damage of aligned memory in HW, No pairing or sharing of
registers in ISA encoding. include all 12 forms of branch (including NaNs)
,...
<
Add IEEE 754-2008 (FMAC) and new rounding mode
<
large flat VAS and PAS.
<
And get rid of the "an instruction is exactly 32-bits in size" but retain the
notion that an instruction can cause at most 1 exception.

Marcus

unread,
Oct 28, 2021, 3:24:29 PM10/28/21
to
My interpretation and implementation of a skid buffer:

https://github.com/mrisc32/mrisc32-a1/blob/master/rtl/common/skid_buffer.vhd

Essentially:

- When there's no stall, pass through input (unregistered).
- When there's a stall, hold on to the last input (registered).

It does not add cycle count, but it does add gate delay (a MUX).

/Marcus

MitchAlsup

unread,
Oct 28, 2021, 4:05:53 PM10/28/21
to
You want at least as many skid buffer entries as the length of the non-stalling
pipeline stage count. 3 or 4 for a skid buffer for a AGEN/DCache/ALIGN pipeline.
<
We just maintained a unary pointer for the head and tail with a transition
vector of 0001->0010->0100->1000->0001 with the added constraint that
head could not equal tail (over-full).
<
We used the skid buffer as a register (the 4 entries were already on the
scan path) and added a mux (tail) to select which entry goes to the
successive stage.
>
> /Marcus

BGB

unread,
Oct 28, 2021, 4:51:21 PM10/28/21
to
OK. This is a bit simpler than some of the stuff I read, which was doing
something more like a variable-length FIFO (with the same term).

I guess something like this could help reduce the latency of stall
propagation.


Harder challenge though is trying to figure out how to (effectively)
implement a non-stalling L1 D$.

Current front-runner idea is:
No-Miss:
EX1, AGU
EX2, L1 does its thing (Hits)
EX3, Get result
WB, Stores Result

Miss:
EX1, AGU
EX2, L1 does its thing (Misses)
EX3, Gets a "Miss Flag"
WB, Marks register as "Not Ready Yet"

So, ID2 would see the Not-Raady-Yet flag, and trigger the interlock
mechanism (inserts NOPs into the EX stages).

At some point, the L1 finishes what it was doing, and then the result
gets written back into the register file. How to best approach this part
at the moment is less obvious.

If the pipeline is interlocked, could just stick it into the pipeline.
If there is a stall, then it needs to be handled by the register file
itself.

One option could be to overload Lane 3's write port:
ZZR or Stalled, use this port.
Otherwise: We may have a problem...

For a possible way to implement WEX-6W, I had considered a 4th
write-port, but not sure I want to go this direction yet (the relative
cost of adding write ports is "pretty steep").

Likewise, putting the value into a FIFO and using forwarding would also
be expensive, but probably less so than the 4th port. Still need to get
the value written back somehow though.

Another possibility being that the GPR File can also stall the pipeline,
which it can then use for internal operations like "getting stuff
written into the register file" (this could possibly also be used for
the "conjoined register files" approach to WEX-6W as well).

Though, it is also possible that the "4th write port" could also be
implemented as a "virtual port" which stalls the pipeline if needed to
complete its action (eg: via stealing Lane 3's write port).



If another memory request comes along, the pipeline would still need to
stall.

Problem: This wouldn't gain much, as much of the miss-heavy code also
tends to be have lots of memory operations.

The likely strategy would be to rework the L1 D$ such that the
miss-handling is independent of the "front-end" request-handling logic
(similar has already been done with the L2 cache).

Though, this would likely mean reworking the L1 to handle misses via a
FIFO, and likely still needing to trigger a stall if a new request
clashes with the request in this FIFO (same cache line);
Since allowing such an operation could likely violate sequential
consistency assumptions;
Would also still need to stall if the FIFO is full.

It is likely though that a FIFO much longer than 2 or 4 would be
impractical. Also, would need a mechanism to feed resolved missed
requests back through the Load/Store mechanism, just now needing to send
loaded results back to the Register File (rather than sending them to EX3).

This seems a bit complicated, so not faced off against this part yet.

It was simpler with the L2, as the requests just sorta cycled around the
ring, and could be either handled immediately, or skipped (would try
again the next time they come around the ring).

A ring-FIFO could also be used for the L1, but ideally want a mechanism
that can both give 1-cycle per operation throughput on hits as well as
deal well with misses (if I loose the 1-cycle throughput on hits, any
possible gains on misses are now effectively moot).


This is looking less like something within the reach of a simple
retrofit onto the existing pipeline, and more like possibly needing to
effectively re-imagine the L1's load/store mechanism around the use of a
FIFO and interlocks...


> /Marcus

MitchAlsup

unread,
Oct 28, 2021, 5:46:25 PM10/28/21
to
In general, for a LD you read the tags, the data, and the TLB in EX2
Then in EX3 you check the tag against the TLB and determine hit.
Determination of hit will take about ½ of the cycle all by itself.
<
In general, for a ST you read the tags, the TLB and write the end of the
store pipeline into the (not being otherwise used) Data Cache RAMs.
In EX3 you check for a hit, and if you got a hit, you enable the store
to write when the next store enters the pipeline.
<
Never try to perform the tag check in the cycle you red the data cache !!
<
If you have a skid buffer, the skid buffer control is the only place the
miss is directed in the cycle it is calculated. The skid buffer will absorb
the miss ready to replay it when the stall goes away. The rest of the pipeline
continues to make forward progress and has a skid buffer entry available
for each instruction dropped into the data cache pipeline.
<
Effectively, the D$ stall propagates in EX4! not in EX3.
>
> So, ID2 would see the Not-Raady-Yet flag, and trigger the interlock
> mechanism (inserts NOPs into the EX stages).
>
> At some point, the L1 finishes what it was doing, and then the result
> gets written back into the register file. How to best approach this part
> at the moment is less obvious.
<
The skid buffer is setup to finish the memory ref all it needs is to
get the stall to go away, and capture the data when it gets written
to the D$, align in the subsequent cycle (as normal) and write in WB
(which is now the skid-buffer recovered normal).
<
No new or different RF porting.
>
> If the pipeline is interlocked, could just stick it into the pipeline.
> If there is a stall, then it needs to be handled by the register file
> itself.
>
> One option could be to overload Lane 3's write port:
> ZZR or Stalled, use this port.
> Otherwise: We may have a problem...
>
> For a possible way to implement WEX-6W, I had considered a 4th
> write-port, but not sure I want to go this direction yet (the relative
> cost of adding write ports is "pretty steep").
>
> Likewise, putting the value into a FIFO and using forwarding would also
> be expensive, but probably less so than the 4th port. Still need to get
> the value written back somehow though.
>
> Another possibility being that the GPR File can also stall the pipeline,
> which it can then use for internal operations like "getting stuff
> written into the register file" (this could possibly also be used for
> the "conjoined register files" approach to WEX-6W as well).
>
> Though, it is also possible that the "4th write port" could also be
> implemented as a "virtual port" which stalls the pipeline if needed to
> complete its action (eg: via stealing Lane 3's write port).
>
>
>
> If another memory request comes along, the pipeline would still need to
> stall.
<
Look into hit-under-miss Cache pipelining.
Look into miss-under-miss cache pipelining.
Each buys about ½ of what got lost by the stalls.
>
> Problem: This wouldn't gain much, as much of the miss-heavy code also
> tends to be have lots of memory operations.
<
Then you have the scheduling problem of multiple read-out and write-in
beats of moving cache lines into and out of the caches.
>
> The likely strategy would be to rework the L1 D$ such that the
> miss-handling is independent of the "front-end" request-handling logic
> (similar has already been done with the L2 cache).
>
> Though, this would likely mean reworking the L1 to handle misses via a
> FIFO, and likely still needing to trigger a stall if a new request
> clashes with the request in this FIFO (same cache line);
> Since allowing such an operation could likely violate sequential
> consistency assumptions;
> Would also still need to stall if the FIFO is full.
<
So, until you build a memory dependence matrix, you end up being in the
position of having to perform memory references in program order.
In these situations, miss-under-miss caching is about as good as you
can do.
>
> It is likely though that a FIFO much longer than 2 or 4 would be
> impractical. Also, would need a mechanism to feed resolved missed
> requests back through the Load/Store mechanism, just now needing to send
> loaded results back to the Register File (rather than sending them to EX3).
<
I allot 8 miss buffers, but you are right, the partially-ordered D$ can only generate
1-2-3-4 requests, the I$ 1-2 and the MMU can use the other 1-2, Snoops might
need to borrow one of these under some circumstances, too.
>
> This seems a bit complicated, so not faced off against this part yet.
>
> It was simpler with the L2, as the requests just sorta cycled around the
> ring, and could be either handled immediately, or skipped (would try
> again the next time they come around the ring).
>
> A ring-FIFO could also be used for the L1, but ideally want a mechanism
> that can both give 1-cycle per operation throughput on hits as well as
> deal well with misses (if I loose the 1-cycle throughput on hits, any
> possible gains on misses are now effectively moot).
>
>
> This is looking less like something within the reach of a simple
> retrofit onto the existing pipeline, and more like possibly needing to
> effectively re-imagine the L1's load/store mechanism around the use of a
> FIFO and interlocks...
<
Finally, you reach the point where you know the direction to take.
>
>
> > /Marcus

BGB

unread,
Oct 28, 2021, 6:24:03 PM10/28/21
to
Yeah, something like this could be possible.

Could capture the values from the last "Non-NOP" bundles, and display those.


> <snip>
>>>>> If you got rid of ½ of the mux entries, and added pipe delay cycles, would
>>>>> it then make timing ?
>>>> It also works if I disable RISC-V mode, where:
>>>> Baseline pipeline:
>>>> Are we doing a JSR/BSR?
>>>> Yes, save old PC to LR;
>>> This can be done as late as writeback if you carry PS down the pipe.
>> Yeah, it is possible. I could treat LR like a normal register update,
>> but would have to rework the BSR mechanism (likely by making it decode
>> BSR in BJX2 into a similar form used by JAL in RISC-V mode; and updating
>> LR via a register store rather than via a side-channel).
> <
> This sounds like the source of a timing problem, to me.
> You have a side band port into LR and you are using it while the end of the
> pipe is also writing to R[k].
> <
> Whereas: if there was only one write port into the file, they would all be done in
> writeback.
> <

Possibly, though two instructions trying to modify LR at the same time
from different parts of the pipeline isn't super likely.

Do have to avoid adding too much latency though, as while it might be
cheaper to add extra forwarding delays (to make passing timing easier),
it tends to cause side-channel registers to immediately revert to
whatever was their prior state.

Though, this tends to be the primary difference between GPRs and SPRs
and CRs:
SPRs and CRs generally have side-channels, whereas the GPRs do not.


> <snip>
>>>> Mostly this was because it turns out it was a lot more expensive to
>>>> widen everything else out, than to more or less leave everything else
>>>> thinking it was still operating in the original 48-bit space and then
>>>> using some last-minute bit-pasting hackery.
>>> <
>>> Suggestion:: make the 48-bit stuff fast, and add cycles to the 96-bit stuff.
>> Could be.
>>
>> I just sorta widened out the TLB, L1 caches, and bus interfaces
>> directly. Pretty much everything else remains 48-bits (so, eg, the PC
>> advance, branch predictor, ... have no need to be aware of XMOV).
>>
>>
>> The implementation for TLB was, basically:
>> Do some hackery to use two adjacent TLB entries like a giant TLB entry;
>> Do some hackery to LDTLB so that the entries end up adjacent;
>> ...
>>
>> This effectively cuts the TLB from 4-way to 2-way when working with
>> 96-bit addresses, but the alternative would have doubled the Block-RAM
>> needed for the TLB (the main alternative having been to widen the TLB
>> entries to 256 bits, and ignore the high address bits if XMOV isn't
>> being used).
>>
>> Though, this other option would make more sense if XMOV ends up being
>> used as much more than a gimmick (as 2-way has a much higher TLB miss
>> rate than 4-way).
> <
> the above sounds dangerous...............

The intent is encoded explicitly into the TLBE's loaded via LDTLB (so, a
high-half will not be confused for a low-half or vice versa).

I had previously considered adding a LDXTLB instruction, but ended up
not adding it because it would have required some way to throw 256 bits
at the problem all at the same time, rather than two LDTLB's which are
able to load 128 bits at a time.

Once the two halves arrive at the MMU, it then loads both of them into
the TLB.

Internally, the LDTLB is essentially a special type of MMIO operation
which targets the TLB, and so needed to fit through the existing path.

Some logic was also added to the TLB such that if a TLB load would cut a
2-part entry in half, it will discard both halves.
I guess this could be possible, but would assume that the array bounds
checking were handled at the 3AC IR level (rather than at the
instruction-emitter level).

Actually, the handling of some of this gets messy, as some parts are
handled in the front-end (via being transformed into function calls).

Other parts are handled in logic for emitting sequences of machine-code
instructions (treating the runtime calls like pseudo-instructions, and
then setting a flag that warns that any scratch registers within this
basic block may be nuked without warning).

This is more-or-less also how things like the multiply and divide
related runtime calls are handled, ...
It is possible, I guess it might be better though if they were not being
handled by generating runtime calls in the instruction emitter.

>>
> <snip>
>>> You stated above that you only really have a 48-bit VaS with other bits "pasted
>>> on". So, with 16K pages, you should be able to have {16Kb, 32Mb, 64GB, and
>>> 48-bit VaS.} But I will note, all of these 48-bit entries are wasting 1/3rd of the
>>> space being used in the mapping tables. This is a reason I went all the way
>>> to 64-bits--get it done and over with so I can share logic across implementations.
>> I am also using the extra bits from the page-tables and TLB entries for
>> the VUGID system.
>>
>> But, yeah, page-table entry layout is generally:
>> [63:48]: ACLID (or VUGID)
>> [47:12]: Physical Page Address
>> [11: 0]: Page Flag Bits
>>
> I am using the PA<63:24> of the PTE as its SSID.
> So all sharing of actual page tables results in sharing of the TLB entries,
> And all sharing of the PTP entries results in sharing of the TableWalkAccelerator
> entries.
> {But then I haave coherent TLB (and MMU}

OK.

>>
>> However, the address space widens to 96 bits in the L1 caches and TLBs,
>> so in effect, it gives a logical 96-bit virtual address space, with a
>> 48-bit physical space.
> <
> Seems like a serious overkill.

Probably, though I used some "cost-saving trickery" in the L1s.


>>
>> However, to map such large virtual addresses, one may still need a
>> deeply nested page-table.
> <snip>
>>> <
>>> simple, elegant, cheap :: choose any 2.
>> Yeah, I had more been aiming for simple and cheap.
>>
>> Also, had taken some amounts of influence from a lot of 80s and 90s era
>> technology, and the occasional obscure reference here and there.
> <
> 80 and 90 technology should be converted from 32-bit to 64-bit without even
> blinking of your eye.
> <
> Undo the damage of aligned memory in HW, No pairing or sharing of
> registers in ISA encoding. include all 12 forms of branch (including NaNs)
> ,...
> <
> Add IEEE 754-2008 (FMAC) and new rounding mode
> <
> large flat VAS and PAS.
> <
> And get rid of the "an instruction is exactly 32-bits in size" but retain the
> notion that an instruction can cause at most 1 exception.
>

It was 80s/90s tech, but not the classic RISC architectures, more stuff
like:
NES, SNES, Sega Genesis, Commodore 64, IBM PC, ...
Or: 8086, 6502, 65C816, M68K, ...
Then, 90s tech like: SuperH, MSP430, Itanium, ...

All of the fun ways people could add features while trying to keep costs
under control (and the ways these consoles could pull off most of their
graphical stuff with what was effectively "text mode on crack", *).


*: Comparably, BJX2's display hardware is far less elaborate. I ended up
going more for compressed bitmap modes, rather than multi-planar
text-modes with hardware scroll/tilt and hardware sprites and similar. I
could in premise, add hardware sprites, but not a strong use-case as of yet.

Also, most ways to cost-effectively implement these would somewhat limit
the number of layers and on-screen sprites (say, 2 layers and 4 sprites
per scanline; possibly ~ 16 sprites on-screen), ... Idea being to
process 1 layer and 1 sprite per clock cycle, limited partly by the
number of clock-cycles per VGA pixel.



And, I guess, SuperH, MSP430, and the M68K were all influenced by the
PDP-11, ... (Granted, despite me taking some design influences from
these ISAs, where I have ended up is quite different from a PDP-11).


Also, not to forget the 8086 and 65C816, ...
PCH ~= CS ~= PB
GBH ~= DS ~= DB

Actually, thinking about it, XMOV has a lot more in common with the
addressing in the 65C816 and friends than it does with the 8086. Just,
much bigger than the 65C816.


But, I am doing it this was because there are cases where larger address
spaces, or multiple virtual address spaces, are desirable. Likewise, a
mode which moved all addressing over to explicit 128-bit pointer types
would be too much of a burden on software.


Also, for most non-C languages, the 48-bit addresses with 16-bits for
tag data is likely to be better than pointers which only contain
addresses. There are likely to be more programs which need things like
efficient dispatch for dynamic type-checks than those which will need
more than 256TB (and those which do need more than this can probably
also afford to use bigger pointers).

In theory, one could also directly load/store pointers in NaN-boxed form
as well, though personally I feel NaN-boxing is a bit of a waste:
It is much better if one can have 62 bit fixnum and flonum types, ...

LSB tagging would have had its own tradeoffs, and needing to access an
object's memory to be able to do type-based dispatch is undesirable (if
one can instead dispatch based on the pointer itself).


...

robf...@gmail.com

unread,
Oct 28, 2021, 7:19:52 PM10/28/21
to
>I could in premise, add hardware sprites, but not a strong use-case as of yet.

I think most modern graphics controllers have some form of sprites. Sprites are useful to implement the mouse pointer, cursors and spinners for instance.

>Also, most ways to cost-effectively implement these would somewhat limit
>the number of layers and on-screen sprites (say, 2 layers and 4 sprites
>per scanline; possibly ~ 16 sprites on-screen), ... Idea being to
>process 1 layer and 1 sprite per clock cycle, limited partly by the
>number of clock-cycles per VGA pixel.

Dual ported block ram can be used to store sprite image data so fetching the data for display does not have to come through the dram. Multiple block rams can be accessed in parallel. There is lots of bandwidth available then for simultaneous overlapping display with lots of colors.

Speaking on 96-bit addressing. For Thor2021, my latest project, a segmented address system is used that allows the physical address to be larger than 64-bit while using a 64-bit or smaller virtual address. I figure needing more than a 64-bit address is unlikely, but Thor has a way to support a larger space if needed. It is difficult to bolt a larger address space on at a later time. The segment descriptor base field has room for up to 128 bits. 64 of which are currently unused. Like an x86, by setting up the segments appropriately a flat memory model can be used. Code address registers (link registers, ip, etc.) are 96 bits in size to accommodate a 32-bit selector. Most of the time, using “near” addressing, only the low order 64-bits of the register need be saved and restored. The jump and call instructions only support a 34-bit address field so to use more a code address register must be loaded with the target and register based jump performed. An issue with supporting an address space over 64-bits is getting the linear or physical address requires a pair of high/low instructions. LLAL and LLAH (load linear address low, and load linear address high), along the vein of a LEA instruction.
With 64-GPRs and selector registers specs for loads and stores many instructions are 48-bit in size.


BGB

unread,
Oct 29, 2021, 1:38:08 AM10/29/21
to
On 10/28/2021 6:19 PM, robf...@gmail.com wrote:
>> I could in premise, add hardware sprites, but not a strong use-case as of yet.
>
> I think most modern graphics controllers have some form of sprites. Sprites are useful to implement the mouse pointer, cursors and spinners for instance.
>

Yeah.

I had considered possibly adding hardware sprites for this purpose.

However, the FPGA board I am using does not offer a good way to use a
mouse, short of getting PMod with a PS2 or USB interface.

The board does have a USB -> Virtual PS2 interface, but it only really
works for hooking up a USB keyboard or similar (it can't deal with
hooking up both a mouse and keyboard). Likewise, there doesn't appear to
be a way (that I am aware of) to access the USB from the FPGA directly.
So, would likely need an additional USB interface via PMod or similar.


>> Also, most ways to cost-effectively implement these would somewhat limit
>> the number of layers and on-screen sprites (say, 2 layers and 4 sprites
>> per scanline; possibly ~ 16 sprites on-screen), ... Idea being to
>> process 1 layer and 1 sprite per clock cycle, limited partly by the
>> number of clock-cycles per VGA pixel.
>
> Dual ported block ram can be used to store sprite image data so fetching the data for display does not have to come through the dram. Multiple block rams can be accessed in parallel. There is lots of bandwidth available then for simultaneous overlapping display with lots of colors.
>

It present, I have a VRAM cache, which functions in a similar way to an
L1 cache. It streams contents to and from DRAM, and provides an MMIO
interface for modifying VRAM contents (itself partly a holdover; it was
either this or modify my test programs to target a framebuffer in RAM).

This replaced the original block-RAM framebuffer, and technically now
allows higher resolutions (such as 640x480), and the freed-up Block-RAM
allowed expanding the L2 cache to 256K.


Things like Font RAM / etc are still stored in Block-RAM, and any
sprites would likely use either Font-RAM or another specialized
interface. Though, might need a wider interface than that currently used
for font glyphs (say, 8x8x2, allowing for 3 colors per cell).


However, the likely cheapest way to implement some of this in hardware
would likely be:
Cycle through each layer, at one layer per clock-cycle, and then fill in
the pixel (if the layer pixel is non-transparent, replace whatever was
the prior pixel value).

Ability to scroll layers with no parallax would be simpler (offset
register added to raster position), where-as inter-layer parallax (like
on the genesis) would be a bit more complicated as it would require the
ability to deal with each layer fully independent of the others (and
would exclude some "quick and dirty" possibilities).


Sprites would similarly cycle one-per-cycle, checking the raster
position against the sprite, and filling in a pixel from the sprite if
there is a hit.

There would likely need to be a limit of how many sprites could be
active per-scanline, but the number of sprites on-screen could be larger
if they are sorted by raster order. Say, there is a rover, and whenever
the raster position is past the bottom-right corner of the sprite
current sprite, it will try to advance the rover.

Depending on how it is implemented, there could be graphical glitches in
cases where multiple sprites overlap.

Avoiding some of these issues would likely require evaluating multiple
sprites per clock-cycle.


Possible use-cases:
Mouse cursor;
Could do side-scrolling games with somewhat higher resolutions and
faster refresh rates than what is currently possible with bitmap graphics.



> Speaking on 96-bit addressing. For Thor2021, my latest project, a segmented address system is used that allows the physical address to be larger than 64-bit while using a 64-bit or smaller virtual address. I figure needing more than a 64-bit address is unlikely, but Thor has a way to support a larger space if needed. It is difficult to bolt a larger address space on at a later time. The segment descriptor base field has room for up to 128 bits. 64 of which are currently unused. Like an x86, by setting up the segments appropriately a flat memory model can be used. Code address registers (link registers, ip, etc.) are 96 bits in size to accommodate a 32-bit selector. Most of the time, using “near” addressing, only the low order 64-bits of the register need be saved and restored. The jump and call instructions only support a 34-bit address field so to use more a code address register must be loaded with the target and register based jump performed. An issue with supporting an address space over 64-bits is getting the linear or physical address requires a pair of high/low instructions. LLAL and LLAH (load linear address low, and load linear address high), along the vein of a LEA instruction.
> With 64-GPRs and selector registers specs for loads and stores many instructions are 48-bit in size.
>

As noted, BJX2 uses 64-bit GPRs, pointers use 64-bits of storage, ...
Just the high 16 bits were left for use as tag bits, leaving the low 48
for the address.

Initially, this was intended to be an implementation-level detail, but
has ended up being "canonized" to some extent.

The TLB format, page table structures, ... all ended up being based
around the assumption of a 48-bit address, and even this was difficult
to keep working reliably on a timing front, with the funkiness of using
a 33 bit index/displacement for load/store address calculations.

Similarly, there are now some Abs48 branch ops, ...


Or:
Base[47:0] + SExt48(Index[32:0])<<Sc
Where: Sc=0/1/2/3.

So, it is really a 36-bit add for the low-part, and carry-select for the
high 12 bits.



I have been experimenting with some cases to allow larger logical
pointers (eg: 60 bit), but the situation isn't great from the timing
front. Granted, It is a little faster if one doesn't propagate the carry
from the low part.

I guess the more notable feature was that the L1 caches and TLB survived
the larger virtual address space with only a fairly minor increase in
resource cost.


I also figured I may as well add this stuff in the near term, as it is
likely that if/when the project became more mature, such an expansion
would no longer be viable.

And, while 48-bits is plenty for now, if my ISA survives, it is possible
it may be hitting its limit in 20 to 40 years or so.


I sort of initially expected that I would implement it as a test, it
would turn out to be too expensive and/or ruin timing, and then I could
put it into a corner and forget about it for a while.

However, surprisingly enough, it seems like the FPGA hardly noticed...

Similarly, some ISA design provisions were made for supporting 96 bit
physical addressing as well (though, the L1 D$ will need further tweaks
to deal with a larger physical space).



I don't expect this feature will make much difference for most normal
applications (they stick with the same pointers as before).

But, for those things which need to go bigger, 96 bits should be enough
to last a little while...

robf...@gmail.com

unread,
Oct 31, 2021, 12:20:46 AM10/31/21
to
>Cycle through each layer, at one layer per clock-cycle, and then fill in
>the pixel (if the layer pixel is non-transparent, replace whatever was
>the prior pixel value).

I think it is possible to handle all the layers in a single clock which greatly simplifies things.
For my sprite controller, I have just a display priority tree implemented as a
loop. All the sprites test for active display at the same time during the same clock cycle,
lots of comparators, but no cascaded logic. Then one is selected using a priority tree.

Simplified:
integer n;
output = external_input;
for (n = 31; n >= 0; n = n – 1)
if (sprite_active[n])
output = sprite_data[n];

A display of 32 sprites seems to work (all can be on the same line at the same
time due to the enormous bandwidth provided by block-ram (32 read ports)).
Although the controller is parameterized so that fewer sprites may be used.
Sprite data is DMA’d into caches optionally during vertical blank or on demand.
If the sprite images are stable for several frames then the dram bandwidth used
is not so much.
A slow video clock, 40MHz is in use (800x600 display), so there may be timing
issues at a higher clock rate. That can be handled by using fewer sprites and
more controllers.
I am using 56x36x16bpp (32kcolor+alpha) sprites. The controller is very flexible
with programmable widths and heights and color depth. There is 1 32kBit block
ram per sprite, so the image may be up to 4096 bytes in size.

I am building a cpu to test the sprite controller and other I/O and graphics cores.
I managed to get a very simple demo working using the rfPower core and have
a short video of it. Bugs in the processor limit the demo.

>And, while 48-bits is plenty for now, if my ISA survives, it is possible
>it may be hitting its limit in 20 to 40 years or so.

I figure it may take 20 years to get my cpu core working well :)


BGB

unread,
Nov 11, 2021, 5:21:07 PM11/11/21
to
( This got lost in the mix. Stuff going on, and Thunderbird being
crash-prone doesn't help. )
Doing it all at the same time is not cheap though, this is the main
issue here.

But, yeah, I also have an FM Synthesis module, which has 16 FM channels,
but in effect only processes one channel at a time, and then accumulates
the results.

This works out a bit easier though because one has a lot more cycles to
loop though the channels.


My current display output module runs at 50 MHz, producing an output
pixel every 2 clock cycles (or one pixel every 4 clocks at 320 x 200).
Some other resolutions are supported, but at non-standard timings
(generally they run on the same clock).


So, for a 320x200 4-sprite mode, one can cycle 4 sprites, and select any
pixel hits (or transparent if no hit).


>> And, while 48-bits is plenty for now, if my ISA survives, it is possible
>> it may be hitting its limit in 20 to 40 years or so.
>
> I figure it may take 20 years to get my cpu core working well :)
>

Time from when I started messing with all of this:
~ 5 years.

Basically works, though the amount of time I spent fiddling with
debugging stuff and fiddling with my C compiler gets tiring.


Though, did recently see a video where someone was messing around with a
386SX-25, and it apparently only got a Dhrystone score of ~ 4800 (4.8k),
which implies that the ~ 69k I am currently getting runs circles around
a 386SX.

Still lame that Quake framerates kinda suck at 50MHz though, but Quake
would be a little more playable at 100MHz.



Brief timeline of my life:
before 2000: Early Years (*1)
2000-2003: High School; wrote a Scheme interpreter and similar.
2004-2008: Misc Projects (*2)
2009-2013: BGBTech Engine
2014-2016: BGBTech2 Engine
2017/2018: BJX1
2019-2021: BJX2

*1: My coding skills during this era were pretty bad, and nothing all
that notable was written.

*2: I worked on stuff with no particular direction, but a lot of stuff I
was working on at this time was rolled together into my later projects.

Things like both BGBScript and BGBCC emerged in this area.

I had also experimented with things like a 3D chat room (built on top of
XMPP), which was part of the codebase that would later become the
BGBTech engine. There was XMPP and XML-RPC, and some of the parts
written for XML-RPC were later reused as part of the BGBScript-VM and BGBCC.

The original DOM was replaced fairly early, as it didn't scale very
well, so:
Original DOM was libxml2 or similar IIRC;
Replacement DOM was a partial mock-up, but stripped down.

In the original BSVM and BGBCC, I later replaced this with a somewhat
redesigned system "BCCX", which somewhat abstracted away a lot of the
DOM machinery, and build new API interfaces better suited to ASTs.

Around a similar time to BGBCC being written (based originally on a fork
of the BSVM), a re-implementation of the BSVM was created mostly
building on top of a Scheme derived core.

The Scheme-core-based version of the BSVM was used in the BGBTech engine.


A redesigned core was used for BGBScript2, which was used in the
BGBTech2 engine, which had used a JSON based AST system (another version
of the language was implemented for BGBCC, and is much less mature
because I am mostly focusing more on C).

I could potentially also support TypeScript, since the core language for
TypeScript isn't too far off from my own languages (both being part of a
larger soup of "languages more-or-less descended from JavaScript").


Some other changes (over time):
Eliminated the use of bare strings in many contexts;
Eliminated the use of linked lists for attributes;
Attributes could hold literal value types (rather than strings);
...

Just recently, I partly redesigned BCCX (now "BCCX2") which mostly
eliminated the use of linked lists for the nodes (replacing it with an
array-indexing abstraction).

This was mostly for performance and memory-footprint reasons.

As can be noted, the logical structure of BGBCC ASTs is loosely
descended from the way data is represented in XML-RPC.


Can also note that despite the obvious superficial difference between
XML Nodes and JSON objects, the BCCX2 system uses a fairly similar
structure internally to what was used in the BS2VM for dynamic objects,
namely:
Objects have a fixed number of key/value fields (16b key, 64b value);
Objects split up B-Tree style when one exceeds the number of key/value
pairs.

Though, there was a minor difference:
BCCX and BCCX2 uses a 4b type tag and 12b symbol index;
BS2VM had used a flat 16b table (with a combined symbol, flags, and
type-descriptor).

In concept though, one could use a 16-bit index into a 64-bit descriptor
table, say:
(15: 0): Interned index into a table of symbol-name strings.
(27:16): Interned index into a table of type signature strings.
(39:28): Interned index into a table of flag/attribute strings.
(63:40): Basic Modifier Flags (Unpacked)

A similar descriptor format could potentially also be used for static
class definitions (with some extra metadata to locate string tables and
similar, *). Generally, static classes follow a struct-like layout (with
contents for any derived classes being glued onto the end of those of
parent classes).

Though one possible source of concern is if this can cross paths with
the C toplevel (the C toplevel cares not for respecting 16-bit index
limits). Works better for class/instance fields though, as there tends
to be less variability here (a lot more reused field names, ...).

*: The ClassInfo for an object can have a self-reference RVA which can
be used to find the image base and similar, which can be used in-turn to
locate other metadata for the dynamic runtime.


Though, the system used by BCCX is faster for field lookups by name as
one can simply do a masked-compare. In the BS2VM, it was necessary to
already know the full type/etc of the field one wanted to access, which
put some constraints on the handling of dynamic objects (one could
either have fields which could be determined statically, or which were
"purely dynamic" and simply contained "variant").

An intermediate option would be, say, 2.14, with a 14-bit bare symbol
name, and a 2-bit tag:
00: Variant / ObjRef Type
01: Integer (anything 'long' and smaller)
10: Double
11: Complex Descriptor (works as above).

The compiler/runtime then figures out what category of value it wants in
advance, but can otherwise use straightforward name-driven lookups (with
up to 16K symbol names). Lookups would still be bare 16-bit equality
checks as far as the runtime functions are concerned (the compiler
doesn't need to care that much; these are opaque 16-bit handles as far
as the compiler is concerned, with their values having been synthesized
by the runtime library).

Main thing that would effect the compiler (directly) would be if it
later turned out that 16-bit dynamic object symbols were insufficient.

...

0 new messages