My original bus was fairly slow:
Put a request on the bus, as it propagates, each layer of the bus holds
the request, until it reaches the destination, and sends back an OK
signal, which returns back up the bus to the sender, and then the sender
switches to sending an IDLE signal, the whole process repeats as the bus
"tears down", and when it is done, the OK signal switches to READY, and
the bus may then accept another request.
This bus could only handle a single active request at a time, and no
further requests could initiate (anywhere) until the prior request had
finished.
Experimentally, I was hard-pressed getting much over about 6MB/sec over
this bus with 128-bit transfers... (but could get it up to around
16MB/sec with 256-bit SWAP messages). As noted, this kinda sucked...
I then replaced this with a ring-bus:
Every object on the node passes messages from input to output, and is
able to drop messages onto the bus, or remove/replace messages as
appropriate. If not handled immediately, they circle the ring until they
can be handled.
This bus was considerably faster, but still seems to suffer from latency
issues.
In this case, the latency of the ring bus was higher than the original
bus, but had the advantage that the L1 cache could effectively drop 4
consecutive requests onto the bus and then (in theory) they could all be
handled within a single trip around the ring.
Theoretically, the bus could move 800MB/sec at 50MHz, but practically
seems to achieve around 70MB/s (which is in-turn effected by things that
effect ring latency, like enabling/disabling various "shortcut paths" or
enabling/disabling the second CPU core).
A point-to-point message-passing bus could be possible, and could have
lower latency, but was not done mostly because it seemed more
complicated and expensive than the ring design.
If one has two endpoints, both can achieve around 70MB/s if L2 hits, but
this drops off if the external RAM accesses become the limiting factor.
The RAM interface is using a modified version of the original bus, where
both the OPM and OK signals were augmented with sequence numbers, where
when the sent sequence number on OPM comes back via the OK signal, one
can immediately move to the next request (incrementing the sequence number).
While this interface still only allows a single request at a time, this
change effectively doubles the throughput. The main reason for using
this interface to talk to external RAM, is that the interface works
across clock-domain crossings (as-is, the ring-bus requests can't
survive a clock-domain crossing).
Most of the MMIO devices are still operating on a narrower version of
the original bus, say:
5b: OPM
28b: Addr
64b: DataIn
64b: DataOut
2b: OK
Where, OPM:
00-000: IDLE
00-zzz: Special Command (if zzz!=000)
01-010: Load DWORD (MMIO)
01-011: Load QWORD (MMIO)
01-111: Load TILE (RAM, Old)
10-010: Store DWORD (MMIO)
10-011: Store QWORD (MMIO)
10-111: Store TILE (RAM, Old)
11-010: Swap DWORD (MMIO, Unused)
11-011: Swap QWORD (MMIO, Unused)
11-111: Swap TILE (RAM, Old)
The ring-bus went over to an 8-bit OPM format, which increases the range
of messages that can be sent.
One advantage of the old bus is that the device-side logic is fairly
simple. Typically, the OPM/Addr/Data signals would be mirrored to all of
the devices, with each device having its own OK and DataOut signal.
A sort of crossbar existed, where whichever device sets its OK value to
something other than READY has its OK and Data signals passed back up
the bus.
Also it works because MMIO only allows a single active request at a time
(and the MMIO bus interface on the ringbus will effectively serialize
all accesses into the MMIO space on a "first come, first serve" basis).
Note that accessing MMIO is comparably slow.
Some devices, like the display / VRAM module, have been partly moved
over to the ringbus (with the screen's frame-buffer mapped into RAM),
but still uses the MMIO interface for access to display control
registers and similar.
The SDcard interface still goes over MMIO, but ended up being modified
to allow sending/receiving 8 bytes at a time over SPI (with 8-bit
transfers, accessing the MMIO bus was a bigger source of latency than
actually sending bytes over SPI at 5MHz).
As-is, I am running the SDcard at 12.5 MHz:
16.7MHz and 25MHz did not work reliably;
Going over 25MHz was out-of-spec;
Even with 8-byte transfers, MMIO access can still become a bottleneck.
A UHS-II interface could in theory run at similar speeds to RAM, but
would likely need a different interface to make use of this.
One possibility would be to map the SDcard into the physical address
space as a huge non-volatile RAM-like space (on the ring-bus). Had
on/off considered this a few times, but didn't get to it.
Effectively, it would require redesigning the whole SDcard and
filesystem interface (essentially moving nearly all of the SDcard logic
into hardware).
> Multiple devices access the main DRAM memory via a memory controller.
> Several devices that are bus masters have their own ports to the memory
> controller and do not use up time on the main system bus tree. The
> frame buffer has a streaming data port. The frame buffer streaming cache
> is 8kB and loaded in 1kB strips at 800MB/s from the DRAM IIRC. Other
> devices share a system cache which is only 16kB due to limited number
> block RAMs. There are about a half dozen read ports, so the block RAMs
> are replicated. With all the ports accessing simultaneously there could
> be 8*40*16 MB/s being transferred, or about 5.1 GB/s for reads.
>
I had put everything on the ring-bus, with the L2 also serving as the
bridge to access external DRAM (via a direct connection to the DDR
interface module).
> The CPU itself has only L1 caches of 8kB I$ and 16kB D$. The D$ can be
> dual ported, but is not configured that way ATM due to resource
> limitations. The caches will request data in blocks the size of a cache
> line. A cache line is broken into four consecutive 128-bit accesses. So,
> data comes back from the boot ROM in a burst at 640 MB/s.
>
In my case:
L1 I$: 16K or 32K
32K helps notably with GLQuake and similar.
Doom works well with 16K.
L1 D$: 16K or 32K
Mostly 32K works well.
Had tried 64K, but bad for timing, and little effect on performance.
IIRC, had evaluated running the CPU at 25MHz with 128K L1 caches and a
small L2 cache, but modeling this had showed that performance would suck
(even if nearly all of the instructions had a 1-cycle latency).
> IIRC there were no display issues with an 800x600x16 bpp display, but I
> could not get Thor to do much more than clear the screen. So, it was a
> display of random dots that was stable. There is a separate text display
> controller with its own dedicated block RAM for displays.
>
My display module is a little weird, as it was based around a
cell-oriented design:
Cells are typically 128 or 256 bits, representing 8x8 pixels.
Text and 2bpp color-cell modes use 128-bit cells, say:
( 29: 0): Pair of 15-bit colors;
( 31:30): 10
( 61:32): Misc
( 63:62): 00
(127:64): Pixel bits, 8x8x1 bit, raster order
The 4bpp color-cell mode is more like:
( 29: 0): Colors A/B
( 31: 30): 11
( 61: 32): Colors C/D
( 63: 62): 11
( 93: 64): Colors E/F
( 95: 94): 00
(125: 96): Colors G/H
(127:126): 00
(159:128): Pixels A/B (4x4x2)
(191:160): Pixels C/D (4x4x2)
(223:192): Pixels E/F (4x4x2)
(255:224): Pixels G/H (4x4x2)
In the bitmapped modes:
128-bit cell selects 256-color modes (4x4 pixels)
256-bit cell selects hi-color modes (4x4 pixels)
So:
640x400 would be configured as 160x100 cells.
800x600 would be configured as 200x150 cells.
The 800x600 256-color mode held up OK when I had the display module
outputting at a non-standard 36Hz refresh, but increasing this to a more
standard 72Hz blows out the memory bandwidth.
Theoretically, the DDR RAM interface could support these resolutions if
all the timing and latency was good. But, no so good when it is
implemented by the display module hammering out a series of prefetch
requests over the ring-bus just ahead of the current raster position.
Though, the cell-oriented display modes still work better than my
attempt at a linear framebuffer mode (due to cache/timing issues, not
even a 320x200 linear framebuffer mode worked without looking like a
broken mess).
I suspect this is because, with the cell-oriented modes, each cell has 4
or 8 chances for the prefetch to succeed before it actually gets drawn,
whereas in the linear raster mode, there is only 1 chance.
It is likely that a linear framebuffer would require two stages:
Prefetch 1: Somewhat ahead of current raster position, hopefully gets
data into L2;
Prefects 2: Closer to the raster position, intended to actually fetch
the pixel data.
Prefetches are used here rather than actual loads, mostly because these
will get cleaned up quickly, whereas with actual fetches, a back-log
scenario would result in the whole bus getting clogged up with
unresolved requests.
However, the CPU can use normal loads, since the CPU will patiently wait
for the previous request(s) to finish before doing anything else (and
thus avoids flooding the ring-bus with requests).
However, a downside of prefetches, is that one has to keep asking the L2
cache each time whether or not it has the data in question yet.
As for the "BJX2 doesn't always generate smaller .text than RISC-V
issue", went looking at the ASM, and noted there is a big difference:
GCC "-Os" generates very tight and efficient code, but needs to work
within the limits of what the ISA provides;
BGBCC has a bit more to work with, but the relative quality of the
generated code is fairly poor in comparison.
Like, say:
MOV.Q R8, (SP, 40)
.lbl:
MOV.Q (SP, 40), R8
//BGBCC: "Sure why not?..."
...
MOV R2, R9
MOV R9, R2
BRA .lbl
//BGBCC: "Seems fine to me..."
So, I look at the ASM, and once again feel a groan at how crappy a lot
of it is.
Or:
if(!ptr)
...
Was failing to go down the logic path that would have allowed it to use
the BREQ/BRNE instructions (so was always producing a two-op sequence).
Have noticed that code that writes, say:
if(ptr==NULL)
...
Ends up using a 3-instruction sequence, because it doesn't recognize
this pattern as being the same as the "!ptr" case, ...
Did at least find a few more "low hanging fruit" cases that shaved a few
more kB off the binary.
Well, and also added a case to partially optimize:
return(bar());
To merge the 3AC "RET" into the "CSRV" operation, and thus save the use
of a temporary (and roughly two otherwise unnecessary MOV instructions
whenever this happens).
But, ironically, it was still "mostly" generating code with fewer
instructions, despite the still relatively weak code generation at times.
Also it seems:
void foo()
{
//does nothing
}
void bar()
{
...
foo();
...
}
GCC seems to be clever enough to realize that "foo()" does nothing, and
will eliminate the function and function call entirely.
BGBCC has no such optimization.
...