64 bit 68080 CPU

Brett

unread,

Jul 31, 2022, 3:55:32 PM7/31/22

to

Here is the 64 but 68080 CPU, interesting.
Surprising that the economics to build such a thing exists.

http://www.apollo-core.com/index.htm?page=coding&tl=1

Mostly Amiga upgrades with antique Apollo workstations also mentioned, and
probably lots of embedded systems for machinery?

Josh Vanderhoof

unread,

Jul 31, 2022, 5:08:08 PM7/31/22

to

That's really cool! The SAGA Amiga chipset looks interesting as well.
Chunky pixels and bilinear z buffered texture mapping on top of AGA. I
had no idea such a thing existed. Neat!

Quadibloc

unread,

Aug 1, 2022, 12:40:03 AM8/1/22

to

On Sunday, July 31, 2022 at 1:55:32 PM UTC-6, gg...@yahoo.com wrote:

> Surprising that the economics to build such a thing exists.

It's a soft core, in Verilog or VHDL, that gets used to program
an FPGA, so the economics aren't so daunting as to be all that
surprising.

John Savard

BGB

unread,

Aug 1, 2022, 2:26:12 AM8/1/22

to

But, what sort of FPGA, exactly?...

Their claimed stats seem a little higher than what I could expect from
something in a similar class as a Spartan or Artix based on my
experience thus far.

In one of the images, albeit low res, I can sort of make out an Altera
logo; FPGA appears to be a Cyclone 3, but can't make out anything beyond
this (eg: what size of Cyclone III ?...).

Looking stuff up, it would appear that Cyclone 3 stats are in a similar
range to the Artix-7 family, albeit it would appear to be balanced
towards more logic elements but less block RAM.

Not sure much beyond this, relative comparisons between Xilinx and
Altera FPGAs is a bit sparse, particularly for the lower-end families.

Then again, I guess retro-computers like the Amiga are not cheap, so it
seems plausible they could justify the costs of (potentially) using a
"slightly expensive" FPGA on their Amiga upgrade boards (so, in any
case, probably not using the the low-end entries of the product line).

Their boards also seem to come with significantly more RAM than a lot of
the "low cost" FPGA dev boards, ...

...

Then again, I guess it is a question if there is a good way to get a
significant increase how much performance I can get out of an FPGA
without a significant resource cost increase?...

I guess main areas to look at would be:
Find a way to reduce both the cost and latency of the L1 caches;
Find a way to reduce the cost of dealing with pipeline stall signals;
Try to find a way to make the interrupt dispatch mechanism cheaper;
...

Well, and would also help for performance:
Figure out some way to make L2 misses and DRAM access both cheaper and
lower latency.

...

Well, and further reaching issues, say, whether a Reg/Mem ISA could
compare well with a Load/Store ISA?... Can use fewer instructions, but
how to do Reg/Mem ops without introducing significant complexities or
penalty cases?...

...

> John Savard

Anton Ertl

unread,

Aug 1, 2022, 5:01:04 AM8/1/22

to

BGB <cr8...@gmail.com> writes:
> Find a way to reduce the cost of dealing with pipeline stall signals;

It seems to me that OoO helps with that, and the 68080 is OoO.

>Well, and further reaching issues, say, whether a Reg/Mem ISA could
>compare well with a Load/Store ISA?... Can use fewer instructions, but
>how to do Reg/Mem ops without introducing significant complexities or
>penalty cases?...

Intel and AMD are doing it just fine with OoO implementations. The
68020 (and consequently the 68080) has the additional complexity of
memory-indirect addressing, but the main problem I see here is one of
verification (can you guarantee forward progress in the presence of
TLB misses and page faults), not performance problems.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

Theo

unread,

Aug 1, 2022, 5:06:23 AM8/1/22

to

BGB <cr8...@gmail.com> wrote:
> But, what sort of FPGA, exactly?...

Cyclone V 5CEFA5F23C, 77k LE:
http://www.apollo-computer.com/icedrakev4.php

> Looking stuff up, it would appear that Cyclone 3 stats are in a similar
> range to the Artix-7 family, albeit it would appear to be balanced
> towards more logic elements but less block RAM.
>
> Not sure much beyond this, relative comparisons between Xilinx and
> Altera FPGAs is a bit sparse, particularly for the lower-end families.

Cyclone is Altera's 'cheap' FPGA line, fitting between the MAX CPLDs and the
bigger Arria and Stratix parts. 'Cheap' means ~$100 list price, so not
cheap for the rest of us. Cyclone V is the old mainstream part, Cyclone
10LP is I think a rebrand of the Cyclone IV, and Cyclone 10GX is higher end
with transceivers.

Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
pretty antique as far as Arm cores go). Those are pretty comparable with
the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
there's no transceivers and no Arm, hence it's at the cheap end of the line
(the A5 meaning 77k LE is the middle of the range).

https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html

This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).
The I/Os are typically good to drive DDR3, which is what the Arm uses for
DRAM.

Mouser will sell me one for $127 (in MOQ 60), and the price they get from
their distributor is almost certainly less.

I'm not quite sure how it matches up with Xilinx, but I'd expect an Artix is
probably comparable.

With 800Kbyte of BRAM I think you could make some decent caches - after all
the Amiga 500 only had 512Kbyte DRAM to begin with.

Theo

EricP

unread,

Aug 1, 2022, 11:44:39 AM8/1/22

to

BGB wrote:
>
> Then again, I guess it is a question if there is a good way to get a
> significant increase how much performance I can get out of an FPGA
> without a significant resource cost increase?...
>
> I guess main areas to look at would be:
> Find a way to reduce both the cost and latency of the L1 caches;

IIRC you were having cross clock-domain issues at one point where
it took many clocks to synchronize, and still had reliability issues.
Is this still the case?

Also are you still using a token ring to talk to L1?

> Find a way to reduce the cost of dealing with pipeline stall signals;

Which costs are you referring to, FPGA logic elements or pipeline latency?

If it is the second, are you using a global pipeline stall signal where
if any stage stalls they all stall (I think you were at one point)?
I've mentioned 'elastic' pipeline stages previously that use local
stalls to allow bubbles to compact out.
But they are more than twice the LE cost.
Instead of one set of FF for each stage, it is 2 sets of FF, plus a MUX
to select between FF, plus control logic.

> Try to find a way to make the interrupt dispatch mechanism cheaper;

What are you currently doing?

> Well, and would also help for performance:
> Figure out some way to make L2 misses and DRAM access both cheaper and
> lower latency.

Is this also across a clock domain?

MitchAlsup

unread,

Aug 1, 2022, 12:56:28 PM8/1/22

to

On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>
> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> compare well with a Load/Store ISA?... Can use fewer instructions, but
> how to do Reg/Mem ops without introducing significant complexities or
> penalty cases?...
>

You build a pipeline which has a pre-execute stage (which calculates
AGEN) and then a stage or 2 of cache access, and then you get to the
normal execute---writeback part of the pipeline. I have called this the
360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
machines used such a pipeline.

BGB

unread,

Aug 1, 2022, 2:58:04 PM8/1/22

to

On 8/1/2022 10:44 AM, EricP wrote:
> BGB wrote:
>>
>> Then again, I guess it is a question if there is a good way to get a
>> significant increase how much performance I can get out of an FPGA
>> without a significant resource cost increase?...
>>
>> I guess main areas to look at would be:
>> Find a way to reduce both the cost and latency of the L1 caches;
>
> IIRC you were having cross clock-domain issues at one point where
> it took many clocks to synchronize, and still had reliability issues.
> Is this still the case?
>

This is for L2<->DRAM, pretty much everything else is running on a
global 50MHz clock.

The issue is that for 16/32K L1 caches, the fastest I can currently run
them is around 50MHz. Otherwise, it seems the logic for fetching the
cache line data and tag bits from the arrays, and checking for
match/mismatch, takes too many nanoseconds (mostly with routes that
zigzag across a big part of the FPGA).

This part can pass timing easier with smaller arrays (LUTRAM), but then
the L1 is only around 2K, and the hit rate "kinda sucks".

The L2 has it a little easier, because it can stick in an extra delay
cycle between fetching the block from the array, and checking whether or
not it matches (otherwise, the 256K cache would be difficult even at 50MHz).

> Also are you still using a token ring to talk to L1?
>

The L1<->L2 interface uses a big token-ring style bus.

The L1 caches plug directly into the main pipeline (effectively, the L1s
and main pipeline operate in lock-step with each other).

While it seems like the ring-bus would have a pretty high latency, on
average the latency is compensated for by the ability to send several
requests over the bus at the same time (overall performance being
significantly higher than my original "one request at a time" bus).

From what I can gather, the OpenCores Wishbone bus operates in a
similar way to my original bus, not sure if/how they avoid it suffering
from poor performance as a result.

Something like AXI seems a fair bit more complicated, and like it would
likely result in a significantly higher resource cost than the ring-bus.

>> Find a way to reduce the cost of dealing with pipeline stall signals;
>
> Which costs are you referring to, FPGA logic elements or pipeline latency?
>
> If it is the second, are you using a global pipeline stall signal where
> if any stage stalls they all stall (I think you were at one point)?

Yeah. If the L1 D$ stalls, it asserts a global stall. The entire
pipeline stalls.

This includes:
All the forwarding between the main pipeline stages;
All the forwarding within the various 'EX' stage units
This includes the FPUs and SIMD FPUs;
Forwarding within the internal stages within the L1 caches;
...

There are actually two stalls, A and B:
A: Stalls *everything*;
B: Stalls only the Fetch and Decode stages
Used for interlocks (injects NOPs into the EX stages).

In this case, the stall makes sure that the memory access always
finishes on the EX3 stage regardless of how long the memory access took
in reality (the FPU and similar may also assert this stall, since FPU
operations take a lot longer than what the 3 EX stages can accommodate).

I had noted that some other cores had used a FIFO queue for accessing
memory, but this leaves the issue of getting Load results written back
to the register file (and needing to keep track somehow of when Load
results have arrived).

This seems more complicated though than "whole pipeline stalls if L1 the
misses" approach.

I had experimented before with making the L1 I$ generate a stream of
0-length NOP bundles on miss, rather than stall the pipeline, but was
faced with "technical issues" with making this work reliably, so mostly
stuck with the "stall the whole pipeline" mechanism (though, the
0-length NOP bundles are still needed to be able to handle I$ TLB misses
effectively).

> I've mentioned 'elastic' pipeline stages previously that use local
> stalls to allow bubbles to compact out.
> But they are more than twice the LE cost.
> Instead of one set of FF for each stage, it is 2 sets of FF, plus a MUX
> to select between FF, plus control logic.
>

LUT cost is already a big issue.
LUTs and BRAM's are the main resources I have mostly used up.

Still have plenty of DSP48s left though...
My CPU core only needs so much low-precision multiply though.

>> Try to find a way to make the interrupt dispatch mechanism cheaper;
>
> What are you currently doing?
>

Current mechanism (ISA level):
Save SR state into EXSR;
Twiddle some bits in SR;
MD, RB, and BL are set (Sets to Supervisor+ISR mode);
WXE and WX2 copied from VBR (51:50)
Swap SP and SSP;
Save PC to SPC;
Jump to a computed address relative to VBR.

Internally, the CPU also does:
Figure out which pipeline stage we can validly revert to;
Revert DLR, DHR, LR, SP, etc, to their values at that pipeline stage;
Normal GPR writes are handled via an "invalidate" flag:
This blocks the register's value at the WB stage.
...

The logic for forwarding and then reverting the state of some of the
various special registers is expensive. But, special handling is needed
for any register which may be updated via side-channel mechanisms
(rather than via normal GPR style access patterns).

This is not needed for SRs/CRs where the side-channel is effectively
read-only (TTB, MMCR, KRR, GBR, etc), since these are only modified via
the WB stage (and thus can use the same mechanism as the GPRs).

A likely change would be:
* Eliminate the writable side-channels for DLR, DHR, SP, and LR.
** DLR/DHR: Would become read-only side-channels;
** SP could become mostly GPR-like
*** Given PUSH/POP no longer exist in the ISA.
*** SP <-> SSP swap could be handled in the instruction decoder.
** LR update would be handled via a GPR style update
*** This is already used for RISC-V mode.

Some of this would require "awkward" rewrites to parts of the Verilog
code, and I hadn't poked at it yet.

>> Well, and would also help for performance:
>> Figure out some way to make L2 misses and DRAM access both cheaper and
>> lower latency.
>
> Is this also across a clock domain?
>

Yes.

The DDR controller logic operates at a higher clock speed than
everything else, so it needs a clock-domain crossing to access.

Still uses the original "one request at a time" bus, with several
request types:
Load (fetch cache line from DDR)
Store (write cache line to DDR)
Swap (do a combined Store + Load)

The majority of accesses here are Load and Swap.

Given this uses 64B cache lines, this is able to mostly cover a lot of
the overhead of the clock-domain crossing and state transisiton overheads.

But, access latency is still "not great", and the LUT costs of having
this part work with 64B (512-bit) cache lines, is pretty steep.

...

BGB

unread,

Aug 1, 2022, 2:58:32 PM8/1/22

to

On 8/1/2022 4:06 AM, Theo wrote:
> BGB <cr8...@gmail.com> wrote:
>> But, what sort of FPGA, exactly?...
>
> Cyclone V 5CEFA5F23C, 77k LE:
> http://www.apollo-computer.com/icedrakev4.php
>

From what I can gather, vs an XC7A100T:
Fewer LUTs (kinda hard to compare directly here);
Less Block RAM;
More IO pins;
Faster maximum internal clock speed.

>> Looking stuff up, it would appear that Cyclone 3 stats are in a similar
>> range to the Artix-7 family, albeit it would appear to be balanced
>> towards more logic elements but less block RAM.
>>
>> Not sure much beyond this, relative comparisons between Xilinx and
>> Altera FPGAs is a bit sparse, particularly for the lower-end families.
>
> Cyclone is Altera's 'cheap' FPGA line, fitting between the MAX CPLDs and the
> bigger Arria and Stratix parts. 'Cheap' means ~$100 list price, so not
> cheap for the rest of us. Cyclone V is the old mainstream part, Cyclone
> 10LP is I think a rebrand of the Cyclone IV, and Cyclone 10GX is higher end
> with transceivers.
>
> Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
> pretty antique as far as Arm cores go). Those are pretty comparable with
> the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
> there's no transceivers and no Arm, hence it's at the cheap end of the line
> (the A5 meaning 77k LE is the middle of the range).
>
> https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html
>
> This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).
> The I/Os are typically good to drive DDR3, which is what the Arm uses for
> DRAM.
>

On Artix, one is typically using FPGA logic and IO to drive the DDR.

In my case, I am driving the DDR2 module at 50MHz on the board I have.

Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
in this case, and this is pushing the limits of the IO pin speeds.

One could drive it faster (via SERDES) but still not climbed that
ladder. Theoretical estimates were typically showing only modest
improvement from faster RAM IO speeds here.

Also still not climbed the AXI ladder either (would be needed to use
Vivado's MIG tool).

> Mouser will sell me one for $127 (in MOQ 60), and the price they get from
> their distributor is almost certainly less.
>
> I'm not quite sure how it matches up with Xilinx, but I'd expect an Artix is
> probably comparable.
>

Yeah.

> With 800Kbyte of BRAM I think you could make some decent caches - after all
> the Amiga 500 only had 512Kbyte DRAM to begin with.
>

Probably.

In my case, with an XC7A100T, and a single CPU core, the maxed out
settings are basically:
256K L2 + 64K L1 I$ + 64K L1 D$

But, a fair chunk of block-RAM is eaten by internal overheads, like
tagging arrays (particularly for the L1s, which are roughly 50% tagging
overhead in this case).

Overhead is somewhat less for L2, given the L2 is using 64B cache lines
rather than 16B cache lines.

Where:
64B lines in the L2 allow OK bandwidth to external RAM;
16B lines in the L1 keep the LUT cost more modest.

A case could possibly be made here for making both be 32B though.

I am mostly using 32K L1s at the moment, given:
Timing doesn't really like 64K L1s;
The bigger L1 caches would only slightly improve hit rate.

I got the bigger L2 at the cost of dropping the use of Block-RAM for the
framebuffer, and instead mapping it to RAM (via the L2 cache).

Practically, still mostly limited to around 128K VRAM:
My original MMIO interface design doesn't deal very well with more than
128K;
Going much bigger than this, and the L2 cache DRAM can't keep up with
"keeping the VGA refresh fed".

So, the actual "usable part" of the L2 is reduced somewhat, partly as it
gets repeatedly hit by the screen refresh (as it sweeps across the
frame-buffer at around 60 times per second).

Ironically though, because of the way it works, the bandwidth is
actually "slightly less" in the 800x600 mode, because it is still using
128K (at around 2bpp), but the effective screen refresh speed drops from
60Hz down to 36HZ (still not yet confirmed to work on a real monitor).

Performance in 640x400 and 800x600 mode is partly limited also by things
like the need to use a color-cell encoder.

640x400 using a 4-bpp color cell format:
4x4 cells with two RGB555 endpoints, 2b per pixel (A, B, 3/8 + 5/8)
Packed into 256b 8x8 blocks though.
800x600 using a 2-bpp color cell format:
8x8 cells with RGB555 endpoints, 1b per pixel (A, B)
Looks "kinda awful" for Doom and similar.
Arguably "less awful" than a 4-color mode would look:
"Ah yeah, Black/White/Cyan/Magenta"...
This cell format works pretty good for text and similar at least.

If one uses the "draw into 16bpp framebuffer and then color-cell encode
this to VRAM" approach, hard to get much over 10Hz in 640x400 mode or
6Hz in 800x600 (not really all that usable for games or similar; would
probably work mostly OK for GUI or similar).

An optimization though is to flag which parts of the screen have been
updated, and then skipping color-cell encoding unchanged parts of the
screen.

Say, for a GUI-like scenario, running Doom or similar:
Doom draws to its internal framebuffer;
This is copied to window's backbuffer via as a bitmap draw operation;
Relevant parts of windows' bitmap are marked dirty.
If a window has been marked dirty:
Color or pattern fill the screen buffer
This is itself kinda expensive at 640x400 (roughly a 512K memset)
Window stack is redrawn to display's screen buffer;
Color-cell encode screen buffer, copy to VRAM.

Not entirely sure how early GUIs worked, so maybe they had more
efficient approaches.

For 320x200 (what I am mostly using for 3D stuff), the 128K of VRAM can
give a bitmapped RGB555 mode, which is a little more usable for this
(mostly need to redraw frames and then copy them to VRAM mostly unchanged).

> Theo

BGB

unread,

Aug 1, 2022, 4:15:00 PM8/1/22

to

On 8/1/2022 3:54 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> Find a way to reduce the cost of dealing with pipeline stall signals;
>
> It seems to me that OoO helps with that, and the 68080 is OoO.
>

Possible, I would have figured OoO would have been a bit steep for an
FPGA, but they seem to be managing with what looks like "not too
unreasonable" FPGAs, so dunno...

>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>> how to do Reg/Mem ops without introducing significant complexities or
>> penalty cases?...
>
> Intel and AMD are doing it just fine with OoO implementations. The
> 68020 (and consequently the 68080) has the additional complexity of
> memory-indirect addressing, but the main problem I see here is one of
> verification (can you guarantee forward progress in the presence of
> TLB misses and page faults), not performance problems.
>

To support Reg/Mem "in general", seems like it would need mechanisms for:
Perform a Load, then perform the operation (Load-Op);
Ability to perform certain ops directly in the L1 cache.

Latter is assuming that only a subset of operations support memory as a
destination, which appears to be the case in x86 at least
(interestingly, both RISC-V's 'A' extension and SH-2A ended up with some
operations to operate directly on memory in roughly the same scenarios).

In my case, a limited form of LoadOp already exist for the "FMOV.S" and
the "LDTEX" instruction (these perform work on the loaded value in EX3).

Something like Load+ADDS.L or similar is not entirely implausible (could
probably fit within ~ 1 cycle).

However, something like extending EX to 5 stages, or needing "split this
operation into micro-ops" logic, would be a little steep. The former
would adversely effect branch latency (and increase LUT cost), the
latter would cause these instructions to rather perform poorly
(defeating the purpose of adding them).

Other options would likely require rethinking the pipeline (such as
trying to find some way to shove a Load into the ID stages; and all of
the consequences this would entail).

Say, for example, if ID1/ID2 had some special read ports, and special
cases for "Assume op is Load or Load-Op", which could then allow Load-Op
within similar pipeline latency, but would require some additional
interlock-stage handling.

Adding an EX4 and EX5 stage could be argued for on the basis that, while
it would increase branch latency and LUT cost, it could potentially also
be used to allow for fully pipelined FPU instructions (eg: could allow
turning FADD and FMUL from 6C operations into 5L/1T operations).

...

John Dallman

unread,

Aug 1, 2022, 4:27:37 PM8/1/22

to

In article <tc6mng$fm5h$1...@dont-email.me>, gg...@yahoo.com (Brett) wrote:

> Mostly Amiga upgrades with antique Apollo workstations also
> mentioned, and probably lots of embedded systems for machinery?

It all looks to be Amiga to me; ApolloOS is a fork of
<https://en.wikipedia.org/wiki/AROS_Research_Operating_System>, which is
a compatible re-creation of AmigaOS 3.1.

John

MitchAlsup

unread,

Aug 1, 2022, 5:38:38 PM8/1/22

to

On Monday, August 1, 2022 at 3:15:00 PM UTC-5, BGB wrote:
> On 8/1/2022 3:54 AM, Anton Ertl wrote:
> > BGB <cr8...@gmail.com> writes:
> >> Find a way to reduce the cost of dealing with pipeline stall signals;
> >
> > It seems to me that OoO helps with that, and the 68080 is OoO.
> >
> Possible, I would have figured OoO would have been a bit steep for an
> FPGA, but they seem to be managing with what looks like "not too
> unreasonable" FPGAs, so dunno...
> >> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >> how to do Reg/Mem ops without introducing significant complexities or
> >> penalty cases?...
> >
> > Intel and AMD are doing it just fine with OoO implementations. The
> > 68020 (and consequently the 68080) has the additional complexity of
> > memory-indirect addressing, but the main problem I see here is one of
> > verification (can you guarantee forward progress in the presence of
> > TLB misses and page faults), not performance problems.
> >
> To support Reg/Mem "in general", seems like it would need mechanisms for:
> Perform a Load, then perform the operation (Load-Op);
<

<
> Ability to perform certain ops directly in the L1 cache.
>

A 16KB L1 cache (small) is already bigger than the register file, forwarding
logic, and all integer execution stuff, and other interfaces this section talks
to. Adding RMW operations to the cache does not add "that much" logic or
"that much" to verification.

Michael S

unread,

Aug 1, 2022, 6:18:41 PM8/1/22

to

May be, things changed to the better in recent months, but, say, a year ago
it was practically impossible to buy Cyclone-4E/10LP or Cyclone-5 from
official distributors. I.e. formally they would accept orders, but with lead
time of 50-60 weeks and even that with no guarantee of delivery.

> I'm not quite sure how it matches up with Xilinx, but I'd expect an Artix is
> probably comparable.

Except that Xilinx 7 family has more troubles than Cyclone-5 dealing with
"traditional I/O" i.e. anything non-differential and above 1.8V.
In that regard 28nm Artix-7 is more similar to 20nm Arria-10/Cyclone-10GX
then to 28nm Cyclone-5.

BGB

unread,

Aug 1, 2022, 7:28:52 PM8/1/22

to

On 8/1/2022 4:38 PM, MitchAlsup wrote:
> On Monday, August 1, 2022 at 3:15:00 PM UTC-5, BGB wrote:
>> On 8/1/2022 3:54 AM, Anton Ertl wrote:
>>> BGB <cr8...@gmail.com> writes:
>>>> Find a way to reduce the cost of dealing with pipeline stall signals;
>>>
>>> It seems to me that OoO helps with that, and the 68080 is OoO.
>>>
>> Possible, I would have figured OoO would have been a bit steep for an
>> FPGA, but they seem to be managing with what looks like "not too
>> unreasonable" FPGAs, so dunno...
>>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>>> compare well with a Load/Store ISA?... Can use fewer instructions, but
>>>> how to do Reg/Mem ops without introducing significant complexities or
>>>> penalty cases?...
>>>
>>> Intel and AMD are doing it just fine with OoO implementations. The
>>> 68020 (and consequently the 68080) has the additional complexity of
>>> memory-indirect addressing, but the main problem I see here is one of
>>> verification (can you guarantee forward progress in the presence of
>>> TLB misses and page faults), not performance problems.
>>>
>> To support Reg/Mem "in general", seems like it would need mechanisms for:
>> Perform a Load, then perform the operation (Load-Op);
> <
> For these ISAs, you build the pipeline as::
> <
> |FETCH |DECODE| AGEN |CACHE|EXECUTE|WRITEB|
> <
> As I stated above. No OoO is needed, but OoO does not hurt, either.
> <

I was thinking of in-order...

In my case, pipelie is:
~ PF (overlaps with ID1)
IF (Instruction Fetch)
ID1 (Decode)
ID2 (Register Fetch)
EX1
EX2
EX3
~ WB (Pseudo stage)

Could, in theory, move AGU from EX1 to ID2 to buy an extra cycle here,
but probably could not move it to ID1 without causing problems.

This could, in premise, allow Load-Op in EX2 and EX3 without lengthening
the pipeline.

As-is, Binary32 load and LDTEX have their logic shoved into the EX3
stage. Could in theory allow for a few 32-bit ALU ops or similar.

Though, while Load+ADD occurs semi-frequently, it looks like Load+CMP is
a somewhat more common case:
CMPxx Imm, (Rm, Disp)
Would likely be able to "hit" pretty often, if it existed.

So, a few common sequences here:
MOV.L (Rm, Disp), Rs; CMPxx Imm, Rs
MOV.L (Rm1, Disp), Rs; MOV.L (Rm2, Disp), Rt; CMPxx Rt, Rs
MOV.L (Rm, Disp), Rs; ADDS.L Rs, Imm, Rn
MOV.L (Rm, Disp), Rs; ADDS.L Rs, Rt, Rn

I suspect a lot of the Load+CMP sequences are for cases where a loop
counter is used with a "for()" loop or similar, but then ends up being
evicted.

Another common case here seems to be:
MOV.L (Rm, Disp), Rt; ADD Imm, Rt; MOV.L Rt, (Rm, Disp)

This seems to be another common case of the loop counter being evicted.

Then again, I suspect my ranking logic may not be counting the loop
counters as being "inside" the loop (except when referenced withing the
loop body).

>> Ability to perform certain ops directly in the L1 cache.
>>
> A 16KB L1 cache (small) is already bigger than the register file, forwarding
> logic, and all integer execution stuff, and other interfaces this section talks
> to. Adding RMW operations to the cache does not add "that much" logic or
> "that much" to verification.

The way L1 works in my case:
AGU happens (externally);
Calculate index and similar to fetch from (low bits of address);
-- (Clock Edge, Fetch goes here)
Check whether or not request missed;
Extract value from cache lines;
-- (Clock Edge)
Generate modified cache line with store value inserted back;
Initiate store of modified cache lines;
-- (Clock Edge, Store happens here)

In theory, could shove a few 32-bit ALU ops in there (between the Load
and Store parts of the logic), and would need this if I wanted to
support the RISC-V 'A' extension, but, "Is it worth it?"...

But, then again, if it could deal effectively with:
"i++;" (where 'i' is currently located on the stack)
It could potentially be worthwhile from a performance POV.

If I were to add these instructions, they would probably be as Op64
encodings or similar though.

But, even as such, "i++;" as an Op64 is arguably still more compact than
"i++;" as "Load; ADD; Store" (with 32-bit instruction encodings), and
could be made ~ 1 cycle, vs ~ 6 cycles for the 3-op sequence.

Would still need to think a bit about how these instructions would be
encoded though.
Possibly:
FFw0_0Vpp_F1nm_Zedd ?

Well, or hack it onto the existing RiMOV encoding:
FFw0_Pvdd_F0nm_0eoZ
P: 0..7: Ld/St, XCHG.x, ADD.x, SUB.x, -, AND.x, OR.x, XOR.x
8..F: Same, but, Rn is an Imm6u? (Store Only)
Loads understood as Load-Op, and Stores as RMW.

Such an encoding would theoretically allow for stuff like, say:
SUB.W R39, (R4, R6, 122)
ADD.B 55, (R4, R6, 69)
ADDS.L (R4, R6, 44), R45

While CMPxx is technically a more common than ADD/SUB by a quick skim,
this would be harder to add without stepping on some ugly edge cases.

It would also likely be limited to Byte/Word/DWord "for reasons".

Might make sense to model some of this first, to try to see if it would
gain enough to make it worthwhile (well, it is either that, or add it to
the Verilog first to see if it can be added without effectively dropping
a nuke on resource cost or timing...).

Then again, I don't expect costs to be quite as bad as my failed "add a
second Load port" experiment, and this does have "could allow supporting
the RV64 'A' extension and similar" as an upshot (well, and possibly
also help if I wanted to add an x86 JIT compiler, but this would still
likely also need helper logic for EFLAGS emulation and similar).

MitchAlsup

unread,

Aug 1, 2022, 8:43:25 PM8/1/22

to

Drop this into the Store Buffer and opportunistically wait for a WB
cycle into the Cache.

<
> -- (Clock Edge, Store happens here)
>

When there is a ST or no-mem-ref you can commit the pending store
to the data portion of the cache.

BGB

unread,

Aug 2, 2022, 12:53:32 AM8/2/22

to

Went and added a small ALU into the L1 cache as an experiment, which
should theoretically be able to handle both LoadOp and StoreOp cases.
Cheaper than expected; timing seems to have survived (though is a little
tight now).

The ALU basically sits between the logic for doing a load, and the logic
for doing a store, and if doing a LoadOp or StoreOp, it calculates the
value and puts it on both the loaded-value and stored-value paths (so it
goes where it needs to go).

New encodings are based on the existing RiMOV encodings:
FFw0_Pvdd_F0nm_0eoZ

Z: Gives the type of the Load/Store:
0..3: ST.B/ST.W/ST.L/ST.Q (Rm, Disp17s)
4..7: ST.B/ST.W/ST.L/ST.Q (Rm, Ro*Sc, Disp9u)
8..B: LD.B/LD.W/LD.L/LD.Q (Rm, Disp17s)
C..F: LD.B/LD.W/LD.L/LD.Q (Rm, Ro*Sc, Disp9u)
E.q: Sign vs Zero Extend Loads, Turns Store into LEA
v: Encodes scale and similar for Ro.
P: Encodes the Operator.
Ld/St, XCHG, ADD, SUB (Mem-Reg), SUB (Reg-Mem), AND, OR, XOR
8..F: Imm6u Store variants;
Load/LEA/...=Reserved
May be used for "other operations".

Load Operation: Perform the operation, Result goes into Rn
Store Operation: Perform the operation, Result goes into Mem

XCHG:
Encoded like a Load;
Value from Register is stored;
Rn gets filled with the value previously held in memory;
Effectively, the two values "pass by each other" in this case.

While an immediate that is limited to 6 bits in a 64-bit encoding is
possibly "kinda weak", probably still better than "not at all" (previous
case), and still allows the usual "load constant into a register and
store the register" semantics.

It is sufficient for INC/DEC which is the main use-case it would likely
need to address.

Will need to do some more testing to try to determine its effectiveness
(if compiler support is added).

Anton Ertl

unread,

Aug 2, 2022, 7:22:26 AM8/2/22

to

BGB <cr8...@gmail.com> writes:
>To support Reg/Mem "in general", seems like it would need mechanisms for:
> Perform a Load, then perform the operation (Load-Op);

Sure, that's what is done in every implementation.

> Ability to perform certain ops directly in the L1 cache.

I don't know any implementation that does that, although there have
been funky memory subsystems that supported fetch-and-add or other
synchronization primitives; AFAIK they do it in the remote memory
controller, not in the controlled memory, though.

>Latter is assuming that only a subset of operations support memory as a
>destination, which appears to be the case in x86 at least
>(interestingly, both RISC-V's 'A' extension

That's the atomic extension, i.e., what I called "synchronization
primitives" above. These operations are unlikely to be fast relative
to non-atomic operations.

>However, something like extending EX to 5 stages, or needing "split this
>operation into micro-ops" logic, would be a little steep. The former
>would adversely effect branch latency

Use a branch predictor, like the big boys.

>the
>latter would cause these instructions to rather perform poorly
>(defeating the purpose of adding them).

That's the way the 486 and Pentium went. Yes, load-and-op
instructions took just as long as a load and an op. I wonder how a
486 with an additional EX stage would have performed: one load-and-op
per cycle would increase performance, but you would have to wait
another cycle before a conditional branch resolves, you would need
more bypasses and more area overall.

BGB

unread,

Aug 2, 2022, 4:27:39 PM8/2/22

to

On 8/2/2022 6:08 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> To support Reg/Mem "in general", seems like it would need mechanisms for:
>> Perform a Load, then perform the operation (Load-Op);
>
> Sure, that's what is done in every implementation.
>
>> Ability to perform certain ops directly in the L1 cache.
>
> I don't know any implementation that does that, although there have
> been funky memory subsystems that supported fetch-and-add or other
> synchronization primitives; AFAIK they do it in the remote memory
> controller, not in the controlled memory, though.
>

I did an experiment where I put a mini-ALU in the L1 cache.

Interestingly, it sorta works:
No significant change to architecture (vs Load/Store);
Resource cost is modest;
Timing Seems to survive;
...

Drawbacks:
Doesn't work for general operations;
A few 'useful' cases (like direct CMP against memory) are left out;
No good way to route a status flag update out of this.

Compare would require doing the flag update in EX3, after the result
arrives, but this wouldn't save much over the 2-op sequence.

This trick wouldn't work for a full ISA (like x86), but is probably OK
for "well, we'll stick a few ALU ops here".

Would also be fully insufficient for an ISA like M68K.

>> Latter is assuming that only a subset of operations support memory as a
>> destination, which appears to be the case in x86 at least
>> (interestingly, both RISC-V's 'A' extension
>
> That's the atomic extension, i.e., what I called "synchronization
> primitives" above. These operations are unlikely to be fast relative
> to non-atomic operations.
>

Possibly true.

But, it does add a limited set of RMW operations.

The extension I added should be more or less able to emulate the
behavior of the 'A' extension, but is nowhere near as far reaching as x86.

Then again, much past things like basic ALU operations, the compilers
tend to treat x86-64 like it were Load/Store, so it may not be a huge loss.

Well, and also x86 tends to only have LoadOp forms of most instructions,
with StoreOp limited primarily to things like basic ALU instructions and
similar.

Still not entirely sure how I will use them in my compiler (where
everything was written around the assumption of Load/Store), this will
be another thing to resolve for now.

Maybe would add special-cases to a few of the 3AC ops, where if trying
to do a binary op and the arguments are not in registers (and this
extension is enabled), will do a few extra checks and maybe use the
LoadOp or StoreOp encodings if appropriate.

>> However, something like extending EX to 5 stages, or needing "split this
>> operation into micro-ops" logic, would be a little steep. The former
>> would adversely effect branch latency
>
> Use a branch predictor, like the big boys.
>

I do use a branch predictor, but latency would still be latency on
misprediction.

>> the
>> latter would cause these instructions to rather perform poorly
>> (defeating the purpose of adding them).
>
> That's the way the 486 and Pentium went. Yes, load-and-op
> instructions took just as long as a load and an op. I wonder how a
> 486 with an additional EX stage would have performed: one load-and-op
> per cycle would increase performance, but you would have to wait
> another cycle before a conditional branch resolves, you would need
> more bypasses and more area overall.
>

Yeah, dunno. It is a mystery.

I can wonder, how did stuff back then perform as well as it did? By most
of my current metrics, performance should have been "kinda awful" with
386 and 486 PCs, but they still ran things like Doom and Win95 and
similar pretty well.

Well, also mysteries like how things like JPEG and ZIP were "fast",
where I am currently only getting:
~ 0.4 Mpix/sec from JPEG decoding;
~ 2 MB/s from Deflate (decoding);
...
Which doesn't really seem all that fast.

Well, some era appropriate video codecs also sorta work, but I have to
compress them further because, while I have enough CPU power to play
CRAM video, I don't generally have the IO bandwidth (and by the time one
gets the bitrate low enough, it looks like broken crap). Similar codec
designs with an extra LZ stage thrown on work pretty OK though (can do
320x200 at 30fps).

MPEG is still a little bit of a stretch though (unless I do it at
160x120 or 200x144 or something...).

For a moment, I was having (~ childhood) memories of movies on VCD being
playable on PCs, but then remembered that this was on a Pentium, so
probably doesn't really count for whether or not they would have worked
acceptably on a 486.

Well, I also have memories of things like FMV games and similar from
that era. The game basically being a stack of CDs with poor quality
video, most not offering much in terms of either gameplay or replay value.

In my case, Doom runs in the double-digits, but only rarely gets up near
the 32 fps limit.

Granted, I am using RGB55, drawing to an off-screen buffer (followed by
a "blit"), and using RGB alpha-blending to implement things like screen
color flashes, which possibly adds cost in a few areas.

Well, and for example, games like Hexen and Quake2 had faked
alpha-blending via using lookup tables (rather than using the RGB values).

Doom's original "invisibility" effect was also done using using colormap
trickery (whereas my port had switched to doing it via RGB math after
switching stuff over to 16-bit pixels), ...

With my newer TKGDI experiment, it is possible I could revisit the
original idea of doing everything with RGB555, and maybe look at the
possibility of going back to 8-bit indexed color for some things here
(and then convert during "blit").

Mostly this would require me to add indexed-color bitmap support to
TKGDI, say, traditional approach:
BITMAPINFOHEADER:
biBitCount=8, biClrUsed=256, biClrImportant=256, ...
Then one appends the color palette after the end of the BITMAPINFOHEADER
strucure (at 32 bits/color for whatever reason).

Where, the blit operation being responsible for the index-color to
RGB555 conversion.

Could maybe go further, and have a color-cell encoder that can operate
on index-color input (with a table of precomputed Y values and similar),
but this would be kinda moot if window backing buffers and the main
screen buffer were all still RGB555.

But, this still leaves a mystery of how things like Win95 and similar
were so responsive on the fairly limited hardware of the time. Though, I
can guess they didn't use per-window backing buffers (but then how does
one do the window stack redraw without windows drawing on top of each
other, ... ?).

Well, also on this era of hardware, they were using raw bitmapped
framebuffers in a hardware native format, rather than trying to feed
everything through a color-cell encoder (*1).

Well, say, because 640x400x16bpp would need 512K and 32MB/s for the
screen-refresh, and there isn't enough bandwidth to pull off the display
refresh in this case (it would turn into a broken mess).

But, OTOH, 640x400 16-color RGBI looks awful, ...

Which is part of why I originally had used a color-cell display to begin
with (as did a lot of the early game consoles and similar).

*1: This takes blocks of 4x4 pixels, figures out a "dark color" and a
"light color" (converts RGB->Y for this), and then generates a 2-bpp
interpolation value per pixel (interpolates between the A and B
endpoints). This process appears to be a performance bottleneck in the
640x400 mode.

Tested several options, the interpolation bits are generated with a
process like:
rcp=896/((ymax-ymin)+1); //cached (shared between all the pixels)
ix=(((pix_y-avg_y)*rcp)>>8)+2; //per pixel
block=(block<<2)|ix;
Generally, with 2 passes over all the pixels:
First pass, figures out the ranges and color endpoints;
Second pass, maps the pixels to 2-bpp values.

This use of a multiply was the faster approach in this case, where I had
also tested, eg:
ix=(pix_y>=avg_y)?((pix_y>=avg_hi_y)?3:2):((pix_y>=avg_lo_y)?1:0);
But, this was slower than the use of a per-pixel multiply.

The use of a multiply here tends to be faster on x86 as well.

For "higher quality" encoders, there might be multiple gamma functions,
and some fancier math for calculating endpoints (cluster averaging), but
this is a bit too slow for real-time encoding on the BJX2 core
(normally, one would want to select a "gamma curve" which maximizes
contrast between the high and low sets; and then calculate endpoints
roughly representing a weighted average of the extremes and the centroid
regions of each set of pixels).

For speed, one mostly has to live with a single gamma function, and
merely using the minimum and maximum values, ...

Though, one other option would be normalizing the window backing buffers
to my UTX2 format, with UTX2 also being used for the main screen buffer.
This would allow the screen-redraw and conversion into the VRAM format
to be faster.

Mostly, would still prefer Doom to still be able to have double-digit
framerates in a "Full GUI" mode.

Then again, from some of the videos I have seen of people showing off
running Doom on RISC-V soft-cores, they are often getting low
single-digit framerates (so, I am probably not doing too horribly in
this sense).

Like, at least in 320x200 (RGB555 mode), on a 50MHz core, I am getting
double-digit framerates.

> - anton

Thomas Koenig

unread,

Aug 2, 2022, 4:39:16 PM8/2/22

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> BGB <cr8...@gmail.com> writes:
>>To support Reg/Mem "in general", seems like it would need mechanisms for:
>> Perform a Load, then perform the operation (Load-Op);
>
> Sure, that's what is done in every implementation.
>
>> Ability to perform certain ops directly in the L1 cache.
>
> I don't know any implementation that does that, although there have
> been funky memory subsystems that supported fetch-and-add or other
> synchronization primitives; AFAIK they do it in the remote memory
> controller, not in the controlled memory, though.

The Nova might count, if you consider its memory to be a cache
(well, not really, but it was the closest memory to the CPU, so...)
it had an ISZ (increment and skip if zero) and DSZ (decrement and
skip if zero) which apparently was done in the memory subsystem.
Slow, but it saved a register, which were in short supply.

Marcus

unread,

Aug 2, 2022, 4:54:38 PM8/2/22

to

Don't you need two data access points along such a pipeline?

BGB

unread,

Aug 2, 2022, 5:32:54 PM8/2/22

to

It is a mystery.

I guess, assuming LdOp:
IF ID RA MA EX1 EX2 EX3 WB
IF ID RA MA EX1 EX2 EX3 ST
RA: Register Access and AGEN
MA: Memory Access (Load)
WB: Write Nack (Register File)
ST: Store to Memory

Load : IF ID RA MA +++ +++ +++ WB
Store : IF ID RA -- --- --- --- ST
Reg Op (1L): IF ID RA -- EX1 +++ +++ WB
Reg Op (2L): IF ID RA -- EX1 EX2 +++ WB
Reg Op (3L): IF ID RA -- EX1 EX2 EX3 WB
StoreOp(2L): IF ID RA MA EX1 EX2 --- ST
StoreOp(3L): IF ID RA MA EX1 EX2 EX3 ST
--: Unused Stage (No Forward)
++: Unused Stage (May Forward)

This would likely require tighter coupling between the pipeline and L1
cache though, since these would happen in lockstep albeit with a longer
delay than for Load/Store (this would complicate things like memory
consistency, since it would now be possible for the MA stage of
following instructions to evict resident cache lines before the Store
stage of previous instructions).

It is quite possible that, in such a design, if the 'RA' stage generates
an L1 cache index which collides with a store that is already in-flight,
the pipeline would need to interlock.

Though, this would create an extra penalty, as load/store-collision
interlocks are likely to be a serious issue in many cases (would add a 3
or 4 cycle penalty whenever an operation tries to access a cache-line
with an in-flight store, which is likely to be fairly common in areas
like the stack and similar).

Though, one possibility would be to only cause an interlock if this
access would generate an L1 miss.

It is likely that MA would need to be treated like an Execute stage.

...

MitchAlsup

unread,

Aug 2, 2022, 5:41:32 PM8/2/22

to

Sorry cannot parse your question.
<
But the AGEN unit at the front of the pipeline is speculative, and all
actual calculations are done after inbound memory references (if any)
have shown up.

Marcus

unread,

Aug 3, 2022, 3:41:42 AM8/3/22

to

Sorry, I meant: In the pre-execute stages you need to read memory
operands, no? And later in the pipeline you need to write to (or read
from?) memory? Thus you would need (at least) two concurrent ports to
the L1D$?

/Marcus

EricP

unread,

Aug 3, 2022, 12:33:58 PM8/3/22

to

The thing is that a fixed layout pipeline can only accommodate
the specific situations that it is designed for.
Forwarding allows limited topological rearrangement.
Putting an extra AGEN at an early pipeline stage makes
all uOps perform an extra stage, which add extra latency
that makes it costlier to fill in bubbles.

A pipeline might dynamically rearrange while maintaining In-Order (InO)
simplicity such that can fill in bubbles as best as possible.
(I have a mental picture of a dynamic Pert chart.
It is not OoO but it does allow concurrency to fill in bubbles.)

For example, a ST with immediate data and immediate address doesn't
need either a RR Register Read stage or AGEN, and can go straight
from Decode to LSU. That ST can launch concurrent with an earlier
RR-ALU uOp, or a following RR-ALU op can launch concurrent with ST.
ST doesn't need the WB stage so a subsequent uOp can use that stage.

MitchAlsup

unread,

Aug 3, 2022, 2:14:04 PM8/3/22

to

No !! There is a trick to all of this:
<
Consider the cache as {tag, TLB, data}
LDs read {tag, TLB, data} in the early stage
STs read {tag, TLB} in the early stage and use {data} in the later stage.
Anytime there is not a AGEN or a ST in the early stage, the later ST
stage can use {data}
<
This, BTW, is the HP store pipeline patented circa 1988, just applied
to LD-Op pipeline design.
>
> /Marcus

MitchAlsup

unread,

Aug 3, 2022, 2:20:02 PM8/3/22

to

On Wednesday, August 3, 2022 at 11:33:58 AM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
> >> Well, and further reaching issues, say, whether a Reg/Mem ISA could
> >> compare well with a Load/Store ISA?... Can use fewer instructions, but
> >> how to do Reg/Mem ops without introducing significant complexities or
> >> penalty cases?...
> >>
> > You build a pipeline which has a pre-execute stage (which calculates
> > AGEN) and then a stage or 2 of cache access, and then you get to the
> > normal execute---writeback part of the pipeline. I have called this the
> > 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
> > machines used such a pipeline.
> The thing is that a fixed layout pipeline can only accommodate
> the specific situations that it is designed for.
> Forwarding allows limited topological rearrangement.
> Putting an extra AGEN at an early pipeline stage makes
<

It is not "extra" there is only 1 AGEN and it is in the early part of the pipe.
Only through AGEN can you access {tag, TLB}. There is not another
AGEN in EXECUTE you already "missed" the cache access port.

<
> all uOps perform an extra stage, which add extra latency
> that makes it costlier to fill in bubbles.
<

One of the obvious drawbacks. TANSTAaFL

>
> A pipeline might dynamically rearrange while maintaining In-Order (InO)
> simplicity such that can fill in bubbles as best as possible.
> (I have a mental picture of a dynamic Pert chart.
> It is not OoO but it does allow concurrency to fill in bubbles.)
<

a GBOoO machine would not want such a pipeline.

>
> For example, a ST with immediate data and immediate address doesn't
> need either a RR Register Read stage or AGEN, and can go straight
> from Decode to LSU.
<

What if there is another memory reference in the cycle preceding
ST #7,[global_a], even though there is no arithmetic that needs to occur
there are pipeline "events" to monitor, and in the end it is easier to let
the pipeline "flow" naturally.

BGB

unread,

Aug 3, 2022, 2:27:21 PM8/3/22

to

Not necessarily.

While a Store also needs the data from a Load, in premise one could
delay the store part of the mechanism by N clock cycles and allow the
value for the Store part to arrive from the main pipeline N cycles after
the Load part.

Main issue is that this creates some potential memory consistency issues:
What happens if a following instruction would lead to an L1 miss?
What if one tries to access the line again before it has stored?
...

Scenario, two consecutive stores to the same index,
Say, 01100100 and 01120100:
First store, Fetches index 008;
Second store, Misses on 008, Loads a different cache line;
First store writes its result to index 008;
Second store writes it result to index 008;
The result of the first store is lost.

Scenario, two consecutive stores to the same line,
Say, 01100100 and 01100108:
First store, Fetches index 008;
Second store, Fetches index 008 (No miss this time);
First store writes its result to index 008;
Second store writes it result to index 008;
However, this line is stale, lacking the prior store;
The result of the first store is lost.

Scenario, store followed by load of same address,
Say, both 01100100:
Store, Fetches index 008;
Load, Fetches index 008 (No miss this time);
First store writes its result to index 008;
But, Load has a stale value.

Simplest option being to generate an interlock stall if the new
instruction would fall into the same spot in the L1 cache as an
in-flight store, but this is "not ideal" for performance (things like
stack spill and fill are likely perform poorly in this case; these often
involve mixed loads and stores to adjacent memory addresses).

One could instead have an "Early Store" and a "Late Store" (would only
require interlocking on a "Late Store"), but this creates a new problem:
What if a prior "Late Store" and a following "Early Store" happen to
land on the same clock cycle? This is also not good.

These cases could be handled by a bunch of special case interlock
checks, but this is not ideal.

...

If designing an ISA, one possible option is to have only LoadOp as a
generic case, but then StoreOp only for simple ALU instructions. In this
case, these ALU ops could be put into the L1 cache directly, and the
Late Store mechanism and scenario could be eliminated (or, at least,
reduced to a 1 cycle delay).

This could be done while adding roughly 1 cycle of pipeline latency
on-average vs a pure Load/Store design.

Seemingly (at least from superficial checks), x86 seems to mostly fit
this latter pattern.

However, x86 has the drawback that pretty much every ALU instruction
updates status bits in EFLAGS/rFLAGS, which seems like a bit of a
hassle, since one would have to route in the flags updates from several
different areas.

Though, a likely option would be that, rather than routing the bits
linearly through all the ALUs, they could be expanded along each path
to, say:
00: Bit is clear (No Change)
01: Bit is set (No Change)
10: Clear the bit (Changed)
11: Set the bit (Changed)

In which case, the different paths can "update" the flags bits
independent of each other.

As can be noted, in BJX2 I have partly ended up with something a little
more limited:
A few LoadOp and StoreOp cases being shoved into the L1 cache;
A few LoadOp style instructions which shim in the "Op" part onto the end
of the "Load" mechanism.

This avoids adding extra latency, but makes these instructions only able
to exist as special cases, and generally limited to 1-cycle operations.

Previous examples:
FMOV.S (Rm, Disp), Rn //Load Binary32 -> Binary64
FMOV.H (Rm, Disp), Rn //Load Binary16 -> Binary64
LDTEX (Rm, Ri), Rn //Load from Texture

Newer cases (with ALU in L1):
ADDS.L (...), Rn //LoadOp, Rn=Rn+Mem
ADDU.L (...), Rn //LoadOp
SUBS.L (...), Rn //LoadOp, Rn=Mem-Rn
SUBU.L (...), Rn //LoadOp
RSUBS.L (...), Rn //LoadOp, Rn=Rn-Mem
RSUBU.L (...), Rn //LoadOp

ADDS.L Rn, (...) //StoreOp, Mem=Mem+Rn
ADDU.L Rn, (...) //StoreOp
SUBS.L Rn, (...) //StoreOp, Mem=Mem-Rn
SUBU.L Rn, (...) //StoreOp

ADDS.L Imm6u, (...) //StoreOp, Mem=Mem+Imm6u
ADDU.L Imm6u, (...) //StoreOp
SUBS.L Imm6u, (...) //StoreOp, Mem=Mem-Imm6u
SUBU.L Imm6u, (...) //StoreOp

AND.L (...), Rn //LoadOp, Rn=Mem&Rn
AND.L Rn, (...) //StoreOp, Mem=Mem&Rn
AND.L Imm6u, (...) //StoreOp, Mem=Mem&Imm6u

XCHG.L (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
XCHG.Q (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn

...

Also adds new encodings for:
MOV.L Imm6u, (...) //Store Imm6 encodings.
MOV.L Imm6n, (...) //XCHG Imm6 encodings.
...

The encodings cover Byte, Word, DWord, and QWord operations (with signed
and unsigned variants for LoadOp).

Will assume Byte and Word will also work (this is more the hassle of
adding all of these cases to my compiler and emulator, than it is an
issue for the Verilog). So, a fairly minor change to the Verilog, but
expanding it out would add a big chunk of new instructions and encodings
to the listing (annoyingly, listing would be a lot longer if I fully
expanded out all of the encodings which exist due to internal
combinations of features); along with a whole bunch of new instruction
mnemonics to deal with it.

Will not cover immediate values bigger than 6 bits, but OTOH, if one
needs an immediate bigger than this, loading a constant into a register
isn't all that expensive.

I am still on the fence about whether QWORD ops should be supported here
(mostly due to the higher cost and latency of 64-bit ADD/SUB vs 32-bit),
but it makes sense for things like pointer increment/decrement (though,
these are more likely to be in registers, as one tends to be less likely
to increment a pointer without otherwise interacting with it in some
other way, such as a pointer dereference or similar).

Implicitly, this extension will (presumably) have the prior RiMOV
extension as a prerequisite, so if one gets this, they also get the (Rm,
Ri*Sc, Disp) addressing mode and similar.

For now, I will consider all this to be an "experimental" extension.

> /Marcus

MitchAlsup

unread,

Aug 3, 2022, 3:46:10 PM8/3/22

to

Why not:
<
RSUBS.L Rn,(...) // StoreOp Mem=Rn-Mem
RSUBU.L Rn,(...) // StoreOp
<
??

>
> ADDS.L Imm6u, (...) //StoreOp, Mem=Mem+Imm6u
> ADDU.L Imm6u, (...) //StoreOp
> SUBS.L Imm6u, (...) //StoreOp, Mem=Mem-Imm6u
> SUBU.L Imm6u, (...) //StoreOp
<

Why not:
<
RSUBS.L Imm6u,(...) // StoreOp Mem=Imm6u-Mem
RSUBU.L Imm6u,(...) // StoreOp
<
??

>
> AND.L (...), Rn //LoadOp, Rn=Mem&Rn
> AND.L Rn, (...) //StoreOp, Mem=Mem&Rn
> AND.L Imm6u, (...) //StoreOp, Mem=Mem&Imm6u
>
> XCHG.L (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
> XCHG.Q (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
>
> ...
>
> Also adds new encodings for:
> MOV.L Imm6u, (...) //Store Imm6 encodings.
> MOV.L Imm6n, (...) //XCHG Imm6 encodings.
> ...
>
> The encodings cover Byte, Word, DWord, and QWord operations (with signed
> and unsigned variants for LoadOp).
<

It is this great waste of entropy which caused these kinds of ISAs to
drop out of favor.

BGB

unread,

Aug 3, 2022, 5:34:47 PM8/3/22

to

On 8/3/2022 11:32 AM, EricP wrote:
> MitchAlsup wrote:
>> On Monday, August 1, 2022 at 1:26:12 AM UTC-5, BGB wrote:
>>> Well, and further reaching issues, say, whether a Reg/Mem ISA could
>>> compare well with a Load/Store ISA?... Can use fewer instructions,
>>> but how to do Reg/Mem ops without introducing significant
>>> complexities or penalty cases?...
>> You build a pipeline which has a pre-execute stage (which calculates
>> AGEN) and then a stage or 2 of cache access, and then you get to the
>> normal execute---writeback part of the pipeline. I have called this the
>> 360 pipeline (or IBM pipeline) several times in the past, here. The S.E.L
>> machines used such a pipeline.
>
> The thing is that a fixed layout pipeline can only accommodate
> the specific situations that it is designed for.
> Forwarding allows limited topological rearrangement.
> Putting an extra AGEN at an early pipeline stage makes
> all uOps perform an extra stage, which add extra latency
> that makes it costlier to fill in bubbles.
>

This is a big drawback:
Best cases for generic LoadOp add latency.

At least found a way to support a few of these cases (in a non-generic
way) in my case without adding a latency penalty (or changing the
pipeline). Maybe not ideal for timing, but sorta works.

Would make a lot more sense for an ISA like x86, where presumably one is
already prepared to pay these penalties for an in-order core as an
artifact of the ISA design.

> A pipeline might dynamically rearrange while maintaining In-Order (InO)
> simplicity such that can fill in bubbles as best as possible.
> (I have a mental picture of a dynamic Pert chart.
> It is not OoO but it does allow concurrency to fill in bubbles.)
>
> For example, a ST with immediate data and immediate address doesn't
> need either a RR Register Read stage or AGEN, and can go straight
> from Decode to LSU. That ST can launch concurrent with an earlier
> RR-ALU uOp, or a following RR-ALU op can launch concurrent with ST.
> ST doesn't need the WB stage so a subsequent uOp can use that stage.
>

Dunno there.

Ironically, I am seemingly gradually getting closer to how one would
design a "reasonable cost" x86 core...

Main remaining ugly part is mostly the instruction decoder, which is
likely mostly a manner of having a bunch of duplicated logic for "What
if a Mod/Rm byte happens right here?".

I am almost to the point where I could try to implement such a thing if
I wanted to (and, ironically, an x86-64 core may well be cheaper than an
IA-64 core, on account of IA-64's stupidly large register file).

Much less confident about performance though.
It is like a bit of an enigma:
Sometimes, x86's performance is unexpectedly fast;
Sometimes, it is meh, or downright terrible (*).

*: Like the early versions of the Atom (such as in an ASUS Eee), where a
RasPi can seemingly run circles around it in terms of general performance.

Meanwhile, the original MS-DOS builds of Doom is seemingly unexpectedly
fast, whereas trying to run x86 versions based back-ports of the
Linuxdoom source release, are seemingly dragging around a boat anchor
(in comparison). Well, then there is Hexen which, despite being based on
the Doom engine, manages to somehow be almost as slow as Quake.

I could still consider trying to write an x86 emulator on top of BJX2; I
wrote a basic x86 emulator once before (and in a way was the origin of a
lot of the designs used in my later emulators and interpreters).

Basically, previous to this, many of my interpreters had used a fairly
naive strategy:
Decode an instruction;
Execute an instruction;
Decode next instruction;
Execute next instruction;
...
Generally spinning in a loop (which then bottle-necked on the "decode
and dispatch" part of the process).

When at first I tried writing an x86 emulator, this was no longer a
workable strategy, as the logic for pattern matching the instructions
was way too slow for handling this part inline.

So, solution:
Decode a trace of instructions in advance, and have each instruction as
a struct with "do your thing" function pointer (mostly eliminating the
decoding cost from the running execution time).

Then, there was another trick of detecting cases where future
instructions in a trace would mask out the previous instructions' EFLAGS
updates, allowing them to be replaced with faster "non EFLAGS updating"
variants (worked OK, as most of the EFLAGS updates were being masked).

However, I have doubts this would be sufficient to give usable
performance on an otherwise already slow CPU core.

But, I suspect I could probably do better this time around, if I got
around to it (in later emulators, I had realized a few more potentially
relevant tricks).

At this point, even on my PC, my BJX2 emulator seemingly can't maintain
real-time emulation much over ~ 150MHz; seemingly mostly bogged down
with handling Load/Store operations.

Apart from stuff related to Load/Store handling, no other "major
hotspots" in the profiler.

There are special lookup hint cases to speed up Load/Store cases (eg:
caching previously accessed page addresses and pointers, as a sort of
small emulator-side TLB cache, vs always needing to go through the main
TLB and looking up a memory span), but these aren't really sufficient to
fully defeat this issue.

The main trace loop:
while(tr && (cyc>0))
{ cyc-=tr->n_cyc; tr=tr->Run(tr); }

Is still making a showing in the profile, implying the emulator is still
running at close to full speed (and a lot of the longer trace dispatch
functions are still making a showing, so the problem isn't mostly of
being overrun with overly short instruction traces).

...

However, granted, for normal use emulator just needs to be faster than
what the FPGA version can do.

Kinda annoying on a RasPi though, as at present the RasPi can't emulate
the BJX2 core at much faster than about 16 MHz, and my previous attempts
to use a JIT compiler (and NEON instructions) to improve performance on
ARM were decidedly unsuccessful.

Granted, my attempts at running TKRA-GL on a RasPi "weren't so hot"
either, it seemingly kinda really sucks at it (much as doing so on an
early 2000s laptop). Neither really fast enough to give a particularly
usable GLQuake experience with software GL despite having a fairly
significant clock-speed advantage.

Extrapolating backwards, this would imply that trying to run software
rasterized OpenGL or similar on a 486 or similar would have been
borderline glacial.

But, it seems like my current ISA is putting up more of a fight than
previous ISA's at "emulate ISA at a higher clock speed".

However, most of the previous ISAs had:
Didn't have to account for bundling:
Superscalar would have a similar effect here;
More ops have a "free ride" as per the clock-cycle accounting.
Lower density of Load/Store ops:
Limited address modes, and needing ALU ops for address calcs, ...
This would makes "cheap to emulate" ops more common.

In this case, the emulator is seemingly running into a limit of not
being able to get much under around 20 (real) clock-cycles per emulated
instruction (at least, with a high density of Load/Store instructions).

Though, as long as it stays below an average of around 60 clock-cycles
per emulated instruction, this is fast enough to keep up with real-time
emulation of a 50MHz CPU core.

Though, if emulating x86 on BJX2, there would at least be the advantage
that I could map the x86 virtual address space into the BJX2 virtual
address space, and then make use of the hardware MMU for most of the
address translation (even if the reverse isn't really true).

But, still have serious doubts as to whether it could be fast enough for
something like Doom would be playable with it. Then again, seemingly the
RasPi also fails at this task as well.

Neither QEMU nor DOSBox on a RasPi giving usable performance for playing
Doom (it was basically a slide show when I tested it).

Running Doom in my BJX2 emulator on a RasPi still somehow manages to
give more playable performance than the RasPi port of DOSBox.

Well, and the relative oddity that Software GL in the emulator isn't all
that much slower than trying to do it using a natively compiled version.

This is very much unlike Doom; which is apparently able to run crazy
fast in a native ARM build.

Performance is weird sometimes...

...

BGB

unread,

Aug 3, 2022, 5:43:26 PM8/3/22

to

Actually, these cases should exist as well, just didn't think to mention it.

>>
>> ADDS.L Imm6u, (...) //StoreOp, Mem=Mem+Imm6u
>> ADDU.L Imm6u, (...) //StoreOp
>> SUBS.L Imm6u, (...) //StoreOp, Mem=Mem-Imm6u
>> SUBU.L Imm6u, (...) //StoreOp
> <
> Why not:
> <
> RSUBS.L Imm6u,(...) // StoreOp Mem=Imm6u-Mem
> RSUBU.L Imm6u,(...) // StoreOp
> <
> ??

Likewise, this wasn't exactly an exhaustive list of every new
"instruction" which pops into existence as a side effect of this feature...

>>
>> AND.L (...), Rn //LoadOp, Rn=Mem&Rn
>> AND.L Rn, (...) //StoreOp, Mem=Mem&Rn
>> AND.L Imm6u, (...) //StoreOp, Mem=Mem&Imm6u
>>
>> XCHG.L (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
>> XCHG.Q (...), Rn //Ld+St, Rn'=Mem | Mem'=Rn
>>
>> ...
>>
>> Also adds new encodings for:
>> MOV.L Imm6u, (...) //Store Imm6 encodings.
>> MOV.L Imm6n, (...) //XCHG Imm6 encodings.
>> ...
>>
>> The encodings cover Byte, Word, DWord, and QWord operations (with signed
>> and unsigned variants for LoadOp).
> <
> It is this great waste of entropy which caused these kinds of ISAs to
> drop out of favor.

Given these were a hack onto the RiMOV encodings (already using a 64-bit
instruction format), the added entropy cost was at least in a part of
the space where it didn't eat into the 32-bit encoding space.

There is not currently any plan to migrate these encodings into the
32-bit part of the encoding space.

But, yeah, something like:
AND.B 63, (R4, 0)
Would take 8 bytes to encode, unlike x86 where its' equivalent could be
encoded in 3 bytes.

Torbjorn Lindgren

unread,

Aug 3, 2022, 7:44:07 PM8/3/22

to

BGB <cr8...@gmail.com> wrote:
>On 8/1/2022 4:06 AM, Theo wrote:
>> BGB <cr8...@gmail.com> wrote:
>>> But, what sort of FPGA, exactly?...
>>
>> Cyclone V 5CEFA5F23C, 77k LE:
>> http://www.apollo-computer.com/icedrakev4.php
>
> From what I can gather, vs an XC7A100T:
> Fewer LUTs (kinda hard to compare directly here);
> Less Block RAM;
> More IO pins;
> Faster maximum internal clock speed.
>

[...]

>> Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
>> pretty antique as far as Arm cores go). Those are pretty comparable with
>> the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
>> there's no transceivers and no Arm, hence it's at the cheap end of the line
>> (the A5 meaning 77k LE is the middle of the range).
>>
>> https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html
>>
>> This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).

Yeah, the A2 to A9 summary[1] lists memory as Mb/Gb/Gb/Mb/Mb, this is
clearly wrong (and I confimed it's all Mb via other sources).

The part they use is the SCEA5, if we follow that link we get to the
Intel ARK page which includes lots of different order codes, three of
them are "5CEFA5F23C" models (different speed grades in the same
package & pin-out, C6 is the fastest).

>> The I/Os are typically good to drive DDR3, which is what the Arm uses for
>> DRAM.
>
>On Artix, one is typically using FPGA logic and IO to drive the DDR.
>In my case, I am driving the DDR2 module at 50MHz on the board I have.

According to the Intel ARK page[2] the 5CSEA5 model has "hard memory
controllers" which support DDR2, DDR3 and LPDDR2. The Product Table
confirms this and also reveals the A5 (and up) actually has two "hard
memory controller (FPGA)".

>Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
>in this case, and this is pushing the limits of the IO pin speeds.

IF I'm reading the Intel's External Memory Interface Spec Estimator
[4] correctly it reveals that Cyclone V's memory channels are "up to
40-bit" (IE like mobile, not like PC) and that these models can do up
to 80-bit (confirming the dual-channel).

It looks like the faster C6 & C7 grade can run DDR3-800 (400MHz) if
you restrict it to a single chip select though it needs DDR3-1066
rated memory chips for that due to an errata. The slower C8 only
support up to DDD3-666 using DDR3-800 rated memory chips. Speed with
two chip select is a bit lower (DDR3-666 for C6/C7, DDR3-606 for C8).

It's possible Intel wrote MHz and meant MT/s, if so the numbers would
half this but 800MT is AFAIK the slowest JEDEC DDR3 standard speed
which hints these are the correct numbers.

The figures for DDR2 are the same (800MT/s) and LPDDR2 is a bit slower
(666MT/s, can't do 2 chip selects), these speeds are also completely
reasonable for DDR2 and LPDDR2. I suspect the reason for supporting
three different memory interfaces is to give the designers more
choices in memory sizes.

800MT/s and a 64-bit/8-byte wide memory interface gives us a best case
total "interface speed" of 6.4 GB/s. Obviously it will never HIT that
but if the implementation is competent there's no reason it couldn't
be capable of 3-5 GB/s.

Given that all the "68080" boards I found have 512MB of (soldered)
memory that seems like it should be sufficient bandwidth assuming they
have L1 and L2 caches on the CPU which seems very likely.

AFAIK your core has order(s?) of magnitude less memory bandwith than
this (the 80MHz above also hints at that).

Hmmm! Looks like the Cyclone V is the ONLY low-end Intel/Altera FPGA
with hard memory controllers, even some of the mid-range models
doesn't have it. So those have to rely on much slower soft memory
controllers.

So perhaps that's the secret sauce!

>> Mouser will sell me one for $127 (in MOQ 60), and the price they get from
>> their distributor is almost certainly less.

It say "non-stocked" so I suspect all this would do is to put in an
order with their supplier and then Mouser comes back to you with some
kind of estimate (or "when we get it"!) - I'd definitely check with
them before ordering any non-stocked parts.

The fact that there's suppliers in Asia that want $450 and $650 for it
respectively and provides NO volume discount at all kind of suggest
availability via "proper" channels may be very limited.

Certainly the Cyclone V SE A6 (SE is the SOC variant with lots of
extra hard stuff) that the Terasic DE10-Nano uses is in very short
supply - the MISTer game FPGA system uses this so I know the lead
times are very long.

1. https://ark.intel.com/content/www/us/en/ark/products/series/141579/cyclone-v-e-fpga.html
2. https://ark.intel.com/content/www/us/en/ark/products/210443/cyclone-v-5cea5-fpga.html
3. https://www.intel.com/content/dam/support/us/en/programmable/support-resources/bulk-container/pdfs/literature/pt/cyclone-v-product-table.pdf
4. https://www.intel.com/content/www/us/en/support/programmable/support-resources/support-centers/emif-spec-estimator.html

BGB

unread,

Aug 3, 2022, 11:08:05 PM8/3/22

to

On 8/3/2022 6:44 PM, Torbjorn Lindgren wrote:
> BGB <cr8...@gmail.com> wrote:
>> On 8/1/2022 4:06 AM, Theo wrote:
>>> BGB <cr8...@gmail.com> wrote:
>>>> But, what sort of FPGA, exactly?...
>>>
>>> Cyclone V 5CEFA5F23C, 77k LE:
>>> http://www.apollo-computer.com/icedrakev4.php
>>
>> From what I can gather, vs an XC7A100T:
>> Fewer LUTs (kinda hard to compare directly here);
>> Less Block RAM;
>> More IO pins;
>> Faster maximum internal clock speed.
>>
> [...]
>>> Cyclone V go up to 300K LE, and can have an Arm Cortex A9 on them (yes,
>>> pretty antique as far as Arm cores go). Those are pretty comparable with
>>> the Zynq in Xilinx-land. This one is a Cyclone V E version, which means
>>> there's no transceivers and no Arm, hence it's at the cheap end of the line
>>> (the A5 meaning 77k LE is the middle of the range).
>>>
>>> https://www.intel.com/content/www/us/en/products/details/fpga/cyclone/v/e.html
>>>
>>> This one has 4.8Mbit of BRAM (think the 'Gb' on that table is a typo).
>
> Yeah, the A2 to A9 summary[1] lists memory as Mb/Gb/Gb/Mb/Mb, this is
> clearly wrong (and I confimed it's all Mb via other sources).
>

Yeah, probably nothing in this category is going to have Gb of Block RAM...

> The part they use is the SCEA5, if we follow that link we get to the
> Intel ARK page which includes lots of different order codes, three of
> them are "5CEFA5F23C" models (different speed grades in the same
> package & pin-out, C6 is the fastest).
>

The FPGA I am using is a "-1" speed grade, which in Artix-7 terms, is
basically the slowest.

>
>>> The I/Os are typically good to drive DDR3, which is what the Arm uses for
>>> DRAM.
>>
>> On Artix, one is typically using FPGA logic and IO to drive the DDR.
>> In my case, I am driving the DDR2 module at 50MHz on the board I have.
>
> According to the Intel ARK page[2] the 5CSEA5 model has "hard memory
> controllers" which support DDR2, DDR3 and LPDDR2. The Product Table
> confirms this and also reveals the A5 (and up) actually has two "hard
> memory controller (FPGA)".
>

Yes, but no hard controllers on Artix-7, where one only has soft
controllers.

Usual idea is that Xilinx wants people to use Vivado's MIG tool, but
then one would need to deal with AXI.

In theory, one could use SERDES (the RAM being connected up to SERDES
capable pins), but the specifics of how to use it are a bit sparse, and
most of the "official" stuff here is mostly "Instantiate these random IP
Cores from the IP Catalog...".

Me: "How about NO."
Don't really want "IP Cores", nor do I necessarily want to deal with AXI.

I would more prefer if they more bothered actually documenting their
FPGA primitives, vs just endlessly being like "Invoke X from the IP
Catalog Wizard"...

In my case, I am mostly testing stuff in simulations using Verilator,
with the testbenches using a mock-up of the RAM chip based on
descriptions from various RAM module datasheets and similar (which was,
luckily, apparently accurate enough to allow interfacing with the actual
RAM chips; however the actual standards here are located behind JEDEC
paywalls, so don't have these...).

>
>> Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
>> in this case, and this is pushing the limits of the IO pin speeds.
>
> IF I'm reading the Intel's External Memory Interface Spec Estimator
> [4] correctly it reveals that Cyclone V's memory channels are "up to
> 40-bit" (IE like mobile, not like PC) and that these models can do up
> to 80-bit (confirming the dual-channel).
>

Didn't mention it, but the board I am using has a 16-bit RAM interface
(128MB DDR2, 16-bit, 15ns CAS latency).

Some other boards are using 8-bit DDR (32MB or 64MB modules), and some
lower-end boards are using 4-bit QSPI (512K or 1024K). This sort of
thing seems more common with XC7S25 and XC7A35T based boards.

But, running a 16-bit RAM module at 50 MHz isn't that high of RAM
bandwidth in the best case...

I am effectively running the RAM module in a low-power standby mode (DLL
disabled, with 3-1-0 timings, ...).

This isn't really a proper way to use the chip, but seems to work in
this case.

> It looks like the faster C6 & C7 grade can run DDR3-800 (400MHz) if
> you restrict it to a single chip select though it needs DDR3-1066
> rated memory chips for that due to an errata. The slower C8 only
> support up to DDD3-666 using DDR3-800 rated memory chips. Speed with
> two chip select is a bit lower (DDR3-666 for C6/C7, DDR3-606 for C8).
>
> It's possible Intel wrote MHz and meant MT/s, if so the numbers would
> half this but 800MT is AFAIK the slowest JEDEC DDR3 standard speed
> which hints these are the correct numbers.
>
> The figures for DDR2 are the same (800MT/s) and LPDDR2 is a bit slower
> (666MT/s, can't do 2 chip selects), these speeds are also completely
> reasonable for DDR2 and LPDDR2. I suspect the reason for supporting
> three different memory interfaces is to give the designers more
> choices in memory sizes.
>
> 800MT/s and a 64-bit/8-byte wide memory interface gives us a best case
> total "interface speed" of 6.4 GB/s. Obviously it will never HIT that
> but if the implementation is competent there's no reason it couldn't
> be capable of 3-5 GB/s.
>
> Given that all the "68080" boards I found have 512MB of (soldered)
> memory that seems like it should be sufficient bandwidth assuming they
> have L1 and L2 caches on the CPU which seems very likely.
>

Yeah.

> AFAIK your core has order(s?) of magnitude less memory bandwith than
> this (the 80MHz above also hints at that).
>

Peak unidirectional DDR bandwidth is ~ 90 MB/s in my case at 50 MHz
(Unidirectional Load or Store), with a bidirectional speed of ~ 54 MB/s
(SWAP).

Theoretical extrapolated speed from the DDR tables, for running the RAM
at 50 MHz with a 16-bit RAM interface: 100 MB/s.
Seems it is pretty close...

For memcpy(), average-case is generally closer to around 26-30 MB/s.

For accesses within the L1 or L2, things are generally a bit:
L1:
Memcpy ~ 270 MB/s (hard limit = 400)
Memset ~ 407 MB/s (hard limit = 800)
Memload ~ 483 MB/s (hard limit = 800)
L2:
Memcpy ~ 77 MB/s
Memset ~ 142 MB/s
Memload ~ 223 MB/s
DDR:
Memcpy ~ 27 MB/s
Memset ~ 56 MB/s
Memload ~ 78 MB/s

Given L1 speeds are greater than 50% of the hard-limit, this means that
the majority of the L1 local accesses are 1-cycle. The hard limit here
is due to the clock speed (50 MHz), access width (128 bit), and single
memory port.

Vs theoretical limits:
Memcpy is 50% of theoretical limit (Swap Only);
It is 79% of adjusted limit (Load+Swap);
Memset is 62% of theoretical limit;
Memload is 87% of theoretical limit.

Where, the limit here would be if the caches and ring bus did not add
any additional latency.

Some of the overhead here is due to a multiplier effect, where, say, 1
access in the L1 may result in 2 accesses to the L2, and 4 to DRAM
(though, the latter is reduced due to the DDR controller having SWAP
operation, turning this scenario into 1,2,2).

The overhead of memcpy can partly be explained that is case effectively
tends to result in Load+Swap access pattern for DDR, which lowers the
limit down to 34MB/s.

> Hmmm! Looks like the Cyclone V is the ONLY low-end Intel/Altera FPGA
> with hard memory controllers, even some of the mid-range models
> doesn't have it. So those have to rely on much slower soft memory
> controllers.
>
> So perhaps that's the secret sauce!
>

Quite possibly...

>
>>> Mouser will sell me one for $127 (in MOQ 60), and the price they get from
>>> their distributor is almost certainly less.
>
> It say "non-stocked" so I suspect all this would do is to put in an
> order with their supplier and then Mouser comes back to you with some
> kind of estimate (or "when we get it"!) - I'd definitely check with
> them before ordering any non-stocked parts.
>
> The fact that there's suppliers in Asia that want $450 and $650 for it
> respectively and provides NO volume discount at all kind of suggest
> availability via "proper" channels may be very limited.
>
> Certainly the Cyclone V SE A6 (SE is the SOC variant with lots of
> extra hard stuff) that the Terasic DE10-Nano uses is in very short
> supply - the MISTer game FPGA system uses this so I know the lead
> times are very long.
>
>
> 1. https://ark.intel.com/content/www/us/en/ark/products/series/141579/cyclone-v-e-fpga.html
> 2. https://ark.intel.com/content/www/us/en/ark/products/210443/cyclone-v-5cea5-fpga.html
> 3. https://www.intel.com/content/dam/support/us/en/programmable/support-resources/bulk-container/pdfs/literature/pt/cyclone-v-product-table.pdf
> 4. https://www.intel.com/content/www/us/en/support/programmable/support-resources/support-centers/emif-spec-estimator.html

There is also a bit of a shortage of Xilinx based boards right now...
One of the boards I had been tempted to order is now pretty much
universally sold out.

robf...@gmail.com

unread,

Aug 5, 2022, 9:52:44 PM8/5/22

to

> Yes, but no hard controllers on Artix-7, where one only has soft
> controllers.
>
> Usual idea is that Xilinx wants people to use Vivado's MIG tool, but
> then one would need to deal with AXI.

One does not have to use the AXI interface. It is an option in the MIG tool.

> In theory, one could use SERDES (the RAM being connected up to SERDES

I believe this is what the Xilinx core does. The softcore can interface to the
DDR RAM at full speed. Probably why there is not a hard core. The SERDES and
other components are more general in nature and can be applied for other
interfacing.

> I would more prefer if they more bothered actually documenting their
> FPGA primitives, vs just endlessly being like "Invoke X from the IP
> Catalog Wizard"...

I have found Xilinx to generally have good documentation. There are many
user guides available, describing the IP cores and operation.

> >> Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
> >> in this case, and this is pushing the limits of the IO pin speeds.

For my system 800MHz (400 MHz clock) DDR3 is being used, driven by the Xilinx softcore.
This is using the Artix7 -1 (slow part). The DDR RAM is 16 bits wide, so that 1.6GB/s.

I have built my own system read cache core and multi-port memory controller to
try an make use of the bandwidth. The pipeline for the Xilinx core is pretty deep, it
is something like 25 clock cycles. But then it can transfer every clock. The core
breaks data into 16 byte chunks so that a lower clock frequency can be used.

BGB

unread,

Aug 5, 2022, 11:35:50 PM8/5/22

to

On 8/5/2022 8:52 PM, robf...@gmail.com wrote:
>
>> Yes, but no hard controllers on Artix-7, where one only has soft
>> controllers.
>>
>> Usual idea is that Xilinx wants people to use Vivado's MIG tool, but
>> then one would need to deal with AXI.
>
> One does not have to use the AXI interface. It is an option in the MIG tool.
>

Might have to look into it, all the stuff I had read had said it used AXI.

>> In theory, one could use SERDES (the RAM being connected up to SERDES
>
> I believe this is what the Xilinx core does. The softcore can interface to the
> DDR RAM at full speed. Probably why there is not a hard core. The SERDES and
> other components are more general in nature and can be applied for other
> interfacing.
>

Probably true enough.

As noted, my DDR controller doesn't use SERDES, but this was partly
because I did not know about SERDES when I wrote it. I had just sorta
figured people were writing tight FIFOs and running them at high clock
speeds, partly as some of the early RAM controller code I had looked at
was working this way.

>> I would more prefer if they more bothered actually documenting their
>> FPGA primitives, vs just endlessly being like "Invoke X from the IP
>> Catalog Wizard"...
>
> I have found Xilinx to generally have good documentation. There are many
> user guides available, describing the IP cores and operation.
>

But the assumption seems to be that people use the IP Cores, and not try
to use the SERDES directly. Decent documentation for the actual FPGA
primitives is a bit more lacking here.

If possible, I also want to write Verilog that does not depend on the
specifics of Xilinx tooling (eg: relatively generic Verilog, that I can
also fill in the gaps and simulate in Verilator, or maybe synthesize in
Quartus if I decide I want to run it on a Cyclone V or something,
ideally while keeping all of the toolchain specific stuff to a minimum,
...).

The IP Cores seem to go against this, seeming almost like a trap to try
to create vendor lock-in.

>>>> Can sort-of drive RAM at 75MHz via IO pins, but reliability is lacking
>>>> in this case, and this is pushing the limits of the IO pin speeds.
>
> For my system 800MHz (400 MHz clock) DDR3 is being used, driven by the Xilinx softcore.
> This is using the Artix7 -1 (slow part). The DDR RAM is 16 bits wide, so that 1.6GB/s.
>

I am using a board with 16-bit DDR2, its rated RAM speed being 667 MHz.

> I have built my own system read cache core and multi-port memory controller to
> try an make use of the bandwidth. The pipeline for the Xilinx core is pretty deep, it
> is something like 25 clock cycles. But then it can transfer every clock. The core
> breaks data into 16 byte chunks so that a lower clock frequency can be used.
>

OK.