multi-way LOAD/STORE buffers and misalignment

257 views
Skip to first unread message

lkcl

unread,
Mar 24, 2020, 7:53:29 AM3/24/20
to
we're doing a LD/ST buffer at the moment, and some design review and
constructive feedback and advice would be greatly appreciated.
unlike many "first time processors", we cannot possibly get away
with a simple single-issue LD/ST design as the first iteration because
of the vector engine being effectively a multi-issue multiplier that,
from a single instruction, hammers the internal engine with a *batch*
of actual operations.

this includes LD/ST and so consequently we need a multi-issue-capable
LD/ST buffer right from the start. it is being documented, here:

https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/

with Mitch's help i already have a RAW / WAR dependency detection matrix
in place, which detects (in-order) *batches* of LDs, *batches* of STs,
and *batches* of atomic operations (defined by having LD/ST both set).

additionally, we have (again, thanks to Mitch), an "address clash detection"
Matrix, where the original design concept from Mitch uses bits 4 thru 11
(stopping at a VM page) to say "this LD/ST definitely doesn't overlap with
any others, but it might have picked up a few that *potentially* don't
overlap, but hey better safe than sorry".

this is quite efficient because the number of bits used to detect a clash
is drastically reduced, which is important given that it's an NxN/2 set of
binary-address comparators involved.

however we *modified* this idea, by turning the addr+len (len=1/2/4/8 byte)
into a *byte-mask*, where the byte-mask is effectively nothing more than
(or, "can be used as") byte-enable lines on the L1 cache memory read/write.

this means turning len=1 into 0b1, len=2 into 0b11, len=4 into 0b1111
or len=8 into 0b1111 1111 and then shifting it by the LSBs of the address
(bits 0-3).

however of course, that could "wrap over" to a second cache line, therefore
we split it into two:

* 1st cache line, take bits 0..15 of the mask, use addr[4:] as the address
* 1nd cache line, take bits 15..27 of the mask, use addr[4:] PLUS ONE as address

now all LDs are potentially *two* LDs, and all STs are potentially *two* STs,
requiring synchronisation.

this does interestingly mean that the "address clash detection" Matrix is
now doubled in size, because each LD/ST Computation Unit can now issue *two*
LDs/STs, not one.

there is a big advantage to partial-conversion of the LD/STs to unary-masked
format:

1) those "masks" we can now detect in the address-clash unit, whether there
are overlaps, by simple "unary" (ANDing).

2) once a batch of guaranteed-non-overlapping LDs (or STs, but not both at
once) have been identified, the ORing of those masks *pre-identifies*,
where the remaining MSBs are identical, which LDs and which STs are
*part of the same cache line*.

3) the ORed masks can be dropped - as-is - directly onto byte-level
read/write-enable lines on the L1 Cache underlying memory - with zero
shifting or other messing about.

so that's the bits that we "know".

the bits that we *don't* know are what is the best L1 cache arrangement
to use. we discussed the possibility of an 8-way set-associative cache
with 512-bit wide data (!!!) - yes, really, because we will have a minimum
of 4 LD/ST operations @ 64-bit occurring every cycle (in the first version
of the processor: this *will* go up in later iterations).

one thing that would be particularly nice to avoid would be multi-ported
SRAM on the L1 Cache Memories. i thought there it may be possible to
"stripe" the L1 Cache, such that bits 4-5 of every address go through to,
effectively, a completely separate (isolated, independent) L1 Cache.

* Addr bits [0:3] already turned into byte-map mask (enable lines)
* Addr bits [4:5] used to select one of four INDEPENDENT caches
* Addr bits [6:end] used "as normal"

the only downside of this being: LD/STs that happen to be on the same
64-byte boundary would be "penalised" by being only single-issue-capable.

do we care? i'm not sure that we do!

_should_ we care? that's the question i'd like to know the answer to :)
an alternative is that we have to multi-port the L1 Cache memory, which
would be quite expensive.

what alternatives are there? is there a way to, for example, have an
8-way set-associative cache with 8 independent input/output data paths,
yet they do not require the trick of a full split using addr bits [4:5],
but at the same time do not require 4R4W SRAMs?

i *think* this is doable, with the potential for some wait cycles if
one of the 8 ways happens to have request contention. honestly i don't know
what's best here.

advice and discussion appreciated as always,

l.

lkcl

unread,
Mar 24, 2020, 10:19:50 AM3/24/20
to
do most systems simply assume that all L1 caches are effectively 1R1W as
far as the core is concerned, then "amalgamate" cache access into as close
to full cache lines as possible, and, if greater bandwidth is required,
simply "up the cache line width" from e.g. 16 bytes to 64 bytes?

i can't say i'm particularly happy with such an arrangement, because if
LDs/STs from a single vector instruction are to be issued on 64-byte
"strides", they're guaranteed to end up as single-issue.

how the hell is this done?? every cache i've seen is effectively 1R1W.
i found *one* paywalled paper on "dual-ported N-way cache"
https://ieeexplore.ieee.org/document/4734919
however its focus appears not to be on multi-porting but on power
reduction.

there is also this: http://sadve.cs.illinois.edu/Publications/isca00.pdf

which describes a "reconfigureable" cache, subdivideable down into
partitions - howeverrr agaaaain: it's not multi-ported: it's more about
subdividing caches by "processor activity". this is probably now
called "ASID" (the paper is from 2000) - Address Space Identifier.

frustratingly-lost, here.

l.

MitchAlsup

unread,
Mar 24, 2020, 11:27:05 AM3/24/20
to
Yes, you do want to care:: you want to care to "get it correct" even
at some minor cost in perf.
>
> _should_ we care? that's the question i'd like to know the answer to :)
> an alternative is that we have to multi-port the L1 Cache memory, which
> would be quite expensive.
>
> what alternatives are there? is there a way to, for example, have an
> 8-way set-associative cache with 8 independent input/output data paths,
> yet they do not require the trick of a full split using addr bits [4:5],
> but at the same time do not require 4R4W SRAMs?

What you want to know are the dimensions of the SRAM macros. Given these
dimensions you are in a position to start making good choices, without
you are not. A typical SRAM dimension might be 256-bits wide and 64-words
deap (2KBytes).

Since you want to read/write 512-bits per port, you will need 2 SRAM
macros per port (per cycle). So if you are making 8-way SA cache, you
will need at least 16 SRAM macros (32KBytes)--and (A N D) you still
have but a single port!

On the other hand, dropping back to 4-way set, and that same SRAM array
can now support 2-ports of 4-way SA each. In this organization you will
be using more lower-address bits to choose port[0] versus port[1]. And
you will see 40% conflicts if you are really running 2 request per cycle.

However, if you know the access pattern is "dense", then you know that
LD[i][k..k+3] are all in the same line, and you can THEN make one access
for every 4 actual LDs an then sort it out with multiplexers. And NOW we
have enough ports and width to use 1024 bits per cycle throughput in the
cache. {You can determine an access pattern will be dense when the loop
increment is +1 (or -1) and the loop index is used in scaled form in
memref instructions:

LD Rx,[Rb+Ri<<scale+DISP]

} Other forms of dense access are harder for HW to detect and utilize.

MitchAlsup

unread,
Mar 24, 2020, 11:36:16 AM3/24/20
to
On Tuesday, March 24, 2020 at 9:19:50 AM UTC-5, lkcl wrote:
> do most systems simply assume that all L1 caches are effectively 1R1W as
> far as the core is concerned, then "amalgamate" cache access into as close
> to full cache lines as possible, and, if greater bandwidth is required,
> simply "up the cache line width" from e.g. 16 bytes to 64 bytes?

You should assume (if you want a high clock rate) that SRAM macros are
1R or 1W per cycle not 1R AND 1W per cycle.

Architecturally, it is a lot cleaner to use 1R AND 1W per cycle SRAM macros
but this will cost you in clock rate as you are effectively running the
SRAM macros on a 2× clock.

These days, what I prefer is to read/write wide into a buffer (call it an L0
if you will) and use multiplexers to deliver the data. This buffer will
have 4-8 entries.
>
> i can't say i'm particularly happy with such an arrangement, because if
> LDs/STs from a single vector instruction are to be issued on 64-byte
> "strides", they're guaranteed to end up as single-issue.

But you are only orchestrating data to/from this buffer every 4 LDs/STs--
wider and slower.
>
> how the hell is this done?? every cache i've seen is effectively 1R1W.
> i found *one* paywalled paper on "dual-ported N-way cache"
> https://ieeexplore.ieee.org/document/4734919
> however its focus appears not to be on multi-porting but on power
> reduction.
>
> there is also this: http://sadve.cs.illinois.edu/Publications/isca00.pdf
>
> which describes a "reconfigureable" cache, subdivideable down into
> partitions - howeverrr agaaaain: it's not multi-ported: it's more about
> subdividing caches by "processor activity". this is probably now
> called "ASID" (the paper is from 2000) - Address Space Identifier.
>
> frustratingly-lost, here.

Typical problem with acedemia--does not want to "play the game" with the
logic technology of the current day.

The building blocks are the SRAM macros. Consider each macro having 1R OR
1W per cycle--add to this some buffers so that you are accessing the SRAM
macros once every 4 cycles (64-bit containers) per data line, and sort
it out with flip-flops and multiplexers.

You might want to give up some settedness for portedness. But it all comes
down to the number of SRAM macros you start out with.
>
> l.

Ivan Godard

unread,
Mar 24, 2020, 11:50:16 AM3/24/20
to
Very much not my expertise, so this comment may well be inane or wrong.

We don't have load-store buffers, but still have to merge requests,
which is done at the L1 victim buffers. Like you, there can be multiple
requests each cycle, although ours come from independent ops rather than
vectorization. One thing that caused us a lot of thought that (if I
understand it) you are not addressing: preserving semantic order.
Mitch's (hence your) vector logic can pick up several different streams
in the raw scalar loop, and those may potentially overlap. A scalar
store-load-store-load sequence is unambiguous; the vectorized version is
not.

It is not even sufficient to detect that the initial address recurrence
overlaps and avoid setting up a vector, at least if your vectors support
negative strides and/or different strides in different but concurrent
streams, which they really must. You can get two vectors sliding over
each other in opposite directions or one overtaking another, and
determining the order semantics when the overlap happens is non-trivial
except in Fortran.

Assuming that you resolve ordering issues, there is then the alignment
and isolation issue you address above. We initially required natural
alignment, but gave it up on business/market grounds. However, while we
have to support unaligned access, we decided that we didn't have to make
it fast. Consequently the load/store units check for boundary-crossing
and if necessary double-pump the request. The difference between our
approach and your description is that we do the split at the LS FU, not
at the other end of the request stream at the buffers. Because our
request stream is strictly ordered, we can split any request while
preserving semantic order.

However, the request stream has a fixed number of slots per cycle. If
all the slots are taken by LS requests then there's no slot for the
splitee. On a slow machine one could stall for a cycle, but our firehose
takes a couple of cycles to throw a stall, due to speed-of-light issues.
Consequently the request stream, and pretty much all other streams in
the hierarchy, use a predictive slot assignment algorithm where the
source maintains a high/low-water count of the projected state at the
sink end, and throttles itself as necessary. The source does not send a
request unless it knows that the sink will have catch capacity when the
request arrives (will arrive, not is sent). This is *not* a handshake.

This algorithm is essentially identical to the predictive throttling
used in network packet handling, q.v. The throttle logic, which is
widely used in a Mill, is called "Drano", because it keeps the pipes
open. Of course, excess split requests on a saturated request stream
will eventually exceed the Drano buffering capacity and the source will
issue a stall, early enough (it's a predictive system, after all) that
the stall can take effect before the sink overflows.

If I understand your system (questionable) you are able to handle
boundary overrun by doubling the access capacity so you can do two
(derived from one) requests at once. This seems to be wasteful IMO,
because you will very rarely see boundary crossing requests. You propose
to be able to do two at once, but do not have the ability to do two at
once from two requests. Mill Drano has only the ability to do one at a
time (per stream slot), but the slots are rarely full in a cycle, and
the Drano buffer lets requests spill over into other cycles. The cost
for us is that a split request may cause other requests to be delayed a
cycle (if there are no free slots in the current cycle) or trigger a
stall (if there are no free slots within the reach of the Drano).

Your proposal is immune to the Mill's potential delay, but it leaves you
searching for double-ported hardware. I suggest that you consider saying
that users of boundary-crossing references should accept a potential
delay as the cost of software flexibility. A bit of cooperation from the
rest of the ISA will help; for example, rather than support arbitrary
arithmetic and assignment to the stack pointer, you could make it a
specReg and provide an update operation that always allocates in whole
aligned lines, so the stack frame is always aligned and your compiler
can lay out data so that ordinary references never cross boundaries. Of
course, that might be too CISCy for you.

lkcl

unread,
Mar 24, 2020, 12:18:32 PM3/24/20
to
On Tuesday, March 24, 2020 at 3:36:16 PM UTC, MitchAlsup wrote:
> On Tuesday, March 24, 2020 at 9:19:50 AM UTC-5, lkcl wrote:
> > do most systems simply assume that all L1 caches are effectively 1R1W as
> > far as the core is concerned, then "amalgamate" cache access into as close
> > to full cache lines as possible, and, if greater bandwidth is required,
> > simply "up the cache line width" from e.g. 16 bytes to 64 bytes?
>
> You should assume (if you want a high clock rate) that SRAM macros are
> 1R or 1W per cycle not 1R AND 1W per cycle.

ok. appreciated. i'll ask Staf: he's actually doing (test chip is taping
out this month) a Libre Cell Library, including SRAMs, DRAMs, GPIOs and
standard cells as well.

> Architecturally, it is a lot cleaner to use 1R AND 1W per cycle SRAM macros
> but this will cost you in clock rate as you are effectively running the
> SRAM macros on a 2× clock.

hmmm

> These days, what I prefer is to read/write wide into a buffer (call it an L0
> if you will) and use multiplexers to deliver the data. This buffer will
> have 4-8 entries.

interestingly, that's what we started discussing: a mini-cache of, yes,
only around 8 cache-lines.

so what you're saying is: have multi-porting by way of multiplexers (4-in, 4-out, or in our case 8, 12 or even 16-in on the LDSTCompUnit side, 4-out on the L0 cache side), group things into cache-lines and then hit the L1 cache one line at a time from the L0 cache/buffer?


> > i can't say i'm particularly happy with such an arrangement, because if
> > LDs/STs from a single vector instruction are to be issued on 64-byte
> > "strides", they're guaranteed to end up as single-issue.
>
> But you are only orchestrating data to/from this buffer every 4 LDs/STs--
> wider and slower.

... except, the Vector Engine that we have, it can actually issue 4 LDs/STs
per clock cycle (actually i was thinking of putting in 6 or 8 LD/ST CompUnits).
if *all* of those are on 64-byte strides, they'll hit the same cache line
(@64-byte width), that means those 4 LDs/STs grind to a halt, only being able
to access alternate cache lines on every 4th cycle.

if instead we split the L1 cache into two halves, at least that comes down
to every 2nd cycle...

ahhh. and the L0 cache/buffer will "fix" the problem of odd-even addressing
causing slow-down just because two LDs/STs happened to both have "odd" addresses.


> You might want to give up some settedness for portedness. But it all comes
> down to the number of SRAM macros you start out with.

we can do what we want (within reason) because we're using a Libre SRAM
library. downside: we're using a Libre SRAM library and *have* to do it
ourselves... :)

l.

lkcl

unread,
Mar 24, 2020, 12:25:53 PM3/24/20
to
On Tuesday, March 24, 2020 at 3:27:05 PM UTC, MitchAlsup wrote:

> Yes, you do want to care:: you want to care to "get it correct" even
> at some minor cost in perf.

happy to have something that's correct and working at this moment in time.

> What you want to know are the dimensions of the SRAM macros. Given these
> dimensions you are in a position to start making good choices, without
> you are not. A typical SRAM dimension might be 256-bits wide and 64-words
> deap (2KBytes).

we're able to build / design our own.

> Since you want to read/write 512-bits per port, you will need 2 SRAM
> macros per port (per cycle). So if you are making 8-way SA cache, you
> will need at least 16 SRAM macros (32KBytes)--and (A N D) you still
> have but a single port!
>
> On the other hand, dropping back to 4-way set, and that same SRAM array
> can now support 2-ports of 4-way SA each. In this organization you will
> be using more lower-address bits to choose port[0] versus port[1]. And
> you will see 40% conflicts if you are really running 2 request per cycle.

ok. this makes sense (in combination with the L0 cache/buffer from your
other message).

> However, if you know the access pattern is "dense", then you know that
> LD[i][k..k+3] are all in the same line,

ha! we have a detection system for that! by turning the low address
bits plus the LD/ST-len (1/2/4/8) into a shifted byte-mask, we know
which bytes are going to be accessed in any given cache line, and we can
then merge them into a single line-request, by noting that the high
bits of any address are all equal.

the Address-Conflict-Matcher will *already* have excluded anything that
could overlap, so we *know* that, in the unary byte-masks, they are
going to be uniquely-set. i.e. no two LDs (or two STs) will ever "overlap"
at the cache byte-line *column* level.


> and you can THEN make one access
> for every 4 actual LDs an then sort it out with multiplexers. And NOW we
> have enough ports and width to use 1024 bits per cycle throughput in the
> cache. {You can determine an access pattern will be dense when the loop
> increment is +1 (or -1) and the loop index is used in scaled form in
> memref instructions:
>
> LD Rx,[Rb+Ri<<scale+DISP]
>
> } Other forms of dense access are harder for HW to detect and utilize.

i believe we have it covered by turning the accesses into an unary byte-mask
of the exact same width as the cache-line.

l.

lkcl

unread,
Mar 24, 2020, 1:02:49 PM3/24/20
to
fortunately, our vector engine is effectively nothing more than a multi-issue scalar system. thus by the time instruction-issue is over (which is where the
vector-thing sits) vectorisation is entirely gone.

mitch's address-conflict and memory-hazard-matrix system detects:

* sequences of load-load-load and separates them from store-store-store-store
whilst *also preserving the order*

* identifies address-conflicts (in a slightly-overzealous way)


> It is not even sufficient to detect that the initial address recurrence
> overlaps and avoid setting up a vector, at least if your vectors support
> negative strides and/or different strides in different but concurrent
> streams, which they really must. You can get two vectors sliding over
> each other in opposite directions or one overtaking another, and
> determining the order semantics when the overlap happens is non-trivial
> except in Fortran.

again: we do not have a problem, here, because the "vectors" are not actually
vectors, per se, they're just an instruction to the *issue* engine to sit
with the PC "frozen", blurting out a whack-load of *scalar* instructions,
incrementing the *register* numbers on each time round the "vector" hardware
for-loop.

thus we can "pretend" that those scalar instructions (which happen to have
sequential register numbers but the same operation) were actually really,
*really* scalar operations actually coming from the I-Cache, as in, they
were, actually, really, *really*, actual binary operations @ PC, PC+4, PC+8
and so on.

all we need to do then is design a (bog-standard) Multi-Issue *scalar*
execution engine, and all of the problems of Vector overlap detection
etc. all go away.

to be replaced by lovely, shiny new ones of course :)

> Assuming that you resolve ordering issues, there is then the alignment
> and isolation issue you address above. We initially required natural
> alignment, but gave it up on business/market grounds. However, while we
> have to support unaligned access, we decided that we didn't have to make
> it fast.

:)

> Consequently the load/store units check for boundary-crossing
> and if necessary double-pump the request. The difference between our
> approach and your description is that we do the split at the LS FU,

the plan is to have the split directly after the LD/ST Function Unit:
effectively having (and therefore making it potentially wait for) two
LD/ST ports.

we'd anticipate that for the most part the 2nd port would go completely
unused, as it's only going to be active when mis-aligned memory requests
occur.

> not
> at the other end of the request stream at the buffers. Because our
> request stream is strictly ordered, we can split any request while
> preserving semantic order.

yeah the plan is to turn the LSBs of the address into that unary masking
thing. we won't ever hit even the buffers with bits 0-3 of the address:
it will be unary byte-mask lines 0-15.

and the Memory-Order Matrix takes care of load/store ordering. i'm trusting
Mitch's advice that, whilst *batches* of loads have to be ordered compared
to *batches* of stores (load-load-load store-store load-load *must* be
3xLD then 2xST then 2xLD), for a Weak Memory Ordering system, the ordering
of LDs (or STs) *in* a batch do *not* have to be ordered.


> However, the request stream has a fixed number of slots per cycle. If
> all the slots are taken by LS requests then there's no slot for the
> splitee.

based on Mitch's assessment that for WMO, the ordering of any given LD (or ST)
in a "batch" is not important, it didn't occur to me to make any distinction
between the 2-way split operations.

for a TSO system, hell yes it would matter. and, because we're now doing
POWER ISA, i know that TSO really does matter on IO-memory-accesses, so
we'll cross that bridge when we come to it. (i believe POWER 3.0B copes
with this by simply banning misaligned IO-memory accesses)

> On a slow machine one could stall for a cycle, but our firehose
> takes a couple of cycles to throw a stall, due to speed-of-light issues.

:)

> Consequently the request stream, and pretty much all other streams in
> the hierarchy, use a predictive slot assignment algorithm where the
> source maintains a high/low-water count of the projected state at the
> sink end, and throttles itself as necessary. The source does not send a
> request unless it knows that the sink will have catch capacity when the
> request arrives (will arrive, not is sent). This is *not* a handshake.
>
> This algorithm is essentially identical to the predictive throttling
> used in network packet handling, q.v. The throttle logic, which is
> widely used in a Mill, is called "Drano", because it keeps the pipes
> open.

good name :)

> Of course, excess split requests on a saturated request stream
> will eventually exceed the Drano buffering capacity and the source will
> issue a stall, early enough (it's a predictive system, after all) that
> the stall can take effect before the sink overflows.
>
> If I understand your system (questionable) you are able to handle
> boundary overrun by doubling the access capacity so you can do two
> (derived from one) requests at once.

and let the buffers sort it out, yes. ok, so the idea is to have N (N=6 or 8)
LD/ST Function Units, each with 2 ports, and to try to get 4-way porting
to the L0 cache/buffer.

now, unless there's a really, really good reason (and i am getting some
hints from what you describe that there might be), i wasn't going to
"prioritise" the non-misaligned LD/ST operations, on the basis that, well,
it's all gotta come out at some point [hopefully not as poop...]

> This seems to be wasteful IMO,
> because you will very rarely see boundary crossing requests. You propose
> to be able to do two at once, but do not have the ability to do two at
> once from two requests. Mill Drano has only the ability to do one at a
> time (per stream slot), but the slots are rarely full in a cycle, and
> the Drano buffer lets requests spill over into other cycles. The cost
> for us is that a split request may cause other requests to be delayed a
> cycle (if there are no free slots in the current cycle) or trigger a
> stall (if there are no free slots within the reach of the Drano).
>
> Your proposal is immune to the Mill's potential delay, but it leaves you
> searching for double-ported hardware.

ah no i wondered *if* double-ported (or quad-ported) L1 caches exist,
and from what Mitch was saying, not even 1R1W caches exist, they're
1R-*or*-1W caches.

so that idea is shot.

now, let's say that we have a 4-ported L0 cache/buffer, 8-entry, which has a
12-in (or 16-in) Priority Muxer to allow access to those 4 ports.

12-in because if we have 6 LD/ST Function Units, each will have its own
splitter thus doubling the FU ports in effect. 16-in if we have 8 LD/ST FUs.

i don't mind having a 12-or-16 to 4 multiplexer: i wrote a Multi-port
Priority-Picker for this purpose, just last week.

so let's then say that we have 2 of those LD/ST units issuing mis-aligned LDs,
and another 4 FUs issue aligned LDs. this effectively means we have *six*
cache-line-aligned LD/STs being issued:

* 2 from the 1st mis-aligned LD
* 2 from the 2nd mis-aligned LD
* 1 from the 1st aligned LD
* 1 from the 2nd aligned LD

depending on what the 4-simultaneous PriorityPicker happens to pick, we could
have *any* of those 6 be picked.

now... should we do something such that mis-aligned LDs/STs are *reduced* in priority? given that we have this byte-mask "cache-line detection system", i'm not so sure that we should try that, because the more that is thrown at the L0 cache/buffer, the more opportunity there could be for "merging" the unary byte-masks into full cache lines.

i honestly don't know, here.

> I suggest that you consider saying
> that users of boundary-crossing references should accept a potential
> delay as the cost of software flexibility.

i think... i think we're going to have to see how it goes, and (thank you
for the heads-up) keep a close eye on that.

one of the "refinements" that was considered was to have a popcount on the ORing of the cache-byte-line bitmaps, to detect which ones have the highest
count. full cache lines get top priority.

i can see, now that you mention it, that a large number of mis-aligned LD/STs
will, due to using double the normal number of cache lines for a non-misaligned
LD/ST, would clearly cause throughput slow-down.

however *if* those misalignments happen to be mergeable with other memory requests (because of the cache-byte-line bitmaps), actually it will not be as bad as it sounds.

the caveat being: they have to be within the same address (bits 0-3 excluded
from that), and pretty much the same cycle. for a sequential memory read/write
that happened to be accidentally mis-aligned, this condition would hold true.

however random misaligned memory access, we'd end up with 1/2 throughput


> A bit of cooperation from the
> rest of the ISA will help; for example, rather than support arbitrary
> arithmetic and assignment to the stack pointer, you could make it a
> specReg and provide an update operation that always allocates in whole
> aligned lines, so the stack frame is always aligned and your compiler
> can lay out data so that ordinary references never cross boundaries. Of
> course, that might be too CISCy for you.

unfortunately we're stuck with POWER ISA. actually... i *say* "stuck":
we're planning on extending it behind a "safety barrier" of escape-sequencing,
so we couuuld conceivably do strange and wonderful things, if justified.

l.

Ivan Godard

unread,
Mar 24, 2020, 2:18:07 PM3/24/20
to
On 3/24/2020 10:02 AM, lkcl wrote:
> On Tuesday, March 24, 2020 at 3:50:16 PM UTC, Ivan Godard wrote:

<snip>
You are way beyond me. Some questions:

1) have you allowed for two scalar references being not congruent but
overlapping and both are boundary crossing? If they are both stores, for
example, it would be unfortunate if the update effect was a1-b1-a2-b2
instead of a1-a2-b1-b2.

2) Your L0 is our L1 victim buffer. With 6 LS units you have a 12-way
mux to reach the four ports; we have a 6 way mux because of the upstream
Drano; I agree that a 12-to-4 mux is unimportant. However, you also have
12 wiresets from the LS to the muxes, which in wire terms are remote. We
would have six, so in effect you have doubled the operand width in order
to support boundary crossing. Our hardware guys have in general been
much more concerned with wire area than with logic; how do you weight
these in your design?

3) I'm doubtful that the "whack-load issue" will remove all ordering
issues. Yes, it clearly works within any single stream, and for multiple
disjoint streams. And I can see that conjoint streams can have the
ordering of overlapping whack-loads disambiguated with enough care. But
it seems to me that such care is not doable in practical hardware, at
least with the fanout you propose. If one stream is double-at-a-time,
one is word-at-a-time, and one is short-at-a-time, and they are on
different byte alignments, then as they cross each other the conflict
detector will correctly order D1-D2-D3-D4 and so for the other streams,
but the multi-stream will not be d1-w1-s1-d2-w2-s2-etc, but will be
interleaved as something like d1-w1-s1-s2-w2-w3-d3-etc.

Now you can get around this by having the subscript correspond to the
iteration instead of the address. Then its unambiguous that
d1-w1-s1-d2-w2-s2... But you have nothing in the ISA that tells you what
an iteration is. Mitch has his loop-vec operation pair that gives him
that info, but you can only pick it up from the addresses. Think of this
loop:
int a[N]; short B[N*2];
for (int i = 0; i < N; ++i)
sum += A[i] + B[2*i] + B[2*i+1];
Is your hardware going to make one B stream, or two? How are you going
to have the order in the whack-load be a1-b1-b2-a2-b3-b4...? and what if
they overlap? I admit I don't understand the conflict avoidence logic
proposed, but I also don't see anything that suggests things like this
are handled.

MitchAlsup

unread,
Mar 24, 2020, 3:15:33 PM3/24/20
to
We used to do double ported 1R AND 1W SRAMs and were forced to 1R or 1W
SRAMs by the technology. In 1µ technology it was not hard to get SRAMs
to read and write in a single 16-gate cycle. But somewhere in the 0.18µ
era SRAms became slower because of transistor leakage, bit-line capacitance,
and body effects of <still> planar transistors. You can still get SRAMs that
are fast enough to run a 20-gate machine cycle and hit them 2× per cycle,
but not if the library SRAM makes YOU put the flip-flops at the edge of
the SRAM and these flip-flops are timed on the std clock.

Caches, on the other hand can be 2 ported (or even 4 ported) using std SRAMs
as long as you are not using 1 SRAM macro more than once per cycle. The
rest is simply wiring the macros up however you deem appropriate.

Note:: In order to improve the tag-TLB-hit logic speed, Athlon and Opteron
gated the raw bit lines of the Tag and TLB into a analog XOR gate and then
used sense amplifiers to decide if the bits matched. They did not actually
bother to read out the translated bits from either the Tag array or the
TLB array. This necessitates these two arrays be side-by-side.
>
> so that idea is shot.
>
> now, let's say that we have a 4-ported L0 cache/buffer, 8-entry, which has a
> 12-in (or 16-in) Priority Muxer to allow access to those 4 ports.
>
> 12-in because if we have 6 LD/ST Function Units, each will have its own
> splitter thus doubling the FU ports in effect. 16-in if we have 8 LD/ST FUs.

All of this split stuff resolves about gate (delay) 6 whereas the higher
order AGEN bits do not resolve until at least gate 12-13 (64-bit addresses).
So there is plenty (PLENTY) of time to do bank selection and split analysis
while the rest of the AGEN bits are resolving. By the time you get to the
L0 multiplexers, there should be no "priorities" and no "indecision" as to
what this cycle is doing. (i.e., move it back 1/3rd cycle)
>
> i don't mind having a 12-or-16 to 4 multiplexer: i wrote a Multi-port
> Priority-Picker for this purpose, just last week.
>
> so let's then say that we have 2 of those LD/ST units issuing mis-aligned LDs,
> and another 4 FUs issue aligned LDs. this effectively means we have *six*
> cache-line-aligned LD/STs being issued:
>
> * 2 from the 1st mis-aligned LD
> * 2 from the 2nd mis-aligned LD
> * 1 from the 1st aligned LD
> * 1 from the 2nd aligned LD
>
> depending on what the 4-simultaneous PriorityPicker happens to pick, we could
> have *any* of those 6 be picked.

I have priority pickers that can do a unary pick for 72 inputs in 4 gates of
delay (without using more than 4-input NAND gates and 3-input NOR gates.
That is very fast, and as wide as you could ever need. But for 1 from 4
there is 1 gate of delay (from a flip flop where you have access to both
polarities), 1 from 12 is 2 gates, 1 from 36 is 3 gates.

But these should be performed while the HoBs are resolving.
>
> now... should we do something such that mis-aligned LDs/STs are *reduced* in priority? given that we have this byte-mask "cache-line detection system", i'm not so sure that we should try that, because the more that is thrown at the L0 cache/buffer, the more opportunity there could be for "merging" the unary byte-masks into full cache lines.

Just guessing: If you get a misaligned, you probably don't have the second
line address (even though its only an INC away), you will likely have to
perform splits over 2 cycles--then depending on address compares and access
density, you may (or may not) have to delay other memory refs to maintain
memory order.
>
> i honestly don't know, here.
>
> > I suggest that you consider saying
> > that users of boundary-crossing references should accept a potential
> > delay as the cost of software flexibility.
>
> i think... i think we're going to have to see how it goes, and (thank you
> for the heads-up) keep a close eye on that.
>
> one of the "refinements" that was considered was to have a popcount on the ORing of the cache-byte-line bitmaps, to detect which ones have the highest
> count. full cache lines get top priority.

The "density" concept tells you 90%+ of what you want to know. fast and
simple.
>
> i can see, now that you mention it, that a large number of mis-aligned LD/STs
> will, due to using double the normal number of cache lines for a non-misaligned
> LD/ST, would clearly cause throughput slow-down.

What you are going to want to do is to get 2 cache lines in the L0 and
then have this L0 service as many requests as you can afford to build
multiplexers for.

lkcl

unread,
Mar 24, 2020, 5:47:22 PM3/24/20
to
On Tuesday, March 24, 2020 at 6:18:07 PM UTC, Ivan Godard wrote:

>
> You are way beyond me.

sorry! it makes sense in my head :) however due to the nature of our project (educational as well as profitable) that's not good enough so i have to do better.

> Some questions:
>
> 1) have you allowed for two scalar references being not congruent but
> overlapping and both are boundary crossing? If they are both stores, for
> example, it would be unfortunate if the update effect was a1-b1-a2-b2
> instead of a1-a2-b1-b2.

a1 a2 would.. *click* you are right. in one version of the Address Conflict detection Matrix I had several binary address detection comparators, with the (address+16) being done *inside* the Matrix.

given that this required *three* comparisons (define a2 = a1+16, b2 = b1+16) to detect any clashes:

* a1 == b1, a1mask[0..15] & b1mask[0..15] nonzero

* a1 == b2, a1 mask[0..7] & b1mask[16..23] nonzero

* viceversa a2 == b1

i considered this to be excessive (3x the number of binary comparators in an 8x8 Matrix is a *lot* of comparators) i decided instead to split the ports.

however.. wait... no, the ordering of the LDSTs back at the *Dependency Hazard* phase, the LDs and STs are *already* split into ordered groups, well before they get to the "Address Conflict" phase.

therefore it is guaranteed that the AC phase will *only* see batches of LDs or batches of STs, it will *never* see both.

so we are good. whew :)


> 2) Your L0 is our L1 victim buffer.

ah ok. same thing, different name.

> With 6 LS units you have a 12-way
> mux to reach the four ports;

sigh yes. i am wondering if we can do just an N-in, 4-out Mux onto an 8-entry L0, and it would only fill up entirely if 4 previous entries were not flushed to the L1 cache.

it makes for a bottleneck but reduces wires somewhat.

> we have a 6 way mux because of the upstream
> Drano; I agree that a 12-to-4 mux is unimportant. However, you also have
> 12 wiresets from the LS to the muxes, which in wire terms are remote. We
> would have six, so in effect you have doubled the operand width in order
> to support boundary crossing. Our hardware guys have in general been
> much more concerned with wire area than with logic; how do you weight
> these in your design?

i don't knoow, i'm not happy about it. we're talking full 128 bit cache lines, QTY 12 (or 16) being MUXed 4 ways into 8 sets of latches.

woo.

...use several layers? :)

>
> 3) I'm doubtful that the "whack-load issue" will remove all ordering
> issues.

oh it will generate a ton of them.

however Mitch explained that when you use those (biiig) Dependency Matrices in unary regnum and FunctionUnit addressing (rather than the more traditional binary numbering used in the CDC6600 patent on its Q-Tables), multi-issue is a simple matter of transitively (cascade) collecting register dependency hazards from previous instructions.

thus if you have ADD r1, r10, r20 and ADD r2, r11, r21 in the next instruction, you create a "fake" write dependency between r2 and r1, and fake read hazard deps between r10/11 and r20/21, in that same clock cycle.

this is enough to preserve the instruction order in an intriguing single-cycle way.

and in the case of our vector-is-actually-multiissue-scalar it should be trivially obvious how to generate the transitive cascade of dependencies inside the vector for-loop as it increments the register numbers.


> Yes, it clearly works within any single stream, and for multiple
> disjoint streams. And I can see that conjoint streams can have the
> ordering of overlapping whack-loads disambiguated with enough care.

there are several critical pre-components, each of which (thank god) is independent.

* generating a ton of scalar operations, such that we can use standard multi issue hardware, vectorisation "goes away"

* standard register RW Hazard detection, by using unary numbering, allows us to ensure register ordering is preserved

* the RAW WAR Memory Ordering Detection Matrix allows us to detect batches of LDs and separate them from batches of STs *before*...

* ... the Address Clash Detection Matrix gets to see the batch of (only) LDs or (only) STs, even after they've gone through splitting.


> But
> it seems to me that such care is not doable in practical hardware, at
> least with the fanout you propose. If one stream is double-at-a-time,
> one is word-at-a-time, and one is short-at-a-time, and they are on
> different byte alignments, then as they cross each other the conflict
> detector will correctly order D1-D2-D3-D4 and so for the other streams,
> but the multi-stream will not be d1-w1-s1-d2-w2-s2-etc, but will be
> interleaved as something like d1-w1-s1-s2-w2-w3-d3-etc.

ah, remember: by this point, with all the other guarantees provided by the various Matrices, we have a bunch of *bytes* (16 of them), with their associated cache byte read/write enable line, where we *know*, thanks to the AddressConflict Matrix, that there are no two LDST FUs in this batch trying to read/write to/from the same cache byte column.

also, LD batches were *already* separated from ST batches.

> Now you can get around this by having the subscript correspond to the
> iteration instead of the address. Then its unambiguous that
> d1-w1-s1-d2-w2-s2...

all these have been turned into unary bytemaps.

d1 = 0b1111 1111 0000 0000
w1 = 0b0000 0000 1111 0000
s1 = 0b0000 0000 0000 1100
w2 = 0b0000 1111 0000 0000
s2 = 0b0000 0000 0011 0000

therefore *back at the AddressConflict Matrix* (assuming those were all the same Addr[4:48] binary addresses) the ACM would *already* have prevented LEN+Addr[0..3] d1 and w2 when encoded in this unary mask form from progressing simultaneously to the L0 Cache/buffer (aka L1 victim buffer).

etc.

d1 s1 and w1 would not produce conflicts, therefore all 3 of those would be PERMITTED to proceed to the L0 phase (allowed simultaneous access to the Muxers).

> But you have nothing in the ISA that tells you what
> an iteration is.

i know. i am so happy not to have to deal with that.

we are critically reliant however on this being an absolutely stonkingly high performance OoO engine.

i cannot imagine the hell that anyone woukd go through if they tried to stick to a traditional in-order multi-issue microarchitecture.


> Mitch has his loop-vec operation pair that gives him
> that info, but you can only pick it up from the addresses. Think of this
> loop:
> int a[N]; short B[N*2];
> for (int i = 0; i < N; ++i)
> sum += A[i] + B[2*i] + B[2*i+1];

this shouuuld result in near simultaneous B LDs ending up in the LDST buffers, where the lack of unary-encoded clashes in bits 0..3 of the address+LEN encoding as a cacheline column byte-enable line will result in them being merged into the same cacheline request in the same cycle.

however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1

> Is your hardware going to make one B stream, or two?

two at the FU level, merged through unary mask detection and clash avoidance onto the same cache line.

> How are you going
> to have the order in the whack-load be a1-b1-b2-a2-b3-b4...? and what if
> they overlap?

they are not permitted to.

> I admit I don't understand the conflict avoidence logic
> proposed, but I also don't see anything that suggests things like this
> are handled.

it is. however it is only by each Matrix taking care of guarantees further down the chain that it all hangs together.

if the MemHazard Matrix did not detect and separate out batches of LDs from batches of STs, in instruction order, we would be screwed at the AddressConflict detection phase.

if the AddressConflict stage did not detect bytemask overlaps and exclude them, we would be screwed at the L1 cache level due to multiple requests for R/W at the same cacheline column.

etc. etc.

l.

lkcl

unread,
Mar 24, 2020, 6:00:08 PM3/24/20
to
On Tuesday, March 24, 2020 at 9:47:22 PM UTC, lkcl wrote:
> On Tuesday, March 24, 2020 at 6:18:07 PM UTC, Ivan Godard wrote:

> > With 6 LS units you have a 12-way
> > mux to reach the four ports;
>
> sigh yes. i am wondering if we can do just an N-in, 4-out Mux onto an 8-entry L0, and it would only fill up entirely if 4 previous entries were not flushed to the L1 cache.
>
> it makes for a bottleneck but reduces wires somewhat.

ohh gooOdddd i just thought of a horrible way to avoid those wires somewhat.

although you turn Addr[0..3] plus len=1,2,4,8 into an unary shifted bytemask back at the Address Conflict Matrices, you do *not* throw this down a massive 12x4 way Mux

you put the *binary* address and *binary* len through the Mux phase and *recreate* the unary masks on the other side of the Mux.

yuck.

likewise the data Mux bus need actually only be 8 bytes wide, although it will be a full cache line wide at the L0 cachebuffer (aka L1 victim buffer in Mill terminology).

l.

lkcl

unread,
Mar 24, 2020, 6:27:08 PM3/24/20
to
On Tuesday, March 24, 2020 at 7:15:33 PM UTC, MitchAlsup wrote:

> We used to do double ported 1R AND 1W SRAMs and were forced to 1R or 1W
> SRAMs by the technology. In 1µ technology it was not hard to get SRAMs
> to read and write in a single 16-gate cycle. But somewhere in the 0.18µ
> era SRAms became slower because of transistor leakage, bit-line capacitance,
> and body effects of <still> planar transistors. You can still get SRAMs that
> are fast enough to run a 20-gate machine cycle and hit them 2× per cycle,
> but not if the library SRAM makes YOU put the flip-flops at the edge of
> the SRAM and these flip-flops are timed on the std clock.

i will alert Staf so he can take a look at this, thank you for the insight.

> Caches, on the other hand can be 2 ported (or even 4 ported) using std SRAMs
> as long as you are not using 1 SRAM macro more than once per cycle. The
> rest is simply wiring the macros up however you deem appropriate.

i *believe* that by a nice coincidence due to the unary mask clash detection we will:

A. never have reads occurring at the same time as writes
B. never have a double access to the same cache line *or* column, in the same cycle.

this would seem to satisfy the criteria you describe?

> Note:: In order to improve the tag-TLB-hit logic speed, Athlon and Opteron
> gated the raw bit lines of the Tag and TLB into a analog XOR gate and then
> used sense amplifiers to decide if the bits matched.

holy cow!

> They did not actually
> bother to read out the translated bits from either the Tag array or the
> TLB array. This necessitates these two arrays be side-by-side.
> >
> > so that idea is shot.
> >
> > now, let's say that we have a 4-ported L0 cache/buffer, 8-entry, which has a
> > 12-in (or 16-in) Priority Muxer to allow access to those 4 ports.
> >
> > 12-in because if we have 6 LD/ST Function Units, each will have its own
> > splitter thus doubling the FU ports in effect. 16-in if we have 8 LD/ST FUs.
>
> All of this split stuff resolves about gate (delay) 6 whereas the higher
> order AGEN bits do not resolve until at least gate 12-13 (64-bit addresses).

48. actually 48-11 because bits 5-11 are done back in the AddressConflict Matrix already, and i will (be attempting to) re-use those comparisons further down (in the L0 cache/buffer).

this might not turn out to be sane due to the amount of routing, as Ivan points out.


> So there is plenty (PLENTY) of time to do bank selection and split analysis
> while the rest of the AGEN bits are resolving. By the time you get to the
> L0 multiplexers, there should be no "priorities" and no "indecision" as to
> what this cycle is doing. (i.e., move it back 1/3rd cycle)

i did consider not having a Priority Picker cascade at all and instead just if there are 6 Function Units, have a 12 entry LD/ST L0 cache/buffer.

however that means 12 48(64) bit ADDR comparators in the L0 cache/buffer and i am not happy about that.

8 is... tolerable.


> > depending on what the 4-simultaneous PriorityPicker happens to pick, we could
> > have *any* of those 6 be picked.
>
> I have priority pickers that can do a unary pick for 72 inputs in 4 gates of
> delay (without using more than 4-input NAND gates and 3-input NOR gates.

is that using the same PP design as in the Scoreboard Mechanics book chapters?

> That is very fast, and as wide as you could ever need. But for 1 from 4
> there is 1 gate of delay (from a flip flop where you have access to both
> polarities), 1 from 12 is 2 gates, 1 from 36 is 3 gates.

you have slightly lost me here. do you mean, a cascade of PriorityPickers?

1st one MASKs out the input, 2nd one adds another bit to the mask, 3rd likewise. result is that each successive PriorityPicker gets less and less bits to pick from.

perhaps this is the naive approach

> But these should be performed while the HoBs are resolving.

HoBs... higher order bits.

> >
> > now... should we do something such that mis-aligned LDs/STs are *reduced* in priority? given that we have this byte-mask "cache-line detection system", i'm not so sure that we should try that, because the more that is thrown at the L0 cache/buffer, the more opportunity there could be for "merging" the unary byte-masks into full cache lines.
>
> Just guessing: If you get a misaligned, you probably don't have the second
> line address (even though its only an INC away), you will likely have to
> perform splits over 2 cycles--then depending on address compares and access
> density, you may (or may not) have to delay other memory refs to maintain
> memory order.

by having the rules be Weak Memory Model not TSO and by the RAW WAR Mem Hazard Matrix detect and separate LDs from STs my understanding is that even if split up into dozens of single byte operations, actual order is irrelevant [for WMO and nonatomic accesses only].

the misaligned will definitely decrease throughput in the general mem access case, not a lot we can do about that.


> >
> > i honestly don't know, here.
> >
> > > I suggest that you consider saying
> > > that users of boundary-crossing references should accept a potential
> > > delay as the cost of software flexibility.
> >
> > i think... i think we're going to have to see how it goes, and (thank you
> > for the heads-up) keep a close eye on that.
> >
> > one of the "refinements" that was considered was to have a popcount on the ORing of the cache-byte-line bitmaps, to detect which ones have the highest
> > count. full cache lines get top priority.
>
> The "density" concept tells you 90%+ of what you want to know. fast and
> simple.
> >
> > i can see, now that you mention it, that a large number of mis-aligned LD/STs
> > will, due to using double the normal number of cache lines for a non-misaligned
> > LD/ST, would clearly cause throughput slow-down.
>
> What you are going to want to do is to get 2 cache lines in the L0 and
> then have this L0 service as many requests as you can afford to build
> multiplexers for.

i think we can manage that and not scream in the middle of the night at the number of wires.

the more L0 entries we have, the more opportunities for unary bytemask merging.

unfortunately that also increases full addr comparison logic.

tradeoffs...

l.

MitchAlsup

unread,
Mar 24, 2020, 7:11:12 PM3/24/20
to
Probably not: however the one you saw was a precursor and not much slower.
Realistically, if the picker is working from less than 12 inputs it won't
matter.

lkcl

unread,
Mar 25, 2020, 6:46:41 AM3/25/20
to
On Tuesday, March 24, 2020 at 11:11:12 PM UTC, MitchAlsup wrote:
> On Tuesday, March 24, 2020 at 5:27:08 PM UTC-5, lkcl wrote:
> > On Tuesday, March 24, 2020 at 7:15:33 PM UTC, MitchAlsup wrote:
> > > I have priority pickers that can do a unary pick for 72 inputs in 4 gates of
> > > delay (without using more than 4-input NAND gates and 3-input NOR gates.
> >
> > is that using the same PP design as in the Scoreboard Mechanics book chapters?
>
> Probably not: however the one you saw was a precursor and not much slower.
> Realistically, if the picker is working from less than 12 inputs it won't
> matter.

hm ok i'm intrigued now.

i just deliberately created a 16-in 16-out PriorityPicker and ran it
through the yosys "synth" command just to see what kind of StdCells
it would create and whether it would be optimised / flattened at all.
looks like it was!
https://libre-riscv.org/3d_gpu/priority_picker_16_yosys.png

the 16-long chain of NOTs-and-AND-cascades seems to have been turned
into StdCells (ANDNOT, ORNOT, AND, NOT, OR) only 4 deep.


(for anyone else who may be going "err what on earth are they talking about??")

here is a pair of priority-pickers in the "precursor" form Mitch describes:
https://libre-riscv.org/3d_gpu/group_pick_rd_rel.jpg

and here's a cascade of them:
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/800x-multi_priority_picker.png

input is just "i", pp0-3 block is each one of the priority-pickers
however inside each PP block i additionally OR the outputs together
so as to produce a single-bit output indicating that the PP is "active"

outputs are idxo_0-3 plus all four enable bits from each PP which gives
an easy way to tell which PP is "active".

a quick "flatten" followed by "synth" command in yosys shows that the
longest path (through Std Cell Libraries, not gates) is 8 long.

if i were to reorganise the masking to no longer be a chain but instead
to be accumulated (duplicated) it might get a little flatter.

i think we can get away with using this even for 16-in and 4-out?


links to the source code implementing those (in nmigen) can be found
by following the links here:
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/

l.

lkcl

unread,
Mar 25, 2020, 8:40:18 AM3/25/20
to
On Tuesday, March 24, 2020 at 7:15:33 PM UTC, MitchAlsup wrote:
> On Tuesday, March 24, 2020 at 12:02:49 PM UTC-5, lkcl wrote:
> > ah no i wondered *if* double-ported (or quad-ported) L1 caches exist,
> > and from what Mitch was saying, not even 1R1W caches exist, they're
> > 1R-*or*-1W caches.
>
> We used to do double ported 1R AND 1W SRAMs and were forced to 1R or 1W
> SRAMs by the technology.

i heard back from Staf (very busy, thank you again if you read this).
he said that for this initial test SRAM Cell Library he's providing a
synchronous 1R-or-1W SRAM cell, which we can use to create dual, triple
etc. ported blocks (2R1W, 3R1W ...) by putting 1R-or-1W SRAMs together
in parallel.

in this first 180nm chip we just have to live with that and eat performance.

it does mean however we're not paying $NNN,NNN(,N?NN?) in NREs.
USD $600 per sq.mm - that's it. no mask charges. we're damn lucky,
and it was *only* possible for us to get that because the entire design
is Libre-Licensed.

l.

Ivan Godard

unread,
Mar 27, 2020, 10:32:50 AM3/27/20
to
On 3/24/2020 2:47 PM, lkcl wrote:
> On Tuesday, March 24, 2020 at 6:18:07 PM UTC, Ivan Godard wrote:
>
>>
>> You are way beyond me.
>
> sorry! it makes sense in my head :) however due to the nature of our project (educational as well as profitable) that's not good enough so i have to do better.
>
>> Some questions:
>>
>> 1) have you allowed for two scalar references being not congruent but
>> overlapping and both are boundary crossing? If they are both stores, for
>> example, it would be unfortunate if the update effect was a1-b1-a2-b2
>> instead of a1-a2-b1-b2.
>
> a1 a2 would.. *click* you are right. in one version of the Address Conflict detection Matrix I had several binary address detection comparators, with the (address+16) being done *inside* the Matrix.
>
> given that this required *three* comparisons (define a2 = a1+16, b2 = b1+16) to detect any clashes:
>
> * a1 == b1, a1mask[0..15] & b1mask[0..15] nonzero
>
> * a1 == b2, a1 mask[0..7] & b1mask[16..23] nonzero
>
> * viceversa a2 == b1
>
> i considered this to be excessive (3x the number of binary comparators in an 8x8 Matrix is a *lot* of comparators) i decided instead to split the ports.
>
> however.. wait... no, the ordering of the LDSTs back at the *Dependency Hazard* phase, the LDs and STs are *already* split into ordered groups, well before they get to the "Address Conflict" phase.
>
> therefore it is guaranteed that the AC phase will *only* see batches of LDs or batches of STs, it will *never* see both.
>
> so we are good. whew :)

If you say so :-)

>> 2) Your L0 is our L1 victim buffer.
>
> ah ok. same thing, different name.

Well, a little different, because they get filled from the L1 banks as
well as from the FUs, and stores don't need line fetches.

>> With 6 LS units you have a 12-way
>> mux to reach the four ports;
>
> sigh yes. i am wondering if we can do just an N-in, 4-out Mux onto an 8-entry L0, and it would only fill up entirely if 4 previous entries were not flushed to the L1 cache.
>
> it makes for a bottleneck but reduces wires somewhat.
>
>> we have a 6 way mux because of the upstream
>> Drano; I agree that a 12-to-4 mux is unimportant. However, you also have
>> 12 wiresets from the LS to the muxes, which in wire terms are remote. We
>> would have six, so in effect you have doubled the operand width in order
>> to support boundary crossing. Our hardware guys have in general been
>> much more concerned with wire area than with logic; how do you weight
>> these in your design?
>
> i don't knoow, i'm not happy about it. we're talking full 128 bit cache lines, QTY 12 (or 16) being MUXed 4 ways into 8 sets of latches.
>
> woo.
>
> ...use several layers? :)

Money and power. And did I mention money?

>>
>> 3) I'm doubtful that the "whack-load issue" will remove all ordering
>> issues.
>
> oh it will generate a ton of them.
>
> however Mitch explained that when you use those (biiig) Dependency Matrices in unary regnum and FunctionUnit addressing (rather than the more traditional binary numbering used in the CDC6600 patent on its Q-Tables), multi-issue is a simple matter of transitively (cascade) collecting register dependency hazards from previous instructions.
>
> thus if you have ADD r1, r10, r20 and ADD r2, r11, r21 in the next instruction, you create a "fake" write dependency between r2 and r1, and fake read hazard deps between r10/11 and r20/21, in that same clock cycle.
>
> this is enough to preserve the instruction order in an intriguing single-cycle way.

Not sure about that. What about instructions that don't use registers
that you can chain? Or rather, use the same ones? Code like:
store r4, *r4;
store r4, *r4+1;
Here there is an order dependency between the two stores, but there's
only one register involved and no updates or value creation. I suppose
you can create a fake dependency r4->r4, but that could read in either
direction. And it could be followed by "store r4, r4+2;" too. How do you
resolve this without an order-tag? Or maybe the AGU output gets a fake
regnum, with memrefs cracked in decode into someting like "rfake =
LEA(r4); store r4, *rfake; rfake1 = LEA(r4+1); store r4, *rfake1;"? That
should work, but chews up a lot of renamers.

It's easy to see how OOO works in the easy cases, but it's perversities
like this that make me a dedicated IO guy.

> and in the case of our vector-is-actually-multiissue-scalar it should be trivially obvious how to generate the transitive cascade of dependencies inside the vector for-loop as it increments the register numbers.
>
>
>> Yes, it clearly works within any single stream, and for multiple
>> disjoint streams. And I can see that conjoint streams can have the
>> ordering of overlapping whack-loads disambiguated with enough care.
>
> there are several critical pre-components, each of which (thank god) is independent.
>
> * generating a ton of scalar operations, such that we can use standard multi issue hardware, vectorisation "goes away"

You get no argument from me on that one, except for "standard". Standard
doesn't scale enough.

> * standard register RW Hazard detection, by using unary numbering, allows us to ensure register ordering is preserved
>
> * the RAW WAR Memory Ordering Detection Matrix allows us to detect batches of LDs and separate them from batches of STs *before*...
>
> * ... the Address Clash Detection Matrix gets to see the batch of (only) LDs or (only) STs, even after they've gone through splitting.
>

The example above was WAW ordering. What about that?

>> But
>> it seems to me that such care is not doable in practical hardware, at
>> least with the fanout you propose. If one stream is double-at-a-time,
>> one is word-at-a-time, and one is short-at-a-time, and they are on
>> different byte alignments, then as they cross each other the conflict
>> detector will correctly order D1-D2-D3-D4 and so for the other streams,
>> but the multi-stream will not be d1-w1-s1-d2-w2-s2-etc, but will be
>> interleaved as something like d1-w1-s1-s2-w2-w3-d3-etc.
>
> ah, remember: by this point, with all the other guarantees provided by the various Matrices, we have a bunch of *bytes* (16 of them), with their associated cache byte read/write enable line, where we *know*, thanks to the AddressConflict Matrix, that there are no two LDST FUs in this batch trying to read/write to/from the same cache byte column.
>
> also, LD batches were *already* separated from ST batches.
>
>> Now you can get around this by having the subscript correspond to the
>> iteration instead of the address. Then its unambiguous that
>> d1-w1-s1-d2-w2-s2...
>
> all these have been turned into unary bytemaps.
>
> d1 = 0b1111 1111 0000 0000
> w1 = 0b0000 0000 1111 0000
> s1 = 0b0000 0000 0000 1100
> w2 = 0b0000 1111 0000 0000
> s2 = 0b0000 0000 0011 0000
>
> therefore *back at the AddressConflict Matrix* (assuming those were all the same Addr[4:48] binary addresses) the ACM would *already* have prevented LEN+Addr[0..3] d1 and w2 when encoded in this unary mask form from progressing simultaneously to the L0 Cache/buffer (aka L1 victim buffer).
>
> etc.

Hmm. The picture helps; thank you. BTW, figuring out that d1/w1/s1 are
non-conflicting is polynomial, but figuring out which is the best of the
various intra-non-conflicting but inter-conflicting sets to let through
is the NP bin-packing problem. I'm guessing that you don't try, and just
use a maximal munch on the ordered request stream, which is probably
nlogn and would rarely cost you a cycle over optimal.

The byte conflict mask approach is actually how we do byte merge too,
except that our request stream is ordered by definition and your unary
mask is architecturally visible to us as the valid-bit on every byte in
the hierarchy short of DRAM.

Your explanation thus pushes my doubts to the Address Conflict Matrix.
Do I take it correctly that ACM detects at line granularity? One
observation of actual code is that spatial locality is at sub-line
granularity: a reference to a line is followed by another reference to
the same line with high probability. A conflict detect can split a
per-address-space stream of requests into several per-line streams of
requests; the cost is a set of associative comparators (fast) and a miss
detect and evict (slow).

Still, if I understand you (?) you cannot handle an intra-line sequence
that has more than one read-write or write-read transition, because you
have to accumulate a mask at each transition. W-W-W is OK, and R-R-R,
but not W-R-W or R-W-R; not in one cycle anyway. We can, with the
as-of-retire load semantics Mill has; we just reissue a read that got
klobbered. We also detect conflict at the byte(-range) level, co
conflict is very rare, whereas locality says that it will be frequent in
your line-granularity AGM.

It's interesting that Mill, starting from the semantics, and you/Mitch,
starting from the gates, have arrived at very similar designs in this area.

> d1 s1 and w1 would not produce conflicts, therefore all 3 of those would be PERMITTED to proceed to the L0 phase (allowed simultaneous access to the Muxers).
>
>> But you have nothing in the ISA that tells you what
>> an iteration is.
>
> i know. i am so happy not to have to deal with that.
>
> we are critically reliant however on this being an absolutely stonkingly high performance OoO engine.
>
> i cannot imagine the hell that anyone woukd go through if they tried to stick to a traditional in-order multi-issue microarchitecture.
>
>
>> Mitch has his loop-vec operation pair that gives him
>> that info, but you can only pick it up from the addresses. Think of this
>> loop:
>> int a[N]; short B[N*2];
>> for (int i = 0; i < N; ++i)
>> sum += A[i] + B[2*i] + B[2*i+1];
>
> this shouuuld result in near simultaneous B LDs ending up in the LDST buffers, where the lack of unary-encoded clashes in bits 0..3 of the address+LEN encoding as a cacheline column byte-enable line will result in them being merged into the same cacheline request in the same cycle.
>
> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1

And so you get away from SIMD. That will work fine for code working on
full-scalar objects. It's not goind to work well for sub-scalar data.
You don't have the hardware capacity to use eight LS units to block
load/store a byte array, nor eight ALUs to add one to a byte. Or does
Mitch still plan on having GBOOO figure out SIMDization in hardware? We
gave up on that, and all ops work on SIMD too.

>> Is your hardware going to make one B stream, or two?
>
> two at the FU level, merged through unary mask detection and clash avoidance onto the same cache line.
>
>> How are you going
>> to have the order in the whack-load be a1-b1-b2-a2-b3-b4...? and what if
>> they overlap?
>
> they are not permitted to.
>
>> I admit I don't understand the conflict avoidence logic
>> proposed, but I also don't see anything that suggests things like this
>> are handled.
>
> it is. however it is only by each Matrix taking care of guarantees further down the chain that it all hangs together.
>
> if the MemHazard Matrix did not detect and separate out batches of LDs from batches of STs, in instruction order, we would be screwed at the AddressConflict detection phase.
>
> if the AddressConflict stage did not detect bytemask overlaps and exclude them, we would be screwed at the L1 cache level due to multiple requests for R/W at the same cacheline column.
>
> etc. etc.

Bon voyage :-)

Ivan Godard

unread,
Mar 27, 2020, 11:27:45 AM3/27/20
to
You will deeply regret using Weak Memory Model.

MitchAlsup

unread,
Mar 27, 2020, 12:32:57 PM3/27/20
to
Is there?: I fail to see any difference between

ST R4,[R4]
ST R4,[R4+1]
and
ST R4,[R4+1]
St R4,[R4]

In both cases, the same values get written to the same memory locations.
GBOoO yes
SIMD no
But VVM makes loops faster than scalar code can ever be.

Ivan Godard

unread,
Mar 27, 2020, 12:54:25 PM3/27/20
to
On 3/27/2020 9:32 AM, MitchAlsup wrote:
> On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:

<snip>

>> Not sure about that. What about instructions that don't use registers
>> that you can chain? Or rather, use the same ones? Code like:
>> store r4, *r4;
>> store r4, *r4+1;
>> Here there is an order dependency between the two stores,
>
> Is there?: I fail to see any difference between
>
> ST R4,[R4]
> ST R4,[R4+1]
> and
> ST R4,[R4+1]
> St R4,[R4]
>
> In both cases, the same values get written to the same memory locations.
>

If r is 0x00010203, then the first example results in memory following
*r4 being the five bytes 0x0000010203, while the second produces
0x0001020303.

Perhaps my notation confused you. The "+1" is not scaled; it offsets the
store by one byte in memory.

WAW hazard is a thing.

Ivan Godard

unread,
Mar 27, 2020, 1:11:52 PM3/27/20
to
On 3/27/2020 9:32 AM, MitchAlsup wrote:
> On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:

<snip>

>>> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
>>
>> And so you get away from SIMD. That will work fine for code working on
>> full-scalar objects. It's not goind to work well for sub-scalar data.
>> You don't have the hardware capacity to use eight LS units to block
>> load/store a byte array, nor eight ALUs to add one to a byte. Or does
>> Mitch still plan on having GBOOO figure out SIMDization in hardware? We
>> gave up on that, and all ops work on SIMD too.
>
> GBOoO yes
> SIMD no
> But VVM makes loops faster than scalar code can ever be.

Ah, no, I don't think you can claim that. When you come right down to
it, VVM is itself scalar. Loops are more than loads and stores, and the
iteration interval is bounded by the ratio of FU count and op demand;
you are not magically creating ALUs on the fly. So you are not going to
be faster than any other approach with the same bounds, software
pipelining for example.

What you *can* claim is to do it more efficiently than most: you avoid
decode costs in a clever way (we have to use an I$0, and that's going to
be more expensive, albeit more flexible and general than VMM). And you
can sharply down-scale in the memory paths by agglutinating the
requests. For code that is efficiency-bound in the memory paths you
*will* be faster, but not in the general case.

Mind you, I really like VMM. It's clearly (IANAHG) better than legacy.
But it's not better than everything possible in scalar.

Stephen Fuld

unread,
Mar 27, 2020, 1:32:02 PM3/27/20
to
Minor quibble. Unless your ISA has a three input, one output increment
and conditional branch instruction, the LOOP instruction saves one cycle
per loop iteration.



> Mind you, I really like VMM. It's clearly (IANAHG) better than legacy.
> But it's not better than everything possible in scalar.


--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Ivan Godard

unread,
Mar 27, 2020, 1:48:36 PM3/27/20
to
Yes, that's nice. We crack those using three ops and phasing to get one
iteration per cycle. But really, our two-op increment/compare gang is
really just an encoding form of a 3-input-2 output op that is split over
two slots so we can keep the slots 2->1 while still handling the oddball
ops.

MitchAlsup

unread,
Mar 27, 2020, 2:17:48 PM3/27/20
to
On Friday, March 27, 2020 at 12:11:52 PM UTC-5, Ivan Godard wrote:
> On 3/27/2020 9:32 AM, MitchAlsup wrote:
> > On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:
>
> <snip>
>
> >>> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
> >>
> >> And so you get away from SIMD. That will work fine for code working on
> >> full-scalar objects. It's not goind to work well for sub-scalar data.
> >> You don't have the hardware capacity to use eight LS units to block
> >> load/store a byte array, nor eight ALUs to add one to a byte. Or does
> >> Mitch still plan on having GBOOO figure out SIMDization in hardware? We
> >> gave up on that, and all ops work on SIMD too.
> >
> > GBOoO yes
> > SIMD no
> > But VVM makes loops faster than scalar code can ever be.
>
> Ah, no, I don't think you can claim that. When you come right down to
> it, VVM is itself scalar. Loops are more than loads and stores, and the
> iteration interval is bounded by the ratio of FU count and op demand;
> you are not magically creating ALUs on the fly. So you are not going to
> be faster than any other approach with the same bounds, software
> pipelining for example.

Scalar code is FETCH-ISSUE bound and Latency bound.

Vectorized Loops are not FETCH-ISSUE bound (completely removing that
obstacle.)

Vectorized Loads in a Loop can determine if this load also satisfies
one or more additional load so the AGEN and read cache portions don't
have to be performed. Thus 1 load can satisfy 1-64 actual loads in
terms of bringing the data close enough so that the data can be obtained
through multiplexing.

Vectorized stores do similarly.

The ADD/CMP/BCND is 3 SCALAR instructions but 1 Vector instruction and
executes in 1 cycle instead of minimum 3.

Even a GBOoO machine with plenty of resources cannot be faster.

So the real question is how much calculation is required in the vector loop
versus how many calculation resources are available. And for a good class
of calculations from NULL to several VVM is invariable faster than scalar
(staying within the same ISA).
>
> What you *can* claim is to do it more efficiently than most: you avoid
> decode costs in a clever way (we have to use an I$0, and that's going to
> be more expensive, albeit more flexible and general than VMM). And you
> can sharply down-scale in the memory paths by agglutinating the
> requests. For code that is efficiency-bound in the memory paths you
> *will* be faster, but not in the general case.
>
> Mind you, I really like VMM. It's clearly (IANAHG) better than legacy.
> But it's not better than everything possible in scalar.

I will grant you that distinction.

MitchAlsup

unread,
Mar 27, 2020, 2:21:05 PM3/27/20
to
Saves 2 cycles.
The ADD and the CMP are performed simultaneously with a 3-input adder doing the ADD and the CMP (n.e., SUB) at the same time.

{ Ri = Ri + Rinc
cmp = Rmax - (Ri + Rinc)
BLE cmp, Loop } 1 cycle.

Ivan Godard

unread,
Mar 27, 2020, 2:58:08 PM3/27/20
to
On 3/27/2020 11:17 AM, MitchAlsup wrote:
> On Friday, March 27, 2020 at 12:11:52 PM UTC-5, Ivan Godard wrote:
>> On 3/27/2020 9:32 AM, MitchAlsup wrote:
>>> On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:
>>
>> <snip>
>>
>>>>> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
>>>>
>>>> And so you get away from SIMD. That will work fine for code working on
>>>> full-scalar objects. It's not goind to work well for sub-scalar data.
>>>> You don't have the hardware capacity to use eight LS units to block
>>>> load/store a byte array, nor eight ALUs to add one to a byte. Or does
>>>> Mitch still plan on having GBOOO figure out SIMDization in hardware? We
>>>> gave up on that, and all ops work on SIMD too.
>>>
>>> GBOoO yes
>>> SIMD no
>>> But VVM makes loops faster than scalar code can ever be.
>>
>> Ah, no, I don't think you can claim that. When you come right down to
>> it, VVM is itself scalar. Loops are more than loads and stores, and the
>> iteration interval is bounded by the ratio of FU count and op demand;
>> you are not magically creating ALUs on the fly. So you are not going to
>> be faster than any other approach with the same bounds, software
>> pipelining for example.

Perhaps we have a terminological issue.

> Scalar code is FETCH-ISSUE bound and Latency bound.

In small loops (8 line including internal control flow if any) we are
not fetching at all, and some members are not decoding either. I think
FETCH-ISSUE is only a problem for OOO and then only if there's no trace
buffer. If you disagree then please explain where the bottleneck is.

Software pipelining removes all latency issues except load LLC misses. I
grant you the VVM acts like a smart prefetch, but smart prefetch does
not require VVM.

> Vectorized Loops are not FETCH-ISSUE bound (completely removing that
> obstacle.)
>
> Vectorized Loads in a Loop can determine if this load also satisfies
> one or more additional load so the AGEN and read cache portions don't
> have to be performed. Thus 1 load can satisfy 1-64 actual loads in
> terms of bringing the data close enough so that the data can be obtained
> through multiplexing.

Yes; I granted you the efficiency, it's the necessarily superior speed I
don't buy.

> Vectorized stores do similarly.
>
> The ADD/CMP/BCND is 3 SCALAR instructions but 1 Vector instruction and
> executes in 1 cycle instead of minimum 3.
>
> Even a GBOoO machine with plenty of resources cannot be faster.

Funny, a LBIO Mill does ADD/CMP/BCND equivalent in one cycle too. You're
good, but you are not unequaled :-)

> So the real question is how much calculation is required in the vector loop
> versus how many calculation resources are available. And for a good class
> of calculations from NULL to several VVM is invariable faster than scalar
> (staying within the same ISA).

It's invariable faster than legacy, but that's a very low bar. It's not
invariably faster than possible.

Sorry; remember, I do like it.


MitchAlsup

unread,
Mar 27, 2020, 3:21:30 PM3/27/20
to
The width of Issue in the machine in my head.
>
> Software pipelining removes all latency issues except load LLC misses. I
> grant you the VVM acts like a smart prefetch, but smart prefetch does
> not require VVM.
>
> > Vectorized Loops are not FETCH-ISSUE bound (completely removing that
> > obstacle.)
> >
> > Vectorized Loads in a Loop can determine if this load also satisfies
> > one or more additional load so the AGEN and read cache portions don't
> > have to be performed. Thus 1 load can satisfy 1-64 actual loads in
> > terms of bringing the data close enough so that the data can be obtained
> > through multiplexing.
>
> Yes; I granted you the efficiency, it's the necessarily superior speed I
> don't buy.

What doing 2-8-64 loops in a single clock on a 1-wide machine is not necess-
arily faster !?!?! What planet did you come from ?
>
> > Vectorized stores do similarly.
> >
> > The ADD/CMP/BCND is 3 SCALAR instructions but 1 Vector instruction and
> > executes in 1 cycle instead of minimum 3.
> >
> > Even a GBOoO machine with plenty of resources cannot be faster.
>
> Funny, a LBIO Mill does ADD/CMP/BCND equivalent in one cycle too. You're
> good, but you are not unequaled :-)

MILL is far from LB I will grant you the IO part.

Ivan Godard

unread,
Mar 27, 2020, 7:43:57 PM3/27/20
to
I don't understand you. You must have a meaning for "loop", "1-wide", or
"clock" that doesn't match mine.

In:
for (int i = 0; i < N; ++i)
A[i]++;
on a machine with one ALU, the iteration internal is one clock. It is
not possible to get 2, or 8, or 64 iterations per clock without 2, 8, or
64 ALUs to increment the A[i] elements.

Yes, you can do the loads and stores and the index recurrence of that
iteration in parallel in that cycle; so can other architectures.

Now please explain what you mean that is different from what I thought I
heard.

lkcl

unread,
Mar 28, 2020, 2:01:24 AM3/28/20
to
On Friday, March 27, 2020 at 2:32:50 PM UTC, Ivan Godard wrote:
> On 3/24/2020 2:47 PM, lkcl wrote:

> > so we are good. whew :)
>
> If you say so :-)

(i'd better be...)

> >> 2) Your L0 is our L1 victim buffer.
> >
> > ah ok. same thing, different name.
>
> Well, a little different, because they get filled from the L1 banks as
> well as from the FUs, and stores don't need line fetches.

hum, hum... this because it's also being filled by SMP coherence requests that come from other cores etc?

> > ...use several layers? :)
>
> Money and power. And did I mention money?

this is why we are using coriolis2 (and NLNet)

> > this is enough to preserve the instruction order in an intriguing single-cycle way.
>
> Not sure about that. What about instructions that don't use registers
> that you can chain? Or rather, use the same ones? Code like:
> store r4, *r4;
> store r4, *r4+1;
> Here there is an order dependency between the two stores, but there's
> only one register involved and no updates or value creation.

reg read write dependency hazards - not memory hazards - would be detected and stop them from overlapping.

i do have a scheme for detecting wasted writes (i called it "nameless registers") however we are running out of time to implement optimisations.

so aside from that might be an IO store that you want to genuinely do...

wait... it's a ST. put contents of r4 into location pointed to by r4. and r4+1

yes of course those can be *issued* in parallel. they are reg read dependencies.

so there are 2 cases. first is if they are say ST.byte then there are no memory overlaps, all hunky dory, *r4 and *r4+1 fall (maybe) into the same 16 byte cache line, get done in 1 go, otherwise they get done across odd even cache lines (we plan to have that dual banked L1 using bit 4 of addr to select them)

if however they are ST.dwords then the AddressConflict Matcher kicks in.

the AddressMatcher notices two things:

1. that the top bits 4 to 11 are identical AND
2. the bytemask maps created from the 8byte len shifted by address bits[0..3] have an overlap.

therefore the two will NOT be permitted to proceed simultaneously and MUST be done independently.

unlike the reg dependency checking (done before ACM phase) this part does *not* have anything to do with register numbers themselves.

although there is opportunity for detection and merging of the overlaps we will probably leave it for later.


> I suppose
> you can create a fake dependency r4->r4, but that could read in either
> direction. And it could be followed by "store r4, r4+2;" too.

yes.

> How do you
> resolve this without an order-tag?

i did not mention, i also added a FU-FU LDST order preserving matrix, it has been a while (code written last year), working in