multi-way LOAD/STORE buffers and misalignment

lkcl

unread,

Mar 24, 2020, 7:53:29 AM3/24/20

to

we're doing a LD/ST buffer at the moment, and some design review and
constructive feedback and advice would be greatly appreciated.
unlike many "first time processors", we cannot possibly get away
with a simple single-issue LD/ST design as the first iteration because
of the vector engine being effectively a multi-issue multiplier that,
from a single instruction, hammers the internal engine with a *batch*
of actual operations.

this includes LD/ST and so consequently we need a multi-issue-capable
LD/ST buffer right from the start. it is being documented, here:

https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/

with Mitch's help i already have a RAW / WAR dependency detection matrix
in place, which detects (in-order) *batches* of LDs, *batches* of STs,
and *batches* of atomic operations (defined by having LD/ST both set).

additionally, we have (again, thanks to Mitch), an "address clash detection"
Matrix, where the original design concept from Mitch uses bits 4 thru 11
(stopping at a VM page) to say "this LD/ST definitely doesn't overlap with
any others, but it might have picked up a few that *potentially* don't
overlap, but hey better safe than sorry".

this is quite efficient because the number of bits used to detect a clash
is drastically reduced, which is important given that it's an NxN/2 set of
binary-address comparators involved.

however we *modified* this idea, by turning the addr+len (len=1/2/4/8 byte)
into a *byte-mask*, where the byte-mask is effectively nothing more than
(or, "can be used as") byte-enable lines on the L1 cache memory read/write.

this means turning len=1 into 0b1, len=2 into 0b11, len=4 into 0b1111
or len=8 into 0b1111 1111 and then shifting it by the LSBs of the address
(bits 0-3).

however of course, that could "wrap over" to a second cache line, therefore
we split it into two:

* 1st cache line, take bits 0..15 of the mask, use addr[4:] as the address
* 1nd cache line, take bits 15..27 of the mask, use addr[4:] PLUS ONE as address

now all LDs are potentially *two* LDs, and all STs are potentially *two* STs,
requiring synchronisation.

this does interestingly mean that the "address clash detection" Matrix is
now doubled in size, because each LD/ST Computation Unit can now issue *two*
LDs/STs, not one.

there is a big advantage to partial-conversion of the LD/STs to unary-masked
format:

1) those "masks" we can now detect in the address-clash unit, whether there
are overlaps, by simple "unary" (ANDing).

2) once a batch of guaranteed-non-overlapping LDs (or STs, but not both at
once) have been identified, the ORing of those masks *pre-identifies*,
where the remaining MSBs are identical, which LDs and which STs are
*part of the same cache line*.

3) the ORed masks can be dropped - as-is - directly onto byte-level
read/write-enable lines on the L1 Cache underlying memory - with zero
shifting or other messing about.

so that's the bits that we "know".

the bits that we *don't* know are what is the best L1 cache arrangement
to use. we discussed the possibility of an 8-way set-associative cache
with 512-bit wide data (!!!) - yes, really, because we will have a minimum
of 4 LD/ST operations @ 64-bit occurring every cycle (in the first version
of the processor: this *will* go up in later iterations).

one thing that would be particularly nice to avoid would be multi-ported
SRAM on the L1 Cache Memories. i thought there it may be possible to
"stripe" the L1 Cache, such that bits 4-5 of every address go through to,
effectively, a completely separate (isolated, independent) L1 Cache.

* Addr bits [0:3] already turned into byte-map mask (enable lines)
* Addr bits [4:5] used to select one of four INDEPENDENT caches
* Addr bits [6:end] used "as normal"

the only downside of this being: LD/STs that happen to be on the same
64-byte boundary would be "penalised" by being only single-issue-capable.

do we care? i'm not sure that we do!

_should_ we care? that's the question i'd like to know the answer to :)
an alternative is that we have to multi-port the L1 Cache memory, which
would be quite expensive.

what alternatives are there? is there a way to, for example, have an
8-way set-associative cache with 8 independent input/output data paths,
yet they do not require the trick of a full split using addr bits [4:5],
but at the same time do not require 4R4W SRAMs?

i *think* this is doable, with the potential for some wait cycles if
one of the 8 ways happens to have request contention. honestly i don't know
what's best here.

advice and discussion appreciated as always,

l.

lkcl

unread,

Mar 24, 2020, 10:19:50 AM3/24/20

to

do most systems simply assume that all L1 caches are effectively 1R1W as
far as the core is concerned, then "amalgamate" cache access into as close
to full cache lines as possible, and, if greater bandwidth is required,
simply "up the cache line width" from e.g. 16 bytes to 64 bytes?

i can't say i'm particularly happy with such an arrangement, because if
LDs/STs from a single vector instruction are to be issued on 64-byte
"strides", they're guaranteed to end up as single-issue.

how the hell is this done?? every cache i've seen is effectively 1R1W.
i found *one* paywalled paper on "dual-ported N-way cache"
https://ieeexplore.ieee.org/document/4734919
however its focus appears not to be on multi-porting but on power
reduction.

there is also this: http://sadve.cs.illinois.edu/Publications/isca00.pdf

which describes a "reconfigureable" cache, subdivideable down into
partitions - howeverrr agaaaain: it's not multi-ported: it's more about
subdividing caches by "processor activity". this is probably now
called "ASID" (the paper is from 2000) - Address Space Identifier.

frustratingly-lost, here.

l.

MitchAlsup

unread,

Mar 24, 2020, 11:27:05 AM3/24/20

to

Yes, you do want to care:: you want to care to "get it correct" even
at some minor cost in perf.

>
> _should_ we care? that's the question i'd like to know the answer to :)
> an alternative is that we have to multi-port the L1 Cache memory, which
> would be quite expensive.
>
> what alternatives are there? is there a way to, for example, have an
> 8-way set-associative cache with 8 independent input/output data paths,
> yet they do not require the trick of a full split using addr bits [4:5],
> but at the same time do not require 4R4W SRAMs?

What you want to know are the dimensions of the SRAM macros. Given these
dimensions you are in a position to start making good choices, without
you are not. A typical SRAM dimension might be 256-bits wide and 64-words
deap (2KBytes).

Since you want to read/write 512-bits per port, you will need 2 SRAM
macros per port (per cycle). So if you are making 8-way SA cache, you
will need at least 16 SRAM macros (32KBytes)--and (A N D) you still
have but a single port!

On the other hand, dropping back to 4-way set, and that same SRAM array
can now support 2-ports of 4-way SA each. In this organization you will
be using more lower-address bits to choose port[0] versus port[1]. And
you will see 40% conflicts if you are really running 2 request per cycle.

However, if you know the access pattern is "dense", then you know that
LD[i][k..k+3] are all in the same line, and you can THEN make one access
for every 4 actual LDs an then sort it out with multiplexers. And NOW we
have enough ports and width to use 1024 bits per cycle throughput in the
cache. {You can determine an access pattern will be dense when the loop
increment is +1 (or -1) and the loop index is used in scaled form in
memref instructions:

LD Rx,[Rb+Ri<<scale+DISP]

} Other forms of dense access are harder for HW to detect and utilize.

MitchAlsup

unread,

Mar 24, 2020, 11:36:16 AM3/24/20

to

On Tuesday, March 24, 2020 at 9:19:50 AM UTC-5, lkcl wrote:
> do most systems simply assume that all L1 caches are effectively 1R1W as
> far as the core is concerned, then "amalgamate" cache access into as close
> to full cache lines as possible, and, if greater bandwidth is required,
> simply "up the cache line width" from e.g. 16 bytes to 64 bytes?

You should assume (if you want a high clock rate) that SRAM macros are
1R or 1W per cycle not 1R AND 1W per cycle.

Architecturally, it is a lot cleaner to use 1R AND 1W per cycle SRAM macros
but this will cost you in clock rate as you are effectively running the
SRAM macros on a 2× clock.

These days, what I prefer is to read/write wide into a buffer (call it an L0
if you will) and use multiplexers to deliver the data. This buffer will
have 4-8 entries.

>
> i can't say i'm particularly happy with such an arrangement, because if
> LDs/STs from a single vector instruction are to be issued on 64-byte
> "strides", they're guaranteed to end up as single-issue.

But you are only orchestrating data to/from this buffer every 4 LDs/STs--
wider and slower.

>
> how the hell is this done?? every cache i've seen is effectively 1R1W.
> i found *one* paywalled paper on "dual-ported N-way cache"
> https://ieeexplore.ieee.org/document/4734919
> however its focus appears not to be on multi-porting but on power
> reduction.
>
> there is also this: http://sadve.cs.illinois.edu/Publications/isca00.pdf
>
> which describes a "reconfigureable" cache, subdivideable down into
> partitions - howeverrr agaaaain: it's not multi-ported: it's more about
> subdividing caches by "processor activity". this is probably now
> called "ASID" (the paper is from 2000) - Address Space Identifier.
>
> frustratingly-lost, here.

Typical problem with acedemia--does not want to "play the game" with the
logic technology of the current day.

The building blocks are the SRAM macros. Consider each macro having 1R OR
1W per cycle--add to this some buffers so that you are accessing the SRAM
macros once every 4 cycles (64-bit containers) per data line, and sort
it out with flip-flops and multiplexers.

You might want to give up some settedness for portedness. But it all comes
down to the number of SRAM macros you start out with.
>
> l.

Ivan Godard

unread,

Mar 24, 2020, 11:50:16 AM3/24/20

to

Very much not my expertise, so this comment may well be inane or wrong.

We don't have load-store buffers, but still have to merge requests,
which is done at the L1 victim buffers. Like you, there can be multiple
requests each cycle, although ours come from independent ops rather than
vectorization. One thing that caused us a lot of thought that (if I
understand it) you are not addressing: preserving semantic order.
Mitch's (hence your) vector logic can pick up several different streams
in the raw scalar loop, and those may potentially overlap. A scalar
store-load-store-load sequence is unambiguous; the vectorized version is
not.

It is not even sufficient to detect that the initial address recurrence
overlaps and avoid setting up a vector, at least if your vectors support
negative strides and/or different strides in different but concurrent
streams, which they really must. You can get two vectors sliding over
each other in opposite directions or one overtaking another, and
determining the order semantics when the overlap happens is non-trivial
except in Fortran.

Assuming that you resolve ordering issues, there is then the alignment
and isolation issue you address above. We initially required natural
alignment, but gave it up on business/market grounds. However, while we
have to support unaligned access, we decided that we didn't have to make
it fast. Consequently the load/store units check for boundary-crossing
and if necessary double-pump the request. The difference between our
approach and your description is that we do the split at the LS FU, not
at the other end of the request stream at the buffers. Because our
request stream is strictly ordered, we can split any request while
preserving semantic order.

However, the request stream has a fixed number of slots per cycle. If
all the slots are taken by LS requests then there's no slot for the
splitee. On a slow machine one could stall for a cycle, but our firehose
takes a couple of cycles to throw a stall, due to speed-of-light issues.
Consequently the request stream, and pretty much all other streams in
the hierarchy, use a predictive slot assignment algorithm where the
source maintains a high/low-water count of the projected state at the
sink end, and throttles itself as necessary. The source does not send a
request unless it knows that the sink will have catch capacity when the
request arrives (will arrive, not is sent). This is *not* a handshake.

This algorithm is essentially identical to the predictive throttling
used in network packet handling, q.v. The throttle logic, which is
widely used in a Mill, is called "Drano", because it keeps the pipes
open. Of course, excess split requests on a saturated request stream
will eventually exceed the Drano buffering capacity and the source will
issue a stall, early enough (it's a predictive system, after all) that
the stall can take effect before the sink overflows.

If I understand your system (questionable) you are able to handle
boundary overrun by doubling the access capacity so you can do two
(derived from one) requests at once. This seems to be wasteful IMO,
because you will very rarely see boundary crossing requests. You propose
to be able to do two at once, but do not have the ability to do two at
once from two requests. Mill Drano has only the ability to do one at a
time (per stream slot), but the slots are rarely full in a cycle, and
the Drano buffer lets requests spill over into other cycles. The cost
for us is that a split request may cause other requests to be delayed a
cycle (if there are no free slots in the current cycle) or trigger a
stall (if there are no free slots within the reach of the Drano).

Your proposal is immune to the Mill's potential delay, but it leaves you
searching for double-ported hardware. I suggest that you consider saying
that users of boundary-crossing references should accept a potential
delay as the cost of software flexibility. A bit of cooperation from the
rest of the ISA will help; for example, rather than support arbitrary
arithmetic and assignment to the stack pointer, you could make it a
specReg and provide an update operation that always allocates in whole
aligned lines, so the stack frame is always aligned and your compiler
can lay out data so that ordinary references never cross boundaries. Of
course, that might be too CISCy for you.

lkcl

unread,

Mar 24, 2020, 12:18:32 PM3/24/20

to

On Tuesday, March 24, 2020 at 3:36:16 PM UTC, MitchAlsup wrote:
> On Tuesday, March 24, 2020 at 9:19:50 AM UTC-5, lkcl wrote:
> > do most systems simply assume that all L1 caches are effectively 1R1W as
> > far as the core is concerned, then "amalgamate" cache access into as close
> > to full cache lines as possible, and, if greater bandwidth is required,
> > simply "up the cache line width" from e.g. 16 bytes to 64 bytes?
>
> You should assume (if you want a high clock rate) that SRAM macros are
> 1R or 1W per cycle not 1R AND 1W per cycle.

ok. appreciated. i'll ask Staf: he's actually doing (test chip is taping
out this month) a Libre Cell Library, including SRAMs, DRAMs, GPIOs and
standard cells as well.

> Architecturally, it is a lot cleaner to use 1R AND 1W per cycle SRAM macros
> but this will cost you in clock rate as you are effectively running the
> SRAM macros on a 2× clock.

hmmm

> These days, what I prefer is to read/write wide into a buffer (call it an L0
> if you will) and use multiplexers to deliver the data. This buffer will
> have 4-8 entries.

interestingly, that's what we started discussing: a mini-cache of, yes,
only around 8 cache-lines.

so what you're saying is: have multi-porting by way of multiplexers (4-in, 4-out, or in our case 8, 12 or even 16-in on the LDSTCompUnit side, 4-out on the L0 cache side), group things into cache-lines and then hit the L1 cache one line at a time from the L0 cache/buffer?

> > i can't say i'm particularly happy with such an arrangement, because if
> > LDs/STs from a single vector instruction are to be issued on 64-byte
> > "strides", they're guaranteed to end up as single-issue.
>
> But you are only orchestrating data to/from this buffer every 4 LDs/STs--
> wider and slower.

... except, the Vector Engine that we have, it can actually issue 4 LDs/STs
per clock cycle (actually i was thinking of putting in 6 or 8 LD/ST CompUnits).
if *all* of those are on 64-byte strides, they'll hit the same cache line
(@64-byte width), that means those 4 LDs/STs grind to a halt, only being able
to access alternate cache lines on every 4th cycle.

if instead we split the L1 cache into two halves, at least that comes down
to every 2nd cycle...

ahhh. and the L0 cache/buffer will "fix" the problem of odd-even addressing
causing slow-down just because two LDs/STs happened to both have "odd" addresses.

> You might want to give up some settedness for portedness. But it all comes
> down to the number of SRAM macros you start out with.

we can do what we want (within reason) because we're using a Libre SRAM
library. downside: we're using a Libre SRAM library and *have* to do it
ourselves... :)

l.

lkcl

unread,

Mar 24, 2020, 12:25:53 PM3/24/20

to

On Tuesday, March 24, 2020 at 3:27:05 PM UTC, MitchAlsup wrote:

> Yes, you do want to care:: you want to care to "get it correct" even
> at some minor cost in perf.

happy to have something that's correct and working at this moment in time.

> What you want to know are the dimensions of the SRAM macros. Given these
> dimensions you are in a position to start making good choices, without
> you are not. A typical SRAM dimension might be 256-bits wide and 64-words
> deap (2KBytes).

we're able to build / design our own.

> Since you want to read/write 512-bits per port, you will need 2 SRAM
> macros per port (per cycle). So if you are making 8-way SA cache, you
> will need at least 16 SRAM macros (32KBytes)--and (A N D) you still
> have but a single port!
>
> On the other hand, dropping back to 4-way set, and that same SRAM array
> can now support 2-ports of 4-way SA each. In this organization you will
> be using more lower-address bits to choose port[0] versus port[1]. And
> you will see 40% conflicts if you are really running 2 request per cycle.

ok. this makes sense (in combination with the L0 cache/buffer from your
other message).

> However, if you know the access pattern is "dense", then you know that
> LD[i][k..k+3] are all in the same line,

ha! we have a detection system for that! by turning the low address
bits plus the LD/ST-len (1/2/4/8) into a shifted byte-mask, we know
which bytes are going to be accessed in any given cache line, and we can
then merge them into a single line-request, by noting that the high
bits of any address are all equal.

the Address-Conflict-Matcher will *already* have excluded anything that
could overlap, so we *know* that, in the unary byte-masks, they are
going to be uniquely-set. i.e. no two LDs (or two STs) will ever "overlap"
at the cache byte-line *column* level.

> and you can THEN make one access
> for every 4 actual LDs an then sort it out with multiplexers. And NOW we
> have enough ports and width to use 1024 bits per cycle throughput in the
> cache. {You can determine an access pattern will be dense when the loop
> increment is +1 (or -1) and the loop index is used in scaled form in
> memref instructions:
>
> LD Rx,[Rb+Ri<<scale+DISP]
>
> } Other forms of dense access are harder for HW to detect and utilize.

i believe we have it covered by turning the accesses into an unary byte-mask
of the exact same width as the cache-line.

l.

lkcl

unread,

Mar 24, 2020, 1:02:49 PM3/24/20

to

fortunately, our vector engine is effectively nothing more than a multi-issue scalar system. thus by the time instruction-issue is over (which is where the
vector-thing sits) vectorisation is entirely gone.

mitch's address-conflict and memory-hazard-matrix system detects:

* sequences of load-load-load and separates them from store-store-store-store
whilst *also preserving the order*

* identifies address-conflicts (in a slightly-overzealous way)

> It is not even sufficient to detect that the initial address recurrence
> overlaps and avoid setting up a vector, at least if your vectors support
> negative strides and/or different strides in different but concurrent
> streams, which they really must. You can get two vectors sliding over
> each other in opposite directions or one overtaking another, and
> determining the order semantics when the overlap happens is non-trivial
> except in Fortran.

again: we do not have a problem, here, because the "vectors" are not actually
vectors, per se, they're just an instruction to the *issue* engine to sit
with the PC "frozen", blurting out a whack-load of *scalar* instructions,
incrementing the *register* numbers on each time round the "vector" hardware
for-loop.

thus we can "pretend" that those scalar instructions (which happen to have
sequential register numbers but the same operation) were actually really,
*really* scalar operations actually coming from the I-Cache, as in, they
were, actually, really, *really*, actual binary operations @ PC, PC+4, PC+8
and so on.

all we need to do then is design a (bog-standard) Multi-Issue *scalar*
execution engine, and all of the problems of Vector overlap detection
etc. all go away.

to be replaced by lovely, shiny new ones of course :)

> Assuming that you resolve ordering issues, there is then the alignment
> and isolation issue you address above. We initially required natural
> alignment, but gave it up on business/market grounds. However, while we
> have to support unaligned access, we decided that we didn't have to make
> it fast.

:)

> Consequently the load/store units check for boundary-crossing
> and if necessary double-pump the request. The difference between our
> approach and your description is that we do the split at the LS FU,

the plan is to have the split directly after the LD/ST Function Unit:
effectively having (and therefore making it potentially wait for) two
LD/ST ports.

we'd anticipate that for the most part the 2nd port would go completely
unused, as it's only going to be active when mis-aligned memory requests
occur.

> not
> at the other end of the request stream at the buffers. Because our
> request stream is strictly ordered, we can split any request while
> preserving semantic order.

yeah the plan is to turn the LSBs of the address into that unary masking
thing. we won't ever hit even the buffers with bits 0-3 of the address:
it will be unary byte-mask lines 0-15.

and the Memory-Order Matrix takes care of load/store ordering. i'm trusting
Mitch's advice that, whilst *batches* of loads have to be ordered compared
to *batches* of stores (load-load-load store-store load-load *must* be
3xLD then 2xST then 2xLD), for a Weak Memory Ordering system, the ordering
of LDs (or STs) *in* a batch do *not* have to be ordered.

> However, the request stream has a fixed number of slots per cycle. If
> all the slots are taken by LS requests then there's no slot for the
> splitee.

based on Mitch's assessment that for WMO, the ordering of any given LD (or ST)
in a "batch" is not important, it didn't occur to me to make any distinction
between the 2-way split operations.

for a TSO system, hell yes it would matter. and, because we're now doing
POWER ISA, i know that TSO really does matter on IO-memory-accesses, so
we'll cross that bridge when we come to it. (i believe POWER 3.0B copes
with this by simply banning misaligned IO-memory accesses)

> On a slow machine one could stall for a cycle, but our firehose
> takes a couple of cycles to throw a stall, due to speed-of-light issues.

:)

> Consequently the request stream, and pretty much all other streams in
> the hierarchy, use a predictive slot assignment algorithm where the
> source maintains a high/low-water count of the projected state at the
> sink end, and throttles itself as necessary. The source does not send a
> request unless it knows that the sink will have catch capacity when the
> request arrives (will arrive, not is sent). This is *not* a handshake.
>
> This algorithm is essentially identical to the predictive throttling
> used in network packet handling, q.v. The throttle logic, which is
> widely used in a Mill, is called "Drano", because it keeps the pipes
> open.

good name :)

> Of course, excess split requests on a saturated request stream
> will eventually exceed the Drano buffering capacity and the source will
> issue a stall, early enough (it's a predictive system, after all) that
> the stall can take effect before the sink overflows.
>
> If I understand your system (questionable) you are able to handle
> boundary overrun by doubling the access capacity so you can do two
> (derived from one) requests at once.

and let the buffers sort it out, yes. ok, so the idea is to have N (N=6 or 8)
LD/ST Function Units, each with 2 ports, and to try to get 4-way porting
to the L0 cache/buffer.

now, unless there's a really, really good reason (and i am getting some
hints from what you describe that there might be), i wasn't going to
"prioritise" the non-misaligned LD/ST operations, on the basis that, well,
it's all gotta come out at some point [hopefully not as poop...]

> This seems to be wasteful IMO,
> because you will very rarely see boundary crossing requests. You propose
> to be able to do two at once, but do not have the ability to do two at
> once from two requests. Mill Drano has only the ability to do one at a
> time (per stream slot), but the slots are rarely full in a cycle, and
> the Drano buffer lets requests spill over into other cycles. The cost
> for us is that a split request may cause other requests to be delayed a
> cycle (if there are no free slots in the current cycle) or trigger a
> stall (if there are no free slots within the reach of the Drano).
>
> Your proposal is immune to the Mill's potential delay, but it leaves you
> searching for double-ported hardware.

ah no i wondered *if* double-ported (or quad-ported) L1 caches exist,
and from what Mitch was saying, not even 1R1W caches exist, they're
1R-*or*-1W caches.

so that idea is shot.

now, let's say that we have a 4-ported L0 cache/buffer, 8-entry, which has a
12-in (or 16-in) Priority Muxer to allow access to those 4 ports.

12-in because if we have 6 LD/ST Function Units, each will have its own
splitter thus doubling the FU ports in effect. 16-in if we have 8 LD/ST FUs.

i don't mind having a 12-or-16 to 4 multiplexer: i wrote a Multi-port
Priority-Picker for this purpose, just last week.

so let's then say that we have 2 of those LD/ST units issuing mis-aligned LDs,
and another 4 FUs issue aligned LDs. this effectively means we have *six*
cache-line-aligned LD/STs being issued:

* 2 from the 1st mis-aligned LD
* 2 from the 2nd mis-aligned LD
* 1 from the 1st aligned LD
* 1 from the 2nd aligned LD

depending on what the 4-simultaneous PriorityPicker happens to pick, we could
have *any* of those 6 be picked.

now... should we do something such that mis-aligned LDs/STs are *reduced* in priority? given that we have this byte-mask "cache-line detection system", i'm not so sure that we should try that, because the more that is thrown at the L0 cache/buffer, the more opportunity there could be for "merging" the unary byte-masks into full cache lines.

i honestly don't know, here.

> I suggest that you consider saying
> that users of boundary-crossing references should accept a potential
> delay as the cost of software flexibility.

i think... i think we're going to have to see how it goes, and (thank you
for the heads-up) keep a close eye on that.

one of the "refinements" that was considered was to have a popcount on the ORing of the cache-byte-line bitmaps, to detect which ones have the highest
count. full cache lines get top priority.

i can see, now that you mention it, that a large number of mis-aligned LD/STs
will, due to using double the normal number of cache lines for a non-misaligned
LD/ST, would clearly cause throughput slow-down.

however *if* those misalignments happen to be mergeable with other memory requests (because of the cache-byte-line bitmaps), actually it will not be as bad as it sounds.

the caveat being: they have to be within the same address (bits 0-3 excluded
from that), and pretty much the same cycle. for a sequential memory read/write
that happened to be accidentally mis-aligned, this condition would hold true.

however random misaligned memory access, we'd end up with 1/2 throughput

> A bit of cooperation from the
> rest of the ISA will help; for example, rather than support arbitrary
> arithmetic and assignment to the stack pointer, you could make it a
> specReg and provide an update operation that always allocates in whole
> aligned lines, so the stack frame is always aligned and your compiler
> can lay out data so that ordinary references never cross boundaries. Of
> course, that might be too CISCy for you.

unfortunately we're stuck with POWER ISA. actually... i *say* "stuck":
we're planning on extending it behind a "safety barrier" of escape-sequencing,
so we couuuld conceivably do strange and wonderful things, if justified.

l.

Ivan Godard

unread,

Mar 24, 2020, 2:18:07 PM3/24/20

to

On 3/24/2020 10:02 AM, lkcl wrote:
> On Tuesday, March 24, 2020 at 3:50:16 PM UTC, Ivan Godard wrote:

<snip>

You are way beyond me. Some questions:

1) have you allowed for two scalar references being not congruent but
overlapping and both are boundary crossing? If they are both stores, for
example, it would be unfortunate if the update effect was a1-b1-a2-b2
instead of a1-a2-b1-b2.

2) Your L0 is our L1 victim buffer. With 6 LS units you have a 12-way
mux to reach the four ports; we have a 6 way mux because of the upstream
Drano; I agree that a 12-to-4 mux is unimportant. However, you also have
12 wiresets from the LS to the muxes, which in wire terms are remote. We
would have six, so in effect you have doubled the operand width in order
to support boundary crossing. Our hardware guys have in general been
much more concerned with wire area than with logic; how do you weight
these in your design?

3) I'm doubtful that the "whack-load issue" will remove all ordering
issues. Yes, it clearly works within any single stream, and for multiple
disjoint streams. And I can see that conjoint streams can have the
ordering of overlapping whack-loads disambiguated with enough care. But
it seems to me that such care is not doable in practical hardware, at
least with the fanout you propose. If one stream is double-at-a-time,
one is word-at-a-time, and one is short-at-a-time, and they are on
different byte alignments, then as they cross each other the conflict
detector will correctly order D1-D2-D3-D4 and so for the other streams,
but the multi-stream will not be d1-w1-s1-d2-w2-s2-etc, but will be
interleaved as something like d1-w1-s1-s2-w2-w3-d3-etc.

Now you can get around this by having the subscript correspond to the
iteration instead of the address. Then its unambiguous that
d1-w1-s1-d2-w2-s2... But you have nothing in the ISA that tells you what
an iteration is. Mitch has his loop-vec operation pair that gives him
that info, but you can only pick it up from the addresses. Think of this
loop:
int a[N]; short B[N*2];
for (int i = 0; i < N; ++i)
sum += A[i] + B[2*i] + B[2*i+1];
Is your hardware going to make one B stream, or two? How are you going
to have the order in the whack-load be a1-b1-b2-a2-b3-b4...? and what if
they overlap? I admit I don't understand the conflict avoidence logic
proposed, but I also don't see anything that suggests things like this
are handled.

MitchAlsup

unread,

Mar 24, 2020, 3:15:33 PM3/24/20

to

We used to do double ported 1R AND 1W SRAMs and were forced to 1R or 1W
SRAMs by the technology. In 1µ technology it was not hard to get SRAMs
to read and write in a single 16-gate cycle. But somewhere in the 0.18µ
era SRAms became slower because of transistor leakage, bit-line capacitance,
and body effects of <still> planar transistors. You can still get SRAMs that
are fast enough to run a 20-gate machine cycle and hit them 2× per cycle,
but not if the library SRAM makes YOU put the flip-flops at the edge of
the SRAM and these flip-flops are timed on the std clock.

Caches, on the other hand can be 2 ported (or even 4 ported) using std SRAMs
as long as you are not using 1 SRAM macro more than once per cycle. The
rest is simply wiring the macros up however you deem appropriate.

Note:: In order to improve the tag-TLB-hit logic speed, Athlon and Opteron
gated the raw bit lines of the Tag and TLB into a analog XOR gate and then
used sense amplifiers to decide if the bits matched. They did not actually
bother to read out the translated bits from either the Tag array or the
TLB array. This necessitates these two arrays be side-by-side.

>
> so that idea is shot.
>
> now, let's say that we have a 4-ported L0 cache/buffer, 8-entry, which has a
> 12-in (or 16-in) Priority Muxer to allow access to those 4 ports.
>
> 12-in because if we have 6 LD/ST Function Units, each will have its own
> splitter thus doubling the FU ports in effect. 16-in if we have 8 LD/ST FUs.

All of this split stuff resolves about gate (delay) 6 whereas the higher
order AGEN bits do not resolve until at least gate 12-13 (64-bit addresses).
So there is plenty (PLENTY) of time to do bank selection and split analysis
while the rest of the AGEN bits are resolving. By the time you get to the
L0 multiplexers, there should be no "priorities" and no "indecision" as to
what this cycle is doing. (i.e., move it back 1/3rd cycle)

>
> i don't mind having a 12-or-16 to 4 multiplexer: i wrote a Multi-port
> Priority-Picker for this purpose, just last week.
>
> so let's then say that we have 2 of those LD/ST units issuing mis-aligned LDs,
> and another 4 FUs issue aligned LDs. this effectively means we have *six*
> cache-line-aligned LD/STs being issued:
>
> * 2 from the 1st mis-aligned LD
> * 2 from the 2nd mis-aligned LD
> * 1 from the 1st aligned LD
> * 1 from the 2nd aligned LD
>
> depending on what the 4-simultaneous PriorityPicker happens to pick, we could
> have *any* of those 6 be picked.

I have priority pickers that can do a unary pick for 72 inputs in 4 gates of
delay (without using more than 4-input NAND gates and 3-input NOR gates.
That is very fast, and as wide as you could ever need. But for 1 from 4
there is 1 gate of delay (from a flip flop where you have access to both
polarities), 1 from 12 is 2 gates, 1 from 36 is 3 gates.

But these should be performed while the HoBs are resolving.

>
> now... should we do something such that mis-aligned LDs/STs are *reduced* in priority? given that we have this byte-mask "cache-line detection system", i'm not so sure that we should try that, because the more that is thrown at the L0 cache/buffer, the more opportunity there could be for "merging" the unary byte-masks into full cache lines.

Just guessing: If you get a misaligned, you probably don't have the second
line address (even though its only an INC away), you will likely have to
perform splits over 2 cycles--then depending on address compares and access
density, you may (or may not) have to delay other memory refs to maintain
memory order.

>
> i honestly don't know, here.
>
> > I suggest that you consider saying
> > that users of boundary-crossing references should accept a potential
> > delay as the cost of software flexibility.
>
> i think... i think we're going to have to see how it goes, and (thank you
> for the heads-up) keep a close eye on that.
>
> one of the "refinements" that was considered was to have a popcount on the ORing of the cache-byte-line bitmaps, to detect which ones have the highest
> count. full cache lines get top priority.

The "density" concept tells you 90%+ of what you want to know. fast and
simple.

>
> i can see, now that you mention it, that a large number of mis-aligned LD/STs
> will, due to using double the normal number of cache lines for a non-misaligned
> LD/ST, would clearly cause throughput slow-down.

What you are going to want to do is to get 2 cache lines in the L0 and
then have this L0 service as many requests as you can afford to build
multiplexers for.

lkcl

unread,

Mar 24, 2020, 5:47:22 PM3/24/20

to

On Tuesday, March 24, 2020 at 6:18:07 PM UTC, Ivan Godard wrote:

>
> You are way beyond me.

sorry! it makes sense in my head :) however due to the nature of our project (educational as well as profitable) that's not good enough so i have to do better.

> Some questions:
>
> 1) have you allowed for two scalar references being not congruent but
> overlapping and both are boundary crossing? If they are both stores, for
> example, it would be unfortunate if the update effect was a1-b1-a2-b2
> instead of a1-a2-b1-b2.

a1 a2 would.. *click* you are right. in one version of the Address Conflict detection Matrix I had several binary address detection comparators, with the (address+16) being done *inside* the Matrix.

given that this required *three* comparisons (define a2 = a1+16, b2 = b1+16) to detect any clashes:

* a1 == b1, a1mask[0..15] & b1mask[0..15] nonzero

* a1 == b2, a1 mask[0..7] & b1mask[16..23] nonzero

* viceversa a2 == b1

i considered this to be excessive (3x the number of binary comparators in an 8x8 Matrix is a *lot* of comparators) i decided instead to split the ports.

however.. wait... no, the ordering of the LDSTs back at the *Dependency Hazard* phase, the LDs and STs are *already* split into ordered groups, well before they get to the "Address Conflict" phase.

therefore it is guaranteed that the AC phase will *only* see batches of LDs or batches of STs, it will *never* see both.

so we are good. whew :)

> 2) Your L0 is our L1 victim buffer.

ah ok. same thing, different name.

> With 6 LS units you have a 12-way
> mux to reach the four ports;

sigh yes. i am wondering if we can do just an N-in, 4-out Mux onto an 8-entry L0, and it would only fill up entirely if 4 previous entries were not flushed to the L1 cache.

it makes for a bottleneck but reduces wires somewhat.

> we have a 6 way mux because of the upstream
> Drano; I agree that a 12-to-4 mux is unimportant. However, you also have
> 12 wiresets from the LS to the muxes, which in wire terms are remote. We
> would have six, so in effect you have doubled the operand width in order
> to support boundary crossing. Our hardware guys have in general been
> much more concerned with wire area than with logic; how do you weight
> these in your design?

i don't knoow, i'm not happy about it. we're talking full 128 bit cache lines, QTY 12 (or 16) being MUXed 4 ways into 8 sets of latches.

woo.

...use several layers? :)

>
> 3) I'm doubtful that the "whack-load issue" will remove all ordering
> issues.

oh it will generate a ton of them.

however Mitch explained that when you use those (biiig) Dependency Matrices in unary regnum and FunctionUnit addressing (rather than the more traditional binary numbering used in the CDC6600 patent on its Q-Tables), multi-issue is a simple matter of transitively (cascade) collecting register dependency hazards from previous instructions.

thus if you have ADD r1, r10, r20 and ADD r2, r11, r21 in the next instruction, you create a "fake" write dependency between r2 and r1, and fake read hazard deps between r10/11 and r20/21, in that same clock cycle.

this is enough to preserve the instruction order in an intriguing single-cycle way.

and in the case of our vector-is-actually-multiissue-scalar it should be trivially obvious how to generate the transitive cascade of dependencies inside the vector for-loop as it increments the register numbers.

> Yes, it clearly works within any single stream, and for multiple
> disjoint streams. And I can see that conjoint streams can have the
> ordering of overlapping whack-loads disambiguated with enough care.

there are several critical pre-components, each of which (thank god) is independent.

* generating a ton of scalar operations, such that we can use standard multi issue hardware, vectorisation "goes away"

* standard register RW Hazard detection, by using unary numbering, allows us to ensure register ordering is preserved

* the RAW WAR Memory Ordering Detection Matrix allows us to detect batches of LDs and separate them from batches of STs *before*...

* ... the Address Clash Detection Matrix gets to see the batch of (only) LDs or (only) STs, even after they've gone through splitting.

> But
> it seems to me that such care is not doable in practical hardware, at
> least with the fanout you propose. If one stream is double-at-a-time,
> one is word-at-a-time, and one is short-at-a-time, and they are on
> different byte alignments, then as they cross each other the conflict
> detector will correctly order D1-D2-D3-D4 and so for the other streams,
> but the multi-stream will not be d1-w1-s1-d2-w2-s2-etc, but will be
> interleaved as something like d1-w1-s1-s2-w2-w3-d3-etc.

ah, remember: by this point, with all the other guarantees provided by the various Matrices, we have a bunch of *bytes* (16 of them), with their associated cache byte read/write enable line, where we *know*, thanks to the AddressConflict Matrix, that there are no two LDST FUs in this batch trying to read/write to/from the same cache byte column.

also, LD batches were *already* separated from ST batches.

> Now you can get around this by having the subscript correspond to the
> iteration instead of the address. Then its unambiguous that
> d1-w1-s1-d2-w2-s2...

all these have been turned into unary bytemaps.

d1 = 0b1111 1111 0000 0000
w1 = 0b0000 0000 1111 0000
s1 = 0b0000 0000 0000 1100
w2 = 0b0000 1111 0000 0000
s2 = 0b0000 0000 0011 0000

therefore *back at the AddressConflict Matrix* (assuming those were all the same Addr[4:48] binary addresses) the ACM would *already* have prevented LEN+Addr[0..3] d1 and w2 when encoded in this unary mask form from progressing simultaneously to the L0 Cache/buffer (aka L1 victim buffer).

etc.

d1 s1 and w1 would not produce conflicts, therefore all 3 of those would be PERMITTED to proceed to the L0 phase (allowed simultaneous access to the Muxers).

> But you have nothing in the ISA that tells you what
> an iteration is.

i know. i am so happy not to have to deal with that.

we are critically reliant however on this being an absolutely stonkingly high performance OoO engine.

i cannot imagine the hell that anyone woukd go through if they tried to stick to a traditional in-order multi-issue microarchitecture.

> Mitch has his loop-vec operation pair that gives him
> that info, but you can only pick it up from the addresses. Think of this
> loop:
> int a[N]; short B[N*2];
> for (int i = 0; i < N; ++i)
> sum += A[i] + B[2*i] + B[2*i+1];

this shouuuld result in near simultaneous B LDs ending up in the LDST buffers, where the lack of unary-encoded clashes in bits 0..3 of the address+LEN encoding as a cacheline column byte-enable line will result in them being merged into the same cacheline request in the same cycle.

however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1

> Is your hardware going to make one B stream, or two?

two at the FU level, merged through unary mask detection and clash avoidance onto the same cache line.

> How are you going
> to have the order in the whack-load be a1-b1-b2-a2-b3-b4...? and what if
> they overlap?

they are not permitted to.

> I admit I don't understand the conflict avoidence logic
> proposed, but I also don't see anything that suggests things like this
> are handled.

it is. however it is only by each Matrix taking care of guarantees further down the chain that it all hangs together.

if the MemHazard Matrix did not detect and separate out batches of LDs from batches of STs, in instruction order, we would be screwed at the AddressConflict detection phase.

if the AddressConflict stage did not detect bytemask overlaps and exclude them, we would be screwed at the L1 cache level due to multiple requests for R/W at the same cacheline column.

etc. etc.

l.

lkcl

unread,

Mar 24, 2020, 6:00:08 PM3/24/20

to

On Tuesday, March 24, 2020 at 9:47:22 PM UTC, lkcl wrote:
> On Tuesday, March 24, 2020 at 6:18:07 PM UTC, Ivan Godard wrote:

> > With 6 LS units you have a 12-way
> > mux to reach the four ports;
>
> sigh yes. i am wondering if we can do just an N-in, 4-out Mux onto an 8-entry L0, and it would only fill up entirely if 4 previous entries were not flushed to the L1 cache.
>
> it makes for a bottleneck but reduces wires somewhat.

ohh gooOdddd i just thought of a horrible way to avoid those wires somewhat.

although you turn Addr[0..3] plus len=1,2,4,8 into an unary shifted bytemask back at the Address Conflict Matrices, you do *not* throw this down a massive 12x4 way Mux

you put the *binary* address and *binary* len through the Mux phase and *recreate* the unary masks on the other side of the Mux.

yuck.

likewise the data Mux bus need actually only be 8 bytes wide, although it will be a full cache line wide at the L0 cachebuffer (aka L1 victim buffer in Mill terminology).

l.

lkcl

unread,

Mar 24, 2020, 6:27:08 PM3/24/20

to

On Tuesday, March 24, 2020 at 7:15:33 PM UTC, MitchAlsup wrote:

> We used to do double ported 1R AND 1W SRAMs and were forced to 1R or 1W
> SRAMs by the technology. In 1µ technology it was not hard to get SRAMs
> to read and write in a single 16-gate cycle. But somewhere in the 0.18µ
> era SRAms became slower because of transistor leakage, bit-line capacitance,
> and body effects of <still> planar transistors. You can still get SRAMs that
> are fast enough to run a 20-gate machine cycle and hit them 2× per cycle,
> but not if the library SRAM makes YOU put the flip-flops at the edge of
> the SRAM and these flip-flops are timed on the std clock.

i will alert Staf so he can take a look at this, thank you for the insight.

> Caches, on the other hand can be 2 ported (or even 4 ported) using std SRAMs
> as long as you are not using 1 SRAM macro more than once per cycle. The
> rest is simply wiring the macros up however you deem appropriate.

i *believe* that by a nice coincidence due to the unary mask clash detection we will:

A. never have reads occurring at the same time as writes
B. never have a double access to the same cache line *or* column, in the same cycle.

this would seem to satisfy the criteria you describe?

> Note:: In order to improve the tag-TLB-hit logic speed, Athlon and Opteron
> gated the raw bit lines of the Tag and TLB into a analog XOR gate and then
> used sense amplifiers to decide if the bits matched.

holy cow!

> They did not actually
> bother to read out the translated bits from either the Tag array or the
> TLB array. This necessitates these two arrays be side-by-side.
> >
> > so that idea is shot.
> >
> > now, let's say that we have a 4-ported L0 cache/buffer, 8-entry, which has a
> > 12-in (or 16-in) Priority Muxer to allow access to those 4 ports.
> >
> > 12-in because if we have 6 LD/ST Function Units, each will have its own
> > splitter thus doubling the FU ports in effect. 16-in if we have 8 LD/ST FUs.
>
> All of this split stuff resolves about gate (delay) 6 whereas the higher
> order AGEN bits do not resolve until at least gate 12-13 (64-bit addresses).

48. actually 48-11 because bits 5-11 are done back in the AddressConflict Matrix already, and i will (be attempting to) re-use those comparisons further down (in the L0 cache/buffer).

this might not turn out to be sane due to the amount of routing, as Ivan points out.

> So there is plenty (PLENTY) of time to do bank selection and split analysis
> while the rest of the AGEN bits are resolving. By the time you get to the
> L0 multiplexers, there should be no "priorities" and no "indecision" as to
> what this cycle is doing. (i.e., move it back 1/3rd cycle)

i did consider not having a Priority Picker cascade at all and instead just if there are 6 Function Units, have a 12 entry LD/ST L0 cache/buffer.

however that means 12 48(64) bit ADDR comparators in the L0 cache/buffer and i am not happy about that.

8 is... tolerable.

> > depending on what the 4-simultaneous PriorityPicker happens to pick, we could
> > have *any* of those 6 be picked.
>
> I have priority pickers that can do a unary pick for 72 inputs in 4 gates of
> delay (without using more than 4-input NAND gates and 3-input NOR gates.

is that using the same PP design as in the Scoreboard Mechanics book chapters?

> That is very fast, and as wide as you could ever need. But for 1 from 4
> there is 1 gate of delay (from a flip flop where you have access to both
> polarities), 1 from 12 is 2 gates, 1 from 36 is 3 gates.

you have slightly lost me here. do you mean, a cascade of PriorityPickers?

1st one MASKs out the input, 2nd one adds another bit to the mask, 3rd likewise. result is that each successive PriorityPicker gets less and less bits to pick from.

perhaps this is the naive approach

> But these should be performed while the HoBs are resolving.

HoBs... higher order bits.

> >
> > now... should we do something such that mis-aligned LDs/STs are *reduced* in priority? given that we have this byte-mask "cache-line detection system", i'm not so sure that we should try that, because the more that is thrown at the L0 cache/buffer, the more opportunity there could be for "merging" the unary byte-masks into full cache lines.
>
> Just guessing: If you get a misaligned, you probably don't have the second
> line address (even though its only an INC away), you will likely have to
> perform splits over 2 cycles--then depending on address compares and access
> density, you may (or may not) have to delay other memory refs to maintain
> memory order.

by having the rules be Weak Memory Model not TSO and by the RAW WAR Mem Hazard Matrix detect and separate LDs from STs my understanding is that even if split up into dozens of single byte operations, actual order is irrelevant [for WMO and nonatomic accesses only].

the misaligned will definitely decrease throughput in the general mem access case, not a lot we can do about that.

> >
> > i honestly don't know, here.
> >
> > > I suggest that you consider saying
> > > that users of boundary-crossing references should accept a potential
> > > delay as the cost of software flexibility.
> >
> > i think... i think we're going to have to see how it goes, and (thank you
> > for the heads-up) keep a close eye on that.
> >
> > one of the "refinements" that was considered was to have a popcount on the ORing of the cache-byte-line bitmaps, to detect which ones have the highest
> > count. full cache lines get top priority.
>
> The "density" concept tells you 90%+ of what you want to know. fast and
> simple.
> >
> > i can see, now that you mention it, that a large number of mis-aligned LD/STs
> > will, due to using double the normal number of cache lines for a non-misaligned
> > LD/ST, would clearly cause throughput slow-down.
>
> What you are going to want to do is to get 2 cache lines in the L0 and
> then have this L0 service as many requests as you can afford to build
> multiplexers for.

i think we can manage that and not scream in the middle of the night at the number of wires.

the more L0 entries we have, the more opportunities for unary bytemask merging.

unfortunately that also increases full addr comparison logic.

tradeoffs...

l.

MitchAlsup

unread,

Mar 24, 2020, 7:11:12 PM3/24/20

to

Probably not: however the one you saw was a precursor and not much slower.
Realistically, if the picker is working from less than 12 inputs it won't
matter.

lkcl

unread,

Mar 25, 2020, 6:46:41 AM3/25/20

to

On Tuesday, March 24, 2020 at 11:11:12 PM UTC, MitchAlsup wrote:
> On Tuesday, March 24, 2020 at 5:27:08 PM UTC-5, lkcl wrote:
> > On Tuesday, March 24, 2020 at 7:15:33 PM UTC, MitchAlsup wrote:
> > > I have priority pickers that can do a unary pick for 72 inputs in 4 gates of
> > > delay (without using more than 4-input NAND gates and 3-input NOR gates.
> >
> > is that using the same PP design as in the Scoreboard Mechanics book chapters?
>
> Probably not: however the one you saw was a precursor and not much slower.
> Realistically, if the picker is working from less than 12 inputs it won't
> matter.

hm ok i'm intrigued now.

i just deliberately created a 16-in 16-out PriorityPicker and ran it
through the yosys "synth" command just to see what kind of StdCells
it would create and whether it would be optimised / flattened at all.
looks like it was!
https://libre-riscv.org/3d_gpu/priority_picker_16_yosys.png

the 16-long chain of NOTs-and-AND-cascades seems to have been turned
into StdCells (ANDNOT, ORNOT, AND, NOT, OR) only 4 deep.

(for anyone else who may be going "err what on earth are they talking about??")

here is a pair of priority-pickers in the "precursor" form Mitch describes:
https://libre-riscv.org/3d_gpu/group_pick_rd_rel.jpg

and here's a cascade of them:
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/800x-multi_priority_picker.png

input is just "i", pp0-3 block is each one of the priority-pickers
however inside each PP block i additionally OR the outputs together
so as to produce a single-bit output indicating that the PP is "active"

outputs are idxo_0-3 plus all four enable bits from each PP which gives
an easy way to tell which PP is "active".

a quick "flatten" followed by "synth" command in yosys shows that the
longest path (through Std Cell Libraries, not gates) is 8 long.

if i were to reorganise the masking to no longer be a chain but instead
to be accumulated (duplicated) it might get a little flatter.

i think we can get away with using this even for 16-in and 4-out?

links to the source code implementing those (in nmigen) can be found
by following the links here:
https://libre-riscv.org/3d_gpu/architecture/6600scoreboard/

l.

lkcl

unread,

Mar 25, 2020, 8:40:18 AM3/25/20

to

On Tuesday, March 24, 2020 at 7:15:33 PM UTC, MitchAlsup wrote:
> On Tuesday, March 24, 2020 at 12:02:49 PM UTC-5, lkcl wrote:
> > ah no i wondered *if* double-ported (or quad-ported) L1 caches exist,
> > and from what Mitch was saying, not even 1R1W caches exist, they're
> > 1R-*or*-1W caches.
>
> We used to do double ported 1R AND 1W SRAMs and were forced to 1R or 1W
> SRAMs by the technology.

i heard back from Staf (very busy, thank you again if you read this).
he said that for this initial test SRAM Cell Library he's providing a
synchronous 1R-or-1W SRAM cell, which we can use to create dual, triple
etc. ported blocks (2R1W, 3R1W ...) by putting 1R-or-1W SRAMs together
in parallel.

in this first 180nm chip we just have to live with that and eat performance.

it does mean however we're not paying $NNN,NNN(,N?NN?) in NREs.
USD $600 per sq.mm - that's it. no mask charges. we're damn lucky,
and it was *only* possible for us to get that because the entire design
is Libre-Licensed.

l.

Ivan Godard

unread,

Mar 27, 2020, 10:32:50 AM3/27/20

to

On 3/24/2020 2:47 PM, lkcl wrote:
> On Tuesday, March 24, 2020 at 6:18:07 PM UTC, Ivan Godard wrote:
>
>>
>> You are way beyond me.
>
> sorry! it makes sense in my head :) however due to the nature of our project (educational as well as profitable) that's not good enough so i have to do better.
>
>> Some questions:
>>
>> 1) have you allowed for two scalar references being not congruent but
>> overlapping and both are boundary crossing? If they are both stores, for
>> example, it would be unfortunate if the update effect was a1-b1-a2-b2
>> instead of a1-a2-b1-b2.
>
> a1 a2 would.. *click* you are right. in one version of the Address Conflict detection Matrix I had several binary address detection comparators, with the (address+16) being done *inside* the Matrix.
>
> given that this required *three* comparisons (define a2 = a1+16, b2 = b1+16) to detect any clashes:
>
> * a1 == b1, a1mask[0..15] & b1mask[0..15] nonzero
>
> * a1 == b2, a1 mask[0..7] & b1mask[16..23] nonzero
>
> * viceversa a2 == b1
>
> i considered this to be excessive (3x the number of binary comparators in an 8x8 Matrix is a *lot* of comparators) i decided instead to split the ports.
>
> however.. wait... no, the ordering of the LDSTs back at the *Dependency Hazard* phase, the LDs and STs are *already* split into ordered groups, well before they get to the "Address Conflict" phase.
>
> therefore it is guaranteed that the AC phase will *only* see batches of LDs or batches of STs, it will *never* see both.
>
> so we are good. whew :)

If you say so :-)

>> 2) Your L0 is our L1 victim buffer.
>
> ah ok. same thing, different name.

Well, a little different, because they get filled from the L1 banks as
well as from the FUs, and stores don't need line fetches.

>> With 6 LS units you have a 12-way
>> mux to reach the four ports;
>
> sigh yes. i am wondering if we can do just an N-in, 4-out Mux onto an 8-entry L0, and it would only fill up entirely if 4 previous entries were not flushed to the L1 cache.
>
> it makes for a bottleneck but reduces wires somewhat.
>
>> we have a 6 way mux because of the upstream
>> Drano; I agree that a 12-to-4 mux is unimportant. However, you also have
>> 12 wiresets from the LS to the muxes, which in wire terms are remote. We
>> would have six, so in effect you have doubled the operand width in order
>> to support boundary crossing. Our hardware guys have in general been
>> much more concerned with wire area than with logic; how do you weight
>> these in your design?
>
> i don't knoow, i'm not happy about it. we're talking full 128 bit cache lines, QTY 12 (or 16) being MUXed 4 ways into 8 sets of latches.
>
> woo.
>
> ...use several layers? :)

Money and power. And did I mention money?

>>
>> 3) I'm doubtful that the "whack-load issue" will remove all ordering
>> issues.
>
> oh it will generate a ton of them.
>
> however Mitch explained that when you use those (biiig) Dependency Matrices in unary regnum and FunctionUnit addressing (rather than the more traditional binary numbering used in the CDC6600 patent on its Q-Tables), multi-issue is a simple matter of transitively (cascade) collecting register dependency hazards from previous instructions.
>
> thus if you have ADD r1, r10, r20 and ADD r2, r11, r21 in the next instruction, you create a "fake" write dependency between r2 and r1, and fake read hazard deps between r10/11 and r20/21, in that same clock cycle.
>
> this is enough to preserve the instruction order in an intriguing single-cycle way.

Not sure about that. What about instructions that don't use registers
that you can chain? Or rather, use the same ones? Code like:
store r4, *r4;
store r4, *r4+1;
Here there is an order dependency between the two stores, but there's
only one register involved and no updates or value creation. I suppose
you can create a fake dependency r4->r4, but that could read in either
direction. And it could be followed by "store r4, r4+2;" too. How do you
resolve this without an order-tag? Or maybe the AGU output gets a fake
regnum, with memrefs cracked in decode into someting like "rfake =
LEA(r4); store r4, *rfake; rfake1 = LEA(r4+1); store r4, *rfake1;"? That
should work, but chews up a lot of renamers.

It's easy to see how OOO works in the easy cases, but it's perversities
like this that make me a dedicated IO guy.

> and in the case of our vector-is-actually-multiissue-scalar it should be trivially obvious how to generate the transitive cascade of dependencies inside the vector for-loop as it increments the register numbers.
>
>
>> Yes, it clearly works within any single stream, and for multiple
>> disjoint streams. And I can see that conjoint streams can have the
>> ordering of overlapping whack-loads disambiguated with enough care.
>
> there are several critical pre-components, each of which (thank god) is independent.
>
> * generating a ton of scalar operations, such that we can use standard multi issue hardware, vectorisation "goes away"

You get no argument from me on that one, except for "standard". Standard
doesn't scale enough.

> * standard register RW Hazard detection, by using unary numbering, allows us to ensure register ordering is preserved
>
> * the RAW WAR Memory Ordering Detection Matrix allows us to detect batches of LDs and separate them from batches of STs *before*...
>
> * ... the Address Clash Detection Matrix gets to see the batch of (only) LDs or (only) STs, even after they've gone through splitting.
>

The example above was WAW ordering. What about that?

>> But
>> it seems to me that such care is not doable in practical hardware, at
>> least with the fanout you propose. If one stream is double-at-a-time,
>> one is word-at-a-time, and one is short-at-a-time, and they are on
>> different byte alignments, then as they cross each other the conflict
>> detector will correctly order D1-D2-D3-D4 and so for the other streams,
>> but the multi-stream will not be d1-w1-s1-d2-w2-s2-etc, but will be
>> interleaved as something like d1-w1-s1-s2-w2-w3-d3-etc.
>
> ah, remember: by this point, with all the other guarantees provided by the various Matrices, we have a bunch of *bytes* (16 of them), with their associated cache byte read/write enable line, where we *know*, thanks to the AddressConflict Matrix, that there are no two LDST FUs in this batch trying to read/write to/from the same cache byte column.
>
> also, LD batches were *already* separated from ST batches.
>
>> Now you can get around this by having the subscript correspond to the
>> iteration instead of the address. Then its unambiguous that
>> d1-w1-s1-d2-w2-s2...
>
> all these have been turned into unary bytemaps.
>
> d1 = 0b1111 1111 0000 0000
> w1 = 0b0000 0000 1111 0000
> s1 = 0b0000 0000 0000 1100
> w2 = 0b0000 1111 0000 0000
> s2 = 0b0000 0000 0011 0000
>
> therefore *back at the AddressConflict Matrix* (assuming those were all the same Addr[4:48] binary addresses) the ACM would *already* have prevented LEN+Addr[0..3] d1 and w2 when encoded in this unary mask form from progressing simultaneously to the L0 Cache/buffer (aka L1 victim buffer).
>
> etc.

Hmm. The picture helps; thank you. BTW, figuring out that d1/w1/s1 are
non-conflicting is polynomial, but figuring out which is the best of the
various intra-non-conflicting but inter-conflicting sets to let through
is the NP bin-packing problem. I'm guessing that you don't try, and just
use a maximal munch on the ordered request stream, which is probably
nlogn and would rarely cost you a cycle over optimal.

The byte conflict mask approach is actually how we do byte merge too,
except that our request stream is ordered by definition and your unary
mask is architecturally visible to us as the valid-bit on every byte in
the hierarchy short of DRAM.

Your explanation thus pushes my doubts to the Address Conflict Matrix.
Do I take it correctly that ACM detects at line granularity? One
observation of actual code is that spatial locality is at sub-line
granularity: a reference to a line is followed by another reference to
the same line with high probability. A conflict detect can split a
per-address-space stream of requests into several per-line streams of
requests; the cost is a set of associative comparators (fast) and a miss
detect and evict (slow).

Still, if I understand you (?) you cannot handle an intra-line sequence
that has more than one read-write or write-read transition, because you
have to accumulate a mask at each transition. W-W-W is OK, and R-R-R,
but not W-R-W or R-W-R; not in one cycle anyway. We can, with the
as-of-retire load semantics Mill has; we just reissue a read that got
klobbered. We also detect conflict at the byte(-range) level, co
conflict is very rare, whereas locality says that it will be frequent in
your line-granularity AGM.

It's interesting that Mill, starting from the semantics, and you/Mitch,
starting from the gates, have arrived at very similar designs in this area.

> d1 s1 and w1 would not produce conflicts, therefore all 3 of those would be PERMITTED to proceed to the L0 phase (allowed simultaneous access to the Muxers).
>
>> But you have nothing in the ISA that tells you what
>> an iteration is.
>
> i know. i am so happy not to have to deal with that.
>
> we are critically reliant however on this being an absolutely stonkingly high performance OoO engine.
>
> i cannot imagine the hell that anyone woukd go through if they tried to stick to a traditional in-order multi-issue microarchitecture.
>
>
>> Mitch has his loop-vec operation pair that gives him
>> that info, but you can only pick it up from the addresses. Think of this
>> loop:
>> int a[N]; short B[N*2];
>> for (int i = 0; i < N; ++i)
>> sum += A[i] + B[2*i] + B[2*i+1];
>
> this shouuuld result in near simultaneous B LDs ending up in the LDST buffers, where the lack of unary-encoded clashes in bits 0..3 of the address+LEN encoding as a cacheline column byte-enable line will result in them being merged into the same cacheline request in the same cycle.
>
> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1

And so you get away from SIMD. That will work fine for code working on
full-scalar objects. It's not goind to work well for sub-scalar data.
You don't have the hardware capacity to use eight LS units to block
load/store a byte array, nor eight ALUs to add one to a byte. Or does
Mitch still plan on having GBOOO figure out SIMDization in hardware? We
gave up on that, and all ops work on SIMD too.

>> Is your hardware going to make one B stream, or two?
>
> two at the FU level, merged through unary mask detection and clash avoidance onto the same cache line.
>
>> How are you going
>> to have the order in the whack-load be a1-b1-b2-a2-b3-b4...? and what if
>> they overlap?
>
> they are not permitted to.
>
>> I admit I don't understand the conflict avoidence logic
>> proposed, but I also don't see anything that suggests things like this
>> are handled.
>
> it is. however it is only by each Matrix taking care of guarantees further down the chain that it all hangs together.
>
> if the MemHazard Matrix did not detect and separate out batches of LDs from batches of STs, in instruction order, we would be screwed at the AddressConflict detection phase.
>
> if the AddressConflict stage did not detect bytemask overlaps and exclude them, we would be screwed at the L1 cache level due to multiple requests for R/W at the same cacheline column.
>
> etc. etc.

Bon voyage :-)

Ivan Godard

unread,

Mar 27, 2020, 11:27:45 AM3/27/20

to

You will deeply regret using Weak Memory Model.

MitchAlsup

unread,

Mar 27, 2020, 12:32:57 PM3/27/20

to

Is there?: I fail to see any difference between

ST R4,[R4]
ST R4,[R4+1]
and
ST R4,[R4+1]
St R4,[R4]

In both cases, the same values get written to the same memory locations.

GBOoO yes
SIMD no
But VVM makes loops faster than scalar code can ever be.

Ivan Godard

unread,

Mar 27, 2020, 12:54:25 PM3/27/20

to

On 3/27/2020 9:32 AM, MitchAlsup wrote:
> On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:

<snip>

>> Not sure about that. What about instructions that don't use registers
>> that you can chain? Or rather, use the same ones? Code like:
>> store r4, *r4;
>> store r4, *r4+1;
>> Here there is an order dependency between the two stores,
>
> Is there?: I fail to see any difference between
>
> ST R4,[R4]
> ST R4,[R4+1]
> and
> ST R4,[R4+1]
> St R4,[R4]
>
> In both cases, the same values get written to the same memory locations.
>

If r is 0x00010203, then the first example results in memory following
*r4 being the five bytes 0x0000010203, while the second produces
0x0001020303.

Perhaps my notation confused you. The "+1" is not scaled; it offsets the
store by one byte in memory.

WAW hazard is a thing.

Ivan Godard

unread,

Mar 27, 2020, 1:11:52 PM3/27/20

to

On 3/27/2020 9:32 AM, MitchAlsup wrote:

> On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:

<snip>

>>> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
>>
>> And so you get away from SIMD. That will work fine for code working on
>> full-scalar objects. It's not goind to work well for sub-scalar data.
>> You don't have the hardware capacity to use eight LS units to block
>> load/store a byte array, nor eight ALUs to add one to a byte. Or does
>> Mitch still plan on having GBOOO figure out SIMDization in hardware? We
>> gave up on that, and all ops work on SIMD too.
>
> GBOoO yes
> SIMD no
> But VVM makes loops faster than scalar code can ever be.

Ah, no, I don't think you can claim that. When you come right down to
it, VVM is itself scalar. Loops are more than loads and stores, and the
iteration interval is bounded by the ratio of FU count and op demand;
you are not magically creating ALUs on the fly. So you are not going to
be faster than any other approach with the same bounds, software
pipelining for example.

What you *can* claim is to do it more efficiently than most: you avoid
decode costs in a clever way (we have to use an I$0, and that's going to
be more expensive, albeit more flexible and general than VMM). And you
can sharply down-scale in the memory paths by agglutinating the
requests. For code that is efficiency-bound in the memory paths you
*will* be faster, but not in the general case.

Mind you, I really like VMM. It's clearly (IANAHG) better than legacy.
But it's not better than everything possible in scalar.

Stephen Fuld

unread,

Mar 27, 2020, 1:32:02 PM3/27/20

to

Minor quibble. Unless your ISA has a three input, one output increment
and conditional branch instruction, the LOOP instruction saves one cycle
per loop iteration.

> Mind you, I really like VMM. It's clearly (IANAHG) better than legacy.
> But it's not better than everything possible in scalar.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Ivan Godard

unread,

Mar 27, 2020, 1:48:36 PM3/27/20

to

Yes, that's nice. We crack those using three ops and phasing to get one
iteration per cycle. But really, our two-op increment/compare gang is
really just an encoding form of a 3-input-2 output op that is split over
two slots so we can keep the slots 2->1 while still handling the oddball
ops.

MitchAlsup

unread,

Mar 27, 2020, 2:17:48 PM3/27/20

to

On Friday, March 27, 2020 at 12:11:52 PM UTC-5, Ivan Godard wrote:
> On 3/27/2020 9:32 AM, MitchAlsup wrote:
> > On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:
>
> <snip>
>
> >>> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
> >>
> >> And so you get away from SIMD. That will work fine for code working on
> >> full-scalar objects. It's not goind to work well for sub-scalar data.
> >> You don't have the hardware capacity to use eight LS units to block
> >> load/store a byte array, nor eight ALUs to add one to a byte. Or does
> >> Mitch still plan on having GBOOO figure out SIMDization in hardware? We
> >> gave up on that, and all ops work on SIMD too.
> >
> > GBOoO yes
> > SIMD no
> > But VVM makes loops faster than scalar code can ever be.
>
> Ah, no, I don't think you can claim that. When you come right down to
> it, VVM is itself scalar. Loops are more than loads and stores, and the
> iteration interval is bounded by the ratio of FU count and op demand;
> you are not magically creating ALUs on the fly. So you are not going to
> be faster than any other approach with the same bounds, software
> pipelining for example.

Scalar code is FETCH-ISSUE bound and Latency bound.

Vectorized Loops are not FETCH-ISSUE bound (completely removing that
obstacle.)

Vectorized Loads in a Loop can determine if this load also satisfies
one or more additional load so the AGEN and read cache portions don't
have to be performed. Thus 1 load can satisfy 1-64 actual loads in
terms of bringing the data close enough so that the data can be obtained
through multiplexing.

Vectorized stores do similarly.

The ADD/CMP/BCND is 3 SCALAR instructions but 1 Vector instruction and
executes in 1 cycle instead of minimum 3.

Even a GBOoO machine with plenty of resources cannot be faster.

So the real question is how much calculation is required in the vector loop
versus how many calculation resources are available. And for a good class
of calculations from NULL to several VVM is invariable faster than scalar
(staying within the same ISA).

>
> What you *can* claim is to do it more efficiently than most: you avoid
> decode costs in a clever way (we have to use an I$0, and that's going to
> be more expensive, albeit more flexible and general than VMM). And you
> can sharply down-scale in the memory paths by agglutinating the
> requests. For code that is efficiency-bound in the memory paths you
> *will* be faster, but not in the general case.
>
> Mind you, I really like VMM. It's clearly (IANAHG) better than legacy.
> But it's not better than everything possible in scalar.

I will grant you that distinction.

MitchAlsup

unread,

Mar 27, 2020, 2:21:05 PM3/27/20

to

Saves 2 cycles.
The ADD and the CMP are performed simultaneously with a 3-input adder doing the ADD and the CMP (n.e., SUB) at the same time.

{ Ri = Ri + Rinc
cmp = Rmax - (Ri + Rinc)
BLE cmp, Loop } 1 cycle.

Ivan Godard

unread,

Mar 27, 2020, 2:58:08 PM3/27/20

to

On 3/27/2020 11:17 AM, MitchAlsup wrote:
> On Friday, March 27, 2020 at 12:11:52 PM UTC-5, Ivan Godard wrote:
>> On 3/27/2020 9:32 AM, MitchAlsup wrote:
>>> On Friday, March 27, 2020 at 9:32:50 AM UTC-5, Ivan Godard wrote:
>>
>> <snip>
>>
>>>>> however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
>>>>
>>>> And so you get away from SIMD. That will work fine for code working on
>>>> full-scalar objects. It's not goind to work well for sub-scalar data.
>>>> You don't have the hardware capacity to use eight LS units to block
>>>> load/store a byte array, nor eight ALUs to add one to a byte. Or does
>>>> Mitch still plan on having GBOOO figure out SIMDization in hardware? We
>>>> gave up on that, and all ops work on SIMD too.
>>>
>>> GBOoO yes
>>> SIMD no
>>> But VVM makes loops faster than scalar code can ever be.
>>
>> Ah, no, I don't think you can claim that. When you come right down to
>> it, VVM is itself scalar. Loops are more than loads and stores, and the
>> iteration interval is bounded by the ratio of FU count and op demand;
>> you are not magically creating ALUs on the fly. So you are not going to
>> be faster than any other approach with the same bounds, software
>> pipelining for example.

Perhaps we have a terminological issue.

> Scalar code is FETCH-ISSUE bound and Latency bound.

In small loops (8 line including internal control flow if any) we are
not fetching at all, and some members are not decoding either. I think
FETCH-ISSUE is only a problem for OOO and then only if there's no trace
buffer. If you disagree then please explain where the bottleneck is.

Software pipelining removes all latency issues except load LLC misses. I
grant you the VVM acts like a smart prefetch, but smart prefetch does
not require VVM.

> Vectorized Loops are not FETCH-ISSUE bound (completely removing that
> obstacle.)
>
> Vectorized Loads in a Loop can determine if this load also satisfies
> one or more additional load so the AGEN and read cache portions don't
> have to be performed. Thus 1 load can satisfy 1-64 actual loads in
> terms of bringing the data close enough so that the data can be obtained
> through multiplexing.

Yes; I granted you the efficiency, it's the necessarily superior speed I
don't buy.

> Vectorized stores do similarly.
>
> The ADD/CMP/BCND is 3 SCALAR instructions but 1 Vector instruction and
> executes in 1 cycle instead of minimum 3.
>
> Even a GBOoO machine with plenty of resources cannot be faster.

Funny, a LBIO Mill does ADD/CMP/BCND equivalent in one cycle too. You're
good, but you are not unequaled :-)

> So the real question is how much calculation is required in the vector loop
> versus how many calculation resources are available. And for a good class
> of calculations from NULL to several VVM is invariable faster than scalar
> (staying within the same ISA).

It's invariable faster than legacy, but that's a very low bar. It's not
invariably faster than possible.

Sorry; remember, I do like it.

MitchAlsup

unread,

Mar 27, 2020, 3:21:30 PM3/27/20

to

The width of Issue in the machine in my head.

>
> Software pipelining removes all latency issues except load LLC misses. I
> grant you the VVM acts like a smart prefetch, but smart prefetch does
> not require VVM.
>
> > Vectorized Loops are not FETCH-ISSUE bound (completely removing that
> > obstacle.)
> >
> > Vectorized Loads in a Loop can determine if this load also satisfies
> > one or more additional load so the AGEN and read cache portions don't
> > have to be performed. Thus 1 load can satisfy 1-64 actual loads in
> > terms of bringing the data close enough so that the data can be obtained
> > through multiplexing.
>
> Yes; I granted you the efficiency, it's the necessarily superior speed I
> don't buy.

What doing 2-8-64 loops in a single clock on a 1-wide machine is not necess-
arily faster !?!?! What planet did you come from ?

>
> > Vectorized stores do similarly.
> >
> > The ADD/CMP/BCND is 3 SCALAR instructions but 1 Vector instruction and
> > executes in 1 cycle instead of minimum 3.
> >
> > Even a GBOoO machine with plenty of resources cannot be faster.
>
> Funny, a LBIO Mill does ADD/CMP/BCND equivalent in one cycle too. You're
> good, but you are not unequaled :-)

MILL is far from LB I will grant you the IO part.

Ivan Godard

unread,

Mar 27, 2020, 7:43:57 PM3/27/20

to

I don't understand you. You must have a meaning for "loop", "1-wide", or
"clock" that doesn't match mine.

In:

for (int i = 0; i < N; ++i)

A[i]++;
on a machine with one ALU, the iteration internal is one clock. It is
not possible to get 2, or 8, or 64 iterations per clock without 2, 8, or
64 ALUs to increment the A[i] elements.

Yes, you can do the loads and stores and the index recurrence of that
iteration in parallel in that cycle; so can other architectures.

Now please explain what you mean that is different from what I thought I
heard.

lkcl

unread,

Mar 28, 2020, 2:01:24 AM3/28/20

to

On Friday, March 27, 2020 at 2:32:50 PM UTC, Ivan Godard wrote:
> On 3/24/2020 2:47 PM, lkcl wrote:

> > so we are good. whew :)
>
> If you say so :-)

(i'd better be...)

> >> 2) Your L0 is our L1 victim buffer.
> >
> > ah ok. same thing, different name.
>
> Well, a little different, because they get filled from the L1 banks as
> well as from the FUs, and stores don't need line fetches.

hum, hum... this because it's also being filled by SMP coherence requests that come from other cores etc?

> > ...use several layers? :)
>
> Money and power. And did I mention money?

this is why we are using coriolis2 (and NLNet)

> > this is enough to preserve the instruction order in an intriguing single-cycle way.
>
> Not sure about that. What about instructions that don't use registers
> that you can chain? Or rather, use the same ones? Code like:
> store r4, *r4;
> store r4, *r4+1;
> Here there is an order dependency between the two stores, but there's
> only one register involved and no updates or value creation.

reg read write dependency hazards - not memory hazards - would be detected and stop them from overlapping.

i do have a scheme for detecting wasted writes (i called it "nameless registers") however we are running out of time to implement optimisations.

so aside from that might be an IO store that you want to genuinely do...

wait... it's a ST. put contents of r4 into location pointed to by r4. and r4+1

yes of course those can be *issued* in parallel. they are reg read dependencies.

so there are 2 cases. first is if they are say ST.byte then there are no memory overlaps, all hunky dory, *r4 and *r4+1 fall (maybe) into the same 16 byte cache line, get done in 1 go, otherwise they get done across odd even cache lines (we plan to have that dual banked L1 using bit 4 of addr to select them)

if however they are ST.dwords then the AddressConflict Matcher kicks in.

the AddressMatcher notices two things:

1. that the top bits 4 to 11 are identical AND
2. the bytemask maps created from the 8byte len shifted by address bits[0..3] have an overlap.

therefore the two will NOT be permitted to proceed simultaneously and MUST be done independently.

unlike the reg dependency checking (done before ACM phase) this part does *not* have anything to do with register numbers themselves.

although there is opportunity for detection and merging of the overlaps we will probably leave it for later.

> I suppose
> you can create a fake dependency r4->r4, but that could read in either
> direction. And it could be followed by "store r4, r4+2;" too.

yes.

> How do you
> resolve this without an order-tag?

i did not mention, i also added a FU-FU LDST order preserving matrix, it has been a while (code written last year), working in conjunction with AddressConflict Hazard detection.

so if AC is noted, then the FUFU matrix does not let the operation proceed until that hazard is dropped. in this case that hazard only gets dropped when the *r4 st gets notification back that it has completed.

*then* because that operation no longer exists in the Matrices, the address conflict bytemask overlap no longer exists, and the *r4+1 is free and clear to proceed.

> Or maybe the AGU output gets a fake
> regnum, with memrefs cracked in decode into someting like "rfake =
> LEA(r4); store r4, *rfake; rfake1 = LEA(r4+1); store r4, *rfake1;"? That
> should work, but chews up a lot of renamers.

are we getting confused about LD ane ST here? no, don't think so...

the LDST Comp Units have their own ALUs. just adders. the addresses get stored in latches within those two LDSTALUs.

so no need in my mind for rfake or rfake1 although conceptually they do exist, as anonymously unnamed latches sonewhere in the AGEN ALU.

these AGENs also get held in the AC matrices (perhaps unnecessarily) i will investigate that during an optimisation phase.

> It's easy to see how OOO works in the easy cases, but it's perversities
> like this that make me a dedicated IO guy.

whereas for me, inorder is just bolliocks. every solution to every problem is: stall.

all efforts and i mean all to workaround that start with the word "but" and go downhill from there.

but you can't do variable pipelines.

but you can't do pipeline bypasses.

but interrupts have to be masked.

but you must destroy interrupt latency.

> > * generating a ton of scalar operations, such that we can use standard multi issue hardware, vectorisation "goes away"
>
> You get no argument from me on that one, except for "standard". Standard
> doesn't scale enough.

yehh i also didn't mention, we will be doing a SIMD ALU backend. the plan is to detect some sequential operations such as element striding etc. and fuse them into ALU operations of the width of the SIMD unit.

even FP16 element strided LDs/STs can be fused into 64 bit LD/STs because you are not supposed to touch the actual bits (no NaN conversion etc)

however yes at some point we hit a scale wall long before the 64 entry VLEN theoretical maximum. hypothetically we could extend to 8 way multi issue however the bus internal architecture, maybe 6,000 wide data wires? has me slightly freaked out.

i mean, we can _do_ it... :)

> The example above was WAW ordering. What about that?

yep back at the reg DM phase. hang on are you sure you are not confusing LD and ST?

oh you mean *memory* WaW not reg WaW.

yes, we are not ignoring it, just separating them and not doing any optimisation (yet)

> > all these have been turned into unary bytemaps.
> >
> > d1 = 0b1111 1111 0000 0000
> > w1 = 0b0000 0000 1111 0000
> > s1 = 0b0000 0000 0000 1100
> > w2 = 0b0000 1111 0000 0000
> > s2 = 0b0000 0000 0011 0000
> >
> > therefore *back at the AddressConflict Matrix* (assuming those were all the same Addr[4:48] binary addresses) the ACM would *already* have prevented LEN+Addr[0..3] d1 and w2 when encoded in this unary mask form from progressing simultaneously to the L0 Cache/buffer (aka L1 victim buffer).
> >
> > etc.
>
> Hmm. The picture helps; thank you. BTW, figuring out that d1/w1/s1 are
> non-conflicting is polynomial, but figuring out which is the best of the
> various intra-non-conflicting but inter-conflicting sets to let through
> is the NP bin-packing problem. I'm guessing that you don't try,

correct. reviewed queues, rejected it, decided just select first arbitrary ADDR, use that one as filter to check and find any opportunities for simultaneous cache line reads/writes.

turn O(N^2) into two O(N) operations. one that finds first free buffer entry, one that does binary address comparisons with all other pending operations.

matches have their bytemasks merged.

> and just
> use a maximal munch on the ordered request stream, which is probably
> nlogn and would rarely cost you a cycle over optimal.

we really, *really* want simple and straightforward algorithms for a first cut.

i would be interested to hear more about this idea "maximal munch" some time. after the first tapeout.

>
> The byte conflict mask approach is actually how we do byte merge too,

cool.

> except that our request stream is ordered by definition and your unary
> mask is architecturally visible to us as the valid-bit on every byte in
> the hierarchy short of DRAM.

yes likewise i would expect the bytemask to propagate down thru L1 and L2 caches, and even because we are going to do DDR1 SDRAM in the first 180nm ASIC (because it'w on opencores), doesn't that have byte select lines?

> Your explanation thus pushes my doubts to the Address Conflict Matrix.
> Do I take it correctly that ACM detects at line granularity?

mitch's original uses bits 5 thru 11 to give "overopportunistic" matching.

by grabbing *more* than is actually conflicting, although opportunities for optimisation are missed, at least no conflicts occur, and the algorithm turned out, mitch reported, not to actually significantly impact rral work performance dramatically.

yes we are going with cache line BINARY comparison bits FOUR thru 11 on address AND unary mask overlap as the "conflict detect trigger".

> One
> observation of actual code is that spatial locality is at sub-line
> granularity: a reference to a line is followed by another reference to
> the same line with high probability.

yyep. hence the double detect. binary cache line, unary bytemask.

> A conflict detect can split a
> per-address-space stream of requests into several per-line streams of
> requests; the cost is a set of associative comparators (fast) and a miss
> detect and evict (slow).

nnope. that's the beauty of Mitch's scheme. you only need a Pascal Triangle of 5 thru 11 binary address comparators

no need to check upper bits, no need for set associative tags.

let us say that you do indeed get two VM pages with the same bits 12 and up of the address.

o dear the bits 5 thru 11 are the same, this is accidentally included as a "conflict", whoopsie, one of them has to wait, so what?

this "hit" on performance, in real world scenarios, is not as bad as it seems.

> Still, if I understand you (?) you cannot handle an intra-line sequence
> that has more than one read-write or write-read transition, because you
> have to accumulate a mask at each transition. W-W-W is OK, and R-R-R,
> but not W-R-W or R-W-R; not in one cycle anyway.

err... maayybee? :) he said.

aw god i think we might.

no, hang on. LDs get batched and hold up STs. STs get batched and hold up LDs. that's done right back at the Memory Hazard RaW WaR Matrix, long before addresses are analysed.

so yes, W R W would get turned into
W
pause and flush L0
R
pause and flush L0
W

so it's not that it's not handled, it's avoided entirely.

> We can, with the
> as-of-retire load semantics Mill has; we just reissue a read that got
> klobbered. We also detect conflict at the byte(-range) level, co
> conflict is very rare, whereas locality says that it will be frequent in
> your line-granularity AGM.
>
> It's interesting that Mill, starting from the semantics, and you/Mitch,
> starting from the gates, have arrived at very similar designs in this area.

:) hey i am just running with ideas that make sense and trusting they'll work out

> > however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
>
> And so you get away from SIMD. That will work fine for code working on
> full-scalar objects. It's not goind to work well for sub-scalar data.
> You don't have the hardware capacity to use eight LS units to block
> load/store a byte array, nor eight ALUs to add one to a byte.

to keep the discussion sane i missed out the bit about pre-batching back at the VLEN Loop phase because we do have a 64 bit wide dynamic partitionable SIMD ALU.

yes really, gates open up and turn add, shift etc even MUL into 8x8 or 4x16 or 2x32 or 1x64. hypothetically we can do 8 16 8 32 but i reaaaally don't want to go there.

thus by the time we get to that "scalar" part, actually what we will mostly be hitting the ALUs (and LDST) with is 64 bit operations.

obviously, nonstrided LDSTs not so much.

> Or does
> Mitch still plan on having GBOOO figure out SIMDization in hardware?

can't speak for Mitch, he csn take care if himself :)

> We
> gave up on that, and all ops work on SIMD too.

the only reason LibreSOC can even remotely consider it is because of the union typecasting of the regfile to uint8 uint16 uint32 uint64 arrays that is part of the SV Specification.

> Bon voyage :-)

:)

lkcl

unread,

Mar 28, 2020, 10:56:35 AM3/28/20

to

https://libre-riscv.org/3d_gpu/ld_st_comp_unit.png
(note to Ivan)

see ALU middle-left (ish), the ADD thing, which has latches (rectangles with funny triangles), on both the input and output. this is the AGEN phase, clocked by way of SR-Latches Issue, then GoRead, then GoAddr.

so we don't need separate "registers" (virtual or otherwise) for LD/ST operations, because these (unnamed) latches already capture the (computed)
address.

*that* AGEN'd address - not the register number nor the contents *of* the
register - is what is passed through to the AddressConflict Matrices.

l.

MitchAlsup

unread,

Mar 28, 2020, 1:57:14 PM3/28/20

to

Depending on the type of A[], one LD can determine if it got 1 unit of data,
8 units or 64 units. So, the AGEN only has to be performed on a DIV 2**k
rate. Once performed, knowing the access pattern is dense, we can access
A[i] from the cache line buffer as many times per cycle as we choose to
build read ports into the cache line buffer.

Given 1 ALU, the A[*]++ can be performed once per cycle.

Let us postulate that A is Double Precision, every 8 loop iterations,
1 AGEN is performed, and 1 cache line buffer pushed back into the
memory hierarchy.

In a pure scalar ISA, there will be 6 instructions in the loop. Thus, we
can execute (the equivalent of) 6 scalar instructions every cycle in a
1-wide machine (ala Mc88100 µArchitecture).

MitchAlsup

unread,

Mar 28, 2020, 1:59:15 PM3/28/20

to

On Saturday, March 28, 2020 at 9:56:35 AM UTC-5, lkcl wrote:
> https://libre-riscv.org/3d_gpu/ld_st_comp_unit.png
> (note to Ivan)

People should look at this and recognize, this is pipelining with
L A T C H es (not flip-flops).

Ivan Godard

unread,

Mar 28, 2020, 2:51:19 PM3/28/20

to

OK, all that I got; the problem is terminological.

May I suggest that you, in your public writing, say only that you get
more "operations per cycle" or "instructions per cycle", not that you
get more "iterations per cycle"? The latter phrasing is confusing to
those of us who have a well-defined meaning for "iteration".

Ivan Godard

unread,

Mar 28, 2020, 3:04:04 PM3/28/20

to

On 3/28/2020 10:59 AM, MitchAlsup wrote:
> On Saturday, March 28, 2020 at 9:56:35 AM UTC-5, lkcl wrote:
>> https://libre-riscv.org/3d_gpu/ld_st_comp_unit.png
>> (note to Ivan)
>
> People should look at this and recognize, this is pipelining with
> L A T C H es (not flip-flops).

Beg your pardon, but for some of us this is programming with hieroglyphics.

MitchAlsup

unread,

Mar 28, 2020, 3:52:39 PM3/28/20

to

I can imagine a "bigger" machine that indeed has more ALUs and other
resources that can actually perform 2,4,8 iterations per cycle. The
machine I can do all by myself does not have these kinds of resources.

The term "iteration" is a software one, not necessarily a hardware one.
AND I am trying to use it within is SW understood nomenclature. Straight
line code gets "executed" while loops get (their straight line code)
"iterated".

But, to a certain extent, that is why we are playing this out in this
forum, to explore a wider field and to get terminology that a majority
of us can both agree on and its meanings; reducing the surprise coefficient
for later readers.

MitchAlsup

unread,

Mar 28, 2020, 3:59:55 PM3/28/20

to

A latch is a unit storage that has both transparent state (output = input
with essentially no delay) and a captured state output = output which is
held as long as someone wants it held.

while( Clock is HIGH )
output = input;

A flip-flop is a unit of storage that has no transparent state and operates
as:

when( Clock Goes HIGH )
output = input;

And transmits the input state to the output state during the rising edge
of the "clock".

A Latch is about 1/3rd the size of a flip-flop and 1/2 the delay.

The CDC6600 was a latch machine.

One can guarantee that a latch based pipeline has no <forward> race
conditions when there are at least 3 latches in the pipeline and no
more than 2 are ever both transparent at the same time.

There is no software version of a latch, except as pertains to I/O
device registers which may change between successive reads.

Stephen Fuld

unread,

Mar 28, 2020, 5:02:08 PM3/28/20

to

Is this a fair and accurate summary?

An IO CPU with VVM can perform an iteration in something over 2
instructions. (one for the add, one for the loop, plus occasionally
additional cycles for the loads or stores when the internal buffer runs
out)

An IO CPU without VVM would require something like 5-6 cycles per
iteration, a load, an add, a store, and 2-3 instructions for the compare
and conditional branch at the end.

An OoO CPU, without VVM, but with two load/store functional units, one
integer ALU and one branch unit would take about 1-2 cycle per
iteration, as the each of the operations could be overlapped with all
the others, except perhaps part of the loop termination.

MitchAlsup

unread,

Mar 28, 2020, 6:35:41 PM3/28/20

to

There is no reason the ADD cannot be performed in parallel with the
LOOP (ADD/CMP/BLE) all one needs is an ADDer and a LOOPer (3-input
integer adder). Thus, I see 1 cycle per loop as the vast majority of
everything in the next loop is dependent on the loop Iterator getting
incremented.

>
> An IO CPU without VVM would require something like 5-6 cycles per
> iteration, a load, an add, a store, and 2-3 instructions for the compare
> and conditional branch at the end.

As I stated. It is 6 in My 66000 ISA.

>
> An OoO CPU, without VVM, but with two load/store functional units, one
> integer ALU and one branch unit would take about 1-2 cycle per
> iteration, as the each of the operations could be overlapped with all
> the others, except perhaps part of the loop termination.

This depends on the kind of thing A[*] is. If it is integer, then the
integer adder would be used for the ADD of A[i] and for the Add of loop
index, and possibly the CMP of the loop iteration. Thus, it depends
on how many resources are provided.

On the other hand, if A[*] is floating point, then subtract 1 from the above, but be in a position to eat 3-4 more cycles of latency (even if
that latency is fully pipelined.)

Stephen Fuld

unread,

Mar 28, 2020, 7:10:01 PM3/28/20

to

OK. I took "in order" to mean that the store must at least start before
the loop end instruction could start. Is that not a correct definition?

Ivan Godard

unread,

Mar 28, 2020, 7:23:41 PM3/28/20

to

Qualified: *some* IO would if they have inadequate compilers. With a
decent compiler the IO will do all of it in one cycle using software
pipelining.

MitchAlsup

unread,

Mar 28, 2020, 8:07:57 PM3/28/20

to

How very von Neumann of you.

Basically, In Order means I do not have a Scoreboard or Reservation
Station instruction dispatch mechanism; although at this point I must
remain fluid in my nomenclature because some kind of instruction queuing
seems to be mandatory if VVM does anything more than just shutting down
IFetch.

Certainly the goal is to be able to perform the semantic content of
every instruction in the Loop in as few cycles as possible. By recognizing
those situations where the work of one instructions provides a lot of
support for some future instruction, those (2**k-1) instructions do
not actually have to consume many of the resources one would initially
think.

MitchAlsup

unread,

Mar 28, 2020, 8:09:16 PM3/28/20

to

One could also run out of registers and be unable to fully software pipeline
a given loop.

Ivan Godard

unread,

Mar 29, 2020, 12:11:48 AM3/29/20

to

Both software and hardware pipelining have unbounded architectural
parallelism - until they run into a provisioning boundary. Which of the
many provisioning limits they hit depends on the test code and the
provision, but that reflects marketing decision, not the choice of
architecture. There's no inherent advantage to either IO or OOO; it's
all a bang-for-the-buck over the intended customer use-cases.

Terje Mathisen

unread,

Mar 29, 2020, 8:27:49 AM3/29/20

to

>>>> 1-wide machine (ala Mc88100 ÂµArchitecture).

With good programming, the totally in-order Pentium (one regular and one
very limited pipe) could run my word count algorithm at a real/measured
1.5 cycles per character, or 3 cycles per pair of input bytes.

Load a pair of input bytes, lookup the classifications for both chars,
combine this with the previous pair classification and use the
combination to lookup the word & line increments and add them to the
running sum.

Yeah, that does sound like more than 3 cycles... :-)

>
> An OoO CPU, without VVM, but with two load/store functional units, one
> integer ALU and one branch unit would take about 1-2 cycle per
> iteration, as the each of the operations could be overlapped with all
> the others, except perhaps part of the loop termination.

With enough OoO/hardware unrolling thise algorithms can all run in a
single cyle since there are no iteration-to-iteration dependency chains.

In fact, for either integer or fp A[] arrays to be processed, we can
relatively easily get to 2/4/8 iterations/cycle since it can be
trivially SIMD vectorized.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Ivan Godard

unread,

Mar 29, 2020, 9:06:45 AM3/29/20

to

SIMDization is never trivial except in toys that define away the
complications. One reason I like Mitch's vector is because it recognizes
those problems and avoids them.

For example - what if A[13] is SNAN?

MitchAlsup

unread,

Mar 29, 2020, 11:48:21 AM3/29/20

to

What if::

sizeof(A)/sizeof(A[i]) = 3

A lot of boundary effects for SIMD that are not present in VVM.

Terje Mathisen

unread,

Mar 29, 2020, 12:45:16 PM3/29/20

to

If we have a single array of length 3, then the code cannot be time
critical, right?

If we have a large number of such 3-wide chunks then we can still employ
4-wide SIMD, but now with gated stores of only the first three elements.

Please don't take this the wrong way, I believe your VMM feature is
really, _really_ nice, specifically because it is so very general.

For almost all specific problems, it is possible to do as well or better
with hand-optimized code for each of them, even though this is a
"solution" which doesn't scale at all.

Ivan Godard

unread,

Mar 29, 2020, 4:23:15 PM3/29/20

to

I vote we release "Terje V.2, Now Fully Scalable!". But what license to use?

MitchAlsup

unread,

Mar 29, 2020, 5:46:11 PM3/29/20

to

On Sunday, March 29, 2020 at 11:45:16 AM UTC-5, Terje Mathisen wrote:
> MitchAlsup wrote:
> > On Sunday, March 29, 2020 at 8:06:45 AM UTC-5, Ivan Godard wrote:
> >> On 3/29/2020 5:27 AM, Terje Mathisen wrote:
> >>> With enough OoO/hardware unrolling thise algorithms can all run in a
> >>> single cyle since there are no iteration-to-iteration dependency chains.
> >>>
> >>> In fact, for either integer or fp A[] arrays to be processed, we can
> >>> relatively easily get to 2/4/8 iterations/cycle since it can be
> >>> trivially SIMD vectorized.
> >>
> >> SIMDization is never trivial except in toys that define away the
> >> complications. One reason I like Mitch's vector is because it recognizes
> >> those problems and avoids them.
> >>
> >> For example - what if A[13] is SNAN?
> >
> > What if::
> >
> > sizeof(A)/sizeof(A[i]) = 3
> >
> > A lot of boundary effects for SIMD that are not present in VVM.
>
> If we have a single array of length 3, then the code cannot be time
> critical, right?

I W I L L give you that::

However at iterations=3 VVM already has a speed advantage over Scalar,
almost eliminating any reason not to use VVM on loops where it (at first
glance) seems appropriate.

>
> If we have a large number of such 3-wide chunks then we can still employ
> 4-wide SIMD, but now with gated stores of only the first three elements.

Most likely eating 4/3rds the power using all 4 SMID lanes.

Terje Mathisen

unread,

Mar 30, 2020, 3:50:45 AM3/30/20

to

That is somewhat ingenious, in that you Mitch have taught us how much
more power it takes to get data from RAM into the cpu compared to doing
quite a few FP operations on said data.

Even in your 3-wide example, said blocks of 3 elements will be embedded
in 64-byte cache lines, so the actual loads from RAM will be by line,
even if the engine is SIMD vs VMM, right?

Doing 33% extra fp ops is probably down in the noise power vise.

lkcl

unread,

Apr 19, 2020, 5:22:03 PM4/19/20

to

https://libre-soc.org/3d_gpu/mem_l0_to_l1_bridge_v2.png

reopening an older discussion, here, because of some minor design changes.

recall that (with thanks to input from Mitch) we have an address-match triangular matrix that catches (over-zealously) LD/STs that could overlap, by comparing addr[4:11]. however by excluding addr[0..3] this misses some opportunities to merge LDSTs at the cache-line

recall that we augmented and improved this by turning addr[0:3] plus the LDSTlen into a bitmask, and performing simple mask-wise AND in order to fully determine even diwn to the byte level which LD/STs do not overlap.

thus by the time we want to maximise the LD/ST Cache line usage, by merging as many LD/STs onto the L1 as possible, we *already know* that the requests in the LDST Buffer do *not* overlap, right down to the byte level.

thus we only need to identify those requests that have the same complete address, and that means comparing bits 12:48 (or whatever).

here's the problem: that's 36 XOR gates times 8 (if the L0 cache/buffer is 16 entries).

i consider that to be excessive / cause for concern, however that simply may be inexperience.

firstly: is this a lot? do we need to be concerned about power consumption here?

second, is there anything here that can be done to reduce that?

one potential idea, is to carry metadata from the LDST Issue Unit, if a batch of individusl LDSTs were part of a sequential Vector LD/ST. in the case where the FUs were sequential, and it was known that the top address bits 12:48 did not change, we could use a "tag" to indicate which addresses were the same, thus skipping quite a few of those 12-48 bit comparisons.

another idea was to try to use L1 Cache Tags. however, that runs into problems because even just to ascertain which LD/STs *have* the same 12-48 bits in the L0 cache/buffer, you *still have to make comparisons*!

wark-wark

the fail there is because we are trying to do up to 16 simultaneous merges of L0 cache LD/ST addresses onto a single (1R1W) L1 cache.

fortunately it is only actually 8 (7) maximum 12-48 bit addr compares because each LDST Function Unit generates a pair of operations that cover the same (potentially) misaligned operation.

thus we know that for each pair:

* the 2nd LDST request only activates on a misaligned request
* the diff between 1st and 2nd addr is +1 on bits 4-11
* bits 12-48 *do not change* (we trap when bits 4-11 are 0b1111111111)

thus in effect there are only 7 comparisons of bits 12-48 required, not 15.

anyone any bright ideas, here?

l.

MitchAlsup

unread,

Apr 19, 2020, 5:28:30 PM4/19/20

to

On Sunday, April 19, 2020 at 4:22:03 PM UTC-5, lkcl wrote:
> https://libre-soc.org/3d_gpu/mem_l0_to_l1_bridge_v2.png
>
> reopening an older discussion, here, because of some minor design changes.
>
> recall that (with thanks to input from Mitch) we have an address-match triangular matrix that catches (over-zealously) LD/STs that could overlap, by comparing addr[4:11]. however by excluding addr[0..3] this misses some opportunities to merge LDSTs at the cache-line

So, you start with the ability to determine "Can't be the same as" (below)

>
> recall that we augmented and improved this by turning addr[0:3] plus the LDSTlen into a bitmask, and performing simple mask-wise AND in order to fully determine even diwn to the byte level which LD/STs do not overlap.
>
> thus by the time we want to maximise the LD/ST Cache line usage, by merging as many LD/STs onto the L1 as possible, we *already know* that the requests in the LDST Buffer do *not* overlap, right down to the byte level.
>
> thus we only need to identify those requests that have the same complete address, and that means comparing bits 12:48 (or whatever).
>
> here's the problem: that's 36 XOR gates times 8 (if the L0 cache/buffer is 16 entries).
>
> i consider that to be excessive / cause for concern, however that simply may be inexperience.
>
> firstly: is this a lot? do we need to be concerned about power consumption here?
>
> second, is there anything here that can be done to reduce that?
>
> one potential idea, is to carry metadata from the LDST Issue Unit, if a batch of individusl LDSTs were part of a sequential Vector LD/ST. in the case where the FUs were sequential, and it was known that the top address bits 12:48 did not change, we could use a "tag" to indicate which addresses were the same, thus skipping quite a few of those 12-48 bit comparisons.
>
> another idea was to try to use L1 Cache Tags. however, that runs into problems because even just to ascertain which LD/STs *have* the same 12-48 bits in the L0 cache/buffer, you *still have to make comparisons*!
>
> wark-wark
>
> the fail there is because we are trying to do up to 16 simultaneous merges of L0 cache LD/ST addresses onto a single (1R1W) L1 cache.
>
> fortunately it is only actually 8 (7) maximum 12-48 bit addr compares because each LDST Function Unit generates a pair of operations that cover the same (potentially) misaligned operation.
>
> thus we know that for each pair:
>
> * the 2nd LDST request only activates on a misaligned request
> * the diff between 1st and 2nd addr is +1 on bits 4-11
> * bits 12-48 *do not change* (we trap when bits 4-11 are 0b1111111111)
>
> thus in effect there are only 7 comparisons of bits 12-48 required, not 15.
>
> anyone any bright ideas, here?

And you attempt to morph it into "has to be the same as" and wonder why
there are so many comparators !?!

>
> l.

lkcl

unread,

Apr 19, 2020, 5:43:18 PM4/19/20

to

On Sunday, April 19, 2020 at 10:28:30 PM UTC+1, MitchAlsup wrote:
> On Sunday, April 19, 2020 at 4:22:03 PM UTC-5, lkcl wrote:
> > https://libre-soc.org/3d_gpu/mem_l0_to_l1_bridge_v2.png
> >
> > reopening an older discussion, here, because of some minor design changes.
> >
> > recall that (with thanks to input from Mitch) we have an address-match triangular matrix that catches (over-zealously) LD/STs that could overlap, by comparing addr[4:11]. however by excluding addr[0..3] this misses some opportunities to merge LDSTs at the cache-line
>
> So, you start with the ability to determine "Can't be the same as" (below)

> > [lots of waffling, by me]

> And you attempt to morph it into "has to be the same as" and wonder why
> there are so many comparators !?!

yerrrrs, i knooow :)

actually in the triangular matrix it's still a lot of comparators. 7+6+5+4+3+2+1 equals 24, bits 4-11 is um 8 so that's 192 XOR gates (eek!)

i have a suspicion that, with a little thought, the vector tag trick might be applicable to those 4-11 bits just as it would to 12-48.

hmmm ...

lkcl

unread,

Apr 19, 2020, 6:20:12 PM4/19/20

to

HA!

if you have sequential LD/ST operations being generated from a Vector LD/ST
operation, the "comparison" criteria - detecting if the carry from bits 0-3 have rolled over is:

bit 4!

and detecting if that carry has also rollder over on bits 4-11 is:

bit 12!

ta-daaa :)

we are back to single-bit comparisons, and that can be done as unary-level. of course, if there *isn't* a match, you have to do the full address.

this leaves an interesting situation, where you have a mix of:

* some addresses that matched sequentially
* some addresses that you know specifically *failed* a sequential match
(bit 4 or 12 changed)
* some addresses where you know that they were not sequential.

have to think about this some more.

l.

lkcl

unread,

Apr 22, 2020, 8:46:35 AM4/22/20

to

https://libre-soc.org/3d_gpu/twin_l0_cache_buffer.jpg

this is mainly for Ivan. i worked out a couple of niggles: i realised
that with each Function Unit mis-aligned address always crossing over
to a new cache line, Addr[4] is *always* going to be mutually-exclusively
different on any given pair of (mis-)align split LD/ST pair of requests.

thus the idea to farm Addr[0]==0 to one separate L0 cache/buffer and
Addr[1]==0 to another and still be able to do useful optimisations
to reduce XOR-gate power consumption, is meaningless.

the solution: use only a single L0 cache/buffer that happens to have
twin ports: one for Addr[4]==0 and one for Addr[4]==1.

this has the added advantage of cutting the number of rows required in half,
back down to equal the number of FunctionUnits rather than double the number
of FUs.

also, Multiplexing is reduced to a 2-in 2-out non-delaying non-exclusionary
crossbar that requires *no* arbitration.

why?

because, remember, back at the FU, the two addresses are *mutually-exclusive*
in bit 4 of the address.

that means that either:

* low-port addr[4] == 0 and hi-port addr[4] == 1 - set the xbar "straight"
* low-port addr[4] == 1 and hi-port addr[4] == 0 - set the xbar "cross"

this results in:

* the left port of L0 always serving addr[4]==0 and
* the right port of L0 always serving addr[4]==1

exactly as when we had two separate L0 caches. the differences:

* only 8 rows
* addr[12:48] XOR-comparison results are shared between both ports

notes:

* the optimisation for vector LD/STs (which will result in significant
power usage reduction) may still be applied.

* those crossbars *only* need to go 2-in/2-out to the 2 banks in the row.
they do *NOT* need to go to other rows.

* thus we do not have massive 16-way to 16-way multiplexers. we have
instead 8x 2-in/2-out crossbars

* the interface to each L1 cache is a pure-and-simple single-cycle
read-or-write. this means that the L1 cache may use straight (fast)
1R-or-1W SRAM and thus may be single-cycle operational.

l.