On Friday, March 27, 2020 at 2:32:50 PM UTC, Ivan Godard wrote:
> On 3/24/2020 2:47 PM, lkcl wrote:
> > so we are good. whew :)
>
> If you say so :-)
(i'd better be...)
> >> 2) Your L0 is our L1 victim buffer.
> >
> > ah ok. same thing, different name.
>
> Well, a little different, because they get filled from the L1 banks as
> well as from the FUs, and stores don't need line fetches.
hum, hum... this because it's also being filled by SMP coherence requests that come from other cores etc?
> > ...use several layers? :)
>
> Money and power. And did I mention money?
this is why we are using coriolis2 (and NLNet)
> > this is enough to preserve the instruction order in an intriguing single-cycle way.
>
> Not sure about that. What about instructions that don't use registers
> that you can chain? Or rather, use the same ones? Code like:
> store r4, *r4;
> store r4, *r4+1;
> Here there is an order dependency between the two stores, but there's
> only one register involved and no updates or value creation.
reg read write dependency hazards - not memory hazards - would be detected and stop them from overlapping.
i do have a scheme for detecting wasted writes (i called it "nameless registers") however we are running out of time to implement optimisations.
so aside from that might be an IO store that you want to genuinely do...
wait... it's a ST. put contents of r4 into location pointed to by r4. and r4+1
yes of course those can be *issued* in parallel. they are reg read dependencies.
so there are 2 cases. first is if they are say ST.byte then there are no memory overlaps, all hunky dory, *r4 and *r4+1 fall (maybe) into the same 16 byte cache line, get done in 1 go, otherwise they get done across odd even cache lines (we plan to have that dual banked L1 using bit 4 of addr to select them)
if however they are ST.dwords then the AddressConflict Matcher kicks in.
the AddressMatcher notices two things:
1. that the top bits 4 to 11 are identical AND
2. the bytemask maps created from the 8byte len shifted by address bits[0..3] have an overlap.
therefore the two will NOT be permitted to proceed simultaneously and MUST be done independently.
unlike the reg dependency checking (done before ACM phase) this part does *not* have anything to do with register numbers themselves.
although there is opportunity for detection and merging of the overlaps we will probably leave it for later.
> I suppose
> you can create a fake dependency r4->r4, but that could read in either
> direction. And it could be followed by "store r4, r4+2;" too.
yes.
> How do you
> resolve this without an order-tag?
i did not mention, i also added a FU-FU LDST order preserving matrix, it has been a while (code written last year), working in conjunction with AddressConflict Hazard detection.
so if AC is noted, then the FUFU matrix does not let the operation proceed until that hazard is dropped. in this case that hazard only gets dropped when the *r4 st gets notification back that it has completed.
*then* because that operation no longer exists in the Matrices, the address conflict bytemask overlap no longer exists, and the *r4+1 is free and clear to proceed.
> Or maybe the AGU output gets a fake
> regnum, with memrefs cracked in decode into someting like "rfake =
> LEA(r4); store r4, *rfake; rfake1 = LEA(r4+1); store r4, *rfake1;"? That
> should work, but chews up a lot of renamers.
are we getting confused about LD ane ST here? no, don't think so...
the LDST Comp Units have their own ALUs. just adders. the addresses get stored in latches within those two LDSTALUs.
so no need in my mind for rfake or rfake1 although conceptually they do exist, as anonymously unnamed latches sonewhere in the AGEN ALU.
these AGENs also get held in the AC matrices (perhaps unnecessarily) i will investigate that during an optimisation phase.
> It's easy to see how OOO works in the easy cases, but it's perversities
> like this that make me a dedicated IO guy.
whereas for me, inorder is just bolliocks. every solution to every problem is: stall.
all efforts and i mean all to workaround that start with the word "but" and go downhill from there.
but you can't do variable pipelines.
but you can't do pipeline bypasses.
but interrupts have to be masked.
but you must destroy interrupt latency.
> > * generating a ton of scalar operations, such that we can use standard multi issue hardware, vectorisation "goes away"
>
> You get no argument from me on that one, except for "standard". Standard
> doesn't scale enough.
yehh i also didn't mention, we will be doing a SIMD ALU backend. the plan is to detect some sequential operations such as element striding etc. and fuse them into ALU operations of the width of the SIMD unit.
even FP16 element strided LDs/STs can be fused into 64 bit LD/STs because you are not supposed to touch the actual bits (no NaN conversion etc)
however yes at some point we hit a scale wall long before the 64 entry VLEN theoretical maximum. hypothetically we could extend to 8 way multi issue however the bus internal architecture, maybe 6,000 wide data wires? has me slightly freaked out.
i mean, we can _do_ it... :)
> The example above was WAW ordering. What about that?
yep back at the reg DM phase. hang on are you sure you are not confusing LD and ST?
oh you mean *memory* WaW not reg WaW.
yes, we are not ignoring it, just separating them and not doing any optimisation (yet)
> > all these have been turned into unary bytemaps.
> >
> > d1 = 0b1111 1111 0000 0000
> > w1 = 0b0000 0000 1111 0000
> > s1 = 0b0000 0000 0000 1100
> > w2 = 0b0000 1111 0000 0000
> > s2 = 0b0000 0000 0011 0000
> >
> > therefore *back at the AddressConflict Matrix* (assuming those were all the same Addr[4:48] binary addresses) the ACM would *already* have prevented LEN+Addr[0..3] d1 and w2 when encoded in this unary mask form from progressing simultaneously to the L0 Cache/buffer (aka L1 victim buffer).
> >
> > etc.
>
> Hmm. The picture helps; thank you. BTW, figuring out that d1/w1/s1 are
> non-conflicting is polynomial, but figuring out which is the best of the
> various intra-non-conflicting but inter-conflicting sets to let through
> is the NP bin-packing problem. I'm guessing that you don't try,
correct. reviewed queues, rejected it, decided just select first arbitrary ADDR, use that one as filter to check and find any opportunities for simultaneous cache line reads/writes.
turn O(N^2) into two O(N) operations. one that finds first free buffer entry, one that does binary address comparisons with all other pending operations.
matches have their bytemasks merged.
> and just
> use a maximal munch on the ordered request stream, which is probably
> nlogn and would rarely cost you a cycle over optimal.
we really, *really* want simple and straightforward algorithms for a first cut.
i would be interested to hear more about this idea "maximal munch" some time. after the first tapeout.
>
> The byte conflict mask approach is actually how we do byte merge too,
cool.
> except that our request stream is ordered by definition and your unary
> mask is architecturally visible to us as the valid-bit on every byte in
> the hierarchy short of DRAM.
yes likewise i would expect the bytemask to propagate down thru L1 and L2 caches, and even because we are going to do DDR1 SDRAM in the first 180nm ASIC (because it'w on opencores), doesn't that have byte select lines?
> Your explanation thus pushes my doubts to the Address Conflict Matrix.
> Do I take it correctly that ACM detects at line granularity?
mitch's original uses bits 5 thru 11 to give "overopportunistic" matching.
by grabbing *more* than is actually conflicting, although opportunities for optimisation are missed, at least no conflicts occur, and the algorithm turned out, mitch reported, not to actually significantly impact rral work performance dramatically.
yes we are going with cache line BINARY comparison bits FOUR thru 11 on address AND unary mask overlap as the "conflict detect trigger".
> One
> observation of actual code is that spatial locality is at sub-line
> granularity: a reference to a line is followed by another reference to
> the same line with high probability.
yyep. hence the double detect. binary cache line, unary bytemask.
> A conflict detect can split a
> per-address-space stream of requests into several per-line streams of
> requests; the cost is a set of associative comparators (fast) and a miss
> detect and evict (slow).
nnope. that's the beauty of Mitch's scheme. you only need a Pascal Triangle of 5 thru 11 binary address comparators
no need to check upper bits, no need for set associative tags.
let us say that you do indeed get two VM pages with the same bits 12 and up of the address.
o dear the bits 5 thru 11 are the same, this is accidentally included as a "conflict", whoopsie, one of them has to wait, so what?
this "hit" on performance, in real world scenarios, is not as bad as it seems.
> Still, if I understand you (?) you cannot handle an intra-line sequence
> that has more than one read-write or write-read transition, because you
> have to accumulate a mask at each transition. W-W-W is OK, and R-R-R,
> but not W-R-W or R-W-R; not in one cycle anyway.
err... maayybee? :) he said.
aw god i think we might.
no, hang on. LDs get batched and hold up STs. STs get batched and hold up LDs. that's done right back at the Memory Hazard RaW WaR Matrix, long before addresses are analysed.
so yes, W R W would get turned into
W
pause and flush L0
R
pause and flush L0
W
so it's not that it's not handled, it's avoided entirely.
> We can, with the
> as-of-retire load semantics Mill has; we just reissue a read that got
> klobbered. We also detect conflict at the byte(-range) level, co
> conflict is very rare, whereas locality says that it will be frequent in
> your line-granularity AGM.
>
> It's interesting that Mill, starting from the semantics, and you/Mitch,
> starting from the gates, have arrived at very similar designs in this area.
:) hey i am just running with ideas that make sense and trusting they'll work out
> > however *initially* they will be separate LDs, thus 2 separate FUs will be allocated, one for Bi and one for Bi+1
>
> And so you get away from SIMD. That will work fine for code working on
> full-scalar objects. It's not goind to work well for sub-scalar data.
> You don't have the hardware capacity to use eight LS units to block
> load/store a byte array, nor eight ALUs to add one to a byte.
to keep the discussion sane i missed out the bit about pre-batching back at the VLEN Loop phase because we do have a 64 bit wide dynamic partitionable SIMD ALU.
yes really, gates open up and turn add, shift etc even MUL into 8x8 or 4x16 or 2x32 or 1x64. hypothetically we can do 8 16 8 32 but i reaaaally don't want to go there.
thus by the time we get to that "scalar" part, actually what we will mostly be hitting the ALUs (and LDST) with is 64 bit operations.
obviously, nonstrided LDSTs not so much.
> Or does
> Mitch still plan on having GBOOO figure out SIMDization in hardware?
can't speak for Mitch, he csn take care if himself :)
> We
> gave up on that, and all ops work on SIMD too.
the only reason LibreSOC can even remotely consider it is because of the union typecasting of the regfile to uint8 uint16 uint32 uint64 arrays that is part of the SV Specification.
> Bon voyage :-)
:)