fortunately, our vector engine is effectively nothing more than a multi-issue scalar system. thus by the time instruction-issue is over (which is where the
vector-thing sits) vectorisation is entirely gone.
mitch's address-conflict and memory-hazard-matrix system detects:
* sequences of load-load-load and separates them from store-store-store-store
whilst *also preserving the order*
* identifies address-conflicts (in a slightly-overzealous way)
> It is not even sufficient to detect that the initial address recurrence
> overlaps and avoid setting up a vector, at least if your vectors support
> negative strides and/or different strides in different but concurrent
> streams, which they really must. You can get two vectors sliding over
> each other in opposite directions or one overtaking another, and
> determining the order semantics when the overlap happens is non-trivial
> except in Fortran.
again: we do not have a problem, here, because the "vectors" are not actually
vectors, per se, they're just an instruction to the *issue* engine to sit
with the PC "frozen", blurting out a whack-load of *scalar* instructions,
incrementing the *register* numbers on each time round the "vector" hardware
for-loop.
thus we can "pretend" that those scalar instructions (which happen to have
sequential register numbers but the same operation) were actually really,
*really* scalar operations actually coming from the I-Cache, as in, they
were, actually, really, *really*, actual binary operations @ PC, PC+4, PC+8
and so on.
all we need to do then is design a (bog-standard) Multi-Issue *scalar*
execution engine, and all of the problems of Vector overlap detection
etc. all go away.
to be replaced by lovely, shiny new ones of course :)
> Assuming that you resolve ordering issues, there is then the alignment
> and isolation issue you address above. We initially required natural
> alignment, but gave it up on business/market grounds. However, while we
> have to support unaligned access, we decided that we didn't have to make
> it fast.
:)
> Consequently the load/store units check for boundary-crossing
> and if necessary double-pump the request. The difference between our
> approach and your description is that we do the split at the LS FU,
the plan is to have the split directly after the LD/ST Function Unit:
effectively having (and therefore making it potentially wait for) two
LD/ST ports.
we'd anticipate that for the most part the 2nd port would go completely
unused, as it's only going to be active when mis-aligned memory requests
occur.
> not
> at the other end of the request stream at the buffers. Because our
> request stream is strictly ordered, we can split any request while
> preserving semantic order.
yeah the plan is to turn the LSBs of the address into that unary masking
thing. we won't ever hit even the buffers with bits 0-3 of the address:
it will be unary byte-mask lines 0-15.
and the Memory-Order Matrix takes care of load/store ordering. i'm trusting
Mitch's advice that, whilst *batches* of loads have to be ordered compared
to *batches* of stores (load-load-load store-store load-load *must* be
3xLD then 2xST then 2xLD), for a Weak Memory Ordering system, the ordering
of LDs (or STs) *in* a batch do *not* have to be ordered.
> However, the request stream has a fixed number of slots per cycle. If
> all the slots are taken by LS requests then there's no slot for the
> splitee.
based on Mitch's assessment that for WMO, the ordering of any given LD (or ST)
in a "batch" is not important, it didn't occur to me to make any distinction
between the 2-way split operations.
for a TSO system, hell yes it would matter. and, because we're now doing
POWER ISA, i know that TSO really does matter on IO-memory-accesses, so
we'll cross that bridge when we come to it. (i believe POWER 3.0B copes
with this by simply banning misaligned IO-memory accesses)
> On a slow machine one could stall for a cycle, but our firehose
> takes a couple of cycles to throw a stall, due to speed-of-light issues.
:)
> Consequently the request stream, and pretty much all other streams in
> the hierarchy, use a predictive slot assignment algorithm where the
> source maintains a high/low-water count of the projected state at the
> sink end, and throttles itself as necessary. The source does not send a
> request unless it knows that the sink will have catch capacity when the
> request arrives (will arrive, not is sent). This is *not* a handshake.
>
> This algorithm is essentially identical to the predictive throttling
> used in network packet handling, q.v. The throttle logic, which is
> widely used in a Mill, is called "Drano", because it keeps the pipes
> open.
good name :)
> Of course, excess split requests on a saturated request stream
> will eventually exceed the Drano buffering capacity and the source will
> issue a stall, early enough (it's a predictive system, after all) that
> the stall can take effect before the sink overflows.
>
> If I understand your system (questionable) you are able to handle
> boundary overrun by doubling the access capacity so you can do two
> (derived from one) requests at once.
and let the buffers sort it out, yes. ok, so the idea is to have N (N=6 or 8)
LD/ST Function Units, each with 2 ports, and to try to get 4-way porting
to the L0 cache/buffer.
now, unless there's a really, really good reason (and i am getting some
hints from what you describe that there might be), i wasn't going to
"prioritise" the non-misaligned LD/ST operations, on the basis that, well,
it's all gotta come out at some point [hopefully not as poop...]
> This seems to be wasteful IMO,
> because you will very rarely see boundary crossing requests. You propose
> to be able to do two at once, but do not have the ability to do two at
> once from two requests. Mill Drano has only the ability to do one at a
> time (per stream slot), but the slots are rarely full in a cycle, and
> the Drano buffer lets requests spill over into other cycles. The cost
> for us is that a split request may cause other requests to be delayed a
> cycle (if there are no free slots in the current cycle) or trigger a
> stall (if there are no free slots within the reach of the Drano).
>
> Your proposal is immune to the Mill's potential delay, but it leaves you
> searching for double-ported hardware.
ah no i wondered *if* double-ported (or quad-ported) L1 caches exist,
and from what Mitch was saying, not even 1R1W caches exist, they're
1R-*or*-1W caches.
so that idea is shot.
now, let's say that we have a 4-ported L0 cache/buffer, 8-entry, which has a
12-in (or 16-in) Priority Muxer to allow access to those 4 ports.
12-in because if we have 6 LD/ST Function Units, each will have its own
splitter thus doubling the FU ports in effect. 16-in if we have 8 LD/ST FUs.
i don't mind having a 12-or-16 to 4 multiplexer: i wrote a Multi-port
Priority-Picker for this purpose, just last week.
so let's then say that we have 2 of those LD/ST units issuing mis-aligned LDs,
and another 4 FUs issue aligned LDs. this effectively means we have *six*
cache-line-aligned LD/STs being issued:
* 2 from the 1st mis-aligned LD
* 2 from the 2nd mis-aligned LD
* 1 from the 1st aligned LD
* 1 from the 2nd aligned LD
depending on what the 4-simultaneous PriorityPicker happens to pick, we could
have *any* of those 6 be picked.
now... should we do something such that mis-aligned LDs/STs are *reduced* in priority? given that we have this byte-mask "cache-line detection system", i'm not so sure that we should try that, because the more that is thrown at the L0 cache/buffer, the more opportunity there could be for "merging" the unary byte-masks into full cache lines.
i honestly don't know, here.
> I suggest that you consider saying
> that users of boundary-crossing references should accept a potential
> delay as the cost of software flexibility.
i think... i think we're going to have to see how it goes, and (thank you
for the heads-up) keep a close eye on that.
one of the "refinements" that was considered was to have a popcount on the ORing of the cache-byte-line bitmaps, to detect which ones have the highest
count. full cache lines get top priority.
i can see, now that you mention it, that a large number of mis-aligned LD/STs
will, due to using double the normal number of cache lines for a non-misaligned
LD/ST, would clearly cause throughput slow-down.
however *if* those misalignments happen to be mergeable with other memory requests (because of the cache-byte-line bitmaps), actually it will not be as bad as it sounds.
the caveat being: they have to be within the same address (bits 0-3 excluded
from that), and pretty much the same cycle. for a sequential memory read/write
that happened to be accidentally mis-aligned, this condition would hold true.
however random misaligned memory access, we'd end up with 1/2 throughput
> A bit of cooperation from the
> rest of the ISA will help; for example, rather than support arbitrary
> arithmetic and assignment to the stack pointer, you could make it a
> specReg and provide an update operation that always allocates in whole
> aligned lines, so the stack frame is always aligned and your compiler
> can lay out data so that ordinary references never cross boundaries. Of
> course, that might be too CISCy for you.
unfortunately we're stuck with POWER ISA. actually... i *say* "stuck":
we're planning on extending it behind a "safety barrier" of escape-sequencing,
so we couuuld conceivably do strange and wonderful things, if justified.
l.