Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Decoding Instructions in Parallel

106 views

Skip to first unread message

Quadibloc

unread,

Jan 6, 2024, 11:36:35 AMJan 6

Given that I do not know a whole lot about how cache
coherency is done, and Mitch asked me what approach
I was planning to take...

I went on a web search to find more information on
the subject.

I learned that MSI went to MESI... and then there were
a bunch of "ownership" schemes, such as Berkeley,
Illinois, Firefly, and Dragon.

By 1999, AMD seems to have done something in that area
with MOESI, and later on Intel came up with MESIF instead,
where "F", for Forwarding, is _like_ owned data, but it
is also saved to RAM. Engineers at Intel recently also
wrote papers on "MOESI Prime", which has primed versions
of two of the MOESI states to avoid the cache coherency
mechanism causing RowHammer-like behavior.

Anyways... there was something else I found while looking
this stuff up.

I had noted that one of the reasons for offering the
programmer a choice of writing programs with 32-bit
long instructions and nothing but 32-bit long instructions,
or using block headers for blocks of 256 bits in code,
was to allow instructions to be decoded in parallel.

Mitch pointed out that one could just start decoding
in parallel at every possible instruction start location,
while also, in parallel, quickly resolving instruction
lengths so as to find which decodes result in executions.

I acknowledged that one could certainly do that, but
since it was somewhat wasteful of heat and electricity,
I didn't think of this as describing a _typical_
implementation of my ISA (and hence parallel decoding
was still an excuse for having a block structure rather
than conventional CISC-like variable-length instructions).

Well, one of my search results showed that this was how
they did it on the first 64-bit Opterons, from AMD, so
that explains why this technique came so readily to
Mitch's mind!

John Savard

MitchAlsup

unread,

Jan 6, 2024, 2:16:59 PMJan 6

Quadibloc wrote:

> Given that I do not know a whole lot about how cache
> coherency is done, and Mitch asked me what approach
> I was planning to take...

> I went on a web search to find more information on
> the subject.

> I learned that MSI went to MESI... and then there were
> a bunch of "ownership" schemes, such as Berkeley,
> Illinois, Firefly, and Dragon.

> By 1999, AMD seems to have done something in that area
> with MOESI, and later on Intel came up with MESIF instead,
> where "F", for Forwarding, is _like_ owned data, but it
> is also saved to RAM. Engineers at Intel recently also
> wrote papers on "MOESI Prime", which has primed versions
> of two of the MOESI states to avoid the cache coherency
> mechanism causing RowHammer-like behavior.

The OWNED state represents the concept that this copy is the
only valid copy, so you better not lose it. A request can
arrive back with OWNED data (in some protocols) and now the
recipient is in charge of not losing it.

> Anyways... there was something else I found while looking
> this stuff up.

> I had noted that one of the reasons for offering the
> programmer a choice of writing programs with 32-bit
> long instructions and nothing but 32-bit long instructions,
> or using block headers for blocks of 256 bits in code,
> was to allow instructions to be decoded in parallel.

> Mitch pointed out that one could just start decoding
> in parallel at every possible instruction start location,

Consider reading 4 words at a time out of ICache. Even
before one compares the tag and selects the data to be
decoded, one can apply a block of logic 40-gates in
size and 4-gates of delay and have unary pointers to
the {Next instruction, any displacement, any constant}
by the time the tags have been compared and the 4-words
are then gated out with these extra pointers (8-bits)
on top of the 128-bits of instructions.

Each Next instruction pointer selects its successor, and
a tree of these resolves 2->4->8->16 at 1 more gate of
delay each. {Higher exponents seem accessible if desired}

> while also, in parallel, quickly resolving instruction
> lengths so as to find which decodes result in executions.

Generally one associated DECODE with when logical registers
are applied to either the physical register rile or to the
register renamer. These be ports one must use efficiently
and if possible the stage before DECODE (I call PARSE)
routes instructions to suitable DECODERs {Especially
important in ISAs with multiple register files {GPR, FP,
SIMD}.

> I acknowledged that one could certainly do that, but
> since it was somewhat wasteful of heat and electricity,

Separating PARSE from DECODE minimizes the waste heat
as all we are doing is looking at enough bits to route
the instruction to somewhere it can be efficiently DECODEd.
DECODE accesses the register ports and all sorts of big
gate count decoding, PARSE uses tiny pattern decoders to
only route instruction.

> I didn't think of this as describing a _typical_
> implementation of my ISA (and hence parallel decoding
> was still an excuse for having a block structure rather
> than conventional CISC-like variable-length instructions).

> Well, one of my search results showed that this was how
> they did it on the first 64-bit Opterons, from AMD, so
> that explains why this technique came so readily to
> Mitch's mind!

Burned in solid. Opteron used a trailing marker bit so we
know if we were looking at the last byte in an instruction
(or not). My 66000 uses 4 Major OpCode patterns from 001xxx
to then use a 4-bit positions {15,14,13,11} to decode all
VLE size information.

> John Savard

BGB

unread,

Jan 7, 2024, 11:13:47 PMJan 7

On 1/6/2024 10:36 AM, Quadibloc wrote:
> Given that I do not know a whole lot about how cache
> coherency is done, and Mitch asked me what approach
> I was planning to take...
>
> I went on a web search to find more information on
> the subject.
>
> I learned that MSI went to MESI... and then there were
> a bunch of "ownership" schemes, such as Berkeley,
> Illinois, Firefly, and Dragon.
>
> By 1999, AMD seems to have done something in that area
> with MOESI, and later on Intel came up with MESIF instead,
> where "F", for Forwarding, is _like_ owned data, but it
> is also saved to RAM. Engineers at Intel recently also
> wrote papers on "MOESI Prime", which has primed versions
> of two of the MOESI states to avoid the cache coherency
> mechanism causing RowHammer-like behavior.
>

I still haven't bothered with this...

Though, yeah, not having cache coherence between cores does make for an
ugly situation in that conventional threading doesn't work if one
schedules multiple threads for the same process on different CPU cores
(and cases where memory sharing is being used may require manual cache
flushing or eviction).

So, proper cache-coherence is still to-do. Need to come up with
something "hopefully cheap" though.

That, or maybe try to convince people to do multithreaded programming
without the assistance of conventional cache coherence (... yeah ...).

Though, at least with direct-mapped caches, it is possible to use dummy
buffers and pointer trickery to knock stuff out of the cache. So, say,
one can write algorithms in ways where shared memory access alternately
accesses the shared memory object, and an alternate dummy address (with
accesses being performed in such a way that cores will knock dirty lines
out to RAM, discard any stale values, and then retrieve the "up to date"
values from RAM).

Practice is questionable though, as it does not work with associative
caches (and would require multiple sets of accesses to various addresses
to deal with multi-level eviction, say, to get things out of the L2
cache and into DRAM, and/or convoluted access patterns to evict things
from a 2-way cache, ...).

Then again, maybe one can argue that by the time one is using
associative caches, one can probably justify having proper cache
coherence?...

Well, there is this, and accessing memory from a "no-cache" address
(which has an auto-evict mechanism), but then observe that this
mechanism is seemingly somehow slower than just going through MMIO (or
using sets of alternating memory addresses to knock things out of the
various cache levels).

> Anyways... there was something else I found while looking
> this stuff up.
>
> I had noted that one of the reasons for offering the
> programmer a choice of writing programs with 32-bit
> long instructions and nothing but 32-bit long instructions,
> or using block headers for blocks of 256 bits in code,
> was to allow instructions to be decoded in parallel.
>

Yeah, this is part of why only 32-bit encodings ended up allowed in
bundles...

Allowing variable length instructions in bundles would increase the
number of decoders required (and more complicated/expensive logic to MUX
the outputs of those decoders).

> Mitch pointed out that one could just start decoding
> in parallel at every possible instruction start location,
> while also, in parallel, quickly resolving instruction
> lengths so as to find which decodes result in executions.
>
> I acknowledged that one could certainly do that, but
> since it was somewhat wasteful of heat and electricity,
> I didn't think of this as describing a _typical_
> implementation of my ISA (and hence parallel decoding
> was still an excuse for having a block structure rather
> than conventional CISC-like variable-length instructions).
>

In my case, the same basic logic was overloaded for both bundles and
64/96 bit instructions. As far as decoding is concerned, the jumbo
prefixes are instructions (just with some horizontal decoding magic
glued on).

> Well, one of my search results showed that this was how
> they did it on the first 64-bit Opterons, from AMD, so
> that explains why this technique came so readily to
> Mitch's mind!
>

But, not necessarily cheap.

> John Savard

EricP

unread,

Jan 8, 2024, 12:20:15 PMJan 8

MitchAlsup wrote:
> Quadibloc wrote:
>
>> Given that I do not know a whole lot about how cache
>> coherency is done, and Mitch asked me what approach
>> I was planning to take...
>
>> I went on a web search to find more information on
>> the subject.
>
>> I learned that MSI went to MESI... and then there were
>> a bunch of "ownership" schemes, such as Berkeley,
>> Illinois, Firefly, and Dragon.
>
>> By 1999, AMD seems to have done something in that area
>> with MOESI, and later on Intel came up with MESIF instead,
>> where "F", for Forwarding, is _like_ owned data, but it
>> is also saved to RAM. Engineers at Intel recently also
>> wrote papers on "MOESI Prime", which has primed versions
>> of two of the MOESI states to avoid the cache coherency
>> mechanism causing RowHammer-like behavior.

The Forward state is to address the issue of who should respond to a
request for a shared copy of a line when there are multiple sharers.
If multiple sharers respond it could flood a requester with redundant
messages.

The Directory Controller (DC) records which lines are held in each core
in what state. It remembers the most recent core to read-share a line
as the Forward state, on the assumption that copy is most likely still
resident, while the prior readers are tracked in a Shared state.

The cache with the line in a Forward state is told send a shared copy to
a read-shared requester, who becomes the line's new Forward state holder.
If no Forward copy is available the DC reads from DRAM.

> The OWNED state represents the concept that this copy is the
> only valid copy, so you better not lose it. A request can
> arrive back with OWNED data (in some protocols) and now the recipient is
> in charge of not losing it.

Also OWNED is the modified-shared state where the owner modifies a line
then shared read-only copies of it. The ownership can be passed to a
new cache without writing it back to DRAM or invalidating the shared copies.
To modify the line again the owner has to invalidate all the shared copies
first to return it to the Exclusive state.
When the owner eventually evicts the line, it is responsible for writing
it back to DRAM.

MitchAlsup

unread,

Jan 8, 2024, 5:17:08 PMJan 8

Granted

> When the owner eventually evicts the line, it is responsible for writing
> it back to DRAM.

Or sending it to another cache that can take OWNERship over it.

0 new messages