Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Misc. architectural features

57 views
Skip to first unread message

Elijah Stone

unread,
Jan 21, 2023, 12:51:40 AM1/21/23
to
I will lead with the caveat that I am not a hardware person. However, I am
interested in computer architecture, and am curious whether any of the
following architectural features have been proposed or studied at all in the
past. If somebody could comment on them, or point at existing resources, that
would be great. Thanks!

Optional branches

An optional branch is one that, just like a regular branch, must be taken if
its condition is true. However, if the condition turns out to be false, the
hardware may, at its discretion, either take the branch or not take the
branch. The advantage is that, if the branch was predicted taken, but the
condition turned out to be false, it is not necessary to roll back any state.

The idea is that software can use these in cases where it has a fast path and
a slow path, and the fast path works for only a subset of inputs, where the
slow path works for all inputs. It's a performance win when the fast path is
fast enough to be worth having, but the difference in performance between the
fast and slow paths is less than that of a mispredict.

This feature is actually already on gpus
(https://registry.khronos.org/OpenGL/extensions/ARB/ARB_shader_group_vote.txt),
but obviously branches work very differently on gpus.

Aliasing tags

These can be transparently taken advantage of by compilers with no
source-level changes (though the latter may be helpful). Instructions that
operate on memory locations can specify a tag with a few bits; the behaviour
is undefined if any two memory locations with different tags alias. This
simplifies memory disambiguation, removing the need to spend so many resources
tracking and predicting it.

Lightweight fences can be provided, possibly implicitly at subroutine
boundaries, such that two operations on opposite sides of a fence are _always_
allowed to alias.

Coalesceable branches

This one I'm least sure of. The idea is to do less bookkeeping for things
like gc barriers and bounds checks, where the slow path is very slow and can
afford to do some of the state reconstruction in software. A coalesceable
branch is a special type of branch, which has a tag associated. If the
condition of a coalesceable branch is true, you can either take that branch,
_or_ take any previous coalesceable branch with the same tag. There can again
be lightweight fences to enable local reasoning in code generation.

---------------------------------------------------------------------------

There's a common theme here: restrictions are added on the software side, and
freedom added to the hardware side. The hardware could choose to not make use
of that freedom, and continue operating as it always has. I'm not quite sure
what to make of that.

Thoughts?

-E

Paul A. Clayton

unread,
Jun 20, 2023, 1:12:59 PM6/20/23
to
On 1/21/23 12:51 AM, Elijah Stone wrote:
> I will lead with the caveat that I am not a hardware person.
> However, I am interested in computer architecture, and am curious
> whether any of the following architectural features have been
> proposed or studied at all in the past.  If somebody could comment
> on them, or point at existing resources, that would be great.
> Thanks!

I have not done a search, but I do not remember reading about any
of these features.

> Optional branches
>
> An optional branch is one that, just like a regular branch, must
> be taken if its condition is true.  However, if the condition
> turns out to be false, the hardware may, at its discretion, either
> take the branch or not take the branch.  The advantage is that, if
> the branch was predicted taken, but the condition turned out to be
> false, it is not necessary to roll back any state.
>
> The idea is that software can use these in cases where it has a
> fast path and a slow path, and the fast path works for only a
> subset of inputs, where the slow path works for all inputs.  It's
> a performance win when the fast path is fast enough to be worth
> having, but the difference in performance between the fast and
> slow paths is less than that of a mispredict.

This has several minor issues. First, a compiler for most
languages would not have the information to know when the slow
path is semantically identical or *especially* if it is close
enough for the intended use. Requiring special annotation to use
the feature will hinder software adoption.

The feature also has somewhat limited utility; the slow path has
to be valid in both cases and the expected residue of the slow
path after condition resolution should be less resource consuming
than branch redirecting and executing the fast path.

Most programmers are probably not inclined to measure or estimate
the relative costs of slow and fast paths and branch misprediction
recovery costs. In addition, the actual cost/benefit would depend
strongly on branch prediction accuracy. With binary compatibility,
one does not want to be stuck with a feature which was somewhat
useful in one microarchitecture but more bother than value in
other microarchitectures (e.g., improved branch prediction,
cheaper branch recovery [including earlier resolution], low
confidence predictions having slower and more energy efficient
execution [a slow path prediction seems likely to have lower
confidence]).

In some ways this is similar to static branch direction hints.
Although such have been implemented and they could be used in
conjunction with dynamic prediction (e.g., providing a basis for
agree prediction, filtering prediction types), the benefit was
reduced by inaccurate hints (which can be from different biases
depending on inputs — which can even fool profile-based
optimization — or programmer/compiler mistakes) and improvements
in prediction techniques.

By the way, you might be interested in the idea of random branches
(Edward Lee and Craig Zilles, "Branch-on-Random", 2008), which
proposed a probabilistic branch in support of software sampling.
Hardware support for highly configurable performance counters
reduces the utility of such, but the idea is interesting. One
might also consider using the entropy string as a precomputed
branch direction buffer (or prediction stream); a 50% probable
branch using an entropy stream is the same as an ordinary branch
using a precomputed direction stream (multi-way branches could
also be handled by such).

Branch direction (prediction) precomputation has been broadly
suggested (e.g., Bill Appelbe et al., "Hoisting Branch Conditions
— Improving Super-Scalar Processor Performance", 1996, Robert S.
Chappell et al., "Difficult-Path Branch Prediction Using
Subordinate Microthreads", 2002, and Yasuo Ishii, "Fused Two-Level
Branch Prediction With Ahead Calculation", 2006 [which proposed
caching predictions from high latency neural predictors]),
especially for loops whether one iteration ahead (Virgil
Andronache et al., "A Perfect Branch Prediction Technique for
Conditional Loops", 1998]) or across multiple iterations (Rami
Sheikh et al., "Control-Flow Decoupling", 2012);

> This feature is actually already on gpus
> (https://registry.khronos.org/OpenGL/extensions/ARB/ARB_shader_group_vote.txt),
> but obviously branches work very differently on gpus.
>
> Aliasing tags
>
> These can be transparently taken advantage of by compilers with no
> source-level changes (though the latter may be helpful).
> Instructions that operate on memory locations can specify a tag
> with a few bits; the behaviour is undefined if any two memory
> locations with different tags alias.  This simplifies memory
> disambiguation, removing the need to spend so many resources
> tracking and predicting it.
>
> Lightweight fences can be provided, possibly implicitly at
> subroutine boundaries, such that two operations on opposite sides
> of a fence are _always_ allowed to alias.

I imagined a similar mechanism; it addresses a real problem
(improving store forwarding prediction accuracy and/or cost and
potentially facilitating cache clustering). However, with somewhat
frequent fences, the scope for benefit seems very limited. For
alias prediction, the benefit is relatively small — dynamic alias
prediction is fairly accurate (even with a simple predictor) and
replay can be somewhat inexpensive given that a forwarding store
has to wait for the address to be confirmed (ignoring energy/heat
costs, wasted execution in a single-threaded core is cheap).

For cache clustering/banking, such would not seem to be enough of
a guarantee. Access-granular information would not work well with
cache block granular allocation. Also, fences would seem to
motivate changing the alias tag from a cluster/bank selector to (a
contributor to) the index of a bank predictor. (Cache clustering
has other issues; in addition to utilization imbalance, cross-
cluster communication would be difficult to avoid [references in
one cluster pointing to contents in another and general operand
sharing — the latter is more tractable because functional units
can be placed close]. Since the potential benefit of cache
clustering is significant — two or three times the capacity [cache
clusters east, west, and possibly north of the compute units] at
similar latency — I think more research is justified, but the
challenges are significant and some research has been done.)

For alias prediction, I suspect such would not be worthwhile, but
there may be other uses for this kind information.

> Coalesceable branches
>
> This one I'm least sure of.  The idea is to do less bookkeeping
> for things like gc barriers and bounds checks, where the slow path
> is very slow and can afford to do some of the state reconstruction
> in software.  A coalesceable branch is a special type of branch,
> which has a tag associated.  If the condition of a coalesceable
> branch is true, you can either take that branch, _or_ take any
> previous coalesceable branch with the same tag.  There can again
> be lightweight fences to enable local reasoning in code generation.

I do not understand the intended use for this. I get the
impression that this is proposing something like imprecise
exceptions where software manages state consistency. I think you
are trying to exploit the low frequency of exceptions (such as
bounds check failures) and already slow handling of such special
cases to reduce the ordinary overhead of tracking state precisely.

I.e., rather than having hardware recover the precise state at a
given coalesceable branch and have to wait until the branch is
retired before committing later instructions, a sticky condition
would be set (if the condition is determined late, the branch
would be skipped rather than holding up execution waiting) and
eventually another coalesceable branch with the same tag would be
triggered and fix-up the state.

The functionality I describe — which might not be what you meant —
could have some benefits, but _selectively_ undoing/redoing work
is more complex than checkpointing and retrying. Limited imprecise
exceptions have been implemented with hardware/software
collaboration, but I suspect hardware is generally better at
detail tracking. Software might have more information about what
is "don't care" (logically dead values, cheaply recomputed/
reloaded values) and where checkpointing would be most effective,
but efficient communication between the layers seems challenging.

Itanium had speculation check instructions which would effectively
generate a branch to fix-up code. Even though the speculation and
fix-up was localized for one specific load (if I remember
correctly), I get the impression that compilers did not use this
effectively or even necessarily correctly. Requiring fix-up code
to handle multiple possible exception origins seems likely to be
even less tractable.

(This brings to mind transactional memory where some operations
can escape the transaction such that less state needs to be
checkpointed, monitored, and recovered. Similarly, partial
completions and retries have been proposed, but the complexity is
daunting and the utility limited.)

>
> ---------------------------------------------------------------------------
>
>
> There's a common theme here: restrictions are added on the
> software side, and freedom added to the hardware side.  The
> hardware could choose to not make use of that freedom, and
> continue operating as it always has.  I'm not quite sure what to
> make of that.
>
> Thoughts?

I certainly approve of looking at hardware-software codesign.
Software hints and quasi-directives are a way of caching
information in the "highest level cache" of compiled code (source
code might be analogous to main memory); I think there is scope
for not having hardware always dynamically regenerate useful
information. (The Mill team have proposed more persistently
caching control flow metadata, but there may be other metadata
worth preserving.)

Cache sizing (including associativity and level depth),
replacement, and sharing are difficult problems. Prefetching can
help but has issues with timeliness (too soon wastes buffer space,
too late reduces benefit) and accuracy (useless prefetching wastes
bandwidth, tracking management overheads, and buffer capacity).
Dynamically improved/discovered hint-like information can be
discarded rather than written back, but knowing when the
information is valuable enough to retain is a challenge.

By the way, I do think quasi-directives and "conservative filters"
are interesting. Cache-block-allocate instructions are an example;
on a cache hit zeroing can be avoided while providing the same
avoidance of read for ownership (and it can work on non-cacheable
memory; if the potential accidental communication is not a
concern, as in some embedded uses, even cache misses could avoid
block zeroing). Unlike cache-block-zero instructions, such can be
ignored without impacting architectural guarantees. Unlike a hint,
such can change architectural state (e.g., by zeroing memory that
is not yet overwritten or — ignoring the security issue —
inserting "arbitrary" data).

I do not know if any of the above is useful, but Juneteenth gave
me a two day weekend with the libraries (where I work) closed,
which reduced my casual Internet use and time spent hanging around
work and removed some commute time and work time. (Yes, this time
off has generated a significant comp.arch posting spurt.)
0 new messages