Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

An Alternative to Predication

97 views
Skip to first unread message

Quadibloc

unread,
Apr 6, 2023, 11:52:46 AM4/6/23
to
Branches are bad!

This is why many ISAs which intend to achieve high performance
include the option of changing the flow of execution through
predicated instructions.

But that doesn't seem to be an ideal solution, because the instructions
that aren't executed still at least get fetched from memory, and likely
are processed in some other ways as well.

Why are branches bad?

Essentially, it's because they cause pipeline flushes - the computer
reads ahead in the code, perhaps assuming that any conditional
branches aren't taken, or using sophisticated branch prediction logic,
so that it's only a branch if it guesses wrong... and now the results
from reading ahead are all useless.

Hmm. Perhaps my idea won''t solve this problem.

I was thinking that perhaps one could use a pair of instructions -

BLBB, ELBB - begin local branch block, end local branch block

which, like Mitch's VVM instructions, delimit code which otherwise
isn't modified. What it directs is that the block of code be given priority
to be saved in the micro-op cache because it contains lots of little
local branches to carry out some complicated logic.

It can also serve as a warning that the branches are hard to predict, and
so either the computer looks ahead down both paths if possible, or it
avoids looking ahead, letting other threads run instead, so only latency,
but not throughput, is impacted by the branches.

John Savard

MitchAlsup

unread,
Apr 6, 2023, 1:12:54 PM4/6/23
to
On Thursday, April 6, 2023 at 10:52:46 AM UTC-5, Quadibloc wrote:
> Branches are bad!
>
> This is why many ISAs which intend to achieve high performance
> include the option of changing the flow of execution through
> predicated instructions.
>
> But that doesn't seem to be an ideal solution, because the instructions
> that aren't executed still at least get fetched from memory, and likely
> are processed in some other ways as well.
>
> Why are branches bad?
>
> Essentially, it's because they cause pipeline flushes - the computer
> reads ahead in the code, perhaps assuming that any conditional
> branches aren't taken, or using sophisticated branch prediction logic,
> so that it's only a branch if it guesses wrong... and now the results
> from reading ahead are all useless.
<
Predication accepts reading the instructions not executed in order
to avoid disrupting the front-end. In other words, If the front-end will
fetch the first instruction at the convergence point by the time the
branch in the then clause would execute, then predication has saved
you the pipeline flush !! Pipeline flushes are expensive in power and
sometimes in cycles.
>
> Hmm. Perhaps my idea won''t solve this problem.
>
> I was thinking that perhaps one could use a pair of instructions -
>
> BLBB, ELBB - begin local branch block, end local branch block
>
> which, like Mitch's VVM instructions, delimit code which otherwise
> isn't modified. What it directs is that the block of code be given priority
> to be saved in the micro-op cache because it contains lots of little
> local branches to carry out some complicated logic.
<
That is too many instructions. My 66000 Predication uses 1 PRED
instruction to describe the then-clause extent and the else-clause
extent and by implication the sum of then and else clauses which
is the convergence point. And does this in 1 instruction. Anytime
a predicated instruction sequence has both a then-clause and an
else clause, you have saved 1 branch instruction.

luke.l...@gmail.com

unread,
Apr 7, 2023, 8:02:37 AM4/7/23
to
On Thursday, April 6, 2023 at 4:52:46 PM UTC+1, Quadibloc wrote:
> Branches are bad!
>
> This is why many ISAs which intend to achieve high performance
> include the option of changing the flow of execution through
> predicated instructions.
>
> But that doesn't seem to be an ideal solution, because the instructions
> that aren't executed still at least get fetched from memory, and likely
> are processed in some other ways as well.

look up Zero-Overhead-Loop-Control.

https://ieeexplore.ieee.org/document/1692906
https://www.researchgate.net/publication/3351728_Zero-overhead_loop_controller_that_implements_multimedia_algorithms

it achieved a whopping 45% reduction in instruction execution
count on a simple in-order system compared to more complex
micro-architectures, for complex MPEG motion-estimation.

*SIX* level nested *conditional* zero-overhead-loops.

absolutely astonishing. some benchmarks reduced by *eighty*
percent instruction count.

ST Micro were so impressed they actually put it into a
production silicon product.

l.

Terje Mathisen

unread,
Apr 7, 2023, 9:51:55 AM4/7/23
to
The most obvious limitation here is probably that they are comparing
against scalar multimedia code, the specific case where SIMD already
provides the largest speedups. I.e. the Pentium-MMX with 8-wide byte
operations overlayed on the 64-bit x87 FP mantissas were fast enough
that the Zoran SoftDVD player was able to keep up with zero frame skips.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,
Apr 7, 2023, 10:40:29 AM4/7/23
to
On Friday, April 7, 2023 at 7:02:37 AM UTC-5, luke.l...@gmail.com wrote:
> On Thursday, April 6, 2023 at 4:52:46 PM UTC+1, Quadibloc wrote:
> > Branches are bad!
> >
> > This is why many ISAs which intend to achieve high performance
> > include the option of changing the flow of execution through
> > predicated instructions.
> >
> > But that doesn't seem to be an ideal solution, because the instructions
> > that aren't executed still at least get fetched from memory, and likely
> > are processed in some other ways as well.
> look up Zero-Overhead-Loop-Control.
<
This is what the LOOP instruction does in VVM.

luke.l...@gmail.com

unread,
Apr 7, 2023, 7:00:55 PM4/7/23
to
On Friday, April 7, 2023 at 2:51:55 PM UTC+1, Terje Mathisen wrote:

> The most obvious limitation here is probably that they are comparing
> against scalar multimedia code, the specific case where SIMD already
> provides the largest speedups. I.e. the Pentium-MMX with 8-wide byte
> operations overlayed on the 64-bit x87 FP mantissas were fast enough
> that the Zoran SoftDVD player was able to keep up with zero frame skips.

it would be unfair to compare apples v oranges. thinking about it:
if ZOLC were to be applied to a SIMD baseline core it would equally
cut the number of branch-conditionals etc. but yes not the amount
of operations done *inside* any given loop.

but the specific problem with MPEG motion-estimation is that
it is an absolute bitch-of-a-task. there's six levels of nested loops
in the algorithm, with conditional jumping in and out of two or
more levels *and back*, which wreaks absolute havoc with any
traditional CPU branch-prediction, regardless of whether it has
SIMD or not.

the bit you missed Terje is that the CPU that had ZOLC added to
it *had no branch-prediction* and yet still managed to piss all over
cores that did.

l.

luke.l...@gmail.com

unread,
Apr 7, 2023, 7:04:04 PM4/7/23
to
On Friday, April 7, 2023 at 3:40:29 PM UTC+1, MitchAlsup wrote:


> This is what the LOOP instruction does in VVM.

aiui, only one level, though, Mitch (only one LOOP which
needs no test-and-branch-back). ZOLC had as many
*nested* loops on a stack that was as large as your
architectural state context-switch could tolerate,
and the ability to make decisions to push/pop/swap
which loop to jump into or out of, at any time, in
a 100% forward-predictable fashion that resulted
in *zero stalls* even though ZOLC Loops could be
6-deep-nested.

that's... insanely impressive.

l.

MitchAlsup

unread,
Apr 7, 2023, 9:10:26 PM4/7/23
to
On Friday, April 7, 2023 at 6:04:04 PM UTC-5, luke.l...@gmail.com wrote:
> On Friday, April 7, 2023 at 3:40:29 PM UTC+1, MitchAlsup wrote:
>
>
> > This is what the LOOP instruction does in VVM.
> aiui, only one level, though, Mitch (only one LOOP which
> needs no test-and-branch-back). ZOLC had as many
> *nested* loops on a stack that was as large as your
> architectural state context-switch could tolerate,
<
90% of the gain is in the innermost loop.

Stephen Fuld

unread,
Apr 8, 2023, 2:00:30 AM4/8/23
to
On 4/7/2023 4:00 PM, luke.l...@gmail.com wrote:
> On Friday, April 7, 2023 at 2:51:55 PM UTC+1, Terje Mathisen wrote:
>
>> The most obvious limitation here is probably that they are comparing
>> against scalar multimedia code, the specific case where SIMD already
>> provides the largest speedups. I.e. the Pentium-MMX with 8-wide byte
>> operations overlayed on the 64-bit x87 FP mantissas were fast enough
>> that the Zoran SoftDVD player was able to keep up with zero frame skips.
>
> it would be unfair to compare apples v oranges. thinking about it:
> if ZOLC were to be applied to a SIMD baseline core it would equally
> cut the number of branch-conditionals etc. but yes not the amount
> of operations done *inside* any given loop.
>
> but the specific problem with MPEG motion-estimation is that
> it is an absolute bitch-of-a-task. there's six levels of nested loops
> in the algorithm, with conditional jumping in and out of two or
> more levels *and back*, which wreaks absolute havoc with any
> traditional CPU branch-prediction, regardless of whether it has
> SIMD or not.

If there is source code readily available for this, it seems like it
might be an interesting benchmark program, both for compilers and for
architectures.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

luke.l...@gmail.com

unread,
Apr 8, 2023, 4:28:51 AM4/8/23
to
On Saturday, April 8, 2023 at 7:00:30 AM UTC+1, Stephen Fuld wrote:

> > but the specific problem with MPEG motion-estimation is that
> > it is an absolute bitch-of-a-task. there's six levels of nested loops
> > in the algorithm, with conditional jumping in and out of two or
> > more levels *and back*, which wreaks absolute havoc with any
> > traditional CPU branch-prediction, regardless of whether it has
> > SIMD or not.
> If there is source code readily available for this, it seems like it
> might be an interesting benchmark program, both for compilers and for
> architectures.

it was an extremely complex research project, funded back in
2007-2008 by the European Union. it took me a long time to
piece together and because the people funded pulled the
contents off the public website pretty much immediately after
the grant was finished, i had to email one of the authors to
get a copy of the VHDL.

the compiler-augmentation scripts, however, which subdivide
(mark) programs identifying Basic Blocks, etc. etc., are all
still available online.

yes of course MPEG source code is readily available,
try ffmpeg.

l.

luke.l...@gmail.com

unread,
Apr 8, 2023, 4:31:25 AM4/8/23
to
On Saturday, April 8, 2023 at 2:10:26 AM UTC+1, MitchAlsup wrote:

> 90% of the gain is in the innermost loop.

indeed. VVM is pretty high bang-per-buck in that regard,
and doesnt have the (associated) disadvantage of ZOLC
nested-looping that the compilers have no idea how it works.
the original researchers had to create special workaround
tools that inserted assembler "shims" into binaries.

instead VVM's single-loop benefits are brain-dead-simple
for compilers to add.

l.

Terje Mathisen

unread,
Apr 8, 2023, 7:26:59 AM4/8/23
to
I did not miss that, and yes, it is impressive. I am just saying that
(just like FPGA) it is not quite that large an improvement when you
compare against optimized/competent code.

Quadibloc

unread,
Apr 8, 2023, 10:48:41 AM4/8/23
to
On Saturday, April 8, 2023 at 2:28:51 AM UTC-6, luke.l...@gmail.com wrote:

> it was an extremely complex research project, funded back in
> 2007-2008 by the European Union. it took me a long time to
> piece together and because the people funded pulled the
> contents off the public website pretty much immediately after
> the grant was finished,

Would it be possible that the Wayback Machine might help with that?

John Savard

Brian G. Lucas

unread,
Apr 12, 2023, 3:18:20 PM4/12/23
to
If only "compilers" could do it by themselves (ChatGPT?) Otherwise the poor
compiler author must do it. I'm not as smart as I used to be, but I'm not
"brain-dead" yet :-)

brian
> l.

0 new messages