Let's say I have some code like this:
and. 23,25,27 # branch based on this
lha 3,0(29) # (almost) always cached
addi 28,28,1
srawi 27,27,1
li 23,32
cmpwi 1,28,7 # (this has nothing to do with this branch)
dcbt 23,29
bc 4,2,.L7
which doesn't do that (since I can't find anything else useful to do
before the bc, really). Will the presence of the dcbt (a serializing
instruction with a 2-cycle latency) have any effect to sort of "extend"
the execution of the 6 instructions to allow the branch to resolve rather
than be predicted? In this particular case, the static branch prediction
in the 603e will do little good, since it's about a 50/50 chance the
branch will be followed. Assume that I will get a L1 cache hit for the
lha, BTW - that's pretty much guaranteed in this case. Would moving the
dcbt up a few instructions help in any way? This is all based on the
assumption that the dcvbt will actually execute, of course.
Alternatively, would it be preferable to drop some noops in here to allow
the branch to resolve, or should I just eat any misprediction penalty?
Finally, is Tim Olson going to yell at me and say that this is only
4% of my program? ;) ;) ;)
--
Latvian Orthodox Priest: Is there one aspect of the faith that you find
particularly attractive?
George Costanza: I think - the hats.
-- Seinfeld
> Finally, is Tim Olson going to yell at me and say that this is only
> 4% of my program? ;) ;) ;)
I don't know about Tim, but you raise a good point, measure and make
sure that code is worth the attention you are paying to it before you go on
with cycle counting.
Further, why not a hint or two about what this code does? There might
be all kinds of other avenues to look at.
Rob
>> Finally, is Tim Olson going to yell at me and say that this is only
>> 4% of my program? ;) ;) ;)
> I don't know about Tim, but you raise a good point, measure and make
>sure that code is worth the attention you are paying to it before you go on
>with cycle counting.
I was referring to a discussion I had with Tim about JPEG/MPEG a few
days ago in csma. Thus the winking smileys.
> Further, why not a hint or two about what this code does? There might
>be all kinds of other avenues to look at.
It will be executing quite a bit. Basically, it is the code that avoids
doing an 8-point IDCT on a row of coefficients in an MPEG player - some
of the instructions are common to whether or not the branch is followed
(i.e., the instructions for the loop and the cache hint, as well as the
load of the DC coefficient and the shift of the mask for the marker which
tells us whether or not we have to do the math for the IDCT or can cheat,
which is set up at another point in the program). It will run 8 times
for every block that requires an IDCT. My guess is that a 640x480 MPEG
would require about 100,000 IDCTs per second (assuming 12-bit YCbCr),
so this would be executed 800,000 times per second in that case (assuming
that some mythical 603e could handle such a high bitrate stream).
However, I present the example to gain some insight on the PowerPC more
than as an effort to seriously enhance performance. I should have made
that more clear. My approach is already much faster than the approach
taken in all of the other open-source MPEG players that I have seen.
(Although I will (at some point) investigate using a faster algorithm
when I have actually to do the math).
> Rob Barris posted the following to comp.sys.powerpc.tech:
>
> >> Finally, is Tim Olson going to yell at me and say that this is only
> >> 4% of my program? ;) ;) ;)
>
> > I don't know about Tim, but you raise a good point, measure and make
> >sure that code is worth the attention you are paying to it before you go on
> >with cycle counting.
>
> I was referring to a discussion I had with Tim about JPEG/MPEG a few
> days ago in csma. Thus the winking smileys.
>
> > Further, why not a hint or two about what this code does? There might
> >be all kinds of other avenues to look at.
>
> It will be executing quite a bit. Basically, it is the code that avoids
> doing an 8-point IDCT on a row of coefficients in an MPEG player - some
> of the instructions are common to whether or not the branch is followed
> (i.e., the instructions for the loop and the cache hint, as well as the
> load of the DC coefficient and the shift of the mask for the marker which
> tells us whether or not we have to do the math for the IDCT or can cheat,
> which is set up at another point in the program). It will run 8 times
> for every block that requires an IDCT. My guess is that a 640x480 MPEG
> would require about 100,000 IDCTs per second (assuming 12-bit YCbCr),
> so this would be executed 800,000 times per second in that case (assuming
> that some mythical 603e could handle such a high bitrate stream).
disclaimer, I know diddly about the guts of MPEG format.
So, do you have an 8-bit mask of some kind indicating which rows of a given
8x8 macroblock you need to do the full IDCT math on? As in, 11110000
indicating that the first four 8-pixel rows must be processed fully but the
other four rows can be cheated on?
If so, have you gathered any real world data on which patterns of that mask
crop up most frequently? I hope I understand your description above
correctly.
My usual approach when faced with poorly predictable branches is to try and
find a way to merge such choices into "batches" and do them separately,
rather than hop back and forth between modes. Then you can start doing
stuff like loop unrolling and getting rid of many dec/test/branch paths
too.
Also, I might try to exploit non-uniformity in the distributions of typical
input data (what if the pattern 01010101 shows up 70% of the time? that
sort of thing).
Rob
| The 603e User Manual recommends that 9 instructions be placed between an
| instruction that affects a CR and the subsequent conditional branch to
| avoid the need for branch prediction, I believe.
Branches are "executed" by the BPU right after the branch is fetched, so a
conditional branch can avoid prediction only if the prefetch buffer (6
entries) is full between the rename write of the condition-setting op and
the fetch of the branch:
Fetch -> br
XX
XX
XX
XX
Decode/Disp-> XX
Exec -> XX
Write rename: and.
| Let's say I have some code like this:
|
| and. 23,25,27 # branch based on this
| lha 3,0(29) # (almost) always cached
| addi 28,28,1
| srawi 27,27,1
| li 23,32
| cmpwi 1,28,7 # (this has nothing to do with this branch)
| dcbt 23,29
| bc 4,2,.L7
|
| which doesn't do that (since I can't find anything else useful to do
| before the bc, really). Will the presence of the dcbt (a serializing
| instruction with a 2-cycle latency) have any effect to sort of "extend"
| the execution of the 6 instructions to allow the branch to resolve rather
| than be predicted?
No; the DCBT will still be in the prefetch buffer at this time:
Cycle Fetch Decode Execute Write
0 and. lha
1 addi srawi and. lha
2 li cmpwi addi and. lha
3 dcbt bc srawi li addi and. lha
| Would moving the dcbt up a few instructions help in any way?
No, not really; the prefetch buffer will continue to fill even if the rest
of the pipeline is stalled.
| Alternatively, would it be preferable to drop some noops in here to allow
| the branch to resolve, or should I just eat any misprediction penalty?
I think real NOPs (ori r0, r0, 0) are folded out of the pipeline on the
603e like branches are, so they won't help here as spacing between the
and. and the branch. You could perhaps add a dummy ALU op and a dummy
load, which would increase your runtime by 1 cycle but still provide 2
instructions of spacing.
The alternative is to take a branch mispredict hit 50% of the time, which
would add an average of ~ 2 cycles to your code fragment.
| Finally, is Tim Olson going to yell at me and say that this is only
| 4% of my program? ;) ;) ;)
Well, is it? ;-)
--
-- Tim Olson
>> > Further, why not a hint or two about what this code does? There might
>> >be all kinds of other avenues to look at.
>> It will be executing quite a bit. Basically, it is the code that avoids
>> doing an 8-point IDCT on a row of coefficients in an MPEG player - some
>> of the instructions are common to whether or not the branch is followed
>> (i.e., the instructions for the loop and the cache hint, as well as the
>> load of the DC coefficient and the shift of the mask for the marker which
>> tells us whether or not we have to do the math for the IDCT or can cheat,
>> which is set up at another point in the program). It will run 8 times
>> for every block that requires an IDCT. My guess is that a 640x480 MPEG
>> would require about 100,000 IDCTs per second (assuming 12-bit YCbCr),
>> so this would be executed 800,000 times per second in that case (assuming
>> that some mythical 603e could handle such a high bitrate stream).
>disclaimer, I know diddly about the guts of MPEG format.
I like to think that I am one of a few thousand people in the world
who do. Unless they are teaching this in compsci these days. ;)
>So, do you have an 8-bit mask of some kind indicating which rows of a given
>8x8 macroblock you need to do the full IDCT math on? As in, 11110000
>indicating that the first four 8-pixel rows must be processed fully but the
>other four rows can be cheated on?
Exactly. I find that I can "hide" the necessary code in portions of the
Huffman code that otherwise would leave the integer unit idle - there's a
point where you have to go to a LUT a couple of times in quick succession
(and the fact that the number of non-zero coefficients is usually pretty
small helps here as well). So I get to the IDCT with a marker byte
that cost nearly nothing (at least on the 603e, with a minimum 2-cycle
load latency) to generate. This is better than the typical approach,
which loads the 7 AC coefficients and ORs them together to determine
whether the work needs to be done, IMO, since I really only have a few
instructions worth of overhead to dispose of an entire row quite often
(taking into account that some of the instructions are common to both
potential branch targets). Maynard probably does something similar
(well, probably even better), but the open-source decoders do that
OR thingie, which is a real waste of cycles.
>If so, have you gathered any real world data on which patterns of that mask
>crop up most frequently? I hope I understand your description above
>correctly.
No, but I could guess (probably accurately) that there are more likely
rows to be skipped near the bottom than near the top. Such is the nature
of MPEG and JPEG. That being said, it really is impossible to predict.
>My usual approach when faced with poorly predictable branches is to try and
>find a way to merge such choices into "batches" and do them separately,
>rather than hop back and forth between modes. Then you can start doing
>stuff like loop unrolling and getting rid of many dec/test/branch paths
>too.
I don't want to wind up in a situation where I am killing the I-cache
with several different functions, though.
>Also, I might try to exploit non-uniformity in the distributions of typical
>input data (what if the pattern 01010101 shows up 70% of the time? that
>sort of thing).
I doubt that I'd run into that situation. The number of rows that
you will get to "cheat" on is related to the variety of the values
(whether luma or chroma) in the 8x8 block. For example, if you have a
block that is all one color (unlikely, but possible), you can probably
cheat on all 8 rows; a block that has something like a person's hair
probably won't give you any rows to cheat on (since the block is loaded
with high-frequency coefficients). Note that this doesn't take error
prediction in P and B frames into account; I'm not sure how that would
play out (Maynard?!?!?!) The bottom line is that this stuff is not really
predictable at all, which suggests to me that a 604e or 750 probably
wouldn't do any better than a 603e wrt branch prediction. Totally random.
But, again, my question was really more academic than practical for that
particular routine - I presented the code fragment as an example more than
as something I expected to get a big win on.
Thanks for your interest, though. It's good to know that the PowerPC
community is as good as the Linux community in that experienced people
are willing to take the time to help out those who have questions about
somewhat esoteric issues like this one. I hope that you have the time
at some point to devote a machine to Linux/PowerPC (if you haven't
already) and can become active on the mailing lists (and perhaps you
could lend your experience to the kernel developers at times).
>| The 603e User Manual recommends that 9 instructions be placed between an
>| instruction that affects a CR and the subsequent conditional branch to
>| avoid the need for branch prediction, I believe.
>Branches are "executed" by the BPU right after the branch is fetched, so a
>conditional branch can avoid prediction only if the prefetch buffer (6
>entries) is full between the rename write of the condition-setting op and
>the fetch of the branch:
>Fetch -> br
> XX
> XX
> XX
> XX
>Decode/Disp-> XX
>Exec -> XX
>Write rename: and.
Does this mean that the spacing is _more_ for the 604e? (I know this is
a FAQ, and in the user manual, but I'll pose it anyway). How does the
750 do here?
>| Let's say I have some code like this:
>|
>| and. 23,25,27 # branch based on this
>| lha 3,0(29) # (almost) always cached
>| addi 28,28,1
>| srawi 27,27,1
>| li 23,32
>| cmpwi 1,28,7 # (this has nothing to do with this branch)
>| dcbt 23,29
>| bc 4,2,.L7
>|
>| which doesn't do that (since I can't find anything else useful to do
>| before the bc, really). Will the presence of the dcbt (a serializing
>| instruction with a 2-cycle latency) have any effect to sort of "extend"
>| the execution of the 6 instructions to allow the branch to resolve rather
>| than be predicted?
>No; the DCBT will still be in the prefetch buffer at this time:
>Cycle Fetch Decode Execute Write
> 0 and. lha
> 1 addi srawi and. lha
> 2 li cmpwi addi and. lha
> 3 dcbt bc srawi li addi and. lha
>| Would moving the dcbt up a few instructions help in any way?
>No, not really; the prefetch buffer will continue to fill even if the rest
>of the pipeline is stalled.
How does the dcbt affect the following instructions? Let's say the
next few pairs are:
slwi
addi
rlwimi
addi
stw
stw
Will the dcbt hurt any of these in any way, being serializing and all?
>| Alternatively, would it be preferable to drop some noops in here to allow
>| the branch to resolve, or should I just eat any misprediction penalty?
>I think real NOPs (ori r0, r0, 0) are folded out of the pipeline on the
>603e like branches are, so they won't help here as spacing between the
>and. and the branch. You could perhaps add a dummy ALU op and a dummy
>load, which would increase your runtime by 1 cycle but still provide 2
>instructions of spacing.
>The alternative is to take a branch mispredict hit 50% of the time, which
>would add an average of ~ 2 cycles to your code fragment.
Two cycles per miss or two cycles on average? _Just_curious_! I take it
that here it doesn't really make that much difference.
>| Finally, is Tim Olson going to yell at me and say that this is only
>| 4% of my program? ;) ;) ;)
>Well, is it? ;-)
I haven't checked. The whole routine is the IDCT, of which I know you
are very familiar. This is the part that avoids doing rows; it probably
is less than 1% of the whole program, since I have already optimized
my approach to this issue pretty well. I was mostly curious about this
question, seeing that I am knee-deep in some assembly, and the only reason
you start writing assembly in applications is for speed. :)
So I figured I would try to learn something from the real experts. There
are other things that I am fooling with, so other questions will probably
arise.
Thanks.
| Does this mean that the spacing is _more_ for the 604e? (I know this is
| a FAQ, and in the user manual, but I'll pose it anyway).
In the 604e design, you cannot avoid prediction for conditional branches,
even if the branch condition is known well ahead of time -- an unfortunate
aspect of the design. The prefetch logic is totally decoupled from the
condition code registers and only looks at the dynamic branch prediction
bits.
| How does the
| 750 do here?
Instructions are prefetched up to 4 at a time (instead of 2 at a time for
603e); this causes branches to be folded out earlier in the case where a
branch is followed by another branch within 4 instructions (one of the
optimizations for a somewhat common case). However, in your case the
limit is still the prefetch buffer depth, so I think the requirement is
the same.
| Will the dcbt hurt any of these in any way, being serializing and all?
The implementation of DCBT on the 603e isn't very good, and is almost
never beneficial (many times it's a pessimization). On the 604 and 750,
the DCBT acts like a store with respect to the pipeline effects (takes a
dispatch slot, takes up an entry in the store buffer).
| >The alternative is to take a branch mispredict hit 50% of the time, which
| >would add an average of ~ 2 cycles to your code fragment.
|
| Two cycles per miss or two cycles on average? _Just_curious_! I take it
| that here it doesn't really make that much difference.
Two on average (4 per mispredict correction).
--
-- Tim Olson
Jason executes a branch depending on one of eight bits in a register:
> It will be executing quite a bit. Basically, it is the code that avoids
> doing an 8-point IDCT on a row of coefficients in an MPEG player - some
> of the instructions are common to whether or not the branch is followed
> (i.e., the instructions for the loop and the cache hint, as well as the
> load of the DC coefficient and the shift of the mask for the marker which
> tells us whether or not we have to do the math for the IDCT or can cheat,
> which is set up at another point in the program). It will run 8 times
> for every block that requires an IDCT. My guess is that a 640x480 MPEG
> would require about 100,000 IDCTs per second (assuming 12-bit YCbCr),
> so this would be executed 800,000 times per second in that case (assuming
> that some mythical 603e could handle such a high bitrate stream).
Brutest force method: Could you copy these eight bits into the CCR with
mtcrf, and duplicate your loop eight times? Definitely no branch
prediction anymore, and you save a few more instructions.
>| Does this mean that the spacing is _more_ for the 604e? (I know this is
>| a FAQ, and in the user manual, but I'll pose it anyway).
>In the 604e design, you cannot avoid prediction for conditional branches,
>even if the branch condition is known well ahead of time -- an unfortunate
>aspect of the design. The prefetch logic is totally decoupled from the
>condition code registers and only looks at the dynamic branch prediction
>bits.
Okay, then consider this. Certain things in MPEG require values to be
clipped to between 0 and 255. It occurred to me that using an unrolled
loop with conditionals instead of a clipping table might be a performance
win on the 603e, if I can put enough instructions between the cmpwi
instructions and the bc instructions, since the L2 cache latency on most
of these 603e boxes is awful (according to lmbench, L2 cache latency on
my PowerBase is over 200 nsec). I can probably figure that most values
will fall through (i.e., not be clipped, which makes the necessity of
this code particularly regrettable!) From the benchmark numbers that
I've seen, a L1 cache miss isn't as painful on most of the 604e boxes
(although some of those Tanzanias might suffer a bit). I don't know that
I can be sure that the clipping table will stay in L1, especially on
a 603e (obviously, or I wouldn't be asking) and it seems to make sense
to do this stuff piecemeal to avoid loading/storing the actual video
data repeatedly, rather than all at once to get the clipping table into
the L1 cache once per frame.
Might something like this actually hurt a 604e, since I can't count on
branch resolution? (I guess that the clipping table would almost always
be best on the 750, assuming it stays (at least) in L2).
>| How does the
>| 750 do here?
>Instructions are prefetched up to 4 at a time (instead of 2 at a time for
>603e); this causes branches to be folded out earlier in the case where a
>branch is followed by another branch within 4 instructions (one of the
>optimizations for a somewhat common case). However, in your case the
>limit is still the prefetch buffer depth, so I think the requirement is
>the same.
It also occurred to me than in the above case a 603e might be starved
for instructions in the prefetch buffer, if a lot of branches are being
fetched/folded. I guess that the 750 wouldn't have that problem.
>| Will the dcbt hurt any of these in any way, being serializing and all?
>The implementation of DCBT on the 603e isn't very good, and is almost
>never beneficial (many times it's a pessimization). On the 604 and 750,
>the DCBT acts like a store with respect to the pipeline effects (takes a
>dispatch slot, takes up an entry in the store buffer).
At least some Linux/PowerPC kernel versions turn off DCBT on the 603e.
I asked Cort Dougan (lead kernel programmer) about this, in fact - he has
experimented with cache instructions in the kernel quite a bit. I also
hacked a kernel to enable them, and didn't see any particular effect
either way on some code that used DCBT (but there were no consecutive
DCBTs in that code). I take it that DCBT can lead to some good performance
improvement on the 604e/750, though, when used judiciously?
>| >The alternative is to take a branch mispredict hit 50% of the time, which
>| >would add an average of ~ 2 cycles to your code fragment.
>| Two cycles per miss or two cycles on average? _Just_curious_! I take it
>| that here it doesn't really make that much difference.
>Two on average (4 per mispredict correction).
So I could pick up a cycle with a couple of dummy instructions. I doubt
that this will buy me more than a few hundred thousand cycles per second,
though. Well, even if nothing from this thread speeds up anyone's program,
it's interesting to learn about this stuff.
Hmmm - possibly, although I'd be a little worried about code density on
a 603e with its 16K I-cache (the whole loop is about 80 instructions,
IIRC).
> Tim Olson posted the following to comp.sys.powerpc.tech:
>
> >| Does this mean that the spacing is _more_ for the 604e? (I know this is
> >| a FAQ, and in the user manual, but I'll pose it anyway).
>
> >In the 604e design, you cannot avoid prediction for conditional branches,
> >even if the branch condition is known well ahead of time -- an unfortunate
> >aspect of the design. The prefetch logic is totally decoupled from the
> >condition code registers and only looks at the dynamic branch prediction
> >bits.
>
> Okay, then consider this. Certain things in MPEG require values to be
> clipped to between 0 and 255. It occurred to me that using an unrolled
> loop with conditionals instead of a clipping table might be a performance
> win on the 603e...
Or better yet, see if you can come up with a branchless form for the
clamping that you need to do. I know it is possible for things like min
and max from the PPC Compiler Writer's Guide IIRC. There's a whole section
of interesting hand-written and GNU-Superoptimizer-written code fragements
in the appendix that might give some hints.
When you say "values to be clipped", I assume you have some intermediate
result like a sum or product that needs to be put in line, that is more
than 8 bits already? As opposed to having to do a test prior to the
calculation to decide if the outcome is going to overflow or not.
Rob
> although I'd be a little worried about code density on a 603e with
> its 16K I-cache (the whole loop is about 80 instructions, IIRC).
80 times 8 = 640 instructions, times 4 = 2560 bytes. Room to spare?
-=- Andrew Klossner (and...@teleport.com)
>> >| Does this mean that the spacing is _more_ for the 604e? (I know this is
>> >| a FAQ, and in the user manual, but I'll pose it anyway).
>> >In the 604e design, you cannot avoid prediction for conditional branches,
>> >even if the branch condition is known well ahead of time -- an unfortunate
>> >aspect of the design. The prefetch logic is totally decoupled from the
>> >condition code registers and only looks at the dynamic branch prediction
>> >bits.
>> Okay, then consider this. Certain things in MPEG require values to be
>> clipped to between 0 and 255. It occurred to me that using an unrolled
>> loop with conditionals instead of a clipping table might be a performance
>> win on the 603e...
> Or better yet, see if you can come up with a branchless form for the
>clamping that you need to do. I know it is possible for things like min
>and max from the PPC Compiler Writer's Guide IIRC. There's a whole section
>of interesting hand-written and GNU-Superoptimizer-written code fragements
>in the appendix that might give some hints.
Is that GNU as in "free software" or something else? I'll take a look;
that compiler's guide is an interesting document anyway.
> When you say "values to be clipped", I assume you have some intermediate
>result like a sum or product that needs to be put in line, that is more
>than 8 bits already? As opposed to having to do a test prior to the
>calculation to decide if the outcome is going to overflow or not.
The former - IDCT results and colorspace conversion results come to mind
as two examples where this happens. With the colorspace conversion, the
lookup table probably makes sense, the you'd typically do the whole image
at once, and hope that the LUT would stay in L1 once loaded, but in the
IDCT phase, you might do an 8x8 block, and if the values get stuffed
straight into the image, add 128 to each one and then make sure the
(unsigned int) values are between 0 and 255 before stuffing them into
the image. The LUT approach merges the add 128 and the clipping portion
into a single load instruction, but might lead to expensive cache misses
on a 603e-based machine. (Also, if you have to add the IDCT output to
something from another frame, you don't get the benefit of saving the
add instruction anyway).
I was thinking of something like:
addi 128 # 1/2 cycle
do something unrelated
compare to 0 # 1 cycle for both
compare to 255
do a bunch of other stuff # 8 more instructions
branch somewhereelse if gt 0
li 0 # probably one cycle if a li
# is required
b somewhere
somewhereelse:
branch somewhere if lt 255
li 255
somewhere:
do more similar stuff
which seems to lead to 1.5 cycles of overhead if the li is not necessary
(most of the time, I think) and an additional cycle if it is. On the
750, you probably would figure that a L1 miss to a LUT is not going to
hurt very much, but we could be looking at 40 cycles or so on this
PowerBase for each miss, and it seems to make sense to build the final
output for the image right at the end of the IDCT when the values are
still in registers, so I don't know how much of a LUT will stay in L1
throughout the decode process for a frame. However, fetching the three
branch instructions so often might starve the CPU for instructions
here on a 603e (and I think that you'd want to use the LUT on a 750
anyway, so its better fetcher is a moot point).
>>> Could you copy these eight bits into the CCR with mtcrf, and
>>> duplicate your loop eight times?
>> although I'd be a little worried about code density on a 603e with
>> its 16K I-cache (the whole loop is about 80 instructions, IIRC).
>80 times 8 = 640 instructions, times 4 = 2560 bytes. Room to spare?
16K cache; lots of other code. I'm not sure one way or the other.
A useful technique is to test the top 24 bits of a register against zero
(easily done in a single PowerPC instruction). This will tell you if you
need to clip at all. If that test comes up non-zero a single branch testing
the sign will tell you which way to clamp. Since clamping is usually rare,
this saves executing a branch most of the time.
In C this is roughly:
if ((x & 0xffffff00) != 0)
if (x < 0)
x = 0;
else
x = 255;
(I learned this from David DiGiacomo)
-Z-
On PowerPC, something like:
and.
cmpwi (test against 0)
... (eight instructions)
bc A (usually will take the branch)
bc B
li 0
b A
B
li 255
A
This probably eliminates the "branches might swamp the fetcher" issue
on the 603e. Looks good - thanks.
> On PowerPC, something like:
> and.
> cmpwi (test against 0)
> ...
The cmpwi here is unnecessary. The "and." sets the condition codes.
If you really need to avoid branches:
# start with x in r3
li r4,0xff
subfc r4,r3,r4 # set carry if x > 0xff (unsigned compare)
# .. so carry is set if we need to clip
li r5,0
subfze r5,r5 # r5 = 0 if carry, -1 if no carry, so:
# r5 = -1 if we need to clip, else 0
srawi r6,r3,31 # r6 = -1 if x<0, else 0
rlwinm r6,r6,0,24,31 # r6 = 255 if x<0, else 0
and r6,r6,r5 # r6 = 255 if x<0 and we need to clip, else 0
andc r7,r3,r5 # r7 = x if not clipping, 0 if clipping
or r3,r7,r6 # r3 = x if not clipping, else (x<0)?0:255
Rearrange to improve scheduling:
li r4,0xff
li r5,0
srawi r6,r3,31
subfc r4,r3,r4
subfze r5,r5
rlwinm r6,r6,0,24,31
andc r7,r3,r5
and r6,r6,r5
or r3,r7,r6
-=- Andrew Klossner (and...@teleport.com)
>> On PowerPC, something like:
>> and.
>> cmpwi (test against 0)
>> ...
>The cmpwi here is unnecessary. The "and." sets the condition codes.
The cmpwi was for the second if (if (x<0)), not for the first one,
so I believe that it is necessary (how can the and. test the very
portion of the GPR that it masks off?) I think I was unclear as
to my intent there, though. :)
>If you really need to avoid branches:
No, just looking for the best way to arrange the branches to keep
the 603e pipelines filled.
> # start with x in r3
> li r4,0xff
> subfc r4,r3,r4 # set carry if x > 0xff (unsigned compare)
> # .. so carry is set if we need to clip
> li r5,0
> subfze r5,r5 # r5 = 0 if carry, -1 if no carry, so:
> # r5 = -1 if we need to clip, else 0
> srawi r6,r3,31 # r6 = -1 if x<0, else 0
> rlwinm r6,r6,0,24,31 # r6 = 255 if x<0, else 0
> and r6,r6,r5 # r6 = 255 if x<0 and we need to clip, else 0
> andc r7,r3,r5 # r7 = x if not clipping, 0 if clipping
> or r3,r7,r6 # r3 = x if not clipping, else (x<0)?0:255
>Rearrange to improve scheduling:
>
> li r4,0xff
> li r5,0
> srawi r6,r3,31
> subfc r4,r3,r4
> subfze r5,r5
> rlwinm r6,r6,0,24,31
> andc r7,r3,r5
> and r6,r6,r5
> or r3,r7,r6
I think that the branching approach is much better on a 603e, since with
Zalman's approach the overhead is probably less than 2 cycles per value
(here it would be a lot more, and uses too many registers for what I am
doing IMO, since I will be dealing with 8 values at a time, and the code
can be interspersed with the tail end of the IDCT for scheduling reasons
with the IDCT using some other registers as well.) One problem with
Zalman's approach in this case on PowerPC, though, is that the and. always
will use CR0, right? I'd like to set up more than two CRs at a time,
so as to space out the compares and the branches adequately and force
resolution of all branches in the prefetch queue, so it might be best to
alternate between an and./cmpwi and a cmpwi/cmpwi from value to value,
even though the and./cmpwi approach would probably be superior otherwise.
A document appeared on Motorolas webpage that I havent seen before:
http://www.mot.com/SPS/PowerPC/teksupport/teklibrary/papers/opt603.pdf
Lots of things about optimising 603e code. Some is well-known to anyone
knowing about RISC processors, but some things in it are implementation
details that I havent seen described anywhere (for example, the 603e and
750 can do only three floating point operations every four cycles; this
document says why. And it describes when a dbnz might be slower than a
compare and branch).
The and. operation masks of the low 8 bits, not the high 24. So it indeed
should setup the condition codes exactly for both tests. Think of it as two
booleans "are all the top 24 bits 0?" and "Is the sign bit set?".
Obviously, the sign bit is in the top 24 bits.
I'm pretty sure you can uses a rlwinm. instruction to avoid burning a
register for the mask by the way.
: One problem with Zalman's approach in this case on PowerPC, though, is that
: the and. always will use CR0, right?
Yep. One reason why I prefer compare-and-branch architectures to condition
codes. I don't know if the 603e does any CR renaming or how much CR moves
cost. (Ideally, the CR moves could be done fro free since they issue to a
different unit and very early in the pipe. But I don't know.)
-Z-
> ja...@jhste1.dyn.ml.org (Jason S.) writes:
> > Zalman Stern posted the following to comp.sys.powerpc.tech:
> > >Jason S. (ja...@jhste1.dyn.ml.org) wrote:
> > >[Clipping to range [0..255] ...]
> >
> > >A useful technique is to test the top 24 bits of a register against zero
> > >(easily done in a single PowerPC instruction). This will tell you if you
> > >need to clip at all. If that test comes up non-zero a single branch testing
> > >the sign will tell you which way to clamp. Since clamping is usually rare,
> > >this saves executing a branch most of the time.
> >
> > >In C this is roughly:
> > > if ((x & 0xffffff00) != 0)
> > > if (x < 0)
> > > x = 0;
> > > else
> > > x = 255;
> >
> > >(I learned this from David DiGiacomo)
> >
> > On PowerPC, something like:
> >
> > and.
> > cmpwi (test against 0)
> > ... (eight instructions)
> > bc A (usually will take the branch)
> > bc B
> > li 0
> > b A
> > B
> > li 255
> > A
> >
> > This probably eliminates the "branches might swamp the fetcher" issue
> > on the 603e. Looks good - thanks.
>
> Just how out-of-range are the inputs likely to be? If they're either
> negative or within 0 .. 511 then you can do it without branching or
> loads at all, by examining bits 0 and 23:
I had an idea for branchless clamping, here's some pidgin pseudocode. It
relies on a signed right shift "smearing" the sign bit of a calculated
difference vs a constant, across all 32 bits of a register.
; if r10 is bigger than 255, make it 255.
r11 = 255 - r10
r11 = r11 >> 31 (signed right shift, negative input yields all 1's)
r10 = (r10 | r11) & 255.
If r10 goes in >255, it comes out 255.
That seems pretty clean to me. A move, a subtract from a constant, the
shift right to smear the sign bit of the result down, one OR and one AND.
(I think the AND can be postponed or simply dropped, if you just store the
byte out to memory).
Of course every single instruction depends on the one before it, hmmm. And
it only handles the positive case, but I suspect one could be constructed
for clamping at the negative end too:
; if r10 is negative, we want to make it zero.
given r10
r11 = ~ (r10 >> 31) ; r11 = ~ ( (r10<0) ? 0xFFFFFFFF : 0 )
r10 = r10 & r11.
Does this make sense? (untested post lunch ramblings)
Rob
>: The cmpwi was for the second if (if (x<0)), not for the first one,
>: so I believe that it is necessary (how can the and. test the very
>: portion of the GPR that it masks off?) I think I was unclear as
>: to my intent there, though. :)
>The and. operation masks of the low 8 bits, not the high 24. So it indeed
>should setup the condition codes exactly for both tests. Think of it as two
>booleans "are all the top 24 bits 0?" and "Is the sign bit set?".
>Obviously, the sign bit is in the top 24 bits.
Doh. Forgive me for not thinking it through.
>I'm pretty sure you can uses a rlwinm. instruction to avoid burning a
>register for the mask by the way.
Yeah, that's good.
>: One problem with Zalman's approach in this case on PowerPC, though, is that
>: the and. always will use CR0, right?
>Yep. One reason why I prefer compare-and-branch architectures to condition
>codes. I don't know if the 603e does any CR renaming or how much CR moves
>cost. (Ideally, the CR moves could be done fro free since they issue to a
>different unit and very early in the pipe. But I don't know.)
Sounds like something like
rlwinm. # for first value
cmpwi 0 # for second value
cmpwi 255
other stuff # at least 7 insns (there's plenty that can be
# done in my program at this point)
bc to 2 # no need to clamp?
bc to 1 # clamp to 0 or 255?
li 0
b to 2
1
li 255
2
rlwinm. # for third value
bc to 3 # second value 0?
li 0
b to 4
3
bc to 4 # second value 255?
li 255
4
other stuff # at least 4 insns (including 2 cmpwi insns
# for fourth value)
bc # for third value
and so forth could be the best bet, by alternating between
branch-and-compare and your approach to avoid the CR0 problem while
alleviating the problem of the prefetch buffer running out of instructions
for the hungry pipelines.
>Just how out-of-range are the inputs likely to be? If they're either
>negative or within 0 .. 511 then you can do it without branching or
>loads at all, by examining bits 0 and 23:
>underflowMask = ((x >> 31) & 1) - 1;
>overflowMask = -((x >> 8) & 1);
>x = x | overflowMask;
>x = x & underflowMask;
>That's six instructions on PPC -- three clock cycles if interleaved with
>other things. Either mask the result afterwards, or just store it as a byte.
>If a positive overflow could be bigger -- in the range 256 .. 65535 -- then
>you could get "overflowMask" by ensuring the 8 LSBs are not all zero and doing
>a count-leading-zeroes. The result will be in the range 16 .. 23 if there is
>an overflow, 24 .. 31 if there is not. Examine bit 28 to form the mask:
I think that it normally would not fall very far out of range. My guess is
that the clamping is unnecessary much of the time (I haven't tested this
empirically, though, but I'm guessing that you only clamp less than 20%,
and perhaps even less than 10% of the time).
It seems like the hybrid Zalman/compare-and-branch approach might be
best, since my major concern is to keep the branch instructions from
swamping the prefetch buffer, not branch misprediction problems, since
this can be coded so that all branches are folded without prediction
on 603e/750. Zalman's rlwinm. is 1/2 cycle in this code, and a pair of
compares is 1 cycle; the li instructions that are occasionally executed
might be 1/2 or 1 cycle, depending on whether the prefetch buffer runs out
of instructions, I guess (not an issue with the 750). Zalman's approach
saves the 603e/750 from having to fold a second conditional branch most
of the time, but is limited by the fact that the rlwinm. always uses
CR0, so by alternating the two, I wind up with the happy medium of 1/2
cycle and 1 branch in the majority case half the time, and 1 cycle and
a pair of branches in the majority case the other half of the time,
without having to mess with the condition registers. (Also, throwing
in compares on a 603e isn't necessarily a bad thing, since they can
be interleaved with anything, making scheduling the code a lot simpler).
Just how out-of-range are the inputs likely to be? If they're either
negative or within 0 .. 511 then you can do it without branching or
loads at all, by examining bits 0 and 23:
underflowMask = ((x >> 31) & 1) - 1;
overflowMask = -((x >> 8) & 1);
x = x | overflowMask;
x = x & underflowMask;
That's six instructions on PPC -- three clock cycles if interleaved with
other things. Either mask the result afterwards, or just store it as a byte.
If a positive overflow could be bigger -- in the range 256 .. 65535 -- then
you could get "overflowMask" by ensuring the 8 LSBs are not all zero and doing
a count-leading-zeroes. The result will be in the range 16 .. 23 if there is
an overflow, 24 .. 31 if there is not. Examine bit 28 to form the mask:
overflowMask = -((cntlzw(x | 1) >> 3) & 1)
That adds another two instructions, making eight.
The question is: is this enough instructions that you might as well just
eat the branch?
-- Bruce
--
'We have no intention of shipping another bloated operating system and
forcing that down the throats of our Windows customers'
-- Paul Maritz, Microsoft Group Vice President
No, the and. and cmpwi are for two different 'if' statements, and will
be using different sets of condition codes.
> Tim Olson posted the following to comp.sys.powerpc.tech:
>
> >| Does this mean that the spacing is _more_ for the 604e? (I know this is
> >| a FAQ, and in the user manual, but I'll pose it anyway).
>
> >In the 604e design, you cannot avoid prediction for conditional branches,
> >even if the branch condition is known well ahead of time -- an unfortunate
> >aspect of the design. The prefetch logic is totally decoupled from the
> >condition code registers and only looks at the dynamic branch prediction
> >bits.
>
> Okay, then consider this. Certain things in MPEG require values to be
> clipped to between 0 and 255. It occurred to me that using an unrolled
> loop with conditionals instead of a clipping table might be a performance
> win on the 603e, if I can put enough instructions between the cmpwi
> instructions and the bc instructions, since the L2 cache latency on most
> of these 603e boxes is awful (according to lmbench, L2 cache latency on
> my PowerBase is over 200 nsec). I can probably figure that most values
> will fall through (i.e., not be clipped, which makes the necessity of
> this code particularly regrettable!) From the benchmark numbers that
> I've seen, a L1 cache miss isn't as painful on most of the 604e boxes
> (although some of those Tanzanias might suffer a bit). I don't know that
> I can be sure that the clipping table will stay in L1, especially on
> a 603e (obviously, or I wouldn't be asking) and it seems to make sense
> to do this stuff piecemeal to avoid loading/storing the actual video
> data repeatedly, rather than all at once to get the clipping table into
> the L1 cache once per frame.
However the clipping to 0..255 does not need to be done at the IDCT stage.
What comes out of the IDCT is not 0..255. It is nominally -255..255.
That will get added to 8-bit data in the motion comp.
Thus what this boils down to is that
€ what comes out of the IDCT pretty much HAS to be SINT16s, not UINT8s.
€ there's no point in pinning it there.
€ you do have to pin when you add it in the motion comp loops, and there's
too little work happening in those loops to worry about anything but using
a pin table.
Maynard
--
My opinion only
> >| Will the dcbt hurt any of these in any way, being serializing and all?
>
> >The implementation of DCBT on the 603e isn't very good, and is almost
> >never beneficial (many times it's a pessimization). On the 604 and 750,
> >the DCBT acts like a store with respect to the pipeline effects (takes a
> >dispatch slot, takes up an entry in the store buffer).
>
> At least some Linux/PowerPC kernel versions turn off DCBT on the 603e.
> I asked Cort Dougan (lead kernel programmer) about this, in fact - he has
> experimented with cache instructions in the kernel quite a bit. I also
> hacked a kernel to enable them, and didn't see any particular effect
> either way on some code that used DCBT (but there were no consecutive
> DCBTs in that code). I take it that DCBT can lead to some good performance
> improvement on the 604e/750, though, when used judiciously?
My experience has been that on a 604 various types of code (MPEG loops,
Sorenson loops, YUV blit loops), have sped up by about 25% through use of
these instructions, about 12% from DCBT and about 12% from DCBZ.
This is using DCBT in the most limited way possible
---only one DCBT is every active at a time
---each DCBT is retired (via a lbz to the same address) before issuing a
followup DCBT.
My belief (please correct me if I'm wrong, Tim) is that following this
protocol will get DCBT to not hurt you on 603, while gaining most of the
benefit available on 604+. My belief is that the code mentioned above also
improves about 25% on a 603e, but I haven't done as much rigorous timing
there---though the Q&D timings I have done have shown that it has never
hurt the code.
> Zalman Stern posted the following to comp.sys.powerpc.tech:
> >Jason S. (ja...@jhste1.dyn.ml.org) wrote:
> >[Clipping to range [0..255] ...]
>
> >A useful technique is to test the top 24 bits of a register against zero
> >(easily done in a single PowerPC instruction). This will tell you if you
> >need to clip at all. If that test comes up non-zero a single branch testing
> >the sign will tell you which way to clamp. Since clamping is usually rare,
> >this saves executing a branch most of the time.
>
> >In C this is roughly:
> > if ((x & 0xffffff00) != 0)
> > if (x < 0)
> > x = 0;
> > else
> > x = 255;
>
> >(I learned this from David DiGiacomo)
>
> On PowerPC, something like:
>
> and.
> cmpwi (test against 0)
> ... (eight instructions)
> bc A (usually will take the branch)
> bc B
> li 0
> b A
> B
> li 255
> A
>
> This probably eliminates the "branches might swamp the fetcher" issue
> on the 603e. Looks good - thanks.
>
You can do better.
rlwimn. r0, x, 0, 0, 23
^^^ 0==no rotate, (0, 23)==mask against bits 0..23 ie top 24.
....
bne cr0, @inRange
... nasty crud here to catch out of range.
@inRange:
rlwimn etc are your friends for any of this sort of nonsense.
> Just how out-of-range are the inputs likely to be? If they're either
> negative or within 0 .. 511 then you can do it without branching or
> loads at all, by examining bits 0 and 23:
>
> underflowMask = ((x >> 31) & 1) - 1;
> overflowMask = -((x >> 8) & 1);
> x = x | overflowMask;
> x = x & underflowMask;
>
> That's six instructions on PPC -- three clock cycles if interleaved with
> other things. Either mask the result afterwards, or just store it as a byte.
Very interesting, Bruce. Where I need to pin, I just whack a lookup table,
and my concerns are not with that missing in L1, but with the fact that I
pack calculations and have to unpack to perform the lookup loads. The
above may avoid that. It may not be a win in the end since my code is
pretty tightly interleaved anyway, but it's worth looking at.
See ya,
!!! just how packed are they? Clearly not four 8-bit values to a word, otherwise
overflows are going to destroy other values.
Whatever size your data values are, you're going to need at least two extra bits per
value in the packed representation to be able to use the above. So you could get
four six-bit values or three eight-bit values per 32-bit word. If you can get
underflow then you need to use excess-512 notation (i.e. set the high bit for +ve
numbers, clear it for -ve) to catch and contain the underflow carry chain.
Other that that, I think the above could be used for packed pinning with very
little modification: for four six-bit values (looks simpler in hex :-) ...
underflowMask = 0x40404040 - ((x >> 7) & 0x01010101);
overflowMask = 0x40404040 - ((x >> 6) & 0x01010101);
x = x | overflowMask;
x = x & underflowMask;
That's a couple more instructions because you don't get to use rlwinm to
do the bit extraction masking, but then you're doing more useful work :-)
You might also need a couple omre instructions to set up the top two bits
the way you want them afterwards.
Ciao.
>However the clipping to 0..255 does not need to be done at the IDCT stage.
>What comes out of the IDCT is not 0..255. It is nominally -255..255.
>That will get added to 8-bit data in the motion comp.
>Thus what this boils down to is that
>€ what comes out of the IDCT pretty much HAS to be SINT16s, not UINT8s.
>€ there's no point in pinning it there.
>€ you do have to pin when you add it in the motion comp loops, and there's
>too little work happening in those loops to worry about anything but using
>a pin table.
As I described more fully in email, the structure of the code that I am
working with is probably different than yours (it goes straight from
IDCT to sticking the output in the frame, either by adding 128 to the
values or by adding them to something that has already been computed
somewhere else, and then pinning the values, and then dropping them
into the frame). I'm just posting this so that I don't look really dumb,
not to contradict anything that you are saying. :)
It's likely that I should look at the overall approach that the MSSG
code with which I started takes, and perhaps try to rewrite the whole
thing to be faster. However, I'm still taking babysteps at this point.
(I do aspire to being the Maynard Handley of Linux/PowerPC, eventually,
but we all have to start somewhere). ;)
--
"...and a more offensive spectacle I cannot recall."
-- Newman
>One problem with
>Zalman's approach in this case on PowerPC, though, is that the and. always
>will use CR0, right? I'd like to set up more than two CRs at a time,
>so as to space out the compares and the branches adequately and force
>resolution of all branches in the prefetch queue, so it might be best to
>alternate between an and./cmpwi and a cmpwi/cmpwi from value to value,
>even though the and./cmpwi approach would probably be superior otherwise.
Since you're testing that 0 <= x <= 255, you can also test this using an
unsigned compare against 255 since negative numbers will appear to be
large.
Zalman suggested (and Maynard improved to):
rlwinm. r0,r3,0,0,23 (aka clrrwi. r0,r3,8)
so you can also use:
cmplwi cr3,r3,0x0100
The only tricky part is that this compare's result is the LT bit of the
specified CR field, not the EQ bit.
Incidentally, I love this thread. Not only is is about microoptimizing for
PPC, but this entire function would benefit immensely from AltiVec.
+------------------------------------------------------------+
| Alexander M. Rosenberg <mailto:alexr@_spies.com> |
| Nobody cares what I say. Remove the underscore to mail me. |
>>One problem with
>>Zalman's approach in this case on PowerPC, though, is that the and. always
>>will use CR0, right? I'd like to set up more than two CRs at a time,
>>so as to space out the compares and the branches adequately and force
>>resolution of all branches in the prefetch queue, so it might be best to
>>alternate between an and./cmpwi and a cmpwi/cmpwi from value to value,
>>even though the and./cmpwi approach would probably be superior otherwise.
>Since you're testing that 0 <= x <= 255, you can also test this using an
>unsigned compare against 255 since negative numbers will appear to be
>large.
>Zalman suggested (and Maynard improved to):
>rlwinm. r0,r3,0,0,23 (aka clrrwi. r0,r3,8)
>so you can also use:
>cmplwi cr3,r3,0x0100
>The only tricky part is that this compare's result is the LT bit of the
>specified CR field, not the EQ bit.
And this is probably the ideal approach if I stick to branches.
>Incidentally, I love this thread. Not only is is about microoptimizing for
>PPC, but this entire function would benefit immensely from AltiVec.
This group needs more of these threads.
--
"...and a more offensive spectacle I cannot recall."
-- Newman (on x86)
Which raises the issue of spreading the compares across the condition
registers, using condition register logicals to combine them and skipping a
whole bunch of clamping operations with a single branch. This will win if
clamping is indeed rare.
-Z-
Unfortunately, CR logicals are execution serializing.
On Wed, 28 Oct 1998 15:06:52 -0800
al...@I.HATE.SPAM (Alex Rosenberg)
responded back ,
>>
Unfortunately, CR logicals are execution serializing.
<<
That is not strictly true.
Alex Rosenberg used the phrase "execution serializing", which I take to mean
"execution synchronizing" (If I am wrong about that, please skip the rest of
this commentary).
As I understand the overall interest at this point in the thread, the real
objective is to test a number of conditions, and to exploit the opportunity to
do the tests simultaneously by targetting the results to separate places within
the Condition Register. Specifically, we are trying to avoid a code sequence
that keeps placing the results into CR0.
The testing capability of the PowerPCs instruction set is not as symetric as we
would like. Some tests only allow results to be placed into CR0. So the quest
becomes a search for data format and relations that will allow a test that uses
a syntax where more flexibility is possible.
The compare instructions, such as
>: cmplwi cr3,r3,0x0100
are not execution synchronizing, as far as I know. You should be able to deploy
several of these without actually stalling.
Furthermore, the PowerPC Architecture does not specify the Condition Register
logical operations (such as CROR), as execution synchronizing. Obviously there
is dependency back to the instructions that create the condition bit image.
But much could be left to the implementation on this question.
It is atleast conceivable that two compare instructions could generate
intermediate CRx values that never make it to the condition register, but are
(potentially) intercepted by a CR logical operation. All of that, with another
parallel instruction generating yet another CRx subfield bit image.
One of the most important things that RISC chips do is to eliminate the
congestion at the condition register, and implementations might have a number
of holding places for the condition register subfield images.
CR logical operations can concievably execute against intermediate hold
positions for CRx subfields. Even if an implementation posts a CRx result
prior to execution of a CR logical, this does not serialize posts to other CRx
subfields not involved in the sequence.
A critical design opportunity in RISC is to isolate the portions of the CR
register.
Of course there is dependency between the instructions that reference the same
CR subfield.
So at some point the CR logicals, and the subsequent conditional branch must
wait for the results.
But if your data structures and algorithm permit it, then multiplexing the
conditions from numerous tests down to a single logically combined condition
register subfield image is viable, and not inherently synchronizing according
to the architecture. Implementation details will vary, however.
If I am not mistaken, this thread starting a long way back referenced a 603
(and perhaps a 603e). Can anyone comment about this implementation
specifically, with regard to execution synchronization on Condition Register
logicals.
The branch instructions
Robert Rayhawk
RKRa...@aol.com
Sure. From the 603e User's Manual, page 2-35:
"Condition register locail instructions, shown in table 2-28,
and the Move Condition Register Field (mcrf) instruction are
also defined as flow control instructions, although they are
executed by the system register unit (SRU). Most instructions
executed by the SRU are completion-serialized to maintain
system state; that is, the instruction is held for execution
in the SRU until all prior instructions issued have
completed."
(Table 2-28 shows crand, cror, crxor, crnand, crnor, creqv, crandc,
crorc, mcrf.)
As I understand it, the idea was to move CR close to the branch unit
(hence away from the integer unit) for branch prediction and folding.
-=- Andrew Klossner (and...@teleport.com)
>>>
> Unfortunately, CR logicals are execution serializing.
><<
>
>That is not strictly true.
>
>Alex Rosenberg used the phrase "execution serializing", which I take to mean
>"execution synchronizing" (If I am wrong about that, please skip the rest of
>this commentary).
From the MPC750 User Manual, section 6.3.3.2 (pretty much the same in all
the other PPC manuals as well):
"Execution-serialized instructions are dispatched, held in the functional
unit and do not execute until all prior instructions have completed."
And as pointed out by Andrew Klossner, all the cr-logicals fall into this
category.
>Furthermore, the PowerPC Architecture does not specify the Condition Register
>logical operations (such as CROR), as execution synchronizing.
While it may not be an architectural requirement, it's an implementation
fact for recent designs, which happen to be the target of the code in
question.
However, and this is just to take the discussion to the limit of what I see as
valuable implications of some of the proposed coding alternatives. I do accept
the points about execution serializing (and I have now learned to recognize
that as a distinct term from execution synchronizing).
The cmp... instructions are not execution serializing. So, generally, if
complex condition testing can be deployed across two or more CRn fields. It is
advantageous to discover data structures that will allow exploitation of the
cmp... instruction which do permit explication of the CRn field. And to avoid
congestion early in the CR0 field which is implicit for some compare
instructions. This allows parallel dispatch and execution of those cmp..
instructions.
Furthermore, and I am not being argumentative here, this is extremely technical
stuff, the CRlogicals are execution serializing on the 603, but they are
completion-serializing. That is potentially useful.
Logically prior instructions will complete (and the architected full CR
register will be constructed) before the CRlogicals execute. Yet the CRlogical
instructions can and will dispatch even while prior instructions execute AND,
because the CRlogicals are not dispatch serializing, instuction which follow
the CRlogical can be dispatched while the CRlogical executes.
(and to really push this: the 603 is basically a two-up dispatcher. You could
dispatch the CRlogical in parallel with a 'following' instruction, and although
the CRlogical waits, the 'following instruction proceeds to execution, and if I
am not mistaked perhaps to completion (assuming no interdependencies)).
The CRlogical can not execute simultaneously with any prior instruction.
However, at the very leasts they can execute simulaneously with following
instructions.
The 603 does not go into a single thread mode when it sights a CRlogical. But
the CPU is a little bit naive about things. CRlogicals wait for ANY and ALL
prior instructions to complete, before executing. The CRlogicals have no
ability to distinguish prior instructions that do from instructions that do not
effect the CRn fields.
One portion of this discussion considered taking an instruction or two from
some following task, and excellerating it into the currently considered code
sequence to get as much distance as possible from CR image generation and
branch point to eliminate branch speculation by the CPU. I think that strategy
has not been contradicted generally. But excellerating any instruction (from
anywhere) infront of a CRlogical does not help in getting the CR settled down
before the CRlogical waits.
The CRlogical just waits for anything you put infront of it. So oprimization
here involves avoiding anything that takes more than one cycle to execute
infront of a CRlogical. Which for the 603 then becomes a principle with
manifold implications.
Best Wishes,
Robert Rayhawk
RKRa...@aol.com