Software pipelining is a technique to reorder loops to achieve more
parallelism at the instruction level. I was wondering if this
optimization is mainly beneficial for processors that have multiple
functional units like VLIWs, or can significant performance
improvements be also achieved for RISC processors with a restricted
number of functional unis like just two 4-stage pipelines?
I was also wondering if software pipelining can benefit from profiling
information, i.e. can the knowledge about frequently executed paths be
exploited for a better pipelining? Do you know any successful
applications?
Regards,
Tim
forloop (...) {
a[i] = b[i]; // stmt 1
c[i] = a[i]; // stmt 2
d[i] = c[i]; // stmt 3
}
there is a clear true dependence b/w stmt 1 , 2 and 3 which constrains
their execution order. If each statement takes one unit time then
each iteration will take 3 unit time, so if there are 2 iterations
then loop execution will take 6 unit time.
But if we unroll the loop and the reorder the instruction execution
order in such a way that all the non-dependent statements can execute
simultaneously then we can save on execution time. i.e.
1. Unrolling step .
a[0] = b[0]
c[0] = a[0]
d[0] = c[0]
a[1] = b[1]
c[1] = a[1]
d[1] = c[1]
2. After Instruction scheduling step.
a[0] = b[0]
a[1] = b[1]
c[0] = a[0]
c[1] = a[1]
d[0] = c[0]
d[1] = c[1]
Thus instructions in each of above pair of instruction can run
parallely. Thus if the architecture has multiple functional units then
we can run the above loop in 3 units times. Also we can club the
instruction present in each group in a long VLIW format if the
processor is VLIW, this will reduce the initial loop into few VLIW
instructions.
Optimization is called pipelining because we are able to overlap the
execution of multiple instruction after optimizations.
Thanks
- Jatin Bhateja
> It's a combination of loop unrolling and instruction scheduling. e.g
Thank you, but I know how software pipelining is working. :-)
My question was if also a significant performance increase for RISC
architectures with a restricted number of functional units can be
expected when software pipelining is applied.
And if profiling might be exploited here.
I'm not sure about significant increase, but it seems that even with
a small number of FUs pipelining could be used to hide latency (e.g.
memory access). But if FUs are already busy without software
pipelining, then pipelining is not going to make them any busier.
--
Pertti
=Neeraj
> Also, the overheads of modulo scheduling can make it worse.
> Overheads include, more registers, prolog and epilog instructions.
> As Tim said, only advantage is in hiding latencies such as memory
> access and floating point.
Software pipelining is usually beneficial for loops with a high number
of iterations.
S
ps: In order to hide memory latencies, non blocking caches should be
present in the underlying cpu. This is not common in vliw processors.
You need more than 1 functional unit to take advantage of pipelining.
Be it RISC or VLIW instruction format, that has no bearing on how many
functional units a processor can have,
The itanium processor has a RISC instruction format, and it has a
notion of an EPIC (explicitly parallel instruction) bundle. The
compiler will have to generate a bundle wherein multiple instructions
will be handed to the processor and scheduled for parallel execution
and it will yield a good performance if each of those instructions can
utilize different functional units at the same time.
> And if profiling might be exploited here.
I didn't get you there i.e. profiling doesn't seem related to
pipelining as described above.
regards
-kamal
I believe software pipelining can improve performance even when there
are only a few functional units.
For example, suppose the following code is run on a machine with a
single FU.
All the
It is just incrementing an array of 10 elements.
ldc r0, 0
L1:
0. load r1, 0(r0) # latency 2
1. add r1, r1, 1 # latency 1
2. store 0(r0), r1 # latency 1
3. add r0, r0, 1 # latency 1
4. cmp r0, 10 # latency 1
5. blt L1 # latency 1
For a non-SW pipelined code, a single iteration would take 7 cycles.
Now, for the SW pipelined code, a single iteration would take 6 cycles
on average.
## prologue
op0_0
noop
op1_0
op2_0
op3_0
op4_0
######## 1st iteration of steady state
op0_1
op5_0
op1_1
op2_1
op3_1
op4_1
######## 2nd iteration of steady state
op0_2
op5_1
op1_2
op2_2
op3_2
op4_2
########
.............
The degree of performance increase of a SW pipelined loop over a non-
SW pipelined loop depends on several factors, including the number of
resources (FUs), whether there are recurrences in the loop, the number
of iterations and also register pressure.
This was just a question if it would make sense to somehow
combine software pipelining with the most frequently executed
path (determined by profiling). I can't imagine any useful
combination of these two techniques, but maybe there are some?
Regards,
Tim
Even though I understand that this is an example, I still
have the feeling that with just a single functional unit, the
benefits of pipelining are marginal, in your example it's just
about 1 cycle. But I also agree that this might have some
positive effect when the loop has a high iteration count.
Software pipelining does not provide much performance
increase for loops with low trip counts, but it increases
code size. So using it anywhere else but in the frequently
executed paths would be rather ill adviced. Profiling is a
good way to find those paths.
--
Pertti
Actually, the benefits of software pipelining can be significant, even
for a single functional unit, if that functional unit has a multistage
pipeline. Assuming the number of iterations is sufficiently large to
mask the prologue and epiloque, the number of cycles required to
execute any given iteration doesn't really matter. The key is to
minimize the number of cycles needed to start each successive
iteration, otherwise known as the minimum iteration interval (MII).
Best,
Christopher Glaeser
Yes, sure it (is/was) applied very often for unrolling various byte/small
data handling, when handling each item individually costs less than or is
similar to the prologue/epilogue of the loop!
Regards
Armel