Example:
FOR I= 1, 2
FOR J= 1, N
A(I, J) = B(I, J)
ENDDO
ENDDO
This is a perfect candidate for loop jamming because it can
take advantage of cache locality (assuming the elements in
the first dimension are contiguous).
The result of a loop unroll and jam is...
FOR J = 1, N
A(1, J) = B(1, J)
A(2, J) = B(2, J)
ENDDO
Note, this same optimization can be achieved by first
performing a loop interchange, and then a complete unroll.
However, this is one complete optimization.
Loop-jamming is often used as a synonym for loop-fusion:
do i = 1, n
a(i) = b(i) + c(i)
enddo
do i = 1, n
e(i) = f(i) + g(i)
enddo
becomes:
do i = 1, n
a(i) = b(i) + c(i)
e(i) = f(i) + g(i)
enddo
Often used to jam many congruent Fortran 90 array assignment statements
into one loop to reduce loop overhead, where doing so does not introduce
any loop-carried-dependences.
Best regards,
Joel Williamson
A search for "loop jamming" (also called fusion) in AltaVista,
yielded:
http://www.ontek.com/mikey/current.html
"The idea is to combine adjacent loops which loop over the same range of
the same variable."
http://happy.dt.uh.edu/~hu/4328/note.html
"If there are two loops with no data dependencies adjacent to one
another, push them together."
http://www.bme.jhu.edu/programs/courses/580.483/opt.techniques.html
"Programmers often encounter two or more independent, vectorizable loops
with the same iteration count. By combining two or more of these loops
into one vector loop, loop control overhead can be reduced."
Tim
Not true. In the extreme case, performance can actually decrease. This
can occur, for example, when the fusion of the loops cause cache
conflicts that were not present when the loops are run separately.
See http://www.nullstone.com/htmls/category/fusion.htm for a brief
description of loop fusion.
Best regards,
Christopher Glaeser c...@nullstone.com
Nullstone Corporation http://www.nullstone.com
Of course, both cases can be true (but for different extreme cases).
--
Stanley Chow; sc...@nortel.ca, (613) 763-2831
Nortel/BNR, PO Box 3511 Station C, Ottawa, Ontario
I understood "In the extreme case, the slower loop is not slowed down"
to mean there are *no* programs that will slow as a result of loop
jamming. I contend such programs *do* exist. These two assertions can
not both be true. Either such programs exist, or they do not.
If I have misinterpreted the meaning of the original sentence, then
please explain what is "extreme" about the "extreme case". An example
would be helpful.
Best regards,
Christopher Glaeser c...@nullstone.com
Nullstone Corporation http://www.nullstone.com
Tim
Gene W
--
-------------------------------------------------------------------
Gene Wagenbreth Programming for Dollars
Every compiler optimization performed has a risk of producing slower
code. A good compiler can catch many of these instances by analysis
before the transformation; however, no compiler can catch all of
these instances.
Usually, a rule of thumb is to perform the transformation if
it results in a performance enhancement a majority of the time,
and the downside risk is minimal.
Loop jamming/fusion increases the number of generated instructions,
and can thus introduce page faulting in the body of the loop where no page
fault existed before. Obviously, this would slow down your code.
I took "extreme" to apply to the time penalty to the slow loop; as
opposed to applying to the set of programs. To paraphrase the original
statement:
Loop jamming will result in a loop that is, in general, slower than
each of the original loops. The time penalty will depend on many
factors (including cache size, serial dependency, multiple ALU) and
in the extreme case, the time penalty can be zero wrt the slower loop.
--
Stanley Chow; sc...@nortel.ca, (613) 763-2831
Nortel/BNR, PO Box 3511 Station C, Ottawa, Ontario
> If you are after high performance, f90 array syntax can get in your way. The
> compiler must translate the f90 to the equivalent of an f77 DO loop while
> generating code. It must perform loop jamming, and sometimes array demotion.
> Performance depends on how the compiler does this. The user has no control,
> and usually can not find out what the compiler has done. Rewriting the code
> by hand in f77 Do loops is often the best thing to do.
Well, the the whole point of using f90's array operations is lost,
isn't it? But I'm sure the APL/J crowd will point out to you that
they did this type of optimization oh, three decades ago. If your f90
compiler doesn't do its work properly, return it to the manufacturer
while complaining loudly.
Jan
<cliff...@Eng.Sun.COM> wrote:
>Actually, it's pretty easy for the jammed loop to run faster than the
>slowest loop, if the cache locality is right.
Then the compiler was not doing a good job for the slow loop - it should
have managed the cache better. In the extreme :-), it could invent
the "extra" loop and use loop jamming to speed up the slow loop.
--
Stanley Chow; sc...@nortel.ca, (613) 763-2831
Nortel/BNR, PO Box 3511 Station C, Ottawa, Ontario