Agreed, that is the next step for hmmer. (I think you must mean two vectorizable loops out of the three distributed loops.)
My hope is that with some tuning, this pass and Sanjoy’s new range-check elimination pass should take us there.
Adam
My plan is to proceed with the following steps:
- Bring the current functionality to trunk by splitting off smaller patches fromthe current patch and completing them. The final commit will enable loopdistribution with a command-line flag or a loop hint.
- Explore and fine-tune the proper cost model for loop distribution to allowpartial vectorization. This is essentially whether to partition and whatthese partitions should be. Currently instructions are mapped to partitionsusing a simple heuristics to create a vectorizable partitions. We may need tointeract with the vectorizer to make sure the vectorization will actuallyhappen and it will be overall profitable.
- Explore other potentials for loop distribution, e.g.:- Partial vectorization of loops that can't be if-converted- Classic loop distribution to improve spatial locality- Compute the Program Dependence Graph rather than the list of unsafe memoryaccesses and allow reordering of memory operations- Distribute a loop in order to recognize parts as loop idioms
Long term, loop distribution could also become a transformation utility(Transform/Util). That way, the loop transformation passes could use it tostrip the loop from parts that inhibits the given optimization.
Please let me know if you have feedback either on the design or on the nextsteps.
Thanks,Adam
Thanks for working on this! We definitely need this capability in LLVM (and for more than just enabling vectorization).
This is good. It would be nice if this new analysis could have the same interface as the DependenceAnalysis pass, so that it will be easy to switch between them. I think that, eventually, we'll want to switch everything to use something like DependenceAnalysis, at least at higher optimization levels.
>
>
> - It also reuses the Run-time Memory Check code from the Loop
> Vectorizer. The
> hmmer loop requires memchecks. This is again captured by the same
> LoopAccessAnalysis class.
I think this is also reasonable; we just want to make sure that we don't end up with double memory checks. I've seen cases in the past where the vectorizer has inserted checks that should have been eliminated as duplicates with other loop guards, SE guard domination checking may need to be improved.
This sounds reasonable.
>
>
> My plan is to proceed with the following steps:
>
>
> - Bring the current functionality to trunk by splitting off smaller
> patches from
> the current patch and completing them. The final commit will enable
> loop
> distribution with a command-line flag or a loop hint.
Okay, please do.
>
>
> - Explore and fine-tune the proper cost model for loop distribution
> to allow
> partial vectorization. This is essentially whether to partition and
> what
> these partitions should be. Currently instructions are mapped to
> partitions
> using a simple heuristics to create a vectorizable partitions. We may
> need to
> interact with the vectorizer to make sure the vectorization will
> actually
> happen and it will be overall profitable.
I think this sounds reasonable. Splitting to enable vectorization is important; one reason to have this process tightly integrated with vectorization is so that it can properly integrate with the vectorizers register pressure checking (we might split to reduce register pressure, thus enabling more interleaving, at least when doing so does not decrease spatial locality).
Independent of vectorization, loop splitting is important to reduce register pressure within loops (i.e. loops with too many phis, but that are splittable, could be split to prevent intra-iteration spilling). Also very important is splitting to reduce the number of hardware prefetching streams used by the loop. In every system on which I've worked, the hardware prefetchers have a finite set of resources to sustain prefetching streams (5-10 per thread, depending on the architecture). When a loop would require more streams than this then performance will greatly suffer, and splitting it highly profitable. I'd definitely like us to hit these two use cases too.
>
>
> - Explore other potentials for loop distribution, e.g.:
> - Partial vectorization of loops that can't be if-converted
> - Classic loop distribution to improve spatial locality
> - Compute the Program Dependence Graph rather than the list of unsafe
> memory
> accesses and allow reordering of memory operations
This would also be quite nice to have.
> - Distribute a loop in order to recognize parts as loop idioms
Indeed, once you have the partitions, splitting out a memcpy, etc. should not be hard; this is not always profitable, however.
>
>
> Long term, loop distribution could also become a transformation
> utility
> (Transform/Util). That way, the loop transformation passes could use
> it to
> strip the loop from parts that inhibits the given optimization.
This sounds good, but we still may want to schedule the transformation itself (late in the pipeline). We don't want to limit register-pressure-induced loop splitting, for example, to vectorizable loops.
Thanks again,
Hal
>
>
> Please let me know if you have feedback either on the design or on
> the next
> steps.
>
>
> Thanks,
> Adam
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLV...@cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
- It also reuses the Run-time Memory Check code from the Loop
Vectorizer. The
hmmer loop requires memchecks. This is again captured by the same
LoopAccessAnalysis class.
I think this is also reasonable; we just want to make sure that we don't end up with double memory checks. I've seen cases in the past where the vectorizer has inserted checks that should have been eliminated as duplicates with other loop guards, SE guard domination checking may need to be improved.
Independent of vectorization, loop splitting is important to reduce register pressure within loops (i.e. loops with too many phis, but that are splittable, could be split to prevent intra-iteration spilling). Also very important is splitting to reduce the number of hardware prefetching streams used by the loop. In every system on which I've worked, the hardware prefetchers have a finite set of resources to sustain prefetching streams (5-10 per thread, depending on the architecture). When a loop would require more streams than this then performance will greatly suffer, and splitting it highly profitable. I'd definitely like us to hit these two use cases too.
- Explore other potentials for loop distribution, e.g.:
- Partial vectorization of loops that can't be if-converted
- Classic loop distribution to improve spatial locality
- Compute the Program Dependence Graph rather than the list of unsafe
memory
accesses and allow reordering of memory operations
This would also be quite nice to have.- Distribute a loop in order to recognize parts as loop idioms
Indeed, once you have the partitions, splitting out a memcpy, etc. should not be hard; this is not always profitable, however.
Long term, loop distribution could also become a transformation
utility
(Transform/Util). That way, the loop transformation passes could use
it to
strip the loop from parts that inhibits the given optimization.
This sounds good, but we still may want to schedule the transformation itself (late in the pipeline). We don't want to limit register-pressure-induced loop splitting, for example, to vectorizable loops.
This sounds like a good idea, thank you. I’ll experiment with it when I get this far.
Adam
We don't exactly. JumpThreading could handle the constant-range cases, but that also leaves a lot on the table (and we don't run it after loop vectorization regardless). What we do have is SE's isLoopEntryGuardedByCond, which I thought was used by the loop vectorizer to avoid adding unnecessary guards, but I don't see that now, so I might be wrong.
Not off hand. The interleaving factor is limited by the number of available registers. I recall running across loops that have a lot of phi values, and those limit the number of registers available for the intermediate values required by the interleaving. If the loop can be split, reducing the number of registers needed for phis, that can increase the number of interleavings possible.
Thanks again,
Hal