Skip modifying a Scop's Region when some part of PPCGCodeGeneration fails, to pass it to IslScheduleOptimizer

12 views
Skip to first unread message

llvmres...@iith.ac.in

unread,
Jun 10, 2017, 7:22:03 AM6/10/17
to Polly Development
Is it possible to make PPCGCodeGeneration skip code-generating a Scop unless all previous steps of the pass are successful ? This question is in the context of my patch [D34054] Introduce a hybrid target to generate code for either the GPU or CPU.

I see that the pass overall has 4 steps, (please point out if I'm missing anything else)
  1. ppcg optimization : whose success is indicated by (PPCGGen->tree != NULL)
  2. GPU code generation : whose success is indicated by (GPUNodeBuilder::BuildSuccessful==true)
  3. Static cost estimation : effectiveness of running the kernel on the GPU indicated by (NodeBuilder.DeepestSequential <= NodeBuilder.DeepestParallel)
  4. Modifying Scop's Region to code-generate the Scop
The Region is protected by this fence in PPCGCodeGeneration.cpp if ppcg optimisation fails ,

2688     if (PPCGGen->tree)
2689       generateCode(isl_ast_node_copy(PPCGGen->tree), PPCGProg);
2690

But, the Region is modified by generateCode if step 1. succeeds, even when step 2. or step 3. fail. Although the polly.merge_new_and_old block(represented by SplitBlock object) is set to point to the original code when 2. or 3. fail, the analysis information of the Scop has been invalidated.

This precludes the possibility of attempting CPU optimisations on the Scop by passing it to IslScheduleOptimizer, which is what my patch [D34054] proposes.

I'd like to know if the following changes could help skip code-generating the Scop when step 2. or 3. fail,
diff --git a/lib/CodeGen/PPCGCodeGeneration.cpp b/lib/CodeGen/PPCGCodeGeneration.cpp
index f45b9ac
..ab0d57a 100644
--- a/lib/CodeGen/PPCGCodeGeneration.cpp
+++ b/lib/CodeGen/PPCGCodeGeneration.cpp
@@ -2619,7 +2619,9 @@ public:
     
ScopAnnotator Annotator;
     
Annotator.buildAliasScopes(*S);
 
-    Region *R = &S->getRegion();
+    Region *R_orig = &S->getRegion(), *R;
+    Region R_copy = Region(*R_orig);
+    R = &R_copy;

 
     simplifyRegion
(R, DT, LI, RI);
 
@@ -2662,11 +2664,10 @@ public:
     
/// In case a sequential kernel has more surrounding loops as any parallel
     
/// kernel, the SCoP is probably mostly sequential. Hence, there is no
     
/// point in running it on a GPU.
-    if (NodeBuilder.DeepestSequential > NodeBuilder.DeepestParallel)
-      SplitBlock->getTerminator()->setOperand(0, Builder.getFalse());
+    if (NodeBuilder.DeepestSequential > NodeBuilder.DeepestParallel
+               || !NodeBuilder.BuildSuccessful )
+      *(R_orig) = R_copy;
 
-    if (!NodeBuilder.BuildSuccessful)
-      SplitBlock->getTerminator()->setOperand(0, Builder.getFalse());

   
}
 
   
bool runOnScop(Scop &CurrentScop) override {

llvmres...@iith.ac.in

unread,
Jun 10, 2017, 9:27:59 AM6/10/17
to Polly Development, Tobias Grosser, SANJAY SRIVALLABH SINGAPURAM
@Tobias can you share your thoughts ?

Tobias Grosser

unread,
Jun 11, 2017, 11:11:12 PM6/11/17
to llvmres...@iith.ac.in, Polly Development
Hi Sanjay,

thanks for pushing forward!g

On Sat, Jun 10, 2017, at 01:22 PM, llvmresch_int01 via Polly Development
wrote:
> Is it possible to make PPCGCodeGeneration skip code-generating a Scop
> unless *all* previous steps of the pass are successful ? This question is
> in the context of my patch [D34054] Introduce a hybrid target to generate
> code for either the GPU or CPU <https://reviews.llvm.org/D34054>.
>
> I see that the pass overall has 4 steps, (please point out if I'm missing
> anything else)
>
> 1. ppcg optimization : whose success is indicated by (PPCGGen->tree !=
> NULL)
> 2. GPU code generation : whose success is indicated by
> (GPUNodeBuilder::BuildSuccessful==true)
> 3. Static cost estimation : effectiveness of running the kernel on the
> GPU indicated by (NodeBuilder.DeepestSequential <=
> NodeBuilder.DeepestParallel)
> 4. Modifying Scop's Region to code-generate the Scop
>
> The Region is protected by this fence in PPCGCodeGeneration.cpp if ppcg
> optimisation fails ,
>
> 2688 if (PPCGGen->tree)
> 2689 generateCode(isl_ast_node_copy(PPCGGen->tree), PPCGProg);
> 2690
>
> But, the Region is modified by generateCode if step 1. succeeds, even
> when
> step 2. or step 3. fail. Although the polly.merge_new_and_old
> block(represented by SplitBlock object) is set to point to the original
> code when 2. or 3. fail, the analysis information of the Scop has been
> invalidated.
>
>
> This precludes the possibility of attempting CPU optimisations on the
> Scop
> by passing it to IslScheduleOptimizer, which is what my patch [D34054]
> <https://reviews.llvm.org/D34054> proposes.
>
> I'd like to know if the following changes could help skip code-generating
> the Scop when step 2. or 3. fail,
> diff --git a/lib/CodeGen/PPCGCodeGeneration.cpp b/lib/CodeGen/
> PPCGCodeGeneration.cpp
> index f45b9ac..ab0d57a 100644
> --- a/lib/CodeGen/PPCGCodeGeneration.cpp
> +++ b/lib/CodeGen/PPCGCodeGeneration.cpp
> @@ -2619,7 +2619,9 @@ public:
> ScopAnnotator Annotator;
> Annotator.buildAliasScopes(*S);
>
>
>
>
> *- Region *R = &S->getRegion();+ Region *R_orig = &S->getRegion(),
> *R;+ Region R_copy = Region(*R_orig);+ R = &R_copy;*
>
> simplifyRegion(R, DT, LI, RI);
>
> @@ -2662,11 +2664,10 @@ public:
> /// In case a sequential kernel has more surrounding loops as any
> parallel
> /// kernel, the SCoP is probably mostly sequential. Hence, there is
> no
> /// point in running it on a GPU.
>
>
>
>
>
>
>
> *- if (NodeBuilder.DeepestSequential > NodeBuilder.DeepestParallel)-
> SplitBlock->getTerminator()->setOperand(0, Builder.getFalse());+ if
> (NodeBuilder.DeepestSequential > NodeBuilder.DeepestParallel+
> || !NodeBuilder.BuildSuccessful )+ *(R_orig) = R_copy; - if
> (!NodeBuilder.BuildSuccessful)-
> SplitBlock->getTerminator()->setOperand(0, Builder.getFalse());*
> }

All this looks very hacky. I suggest to instead scan the AST once more
before calling generateCode to compute DeepestSequential and
DeeepestParallel and
then only generate code if it is deemed profitable!

Best,
Tobias

>
> bool runOnScop(Scop &CurrentScop) override {
>
> --
> You received this message because you are subscribed to the Google Groups
> "Polly Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to polly-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

SANJAY SRIVALLABH SINGAPURAM

unread,
Jun 12, 2017, 6:24:17 AM6/12/17
to Tobias Grosser, Polly Development
Hello Tobias,

Can you elaborate on what you meant by "scan the AST before calling generateCode" ?

I can check for "NodeBuilder.DeepestSequential > NodeBuilder.DeepestParallel" (in a way) before calling generateCode.

But to check whether  "NodeBuilder.BuildSuccessful == tree", generateCode has to be called. BuildSuccessful is set in finalizeKernelArguments and finalizeKernelFunction which are indirectly called by generateCode.

SANJAY SRIVALLABH SINGAPURAM

unread,
Jun 18, 2017, 2:56:25 AM6/18/17
to Tobias Grosser, Polly Development
@tobias ping

SANJAY SRIVALLABH SINGAPURAM

unread,
Jun 18, 2017, 4:38:28 AM6/18/17
to Tobias Grosser, Polly Development
Hello Tobias,

The final value of BuildSuccessful can be known only after GPUModule is built, i.e. after "2660:   NodeBuilder.create(Root)".

Would you recommend the following scheme ?
1. Build the GPUModule and store ASMString of "kernel" user nodes in a cache.
2. Check BuildSuccessful
 2.1 Bail out if BuildSuccessful == false
 2.2 If not, parse and code-generate the schedule tree

On Mon, Jun 12, 2017 at 3:54 PM SANJAY SRIVALLABH SINGAPURAM <llvmres...@iith.ac.in> wrote:

Tobias Grosser

unread,
Jun 19, 2017, 12:35:07 AM6/19/17
to SANJAY SRIVALLABH SINGAPURAM, Polly Development
Hi Sanjay,

there are two reasons why code generation in the GPU path may fail. One,
because we write to scalar values which are not allowed, for which we do
not support to add synchronization statements yet. Second, because the
GPU kernel is cannot be generated. I personally do not think we should
put a large effort in trying to recover from this (i.e., to generate GPU
code). These are supposed to be exceptional situations indicating
missing features on our side.

It seems in your experiments we fail here rather often. Is this due to
scalar writes or due to problems in the kernel PTX code generation?

Best,
Tobias

Siddharth Bhat

unread,
Jun 19, 2017, 3:29:59 AM6/19/17
to Tobias Grosser, SANJAY SRIVALLABH SINGAPURAM, Polly Development

If it fails at verifyModule, you can pass verifyModule a raw_ostream IIRC which it will write to (I'm debugging similar problems :) ) having that output is nice.

Cheers,
Siddharth.

Sending this from my phone, please excuse any typos!

SANJAY SRIVALLABH SINGAPURAM

unread,
Jun 19, 2017, 10:09:41 AM6/19/17
to Tobias Grosser, Polly Development
I'm thinking of new function "generateASM" to generate and cache the kernels, by traversing the isl_ast Root and considering only the "kernel" nodes. This should involve moving code that's responsible just for generating the PTX string into generateASM, including the statements that compute DeepestSequential and DeepestParallel and mark BuildSuccessful.

I've implemented a rudimentary traversal of the AST in this diff. Please share your thoughts.
On Mon, Jun 19, 2017 at 10:05 AM Tobias Grosser <tob...@grosser.es> wrote:
Hi Sanjay,

there are two reasons why code generation in the GPU path may fail. One,
because we write to scalar values which are not allowed, for which we do
not support to add synchronization statements yet. Second, because the
GPU kernel is cannot be generated. I personally do not think we should
put a large effort in trying to recover from this (i.e., to generate GPU
code). These are supposed to be exceptional situations indicating
missing features on our side.
For the patch "Introduce a hybrid target ..." to be most effective, the Region associated with the Scop should remain untouched for IslScheduleOptimizer to optimize. This is why I think it is important to recover from a failed codegeneration. 

It seems in your experiments we fail here rather often. Is this due to
scalar writes or due to problems in the kernel PTX code generation?
Out of the 11 kernels that ppcg is able to optimize in PolyBench.jl (excluding one which was crashing), 4 had failed during code generation. 3 failed at verifyModule because an "Instruction does not dominate all uses!", with the other being declared cost-ineffective.

SANJAY SRIVALLABH SINGAPURAM

unread,
Jun 19, 2017, 10:28:30 AM6/19/17
to Tobias Grosser, Polly Development
Hello Siddharth,

That's how I came to know what caused an empty kernel string here.
Reply all
Reply to author
Forward
0 new messages