[PPCG] Loop-increments by more then 1 lead to inefficient GPU code

l.hu...@gmail.com

unread,

Aug 14, 2019, 12:26:30 PM8/14/19

to isl Development

Hi PPCG team,

this is Lars -- a student from the University of Muenster, Germany. I am working right now on my bachelor thesis (supervisors in CC) where I try to connect the approaches of polyhedral compilation and Multi-Dimensional Homomorphisms (MDHs).

Please let me ask you a question: it seems that in some cases, PPCG does not generate efficient GPU-Code, maybe because loop optimization is missing. Below is an example:

#define N 64

int main(){

    int a[N] = {0};
    int b[N] = {1};

    for(int i = 0; i < N; i += 2){
        a[i] = b[i];
    }
}


The PPCG-generated CUDA kernel (with --target=cuda --pet-autodetect) uses only half of the started threads, which seems inefficient to me: 

__global__ void kernel0(int *a, int *b)
{
    int b0 = blockIdx.x;
    int t0 = threadIdx.x;

    if (t0 % 2 == 0)
      a[32 * b0 + t0] = b[32 * b0 + t0];
}


The C-Output of PPCG shows no signs of any change except for different names:
   for (int c0 = 0; c0 <= 63; c0 += 2)
        a[c0] = b[c0];


Version used:
./ppcg --version
clang version 8.0.0 (tags/RELEASE_800/final)
pet-0.11.2
isl-0.21-105-g98bb6b88-GMP
ppcg-0.08.2


I expected a transformation that transforms loop "for(int i = 0; i < N; i += 2){...}" to something like "for(int i = 0; i < N/2; ++i)" with transformed array accesses ("a[2*i]" and "b[2*i]", correspondingly), so that no started threads are idling. Is there any way to let PPCG/ISL perform such a transformation? Maybe I am missing a cli-option?

Many thanks for your help in advance!

Best,
Lars

Sven Verdoolaege

unread,

Aug 14, 2019, 5:44:42 PM8/14/19

to l.hu...@gmail.com, isl Development

On Wed, Aug 14, 2019 at 09:26:30AM -0700, l.hu...@gmail.com wrote:
> Please let me ask you a question: it seems that in some cases, PPCG does not generate efficient GPU-Code, maybe because loop optimization is missing. Below is an example:
>
> #define N 64
>
> int main(){
>
> int a[N] = {0};
> int b[N] = {1};
>
> for(int i = 0; i < N; i += 2){
> a[i] = b[i];
> }
> }
>
>
> The PPCG-generated CUDA kernel (with --target=cuda --pet-autodetect) uses only half of the started threads, which seems inefficient to me:

True.

> The C-Output of PPCG shows no signs of any change except for different names:
> for (int c0 = 0; c0 <= 63; c0 += 2)
> a[c0] = b[c0];

In this case, I don't see any reason to produce anything else.

> I expected a transformation that transforms loop "for(int i = 0; i < N; i += 2){...}" to something like "for(int i = 0; i < N/2; ++i)" with transformed array accesses ("a[2*i]" and "b[2*i]", correspondingly), so that no started threads are idling. Is there any way to let PPCG/ISL perform such a transformation? Maybe I am missing a cli-option?

This is not implemented in PPCG.
You could look for strides in the schedule domain that is mapped
to blocks/threads and exploit them to reduce the domain.

skimo

Sven Verdoolaege

unread,

Aug 5, 2021, 5:07:06 PM8/5/21

to l.hu...@gmail.com, isl Development

Since you were working on your bachelor thesis back then,
you're probably not interested in this anymore,
but for the record, PPCG master will now remove the stride
(at least in simple cases).

skimo

Reply all

Reply to author

Forward