On Wed, Aug 14, 2019 at 09:26:30AM -0700,
l.hu...@gmail.com wrote:
> Please let me ask you a question: it seems that in some cases, PPCG does not generate efficient GPU-Code, maybe because loop optimization is missing. Below is an example:
>
> #define N 64
>
> int main(){
>
> int a[N] = {0};
> int b[N] = {1};
>
> for(int i = 0; i < N; i += 2){
> a[i] = b[i];
> }
> }
>
>
> The PPCG-generated CUDA kernel (with --target=cuda --pet-autodetect) uses only half of the started threads, which seems inefficient to me:
True.
> The C-Output of PPCG shows no signs of any change except for different names:
> for (int c0 = 0; c0 <= 63; c0 += 2)
> a[c0] = b[c0];
In this case, I don't see any reason to produce anything else.
> I expected a transformation that transforms loop "for(int i = 0; i < N; i += 2){...}" to something like "for(int i = 0; i < N/2; ++i)" with transformed array accesses ("a[2*i]" and "b[2*i]", correspondingly), so that no started threads are idling. Is there any way to let PPCG/ISL perform such a transformation? Maybe I am missing a cli-option?
This is not implemented in PPCG.
You could look for strides in the schedule domain that is mapped
to blocks/threads and exploit them to reduce the domain.
skimo