I was trying to use FFT computation with Transformations (though it's not important). But as an input argument I used not a regular class Array(gpuarray.GPUArray) but its slice:
fft = FFT(Type(self.data_dtype, shape=(sats_num, dopp_num, ca_samples)), axes=(2,))
fft.parameter.input.connect(prev_tr, prev_tr.out, indata=prev_tr.indata, f=prev_tr.f, phases=prev_tr.phases,
epochs_per_coh=prev_tr.epochs_per_coh, coh_starts=prev_tr.coh_starts, exp_lut_data=prev_tr.exp_lut_data)
fft.parameter.output.connect(post_tr, post_tr.fftout, result=post_tr.result, prn=post_tr.prn)
fftc = fft.compile(self.thr, compiler_options=nvcc_opts)
fftc(self._inter_res_dev, self._prn_dev, self._samples_dev, self._f_dev[non_coh_i],
self._phases_dev[non_coh_i], self._epochs_per_coh_dev, self._starts_dev[non_coh_i],
You can see that some arguments of fftc call has slicing. But I realized that I have the same address of these data in my CUDA code. Which means slicing is not taken into account. In Array::__getitem__(self, index) everything looks correct, we create a new array with new strides and offset, but the same base_data.
But then in Kernel::prepared_call(self, *args) we ignore everything besides base_data. That means slicing doesn't play any role here.
Is that done intentionally?
My idea was just to upload whole data once, but then process it partially one by one. I can't process all at once in parallel, because I need to integrate portions.