FFT implementation

63 views
Skip to first unread message

Gregor Taylor

unread,
Jan 14, 2021, 10:22:42 AMJan 14
to reikna
Hello, 

Just trying to implement a basic FFT to get my head around how all of this works - GPU stuff and openCL is all new to me. My code is:

import numpy as np
import matplotlib.pyplot as plt
from reikna.cluda import dtypes, ocl_api 
from reikna.fft import FFT 
from reikna.core import Annotation, Type, Transformation, Parameter
from reikna.algorithms import PureParallel

# A transformation that transforms a real array to a complex one
# by adding a zero imaginary part
def get_complex_trf(arr):
    complex_dtype = dtypes.complex_for(arr.dtype)
    return Transformation(
            [Parameter('output', Annotation(Type(complex_dtype, arr.shape), 'o')),
             Parameter('input', Annotation(arr, 'i'))],
    """
            ${output.store_same}(
                COMPLEX_CTR(${output.ctype})(
                    ${input.load_same},
                    0));
            """)

#data
x=np.arange(0,1,1/1024)
w=10*2*np.pi
y=np.cos(w*x)+2*np.cos(2*w*x)+3*np.cos(3*w*x)+4*np.cos(4*w*x)+5*np.cos(5*w*x)

#reikna
api=ocl_api()
device=api.get_platforms()[0].get_devices()[2]
print('Performing on {}'.format(device))
thread=api.Thread(device)

trf = get_complex_trf(y)
fft=FFT(trf.output)
fft.parameter.input.connect(trf, trf.output, new_input=trf.input)
fftc=fft.compile(thread)

arr_dev=thread.to_device(y)
res_dev=thread.array(y.shape, np.complex64)
fftc(res_dev, arr_dev)
result=res_dev.get()

Which is pretty much lifted from the example page. However the output is not what I expect? If I perform np.fft on the original y data then I get expected output - spikes at 10/20/30/40/50. But when I perform it with Reikna on the GPU it does not look the same? I think I'm missing something obvious here but I can't see it - I've tried checking the output of the transform with 'from_trf' and also removing the transform altogether and just changing the datatype with np.astype before passing to the FFT but the output remains different and I have no clues from this what might be the issue.

Any tips or help appreciated.

Thanks,
Gregor

Bogdan Opanchuk

unread,
Jan 14, 2021, 1:56:15 PMJan 14
to reikna
Hi Gregor,

The problem here is that your `y` array is `float64`, so the transformation you create is `float64 -> complex128`, and the `FFT` (which is created based on the transformation signature) is `complex128 -> complex128`. Then you give it a `complex64` array for the output, which results in garbage data (or a segfault in my case). 

A way to fix that would be to convert `y` to `float32` in advance, for example. Other solutions are possible depending on your actual code layout and whether you want a single- or double-precision FFT.

You can see what arrays computations and transformations expect by printing their `signature` attribute.

Gregor Taylor

unread,
Jan 14, 2021, 3:13:31 PMJan 14
to reikna
Ahh that's it! Thanks so much for your help. I just altered the input to be float32 and that gave expected output. 

One thing is I left my output array as 'complex64' and get expected output but if I alter it to 'complex128', which I think it what is output from the FFT (from what you say) then it results in a nonsensical result?

Once again thanks for your help - now I can start making it do useful things!

Bogdan Opanchuk

unread,
Jan 14, 2021, 3:50:28 PMJan 14
to reikna
Hm, I'm not sure what you're talking about. If I just change `res_dev=thread.array(y.shape, np.complex64)` to `res_dev=thread.array(y.shape, np.complex128)` in your example, I get the expected result (peaks at specific frequencies). 

Gregor Taylor

unread,
Jan 15, 2021, 2:48:52 AMJan 15
to reikna
I think I get it. In my last message I had changed the initial y array to a float 32. This was then doubled in the transformation resulting in the complex64 I had originally specified. 

If I left the original y array as a float64 then it would be doubled to a complex128 and just changing the final output array to be complex128 works as well.

Both result in the correct answer - but I think just specifying the correct output array (complex128) is simpler and cleaner. 

Thanks for your help!

Gregor Taylor

unread,
Jan 20, 2021, 6:04:03 AMJan 20
to reikna
Hi Bogdan, 
Sorry but I have a further question - if you would be so kind. 
I have now moved on a bit and I am trying to implement a test computation where I have two datasets and I want to FFT one, multiply it by the second and then IFFT that result. I have written a multiply transform and utilise the complex transform from before. I have tested both of these in isolation with the PureParallel.from_trf and work as expected. My code is as follows - apologies the data is not real as it is pulled from a larger file. 

import numpy as np
from reikna.cluda import dtypes, ocl_api, functions
from reikna.fft import FFT
from reikna.core import Annotation, Type, Transformation, Parameter, Computation
from reikna.algorithms import PureParallel
import reikna.helpers as helpers

def get_complex_trf(arr):
    complex_dtype = dtypes.complex_for(arr.dtype)
    return Transformation(
            [Parameter('output', Annotation(Type(complex_dtype, arr.shape), 'o')),
             Parameter('input', Annotation(arr, 'i'))],
    """
            ${output.store_same}(
                COMPLEX_CTR(${output.ctype})(
                    ${input.load_same},
                    0));
            """)

def get_multiply_trf(arr):
    return Transformation(
        [Parameter('output', Annotation(arr, 'o')),
         Parameter('input1', Annotation(arr, 'i')),
         Parameter('input2', Annotation(arr, 'i'))],
        "${output.store_same}(${mul}(${input1.load_same}, ${input2.load_same}));",
        connectors=['output', 'input1'],
        render_kwds=dict(mul=functions.mul(arr.dtype, arr.dtype, out_dtype=arr.dtype))
    )

class TestComp(Computation):
    def __init__(self, arr1, arr2):
        Computation.__init__(self,[
            Parameter('output', Annotation(arr2, 'o')),
            Parameter('input1', Annotation(arr1, 'i')),
            Parameter('input2', Annotation(arr2, 'i'))]),
    def _build_plan(self, plan_factory, device_params, output, input1, input2):
        plan=plan_factory()

        complex_trf = get_complex_trf(input1)
        mul_trf=get_multiply_trf(input2)

        fft=FFT(complex_trf.output)
        fft.parameter.input.connect(complex_trf, complex_trf.output, new_input=complex_trf.input)
        fft.parameter.output.connect(mul_trf, mul_trf.input1, IRFArr=mul_trf.input2, op=mul_trf.output)

        int_arr=plan.temp_array_like(input2)
        plan.computation_call(fft, int_arr, input2, input1, inverse=0)

        ifft=FFT(int_arr)
        output_arr=plan.temp_array_like(int_arr)
        plan.computation_call(ifft, output_arr, int_arr, inverse=1)

        return plan


api=ocl_api()
device=api.get_platforms()[0].get_devices()[2]
print('Performing on {}'.format(device))
thread=api.Thread(device)

#data1 is a complex128
#data2 is float64

data1_dev=thread.to_device(data1)
data2_dev=thread.to_device(data2)
res_dev=thread.array(data2.shape, np.complex128)

planC=TestComp(data2_dev, data1_dev)
planC.compile(thread)

planC(res_dev, data2_dev, data1_dev)
result=res_dev.get()

I have worked through several issues to get here but now I am the line planC.compile(thread) and I get a large stream of errors. First a failed to compile warning and a large stream of output (Seems to print functions from reikna.cluda to console) then the Traceback ends in:

 File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/reikna/cluda/api.py", line 563, in compile_static
    return StaticKernel(self, template_src, name, global_size,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/reikna/cluda/api.py", line 783, in __init__
    program = Program(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/reikna/cluda/api.py", line 654, in __init__
    self._program = thr._create_program(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/reikna/cluda/api.py", line 503, in _create_program
    program = self._compile(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/reikna/cluda/ocl.py", line 146, in _compile
    return cl.Program(self._context, src).build(options=options, cache_dir=temp_dir)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyopencl/__init__.py", line 537, in build
    self._prg, was_cached = self._build_and_catch_errors(
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pyopencl/__init__.py", line 585, in _build_and_catch_errors
    raise err
pyopencl._cl.RuntimeError: clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE - clBuildProgram failed: BUILD_PROGRAM_FAILURE

Followed by the incredibly helpful openCL error:

SC failed. No reason given.

Any help appreciated, I have been staring at it for a few days now and cannot see the error. This is on an AMD Radeon Pro 5300M in a MacBook Pro. 

Many thanks,
Gregor

Bogdan Opanchuk

unread,
Jan 21, 2021, 1:04:30 AMJan 21
to reikna
Hm, this is tricky. This is most likely the fault of the OpenCL driver. It compiles on my macbook's Vega 20 without problems (with data1 and data2 sizes set to 2^10, and to test the limits, to 2^16 - which are you using?). I've bumped into similar bugs before, where you give the compiler some correct code and it just fails (or silently produces incorrect results, which is worse). Usually on Apple platform, too. Paradoxically, nVidia's OpenCL implementation has given me the least problems overall. 

My standard procedure for this cases is:
- extract the CL source that causes the error (should be shown in the error you have, albeit with line numbers; I usually just modify the related exception handler in Reikna to get rid of them temporarily)
- combine it with a simple C program that just calls OpenCL to compile the provided source and reports errors
- simplify it as much as possible while making sure it still produces the error
- report it to Apple
I've been through this maybe 5 or 6 times over the last 10 years, and, to their credit, they did fix those bugs - but it takes them up to half a year. So this wouldn't really help you right now, even if you did it. 

Do you have a capability to try it on a different OpenCL platform, or even on CUDA? What Macos version are you using? I'm on 10.15.7, if you're on 10.13 or 10.14, updating might help - they did fix one of those bugs in 10.15 if I remember correctly.

Gregor Taylor

unread,
Jan 21, 2021, 5:13:42 AMJan 21
to reikna
I'm on OS 11.1.

I've done some further digging this morning and fixed a few issues in the code. I had a temp_arr as the output of the IFFT for some reason when it clearly should have been the output. See updated here:

import numpy as np
from reikna.cluda import dtypes, ocl_api, functions
from reikna.fft import FFT
from reikna.core import Annotation, Type, Transformation, Parameter, Computation
from reikna.algorithms import PureParallel
import reikna.helpers as helpers
import matplotlib.pyplot as plt
        plan.computation_call(ifft, output, int_arr, inverse=1)

        return plan

api=ocl_api()
device=api.get_platforms()[0].get_devices()[2]
print('Performing on {}'.format(device))
thread=api.Thread(device)

i=4096

#data1 is a complex128
data1=np.random.randint(100, size=(i))
data1=data1.astype('complex128')
data2=np.random.randint(100, size=(i))
data2=data2.astype('float64')
#data2 is float64    

data1_dev=thread.to_device(data1)
data2_dev=thread.to_device(data2)
res_dev=thread.array(data2.shape, np.complex128)        

planC=TestComp(data2_dev, data1_dev).compile(thread)        
planC(res_dev, data2_dev, data1_dev)
result=res_dev.get()

The problem seems to be with the data size, which in the above I've assigned to 'i' to allow it to be varied easily. In the previous issues I have been having I have had this set to 4096 as this is the typical data size I am dealing with and this results in the error explained in the earlier example - failure to compile. If I alter this to other values such as  2^10 and 2^16 (as you tried) then it works as expected! Could you possibly try 4096 on your machine to see if the issue exists there? 
I have tried every 2^N up to N=16 and the only one that fails for me is N=12, annoyingly the size most of my data is in. 

I've tried it with some sensible data that I know what the results should look like as well - just to check it wasn't silently reporting incorrect results as you said sometimes happened. The results are what I expect provided the data size is not 2^12.

Many thanks,
Gregor 

Bogdan Opanchuk

unread,
Jan 21, 2021, 4:15:57 PMJan 21
to reikna
Interesting, I get the compilation failure too for 2^12. So, I isolated the problem and the error occurs when one uses more than 32768 bytes of local memory (on my machine, at least), despite OpenCL reporting the maximum size of local memory as 65536. I filed the bug 8977645 for Apple (in case they are public). May take them a while to get around to it.

So, what can you do right now? Reikna attempts to make FFTs workgroup-local, and the higher the problem size, the more local memory is required for that. For 4096 elements it needs 4352 * sizeof(double) bytes of local memory (it's spaced to minimize bank conflicts). It thinks it can do that since OpenCL reports the maximum of 65536, but the driver still raises the error. A quick and dirty solution is be to override the maximum local memory size (do it once after a Thread object is created):

thread.device_params.local_mem_size = 32768

This will mean that Reikna will use a slower non-local FFT version, but at least it will work (or, at least, it works for me - try it and tell me if it helps).

Another approach is to keep the local FFT version, but to remove the spacing in local memory, so it fits in 4096 bytes. This may turn out faster than the non-local version, but the problem is, there's no API to do that. 

Gregor Taylor

unread,
Jan 22, 2021, 4:37:36 AMJan 22
to reikna
Great that you've submitted to Apple - do they feedback of when they've sorted issues or is it more of we just try it after each update to see if it works? 

Yesterday I got round the problem by padding out the data so that they were size=8192 (0-padding 2048 each side). This worked. This morning I have timed this approach for some test data and then also overrode the max local memory size as you suggested and timed this also - in this case overriding the memory size is faster. I think I will continue to work with this method for the mean time and wait for apple to address the 4096 bug. 

Many thanks for your help on this, much appreciated. Now to work out how to do multiple FFTs in parallel in an efficient way! 

Thanks,
Gregor

Bogdan Opanchuk

unread,
Jan 22, 2021, 7:20:10 PMJan 22
to reikna
> do they feedback of when they've sorted issues or is it more of we just try it after each update to see if it works? 

Unfortunately Apple's bug tracker is not public, so you can't subscribe to updates, but I will receive a notification. They don't do hotfixes (unless it's a security issue), so yeah, it'll be bundled into some minor version bump. 

> Yesterday I got round the problem by padding out the data so that they were size=8192 (0-padding 2048 each side).

Note that you can do that without actually allocating a larger array - just via a transformation (and a matching output transformation to drop the excess data). 

Reply all
Reply to author
Forward
0 new messages