[clFFT] Problems with certain 2D transform sizes

377 views
Skip to first unread message

Alex

unread,
Dec 13, 2013, 9:21:25 AM12/13/13
to clm...@googlegroups.com
Hi there,

I noticed that certain transform sizes give invalid results for me. I am testing complex->complex single precision out of place transforms by comparing x==ifft2(fft2(x)). Usually I get relative RMS errors in the range of 1e-7, but for some transform sizes the application crashes (AMD plattform, CPU) or gives relative errors > 1e3 (NVIDIA plattform, GPU). 

All transform sizes I found not working are only radix-2 in both dimensions:

x = 2^n, y= 2^m with n>=9 && m>=10 (starting with 512 x 1024 not working)
x = 2^n, y= 2^m with n>=10 && m>=4 (starting with 1024 x 32 not working)

All transform sizes I tested involving any radix-3 or radix-5 in either x or y do work. For example 1024x1024 does not work, while 1536x1024 works.


Does anybody of you experience similar effects? Is there any transform size restriction besides x*y<=2^24?

Pavan Yalamanchili

unread,
Dec 13, 2013, 9:22:55 AM12/13/13
to Alex, clm...@googlegroups.com

Hi Alex,

Are you normalizing the values correctly ?

--
You received this message because you are subscribed to the Google Groups "clmath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clmath+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Alex

unread,
Dec 13, 2013, 9:25:56 AM12/13/13
to clm...@googlegroups.com, Alex
Hi Pavan,

I do not change the default values for the plans, so fft and ifft should scale correctly as far as I understood the documentation. 

Furthermore the same code works for lots of other transform sizes.

Pavan Yalamanchili

unread,
Dec 13, 2013, 9:36:39 AM12/13/13
to Alex, clm...@googlegroups.com
I just noticed that you said NVIDIA GPUs. There are a couple of known bugs for the NVIDIA platform. 

1) Value errors like you are seeing.
If you are using the "master" branch, you will see the problem. This has been fixed in the "develop" branch.

2) Kernel failures when m >=10 and n >=10 for 2D FFTs.
A work around has been committed to our fork (over here: https://github.com/accelereyes/clFFT) awaiting merges. Our fork contains the fix for (1) as well. Use the "develop" branch here as well. 

Let us know how things go on your end.

--
Pavan Yalamanchili
AccelerEyes

Alex

unread,
Dec 13, 2013, 10:46:16 AM12/13/13
to clm...@googlegroups.com, Alex
Just did the test again using the develop branch of your fork.

Now all transform sizes I am testing are working on the NVIDIA GPU as well as the CPU using the Intel plattform. Thanks a lot!

Unfortunately I still get crashes on the AMD plattform, which is not too important for me, but still an issue.

As we are just discussing transform size: Did you make any efforts to overcome the 2^24 limit? Are there any plans?

Pavan Yalamanchili

unread,
Dec 13, 2013, 4:50:00 PM12/13/13
to Alex, clm...@googlegroups.com
Hi Alex,

I've just tested the following configurations on AMD 7970 and all of them work as expected (i.e. no error in values).

Precision: Single and Double.
Complexity: Real and Complex.
8 <= m <= (12 for single, 11 for double)
8 <= n <= (12 for single, 11 for double)

Plan details:
Scale: Default and (1/(number of elements) for ifft2).

We use complex interleaved layout in our library that uses clFFT. So I could not change that easily.

Please reply back with any differences in configuration from what you are seeing here.

It would also be great if you can create a ticket over here (https://github.com/clMathLibraries/clFFT)

Kent Knox

unread,
Dec 13, 2013, 6:23:45 PM12/13/13
to Pavan Yalamanchili, Alex, clm...@googlegroups.com
Hi Alex~

If I understood your original email correctly, you are saying that (AMD platform, CPU) is the CPU device when you are using the AMD OpenCL runtime.  Likewise, when you say  (NVIDIA platform, GPU) you mean a GPU device on the Nvidia OpenCL runtime.

I do not think i see you reporting results for (AMD platform, GPU), which Pavan sees as working; do you have any?  It may be helpful to us if you could attach a copy of the output from the clinfo utility from your machines.

With regard to the capability of the library for sizes above 2^24, this limit is imposed by a known issue in the library itself.  The clfftEnqueueTransform() call is recursive, and when a dimension length is passed to it that is greater than the amount of LDS memory available on the device (a single FFT row has to fit within LDS memory), the enqueue makes the FFT row smaller by reshaping it, in essence folding a 1D FFT into a 2D FFT.  For dimension lengths that are greater than 2^24, this recursion happens multiple times and we produce invalid results.  This limit can be removed as soon as the problem is triaged and fixed; I have no ETA on this unfortunately.

Kent


Alex

unread,
Dec 18, 2013, 9:34:12 AM12/18/13
to clm...@googlegroups.com, Pavan Yalamanchili, Alex
Hi,

just tested the FFT on an AMD GPU and I got wrong results for the same transform sizes, but in contrast to AMD CPU it does not crash.

Here is the clinfo for that system:

Number of platforms:                             2
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 1.1
  Platform Name:                                 Intel(R) OpenCL
  Platform Vendor:                               Intel(R) Corporation
  Platform Extensions:                           cl_khr_fp64 cl_khr_icd cl_khr_g
lobal_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32
_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store
cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sh
aring cl_intel_dx9_media_sharing
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 1.2 AMD-APP (1348.4)
  Platform Name:                                 AMD Accelerated Parallel Proces
sing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callbac
k cl_amd_offline_devices cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_me
dia_sharing


  Platform Name:                                 Intel(R) OpenCL
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_CPU
  Device ID:                                     32902
  Max compute units:                             8
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           1024
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  4
  Preferred vector width double:                 2
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     4
  Native vector width double:                    2
  Max clock frequency:                           3400Mhz
  Address bits:                                  64
  Max memory allocation:                         4288605184
  Image support:                                 Yes
  Max number of images read arguments:           480
  Max number of images write arguments:          480
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    480
  Max size of kernel argument:                   3840
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               No
    Round to +ve and infinity:                   No
    IEEE754-2008 fused multiply-add:             No
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    262144
  Global memory size:                            17154420736
  Constant buffer size:                          131072
  Max number of constant args:                   480
  Local memory type:                             Global
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     128
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    301
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     Yes
  Queue properties:
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   00000000005F6470
  Name:                                                  Intel(R) Core(TM) i7-26
00 CPU @ 3.40GHz
  Vendor:                                        Intel(R) Corporation
  Device OpenCL C version:                       OpenCL C 1.1
  Driver version:                                1.1
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.1 (Build 31360.31441)
  Extensions:                                    cl_khr_fp64 cl_khr_icd cl_khr_g
lobal_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32
_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store
cl_intel_printf cl_ext_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sh
aring cl_intel_dx9_media_sharing


  Platform Name:                                 AMD Accelerated Parallel Proces
sing
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Device ID:                                     4098
  Board name:                                    AMD Radeon HD 6900M Series
  Device Topology:                               PCI[ B#1, D#0, F#0 ]
  Max compute units:                             12
  Max work items dimensions:                     3
    Max work items[0]:                           256
    Max work items[1]:                           256
    Max work items[2]:                           256
  Max work group size:                           256
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  4
  Preferred vector width double:                 0
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     4
  Native vector width double:                    0
  Max clock frequency:                           680Mhz
  Address bits:                                  32
  Max memory allocation:                         536870912
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    None
  Cache line size:                               0
  Cache size:                                    0
  Global memory size:                            1073741824
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   000007FED53D4E10
  Name:                                          Barts
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                1348.4 (VM)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 AMD-APP (1348.4)
  Extensions:                                    cl_khr_global_int32_base_atomic
s cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo
cal_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store
cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd
_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d1
0_sharing cl_khr_dx9_media_sharing cl_amd_image2d_from_buffer_read_only


  Device Type:                                   CL_DEVICE_TYPE_CPU
  Device ID:                                     4098
  Board name:
  Max compute units:                             8
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           1024
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  8
  Preferred vector width double:                 4
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     8
  Native vector width double:                    4
  Max clock frequency:                           3400Mhz
  Address bits:                                  64
  Max memory allocation:                         4288605184
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   4096
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    32768
  Global memory size:                            17154420736
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Global
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     1
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    301
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     Yes
  Queue properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   000007FED53D4E10
  Name:                                                  Intel(R) Core(TM) i7-26
00 CPU @ 3.40GHz
  Vendor:                                        GenuineIntel
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                1348.4 (sse2,avx)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 AMD-APP (1348.4)
  Extensions:                                    cl_khr_fp64 cl_amd_fp64 cl_khr_
global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3
2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_
khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store
cl_khr_gl_sharing cl_ext_device_fission cl_amd_device_attribute_query cl_amd_vec
3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sh
aring

Kent Knox

unread,
Dec 19, 2013, 11:18:16 AM12/19/13
to Alex, clm...@googlegroups.com
Thank you for your platform output.

I can see that you have 2 opencl runtime stacks installed; I wonder if that could be messing with our library.  Are you using our 'client' app to get test results, or did you write your own test code?  Could you share a repro case?  

Have you tried compiling and running our googletest test suite?  If a test fails with that, it could be easier for me to reproduce here.

Kent

Alex

unread,
Dec 20, 2013, 6:19:49 AM12/20/13
to clm...@googlegroups.com, Alex
I just tried to test using client.exe, but there are some problems, possibly because of the three opencl runtime stacks I have installled:

client.exe -i gives:

OpenCL platform [ 0 ]:
    CL_PLATFORM_PROFILE:     FULL_PROFILE
    CL_PLATFORM_VERSION:     OpenCL 1.1 CUDA 6.0.1
    CL_PLATFORM_NAME:        NVIDIA CUDA
    CL_PLATFORM_VENDOR:      NVIDIA Corporation
    CL_PLATFORM_EXTENSIONS:  cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_
sharing cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_
sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll


OpenCL platform [ 1 ]:
    CL_PLATFORM_PROFILE:     FULL_PROFILE
    CL_PLATFORM_VERSION:     OpenCL 1.2 AMD-APP (1084.4)
    CL_PLATFORM_NAME:        AMD Accelerated Parallel Processing
    CL_PLATFORM_VENDOR:      Advanced Micro Devices, Inc.
    CL_PLATFORM_EXTENSIONS:  cl_khr_icd cl_amd_event_callback cl_amd_offline_dev
ices cl_khr_d3d10_sharing cl_khr_d3d11_sharing

OpenCL platform [ 2 ]:
    CL_PLATFORM_PROFILE:     FULL_PROFILE
    CL_PLATFORM_VERSION:     OpenCL 1.2
    CL_PLATFORM_NAME:        Intel(R) OpenCL
    CL_PLATFORM_VENDOR:      Intel(R) Corporation
    CL_PLATFORM_EXTENSIONS:  cl_khr_fp64 cl_khr_icd cl_khr_global_int32_base_ato
mics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr
_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel_printf cl_e
xt_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sharing cl_intel_dx9_m
edia_sharing cl_khr_dx9_media_sharing cl_khr_d3d11_sharing

OpenCL devices [ 0 ]:
    CL_DEVICE_NAME:                              Intel(R) Core(TM) i7-3770 CPU @
 3.40GHz
    CL_DEVICE_VERSION:                   OpenCL 1.2 (Build 63463)
    CL_DRIVER_VERSION:                   1.2
    CL_DEVICE_TYPE:                      CPU
    CL_DEVICE_MAX_CLOCK_FREQUENCY:       3400
    CL_DEVICE_ADDRESS_BITS:              64
    CL_DEVICE_AVAILABLE:                 TRUE
    CL_DEVICE_COMPILER_AVAILABLE:        TRUE
    CL_DEVICE_OPENCL_C_VERSION:          OpenCL C 1.2
    CL_DEVICE_MAX_WORK_GROUP_SIZE:       1024
    CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS:  3
                         Dimension[ 0 ]  1024
                         Dimension[ 1 ]  1024
                         Dimension[ 2 ]  1024
    CL_DEVICE_HOST_UNIFIED_MEMORY:       TRUE
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE:  131072 ( 128 KB )
    CL_DEVICE_LOCAL_MEM_SIZE:            32768 ( 32 KB )
    CL_DEVICE_GLOBAL_MEM_SIZE:           34295431168 ( 32706 MB )
    CL_DEVICE_MAX_MEM_ALLOC_SIZE:        8573857792 ( 8176 MB )
    CL_DEVICE_EXTENSIONS:                cl_khr_fp64 cl_khr_icd cl_khr_global_in
t32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_at
omics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_intel
_printf cl_ext_device_fission cl_intel_exec_by_local_thread cl_khr_gl_sharing cl
_intel_dx9_media_sharing cl_khr_dx9_media_sharing cl_khr_d3d11_sharing



                Internal Client Test *****PASS*****

client -g gives:

OPENCL_V_THROWERROR< CLFFT_DEVICE_NOT_FOUND > (385): Getting OpenCL devices ( ::
clGetDeviceIDs() )
clFFT error condition reported:
OPENCL_V_THROWERROR< CLFFT_DEVICE_NOT_FOUND > (385): Getting OpenCL devices ( ::
clGetDeviceIDs() )


It seems, that there are problems with multiple opencl runtime stacks

Kent Knox

unread,
Dec 20, 2013, 2:52:35 PM12/20/13
to Alex, clm...@googlegroups.com
I think i understand the problem you see with 'client'; even though the example executable prints all 3 platforms, it automatically chooses the last one, which is the Intel stack in your case.  You specify you want the GPU device, but the intel platform doesn't have one.  Therefore, "device not found".  I'm working on a way to explicitly pick platform and device in a branch i have in development.


As for now, I don't think this problem is related to the one you reported earlier.

Kent
Reply all
Reply to author
Forward
0 new messages