On 1/30/20, Dan Wills <
gda...@gmail.com> wrote:
> That is very interesting to hear Tim, do you think that's true regardless
> of whether the rule is CPU or OpenCL? I've done some somewhat
> large-neighborhood stuff in glsl and it seems to manage somewhat ok.
Yes, though the details of memory access patterns vary, and FFT scans
through all the data once for each possible power of 2 stride from 1
element up to the full N^2 elements. (Also, N must be a power of 2
unless you want to write your own FFT codes from scratch). On GPU you
need to ensure coalesced reads/writes for the smaller strides and
avoid bank conflicts for the really large strides. On CPU, the
challenge is cache hit rate, perhaps using streaming prefetch when
possible, etc. Ignoring all that (and assuming the CPU and GPU
programmers address memory access problems optimally) we can just look
at the computational complexity.
If the kernel is KxK in size, and the pattern being simulated is NxN,
then the normal calculation takes O(K^2 * N^2) per frame. But
reaction-diffusion can be done in the frequency domain. It takes O(N^2
* log(N^2)) for the FFT and the same for the inverse FFT, plus O(N^2)
to perform the diffusion (convolution) in the frequency domain. The
difference is O(N^2 * K^2) versus O(N^2 * (1+4 log(N))). If the kernel
is large enough, then 1+4 log(N) is less than K^2.
> I absolutely love FFT (I wrote a plugin that brought FFTW-based
> FFT-for-images to the compositing app we use at work 'Shake' (now killed by
> apple)) but am not yet across what a GPU-based solution to FFT looks like,
> guess I'd better get googling!
Nvidia provides cuFFT library for use in CUDA programs. For OpenCL and
other non-CUDA I'm sure there is source code somewhere.
> On Fri, Jan 24, 2020 at 12:22 AM Tim Hutton <
tim.h...@gmail.com> wrote:
>> With large-neighborhood rules like SmoothLife and Lenia, FFT will
>> probably be the fastest way to run them. It's on my list to look into for Ready
>> but at low priority.
--
Robert Munafo