In the code below, I have a simple for-loop that I'd like to replace with, hopefully, a faster vectorized Numba parallelized implementation as well as a CUDA implementation.
import numpy as np
b = np.array([9,8100,-60,7], dtype=np.float64)
a = np.array([584,-11,23,79,1001,0,-19], dtype=np.float64)
m = 3
n = b.shape[0]
l = n-m+1
k = a.shape[0]-m+1
QT = np.array([-85224., 181461., 580047., 8108811., 10149.])
QT_first = QT.copy()
out = [None] * l
for i in range(1, l):
QT[1:] = QT[:k-1] - b[i-1]*a[:k-1] + b[i-1+m]*a[-(k-1):]
QT[0] = QT_first[i]
# Update: This is not the REAL calculation below but a proxy
# Use QT above to do something with the ith element of array x
# As i updates in each iteration, QT changes
out[i] = np.argmin((QT + b_mean[i] * m) / (b_stddev[i] * m * a_stddev))
return out
In my real function, the length of the input arrays, a and b, can be variable and very long. Note that QT
depends on the m
and the length of b
and both will always be provided. Also, one might be tempted to recommend doing some sort of traditional convolution but convolution does not solve my problem. Convolving only gives me the final QT but I actually need to use the intermediate QT values for another calculation (see argmin line that depends on some pre-calculated calculations of the input arrays) before updating it for the next iteration of the for-loop.
What is the best way to replace the for-loop with Numba so that it is faster on CPU?
Is it possible to use multiple threads with nogil and prange in this instance?
What is the best way to replace the for-loop with Numba so that I can port this to GPU as well with CUDA Jit?
I would greatly appreciate any help in porting this code over to Numba so that I can leverage parallel CPU computation as well as GPU CUDA computation.