Hi,
I am working on GPU L_op support for cholesky factorization / triangular solve.
I implemented them by referencing to theano/tensor/slinalg.py and I found that I only speed up 20% for my task only.
Am I correct to say there is no GPU implementation of
theano.tensor.outer
theano.tensor.diag
theano.tensor.diagonal
?
If I call them over a GpuArray, theano would copy them the input to host and then use numpy to perform the job and then copy back to GPU?
I have to implement GpuAllocDiag, GpuExtractDiag and then add them to theano/gpuarray/opt.py to fully run the code on GPU?
Best,
wonghang