GPU implementation of outer(.), diag(.), diagonal(.)

6 views
Skip to first unread message

Wong Hang

unread,
Oct 23, 2018, 3:56:42 AM10/23/18
to theano-dev
Hi,

I am working on GPU L_op support for cholesky factorization / triangular solve.

I implemented them by referencing to theano/tensor/slinalg.py and I found that I only speed up 20% for my task only.

Am I correct to say there is no GPU implementation of 

theano.tensor.outer
theano.tensor.diag
theano.tensor.diagonal

?
If I call them over a GpuArray, theano would copy them the input to host and then use numpy to perform the job and then copy back to GPU?

I have to implement GpuAllocDiag, GpuExtractDiag and then add them to theano/gpuarray/opt.py to fully run the code on GPU?

Best,
wonghang

Arnaud Bergeron

unread,
Oct 23, 2018, 2:25:10 PM10/23/18
to thean...@googlegroups.com
GpuAllocDiag and GpuExtractDiag are implemented in gpuarray/subtensor.py

As for GpuOuter is it implemented as GpuGer in gpuarray/blas.py

But you shouldn't have to use any of those directly because they should be implemented from their CPU equivalents automatically.  If you are having speed problems you can try to run with profile=True to find out what is the bottleneck.


Best,
wonghang

--

---
You received this message because you are subscribed to the Google Groups "theano-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Wong Hang

unread,
Oct 23, 2018, 11:59:30 PM10/23/18
to thean...@googlegroups.com
Thanks. I got it now... 
I found that the bottleneck is in reduction rather than cholesky factorization and triangular solve. I already saved time in GpuFromHost and speedup with cusolver/cublas.


Arnaud Bergeron <aber...@gmail.com> 於 2018年10月24日 週三 上午2:25寫道:
Reply all
Reply to author
Forward
0 new messages