Cholesky decomposition slow

Simon Ebner

unread,

Mar 7, 2015, 5:45:36 AM3/7/15

to theano...@googlegroups.com

Hi all,

I want to do computations where I rely heavily on the Cholesky decomposition. Writing a small benchmark for tensor.slinalg.Cholesky, I noticed that the implementation is not as fast as I hoped. As far as I can tell it is not optimized for GPUs yet but relies on the scipy implementation?

Doing a bit of a google seach I found several cuda implementations for fast Cholesky decompositions on the GPU. Before I try to include that code into my theano environment, could you let me know whether you decided not to implement fast Cholesky decomposition on the GPU on purpose? Furthermore, since I'm fairly new to theano I'm not completely confident how to incorporate cuda code best into my existing theano code. Is the sensible to create a custom OP with optimized C-Code?

Best,

Simon

Pascal Lamblin

unread,

Mar 9, 2015, 6:01:18 PM3/9/15

to theano...@googlegroups.com

On Sat, Mar 07, 2015, Simon Ebner wrote:
> I want to do computations where I rely heavily on the Cholesky
> decomposition. Writing a small benchmark for tensor.slinalg.Cholesky, I
> noticed that the implementation is not as fast as I hoped. As far as I can
> tell it is not optimized for GPUs yet but relies on the scipy
> implementation?

That's correct. Moreover, the implementation of the gradient through Cholesky
is probably extremely slow, since it is a bunch of loops in Python.

> Doing a bit of a google seach I found several cuda implementations for fast
> Cholesky decompositions on the GPU. Before I try to include that code into
> my theano environment, could you let me know whether you decided not to
> implement fast Cholesky decomposition on the GPU on purpose?

The only reason is that it has not been a priority for any of the developers.
If you are interested in integrating that, the help would be welcome :)

> Furthermore, since I'm fairly new to theano I'm not completely
> confident how to incorporate cuda code best into my existing theano
> code. Is the sensible to create a custom OP with optimized C-Code?

That would be the best way, yes. There is documentation available for that at [1].

An alternative that may be faster to implement would be to use an
existing Python wrapper (for instance using scikits.cuda), and implement
a Theano Op with Python code only, where that Python code would call the
wrapper. This has been done for instance for the 3D FFT convolution on GPU.

[1] http://deeplearning.net/software/theano/tutorial/extending_theano_c.html
--
Pascal

Annie Marsden

unread,

May 24, 2016, 3:30:45 PM5/24/16

to theano-users, simon...@gmail.com

Hi there,

I'm just wondering if something like this did in fact get included. I'm also writing code that relies heavily on Cholesky decomposition and would really benefit from an optimized version for GPUs.

Thanks,

Annie

Sander Stepanov

unread,

May 24, 2016, 3:45:08 PM5/24/16

to theano...@googlegroups.com

Cholesky is not stable as well

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Frédéric Bastien

unread,

May 25, 2016, 8:32:05 AM5/25/16

to theano-users, simon...@gmail.com

Nobody worked on that.

--

Frédéric Bastien

unread,

May 25, 2016, 8:32:06 AM5/25/16

to theano-users

From memory we call numpy or scipy on the CPU. Can you be more precise? From memory NumPy or scipy have parameter or different implementation with speed vs precision trade off.

Fred

Paul Baggenstoss

unread,

Feb 5, 2020, 10:33:59 AM2/5/20

to theano-users

HI Simon, I was wondering if you got anywhere with the faster Cholesky for Theano. I also use it a lot and have found it to be unstable on the GPU.
Paul

Wong Hang

unread,

Feb 5, 2020, 10:56:14 AM2/5/20

to theano...@googlegroups.com

Hi,

The GPU cholesky decomposition relies on cuSOLVER or Magma. I believe nvidia knows their hardware well and cuSOLVER should provide the best efficient result.

Although cholesky decomposition is very numerical stable, when I write the test case, I find that I will get trouble for relatively small matrix if I use single-precision.

Are you using single-precision on a big matrix?

If not, try to compute the condition number of the matrix to see if it is too big.

If it is not too big, then it may be a bug. I also need to use the cholesky operator, Please send me the matrix and I am willing to fix it.

Best,

2020年2月6日(木) 0:34 Paul Baggenstoss <p.m.bag...@ieee.org>:

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/aca41c35-ec36-4055-bac7-e53aced30ea7%40googlegroups.com.

Paul Baggenstoss

unread,

Feb 5, 2020, 11:30:10 AM2/5/20

to theano-users

Hi Simon,I have uploaded the MATLAB format file with the matrix Cll, which is the original matrix, and R_cpu which was produced using CPU by slinalg.Cholesky( ), and R_cuda which
was produced by the same function, but with GPU ( I think it uses theano.gpuarray.linalg.GpuCholesky() ) Both used the same precision (float32) so should give the same results.
But you can see that at the end of the diagonal, the values go wild. It appears to be numericla errors.
Thanks in advance!
Paul

On Wednesday, February 5, 2020 at 4:56:14 PM UTC+1, Wong Hang wrote:

Hi,

The GPU cholesky decomposition relies on cuSOLVER or Magma. I believe nvidia knows their hardware well and cuSOLVER should provide the best efficient result.

Although cholesky decomposition is very numerical stable, when I write the test case, I find that I will get trouble for relatively small matrix if I use single-precision.

Are you using single-precision on a big matrix?
If not, try to compute the condition number of the matrix to see if it is too big.

If it is not too big, then it may be a bug. I also need to use the cholesky operator, Please send me the matrix and I am willing to fix it.

Best,

2020年2月6日(木) 0:34 Paul Baggenstoss <p.m.ba...@ieee.org>:

HI Simon, I was wondering if you got anywhere with the faster Cholesky for Theano. I also use it a lot and have found it to be unstable on the GPU.
Paul

On Saturday, March 7, 2015 at 11:45:36 AM UTC+1, Simon Ebner wrote:
Hi all,

I want to do computations where I rely heavily on the Cholesky decomposition. Writing a small benchmark for tensor.slinalg.Cholesky, I noticed that the implementation is not as fast as I hoped. As far as I can tell it is not optimized for GPUs yet but relies on the scipy implementation?
Doing a bit of a google seach I found several cuda implementations for fast Cholesky decompositions on the GPU. Before I try to include that code into my theano environment, could you let me know whether you decided not to implement fast Cholesky decomposition on the GPU on purpose? Furthermore, since I'm fairly new to theano I'm not completely confident how to incorporate cuda code best into my existing theano code. Is the sensible to create a custom OP with optimized C-Code?

Best,
Simon

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to theano...@googlegroups.com.

test.mat

Paul Baggenstoss

unread,

Feb 5, 2020, 11:32:43 AM2/5/20

to theano-users

Hi Simon, I forgot to mention that I use the gradient of Cholesky, and this has even more error than the Cholesky decomo, but I assume that this is because
of a bug in Cholesky itself.
Paul

Paul Baggenstoss

unread,

Feb 6, 2020, 4:28:08 AM2/6/20

to theano-users

Simon,
I did more digging and have some more information. I tested theano.gpuarray.linalg.GpuMagmaCholesky(), on float32 and it looks good. The result is exactly the same as for CPU.
So the problem seems to be in CUsolver. The problem is that theano.gpuarray.linalg.GpuMagmaCholesky()(Cll) does not define a gradient and works only for
float32. I installed the latest magma-2.5.2 and it has support for double precision Cholesky (dpotrf) but Theano seems to use it's own copy of the MAGMA source.
Not sure how that works. Can I force Theano to use magma-2.5.2 ? If not, it seems feasible to borrow the gradient from theano.gpuarray.linalg.GpuCholesky()
and add support for float64 as well. Thoughts?
Paul

Wong Hang

unread,

Feb 6, 2020, 4:39:40 AM2/6/20

to theano...@googlegroups.com

Hi Paul,

I haven't test your matrix on my machine. I don't have time until weekend.

I don't think cuSOLVER would produce the same result for GPU and CPU. cuSOLVER would try to parallelize in a sophisticated way to improve performance, but their error should be within a threshold.

If Magma cholesky decomposition is more stable, it is possible to implement a gradient operator like GpuCholesky did. Just add support for float64 and implement the L_op method.

Best regards,

wonghang

Paul Baggenstoss <p.m.bag...@ieee.org> 於 2020年2月6日週四下午6:28寫道：

To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/cbd1feec-2403-487b-809e-241a225a3ae4%40googlegroups.com.

Paul Baggenstoss

unread,

Feb 6, 2020, 8:53:55 AM2/6/20

to theano-users

Hello again.
So I added 64-bit support to theano/gpuarray/c_code/magma_cholesky.c and to theano/gpuarray/linalg.py in the function GpuMagmaCholesky(). I attached the files.
It works now for 32 and 64 bit and has gradient. The numerical problem is gone.
But (and this is a big BUT) it iseems to be a factor of at least 2 times slower than the CPU. Any thoughts on this?
Paul

linalg.py

magma_cholesky.c

Paul Baggenstoss

unread,

Feb 7, 2020, 7:49:43 AM2/7/20

to theano-users

Hi wonghang, Sorry to pester you with emails, but I have some interesting timing information.
I ran a process using different processors and ways of computing Cholesky()
The results are surprising.

GpuMagmaCholesky() 9.0 sec
slinalg.Cholesky(uses cusolver) 2.9 sec
CPU 1.9 sec

It looks like it pays to just use the CPU!

Doesn't make any sense!
Paul

Wong Hang

unread,

Feb 7, 2020, 8:49:27 AM2/7/20

to theano...@googlegroups.com

Hi all,

I found that the cholesky factorization unit test no longer works.

The value returned are completely wrong. It looks like a memory error.

I checked if I skip tril call, the value returned by cuSOLVER is correct.

There should be something wrong in libgpuarray

======================================================================
ERROR: test_dense_chol_lower (theano.gpuarray.tests.test_linalg.TestGpuCholesky64)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/wonghang/github/Theano/theano/gpuarray/tests/test_linalg.py", line 327, in test_dense_chol_lower
self.compare_gpu_cholesky_to_np(A_val, lower=lower, inplace=inplace)
File "/home/wonghang/github/Theano/theano/gpuarray/tests/test_linalg.py", line 280, in compare_gpu_cholesky_to_np
utt.assert_allclose(chol_A_res, chol_A_val)
File "/home/wonghang/github/Theano/theano/tests/unittest_tools.py", line 358, in assert_allclose
raise WrongValue(expected, value, rtol, atol)
theano.tests.unittest_tools.WrongValue: WrongValue
: shape, dtype, strides, min, max, n_inf, n_nan:
Expected : (3, 3) float64 (24, 8) 1.078578362e-314 1.0548793676823098 0 0
Value : (3, 3) float64 (24, 8) 0.0 1.5121774155893968 0 0
expected : [[2.00683310e-314 3.46328020e-001 1.07857836e-314]
[2.29026158e-001 1.05487937e+000 4.86725043e-001]
[2.07913268e-001 4.16263205e-001 1.04157477e+000]]
value : [[1.51217742 0. 0. ]
[0.22902616 1.05487937 0. ]
[0.20791327 0.41626321 1.04157477]]
Max Abs Diff: 1.5121774155893968
Mean Abs Diff: 0.2605811643516005
Median Abs Diff: 1.078578362e-314
Std Abs Diff: 0.4752077922970366
Max Rel Diff: inf
Mean Rel Diff: inf
Median Rel Diff: 1.3335589252099037e-16
Std Rel Diff: nan

rtol, atol: 1e-05 1e-08

======================================================================
ERROR: test_diag_chol (theano.gpuarray.tests.test_linalg.TestGpuCholesky64)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/wonghang/github/Theano/theano/gpuarray/tests/test_linalg.py", line 317, in test_diag_chol
self.compare_gpu_cholesky_to_np(A_val, lower=lower, inplace=inplace)
File "/home/wonghang/github/Theano/theano/gpuarray/tests/test_linalg.py", line 280, in compare_gpu_cholesky_to_np
utt.assert_allclose(chol_A_res, chol_A_val)
File "/home/wonghang/github/Theano/theano/tests/unittest_tools.py", line 358, in assert_allclose
raise WrongValue(expected, value, rtol, atol)
theano.tests.unittest_tools.WrongValue: WrongValue
: shape, dtype, strides, min, max, n_inf, n_nan:
Expected : (5, 5) float64 (40, 8) 0.0 1.3969459393428005 0 0
Value : (5, 5) float64 (40, 8) 0.0 1.3969459393428005 0 0
expected : [[1.26525335e-314 0.00000000e+000 0.00000000e+000 0.00000000e+000
0.00000000e+000]
[0.00000000e+000 2.01543086e-314 0.00000000e+000 0.00000000e+000
0.00000000e+000]
[0.00000000e+000 0.00000000e+000 1.29480282e+000 0.00000000e+000
0.00000000e+000]
[0.00000000e+000 0.00000000e+000 0.00000000e+000 1.31448015e+000
0.00000000e+000]
[0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000
1.39694594e+000]]
value : [[1.3040081 0. 0. 0. 0. ]
[0. 1.35800834 0. 0. 0. ]
[0. 0. 1.29480282 0. 0. ]
[0. 0. 0. 1.31448015 0. ]
[0. 0. 0. 0. 1.39694594]]
Max Abs Diff: 1.3580083368118308
Mean Abs Diff: 0.106480657426342
Median Abs Diff: 0.0
Std Abs Diff: 0.361174224138967
Max Rel Diff: nan
Mean Rel Diff: nan
Median Rel Diff: nan
Std Rel Diff: nan

rtol, atol: 1e-05 1e-08

----------------------------------------------------------------------
Ran 40 tests in 12.218s

FAILED (errors=2, skipped=16)

Please use the revision 07cd4ad56054c279442ee28413b26939f4c03632 of libgpuarray

Use the following command to install an old version of libgpuarray:

$ git clone https://github.com/Theano/libgpuarray.git

$ cd libgpuarray

$ git checkout 07cd4ad56054c279442ee28413b26939f4c03632 .

$ mkdir cmake

$ cd cmake

$ cmake ..

$ make

$ sudo make install

$ sudo ldconfig

$ cd ..

$ python3 setup.py install

and then run your theano code again. I think it would work now.

I will check the code in libgpuarray later. Let me raise an issue first.

Best,

wonghang

Paul Baggenstoss <p.m.bag...@ieee.org> 於 2020年2月7日週五下午9:49寫道：

To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/7aac6c1b-4b3b-4ad3-9a1d-1f331e28cf02%40googlegroups.com.

Wong Hang

unread,

Feb 7, 2020, 9:18:23 AM2/7/20

to theano...@googlegroups.com

I suddenly get the HEAD version of libgpuarray works

I found that if I increase the size of the matrix, the error will appear.

The first few rows of the matrix are correct, and then there will be errors for the remaining rows.

I guess there is a synchronization or memory bug.

$ python3 cho.py

row #0: err=0 (max=0)
row #1: err=0 (max=0)
row #2: err=0 (max=0)
row #3: err=1.77636e-15 (max=1.77636e-15)
row #4: err=0 (max=0)
row #5: err=1.77982e-15 (max=1.77636e-15)
row #6: err=1.14439e-16 (max=1.11022e-16)
row #7: err=6.245e-17 (max=5.55112e-17)
row #8: err=1.79104e-15 (max=1.77636e-15)
row #9: err=1.84778e-15 (max=1.77636e-15)
row #10: err=1.83628e-15 (max=1.77636e-15)
row #11: err=7.13054e-16 (max=6.66134e-16)
row #12: err=8.55484e-17 (max=8.32667e-17)
row #13: err=7.19641e-16 (max=4.44089e-16)
row #14: err=2.30555e-16 (max=1.11022e-16)
row #15: err=1.93574e-15 (max=1.77636e-15)
row #16: err=3.61888e-16 (max=2.22045e-16)
row #17: err=1.94548e-15 (max=1.77636e-15)
row #18: err=1.81003e-15 (max=1.77636e-15)
row #19: err=1.85793e-15 (max=1.77636e-15)
row #20: err=1.93489e-15 (max=1.77636e-15)
row #21: err=2.10577e-15 (max=1.77636e-15)
row #22: err=9.14588e-16 (max=4.44089e-16)
row #23: err=7.63657e-16 (max=4.44089e-16)
row #24: err=1.42114e-15 (max=8.88178e-16)
row #25: err=3.80154e-15 (max=3.55271e-15)
row #26: err=3.66222e-15 (max=3.55271e-15)
row #27: err=1.06328e-15 (max=8.88178e-16)
row #28: err=2.31959e-15 (max=1.77636e-15)
row #29: err=3.65102e-15 (max=3.55271e-15)
row #30: err=9.84652e-16 (max=4.44089e-16)
row #31: err=1.98222e-15 (max=1.33227e-15)
row #32: err=1.69428e-15 (max=8.88178e-16)
row #33: err=2.39616e-15 (max=1.77636e-15)
row #34: err=1.29213e-15 (max=8.88178e-16)
row #35: err=1.04169e-15 (max=4.44089e-16)
row #36: err=2.56552e-15 (max=1.77636e-15)
row #37: err=1.92892e-15 (max=8.88178e-16)
row #38: err=2.20448e-15 (max=1.77636e-15)
row #39: err=1.49001e-15 (max=6.66134e-16)
row #40: err=1.17059e-15 (max=5.55112e-16)
row #41: err=1.77533e-15 (max=8.88178e-16)
row #42: err=2.27739e-15 (max=1.77636e-15)
row #43: err=1.47627e-15 (max=6.66134e-16)
row #44: err=2.09264e-15 (max=1.33227e-15)
row #45: err=1.81502e-15 (max=8.88178e-16)
row #46: err=1.84387e-15 (max=8.88178e-16)
row #47: err=1.06552e-15 (max=4.44089e-16)
row #48: err=2.76471e-15 (max=1.77636e-15)
row #49: err=2.18163e-15 (max=1.33227e-15)
row #50: err=3.22704e-15 (max=1.77636e-15)
row #51: err=3.64846e-15 (max=1.77636e-15)
row #52: err=1.66905e-15 (max=6.66134e-16)
row #53: err=1.81576e-15 (max=1.11022e-15)
row #54: err=2.41371e-15 (max=1.77636e-15)
row #55: err=3.9903e-15 (max=3.55271e-15)
row #56: err=3.00212e-15 (max=1.77636e-15)
row #57: err=3.06269e-15 (max=1.77636e-15)
row #58: err=2.50664e-15 (max=1.77636e-15)
row #59: err=3.85325e-15 (max=3.55271e-15)
row #60: err=3.55556e-15 (max=1.77636e-15)
row #61: err=2.1962e-15 (max=8.88178e-16)
row #62: err=3.49413e-15 (max=1.77636e-15)
row #63: err=3.29766e-15 (max=1.77636e-15)
row #64: err=2.4585e-15 (max=1.33227e-15)
row #65: err=2.12112e-15 (max=8.88178e-16)
row #66: err=3.71809e-15 (max=1.77636e-15)
row #67: err=2.7659e-15 (max=8.88178e-16)
row #68: err=3.32757e-15 (max=1.77636e-15)
row #69: err=2.41245e-15 (max=8.60423e-16)
row #70: err=3.99688e-15 (max=1.9984e-15)
row #71: err=2.52257e-15 (max=8.88178e-16)
row #72: err=3.55973e-15 (max=1.55431e-15)
row #73: err=2.7763e-15 (max=8.88178e-16)
row #74: err=4.40704e-15 (max=3.55271e-15)
row #75: err=3.55809e-15 (max=1.77636e-15)
row #76: err=3.04663e-15 (max=1.77636e-15)
row #77: err=2.85651e-15 (max=1.11022e-15)
row #78: err=4.05814e-15 (max=1.77636e-15)
row #79: err=3.33612e-15 (max=1.32533e-15)
row #80: err=3.20748e-15 (max=1.77636e-15)
row #81: err=3.8984e-15 (max=1.77636e-15)
row #82: err=3.5669e-15 (max=1.22125e-15)
row #83: err=4.28332e-15 (max=2.22045e-15)
row #84: err=3.64221e-15 (max=1.33227e-15)
row #85: err=4.83762e-15 (max=3.55271e-15)
row #86: err=4.0986e-15 (max=1.77636e-15)
row #87: err=3.60163e-15 (max=1.77636e-15)
row #88: err=5.06272e-15 (max=3.55271e-15)
row #89: err=3.68688e-15 (max=1.77636e-15)
row #90: err=7.07646e-15 (max=5.32907e-15)
row #91: err=3.83584e-15 (max=1.05471e-15)
row #92: err=4.50821e-15 (max=1.77636e-15)
row #93: err=5.47632e-15 (max=1.77636e-15)
row #94: err=4.46046e-15 (max=1.44329e-15)
row #95: err=5.61405e-15 (max=3.55271e-15)
row #96: err=5.06176e-15 (max=2.22045e-15)
row #97: err=3.81964e-15 (max=1.55431e-15)
row #98: err=4.37526e-15 (max=1.77636e-15)
row #99: err=3.98392e-15 (max=1.55431e-15)
row #100: err=4.91222e-15 (max=1.77636e-15)
row #101: err=3.35853e-15 (max=1.22125e-15)
row #102: err=4.78829e-15 (max=2.22045e-15)
row #103: err=4.60413e-15 (max=1.33227e-15)
row #104: err=4.5791e-15 (max=1.38778e-15)
row #105: err=5.45668e-15 (max=1.9984e-15)
row #106: err=7.5096e-15 (max=3.55271e-15)
row #107: err=4.63925e-15 (max=1.33227e-15)
row #108: err=5.44862e-15 (max=2.44249e-15)
row #109: err=4.83685e-15 (max=2.22045e-15)
row #110: err=4.11954e-15 (max=1.55431e-15)
row #111: err=5.48967e-15 (max=1.9984e-15)
row #112: err=4.78231e-15 (max=1.77636e-15)
row #113: err=6.65255e-15 (max=2.22045e-15)
row #114: err=6.33143e-15 (max=3.55271e-15)
row #115: err=7.17902e-15 (max=3.21965e-15)
row #116: err=6.00826e-15 (max=1.83187e-15)
row #117: err=6.52156e-15 (max=2.22045e-15)
row #118: err=4.56739e-15 (max=1.55431e-15)
row #119: err=5.78508e-15 (max=2.22045e-15)
row #120: err=6.4643e-15 (max=2.08167e-15)
row #121: err=4.31762e-15 (max=1.33227e-15)
row #122: err=7.30575e-15 (max=3.55271e-15)
row #123: err=5.16371e-15 (max=1.55431e-15)
row #124: err=6.8954e-15 (max=2.66454e-15)
row #125: err=6.68844e-15 (max=1.9984e-15)
row #126: err=6.36886e-15 (max=2.10942e-15)
row #127: err=8.18275e-15 (max=3.10862e-15)
row #128: err=7.58721e-15 (max=2.9976e-15)
row #129: err=8.76019e-15 (max=2.44249e-15)
row #130: err=8.60251e-15 (max=4.16334e-15)
row #131: err=7.45057e-15 (max=1.88738e-15)
row #132: err=7.273e-15 (max=1.9984e-15)
row #133: err=8.46628e-15 (max=2.44249e-15)
row #134: err=6.03992e-15 (max=1.9984e-15)
row #135: err=8.54499e-15 (max=3.55271e-15)
row #136: err=7.33755e-15 (max=3.10862e-15)
row #137: err=1.32453e-14 (max=7.10543e-15)
row #138: err=9.21473e-15 (max=2.88658e-15)
row #139: err=1.38584e-14 (max=8.21565e-15)
row #140: err=9.92134e-15 (max=4.77396e-15)
row #141: err=8.12191e-15 (max=3.55271e-15)
row #142: err=8.54742e-15 (max=3.05311e-15)
row #143: err=1.1525e-14 (max=3.9968e-15)
row #144: err=9.56483e-15 (max=3.55271e-15)
row #145: err=7.57599e-15 (max=2.16493e-15)
row #146: err=9.08358e-15 (max=3.77476e-15)
row #147: err=1.261e-14 (max=4.16334e-15)
row #148: err=1.04084e-14 (max=3.88578e-15)
row #149: err=1.52547e-14 (max=6.21725e-15)
row #150: err=1.34445e-14 (max=6.21725e-15)
row #151: err=1.28415e-14 (max=5.32907e-15)
row #152: err=1.37001e-14 (max=4.88498e-15)
row #153: err=104.091 (max=38.2524)
row #154: err=90.9855 (max=28.555)
row #155: err=114.057 (max=35.5966)
row #156: err=90.1876 (max=34.3175)
row #157: err=114.274 (max=41.0308)
row #158: err=68.7615 (max=29.8493)
row #159: err=102.592 (max=45.7777)
row #160: err=88.559 (max=39.3841)
row #161: err=102.897 (max=37.4962)
row #162: err=89.7443 (max=39.1052)
row #163: err=91.8647 (max=40.6695)
row #164: err=92.5436 (max=39.478)
row #165: err=67.0603 (max=22.3479)
row #166: err=97.741 (max=35.374)
row #167: err=88.4444 (max=33.1283)
row #168: err=66.4308 (max=29.6943)
row #169: err=76.6372 (max=40.7606)
row #170: err=68.7239 (max=28.0245)
row #171: err=91.2993 (max=48.1353)
row #172: err=94.0889 (max=48.0026)
row #173: err=76.6705 (max=33.9253)
row #174: err=78.756 (max=39.5833)
row #175: err=51.6685 (max=29.4995)
row #176: err=74.8719 (max=28.4035)
row #177: err=82.6127 (max=35.2276)
row #178: err=43.8165 (max=20.9576)
row #179: err=67.3553 (max=27.4942)
row #180: err=74.5054 (max=39.5853)
row #181: err=52.8585 (max=29.805)
row #182: err=54.6962 (max=22.4845)
row #183: err=49.1812 (max=26.9341)
row #184: err=79.5791 (max=37.3105)
row #185: err=36.5226 (max=22.6301)
row #186: err=54.368 (max=37.7491)
row #187: err=31.9472 (max=16.7787)
row #188: err=59.4599 (max=33.4338)
row #189: err=67.0638 (max=49.7558)
row #190: err=54.539 (max=42.0158)
row #191: err=29.0013 (max=17.6628)
row #192: err=55.0378 (max=27.5013)
row #193: err=36.5066 (max=33.2416)
row #194: err=22.4157 (max=13.6764)
row #195: err=36.426 (max=29.0035)
row #196: err=24.4191 (max=22.5652)
row #197: err=27.3912 (max=25.9949)
row #198: err=0.915223 (max=0.915223)
row #199: err=3.60679e-13 (max=2.98261e-13)
494.5201252308407 49.755829752019224
494.5201252308407 49.755829752019224

I attached my test code in this message.

Wong Hang <wong...@gmail.com> 於 2020年2月7日週五下午10:49寫道：

cho.py

Wong Hang

unread,

Feb 7, 2020, 10:16:36 AM2/7/20

to theano...@googlegroups.com

I am quite sure I once get correct result even when the matrix is of size >1000.

Let me do more research and test later and get back to you.

Wong Hang <wong...@gmail.com> 於 2020年2月7日週五下午11:18寫道：

Paul Baggenstoss

unread,

Feb 7, 2020, 10:30:43 AM2/7/20

to theano-users

Hi Wong Hang,
Yes, that's what I saw, the errors started near the end of the matrix.
After that, the numbers appeared random.
I'll try the older version and let you know what I find,
Paul

Paul Baggenstoss <p.m.ba...@ieee.org> 於 2020年2月7日週五下午9:49寫道：

To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/7aac6c1b-4b3b-4ad3-9a1d-1f331e28cf02%40googlegroups.com.

Wong Hang

unread,

Feb 8, 2020, 2:49:29 AM2/8/20

to theano...@googlegroups.com

Hi Paul,

I think I fixed the issue. Please check the PR https://github.com/Theano/libgpuarray/pull/589

and you can try to use my branch of libgpuarray to see if it works for you.

For your implementation of MagmaCholesky, you can add profile = True in ~/.theano.rc to see what is the bottleneck of your implementation.

[global]
profile = True

Paul Baggenstoss <p.m.bag...@ieee.org> 於 2020年2月8日週六上午12:30寫道：

To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/29f02f1c-1e2e-4ba7-8c71-f647ad378a09%40googlegroups.com.

Paul Baggenstoss

unread,

Feb 8, 2020, 8:20:07 AM2/8/20

to theano-users

Hi Wonghang,

I am now getting the same results for GPU and CPU . It looks like
it is fixed. I still need to do more checking, but looks good so far.

I did not get your branch using the git commands you sent me. I had to
manually download basic.py from your pull request. Not sure why...

Nice work,

Paul

Wong Hang

unread,

Feb 8, 2020, 9:11:52 AM2/8/20

to theano...@googlegroups.com

Hi Paul,

the previous commands are for checkout a old commit of libgpuarray.

To checkout my fix:

git clone https://github.com/wonghang/libgpuarray.git

cd libgpuarray

git checkout fix_tri_kernel_sched

Best,

wonghang

Paul Baggenstoss <p.m.bag...@ieee.org> 於 2020年2月8日週六下午10:20寫道：

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/a6f8403b-c443-4dd2-95bb-15e84c8d2222%40googlegroups.com.

Paul Baggenstoss

unread,

Feb 10, 2020, 6:46:26 AM2/10/20

to theano-users

Hi Wonghang,
    So I am working toward making the Cholesky problem faster, and by that I include triangular
solvers like GpuCublasTriangularSolve(). We typically do the Cholesky decomp, then solve
linear systems involving the Cholesky matrix that has an upper or lower triangular form.

So I have started with GpuCublasTriangularSolve(). It has a lot of overhead from
all the gpu array matrix copying and creation, which adds to the overhead of
theano.scan, which is necessary to work with batches. So I thought it would be much
better to have a batched version of Cholesky() and GpuCublasTriangularSolve(). I have
created working versions of these batched routines (attached). It's actually pretty simple,
I just index into the data by batch, then call the gpu solver, potrf() for Cholesky
and trsv() for the solver. I loop over the batch like this:

    if theano.config.floatX=='float32':
                wordlen=4
            else:
                wordlet=8
     for ib in range(nb):
                trsv(ctx.cublas_handle, uplo, trans, diag, n,
                     A_ptr+ib*n*n*wordlen, lda, b_ptr+ib*n*m*wordlen, 1)

There are a few small gotchas, such as that the GPU routines expect F-ordered data, but
to index like I did above, it has to be C-ordered. So the data has to be
C-ordered by batch, but F-ordered within the batch!

The problem I have, where I will need help, is in the gradients.
Although I can compute the gradient in L_op correctly (I think), theano.gradient is not
happy with the shape of the matrices that L_op returns. This is probably
because gradient does not understand that the data is bacthed.

   Do you think you can help in this matter? I think this is over my head.
Paul

GpuCublasTriangularSolveBatch.py

GpuCholeskyBatch.py

Wong Hang

unread,

Feb 10, 2020, 10:22:20 PM2/10/20

to theano...@googlegroups.com

Hi Paul,

I am not quite sure what you are going to do.

If you want a batch version of Cholesky and TriangularSolve, then you will end up with a "nb" copy of gradients.

What are you going to do with that "nb" copy of gradients?

AFAIK, theano.grad only accept scalar input. If you need a Jacobian, you can only implement it by theano.scan and I know theano.scan is inefficient.

You may be interested in this thread: https://groups.google.com/d/msg/theano-users/Rg8ZIru-pgo/DgvwY57RBwAJ

Best,

wonghang

Paul Baggenstoss <p.m.bag...@ieee.org> 於 2020年2月10日週一下午8:46寫道：

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/b1336e9d-9601-48e9-9665-58740618a861%40googlegroups.com.

Paul Baggenstoss

unread,

Feb 11, 2020, 7:16:08 AM2/11/20

to theano-users

Hi wonghang,
    I am trying to develop a batched version of both GpuCholesky and GpuCublasTriangularSolve().
I was confused about what to do in L_op with the extra batch dimension, but after
carefully reading the docs again, I now understand it. L_op is just a linearized
and transposed version of Op, so the input of L_op has the same shape and form as
the output of Op, and vice-versa. Now I got it. These attached functions are
tested and work on batches. Let 'nb' be the batch size, so Cll is dimensioned (nb,n,n) ,
and dl is the RHS, dimensioned (nb,n). Then to solve er = Cll * dl, for dl, I use:

          R = theano.gpuarray.linalg.GpuCholeskyBatch(lower=True)(Cll)
          lam1= theano.gpuarray.linalg.GpuCublasTriangularSolveBatch(lower=True)(R,er)
          R_T=R.dimshuffle(0,2,1)
          dl= theano.gpuarray.linalg.GpuCublasTriangularSolveBatch(lower=False)(R_T,lam1)

This does the classical Cholesky followed by back-substitution. This code runs in 4.0 seconds.
When I use the scan on the non-batched versions, it runs in 10.3 sec. So it is 2.5 times faster!
That makes a big difference to me as I can get a result before I go home.

Paul

On Tuesday, February 11, 2020 at 4:22:20 AM UTC+1, Wong Hang wrote:

Hi Paul,

I am not quite sure what you are going to do.

If you want a batch version of Cholesky and TriangularSolve, then you will end up with a "nb" copy of gradients.
What are you going to do with that "nb" copy of gradients?
AFAIK, theano.grad only accept scalar input. If you need a Jacobian, you can only implement it by theano.scan and I know theano.scan is inefficient.

You may be interested in this thread: https://groups.google.com/d/msg/theano-users/Rg8ZIru-pgo/DgvwY57RBwAJ

Best,
wonghang

Paul Baggenstoss <p.m.ba...@ieee.org> 於 2020年2月10日週一下午8:46寫道：

To unsubscribe from this group and stop receiving emails from it, send an email to theano...@googlegroups.com.

linalg.py

Paul Baggenstoss

unread,

Feb 13, 2020, 6:04:15 AM2/13/20

to theano-users

Hi Wonghang,
Good news. I thought about the issue you pointed out that scan runs serial and it would
be necessary to write a truly parallel GPU implementation. So after looking at some CUDA tutorials
I was able to get Cholesky to run as a batch with each batch sample running on an individual core.

In order to interface it to Theano as a new class, I had to add it as a new function to
Magma and link it into libmagma.so. It is working and produces the correct answer and is even
faster. I don't know yet how much faster because I have not eliminated the time of the back-substitution,
but the execution time went from 2.9 sec to 2.0 sec. Once I isolate just the Cholesky routine,
I think the speedup will be huge. And, once I get back-substitution also parallelized, solving the symmetric
matrix problem should be super-fast. I think that the method I used (linking the CUDA code into libmagma.so)
is just temporary. Once I get some routines finished, it should be moved to a separate library.

I hope to get some help in that because my knowledge of the inner workings of Theano is limited.

Let me know if you (or anyone) is interested in this project.

Paul

Wong Hang

unread,

Feb 13, 2020, 6:46:24 AM2/13/20

to theano...@googlegroups.com

Hi Paul,

I am now struggling for my own project and therefore I don't think I have time to work on any side project.

thenao is no longer maintained. One of my friend, who is working as data scientist , told me that most researchers in academia have already switched to pytorch.

I recently looked at pytorch, they look like have batched cholesky operator: https://github.com/pytorch/pytorch/pull/14017

Best,

wonghang

Paul Baggenstoss <p.m.bag...@ieee.org> 於 2020年2月13日週四下午8:04寫道：

--

---
You received this message because you are subscribed to the Google Groups "theano-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to theano-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/theano-users/f84140b6-238d-4065-8a66-ec73e3d83943%40googlegroups.com.

Reply all

Reply to author

Forward