Intel MKL

1,100 views
Skip to first unread message

tr...@cornell.edu

unread,
Jul 5, 2013, 4:04:04 PM7/5/13
to ceres-...@googlegroups.com
Hi,

I've recently started using Ceres, and have gotten great results. I'm currently modeling a dense problem, using DENSE_QR for the linear solve. I was wondering if there is an easy way to use the Intel MKL as the backend. The documentation for Eigen says with suitable defines and libraries linked in, it can substitute it's own routines with the MKL ones, which several colleagues have said can improve performance compared to Eigen's QR solver. I can poke through the cmake files and see if I can find a suitable variable to put the libraries in, or define a new variable, but thought I would check first in case anyone has already done this.

Thanks, Tim

Sameer Agarwal

unread,
Jul 5, 2013, 5:51:37 PM7/5/13
to ceres-...@googlegroups.com
Hi Tim,

Having a higher performance QR factorization is a great idea. 

But I think instead of going via Eigen, I think there is a better way. 

Ceres already links into LAPACK/BLAS in certain circumstances. It is simpler to just use the QR factorization in the LAPACK being linked into when present. So if the user has linked into intel MKL as their BLAS/LAPACK library then they get the corresponding implementation of QR decomposition. This approach will also work with openblas and atlas, which is nice since intel MKL is a commercial library and not every one can or is willing to pay for it.

This would require doing two things.

1. Extend the dense_qr_solver inside ceres to use the lapack routines when they are available.
2. Modify the main CMake file to search and link into lapack/blas even when suitesparse is not being used.

We would very much appreciate a patch implementing this.

Sameer



--
--
----------------------------------------
Ceres Solver Google Group
http://groups.google.com/group/ceres-solver?hl=en?hl=en
 
---
You received this message because you are subscribed to the Google Groups "Ceres Solver" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ceres-solver...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

tr...@cornell.edu

unread,
Jul 9, 2013, 8:55:57 AM7/9/13
to ceres-...@googlegroups.com
Ok, I'll look into it.

Sameer Agarwal

unread,
Jul 11, 2013, 1:03:34 PM7/11/13
to ceres-...@googlegroups.com
Timothy,

I did some experiments comparing DGELS from Single threaded ATLAS and householderQR in eigen3, and eigen wins, by a large amount for small matrices and by 10% or so for large matrices.

Its possible that Intel MKL is better than ATLAS, and threading will make a difference. But i would like to see evidence of that before we modify the DENSE_QR solver in ceres.

What sized matrices are of interest to you? and do you have any numbers to indicate that the dgels in mkl is faster than eigen's householder qr factorization?

When comparing the speed of mkl to eigen, its important to make sure that your matrices are column major and that you are using householderQR and not colPivHouseholderQR or fullPivHouseholderQR, since dgels does not use pivoting as far as I know.

Sameer

tr...@cornell.edu

unread,
Aug 7, 2013, 1:13:13 PM8/7/13
to ceres-...@googlegroups.com
Hi Sameer,

Sorry for the delayed response. The problem I'm working on is large and dense, somewhere in the range of 150000 x 1000. Instead of dgels, I'm using a sequence of 3 calls: dqeqp3, dormqr, dtrsm, which I believe will solve the system using a column pivoted QR, similar to Eigen's colPivHouseholderQR. All matrices are column major. When testing at small sizes, Eigen is faster. When testing at 150000 x 1000, MKL wins by a good margin:

Eigen: 231 seconds
MKL 1 thread: 104 seconds
MKL 4 threads: 62 seconds

Using 10000 x 300:
Eigen: 1 second
MKL 1 thread: .6 seconds
MKL 4 threads: .25 seconds

I haven't tried other blas/lapack libraries yet, I can look into that also. Which sizes were you looking at? Does this timing for Eigen seem on par with what you were seeing?

-Tim

tr...@cornell.edu

unread,
Aug 7, 2013, 1:16:09 PM8/7/13
to ceres-...@googlegroups.com
Sorry, there was a typo. The first call is dgeqp3, not dqeqp3.

tr...@cornell.edu

unread,
Aug 7, 2013, 4:25:07 PM8/7/13
to ceres-...@googlegroups.com
I also tested this using the default blas/lapack libraries on Ubuntu 12.04, and didn't see any performance improvement compared to Eigen. I don't think ATLAS provides the dgeqp3 routine. I've changed the dense_qr_solver locally, but so far I only see performance improvements when using MKL, and then only on large systems, so it may not be worthwhile to add it to the main repo. It's also possible that the speedups MKL is giving is due to the fact that I'm using Intel processors. I'm not sure if MKL provides as much optimization on other architectures.

Sameer Agarwal

unread,
Aug 8, 2013, 1:21:27 AM8/8/13
to ceres-...@googlegroups.com
Thank Tim.
This is very useful. My experiments were with ATLAS on linux and there it was not worth switching to LAPACK. But looks like MKL is rather fast.

As far as I can tell MKL does contain DGELS. 


So I am a bit surprised that you are going the  dqeqp3, dormqr, dtrsm route, as there is no need to store the Q factor, it can be applied to the RHS as the QR factorization is being computed. Why not use DGELS. Its a single call, and should be faster than the three calls that you are making. Further, it does not use pivoting, so it should be faster still.

Would you mind checking DGELS numbers for the same problems? as I would like the ceres implementation to use DGELS. The plan would be to have both the Eigen and LAPACK based DENSE_QR and DENSE_NORMAL_CHOLESKY solvers, and letting the user choose which of the two dense linear algebra backends should be used.

Thanks,
Sameer




Sameer Agarwal

unread,
Aug 8, 2013, 4:57:36 AM8/8/13
to ceres-...@googlegroups.com
So I was able to replicate the performance difference on my mac. I am using the accelerated linear algebra libraries that ship with mac ox (vecLib). 

for a 15000 x 1000 linear system.

DENSE_QR

Eigen 3.1.3 colPivHouseHolderQR - 11.5secs
Eigen 3.1.3 houseHolderQR - 4.1secs
vecLib DGELS - 1.2secs

and now for a surprise :)

DENSE_NORMAL_CHOLESKY which forms the Gauss-Newton Hessian and solves it using Eigen's LDLT factorization, takes about DGELS at 1.6 seconds. 

I'd be curious to see how your problem does if you just switch from DENSE_QR to DENSE_NORMAL_CHOLESKY. There is a small risk that it may lead to poorer numerical performance because of conditioning but it is definitely worth exploring.

I'd be curious to see how DGELS and DENSE_NORMAL_CHOLESKY fare on your system.

Sameer



tr...@cornell.edu

unread,
Aug 8, 2013, 12:29:51 PM8/8/13
to ceres-...@googlegroups.com
Hi Sameer,

Yes I'd be happy to do some more testing. DENSE_NORMAL_CHOLESKY is much faster than DENSE_QR for my problem. I'll get and post some exact timings. I've been switching back and forth between the two. Hopefully I can just use Cholesky, but I'm not sure if my problem will always be well conditioned.

I'll test DGELS also. I went the other route so that I could compare with the current pivoting QR that ceres is using. Isn't pivoted QR more stable/accurate than basic QR (at the expense of being slower), especially if the jacobian is rank-deficient? Would it be worth having two QR solvers, or having an option to turn pivoting on or off? Or is the solve always well conditioned due to the damping? 

-Tim

tr...@cornell.edu

unread,
Aug 9, 2013, 4:18:25 PM8/9/13
to ceres-...@googlegroups.com
Hi,

I've done some more testing

15000 x 1000 system


Eigen colPivHouseholderQR:  35.6s

Eigen householderQR:  6.1s

Eigen ldlt on normal equations: 2.5s


1 Thread

MKL dgeqp3:  12.3s

MKL dgels:  4.4s


2 Threads:

MKL dgeqp3:  9.1s

MKL dgels:  2.5s


4 Threads:

MKL dgeqp3:  4.2s

MKL dgels:  1.4s


8 Threads:

MKL dgeqp3:  3.8s

MKL dgels:  0.94s



150000 x 1000 system


Eigen colPivHouseholderQR:  358.4s

Eigen householderQR:  74.6s

Eigen ldlt on normal equations: 24s


1 Thread:

MKL dgeqp3:  130.1s

MKL dgels:  46s


2 Threads:

MKL dgeqp3:  103.752800s

MKL dgels:  26.349180s


4 Threads:

MKL dgeqp3:  93.2s

MKL dgels:  15.6s


8 Threads:

MKL dgeqp3:  96.5s

MKL dgels:  14.1s


dgels is much faster than QR with pivoting, and Cholesky on the normal equations is fast too. I have not tested the lapack version of the normal equations solve, my guess would be that with MKL it would be as fast or faster than Eigen's version. For the 15000 x 1000 system, my times are significantly slower than yours. Maybe my hardware is older. It's also worth noting that the testing I have been doing has been with randomly generated systems (i.e., generate A and c, set b = Ac, then use A and b to test the solve times with the different methods).

-Tim

Sameer Agarwal

unread,
Aug 9, 2013, 4:57:37 PM8/9/13
to ceres-...@googlegroups.com
Thanks Tim. These numbers look pretty good. I am already working on enabling multiple dense linear algebra backends in ceres. I have a rough version of the code already working including the API changes, but its going to take me a bit to clean it up, test and submit it. 

Also since this is a large change, my plan is to check it in after 1.7.0 is released which should be within the week or so.

Sameer




Reply all
Reply to author
Forward
0 new messages