Intel MKL libraries: nearly 2x speedup for logistic model

Kyle Foreman

unread,

Jan 11, 2014, 12:30:51 PM1/11/14

to stan-...@googlegroups.com

I've seen some speculation on the list that using the Intel MKL libraries to compile Stan and Eigen might speed up sampling, but no confirmation. I just got MKL working and thought I'd pass along my results.

To test I ran the stan/src/models/speed/logistic model with following sampling setup:

sample
   num_samples = 100000
   num_warmup = 100000
   thin = 1000

random
seed = 1

Here's the results from a vanilla (g++) run:

Elapsed Time: 23.43 seconds (Warm-up)
36.5 seconds (Sampling)
59.93 seconds (Total)

And here's the results for MKL (using icc to compile and linking in the appropriate libraries):

Elapsed Time: 14.33 seconds (Warm-up)
17.67 seconds (Sampling)
32 seconds (Total)

So MKL cuts sampling time nearly in half!

Same system, same random seed, same stan version (latest commit b63bfa7), both versions single threaded, etc. Compilation takes a bit longer with the Intel compiler (I haven't timed it, but I estimate 2x).

Implementing this is rather easy once you've got MKL installed on your system:

Download and extract a fresh copy of Stan
In your makefile change CC = g++ to CC = icc
Add the MKL path to makefile (e.g. MKLROOT = /apps/intel/2013/mkl)
Add the following to makefile's CFLAGS: -I $(MKLROOT)/include and -DEIGEN_USE_MKL_ALL
Link to your MKL library by adding to your makefile's LDLIBS

The exact implementation will depend on your system. Use the MKL link line advisor for help.
e.g. in my case: -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm

Compile your model as usual

e.g. make src/models/speed/logistic/logistic
Note: make sure to do the above changes before compiling for the first time - otherwise Stan will be compiled with g++ and you won't see any performance gains

p.s. Devs - is there anyway to force stan to recompile? (short of deleting all the .cpp files I suppose)

I've also attached an example makefile here (just remove .txt extension) in case the above isn't clear. I hope others find this useful!

Best,

Kyle

makefile.txt

Bob Carpenter

unread,

Jan 11, 2014, 3:51:48 PM1/11/14

to stan-...@googlegroups.com

That sounds awesome. We're glad the makefile worked.

Were you using -O3 in the g++ compilation? I think
Ben tested MKL a while ago and didn't see that much difference.

More below.

On 1/11/14, 6:30 PM, Kyle Foreman wrote:
> I've seen some speculation on the list that using the Intel MKL libraries to compile Stan and Eigen might speed up
> sampling, but no confirmation. I just got MKL working and thought I'd pass along my results.
>
> To test I ran the stan/src/models/speed/logistic model with following sampling setup:
>
> sample
> num_samples = 100000
> num_warmup = 100000
> thin = 1000
>
> random
> seed = 1
>
>
> Here's the results from a vanilla (g++) run:
>
> Elapsed Time: 23.43 seconds (Warm-up)
> 36.5 seconds (Sampling)
> 59.93 seconds (Total)
>
>
> And here's the results for MKL (using icc to compile and linking in the appropriate libraries):
>
> Elapsed Time: 14.33 seconds (Warm-up)
> 17.67 seconds (Sampling)
> 32 seconds (Total)
>
>
> So MKL cuts sampling time nearly in half!
>
> Same system, same random seed, same stan version (latest commit b63bfa7), both versions single threaded, etc.
> Compilation takes a bit longer with the Intel compiler (I haven't timed it, but I estimate 2x).
>
> Implementing this is rather easy once you've got MKL installed on your system:
>

> 1. Download and extract a fresh copy of Stan
> 2. In your makefile change CC = g++ to CC = icc
> 3. Add the MKL path to makefile (e.g. MKLROOT = /apps/intel/2013/mkl)
> 4. Add the following to makefile's CFLAGS: -I $(MKLROOT)/include and -DEIGEN_USE_MKL_ALL
> 5. Link to your MKL library by adding to your makefile's LDLIBS
> * The exact implementation will depend on your system. Use the MKL link line advisor
> <http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/> for help.
> * e.g. in my case: -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm
> 6. Compile your model as usual
> * e.g. make src/models/speed/logistic/logistic
> * Note: make sure to do the above changes before compiling for the first time - otherwise Stan will be compiled

> with g++ and you won't see any performance gains

> o p.s. Devs - is there anyway to force stan to recompile? (short of deleting all the .cpp files I suppose)

You can touch the .stan file, which should cause everything to be
done again. Or you can touch the cpp file, which will
force recompilation. But it still won't rebuild all the libs.
It's just the way makefile works --- if nothing's changed, it
won't recompile.

> I've also attached an example makefile here (just remove .txt extension) in case the above isn't clear. I hope others
> find this useful!

Thanks!

Do you mind if we include these instructions in the manual?

It looks like they have some kind of non-commercial license:

http://software.intel.com/en-us/non-commercial-software-development

But it appears to be linux only.

- Bob

Kyle Foreman

unread,

Jan 11, 2014, 4:49:17 PM1/11/14

to stan-...@googlegroups.com

Yes, I used -O3 for both icc and g++.

In one of Ben's comments on another thread it sounded to me like he had tried using icc for compiling but hadn't linked MKL to Eigen. I found that simply using icc alone gave little to no performance improvements - it was the recompiling of Stan+Eigen with the -DEIGEN_USE_MKL_ALL flag that drastically improved performance.

More info on Eigen+MKL here: http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html

Re: Intel MKL licensing, here's how they handle it in the Eigen docs:

Warning
Be aware that Intel® MKL is a proprietary software. It is the responsibility of the users to buy MKL licenses for their products. Moreover, the license of the user product has to allow linking to proprietary software that excludes any unmodified versions of the GPL.

And yes, please feel free to use those instructions any way you'd like.

Ben Goodrich

unread,

Jan 11, 2014, 5:55:26 PM1/11/14

to stan-...@googlegroups.com

On Saturday, January 11, 2014 4:49:17 PM UTC-5, Kyle Foreman wrote:

Yes, I used -O3 for both icc and g++.

In one of Ben's comments on another thread it sounded to me like he had tried using icc for compiling but hadn't linked MKL to Eigen. I found that simply using icc alone gave little to no performance improvements - it was the recompiling of Stan+Eigen with the -DEIGEN_USE_MKL_ALL flag that drastically improved performance.

That is a possibility. One problem is that logistic.stan file is sort of poorly written, so it might be the case that icc is compensating for that better. Can you try it on your system with the attached files?

Ben

logistic.stan

logistic.data.R

Kyle Foreman

unread,

Jan 11, 2014, 6:26:36 PM1/11/14

to stan-...@googlegroups.com

Sure thing. With those new versions of logistic I'm still seeing ~2x speedup.

g++

Elapsed Time: 19.16 seconds (Warm-up)
24.49 seconds (Sampling)
43.65 seconds (Total)

icc

Elapsed Time: 10.32 seconds (Warm-up)
13.38 seconds (Sampling)
23.7 seconds (Total)

I just ran one of my own much larger models that I've done lots of optimizing on already and it saw negligible gains, however.

Ben Goodrich

unread,

Jan 11, 2014, 6:54:50 PM1/11/14

to stan-...@googlegroups.com

On Saturday, January 11, 2014 6:26:36 PM UTC-5, Kyle Foreman wrote:

Sure thing. With those new versions of logistic I'm still seeing ~2x speedup.

Interesting. That would seem to imply that icc does a much better job on X * beta because that is the only thing in that model where Eigen is involved. But the matrix-vector product benchmark at

http://eigen.tuxfamily.org/index.php?title=Benchmark

suggests that it should be the same. Maybe Stan kills the performance by not using the expression templates and icc unkills it somehow?

Ben

Kyle Foreman

unread,

Jan 12, 2014, 4:59:53 AM1/12/14

to stan-...@googlegroups.com

Is Eigen used anywhere in the HMC sampler, or is X * beta the only place it'd be invoked at all?

I suppose there's 2 different variables to test here - using icc as the compiler, and enabling MKL for Eigen. And then those permutations can be applied when compiling Stan+Eigen and/or the model itself. I'll carve out some time this afternoon to systematically test those.

Bob Carpenter

unread,

Jan 12, 2014, 8:35:02 AM1/12/14

to stan-...@googlegroups.com

On 1/12/14, 10:59 AM, Kyle Foreman wrote:
> Is Eigen used anywhere in the HMC sampler, or is X * beta the only place it'd be invoked at all?

It'll be used much more in the Riemann manifold HMC.

Eigen's also used to implement all of the matrix operations
in Stan, so it will be invoked there.

Thanks for testing this.

- Bob

>
> I suppose there's 2 different variables to test here - using icc as the compiler, and enabling MKL for Eigen. And then
> those permutations can be applied when compiling Stan+Eigen and/or the model itself. I'll carve out some time this
> afternoon to systematically test those.
>
> On Saturday, January 11, 2014 11:54:50 PM UTC, Ben Goodrich wrote:
>
> On Saturday, January 11, 2014 6:26:36 PM UTC-5, Kyle Foreman wrote:
>
> Sure thing. With those new versions of logistic I'm still seeing ~2x speedup.
>
>
> Interesting. That would seem to imply that icc does a much better job on X * beta because that is the only thing in
> that model where Eigen is involved. But the matrix-vector product benchmark at
>

> http://eigen.tuxfamily.org/index.php?title=Benchmark <http://eigen.tuxfamily.org/index.php?title=Benchmark>

>
> suggests that it should be the same. Maybe Stan kills the performance by not using the expression templates and icc
> unkills it somehow?
>
> Ben
>
>
> g++
>
> Elapsed Time: 19.16 seconds (Warm-up)
> 24.49 seconds (Sampling)
> 43.65 seconds (Total)
>
>
> icc
>
> Elapsed Time: 10.32 seconds (Warm-up)
> 13.38 seconds (Sampling)
> 23.7 seconds (Total)
>
>
> I just ran one of my own much larger models that I've done lots of optimizing on already and it saw negligible
> gains, however.
>
> On Saturday, January 11, 2014 10:55:26 PM UTC, Ben Goodrich wrote:
>
> On Saturday, January 11, 2014 4:49:17 PM UTC-5, Kyle Foreman wrote:
>
> Yes, I used -O3 for both icc and g++.
>
> In one of Ben's comments on another thread it sounded to me like he had tried using icc for compiling
> but hadn't linked MKL to Eigen. I found that simply using icc alone gave little to no performance
> improvements - it was the recompiling of Stan+Eigen with the -DEIGEN_USE_MKL_ALL flag that drastically
> improved performance.
>
>
> That is a possibility. One problem is that logistic.stan file is sort of poorly written, so it might be the
> case that icc is compensating for that better. Can you try it on your system with the attached files?
>
> Ben
>

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,

Jan 12, 2014, 8:46:30 AM1/12/14

to stan-...@googlegroups.com

Thanks --- I'll drop the instructions into the next manual
and thank you for them.

- Bob

On 1/11/14, 10:49 PM, Kyle Foreman wrote:
> Yes, I used -O3 for both icc and g++.
>
> In one of Ben's comments on another thread it sounded to me like he had tried using icc for compiling but hadn't linked
> MKL to Eigen. I found that simply using icc alone gave little to no performance improvements - it was the recompiling of
> Stan+Eigen with the -DEIGEN_USE_MKL_ALL flag that drastically improved performance.
>
> More info on Eigen+MKL here: http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html
>
> Re: Intel MKL licensing, here's how they handle it in the Eigen docs:
>

> *Warning*

Kyle Foreman

unread,

Jan 12, 2014, 4:32:32 PM1/12/14

to stan-...@googlegroups.com

So I ran several tests of 400k iterations of Ben's modified logistic model to determine what factors made a difference.

Turns out that the compiler used to compile stan and eigen the first time has no effect. Similarly whether or not the -DEIGEN_USE_MKL flag is set has no effect. The only differences were due to using icc vs g++ when compiling the model itself:

Stan+Eigen compiler	Stan+Eigen MKL flag	Model compiler	Model MKL flag	Time (s)
icc	yes	icc	yes	47
icc	yes	icc	no	47
icc	no	icc	yes	47
icc	no	icc	no	46
g++	NA	g++	NA	86
g++	NA	icc	yes	48
g++	NA	icc	no	47
icc	yes	g++	NA	87
icc	no	g++	NA	88

I'm surprised by this. I guess the difference I noticed when setting the flag for eigen before was attributable to something else (e.g. I didn't do the timing very formally then and it was nearly a half hour apart, so perhaps there were other processes utilizing the CPU that I didn't realize).

So using icc for model compilation seems to be the key. That doesn't seem to mesh with Ben's experience, however, so maybe there's something else at play here. Probably worth testing on someone else's machine to see if it holds up.

On Sunday, January 12, 2014 1:46:30 PM UTC, Bob Carpenter wrote:

Thanks --- I'll drop the instructions into the next manual
and thank you for them.

- Bob

On 1/11/14, 10:49 PM, Kyle Foreman wrote:
> Yes, I used -O3 for both icc and g++.
>
> In one of Ben's comments on another thread it sounded to me like he had tried using icc for compiling but hadn't linked
> MKL to Eigen. I found that simply using icc alone gave little to no performance improvements - it was the recompiling of
> Stan+Eigen with the -DEIGEN_USE_MKL_ALL flag that drastically improved performance.
>
> More info on Eigen+MKL here: http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html
>
> Re: Intel MKL licensing, here's how they handle it in the Eigen docs:
>
> *Warning*

> Be aware that Intelï¿½ MKL is a proprietary software. It is the responsibility of the users to buy MKL licenses for

Michael Betancourt

unread,

Jan 12, 2014, 5:34:15 PM1/12/14

to stan-...@googlegroups.com

This makes sense to me -- Eigen is all templated out so code won't get generated until
the model is compiled.

Ben Goodrich

unread,

Jan 12, 2014, 5:57:32 PM1/12/14

to stan-...@googlegroups.com

On Sunday, January 12, 2014 4:32:32 PM UTC-5, Kyle Foreman wrote:

So using icc for model compilation seems to be the key. That doesn't seem to mesh with Ben's experience, however, so maybe there's something else at play here. Probably worth testing on someone else's machine to see if it holds up.

What version numbers of g++ and icc are we talking about here?

Kyle Foreman

unread,

Jan 13, 2014, 3:56:19 AM1/13/14

to stan-...@googlegroups.com

On Sunday, January 12, 2014 10:57:32 PM UTC, Ben Goodrich wrote:

What version numbers of g++ and icc are we talking about here?

Should've thought to include that:

g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3)

icc (ICC) 13.0.0 20120731

And Stan is this commit from January 3 on develop: https://github.com/stan-dev/stan/commit/b63bfa76cae32b599e5e7d925588ebbd54c6a04c

Kyle Foreman

unread,

Jan 13, 2014, 4:02:10 AM1/13/14

to stan-...@googlegroups.com

On Sunday, January 12, 2014 10:34:15 PM UTC, Michael Betancourt wrote:

This makes sense to me -- Eigen is all templated out so code won't get generated until
the model is compiled.

Ah, that explains why icc gives several warnings related to Eigen when compiling the model.

The fact that leaving out the flags/libs for specifying MKL in the icc model compilation step (i.e. steps 4 and 5 from my original post) would suggest to me that it's icc and not MKL that's improving performance (though testing a model which utilizes Eigen more might show that MKL also helps)

Ben Goodrich

unread,

Jan 13, 2014, 4:27:14 AM1/13/14

to stan-...@googlegroups.com

I'm guessing the reason why you are seeing different results than I did a few months back is that g++ 4.4.x dates back to 2009 -- 2010. I guess GNU or Red Hat did bugfix releases up until 2012, but I was testing with g++ 4.7.x I think. I'll have to do the comparisons again (also with clang++) soon.

Ben

Kyle Foreman

unread,

Jan 13, 2014, 6:46:57 AM1/13/14

to stan-...@googlegroups.com

On Monday, January 13, 2014 9:27:14 AM UTC, Ben Goodrich wrote:

I'm guessing the reason why you are seeing different results than I did a few months back is that g++ 4.4.x dates back to 2009 -- 2010. I guess GNU or Red Hat did bugfix releases up until 2012, but I was testing with g++ 4.7.x I think. I'll have to do the comparisons again (also with clang++) soon.

Wow, I hadn't realized 4.4.7 was that old - good to know.

Ben Goodrich

unread,

Jan 13, 2014, 1:34:42 PM1/13/14

to stan-...@googlegroups.com

You generally want to use whatever is the most recent compiler you can get a hold of. For this example, doing

./logistic method=sample num_samples=100000 num_warmup=100000 thin=1000 data file=logistic.data.R random seed=1 output refresh=200001 > time.txt

after utilizing various compilers (libstan was compiled from git SHA 09fbc64 once with g++-4.8) yields

COMPILER     WARMUP SAMPLING TOTAL
llvm-g++-4.2 12.02  15.46    27.48
g++-4.4      9.38   12.11    21.49
g++-4.5      9.41   12.19    21.60
g++-4.6      9.03   11.56    20.59
g++-4.7      9.09   11.77    20.86
g++-4.8      9.21   11.81    21.02
g++-4.9 (RC) 9.36   12.09    21.45
clang++-3.4  8.51   10.99    19.50
icpc (13.x)  7.91   10.23    18.14

So, icpc does the best here (Debian, i7), although it was not nearly as stark a difference as what you saw (RHEL). Unfortunately, we see bigger differences across compilers on complicated (typically multivariate) models than in a simple logistic GLM.

Ben

Reply all

Reply to author

Forward