Intel MKL - threaded math not compiling on Centos 7

P-E Honnet

unread,

May 9, 2019, 5:30:47 AM5/9/19

to kaldi-help

Hi,

It seems that since the default math lib is MKL, I am not able to configure in src with both MKL and threaded math options (on Centos 7). This is the configure I was running before:

$ ./configure --shared --use-cuda=yes --mkl-root=/opt/intel/mkl --threaded-math=yes

and it was working fine with earlier version (e.g. commit f9828e9a2a71f69b72a50369b018b89fc889e1b2). Compilation was also fine.
Now, with the latest version (commit 9702cbc3a081501d4df0124d9b1f7fb9b18fb3ee), this is the output:

$ ./configure --shared --use-cuda=yes --mkl-root=/opt/intel/mkl --threaded-math=yes
Configuring KALDI to use MKL.
Checking compiler g++ ...
Checking OpenFst library in /home/${user}/Projects/kaldi_mkl_test/tools/openfst-1.6.7 ...
Checking cub library in /home/${user}/Projects/kaldi_mkl_test/tools/cub-1.8.0 ...
Doing OS specific configurations ...
On Linux: Checking for linear algebra header files ...
Configuring MKL library directory: Found: /opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64
MKL configured with threading: iomp, libs: -L/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64 -Wl,-rpath=/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64 -lmkl_intel_lp64  -lmkl_core  -lmkl_intel_thread
MKL include directory configured as: /opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/include
Configuring MKL threading as iomp
./configure: line 335: cd: lib/intel64: No such file or directory
./configure: line 337: cd: lib/em64t: No such file or directory
./configure: line 340: cd: lib/intel64: No such file or directory
./configure: line 342: cd: lib/em64t: No such file or directory
***configure failed: Could not find the iomp5 library, have your tried the --omp-libdir switch? ***

The only way I get it to work is to let the default "sequential" mkl-threading:

$ ./configure --shared --use-cuda=yes --mkl-root=/opt/intel/mkl
Configuring KALDI to use MKL.
Checking compiler g++ ...
Checking OpenFst library in /home/${user}/Projects/kaldi_mkl_test/tools/openfst-1.6.7 ...
Checking cub library in /home/${user}/Projects/kaldi_mkl_test/tools/cub-1.8.0 ...
Doing OS specific configurations ...
On Linux: Checking for linear algebra header files ...
Configuring MKL library directory: Found: /opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64
MKL configured with threading: sequential, libs: -L/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64 -Wl,-rpath=/opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/lib/intel64 -lmkl_intel_lp64  -lmkl_core  -lmkl_sequential
MKL include directory configured as: /opt/intel/compilers_and_libraries_2017.4.196/linux/mkl/include
Configuring MKL threading as sequential
MKL threading libraries configured as   -ldl -lpthread -lm
Using Intel MKL as the linear algebra library.
Intel(R) Math Kernel Library Version 2017.0.3 Product Build 20170413 for Intel(R) 64 architecture applications
Successfully configured for Linux with MKL libs from /opt/intel/compilers_and_libraries_2017.4.196/linux/mkl
Using CUDA toolkit /usr/local/cuda (nvcc compiler and runtime libraries)
INFO: Configuring Kaldi not to link with Speex. Don't worry, it's only needed if
      you intend to use 'compress-uncompress-speex', which is very unlikely.
WARNING: slow expf() detected. expf() is slower than exp() by the factor of 1.20813
*** WARNING: expf() seems to be slower than exp() on your machine. This is a known bug in old versions of glibc. Please consider updating glibc. ***
*** Kaldi will be configured to use exp() instead of expf() in base/kaldi-math.h Exp() routine for single-precision floats. ***
Kaldi has been successfully configured. To compile:

  make -j clean depend; make -j <NCPU>

where <NCPU> is the number of parallel builds you can afford to do. If unsure,
use the smaller of the number of CPUs or the amount of RAM in GB divided by 2,
to stay within safe limits. 'make -j' without the numeric value may not limit
the number of parallel jobs at all, and overwhelm even a powerful workstation,
since Kaldi build is highly parallelized.

but then of course I cannot run some of the things in multithreaded mode. Unfortunately I cannot upgrade the MKL libraries on the machine as I a not root. It looks like I have version 2017.0.3 (INTEL_MKL_VERSION 20170003)

Do you have any suggestion?

Thanks

Daniel Povey

unread,

May 9, 2019, 12:28:50 PM5/9/19

to kaldi-help, Kirill Katsnelson

Kirill may be able to comment,
I think the intention, at least, was to remove the threaded math options, since it's really hard to find situations where using threaded BLAS makes sense or helps.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/f0f0cfb8-53a2-44d8-bf18-316d4ae89139%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirill Katsnelson

unread,

May 11, 2019, 4:31:09 PM5/11/19

to kaldi-help

Well, it's possible that multithreaded MKL got broken, but the important point is we cannot actually find a scenario where multithreaded BLAS would be helpful. The thinking goes, Kaldi as a toolkit is always used in a multi-process mode, presumably running as many processes in parallel as a node can handle. In this scenario, multithreaded BLAS rather harms performance; it's also really unpredictable by how much.

We are planning to take the MT option out entirely. I'm wondering what is your use case for it?

-kkm

P-E Honnet

unread,

May 13, 2019, 10:09:57 AM5/13/19

to kaldi-help

Hi,

Thanks for the answers. I guess it is not that important for my use case. One of the examples where threaded math was making decoding a bit faster is online-wav-nnet3-latgen-faster. With the old code, it would use multiple core and get the decoding done faster, but far from X times faster, when X is the number of CPU cores used. By looking at the time it takes to decode for example librispeech test-clean with both (acoustic model is tdnn trained on librispeech), I get these:

# Threaded math
Timing stats: real-time factor for offline decoding was 0.043807 = 852.155 seconds  / 19452.5 seconds.
real    14m23.431s
user    600m23.021s
sys    33m0.198s

# MKL - Sequential
Timing stats: real-time factor for offline decoding was 0.102773 = 1999.18 seconds  / 19452.5 seconds.
real    33m32.943s
user    33m15.195s
sys     0m14.111s

In the multi threaded case, about 44 cores are used, so yeah in the end it looks much slower than the new version, if you split the work beforehand, which I guess goes in the direction of your comments.

What value would you suggest for mkl-threading option? Note that with my version of MKL, only sequential (default) and gomp seem to allow the configure script to work. I don't know what is gomp, sequential is explicit from the name.

Thanks,

PE

Kirill Katsnelson

unread,

May 13, 2019, 11:10:53 AM5/13/19

to kaldi-help

The normal Kaldi experimentation way is to increase number of jobs. With 44 cores and a large set, like LibriSpeech, you can use e. g. --nj 100 for decode, and all your cores will be loaded 100% with sequential jobs.

An individual decode may benefit from parallel computation.

-kkm

Kirill Katsnelson

unread,

May 13, 2019, 1:13:41 PM5/13/19

to kaldi-help

Interesting, this works for me:

$ ./configure --threaded-math

Configuring KALDI to use MKL.

Backing up kaldi.mk to kaldi.mk.bak ...
Checking compiler g++ ...
Checking OpenFst library in /home/kkm/work/kaldi2/tools/openfst-1.6.9 ...
Checking cub library in /home/kkm/work/kaldi2/tools/cub-1.8.0 ...


Doing OS specific configurations ...
On Linux: Checking for linear algebra header files ...

Configuring MKL library directory: Found: /opt/intel/mkl/lib/intel64
MKL configured with threading: iomp, libs: -L/opt/intel/mkl/lib/intel64 -Wl,-rpath=/opt/intel/mkl/lib/intel64 -lmkl_intel_lp64  -lmkl_core  -lmkl_intel_thread
MKL include directory configured as: /opt/intel/mkl/include
Configuring MKL threading as iomp
MKL threading libraries configured as -L/opt/intel/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -Wl,-rpath=/opt/intel/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -liomp5  -ldl -lpthread -lm

Using Intel MKL as the linear algebra library.

Intel(R) Math Kernel Library Version 2019.0.2 Product Build 20190118 for Intel(R) 64 architecture applications
Successfully configured for Linux with MKL libs from /opt/intel/mkl

And there was not really a change in this part.

To anwser your another question, gomp should work as well as any OMP implementation, at the least in theory...

-kkm

P-E Honnet

unread,

May 14, 2019, 4:03:14 AM5/14/19

to kaldi-help

Hi,

Thanks for the answer. Actually, if I try like you and provide only the threaded-math option, it works (and still finds mkl where it is). It looks like only when both threaded-math and mkl-root options are given it fails.

PE

Kirill Katsnelson

unread,

May 14, 2019, 12:11:32 PM5/14/19

to kaldi-help

An interesting find, thanks. I'll check why could that be.

I am still trying to understand your use case though. I want to be sure that removing the threaded options would not be a big deal for the users, and whether we should reconsider removing it. In general, we found that there is no performance benefits to it. The thing is, threaded math helps a ton when you are e. g. inverting a 50Kx50K square matrix. But the smaller the task gets, the more overhead the threading brings up. In addition, other libraries are not very stable with it (OpenBLAS is not entirely unbuggy even in single-threaded use), but let's focus on MKL for a moment. Can you please explain when does it really provide a performance benefit in your case? I've seen some timing you provided, but it alone does not help, because I do not know what the load was. You have 44 cores; did you do at least 44 decodes in parallel (or 88 if hyperthreaded?) The second of your timing samples looks like a 1-CPU load (real time ≈ CPU time). If that's the case, then parallelizing Kaldi on a process level would be a more efficient option. But again, I do not have full information here.

So, what do you think could be the reason for the user (meaning both yourself, and in general) to use and benefit of the threaded math algebra option?

-kkm

P-E Honnet

unread,

May 14, 2019, 12:36:49 PM5/14/19

to kaldi-help

Sorry if it wasn't clear enough, the test I did was not as smart as you imagine I guess. I simply ran online2-wav-nnet3-latgen-faster with mostly default options, on one wav.scp, so not using the decode.sh scripts:

- In one case, the process used all the cores it saw (~4400% CPU usage, fluctuating).

- In the second case, I ran exactly the same command, but with Kaldi compiled with MKL and sequential mode, in this case as expected only one core was used (~100% CPU usage).

Of course, in a more realistic scenario, as you mentioned in the first comment, I would split the work before providing to the decoder. In that case, the second one would be much more efficient. This is shown by the ~33min vs ~600min in user from time command.

To answer your question, there is no specific reason for me to want to use the threaded math option. I had just observed that I had a different behavior and also had the compilation issues. But as we discussed, with splitting of jobs, the non threaded option seems far more efficient (on the tasks I tested).

On a side note, I am wondering what would one have to do to build a dockerized version of kaldi with MKL. Using the Dockerfile from kaldi (misc/docker/ubuntu/Dockerfile), a few modifications are needed (missing unzip, and extras/install_mkl.sh to be run before check_dependencies.sh) and it works, but MKL is quite a heavy thing to put in a docker image (and maybe not recommended?)

PE

Kirill Katsnelson

unread,

May 15, 2019, 10:49:48 PM5/15/19

to kaldi-help

Thanks for the feedback! It might make sense to run threaded math for some uses of Kaldi as a library, but certainly not Kaldi as the toolkit, where we parallelize everything into processes. And when someone uses Kaldi libraries, the ./configure is mostly irrelevant anyway. We are using Kaldi decoding code in production, and it's not as simple as just build it and use it. Please search this group if interested; I explained a couple months ago what is involved. One interesting fact is sometimes multithreading actually harms the performance; probably the shorter is the utterance, the more is the impact. And, since our prod is heavily multithreaded, realtime use, there is not any sense in parallel math. Also, its impact (positive or negative) is inconsistent between different machines. So even speaking of incorporating Kaldi into a first-class production server, there is no clear cut--clearly no point in online use, when data is processed as it comes.

-kkm

Reply all

Reply to author

Forward