Understanding linking, mkl, and pykaldi

David van Leeuwen

unread,

Oct 22, 2019, 10:12:13 AM10/22/19

to kaldi-help

Hello,

Since a while the use of Intel's MKL is free, and Kaldi compiles and runs well with it, and we all save the planet a little by using less energy; that is great!

However, we use the pykaldi wrapper in deployment, and so far I haven't been able to get that going with MKL–-i've spent several days now, trying all kinds of configurations, debugging, and whatever, but I am not making any progress, I basically don't understand the dynamic linking process well enough (or not at all).

I am turning to this list because maybe there are people here that understand the intricacies of dynamic linking and python and perhaps even MKL.

A simple pykaldi test

```

LD_DEBUG=files python -m unittest tests/matrix/matrix-test.py 2>debug

```

fails with

```

Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

```

I don't understand this because those exact files _can_ in fact be found, they're in ld.so.conf and in standard locations etc, but the `LD_DEBUG=files` may give some more information, near the end there are the lines

```

29453: file=/opt/intel/mkl/lib/intel64/libmkl_avx2.so [0]; dynamically loaded by /opt/intel/mkl/lib/intel64/libmkl_core.so [0]

29453: file=/opt/intel/mkl/lib/intel64/libmkl_avx2.so [0]; generating link map

29453: dynamic: 0x00007f8167420aa0 base: 0x00007f81638f4000 size: 0x0000000003b41a88

29453: entry: 0x00007f81639e7700 phdr: 0x00007f81638f4040 phnum: 7

29453: /opt/intel/mkl/lib/intel64/libmkl_avx2.so: error: symbol lookup error: undefined symbol: mkl_sparse_optimize_bsr_trsm_i8 (fatal)

29453: file=/opt/intel/mkl/lib/intel64/libmkl_avx2.so [0]; destroying link map

```

The symbol `mkl_sparse_optimize_bsr_trsm_i8` is defined in the threading libraries like `libmkl_intel_thread.so`, which is used in compiling kaldi and later linking with pykaldi, in the `kaldi.mk` var `LD_LIBS`, which evaluated in our case to

```

/home/david/src/pykaldi/tools/kaldi/tools/openfst/lib/libfst.so -L/opt/intel/mkl/lib/intel64 -Wl,-rpath=/opt/intel/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -L/opt/intel/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -Wl,-rpath=/opt/intel/compilers_and_libraries_2019.2.187/linux/compiler/lib/intel64_lin -liomp5 -ldl -lpthread -lm -lm -lpthread -ldl

```

I've tried all kinds of LD_PRELOAD and LD_LBRARY_PATH settings to help the dynamic linker, but then I end up in a endless chain of more undefined symbols. Just more proof that I don't understand linking.

Would anyone know how to proceed?

Thanks,

–-david

Daniel Povey

unread,

Oct 22, 2019, 3:13:24 PM10/22/19

to kaldi-help

A search for `undefined symbol: mkl_sparse_optimize_bsr_trsm_i8` leads here

https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/748309

(someone is saying it's working for them on one system but not another)..

the intel people suggest something with LD_PRELOAD that might resolve the problem.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/69cd7a51-18cb-46aa-b945-4df2ddadd119%40googlegroups.com.

Daniel Povey

unread,

Oct 22, 2019, 3:36:38 PM10/22/19

to kaldi-help

... although I suspect if you include the MKL blas libraries in that, it might mess up numpy by replacing things that it's trying to get from elsewhere. You could try just doing the LD_PRELOAD on the library that it was failing to find a symbol from-- the threading one.

And make sure the threading library is actually being loaded. More output would make it clearer.

Kirill Katsnelson

unread,

Oct 22, 2019, 9:19:28 PM10/22/19

to kaldi-help

This issue has a venerable history with MKL. You can find in on Intel forums dating as far back as 2008.

We build our binaries (you'll find a line in kaldi.mk) with '-lmkl_core -lmkl_intel_lp64 -lmkl_sequential'. The libmkl_core.so library is responsible for loading the best matching libmkl_ARCH.so for the CPU you are running on, ARCH being avx2 in your case, or fallback to libmkl_def.so if no architecture-specific library found. It should be using dlopen() for this, and I do not understand why and how exactly this fails, but this fails.

To decode, -lmkl_core is the required main library, -lmkl_intel_lp64 specifies particular ABI conventions, and -lmkl_sequential means sequential threading model (i.e., do not attempt to do multithreaded computations of matrices). In practice, the threading part is finicky. It can perform better on one machine, and take a performance hit on another. I noticed -lmkl_intel_thread in your build library line; this is using an alternative multithreading model (called "Intel threading"). Just make sure that you are selecting the right threading model for the task, and you actually benefit from it.

One known resolution to the the dlopen() fiasco is to build against a different library, libmkl_rt.so. This is a run-time dispatcher that selects the ABI and threading model at the load time. It exposes a few functions and understands corresponding environment variables. So instead of "-lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread" try using just "-lmkl_rt" alone. The defaults for x64 are the same as you are using: LP64 calling conventions and Intel threading. If you would prefer to experiment with sequential (non-)threading model, you can export MKL_THREADING_LAYER=SEQUENTIAL to your run-time environment w/o rebuilding anything.

libmkl_rt.so will load libmkl_core.so and the ABI and threading SO libraries at run time. It escapes me how exactly more dynamic loading of libraries solves the problem, but somehow it pulls the trick. I recall that taking the -lmkl_rt route is the recommended way to build numpy and scipy themselves, so I would certainly give it a try.

-kkm

David van Leeuwen

unread,

Oct 23, 2019, 4:24:45 AM10/23/19

to kaldi-help

Hello,

Thanks for this elaborate answer! I clearly don't know enough of dynamic linking and the MKL woes.

On Wednesday, October 23, 2019 at 3:19:28 AM UTC+2, Kirill Katsnelson wrote:

This issue has a venerable history with MKL. You can find in on Intel forums dating as far back as 2008.

We build our binaries (you'll find a line in kaldi.mk) with '-lmkl_core -lmkl_intel_lp64 -lmkl_sequential'. The libmkl_core.so library is responsible for loading the best matching libmkl_ARCH.so for the CPU you are running on, ARCH being avx2 in your case, or fallback to libmkl_def.so if no architecture-specific library found. It should be using dlopen() for this, and I do not understand why and how exactly this fails, but this fails.

Yes, so pykaldi takes its linking arguments from `kaldi.mk` in `setup.py`, and that ends up as `-L/opt/intel/mkl/lib/intel64 -Wl,-rpath=/opt/intel/mkl/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread` (or, if I specify `–threaded-math=no` in configure, the last library is `-lmkl_sequential`). That seems to be a slightly different order (I believe order is important in linking). I might try and reverse order.

To decode, -lmkl_core is the required main library, -lmkl_intel_lp64 specifies particular ABI conventions, and -lmkl_sequential means sequential threading model (i.e., do not attempt to do multithreaded computations of matrices). In practice, the threading part is finicky. It can perform better on one machine, and take a performance hit on another. I noticed -lmkl_intel_thread in your build library line; this is using an alternative multithreading model (called "Intel threading"). Just make sure that you are selecting the right threading model for the task, and you actually benefit from it.

I tend to not select threading in the end (I happened to have this configuration at the time of the post, as the last of all the possible options I tried...), because we have many decoding instances on the same machine, and threading then kind of kills performance (in my OpenBLAS experience)

One known resolution to the the dlopen() fiasco is to build against a different library, libmkl_rt.so. This is a run-time dispatcher that selects the ABI and threading model at the load time. It exposes a few functions and understands corresponding environment variables. So instead of "-lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread" try using just "-lmkl_rt" alone. The defaults for x64 are the same as you are using: LP64 calling conventions and Intel threading. If you would prefer to experiment with sequential (non-)threading model, you can export MKL_THREADING_LAYER=SEQUENTIAL to your run-time environment w/o rebuilding anything.

libmkl_rt.so will load libmkl_core.so and the ABI and threading SO libraries at run time. It escapes me how exactly more dynamic loading of libraries solves the problem, but somehow it pulls the trick. I recall that taking the -lmkl_rt route is the recommended way to build numpy and scipy themselves, so I would certainly give it a try.

I will do that, thanks!

–-david

David van Leeuwen

unread,

Oct 23, 2019, 8:55:18 AM10/23/19

to kaldi-help

Hello,

On Wednesday, October 23, 2019 at 3:19:28 AM UTC+2, Kirill Katsnelson wrote:

One known resolution to the the dlopen() fiasco is to build against a different library, libmkl_rt.so. This is a run-time dispatcher that selects the ABI and threading model at the load time. It exposes a few functions and understands corresponding environment variables. So instead of "-lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread" try using just "-lmkl_rt" alone. The defaults for x64 are the same as you are using: LP64 calling conventions and Intel threading. If you would prefer to experiment with sequential (non-)threading model, you can export MKL_THREADING_LAYER=SEQUENTIAL to your run-time environment w/o rebuilding anything.

libmkl_rt.so will load libmkl_core.so and the ABI and threading SO libraries at run time. It escapes me how exactly more dynamic loading of libraries solves the problem, but somehow it pulls the trick. I recall that taking the -lmkl_rt route is the recommended way to build numpy and scipy themselves, so I would certainly give it a try.

This did the trick for me. So now after configuring Kaldi, I can fix `kaldi.mk` using

```

perl -i~ -pe 's/-lmkl_intel_lp64\s+-lmkl_core\s+-lmkl_sequential/-lmkl_rt/' kaldi.mk

```

and then the pykaldi mkl build _finally_ works for me. Indeed, using the env `MKL_THREADING_LAYER=SEQUENTIAL` speeds things up with Kaldi (e.g., the kaldi-multithreaded ivector extractor initialization works faster with a blas-singlethreaded library).

Many thanks, again!

Kirill Katsnelson

unread,

Oct 25, 2019, 2:56:25 PM10/25/19

to kaldi-help

Glad the trick worked for ya.

Make variables set in its command line override their values set in the Makefile, so you can simply call 'make .... MKL_LIBS="-lmkl_rt ..." ' instead of the perl gymnastics. See https://www.gnu.org/software/make/manual/make.html#Values

-kkm

David van Leeuwen

unread,

Oct 28, 2019, 5:14:16 AM10/28/19

to kaldi-help

Hi,

On Friday, October 25, 2019 at 8:56:25 PM UTC+2, Kirill Katsnelson wrote:

Glad the trick worked for ya.

Make variables set in its command line override their values set in the Makefile, so you can simply call 'make .... MKL_LIBS="-lmkl_rt ..." ' instead of the perl gymnastics. See https://www.gnu.org/software/make/manual/make.html#Values

Yes, this is indeed a nicer approach.

In my case, however, the makefile is used by both the kaldi and the pykaldi building process, and my guess is that it is important that they use the same linking configuration. Another possibility would be to add a configuration option to ./configure, but I am even less of a config script hacker than a makefile hacker... Although, the configure is just a shell script, and not some autoconfigure produced script, so it should be well doable.

–-david

Reply all

Reply to author

Forward