Installation problem

sk

unread,

Apr 8, 2019, 9:55:03 AM4/8/19

to kaldi-help

Having installation problem.

system: Ubuntu 16.04.6 LTS

cd tools; extras/check_dependencies.sh: Intel MKL is not installed. Run extras/install_mkl.sh to install it.

... You can also use other matrix algebra libraries. For information, see:

... http://kaldi-asr.org/doc/matrixwrap.html

Then I run:

sudo extras/install_mkl.sh and get the following message:

```

Get:38 http://security.ubuntu.com/ubuntu xenial-security/universe DEP-11 64x64 Icons [173 kB]

Get:39 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 Packages [5,604 B]

Get:40 http://security.ubuntu.com/ubuntu xenial-security/multiverse i386 Packages [5,764 B]

Ign:41 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 DEP-11 Metadata

Ign:32 http://security.ubuntu.com/ubuntu xenial-security/main amd64 DEP-11 Metadata

Ign:33 http://security.ubuntu.com/ubuntu xenial-security/main DEP-11 64x64 Icons

Ign:37 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 DEP-11 Metadata

Ign:41 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 DEP-11 Metadata

Ign:32 http://security.ubuntu.com/ubuntu xenial-security/main amd64 DEP-11 Metadata

Err:33 http://security.ubuntu.com/ubuntu xenial-security/main DEP-11 64x64 Icons

Could not open file /var/lib/apt/lists/partial/security.ubuntu.com_ubuntu_dists_xenial-security_main_dep11_icons-64x64.tar.gz - open (13: Permission denied) [IP: 91.189.91.23 80]

Ign:37 http://security.ubuntu.com/ubuntu xenial-security/universe amd64 DEP-11 Metadata

Ign:41 http://security.ubuntu.com/ubuntu xenial-security/multiverse amd64 DEP-11 Metadata

Ign:32 http://security.ubuntu.com/ubuntu xenial-security/main amd64 DEP-11 Metadata

Fetched 220 kB in 2s (86.8 kB/s)

Reading package lists... Done

E: Failed to fetch http://security.ubuntu.com/ubuntu/dists/xenial-security/main/dep11/icons-64x64.tar Could not open file /var/lib/apt/lists/partial/security.ubuntu.com_ubuntu_dists_xenial-security_main_dep11_icons-64x64.tar.gz - open (13: Permission denied) [IP: 91.189.91.23 80]

E: Some index files failed to download. They have been ignored, or old ones used instead.

extras/install_mkl.sh: MKL package intel-mkl-64bit-2019.2-057 installation FAILED.

```

I haven't faced this issue before on the same system. Has something changed?

Daniel Povey

unread,

Apr 8, 2019, 12:47:20 PM4/8/19

to kaldi-help, Kirill Katsnelson

We changed the default to MKL yesterday.

You could actually just ignore the warning, it should still work.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d0a4e985-a59d-47c3-821a-f3b0390e224d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sk

unread,

Apr 8, 2019, 2:55:27 PM4/8/19

to kaldi-help

it does not allow to make src. When i do ./configure, it does not generate kaldi.mk

Configuring KALDI to use MKL

Configuring ...

Checking compiler g++ ...

Checking OpenFst library in /data/sls/temp/sameerk/tools/kaldi/tools/openfst ...

Checking cub library in /data/sls/temp/sameerk/tools/kaldi/tools/cub ...

Doing OS specific configurations ...

On Linux: Checking for linear algebra header files ...

Configuring MKL library directory: ***configure failed: MKL libraries could not be found. Please use the switch --mkl-libdir ***

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,

Apr 8, 2019, 3:58:12 PM4/8/19

to kaldi-help, Kirill Katsnelson

You may have to use switches of `configure` to make it use ATLAS if you have that installed and if you can't install MKL.

Kirill, can you please prioritize getting this to work in his case, or least making it easy to install without MKL?

Dan

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/51d45fbe-6e72-4453-82d3-096e1d373e17%40googlegroups.com.

Kirill Katsnelson

unread,

Apr 8, 2019, 5:04:06 PM4/8/19

to kaldi-help

sk, I'm working on a fix to configure to handle ATLAS; sorry, this got inadvertently broken.

Meanwhile, the root cause here is that your apt-get configuration is apparently broken, too. The file that it tries to pull, http://security.ubuntu.com/ubuntu/dists/xenial-security/main/dep11/icons-64x64.tar, does not exist. Actual configuration of the repo specifies http://security.ubuntu.com/ubuntu/dists/xenial-security/main/dep11/icons-64x64.tar.gz, but you seem to have a stale cache.

You should fix it, as described in this answer: https://askubuntu.com/q/917603. Note the path prefix in the error message, /var/lib/apt/lists/partial/, indicates the problem with the "partial download" cache. The short way to fix it is just clear all files in that directory. The cache is needed only to speed up the apt update phase, and safe to remove.

sudo rm /var/lib/apt/lists/partial/*

Then your next 'sudo apt-get update' should update cleanly.

This broken cache state occurs sometimes when apt crashes.

You should be able to install MKL after that, or wait for the ATLAS fix. MKL is significantly faster.

joseph.an...@gmail.com

unread,

Apr 8, 2019, 6:32:53 PM4/8/19

to kaldi-help

MKL is faster only on Intel. Performances is poorer than openblas on non Intel architectures (AMD Ryzen/Threadripper/Epyc) and will likely not work on ARM. It would be better if the script checks for the architecture and defaults to openblas/atlas for non Intel architectures.

Daniel Povey

unread,

Apr 8, 2019, 6:33:54 PM4/8/19

to kaldi-help

Do you know for sure that that is still true?

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/6bbac8ec-744d-40d2-b6cf-ca77deab2d3a%40googlegroups.com.

joseph.an...@gmail.com

unread,

Apr 8, 2019, 7:19:32 PM4/8/19

to kaldi-help

3 years old MKL Notice from Intel regarding optimizations on non-Intel arch: https://software.intel.com/en-us/articles/optimization-notice#opt-en.

The documentation also specifically states that it is for IA-32 and IA-64

Some forum posts w.r.t. ARM: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/624412

Redit post: https://www.reddit.com/r/Amd/comments/9rx0rj/is_amd_working_on_an_alternative_to_intel_mkl_or/

Performance issues on Threadripper with Anaconda Numpy compiled with MKL : https://community.amd.com/thread/228619

On Tuesday, April 9, 2019 at 4:03:54 AM UTC+5:30, Dan Povey wrote:

Do you know for sure that that is still true?

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,

Apr 8, 2019, 7:21:10 PM4/8/19

to kaldi-help, Kirill Katsnelson

OK, I am hoping Kirill can make sure OpenBLAS is easy to install and people are nudged to the right one.

I am on vacation for a week and will respond only minimally.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/00f2889f-82e3-4f38-9f12-7bb4dd5f6178%40googlegroups.com.

Kirill Katsnelson

unread,

Apr 9, 2019, 12:56:43 PM4/9/19

to kaldi-help

I would appreciate it if you could share any links to studies that would confirm that. I has always wondered about that myself, but I've never had a modern AMD CPU to test it, and, while there is a lot of hearsay, but I could not find any real comparative test results.

Thanks,

-kkm

Daniel Povey

unread,

Apr 9, 2019, 1:37:20 PM4/9/19

to kaldi-help

We could probably settle this by having someone test MKL and BLAS with Kaldi on some AMD architecture.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bf77bbcd-659d-4122-a4fd-09d57dc58e99%40googlegroups.com.

Kirill Katsnelson

unread,

Apr 9, 2019, 1:50:57 PM4/9/19

to kaldi-help

On Tuesday, April 9, 2019 at 10:37:20 AM UTC-7, Dan Povey wrote:

We could probably settle this by having someone test MKL and BLAS with Kaldi on some AMD architecture.

That would be the ultimately useful comparison, much better than an abstract test!

-kkm

Jeff Brower

unread,

Apr 9, 2019, 3:50:15 PM4/9/19

to kaldi-help

Kirill-

Yes that would be the way to go. One caution however, MKL appears to rely on OpenMP for high performance. For web scale and other high capacity systems (e.g. telecom) that may not work, because typically in those systems individual cores handle a unified data flow. For example a data flow might be sound (or compressed speech packet) acquisition, pre-processing (noise reduction, AGC, decoding, possibly jitter buffer, etc), followed by Kaldi ASR, followed by output. Maintaining that flow on one physical core (i.e. excluding its hyperthread sibling) and minimizing context switching -- or any other interaction with Linux -- is crucial to very high performance.

So I would suggest to test MKL vs. OpenBLAS both w/wo OpenMP.

-Jeff

Jeff Brower

unread,

Apr 9, 2019, 3:59:20 PM4/9/19

to kaldi-help

Let me correct that to say "rely on OpenMP or Intel-specific multithreading". It does seem either or both can be turned off.

Kirill Katsnelson

unread,

Apr 9, 2019, 4:57:06 PM4/9/19

to kaldi-help

We are deprecating multithreaded math anyway for Kaldi setup as a toolkit. Users who integrate Kaldi in their project (me included) have all the choices; I am speaking only about the default installation with training/decoding pipelines. The rationale and the discussion is here: https://github.com/kaldi-asr/kaldi/issues/3078#issuecomment-478305746. If you can think why this could be a bad idea, please chime in here: https://github.com/kaldi-asr/kaldi/issues/3192 (not in the comments in the former issue, it's a separate ticket now). In my understanding, the benefits are the best when you are churning really huge, 10s of GB-sized matrices, in a single-treaded application. Kaldi is set up from the start to use many single-threaded units of work (one or more process per UOW with shell pipelining), and there is no benefit in multithreading in this setup. Some stock recipes do use multithreading (ivector extractor traininig?), but I've got no benefit with local data; this is perhaps significant only in HPC cluster setup.

The way MKL has been always configured by default was single-threaded, and did not use any OMP at all.

By the way, the approach you describe is similar to our production pipeline (we do realtime dialogue with human callers). I would not stick to thread affinity too much, as the sound is DSP-processed and chopped by VAD on clients, and decoding is faster than RT under normal load, so there is no continuous process, and the OS thread scheduler is hopefully not switching contexts out of boredom. I do not see a number of context switches that would be high enough to make me even think of starting thinking about improving it, to give you a ballpark figure (or, rather, a non-figure :)). Since we do dialogue, we speak more than listen, most utterances we decode are short. And I've seen only a degradation of overall throughput from multithreading MKL in our setup, and it was surprisingly variable. Of course, everyone's mileage may vary.

Re hyperthreading, I always disable it on the Kaldi experimentation machines, it results in a very slight (less than 5%), but nevertheless statistically significant slowdown in training/experimentation. I've read a few analyses that confirm that effect for stable, continuous computation loads with a low scheduling competition--a Kaldi experiment pipeline is the prime example of such a load. In production HT is on, but the load is much less regular there, and we care about timely response even more than throughput.

-kkm

Kirill Katsnelson

unread,

Apr 9, 2019, 4:59:11 PM4/9/19

to kaldi-help

Yup, it comes with libiomp5, the Intel's implementation of OMP, and the TBB multithreading (I dunno if it's OMP-compatble). On Windows, there is an option of using the Microsoft's OMP, too.

-kkm

Kirill Katsnelson

unread,

Apr 9, 2019, 5:01:36 PM4/9/19

to kaldi-help

The issue has fixed in the commit 4ae4bb096 on the master branch.

sk, let me know if you have trouble fixing your apt mishap.

-kkm

On Monday, April 8, 2019 at 6:55:03 AM UTC-7, sk wrote:

joseph.an...@gmail.com

unread,

Apr 10, 2019, 7:30:34 AM4/10/19

to kaldi-help

AWS has Instances with EPYC servers https://aws.amazon.com/ec2/amd/. Alternatively I can test since our clusters consist primarily of AMD HEDTs.

On Tuesday, April 9, 2019 at 11:07:20 PM UTC+5:30, Dan Povey wrote:

We could probably settle this by having someone test MKL and BLAS with Kaldi on some AMD architecture.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Jan Trmal

unread,

Apr 10, 2019, 7:38:54 AM4/10/19

to kaldi...@googlegroups.com

AMD used to have acml lib that included blas but that seems long dead. Seems like they provide optimized core for blis but not sure how the API looks like. Probably openblas would be safest bet in cases of AMD?

Y.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/90d86f21-4161-408f-a16d-9092f6203fa0%40googlegroups.com.

Kirill Katsnelson

unread,

Apr 14, 2019, 6:54:43 PM4/14/19

to kaldi-help

Well, I do not think that MKL would really be so much suboptimal a choice for AMD. Intel does not care, but they do not cripple it either. I would compare, but I've never had an AMD. And installation of MKL is a piece of cake: it just installs, on any distro, identically, we need no tricks. For just a full Kaldi as an experimenter's toolkit install that performs very good, it still should be the optimal choice, given the least configuration surprise. OpenBLAS is required to be built on the user's own platform to perform the best, and even that does not always go smoothly (the latest stable release, v0.3.5, I believe, correctly compiled for SkylakeX with AVX512 for me, but the develop branch, supposed to be unstable but nevertheless ahead, downgraded me to a Haswell-targeted build by default). For those who really want to optimize the last bit of heck out of it, I think the choice and judgment should be their own. It's probably impossible to make the toolkit ultra-optimized for everyone out of the box without a tremendous maintenance effort, and we have only one of each Dan, me and you, in the end. We can shoot only for the best we can install the easiest, and MKL seems to fit the bill, as long as the platform hardware is supported (which is x64, Linux or Mac, or Windows for the most adventurous). For other platform, OpenBLAS is probably the way to go. I'm just trying to strike the best balance between being performant and not tying our hands too much in maintaining relatively uncommon configurations. There is, of course, still a range of options, but it's not that wide.

If we could ditch ATLAS altogether, I'd be rather happy. Too much configuration trouble, and no match in the performance department at all. It's just not up to speed with the modern CPU architectures any more, in my understanding (but please correct me if I'm out of date). And it's pretty capricious to build. I could not build it on a VM, for one, because it could not figure the cache line size(!!!). To me, it's, like, you cannot figure it out, print a warning and go with the default, but do not fail the build altogether. And the tests it's doing during make may take a very long time, too.

And most packaged libs have not a good performance anyway, with the exception of, once again, MKL. OpenBLAS is improving, but package maintainers are often conservative, and go with the most reliable (read: the least CPU-aware) forced configuration, likely to work anywhere, but not optimally using the available hardware options.

Finally, I do understand it may be perceived a bit harsh, but to me, the industry guy, the simplified picture is like this: if you are a student and doing a coursework, you are stuck with what you have, but you are not churning a 10K-hour set for your final; you'll get to it later. If you are a researcher, or a postdoc, or a grad, shelling out $5K on a decent computer with all-supported hardware, like genuine Intel CPU, NVidia GPU etc is dwarfed by your cost to the company or a uni. It's well possible I do not understand many nuanced corner cases here, but this is what I did: just specified the configuration I'm going to have the least trouble and the best performance with. It probably paid for itself in two weeks. I am very open to change my mind on this point; it is certain that I cannot and very possible do not know all the nuanced situations the users end up in. But in my current understanding, the data engineer is in any case the most and incomparably expensive part of the whole computing rig, so getting a matching hardware to save even a few hours of their work is the break-even cost point. I want to reiterate that this is based on possibly too many assumptions, and I certainly would like to understand what is different in the real world. I. e. why one would want to get the supreme (as opposed to just decent) performance of Kaldi on Ryzen, when Intel CPUs are just one click away on Amazon. Please do not hear me as being offensive at all, I am just trying to understand what is the real-world demand for that. AMD does not seem to me a common platform in the HPC world.

I am totally ready to stand corrected on these assumptions, which may be too general or even outright naive, and encourage a discussion on this topic.

-kkm

On Wednesday, April 10, 2019 at 4:38:54 AM UTC-7, Yenda wrote:

AMD used to have acml lib that included blas but that seems long dead. Seems like they provide optimized core for blis but not sure how the API looks like. Probably openblas would be safest bet in cases of AMD?
Y.

joseph.an...@gmail.com

unread,

Apr 15, 2019, 5:31:27 AM4/15/19

to kaldi-help

We build our own clusters for ML work. A number of startups do. AMD Processors are outselling Intel's in Europe by a good margin. It's far cheaper to build a cluster out of AMD Threadrippers than it is with Intel's i9 or server processors. The next version of Ryzen is rumoured to double the number of cores on desktop platforms for the same price. For most students who do want to get into deep learning, it's cheaper for them to invest in a platform with higher core count, PCIe lanes and one that is on par with/close to Intel's per thread performance and spend the rest of their budget on GPUs. Even AMD CPUs are one click away to purchase on Amazon and for far less; the upgrade path is cost efficient too (the socket hasn't changed for 3 generations, and older motherboards support newer processors). For many research groups that may not be that well funded (many do exist in the developing world) they may find it cheaper to use AMD's CPUs over Intel's.

AFAIK MKL is crippled on non-Intel processors. libmkl_core.so, libmkl_vml_avx2.so, libmkl_vml_avx512_mic.so, libmkl_vml_avx512.so, libmkl_vml_avx.so check if the processor it's running on is "GenuineIntel". Why should there be a check if this isn't a mechanism for alternate code path on non-Intel processors. I just recommend that a simple CPUID (/proc/cpuinfo) check be done and Intel MKL be used only if the vendor_id is "GenuineIntel". Else default to OpenBLAS. Alternatively just use OpenBLAS by default and recommend using Intel MKL if vendor_id is "GenuineIntel".

Anand

Kirill Katsnelson

unread,

Apr 15, 2019, 8:55:18 AM4/15/19

to kaldi-help

A cluster is an entirely different story. You build a cluster, understand the the difference between architectures etc etc. We are just talking the default default, for a user who is far from your technical level, someone who runs configure, then make, then starts using Kaldi. This is the kind of user for which we should make the setup the most unproblematic. Not maybe the best performance, but the most robust. I am sure you would check which conigure options work best for you before rolling out the thing on a hundred of machines. You'll perhaps optimize it with different options and different compilers and run performance test. At your level of expertize, it's simple enough in principle (configure --mathlib=openblas is hopefully the only thing you'll need to do). There is a much higher chance that something will not go right, like openblas build would incorrectly detect your CPU, will have bugs, especially with the newer AMD CPUs, which are still not the mainstay of HPC. But considering the type of work that you are doing, the investment makes sense. A cluster equipped with a maintenance engineer is entirely a different thing that a lone ML scientist.

But I'll certainly note your point that AMD gains some use in this sector, too. Interesting. Perhaps it makes sense to mention the openblas option (provided we'll finally beat the thing into compiling with Kaldi more or less reliably) option if the host CPU is AMD.

Tangentially, I would not enthuse too much about core count; higher-end Intel CPUs perform approximately same in the range 14-18 cores,

they all are limited chiefly by the amount of heat you can evacuate from the chip.

> AFAIK MKL is crippled on non-Intel processors.

If it were, AMD would sue the soul out of Intel long ago. There is a document called "Optimization notice" somewhere on Intel performance libraries site, there is some information about non-Intel architectures. AFAIK, latest AMD have AVX2 support; can you test maybe? There is a LAPACK performance test popular with overclockers, compiled with MKL statically; you can easily find and download it. It may not perform as good (instruction set is only part of the optimizations), but you'd probably figure out if it runs with AVX2 or not.

> simple CPUID (/proc/cpuinfo)

Not all systems have /proc. Say we could solve this; I'd rather write and compile a simple probe to detect if it is, but this is still something that would take non-zero amount of my time. Then, check on the machine Kaldi is built, or on the machine Kaldi is targeted to? You see, it's easy to say... There is just not one-size-fits all configuration.

> For many research groups that may not be that well funded (many do exist in the developing world) they may find it cheaper to use AMD's CPUs over Intel's.

That's fair, too, agreed.

I'm still leaning towards this line of thinking: assuming you have an x64-type CPU and want the least problematic and well-performing setup, go with MKL as the default. If you have time to play with compiler switches and different libraries, and technically versed enough, try to optimize it better for your CPU. Besides, the less time we would spend supporting the build system, the better the rest of Kaldi is going to be: there is only so many of us as there are.

But probably we should really print an advice on possible performance improvement (assuming there *is* a performance improvement, which we so far only suppose, actually). Maybe you'd volunteer some realistic tests? For example, i-vector extractor training is a matrix-algebra-dominated process, with a good mixture of basic and advanced operations (I can tell the AVX512 units going by the fan sound :)). An M.2 or SSD drive would perhaps take out other possible variables, if you have them.

-kkm

Kirill Katsnelson

unread,

Apr 15, 2019, 10:33:06 AM4/15/19

to kaldi-help

This is a 10 years old but still interesting read: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/291122

I assume you are a software engineer. Imagine you are given a task to "cripple" your product in a specific environment, and how would you feel about your job. This AMD vs Intel thing gets as hot as the Linux vs Windows debacle, or 49ers vs Seattle Hawks if you want, maybe even more dehumanizing (or maybe not; I do not normally mill around handegg fans). But too much uncool stuff hits this CPU cooling fan. When I hear these "crippled" claims, I take them with the best skepticism I am only capable of. And I'm a scientist, I am very good at it.

Getting real perf numbers would be very interesting, indeed.

-kkm

joseph.an...@gmail.com

unread,

Apr 15, 2019, 2:37:29 PM4/15/19

to kaldi-help

Intel specifically mentions that MKL is meant for Intel processors. Intel has no obligation whatsoever to support non-Intel platforms. OpenBLAS has better performance across non-Intel x86_64, ARM and PowerPC.

Making MKL default is a lot akin to making CUDA default which kaldi currently does not do.

As a software engineer, I have had a quick look at the .so files that are part of the MKL. The files I have mentioned in my earlier reply specifically have checks for "GenuineIntel". Why would this be if MKL does not have separate code paths for Intel Vs non-Intel. Intel is trying to be nice here. It cannot keep checking for what is/isn't supported on non-Intel platforms and takes a safe bet with supporting at most SSE2. MKL crashing on non-Intel x86_64 platforms is a bigger PR nightmare than it performing poorly on them.

Jeff Brower

unread,

Apr 15, 2019, 5:40:34 PM4/15/19

to kaldi-help

Kirill, Anand, Peter thanks very much for your testing and deep look into this.

I have a question about Atom CPUs -- do you think MKL is the best choice there ? We are building Edge products that incorporate Kaldi, for example a 2nd level wake word to provide home assistant privacy from a separate device (which is very small and low power, no fan).

If that's an unknown, no problem. I don't mean to take up anyone's time, we'll have some test results in a month or so anyway. I just thought I'd ask.

-Jeff

Kirill Katsnelson

unread,

Apr 16, 2019, 1:18:48 AM4/16/19

to kaldi-help

Jeff, I do not think the question is even answerable as stated. There are just too many variables. As it happens with Intel, the marketing names of the chips mean next to nothing. If I am to believe this list, there have been at least whopping 29 different microarchitectures all going under the name "Atom" (same with i3, i5, Xeon; you see the point). I have an Atom palm-sized box that I use as a telephony server at home, but I have no idea which exact Atom is that Atom. The second reason is the definition of the "best" matrix library depends on what you are exactly doing. My feeling is that since there are no advanced vector units exist in these CPUs (the AV*s are power-hoggers and take a lot of the chips' real estate), there won't be a big difference between MKL and OpenBLAS. Speaking of general matrix multiplication, code generated by a decent compiler from two nested loops performs about same as sgemm (again, take that with a grain of salt: this was in 2014, IIRC, the matrices were no larger than 1Kx1K, the compiler knew all there was about AVX, and so on). As your process is likely a decode, I would not expect much difference between libraries if we are talking about the CPU with no AV* units. Also, depending on the relative sizes of the AM and the decoding HCLG, and parameters such as the beam width, the matrix library is only one of the perf points to keep track of. I've seen ratios of time spent in AM calculation vs lattice decoding ranging from 1:4 to 4:1 in profiling, to give you a range of possible figures.

Since you really care about the performance of the whole system, just experiment, and experiment a lot. I'd focus on the results reported by profiling tools and start with reaching for the lowest-hanging fruit here. Then, I think I mentioned that, the best overall techniques that do not require manual fiddling with the code are link-time codegen and PGO. The compilers that really shine at these are those from Microsoft and Intel (but the former won't make Linux code; the latter is commercial only, but since you're business, it's likely not out of consideration). Also, the icl is slow as a snail. I could not get matching perf improvements from eithr gcc or clang, but that was not recently (2013-14, IIRC). g++ 8 seems to have had its optimizer updated significantly; I did not have a chance (neither a need nor time) to try to get the best out of it. The learning curve for these techniques is not too steep, but they take time to make sense of.

You are significantly constrained by hardware, in the sense of both e. g. 95% maximum response time being a hard-ish requirement, and consuming less power than more within that limit is the next, and a softer one, and you can also spend only so much time on the whole optimization work, and so on. My feeling is you are up to a few rounds of experimenting with different compilers, getting into advanced building/profiling techniques, tweaking the hot code path etc.; the math library is only one variable of many. Engineering is chiefly the art of finding the best compromise, and in your setting the task of working out the compromise is going to be a challenge, an education opportunity, and a lot of fun.

TL;DR: do not focus too much on just one moving part.

-kkm

Jeff Brower

unread,

Apr 16, 2019, 12:26:44 PM4/16/19

to kaldi-help

Kirill-

Yes I agree, very good advice. From 2016 thru now we have run OpenCV on various Atom CPUs and seen a wide variation in performance. Probably it would be something like an x5-E3930 (which is used in AWS' DeepLens). And good point on decoding. We'll have to run tests, as you say.

-Jeff

Reply all

Reply to author

Forward