Ivector extraction during chain nnet training i librispeech

sandeep cb

unread,

Jun 25, 2018, 12:32:34 PM6/25/18

to kaldi-help

Hi,

I am training chain nnet3 model using librispeech recipe.

I am getting this error during ivector extraction.

Attaching the log here.

# ivector-extract-online2 --config=exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/conf/ivector_extractor.conf ark:exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/train_clean_all_sp_hires_comb_max2/split100/100/spk2utt scp:exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/train_clean_all_sp_hires_comb_max2/split100/100/feats.scp ark:- | copy-feats --compress=true ark:- ark,scp:/media/gnani/816f6106-fd95-4561-9b6f-7e2e5f3b43c8/kaldi/egs/librispeech/s5/exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/ivector_online.100.ark,/media/gnani/816f6106-fd95-4561-9b6f-7e2e5f3b43c8/kaldi/egs/librispeech/s5/exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/ivector_online.100.scp

# Started at Mon Jun 25 21:37:59 IST 2018

#

ivector-extract-online2 --config=exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/conf/ivector_extractor.conf ark:exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/train_clean_all_sp_hires_comb_max2/split100/100/spk2utt scp:exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/train_clean_all_sp_hires_comb_max2/split100/100/feats.scp ark:-

copy-feats --compress=true ark:- ark,scp:/media/gnani/816f6106-fd95-4561-9b6f-7e2e5f3b43c8/kaldi/egs/librispeech/s5/exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/ivector_online.100.ark,/media/gnani/816f6106-fd95-4561-9b6f-7e2e5f3b43c8/kaldi/egs/librispeech/s5/exp/nnet3_cleaned/ivectors_train_clean_all_sp_hires_comb/ivector_online.100.scp

LOG (ivector-extract-online2[5.4.190~1-d16ef]:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor

LOG (ivector-extract-online2[5.4.190~1-d16ef]:ComputeDerivedVars():ivector-extractor.cc:204) Done.

WARNING (ivector-extract-online2[5.4.190~1-d16ef]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 8.06444 > 0.235394. Will do an exact optimization.

LOG (ivector-extract-online2[5.4.190~1-d16ef]:SolveQuadraticProblem<double>():sp-matrix.cc:686) Solving quadratic problem for called-from-linearCGD: floored 1 eigenvalues.

WARNING (ivector-extract-online2[5.4.190~1-d16ef]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 12.9589 > 12.5952. Will do an exact optimization.

ASSERTION_FAILED (ivector-extract-online2[5.4.190~1-d16ef]:SymPosSemiDefEig():sp-matrix.cc:62) : '-min <= tolerance * max'

[ Stack-Trace: ]

ivector-extract-online2() [0x92d062]

kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)

kaldi::MessageLogger::~MessageLogger()

kaldi::KaldiAssertFailure_(char const*, char const*, int, char const*)

kaldi::SpMatrix<double>::SymPosSemiDefEig(kaldi::VectorBase<double>*, kaldi::MatrixBase<double>*, double) const

double kaldi::SolveQuadraticProblem<double>(kaldi::SpMatrix<double> const&, kaldi::VectorBase<double> const&, kaldi::SolverOptions const&, kaldi::VectorBase<double>*)

int kaldi::LinearCgd<double>(kaldi::LinearCgdOptions const&, kaldi::SpMatrix<double> const&, kaldi::VectorBase<double> const&, kaldi::VectorBase<double>*)

kaldi::OnlineIvectorEstimationStats::GetIvector(int, kaldi::VectorBase<double>*) const

kaldi::OnlineIvectorFeature::UpdateStatsUntilFrame(int)

kaldi::OnlineIvectorFeature::GetFrame(int, kaldi::VectorBase<float>*)

main

__libc_start_main

_start

LOG (copy-feats[5.4.190~1-d16ef]:main():copy-feats.cc:143) Copied 0 feature matrices.

# Accounting: time=1 threads=1

# Ended (code 1) at Mon Jun 25 21:38:00 IST 2018, elapsed time 1 seconds

Thanks,

Sandeep

Daniel Povey

unread,

Jun 25, 2018, 1:17:15 PM6/25/18

to kaldi-help

Most likely it's just an issue due to roundoff that could be fixed by
increasing the default tolerance to 0.01. You can see if that helps.
If it does I'll change the default.

Dan

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/915de080-f2da-4285-bfe1-c7f85fc3461c%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

jolin

unread,

Aug 17, 2018, 10:56:08 AM8/17/18

to kaldi-help

HI Dan,

I also encounter the same probelm. Could you please give more information about where can find the tolerance parameters? Thanks.

在 2018年6月26日星期二 UTC+8上午1:17:15，Dan Povey写道：

Daniel Povey

unread,

Aug 17, 2018, 2:00:54 PM8/17/18

to kaldi-help

It's failing at the line
sp-matrix.cc:62
if you start there you can figure it out. But at this point I don't
even know if that would help. If you get the program in gdb by doing
gdb --args program args..
(gdb) r
and when it fails, do
(gdb) up
until you get to the right stack frame, and can do
(gdb) p min
(gdb) p max
(gdb) p tolerance

that will tell me more.

> https://groups.google.com/d/msgid/kaldi-help/5d6564f4-f0a4-4b87-8f84-d9ee9758aa19%40googlegroups.com.

jolin

unread,

Aug 17, 2018, 4:10:54 PM8/17/18

to kaldi-help

Hi Dan,

Thanks. Here is the log.

#5 0x00002aaaabff840c in kaldi::SpMatrix<double>::SymPosSemiDefEig (this=0x2aaae2a03310, s=0x2aaae2a02d30, P=0x2aaae2a02d40, tolerance=0.001) from /home/XX/.conda/envs/ASR/bin/../lib/libkaldi-matrix.so

(gdb) p min

$1 = -0.1319345335028374

(gdb) p max

$2 = 34.312569842315554

(gdb) p tolerance

$3 = 0.001

(gdb)

在 2018年8月17日星期五 UTC-4下午2:00:54，Dan Povey写道：

jolin

unread,

Aug 17, 2018, 5:16:39 PM8/17/18

to kaldi-help

It still not work , even set the tolerance is 0.1.

在 2018年8月17日星期五 UTC-4下午2:00:54，Dan Povey写道：

It's failing at the line

jolin

unread,

Aug 17, 2018, 5:17:54 PM8/17/18

to kaldi-help

#5 0x00000000008a11dc in kaldi::SpMatrix<double>::SymPosSemiDefEig (this=0x7fffffffc9c8, s=0x7fffffffb800, P=0x7fffffffb810, tolerance=0.10000000000000001) at sp-matrix.cc:62

62 KALDI_ASSERT(-min <= tolerance * max);

(gdb) p min

$1 = -170.21687460999905

(gdb) p max

$2 = 399.90949319676514

在 2018年8月17日星期五 UTC-4下午5:16:39，jolin写道：

Daniel Povey

unread,

Aug 17, 2018, 5:19:41 PM8/17/18

to kaldi-help

It may be a compilation issue, maybe you didn't recompile the parts
you needed to.
In any case, it's in double and that difference is too large to be
caused by roundoff. It makes me think that there could be another
problem. Try to create an archive that would enable me to reproduce
this myself (i.e. containing the command line I'd need to run, plus
the files that would allow me to run it-- and test it yourself that it
works, please).
Also if you could send one of the logs from when you trained the
i-vector extractor it would be helpful.

Dan

> https://groups.google.com/d/msgid/kaldi-help/fb999375-b594-4aa9-b460-5eb75fcbd978%40googlegroups.com.

jolin

unread,

Aug 17, 2018, 6:01:02 PM8/17/18

to kaldi-help

Sorry, I just remember that the first time I compiled Kaldi with DOUBLE_PRECISION = 1 and When I recompiled the kaldi I forgot change this option. This may the reason(double).

Here is one log:

WARNING (ivector-extract-online2[5.4.246~3-cd27a]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 4.82691e+09 > 2.50714e+09. Will do an exact optimization.

LOG (ivector-extract-online2[5.4.246~3-cd27a]:SolveQuadraticProblem<double>():sp-matrix.cc:686) Solving quadratic problem for called-from-linearCGD: floored 91 eigenvalues.

在 2018年8月17日星期五 UTC-4下午5:19:41，Dan Povey写道：

Daniel Povey

unread,

Aug 17, 2018, 6:02:37 PM8/17/18

to kaldi-help

You need to do "make clean" if you change that option, before
recompiling, because "make" does not take account of dependencies on
the Makefile itself.

> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/7eb17eec-966e-408c-9366-23f57cd5412d%40googlegroups.com.

Patrick Lange

unread,

Apr 3, 2019, 6:02:48 PM4/3/19

to kaldi-help

I am running into a similar problem. I added a KALDI_LOG statement before the assertion but I am confused why -1.85664e-08 is thought to be bigger than 0.0859371. This is using OpenBLAS as mathlib in case it matters.

LOG (ivector-extractor-est[5.5.241~1419-3f8b6b]:SymPosSemiDefEig():sp-matrix.cc:62) -min: -2.70188e-08 tolerance * max: 0.0676479 tolerance: 0.001
LOG (ivector-extractor-est[5.5.241~1419-3f8b6b]:SolveQuadraticMatrixProblem():sp-matrix.cc:787) Solving matrix problem for M: floored 69 eigenvalues.
LOG (ivector-extractor-est[5.5.241~1419-3f8b6b]:SymPosSemiDefEig():sp-matrix.cc:62) -min: -1.85664e-08 tolerance * max: 0.0859371 tolerance: 0.001
LOG (ivector-extractor-est[5.5.241~1419-3f8b6b]:SolveQuadraticMatrixProblem():sp-matrix.cc:787) Solving matrix problem for M: floored 75 eigenvalues.
ASSERTION_FAILED (ivector-extractor-est[5.5.241~1419-3f8b6b]:SymPosSemiDefEig():sp-matrix.cc:63) Assertion failed: (-min <= tolerance * max)

> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,

Apr 3, 2019, 11:03:53 PM4/3/19

to kaldi-help

I can only assume you printed those values wrongly??

Show the code that results in that output, because you obviously added a line..

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/cb1308c6-cc9d-4e6a-bc11-74963d8fcf82%40googlegroups.com.

Patrick Lange

unread,

Apr 4, 2019, 12:52:29 AM4/4/19

to kaldi-help

I added line 62 in matrix/sp-matrix.cc in the snipped below. Maybe the KALDI_LOG for the exception never gets printed and the one showing up in the log is actually from a previous call.

  56 template<typename Real>
  57 void SpMatrix<Real>::SymPosSemiDefEig(VectorBase<Real> *s,
  58                                       MatrixBase<Real> *P,
  59                                       Real tolerance) const {
  60   Eig(s, P);
  61   Real max = s->Max(), min = s->Min();
  62   KALDI_LOG << "-min: " << -min << " tolerance * max: " << tolerance * max << " tolerance: " << tolerance;
  63   KALDI_ASSERT(-min <= tolerance * max);
  64   s->ApplyFloor(0.0);
  65 }

On Wednesday, April 3, 2019 at 8:03:53 PM UTC-7, Dan Povey wrote:

I can only assume you printed those values wrongly??
Show the code that results in that output, because you obviously added a line..

Daniel Povey

unread,

Apr 4, 2019, 7:05:45 PM4/4/19

to kaldi-help

I don't think it would delay like that, I have never seen such a thing. You can double check by getting it in gdb.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/c810d0ac-4bc4-4e8e-a707-cd3c840257ab%40googlegroups.com.

Patrick Lange

unread,

Apr 5, 2019, 4:02:52 PM4/5/19

to kaldi-help

So I have installed a second copy of Kaldi with Atlas instead of OpenBLAS and it works. This is definitely strange.

Previously I was using the following config settings and I ran into above mentioned issues.

./configure --mathlib=OPENBLAS--openblas-root=../tools/OpenBLAS/install

If I use just configure without any switches it works.

./configure

Daniel Povey

unread,

Apr 5, 2019, 4:05:11 PM4/5/19

to kaldi-help

Please check it in gdb, I want to find out the reason. That is:

gdb --args (program) (args)

(gdb) r

... wait till it crashes

(gdb) bt

then go "up" till you get to the right stack frame, and do:

(gdb) p min

(gdb) p tolerance

(gdb) p max

OpenBlas does some funky stuff with borrowing registers it's not supposed to use, that might interact badly with some things.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/879ce8e6-8e03-42d1-af79-c6f9dc7186c3%40googlegroups.com.

Patrick Lange

unread,

Apr 20, 2019, 5:31:02 PM4/20/19

to kaldi-help

When running some decoding experiments, I ran into this assertion again.

ASSERTION_FAILED (online2-wav-nnet3-latgen-faster[5.5.241~1419-3f8b6b]:SymPosSemiDefEig():sp-matrix.cc:63) Assertion failed: (-min <= tolerance * max)

I observed that this only happens if another decoding processes using the same kaldi install and executable is running at the same time. However, I could verify that this does not happen when using the configuration with ATLAS. Below is some output from the core dump.

Core was generated by `online2-wav-nnet3-latgen-faster --do-endpointing=false --frames-per-chunk=20 --'.
Program terminated with signal 6, Aborted.
#0  0x00002ac1dc890207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.4.x86_64 libgcc-4.8.5-36.el7_6.1.x86_64 libgfortran-4.8.5-36.el7_6.1.x86_64 libquadmath-4.8.5-36.el7_6.1.x86_64 libstdc++-4.8.5-36.el7_6.1.x86_64
(gdb) bt
#0  0x00002ac1dc890207 in raise () from /lib64/libc.so.6
#1  0x00002ac1dc8918f8 in abort () from /lib64/libc.so.6
#2  0x0000000001157608 in kaldi::KaldiAssertFailure_ (func=0x1317300 <kaldi::SpMatrix<double>::SymPosSemiDefEig(kaldi::VectorBase<double>*, kaldi::MatrixBase<double>*, double) const::__func__> "SymPosSemiDefEig",
    file=0x13161f7 "sp-matrix.cc", line=63, cond_str=0x13166bc "-min <= tolerance * max") at kaldi-error.cc:209
#3  0x000000000111b11b in kaldi::SpMatrix<double>::SymPosSemiDefEig (this=0x5224b70, s=0x7ffd94bb59b0, P=0x7ffd94bb59c0, tolerance=0.001) at sp-matrix.cc:63
#4  0x0000000001118c9c in kaldi::SolveQuadraticProblem<double> (H=..., g=..., opts=..., x=0x5224be0) at sp-matrix.cc:676
#5  0x000000000115608f in kaldi::LinearCgd<double> (opts=..., A=..., b=..., x=0x5224be0) at optimization.cc:555
#6  0x0000000000c5047b in kaldi::OnlineIvectorEstimationStats::GetIvector (this=0x5224b58, num_cg_iters=15, ivector=0x5224be0) at ivector-extractor.cc:747
#7  0x0000000000c3396c in kaldi::OnlineIvectorFeature::UpdateStatsUntilFrame (this=0x5224b10, frame=210) at online-ivector-feature.cc:251
#8  0x0000000000c33f0a in kaldi::OnlineIvectorFeature::GetFrame (this=0x5224b10, frame=210, feat=0x7ffd94bb6b40) at online-ivector-feature.cc:314
#9  0x0000000000cf2cdb in kaldi::nnet3::DecodableNnetLoopedOnlineBase::AdvanceChunk (this=0x7ffd94bb7918) at decodable-online-looped.cc:187
#10 0x0000000000cf32e9 in kaldi::nnet3::DecodableNnetLoopedOnlineBase::EnsureFrameIsComputed (this=0x7ffd94bb7918, subsampled_frame=49) at ../nnet3/decodable-online-looped.h:97
#11 0x0000000000cf31bb in kaldi::nnet3::DecodableAmNnetLoopedOnline::LogLikelihood (this=0x7ffd94bb7918, subsampled_frame=49, index=28584) at decodable-online-looped.cc:244
#12 0x0000000000eb7fd8 in kaldi::LatticeFasterDecoderTpl<fst::ConstFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, unsigned int>, kaldi::decoder::BackpointerToken>::ProcessEmitting (this=0x7ffd94bb7a40,
    decodable=0x7ffd94bb7918) at lattice-faster-decoder.cc:766
#13 0x0000000000eb620e in kaldi::LatticeFasterDecoderTpl<fst::ConstFst<fst::ArcTpl<fst::TropicalWeightTpl<float> >, unsigned int>, kaldi::decoder::BackpointerToken>::AdvanceDecoding (this=0x7ffd94bb7a40,
    decodable=0x7ffd94bb7918, max_num_frames=-1) at lattice-faster-decoder.cc:629
#14 0x0000000000eadb03 in kaldi::LatticeFasterDecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::decoder::BackpointerToken>::AdvanceDecoding (this=0x7ffd94bb7a40, decodable=0x7ffd94bb7918,
    max_num_frames=-1) at lattice-faster-decoder.cc:602
#15 0x0000000000c48590 in kaldi::SingleUtteranceNnet3DecoderTpl<fst::Fst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > >::AdvanceDecoding (this=0x7ffd94bb7900) at online-nnet3-decoding.cc:46
#16 0x0000000000befe1c in main (argc=18, argv=0x7ffd94bb89e8) at online2-wav-nnet3-latgen-faster.cc:259
(gdb) frame 3
#3  0x000000000111b11b in kaldi::SpMatrix<double>::SymPosSemiDefEig (this=0x5224b70, s=0x7ffd94bb59b0, P=0x7ffd94bb59c0, tolerance=0.001) at sp-matrix.cc:63

63 KALDI_ASSERT(-min <= tolerance * max);

(gdb) info locals
__func__ = "SymPosSemiDefEig"
max = 423.14832878891633
min = -1.6464138295088315
(gdb) p tolerance
$1 = 0.001
(gdb)

David van Leeuwen

unread,

Jun 4, 2019, 6:21:58 AM6/4/19

to kaldi-help

Hello,

I'm running into the same problem, but with a different setup, using the pykaldi wrapper, so debugging by gdb is less trivial for me...

But I can quite easily compare ATLAS with OpenBLAS. With OpenBLAS I get frequent warnings, and less frequent assertions (~10% of the warning cases), like the ones reported earlier in the thread:

```

WARNING ([5.5.200-39ae7]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 4897.81 > 4612.45. Will do an exact optimization.

LOG ([5.5.200-39ae7]:SolveQuadraticProblem<double>():sp-matrix.cc:687) Solving quadratic problem for called-from-linearCGD: floored 1 eigenvalues.

WARNING ([5.5.200-39ae7]:LinearCgd():optimization.cc:549) Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse, 394.834 > 343.911. Will do an exact optimization.

ASSERTION_FAILED ([5.5.200-39ae7]:SymPosSemiDefEig():sp-matrix.cc:62) : '-min <= tolerance * max'

```

The assertions seem quite random, they do not seem to reproduce with the same audio files. I even get in approx 1 in 10 test runs of 820 audio files a BLAS_ERROR (some error about memory, can't recall the exact error, sorry)

After recompiling Kaldi with ATLAS, these errors all disappear (and execution time roughly doubles:-(

So my quick conclusion is that there may be something fishy going on with (recent?) versions of OpenBLAS...

---david

Daniel Povey

unread,

Jun 4, 2019, 10:49:29 AM6/4/19

to kaldi-help

Thanks for the info. I suspect this issue will be quite hard to debug, but we can keep trying.

If there are any failures on `make test` in matrix/ when using OpenBLAS, that might be an easier way to debug though.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/dadce405-3499-4d86-8242-91fa4dc610e7%40googlegroups.com.

David van Leeuwen

unread,

Jun 4, 2019, 11:42:08 AM6/4/19

to kaldi-help

I am currently bisecting openblas, the errors do not occur in v0.3.0, but they do in v0.3.6.

For v0.3.6, `make test` in matrix is passed. So I suppose this is a rather subtle test setup I have

On Tuesday, June 4, 2019 at 4:49:29 PM UTC+2, Dan Povey wrote:

Thanks for the info. I suspect this issue will be quite hard to debug, but we can keep trying.
If there are any failures on `make test` in matrix/ when using OpenBLAS, that might be an easier way to debug though.

Daniel Povey

unread,

Jun 4, 2019, 1:24:12 PM6/4/19

to kaldi-help

Great!

Run that program with valgrind too, if that's possible, to look for memory errors (e.g. buffer overruns).

Dan

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/8c83b01c-736e-4a54-a508-f5e169ba2013%40googlegroups.com.

Patrick Lange

unread,

Jun 4, 2019, 2:52:10 PM6/4/19

to kaldi-help

As mentioned earlier, I noticed the problems occur more frequently when running multiple processes using the same kaldi/openblas. Maybe this can be helpful.

On Tuesday, June 4, 2019 at 10:24:12 AM UTC-7, Dan Povey wrote:

Great!
Run that program with valgrind too, if that's possible, to look for memory errors (e.g. buffer overruns).
Dan

David van Leeuwen

unread,

Jun 5, 2019, 8:28:31 AM6/5/19

to kaldi-help

Hello,

So, the culprit commit in OpenBLAS, according to my analysis, is a399d004 (one year old).

- About 32/2460 utterances crash Kaldi on the `-min <= tolerance * max` assertion in `SymPosSemiDefEig():sp-matrix.cc:62`,

- About 120/2460 utterances raise a warning `Doing linear CGD in dimension 100, after 15 iterations the squared residual has got worse` in `LinearCgd():optimization.cc:549`,

- decoding performance for the uncrashed utterances is not good

- for some utterances the loglikelihoods give overflows upon math.exp (in my python wrapper)

- `make test` in Kadi is passed, though.

Interestingly, the previous commit fails `make test` in Kaldi (a segmentation fault in matrix-lib-test), but it passes my ASR test (with good decoding performance, no warnings, no overflows).

OpenBLAS v0.3.0 is the last release that passes my test and also Kaldi's `make test`.

I might raise an issue with OpenBLAS, but it won't be easy to create an MWE, since, with the many wrappers (my code, pykaldi, kaldi), I don't really know which BLAS call is the problem. The errors appear to occur consistently in the ivector extraction code, but errors appear randomly, and the culprit commit seems to be only about memory. Also Patrick Lange's remark seems to indicate that the errors do not tend to reproduce, and are dependent on running circumstances. Who knows there also is a dependency on the hardware architecture (which I know virtually nothing about)

On Tuesday, June 4, 2019 at 7:24:12 PM UTC+2, Dan Povey wrote:

Great!
Run that program with valgrind too, if that's possible, to look for memory errors (e.g. buffer overruns).

It looks like running valgrind through a python wrapper requires recompilation of python itself, I haven't gone there yet.

---david

Dan

Daniel Povey

unread,

Jun 5, 2019, 12:35:33 PM6/5/19

to kaldi-help

The commit seems to be about thread-local storage. The programs that load i-vector extractors typically use multiple threads on startup, while computing certain constants. That is usually configurable on the command line though. I wonder if it has to do with that. OpenBLAS seems to have certain limits on this, e.g.:

# define MAX_ALLOCATING_THREADS MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a6164c91-044c-47b5-aa3b-9aa75cc9706d%40googlegroups.com.

David van Leeuwen

unread,

Jun 8, 2019, 8:40:53 AM6/8/19

to kaldi-help

All right, after a discussion on the OpenBLAS github issue page, I think I have an analysis and a solution.

- The errors occur because a single-thread compiled OpenBLAS is called from multiple concurrent threads in Kaldi, during ivector extraction preparation

- Current Kaldi uses OpenBLAS 0.3.5 (5 months old), this would make it vulnerable, hence this thread I suppose

- There is a fix in OpenBLAS around commit 86dda5c2 (three weeks old), this is not yet in a release version (next version would be 0.3.7 I suppose)

- This fix requires the OpenBLAS compile option `USE_LOCKING=1` in the Kaldi `tools/Makefile` in the line compiling openblas

So if one runs into this problem, checkout the latest OpenBLAS on branch develop, and recompile with `USE_THREAD=0 USE_LOCKING=1`.

I was actually using the kaldi branch pykaldi from pykaldi, which has a slightly different tools/Makefile, and directly pulls from OpenBLAS develop.

For Kaldi, my advice would be to move to a new release of OpenBLAS as soon as it is released, and then use the `USE_LOCKING=1` make flag.

Cheers,

---david

On Wednesday, June 5, 2019 at 6:35:33 PM UTC+2, Dan Povey wrote:

The commit seems to be about thread-local storage. The programs that load i-vector extractors typically use multiple threads on startup, while computing certain constants. That is usually configurable on the command line though. I wonder if it has to do with that. OpenBLAS seems to have certain limits on this, e.g.:

# define MAX_ALLOCATING_THREADS MAX_CPU_NUMBER * 2 * MAX_PARALLEL_NUMBER

Daniel Povey

unread,

Jun 8, 2019, 11:04:47 AM6/8/19

to kaldi-help

Great. Do you have time to make a PR with the fix, when the new branch is released?

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a8e207b7-957c-4b59-84e5-dbb4e128bf43%40googlegroups.com.

David van Leeuwen

unread,

Jun 9, 2019, 6:54:55 AM6/9/19

to kaldi-help

Hi,

On Saturday, June 8, 2019 at 5:04:47 PM UTC+2, Dan Povey wrote:

Great. Do you have time to make a PR with the fix, when the new branch is released?

Yes, sure, I've already made a fix available to the pykaldi branch in pykaldi/kaldi, which uses latest openblas instead of a tagged version.

But when OpenBLAS releases a new version, I can certainly make PR for Kaldi.

---david

Go to <a href="http://kaldi-asr.org/forums.html" target="_blank

Reply all

Reply to author

Forward