Restarting a training step for a few split

74 views
Skip to first unread message

Mael Primet

unread,
Jan 22, 2018, 7:00:16 AM1/22/18
to kaldi-help
Sometimes with very large amount of data I have random errors in some of the GMM training like `align_fmllr.sh`, and I suspect it might be due to memory, but most of the splits are computed correctly and only a few have errors.

Is it possible to run the scripts again and limit the computations to the splits which had errors? Or more generally if the error was due to a corrupt file, is it possible to remove those items and restart the computation for those splits ? 

Mael Primet

unread,
Jan 22, 2018, 7:37:15 AM1/22/18
to kaldi-help
I think I found the error in the logs which was

```
LOG (gmm-est-fmllr-gpost[5.2.0-915e]:main():gmm-est-fmllr-gpost.cc:141) For speaker 5afa91fce5fe4894afb48f6b94d6caae, auxf-impr from fMLLR is -nan, over 0 frames.
```


do we know what could cause a `-nan` in the auxf-impr? and the fact there is 0 frames? I think I'm filtering all audio data so that it has at least one frame so I find it curious that it did not find more frames 

Mael Primet

unread,
Jan 22, 2018, 8:05:17 AM1/22/18
to kaldi-help
After looking a bit more, it seems there are many `-nan` when the alignment does not work, but it does not seem to be linked to the error which is this:

```
ERROR (gmm-est-fmllr-gpost[5.2.0-915e]:~RandomAccessTableReader():util/kaldi-table-inl.h:2578) failure detected in destructor.

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
gmm-est-fmllr-gpost() [0x41798a]
```

the error seems to be coming from


which is weird since it seems to work for most other alignments, is it something that can happen?

```
template<class Holder>
RandomAccessTableReader<Holder>::~RandomAccessTableReader() {
if (IsOpen() && !Close()) // call Close() yourself to stop this being thrown.
KALDI_ERR << "failure detected in destructor.";
} ```

Daniel Povey

unread,
Jan 22, 2018, 3:25:20 PM1/22/18
to kaldi-help

After looking a bit more, it seems there are many `-nan` when the alignment does not work

The nan's are probably harmless, it's about speakers where none of the data aligned.
 
, but it does not seem to be linked to the error which is this:

```
ERROR (gmm-est-fmllr-gpost[5.2.0-915e]:~RandomAccessTableReader():util/kaldi-table-inl.h:2578) failure detected in destructor.

[ Stack-Trace: ]
kaldi::MessageLogger::HandleMessage(kaldi::LogMessageEnvelope const&, char const*)
kaldi::MessageLogger::~MessageLogger()
gmm-est-fmllr-gpost() [0x41798a]
```


This won't be the primary error most likely, it probably indicates that either the disk is full or it's writing to a broken pipe.  E.g. the program it was writing to was killed by the linux OOM killer because you exhausted memory.

There isn't an automatic way to just rerun the parts that failed.

If you have things dying due to out-of-memory, probably the best fix is to install GridEngine so you don't run too many jobs on one machine.

Dan 
Reply all
Reply to author
Forward
0 new messages