perform decoding without LM

756 views
Skip to first unread message

lucg...@gmail.com

unread,
Feb 21, 2019, 11:49:14 AM2/21/19
to kaldi-help
Hi,

I trained Librispeech TDNN chain model and I was curious how much important is the LM in Kaldi architecture and also if it's possible to decode without using a LM. I made two experiments:

* I removed the HCLG weights using fstmap --map_type=rmweight;
* I created an 1-gram LM with equal weights using the same vocabulary and then I recompiled the HCLG using it.

Looking at the results on test-clean, I got a 39.11% WER (for the first experiment) and 41.24% WER (for the second experiment).  Considering that decoding with a 3-gram LM leads to 5.29% WER, the difference is huge.

How do you comment that? Is there any chance to use Kaldi without a LM, keeping the accuracy at an acceptable level? Are there any plans for a fully end-to-end Kaldi recipe?

Thanks,
Lucian

Daniel Povey

unread,
Feb 21, 2019, 11:53:49 AM2/21/19
to kaldi-help

I trained Librispeech TDNN chain model and I was curious how much important is the LM in Kaldi architecture and also if it's possible to decode without using a LM.

Not if you want good results.
 
I made two experiments:

* I removed the HCLG weights using fstmap --map_type=rmweight;
* I created an 1-gram LM with equal weights using the same vocabulary and then I recompiled the HCLG using it.

Looking at the results on test-clean, I got a 39.11% WER (for the first experiment) and 41.24% WER (for the second experiment).  Considering that decoding with a 3-gram LM leads to 5.29% WER, the difference is huge.

How do you comment that? Is there any chance to use Kaldi without a LM, keeping the accuracy at an acceptable level?

Not really.
 
Are there any plans for a fully end-to-end Kaldi recipe?

I can't answer that question directly because end-to-end is not a well-defined label, people seem to use it to mean almost anything they want depending on the context.    But I can confirm that there are no plans to add any ASR recipes to Kaldi that don't use an LM.


Dan
 
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/21d21010-fead-4d5b-bc8d-8ea3efdb6846%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

lucg...@gmail.com

unread,
Feb 27, 2019, 6:35:16 AM2/27/19
to kaldi-help
Hello,

Thanks for your answer!
Could you please indicate me a few reasons why TDNN without LM is not able to achieve the performance of an end-to-end network (let's consider an LSTM attention based encoder-decoder)? From my understanding (I hope I'm not wrong), both of them are learning the alignments between phones and acoustic features. The LM seems for me as an additional component that enforce the output to be a meaningful sentence, giving probabilities for word sequences. Is an LSTM encoder-decoder network able to substitute the LM, learning the correspondence between acoustic features and phones, but also the way that the words are composing a sentence?
I just want to understand the concept and the difference between those approaches.

Thanks,
Lucian

Daniel Povey

unread,
Feb 27, 2019, 3:44:03 PM2/27/19
to kaldi-help
There are certain approaches that are designed to implicitly learn the LM as part of the model itself.  CTC and related approaches do that.  With those models you can decode without an LM (although empirically they work better if you add an external LM).  But the models we train are designed to be used with an LM and cannot meaningfully be run without one.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/d21c5615-916c-4875-8e0e-0caeb8a4aed2%40googlegroups.com.
Message has been deleted

Armando

unread,
Jun 18, 2019, 6:52:43 AM6/18/19
to kaldi-help
The reason to avoid this kind of system (where a sort of word or character LM is learnt together with the acoustic modelling) is because Kaldi design cannot be easily modified to accomodate for that or because it's not really worth the effort?
I struggle to understand the practical usefulness of these approaches; is not better to have a modular approach, so acoustic models can be re-used with different linguistic models? and how to adapt a LM to a different domain and then use it in a system where it's supposed to be estimated together with the acoustic model?
in principle, I find much more convenient to have these components estimated and optimized separately; but those approaches apparently are used and referred to a lot nowadays in the literature.
does someone has an opinion on that here?
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Jun 18, 2019, 10:48:25 AM6/18/19
to kaldi-help

Yes, I 100% agree with you, the standard approach of estimating the acoustic model and the language model separately is way more convenient.
I think the reason why there has been so much literature on things like CTC lately is that it's part of a craze to do everything inside the neural network, so you can say the system is "end to end".   (However in practice people often do use a language model with those types of systems).

I haven't included things like that in Kaldi because I don't see a practical advantage over standard approaches.  I did actually start experimenting with CTC at one point, but after various improvements, extensions and simplifications I ended up with LF-MMI.  You can still find the occasional place in the LF-MMI code or scripts where it still says "CTC".

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Ognjen Todic

unread,
Jun 19, 2019, 3:55:16 AM6/19/19
to kaldi-help
A few weeks ago I attended a talk about end-to-end ASR R&D; talk was given by Rohit Prabhavalkar from Google. Couple of interesting/relevant tidbits:

- most experiments Rohit was referring to were done using 12,500h of speech for training; they saw significant boost in performance when going from 2,000h to 12,500h

- performance on out-of-domain data is significantly degraded; their approach is to just add relevant speech data to training (I guess that's reasonable/feasible at Google's scale)

- some errors the system makes are quite interesting; e.g. "one dollar twenty-two cents" gets recognized as "one dollar seven cents", because of sparse training data for certain scenarios. I might have not gotten the example exactly right, but the bottom line is that the system could make some "unreasonable" mistakes, from acoustics point of view; they are working on ways to deal with this

- they are doing some interesting work with "biasing" (shifting the model on-the-fly to work better for specific set of phrases, etc.; e.g. leveraging user or context specific information)

- their 400MB large vocabulary dictation model that will run on Pixel is quite interesting

Just some additional info that might be useful...

/Ogi

Daniel Povey

unread,
Jun 19, 2019, 11:00:30 AM6/19/19
to kaldi-help
That is quite interesting, yes.

Something I do have a slight issue with is the widespread assumption that something that works better only when you have huge amounts of data is automatically the right thing to use.  Most people building most ASR applications *don't* have huge amounts of data, and just because there may be one or two outliers, like Google building ASR models for English, doesn't mean that those scenarios should become the default scenario that everyone should be worrying about.  Of course I'm not suggesting that you think this; just that it does occasionally seem to be assumed, especially by people who are not experts in ASR.

Dan


To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Rémi Francis

unread,
Jun 20, 2019, 6:37:27 AM6/20/19
to kaldi-help
Do you have a link to the talk?

Armando

unread,
Jun 20, 2019, 9:08:57 AM6/20/19
to kaldi-help
we do about 4-6 epochs of training, maybe even less for data set of 1000h or more, and observe only degradation with more than that
so to hear about hundreths of epochs, seems to me numbers from image processing

Daniel Povey

unread,
Jun 20, 2019, 12:07:01 PM6/20/19
to kaldi-help
Where was hundreds of epochs mentioned?  Surely not for a dataset of thousands of hours?
A few of the reasons we use relatively few epochs in Kaldi are as follows:

  - We actually count epochs *after* augmentation, and with a system that has frame-subsampling-factor of 3 we separately train on the data shifted by -1, 0 and 1 and count that all as one epoch.  So for 3-fold augmentation and frame-subsampling-factor=3, each "epoch" actually ends up seeing the data 9 times.

  - Kaldi uses natural gradient, which has better convergence properties than regular SGD and allows you to train with larger learning rates; this might allow you to reduce the num-epochs by at least a factor of 1.5 or 2 versus what you'd use with normal SGD.

  - We do model averaging at the end-- averaging over the last few iterations of training (an iteration is an interval of usually a couple minutes' training time).  This allows us to use relatively large learning rates at the end and not worry too much about the added noise; and it allows us to use relatively high learning rates at the end, which further decreases the training time.  This wouldn't work without the natural gradient; the natural gradient stops the model from moving too far in the more important directions within parameter space.

 - We start with aligments learned from a GMM system, so the nnet doesn't have to do all the work of figuring out the alignments-- i.e. it's not training from a completely uninformed start.

So supposing we say we are using 5 epochs, we are really seeing the data more like 50 times, and if we didn't have those tricks (NG, model averaging) that might have to be more like 100 or 150 epochs, and without knowing the alignments, maybe 200 or 300 epochs.  Also it's likely that attention-based models take longer to train than the more standard models that we use.

Dan



To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

To post to this group, send email to kaldi...@googlegroups.com.

Ognjen Todic

unread,
Jun 20, 2019, 3:55:00 PM6/20/19
to kaldi-help
Rémi : I don't think the talk was recorded. It was part of Hearing Seminar series at CCRMA/Stanford organized by Malcolm Slaney. You could double check with Malcolm (https://music.stanford.edu/people/malcolm-slaney) though.

Dan: I agree with your comments/views on this topic.

/Ogi



On Thursday, June 20, 2019 at 12:37:27 PM UTC+2, Rémi Francis wrote:

Armando

unread,
Jun 20, 2019, 4:03:13 PM6/20/19
to kaldi-help
oh, my reply was meant to be in the thread about SpecAugment, referring to Google set up on Librispeech
Reply all
Reply to author
Forward
0 new messages