[LONG] Was there any research on reasosns why left-biphone models beat triphone models for DNN?

Kirill Katsnelson

unread,

Sep 15, 2019, 11:22:47 PM9/15/19

to kaldi-help

As far as I understand, it was a serendipitous discovery that better results are obtained with the (N=2, P=2; left-context biphone) CD phonectics that with (N=3, P=2; "traditional" triphone) with TDNN models. This seems to be pretty much an established empirical fact now. This has bothered me for a while. Let me explain; I'll be focusing on English here exclusively.

The context clustering has been with us as long as triphones have been, as a solution to the data scarcity. Initially, hand-marked properties used to be assigned to phones, and the trees were built by hand based on linguistic features (e.g., intervocalic /t/ surfaces as an unvoiced [ɾ] or voiced [ɽ] (a.k.a [ᴅ], a.k.a. [dx]) flap. This all made sense, but around 1990 it was found that clustering tree methods provide a better result that a hand-constructed or rule-constructed tree (or combination of both); I think one of the first papers here is (Randolph 1990, doi:10.1109/ICASSP.1990.116176). It is notable that Randolph still uses linguistic features that cover a longer range than the n-gram width itself; for example, one that is considered linguistically useful is whether an unvoiced stop begins the onset of a stressed syllable, which is a well-known and strong predictor of aspirated release.

So far, it all makes sense, linguistically; speaking of the /t/ alone, reading (Eddington 2007 doi:10.1515/COG.2007.002) with his quantitative approach reveals how daunting the problem of hand-assignment and/or rule induction really is. Augmenting linguistic knowledge with ML (or augmenting ML with linguistic knowledge, if you prefer) makes total sense. Everything makes sense up to this point.

I do not know at what point in time linguistic markers during tree clustering came out of use; but the later HMM-GMM mixture systems clustered CD context based only on neighboring phone identities (and, as Kaldi does, the HMM position). This is linguistically interesting; for one, the information on the phone being the starter in the onset of a stressed syllable is partially preserved (in “tap”) and partially lost (in “trap”, where the /'a/ is outside of the /t/ context window).

The question that has bothered me is why discarding even more context information improves the performance of a DNN ASR stack. Some information is indeed preserved (e. g. vowels after a nasal are nearly invariably nasalized, and the phoneme preceding a vowel is in the left context), but the above example with ±aspiration is no longer inferrable from the tree (it was in the right context somewhere, which had been discarded).

At the same time, clustering itself loses certain amount information. If we cluster contexts XaY and XbY, we lose some distinction between the processes of surfacing of the a and b. The information loss grows with the N-phone context size as the power of N. Now, this suggests an easy (and likely wrong) explanation that the DNN inferencing power is so great that discarding information from its training examples stiffles it, and 3-clustering discards more than it retains. A quick counterexample, naturally, is that going one step further and switching from biphones to monophones (which, with the trivial tree, avoids clustering at all and therefore discards no information) not only does not improve the performance, but makes it worse (as a hearsay mostly from conversations on this list; not an amazingly publishable result in itself, after all--like “huh, whad'ya expect?"). But yeah, I expected the trend to continue, as DNNs are indeed immensely powerful in both compressing information and generalizing off it--but it did not.

Another possible explanation is that linguistics studies the language in the human brain, and ASR is an entirely different craft--in other words, the expectation that linguistics would work for ASR is grounded neither in theory nor in practice. But here we can jump into the bottomless epistemological rabbit hole that make this hypothesis “not even wrong,” to my taste. We do use the linguistic concept of context in phonology, after all, and splitting the linguistic knowledge into the part that works for ASR and the part that does not apply to it may be entirely arbitrary.

The question buzzed somewhere in the back of my mind for a while, and with different experiments I came back to it, but still took the well-established path of biphone clustering. As many fellow data witchcraft practitioners here, I'm targeting practical result first, and anything theoretical, if happens at all, comes only as a by-product of it.

My question is, was there any research done, not so much important published or not, or maybe some hypotheses floating around, as to why left-context biphone seem to be a sweet-spot with DNN models. Or maybe I am missing something obvious that explains this phenomenon? Just please speak out your mind. :)

-kkm

Nickolay Shmyrev

unread,

Sep 16, 2019, 4:36:13 AM9/16/19

to kaldi-help

> Now, this suggests an easy (and likely wrong) explanation that the DNN inferencing power is so great that discarding information from its training examples stiffles it, and 3-clustering discards more than it retains.

This is true and exactly the reason why end-to-end systems are used in big corporations. Since end-to-end do not restrict themselves with context trees, they have much better accuracy on big datasets. Big amount of training data is very important, to properly learn you need 10k hours of data and probably even more, the advantage is not that great on Switchboard.

Model without context tree is shown superior for example in 5.3 here :

End-to-end speech recognition using lattice-free MMI

https://www.danielpovey.com/files/2018_interspeech_end2end.pdf

Google also studied that a lot in many publications, they were floating between graphemes and phonemes for some time:

No need for a lexicon? Evaluating the value of pronunciation lexica in end-to-end models

https://arxiv.org/pdf/1712.01864.pdf

Daniel Povey

unread,

Sep 16, 2019, 6:23:54 AM9/16/19

to kaldi-help

I have heard from others that they have got better results from using
triphone than left-biphone when using very large data (even
librispeech size). I was not able to reproduce this in kaldi though.

> > Now, this suggests an easy (and likely wrong) explanation that the DNN inferencing power is so great that discarding information from its training examples stiffles it, and 3-clustering discards more than it retains.
>
> This is true and exactly the reason why end-to-end systems are used in big corporations.

Just because big companies publish a lot of end to end papers, don't
assume they are actually using those systems in production. I was
speaking today with someone from a big company where they use a very
standard, discriminatively trained hybrid model for production (at
least, server-based).

> Since end-to-end do not restrict themselves with context trees, they have much better accuracy on big datasets. Big amount of training data is very important, to properly learn you need 10k hours of data and probably even more, the advantage is not that great on Switchboard.

I have also heard of instances where people were able to beat Kaldi
results using monophones, when the amount of data was very large (and
probably with a recurrent framework). I know this may be slightly at
odds with what I said above about triphone being better than biphone.

> Model without context tree is shown superior for example in 5.3 here :
>
> End-to-end speech recognition using lattice-free MMI
> https://www.danielpovey.com/files/2018_interspeech_end2end.pdf

In that approach we had difficulty to estimate the tree because we
didn't start with alignments, so IIRC Hossein was using a "full
biphone tree" (each pair o biphones is its own leaf. That may be why
the tree wasn't helping.

> Google also studied that a lot in many publications, they were floating between graphemes and phonemes for some time:
>
> No need for a lexicon? Evaluating the value of pronunciation lexica in end-to-end models
> https://arxiv.org/pdf/1712.01864.pdf

Yes, I have heard others outside Google say that when the amount of
data gets larger, the improvement from using phonemes vs. graphemes
decreases and eventually graphemes become better. (This is for
English; for most non-English languages they are approximately
equivalent because the language is nearly phonetically spelled).

>
>
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/349e932e-0fac-421b-b5e7-3f7ce7f3935e%40googlegroups.com.

Rudolf A. Braun

unread,

Sep 16, 2019, 11:27:27 AM9/16/19

to kaldi-help

First of all thanks for posting this, I've also been thinking about this on and off for a while and it's nice to discuss.

I think it's important to remember when making these comparisons that the topology also matters, i.e. the number of PDFs you have per CD phone. In my experience, given the same topology (the default one with two distinct PDFs), triphone actually beats biphone slightly. But the issue is using triphones results in a huge number of output states, and the loss in speed is just not worth the slight decrease in WER.

Probably when you have large amounts of data clustering is not necessary (and degrades results), I still want to investigate that.

I do think that having targets that span more context helps. Reason why I think that is because of the feature windows the current kaldi models use, the 15 layer TDNNFs use a context window of +-35, which I think is insanely huge for predicting biphones, and yet it works! Change the models to reduce the input size by 5-10 and the results get significantly worse. I think what unintentionally ends up happening, due the relatively large number of PDFs, is that the acoustic model ends up learning subword targets that span across more audio than the biphone targets identify (with the high number of PDFs allowing for the biphone targets to end up not being part of that many different words). That's the only way I can explain the very large input sizes helping.

I was thinking it would be interesting to try out variable size contexts, so that longer phoneme sequences that appear often could be better learned by the model, while still keeping the total number of PDFs low for efficiency. But it's a bit tricky to implement (for me at least), and very possible that it doesn't actually help at all anyways (since we know going biphone -> triphone doesn't help much if at all).

Note about this paper, "No need for a lexicon? Evaluating the value of pronunciation lexica in end-to-end models" while interesting they used context INdependent phones. So it does not surprise me their phoneme model did not do that well.

Kirill Katsnelson

unread,

Sep 16, 2019, 7:14:45 PM9/16/19

to kaldi-help

On Monday, September 16, 2019 at 8:27:27 AM UTC-7, Rudolf A. Braun wrote:

the 15 layer TDNNFs use a context window of +-35, which I think is insanely huge for predicting biphones, and yet it works!

This is not surprising; the model sees a segment of 76×3(=frame offset)×10ms ≈ 2.3 second long. With such a large context, what is the point of clustering at all?

Note about this paper, "No need for a lexicon? Evaluating the value of pronunciation lexica in end-to-end models" while interesting they used context INdependent phones. So it does not surprise me their phoneme model did not do that well.

But that's exactly what surprises me! The model sees all the phones there are in the data, and in very wide contexts, perhaps more than a whole syllable, and long dependencies can be in principle inferred. And the DNN seems much finer an instrument than greedy clustering. I do not see a reason why a monophone CI model would fare worse than a biphone model.

-kkm

Rémi Francis

unread,

Sep 17, 2019, 8:11:02 AM9/17/19

to kaldi-help

My point of vue is that the more you remove context in phone targets, the more you offload the decoding to the neural net.

So there's gonna be a sweet spot between giving the neural net more supervision, and just letting it do everything on its own.

On Tuesday, 17 September 2019 00:14:45 UTC+1, Kirill Katsnelson wrote:

On Monday, September 16, 2019 at 8:27:27 AM UTC-7, Rudolf A. Braun wrote:

the 15 layer TDNNFs use a context window of +-35, which I think is insanely huge for predicting biphones, and yet it works!

This is not surprising; the model sees a segment of 76×3(=frame offset)×10ms ≈ 2.3 second long. With such a large context, what is the point of clustering at all?

It's 76 frames of 10ms each, the frame subsampling occurs on the output frames, not the input.

Kirill Katsnelson

unread,

Sep 19, 2019, 1:00:02 AM9/19/19

to kaldi-help

On Tuesday, September 17, 2019 at 5:11:02 AM UTC-7, Rémi Francis wrote:

My point of vue is that the more you remove context in phone targets, the more you offload the decoding to the neural net.
So there's gonna be a sweet spot between giving the neural net more supervision, and just letting it do everything on its own.

Maybe. It would be interesting to play with. :)

It's 76 frames of 10ms each, the frame subsampling occurs on the output frames, not the input.

Oh, you are right, of course, thanks!

-kkm

Rudolf A. Braun

unread,

Sep 20, 2019, 9:20:11 AM9/20/19

to kaldi-help

> This is not surprising

Why not? If one takes a look at alignments one can see most phones take place in 50-100ms. How can having 710ms of context help?

> With such a large context, what is the point of clustering at all?

Are you saying that with such large inputs there will be little clustering happening (since it's very high dimensional), and therefore "what's the point?"? Regardless, AFAIK it's actually the features used for GMM training (+-3 frames) that are used for clustering.

Nickolay Shmyrev

unread,

Sep 21, 2019, 2:14:38 AM9/21/19

to kaldi-help

On Friday, September 20, 2019 at 4:20:11 PM UTC+3, Rudolf A. Braun wrote:

> Why not? If one takes a look at alignments one can see most phones take place in 50-100ms. How can having 710ms of context help

Coarticulation effects are shown to span several hundred milliseconds:

Thus, for a CVVN sequence, it would be predicted that velar opening for the nasal consonant would be initiated at the beginning of the first vowel of the system, a prediction in agreement of this study

Investigation of the time of Velar movement in speech
Daniloff, Moll
http://www.phon.ox.ac.uk/jcoleman/Moll_Daniloff_1970.pdf

Also:

It appears that the short-term memory of the auditory periphery in mammals (exhibited, e.g., by forward masking (see, e.g., [76]), the firing rate adaptation constant (see, e.g., [1]), and the buildup of loudness (see, e.g., [69])) is of the order of about 200 ms.

Should Recognizers Have Ears?
Hynek Hermansky
http://www.edu.upmc.fr/sdi/i3sr/fr/img_auth.php/7/72/Hermansky1997.pdf

Also:

Long context helps the network detector to estimate noise and channel accurately. With proper architecture it is possible to have network that looks on surrounding 1s of audio to estimate speaker vector, estimate background noise and simultaneously score phone in the middle. Separate i-vector/cmvn are not needed then.

Rudolf A. Braun

unread,

Sep 29, 2019, 1:49:48 PM9/29/19

to kaldi-help

Thanks Nickolay your points make sense.

Kirill Katsnelson

unread,

Sep 29, 2019, 3:38:56 PM9/29/19

to kaldi-help

Thanks for the ballpark figure of 1s, Nickolay! I-vectors are my another less than favorite piece of machinery. One thing is to reason how they work when you imagine vectors in 3D-space; but this does not translate well to 100–200D. Rather, the intuition breaks completely, since this space is so vast and so empty that the very notion of "closeness" loses meaning. I know how they work (or, at the least, I like to think I kinda do), I know they do work, but I cannot grok why they do. And here, again, simple hypotheses (e.g., that only a few eigenvalues are large, i. e. the spatial distribution is highly anisotropic) do not appear to be true: even without checking, taking for granted the story that i-vectors beat JFA suggests that it's probably not the case (you can go from i-vectors to JFA by factoring the i-vector space).

An interesting direction to pursue, certainly!

-k "if you don't understand why it works, don't use it" km

Nickolay Shmyrev

unread,

Sep 29, 2019, 4:06:09 PM9/29/19

to kaldi...@googlegroups.com

Hello Kirill

To understand ivectors it is better to focus on eigenvector side than on all linear things. The eigenvector idea is very simple - you have 200-300 core voices (enough to represent the whole voice variety) and you represent the rest as linear combination of those. Dimension does not matter much here as well as spatial distribution. Clustering can be done with JFA, with k-means or with other more advanced manifold learning.

This eigenvector view enables more optimal computation too, like here:

A Small Footprint i-Vector Extractor by Patrick Kenny

https://www.crim.ca/perso/patrick.kenny/kenny_odyssey2012.pdf

> 29 сент. 2019 г., в 22:38, Kirill Katsnelson <kkm.po...@gmail.com> написал(а):

> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---

> You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/8VU32x6jwQk/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/135aeeba-c1ce-4420-b7d1-e334cc447450%40googlegroups.com.

signature.asc

Kirill Katsnelson

unread,

Sep 29, 2019, 6:15:01 PM9/29/19

to kaldi-help

Большое спасибо for the reference! I'll try to see if that clears things up for me. I understand the mechanics of eigenbasis part, and that I can mathematically represent any vector that I know in advance in any basis. What bothers me is we are not "representing" a vector (that's a no-brainer), rather only approximating it, as the very root of the words suggest--trying to land in a "proximity". And however close you land to the (unknown) ground truth, you are still very likely far, far away from it in the space of a few hundred dimensions. The proximity in high-D spaces is a damn tricky, counterintuitive concept. Hitting an n-sphere with say r=0.05 (1 sigma 2-sided, ≈95%) centered around the target in such a space is like shooting at a quarter coin from a mile away. No question, we are still hitting closer than randomly shooting in any direction in this 1-mile radius sphere, to continue my shooting the coin analogy, but how much closer--no intuition, but I suspect not really very "close," whatever that means.

Smaller relative improvement from i-vectors in noHMM-DNN vs HMM-GMM, and no improvement beyond i-vector dimension of ≈100 is also suggestive. Your remark that widening the DNN's piehole could help pick up the remaining slack of channel+speaker differences certainly made me anxious to try it!

-kkm

On Sunday, September 29, 2019 at 1:06:09 PM UTC-7, Nickolay Shmyrev wrote:

Hello Kirill

To understand ivectors it is better to focus on eigenvector side than on all linear things. The eigenvector idea is very simple - you have 200-300 core voices (enough to represent the whole voice variety) and you represent the rest as linear combination of those. Dimension does not matter much here as well as spatial distribution. Clustering can be done with JFA, with k-means or with other more advanced manifold learning.

This eigenvector view enables more optimal computation too, like here:

A Small Footprint i-Vector Extractor by Patrick Kenny

https://www.crim.ca/perso/patrick.kenny/kenny_odyssey2012.pdf

> 29 сент. 2019 г., в 22:38, Kirill Katsnelson <kkm.p...@gmail.com> написал(а):

>
> Thanks for the ballpark figure of 1s, Nickolay! I-vectors are my another less than favorite piece of machinery. One thing is to reason how they work when you imagine vectors in 3D-space; but this does not translate well to 100–200D. Rather, the intuition breaks completely, since this space is so vast and so empty that the very notion of "closeness" loses meaning. I know how they work (or, at the least, I like to think I kinda do), I know they do work, but I cannot grok why they do. And here, again, simple hypotheses (e.g., that only a few eigenvalues are large, i. e. the spatial distribution is highly anisotropic) do not appear to be true: even without checking, taking for granted the story that i-vectors beat JFA suggests that it's probably not the case (you can go from i-vectors to JFA by factoring the i-vector space).
>
> An interesting direction to pursue, certainly!
>
> -k "if you don't understand why it works, don't use it" km
>
> On Friday, September 20, 2019 at 11:14:38 PM UTC-7, Nickolay Shmyrev wrote:
>
>
> On Friday, September 20, 2019 at 4:20:11 PM UTC+3, Rudolf A. Braun wrote:
>
> > Why not? If one takes a look at alignments one can see most phones take place in 50-100ms. How can having 710ms of context help
>
> Coarticulation effects are shown to span several hundred milliseconds:
>
> Thus, for a CVVN sequence, it would be predicted that velar opening for the nasal consonant would be initiated at the beginning of the first vowel of the system, a prediction in agreement of this study
>
> Investigation of the time of Velar movement in speech
> Daniloff, Moll
> http://www.phon.ox.ac.uk/jcoleman/Moll_Daniloff_1970.pdf
>
> Also:
>
> It appears that the short-term memory of the auditory periphery in mammals (exhibited, e.g., by forward masking (see, e.g., [76]), the firing rate adaptation constant (see, e.g., [1]), and the buildup of loudness (see, e.g., [69])) is of the order of about 200 ms.
>
> Should Recognizers Have Ears?
> Hynek Hermansky
> http://www.edu.upmc.fr/sdi/i3sr/fr/img_auth.php/7/72/Hermansky1997.pdf
>
> Also:
>
> Long context helps the network detector to estimate noise and channel accurately. With proper architecture it is possible to have network that looks on surrounding 1s of audio to estimate speaker vector, estimate background noise and simultaneously score phone in the middle. Separate i-vector/cmvn are not needed then.
>
>
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/8VU32x6jwQk/unsubscribe.

> To unsubscribe from this group and all its topics, send an email to kaldi...@googlegroups.com.

Reply all

Reply to author

Forward