As far as I understand, it was a serendipitous discovery that better results are obtained with the (N=2, P=2; left-context biphone) CD phonectics that with (N=3, P=2; "traditional" triphone) with TDNN models. This seems to be pretty much an established empirical fact now. This has bothered me for a while. Let me explain; I'll be focusing on English here exclusively.
The context clustering has been with us as long as triphones have been, as a solution to the data scarcity. Initially, hand-marked properties used to be assigned to phones, and the trees were built by hand based on linguistic features (e.g., intervocalic /t/ surfaces as an unvoiced [ɾ] or voiced [ɽ] (a.k.a [ᴅ], a.k.a. [dx]) flap. This all made sense, but around 1990 it was found that clustering tree methods provide a better result that a hand-constructed or rule-constructed tree (or combination of both); I think one of the first papers here is (Randolph 1990, doi:10.1109/ICASSP.1990.116176). It is notable that Randolph still uses linguistic features that cover a longer range than the n-gram width itself; for example, one that is considered linguistically useful is whether an unvoiced stop begins the onset of a stressed syllable, which is a well-known and strong predictor of aspirated release.
So far, it all makes sense, linguistically; speaking of the /t/ alone, reading (Eddington 2007 doi:10.1515/COG.2007.002) with his quantitative approach reveals how daunting the problem of hand-assignment and/or rule induction really is. Augmenting linguistic knowledge with ML (or augmenting ML with linguistic knowledge, if you prefer) makes total sense. Everything makes sense up to this point.
I do not know at what point in time linguistic markers during tree clustering came out of use; but the later HMM-GMM mixture systems clustered CD context based only on neighboring phone identities (and, as Kaldi does, the HMM position). This is linguistically interesting; for one, the information on the phone being the starter in the onset of a stressed syllable is partially preserved (in “tap”) and partially lost (in “trap”, where the /'a/ is outside of the /t/ context window).
The question that has bothered me is why discarding even more context information improves the performance of a DNN ASR stack. Some information is indeed preserved (e. g. vowels after a nasal are nearly invariably nasalized, and the phoneme preceding a vowel is in the left context), but the above example with ±aspiration is no longer inferrable from the tree (it was in the right context somewhere, which had been discarded).
At the same time, clustering itself loses certain amount information. If we cluster contexts XaY and XbY, we lose some distinction between the processes of surfacing of the a and b. The information loss grows with the N-phone context size as the power of N. Now, this suggests an easy (and likely wrong) explanation that the DNN inferencing power is so great that discarding information from its training examples stiffles it, and 3-clustering discards more than it retains. A quick counterexample, naturally, is that going one step further and switching from biphones to monophones (which, with the trivial tree, avoids clustering at all and therefore discards no information) not only does not improve the performance, but makes it worse (as a hearsay mostly from conversations on this list; not an amazingly publishable result in itself, after all--like “huh, whad'ya expect?"). But yeah, I expected the trend to continue, as DNNs are indeed immensely powerful in both compressing information and generalizing off it--but it did not.
Another possible explanation is that linguistics studies the language in the human brain, and ASR is an entirely different craft--in other words, the expectation that linguistics would work for ASR is grounded neither in theory nor in practice. But here we can jump into the bottomless epistemological rabbit hole that make this hypothesis “not even wrong,” to my taste. We do use the linguistic concept of context in phonology, after all, and splitting the linguistic knowledge into the part that works for ASR and the part that does not apply to it may be entirely arbitrary.
The question buzzed somewhere in the back of my mind for a while, and with different experiments I came back to it, but still took the well-established path of biphone clustering. As many fellow data witchcraft practitioners here, I'm targeting practical result first, and anything theoretical, if happens at all, comes only as a by-product of it.
My question is, was there any research done, not so much important published or not, or maybe some hypotheses floating around, as to why left-context biphone seem to be a sweet-spot with DNN models. Or maybe I am missing something obvious that explains this phenomenon? Just please speak out your mind. :)
-kkm