Hi all, just had a few thoughts to add.
It only matters that a variable rule is not projected into the dictionary in the case that application of that rule leads to an alignment error. Though (ing) is very frequent, it probably leads to very few confusions, whereas r-vocalization would lead to many (as work on AAVE speech recognition has shown). And an alignment system is capable of sorting out quite a bit of variation not present in the dictionary. The aligner "deletes" by giving a segment zero duration. Coronal/dorsal nasal variation as in (ing) is possible with appropriately permissive acoustic models. In general, that is an argument for not putting stable variants into the dictionary (or removing them). There is fancy engineering work which tries to extract rules which can be used to avoid word confusions when the dictionary does not match the dialect, but that's probably outside the scope of this project.
When resources like the CMU dictionary are used "in industry", it is common for someone to edit it to remove weird words, rare words, strange proper names, weird entries for common words, etc. It is certainly not too late to do this, nor would it be inappropriate to use another, more-constrained dictionary resource. But, the acoustic model bin trained (remote past) with the full CMU dictionary. So, if you remove a bunch of stable variants in the dictionary, it might have a negative effect on the acoustic model because it was trained with (ing) variation occasionally present in the dictionary. If you do choose to make major changes to the dictionary, you may want to get a hold of the SCOTUS data (or whatever; much better would be something more domain-appropriate, say the PNC) and retrain.
There is one thing that is (variably) in the dictionary which, IMO, emphatically should not be: wordforms with the Saxon genitive ('s) clitic. As you probably know, that clitic can attach to any part of speech:
(Noun) Fred's opinion about the English genitive is different from mine.
(Verb) Every linguist I know's opinion about the English genitive involves functional categories.
(Preposition) That young hotshot who was recently hired at Princeton that I was just telling you about's opinion about the English genitive is simply wrong.
(Adverb) That linguist who is wrong often's analysis of the English genitive is overly complex.
So adding these to the dictionary is a futile struggle against Zipf's law. A better approach is to treat the clitic as an independent word---insert a space before any instance of "'s" (not already preceded by whitespace) in the label file---and add it to the pronunciation dictionary. I'd be willing to assist with implementing that functionality if changes to the dictionary are being made anyways.
Kyle
> To view this discussion on the web visit
https://groups.google.com/d/msgid/fave-users/CAHue2pqRpDV5Tgudz4_4A2trkcKRBzi8VSQHTNPKME-oJXQc%3Dg%40mail.gmail.com.