Need British English Speech Dataset with Mispronunciation-Aware Decoding

Jayenthiran Pukuraj

unread,

May 2, 2025, 2:30:50 AMMay 2

to kaldi-help

HI

I’m working on an ASR-based project for my liguistic training targeted toward English language learners.

The goal is to capture non-native pronunciations (specifically Indian learners of British English) and retain the mispronunciations in the decoding output, rather than mapping them to the nearest canonical word.

I’m looking for guidance on:

A suitable British English speech dataset, preferably one that includes either native British pronunciation or Indian-accented speech with British targets.

Decoding techniques or modifications in Kaldi that would allow the ASR system to reflect mispronunciations as they are, without auto-correcting to the nearest valid English word. Ideally, this should preserve phoneme-level or subword deviations.

Any known acoustic/language model settings or post-processing strategies to avoid "forced correction" and instead reflect pronunciation variation in the hypothesis.

please help me on this.

Anantha Krishnan

unread,

May 2, 2025, 2:52:11 AMMay 2

to kaldi-help

As long as you have lexicon file with word to phone sequence mapping, you will only get closest word decoded. I suggest that you first change lexicon to have phone to phone mapping instead.If the mispronunciations should also be reflected, I can only think of removing language model completely and build Acoustic only ASR. you will need to change the weights of G.fst accordingly.

If you have already built bigram language model using some Create_ngram_LM.sh script, then use the script that I have attached in this post to effectively remove the language model effects.

Best,

Anantha Krishnan

remove_lm.sh

Anantha Krishnan

unread,

May 2, 2025, 3:06:13 AMMay 2

to kaldi-help

On some quick search, you may find https://www.kaggle.com/datasets/unidatapro/british-english-speech-recognition-dataset . You will have to do data preparation accordingly.

Also, since the decoding without lm solely depends on the acoustic model, the system is no more robust. The GMMs, or DNN you train, really needs proper help from both 1) human speech (recording environment and no slurred speech anywhere; I mean adjacent phones shouldn't be combined as one, or no missing of any phoneme during continuous speech), 2) alignment of speech/its features to sub-word units. (should be manually impossible; forced alignment is used). Both of these are rarely guaranteed. Hence for better performance and also for the sake of robustness, language model is an absolute must.

Anantha Krishnan

On Friday, May 2, 2025 at 12:00:50 PM UTC+5:30 Jayenthiran Pukuraj wrote:

Reply all

Reply to author

Forward