Need British English Speech Dataset with Mispronunciation-Aware Decoding

11 views
Skip to first unread message

Jayenthiran Pukuraj

unread,
May 2, 2025, 2:03:33 AMMay 2
to kaldi-developers
HI 

I’m working on an ASR-based project for my liguistic training targeted toward English language learners.

The goal is to capture non-native pronunciations (specifically Indian learners of British English) and retain the mispronunciations in the decoding output, rather than mapping them to the nearest canonical word.

I’m looking for guidance on:

A suitable British English speech dataset, preferably one that includes either native British pronunciation or Indian-accented speech with British targets.

Decoding techniques or modifications in Kaldi that would allow the ASR system to reflect mispronunciations as they are, without auto-correcting to the nearest valid English word. Ideally, this should preserve phoneme-level or subword deviations.

Any known acoustic/language model settings or post-processing strategies to avoid "forced correction" and instead reflect pronunciation variation in the hypothesis.

please help me on this.
Reply all
Reply to author
Forward
0 new messages