Need British English Speech Dataset with Mispronunciation-Aware Decoding

15 views

Skip to first unread message

Jayenthiran Pukuraj

unread,

May 2, 2025, 2:03:33 AM5/2/25

to kaldi-developers

HI

I’m working on an ASR-based project for my liguistic training targeted toward English language learners.

The goal is to capture non-native pronunciations (specifically Indian learners of British English) and retain the mispronunciations in the decoding output, rather than mapping them to the nearest canonical word.

I’m looking for guidance on:

A suitable British English speech dataset, preferably one that includes either native British pronunciation or Indian-accented speech with British targets.

Decoding techniques or modifications in Kaldi that would allow the ASR system to reflect mispronunciations as they are, without auto-correcting to the nearest valid English word. Ideally, this should preserve phoneme-level or subword deviations.

Any known acoustic/language model settings or post-processing strategies to avoid "forced correction" and instead reflect pronunciation variation in the hypothesis.

please help me on this.

Reply all

Reply to author

Forward

0 new messages