On some quick search, you may find
https://www.kaggle.com/datasets/unidatapro/british-english-speech-recognition-dataset . You will have to do data preparation accordingly.
Also, since the decoding without lm solely depends on the acoustic model, the system is no more robust. The GMMs, or DNN you train, really needs proper help from both 1) human speech (recording environment and no slurred speech anywhere; I mean adjacent phones shouldn't be combined as one, or no missing of any phoneme during continuous speech), 2) alignment of speech/its features to sub-word units. (should be manually impossible; forced alignment is used). Both of these are rarely guaranteed. Hence for better performance and also for the sake of robustness, language model is an absolute must.
Anantha Krishnan
On Friday, May 2, 2025 at 12:00:50 PM UTC+5:30 Jayenthiran Pukuraj wrote: