How to use kaldi for phoneme error detection on non-native English?

1,554 views
Skip to first unread message

vermazer...@gmail.com

unread,
Jul 14, 2018, 10:09:43 AM7/14/18
to kaldi-help
Hi,

For my bachelor's thesis research I am investigating ASR and more specifically phoneme error detection. My initial idea was to train a model on my self collected corpus that contains around 300 short sentences from 30 different speakers but it turned out that this was not sufficient to get a accuracy to be able to provide feedback to the learners. The self collected corpus is English spoken by native Dutch speakers. 

Now, my plan is to compare the performance of different models on my collected corpus and giving feedback accordingly. However I need some help in finding the right models for such a research. I also need help in finding an extra non-native English corpus because my own corpus will not be sufficient. Furthermore I  am also still unsure on how to change to parameters of a model so that I can compare different models better.

I did some research into Kaldi and I am able to train a model right now but I  need help on how to find the extracted phoneme sequences from my test set. This to enable me to provide feedback to the learner.

So in short, is someone able to indicate which non-native English corpora are publicly available and which models are currently state-of-the-art for phoneme error detection?

If anybody has any questions concerning this, please ask them.  

All the best,

Jeroen Vermazeren
Student at Maastricht University

Daniel Povey

unread,
Jul 14, 2018, 2:53:11 PM7/14/18
to kaldi-help
Your project seems a little ambitious for bachelor's thesis work.  Detecting pronunciations is a difficult thing to do, and I'd consider this a specialized and advanced topic within speech recognition (ASR).  ASR itself is not something you'd normally expect to encounter at the undergraduate level, as it's a very complex technology, and understanding it requires a reasonable understanding of statistics and machine learning.

I'm not aware of any good datasets for this kind of research, although there may be others on the list who are aware.

Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/a5eea86d-2772-4bac-827c-f4a4de693a2f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Povey

unread,
Jul 14, 2018, 2:53:30 PM7/14/18
to kaldi-help
On Sat, Jul 14, 2018 at 11:53 AM, Daniel Povey <dpo...@gmail.com> wrote:
Your project seems a little ambitious for bachelor's thesis work.  Detecting pronunciation
I mean pronunciation errors 

Xavier Anguera

unread,
Jul 14, 2018, 7:26:35 PM7/14/18
to kaldi-help
Dear Jeroen,

I subscribe Dan's comment that this seems like an very big project for a Bachelor level, but I would like to encourage you to attempt it and see how much you can achieve.

Answering your questions, phoneme error analysis is usually performed by comparing the expected phonetic transcription with the actually spoken phonemes. This is done by deriving an expected phonetic transcription of the spoken utterance and comparing the audio with the English phoneme acoustic models.

To train your English models you can use any database in the Kaldi recipes library (Librispeech, TED-lium, etc. are good open-dource databases to start with).
In order to test your system you can look for online data containing recordings of non-native English speakers. There are multiple pages offering samples, for example you could check this one: https://www.dialectsarchive.com/dialects-accents

I hope it helps,

X. Anguera



On Sat, Jul 14, 2018 at 7:53 PM, Daniel Povey <dpo...@gmail.com> wrote:
Your project seems a little ambitious for bachelor's thesis work.  Detecting pronunciations is a difficult thing to do, and I'd consider this a specialized and advanced topic within speech recognition (ASR).  ASR itself is not something you'd normally expect to encounter at the undergraduate level, as it's a very complex technology, and understanding it requires a reasonable understanding of statistics and machine learning.

Shin XXX

unread,
Jul 18, 2018, 1:34:29 PM7/18/18
to kaldi...@googlegroups.com
Hi, I have researched this area for the past few months(master degree), and based on all the experiments I tried, my advice is to try the models which are good for "alignment"

In some exp, I tried using an asr model to recognise the non-native speakers' phoneme sequences, then used the toolkit extracted from FestivalTTS to get the "correct" phoneme sequence of the English sentence, finally i directly compare these 2 sequences to find the errors. The problem is, usually the phoneme recognition result is poor, and there is a trade-off between asr and error detection tasks, if I did well in asr tasks, my asr model would learn to "correct the errors" itself, which means even the non-native speakers are speaking not very accuratly, sometimes the asr model still learns to generate the correct phoneme. Such situations make the directly-compare-phoneme-seq method unreliable.

Then I tried the traditional "gop"(goodness of pronunciation) methods, I strongly recommend you read those relative papers. 
Usually it needs a model to do force-alignment first, and then calculate a gop of each phoneme, the simplest way is to set a threshold, if gop is higher than the threshold, the phoneme is wrong, otherwise it's correct. Some paper use gops as phonemes' features, and train a error classification model (if you have enough labelled data), but I don't think this is a good idea, because it's hard to label the data(try listen to a non-native speaker's audio and label which phoneme is right or wrong, you'll know how hard it is) and different people have different ideas of "what is a good/bad pronunciation".  

As for the dataset,  i mixed the non-native and native(librispeech) public english dataset into a giant combined corpora to train my model.
If you want to try the "gop" based methods, then you should focus on those models good for "force-alignment". Some state-of-the-art models, like lstm/ctc, are not designed for "alignment". I tried a ctc model once, it has very low WER on the librispeech test_clean set(around 4%), but i found out the alignment results on the same data set are very bad(now I know it's partly because lstm has the time-delay problem).  I'm using a tdnn model now, and it works well (there are many tdnn receipts in kaldi, maybe you can take a look at egs/fisher_english/s5/local/chain/run_tdnn.sh)

It's hard to do pronunciation detection, I tried my best, but I think the final performance is faaaaar from satisfying...Anyway, I believe you'll learn a lot from your project. 
Hope it helps.
Shin

Reply all
Reply to author
Forward
0 new messages