implementation of GOP with Kaldi

724 views
Skip to first unread message

Hunt Rui

unread,
Jul 9, 2019, 5:51:36 PM7/9/19
to kaldi-help
Hi Dan,

    I am trying to implement GOP with Kaldi. The procedures I am taking are listed below:
    1) Force-align and use ali-to-phones to get phones 
    2) Use 2-gram phone decoder to get the phones and use lattice-to-post and post-to-phone-post to get the posterior prob of phones
    3) Use get-post-on-ali to obtain the prob of each phone at frame level, and average the prob of each phone from all of its corresponding frames.

    Does this make sense? I also found that the phones decoded from my 2-gram phone decoder don't make sense sometimes, not all the time. Actually I used the acoustic model trained from TED and built HCLG graph from 2-gram phones of TED speech materials, and use nnet3-latgen-faster to decode it. Should I use 3-gram instead?
    

    Also I tried to implement GOP with another approach by calculating decodable.LogLikelihood(frame, tid) for each frame, and aggregated it, and the number doesn't look right. I don't know whether this approach makes sense or not.

    I have struggled over this for some time, and your time and help is well appreciated. 

Thanks,
Andy
    

Daniel Povey

unread,
Jul 9, 2019, 5:54:47 PM7/9/19
to kaldi-help


    I am trying to implement GOP with Kaldi. The procedures I am taking are listed below:
    1) Force-align and use ali-to-phones to get phones 
    2) Use 2-gram phone decoder to get the phones and use lattice-to-post and post-to-phone-post to get the posterior prob of phones
    3) Use get-post-on-ali to obtain the prob of each phone at frame level, and average the prob of each phone from all of its corresponding frames.

    Does this make sense? I also found that the phones decoded from my 2-gram phone decoder don't make sense sometimes, not all the time. Actually I used the acoustic model trained from TED and built HCLG graph from 2-gram phones of TED speech materials, and use nnet3-latgen-faster to decode it. Should I use 3-gram instead?

Sounds reasonable.  I think you'll have to use a stronger LM; phones tend to be acoustically quite ambiguous so the
LM needs to be reasonable, to get anything accurate.
    

    Also I tried to implement GOP with another approach by calculating decodable.LogLikelihood(frame, tid) for each frame, and aggregated it, and the number doesn't look right. I don't know whether this approach makes sense or not.

    I have struggled over this for some time, and your time and help is well appreciated. 


log-likelihoods are quantities that, in general, won't even sum to one; and their values can be very different depending what type of model you have, whether you have a decision tree, the the prior, etc.  Getting something that makes sense is complicated.  If I were you I would focus on your first approach.

Dan 
Thanks,
Andy
    

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/c4a24cd9-5f53-4cc4-9d4f-3f17dd6c1b30%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hunt Rui

unread,
Jul 9, 2019, 6:25:55 PM7/9/19
to kaldi...@googlegroups.com
Hi Dan,

I got it, and thank you so much for your comments.

Andy

achintyaha

unread,
Jul 12, 2019, 7:48:37 AM7/12/19
to kaldi-help
Hi Andy,

I am also working on evaluating pronunciation (I'm a beginner) so just wanted to know about the accuracy of GOP implementation with kaldi?

Thanks
Reply all
Reply to author
Forward
0 new messages