I am trying to implement GOP with Kaldi. The procedures I am taking are listed below:
1) Force-align and use ali-to-phones to get phones
2) Use 2-gram phone decoder to get the phones and use lattice-to-post and post-to-phone-post to get the posterior prob of phones
3) Use get-post-on-ali to obtain the prob of each phone at frame level, and average the prob of each phone from all of its corresponding frames.
Does this make sense? I also found that the phones decoded from my 2-gram phone decoder don't make sense sometimes, not all the time. Actually I used the acoustic model trained from TED and built HCLG graph from 2-gram phones of TED speech materials, and use nnet3-latgen-faster to decode it. Should I use 3-gram instead?
Also I tried to implement GOP with another approach by calculating decodable.LogLikelihood(frame, tid) for each frame, and aggregated it, and the number doesn't look right. I don't know whether this approach makes sense or not.
I have struggled over this for some time, and your time and help is well appreciated.
Thanks,
Andy