How to generate a more accurate nnet posterior probability distribution?

110 views
Skip to first unread message

Kane Williamson

unread,
Feb 28, 2023, 10:26:02 PM2/28/23
to kaldi-help

If I know the transcript of an audio beforehand, can I use it to generate a more accurate nnet posterior probability distribution? (Using TTS to generate audio from text, then obtaining posterior probability distribution on streaming)

Can I generate an fst using only the transcript, and then perform forward-backward computation to obtain more accurate posterior probability (Like nnet3-chain-compute-post, but without audio, so alignment and numerator fst cannot be generated.)?

Or can I use the generated fst for decoding to obtain more accurate phoneme posterior probabilities?

I'm not sure which approach is the most feasible and would appreciate any help. Thank you.

Daniel Povey

unread,
Mar 1, 2023, 8:33:13 AM3/1/23
to kaldi...@googlegroups.com
I don't understand the scenario and what you're trying to do.

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/21709eaa-a321-44ff-804b-3279ccf6aab6n%40googlegroups.com.

Kane Williamson

unread,
Mar 3, 2023, 5:51:25 AM3/3/23
to kaldi-help

I use kaldi acoustic model to calculate the ppg of an audio which I'll use for voice conversion or face generation. If I already have text information, can I make the ppg more accurate? (text-conditioned ppg or linguistically-informed ppg? I'm not sure if such a term exists.)

I tried the following approach (similar to forced alignment):

  1. Use text to construct an graph through compile-train-graphs.

  2. Decode on graph to get lattice using nnet3-latgen-faster.

  3. Convert lattice to phone post using lattice-to-post and post-to-phone-post.

But I am currently facing a problem: In the obtained ppg, there are many erroneous "sil" appearing on both sides of normal phoneme segments, and the duration of phoneme segments has become shorter than the actual duration.

eg: A 16-frame audio with the content "hello":

truth phone sequence: sil sil sil HH HH AH AH AH L L OW OW OW sil sil sil

obtained ppg sequence: sil sil sil sil sil sil HH AH L OW OW sil sil sil sil sil

I checked that the output of the acoustic model is right. Could the problem be with the decoding? I tried adjusting the acoustic scale (increasing it) and lm scale, but the effect is not significant.

How can I solve this problem, and are there any better approaches?

Thank you.

Daniel Povey

unread,
Mar 3, 2023, 7:19:39 AM3/3/23
to kaldi...@googlegroups.com
That looks fine to me.  There is not really a "truth" when it comes to alignments.


Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
Message has been deleted
0 new messages