Removing <unk> in the HCLG

Rémi Francis

unread,

Apr 11, 2016, 9:00:14 AM4/11/16

to kaldi-help

I usually remove <unk> from the language model when preparing G.fst, with remove_oovs.pl.

In the end I get:

fstisstochastic exp/mono0a_pi/graph/HCLGa.fst 
0.000251199 -0.0521014

If I do it with --remove-oov in mkgrah, I get:

fstisstochastic exp/chain/tdnn_6z_sp/graph/HCLGa.fst 
9.22907 -0.687487

There doesn't seem to be a lot of difference at decode time, but I still wonder if there is one approach that is more "right" than the other.

Wouldn't it make sense to have this option part of arpa2fst instead?

Daniel Povey

unread,

Apr 11, 2016, 2:02:05 PM4/11/16

to kaldi-help

Probably it's more right to remove it earlier-- it's better w.r.t. weight pushing-- but we have to decide whether we want to disable decoding `<unk>` globally, as opposed to just for the chain models. The chain models were decoding it too much. Probably having it there made little difference for the other models, but we need to check.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rémi Francis

unread,

Apr 12, 2016, 9:57:45 AM4/12/16

to kaldi-help, dpo...@gmail.com

Is there any reason to be able decode <unk> at test time?

In the best of worlds, the model would know that the word it's trying to recognise is not in its vocabulary and would output <unk>, but I doubt that it's what happens in practice.

I think that instead <unk> will appear when the audio is not clear enough, so you end up putting that in the same bag than the ensemble of words your LM doesn't know, which doesn't really make sense to me.

Daniel Povey

unread,

Apr 12, 2016, 1:26:57 PM4/12/16

to Rémi Francis, kaldi-help

Well, the models (particularly chain models) do have a tendency to decode `<unk>` when they hit an out of vocabulary word in the input, so in a sense it's working as desired, but unfortunately the sometimes also decode it inappropriately. I don't think I've ever seen evidence that it was clearly helpful to decode `<unk>`, but most of the time until the Librispeech setup it wasn't clearly harmful either.

Dan

Rémi Francis

unread,

Apr 13, 2016, 6:44:29 AM4/13/16

to kaldi-help, re...@speechmatics.com, dpo...@gmail.com

Thanks I see.

I've tried both ways of removing <unk>, and it didn't change anything (+- 0.01 WER absolute), so it doesn't seem to matter too much.

luss2...@gmail.com

unread,

Jun 21, 2016, 6:26:59 AM6/21/16

to kaldi-help

Hi,

I'm new to kaldi. Can you explain the "stochastic"? What's the mean if we say a fst is stochastic. why we need to keep a fst is stochastic?

Any explanation will be appreciated.

Jun

Daniel Povey

unread,

Jun 21, 2016, 1:56:32 PM6/21/16

to kaldi-help

It means an FST where from each state, the total weight sums to one.
We mean this when interpreted in the log semiring, i.e. that over the
weights on arcs out of a state (+final-rpbo), sum(exp(-weight)) = 1.0.

Reply all

Reply to author

Forward