question about librispeech corpus text

tfpeach

unread,

Oct 26, 2015, 11:50:18 AM10/26/15

to kaldi-help

Hi, dear all,

I am preparing my own corpus following the librispeech recipe. I have a question about text preparation. I see in the lexicon, the librispeech has the filler "<SPOKEN_NOISE>", however, in the training text there is no any filler labeled out. Could anyone tell me if I should include the filler in the text? Also, I see in the lexicon there is no item about "<s>" and "</s>", does that mean I don't need take care of them, Kaldi will process them automatically?

Thank you.

Daniel Povey

unread,

Oct 26, 2015, 1:24:24 PM10/26/15

to kaldi-help

Even if "<SPOKEN_NOISE>" doesn't appear in the transcripts, words in
the transcripts that are out of the vocabulary may get mapped to it
automatically if lang/oov.txt is set to "<SPOKEN_NOISE>".

Regarding "<s>" and "</s>", they are the beginning-of-sentence and
end-of-sentence symbols. They should not appear in the lexicon; they
appear in the language model but they get removed by the time you
compose with the lexicon.

Dan

> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

tfpeach

unread,

Oct 26, 2015, 8:15:15 PM10/26/15

to kaldi-help, dpo...@gmail.com

Thank you. You make it very clear.

I have a following question. I see there are several testing language model folders in the "data/" dir (e.g. lang_nosp_test_tgsmall, lang_nosp_test_tgmed and etc.) What data did you use to create the LM? Can I just use one LM folder? Or do I need create LM according to the subset I am using (e.g. 2k, 5k, 10k training data)?

Thank you.

vinc...@yahoo.com

unread,

Oct 27, 2015, 4:37:45 AM10/27/15

to kaldi-help, dpo...@gmail.com

Unless I need to open another thread I have similar questions related to Lexicon and language model / oov.

1) If a word is in the lexicon BUT NOT in the language model, will it be recognized because it's in the lexicon ? will it get a low likelihood from the laguage model ?

2) if a word is NOT in the lexicon but IS in the language model, does it have a chance to be recognized ? not at all ?

3) if we want to have words (ie several phonemes) that are not recognized to be tagged as UNKNOWN and on the other hand sounds that have no chance to be a word like "euh uhhh hmm" to be skipped not even tagged Spoken noise, what should we do ?

Jan Trmal

unread,

Oct 27, 2015, 10:15:12 AM10/27/15

to kaldi-help, Dan Povey

1) yes, it can get recognized

2) no -- because you don't have the pronunciation. I think the scripts will just replace it by your unk/oov word

3) I don't really understand the question, but for these things it's usually easier just to postprocess the decoded text.

y,

--

tfpeach

unread,

Oct 27, 2015, 11:40:23 AM10/27/15

to kaldi-help, dpo...@gmail.com

Hi, Yenda,

Right now, I just simply get a 3-gram LM upon all my training data. But I see the librispeech and WSJ corpus have several LM folders, and they do LM rescoring with these LM after the tri-phone decoding. Is that necessary? Can I just use one LM? and also could you tell me how I can get these LM?

I read the description about these LM, it seems they are obtained by different approaches. E.G. pruning or non-pruning. Is that right?

Thank you.

Jan Trmal

unread,

Oct 27, 2015, 12:39:09 PM10/27/15

to kaldi-help, Dan Povey

yes, you don't have to do rescoring. I'm not extremely familiar with the WSJ recipe, but I think Dan usually uses convention that _tg is trigram and _bg is bigram. Also, I have the feeling there could be a RNLMM rescoring that might give better numbers or some LM obtained by interpolating by switchboard/fisher/whatever...

WSJ was/is oftentimes used as a testbed for new algorithms, so there might be things you don't really need to run.

y.

tfpeach

unread,

Oct 27, 2015, 12:40:51 PM10/27/15

to kaldi-help, dpo...@gmail.com

Thank you very much!

Reply all

Reply to author

Forward