> First of all, my apologies about posting too many questions to this forum
> recently. But I could really use the help from the experts in this forum.
>
> I'm building an application for 8 kHz telephonic speech recognition for
> domain-specific English. I have tried using the nnet2 model trained with
> Fisher dataset with a custom language model, but the results I got are
> unsatisfactory. Therefore, I decided to train a model myself using some
> transcriptions I have for my domain specific recordings. Initially, I'm
> planning to follow the vystadial_en recipe [1] because I believe preparing
> my training data in the same format as the vystadial_en dataset [2] is
> relatively easy.
>
> I have carefully read the documentation
>
http://kaldi.sourceforge.net/data_prep.html. But I have some specific
> questions about the data preparation.
>
> 1. Is it a must that my train recordings are segmented? (do I have to have
> only one utterance per file, or provide a segmentation file). Is there a way
> to automate this segmentation process?
Let me rephrase your question: "is it best if I train on short utterances?".
An utterance is whatever you define it to be, i.e. a single line in
the 'text' file or 'feats.scp'.
Yes, it's generally better to train on short utterances, but not a must.
Once you have a trained system you can use split_long_utterance.sh to
segment your data.
There is an example of using that in the WSJ example scripts.
> 2. Is it acceptable to have long silences (say more than 1 second, varies
> for each recording), at the beginning and the end of each recording? If not,
> what would be the ideal duration of the silences at the beginning and the
> end of each recording?
It's fine to have long silences. The silence duration isn't super critical.
> 3. Finally, some of the transcriptions have the tag [indistinguishable] for
> some words and short sequences of words that the human transcriber could not
> understand. If I replace these with <unk>, would that be the correct way to
> handle these?
there is nothing special about the symbol "<unk>" as far as kaldi is
concerned, it may be used in the vystadial recipe as the unknown-word
symbol though (oov.txt in lang/), which will usually have its own
phone. Yes, you can map it to that symbol, or leave it as-is and not
supply a dictionary entry, in which case it will automatically be
mapped to it.
Dan
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
kaldi-help+...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.