Preparing training data for Kaldi

1,740 views
Skip to first unread message

Nishan Wickrama

unread,
Oct 12, 2015, 3:05:33 AM10/12/15
to kaldi-help
Hello,

First of all, my apologies about posting too many questions to this forum recently. But I could really use the help from the experts in this forum.

I'm building an application for 8 kHz telephonic speech recognition for domain-specific English. I have tried using the nnet2 model trained with Fisher dataset with a custom language model, but the results I got are unsatisfactory. Therefore, I decided to train a model myself using some transcriptions I have for my domain specific recordings. Initially, I'm planning to follow the vystadial_en recipe [1] because I believe preparing my training data in the same format as the vystadial_en dataset [2] is relatively easy.

I have carefully read the documentation http://kaldi.sourceforge.net/data_prep.html. But I have some specific questions about the data preparation.

1. Is it a must that my train recordings are segmented? (do I have to have only one utterance per file, or provide a segmentation file). Is there a way to automate this segmentation process?

2. Is it acceptable to have long silences (say more than 1 second, varies for each recording), at the beginning and the end of each recording? If not, what would be the ideal duration of the silences at the beginning and the end of each recording?

3. Finally, some of the transcriptions have the tag [indistinguishable] for some words and short sequences of words that the human transcriber could not understand. If I replace these with <unk>, would that be the correct way to handle these?

Thanks in advance. Once I figure out how to train Kaldi with my own data, I'm more than happy to contribute a comprehensive documentation to Kaldi if you think that would be helpful to the future users.

[2] Free English and Czech telephone speech corpus shared under the CC-BY-SA 3.0 license (http://www.lrec-conf.org/proceedings/lrec2014/pdf/535_Paper.pdf)

Regards,
Nishan

Daniel Povey

unread,
Oct 12, 2015, 1:33:13 PM10/12/15
to kaldi-help
> First of all, my apologies about posting too many questions to this forum
> recently. But I could really use the help from the experts in this forum.
>
> I'm building an application for 8 kHz telephonic speech recognition for
> domain-specific English. I have tried using the nnet2 model trained with
> Fisher dataset with a custom language model, but the results I got are
> unsatisfactory. Therefore, I decided to train a model myself using some
> transcriptions I have for my domain specific recordings. Initially, I'm
> planning to follow the vystadial_en recipe [1] because I believe preparing
> my training data in the same format as the vystadial_en dataset [2] is
> relatively easy.
>
> I have carefully read the documentation
> http://kaldi.sourceforge.net/data_prep.html. But I have some specific
> questions about the data preparation.
>
> 1. Is it a must that my train recordings are segmented? (do I have to have
> only one utterance per file, or provide a segmentation file). Is there a way
> to automate this segmentation process?

Let me rephrase your question: "is it best if I train on short utterances?".
An utterance is whatever you define it to be, i.e. a single line in
the 'text' file or 'feats.scp'.
Yes, it's generally better to train on short utterances, but not a must.
Once you have a trained system you can use split_long_utterance.sh to
segment your data.
There is an example of using that in the WSJ example scripts.

> 2. Is it acceptable to have long silences (say more than 1 second, varies
> for each recording), at the beginning and the end of each recording? If not,
> what would be the ideal duration of the silences at the beginning and the
> end of each recording?

It's fine to have long silences. The silence duration isn't super critical.

> 3. Finally, some of the transcriptions have the tag [indistinguishable] for
> some words and short sequences of words that the human transcriber could not
> understand. If I replace these with <unk>, would that be the correct way to
> handle these?

there is nothing special about the symbol "<unk>" as far as kaldi is
concerned, it may be used in the vystadial recipe as the unknown-word
symbol though (oov.txt in lang/), which will usually have its own
phone. Yes, you can map it to that symbol, or leave it as-is and not
supply a dictionary entry, in which case it will automatically be
mapped to it.

Dan

> Thanks in advance. Once I figure out how to train Kaldi with my own data,
> I'm more than happy to contribute a comprehensive documentation to Kaldi if
> you think that would be helpful to the future users.
>
> [1] https://github.com/kaldi-asr/kaldi/tree/master/egs/vystadial_en
> [2] Free English and Czech telephone speech corpus shared under the CC-BY-SA
> 3.0 license
> (http://www.lrec-conf.org/proceedings/lrec2014/pdf/535_Paper.pdf)
>
> Regards,
> Nishan
>
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Tony Robinson

unread,
Oct 12, 2015, 3:06:00 PM10/12/15
to kaldi...@googlegroups.com
On 12/10/15 18:33, Daniel Povey wrote:
Let me rephrase your question: "is it best if I train on short utterances?". An utterance is whatever you define it to be, i.e. a single line in the 'text' file or 'feats.scp'. Yes, it's generally better to train on short utterances, but not a must.

I asked a question on training length at the start of this year and I don't think I ever got back to the list with the results.   Probably because it was a null result, that is the difference between training on 1-15s and our old default of 17-45s was very little and probably not statistically significant.

This was NNET1 using fMLLR.  iVectors, LSTMs, CTC and other stuff will all change the numbers.   It's better to segment at something of 30s or less and be part of the technological advances than worry too much about the exact splitting length.


Tony

--
Speechmatics is a trading name of Cantab Research Limited
We are hiring: www.speechmatics.com/careers
Dr A J Robinson, Founder, Cantab Research Ltd
Phone direct: 01223 794096, office: 01223 794497
Company reg no GB 05697423, VAT reg no 925606030
51 Canterbury Street, Cambridge, CB4 3QG, UK

nishan....@gmail.com

unread,
Oct 15, 2015, 1:51:35 AM10/15/15
to kaldi-help
Thanks very much Dan and Tony!
Reply all
Reply to author
Forward
0 new messages