Data preparation to build a french ASR system

446 views

Skip to first unread message

Thomas D

unread,

Mar 21, 2019, 10:51:10 AM3/21/19

to kaldi-help

Hi everyone,
I’m a new user of kaldi toolkit. I’ve 2 goals : 1) build a French ASR based on my own database (around 30h of speech + transcription) using MFCC features, and 2) test some custom features to replace the MFCC and evaluate the recognition at the word and phoneme level.

For now I’m at the very beginning of my project and I start by preparing my database.

My database is composed of wav files, orthographic transcriptions, a pronunciation dictionary, and also alignments at the sentence level, word level, and phoneme level. The database is already separated with a train set and a test set. For what I’ve understood, I won’t be able to use the phoneme alignments as Kaldi compute them itself.

I’ve look to the documentation but there is a lot of basic questions I can’t find the answer. I will be grateful if you could answer some of them :) .

I am preparing the first data files, text, wav.scp, segment and utt2spk :

What encoding should I use for those files ? Is UTF-8 ok ?
The documentation said that the wav.scp file should follow the format “<recording-id> <extended-filename>”, does the file paths can be a relative paths ? If yes, what is the starting directory of the relative paths ?
For the segment file, I’m not sure what is the best segmentation to give. I can map utterances to sentences, or to sub-sentences (I can split my sentences based on silences), or even to words. What is the best according to you ?
Does my transcription need to be normalized ? Is it ok to use accents like à é è ù ç ï ô ? What about the punctuation ?
Can I use a special word in my transcriptions to specify noise or silences ?

Obviously, the next thing I will have to do is to make an acoustic model :

Is it ok to use non alphanumeric chars to represent phonemes ? like “A/” “O~” or “@” ?
My lexicon use a lot of pronunciation variants, and some of them end with a silence is it a problem ? Is the following the good format for the lexicon when there are variants ? ex :
purées p y R e z
purées(2) p y R e
purées(3) p y R e sil

I don’t know exactly what recipe to choose for next steps. Have you any advice considering my goals and database ? Is there a comparison somewhere ?

And a last question, is it normal that I need a VPN to access the Kaldi documentation? (I’m in France). Can we download it entirely somewhere ?

I know it’s a lot of questions, but if you can help on one point you’ll made my day !

Best regards,
Thomas Debeuret

Daniel Povey

unread,

Mar 21, 2019, 11:35:46 AM3/21/19

to kaldi-help

Hi everyone,
I’m a new user of kaldi toolkit. I’ve 2 goals : 1) build a French ASR based on my own database (around 30h of speech + transcription) using MFCC features, and 2) test some custom features to replace the MFCC and evaluate the recognition at the word and phoneme level.

For now I’m at the very beginning of my project and I start by preparing my database.

My database is composed of wav files, orthographic transcriptions, a pronunciation dictionary, and also alignments at the sentence level, word level, and phoneme level. The database is already separated with a train set and a test set. For what I’ve understood, I won’t be able to use the phoneme alignments as Kaldi compute them itself.

I’ve look to the documentation but there is a lot of basic questions I can’t find the answer. I will be grateful if you could answer some of them :) .

I am preparing the first data files, text, wav.scp, segment and utt2spk :
What encoding should I use for those files ? Is UTF-8 ok ?

that's fine

The documentation said that the wav.scp file should follow the format “<recording-id> <extended-filename>”, does the file paths can be a relative paths ? If yes, what is the starting directory of the relative paths ?

absolute is better

For the segment file, I’m not sure what is the best segmentation to give. I can map utterances to sentences, or to sub-sentences (I can split my sentences based on silences), or even to words. What is the best according to you ?

aim for between 5 to 30 seconds, I would say.

Does my transcription need to be normalized ? Is it ok to use accents like à é è ù ç ï ô ? What about the punctuation ?

does not need to be normalized.

Can I use a special word in my transcriptions to specify noise or silences ?

you can, but it generally will not help.

Obviously, the next thing I will have to do is to make an acoustic model :
Is it ok to use non alphanumeric chars to represent phonemes ? like “A/” “O~” or “@” ?

yes it's OK.

My lexicon use a lot of pronunciation variants, and some of them end with a silence is it a problem ?

It's definitely not normal, but it is allowed.

Is the following the good format for the lexicon when there are variants ? ex :
purées p y R e z
purées(2) p y R e
purées(3) p y R e sil

I recommend to just choose the canonical form, e.g. the first one, and let the acoustic model figure out the rest. And you shouldn't have the (2) and (3).

I don’t know exactly what recipe to choose for next steps. Have you any advice considering my goals and database ? Is there a comparison somewhere ?

start with mini_librispeech.

And a last question, is it normal that I need a VPN to access the Kaldi documentation? (I’m in France). Can we download it entirely somewhere ?

No it is not normal. How does it fail, from your ip address?

I know it’s a lot of questions, but if you can help on one point you’ll made my day !

at least the questions were clear.

Best regards,
Thomas Debeuret

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/b3b1c691-d12d-4e54-a700-3fef8fddcbe6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas D

unread,

Mar 21, 2019, 3:46:11 PM3/21/19

to kaldi-help

Hi Dan, thank you a lot for all the precisions and rapid response.

I think the problem with the doc come from the network I was connected to.

BR,.

Thomas D

Reply all

Reply to author

Forward

0 new messages