Hi everyone,
I’m a new user of kaldi toolkit. I’ve 2 goals : 1) build a French ASR based on my own database (around 30h of speech + transcription) using MFCC features, and 2) test some custom features to replace the MFCC and evaluate the recognition at the word and phoneme level.
For now I’m at the very beginning of my project and I start by preparing my database.
My database is composed of wav files, orthographic transcriptions, a pronunciation dictionary, and also alignments at the sentence level, word level, and phoneme level. The database is already separated with a train set and a test set. For what I’ve understood, I won’t be able to use the phoneme alignments as Kaldi compute them itself.
I’ve look to the documentation but there is a lot of basic questions I can’t find the answer. I will be grateful if you could answer some of them :) .
I am preparing the first data files, text, wav.scp, segment and utt2spk :
- What encoding should I use for those files ? Is UTF-8 ok ?
- The documentation said that the wav.scp file should follow the format “<recording-id> <extended-filename>”, does the file paths can be a relative paths ? If yes, what is the starting directory of the relative paths ?
- For the segment file, I’m not sure what is the best segmentation to give. I can map utterances to sentences, or to sub-sentences (I can split my sentences based on silences), or even to words. What is the best according to you ?
- Does my transcription need to be normalized ? Is it ok to use accents like à é è ù ç ï ô ? What about the punctuation ?
- Can I use a special word in my transcriptions to specify noise or silences ?
Obviously, the next thing I will have to do is to make an acoustic model :
- Is it ok to use non alphanumeric chars to represent phonemes ? like “A/” “O~” or “@” ?
- My lexicon use a lot of pronunciation variants, and some of them end with a silence is it a problem ? Is the following the good format for the lexicon when there are variants ? ex :
purées p y R e z
purées(2) p y R e
purées(3) p y R e sil
I don’t know exactly what recipe to choose for next steps. Have you any advice considering my goals and database ? Is there a comparison somewhere ?
And a last question, is it normal that I need a VPN to access the Kaldi documentation? (I’m in France). Can we download it entirely somewhere ?
I know it’s a lot of questions, but if you can help on one point you’ll made my day !