question about the "text" file format in data preparation

euisun...@gmail.com

unread,

Mar 17, 2016, 11:48:59 PM3/17/16

to kaldi-help

Hello I have some questions regarding "text" file format in data preparation step.

Consider the following line in the "text" file.

AlGore_2009-0001304-0002346 last year i showed these two slides so that demonstrated something

This means "AlGore_2009" speaking from "0001304" to "0002346" and the transcription is "last year..." right?

What if the audio file (either *.sph or *.wav) I have is not long speech but just only one sentence. In that case, what should I put in for the start and the end times?

Consider the following line in my "text" file.

sp01_train_sn0 the birch canoe slid on the smooth planks

Can this mean "sp01_train_sn0" speaking from the start to the end (of the audio) and the transcription is "the birch canoe..."?

I also noticed as I'm writing this question that I have two "underscore"s as a part of the speaker name. I think it would cause problems differentiating between the speaker and the utterance. Would it? If I did, would changing "sp01_train_sn0" to "sp01_trainsn0" solve the problem?

Regardless I messed up this "underscore" part or not, can specifying no time information in the "text" file be considered as taking from the start and the end time?

If it doesn't, what are your suggestions?

I would appreciate your help.

Daniel Povey

unread,

Mar 18, 2016, 12:05:02 AM3/18/16

to kaldi-help

Those utterance-id identifiers are not ever interpreted by Kaldi- as far as Kaldi is concerned they are just strings. However, you do want to make sure that when the utterance-ids and their corresponding speaker-ids are sorted the orderings are compatible, so it can make sense to make the speaker-id a prefix of the utterance-id. Otherwise it's up to whoever formatted the database.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

euisun...@gmail.com

unread,

Mar 18, 2016, 1:06:01 AM3/18/16

to kaldi-help, dpo...@gmail.com

Thank you for your comment, I'm still confused about this whole "files I need to create myself" part though.
I would like to generalize my question a little bit. All I'm trying to do is testing the recognition performance with noisy speech over already trained models.
I've successfully run "run.sh" script in TED-LIUM, if it helps you recall, it has "/test", "/dev", and "/train" directory with *.sph files inside. (planning to use "score_kaldi.sh" in decoding so assuming I don't need stm files).

So my intuition was to comment out training parts of the "run.sh" script and put let say "/test_noisy" instead of "/test" and "/dev" in decoding parts of the "run.sh" script.
The problem is that these noisy *.sph files (as mentioned in my first post) contain only one sentence rather than speech. Looking at "/test/sph" or "/dev/sph", they have about 10 speaker *.sph files, each speaker file contains more than one sentences. Looking at "/test_noisy/sph", it has about 6 speaker but 30 *.sph files, it has 5 *.sph files per one speaker each containing only one sentence. This noisy corpus format gets me confused when I'm trying to create "utt2spk", "text" and "wav.scp" files.

Am I missing something that is so elementary here? I apologize if my lack of understanding the structure is bothering you. Is it something that I just have to put more time in this to grab bigger pictures or can this be easily answered based on the understanding I have now?

Daniel Povey

unread,

Mar 18, 2016, 1:25:14 AM3/18/16

to EUISUNG KIM, kaldi-help

The 'segments' file is optional, if you want to recognize each .scp file as a single utterance you wont' have it. If you want the utterance names and the speaker names and the names of the .sph files can all be identical.

Dan

Reply all

Reply to author

Forward