Re: Command Line Arguments

115 views
Skip to first unread message

Petr Schwarz

unread,
Aug 23, 2012, 5:00:00 AM8/23/12
to phn...@googlegroups.com

Dear Fatima,

TIMIT is definitely not the best recognizer for the NIST data. Use one
of those SpeechDat recognizers. TIMIT is wideband microphone speech and
NIST SRE is 8k telephone telephone speech, so half of the band is
missing. It will not work or you will get very bad results. The
Hungarian recognizer
was the best for NIST LRE.

Some examples of command lines are here:
http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context

Be sure that the speech is sampled at 8k.

Petr

Dne 23.8.2012 6:19, Nakhat Fatima napsal(a):
> Hello All,
>
> I am kind of new to the recognizer. I was wondering if i could be
> guided as to the command line arguments, like what does -v mean and so
> forth.
> I am using NIST SRE 2008 (short 2) data converted to mono wav (8 bit
> ulaw). If anyone could mention the arguments which would make the
> results better, I'd be greatly thankful! :)
> Also, I was wondering if TIMIT based recognition is a better idea for
> NIST data, or should I use recognizer from the other three languages?
>
> I shall be very grateful for a prompt reply!
> cheers!
>
> - Nakhat Fatima Maissom

Nakhat Fatima

unread,
Oct 25, 2012, 2:45:49 AM10/25/12
to phn...@googlegroups.com
Thank you for your reply. Hungarian recognizer has worked considerably better on NIST data. 

I actually have another query. I don't quite understand the start and end times.
000000 300000 int -8.106743
300000 6300000 int -68.401741
6300000 7000000 E -14.546829

I am unable to understand what 300000 would translate to. I have to convert it to byte count in the file. I shall be very thankful for your prompt reply!

Best regards,
Nakhat Fatima Maissom
--
Nakhat Fatima

Pavel Matejka

unread,
Oct 25, 2012, 3:30:45 AM10/25/12
to phn...@googlegroups.com, Nakhat Fatima
Hi Nakhat
you can have a look at HTK documentation, but simply it is in 100 nanoseconds which mean that if you divide the number by
100000 you will get shift in frames with 10ms shift or
10000000 and you will get it in seconds

if you have your own different VAD try to fuse this two systems, our experience is that two different VAD or two different feature extraction are usually complementary

Best regards
Pavel

Dne 25.10.2012 8:45, Nakhat Fatima napsal(a):
-- 

 Ing. Pavel Matejka, PhD      E-mail: mate...@fit.vutbr.cz
 UPGM FIT VUT Brno, L226      Web:    http://www.fit.vutbr.cz/~matejkap
 Bozetechova 2, 612 66        Phone:  +420 54114-1283
 Brno, Czech Republic         Fax:    +420 54114-1290

Nakhat Fatima

unread,
Oct 26, 2012, 5:57:45 AM10/26/12
to Pavel Matejka, phn...@googlegroups.com
Hi

I have another query. 
I am using the phoneme recognizer to use it later for speaker recognition purpose. I am using NIST Speaker Recognition Evaluation data of year 2008. (that too the short 2 type)
What I wanted to ask was, what is the performance of the system (as in percentage or something). Is there any way to determine the performance of the recognizer on NIST data? 

I have used Hungarian recognizer, as suggested by you earlier. 

Thank you very much for your time!!

Best regards
Nakhat Fatima Maissom
--
Nakhat Fatima

Petr Schwarz

unread,
Oct 29, 2012, 4:24:10 PM10/29/12
to phn...@googlegroups.com
Dear Nakhat,

Dne 26.10.2012 11:57, Nakhat Fatima napsal(a):
> Hi
>
> I have another query.
> I am using the phoneme recognizer to use it later for speaker recognition
> purpose. I am using NIST Speaker Recognition Evaluation data of year 2008.
> (that too the short 2 type)
> What I wanted to ask was, what is the performance of the system (as in
> percentage or something).

> Is there any way to determine the performance of
> the recognizer on NIST data?
> I have used Hungarian recognizer, as suggested by you earlier.

Probably not. We are using PER (phoneme error rate) for measurement.
This is the same as WER for word strings. But for PER you need phoneme
alignment that is usually done from word transcription and pronunciation
lexicon using force alignment. In case of NIST I am not sure if there is
some transcribed data. Then the languages differ. It is easier to
measure the accuracy on the target task (LID, SID). It is not important
to recognize the phonemes correctly, but to get some consistent strings.
The PER is about 75% for clean read telephone speech. It can go down to
50% for low quality converational speech. Off course it is higly
dependent on the quality of reference. It is possible to reach higher
numbers with speaker adaptation.

Petr

Reply all
Reply to author
Forward
0 new messages