best results on TIMIT

290 views
Skip to first unread message

Chen,Xin,fAnSKyer

unread,
Feb 25, 2010, 11:50:36 PM2/25/10
to phn...@googlegroups.com
I recently read a paper "Discriminative training for large vocabulary
speech recognition using Minimum Classification Error" on IEEE ASLP.
The result for TIMIT task is amazingly good. Using HMM discriminative
training with only 8 mixture and The 192 core test has reached 78%
accuracy. Though they didn't say anything about the 1344 task, [ I
really want to know why? ] but from my experience, the core set is
tougher.

Anyone familir with TIMIT has some opinions on this?

Thanks,
BEst,

Chen

Petr Schwarz

unread,
Feb 26, 2010, 6:03:32 PM2/26/10
to phn...@googlegroups.com
Dear Chen,

I have not found enough time to study the article in detail yet, but
they mention that the results are on the phoneme clasification task.
So it means that the segmentation is known. This gives a big advantage
over phoneme recognition where you need also to
align the speech to segments. If I remember well, the phoneme
classification accuracy was about 83 for system with about 78 phoneme
recognition accuracy in our experiments. But still I believe that the
discriminatice criteria are beneficial. We are training the
neural networks using frame based criteria now, but a sentence base
criteria like MPE, MCE or boosted MMI can bring another improvement
to the system. Brian Kingsbury from IBM build the whole LVCSR based on
such criteria got nice results.
I also belive that it is possible to get similar results from some GMM
based methods, but the design of such system is more difficult.
We were for example working on the SGMM approach last year at JHU:
http://www.clsp.jhu.edu/workshops/ws09/documents/FinalTalksConcatenated.pdf
This is one promissing approach for small amounts of training data,
although it was not tested on TIMIT.

Petr

Chen,Xin,fAnSKyer

unread,
Feb 27, 2010, 1:50:51 AM2/27/10
to phn...@googlegroups.com
Dear Petr

Thanks a lot for mention this out, I just ignored this point. It make
much sense now. :P

I have one more following question here.
Have you compared the difference between
A. Using 48 phoneme system to train a model and then downgrade the
results in 39 phoneme to obtain the results/
B. Directly train 39 phoneme system and obtain the results

Is there much difference between method A and B?
If you did the comparing, what is the difference in your system?

Thanks a lot,
Best,

-Chen

--
----------------------------------------------------

Chen,Xin,fAnSKyer
Motto: "I am tough, I can handle it"
(BLOG) http://feedproxy.google.com/fanka
(PHOTO) http://picasaweb.google.com/fanskyer

Petr Schwarz

unread,
Feb 27, 2010, 6:36:04 AM2/27/10
to phn...@googlegroups.com
Dear Chen,

I tried to find the results, but no success. I run the experiment for
one of earlier systems during development.
There was not much difference between 39 and 48 phonemes for phoneme
recognition. If I remember well, the difference was
less than 0.5%. I did not verify the conclusion with the most recent
systems. We saw an advantage of more phonemes (or context dependent
states) if the posteriors were used for other tasks, like LVCSR.

Petr

(I am out of office next week)

Reply all
Reply to author
Forward
0 new messages