Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

What are Best Practices for Collecting Speech for a Free GPL Speech Corpus?

3 views
Skip to first unread message

kendm...@gmail.com

unread,
Jan 21, 2007, 3:48:35 PM1/21/07
to
Hi,

I am the admin for the VoxForge project. We are collecting user
submitted speech for incorporation into a GPL Acoustic Model ('AM').
Currently we have a Julius/HTK AM being created daily, incorporating
newly submitted audio on a nightly basis.

I am confused as to which approach to take in the creation of the
VoxForge speech corpora. Up until now, we have been asking users to
submit 'clean' speech - i.e. record their submission to ensure that all
noise (i.e. non-speech noise such as echo, hiss, ...) is kept to an
absolute minimum. One guy (very ingeniously I thought) records his
submissions in his closet or in his car!

But some people, whose opinions I respect, say that I should not be
collecting clean speech, but collecting speech in its 'natural
environment', warts and all, with echo and hiss and all that (but
avoiding other background noise such as people talking or radios or
TVs, ...). On some submissions, the hiss is very noticeable.

What confuses me is that some speech recognition microphones are sold
with built-in echo and noise cancellation, and the marketing says that
this improves a (commercial) speech recognizer's performance. This
indicates to me that I should be collecting clean speech, and then use
a noise reduction and echo cancellation front-end on the speech
recognizer, because that is what commercial speech recognition engines
seem to be doing.

And further, if clean speech is required, should I be using noise
reduction software on the submitted audio (such as the submission with
very pronounced hiss). My attempts at noise reduction have not been
successful, with the resulting 'musical noise' (the low level sound
that replaces the removed noise) giving me very poor recognition
results.

I was wondering what your thoughts on this might be,

thanks for your time,

Ken MacLean

--
http://www.voxforge.org

for-sp...@mailinator.com

unread,
Jan 21, 2007, 9:39:22 PM1/21/07
to
Hi Ken,

Here are my opinions regarding your question. (For those who haven't
heard of VoxForge (voxforge.org), Ken and his contributors are striving
to make open source speech recognition more practical by collecting
speech data under a free license. They are currently aiming to help
enable desktop command & control and telephone call response (IVR)
speech recognition. In the longer term they hope to help enable
dictation. Ken has heard from me before, so I hope my posting here
won't discourage others from joining in. I think a lively discussion
would be very healthy.)

I feel you should aim to collect 'natural' speech, with a set of mics
that reflects the range of mic technology that your future users will
be using. Using noise reduction in the front end of the ASR system
may be a good idea. If so, you probably should employ it both when
training models and when actually using the recognizer (unless one of
these situations will have much cleaner speech coming in than the
other). If you are having 'musical noise' problems with your noise
reduction, you may do better with one of the noise reduction solutions
at http://isca-students.org/freeware. I have never heard any musical
noise problems with the Qualcomm-ICSI-OGI package, for example. I
don't think the QIO package's license will suit you, but there are
other options there such as CtuCopy (which I haven't tried) which is
under GPL.

I think it's a great that you are collecting information on microphone
type from people submitting data. Perhaps you should have some
category codes in addition to the specific model name, so that you can
automatically identify what parts of your data are using particular
categories of microphones. Why? Let's say, for example, that you
are building a model for dictation. Dictation works better with
headset mics than ordinary desktop mics, so presumably your users would
tend to use headset mics. If you choose to include a lot of desktop mic
data in the training set along with headset mic data, this will make
the models better prepared to deal with a mix of headset and desktop
mic users, but if there are only going to be headset mic users you may
lose performance from it. This might happen, in particular, if the
greater variance / reduced sharpness in the models due to the inclusion
of desktop mic data turns out to be a bigger performance factor than
the extra coverage of human voice types and triphones that you will get
from the extra data. (As an aside, I think speaker adaptive training
(SAT) may help during training to preserve model sharpness when mixing
different microphone types. it is designed to preserve model sharpness
in the face of speaker variation and I suppose this would carry over to
many kinds of recording environment variation. Likewise, I know from
experience that employing a speaker adaptation technique such as MLLR
during recognizer use can improve performance in the face of recording
environment variation.)

Regards,
David Gelbart

0 new messages