Why does kaldi require mono channel audio for training instead of stereo or surround?

214 views
Skip to first unread message

Sage Khan

unread,
Aug 23, 2022, 11:48:45 AM8/23/22
to kaldi-help
Hello Can anyone explain this to me academically?

Why does kaldi require mono channel audio for training instead of stereo or surround?

Mike Murray

unread,
Aug 23, 2022, 5:02:16 PM8/23/22
to kaldi...@googlegroups.com
What device captures speech in more than one channel? A microphone only provides a single audio channel. Capturing speech in stereo literally requires two devices, even though sometimes they are packed in the same housing.

I can’t understand what’s motivating this question. It’s like asking why a guitar amplifier is mono: there’s no such thing as a stereo guitar pickup.


On Aug 23, 2022, at 8:48 AM, Sage Khan <class...@gmail.com> wrote:

Hello Can anyone explain this to me academically?

Why does kaldi require mono channel audio for training instead of stereo or surround?

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/05ff9ad3-6458-4832-8f61-d9680edbb6e7n%40googlegroups.com.

Daniel Povey

unread,
Aug 23, 2022, 6:09:41 PM8/23/22
to kaldi...@googlegroups.com
There are algorithms that can usefully use multiple microphones, e.g. there are ways to estimate steering vectors so that you can reduce noise.
But those are not implemented in Kaldi, at least, they are not at all tightly integrated. I believe the AMI recipe may have some scripts, but they
are not very state of the art any more I think.


Sage Khan

unread,
Aug 24, 2022, 1:05:23 AM8/24/22
to kaldi-help
Im not talking about live audio capture.

Im talking about recorded audio clips. For instance, I have stereo audio clips available (and even surround sound).  I wanted to know theoretically, why does kaldi require single-channel audio for training purpose. 

Im a musician myself, and I understand audio engineering, recording setups and everything. I am not asking "why guitar amp is mono to stereo" because question is not about live recording.

My question is with respect to recorded audio being used for "TRAINING" an ASR as given in Kaldi help Data prep section. 

Sage Khan

unread,
Aug 24, 2022, 1:06:43 AM8/24/22
to kaldi-help
Hello Dan.

What I understand is that in mono there is reduced noise compared to stereo. But what do you mean by " they are not at all tightly integrated"?

Regards

Desh Raj

unread,
Aug 24, 2022, 1:58:20 AM8/24/22
to kaldi...@googlegroups.com
If you look at the AMI recipe, there's an option for using "mdm" microphone which is basically an 8-channel array. Inside the recipe, we first use beamformit (a simple filter-and-sum beamforming algorithm) to get a single channel from the array which is then used for the actual training/inference. This is probably what Dan means by "not tightly coupled". Also, beamformit is a very old algorithm.

As to your actual question, there's no particular reason why ASR training should only use single channel. It's just not supported in the Kaldi framework natively, although I can imagine hacks around it by concatenating features from both microphones and using those for training, for instance.

Desh

Muhammad Danyal Khan

unread,
Aug 24, 2022, 2:02:54 AM8/24/22
to kaldi...@googlegroups.com
Hello Desh.

Theoretically, is there any effect of stereo sound making any issue during feature extraction. Now I've read about it mostly in a mono perspective. Is it possible for this to be the reason behind using only mono in Kaldi?


From: kaldi...@googlegroups.com <kaldi...@googlegroups.com> on behalf of Desh Raj <r.de...@gmail.com>
Sent: Wednesday, August 24, 2022 10:58:04 AM
To: kaldi...@googlegroups.com <kaldi...@googlegroups.com>
Subject: Re: [kaldi-help] Why does kaldi require mono channel audio for training instead of stereo or surround?
 
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/92-jEzqyNb4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAHRYyT5BNYGA6Wp1zUW0mtqCKySgMtG6axb7X7L1PL7gZACY0w%40mail.gmail.com.

Mike Murray

unread,
Aug 24, 2022, 8:59:06 AM8/24/22
to kaldi...@googlegroups.com
If you were going to build a system that detects speech using multiple channels, it will usually have a property that I'm sure you don't desire: the resulting system may need to preserve the geometry of the microphone array that was used in training. You can see this discussed in various places, eg https://www.arxiv-vanity.com/papers/2109.11225/

You will get much better support from this list if you explain what you are trying to achieve. Using multiple channels could be a way to reduce noise, as Daniel is describing. Or there could be other motivations.

It sounds like you have a dataset that came from multiple microphones, and you want to train for recognition with multiple microphones. Beamforming may be the right tool for you, as Desh mentioned is in fact available as a data prep option. But it's important to recognize that developing a working ASR system requires some thought about the input.

Just as an example: If you have stereo files because a stereo spread effect chain was applied to a mono recording you are in a very different situation than if you have data from two microphones. No efficient ASR system can be built that blindly accepts multiple channels.

-Michael

On Aug 23, 2022, at 10:05 PM, Sage Khan <class...@gmail.com> wrote:

Im not talking about live audio capture.

Sage Khan

unread,
Aug 24, 2022, 10:24:59 AM8/24/22
to kaldi-help
Hello Michael.

Please allow me elaborate what I am trying to say.

I recently trained an Urdu ASR using open-source and call center data for my university project and now I am working to implement it in a call center. Some of the open source data were stereo channel while call center and other audio were Mono. I converted all of them to mono for training. The question I was recently asked during my work was "Why did you convert all files to mono?". I replied "Because Kaldi requires mono channel." . The next question (asked academically) was: "Why does Kaldi require Mono channel audio for training? Why cant it do it in stereo or surround sound? Does it have to do with noise or does it have to do with mfcc features or i-vectors because quite frankly we read this stuff in text book from a mono channel perspective." That led me to look around for good articles but I could not find the answer. Hence I thought I ask the question Why does kaldi need a mono-input?

For my system, input speech will be live via phone but there will be recorded audios (which may have some stereo audios recorded through a cell phone maybe, hypothetically speaking), that require transcription, which will then be processed by kaldi (and Vosk) to give Text output. The speech data is in Urdu. 

In that context I needed the answer.

What I have understood for now is:
-AMI recipe has option to use "mdm" microphone, an 8-channel array. The recipe uses an old filter-and-sum beamforming algorithm called beamformit to get a single channel from the array which is then used for the actual training/inference. 
-Kaldi natively does not have the option to use stereo audio for training (And this part I want to know WHY) 
-One hack is concatenating features from both microphones and using those for training
- When Dan said "there are ways to estimate steering vectors so that you can reduce noise." I did infer that stereo may pick up more noise compared to mono which would lead to bad training data but it also brought up a question with regards to Feature Extraction of mono vs stereo audio. 


Respect and Regards

KHAN

Mike Murray

unread,
Aug 24, 2022, 11:24:20 AM8/24/22
to kaldi...@googlegroups.com
I think the answer to that basic question, in the context of the query "why mono?", is actually pretty straightforward: a human voice has only one channel. A stream of text has only one channel. This makes ASR a fundamentally single-channel process. The vocal organs of a human being simply cannot generate separate channels of signal data that can be addressed separately. This is what I meant by comparison with a guitar.

As to the more pragmatic question "what do I do with multiple channels when I have them?", it depends directly on how those channels are related to the primary signal: the single human organ producing sound. Since the ASR system you are building will be different depending on those details, addressing them properly falls into the category of preprocessing. This is why the answers people have provided are about preprocessing.

For my system, input speech will be live via phone but there will be recorded audios (which may have some stereo audios recorded through a cell phone maybe, hypothetically speaking

This is a fine example. If the cell phone has two microphones, it's likely that one of them alone will produce the best results. My cell phone has two microphones, but one of them is much higher fidelity; I would simple use one of the channels instead of both.

You would have to experiment to determine the best way forward with the data you have. There isn't a general answer that covers all the scenarios of the physical world where signal data is collected. Once you develop a good strategy for the cell phone, it may not be equally good for yet another arrangement.

Finally, if the question really is only: "why does the Kaldi software program accept only audio files with a single channel?" then really the best answer to that would be because there is no single method for creating features across channels that works well for every possible use case.

Kaldi is a framework: in the absence of a single universal way to create features from signal data that every user would always want, it cant make that choice for you. You have to make it yourself.

Sage Khan

unread,
Aug 24, 2022, 11:33:46 AM8/24/22
to kaldi-help
That summarizes the whole issue. Thank you so much for your time and effort.

Regards

KHAN

Desh Raj

unread,
Aug 24, 2022, 12:02:47 PM8/24/22
to kaldi...@googlegroups.com
See below.

On Wed, Aug 24, 2022 at 7:25 AM Sage Khan <class...@gmail.com> wrote:
Hello Michael.

Please allow me elaborate what I am trying to say.

I recently trained an Urdu ASR using open-source and call center data for my university project and now I am working to implement it in a call center. Some of the open source data were stereo channel while call center and other audio were Mono. I converted all of them to mono for training. The question I was recently asked during my work was "Why did you convert all files to mono?". I replied "Because Kaldi requires mono channel." . The next question (asked academically) was: "Why does Kaldi require Mono channel audio for training? Why cant it do it in stereo or surround sound? Does it have to do with noise or does it have to do with mfcc features or i-vectors because quite frankly we read this stuff in text book from a mono channel perspective." That led me to look around for good articles but I could not find the answer. Hence I thought I ask the question Why does kaldi need a mono-input?

For my system, input speech will be live via phone but there will be recorded audios (which may have some stereo audios recorded through a cell phone maybe, hypothetically speaking), that require transcription, which will then be processed by kaldi (and Vosk) to give Text output. The speech data is in Urdu. 

In that context I needed the answer.

What I have understood for now is:
-AMI recipe has option to use "mdm" microphone, an 8-channel array. The recipe uses an old filter-and-sum beamforming algorithm called beamformit to get a single channel from the array which is then used for the actual training/inference. 
-Kaldi natively does not have the option to use stereo audio for training (And this part I want to know WHY)
I'm not sure, but it is probably just a design choice. Conventionally, speech processing is divided into front-end (e.g. enhancement) and back-end (e.g. ASR) systems. Kaldi is designed as a back-end system and assumes that you have some front-end method that can take multiple channels and beamform them into a single output. Of course, recently the trend is to train everything end-to-end. You can take a look at ESPNet --- there may be recipes which do this. 
  
-One hack is concatenating features from both microphones and using those for training
- When Dan said "there are ways to estimate steering vectors so that you can reduce noise." I did infer that stereo may pick up more noise compared to mono which would lead to bad training data but it also brought up a question with regards to Feature Extraction of mono vs stereo audio. 
It does not mean that stereo picks up more noise. It means that if there is noise in the recording environment, it is easier to suppress it if you have multiple channels. I suggest you take a look at multi-channel speech enhancement literature (there is tons of work in this field).

Sage Khan

unread,
Aug 25, 2022, 3:21:36 AM8/25/22
to kaldi-help
Thank you so much :) Now it is clear :)
Reply all
Reply to author
Forward
0 new messages