Hello Michael.
Please allow me elaborate what I am trying to say.
I recently trained an Urdu ASR using open-source and call center data for my university project and now I am working to implement it in a call center. Some of the open source data were stereo channel while call center and other audio were Mono. I converted all of them to mono for training. The question I was recently asked during my work was "Why did you convert all files to mono?". I replied "Because Kaldi requires mono channel." . The next question (asked academically) was: "Why does Kaldi require Mono channel audio for training? Why cant it do it in stereo or surround sound? Does it have to do with noise or does it have to do with mfcc features or i-vectors because quite frankly we read this stuff in text book from a mono channel perspective." That led me to look around for good articles but I could not find the answer. Hence I thought I ask the question Why does kaldi need a mono-input?
For my system, input speech will be live via phone but there will be recorded audios (which may have some stereo audios recorded through a cell phone maybe, hypothetically speaking), that require transcription, which will then be processed by kaldi (and Vosk) to give Text output. The speech data is in Urdu.
In that context I needed the answer.
What I have understood for now is:
-AMI recipe has option to use "mdm" microphone, an 8-channel array. The recipe uses an old filter-and-sum beamforming algorithm called beamformit to get a single channel from the array which is then used for the actual training/inference.
-Kaldi natively does not have the option to use stereo audio for training (And this part I want to know WHY)
-One hack is concatenating features from both microphones and using those for training
- When Dan said "there are ways to estimate steering vectors so that you can reduce noise." I did infer that stereo may pick up more noise compared to mono which would lead to bad training data but it also brought up a question with regards to Feature Extraction of mono vs stereo audio.
Respect and Regards
KHAN