Hello
I have recently started a project to train Automatic Speech Recognition (ASR) models for Air Traffic Control (ATC) communication, leveraging Kaldi for domain-specific acoustic model development. While Vosk is intended for the ASR inference process, I believe Kaldi can help create a more appropriate acoustic model for this specialized domain.
Currently, I am considering two primary approaches for training:
Approach 1: Kaldi TDNN Fine-tuning/Augmentation
Base Models: Utilize existing LibriSpeech and/or Mini-LibriSpeech trained TDNN models (after completing their base training, if necessary).
Customization: Fine-tune or augment these models with my specific ATC corpora.
My Data: This custom data consists of approximately 3-8 hours of WAV files and corresponding transcripts (.txt), recorded under non-ideal acoustic conditions.
Approach 2: Vosk Pre-trained Model Fine-tuning
Base Models: Use the pre-trained Vosk models (e.g., vosk-model-en-us-0.22 and/or vosk_model-small-en-us-0.15).
Customization: Fine-tune these Vosk models directly using my custom ATC data (3-8 hours of WAV files and transcripts).
In both approaches, I am also considering incorporating additional ATC-related data, such as ATCO2, for further fine-tuning.
I would greatly appreciate your expert input on the following:
Are these proposed approaches feasible for achieving the desired outcome?
What are the potential drawbacks or challenges associated with each approach?
Do you recommend any alternative or complementary strategies for this task?
Thank you for your time and insights.