Hi
Nickolay-
Thanks for your reply. Yes within limited form-factor applications (limited processing and memory) differences between Kaldi, Whisper, etc are less, plus the typical environment for small form-factor tends to be noisy with multiple talkers. In that case one or more downstream small language models can be key, especially where translation to machine readable commands (e.g. ROS) is required. A consensus of 2 out of 3 SLMs is desirable. We can't tell the fork-lift to immediately stop unless we are really sure of what was said (well we can, but don't wanna do that very often).
I took a quick look at the k2/sherpa model link. Is there a way to run these on text input only, assuming speech recognition has already occurred ? I.e. as a "language model only" ?
-Jeff