very small language model

jbrow...@gmail.com

unread,

Dec 11, 2023, 3:34:29 PM12/11/23

to kaldi-help

All-

Does anyone know of a small language model that can run on one or two x86 cores, with an update rate of 1/2 sec or so, and correct sound-alike word errors ? We are running Kaldi on small-form factor x86 (pico-ITX) for robotics and first-responder applications (which must run without cloud connectivity) and it works well, but in the presence of noise, different speakers, etc we get errors such as:

in the early days a king rolled the stake

which should be corrected to:

in the early days a king ruled the state

Of course ChatGPT et. al. can do this easily but vastly exceeds our form-factor requirements. I've tried huggingface demos, I've emailed top execs researchers at a long list of AI outfits (stability.ai thirdai, OpenAI, Scale AI, etc) and Universities, but no luck so far. It seems their focus is only on large language models.

Thanks for any advice.

-Jeff

nshm...@gmail.com

unread,

Dec 22, 2023, 7:16:38 PM12/22/23

to kaldi-help

Overall you can't expect great accuracy in noise from a small model.

You can try k2/sherpa models like below, they are more accurate. int8 models are compact (below 200Mb)

https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#csukuangfj-sherpa-onnx-streaming-zipformer-en-2023-06-21-english

jbrow...@gmail.com

unread,

Jan 11, 2024, 4:25:26 PM1/11/24

to kaldi-help

Hi Nickolay-

Thanks for your reply. Yes within limited form-factor applications (limited processing and memory) differences between Kaldi, Whisper, etc are less, plus the typical environment for small form-factor tends to be noisy with multiple talkers. In that case one or more downstream small language models can be key, especially where translation to machine readable commands (e.g. ROS) is required. A consensus of 2 out of 3 SLMs is desirable. We can't tell the fork-lift to immediately stop unless we are really sure of what was said (well we can, but don't wanna do that very often).

I took a quick look at the k2/sherpa model link. Is there a way to run these on text input only, assuming speech recognition has already occurred ? I.e. as a "language model only" ?

-Jeff

Reply all

Reply to author

Forward