It's great to hear that the spacing options are working for you!
Yes, we were disappointed by whisper as well. Using high end hardware can help, but the real issue is that whisper does not truly support streaming. With whisper, we have to wait until the utterance audio buffer is complete before starting to process the data. With other streaming recognizers, we start processing the data at the beginning of an utterance. This makes a big difference with real time speech recognition. Whisper is better designed for processing audio files.
We have added Open AI API and MS Azure to our task list for upcoming versions.
We have attempted to use google's speech context in the past, but we found that our internal biasing actually performed better. Have you tried that? See the "phrase bias" section at
https://utterlyvoice.com/help/bias. You can look at the "basic" mode for an example, where certain phrases have a negative bias (0.5). Your mode would look similar to that, but with positive biases (1.5 is usually sufficient).