Whisper compared to VOSK

Skip to first unread message


Nov 15, 2022, 6:27:02 AM11/15/22
to Opencast Users
in order to get an idea of how well the different free STT solutions work compared to each other, we ran them over some of our human-transcribed videos and compared their Word Error Rate. The results might be interesting to anyone planning to use STT in future, so I am posting them here:

The OSS solutions used are Subtitle2Go (S2G), VOSK, and OpenAI Whisper's Medium, Small and Basic model. The spoken language is German.

Video 1 (topic: lecture recording of tutorial on how to register for courses):
S2G: 35,6%
Vosk: 24,5%
Whisper M: 9.1%
Whisper S: 11.7%
Whisper B: 15.5%

Video 2 (topic: recording lectures using matterhorn):
S2G: 52,8%
Vosk: 38,5%
Whisper M: 12.1%
Whisper S: 16.2%
Whisper B: 24.1%

Video 3 (topic: biology didactics):
S2G: 33,6%
Vosk: 24,5%
Whisper M: 9.1%
Whisper S: 10.5%
Whisper B: 17.9%

Video 4 (topic: chemistry, scientific language, not public):
S2G: 37,4%
Vosk: 33,1%
Whisper M: 13.8%
Whisper S: 17.0%
Whisper B: 21.8%

Kind regards,

Matthias Neugebauer

Nov 15, 2022, 7:54:17 AM11/15/22
to Opencast Users

We can confirm your results though S2G was often better than Vosk, but both don’t compare to Whisper. In our test cases, Whisper M also beats the commercial offerings. We tested with German (some with dialects) and English videos.

– Matthias

educast.nrw / ZHLdigital (eLectures)
ERCIS - European Research Center for Information Systems

University of Münster
Leonardo-Campus 3 - Room 327
48149 Münster

Tel: +49 251 83-38268
Mail: matthias....@uni-muenster.de
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to users+un...@opencast.org.

Reply all
Reply to author
0 new messages