I felt the topic is intriguing to people, so I just fork it here in case the original thread got closed. https://github.com/MicrosoftDocs/azure-docs/issues/58642
The docs are unclear: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text#response-parameters
Fluency is a part of accuracy, and accuracy is a part of fluency. I don't understand what the difference is in the calculation/production of these two scores. The explanations are single lines that essentially say "x is x".
Also the pronScore is based on these two scores and "weighted" - weighted how? weighted towards what?
Forgive me if I've posted this issue in the wrong place!
Document Details⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@crevulus
Thanks for the feedback! We are currently investigating and will update you shortly.
@crevulus Thanks again for the feedback.
We will improve the document to make it more meaningful to help customer understand the API more easily.
For your questions, let me answer here.
About the difference between accuracy and fluency:
The accuracy score indicates the sounds accuracy of phonemes toward native pronunciation.
We calculate it on phoneme level first, and word level and full text level accuracy score is aggregated from phoneme level accuracy score.
The fluency score indicates the speech fluency of the given speech towards native speaking naturalness such as break, silence duration. It cares inter-word part. This is different from accuracy score.
The completeness score is calculated by the ratio of non-mispronunciation words towards reference text input.
For pronScore, it's the overall score which is aggregated from accuracy score, fluency score and completeness score. It's calculated by accuracyScore * X% + fluencyScore * Y% + completenessScore * Z%. There could be adjustment on the weight so we don't share it here. You can also calculate the over all score with your customized weight.
In the future we will introduce more dimensions like prosody score and aggregate it into pronScore.
Please let me know if you have further questions.
Thanks for the feedback. It was very thorough.
Prosody score would be very useful for my purposes! Please keep me updated, and you can close this ticket if you wish.
@crevulus
We will now proceed to close this thread. If there are further questions regarding this matter, please respond here and @YutongTie-MSFT and we will gladly continue the discussion.
Revised public-facing content will appear at this address within 24 hours:
https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text
#sign-off
@yinhew , thanks for the detailed explaination. I wonder if you have time to help me to better understand the fluency score:
Like you mentioned:
The fluency score indicates the speech fluency of the given speech towards native speaking naturalness such as break, silence duration. It cares inter-word part. This is different from accuracy score.
Is there a way for you to share more details of how exactly the fluency score is calculated? For example, I have the follow alignment result of a recording:
silence 0-0.1s I 0.1-0.3s <break> 0.3-0.6s like 0.7-1.7s # the speaker pronounced the word 'like' longer than most of the native speakers. it 1.7-2.0s silence 2.0-2.3sWhat is the fluency score for this case? And how it is calculated?
P.S.: only a rough idea of the calculation procedure is fine with me, no specific parameter is needed.
Thanks.
@weiwchu @YutongTie-MSFT I'm also very interested in your reply to @weiwchu 's query. Would be useful to know for our product.
Why have we closed this thread?
It's not been answered yet!
@sourabharsh and @crevulus and others who are interested in this topic, let me share my thoughts with you on this:
(Important things first, I created a group for discussing speech assessment questions and problems, I will see if I can answer most of them for you guys during weekends or evening time. And please free free to post your questions. I have a PhD in speech, and my thesis is on pitch estimation and speech analysis.
https://groups.google.com/g/speech-assessment
Anyone is welcomed to join and discuss.)
Fluency is a very subjective concept compared to recognition accuracy or detection precision. I have been doing research on speech assessment for some years. My feeling (just my thoughts) is that this domain lacks of definitions of metrics, e.g. what type of speech is considered as 'native' speech? However, It is perfectly fine, for many aspects of speech assessment are still regarded as open domain by research community.
So the fact I understand is that different firms have to have their own understanding of fluency and have their own implementation.
There are a few ways of estimating fluency:
Method 1: have teachers labeling a lot of recordings with fluency scores from 0-100 (at least 10k+), then build a model for predicting the score from features (you can check linear/logistic regression, xgboost, ... ), the features could be raw mfcc, or state-level posteriors.
Method 2: still have teachers labeling a few recordings with fluency scores from 0-100 (maybe just 100 samples), but just define a function with expert knowledge, and manually tune the parameters in it (this may surprise you, a lot of firms use method 2 indeed).
Thanks,
Wei