Google Class Tokens

336 views
Skip to first unread message

ewrj...@gmail.com

unread,
Apr 1, 2022, 12:41:01 PM4/1/22
to UniMRCP
Hi All,
We're doing some testing with Google class tokens and trying to see if the $FULLPHONENUM class token has an impact and we can see any discernable difference in the recognitions. Example:

MRCP/2.0 553 DEFINE-GRAMMAR 1
Channel-Identifier: fda8893c9fec4dbc@speechrecog
Content-Type: application/srgs+xml
Content-Id: request1@form-level
Content-Length: 380


<?xml version="1.0"?><grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="pre" mode="voice"><meta name="single-utterance" content="true"/><meta name="input-timeout" content="10000"/><meta name="speech-complete-timeout" content="2000"/><meta name="scope" content="hint"/><rule id="pre"><one-of><item>$FULLPHONENUM</item></one-of></rule></grammar>


MRCP/2.0 359 RECOGNIZE 2
Channel-Identifier: fda8893c9fec4dbc@speechrecog
Content-Type: text/uri-list
Vendor-Specific-Parameters: single-utterance=true
Cancel-If-Queue: false
Recognition-Timeout: 10000
Speech-Complete-Timeout: 2000
Start-Input-Timers: true
Save-Waveform: true
Speech-Language: en-US
Content-Length: 27


session:request1@form-level


Interestingly we have also tested $TIME and this did produce some better results. Is $TIME a special case in uniMRCP or should we expect all the Google class tokens to work as they're just passed on by the server/plugin ? 

Thanks

Ed James

ewrj...@gmail.com

unread,
Apr 4, 2022, 11:18:20 AM4/4/22
to UniMRCP

I've just done some testing with Google Class Tokens with a prompt that reads out a 16 digit number:


Azure (No Grammar and 100% Correct)
<result>
<interpretation grammar="builtin:speech/transcribe" confidence="0.96">
<instance>4931785634782378</instance>
<input mode="speech">four nine three one seven eight five six three four seven eight two three seven eight</input>
</interpretation>
</result>



Google (No Grammar and has the digits but not the format)
<result>
<interpretation grammar="builtin:speech/transcribe" confidence="0.93">
<instance>4931 +78-563-478-2378</instance>
<input mode="speech">4931 +78-563-478-2378</input>
</interpretation>
<interpretation grammar="builtin:speech/transcribe" confidence="0.8">
<instance>for 93178 563-478-2378</instance>
<input mode="speech">for 93178 563-478-2378</input>
</interpretation>
<interpretation grammar="builtin:speech/transcribe" confidence="0.66">
<instance>49 +317-856-347-8237 8</instance>
<input mode="speech">49 +317-856-347-8237 8</input>
</interpretation>
</result>


Google ($FULLPHONENUM and missing digits and incorrect format)
<result>
<interpretation grammar="session:request1@form-level" confidence="0.86">
<instance>01785 6347 8237</instance>
<input mode="speech">01785 6347 8237</input>
</interpretation>
<interpretation grammar="session:request1@form-level" confidence="0.66">
<instance>01785 6347 827</instance>
<input mode="speech">01785 6347 827</input>
</interpretation>
<interpretation grammar="session:request1@form-level" confidence="0.65">
<instance>1785 6347 8237</instance>
<input mode="speech">1785 6347 8237</input>
</interpretation>
</result>


Google ($OOV_CLASS_DIGIT_SEQUENCE and one missing digit and correct format)
<result>
<interpretation grammar="session:request1@form-level" confidence="0.95">
<instance>493178563478237</instance>
<input mode="speech">493178563478237</input>
</interpretation>
<interpretation grammar="session:request1@form-level" confidence="0.95">
<instance>4931788563478237</instance>
<input mode="speech">4931788563478237</input>
</interpretation>
<interpretation grammar="session:request1@form-level" confidence="0.91">
<instance>4931785634788237</instance>
<input mode="speech">4931785634788237</input>
</interpretation>
</result>


So the Google class tokens do seem to impact the results (which does suggest they are reaching Google) although if this pattern could be relied upon then we'd be better off not using a class token and just reformatting the Google result without tokens as at least it got all the digits.

Does anyone have any experience of using Google tokens producing reliably better results than having no grammar ?

Thanks

Ed James

Arsen Chaloyan

unread,
Apr 12, 2022, 9:43:39 PM4/12/22
to UniMRCP
Hi Ed,

I can confirm that the plugin transparently passes a specified class token to Google, and also your use of the SRGS XML grammar is correct.

Based on our experience dealing with similar cases, Google recommends using their enhanced model trained for telephony, which can be achieved by extending the grammar with the following outlined parameters (all the three are required to have the desired impact).

<?xml version="1.0"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="pre" mode="voice">
<meta name="single-utterance" content="true"/>
<meta name="input-timeout" content="10000"/>
<meta name="speech-complete-timeout" content="2000"/>
<meta name="scope" content="hint"/>
<meta name="use-enhanced" content="true"/>
<meta name="model" content="phone_call"/>
<meta name="api" content="v1p1beta1"/>
<rule id="pre">
<one-of><item>$FULLPHONENUM</item></one-of>
</rule>
</grammar>

While the above does help improve the results, it is not uncommon to perform some additional post-processing to normalize the returned results based on a domain-specific context in the scope of the user application.


--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/a075d607-113c-477b-ad99-0c8093d31972n%40googlegroups.com.


--
Arsen Chaloyan
Author of UniMRCP
http://www.unimrcp.org

ewrj...@gmail.com

unread,
Apr 14, 2022, 10:57:23 AM4/14/22
to UniMRCP
Hi Arsen,
Thanks for the suggestion. If I include those headers I do get all the digits returned and formatted as a number but all the confidence values are virtually zero (so our applications will discard them):

2022-04-13 17:28:58:441335 [INFO]   Send MRCPv2 Data 10.1.6.49:63789 <-> 172.18.1.178:1544 [688 bytes]
MRCP/2.0 688 DEFINE-GRAMMAR 1
Channel-Identifier: 10a2d71c2fc1467c@speechrecog
Content-Type: application/srgs+xml
Content-Id: request1@form-level
Content-Length: 515

<?xml version="1.0"?><grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-GB" version="1.0" root="pre" mode="voice">
<meta name="single-utterance" content="false"/>

<meta name="input-timeout" content="10000"/>
<meta name="speech-complete-timeout" content="2000"/>
<meta name="scope" content="hint"/>
<meta name="use-enhanced" content="true"/>
<meta name="model" content="phone_call"/>
<meta name="api" content="v1p1beta1"/>
<rule id="pre">
<one-of>
<item>$OOV_CLASS_DIGIT_SEQUENCE</item>
</one-of>
</rule>
</grammar>
2022-04-13 17:28:58:442335 [DEBUG]  Wait for Messages [MRCPv2-Agent-1]


2022-04-13 17:29:06:235533 [INFO]   Receive MRCPv2 Data 10.1.6.49:63789 <-> 172.18.1.178:1544 [930 bytes]
MRCP/2.0 930 RECOGNITION-COMPLETE 2 COMPLETE
Channel-Identifier: 10a2d71c2fc1467c@speechrecog
Completion-Cause: 000 success
Waveform-Uri: <http://localhost/utterances/umsgsr-10a2d71c2fc1467c-2.wav>;size=124800;duration=7800
Content-Type: application/x-nlsml
Content-Length: 644

<?xml version="1.0"?>
<result>
  <interpretation grammar="session:request1@form-level" confidence="2.3e-09">
    <instance>4 9 3 1 7 8 5 6 3 4 7 8 2 3 7 8</instance>
    <input mode="speech">4 9 3 1 7 8 5 6 3 4 7 8 2 3 7 8</input>
  </interpretation>
  <interpretation grammar="session:request1@form-level" confidence="2.3e-09">
    <instance>4931785634782378</instance>
    <input mode="speech">4931785634782378</input>
  </interpretation>
  <interpretation grammar="session:request1@form-level" confidence="2.3e-09">
    <instance>£4931785634782378</instance>
    <input mode="speech">£4931785634782378</input>
  </interpretation>
</result>



We've opened a ticket with Google directly as I'm sure the issue is with them and they are asking me the following questions. Would it be possible for you to let me know any useful answers I can give them (I understand the plugin to Google connection is your intellectual property):

Questions From Google:

1.  Can you verify that the sampling rate is at 16000 khz or higher?

2.  Can you please verify the codec being used is a lossless codec? We recommend FLAC or LINEAR16

3. Please provide the Entire request code sent to API including the entire RecognitionConfig (see https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig )

I send PCMA 8kHz audio to the uniMRCP server but I notice they're saying 16kHz is optimal and PCMA doesn't seem to be a listed codec so are you transcoding ? 

Thanks

Ed

Arsen Chaloyan

unread,
Apr 15, 2022, 2:56:57 PM4/15/22
to UniMRCP
Hi Ed,

> Thanks for the suggestion. If I include those headers I do get all the digits returned and formatted as a number but all the confidence values are virtually zero (so our applications will discard them):

Glad it helps. It is not uncommon to disregard the confidence score reported by Google, as it happens not frequently but still continuously that the intended result is returned with the confidence score 0. If that is acceptable, you may simply not set a threshold or set the threshold to 0 from your application.

> We've opened a ticket with Google directly as I'm sure the issue is with them and they are asking me the following questions. Would it be possible for you to let me know any useful answers I can give them (I understand the plugin to Google connection is your intellectual property):

Sure, please see below.

> 1.  Can you verify that the sampling rate is at 16000 khz or higher?

The plugin supports both 8 kHz and 16 kHz. The sampling rate used in communication with Google is derived from the sampling rate negotiated via SDP answer/offer for RTP. No resampling is performed.

> 2.  Can you please verify the codec being used is a lossless codec? We recommend FLAC or LINEAR16

LINEAR16.

> 3. Please provide the Entire request code sent to API including the entire RecognitionConfig (see https://cloud.google.com/speech-to-text/docs/reference/rest/v1/RecognitionConfig )

> I send PCMA 8kHz audio to the uniMRCP server but I notice they're saying 16kHz is optimal and PCMA doesn't seem to be a listed codec so are you transcoding ?

PCMA is decoded to raw PCM and sent to Google in 8 kHz. Their phone_call model, I referred to in my previous post, is trained on 8 kHz data. Naturally, 16 kHz would be preferable, but it does not apply to traditional telephony.

This would be a typical config sent to Google in a similar case.

{
"streamingConfig":
{
"config":
{
"encoding":"LINEAR16","sampleRateHertz":8000,"languageCode":"en-US","maxAlternatives":1,
"speechContexts":[{"phrases":["$OOV_CLASS_ALPHANUMERIC_SEQUENCE"]}],
"model":"phone_call",
"useEnhanced":true,
"enableSpokenPunctuation":false,
"enableSpokenEmojis":false
},
"interimResults":true
}
}



erj

unread,
Apr 25, 2022, 10:24:30 AM4/25/22
to UniMRCP
Hi Arsen,
We for this back from Google. Do you know how to select model type ?

" I have been able to reproduce the issue, and found a setting that fixes it. When setting the model type, have them select “latest long”. I was able to get all digits and a 97% confidence rating at the current sample rate of 8000khz. Please have them try this and let me know. Thank you and have a great weekend."

I've tried various guesses in the grammar but (the latest below) but still get zero confidence levels:

<?xml version="1.0"?>
<grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-GB" version="1.0" root="pre" mode="voice">
<meta name="single-utterance" content="false"/>
<meta name="input-timeout" content="10000"/>
<meta name="speech-complete-timeout" content="2000"/>
<meta name="scope" content="hint"/>
<meta name="use-enhanced" content="true"/>
<meta name="model" content="phone_call"/>
<meta name="model-type" content="latest long"/>

<meta name="api" content="v1p1beta1"/>
<rule id="pre">
<one-of>
<item>$OOV_CLASS_DIGIT_SEQUENCE</item>
</one-of>
</rule>
</grammar>

Thanks
Ed

erj

unread,
Apr 26, 2022, 4:21:29 AM4/26/22
to UniMRCP
Hi Arsen,
Based on this  https://cloud.google.com/speech-to-text/docs/basics#select-model I tried a model of latest_long and reported the following to Google:

If I use latest_long as the model (are model and model type the same thing ?) then I get this response back from Google: 

 2022-04-26 09:02:36:359138 [WARN] gRPC Status: Invalid recognition 'config': The requested model is currently not supported for language : en-GB. <ff7f8b955941497d@gsr> 

 Is the model tied to a specific language ? 

 This is the SSML grammar file I'm using: 

 <?xml version="1.0"?> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-GB" version="1.0" root="pre" mode="voice"> 
<meta name="single-utterance" content="false"/> 
<meta name="input-timeout" content="10000"/>
 <meta name="speech-complete-timeout" content="2000"/>
 <meta name="scope" content="hint"/>
 <meta name="use-enhanced" content="true"/> 
<meta name="model" content="latest_long"/>
 <meta name="api" content="v1p1beta1"/> 
 <rule id="pre"> <one-of> <item>$OOV_CLASS_DIGIT_SEQUENCE</item> </one-of> </rule> 
</grammar> 

 Note that if I change the model tag to phone_call then it works albeit with the zero confidences.

Thanks
Ed

Arsen Chaloyan

unread,
May 3, 2022, 6:25:52 PM5/3/22
to UniMRCP
Hi Ed,

Yes, you correctly specified the model name. In fact, the GSR plugin does not make any validation of the model field and simply passes it along to Google.

The "latest" models have been introduced by Google relatively recently and they are indeed not available for all the language codes. You may check the following page for the current availability matrix.



Reply all
Reply to author
Forward
0 new messages