Voice activity getting detected prematurely causing TTS to barge

910 views
Skip to first unread message

Bharath Shankar

unread,
Jul 10, 2019, 2:27:57 AM7/10/19
to UniMRCP
I am facing a peculiar problem where the TTS is getting barged in prematurely even when I put my phone on mute. From the logs I see "Detected Voice Activity".  This is causing TTS to stop abruptly. Pasting log snippets below
====================================================================================
2019-07-10 05:34:39:576990 [INFO]   Receive MRCPv2 Data 172.31.8.9:1544 <-> 172.31.8.9:53450 [351 bytes]
MRCP/2.0 351 RECOGNIZE 2^M
Channel-Identifier: 0584bd1c19f445d3@speechrecog^M
Content-Type: text/uri-list^M
Cancel-If-Queue: false^M
Start-Input-Timers: false^M
No-Input-Timeout: 5000^M
Sensitivity-Level: 0.3^M
Speech-Complete-Timeout: 500^M
Recognition-Timeout: 5000^M
Confidence-Threshold: 0.1^M
Speech-Language: XXX
Content-Length: 17^M

..................................
....................................
2019-07-10 05:34:39:594421 [INFO]   Receive MRCPv2 Data 172.31.8.9:1544 <-> 172.31.8.9:53468 [468 bytes]
MRCP/2.0 468 SPEAK 1^M
Channel-Identifier: 95366fe9c5414f32@speechsynth^M
Content-Type: text/plain^M
Speech-Language: XXX
Content-Length: 323^M
<XXXXX>

2019-07-10 05:34:39:594437 [INFO]   Assign Control Channel <95366fe9c5414f32@speechsynth> to Connection 172.31.8.9:1544 <-> 172.31.8.9:53468 [0] -> [1]
2019-07-10 05:34:39:594453 [INFO]   Process SPEAK Request <95366fe9c5414f32@speechsynth> [1]
2019-07-10 05:34:39:594488 [INFO]   Text for Synthesis [<speak> <prosody rate=\"95%\">XXXXXXXXXXXXXXX</prosody></speak>]
2019-07-10 05:34:39:594496 [INFO]   Request command [aws polly synthesize-speech --text-type ssml --language-code hi-IN --sample-rate 8000 --text " <speak> <prosody rate=\"95%\">XXXXXXXXXXXXXXXX</prosody></speak> " --output-format pcm --voice-/local/unimrcp/data/95366fe9.pcm]
2019-07-10 05:34:39:601061 [INFO]   Process SPEAK-COMPLETE Event <95366fe9c5414f32@speechsynth> [1]
2019-07-10 05:34:39:601080 [INFO]   Unexpected SPEAK-COMPLETE Event <95366fe9c5414f32@speechsynth> [1]
2019-07-10 05:34:40:151709 [INFO]   Set [/usr/local/unimrcp/data/95366fe9.pcm] as Speech Source <95366fe9c5414f32@speechsynth>
2019-07-10 05:34:40:151730 [INFO]   Process SPEAK Response <95366fe9c5414f32@speechsynth> [1]
2019-07-10 05:34:40:151741 [NOTICE] State Transition IDLE -> SPEAKING <95366fe9c5414f32@speechsynth>
2019-07-10 05:34:40:151776 [INFO]   Send MRCPv2 Data 172.31.8.9:1544 <-> 172.31.8.9:53468 [83 bytes]
MRCP/2.0 83 1 200 IN-PROGRESS^M
Channel-Identifier: 95366fe9c5414f32@speechsynth^M
^M
2019-07-10 05:34:40:520986 [INFO]   Detected Voice Activity <0584bd1c19f445d3@speechrecog>
2019-07-10 05:34:40:521030 [INFO]   Process START-OF-INPUT Event <0584bd1c19f445d3@speechrecog> [2]
2019-07-10 05:34:40:521067 [INFO]   Send MRCPv2 Data 172.31.8.9:1544 <-> 172.31.8.9:53450 [94 bytes]
MRCP/2.0 94 START-OF-INPUT 2 IN-PROGRESS^M
Channel-Identifier: 0584bd1c19f445d3@speechrecog^M
^M

2019-07-10 05:34:40:522806 [INFO]   Receive MRCPv2 Data 172.31.8.9:1544 <-> 172.31.8.9:53468 [72 bytes]
MRCP/2.0 72 STOP 2^M
Channel-Identifier: 95366fe9c5414f32@speechsynth^M

2019-07-10 05:34:40:522836 [INFO]   Process STOP Request <95366fe9c5414f32@speechsynth> [2]
2019-07-10 05:34:40:531050 [NOTICE] State Transition SPEAKING -> IDLE <95366fe9c5414f32@speechsynth>
======================================================================================================

Its not clear to me how recognition plugin received MPF_DETECTOR_EVENT_ACTIVITY when there was no activity (phone was on mute) ?

Also when the barge in occurs the recognition plugin is expected to receive START-INPUT-TIMERS request from the client and only then the plugin needs to process the input. Am I correct ?

Any help regarding this is much appreciated.

Doug Rylaarsdam

unread,
Jul 10, 2019, 12:01:55 PM7/10/19
to uni...@googlegroups.com
Hi Bharath,

Based on your description, I might try experimenting with the speech-start-timeout in the GSR plugin configuration, to see if it makes any difference in the frequency of unexpected barge-in events. Also per the usage guide, the sensitivity level of 0.3 in the RECOGNIZE request maps to GSR vad-mode 1, so that might be something to experiment with as well. 

The START-INPUT-TIMERS request would only tell the recognizer to start the timers. The recognizer would have been listening (processing input) since the time it received the RECOGNIZE request.

Doug


--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/3ea5a870-a9fa-4843-8d66-f1c3a7773c4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arsen Chaloyan

unread,
Jul 10, 2019, 4:50:01 PM7/10/19
to UniMRCP
It is also not very clear which plugin and VAD Bharath uses in particular. It appears to me that he refers to mpf_activity_detector, which is based on energy level detection and is not what is used within commercial plugins.


For more options, visit https://groups.google.com/d/optout.


--
Arsen Chaloyan
Author of UniMRCP
http://www.unimrcp.org

Bharath Shankar

unread,
Jul 11, 2019, 2:25:58 AM7/11/19
to UniMRCP

Thanks for your inputs Doug and Arsen.

Sorry I should have been clear on the plugin part. We are not using any commercial plugins. We have customized unimrcp demo plugin itself. So on the MRCP side, from my understanding I will have to play around with mpf_activity_detector level_threshold?. Activity detection happens at the MPF level and not at the plugin level. So shouldn't it be the same logic even in case of commercial plugins ?

To unsubscribe from this group and stop receiving emails from it, send an email to uni...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to uni...@googlegroups.com.

lnm01

unread,
Jul 22, 2019, 11:52:10 AM7/22/19
to UniMRCP
any update on this? using cisco w vvb unimcrp gsr and having the same issue, unimcrp gsr plugin vad set to 0

Arsen Chaloyan

unread,
Jul 24, 2019, 7:17:12 PM7/24/19
to UniMRCP
The implementation of VAD available in the open source domain and in the commercial plugins differs cordially, even though you clearly may experience false positives with both.

The following comments concern implementation in GSR. By setting VAD to 0, you make it very sensitive. Set the mode to 2 or 3, and use a longer speech-incomplete-timeout, let's say 15000 msec, and the problem will be resolved.

If you continue to experience problems, collect logs/recorded utterance and submit for review.

--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/b37ec68b-bf7d-4b48-a86b-821d1b3abea0%40googlegroups.com.

lnm01

unread,
Aug 8, 2019, 11:15:47 AM8/8/19
to UniMRCP
Thanks, we also had to increase the sensitivity on the vxml side to take effect, however it appears to still be very sensitive.  Any other thoughts on parameters to change?  Is there anything on streaming we could change?
      single-utterance="true"
      interim-results="true"
      start-of-input="service-originated"
or the detector?
      speech-leading-silence="300"
      speech-trailing-silence="300"
      speech-output-period="200"

Bharath Shankar did you get your issue resolved?

Thanks
Mark
To unsubscribe from this group and stop receiving emails from it, send an email to uni...@googlegroups.com.

Doug Rylaarsdam

unread,
Aug 8, 2019, 12:43:30 PM8/8/19
to UniMRCP
What is the value of speech-start-timeout in your umsgsr.xml file? (I observed a default of 50 in my installation). I don't know any details of the detection algorithms in use, but per the description this sets the length of a transition mode prior to start of speech. If the algorithm allows for cancelling transition mode (due to ongoing analysis of the input signal during transition), then maybe a longer value of this timeout might allow for filtering out more false positives. If my description is accurate, setting the value too large could also delay start of speech, making barge-in response feel slow.

Arsen Chaloyan

unread,
Aug 9, 2019, 6:40:09 PM8/9/19
to UniMRCP
Right, setting speech-start-timeout to a longer value would delay START-OF-INPUT, but a too large value may also result in missing short utterances. For some of the SR plugins, this parameter is already made override-able per recognition request, for the others, including GSR/GDF, the change will be available with upcoming releases.

To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/5e5eba7e-bd96-4efa-907e-10520038987a%40googlegroups.com.

lnm01

unread,
Aug 14, 2019, 11:03:04 AM8/14/19
to UniMRCP
right now its set to 300, can try setting back down to 50 again.

lnm01

unread,
Aug 14, 2019, 1:24:19 PM8/14/19
to UniMRCP
tried 50, no dice, still chops init tts prompting, set it back to 300, happens even with the larger speech incomplete.

Arsen Chaloyan

unread,
Aug 14, 2019, 10:13:18 PM8/14/19
to UniMRCP
Collect logs and utterances and submit for review.

To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/2c7bb71e-9c1f-4781-a9d6-a393a2f93601%40googlegroups.com.

Mark Eichten

unread,
Aug 15, 2019, 5:39:52 AM8/15/19
to uni...@googlegroups.com
Hi Arsen, can you please let me know which logs you would like? Vvb, vxml, unimrcp, sip? And at what levels?  Thanks 

Arsen Chaloyan

unread,
Aug 15, 2019, 10:09:02 PM8/15/19
to UniMRCP
Hi Mark,

I am referring to the logs and utterances to be obtained from UniMRCP server. Default log levels should be sufficient. Please follow the procedure below.

1. Enable collection of utterances from the configuration of the SR plugin you use.

<utterance-manager
   save-waveforms="true"

2. Restart the service for the change to take effect.

systemctl restart unimrcp

3. Reproduce the problem.

4. Collect logs stored in /opt/unimrcp/log and wave files stored in /opt/unimrcp/var directories respectively.


lnm01

unread,
Aug 22, 2019, 11:02:28 AM8/22/19
to UniMRCP
log files attached, thanks Mark


--
Arsen Chaloyan
Author of UniMRCP
http://www.unimrcp.org

--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to uni...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to uni...@googlegroups.com.
umsgsr-fcd875c3c25049e6-102.wav
umsgsr-fcd875c3c25049e6-102.json
unimrcpserver_2019.08.21_14.33.41.193904.log
unimrcpserver_2019.08.20_00.16.21.935350.log

Arsen Chaloyan

unread,
Aug 28, 2019, 4:51:49 PM8/28/19
to UniMRCP
Hi Mark,

Well, let's analyze the supplied logs corresponding to the recorded utterance umsgsr-fcd875c3c25049e6-102.wav.

C -> S: RECOGNIZE

2019-08-22 08:29:42:131617 [INFO]   Receive MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37029 [553 bytes]
MRCP/2.0 551 RECOGNIZE 102
Channel-Identifier: fcd875c3c25049e6@speechrecog

A RECOGNIZE request is received from Cisco VVB. Everything is ordinary so far, not counting preceding two DEFINE-GRAMMAR requests with the same content (one of them is simply purposeless) and other known problems with the MRCP message content length.

The following statements indicate recognition parameters used for this request. I do not see anything that would catch my attention.

2019-08-22 08:29:42:131747 [INFO]   Init Speech Detector: frame-size=160, max-frame-count=180, output-frame-count=20, vad-mode=2, noinput-timeout=8000 ms, input-timeout=20000 ms, start-timeout=500 ms, complete-timeout=700 ms, incomplete-timeout=1000 ms, leading-silence=300 ms, trailing-silence=300 ms, interim-results=1, start-of-input=external <fcd875c3c25049e6>
2019-08-22 08:29:42:131774 [INFO]   Open Waveform File for Writing /opt/unimrcp/var/umsgsr-fcd875c3c25049e6-102.wav, sampling-rate [8000]
2019-08-22 08:29:42:131955 [DEBUG]  Init Streaming Config: encoding=1, sampling-rate=8000, language=en-US, max-alternatives=1, interim-results=1, single-utterance=1 profanity-filter=0, word-time-offsets=0, auto-punctuation=0, use-enhanced=1 <fcd875c3c25049e6@gsr>
2019-08-22 08:29:42:131971 [DEBUG]  Set model [phone_call] <fcd875c3c25049e6@gsr>

S -> C: IN-PROGRESS

2019-08-22 08:29:42:132637 [INFO]   Send MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37029 [85 bytes]
MRCP/2.0 85 102 200 IN-PROGRESS
Channel-Identifier: fcd875c3c25049e6@speechrecog

The request is acknowledged in a timely manner.

C -> S: SPEAK

2019-08-22 08:29:42:198919 [INFO]   Receive MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37030 [459 bytes]
MRCP/2.0 457 SPEAK 100
Channel-Identifier: 73c343fda44e4b03@speechsynth

In the meantime, VVB establishes a new MRCP session and sends a SPEAK request.

S -> C: IN-PROGRESS

2019-08-22 08:29:42:412495 [INFO]   Send MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37030 [85 bytes]
MRCP/2.0 85 100 200 IN-PROGRESS
Channel-Identifier: 73c343fda44e4b03@speechsynth

This request is also properly acknowledged, which indicates that the synthesized speech is ready and started to be streamed to the client (VVB).

S -> C: START-OF-INPUT

2019-08-22 08:29:45:658081 [INFO]   Send MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37029 [117 bytes]
MRCP/2.0 117 START-OF-INPUT 102 IN-PROGRESS
Channel-Identifier: fcd875c3c25049e6@speechrecog
Input-Type: speech

This is the time when a start of input is determined and sent to the client. As you may see, the difference between the last two events, when the streaming of synthesized speech is started and barge-in occurred, is about 3 seconds, which should be enough to get the prompt "My Name is Perry... I'm your Personal Virtual Assistant. How can I help you?" to be played nearly till completion.

C -> S: STOP

2019-08-22 08:29:46:111686 [INFO]   Receive MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37030 [76 bytes]
MRCP/2.0 74 STOP 101
Channel-Identifier: 73c343fda44e4b03@speechsynth

As a result of the START-OF-INPUT event, VVB sends a STOP request.

S -> C: COMPLETE

2019-08-22 08:29:46:121394 [INFO]   Send MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37030 [112 bytes]
MRCP/2.0 112 101 200 COMPLETE
Channel-Identifier: 73c343fda44e4b03@speechsynth
Active-Request-Id-List: 100

This is a completion response to the STOP request.

S -> C: RECOGNITION-COMPLETE

2019-08-22 08:29:47:709275 [INFO]   Send MRCPv2 Data 10.38.244.162:1544 <-> 10.38.244.153:37029 [465 bytes]
MRCP/2.0 465 RECOGNITION-COMPLETE 102 COMPLETE
Channel-Identifier: fcd875c3c25049e6@speechrecog
Completion-Cause: 000 success
Content-Type: application/x-nlsml
Content-Length: 278

<?xml version="1.0"?>
<result>
  <interpretation grammar="session:fie...@field.grammar" confidence="0.96">
    <instance>what&apos;s the weather in Omaha Nebraska</instance>
    <input mode="speech">what&apos;s the weather in Omaha Nebraska</input>
  </interpretation>
</result>

Finally, the recognition completes with the expected result.

In general, there is nothing suspicious in this message flow or in the timings. The question is what your experience actually was in this particular case. Did you listen to the prompt almost till the end and then started to speak, or what...?

The way I see it, even if there is a problem, it is unlikely related to VAD. In order to correlate all the RTP streams involved in this scenario, you may need to make a network capture of an incoming call to VVB, and two MRCP sessions originated by VVB for speech recognition and synthesis.


To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/9d63fd3b-838e-4921-b3bc-fe1730e03dcd%40googlegroups.com.

lnm01

unread,
Sep 6, 2019, 11:34:58 AM9/6/19
to UniMRCP
Arsen, wav recording of caller experience.

you can hear me speak the utterance and the init prompt pause

Hello and welcome to Qantas Frequent Flyer. Please tell me how I can assist you today. You can say things like update my contact information, or I need a replacement card. If you get stuck along the way, say help and I’ll take it from there.

then it continues to play the init message after the result and then goes straight to the replacement card prompt

In order to replace your card, I will need to authenticate your account and connect you to an associate.

please let me know if you have any questions

thanks
Mark

On Wednesday, August 28, 2019 at 3:51:49 PM UTC-5, Arsen Chaloyan wrote:
Hi Mark,


S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:field1@field.grammar" confidence="0.96">
demo tts playback unimrcp.mp3

Arsen Chaloyan

unread,
Sep 10, 2019, 8:40:49 PM9/10/19
to UniMRCP
Hi Mark,

The provided recording and your description of the problem, reminded me an issue which was supposedly fixed in GSS 1.5.0.

> If a SPEAK request is stopped when synthesized speech is being streamed to the client, then the remaining part of the synthesized speech could be left in media buffer and be streamed on a consecutive SPEAK request received in the scope of the same MRCP session.

However, it turns out the problem was not addressed entirely. The mystery is over at least...

Thanks



S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:fie...@field.grammar" confidence="0.96">
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/ced836bc-7b9b-4efa-8c4c-88bad8bf680d%40googlegroups.com.

lnm01

unread,
Sep 11, 2019, 3:08:31 PM9/11/19
to UniMRCP
ok, thats great news, where can I look for status update on resolution?  Thanks Mark

S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:field1@field.grammar" confidence="0.96">

Arsen Chaloyan

unread,
Sep 11, 2019, 8:08:21 PM9/11/19
to UniMRCP
I'll provide an update here when the patched version becomes available.


S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:fie...@field.grammar" confidence="0.96">
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/8460270c-35b2-464e-88e8-da56375302ea%40googlegroups.com.

Arsen Chaloyan

unread,
Sep 12, 2019, 10:25:45 PM9/12/19
to UniMRCP
Well, the patch is available. Use

yum install unimrcp-gss
or
sudo apt-get install unimrcp-gss

lnm01

unread,
Oct 5, 2019, 7:17:08 AM10/5/19
to UniMRCP
Arsen,

  After applying the patch we've done some additional testing, it appears this issue is still occurring, please let me know what additional logging we need to capture to further troubleshoot. 

Thanks
Mark

S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:field1@field.grammar" confidence="0.96">


--
Arsen Chaloyan
Author of UniMRCP
http://www.unimrcp.org

lnm01

unread,
Oct 9, 2019, 12:12:58 PM10/9/19
to UniMRCP
Arsen, as a work around it appears splitting the tts and stt devices into 2 works from vvb and provides the desired behavior.  Thanks Mark

Fabrizio Ammollo

unread,
Oct 9, 2019, 12:32:27 PM10/9/19
to UniMRCP

Hello,

it seems to me that I may be having a similar problem with the ASR component (I am using a separate server for TTS, not UniMRCP).
I am using GSR plugin version 1.14.0, first version I have installed, and before writing this message I verified that the umc scenarios work correctly reading the input from the demo files, so the base setup is OK.

What is happening exactly is that I have a somewhat long audio message with barge in enabled, and I am unable to hear it for about more than two seconds since its beginning, because a nomatch is returned; if I speak immediately, the voice is processed and results are obtained.
I am using the builtin:speech/transcribe grammar with most default values taken from GSR defaults, I tried to use a minimum set of headers into the RECOGNIZE request, using the same values set by the umc client (in this case I tried changing Recognition-Timeout from 10000 to 20000 but it made no difference at all).
This is the log of unimrcpserver from the RECOGNIZE request up to the nomatch (I am using MRCPv1 because I use VoiceXML interpreter internally developed by our company which still uses only that MRCP version):

2019-10-09 17:32:32:962373 [INFO]   Receive RTSP Data 10.51.54.199:1554 <-> 10.51.54.199:41692 [568 bytes]
ANNOUNCE rtsp://10.51.54.199:1554/speechrecognizer RTSP/1.0
Content-Length: 284
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Supported: method.announce
User-Agent: VoiceXML-SIP/6.4.4-2176:2177 (Linux 3.10.0-862.3.2.el7.x86_64) VoiceXML/2.1 JRE/1.8.0_171

RECOGNIZE 3 MRCP/1.0
Content-Length: 55
Speech-Language: it-IT
No-Input-Timeout: 10000
Save-Waveform: true
Content-Type: text/uri-list
Confidence-Threshold: 50
Recognition-Timeout: 20000
Recognizer-Start-Timers: false

session:aNAVWcxPeGmMOHQQ...@external.com

2019-10-09 17:32:32:962492 [INFO]   Process RECOGNIZE Request <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:32:962583 [INFO]   Init Speech Detector: frame-size=160, max-frame-count=350, output-frame-count=20, vad-mode=2, noinput-timeout=10000 ms, input-timeout=20000 ms, start-timeout=50 ms, complete-timeout=1000 ms, incomplete-timeout=3000 ms, leading-silence=300 ms, trailing-silence=300 ms, interim-results=1, start-of-input=external <0a2ac1c42a994d7a>
2019-10-09 17:32:32:962643 [INFO]   Open Waveform File for Writing /opt/unimrcp/var/umsgsr-0a2ac1c42a994d7a-3.wav, sampling-rate [8000]
2019-10-09 17:32:32:962801 [INFO]   Create gRPC Stream <0a2ac1c42a994d7a@gsr>
2019-10-09 17:32:32:963736 [INFO]   Process RECOGNIZE Response <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:32:963750 [INFO]   State Transition IDLE -> RECOGNIZING <0a2ac1c42a994d7a@speechrecog>
2019-10-09 17:32:32:963785 [INFO]   Send RTSP Data 10.51.54.199:1554 <-> 10.51.54.199:41692 [138 bytes]
RTSP/1.0 200 OK
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Content-Length: 30

MRCP/1.0 3 200 IN-PROGRESS


2019-10-09 17:32:33:092620 [INFO]   Speech Detector State Transition NO-INPUT -> IN-PROGRESS [130 ms] <0a2ac1c42a994d7a>
2019-10-09 17:32:33:092649 [INFO]   Start Input Timer [20000 ms] <0a2ac1c42a994d7a>
2019-10-09 17:32:36:572595 [INFO]   Speech Detector State Transition IN-PROGRESS -> COMPLETE [3480 ms] <0a2ac1c42a994d7a>
2019-10-09 17:32:36:572623 [INFO]   Detector Stats: leading-silence=20 ms, input=530 ms, trailing-silence=3000 ms <0a2ac1c42a994d7a>
2019-10-09 17:32:36:572869 [INFO]   Input Complete [success] size=61280 bytes, dur=3830 ms <0a2ac1c42a994d7a@gsr>
2019-10-09 17:32:36:628037 [INFO]   Received Response: status [0] type [unspecified] result-count [0] <0a2ac1c42a994d7a@gsr>
2019-10-09 17:32:36:628135 [INFO]   Process START-OF-SPEECH Event <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:36:628197 [INFO]   Process RECOGNITION-COMPLETE Event <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:36:628215 [INFO]   State Transition RECOGNIZING -> RECOGNIZED <0a2ac1c42a994d7a@speechrecog>
2019-10-09 17:32:36:628238 [INFO]   Send RTSP Data 10.51.54.199:1554 <-> 10.51.54.199:41692 [194 bytes]
ANNOUNCE rtsp://10.51.54.199:1554/speechrecognizer RTSP/1.0
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Content-Length: 42

START-OF-SPEECH 3 IN-PROGRESS MRCP/1.0


2019-10-09 17:32:36:628282 [INFO]   Send RTSP Data 10.51.54.199:1554 <-> 10.51.54.199:41692 [329 bytes]
ANNOUNCE rtsp://10.51.54.199:1554/speechrecognizer RTSP/1.0
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Content-Length: 176

RECOGNITION-COMPLETE 3 COMPLETE MRCP/1.0
Completion-Cause: 001 no-match
Waveform-Url: <http://localhost/utterances/umsgsr-0a2ac1c42a994d7a-3.wav>;size=61280;duration=3830


If this can help you understand what is happening and how to prevent it, the indicated duration is 3480 ms but if I (try to) start speaking immediately when the audio message begins there are almost 2 seconds of silence at the beginning of the saved waveform (I am unable to explain now where that comes from).

Best regards,
Fabrizio

Arsen Chaloyan

unread,
Oct 10, 2019, 8:36:40 AM10/10/19
to UniMRCP
Mark,

I would be quite confident that the issue I observed has been properly addressed in the patched version. At the same time, I cannot tell whether or not that was the same problem you were actually encountering, but it appeared so to me.

You can certainly use two different MRCP servers per TTS and STT. However, this normally should not make any difference in the behavior, unless there are some unknown to me issues in interoperability with VVB.


S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:fie...@field.grammar" confidence="0.96">
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/caa4d5f3-c4f0-4117-aae7-ad2a87051056%40googlegroups.com.

Arsen Chaloyan

unread,
Oct 10, 2019, 8:41:34 AM10/10/19
to UniMRCP
Hello Fabrizio,

To avoid any misunderstanding in the actual and expected behavior, please provide a network capture along with the logs and an utterance saved on UniMRCP server.

If you have any questions or difficulties obtaining this information, let us know. I'll take a further look at this issue when the complete data is available.


S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:fie...@field.grammar" confidence="0.96">
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/4be0ce88-a0e1-45c5-a24d-0c2529171f81%40googlegroups.com.

Fabrizio Ammollo

unread,
Oct 10, 2019, 10:52:48 AM10/10/19
to UniMRCP
Hello Arsen,

I have taken captures, logs and wave forms of three different calls, two with a prematurely ended recognition and a third with a correctly timed recognition; first of all, to clarify my local setup and scenario, there are some differences from my initial attempts:

1) I am using Google TTS using the GSS plugin, so you will also see those requests inside the captures and logs; the SPEAK requests ask to synthesize the following SSML:

<?xml version="1.0"?><speak>Buongiorno, e benvenuto nel servizio telefonico automatizzato per l'acquisto dei biglietti.
<break time="1s" />

Quale è la stazione di partenza?</speak>

and when the recognition is prematurely ended and nomatch is returned, the following SSML is synthesized and another RECOGNIZE is requested:

<?xml version="1.0"?><speak>Non ho capito, puoi ripetere?</speak>

The ASR and TTS requests are made by two different components so they are on two different RTSP sessions.

2) I changed/added some parameters of the RECOGNIZE request setting them into our VoiceXML application to high/very high values in order to see if something changed, and indeed instead of just about a little more than 2 seconds I am now usually able to hear a much longer part of it, but very rarely its end is reached; the parameters I have changed/added are the following:

No-Input-Timeout: 20000 (before it was 10000)
Recognition-Timeout: 20000 (before I did not specify it at all)
Speech-Incomplete-Timeout: 60000 (before I did not specify it at all)

It was after adding the last one that the real difference started to show up.

Explaination of the contents of the zip file:
You can find the three different calls: for each one I have included the .cap file, the log file of unimrcpserver and the wave forms of the utterances; of these, only the first file is actually relevant, since I always hung up the call during the second RECOGNIZE.

As I initially wrote, the first two calls have partial reproductions of the first message, up to different parts of it, and then the reproduction of the second message because of no match, while the third call have a full reproduction of the first message and then the reproduction of the second message because in all the three calls I have not actually spoken any word since I was only interested about the premature interruption without speaking (when the recognition correctly occurs it seems to me that it works fine).

For each one of the calls, here are the base file names and this is what I was able to hear of the first message:

umsgsr-96920c770af74fe7 "... qual è la stazione" (truncated and then immediately the second message)
umsgsr-b92a33529b7d4d36 "... qual è la stazione di" (truncated and then immediately the second message)
umsgsr-54db72656ac24109 "... qual è la stazione di partenza?" (optimal, full message and there is still time to speak after the end of the message, after that the second message)

I was wondering if it could be possible to have more debug information on the Google side, I did not find anything useful within the Google Cloud Console.

If you need more data/more calls just ask me and I'll try to provide them to you.

Best regards,
Fabrizio

S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:field1@field.grammar" confidence="0.96">
captures-logs.zip

Arsen Chaloyan

unread,
Oct 18, 2019, 10:49:37 AM10/18/19
to UniMRCP
Hello Fabrizio,

Thanks for the provided coherent details. I apologize for such a late response, but just got a chance to take a look and follow up on this issue.

As you may know, each side may indicate completion of speech input. The timeouts that you specify apply to internal speech detector only. Google unfortunately does not provide any means to adjust the timeouts on the far end.

Based on what I see in the provided logs, the input completed based on an indication from Google.

[INFO]   Received Response: status [1] type [end-of-utterance]

If you run recognition in continuous mode (single-utterance=false), then it would be solely up to the client (GSR plugin) to indicate the completion, which would help work around the problem.




S -> C: RECOGNITION-COMPLETE

  <interpretation grammar="session:fie...@field.grammar" confidence="0.96">
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/d9a8808d-11d1-497c-b21a-4eef6539e7c9%40googlegroups.com.

Fabrizio Ammollo

unread,
Oct 23, 2019, 7:16:53 AM10/23/19
to uni...@googlegroups.com
Hello Arsen,

thank you for your reply and I'm sorry if mine was late too.

After setting recognition to continuous mode things have definitely improved; now, sometimes, but almost rarely I can say, after making some test calls, there is the somewhat opposite problem: after speaking is finished, and I see into the logs on the console that Google returns at least a result, a long time passes before the recognition result is acknowledged. This never happened during the last phase of the test calls, so I think that maybe it happened because there was more background noise (I work in an open space office).

Thank you very much again for your help.

Best regards,
Fabrizio

You received this message because you are subscribed to a topic in the Google Groups "UniMRCP" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/unimrcp/EKn9n9gNirI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/CAC5UscKK9Fp%2BcZahM0X_b_gjk%2BXkO1rxsQBCamAOXNEppqdMDA%40mail.gmail.com.

Arsen Chaloyan

unread,
Oct 26, 2019, 4:52:10 PM10/26/19
to UniMRCP
Hello Fabrizio,

It is always recommended to provide logs and utterances to be able to correlate all the details that may matter. Believe me, things are not as trivial as may seem here.

As for using the continuous mode, I take this as a workaround but not really a straight solution to the problem you experienced initially. Ideally, the plugin should initiate a new recognition request, if the original one is terminated by Google with empty results, given the speech timeouts employed on the MRCP leg still allow/assume further processing. In fact, this routine has been implemented for Azure (former Bing) plugin for a while.

Reply all
Reply to author
Forward
0 new messages