AAEP & UNIMRCP & Azure: Barge-In & Recording

Eduardo Hermida

unread,

Dec 13, 2019, 9:22:23 AM12/13/19

to UniMRCP

Hi all

I have sucessfully integrated AAEP 7.1 with UniMRCP and with AzureSR plugin.

It definetly works, but I have some doubts about two major aspects

BargeIn:

I have an VXML page with property bagein=true and bargeintype=speech. I have tested two different scenarios....

1) start-of-input = service-originated.

In this case the effect produced is that the audio collection starts at the same time the IVR starts playing the prompt, so if I don't say anything while the IVR speaks what I'm always receiving from Microsoft is a

"transcripts": {"RecognitionStatus":"InitialSilenceTimeout","Offset":20000000,"Duration":0} response.

On the other hand if I start talking almost at the same time that the IVR starts speaking (that is, before reaching that InitialSilenceTimeout param) , the prompt is not interrupted at this time, but it continues until there is a match, that is, it behaves like if I had selected "hotword" baregintype.

2) start-of-input=internal

The audio collections starts at the same time the IVR starts playing the prompt. In this case the audio is interrupted almost at start, even if I don't say anything,

Is that the normal behavior?

Is there any configuration that could allow to not start collecting audio until I start speaking, and that the prompt is interrupted at the very moment I start speaking?

Is there any way to increase Microsoft InitialSilenceTimeout param?

Recording:

I see in the logs that in every call, param save-waveform arrives via MRCP set to false, even thougth I have vxml property recordutterance set to true. Is there a way that I could set the save-waveform to true in my VXML app?

Many thanks in advance

Wilmar Pérez

unread,

Dec 13, 2019, 11:33:56 AM12/13/19

to uni...@googlegroups.com

Hi Eduardo,

I dont know about your barge-in questions. Regarding the recordings, You can control it at the plugin level, In your umsazuresr.xml look for <utterance manager> save-waveform = "true" </>. However that can be overridden per MCP session by setting the header field Save-Waveforms in the SET-PARAMS or RECOGNIZE request. I am not sure if that is equivalent to recordutterance = true.

I know this does not solve your problem but perhaps it rings a bell!

Best,

Wilmar

--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/3507028d-a69d-464d-bb40-6b7a0d17fd22%40googlegroups.com.

--

--------------------------------------------------------

Wilmar Pérez

Eduardo Hermida

unread,

Dec 16, 2019, 2:41:27 AM12/16/19

to UniMRCP

Hi Wilmar.

Thanks for your response. The problem is that it seems AAEP MRCP client is always sending this header field Save-Waveforms to false, hence it's overwritten... and I don't know how to solve it.

Is there any way to make config file values prevail?

Regards

To unsubscribe from this group and stop receiving emails from it, send an email to uni...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/3507028d-a69d-464d-bb40-6b7a0d17fd22%40googlegroups.com.

--
--------------------------------------------------------
Wilmar Pérez

Arsen Chaloyan

unread,

Dec 24, 2019, 2:26:02 PM12/24/19

to UniMRCP

Hi Eduardo,

Barge-in

1) start-of-input = service-originated

This is the recommended mode of operation as it would allow for more robust barge-in experience. There are a number of settings that you should consider.

To start off, there are two parameters that control sensitivity of internal speech activity detector such as speech-start-timeout and vad-mode. For short utterances, where the caller is supposed to say something like "yes/no", the speech start timeout has to be set to 50 msec or so in order to trigger (not to miss) an activity. Otherwise, a timeout in 200 or 300 msec should be sufficient for detection of start of speech and filter out some false positives. With latest releases of all SR plugin, this timeout can be adjusted per recognition request. The same applies to vad-mode which can also be adjusted per recognition request, even though the default mode should be sufficient for most of the cases.

Next, and probably more or evenly important settings are speech-complete-timeout and speech-incomplete-timeout. While the former is enforced when there is an interim speech transcription result becomes available, the latter is used originally when streaming to the service is started. The speech-complete timeout must be set to a smaller value such as 500 msec from 1000 msec, depending how long the utterance might be. If the caller might take a breath while pronouncing a long sequence of digits or address, then you may have a longer speech-complete timeout; otherwise, even 300 msec would be enough to reliably trigger end of input for short utterances.

As for the speech-incomplete timeout, you should consider using a longer timeout in order to address the problem with recognition terminated prematurely due to InitialSilenceTimeout. If you set the speech-incpmplete timeout to, let's say, 15000 msec, and recognition (first turn) completes with InitialSilenceTimeout, then a new turn will implicitly be initiated, assuming all speech input timeouts allow for further processing. This mode of operation has been implemented for Bing/Azure SR plugin for a while.

> On the other hand if I start talking almost at the same time that the IVR starts speaking (that is, before reaching that InitialSilenceTimeout param) , the prompt is not interrupted at this time, but it continues until there is a match, that is, it behaves like if I had selected "hotword" baregintype.

The prompt is interrupted (START-OF-INPUT is sent to IVR) as soon as first interim result is received from the service. The issue is Azure sends the first interim result quite late if we compare the same behavior with Google.

2) start-of-input=internal

All the configuration settings apply to this mode too. The major difference is START-OF-INPUT is sent to IVR as soon as streaming to the service is started. At that point of time, it is clear there is an activity in the audio data but it is not certain whether this is caused by background noise or speech. The capabilities of the internal detector are naturally limited.

> Is that the normal behavior?

No.

> Is there any configuration that could allow to not start collecting audio until I start speaking, and that the prompt is interrupted at the very moment I start speaking?

Yes, see above.

> Is there any way to increase Microsoft InitialSilenceTimeout param?

No, none of the cloud-based speech APIs allow to tune the timeouts, even though the vendors are pretty much aware of the use cases. On the other hand, if you use the technique described above with a longer speech-incomplete timeout, then everything should work reliably.

Recording

The recordutterance property in VoiceXML app must prevail whatever is set in the default configuration. You may consider using the following property.

At the same time, make sure the parameter waveform-base-uri is set accordingly in the configuration file of the plugin in order for AAEP to be able to retrieve stored utterance by returned URI.

<utterance-manager
save-waveforms="false"
purge-existing="false"
max-file-age="60"
max-file-count="100"
waveform-base-uri="http://localhost/utterances/"

To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/811860f0-6677-452a-9360-f79b9dc66c24%40googlegroups.com.