AWS Lex Plugin - use of RecognizeUtterance instead of StartConversation?

361 views
Skip to first unread message

Daniel Ng

unread,
Dec 17, 2021, 1:00:13 PM12/17/21
to UniMRCP
Hi Arsen,

I've been testing the AWS Lex Plugin against LexV2 and noticed that the Unimrcp plugin appears to use the StartConversation AWS SDK api call. 

We are attempting to use single intent bots (in the attempt to allow modularised call-flows e.g. one generic call-routing type bot that then goes to bots that are specifically for e.g. billing, support, sales) and during the call itself, we submit different bot details to the Unimrcp server as relevant to the call progress/logic.

However during testing, I've noticed that the initial (first bot) details are the ones submitted to AWS and though the Unimrcp server correctly accepts the new bot details in the later parts of the call-flow, those caller interactions only turn up in the original bot's CloudWatch logging?

I presume that it is also only using the initial bot to service the later parts of the call.  If I understand it correctly, this would imply that we would need to run monolithic bot that covers all potential intents and enable/disable the appropriate intent during the call?

In your expert opinion, can I accomplish the modular bot design but via the use of the RecognizeUtterance API calls instead of StartConversation and how can I configure UMS Lex to run in that mode instead of the StartConversation one - or do I have to resort to the singular monolithic Lex Bot to accomplish this?

The point of the modular bots/conversation flows is to allow the reuse of "tested" conversational pieces and to allow these "directed" call-flows to be brought in and out of service as determined by the customer themselves.

Thank you in advance for your time and help.
Kind regards,
Daniel

Arsen Chaloyan

unread,
Dec 20, 2021, 7:45:23 PM12/20/21
to UniMRCP
Hi Daniel,

We do want to use the streaming API provided by Lex V2, and the StartConversation method in particular, no matter whether a conversation consists of one or multiple interactions. The RecognizeUtterance method does not serve the purpose well.

Now, moving to your use case. It is true that a conversation is started with the first RECOGNIZE request placed in the scope of an MRCP session, and all the subsequent RECOGNIZE requests placed in the scope of the same MRCP session are performed in the same conversation context. In other words, there is a one-to-one association between the MRCP session and the Lex V2 conversation.

In order to perform just one interaction with the Lex V2 API, the MRCP client may open a session, send a RECOGNIZE request, and close the session. Repeat the same procedure as many times as needed in the scope of a call. Alternatively, it would also be possible to introduce a flag which would allow the client to indicate whether or not a new conversation is supposed to be started with a particular RECOGNIZE request. I'll take this use case into account...


--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/04b3fb7b-65c4-469a-be91-71df420f3bb1n%40googlegroups.com.


--
Arsen Chaloyan
Author of UniMRCP
http://www.unimrcp.org

Daniel Ng

unread,
Dec 21, 2021, 11:17:52 AM12/21/21
to UniMRCP
Hi Arsen,
Thank you for this - it's just that it's quite common in the Contact Centre/IVR world to have dialogue modules or even to have dedicated "Bots" to deal with particular topics - for example, you may have a generic "How may I help you?" type routing bot that effectively does natural language routing for you but once the conversation has been established as e.g. sales or support or HR or billing, for the conversation to be directed to the specialised bots/dialogue modules.

In addition to this, it may be also more granular in terms of perhaps having a ID and V bot or payment bot which may be subject to more stringent programming/legal requirements and again, by segregating that into a specific bot for that single "conversation" with the caller is also useful and important?  For myself, I see the need for a level of granularity to allow the best use of the AI training for things like Postcode recognition (which may be part of a full address capture dialogue but also could be used by itself as part of an identity verification or order/delivery verification step), things like (in the UK) NHS numbers, NI numbers,  

I guess from your reply that currently, it is not possible to have a open session, send recognize request, close the session type interaction - i.e. how would I currently do that without having to terminate or transfer the call?  I appreciate that it gets complicated when retries are factored in (in which case, you don't want to close that session but to reuse it).

Please let me know when you do decide or if you need more info to justify this use-case?  At the moment, with a single session, the only alternative is to create large monolithic bots encompassing all the intents required for the service and to then figure out how to enable the required intent for a particular part of the dialogue with the caller (which feels like it goes against the concept and intentions of AWS's conversation where Lex is supposed to manage that conversation itself)

Kind regards,
Daniel

Daniel Ng

unread,
Dec 21, 2021, 11:25:57 AM12/21/21
to UniMRCP
Forgot it was so close to Christmas already -  wishing you all a very Merry Christmas and a Happy New Year.
All the best,
Daniel
On Tuesday, December 21, 2021 at 12:45:23 AM UTC Arsen Chaloyan wrote:

Arsen Chaloyan

unread,
Dec 24, 2021, 3:19:40 PM12/24/21
to UniMRCP
Hi Daniel,

Thanks for providing the additional clarifications. The use cases are very clear to me.

> I guess from your reply that currently, it is not possible to have a open session, send recognize request, close the session type interaction - i.e. how would I currently do that without having to terminate or transfer the call?  I appreciate that it gets complicated when retries are factored in (in which case, you don't want to close that session but to reuse it).

It depends on the IVR platform you use. The mentioned behavior can easily be achieved using app_unimrcp in Asterisk. VXML-based platforms may not allow for the same level of flexibility.

> Please let me know when you do decide or if you need more info to justify this use-case?  At the moment, with a single session, the only alternative is to create large monolithic bots encompassing all the intents required for the service and to then figure out how to enable the required intent for a particular part of the dialogue with the caller (which feels like it goes against the concept and intentions of AWS's conversation where Lex is supposed to manage that conversation itself)

If you ask the vendors, you are supposed to let the bot manage an entire conversation, but I understand the challenges you face. In order to achieve the desired behavior, customers sometimes have to use not only different bots from a vendor but also different technologies/vendors in the scope of a single call.

Bottom line, it would not be that hard to support multiple conversations in the scope of the same MRCP session. But I cannot provide any strict timelines when this functionality will be available. There are too many different tasks currently in the pipeline.

> Forgot it was so close to Christmas already -  wishing you all a very Merry Christmas and a Happy New Year.

Thanks, Daniel. Given the opportunity, I also wish you and everyone Merry Christmas and Happy New Year!

Daniel Ng

unread,
Jan 5, 2022, 10:15:53 AM1/5/22
to UniMRCP
Hi Arsen, 
I hope you had a great Christmas and a good start to the New Year!

Thank you for your reply.  The IVR platform that we are using is the Cisco CVP platform which does not allow that level of flexibility (to specifically specify the opening and closing of the session - it's following the old school mechanism of grabbing the ASR and TTS resources at the start of the call and holding those resources/sessions per call until the end of the call).

Having used the same Lex bots with Amazon Connect, there AWS does allow Connect to swap between the bots (and hence only have the bots manage parts of the conversation as well as if required, the entire conversation). 

To get around this, we have had to revert back to using LexV1 and with the UMS Lex V1 plugin which means that we lose that grammar parameter separator feature available in the V2 plugin.  This has allowed us to jump between bots but has brought up another issue.  One of the other bots entails getting a long digit sequence of around 11+ digits.  From past experience, we know that not many people can rattle off the full digit sequence without pausing at some point and hence we were looking to increase platform timers such as [speech-]complete-timeout and [speech-]incomplete-timeout.  On the CVP platform, setting the incomplete-timeout to beyond 1.5s resulted with the recognition status being returned as STOPPED instead of success and even if we get a returned Lex transcription to the Unimrcp server, the results never make it back to the VXML application.  In addition to that I don't know how effective that setting is because I don't see any such related timer settings being sent over to the Lex bot too?  Should there be a correlation between the settings that the Unimrcp server receives and therefore sets and what is also set on the Lex bot side?

Looking through the AWS documentation (https://docs.aws.amazon.com/connect/latest/adminguide/get-customer-input.html#get-customer-input-tips - see "Configurable time-outs for voice inputs"), it appears that there is the following option available (for Lex V1, there is also a Lex V2 equivalent):
  • End Silence Threshold

    x-amz-lex:end-silence-threshold-ms:[intentName]:[slotToElicit]

    How long to wait after the customer stops speaking before assuming the utterance has concluded. You can increase the allotted time in situations where periods of silence are expected while providing input.

    Default = 600 milliseconds (0.6 seconds)

It seems to me that if I set e.g. CVP to have a 1 second complete/incomplete-timeout, but if Lex has a 0.6 sec setting, I would only effectively have 0.6 from a service perspective - rendering the complete/incomplete-timeout settings on the platform irrelevant for that case? And vice-versa, if Lex has e.g. a 1.5 sec setting but my VXML platform has a shorter setting, that would render the Lex setting ineffective?

Kind regards,
Daniel

Arsen Chaloyan

unread,
Jan 14, 2022, 8:37:41 PM1/14/22
to UniMRCP
Hi Daniel,

We agree that the ability of switching between bots within a single session might help in certain cases. This feature will be supported in the next version of the plugin but I cannot provide any strict timelines. You may always ask for a quote and prioritize implementation of one or the other feature, if/when needed and possible.

Switching to LexV1 is not a good idea, but if this is something that you have to do, then that is understandable.

With the latest version of the Lex V2 plugin, you can set attributes of the bot from the user application. All the attributes starting with a prefix "x-amz-lex" will be passed through to Lex V2 API. For example,

builtin:speech/transcribe?x-amz-lex:end-silence-threshold-ms=1500

And, yes, you would always need to override speech-complete and speech-incomplete timeouts with CVP, as the defaults are not really suitable in our context.

Daniel Ng

unread,
Jan 27, 2022, 10:29:14 AM1/27/22
to UniMRCP
Thanks Arsen,

We are taking your advice and am looking to make it work with LexV2 and have been looking at the x-amz-lex parameters.  It's not clear whether these settings however manages to get to the Speech-To-Text element via Lex or if these are only effective within the Lex bot itself post Speech-to-Text (which renders some of these settings effectively useless).

In order to help quantify this as well as to validate a production deployment architecture, we attempted to put in place a PoC system to allow us to try out the x-amz-lex settings.

For our deployment needs against the Cisco CVP 11.6 IVR platform, we have had to get internet access (i.e. access to AWS from Unimrcp) via  a proxy server.  However we encountered a strange issue with the UMS LexV2 plugin where the code appears to still attempt to use a direct connection?  I didn't think it was at first because the disconnected message is logged within ms of the create connection message but my firewall guys said they saw some https traffic that wasn't routed via the proxy (hence in their denied access logs).  I did do a pcap trace and noted that the disconnect wasn't from any SIP messaging from the VVB side either and replicated the test with the same servers but without needing or using a proxy (in which case instead of a Disconnected message, I got a Connected message instead and recognition etc continues as expected).

 2022-01-25 13:42:49:578017 [INFO]   Start Conversation botId [F2Z6WALK24] aliasId [IJP9KZXNDD] locale [en-GB] <43fb12e5736644e0@lex>
2022-01-25 13:42:49:578278 [INFO]   Process RECOGNIZE Response <43fb12e5736644e0@speechrecog> [103]
2022-01-25 13:42:49:578286 [INFO]   State Transition IDLE -> RECOGNIZING <43fb12e5736644e0@speechrecog>
2022-01-25 13:42:49:578307 [INFO]   Send MRCPv2 Data 172.21.101.14:1544 <-> 172.21.102.168:59662 [85 bytes]
MRCP/2.0 85 103 200 IN-PROGRESS^M
Channel-Identifier: 43fb12e5736644e0@speechrecog^M
^M

2022-01-25 13:42:49:578371 [INFO]   Create HTTP/2 connection [https://runtime-v2-lex.eu-west-2.amazonaws.com:443] <43fb12e5736644e0>
2022-01-25 13:42:49:589515 [INFO]   Disconnected <43fb12e5736644e0>

2022-01-25 13:42:49:589535 [INFO]   Delete H2 session <43fb12e5736644e0>
2022-01-25 13:42:49:589587 [DEBUG]  Stop Input <43fb12e5736644e0@lex>

Am I reading/interpreting that log entry correctly? I have  doubled checked the start of that unimrcpserver log file to confirm that we are using and setting the proxy details (which were fine with umspolly as that manages to send and receive back the Polly TTS audio files).
2022-01-25 13:29:19:240863 [DEBUG]  Load Streaming Recognition Attribute: proxy-scheme = http
2022-01-25 13:29:19:240871 [DEBUG]  Load Streaming Recognition Attribute: proxy-port = 8080
2022-01-25 13:29:19:240874 [DEBUG]  Load Streaming Recognition Attribute: proxy-host = 86.54.150.60

2022-01-25 13:29:19:240878 [DEBUG]  Load Streaming Recognition Attribute: alias = dummy
2022-01-25 13:29:19:240881 [DEBUG]  Load Streaming Recognition Attribute: bot-name = dummy
2022-01-25 13:29:19:240884 [DEBUG]  Load Streaming Recognition Attribute: region = eu-west-2
2022-01-25 13:29:19:240887 [DEBUG]  Load Streaming Recognition Attribute: generate-output-audio = false
2022-01-25 13:29:19:240891 [DEBUG]  Load Streaming Recognition Attribute: grammar-param-separator = ,
2022-01-25 13:29:19:240894 [DEBUG]  Load Streaming Recognition Attribute: transcription-grammar = transcribe
2022-01-25 13:29:19:240897 [DEBUG]  Load Streaming Recognition Attribute: skip-empty-results = true
2022-01-25 13:29:19:240900 [DEBUG]  Load Streaming Recognition Attribute: skip-unsupported-grammars = true
2022-01-25 13:29:19:240903 [DEBUG]  Load Streaming Recognition Attribute: language = en_GB
2022-01-25 13:29:19:240906 [DEBUG]  Load Streaming Recognition Attribute: start-of-input = service-originated


Kind regards,
Daniel

Arsen Chaloyan

unread,
Feb 4, 2022, 2:58:25 PM2/4/22
to UniMRCP
Hi Daniel,

The HTTP proxy issue should have been resolved in the latest release of the Lex plugin. Also, per our discussion, a new conversation is now initiated if one of the bot parameters changes in a follow-up request placed in the scope of the same MRCP session.


Please give it a try and let me know if you have any questions.

Daniel Ng

unread,
Feb 7, 2022, 5:03:46 AM2/7/22
to UniMRCP
Hi Arsen,

Thank you - will install it and have a go.  Will let you know the results following the testing.

thanks again.
Kind regards,
Daniel

Daniel Ng

unread,
Feb 9, 2022, 5:31:21 AM2/9/22
to UniMRCP
Hi Arsen,

So I've used yum to update the UMS lex plugin (looks like yum also updated the Polly and AWS dependencies too) but I'm not able to fully test the Lex element because the trigger prompt was done via the Polly speech synthesis.

That is getting the following error now:
2022-02-08 14:26:41:563791 [DEBUG]  Handler Called <00795015187d4fd0@polly>
2022-02-08 14:26:41:563820 [WARN]   Failed to Fetch Audio: error [] exception [curlCode: 56, Failure when receiving data from the peer] <00795015187d4fd0@polly>
2022-02-08 14:26:41:563950 [INFO]   Process SPEAK Response <00795015187d4fd0@speechsynth> [100]
2022-02-08 14:26:41:563979 [INFO]   Deactivate Session 0x7f1698003e88 <00795015187d4fd0>
2022-02-08 14:26:41:563987 [INFO]   Terminate Session 0x7f1698003e88 <00795015187d4fd0>
2022-02-08 14:26:41:564033 [INFO]   Close <00795015187d4fd0@polly>
2022-02-08 14:26:41:564072 [NOTICE] Polly Usage: 0/1/2
2022-02-08 14:26:41:564036 [WARN]   Null MRCPv2 Connection <00795015187d4fd0@speechsynth>
2022-02-08 14:26:41:564117 [INFO]   Remove Control Channel <00795015187d4fd0@speechsynth> [0]

Unfortunately I'm limited to a single VM (i.e. a joint Polly and Lex plugin server) and I can't run the earlier version of Polly with the latest version of Lex due to the AWS SDK dependency needed for both.

Is the above enough for you to figure out what is happening?  I can see from the unimrcpserver logs that Polly is able to retrieve the voices (via the Describe Voice function) via the proxy but it fails when attempting to do the speech synthesis via the Proxy (previous release worked fine)

Kind regards,
Daniel

Arsen Chaloyan

unread,
Mar 14, 2022, 2:18:15 PM3/14/22
to UniMRCP
Hi Daniel,

You can use the sample client application to verify the behavior for Polly and Lex separately. There is no need to place real calls to the system in order to validate the proxy connectivity.

cd /opt/unimrcp/bin
./umc

For Polly
run bss1
For Lex
run lex1 

Reply all
Reply to author
Forward
0 new messages