Real-time speech recognition

42 views
Skip to first unread message

David Cunningham

unread,
May 7, 2024, 8:02:25 PMMay 7
to UniMRCP
Hello,

We've been looking at the UniMRCP for Asterisk manual, trying to figure out how to do real-time speech recognition. We don't want to run an Asterisk dialplan application and have it block until the speech recognition is finished. We want to start the speech recognition, do lots of other stuff (eg Dial a destination), and then end the speech recognition when the call hangs up.

I found a message in this group from 2019 where Arsen said:

"This subject has been brought up several times but no progress has been made so far. The problem is the MRCP framework is primarily made for recognition of short utterances.
In order to support continuous speech transcription from Asterisk, several components in the path would need to be extended."

Does anyone know if a solution has been created since then? Or is real-time speech recognition still not possible?

Thanks in advance for any advice.


David Cunningham

unread,
May 8, 2024, 7:39:35 PMMay 8
to UniMRCP
Could anyone give me a quick pointer in the right direction?

I have referred to the UniMRCP for Asterisk manual, but can't see a dialplan application which will start speech recognition and then allow the dialplan to continue with recognition still running.


Thanks again.

Michael Levy

unread,
May 8, 2024, 9:23:59 PMMay 8
to uni...@googlegroups.com
I am not an Asterisk expert.

However, I think you are looking in the wrong place. When MRCP is used to support speech recognition, it is all about doing recognition as part of a conversation. A prompt is played, an utterance gets recognized. This repeats until the call is complete. MRCP recognition is also single channel. It is the caller's voice, but not the agent/IVR. Do you want both parts of the call in your transcript?

I think what you want to do is full call recording, or use the technology of call recording to fork the SIP call to two endpoints. One goes to your agent or IVR, one goes to your recorder or real-time recognition engine.


Or perhaps treat it as a 3-way call in asterisk (or bridge transfer?) where one leg goes to your recognizer. 

Your recognizer doesn't need MRCP, it just needs SIP/RTP.







--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unimrcp/d7a96870-f2b8-4eb1-ae5e-fda1c3537520n%40googlegroups.com.

David Cunningham

unread,
May 8, 2024, 10:38:33 PMMay 8
to UniMRCP
Thank you very much for the reply Michael.

We would probably fine to have the two channels with separate transcriptions, especially if we could use timestamps to show how they align.

Our Asterisk based system already as call recording, but we had thought that UniMRCP's integration with AWS or Google would allow real-time transcription of those recordings (ie while a person is still talking).

The unispeech website says "You can process audio in batch or in near real-time. Using a secure connection, you can send a live audio stream to the service, and receive a stream of text in response.", as per:

So what is this near real-time functionality? If we have to start listening, stop listening, send that for transcription, then I wouldn't call that near real-time. The mention of "receive a stream of text" makes it sound like you can receive text while the person is still speaking.

Have I misunderstood what the website says?

Thanks again for your help.

Mickael Hubert

unread,
May 9, 2024, 3:39:50 AMMay 9
to uni...@googlegroups.com
Hi David,
We use this system to transcribe sip conversation with our own transcription system.
But we use our specific daemon, developed on premise, so it's private.

Before our daemon, I searched an open source solution, but notre realy find a good solution.
Maybe you can watch this prototype : 


Have a good day

--
You received this message because you are subscribed to the Google Groups "UniMRCP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unimrcp+u...@googlegroups.com.

Michael Levy

unread,
May 9, 2024, 8:12:11 AMMay 9
to uni...@googlegroups.com
Nice.  https://github.com/pc-m/transcript-demo says:

How it works

  1. When a call enters the stasis application it will be added to the bridge
  2. When the server starts listening on the configured port and create the external media channel
  3. When RTP is received the payload is sent to Google Speech to text API and an HTML file is generated
This makes sense. They are not using MRCP.

Ken Walker

unread,
May 9, 2024, 12:34:47 PMMay 9
to UniMRCP
The asterisk team has a similar demo as well using ARI and JS.

ewrj...@gmail.com

unread,
May 9, 2024, 12:46:41 PMMay 9
to UniMRCP
Hi David,
For real time transcription during a call you could try Speechmatics ( https://www.speechmatics.com/product/real-time ). Basically you do some session set up then fork the call RTP and send it to them and they'll send back transcription snippets throughout the life of the audio stream. At the end you've then got a ready made transcription of the entire audio.
I found that uniMRCP could only effectively transcribe the entire audio stream at the end of the audio stream but not during it (which as you say required a Start/Stop loop and isn't real time).

David Cunningham

unread,
May 10, 2024, 6:45:26 PMMay 10
to UniMRCP
Thank you everyone for the replies. We'll give the https://github.com/asterisk/asterisk-external-media project a try since it seems that whatever solution is used, we need to use the Asterisk ARI to get the call audio.

Either I've missed something or it seems the UniSpeech website is a bit misleading when it says it can do "near real-time" transcription.
Reply all
Reply to author
Forward
0 new messages