ASR for IVR (speech recognition)

matt

unread,

Dec 20, 2016, 2:28:51 PM12/20/16

to 2600hz-dev

Hi,

I have a customer who wants to build a IVR speech recognition system to place orders over the phone.

I know this can be done with Pivot via DTMF, but haven't found any ASR systems. I am interested in integrating Google's Cloud Speech API (https://cloud.google.com/speech/). Are there any other integrations for ASR that have already been done?

Any recommendations on a direction I should go in to create this? Is it possible with PIvot to send a recording (wav/mp3) binary to a URL and then have my script send the binary to google, and return back the text string that was spoken?

matt

unread,

Dec 21, 2016, 1:26:18 AM12/21/16

to 2600hz-dev

Looking more, I found https://github.com/2600hz/kazoo/blob/master/core/kazoo_apps/src/kapps_speech.erl

Which appears to have a way to handle ASR via ispeech.org.

"

asr_freeform/* -> takes an arbitrary file and tries to transcribe

%%% it as best the engine can

%%% 2. asr_commands/* -> For greater accuracy, include a list of expected

%%% words, and the ASR engine will try to determine which command is

%%% said in the audio file.

"

How can I invoke the the ASR for use in a IVR menu? Is there a way to access the ast_commands or asr_freeform functions from within PIVOT?

James Aimonetti

unread,

Dec 21, 2016, 12:52:38 PM12/21/16

to 2600h...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

The ASR stuff was added but not really vetted for use. I know SIPLabs
added support for VoiceFabric, so I assume it worked well enough for
their needs but apart from that, I'm not sure who else uses it. I think
its mostly for transcribing recordings and not for IVR control.

To support IVR, it will take a bit more work. Some ASRs want you to
bridge the caller to them via SIP, they accept the voice data and hangup
their leg and send you the response over something like XMPP or
HTTP. Others, like iSpeech, just want a file to convert so now Kazoo
needs to do the recording, put the recording somewhere local to the
handling Kazoo server, push it up to iSpeech and get the resultant
text. Unknown what the delay for the caller will be.

Another option could be ASRs that you can run locally on the Kazoo
servers but I'm not sure what the options look like.

So, in short, this is a non-trivial addition and likely requiring
sponsorship or a community contribution.

- --
James Aimonetti

Lead Systems Architect
"If Dialyzer don't care, I don't care"
2600HzPDX | http://2600hz.com
sip:ja...@2600hz.com
tel:415.886.7905
irc:mc_ @ freenode
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCAAdFiEEvSh+xZ5hP1H8lVIU1Mpr4k9cJWAFAlhawWQACgkQ1Mpr4k9c
JWCWEgf/duUQ+58ZjfOGaa59Hh2xM1gGYklBXrbiGhPQVdqXK1GtfEzEb8JxyBpm
fv26qVsBhd/hcJj3n+quLG8WgHdiRn59bXSdeqGgIkn6NXYcYlcqUt9ONI8eAR50
BPyGG/3BFnJRiScZErPsfk6xD+9Ledet9EqVQ8uWuUBs9uMyD8TVgF40ED2d/7ON
wWip6XhLE5fmkZnKgre/hPtX9/oaGAVuCUWZwNoxzwSrCZJgbWp7rMubsJtkhPOt
R5XolhLseUXO5gcDe32eTIGACDGoW0pRphJVeiWRA0I8c3d+mPrH0ZeKV/+yIf6p
6DECRh+JMPUDmBmabfPf8oz2GsfxHQ==
=hzCJ
-----END PGP SIGNATURE-----

dv

unread,

Jan 15, 2017, 1:35:35 PM1/15/17

to 2600hz-dev

There are a few ways to go about using Google Speech API (shortened to gsAPI here, but that's not an official acronym). I haven't tried any of these, but I've thought about this a lot because I'd like to eventually use Google Speech API for ASR.

There are a few ways I have thought about addressing this. One thing to consider is that I haven't gotten too far with Kazoo yet. From what I've read about Pivot, you'll need to do this outside of Kazoo. I don't think you can continuously send commands to Kazoo yet.

I also don't believe Kazoo is a good fit for the core of an ASR implementation. It would be best to create a callflow module so that this can be handled outside of Kazoo. This a media-oriented action, so I make recommendations for FreeSWITCH.

The “best way” for performance and speed - streaming

To get bidirectional data flow on your speech, you have to stream it to gsAPI. This poses a problem: no JSON. You will have to implement a controller (probably in golang) that takes the raw audio stream from FreeSWITCH and then streams that data using grpc to the gsAPI.

Streaming audio from FreeSWITCH

Once the call initiates, you will need to use some method to stream the audio from FreeSWITCH to your controller.

Here are the various methods I've thought about to do this from most complex to least complex.

mod_unimrcp

UniMRPC could be used to stream the audio to a controller application that would then use grpc to stream the audio to gsAPI. This is a complex way to address the problem, but it means that you will have more control over how the media is handled and how the call is controlled.

The problem is that you would need to write a unimrcp plugin that streams the audio to the controller application. There are example plugins for a generic ASR and for pocketsphinx that will give you an idea about how to do this. Ultimately unimrcp is the most optimzied platform to handle ASR. It will just require you to write the software in C.

With unimrcp, there exists the possibility of containing all of the logic (sending the data to gsAPI with grpc) within the plugin. Another problem! UniMRPC is written in C and grpc in C++. I'm not sure how easy it is to use a C++ library in C code.

That's one of the reasons I've mentioned writing the controller code in golang. This code from Google explains how to take stdin and stream it to the gsAPI to get speech results: https://github.com/GoogleCloudPlatform/golang-samples/blob/master/speech/livecaption/livecaption.go

mod_vlc

You can use mod_vlc to stream the audio to the controller application. I don't know how you would wait for silence to signal to the application that the caller stopped talking.

mod_verto

Since mod_verto can use webrtc to stream the audio, you could use some kind of RTC library in golang to accept the stream and then send it using grpc to gsAPI. Again, you may run into the same problem of sending the signal that the caller stopped talking. I also don't think mod_verto is really aimed at this use case, so it could take a lot of hacking to get things where you want.

The controller

Now you'll need to write a controller that gives back the commands to the FreeSWITCH. Google Speech API will send back text as it detects it in the speech (IF you send the audio asynchronously). You can drill down the commands you want to execute and communicate them to FreeSWITCH with event socket.

A naive way

There is a naive way to do it as well. Using Lua, you can record the audio of the caller and save it to a file and then send that audio base64 encoded to gsAPI. This runs the risk of being really slow (since the audio must be sent synchronously, base64 encoding, sending over http) and the caller feeling uncomfortable, but it's contained within FreeSWITCH.

Resources

https://freeswitch.org/confluence/display/FREESWITCH/mod_unimrcp

https://freeswitch.org/confluence/display/FREESWITCH/mod_vlc

https://freeswitch.org/confluence/display/FREESWITCH/mod_verto

http://www.unimrcp.org/index.php/solutions/server

http://www.grpc.io/

https://github.com/googleapis/googleapis/blob/master/google/cloud/speech/v1beta1/cloud_speech.proto#L46

On Tuesday, December 20, 2016 at 2:28:51 PM UTC-5, matt wrote:

Reply all

Reply to author

Forward