There are a few ways to go about using Google Speech API (shortened to gsAPI here, but that's not an official acronym). I haven't tried any of these, but I've thought about this a lot because I'd like to eventually use Google Speech API for ASR.
There are a few ways I have thought about addressing this. One thing to consider is that I haven't gotten too far with Kazoo yet. From what I've read about Pivot, you'll need to do this outside of Kazoo. I don't think you can continuously send commands to Kazoo yet.
I also don't believe Kazoo is a good fit for the core of an ASR implementation. It would be best to create a callflow module so that this can be handled outside of Kazoo. This a media-oriented action, so I make recommendations for FreeSWITCH.
To get bidirectional data flow on your speech, you have to stream it to gsAPI. This poses a problem: no JSON. You will have to implement a controller (probably in golang) that takes the raw audio stream from FreeSWITCH and then streams that data using grpc to the gsAPI.
Once the call initiates, you will need to use some method to stream the audio from FreeSWITCH to your controller.
Here are the various methods I've thought about to do this from most complex to least complex.
mod_unimrcp
UniMRPC could be used to stream the audio to a controller application that would then use grpc to stream the audio to gsAPI. This is a complex way to address the problem, but it means that you will have more control over how the media is handled and how the call is controlled.
The problem is that you would need to write a unimrcp plugin that streams the audio to the controller application. There are example plugins for a generic ASR and for pocketsphinx that will give you an idea about how to do this. Ultimately unimrcp is the most optimzied platform to handle ASR. It will just require you to write the software in C.
With unimrcp, there exists the possibility of containing all of the logic (sending the data to gsAPI with grpc) within the plugin. Another problem! UniMRPC is written in C and grpc in C++. I'm not sure how easy it is to use a C++ library in C code.
That's one of the reasons I've mentioned writing the controller code in golang. This code from Google explains how to take stdin and stream it to the gsAPI to get speech results: https://github.com/GoogleCloudPlatform/golang-samples/blob/master/speech/livecaption/livecaption.go
mod_vlc
You can use mod_vlc to stream the audio to the controller application. I don't know how you would wait for silence to signal to the application that the caller stopped talking.
mod_verto
Since mod_verto can use webrtc to stream the audio, you could use some kind of RTC library in golang to accept the stream and then send it using grpc to gsAPI. Again, you may run into the same problem of sending the signal that the caller stopped talking. I also don't think mod_verto is really aimed at this use case, so it could take a lot of hacking to get things where you want.
Now you'll need to write a controller that gives back the commands to the FreeSWITCH. Google Speech API will send back text as it detects it in the speech (IF you send the audio asynchronously). You can drill down the commands you want to execute and communicate them to FreeSWITCH with event socket.
There is a naive way to do it as well. Using Lua, you can record the audio of the caller and save it to a file and then send that audio base64 encoded to gsAPI. This runs the risk of being really slow (since the audio must be sent synchronously, base64 encoding, sending over http) and the caller feeling uncomfortable, but it's contained within FreeSWITCH.
https://freeswitch.org/confluence/display/FREESWITCH/mod_unimrcp
https://freeswitch.org/confluence/display/FREESWITCH/mod_vlc
https://freeswitch.org/confluence/display/FREESWITCH/mod_verto
http://www.unimrcp.org/index.php/solutions/server