Speech API - word timings

178 views
Skip to first unread message

Gene Matocha

unread,
Aug 8, 2016, 11:32:28 AM8/8/16
to Google App Engine
Hi,

Just stareted evaluating the Google Speech API. The accuracy is impressive, but it doesn't provide word timings (when in the recording the word occurred)  by default, and I don't see it as a configurable option. Does anyone know if this can be enabled, or may be on the roadmap for a future release?

Thanks,
Gene

Nick (Cloud Platform Support)

unread,
Aug 9, 2016, 4:26:29 PM8/9/16
to Google App Engine
Hey Gene,

If I understand this correctly, does this mean that you'd like to receive a JSON object which is an array of objects of something like the following form:


  word: "asdf",
  time_s: 0.0
}

If so, feel free to let me know, and reply with some more details about your used case. We'd be happy to take this request but would like to clarify it a bit.

Sincerely,

Nick
Cloud Platform Community Support

Gene Matocha

unread,
Aug 9, 2016, 4:41:21 PM8/9/16
to Google App Engine
Yes, something like that would be useful...with each word and the time (and possibly duration) of the word.

Or, working more from the current functionality, even having start and end times for each alternative would be helpful (but not as helpful as timing for each word). Example:

      {
        "alternatives": [
          {
            "transcript": "oh sure yeah sure no problem",
            "confidence": 0.92214179
            "start_time_s": 15.32
            "end_time_s": 16.87
          }
        ]
      },

Our use case is analyzing call center recordings for business analytics purposes. Once the STT is done, it's loaded back into our system along with the recording. We use word timings so that call center managers can quickly jump to interesting sections of the recordings, or to display the text as the recording is being played karaoke style.


That leads to another small question - we're currently loading relatively long (minutes to hours) recordings to google storage and submitting those urls to google speech. The STT breaks it into "alternatives" of various sizes. Can you elaborate how those divisions are made?

Nick (Cloud Platform Support)

unread,
Aug 11, 2016, 7:05:31 PM8/11/16
to Google App Engine
Hey Gene,

So to clarify, would the most useful form of response contain both the overall timing of the phrase, as well as the timing of individual words? This seems like the most useful for most applications. The level of detail could be controlled by a request parameter perhaps, or it could be a different API method.

As for the specific details of how the output is generated, as it mentions in the docs it's using deep learning neural networks, so it's not really possible to give a clean formal analysis of how it transforms input to output. 

Regards,


Nick
Cloud Platform Community Support



Nick (Cloud Platform Support)

unread,
Aug 16, 2016, 3:52:05 PM8/16/16
to Google App Engine
Hey Gene,

I'll go ahead and forward this feature request with various possible means of realization, with a clear reference to your use-case.

Cheers,


Nick
Cloud Platform Community Support

On Tuesday, August 9, 2016 at 4:41:21 PM UTC-4, Gene Matocha wrote:

Nick (Cloud Platform Support)

unread,
Aug 19, 2016, 2:16:28 PM8/19/16
to Google App Engine
Hey Gene,

Final update for now, the issue is filed and will be taken a look at. 

Have a nice day!


Cheers,

Nick
Cloud Platform Community Support

On Monday, August 8, 2016 at 11:32:28 AM UTC-4, Gene Matocha wrote:

Lance Dolan

unread,
Apr 4, 2017, 4:59:56 AM4/4/17
to Google App Engine
What is the latest news on this feature?

We're evaluating Google Speech API as a means for a rather large solution. Google is more accurate than its competitors, but the competitors prove word timings. For example, see IBM Watson here, and click the "word timings" tab (https://speech-to-text-demo.mybluemix.net). It functions almost exactly as Gene has requested here.

For us, knowing word timings is a requirement... To proceed with Google Speech might require some really hacky weird stuff on our part to determine timings. We've tested chopping the audio up in to 3 second blocks to create our own rough estimate of timings, but this dramatically harms the accuracy as the Google service loses its context for each audio block.

Nick (Cloud Platform Support)

unread,
Apr 5, 2017, 7:46:41 PM4/5/17
to Google App Engine
Hey Lance,

We're aware of this request although we can't provide any timelines on our work, across any products. If you'd like, you can "star" this issue by hitting "Me Too" in the Public Issue Tracker.


Cheers,

Nick
Cloud Platform Community Support

Nicholas (Google Cloud Support)

unread,
Apr 6, 2017, 10:00:05 AM4/6/17
to Google App Engine
Apologies for the internal URL.  The public issue can be accessed here: https://issuetracker.google.com/37002743
Reply all
Reply to author
Forward
0 new messages