Google Speech-to-Text

107 views

Skip to first unread message

David Akerson

unread,

Jul 29, 2019, 10:57:16 AM7/29/19

to Google App Engine

I am developing an app to transcribe audio recordings on my phone.

I am using the "video" model. The problem I am having is that the transcription breaks the text into various speakers, even though its just me speaking.

I am wondering whether I should be using the Phone Call model instead, which supports the Speaker Diarization function (video does not, apparently.) And if I use the Phone Call model, and I have a recording which is three hours long, will this cause problems?

Finally, if I am trying to produce a transcript with the most accurate punctuation, does one model (Video, Phone Call, etc) work better than others?

Thanks!

Ali T (Cloud Platform Support)

unread,

Jul 30, 2019, 12:36:22 PM7/30/19

to Google App Engine

Hi,

The model used should be decided in accordance from where the audio being transcribed originates. If the audio is not specific to one of the alternatives models, choosing the default model would be appropriate. You can find a breakdown of each model in the request configuration documentation. If you do want to use speaker diarization, it’s only available for the phone_call model.

Regarding the audio length, whether you chose the default, video or phone model, you shouldn’t have any problems transcribing a 3 hour long audio. For audio length, the content limits depend on the request type rather than the model chosen.

Reply all

Reply to author

Forward

0 new messages