I am developing an app to transcribe audio recordings on my phone.
I am using the "video" model. The problem I am having is that the transcription breaks the text into various speakers, even though its just me speaking.
I am wondering whether I should be using the Phone Call model instead, which supports the Speaker Diarization function (video does not, apparently.) And if I use the Phone Call model, and I have a recording which is three hours long, will this cause problems?
Finally, if I am trying to produce a transcript with the most accurate punctuation, does one model (Video, Phone Call, etc) work better than others?
Thanks!