That's very interesting and promising, Weiji. Not that it's directly relevant, but I thought you might be interested in my experience with "A.I." in a different domain.
I've been experimenting with Whisper, the OpenAI audio transcription program, mostly on episodes of my podcast (
https://two chairs.website).
The original version of Whisper is remarkable, but it does take a while to work, taking several hours at least to transcribe one episode of my podcast, on average taking about three times the length of the source audio to complete. The original Whisper was all written in Python, something I find quite remarkable, as a Python coder myself. But Python is an interpreted language and so a bit slow.
Well, a German developer recently ported all the Whisper code into the C++ language, compiled it and optimised it to run on Apple Silicon chips (like my iMac has). In that form, it is able to transcribe a ninety-minute podcast in
two-and-a-half minutes!!
The results are not perfect, of course, but they are pretty good, good enough so that after some light editing they can be published.
What I found particularly interesting, though, was that in editing the transcription of a recent discussion, listening along with my headphones as I carefully read through the transcribed text, I discovered that Whisper doesn’t generate a straight word-for-word transcription of the audio. Instead, like ChatGPT, it sometimes
makes stuff up. Just the odd word or two inserted which is not there in the original, generally making sense of what is being said. It’s not directly transcribing sounds: instead, guided by the audio, it’s putting in what
you would be likely to say at a particular moment. Mostly that is an accurate transcription of what you actually did say, but not always.
All of these LLM tools work by predicting what the next word or phrase is likely to be on average, and lo and behold the same principles apply when it is doing audio transcription.
It will also leave out words even when the audio of them is perfectly clear. Again, this is usually useful. If a speaker says the same word or phrase more than once (a bad habit I suffer from), it will only include one of those instances in the transcript. Kind of freaky, though, and it does mean you need to carefully check what it’s done.
With those reservations in mind, the speed with which I can now generate a transcription has led to a change in how I edit the audio of the podcast. I can now use it to generate timecode markers in my audio editing program (think closed captions), which makes it a breeze to scan through and spot things which need to be fixed.
None of this is directly relevant to SE, of course, but perhaps supports the feeling Weijia expresses that these tools,
used with discretion, could make life much easier for our producers.
What I'm hoping for is that these LLM tools can be used to greatly improve OCR, but given what I say above, it would be hard to be confident how accurate the resulting transcription would be. But that's what proofreading is for, I guess.