Over the last year or so, I have been listening to the Rāmāyaṇa
off and on, mostly while driving. It has been a richly rewarding experience (despite my limited attention and understanding). As I near the end, it occurs to me that it would be a nice idea to combine the excellent audio with the Sanskrit text (stanza by stanza) and possibly an English translation, to produce essentially 50-odd hours of “video”.
I do not know how to do this. I imagine that, having the audio and text separately (the audio is already split by sarga), some heuristics like looking for pauses and matching them to shloka boundaries will mostly work. If that's not good enough, machine learning may help. I looked up how this is done in general, and it seems the problem is called “forced alignment” and there are some resources here: https://github.com/pettarin/forced-alignment-tools
(most of the existing tools work only with specific languages, Sanskrit not included). (Note that Youtube also has a feature where you can upload a video and transcript and it will try to create subtitles out of the transcript.)
I don't know anything about machine learning, so probably am not going to pursue this further right now, but if someone wants to do it, then having the audio annotated with timing information might be a nice project.
(BTW it's not even fully clear exactly what text was used (it's definitely not the “critical edition” of the Rāmāyaṇa used by e.g. Bibek Debroy for his translation), but it seems to match the popular text like that on https://www.valmikiramayan.net/
Anyway, if someone is producing audio out of text, it's good to retain the exact text used; it may be useful later.)