Alignment of audio and text - PoC + request for help/collaboration

123 views
Skip to first unread message

Avinash L Varna

unread,
Mar 29, 2021, 2:20:27 AMMar 29
to sanskrit-programmers
TL;DR:
A proof-of-concept (PoC) of the first sarga of the ramayana audio in a "read-along" mode is here - https://avinashvarna.github.io/rAmAyaNa-paThanam/
If you play the audio at the top of the page, the corresponding word being uttered should be highlighted (with some occasional lag/errors). Scrolling to a different point in the audio should cause the text to also advance to the corresponding section. (The reverse does not work yet. See request for help below).

Details:
I was interested in using TTS to perform forced alignment using Dynamic Time Warping and was delighted to find the aeneas library that already implements this functionality. I've been using it to test out alignment using the rAmAyaNa and meghadUta recordings previously shared on this mailing list. For the rAmAyaNa text, I used data from the aandhrapATha shared by Vishvas and for meghadUta from GRETIL (with some minor corrections based on the audio).The output of aeneas is quite good and can be finetuned using the awesome finteuneas interface.

I've uploaded alignment at the pAda and pada level (sentence/word alignment) for all the sargas of rAmAyaNa and pUrvamegha, along with the code here - https://github.com/avinashvarna/audio_alignment. I was able to create the https://avinashvarna.github.io/rAmAyaNa-paThanam/ website using the forced alignment output, and a javascript library I found.

Such a website might be useful for those trying to learn how to chant or memorize the shlokas (visual aid in addition to the audio). It would be fairly easy to generate the alignment for different available audio for common texts (gItA, viShNu-sahasra-nAma, etc.).

Request for help:
I am sure there are members of this group who are far more well-versed in web-design/site generators than "yours truly". Would anyone be interested in creating a website with a few bells and whistles such as:
  1. A menu to navigate the text -> kANda -> sarga (or appropriate division)
  2. Transliteration support into various scripts
  3. Ability to scroll the text and go to the corresponding location in the audio.
E.g. vishvAs's site already has support for 1 and 2. If he is open to adding this mode, that could be an option, or a site could be based off of it.

Ideally, the initial site should be created such that adding new audio + text should be fairly trivial. Any takers?

Thanks
Avinash

Anunad Singh

unread,
Mar 29, 2021, 4:19:10 AMMar 29
to sanskrit-programmers
अत्युत्तम !
मया प्रथम बारं वाल्मीकीयरामायणं श्रुतम्।
यद् सम्पूर्णं कथा बालकण्डस्य प्रथमे सर्गे एव संक्षिप्तरूपेण दत्तं, इति ज्ञात्वा परं सुखं प्राप्तम्। 
-- अनुनाद

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9Yz1NcAboBfdhhoR-aiQ9QORVYUrMoZZP%2Bh_rrFc13n9Q%40mail.gmail.com.

Lokesh Sharma

unread,
Mar 29, 2021, 8:02:42 AMMar 29
to sanskrit-programmers

Shreevatsa R

unread,
Mar 29, 2021, 6:48:16 PMMar 29
to sanskrit-programmers
Thanks Avinash, nice to see this happening! 
I'm glad I proposed this idea back then, instead of trying to work on it myself -- that I would probably never have. :-)

I think (3) shouldn't be hard; I glanced at the source of karaoke.js that's currently being used and it is not very long; we could rip it out and write the 2-directional stuff from scratch. (In fact I've implemented something like that for YouTube videos a while ago, a webpage with a video and a transcript where clicking on a sentence in the transcript will take you to the corresponding location in the video… the repo is not public but I just put the script as a gist and the relevant code is just five lines; will be similarly short for HTML5 audio.)

I'm not very good at the other aspects of web design (aesthetics, layout, CSS, etc), but maybe one of us could start a github repository for the website to be hosted somewhere, and others can add to it and we can build it collaboratively.

--

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 29, 2021, 9:53:23 PMMar 29
to sanskrit-programmers
On Mon, Mar 29, 2021 at 11:50 AM Avinash L Varna <avinas...@gmail.com> wrote:
TL;DR:
A proof-of-concept (PoC) of the first sarga of the ramayana audio in a "read-along" mode is here - https://avinashvarna.github.io/rAmAyaNa-paThanam/
If you play the audio at the top of the page, the corresponding word being uttered should be highlighted (with some occasional lag/errors). Scrolling to a different point in the audio should cause the text to also advance to the corresponding section. (The reverse does not work yet. See request for help below).

Details:
I was interested in using TTS to perform forced alignment using Dynamic Time Warping and was delighted to find the aeneas library that already implements this functionality. I've been using it to test out alignment using the rAmAyaNa and meghadUta recordings previously shared on this mailing list. For the rAmAyaNa text, I used data from the aandhrapATha shared by Vishvas and for meghadUta from GRETIL (with some minor corrections based on the audio).The output of aeneas is quite good and can be finetuned using the awesome finteuneas interface.

I've uploaded alignment at the pAda and pada level (sentence/word alignment) for all the sargas of rAmAyaNa and pUrvamegha, along with the code here - https://github.com/avinashvarna/audio_alignment. I was able to create the https://avinashvarna.github.io/rAmAyaNa-paThanam/ website using the forced alignment output, and a javascript library I found.

How does this alignment happen conceptually? Is my guess below correct?
Given a text, tts generates audio B. Audio B is aligned with recitation (Audio A), thence text is aligned with Audio A.

Aeneas does all of the above steps? Which TTS is used?

 

Such a website might be useful for those trying to learn how to chant or memorize the shlokas (visual aid in addition to the audio). It would be fairly easy to generate the alignment for different available audio for common texts (gItA, viShNu-sahasra-nAma, etc.).
बाढम्! सुकृतम्!

 

Request for help:
I am sure there are members of this group who are far more well-versed in web-design/site generators than "yours truly". Would anyone be interested in creating a website with a few bells and whistles such as:
  1. A menu to navigate the text -> kANda -> sarga (or appropriate division)
  2. Transliteration support into various scripts
  3. Ability to scroll the text and go to the corresponding location in the audio.
E.g. vishvAs's site already has support for 1 and 2. If he is open to adding this mode, that could be an option, or a site could be based off of it.

Ideally, the initial site should be created such that adding new audio + text should be fairly trivial. Any takers?

Thanks
Avinash

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9Yz1NcAboBfdhhoR-aiQ9QORVYUrMoZZP%2Bh_rrFc13n9Q%40mail.gmail.com.


--
--
Vishvas /विश्वासः

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 29, 2021, 11:00:51 PMMar 29
to sanskrit-programmers
On Mon, Mar 29, 2021 at 11:50 AM Avinash L Varna <avinas...@gmail.com> wrote:


I've uploaded alignment at the pAda and pada level (sentence/word alignment) for all the sargas of rAmAyaNa and pUrvamegha, along with the code here - https://github.com/avinashvarna/audio_alignment.

I've copied some of this code to https://github.com/sanskrit-coders/audio_utils/tree/master/audio_utils/alignment (idea being to gather all relevant audio code there and at https://github.com/sanskrit-coders/audio_curation - so that it becomes easy to collaboratively maintain and use related code without having to install many packages. avinash is already a maintainer of that org.)
 

Request for help:
I am sure there are members of this group who are far more well-versed in web-design/site generators than "yours truly". Would anyone be interested in creating a website with a few bells and whistles such as:
  1. A menu to navigate the text -> kANda -> sarga (or appropriate division)
  2. Transliteration support into various scripts
  3. Ability to scroll the text and go to the corresponding location in the audio.
E.g. vishvAs's site already has support for 1 and 2. If he is open to adding this mode, that could be an option, or a site could be based off of it.

Definitely open to adding 3 - very exciting prospect.
The following would need to be done:

- Add js function to do the scrolling and highlighting under https://github.com/sanskrit-coders/sanskrit-documentation-theme-hugo/tree/master/webpack_src/js  (included in https://github.com/vvasuki/purANam/tree/master/ as a submodule) . So one would just clone https://github.com/vvasuki/purANam/tree/master/ to start off.
- compile using
cd themes/sanskrit-documentation-theme-hugo/webpack_src/ , npm install and npm run watch
- try out the website by running hugo server --renderToDisk from wherever https://github.com/vvasuki/purANam/ was cloned to.



 

Ideally, the initial site should be created such that adding new audio + text should be fairly trivial. Any takers?

Thanks
Avinash

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9Yz1NcAboBfdhhoR-aiQ9QORVYUrMoZZP%2Bh_rrFc13n9Q%40mail.gmail.com.

Shreevatsa R

unread,
Mar 29, 2021, 11:12:53 PMMar 29
to sanskrit-programmers
On Mon, 29 Mar 2021 at 15:48, Shreevatsa R <shree...@gmail.com> wrote:
Thanks Avinash, nice to see this happening! 
I'm glad I proposed this idea back then, instead of trying to work on it myself -- that I would probably never have. :-)

I think (3) shouldn't be hard; I glanced at the source of karaoke.js that's currently being used and it is not very long; we could rip it out and write the 2-directional stuff from scratch. (In fact I've implemented something like that for YouTube videos a while ago, a webpage with a video and a transcript where clicking on a sentence in the transcript will take you to the corresponding location in the video… the repo is not public but I just put the script as a gist and the relevant code is just five lines; will be similarly short for HTML5 audio.)

I'm not very good at the other aspects of web design (aesthetics, layout, CSS, etc), but maybe one of us could start a github repository for the website to be hosted somewhere, and others can add to it and we can build it collaboratively.

I just started trying this out in a very bare-bones way from scratch (no frameworks used, except Hugo for templating and that too without any pre-existing theme).  The basic functionality seems to be working (clicking on text will change the position in the audio): https://shreevatsa.net/ramayana/pages/test — deployed from https://github.com/shreevatsa/web-align-audio-text 

Trying to add all the rest of the kandas/sargas now and will email again when done, but then someone else will probably have to help with the CSS etc. :-)

Shreevatsa R

unread,
Mar 30, 2021, 3:01:16 AMMar 30
to sanskrit-programmers
All the sargas (except 3*) are now here: https://shreevatsa.net/ramayana/sarga — please feel free to make changes to the repo, or fork it and host your own, or whatever.

(* This is a silly thing: had to delete 5.054, 6.094, 6.127 because of some issue with spaces in the filenames, will figure it out eventually…)

A couple of other points (apart from everything else mentioned earlier):
- It's using sentence-level alignment instead of word-level alignment, because from the data here I couldn't figure out what the sentences are.
- I did notice some errors, e.g. sargas 6.096 and 6.098 (I think) have the wrong text. There may be more, in that region (actually everything towards the end of YK seems to be off). This seems a bug in the alignment data itself.


Avinash L Varna

unread,
Mar 30, 2021, 11:14:35 AMMar 30
to sanskrit-programmers
I'm happy to see all the updates in a day! Thanks for all the great efforts!

> How does this alignment happen conceptually? Is my guess below correct? Given a text, tts generates audio B. Audio B is aligned with recitation (Audio A), thence text is aligned with Audio A.

Correct. This is indeed the essence. More details here

> Aeneas does all of the above steps? Which TTS is used?

Yes. It has wrappers for several TTS engines. I believe that the default is espeak. I "cheated" again by using hindi as the language, but it works pretty well for the alignment. Extending it to other TTS engines should be possible.

> - It's using sentence-level alignment instead of word-level alignment, because from the data here I couldn't figure out what the sentences are.

The "id" field in the json contains the paragraph/sentence/word numbers: E.g.  "id": "p000001s000001w000001" It needs to be parsed to figure out which words belong to the same pAda/sentence.

> - I did notice some errors, e.g. sargas 6.096 and 6.098 (I think) have the wrong text. There may be more, in that region (actually everything towards the end of YK seems to be off). This seems a bug in the alignment data itself.

I will look into this. There may be a parsing error in cleaning the text, or the text and audio annotations in vishvAs's repo may be off




Avinash L Varna

unread,
Mar 30, 2021, 11:40:54 AMMar 30
to sanskrit-programmers

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Mar 31, 2021, 11:20:15 AMMar 31
to sanskrit-programmers

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Apr 3, 2021, 12:31:47 PMApr 3
to sanskrit-programmers
इदानीं परिष्कृतम् अत्र निक्षिप्तम् - https://vvasuki.github.io/purANam/rAmAyaNam/AndhrapAThaH/

Shreevatsa R

unread,
Apr 3, 2021, 9:23:07 PMApr 3
to sanskrit-programmers
Great, thanks. Will take a look after Avinash recomputes the alignment from this.

Meanwhile I switched to word-level alignment as Avinash suggested, and also added the purva-megha (from the same repo of Avinash). Both texts (Ramayana and Meghaduta) are using the same template (now) so I imagine the code is not too specific, and adding more texts will be possible too.

Avinash L Varna

unread,
Apr 11, 2021, 9:28:20 PMApr 11
to sanskrit-programmers
Thanks Shreevatsa! The word alignment seems to be working nicely! I've rerun the alignment and done some spot checking to ensure that the sargas and audio match. Please let me know if anything is incorrect. I will work on some of your other suggestions later.

Thanks
Avinash



Hrishikesh Terdalkar

unread,
Apr 13, 2021, 2:15:09 PMApr 13
to sanskrit-p...@googlegroups.com
Copying Shreevatsa's JS logic, I've built a Flask application with Bootstrap5-powered front-end.
I've sent a pull request for the same (meanwhile it can be tried by cloning https://github.com/hrishikeshrt/audio_alignment)

Corpus list is built using a data.json file inside the corpus directory, which should contain a map of key (used to refer to the particular sarga or adhyaya etc), name, audio url and alignment files.

image.png

Also, I tried running the alignment code for Ashtadhyayi (text from ashtadhyayi.com/, audio from http://surasa.net/music/samskrta-vani/ashtadhyayi.php), but the alignment obtained is erroneous, in that, the text lags behind the audio in most cases. (I'm using https://pypi.org/project/py3-aeneas/)
I tried with both original text and split text (as found in https://ashtadhyayi.com/chanting/), and surprisingly, got better (still not acceptable) results with the split text than the original text.

Regards,
-
हृषीकेश


Shreevatsa R

unread,
Apr 14, 2021, 12:17:33 AMApr 14
to sanskrit-programmers
This is great! Nice to see more people working together for this. :-)

I haven't tried it yet but from the screenshot it also looks much better than anything I could have come up with. :-) I'll study it for sure.

BTW, does it need to be a Flask app (is there much happening on the backend) or can it remain a static site? With static sites there are more free options for hosting (GitHub Pages, GitLab Pages, Netlify, etc), though I guess this will be small enough that deploying to App Engine/Heroku/whatever is also fine.

Cool stuff, I look forward to more texts and more usability improvements :-)

Hrishikesh Terdalkar

unread,
Apr 14, 2021, 12:26:21 AMApr 14
to sanskrit-p...@googlegroups.com
For this purpose, it doesn't have to be a Flask application, and there is a freeze script (using Frozen-Flask) that can build the static site. (I just found it easier to develop using flask).

Hrishikesh Terdalkar

unread,
Apr 14, 2021, 1:20:28 PMApr 14
to sanskrit-p...@googlegroups.com
I managed to figure out why I was getting very poor alignment, the main culprit being improper installation of aeneas library.
There were two installation options,
aeneas (https://pypi.org/project/aeneas/) (last updated May 2017) and
I was naturally tempted to go for the "updated" version, and while it installs and runs without an error, it is,
1. extremely slow,
2. incorrect,
Reasons behind which is that it does not install necessary C extensions.

Thus, it is recommended to go for the first aeneas package. Further, for installing aeneas properly, in Ubuntu, we need to apt-install ffprobe ffmpeg espeak libespeak-dev
(libespeak-dev is important, otherwise pip installation of aeneas may fail with an obscure error that demands win32api, wincon or winreg libraries).

After installation, "python -m aeneas.diagnostics" should say something as follows,

[INFO] ffprobe        OK
[INFO] ffmpeg         OK
[INFO] espeak         OK
[INFO] aeneas.tools   OK
[INFO] aeneas.cdtw    AVAILABLE
[INFO] aeneas.cmfcc   AVAILABLE
[INFO] aeneas.cew     AVAILABLE
[INFO] All required dependencies are met and all available Python C extensions are working

In particular, C extensions should be working (CDTW, CMFCC, CEW)

After this, the alignments generated are fairly accurate. Overall alignment earlier took 26000+ seconds, and after the correct aeneas installation, it took ~1600 seconds (which is consistent with the stated requriement of ~1/5th of the audio file length, which for Ashtadhyayi (across 8*4 = 32 files) is close to 8500 seconds.

I've added the ashtadhyayi alignment to the repository.
-
हृषीकेश

Shreevatsa R

unread,
Apr 15, 2021, 2:07:30 AMApr 15
to sanskrit-programmers
This is great to hear! 

Is the updated version / UI hosted somewhere (GitHub Pages or whatever) so that we can take a look?

As I was mentioning to Avinash, the next step may be to align with the English translation as well, then we'll have a "video" of the sort I was initially imagining...

Avinash L Varna

unread,
Apr 15, 2021, 3:10:04 AMApr 15
to sanskrit-programmers
Thanks to Hrishikesh for all the updates and Shreevatsa for the initial implementation!
I've deployed the pages here - https://avinashvarna.github.io/audio_alignment/ Feedback is welcome.

@Shreevatsa, your suggestion is wonderful. Let's work towards that.

Thanks
Avinash


Reply all
Reply to author
Forward
0 new messages