--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The recent thread regarding AIR news sparked the thought that it could be used to create a corpus for research regarding speech-to-text and text-to-speech for Sanskrit, since audio and corresponding text (mostly) are available. As Shree Pooja pointed out, these have been collected for the last 5-6 years in https://groups.google.com/forum/#!forum/samskrithanews. I've already downloaded the attachments from this group and have organized it by year + month yielding an initial corpus that has ~300 hrs of audio with the corresponding text in pdf format, which I plan to upload to archive.org.
There are several more challenges in the next steps - extracting the text from the pdf, segmenting the audio into sentences and aligning with the corresponding text, manual proofreading to ensure that the audio does indeed match the text, etc., etc.I am sure that there are members of this group with more experience in creating such datasets, especially for speech. Is anyone aware of any effort to create such a dataset ? I believe that typically datasets are created by having participants read some data instead of this backwards approach. Are the steps outlined above worth pursuing or are the obstacles too numerous to be surmountable? If it is indeed worthwhile, is anyone interested in collaborating on creating such a dataset ?
--
नमो नमः,Thanks for all the feedback.> I wonder who has generated the text corresponding to the audio? Is it provided by AIR, or was it transcribed by volunteers at the samskrithanews group?The pdf is provided by AIR NSD itself. As far as I can see, this corresponds to the script used by the newsreader for the news broadcast, and it is great that AIR makes this available as well. I've noticed some minor discrepancies in the few samples that I've checked, but most of the text is accurate.> Not saying it's trivial -- may require months of tinkering until figuring out something good -- but it definitely seems possible.Indeed, it seems a good problem to throw at a few grad students ;-). Not sure if any universities are looking into this area. We can do what we can in our spare time.@Ganesh,OCR of the pdfs is a good idea, but the pdf does not need to be OCR'ed per se, since the text is in Unicode. However, the fonts transform some glyphs, so some work is needed to properly extract them. This is a topic that has been previously discussed in this forum, so I will try to follow the suggestions there to see if they can be used.If this works, the crowd sourcing portion can perhaps be a bit simpler. Instead of having to transcribe the entire audio clip, participants can be asked to review if the provided transcription is correct, and if not, make corrections. This may be easier than having to type out the entire transcription.
Web hosting/storage etc. has become cheap enough to the point that I am sure we can figure something out. The major investment will really be the time to develop such an interface and crowd source the effort.A similar dataset could be created from the audiobooks created by Samskrita Bharati volunteers - E.g. https://archive.org/details/bAlamodinI-01 through 05 and https://archive.org/details/Sanskrit-Audiobook-Samskrita-Bharati. Maybe that is a good place to start.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/75400f9a-ab7e-4856-84b1-31d84c81709dn%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9ajurCR-WnxwLrRnrhf5WQntFUKVTp%2BATEKHDf%2BV09Svg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/5dca4567-74b3-436c-ac8a-951435a0ad08n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/bab1683e-86f7-4950-92b6-511ee399adf0n%40googlegroups.com.
As for the text, IIRC Vishvas (cc-ed) has collected either the exact text, or something fairly close, at some point? It should be one of the standard Ramayana editions (probably some "southern" one; I don't know much about Ramayana recensions).
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAFY6qgHw%2BpcknHy6qtOVQ4-nW1WAstutckP9zgEUzZ3DdsuZ1Q%40mail.gmail.com.
As for the text, IIRC Vishvas (cc-ed) has collected either the exact text, or something fairly close, at some point? It should be one of the standard Ramayana editions (probably some "southern" one; I don't know much about Ramayana recensions).
@shree - Would sambhAShaNa sandesha text dump be useful? I corresponded with the editors over email but in the end it turned out that we need to speak with them.
On Wed, 14 Oct 2020 at 19:59, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:As for the text, IIRC Vishvas (cc-ed) has collected either the exact text, or something fairly close, at some point? It should be one of the standard Ramayana editions (probably some "southern" one; I don't know much about Ramayana recensions).Great, thank you. How closely do they match, have you tested a bit / do you know?
Does the list of sargas coincide at least?
I wonder if one can perform alignment between sound and text even by just looking for silence after roughly as long as it takes to recite one shloka.@shree - Would sambhAShaNa sandesha text dump be useful? I corresponded with the editors over email but in the end it turned out that we need to speak with them.Offtopic, but this is perfect; I couldn't help laughing… they are *sambhAShaNa* sandesha after all :D
>> Would sambhAShaNa sandesha text dump be useful?Definitely. As mentioned previously in this thread, there are several 100 hours worth of audio recordings of bAlamodinI stories created by volunteers - https://archive.org/details/bAlamodinI-01 through https://archive.org/details/bAlamodinI-05. With the text dump, these could be leveraged as well, if it's OK to do so.I suppose someone must raise this question, so let me do it - Does anyone have experience with obtaining permission from such volunteers to use their recordings to train TTS/ASR systems? E.g. in Shreevatsa's example, would the reciters of the Ramayana recordings be comfortable with someone taking those recordings and training a TTS system that can convert anuShTubh shlokas to audio that sounds like it came from them? Or would they consider it a 'deepfake'?
I think the cleaner approach would be to recruit volunteers who are aware of the intended usage and have them make recordings (similar to Mozilla's Common Voice), which requires a significant time investment ...
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAALtx9b66LTFNgutVAWgm8FK96vgmEnYswDCUDszXeWq%2BQEDpQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/e54a31f9-3793-4541-8fb4-39e8e2155a90n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CAO1ytLxOk7THOtGDuD7NFSRFZPkp6z_O6-7mK5w6nG4NAqj9Xg%40mail.gmail.com.
Then we will not use it for now.It's something that would be good to have, but not necessary to get started with for this project.
To view this discussion on the web visit https://groups.google.com/d/msgid/sanskrit-programmers/CANREUj2DRBqpHoAU6_dAHbEC7sSh_BN7%3Dx189sDCqGJcHj4Rmg%40mail.gmail.com.