Creating a speech + text corpus from AIR Sanskrit News data

413 views
Skip to first unread message

Avinash L Varna

unread,
Jul 7, 2018, 9:25:52 PM7/7/18
to sanskrit-programmers
नमो नमः,

The recent thread regarding AIR news sparked the thought that it could be used to create a corpus for research regarding speech-to-text and text-to-speech for Sanskrit, since audio and corresponding text (mostly) are available. As Shree Pooja pointed out, these have been collected for the last 5-6 years in https://groups.google.com/forum/#!forum/samskrithanews. I've already downloaded the attachments from this group and have organized it by year + month yielding an initial corpus that has ~300 hrs of audio with the corresponding text in pdf format, which I plan to upload to archive.org.

There are several more challenges in the next steps - extracting the text from the pdf, segmenting the audio into sentences and aligning with the corresponding text, manual proofreading to ensure that the audio does indeed match the text, etc., etc.

I am sure that there are members of this group with more experience in creating such datasets, especially for speech. Is anyone aware of any effort to create such a dataset ? I believe that typically datasets are created by having participants read some data instead of this backwards approach. Are the steps outlined above worth pursuing or are the obstacles too numerous to be surmountable? If it is indeed worthwhile, is anyone interested in collaborating on creating such a dataset ?

Regards,
Avinash


Ganesh S

unread,
Jul 8, 2018, 12:44:48 AM7/8/18
to sanskrit-p...@googlegroups.com
नमोनमः।

I have been thinking about creating a samskritam speech dataset with correct transcriptions. As pointed out by अविनाश महोदय, I have thought about two possible approaches. 

1. Collect data from people using a web app. - create a web app (with WebAudio) where we display a samskritam sentence and have people record the sentence by pressing the microphone button on the app and submitting. We store the recorded sentences in a database backend with enough storage space. 

2. The audio resource pointed out by अविनाश महोदय is fantastic. The effort for performing ocr from the pdfs and then aligning them with the audio will be very time consuming. I can automatically generate segments of audio (3-10seconds long) from long audio recordings. We can crowd source the transcriptions for these audio segments with the help of volunteers from our group and the larger samskrita parivarah. Again for this purpose I feel we need a web app that can play a random audio segment from our database and then have a text box for people to write the transcription and submit. This way it is easy to centrally control the crowd sourcing process 

Unfortunately I don't have any experience creating web apps or phone apps. I have been tinkering around creating some simple app that records audio and stores the file locally. However it would be necessary to link the app with a storage server that we own where the recordings can be pushed.

For the second approach again we need a storage space from which audio samples can be sequentially retrieved and loaded onto the page of the web app.

I look forward to working with someone who has experience in these areas to perform data collection.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shreevatsa R

unread,
Jul 8, 2018, 4:06:38 AM7/8/18
to sanskrit-programmers
On Sat, Jul 7, 2018 at 6:25 PM Avinash L Varna <avinas...@gmail.com> wrote:
The recent thread regarding AIR news sparked the thought that it could be used to create a corpus for research regarding speech-to-text and text-to-speech for Sanskrit, since audio and corresponding text (mostly) are available. As Shree Pooja pointed out, these have been collected for the last 5-6 years in https://groups.google.com/forum/#!forum/samskrithanews. I've already downloaded the attachments from this group and have organized it by year + month yielding an initial corpus that has ~300 hrs of audio with the corresponding text in pdf format, which I plan to upload to archive.org.

This is great! I had never expected that so much transcribed Sanskrit audio exists. This would be a great resource; thank you.
I wonder who has generated the text corresponding to the audio? Is it provided by AIR, or was it transcribed by volunteers at the samskrithanews group? 
Either way, it's wonderful that it exists.
 

There are several more challenges in the next steps - extracting the text from the pdf, segmenting the audio into sentences and aligning with the corresponding text, manual proofreading to ensure that the audio does indeed match the text, etc., etc.

I am sure that there are members of this group with more experience in creating such datasets, especially for speech. Is anyone aware of any effort to create such a dataset ? I believe that typically datasets are created by having participants read some data instead of this backwards approach. Are the steps outlined above worth pursuing or are the obstacles too numerous to be surmountable? If it is indeed worthwhile, is anyone interested in collaborating on creating such a dataset ? 

I think it is indeed worthwhile collecting this dataset -- the obstacles are numerous but not AFAICT insurmountable. ML techniques (for example) are getting better every day, and given that both audio and corresponding text exists, and the problem is merely one of segmentation and alignment, I'm sure it's within reach of current techniques -- e.g. pauses in sound are likely to correspond to paragraph breaks in the text, or whatever. Not saying it's trivial -- may require months of tinkering until figuring out something good -- but it definitely seems possible.

Avinash L Varna

unread,
Jul 8, 2018, 3:02:06 PM7/8/18
to sanskrit-programmers
नमो नमः,

Thanks for all the feedback.

> I wonder who has generated the text corresponding to the audio? Is it provided by AIR, or was it transcribed by volunteers at the samskrithanews group?  

The pdf is provided by AIR NSD itself. As far as I can see, this corresponds to the script used by the newsreader for the news broadcast, and it is great that AIR makes this available as well. I've noticed some minor discrepancies in the few samples that I've checked, but most of the text is accurate.

> Not saying it's trivial -- may require months of tinkering until figuring out something good -- but it definitely seems possible.

Indeed, it seems a good problem to throw at a few grad students ;-). Not sure if any universities are looking into this area. We can do what we can in our spare time.

@Ganesh,

OCR of the pdfs is a good idea, but the pdf does not need to be OCR'ed per se, since the text is in Unicode. However, the fonts transform some glyphs, so some work is needed to properly extract them. This is a topic that has been previously discussed in this forum, so I will try to follow the suggestions there to see if they can be used.

If this works, the crowd sourcing portion can perhaps be a bit simpler. Instead of having to transcribe the entire audio clip, participants can be asked to review if the provided transcription is correct, and if not, make corrections. This may be easier than having to type out the entire transcription.

Web hosting/storage etc. has become cheap enough to the point that I am sure we can figure something out. The major investment will really be the time to develop such an interface and crowd source the effort.

A similar dataset could be created from the audiobooks created by Samskrita Bharati volunteers - E.g.  https://archive.org/details/bAlamodinI-01 through 05 and https://archive.org/details/Sanskrit-Audiobook-Samskrita-Bharati. Maybe that is a good place to start.

Thanks
Avinash


--

Ganesh S

unread,
Jul 12, 2018, 2:10:02 AM7/12/18
to sanskrit-p...@googlegroups.com
नमस्ते अविनाश महोदय, 


On Sun, Jul 8, 2018, 3:02 PM Avinash L Varna <avinas...@gmail.com> wrote:
नमो नमः,

Thanks for all the feedback.

> I wonder who has generated the text corresponding to the audio? Is it provided by AIR, or was it transcribed by volunteers at the samskrithanews group?  

The pdf is provided by AIR NSD itself. As far as I can see, this corresponds to the script used by the newsreader for the news broadcast, and it is great that AIR makes this available as well. I've noticed some minor discrepancies in the few samples that I've checked, but most of the text is accurate.

> Not saying it's trivial -- may require months of tinkering until figuring out something good -- but it definitely seems possible.

Indeed, it seems a good problem to throw at a few grad students ;-). Not sure if any universities are looking into this area. We can do what we can in our spare time.

@Ganesh,

OCR of the pdfs is a good idea, but the pdf does not need to be OCR'ed per se, since the text is in Unicode. However, the fonts transform some glyphs, so some work is needed to properly extract them. This is a topic that has been previously discussed in this forum, so I will try to follow the suggestions there to see if they can be used.

If this works, the crowd sourcing portion can perhaps be a bit simpler. Instead of having to transcribe the entire audio clip, participants can be asked to review if the provided transcription is correct, and if not, make corrections. This may be easier than having to type out the entire transcription.

This idea is perfect. I know an open source tool by which I can set up the tests for reviewing the transcripts. This can definitely be done. I can also take a look at the audio files and the pdfs to see how we can align them. It is preferred to have one or max two sentences per audio file with corresponding transcripts for the training dataset. I will update you about what I can do to circumvent this issue of larger audio files with corresponding paragraphs of text. 

Could you please point me to the pdfs and audio files? 

Web hosting/storage etc. has become cheap enough to the point that I am sure we can figure something out. The major investment will really be the time to develop such an interface and crowd source the effort.

A similar dataset could be created from the audiobooks created by Samskrita Bharati volunteers - E.g.  https://archive.org/details/bAlamodinI-01 through 05 and https://archive.org/details/Sanskrit-Audiobook-Samskrita-Bharati. Maybe that is a good place to start.

I can start a small experiment of data collection from some karyakartas here in the Atlanta kendra. I have an idea which if works can be used to crowd source data from all volunteers here. I will update you how this experiment goes. 

Avinash L Varna

unread,
Jul 14, 2018, 10:20:18 PM7/14/18
to sanskrit-programmers
namo namaH

I tried to extract the text from the pdf, but ran into different issues. It turned out to be easier to OCR the text as Ganesh had suggested, so I converted each pdf to a png and ran the images through google OCR. The OCR response is included in the dataset. This process took a bit of time, so apologies for the delay.

The entire dataset of news broadcasts spanning ~5 years from 2012-06 to 2017-06 is now available on archive at the following link:
(~4.1G)
For quick experimentation, I've also created a smaller subset consisting of the available news broadcasts from 2012 (~6-7 months) which is available here:
(~200M)

The README included in the dataset is attached as an FYI.

The next step would be to segment the audio and text and figure out a process to obtain a tentative alignment that can then be checked manually through some form of crowdsourcing. Please send me a note if you are interested in collaborating.

Thanks
Avinash

README.txt

Ganesh S

unread,
Jul 19, 2018, 11:52:17 PM7/19/18
to sanskrit-p...@googlegroups.com
Thank you for uploading the Sanskrit news database and their transcripts as obtained using OCR. I am looking at the transcript json files and listening to a few of the audio. For most parts the transcripts are correct. There are a few places where the news reader seems to insert a word, like "iti" in the end of some sentences. These errors won't hurt the performance of speech recognition too much.

I observed that the text often contains words joined together by sandhi. However, the reader splits them while reading.In some places the reader reads with the sandhis.

About Sandhi and concept of word in Sanskrit
For speech recognition and even for synthesis, we need to think how to approach Samskritam as a language. In English, we can have a well defined pronunciation dictionary of words with their corresponding phonetic pronunciations. For samskritam, the aksharAni are our phonemes. However, there cannot be a well defined set of words. Due to sandhi, there are potentially infinite words that can be formed. For speech recognition, we will need some kind of language model to decode words from series of aksharAni. Also decoding the series of aksharAni from speech without a language model is more error prone. To start with I can work on creating a phone  recognition system (akshra sequence recognition). We can figure out how to deal with words and then factor that into our speech recognition system.



Shree Nadig

unread,
Oct 12, 2020, 3:20:33 PM10/12/20
to sanskrit-programmers
नमो नमः,

We (me and a small team at Tarunodaya Samskruta Seva Samsthe, Shivamogga) are very much interested in developing an ASR dataset for Sanskrit.
I found this group while searching for publicly available speech data for Sanskrit. We found AIR recordings to be one of the potential sources, although the data available on http://newsonair.com/ is only for a few months.
We are also exploring a small web-app to crowd-source a small seed dataset.

@Avinash and team, great work in collecting this over the years and putting in one place.
I can help build a Kaldi or ESPnet based ASR system using the recent AIR data that we have and the data that @Avinash has uploaded (if Avinash is OK with it).
This is purely for research purposes, and to further the technology adoption for Sanskrit.

We need a few hours (around 10) of cleaned, annotated data from this corpus without any mismatch between audio and text. Using this clean data, we can bootstrap and remove noise, intro music, speech by guests etc. over the remaining bulk dataset. Both Kaldi and ESPnet come with scripts to do this bootstrapping.
Currently, we have a small team of volunteers with near native/fluent Sanskrit skills who are ready to help us annotate this small seed dataset.

Please let me know if you are OK with us using the data that you have collected.
We plan to open-source the dataset, the code to train the ASR models, and the ASR models.

To answer some points Dr Ganesh raised, yes, this would not be too good without an LM. We can use Wikipedia dump and other text corpora to train an LM. Besides, if we train an end-to-end ASR model, it will not require a lexicon and hence would take less effort to get started with. This is very much possible with the 300h that is available.
Did you develop a G2P system for Sanskrit? We're also looking to build a simple rule-based G2P to begin with.

Best Regards,
Shreekantha

Avinash L Varna

unread,
Oct 13, 2020, 3:09:07 PM10/13/20
to sanskrit-programmers
Namaste,

I am happy to hear that your team is planning to use this dataset for research into ASR for Sanskrit. The intention of collecting the data was so that it could be used for research purposes as you described, so I personally have no objections. Please make sure that you respect the rights of the original owner (AIR).

While I have played around with TTS/ASR systems on and off since creating this dataset, I have not been able to devote an uninterrupted chunk of time to dig deeper into this research topic. I will be happy to see another team pick it up, and will try to pitch in if I can.

Since you also mentioned using Wikipedia dump to train an LM, perhaps this data repo with the dump of Sanskrit wikipedia extracted into XML might also be of interest to you - https://github.com/avinashvarna/sa_wiki_text



Ganesh S

unread,
Oct 13, 2020, 4:24:07 PM10/13/20
to sanskrit-p...@googlegroups.com
Namaste Shree varya,

Great effort that you are planning to undertake! I briefly worked on creating transcriptions for read-speech Sanskrit stories (by SB volunteers). I cut them into small sentences using a VAD and wanted to put them on a web app for crowdsourcing the transcriptions. I have been quite caught up with my work that I haven't really devoted much time to. I will share the dataset of a few stories with you. I would be interested in collaborating on the transcription project. Any end-to-end system also would also need sentence level transcription to train. 

About G2P for Sanskrit: The boon of Sanskrit and most Bharatiya bhAshAH is that you don't need G2P. We read Samskritam the same way we write it. Splitting the sanyukta aksharAs into vyanjana (with halanth) and swaras would be the phonetic representation. 
Example:  प्रयत्न = प् र् अ य त् न् अ 

Would love to collaborate on this and push this work forward! Please let me know when we can talk more and work on this project. 

Shree Nadig

unread,
Oct 14, 2020, 3:45:02 AM10/14/20
to sanskrit-programmers
Namaste Avinash and Ganesh,

Avinash, thank you very much for offering your help and for the sa_wiki_text. It'll be definitely helpful as it reduces my effort of collecting the same data.
We recognize the licensing issues with AIR. The policy on the website says we can't redistribute the data for anything other than personal or non-commercial usage. I hope there won't be any issue with this as we are not using this for commercial purpose and this is purely for advancing technology adoption. There might be an issue if we want to make this data public though. I don't currently know how we could solve that. Right now, my thinking is that if we can leverage this data to build an ASR model, even if we don't make the data public, we can then use the model to build another dataset quickly (as it will help in cleaning, organizing the data in a faster way).

Ganesh, thank you very much for offering your help. Yes, any annotated data would be great to get started with.
We have similar plans for the G2P model. I'm thinking of building this using Pynini (Referring the Estonian G2P). This would be more efficient as we can re-use this during inference to recover the letter sequence easily. (relatively, compared to splitting the letters directly).
It would be great to get your inputs on this.
How many hours is the Sanskrit stories data?

I'm scoping a couple of OSS projects for data annotation (aligning speech and text) for getting the seed data that we would need to build the seed model.
We have to make it as easy as possible for the annotators to get this done.

We can probably plan to have one meeting in the last week of Oct, if it's OK with you.
Please let me know. I work in IST timezone. But, I am usually online till 7:30PM GMT.

Best Regards,
Shree

Shreevatsa R

unread,
Oct 14, 2020, 1:53:42 PM10/14/20
to sanskrit-programmers
For a speech+text corpus, have you considered using the Ramayana recitation (by "वेदभाष्यरत्नम्" ब्रह्मश्री सलक्षणघनपाठी V.श्रीरामः and "स्वाध्यायरत्नम्" ब्रह्मश्री सलक्षणघनपाठी हरिसीताराममूर्तिः)? It is 56 hours of audio with impeccable and very pleasant pronunciation. There are two caveats:

1. By using this you may get something optimized for verse (or even just Anushtubh Shloka verse, though there's a bit in other metres too), rather than for prose. But if you think about it, text-to-speech for verse is also (and possibly more) useful for Sanskrit, given that the bulk of Sanskrit literature is in Anushtubh Shloka verse.

2. I'm not sure of exactly which text (which edition / recension of the Ramayana) corresponds to the audio but someone here may be able to help. In any case, you can pick any edition and most of the text should agree, and your ML pipeline can probably do the alignment too?




Shree Nadig

unread,
Oct 14, 2020, 3:18:27 PM10/14/20
to sanskrit-programmers
Thanks for pointing me to that resource. The Ramayana data sounds excellent as you said. But I see a couple of challenges with it
  • It has background music in all the verses. It might be OK if the model will be used to transcribe similar audio, but if we want a generic ASR model that performs fairly well on a wide variety of acoustic conditions and for many speakers, we'd need clean multi-speaker speech.
  • If there's a corresponding text available, we can give it a try though. But, I do see a challenge in cleaning the audio to remove background music.
  • The ML pipeline can do the alignment if we have a seed model. (i.e, we do need a small corpus of around 5 to 10 hours of data and train a small model to do the alignment). This model won't perform very well on decoding unseen audio but would perform fairly well in aligning a pair of audio and text.
  • This dataset would be a good resource for building a speech synthesis model as it's all from a single speaker. Most of the TTS (Text-to-speech) models for English are built on around 25 hours of data. This has 56, which is very good.
Our first phase project is building an ASR system for Sanskrit. We were planning to pick the TTS system at a later point as it's costly to collect single speaker data with studio-quality recordings. There is also a lot of decision that needs to be made on the text prompts that are needed to collect such a dataset for TTS. (Example here)

Best Regards,
Shree

Shreevatsa R

unread,
Oct 14, 2020, 5:21:02 PM10/14/20
to sanskrit-programmers, vishvas Vasuki
Thanks for the detailed reply; it was informative!

As far as I can tell, the only background sounds are 
(1) a bell at the beginning and end of each sarga (along with some introductory/final words which should be cut out anyway if they aren't in the text), and 
(2) a barely perceptible drone (for shruti).
I imagine the first one at least will be easy to cut out (even from just looking for silence or whatever); not sure about the second one.

As for the text, IIRC Vishvas (cc-ed) has collected either the exact text, or something fairly close, at some point? It should be one of the standard Ramayana editions (probably some "southern" one; I don't know much about Ramayana recensions).


विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 14, 2020, 10:59:02 PM10/14/20
to Shreevatsa R, sanskrit-programmers
On Thu, Oct 15, 2020 at 2:51 AM Shreevatsa R <shree...@gmail.com> wrote:


As for the text, IIRC Vishvas (cc-ed) has collected either the exact text, or something fairly close, at some point? It should be one of the standard Ramayana editions (probably some "southern" one; I don't know much about Ramayana recensions).


@shree - Would sambhAShaNa sandesha text dump be useful? I corresponded with the editors over email but in the end it turned out that we need to speak with them.


--
--
Vishvas /विश्वासः

Avinash L Varna

unread,
Oct 14, 2020, 11:27:15 PM10/14/20
to sanskrit-programmers, Shreevatsa R
>> Would sambhAShaNa sandesha text dump be useful?

Definitely. As mentioned previously in this thread, there are several 100 hours worth of audio recordings of bAlamodinI stories created by volunteers - https://archive.org/details/bAlamodinI-01 through https://archive.org/details/bAlamodinI-05. With the text dump, these could be leveraged as well, if it's OK to do so.

I suppose someone must raise this question, so let me do it - Does anyone have experience with obtaining permission from such volunteers to use their recordings to train TTS/ASR systems? E.g. in Shreevatsa's example, would the reciters of the Ramayana recordings be comfortable with someone taking those recordings and training a TTS system that can convert anuShTubh shlokas to audio that sounds like it came from them? Or would they consider it a 'deepfake'? I think the cleaner approach would be to recruit volunteers who are aware of the intended usage and have them make recordings (similar to Mozilla's Common Voice), which requires a significant time investment ...

Thanks
Avinash

Shreevatsa R

unread,
Oct 14, 2020, 11:51:57 PM10/14/20
to विश्वासो वासुकिजः (Vishvas Vasuki), sanskrit-programmers
On Wed, 14 Oct 2020 at 19:59, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
On Thu, Oct 15, 2020 at 2:51 AM Shreevatsa R <shree...@gmail.com> wrote:
As for the text, IIRC Vishvas (cc-ed) has collected either the exact text, or something fairly close, at some point? It should be one of the standard Ramayana editions (probably some "southern" one; I don't know much about Ramayana recensions).


Great, thank you. How closely do they match, have you tested a bit / do you know? Does the list of sargas coincide at least?
I wonder if one can perform alignment between sound and text even by just looking for silence after roughly as long as it takes to recite one shloka.
 
@shree - Would sambhAShaNa sandesha text dump be useful? I corresponded with the editors over email but in the end it turned out that we need to speak with them.

Offtopic, but this is perfect; I couldn't help laughing… they are *sambhAShaNa* sandesha after all :D

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 15, 2020, 2:08:45 AM10/15/20
to Shreevatsa R, sanskrit-programmers
On Thu, Oct 15, 2020 at 9:21 AM Shreevatsa R <shree...@gmail.com> wrote:
On Wed, 14 Oct 2020 at 19:59, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:
On Thu, Oct 15, 2020 at 2:51 AM Shreevatsa R <shree...@gmail.com> wrote:
As for the text, IIRC Vishvas (cc-ed) has collected either the exact text, or something fairly close, at some point? It should be one of the standard Ramayana editions (probably some "southern" one; I don't know much about Ramayana recensions).


Great, thank you. How closely do they match, have you tested a bit / do you know?

Very. A word here and there will be different.

 
Does the list of sargas coincide at least?
Yes 
I wonder if one can perform alignment between sound and text even by just looking for silence after roughly as long as it takes to recite one shloka.
 
@shree - Would sambhAShaNa sandesha text dump be useful? I corresponded with the editors over email but in the end it turned out that we need to speak with them.

Offtopic, but this is perfect; I couldn't help laughing… they are *sambhAShaNa* sandesha after all :D

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 15, 2020, 2:12:32 AM10/15/20
to sanskrit-programmers, Shreevatsa R
On Thu, Oct 15, 2020 at 8:57 AM Avinash L Varna <avinas...@gmail.com> wrote:
>> Would sambhAShaNa sandesha text dump be useful?

Definitely. As mentioned previously in this thread, there are several 100 hours worth of audio recordings of bAlamodinI stories created by volunteers - https://archive.org/details/bAlamodinI-01 through https://archive.org/details/bAlamodinI-05. With the text dump, these could be leveraged as well, if it's OK to do so.

I suppose someone must raise this question, so let me do it - Does anyone have experience with obtaining permission from such volunteers to use their recordings to train TTS/ASR systems? E.g. in Shreevatsa's example, would the reciters of the Ramayana recordings be comfortable with someone taking those recordings and training a TTS system that can convert anuShTubh shlokas to audio that sounds like it came from them? Or would they consider it a 'deepfake'?

The rAmAyaNa reciters definitely would not mind (that's the second hand impression I get from talking to suhAs and sudarshan who interacted with them).

 
I think the cleaner approach would be to recruit volunteers who are aware of the intended usage and have them make recordings (similar to Mozilla's Common Voice), which requires a significant time investment ...
I think risk is far lower than opportunity cost

Shree Nadig

unread,
Oct 15, 2020, 11:43:47 AM10/15/20
to sanskrit-programmers
Thank you Vishvas for the resources.

Yes, sambhAShaNa sandesha text dump would help in building a better Language model. I see on the website they have archives going back many years. This would be a GREAT resource for building LMs, not necessarily just for ASR, but general LMs which can be used in a variety of applications.

I had a look at your collection of rAmAyaNam/AndhrapAThaH, I think it can be useful to us. I'm not 100% sure though. Could I use it?
As @Shreevatsa mentioned, I'll have to remove the beginning part of the utterance and I'm hoping the persistent drone for shruti can be removed using some filter. I'll give it a try.
We can't align the audio and text just by looking at silence. That is not reliable and any mismatch will affect the ASR model.

@Avinash, Mozilla CommonVoice also is something that we could try. They have a detailed guide on how to add a language to the collection process. The first step is to find an open-source text dataset. Their sentence collector uses Wikipedia dumps. But, this data can't be used for TTS as it'll be multi-speaker. The TTS will turn out robotic. If we want realistic sounding TTS system, all the data has to be from a single speaker, preferably recorded without any noise in a studio with CD quality. We have plans to pursue such a project after the ASR one :)

Right now, I think we have enough resources to get started with, thanks to the datasets that this group has put together and our collection of AIR recordings from the past few months.
I don't think it will take too much effort to clean a seed dataset of ~5 to 10 hours, which is what we are aiming for in the next month or so with our team of volunteers.
Once we have this seed data ready, we will build an ASR model and use that to bootstrap on the remaining dataset by aligning them using the model automatically.
If we get any more data which has audio and text but does not have alignments, we can repeat the same steps to clean it and include it in our training. We will open-source the scripts to do this from day 1 when we start implementing these.

Thank you everyone who pitched in so far. I will keep this group posted on the progress.

Best Regards,
Shree

Chandrasekharan Raman

unread,
Oct 15, 2020, 4:28:03 PM10/15/20
to sanskrit-programmers, विश्वासो वासुकिजः (Vishvas Vasuki)
Yes, sambhAShaNa sandesha text dump would help in building a better Language model. I see on the website they have archives going back many years. This would be a GREAT resource for building LMs, not necessarily just for ASR, but general LMs which can be used in a variety of applications.

Just a caveat.. The sambhAShaNa sandesa is copyrighted and the editors of the magazine were not happy about the audios in archive DOT org, because the authors of the stories may not like it (or some reasons like this). I know @विश्वासो वासुकिजः (Vishvas Vasuki)  mentioned that he is trying to contact them. It is better to keep them informed about this project.

Shree Nadig

unread,
Oct 16, 2020, 4:11:20 PM10/16/20
to sanskrit-p...@googlegroups.com
Then we will not use it for now.
It's something that would be good to have, but not necessary to get started with for this project. 

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Oct 16, 2020, 10:15:24 PM10/16/20
to sanskrit-programmers
On Sat, Oct 17, 2020 at 1:41 AM Shree Nadig <shreeka...@gmail.com> wrote:
Then we will not use it for now.
It's something that would be good to have, but not necessary to get started with for this project. 
 
Good idea to call aksharam office of saMskRta bhAratI at 080 26721052/ 080 26722576  and talk to Janardana Hegde. Whether or not it works immedieately, it will atleast put the thought in their heads.

 
Reply all
Reply to author
Forward
0 new messages