I would love to contribute on TTS and STT.

65 views
Skip to first unread message

hr12345 history

unread,
Mar 8, 2019, 5:36:58 AM3/8/19
to indicnlp
Hello,
I am Harsha,I loved the work you are doing and I am motivated to contribute to solve these challenges.I have good experience and knowledge in speech synthesis(I have worked on Tacotron-2 model).So I would like to work on Speech-to-text(STT) and Text-to Speech(TTS) models.

Muru Selvakumar

unread,
Mar 8, 2019, 7:14:15 AM3/8/19
to hr12345 history, indicnlp
Hi Harsha,

Thanks for the appreciation. It is exciting to see people like you willing to contribute. We have tried to gain access to datasets for STT and TTS from existing sources like TDIL. We haven't received positive response, and even if we gain access, their licensing is too restrictive. So we are in the process of building tools to gather datasets for those tasks. 

Before that,  we need help with lot thing like more crawlers scripts. If you'd like you can write crawlers for news websites. Please take a look into the crawler-viduthalai2.py under tamil in the following repo for reference


 On machine learning front you can start from git repo below to get started.





Thanks,
vanangamudi.


On Fri, Mar 8, 2019, 4:06 PM hr12345 history <hr12345...@gmail.com> wrote:
Hello,
I am Harsha,I loved the work you are doing and I am motivated to contribute to solve these challenges.I have good experience and knowledge in speech synthesis(I have worked on Tacotron-2 model).So I would like to work on Speech-to-text(STT) and Text-to Speech(TTS) models.

--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/10e00bb3-2e0c-4bad-80e0-a1333112da42%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sowmyan Tirumurti

unread,
Mar 26, 2019, 11:14:44 PM3/26/19
to indicnlp
I am new to this forum, and do not know of past discussions. IIT Madras has developed TTS (for multiple Indian languages including Tamil,) and also makes it available for others to use. 

Soham Chatterjee

unread,
Mar 27, 2019, 2:11:48 AM3/27/19
to indicnlp
Hey Sowmyan,

We are aware of that dataset and have also gained access to it. Unfortunately though, it is behind a copyright meaning that we cannot share and hence open source it.

Sowmyan Tirumurti

unread,
Mar 27, 2019, 3:27:38 AM3/27/19
to Soham Chatterjee, indicnlp
Soham, Thank you. 

In some domains I feel a crowd can find solutions more effectively than an university / .  institution. 

I have interest in exploring. I learnt ML etc for this. I also learnt audio signal processing etc. However, I have not focussed and continued to work as a individual. I have many thoughts. Perhaps a group like this can support discussions and a path to finding a solution. With this view I begin sharing my views. 

It has been my opinion that Indian languages are different from English for speech processing. they can be easier and simpler because Indian languages are more phonetic. While english speech analysis seeks to capture thousands of words to extract syllables, Indian alphabets themselves appear to be a comprehensive set of syllables. The other thing i find is in English they seek to extract pairs of syllables, because some approaches to speech synthesis work with concatenation of syllables. Since syllable boundaries are difficult to find and join, they seek to cut a syllable in the middle and join with a like complementary cut syllable in the adjoining syllable pair. Even in this process there are multiple variations. So they need statistical techniques to match the pairs. Since the pairs of syllables proliferate to a large count recording and analyzing has additional problems. The recording voice actors get tired, and the voice quality changes. They need to voice and record multiple times over multiple days etc. These are also done in a studio environment. Considering the cost and effort these data are perhaps sought to be held closely than shared. 

I am yet to learn neural network approach to speech analysis / synthesis / processing. But my hunch was that we may be able to work with characters than with syllable pairs.  While vowels are relatively steady state sounds, consonants are fleeting and difficult to capture. For want of adequate variety of voice data to find common patterns, I have not explored consonants as much as I have explored vowels.  

To work on optical character recognition of handwritten numerals they have a open database in USA. Likewise if we have a open data based of labelled sounds for alphabets, we can benefit as a community. There are perhaps two ways we can build it. One to pronounce each alphabet and record. Another is to read and record a explicitly spelled out language text. If the same unicode text of a corpus is read out by a number of people, we may have a generalized source corpus to extract. This may require parsing through crowdsourcing. 

The challenge to overcome studio recording is another. Studios are acoustically highly controlled environments. If we want to crowdsource voice inputs, we can not easily provide access to such studio environments. I think if we can let people record in a reasonably good environment that commoners can create, we can use acoustic signal processing methods such as harmonic signal separation, to clean them out and get a reasonable quality of sound comparable to the studio recording. I am fairly confident of this approach for harmonic signals as in vowels. I am uncertain about consonants. I learnt acoustic signal processing primarily to be able to record without a studio environment. I have bought a pair of better microphones, to be able to reject external noises. However, I have not done much work in this direction. 

My approach is to synthesize sound from models and not really depend on any single individual's recording. Most current TTS, such as the one from IIT Madras, offer just one voice. The best I have seen is the one from IITM. The efforts from other sources suffer from a warbling effect as if an old person is speaking. 

I am going abroad for the next 3 months. So my responses may not be very quick and I will not have access to my usual resources to check and answer any specific query. As a 70 year old with enthusiasm for this field, but otherwise not attached to any organization, I do not expect to go far all alone. Hence my objective is to support aspiring research teams with discussions, tangential thoughts, and where it aligns with my goals, some collaborative work. 

I will be happy to continue the dialogue if there are people with similar interests. 

My current hobby interest is TTS. Later STT. Then OCR. All related to Tamil in the first phase. If there are fields where adequate free resources of good quality are available, I may shuffle the priority. If collaboration potential is good also I may shuffle priority. 

If I think there is a decent TTS from IIT Madras, why am I pursuing TTS? So many climbed everest, still more people attempt to climb it. So many run 100 meters etc too. It is also a thrill to replicate what has been done. That is like a medal. Also as I said, i am looking at a different path compared to what is done in English with a premise that Indic languages do not have to follow that same path. 

Looking forward to finding individuals to continue this sort of conversation. 

with best regards to the group,
Sowmyan


For more options, visit https://groups.google.com/d/optout.


--

Soham Chatterjee

unread,
Mar 27, 2019, 5:26:12 AM3/27/19
to indicnlp
Hey Sowmyan,

Your ideas and approaches are very interesting. Selva is working on TTS and STT for Tamil and we also recently received OCR data too. If you would like to help us out and try out your approaches do let us know. I am really keen to see the results you get!
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+unsubscribe@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/3123b0ca-97ee-4497-b4b5-74dee2c13b02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vanangamudi

unread,
Mar 27, 2019, 5:32:54 AM3/27/19
to indicnlp
We are excited to see someone at your age willing to guide the younger communities to try solve interesting problems like TTS/STT.  Few of us have good amount of experience with training neural networks, and are just getting started with speech corpora. Our team as of now have speech data on four languages including English. I am assuming you live in Chennai. It would be extremely helpful if we can have brain storm session in person.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+unsubscribe@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/3123b0ca-97ee-4497-b4b5-74dee2c13b02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ravi Annaswamy

unread,
Mar 28, 2019, 8:56:57 AM3/28/19
to indicnlp


On Wednesday, March 27, 2019 at 3:27:38 AM UTC-4, Sowmyan Tirumurti wrote:
Soham, Thank you. 

In some domains I feel a crowd can find solutions more effectively than an university / .  institution. 

Well said, Sir. In olden days, individuals did not have availability of data, compute, knowhow and free time to start projects on their own. Now due to open source and due to 
improvements in each of these, it is thinkable.

All great work begins with committed individual's relentless efforts. When such individuals team up without ego, this gets magnified immensely

 

I have interest in exploring. I learnt ML etc for this. I also learnt audio signal processing etc. However, I have not focussed and continued to work as a individual. I have many thoughts. Perhaps a group like this can support discussions and a path to finding a solution. With this view I begin sharing my views. 

It has been my opinion that Indian languages are different from English for speech processing. they can be easier and simpler because Indian languages are more phonetic. While english speech analysis seeks to capture thousands of words to extract syllables, Indian alphabets themselves appear to be a comprehensive set of syllables. The other thing i find is in English they seek to extract pairs of syllables, because some approaches to speech synthesis work with concatenation of syllables. Since syllable boundaries are difficult to find and join, they seek to cut a syllable in the middle and join with a like complementary cut syllable in the adjoining syllable pair. Even in this process there are multiple variations. So they need statistical techniques to match the pairs. Since the pairs of syllables proliferate to a large count recording and analyzing has additional problems. The recording voice actors get tired, and the voice quality changes. They need to voice and record multiple times over multiple days etc. These are also done in a studio environment. Considering the cost and effort these data are perhaps sought to be held closely than shared. 

I am yet to learn neural network approach to speech analysis / synthesis / processing. But my hunch was that we may be able to work with characters than with syllable pairs.  While vowels are relatively steady state sounds, consonants are fleeting and difficult to capture. For want of adequate variety of voice data to find common patterns, I have not explored consonants as much as I have explored vowels.  


Until 2015 state of the art systems for pattern recognition (image recognition, speech recognition, translation, speech synthesis, OCR) relied on enormous amount of human engineered features
and human logic programmed into the systems. The Deep learning systems are end to end trained from raw data to expected output. They not only learn the classification but also appropriate feature engineering
necessary. This may so counterintuitive (how can a system learn features) on one sense, skeptical and discouraging for experts who have spent decades in building such systems (have all my work gone waste?)
but it is mind boggling and it works.

As an example, in neural OCR you give an entire line of scanned text and its corresponding transcripted unicode as a pair and the system not only learns character recognition
but font recognition, character segmentation, word segmentation. Google Tesseract OCR is available full system that we can train that has these pipelines coded.

Another example, in neural text synthesis (state of the art is Tacotron from Deepmind) can take pairs of typed text and corresponding utterances it learns an end to end network
that internally learns all required modules (parsing the text, identifying character or letter as needed, syllable mapping, acoustic translation, adjusting the synthesized text for smoothness
and believability). Human engineered systems needed design of each module and then tying them together and fine tuning each. 

Traditional Translation techniques required human coding of dictionaries and grammar and translation rules and exceptions. The state of the art Seq2seq with attention or transformer,
does not need any such parts, it is end to end and it automatically learned vocabulary, meanings, grammar, rules, exceptions by adjusting self-performance against a large corpus.
 
To work on optical character recognition of handwritten numerals they have a open database in USA. Likewise if we have a open data based of labelled sounds for alphabets, we can benefit as a community. There are perhaps two ways we can build it. One to pronounce each alphabet and record. Another is to read and record a explicitly spelled out language text. If the same unicode text of a corpus is read out by a number of people, we may have a generalized source corpus to extract. This may require parsing through crowdsourcing. 

The challenge to overcome studio recording is another. Studios are acoustically highly controlled environments. If we want to crowdsource voice inputs, we can not easily provide access to such studio environments. I think if we can let people record in a reasonably good environment that commoners can create, we can use acoustic signal processing methods such as harmonic signal separation, to clean them out and get a reasonable quality of sound comparable to the studio recording. I am fairly confident of this approach for harmonic signals as in vowels. I am uncertain about consonants. I learnt acoustic signal processing primarily to be able to record without a studio environment. I have bought a pair of better microphones, to be able to reject external noises. However, I have not done much work in this direction. 

My approach is to synthesize sound from models and not really depend on any single individual's recording. Most current TTS, such as the one from IIT Madras, offer just one voice. The best I have seen is the one from IITM. The efforts from other sources suffer from a warbling effect as if an old person is speaking. 


One easy way to get data for training TTS/ and STT is probably using youtube speeches by various type of people. We can extract you tube lectures (formal from Vairamuthu or casual from say Suki Sivam or colloquial from Tenkacchi) into mp3. Then as we play them we can reroute them to google WebSTT freeware or google docs voice typing and get initial data transcripts. I was able to use voice typing to transcribe a 5 page essay from writer sujatha by speaking it into Voice Typing in about an hour.

 
I am going abroad for the next 3 months. So my responses may not be very quick and I will not have access to my usual resources to check and answer any specific query. As a 70 year old with enthusiasm for this field, but otherwise not attached to any organization, I do not expect to go far all alone. Hence my objective is to support aspiring research teams with discussions, tangential thoughts, and where it aligns with my goals, some collaborative work. 

I will be happy to continue the dialogue if there are people with similar interests. 

My current hobby interest is TTS. Later STT. Then OCR. All related to Tamil in the first phase. If there are fields where adequate free resources of good quality are available, I may shuffle the priority. If collaboration potential is good also I may shuffle priority. 

If I think there is a decent TTS from IIT Madras, why am I pursuing TTS? So many climbed everest, still more people attempt to climb it. So many run 100 meters etc too. It is also a thrill to replicate what has been done. That is like a medal. Also as I said, i am looking at a different path compared to what is done in English with a premise that Indic languages do not have to follow that same path. 

This is the best highlight. Totally agree. Only in trying to build from scratch we begin to understand anything.
Rebuilding not only improves systems but it improves each of our ability and also gives ideas to create new applications.
In modern ML it is easy to get discouraged by the complexity and level of performance of modern systems, but one should always
try to start from scratch and code upwards

However, at some point AFTER trying oneself, one knows how to incorporate other peoples building blocks.

If one starts to build a translation system by scratch they cannot go farther, but only that experience will give the perspective
and appreciation of all the things involved and give the courage to use other tools on our own outlines in our own agenda, without being
totally dependent and carried away.

I would like to give two examples:
1. Vanangamudi's crawlers are so simple and straightforward and self contained because they are coded from scratch.
2. The word2vec browing app that team has put together is very very creative (click on a word see its friends and enemies).
3. TShrinivasan has a wikibook OCR module that cleverly takes in a pdf splits into pages, splits pages to columns, sends to good docs, get tamil ocr 
    and update that text on wiki! This is a very bottom up application done well. Of course one can go ahead and make it functional modular etc,
    but I was amazed by these innovations.
 

Update from myself: 
1. I was able to write a wikidump processor to extract all wiki tam articles.
2. Create a sentence piece model for automatical tamil tokenization
3. Build a ULMFIT tamil LM with lowest perplexity of 37.

I will try to make some time this weekend to create two versions of this on my git
1. End to end single notebook that can be used for any language.
2. Three modules: wikita-extractor, tam-sp-tokenizer, tam-lm

But whoever also wants to do ULMFit please continue, we might get alternate approaches, but we 
can start with one baseline.

I also have a few notebooks on Tamil OCR. Will try to organize and make them useful and usable and publish them
in the next few weeks.

Thanks
Ravi

To unsubscribe from this group and stop receiving emails from it, send an email to indi...@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/3123b0ca-97ee-4497-b4b5-74dee2c13b02%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sowmyan Tirumurti

unread,
Mar 28, 2019, 12:32:15 PM3/28/19
to Ravi Annaswamy, indicnlp
Thank you Ravi. You added to my understanding of work already done by others. I wonder how many of these are easy to access and use for a common man, or a professional. That has always left me with an impression some of the domains remain undeveloped. 

One aspect I would like to benefit from this group is to be able to get a state of the art white paper on every application, current status / trends, tools and open groups working on them.  This can help reduce fragmentation of the limited resources that can contribute to open source development. 

Thank you. 



For more options, visit https://groups.google.com/d/optout.

Sowmyan Tirumurti

unread,
Mar 28, 2019, 12:55:14 PM3/28/19
to indi...@googlegroups.com, selva.d...@gmail.com
Dear Selva,

Yes, I am based in Chennai. I moved in just last year. One motivator was to be able to work on Tamil. 

This week end (Sunday) I am going to Cuddalore and will return on Monday evening. Next week end (Saturday early morning), I will be travelling abroad.  My return will be after 3 months. 

This does not mean we can not meet face to face. I live in Prestige Bella Vista at Iyyappanthangal. My cell number is <xxx>. As a retired person I should find it easier to flex my time. 

Regards,
Sowmyan


For more options, visit https://groups.google.com/d/optout.

vanangamudi

unread,
Mar 30, 2019, 2:56:01 AM3/30/19
to indicnlp
Fantastic. I will familiarize myself with existing literature on TTS/STT.  I am in Chennai. If it is possible, we can meet in any weekday before your trip. I shall contact you on your mobile phone for further details.

P.S Sorry if I am out of line, please do not share your mobile phone in public forums.

vanangamudi

unread,
Mar 30, 2019, 3:00:17 AM3/30/19
to indicnlp
We maintain a list of resources and repositories here[1]. It is so incomplete and we are trying to gather information on the activities of ML people who work on indigenous languages.


To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+unsubscribe@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/3473db5a-74a4-410d-a37e-3107ae01b496%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vanangamudi

unread,
Mar 30, 2019, 3:03:55 AM3/30/19
to indicnlp
I want to add just one more point. The traditional methods are still so much valuable. We need more principled approach to all the ML problems, not just throwing compute power at it.

Thank you Ravi Annaswamy for the rich and dense information. It is exciting to see people engaged in ML from a wide spectrum

Ananda Seelan

unread,
Apr 15, 2019, 1:34:43 AM4/15/19
to indicnlp
Hello,
Currently, I'm looking into state-of-the-art SST systems and have a bit of Deep Learning background. If anyone has a repo out there for Tamil SST, I can try contributing stuff. Any pointers please?

Ananda Seelan

unread,
Apr 15, 2019, 1:43:44 AM4/15/19
to indicnlp
*STT

Ananda Seelan

unread,
Apr 15, 2019, 1:51:01 AM4/15/19
to indicnlp
I'm aware of a couple of regional  language speech to text datasets from Microsoft and IIIT H. If anyone hasn't started working on these yet, I can come up with a plan.


vanangamudi

unread,
Apr 15, 2019, 5:28:18 PM4/15/19
to indicnlp
There is also a corpus available IIT Madras. Please raise request at IndicTTS[1]

Can you please share the dataset from Microsoft, if it is not an issue?

Ananda Seelan

unread,
May 8, 2019, 8:16:43 AM5/8/19
to indicnlp
Apologies for the delay. My bad that I've missed this message somehow.

Here's the link for Microsoft data -
"Microsoft releases Speech Corpus for three Indian languages to aid researchers" - https://news.microsoft.com/en-in/microsoft-releases-speech-corpus-for-three-indian-languages-to-aid-researchers/
"Microsoft Speech Corpus (Indian languages)" - https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e
I haven't looked into the dataset myself though.

Here's another resource that I just came across in OpenSLR resources.
" Crowdsourced high-quality Tamil multi-speaker speech data set. " - http://www.openslr.org/65/

Cheers
Reply all
Reply to author
Forward
0 new messages