Word usage in Tamil - A usage dictionary and a notebook to view it

23 views
Skip to first unread message

Ravi Annaswamy

unread,
Apr 12, 2019, 12:01:51 AM4/12/19
to indicnlp
Friends,

    By analyzing 30,000 magazine articles, I found 900000 unique words in Tamil (since it is a morphologically rich language).

    Here is dropbox link of the Collection object and a colab notebook to query.



    Since Tamil has beautiful encoding for verbs, that is every verb shows
         Gender, Number, Person, Tense, Mode, you can write simple queries and get lists.

     For example what does a man do? (750 things, but a few shown here, rest in notebook)

ending='ுவான்'
i=0
for w,c in word_stream.most_common():
    if w[-len(ending):]==ending:
        i+=1
        print(i,w,c)

    1 வருவான் 350
2 பேசுவான் 95
3 பண்ணுவான் 86
4 போடுவான் 69
5 தருவான் 63
6 சொல்லுவான் 49
7 விடுவான் 47
8 காட்டுவான் 42
9 போய்விடுவான் 33
10 வந்துவிடுவான் 32
11 எழுதுவான் 26
12 பாடுவான் 25
13 ஓடுவான் 22
14 போயிடுவான் 22
15 வாங்குவான் 20
16 போயிருவான் 19
17 சாப்பிடுவான் 19
18 கூப்பிடுவான் 19
19 வந்துடுவான் 16
20 விளையாடுவான் 16


What does a woman do?

ending='ுவாள்'
i=0
for w,c in word_stream.most_common():
    if w[-len(ending):]==ending:
        i+=1
        print(i,w,c)

1 வருவாள் 110
2 பேசுவாள் 41
3 விடுவாள் 34
4 போய்விடுவாள் 31
5 அழுவாள் 31
6 வந்துவிடுவாள் 28
7 தருவாள் 26
8 பாடுவாள் 23
9 போடுவாள் 17
10 சொல்லுவாள் 13
11 புலம்புவாள் 12
12 அருவாள் 12
13 எழுதுவாள் 12
14 சாப்பிடுவாள் 11
15 தொடங்குவாள் 10
16 காட்டுவாள் 9
17 திட்டுவாள் 9


Here is a query that shows what man usually does that woman does not :)


ஆண்விகுதி = 'ுவான்'
பெண்விகுதி = 'ுவாள்'
print([w for w,c in word_stream.most_common() 
       if w[len(ஆண்விகுதி):]==ஆண்விகுதி and 
       w[:len(ஆண்விகுதி)]+பெண்விகுதி not in word_stream])


['போயிருவான்', 'இறங்குவான்', 'வித்துவான்', 'போய்டுவான்', 'தாண்டுவான்', 'மாட்டுவான்', 'துப்புவான்', 'திருடுவான்', 'மயக்குவான்', 'மாறிடுவான்', 'அடங்குவான்', 'ஈடுபடுவான்', 'மாற்றுவான்', 'நீங்குவான்', 'துழாவுவான்', 'தாக்குவான்', 'அலட்டுவான்', 'சிக்குவான்', 'விட்ருவான்', 'சாற்றுவான்', 'தேறிடுவான்', 'சிந்துவான்', 'இயக்குவான்', 'கலக்குவான்', 'மாறிருவான்', 'தீண்டுவான்', 'கலங்குவான்', 'கையாளுவான்', 'தின்றுவான்', 'மீட்டுவான்', 'தொகுறுவான்', 'மாத்துவான்', 'பரப்புவான்', 'அரற்றுவான்', 'தொங்குவான்', 'டுவிடுவான்', 'நோங்குவான்', 'ஆடிவருவான்', 'கூப்புவான்', 'கொல்லுவான்', 'பேசிடுவான்', 'போட்ருவான்', 'மூழ்குவான்', 'தூண்டுவான்', 'வாச்சுவான்', 'தீட்டுவான்', 'நெம்புவான்', 'வணங்குவான்', 'டப்படுவான்', 'அசத்துவான்', 'நோக்குவான்', 'ஜொள்ளுவான்', 'வுட்ருவான்', 'அரசாளுவான்', 'செருமுவான்', 'மயங்குவான்', 'அகற்றுவான்', 'திருகுவான்', 'வாகிருவான்', 'கைவிடுவான்', 'சிச்சுவான்', 'அலறிருவான்', 'உறவாடுவான்']


Thanks
Ravi


Muru Selvakumar

unread,
Apr 12, 2019, 1:53:13 AM4/12/19
to Ravi Annaswamy, indicnlp
900k words from 30k articles? We need a principled tokenization methods for Tamil. 

SOTA model in many language tasks across many languages BERT used something called word-piece tokenization, which is equivalent to byte pair encoding where the vocab achieves maximum likelihood in language modelling.  

Standard news articles does follow Tamil grammatical features for combining words into compound words. We could use morphological analyzer based tokenization. At the same time dialect variations do not follow the rules and how do we account for those cases is still a question.

--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/ad8858f0-ef84-40ce-ba8a-5e6539d33139%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shrinivasan T

unread,
Apr 12, 2019, 3:51:11 AM4/12/19
to indi...@googlegroups.com
great effort ravi,

along with the pkl file, share the words list too.
it will help for more research works.

வெள்., 12 ஏப்., 2019, முற்பகல் 9:31 அன்று, Ravi Annaswamy <ravi.an...@gmail.com> எழுதியது:
--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/ad8858f0-ef84-40ce-ba8a-5e6539d33139%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Regards,
T.Shrinivasan


My Life with GNU/Linux : http://goinggnu.wordpress.com
Free E-Magazine on Free Open Source Software in Tamil : http://kaniyam.com

Get Free Tamil Ebooks for Android, iOS, Kindle, Computer :     http://FreeTamilEbooks.com

Ravi Annaswamy

unread,
Apr 12, 2019, 6:28:22 AM4/12/19
to indicnlp
Selva, I am amazed by how you nailed down the issues so concisely. (900k words from 30k articles) and (news versus dialect)

The 8K SP model (SP for Google sentencepiece) is built using wikipedia 1.2 lakh article dump so it is primarily from 'regularized' written Tamil.

The 9 lakh raw words list and freq dictionary includes dialects and colloquial usage since it was from magazine stories, interviews as well as proper essays and even poetry.
I will build another language model in this corpus and we can see how it addresses dialect distortions of sounds and words.

My guess is that dialect distortinos are on the inflections so they *are* regular too! So one can build a dialect -> written form text translator using seq2seq. We will try.

**

Shrini, thanks to you for extreme hard work Tamil Wikipedia. What a resource! 


Ravi



On Friday, April 12, 2019 at 1:53:13 AM UTC-4, vanangamudi wrote:
900k words from 30k articles? We need a principled tokenization methods for Tamil. 

SOTA model in many language tasks across many languages BERT used something called word-piece tokenization, which is equivalent to byte pair encoding where the vocab achieves maximum likelihood in language modelling.  

Standard news articles does follow Tamil grammatical features for combining words into compound words. We could use morphological analyzer based tokenization. At the same time dialect variations do not follow the rules and how do we account for those cases is still a question.

To unsubscribe from this group and stop receiving emails from it, send an email to indi...@googlegroups.com.

Ravi Annaswamy

unread,
Apr 12, 2019, 6:31:56 AM4/12/19
to indicnlp
1. The purpose of this pull was to get “raw” words - verbs in full form etc so I did not use a proper tokenizer just white space splitter.

2. I have tried a sentencepiece tokenizer with good success in both language modeling and translation tasks. Surprisingly a 8000 to 10000 word vocabulary  based sentencepiece works really well. This is because of the extreme regularity rules in Tamil morphology as in many Indian languages.

In the next two days I will share the tokenizer and also the core wordparts list in csv. I will also create a parse for the 900k words or subset.

We can improve it further.

The 8k word sentencepiece model and vocab and resulting low perplexity language model are available in this raw notebook project 

Some more notes on the sentencepiece Tamil tokenization with examples:

1. TOKENIZER:

Tokenization is splitting a word into its parts so that the meaning of the word can be derived from its parts.
So new words can be composed and grasped, adding to richness of language.

PHONETIC-MORPHOLOGY LEADS TO EASE OF TOKENIZATION AND AUTOMATIC INFERENCE OF GOOD TOKENIZERS.

To give concrete evidence, here is an automatically created tokenizer based on information-theoretic inference of word roots. Basically the system automatically learns to split tamil words into their parts.

The system reads all the text of 1,20,000 tamil Wikipedia text articles and identifies word roots and prefix suffixes that can ‘explain’ all the millions of words in that text using a small set of core vocabulary.

Because of the phonetic encoding it is able to recognize without human programming,
the proper roots and extension. Results with a part limit of 10,000 tamil identifies roots and extensions like this:

▁வர வ ிருக்கிறது ▁என்ற ▁அச்சத்தை ▁தூண்டிவிட்ட ுள்ளனர் .

▁பட்டணத்த ுக்கு ▁வந்த ோம் .

▁ஒட்டு ண்ண ி த்தனம்

I am amazed by how the grammar (and regularity) of Tamil phonology gave the clue to the machine that the first word has a root and a postfix and a joining letter.

▁வர வ ிருக்கிறது

Of course, the post fix can be further broken into verb+tense+person+number markers!

Similarly, we can see the correspondence is easier to learn even in a complex noun, and machine translation systems can learn that ism is likely to mean த்தனம் from just a few examples.

▁para s it ism
▁ஒட்டு ண்ண ி த்தனம்


A 10,000 limited core vocabulary figures out the parts of the following sentence.
10000: ▁நடந்த ால் , ▁முதல் ▁ஆர்ப்பாட்ட - எதிர்ப்பு ▁ஞாயிறன்று ▁நடப்ப தாக ▁இருக்கிறது

Ideally the Tamil grammar would split it like this. 
▁நட ந்த ால் , ▁முதல் ▁ஆர் ப்பாட்ட - எதிர் ப்பு ▁ ஞாயிற ன்று ▁நட ப்ப தாக ▁இரு க்கிற து

Which would split verb ▁நட, tense ந்த, conditional  ால், verb எதிர் verb-noun marker ப்பு, compound words ஞாயிற ன்று, postposition தாக, tense marker க்கிற, person and number marker து just through phonetic clues. 

We should work on improving sentence piece algorithms to produce parse like above based on Tamil parsing as the gold standard and the most logical test bed. 

Regards 
Ravi

Ravi Annaswamy

unread,
Apr 12, 2019, 6:41:57 AM4/12/19
to indicnlp
I will share a cleaned up wordlist and tokenlist in a day or two after using 2 or 3 large corpus.

Shrinivasan T

unread,
Apr 12, 2019, 6:56:35 AM4/12/19
to Ravi Annaswamy, indicnlp
Thanks Ravi,

Requesting you to write an article in tamil on this to publish in kaniyam.com

நீச்சல் காரன்

unread,
Apr 12, 2019, 7:10:28 AM4/12/19
to Ravi Annaswamy, indicnlp
நல்ல முயற்சி. அதிகப் பிழைவிடும் இதழ், தமிழக வழக்கில்லாத வெளிநாட்டு இதழ் போன்றவற்றில் தேவையான சொல்லை எடுப்பது கூடுதல் வேலையாகும். ஒரே மாதிரி சிறந்த இதழ்களை எடுத்துக் கொள்ளலாம் உதா) தினமணி, தமிழ் இந்து, தினமலர், விகடன். 

சந்தி இல்லாத சொற்களை எடுத்துக் கொள்ளவும் அல்லது சந்தியை நிரல் அளவிலேயே நீக்கிக் கொள்ளவும்.


அன்புடன்,
நீச்சல்காரன்.


--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.

Ravi Annaswamy

unread,
Apr 12, 2019, 8:19:22 AM4/12/19
to நீச்சல் காரன், indicnlp
Thank you, yes that makes sense 
And thanks for your excellent service and web apps

Sent from my iPhone

Ravi Annaswamy

unread,
Apr 12, 2019, 8:24:28 AM4/12/19
to Shrinivasan T, indicnlp
Sure I will do thanks

Sent from my iPhone

Ravi Annaswamy

unread,
Apr 12, 2019, 8:33:24 AM4/12/19
to Shrinivasan T, indicnlp
First time I am seeing this page. Truly awesome resource for Tamil Computing 

thanks 

Sent from my iPhone

On Apr 12, 2019, at 6:55 AM, Shrinivasan T <tshrin...@gmail.com> wrote:

vanangamudi

unread,
Apr 12, 2019, 8:57:53 AM4/12/19
to indicnlp
Hi Ravi,

Yes. You can find the vocabulary built from news corpus here[1]. It doens't include the frequency count, but a word must occur at least 100 times in the corpus for it to be in this file.

I did not include the wikipedia articles, so as to keep the language fully news based. The language style wikipedia and news papers follows are evidently different.

I am also working on building a language model with books from Freetamilebooks[2] since books maintain very long range context.



On Friday, April 12, 2019 at 3:58:22 PM UTC+5:30, Ravi Annaswamy wrote:
Selva, I am amazed by how you nailed down the issues so concisely. (900k words from 30k articles) and (news versus dialect)

The 8K SP model (SP for Google sentence-piece) is built using wikipedia 1.2 lakh article dump so it is primarily from 'regularized' written Tamil.

Ravi Annaswamy

unread,
Apr 12, 2019, 9:37:09 AM4/12/19
to vanangamudi, indicnlp
Awesome 👏 thanks 



Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.

Ravi Annaswamy

unread,
Apr 12, 2019, 1:38:23 PM4/12/19
to vanangamudi, indicnlp
“Very long range context” is the key my friend 
Well said


Sent from my iPhone

On Apr 12, 2019, at 8:57 AM, vanangamudi <selva.d...@gmail.com> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.

வேந்தன் அரசு

unread,
Apr 12, 2019, 4:50:24 PM4/12/19
to Ravi Annaswamy, indicnlp
"வித்துவான்"
விதைப்பவனும் வித்துவான்(தொழில்), புலவனும் வித்துவான்(பெயர்).

வியா., 11 ஏப்., 2019, பிற்பகல் 9:01 அன்று, Ravi Annaswamy <ravi.an...@gmail.com> எழுதியது:
--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/ad8858f0-ef84-40ce-ba8a-5e6539d33139%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
வேந்தன் அரசு
வள்ளுவம் என் சமயம்

Ravi Annaswamy

unread,
Apr 12, 2019, 5:59:25 PM4/12/19
to வேந்தன் அரசு, indicnlp
தகவலுக்கு நன்றி வேந்தன் ஐயா.

Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages