Word usage in Tamil - A usage dictionary and a notebook to view it

Ravi Annaswamy

unread,

Apr 12, 2019, 12:01:51 AM4/12/19

to indicnlp

Friends,

By analyzing 30,000 magazine articles, I found 900000 unique words in Tamil (since it is a morphologically rich language).

Here is dropbox link of the Collection object and a colab notebook to query.

https://www.dropbox.com/s/6zstwcv6oq2yqxw/word_stream.pkl?dl=0

https://colab.research.google.com/drive/1i7sURLiOB05PA_p3O6IhoCKaJjY4ktsE

Since Tamil has beautiful encoding for verbs, that is every verb shows

Gender, Number, Person, Tense, Mode, you can write simple queries and get lists.

For example what does a man do? (750 things, but a few shown here, rest in notebook)

ending='ுவான்'

i=0

for w,c in word_stream.most_common():

if w[-len(ending):]==ending:

i+=1

print(i,w,c)

1 வருவான் 350

2 பேசுவான் 95
3 பண்ணுவான் 86
4 போடுவான் 69
5 தருவான் 63
6 சொல்லுவான் 49
7 விடுவான் 47
8 காட்டுவான் 42
9 போய்விடுவான் 33
10 வந்துவிடுவான் 32
11 எழுதுவான் 26
12 பாடுவான் 25
13 ஓடுவான் 22
14 போயிடுவான் 22
15 வாங்குவான் 20
16 போயிருவான் 19
17 சாப்பிடுவான் 19
18 கூப்பிடுவான் 19
19 வந்துடுவான் 16
20 விளையாடுவான் 16

What does a woman do?

ending='ுவாள்'
i=0
for w,c in word_stream.most_common():
    if w[-len(ending):]==ending:
        i+=1
        print(i,w,c)

1 வருவாள் 110
2 பேசுவாள் 41
3 விடுவாள் 34
4 போய்விடுவாள் 31
5 அழுவாள் 31
6 வந்துவிடுவாள் 28
7 தருவாள் 26
8 பாடுவாள் 23
9 போடுவாள் 17
10 சொல்லுவாள் 13
11 புலம்புவாள் 12
12 அருவாள் 12
13 எழுதுவாள் 12
14 சாப்பிடுவாள் 11
15 தொடங்குவாள் 10
16 காட்டுவாள் 9
17 திட்டுவாள் 9

Here is a query that shows what man usually does that woman does not :)

ஆண்விகுதி = 'ுவான்'
பெண்விகுதி = 'ுவாள்'
print([w for w,c in word_stream.most_common() 
       if w[len(ஆண்விகுதி):]==ஆண்விகுதி and 
       w[:len(ஆண்விகுதி)]+பெண்விகுதி not in word_stream])

['போயிருவான்', 'இறங்குவான்', 'வித்துவான்', 'போய்டுவான்', 'தாண்டுவான்', 'மாட்டுவான்', 'துப்புவான்', 'திருடுவான்', 'மயக்குவான்', 'மாறிடுவான்', 'அடங்குவான்', 'ஈடுபடுவான்', 'மாற்றுவான்', 'நீங்குவான்', 'துழாவுவான்', 'தாக்குவான்', 'அலட்டுவான்', 'சிக்குவான்', 'விட்ருவான்', 'சாற்றுவான்', 'தேறிடுவான்', 'சிந்துவான்', 'இயக்குவான்', 'கலக்குவான்', 'மாறிருவான்', 'தீண்டுவான்', 'கலங்குவான்', 'கையாளுவான்', 'தின்றுவான்', 'மீட்டுவான்', 'தொகுறுவான்', 'மாத்துவான்', 'பரப்புவான்', 'அரற்றுவான்', 'தொங்குவான்', 'டுவிடுவான்', 'நோங்குவான்', 'ஆடிவருவான்', 'கூப்புவான்', 'கொல்லுவான்', 'பேசிடுவான்', 'போட்ருவான்', 'மூழ்குவான்', 'தூண்டுவான்', 'வாச்சுவான்', 'தீட்டுவான்', 'நெம்புவான்', 'வணங்குவான்', 'டப்படுவான்', 'அசத்துவான்', 'நோக்குவான்', 'ஜொள்ளுவான்', 'வுட்ருவான்', 'அரசாளுவான்', 'செருமுவான்', 'மயங்குவான்', 'அகற்றுவான்', 'திருகுவான்', 'வாகிருவான்', 'கைவிடுவான்', 'சிச்சுவான்', 'அலறிருவான்', 'உறவாடுவான்']

Thanks
Ravi

Muru Selvakumar

unread,

Apr 12, 2019, 1:53:13 AM4/12/19

to Ravi Annaswamy, indicnlp

900k words from 30k articles? We need a principled tokenization methods for Tamil.

SOTA model in many language tasks across many languages BERT used something called word-piece tokenization, which is equivalent to byte pair encoding where the vocab achieves maximum likelihood in language modelling.

Standard news articles does follow Tamil grammatical features for combining words into compound words. We could use morphological analyzer based tokenization. At the same time dialect variations do not follow the rules and how do we account for those cases is still a question.

--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/ad8858f0-ef84-40ce-ba8a-5e6539d33139%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shrinivasan T

unread,

Apr 12, 2019, 3:51:11 AM4/12/19

to indi...@googlegroups.com

great effort ravi,

along with the pkl file, share the words list too.

it will help for more research works.

வெள்., 12 ஏப்., 2019, முற்பகல் 9:31 அன்று, Ravi Annaswamy <ravi.an...@gmail.com> எழுதியது:

--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/ad8858f0-ef84-40ce-ba8a-5e6539d33139%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Regards,
T.Shrinivasan

My Life with GNU/Linux : http://goinggnu.wordpress.com
Free E-Magazine on Free Open Source Software in Tamil : http://kaniyam.com

Get Free Tamil Ebooks for Android, iOS, Kindle, Computer : http://FreeTamilEbooks.com

Ravi Annaswamy

unread,

Apr 12, 2019, 6:28:22 AM4/12/19

to indicnlp

Selva, I am amazed by how you nailed down the issues so concisely. (900k words from 30k articles) and (news versus dialect)

The 8K SP model (SP for Google sentencepiece) is built using wikipedia 1.2 lakh article dump so it is primarily from 'regularized' written Tamil.

The 9 lakh raw words list and freq dictionary includes dialects and colloquial usage since it was from magazine stories, interviews as well as proper essays and even poetry.

I will build another language model in this corpus and we can see how it addresses dialect distortions of sounds and words.

My guess is that dialect distortinos are on the inflections so they *are* regular too! So one can build a dialect -> written form text translator using seq2seq. We will try.

**

Shrini, thanks to you for extreme hard work Tamil Wikipedia. What a resource!

Ravi

On Friday, April 12, 2019 at 1:53:13 AM UTC-4, vanangamudi wrote:

900k words from 30k articles? We need a principled tokenization methods for Tamil.

SOTA model in many language tasks across many languages BERT used something called word-piece tokenization, which is equivalent to byte pair encoding where the vocab achieves maximum likelihood in language modelling.

Standard news articles does follow Tamil grammatical features for combining words into compound words. We could use morphological analyzer based tokenization. At the same time dialect variations do not follow the rules and how do we account for those cases is still a question.

To unsubscribe from this group and stop receiving emails from it, send an email to indi...@googlegroups.com.

Ravi Annaswamy

unread,

Apr 12, 2019, 6:31:56 AM4/12/19

to indicnlp

1. The purpose of this pull was to get “raw” words - verbs in full form etc so I did not use a proper tokenizer just white space splitter.

2. I have tried a sentencepiece tokenizer with good success in both language modeling and translation tasks. Surprisingly a 8000 to 10000 word vocabulary based sentencepiece works really well. This is because of the extreme regularity rules in Tamil morphology as in many Indian languages.

In the next two days I will share the tokenizer and also the core wordparts list in csv. I will also create a parse for the 900k words or subset.

We can improve it further.

The 8k word sentencepiece model and vocab and resulting low perplexity language model are available in this raw notebook project

https://github.com/ravi-annaswamy/tamil_lm_spm_fai

Some more notes on the sentencepiece Tamil tokenization with examples:

1. TOKENIZER:

Tokenization is splitting a word into its parts so that the meaning of the word can be derived from its parts.

So new words can be composed and grasped, adding to richness of language.

PHONETIC-MORPHOLOGY LEADS TO EASE OF TOKENIZATION AND AUTOMATIC INFERENCE OF GOOD TOKENIZERS.

To give concrete evidence, here is an automatically created tokenizer based on information-theoretic inference of word roots. Basically the system automatically learns to split tamil words into their parts.

The system reads all the text of 1,20,000 tamil Wikipedia text articles and identifies word roots and prefix suffixes that can ‘explain’ all the millions of words in that text using a small set of core vocabulary.

Because of the phonetic encoding it is able to recognize without human programming,

the proper roots and extension. Results with a part limit of 10,000 tamil identifies roots and extensions like this:

▁வர வ ிருக்கிறது ▁என்ற ▁அச்சத்தை ▁தூண்டிவிட்ட ுள்ளனர் .

▁பட்டணத்த ுக்கு ▁வந்த ோம் .

▁ஒட்டு ண்ண ி த்தனம்

I am amazed by how the grammar (and regularity) of Tamil phonology gave the clue to the machine that the first word has a root and a postfix and a joining letter.

▁வர வ ிருக்கிறது

Of course, the post fix can be further broken into verb+tense+person+number markers!

Similarly, we can see the correspondence is easier to learn even in a complex noun, and machine translation systems can learn that ism is likely to mean த்தனம் from just a few examples.

▁para s it ism

▁ஒட்டு ண்ண ி த்தனம்

A 10,000 limited core vocabulary figures out the parts of the following sentence.

10000: ▁நடந்த ால் , ▁முதல் ▁ஆர்ப்பாட்ட - எதிர்ப்பு ▁ஞாயிறன்று ▁நடப்ப தாக ▁இருக்கிறது

Ideally the Tamil grammar would split it like this.

▁நட ந்த ால் , ▁முதல் ▁ஆர் ப்பாட்ட - எதிர் ப்பு ▁ ஞாயிற ன்று ▁நட ப்ப தாக ▁இரு க்கிற து

Which would split verb ▁நட, tense ந்த, conditional ால், verb எதிர் verb-noun marker ப்பு, compound words ஞாயிற ன்று, postposition தாக, tense marker க்கிற, person and number marker து just through phonetic clues.

We should work on improving sentence piece algorithms to produce parse like above based on Tamil parsing as the gold standard and the most logical test bed.

Regards

Ravi

Ravi Annaswamy

unread,

Apr 12, 2019, 6:41:57 AM4/12/19

to indicnlp

I will share a cleaned up wordlist and tokenlist in a day or two after using 2 or 3 large corpus.

Shrinivasan T

unread,

Apr 12, 2019, 6:56:35 AM4/12/19

to Ravi Annaswamy, indicnlp

Thanks Ravi,

Requesting you to write an article in tamil on this to publish in kaniyam.com

நீச்சல் காரன்

unread,

Apr 12, 2019, 7:10:28 AM4/12/19

to Ravi Annaswamy, indicnlp

நல்ல முயற்சி. அதிகப் பிழைவிடும் இதழ், தமிழக வழக்கில்லாத வெளிநாட்டு இதழ் போன்றவற்றில் தேவையான சொல்லை எடுப்பது கூடுதல் வேலையாகும். ஒரே மாதிரி சிறந்த இதழ்களை எடுத்துக் கொள்ளலாம் உதா) தினமணி, தமிழ் இந்து, தினமலர், விகடன்.

சந்தி இல்லாத சொற்களை எடுத்துக் கொள்ளவும் அல்லது சந்தியை நிரல் அளவிலேயே நீக்கிக் கொள்ளவும்.

அன்புடன்,

நீச்சல்காரன்.

http://www.neechalkaran.com

--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/fe03662e-b043-4409-b387-eb251ef8833d%40googlegroups.com.

Ravi Annaswamy

unread,

Apr 12, 2019, 8:19:22 AM4/12/19

to நீச்சல் காரன், indicnlp

Thank you, yes that makes sense

And thanks for your excellent service and web apps

Sent from my iPhone

Ravi Annaswamy

unread,

Apr 12, 2019, 8:24:28 AM4/12/19

to Shrinivasan T, indicnlp

Sure I will do thanks

Sent from my iPhone

Ravi Annaswamy

unread,

Apr 12, 2019, 8:33:24 AM4/12/19

to Shrinivasan T, indicnlp

First time I am seeing this page. Truly awesome resource for Tamil Computing

thanks

Sent from my iPhone

On Apr 12, 2019, at 6:55 AM, Shrinivasan T <tshrin...@gmail.com> wrote:

vanangamudi

unread,

Apr 12, 2019, 8:57:53 AM4/12/19

to indicnlp

Hi Ravi,

Yes. You can find the vocabulary built from news corpus here[1]. It doens't include the frequency count, but a word must occur at least 100 times in the corpus for it to be in this file.

I did not include the wikipedia articles, so as to keep the language fully news based. The language style wikipedia and news papers follows are evidently different.

I am also working on building a language model with books from Freetamilebooks[2] since books maintain very long range context.

[1] https://drive.google.com/open?id=1nxwCgZ7n0n2d6ZSAg19nsmO7kGIVe13R

[2] http://freetamilebooks.com/

On Friday, April 12, 2019 at 3:58:22 PM UTC+5:30, Ravi Annaswamy wrote:

Selva, I am amazed by how you nailed down the issues so concisely. (900k words from 30k articles) and (news versus dialect)

The 8K SP model (SP for Google sentence-piece) is built using wikipedia 1.2 lakh article dump so it is primarily from 'regularized' written Tamil.

Ravi Annaswamy

unread,

Apr 12, 2019, 9:37:09 AM4/12/19

to vanangamudi, indicnlp

Awesome 👏 thanks

Sent from my iPhone

To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/78d4826f-ffcd-45bf-bb68-9cb3631243d2%40googlegroups.com.

Ravi Annaswamy

unread,

Apr 12, 2019, 1:38:23 PM4/12/19

to vanangamudi, indicnlp

“Very long range context” is the key my friend

Well said

Sent from my iPhone

On Apr 12, 2019, at 8:57 AM, vanangamudi <selva.d...@gmail.com> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.

To post to this group, send email to indi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/78d4826f-ffcd-45bf-bb68-9cb3631243d2%40googlegroups.com.

வேந்தன் அரசு

unread,

Apr 12, 2019, 4:50:24 PM4/12/19

to Ravi Annaswamy, indicnlp

"வித்துவான்"
விதைப்பவனும் வித்துவான்(தொழில்), புலவனும் வித்துவான்(பெயர்).

வியா., 11 ஏப்., 2019, பிற்பகல் 9:01 அன்று, Ravi Annaswamy <ravi.an...@gmail.com> எழுதியது:

--
You received this message because you are subscribed to the Google Groups "indicnlp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to indicnlp+u...@googlegroups.com.
To post to this group, send email to indi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/indicnlp/ad8858f0-ef84-40ce-ba8a-5e6539d33139%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

வேந்தன் அரசு

வள்ளுவம் என் சமயம்

Ravi Annaswamy

unread,

Apr 12, 2019, 5:59:25 PM4/12/19

to வேந்தன் அரசு, indicnlp

தகவலுக்கு நன்றி வேந்தன் ஐயா.

Sent from my iPhone

Reply all

Reply to author

Forward