Hi,
Our 1st step in open source contribution to TamilNLP. Most of
the current research is focussed on Syntactic aspect of Tamil Language,
we wanted to focus on the semantic aspect of Tamil.
சொற்றிசையன்(சொல் திசையன்) or
Vaaku2vec released for Tamil from data we have crawled through Tamil
websites. The way the algorithm work is fill in the blanks "I live in
Tambaram and _______ to Adambakkam for work", If we humans are asked to
fill it we would fill it as commute, bike, drive etc, the way we would
come up with it is look at the surrounding words and predict the missing
word, algorithm works that way too.
Similar words will be near to each
other in the vector space. I have posted some examples to show what the
algorithm learned. You can play with it in
http://w2v.kaatchi.cheyyarivu.org/, enter your own words and see what it is coming up with. If you want to contribute to this free and open source effort, please reach out to us. There is parallel development happening for Malayalam too, and we are also looking for people who are interested to work on their native languages.
Thanks,
vanangamudi