NLTK for Vietnamese

1,184 views
Skip to first unread message

brother rain

unread,
Jun 2, 2015, 10:57:41 PM6/2/15
to nltk-...@googlegroups.com
Can anyone point me some resource to bind nltk with other language (for me it's vietnamese)?

I really want to contribute to nltk to make it works with vietnamese, but I don't know how and where do I start?

Something like this

import nltk
>>> sentence = "Vào tám giờ sáng thứ sáu tôi không được khỏe"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['Vào', 'tám', "giờ", 'sáng', 'thứ sáu', 'tôi', 'không', 'được', "khỏe"]
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:5]
[('Vào', 'IN'), ('tám', 'CD'), ("giờ", 'JJ'), ('sáng', 'NN'),  ('thứ 6', 'NNP')]

Thank you for you helpelp

Alex Rudnick

unread,
Jun 4, 2015, 1:10:41 PM6/4/15
to nltk-...@googlegroups.com
Hey brother rain,

Like what Alexis said about Greek in the other thread, you could train
models for Vietnamese! When you call nltk.pos_tag, under the hood NLTK
loads up a tagger that has been trained for English, but it would be
pretty straightforward to train a tagger for Vietnamese instead.

A tagger has a trained model that's like "ok, given this word in this
context, what tags do we think it might have, and with what
probability?" To learn a model like that, you need a bunch of data to
train it on!

So now you need to get some tagged Vietnamese data. Try searching for
things like [vietnamese treebank] and [vietnamese tagged corpus] --
let us know if you find any good resources!
> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
-- alexr
Reply all
Reply to author
Forward
0 new messages