urdu tokenizer

570 views
Skip to first unread message

JD

unread,
Oct 25, 2009, 8:32:37 PM10/25/09
to nltk-...@googlegroups.com
I am interested in learning more about the process of generating the pickle files used by punkt so I can generate an Urdu tokenizer. I recently converted a regexp Urdu tokenizer from Perl to Python but have decided I'd like to keep all of my splitting done inside NLTK.

I have found the README included with the pickle files and at the end it suggests the following steps:

# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in trainings corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()

# Train tokenizer
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()

I can't find slovene.plain anywhere in my svn checkout of nltk or the nltk_data dir created via nltk.download('all'). 

I found that Punkt is trained using unsupervised training though. So, based on this knowledge, I am wondering if the way to train punkt on some new text is to simply feed it a LOT of text and let it go to work figuring out the sentence boundaries on it's own. Without seeing slovene.plain I can't be sure...

Can someone help?

JD

unread,
Oct 25, 2009, 8:34:04 PM10/25/09
to nltk-...@googlegroups.com
My bad... I meant I'd like to keep the splitting inside NLTK's machine learned splits. I'm not sure I'd gain much by simply transferring my regex splitter to one of NLTK's regexp splitting options.
Reply all
Reply to author
Forward
0 new messages