I am interested in learning more about the process of generating the pickle files used by punkt so I can generate an Urdu tokenizer. I recently converted a regexp Urdu tokenizer from Perl to Python but have decided I'd like to keep all of my splitting done inside NLTK.
I have found the README included with the pickle files and at the end it suggests the following steps:
# import punkt
import nltk.tokenize.punkt
# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
# Read in trainings corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
# Train tokenizer
tokenizer.train(text)
# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()
I can't find slovene.plain anywhere in my svn checkout of nltk or the nltk_data dir created via nltk.download('all').
I found that Punkt is trained using unsupervised training though. So, based on this knowledge, I am wondering if the way to train punkt on some new text is to simply feed it a LOT of text and let it go to work figuring out the sentence boundaries on it's own. Without seeing slovene.plain I can't be sure...
Can someone help?