I am having trouble and I can't seem to make the punkt tokenizer produce a proper pickle file (one that actually works).
I posting here a sample of the Chinese text as it is in my plain text file
卷一
曾子仕于莒,得粟三秉,方是之時,曾子重其祿而輕其身;親沒之后
,齊迎以相,楚迎以令尹,晉迎以上卿,方是之時,曾子重其身而輕其祿
。怀其寶而迷其國者,不可與語仁;窘其身而約其親者,不可與語孝;任
重道遠者,不擇地而息;家貧親老者,不擇官而仕。故君子橋褐趨時,當
務為急。傳云:不逢時而仕,任事而敦其慮,為之使而不入其謀,貧焉故
也。詩云:“夙夜在公,實命不同。”
傳曰:夫行露之人許嫁矣,然而未往也,見一物不具,一禮不備,守
節貞理,守死不往,君子以為得婦道之宜,故舉而傳之,揚而歌之,以絕
無道之求,防污道之行乎!詩曰:“雖速我訟,亦不爾從。”
My code for training punkt is the following:
# *-* coding: utf-8*-*
# import punkt
import nltk.tokenize.punkt
import codecs
text = codecs.open("zh-plainraw.txt","Ur","utf-8").read()
# import punkt
import nltk.tokenize.punkt
# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
# Train tokenizer
tokenizer.train(text)
# Dump pickled tokenizer
import pickle
out = open("chinese.pickle","wb")
pickle.dump(tokenizer, out)
out.close()
What am I doing wrong?