Training punkt tokenizer with Chinese texts

Michael Tsikerdekis

unread,

Mar 11, 2013, 3:31:18 AM3/11/13

to nltk-...@googlegroups.com

I am having trouble and I can't seem to make the punkt tokenizer produce a proper pickle file (one that actually works).

I posting here a sample of the Chinese text as it is in my plain text file

卷一

曾子仕于莒，得粟三秉，方是之時，曾子重其祿而輕其身；親沒之后

，齊迎以相，楚迎以令尹，晉迎以上卿，方是之時，曾子重其身而輕其祿

。怀其寶而迷其國者，不可與語仁；窘其身而約其親者，不可與語孝；任

重道遠者，不擇地而息；家貧親老者，不擇官而仕。故君子橋褐趨時，當

務為急。傳云：不逢時而仕，任事而敦其慮，為之使而不入其謀，貧焉故

也。詩云：“夙夜在公，實命不同。”

傳曰：夫行露之人許嫁矣，然而未往也，見一物不具，一禮不備，守

節貞理，守死不往，君子以為得婦道之宜，故舉而傳之，揚而歌之，以絕

無道之求，防污道之行乎！詩曰：“雖速我訟，亦不爾從。”

My code for training punkt is the following:

# *-* coding: utf-8*-*

# import punkt

import nltk.tokenize.punkt

import codecs

text = codecs.open("zh-plainraw.txt","Ur","utf-8").read()

# import punkt

import nltk.tokenize.punkt

# Make a new Tokenizer

tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Train tokenizer

tokenizer.train(text)

# Dump pickled tokenizer

import pickle

out = open("chinese.pickle","wb")

pickle.dump(tokenizer, out)

out.close()

What am I doing wrong?

Jacob Perkins

unread,

Mar 11, 2013, 6:05:02 PM3/11/13

to nltk-...@googlegroups.com

Hi Michael,

Are you having errors with the actual pickle file, like it doesn't save/load properly? Or is it that the trained tokenizer doesn't tokenize properly?

Jacob

---

http://streamhacker.com/

http://twitter.com/japerk

Michael Tsikerdekis

unread,

Mar 12, 2013, 3:49:15 AM3/12/13

to nltk-...@googlegroups.com

Hi Jacob,

Yes, I think the error is with the actual tokenization and pickle file. I've attached the pickle file produced by this code. It does not contain any information that would tokenize any file properly.

The code that I use to tokenize using this pickle file the following:

sent_detector = nltk.data.load('tokenizers/punkt/chinese.pickle')

senttokens = sent_detector.tokenize(raw, realign_boundaries=True)

where raw is a file loaded using coder.

Michael

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

chinese.pickle

Jacob Perkins

unread,

Mar 12, 2013, 12:04:49 PM3/12/13

to nltk-...@googlegroups.com

Hi Michael,

I tried out the tokenizer, and it did not tokenize your example text. Just to be clear, this has nothing to do with the pickling process, and everything to do with the tokenizer training process. Looking at the NLTK code in punkt.py, and your code above, it looks like you need to either pass finalize=True into the train() method, as in tokenizer.train(text, finalize=True), or if you train on multiple files, call tokenizer.finalize_training() after all the train() calls.

Jacob

---

http://streamhacker.com/

http://twitter.com/japerk

Michael Tsikerdekis

unread,

Mar 13, 2013, 2:51:26 PM3/13/13

to nltk-...@googlegroups.com

Hi Jacob,

so I tried both methods that you suggested, tokenize.finalize_training() and as an option in the train method but I keep getting errors that this function does not exist

tokenizer.finalize_training()

AttributeError: 'PunktSentenceTokenizer' object has no attribute 'finalize_training'

It is really weird because when I look at the punkt.py file these options seem to be there but when I dir(tokenizer) they dont appear. I know I am forgetting something but can't quite figure what.

Michael

Reply all

Reply to author

Forward