Training the Punkt sentence tokeniser

2,068 views
Skip to first unread message

Francis Tyers

unread,
Dec 15, 2012, 5:58:27 PM12/15/12
to nltk-...@googlegroups.com
Hi

I've searched high and low for an answer to this particular riddle, but despite my best efforts I can't for the life of me find some clear instructions for training the Punkt sentence tokeniser for a new language. The languages I am interested in having a sentence tokeniser for are Armenian and Russian.

I found this document: http://nltk.googlecode.com/svn/trunk/doc/howto/portuguese_en.html which suggests taking a corpus where there is one-sentence per line, and replacing the newlines with spaces. Then training on this. I've tried that, but without success.

Does the input format need to be one-sentence per line ? How much training data is needed for decent performance ?

From what I can tell this shouldn't be that difficult :) What am I doing wrong ?

Oh, another thing, has anyone got existing models for these languages ?

Regards,

Fran

Alex Rudnick

unread,
Jan 2, 2013, 12:34:09 AM1/2/13
to nltk-...@googlegroups.com
Hello Fran!

Sorry for taking a long time to respond; hopefully this is still
helpful! I just used this code snippet to train a Punkt sentence
segmenter; found it in this thread on nltk-dev.

https://groups.google.com/forum/?fromgroups=#!topic/nltk-dev/y2zYJSOdevQ

---- Training Code ----
# import punkt
import nltk.tokenize.punkt

# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()

# Read in trainings corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()

# Train tokenizer
tokenizer.train(text)

# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()
---------

I took the first thousand sentences from the Spanish europarl corpus
(one sentence per line) and used that as input instead of
"slovene.plain" (not sure where to get that file!); it managed to
segment the few sample Spanish sentences that I showed it.

We should really have better documentation about how to train these
sentence tokenizers. Filing a bug about that now...
> --
>
>

Cheers!

--
-- alexr

Francis Tyers

unread,
Jan 2, 2013, 7:46:37 AM1/2/13
to nltk-...@googlegroups.com
Thanks!

One of our GCI students has also had a bash at documenting it, you can find his instructions here:

http://wiki.apertium.org/wiki/Sentence_segmenting#NLTK_Punkt

Fran
Reply all
Reply to author
Forward
0 new messages