Hello Fran!
Sorry for taking a long time to respond; hopefully this is still
helpful! I just used this code snippet to train a Punkt sentence
segmenter; found it in this thread on nltk-dev.
https://groups.google.com/forum/?fromgroups=#!topic/nltk-dev/y2zYJSOdevQ
---- Training Code ----
# import punkt
import nltk.tokenize.punkt
# Make a new Tokenizer
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
# Read in trainings corpus (one example: Slovene)
import codecs
text = codecs.open("slovene.plain","Ur","iso-8859-2").read()
# Train tokenizer
tokenizer.train(text)
# Dump pickled tokenizer
import pickle
out = open("slovene.pickle","wb")
pickle.dump(tokenizer, out)
out.close()
---------
I took the first thousand sentences from the Spanish europarl corpus
(one sentence per line) and used that as input instead of
"slovene.plain" (not sure where to get that file!); it managed to
segment the few sample Spanish sentences that I showed it.
We should really have better documentation about how to train these
sentence tokenizers. Filing a bug about that now...
> --
>
>
Cheers!
--
-- alexr