Perhaps try segmenting it into sentences first. Really, if you can
find a way to load the file gradually and segment/tokenize it a few
megs at a time, that would be even better.
2010/6/29 Andrés Monroy-Hernández <andres...@gmail.com>:
Cheers,
--
-- alexr
raw = open(filename)
tokenized = ''
for line in raw:
tokenized += nltk.word_tokenize(line)
And maybe you want to add a newline in the for loop depending on the
kind of output you want.
Good luck!
> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>