MemoryError when tokenizing large file

Andrés Monroy-Hernández

unread,

Jun 29, 2010, 9:31:47 AM6/29/10

to nltk-users

Hello,

I am new to NLTK and python in general. I am trying to load a large
file to analyze and I am running out of memory when tokenizing. Any
suggestions on how to analyze large files?

Thanks!

Here is the traceback

>>> raw = open("text_content.txt").read()
>>> type(raw)
<type 'str'>
>>> tokens=nltk.word_tokenize(raw)

Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
tokens=nltk.word_tokenize(raw)
File "C:\Python26\lib\site-packages\nltk\tokenize\__init__.py", line
55, in word_tokenize
return _word_tokenize(text)
File "C:\Python26\lib\site-packages\nltk\tokenize\treebank.py", line
53, in tokenize
text = regexp.sub(r'\1 \2', text)
MemoryError
>>> len(raw)
255962962
>>>

Alex Rudnick

unread,

Jun 29, 2010, 9:45:08 AM6/29/10

to nltk-...@googlegroups.com

That's pretty big!

Perhaps try segmenting it into sentences first. Really, if you can
find a way to load the file gradually and segment/tokenize it a few
megs at a time, that would be even better.

2010/6/29 Andrés Monroy-Hernández <andres...@gmail.com>:

Cheers,

--
-- alexr

Matthew Gardner

unread,

Jun 29, 2010, 11:16:11 AM6/29/10

to nltk-...@googlegroups.com

The easy way to tokenize the file by sentence in python is this:

raw = open(filename)
tokenized = ''
for line in raw:
tokenized += nltk.word_tokenize(line)

And maybe you want to add a newline in the for loop depending on the
kind of output you want.
Good luck!

> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>

Reply all

Reply to author

Forward