maxent classifier training MemoryError

619 views
Skip to first unread message

Dave Orr

unread,
Jan 31, 2012, 12:38:46 AM1/31/12
to nltk-users
Hi smart friendly people,

I'm working on spam detection for reviews on review sites (like
reviews on Hotels.com, for instance). I am testing on a small dataset
of 400 spam reviews and 400 ham reviews just to get a feel for how
this is going to work, and I can't seem to get any classifier other
than naive bayes to be happy. I'm probably just misunderstanding
something basic.

Here's what I'm doing: I am calculating features from the 1000 more
informative words, and the 200 most informative bigrams, in my
training set, plus a couple of other small features. When I train
something in naive bayes, it just works:

classifier = nltk.NaiveBayesClassifier.train(train_set)
C:\Python27\lib\site-packages\nltk\app\__init__.py:46: UserWarning:
nltk.app.wordfreq not loaded (requires the pylab library).
warnings.warn("nltk.app.wordfreq not loaded "
nltk.classify.accuracy(classifier, test_set)
0.81875

(Aside: should I worry about that warning?)

It takes a few seconds to train. But when I try it with maxent:

classifier = nltk.MaxentClassifier.train(train_set, algorithm='CG')
Grad eval #0
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "C:\Python27\lib\site-packages\nltk\classify\maxent.py", line
323, in train
gaussian_prior_sigma, **cutoffs)
...
[skipping many many lines]
...
File "C:\Python27\lib\copy.py", line 163, in deepcopy
y = copier(x, memo)
File "C:\Python27\lib\copy.py", line 228, in _deepcopy_list
memo[id(x)] = y
File "C:\Python27\lib\copy.py", line 163, in deepcopy
y = copier(x, memo)
File "C:\Python27\lib\copy.py", line 230, in _deepcopy_list
y.append(deepcopy(a, memo))
File "C:\Python27\lib\copy.py", line 192, in deepcopy
memo[d] = y
MemoryError

Now it runs for what seems like a long time, more than 5 minutes,
before it dies that sad death. Using LBFGSB instead means it hits the
memory error in about 3 seconds, which I guess is progress. GIS does
actually work although it takes forever (where forever == 7 minutes).
Accuracy is better though, which is nice.

So, what am I doing wrong with the scipy algorithms? I set eclipse to
run with 1024 megs of RAM (for some reason it doesn't seem to want to
use more than that), but can it really be the case that 640 training
examples takes up a gig? Are there parameters I should be passing in?

Also, I tried the decision tree classifier, which took 9 minutes to
train and had lousy accuracy, but at least it worked.

Any help would be appreciated.

Thanks,
Dave

PS: In case anyone is interested, I'm trying to duplicate the results
found in this paper, using their data: http://aclweb.org/anthology/P/P11/P11-1032.pdf

Correa Denzil

unread,
Jan 31, 2012, 4:15:54 AM1/31/12
to nltk-...@googlegroups.com
Dave,
Can you try using NLTK-trainer and check if this issue still exists? It's a quick command line tool and hence, you shouldn't face issues.


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.


Richard Marsden

unread,
Jan 31, 2012, 7:56:48 AM1/31/12
to nltk-...@googlegroups.com
My understanding was that the SciPy routines are broken and were removed / being removed from the latest / next version.

I've had some success with the Megam option, although configuring Megam never seems smooth (perhaps that is just my own problem!).

MaxEnt is going to be slower than Naive Bayes to train: You are trading speed for a more sophisticated model that supports independent features.

Richard (M)

Dave Orr

unread,
Jan 31, 2012, 9:43:35 AM1/31/12
to nltk-...@googlegroups.com
Thanks Richard. I think this isn't just an issue with SciPy, because
when I run with a more real dataset, which includes 3462 training
cases instead of 720, GIS runs out of memory immediately. And in the
real world, that's not really a huge amount of training data.

- Dave

Dave Orr

unread,
Jan 31, 2012, 10:10:15 AM1/31/12
to nltk-...@googlegroups.com
On further investigation, I think Correa had it right. If I run
outside of eclipse, I don't run out of memory even on the larger
dataset.

So I think we can close this as an eclipse problem, not an nltk problem.

It's actually still a bit puzzling. When I run via the command line,
python takes up about 126 megs on the larger dataset using GIS.
Eclipse's footprint, in the meantime, is 438 mb, and supposedly is
allocating a gig of space, so there should be plenty of overhead.

Lesson learned, eclipse is not the right environment for actually
running code. I'll have to find a better interactive environment than
windows shell, though.

Cheers,
Dave

Correa Denzil

unread,
Jan 31, 2012, 2:55:28 PM1/31/12
to nltk-...@googlegroups.com
Glad to know this solved it but I wouldn't blame Eclipse as yet. Needs to be further investigated.

--Regards,
Denzil

Correa Denzil

unread,
Jan 31, 2012, 2:56:50 PM1/31/12
to nltk-...@googlegroups.com


--Regards,
Denzil




On Tue, Jan 31, 2012 at 8:40 PM, Dave Orr <dm...@stanfordalumni.org> wrote:
On further investigation, I think Correa had it right. If I run
outside of eclipse, I don't run out of memory even on the larger
dataset.

So I think we can close this as an eclipse problem, not an nltk problem.

It's actually still a bit puzzling. When I run via the command line,
python takes up about 126 megs on the larger dataset using GIS.
Eclipse's footprint, in the meantime, is 438 mb, and supposedly is
allocating a gig of space, so there should be plenty of overhead.

Lesson learned, eclipse is not the right environment for actually
running code. I'll have to find a better interactive environment than
windows shell, though.


For a better interactive environment, have a look at iPython : http://ipython.org/

Diego Molla

unread,
Oct 23, 2012, 3:33:28 AM10/23/12
to nltk-...@googlegroups.com, dm...@stanfordalumni.org
Can we revisit this? I tried Maxent with NLTK's name corpus, and using the command line, and I have the same memory error. The odd thing is that everything worked well a few weeks ago. I think at that time I had an old version of Ubuntu, and since then I upgraded to 12.02 LTS.

Here is the code:

    from nltk.corpus import names
    import random
   
    def gender_features(word):
        return {'last-'+word[-1]:1,'secondlast-'+word[-2]:1}
       
    names = ([(name,'male') for name in names.words('male.txt')] +
             [(name,'female') for name in names.words('female.txt')])
    random.shuffle(names)
   
    featuresets = [(gender_features(n),g) for (n,g) in names]
    train_set, test_set = featuresets[500:], featuresets[:500]
   
    import nltk
    print "Training"
    classifier = nltk.MaxentClassifier.train(train_set)

And after 9 minutes on an Intel Core i7 C...@2.80Gz x 8 machine, the last lines of a very long error message are:

  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 228, in _deepcopy_list
    memo[id(x)] = y
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 228, in _deepcopy_list
    memo[id(x)] = y
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/usr/lib/python2.7/copy.py", line 192, in deepcopy

    memo[d] = y
MemoryError

Another error message somewhere up in the long list is:

  File "/usr/lib/python2.7/dist-packages/apt/__init__.py", line 21, in <module>
    import apt_pkg
  ImportError: libapt-pkg.so.4.12: failed to map segment from shared object: Cannot allocate memory

Jacob Perkins

unread,
Oct 23, 2012, 11:52:22 AM10/23/12
to nltk-...@googlegroups.com, dm...@stanfordalumni.org
The default MaxentClassifier training algorithm is very memory intensive, and quite slow. I'd try IIS or CG, or even better MEGAM (which requires installing http://www.cs.utah.edu/~hal/megam/). You can specify the algorithm as a keyword argument to the train() method, and you can also restrict the training time (and overall memory usage) with the max_iter argument, as in MaxentClassifier.train(train_set, algorithm='MEGAM', max_iter=10). Usually by 10 iterations, it'll have gotten pretty close to a plateau where further iterations don't make much difference.

Hope that helps,

Jacob
---
Reply all
Reply to author
Forward
0 new messages