how to use integer features in the naive bayes classifier ?

365 views
Skip to first unread message

dmtr

unread,
Apr 6, 2010, 10:31:26 PM4/6/10
to nltk-users
Hi,

I have a very simple training corpus with one integer feature 'a' and
two labels 'X' and 'y':
a=0 -> y; a=20 -> y; a=35 -> y; a=75 -> X; a=80 -> X; a=95 -> X;
a=100 -> X;

Now I want to classify a=5, a=20, a=97. By default the naive bayes
classifier doesn't use the integer feature as a continues integer,
returning P=0.5 for 'unknown' a=5 and a=97 tests. How can I make it
interpret this feature as an integer and not as a set of distinct
values?

-- D

P.S. here is the code example:

import nltk
train = [(dict(a=0), 'y'), (dict(a=20), 'y'), (dict(a=35), 'y'),
(dict(a=75), 'x'), (dict(a=80), 'x'), (dict(a=95), 'x'),
(dict(a=100), 'x')]
print train
test = [(dict(a=5)), (dict(a=20)), (dict(a=97)), (dict(a=99)), ]

classifier = nltk.NaiveBayesClassifier.train(train)
for t in test:
pdist = classifier.prob_classify(t)
print '%s: p(x) = %.4f p(y) = %.4f' % (t, pdist.prob('x'),
pdist.prob('y'))
classifier.show_most_informative_features()

Richard Careaga

unread,
Apr 6, 2010, 10:56:47 PM4/6/10
to nltk-...@googlegroups.com
Do you really want integers rather than strings?

>>> type(20)
<type 'int'>
>>> type('20')
<type 'str'>

dmtr

unread,
Apr 7, 2010, 2:29:38 AM4/7/10
to nltk-users
Yes. Sure. I do want integers. As I've said I want to classify a=5,
a=20, a=97. And get meaningful answers for 5 and 97. Based on the
training set that does not contain these specific values. I guess I
can try a workaround (use int(log(N)) instead of the number, or
something like that), it would probably work in my specific case, but
I'd rather not to.

Or are you trying to say that my training set contains strings instead
of integers? I don't think this is true. From my example:

>>> train = [(dict(a=0), 'y'), (dict(a=20), 'y'), (dict(a=35), 'y'), (dict(a=75), 'x'), (dict(a=80), 'x'), (dict(a=95), 'x'), (dict(a=100), 'x')]

>>> print train[1][0]['a']
>>> print type(train[1][0]['a'])
20
<type 'int'>

-- D

Richard Careaga

unread,
Apr 7, 2010, 8:49:29 AM4/7/10
to nltk-...@googlegroups.com
The reason I asked was that it seemed odd to me, from just looking at the snipped, this classifier would be used on integers. I couldn't figure out why it wouldn't be more straightforward just to just

>>> a = 5
>>> if a  <= 35:    
...     a = 'x'
... else:
...     a = 'y'
...
>>> print a
x
>>> a = 97
>>> if a  <= 35:    
...     a = 'x'
... else:
...     a = 'y'
...
>>> print a
y
>>>

But, now, looking at the code:

>>>help(classifier)

 |  If the classifier encounters an input with a feature that has
 |  never been seen with any label, then rather than assigning a
 |  probability of 0 to all labels, it will ignore that feature.

I see that it's not designed to interpolate. Since 5 and 97 are features that the classifier has never seen with a label, they get silently ignored.

dmtr

unread,
Apr 7, 2010, 6:00:46 PM4/7/10
to nltk-users
> I see that it's not designed to interpolate. Since 5 and 97 are features
> that the classifier has never seen with a label, they get silently ignored.
>

Well, I'm glad we are on the same page. That doesn't answer the
original question though.
If the naive bayes classifier doesn't support integer features, is
there any way to add such support?
Is it at all possible for the bayes classifier? How about other
classifiers?

-- D

Richard Careaga

unread,
Apr 7, 2010, 8:18:00 PM4/7/10
to nltk-...@googlegroups.com
Where I'd go next if you want to do this within NLTK is to consider the outcome space, which consists of the integers N between 0 and 101, all of which have the label x or the label y. To classify an arbitrary N, you must either have all N in your training set, which is pointless, or your training set has to be based not on examples of N as features but on the features of N that distinguish the set belonging to x from the set belonging to y, such as

has_single_digit -> x
has_dual_digit:
  and first_digit 2 | 3 | 4 -> x
  and first_digit 5 | 6 | 7 | 8 | 9 ->y

dmtr

unread,
Apr 7, 2010, 9:02:48 PM4/7/10
to nltk-users
I've already pointed out that workaround like that is not what I was
looking for.
I'd rather use some native support of integer features.

-- D

Steven Bird

unread,
Apr 8, 2010, 3:53:33 AM4/8/10
to nltk-...@googlegroups.com

Richard described an approach known as binning, and its a popular approach for dealing with scalar features.

-Steven Bird

On 8 Apr 2010 11:02, "dmtr" <dchi...@gmail.com> wrote:

I've already pointed out that workaround like that is not what I was
looking for.
I'd rather use some native support of integer features.

-- D



On Apr 7, 5:18 pm, Richard Careaga <leuc...@gmail.com> wrote:

> Where I'd go next if you want to d...

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To...

dmtr

unread,
Apr 12, 2010, 4:59:54 PM4/12/10
to nltk-users
Is that the only approach that is available in the NLTK? It seems to
me that binning and similar approaches have inherent performance
problems (you are converting single feature into multiple), data loss
and loss of smoothness (bins on the extremes of spectrum may have only
a few elements in the whole corpus). As far as I remember some models
(maxent, ... ?) supposed to support scalar (real) features. Is there
any way to use them in the NLTK?

-- D


On Apr 8, 12:53 am, Steven Bird <stevenbi...@gmail.com> wrote:
> Richard described an approach known as binning, and its a popular approach
> for dealing with scalar features.
>
> -Steven Bird
>

Raymond

unread,
Apr 13, 2010, 7:13:17 AM4/13/10
to nltk-users
What do you want from this classification of integers?

dmtr

unread,
Apr 13, 2010, 5:54:52 PM4/13/10
to nltk-users
The problem is, that in my model (text classification) I have:
* several integer (and boolean) features characterizing the author
(author reputation - think citation number, author age, etc);
* several integer (and boolean) features characterizing the text
(number of reviews, number of formulas, number of named entities,
length, etc);
* many boolean features (the text content itself);

I was trying to assign weights to the features manually and it sort of
works; but it's a rather boring task to do. And some discrimination
model probably can give me more precise results.

I've looked at the NTLK code though and as far as I could see it was
specifically using binary classifiers (or generic classifiers in
binary mode). So I guess my best bet would be to use some maxent
optimization package directly (e.g. megam or
scipy.maxentropy.bigmodel) and discard all that NTLK stuff
altogether.

Richard Careaga

unread,
Apr 13, 2010, 7:42:24 PM4/13/10
to nltk-...@googlegroups.com
I think that you might be able to use nltk to extract categorical variables from your text content and then use heavier duty numeric packages to do the rest of the work and that would seem to be more promising.

dmtr

unread,
Apr 13, 2010, 10:59:39 PM4/13/10
to nltk-users
I'm gonna try to hack my own feature encoder (MaxentFeatureEncodingI)
and use it with megam.
It supports pretty much arbitrary parameters... I'll post my progress
and maybe even a patch.

dmtr

unread,
Apr 14, 2010, 6:43:38 PM4/14/10
to nltk-users
The patch is in the http://code.google.com/p/nltk/issues/detail?id=535
Seems to work ok for floats, but not so well for integers.
Reply all
Reply to author
Forward
0 new messages