I have a very simple training corpus with one integer feature 'a' and
two labels 'X' and 'y':
a=0 -> y; a=20 -> y; a=35 -> y; a=75 -> X; a=80 -> X; a=95 -> X;
a=100 -> X;
Now I want to classify a=5, a=20, a=97. By default the naive bayes
classifier doesn't use the integer feature as a continues integer,
returning P=0.5 for 'unknown' a=5 and a=97 tests. How can I make it
interpret this feature as an integer and not as a set of distinct
values?
-- D
P.S. here is the code example:
import nltk
train = [(dict(a=0), 'y'), (dict(a=20), 'y'), (dict(a=35), 'y'),
(dict(a=75), 'x'), (dict(a=80), 'x'), (dict(a=95), 'x'),
(dict(a=100), 'x')]
print train
test = [(dict(a=5)), (dict(a=20)), (dict(a=97)), (dict(a=99)), ]
classifier = nltk.NaiveBayesClassifier.train(train)
for t in test:
pdist = classifier.prob_classify(t)
print '%s: p(x) = %.4f p(y) = %.4f' % (t, pdist.prob('x'),
pdist.prob('y'))
classifier.show_most_informative_features()
>>> type(20)
<type 'int'>
>>> type('20')
<type 'str'>
Or are you trying to say that my training set contains strings instead
of integers? I don't think this is true. From my example:
>>> train = [(dict(a=0), 'y'), (dict(a=20), 'y'), (dict(a=35), 'y'), (dict(a=75), 'x'), (dict(a=80), 'x'), (dict(a=95), 'x'), (dict(a=100), 'x')]
>>> print train[1][0]['a']
>>> print type(train[1][0]['a'])
20
<type 'int'>
-- D
Well, I'm glad we are on the same page. That doesn't answer the
original question though.
If the naive bayes classifier doesn't support integer features, is
there any way to add such support?
Is it at all possible for the bayes classifier? How about other
classifiers?
-- D
-- D
Richard described an approach known as binning, and its a popular approach for dealing with scalar features.
-Steven Bird
On 8 Apr 2010 11:02, "dmtr" <dchi...@gmail.com> wrote:
I've already pointed out that workaround like that is not what I was
looking for.
I'd rather use some native support of integer features.
-- D
On Apr 7, 5:18 pm, Richard Careaga <leuc...@gmail.com> wrote:
> Where I'd go next if you want to d...
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To...
-- D
On Apr 8, 12:53 am, Steven Bird <stevenbi...@gmail.com> wrote:
> Richard described an approach known as binning, and its a popular approach
> for dealing with scalar features.
>
> -Steven Bird
>
I was trying to assign weights to the features manually and it sort of
works; but it's a rather boring task to do. And some discrimination
model probably can give me more precise results.
I've looked at the NTLK code though and as far as I could see it was
specifically using binary classifiers (or generic classifiers in
binary mode). So I guess my best bet would be to use some maxent
optimization package directly (e.g. megam or
scipy.maxentropy.bigmodel) and discard all that NTLK stuff
altogether.