HW1 Q3.2 apply binning

82 views
Skip to first unread message

Shuchang Liu

unread,
Feb 3, 2013, 11:52:37 AM2/3/13
to 10-701-spri...@googlegroups.com

Hi,

I have a question for Q3.2 'applying binning for discrete components'.
Is that means we need to classify x value into each bin and calculate their probability OR we need to use kernel method to get a smooth pdf?
If the former one, how to decide the bin number and size? Need each discrete feature has the same bin number or it depends on each feature's data?
If the latter one, what kind of kernel should we use(Gauss, Laplace,..., or depends on our own preference)?

Thanks!

Leila Wehbe

unread,
Feb 3, 2013, 12:49:22 PM2/3/13
to Shuchang Liu, 10-701-spri...@googlegroups.com
For discrete components, you need to have each bin be one different x value (the number of bins depends on how many values each feature has).

Leila



--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-201...@googlegroups.com.
To post to this group, send email to 10-701-spri...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Shuchang Liu

unread,
Feb 3, 2013, 1:50:34 PM2/3/13
to 10-701-spri...@googlegroups.com, Shuchang Liu
That helps me a lot.

Here are several more questions for knn method.
1. Need the discrete and continuous case have different algorithm? Using different distance functions?
2. For example, after sorting the distance, if k th instance and k+1 th instance have the same distance with the testing data, but their y label are different, how should we decide to use which one?
3. Do we need to consider the weight for each feature in this question? And whether the neighbor weight (based on instance) should be considered?

Thanks!


On Sunday, February 3, 2013 12:49:22 PM UTC-5, Leila Wehbe wrote:
For discrete components, you need to have each bin be one different x value (the number of bins depends on how many values each feature has).

Leila

On Sun, Feb 3, 2013 at 11:52 AM, Shuchang Liu <silvia.sh...@gmail.com> wrote:

Hi,

I have a question for Q3.2 'applying binning for discrete components'.
Is that means we need to classify x value into each bin and calculate their probability OR we need to use kernel method to get a smooth pdf?
If the former one, how to decide the bin number and size? Need each discrete feature has the same bin number or it depends on each feature's data?
If the latter one, what kind of kernel should we use(Gauss, Laplace,..., or depends on our own preference)?

Thanks!

--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-2013-cmu+unsub...@googlegroups.com.

Yitong Zhou

unread,
Feb 3, 2013, 4:20:25 PM2/3/13
to 10-701-spri...@googlegroups.com, Shuchang Liu
Leila,
I am also confused about using binning. Say for each X, we have 5 features and each feature has 3 different values.   The total # of bins needed should be 3x5=15? Or 3^5 (if we do not assume P(Y| X1,X2,...,Xn)=multiply all-- P(Xi | Y)) ?

Thanks,
Yitong

在 2013年2月3日星期日UTC-5下午12时49分22秒,Leila Wehbe写道:
For discrete components, you need to have each bin be one different x value (the number of bins depends on how many values each feature has).

Leila

On Sun, Feb 3, 2013 at 11:52 AM, Shuchang Liu <silvia.sh...@gmail.com> wrote:

Hi,

I have a question for Q3.2 'applying binning for discrete components'.
Is that means we need to classify x value into each bin and calculate their probability OR we need to use kernel method to get a smooth pdf?
If the former one, how to decide the bin number and size? Need each discrete feature has the same bin number or it depends on each feature's data?
If the latter one, what kind of kernel should we use(Gauss, Laplace,..., or depends on our own preference)?

Thanks!

--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-2013-cmu+unsub...@googlegroups.com.

Leila Wehbe

unread,
Feb 3, 2013, 4:34:03 PM2/3/13
to Shuchang Liu, 10-701-spri...@googlegroups.com
Shuchang,
1- The discrete and continuous case have the same distance function. (in the preprocessing step, each feature should be normalized).
2- Regarding breaking ties in this specific case, you have multiple options. You can consider both and break ties randomly, or consider each with a half weight and then break ties randomly, or disregard them... Just indicate what you did in the comments.
3- This question is not about weighted k-nn. You can think of all the features as having weight 1.

Leila


To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-201...@googlegroups.com.

Leila Wehbe

unread,
Feb 3, 2013, 4:37:14 PM2/3/13
to Yitong Zhou, 10-701-spri...@googlegroups.com, Shuchang Liu
Yitong,

The main assumption in Naive Bayes is that P(Y| X1,X2,...,Xn)=multiply all-- P(Xi | Y)), and that allows you to compute the conditional distribution for each variable independently. So yes you would have 3*5 = 15 bins.

Leila


To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-201...@googlegroups.com.

Shuchang Liu

unread,
Feb 3, 2013, 4:52:48 PM2/3/13
to 10-701-spri...@googlegroups.com, Shuchang Liu
Hi Leila,

If there is no difference between discrete and continuous' distance function in knn, what is the input 'attributes' used for?

Thanks for your patience! :) 
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-2013-cmu+unsubscri...@googlegroups.com.
To post to this group, send email to 10-701-spri...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

Leila Wehbe

unread,
Feb 3, 2013, 5:03:31 PM2/3/13
to Shuchang Liu, 10-701-spri...@googlegroups.com
It doesn't need to be used here.

Leila


To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-201...@googlegroups.com.

milad memarzadeh

unread,
Feb 3, 2013, 5:11:24 PM2/3/13
to Leila Wehbe, Shuchang Liu, 10-701-spri...@googlegroups.com
I have written the code for discrete variable and the distance
function. If the attributes are strings you cannot use normalization
or same distance function as continuous variables. I've developed the
distance function for discrete variables on my own idea. Is it
acceptable? The results are reasonable.

The other question is that should we print the code and submit it with
other materials or you need the matlab file? IF you need the file can
we submit number 1 and 2 hard copy and number 3 online?

Thanks,
Milad
--
Milad Memarzadeh, M.Sc.
Doctoral Candidate, Advanced Infrastructure Systems
Department of Civil and Environmental Engineering
Carnegie Mellon University

Leila Wehbe

unread,
Feb 3, 2013, 5:14:02 PM2/3/13
to milad memarzadeh, Shuchang Liu, 10-701-spri...@googlegroups.com
This is acceptable.

You should not print the code but send it via email. You can send all of number 3 by email if you wish.

Leila

Alex Smola

unread,
Feb 3, 2013, 9:19:07 PM2/3/13
to Leila Wehbe, Shuchang Liu, 10-701-spri...@googlegroups.com
hi guys,

there seems to be plenty of confusion for Naive Bayes. remember, the algorithm is such that you first need to estimate

p(x_i|y)

that is, you need to estimate the distribution of EACH COORDINATE SEPARATELY given the y. this means that if you have three dimensions with 5 features each, you do NOT NEED to create 5^3 bins but rather just 15 bins. 

also, discrete variables are not all the same. suppose that the discrete variable is the number of speeding tickets. then some people might have a sufficiently large number that binning by the number won't work (i.e. there will be only very few people with, say, 199 tickets and it won't matter very mach to distinguish between 199 and 200 tickets). in this case you have a nice ordinal relationship and it is probably worth while exploiting it. e.g. you could define bins for {0}, {1}, {2}, ... {10}, {11..12}, {13..15},{16..20} and so on. 

i hope that helps,

alex
--
                            ''~``
                           ( o o )
+----------------------.oooO--(_)--Oooo.---------------------------+
| Prof. Alexander J. Smola             http://alex.smola.org       |
| 5000 Forbes Ave                      phone: (+1) 408-759-1044    |
| Gates Hillman Center 8002            Carnegie Mellon University  |
| Pittsburgh 15213 PA    (   )   Oooo. Machine Learning Department |
+-------------------------\ (----(   )-----------------------------+                          
                          \_)    ) /
                                (_/

Mark

unread,
Feb 5, 2013, 12:02:49 PM2/5/13
to 10-701-spri...@googlegroups.com, Leila Wehbe, Shuchang Liu, al...@smola.org
In storing the contents of model in the output from nb_train, I wonder whether we should have (A) 5*3 elements of p(h|D) or (B) 5*3 elements of p(h=1|D) plus 5*3 elements of p(h=0|D). Or will p(h=1|D) + p(h=0|D) = 1 always for discrete attributes? Also, in the continuous attributes it seems to me that the distributions p(D|h=0) and p(D|h=1) which we are modeling as Gaussian are irreducible to p(D|h). But I might be understanding this wrong, just wanted to clear this up.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-2013-cmu+unsubscri...@googlegroups.com.
To post to this group, send email to 10-701-spri...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.
To unsubscribe from this group and stop receiving emails from it, send an email to 10-701-spring-2013-cmu+unsub...@googlegroups.com.
To post to this group, send email to 10-701-spring-2013-cmu@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
http://alex.smola.org/teaching/cmu2013-10-701 (course website)
http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9 (YouTube playlist)
---
You received this message because you are subscribed to the Google Groups "10-701 Spring 2013 CMU" group.
To unsubscribe from this group and stop receiving emails from it, send an email to

Barnabas Poczos

unread,
Feb 5, 2013, 1:10:08 PM2/5/13
to Mark, 10-701-spri...@googlegroups.com, Leila Wehbe, Shuchang Liu, al...@smola.org
(I) If h is a discrete random variable that can have two values 0 or 1,
then p(h=1|D) + p(h=0|D) = 1. If you know p(h=1|D), then you know
p(h=0|D) as well.

(II) p(D|h) is an abbreviation of p(D|h=0) and p(D|h=1). In case of
continuous (e.g. Gaussian) attributes, p(D|h=0) and p(D|h=1) are two
different distributions with different mean
and variance parameters. You have to learn the parameters both for
p(D|h=0) and p(D|h=1).
>>> email to 10-701-spring-201...@googlegroups.com.
>>> To post to this group, send email to 10-701-spri...@googlegroups.com.
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>>
>>>
>>> --
>>> http://alex.smola.org/teaching/cmu2013-10-701 (course website)
>>> http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9
>>> (YouTube playlist)
>>> ---
>>> You received this message because you are subscribed to the Google Groups
>>> "10-701 Spring 2013 CMU" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to 10-701-spring-201...@googlegroups.com.
>>> To post to this group, send email to
>>> 10-701-spri...@googlegroups.com.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>>
>>>
>>> --
>>> http://alex.smola.org/teaching/cmu2013-10-701 (course website)
>>> http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9
>>> (YouTube playlist)
>>> ---
>>> You received this message because you are subscribed to the Google Groups
>>> "10-701 Spring 2013 CMU" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to
>>
>>
>>
>> --
>> ''~``
>> ( o o )
>> +----------------------.oooO--(_)--Oooo.---------------------------+
>> | Prof. Alexander J. Smola http://alex.smola.org |
>> | 5000 Forbes Ave phone: (+1) 408-759-1044 |
>> | Gates Hillman Center 8002 Carnegie Mellon University |
>> | Pittsburgh 15213 PA ( ) Oooo. Machine Learning Department |
>> +-------------------------\ (----( )-----------------------------+
>> \_) ) /
>> (_/
>>
> --
> http://alex.smola.org/teaching/cmu2013-10-701 (course website)
> http://www.youtube.com/playlist?list=PLZSO_6-bSqHQmMKwWVvYwKreGu4b4kMU9
> (YouTube playlist)
> ---
> You received this message because you are subscribed to the Google Groups
> "10-701 Spring 2013 CMU" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to 10-701-spring-201...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages