Discretizer review Q

10 views
Skip to first unread message

Andrei Perhinschi

unread,
Oct 6, 2013, 4:58:40 PM10/6/13
to cs...@googlegroups.com
Hello all,

I am having trouble understanding the discretizer described in the week of sep 17 review questions. Is it supposed to be an equal width discretizer with N=3? I figured so due to (max-min)/3. Furthermore what constitutes a "dull bin" and how does entropy factor in? My confusion stems from not understanding what to do with the output of the EWD algorithm (menzies.us/cs573/?nums2syms) in relation to the entropy calculation.

Thank you,
Andrei

Tim Menzies

unread,
Oct 7, 2013, 10:50:01 AM10/7/13
to cs...@googlegroups.com
On Sun, Oct 6, 2013 at 4:58 PM, Andrei Perhinschi <aper...@mix.wvu.edu> wrote:
> Hello all,
>
> I am having trouble understanding the discretizer described in the week of sep 17 review questions. Is it supposed to be an equal width discretizer with N=3? I figured so due to (max-min)/3. Furthermore what constitutes a "dull bin" and how does entropy factor in? My confusion stems from not understanding what to do with the output of the EWD algorithm (menzies.us/cs573/?nums2syms) in relation to the entropy calculation.
>

hey andrei,

thanks for your question

one clarification (and apology): there is some summary cr*p in those
examples. so please ignore lines 2 and 3 of each example. which
means, please ignore

73.57, no,
6.572, 0.6429,

and

69.07, yes
3.605, 0.5


> Is it supposed to be an equal width discretizer with N=3? I figured so due to (max-min)/3.

think of it as a 2 pass algorithm:

pass1: divide into 3 (equal width)

pass2: look at the entropy in each bin (entropy measured from the
class symbols found in each bin).

--- if N consecutive bins have similar entropy, then those N are dull
(since the decision variable does not change across their width) and
can be replaced by 1 bin.

as to how to apply the entropy calc..... suppose we had 8 rows of data
that references two class variables.

f1 class
== ==
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b

a four-way split (equal width) on f1 would generate

1 a
2 a

3 a
4 a

5 b
6 b

7 b
8 b

if we split between 4 and 5, we would have n1=4 and n2=4 things above
and below the split and each split would have entropy e1=e2=0 in which
case, the score for this split would be:

score1 = n1/(n1+n2)e1 + n2/(n1+n2)e2 = 4/8*0 + 4/8*0 = 0

but it split between 6 and 7, we would have n1=6 and n2=2 and e1=
entropy(4/6,2/6) and e2 = 0. in which case, the score for that split
would be

score2= 6/8*ent(4/6,2/6) + 2/8*0

note that score2 > score1. ie. whatever we did to generated score1 was
BETTER than whatever we did to generate score2.

so we would prefer the first split

t



> Thank you,
> Andrei
>
> --
> You received this message because you are subscribed to the Google Groups "csx73" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to csx73+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.



--
:: there are some who call me... tim.m...@gmail.com
:: prof @ cs.ai.se.csee.wvu.usa.sol.virgo.all.nil
:: +1-304-376-2859
:: http://menzies.us (skype = menzies.tim)

<hubris>
vita= http:// goo.gl/8eNhY
pubs= http:// goo.gl/8KPKA
stats= http:// goo.gl/vggy1
wow = http:// goo.gl/2Wg9A
</hubris>

Andrei Perhinschi

unread,
Oct 7, 2013, 12:17:51 PM10/7/13
to cs...@googlegroups.com
Dr. Menzies,

Thank you for the explanation! Especially for pointing out that bin entropy is computed relative to the class values, that was one of the parts I was messing up. Just to make sure I'm understanding this properly, in both review question examples after combining the dull bins I end up with the first 10 rows in one bin and the last 4 rows in a second bin (since for both examples after chopping into 3 equal width bins, with 5, 5, and 4 rows, the first two bins have equal entropy). So if I'm correct so far, at this point what is the need for the scoring calculation since all I have are two bins anyway? Also, on a somewhat related note, would you say this is a supervised discretizer since it uses class value entropy?

Thank you,
Andrei
Reply all
Reply to author
Forward
0 new messages