Choosing between hypotheses

Ross Clement

unread,

Aug 19, 2004, 4:15:43 PM8/19/04

to

Hi. I have a problem like this.

I am generating hypotheses in the form of classification rules, with
the general form:

if <some conditions>
then predict a particular class.

When I'm generating these rules, I basically need to choose between
different rules, i.e. I need a simple method of deciding which of two
rules is the better one. I have a large set of data items, and can
then find (for any particular rule), which of the data items
(including both the attributes used in the rules' conditions, and the
true class) are 'covered' by the rule (i.e. the attributes match the
conditions of the rule), and of these data items, the number for which
the rule predicts the correct (or incorrect) class.

Imagine that I have two rules, both of which cover 1000 data items.
One rule predicts the correct class for 980 of these, and the other
predicts the correct class for 990 of them. In this case it is clear
that the second rule is better. However, if I have one rule that
covers 10 data items, 9 of which correctly, and another rule that
covers 1000 data items, 880 of which correctly, then things are less
clear. The first rule has the highest predicted accuracy (0.9 versus
0.88). However, the sample size of the second rule is so much larger,
that we'd have far more confidence in the rule actually achieving its'
predicted accuracy.
And, as a human, if I had my choice I'd plump for the second rule.

What I'm curious about is whether methods similar to statistical
hypothesis testing could be applied here (inspired by Anna Hart's use
of chi-squared significance tests when deciding how to build a
Decision Tree)? If we assume (slightly dodgy assumption) that the data
was sampled from a binomial distribution, then we could look at the
highest p where we can reject the hypothesis that the results were
sampled from a binomial distribution with that p. E.g. if we have 9
hits out of 10 trials, then we can find the highest p, where we can
reject the hypothesis that the data came from that distribution with
say 95% confidence. This p is likely to be (I believe, without doing
the calculations) less than the p for the 880 out of 1000 data. So, it
seems that I could use this p to choose between the two hypotheses.
Or, conversely, I could estimate the probability of getting >=
observed hits out of number of data items covered by the rule given
p=0.5 say. Perhaps the smallest probability would indicate the
"better" rule.

The bottom line is that I have data which I can use to test if one or
both of these approaches out-perform the simpler approach of choosing
the rule with the highest raw accuracy. So, I'm not stuck. But, I'm
wondering if there is an area of statistical theory that I don't know
of or where I haven't realised that it's applicable. It could be that
information theory applies here, and the normal methods of building
Decision Trees are based on Information Theory, but I'm not quite sure
how it could be applied to the case where the number of data items
covered by rules being compared is not the same, and the sets of data
items covered by rules are potentially overlapping, but not identical
sets.

Any hints?

Thanks in anticipation,

Ross-c

Ray Koopman

unread,

Aug 19, 2004, 5:07:29 PM8/19/04

to

The Clopper-Pearson confidence limits for a binomial proportion
are the smallest and largest hypothetical values of p
that would not be rejected by the observed data.
For 9/10 the 95% limits are (.554984, .997471);
for 880/1000 they are (.858233, .899499).
A comparison of Bayesian posterior densities
would leave you in much the same predicament.
You need some sort of scalar utility function,
which will necessarily be subjective.

Ross Clement

unread,

Aug 20, 2004, 7:12:33 AM8/20/04

to

"Ray Koopman" <koo...@sfu.ca> wrote in message news:<cg34qh$i...@odak26.prod.google.com>...

>
> The Clopper-Pearson confidence limits for a binomial proportion
> are the smallest and largest hypothetical values of p
> that would not be rejected by the observed data.
> For 9/10 the 95% limits are (.554984, .997471);
> for 880/1000 they are (.858233, .899499).
> A comparison of Bayesian posterior densities
> would leave you in much the same predicament.
> You need some sort of scalar utility function,
> which will necessarily be subjective.

Thanks. This sounds like what I need. As I have data which I can use
to evaluate the performance of different rule selection strategies, I
can probably apply optimisation techniques to come up the the scalar
utility function.

Cheers,

Ross-c

ctc...@hotmail.com

unread,

Aug 20, 2004, 8:18:40 PM8/20/04

to

cle...@wmin.ac.uk (Ross Clement) wrote:
...

>
> Imagine that I have two rules, both of which cover 1000 data items.
> One rule predicts the correct class for 980 of these, and the other
> predicts the correct class for 990 of them. In this case it is clear
> that the second rule is better. However, if I have one rule that
> covers 10 data items, 9 of which correctly, and another rule that
> covers 1000 data items, 880 of which correctly, then things are less
> clear. The first rule has the highest predicted accuracy (0.9 versus
> 0.88). However, the sample size of the second rule is so much larger,
> that we'd have far more confidence in the rule actually achieving its'
> predicted accuracy.
> And, as a human, if I had my choice I'd plump for the second rule.
>
> What I'm curious about is whether methods similar to statistical
> hypothesis testing could be applied here (inspired by Anna Hart's use
> of chi-squared significance tests when deciding how to build a
> Decision Tree)? If we assume (slightly dodgy assumption) that the data
> was sampled from a binomial distribution, then we could look at the
> highest p where we can reject the hypothesis that the results were
> sampled from a binomial distribution with that p.

I think you want the highest p that would be rejected *for being too
small*; not the highest p that would be rejected in general (which would
pretty much always be 1). The one you want is just one quanta lower than
the lowest p that would *not* be rejected.

I call this the pressimal value, the most pessimistic value that is still
"probable" (i.e. consistent with the confidence interval.) I don't know if
it has a more formal name or not.

Of course, you still have to set alpha for the CI. I generally
use 1/N/5 where N is the number of alternatives inspected, but I can't
really justify that. (But then again I can't justify any of this stuff,
I am not a real statistician.)

> Or, conversely, I could estimate the probability of getting >=
> observed hits out of number of data items covered by the rule given
> p=0.5 say. Perhaps the smallest probability would indicate the
> "better" rule.

No, don't use that one. You want the rule that is probably better than the
other rule, not the one that is most definitely higher than average.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB

Aleks Jakulin

unread,

Aug 26, 2004, 11:20:47 AM8/26/04

to

Ross Clement wrote:

> Ray Koopman wrote:
> >
> > The Clopper-Pearson confidence limits for a binomial proportion
> > are the smallest and largest hypothetical values of p
> > that would not be rejected by the observed data.
> > For 9/10 the 95% limits are (.554984, .997471);
> > for 880/1000 they are (.858233, .899499).
>

> Thanks. This sounds like what I need. As I have data which I can use
> to evaluate the performance of different rule selection strategies,

> can probably apply optimisation techniques to come up the the scalar
> utility function.

You can take advantage of the econometric concept of "value at risk":
pick the model that guarantees better performance, neglecting the
worst X% of situations.

If X=5%, then you would pick the 880/1000 proportion, as the
guaranteed performance is 0.86, while the guaranteed performance for
9/10 is 0.55 (much worse).

This same "value at risk" logic underlies significance testing in
general: prefer the null model except if the alternative will
guarantee you lower error in 95 or 99 or 99.5 percent of samples of
the same size.

--
mag. Aleks Jakulin
http://www.ailab.si/aleks/
Artificial Intelligence Laboratory,
Faculty of Computer and Information Science,
University of Ljubljana,
Slovenia.