Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

24 views

Skip to first unread message

Aug 11, 2023, 9:28:53 PM8/11/23

to

Hi:

Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

Kind-1:

M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

Then we claim that A performs better than all the rest 4 algorithms.

Kind-2:

M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,

M_B > M_C, M_B > M_D, M_B > M_E,

M_C > M_D, M_C > M_E, and

M_D > M_E

Then, we claim that A performs best among all the 5 algorithms.

Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

Kind-1:

M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

Then we claim that A performs better than all the rest 4 algorithms.

Kind-2:

M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,

M_B > M_C, M_B > M_D, M_B > M_E,

M_C > M_D, M_C > M_E, and

M_D > M_E

Then, we claim that A performs best among all the 5 algorithms.

Aug 12, 2023, 12:10:06 AM8/12/23

to

On Fri, 11 Aug 2023 18:28:50 -0700 (PDT), Cosine <ase...@gmail.com>

wrote:

>Hi:

>

> Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

>

>Kind-1:

>

> M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

>

> Then we claim that A performs better than all the rest 4 algorithms.

It seems that you are describing the RESULT of a set

of comparisons. The two 'kinds' would be, A versus each other,

and "all comparisons among them."

You should say, "on these test data" and "better on M than ..."

and "performed" (past tense).

>

>Kind-2:

>

> M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,

> M_B > M_C, M_B > M_D, M_B > M_E,

> M_C > M_D, M_C > M_E, and

> M_D > M_E

>

> Then, we claim that A performs best among all the 5 algorithms.

>

I would state that A performed better (on M) than the rest, and also

the rest were strictly ordered in how well they performed.

--

Rich Ulrich

wrote:

>Hi:

>

> Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

>

>Kind-1:

>

> M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

>

> Then we claim that A performs better than all the rest 4 algorithms.

of comparisons. The two 'kinds' would be, A versus each other,

and "all comparisons among them."

You should say, "on these test data" and "better on M than ..."

and "performed" (past tense).

>

>Kind-2:

>

> M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,

> M_B > M_C, M_B > M_D, M_B > M_E,

> M_C > M_D, M_C > M_E, and

> M_D > M_E

>

> Then, we claim that A performs best among all the 5 algorithms.

>

the rest were strictly ordered in how well they performed.

--

Rich Ulrich

Aug 12, 2023, 2:31:44 AM8/12/23

to

Rich Ulrich 在 2023年8月12日 星期六中午12:10:06 [UTC+8] 的信中寫道：

In other words, if the purpose is only to demonstrate that A performed better on M than the rest 4 algorithms,

we only need to do the first kind of comparison. We do the second kind only if we want to demonstrate the ordering.

By the way. it seems that to reach the desired conclusion, both kinds of comparison require doing multiple comparisons.

The first kind requires 4 ( = 5-1 ) and the second requires C(5,2) = 10.

Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?

But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

> On Fri, 11 Aug 2023 18:28:50 -0700 (PDT), Cosine

we only need to do the first kind of comparison. We do the second kind only if we want to demonstrate the ordering.

By the way. it seems that to reach the desired conclusion, both kinds of comparison require doing multiple comparisons.

The first kind requires 4 ( = 5-1 ) and the second requires C(5,2) = 10.

Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?

But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

Aug 12, 2023, 3:24:22 PM8/12/23

to

On Fri, 11 Aug 2023 23:31:41 -0700 (PDT), Cosine <ase...@gmail.com>

wrote:

>Rich Ulrich ? 2023?8?12? ?????12:10:06 [UTC+8] ??????

Before you take on 'multiple comparisons' and p-levels, you ought

to have a Decision to be made, or a question What do you have

here? Making a statement about what happens to fit the sample

best does not require assumptions; drawing inferences to elsewhere

does require assumptions.

Who or what does your sample /represent/? Where do the algorithms

come from? (and how do they differ?). What are you hoping to

generalize to?

I can imagine that your second set of results could be a summary

of step-wise regression, where Metric is the R-squared and A is

the result after mutiple steps. Each step shows an increase in

R-squared, by definition. Ta-da!

The hazards of step-wise regression are well-advertised by now.

I repeated Frank Harrell's commentary multiple times in the stats

Usenet groups, and others picked it up. I can add: When there

are dozens of candidate variables to Enter, each step is apt to

provide a WORSE algorithm when applied to a separate sample for

validation. Sensible algorithms usually require the application of

good sense by the developers -- instead of over-capitalizing on

chance in a model built on limited data.

If you have huge data, then you should also pay attention to

robustness and generalizability across sub-populations, rather

than focus on p-levels for the whole shebang.

>

>Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

In my experience, I talked people out of corrections many times

by cleaning up their questions. Bonferroni fits best when you

have /independent/ questions of equal priority. And when you

have a reason to pay heed to family-wise error.

>

>If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?

>

>But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

If you don't have any idea what you are looking for, one common

procedure is to proclaim the effort 'exploratory' and report

the nominal levels.

--

Rich Ulrich

wrote:

>Rich Ulrich ? 2023?8?12? ?????12:10:06 [UTC+8] ??????

to have a Decision to be made, or a question What do you have

here? Making a statement about what happens to fit the sample

best does not require assumptions; drawing inferences to elsewhere

does require assumptions.

Who or what does your sample /represent/? Where do the algorithms

come from? (and how do they differ?). What are you hoping to

generalize to?

I can imagine that your second set of results could be a summary

of step-wise regression, where Metric is the R-squared and A is

the result after mutiple steps. Each step shows an increase in

R-squared, by definition. Ta-da!

The hazards of step-wise regression are well-advertised by now.

I repeated Frank Harrell's commentary multiple times in the stats

Usenet groups, and others picked it up. I can add: When there

are dozens of candidate variables to Enter, each step is apt to

provide a WORSE algorithm when applied to a separate sample for

validation. Sensible algorithms usually require the application of

good sense by the developers -- instead of over-capitalizing on

chance in a model built on limited data.

If you have huge data, then you should also pay attention to

robustness and generalizability across sub-populations, rather

than focus on p-levels for the whole shebang.

>

>Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

by cleaning up their questions. Bonferroni fits best when you

have /independent/ questions of equal priority. And when you

have a reason to pay heed to family-wise error.

>

>If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?

>

>But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

procedure is to proclaim the effort 'exploratory' and report

the nominal levels.

--

Rich Ulrich

Aug 12, 2023, 7:03:55 PM8/12/23

to

Hmm, let's start by asking or clarifying the research questions then.

Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.

Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.

Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.

Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.

This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.

Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.

Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.

Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.

This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

Aug 14, 2023, 7:03:37 PM8/14/23

to

On Sat, 12 Aug 2023 16:03:52 -0700 (PDT), Cosine <ase...@gmail.com>

wrote:

>Hmm, let's start by asking or clarifying the research questions then.

>

>Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.

>

>Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.

>

>Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

Okay. You are outside the scope of what I have read.

Whatever I read about machine learning, decades ago, was

far more primitive or preliminary than this. I can offer a note

or two on 'reading' such papers..

>

>Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.

>

>Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

Did the statisitcs include p-values?

The comparison I can think of is the demonstartions I have

seen about 'statistical tests' offered for consideration. That is,

authors are comparing (say, too simplistically) Student's t-test to

a t-test for unequal variances, or to a t on rank-orders.

Here, everyone can inspect the tests and imagine when they

will differ; randomized samples are created which feature various

aspects of non-normality, for various matches of Ns. What is

known is that the tests will differ -- a 5% test does not 'reject'

2.5% at each end, when computed on 10,000 generated samples,

when its assumptions are intentionally violated.

What is interesting is how MUCH they differ, and how much more

they differ for smaller N or for smaller alpha.

>

>Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.

>

> This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

So, what I know (above) won't apply if you have to treat the

algorithms as 'black-box' operations -- you can't predict when

an algorithm will perform its best.

I think I would be concerned about the generality of the test

bank, and the legitimacy/credibility of the authors.

I can readily imagine a situation like with the 'meta-analyses'

that I read in the 1990s: You need a good statistician and a

good subject-area scientist to create a good meta-analysis, and

most of the ones I read had neither.

--

Rich Ulrich

wrote:

>Hmm, let's start by asking or clarifying the research questions then.

>

>Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.

>

>Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.

>

>Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

Whatever I read about machine learning, decades ago, was

far more primitive or preliminary than this. I can offer a note

or two on 'reading' such papers..

>

>Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.

>

>Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

The comparison I can think of is the demonstartions I have

seen about 'statistical tests' offered for consideration. That is,

authors are comparing (say, too simplistically) Student's t-test to

a t-test for unequal variances, or to a t on rank-orders.

Here, everyone can inspect the tests and imagine when they

will differ; randomized samples are created which feature various

aspects of non-normality, for various matches of Ns. What is

known is that the tests will differ -- a 5% test does not 'reject'

2.5% at each end, when computed on 10,000 generated samples,

when its assumptions are intentionally violated.

What is interesting is how MUCH they differ, and how much more

they differ for smaller N or for smaller alpha.

>

>Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.

>

> This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

algorithms as 'black-box' operations -- you can't predict when

an algorithm will perform its best.

I think I would be concerned about the generality of the test

bank, and the legitimacy/credibility of the authors.

I can readily imagine a situation like with the 'meta-analyses'

that I read in the 1990s: You need a good statistician and a

good subject-area scientist to create a good meta-analysis, and

most of the ones I read had neither.

--

Rich Ulrich

Aug 15, 2023, 11:10:18 AM8/15/23

to

Well, let's consider a more classical problem.

Regarding the English teaching method for high school students, we develop a new method (A1) and want to demonstrate if it performs better than other methods (A2, A3, and A4) by comparing the average scores of the experimental class using different methods. Each comparison uses paired t-test. Since each comparison is independent of the other, the correct significance level using the Bonferroni test is alpha_original/( 4-1 ).

Suppose we want to investigate if the developed method (A1) is better than other methods (A2. A3. and A4) for English, Spanish, and German, then the correct alpha = alpha_original/( 4-1 )/3.

Regarding the English teaching method for high school students, we develop a new method (A1) and want to demonstrate if it performs better than other methods (A2, A3, and A4) by comparing the average scores of the experimental class using different methods. Each comparison uses paired t-test. Since each comparison is independent of the other, the correct significance level using the Bonferroni test is alpha_original/( 4-1 ).

Suppose we want to investigate if the developed method (A1) is better than other methods (A2. A3. and A4) for English, Spanish, and German, then the correct alpha = alpha_original/( 4-1 )/3.

Aug 19, 2023, 12:26:03 AM8/19/23

to

On Tue, 15 Aug 2023 08:10:14 -0700 (PDT), Cosine <ase...@gmail.com>

wrote:

>Well, let's consider a more classical problem.

>

> Regarding the English teaching method for high school students, we

> develop a new method (A1) and want to demonstrate if it performs

> better than other methods (A2, A3, and A4) by comparing the average

> scores of the experimental class using different methods. Each

> comparison uses paired t-test. Since each comparison is independent of

> the other, the correct significance level using the Bonferroni test is

> alpha_original/( 4-1 ).

It took me a bit to figure out how this was a classical problem,

especially with paired t-tests -- I've never read that literature

in particular. 'Paired' on individuals does not work because you

can't teach the same material to the same student in two ways

from the same starting point.

Maybe I got it. 'Teachers' account for so much variance in

learning that the same teacher needs to teach two methods

to two different classes. 'Teachers' are the units of analyses,

comparing success for pairs of methods.

Doing this would be similar to what I've read a little more about,

testing two methods of clinical intervention. What also seems

similar for both is that the PI wants to know that the teacher/

clinician can and will properly administer the Method without too

much contamination.

>

> Suppose we want to investigate if the developed method (A1) is

> better than other methods (A2. A3. and A4) for English, Spanish, and

> German, then the correct alpha = alpha_original/( 4-1 )/3.

From my own consulting world, 'power of analysis' was always

a major concern. So I must mention that there is a very good

reason that studies usually compare only TWO methods if they

want a firm answer: More than two comparisons will require

larger Ns for the same power, and funding agencies (US, now)

typically care about the power of analysis matters. So if cost/size

is a problem, there won't be four Methods or four Languages.

For the combined experiment, I bring up what I said before:

Are you sure you are asking the question you want? (or that

you need?)

One way to comprise a simple design would be to look at the

two-way analysis of Method x Language. The main effect for

Method would matter, and the interaction of Method x Language

would say that they don't work the same. A main effect for

Language would mainly be confusing.

Beyond that, there is what I mentioned before, Are you sure

that family-wise alpha error deserves to be protected?

For educational methods -- or clinical ones -- being 'just as good'

may be fine if the teachers and students like it better. In fact, for

drug treatments (which I never dealt with on this level), NIH

had some (maybe confusing) prescriptions for how to 'show

equivalence'.

I say '(confusing)' because I do remember reading some criticism

and contradictory advice -- when I read about it, 20 years ago.

(I hope they've figured it out by now.)

--

Rich Ulrich

wrote:

>Well, let's consider a more classical problem.

>

> Regarding the English teaching method for high school students, we

> develop a new method (A1) and want to demonstrate if it performs

> better than other methods (A2, A3, and A4) by comparing the average

> scores of the experimental class using different methods. Each

> comparison uses paired t-test. Since each comparison is independent of

> the other, the correct significance level using the Bonferroni test is

> alpha_original/( 4-1 ).

especially with paired t-tests -- I've never read that literature

in particular. 'Paired' on individuals does not work because you

can't teach the same material to the same student in two ways

from the same starting point.

Maybe I got it. 'Teachers' account for so much variance in

learning that the same teacher needs to teach two methods

to two different classes. 'Teachers' are the units of analyses,

comparing success for pairs of methods.

Doing this would be similar to what I've read a little more about,

testing two methods of clinical intervention. What also seems

similar for both is that the PI wants to know that the teacher/

clinician can and will properly administer the Method without too

much contamination.

>

> Suppose we want to investigate if the developed method (A1) is

> better than other methods (A2. A3. and A4) for English, Spanish, and

> German, then the correct alpha = alpha_original/( 4-1 )/3.

a major concern. So I must mention that there is a very good

reason that studies usually compare only TWO methods if they

want a firm answer: More than two comparisons will require

larger Ns for the same power, and funding agencies (US, now)

typically care about the power of analysis matters. So if cost/size

is a problem, there won't be four Methods or four Languages.

For the combined experiment, I bring up what I said before:

Are you sure you are asking the question you want? (or that

you need?)

One way to comprise a simple design would be to look at the

two-way analysis of Method x Language. The main effect for

Method would matter, and the interaction of Method x Language

would say that they don't work the same. A main effect for

Language would mainly be confusing.

Beyond that, there is what I mentioned before, Are you sure

that family-wise alpha error deserves to be protected?

For educational methods -- or clinical ones -- being 'just as good'

may be fine if the teachers and students like it better. In fact, for

drug treatments (which I never dealt with on this level), NIH

had some (maybe confusing) prescriptions for how to 'show

equivalence'.

I say '(confusing)' because I do remember reading some criticism

and contradictory advice -- when I read about it, 20 years ago.

(I hope they've figured it out by now.)

--

Rich Ulrich

0 new messages

Search

Clear search

Close search

Google apps

Main menu