Q differences between these two ways of comparisons

Cosine

unread,

Aug 11, 2023, 9:28:53 PM8/11/23

to

Hi:

Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.

Kind-1:

M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E

Then we claim that A performs better than all the rest 4 algorithms.

Kind-2:

M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,
M_B > M_C, M_B > M_D, M_B > M_E,
M_C > M_D, M_C > M_E, and
M_D > M_E

Then, we claim that A performs best among all the 5 algorithms.

Rich Ulrich

unread,

Aug 12, 2023, 12:10:06 AM8/12/23

to

On Fri, 11 Aug 2023 18:28:50 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Hi:
>
> Suppose we have 5 algorithms: A, B, C, D, and E, and we did the following two kinds of performance comparison. The performance comparison is to compare the two algorithms' values of a given performance metric, M.
>
>Kind-1:
>
> M_A > M_B, M_A > M_C, M_A >M_D, and M_A >M_E
>
> Then we claim that A performs better than all the rest 4 algorithms.

It seems that you are describing the RESULT of a set
of comparisons. The two 'kinds' would be, A versus each other,
and "all comparisons among them."

You should say, "on these test data" and "better on M than ..."
and "performed" (past tense).

>
>Kind-2:
>
> M_A > M_B, M_A > M_C, M_A > M_D, M_A > M_E,
> M_B > M_C, M_B > M_D, M_B > M_E,
> M_C > M_D, M_C > M_E, and
> M_D > M_E
>
> Then, we claim that A performs best among all the 5 algorithms.
>

I would state that A performed better (on M) than the rest, and also
the rest were strictly ordered in how well they performed.

--
Rich Ulrich

Cosine

unread,

Aug 12, 2023, 2:31:44 AM8/12/23

to

Rich Ulrich 在 2023年8月12日星期六中午12:10:06 [UTC+8] 的信中寫道：

> On Fri, 11 Aug 2023 18:28:50 -0700 (PDT), Cosine

In other words, if the purpose is only to demonstrate that A performed better on M than the rest 4 algorithms,
we only need to do the first kind of comparison. We do the second kind only if we want to demonstrate the ordering.

By the way. it seems that to reach the desired conclusion, both kinds of comparison require doing multiple comparisons.

The first kind requires 4 ( = 5-1 ) and the second requires C(5,2) = 10.

Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?

But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

Rich Ulrich

unread,

Aug 12, 2023, 3:24:22 PM8/12/23

to

On Fri, 11 Aug 2023 23:31:41 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Rich Ulrich ? 2023?8?12? ?????12:10:06 [UTC+8] ??????

Before you take on 'multiple comparisons' and p-levels, you ought
to have a Decision to be made, or a question What do you have
here? Making a statement about what happens to fit the sample
best does not require assumptions; drawing inferences to elsewhere
does require assumptions.

Who or what does your sample /represent/? Where do the algorithms
come from? (and how do they differ?). What are you hoping to
generalize to?

I can imagine that your second set of results could be a summary
of step-wise regression, where Metric is the R-squared and A is
the result after mutiple steps. Each step shows an increase in
R-squared, by definition. Ta-da!

The hazards of step-wise regression are well-advertised by now.
I repeated Frank Harrell's commentary multiple times in the stats
Usenet groups, and others picked it up. I can add: When there
are dozens of candidate variables to Enter, each step is apt to
provide a WORSE algorithm when applied to a separate sample for
validation. Sensible algorithms usually require the application of
good sense by the developers -- instead of over-capitalizing on
chance in a model built on limited data.

If you have huge data, then you should also pay attention to
robustness and generalizability across sub-populations, rather
than focus on p-levels for the whole shebang.

>
>Therefore, if we use Bonferroni correction, the significant level will be corrected to alpha/(n-1) and alpha/C(n,2), respectively.

In my experience, I talked people out of corrections many times
by cleaning up their questions. Bonferroni fits best when you
have /independent/ questions of equal priority. And when you
have a reason to pay heed to family-wise error.

>
>If we use more than one metric, e.g., M_1, to M_m, then we need to further divide the previous alphas by m, right?
>
>But wouldn't the corrected alpha value be too small, especially when we have certain numbers of n and m?

If you don't have any idea what you are looking for, one common
procedure is to proclaim the effort 'exploratory' and report
the nominal levels.

--
Rich Ulrich

Cosine

unread,

Aug 12, 2023, 7:03:55 PM8/12/23

to

Hmm, let's start by asking or clarifying the research questions then.

Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.

Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.

Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.

Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.

This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

Rich Ulrich

unread,

Aug 14, 2023, 7:03:37 PM8/14/23

to

On Sat, 12 Aug 2023 16:03:52 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Hmm, let's start by asking or clarifying the research questions then.
>
>Many machine learning papers I read often used a set fo metrics to show that the developed algorithm runs the best, compared to a set of benchmarks.
>
>Typically, the authors list the metrics like accuracy, sensitivity, specificity, the area under the receiver operating characteristic (AUC) curve, recall, F1-score, and Dice score, etc.
>
>Next, the authors list 4-6 published algorithms as benchmarks. These algorithms have similar designs and are designed for the same purpose as the developed one, e.g., segmentation, classification, and detection/diagnosis.

Okay. You are outside the scope of what I have read.
Whatever I read about machine learning, decades ago, was
far more primitive or preliminary than this. I can offer a note
or two on 'reading' such papers..

>
>Then the authors run the developed algorithm and the benchmarks using the same dataset to get the values of each of the metrics listed.
>
>Next, the authors conduct the statistical analysis y comparing the values of the metrics to demonstrate that the developed algorithm is the best, and sometimes, the rank of the algorithms (the developed one and all the benchmarks.)

Did the statisitcs include p-values?

The comparison I can think of is the demonstartions I have
seen about 'statistical tests' offered for consideration. That is,
authors are comparing (say, too simplistically) Student's t-test to
a t-test for unequal variances, or to a t on rank-orders.

Here, everyone can inspect the tests and imagine when they
will differ; randomized samples are created which feature various
aspects of non-normality, for various matches of Ns. What is
known is that the tests will differ -- a 5% test does not 'reject'
2.5% at each end, when computed on 10,000 generated samples,
when its assumptions are intentionally violated.

What is interesting is how MUCH they differ, and how much more
they differ for smaller N or for smaller alpha.

>
>Finally, the authors pick up those results showing favorable comparisons and claim these as the contribution(s) of the developed algorithm.
>
> This looks to me that the authors are doing the statistical tests by comparing multiple algorithms with multiple metrics to conclude the final (single or multiple) contribution(s) of the developed algorithm.

So, what I know (above) won't apply if you have to treat the
algorithms as 'black-box' operations -- you can't predict when
an algorithm will perform its best.

I think I would be concerned about the generality of the test
bank, and the legitimacy/credibility of the authors.

I can readily imagine a situation like with the 'meta-analyses'
that I read in the 1990s: You need a good statistician and a
good subject-area scientist to create a good meta-analysis, and
most of the ones I read had neither.

--
Rich Ulrich

Cosine

unread,

Aug 15, 2023, 11:10:18 AM8/15/23

to

Well, let's consider a more classical problem.

Regarding the English teaching method for high school students, we develop a new method (A1) and want to demonstrate if it performs better than other methods (A2, A3, and A4) by comparing the average scores of the experimental class using different methods. Each comparison uses paired t-test. Since each comparison is independent of the other, the correct significance level using the Bonferroni test is alpha_original/( 4-1 ).

Suppose we want to investigate if the developed method (A1) is better than other methods (A2. A3. and A4) for English, Spanish, and German, then the correct alpha = alpha_original/( 4-1 )/3.

Rich Ulrich

unread,

Aug 19, 2023, 12:26:03 AM8/19/23

to

On Tue, 15 Aug 2023 08:10:14 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Well, let's consider a more classical problem.
>
> Regarding the English teaching method for high school students, we
> develop a new method (A1) and want to demonstrate if it performs
> better than other methods (A2, A3, and A4) by comparing the average
> scores of the experimental class using different methods. Each
> comparison uses paired t-test. Since each comparison is independent of
> the other, the correct significance level using the Bonferroni test is
> alpha_original/( 4-1 ).

It took me a bit to figure out how this was a classical problem,
especially with paired t-tests -- I've never read that literature
in particular. 'Paired' on individuals does not work because you
can't teach the same material to the same student in two ways
from the same starting point.

Maybe I got it. 'Teachers' account for so much variance in
learning that the same teacher needs to teach two methods
to two different classes. 'Teachers' are the units of analyses,
comparing success for pairs of methods.

Doing this would be similar to what I've read a little more about,
testing two methods of clinical intervention. What also seems
similar for both is that the PI wants to know that the teacher/
clinician can and will properly administer the Method without too
much contamination.

>
> Suppose we want to investigate if the developed method (A1) is
> better than other methods (A2. A3. and A4) for English, Spanish, and
> German, then the correct alpha = alpha_original/( 4-1 )/3.

From my own consulting world, 'power of analysis' was always
a major concern. So I must mention that there is a very good
reason that studies usually compare only TWO methods if they
want a firm answer: More than two comparisons will require
larger Ns for the same power, and funding agencies (US, now)
typically care about the power of analysis matters. So if cost/size
is a problem, there won't be four Methods or four Languages.

For the combined experiment, I bring up what I said before:
Are you sure you are asking the question you want? (or that
you need?)

One way to comprise a simple design would be to look at the
two-way analysis of Method x Language. The main effect for
Method would matter, and the interaction of Method x Language
would say that they don't work the same. A main effect for
Language would mainly be confusing.

Beyond that, there is what I mentioned before, Are you sure
that family-wise alpha error deserves to be protected?

For educational methods -- or clinical ones -- being 'just as good'
may be fine if the teachers and students like it better. In fact, for
drug treatments (which I never dealt with on this level), NIH
had some (maybe confusing) prescriptions for how to 'show
equivalence'.

I say '(confusing)' because I do remember reading some criticism
and contradictory advice -- when I read about it, 20 years ago.
(I hope they've figured it out by now.)

--
Rich Ulrich