Q finding the best screening methods

Cosine

unread,

Jun 29, 2021, 7:47:40 AM6/29/21

to

Hi:

How do we conduct statistical tests to find the best screening method among a set of methods?

For example, we have 3 new methods of screening. We tested them in the same group of patients and verified the screening result of each method against a clinical standard method. Will we be sure to find the best method in the following way?

1>2 ^ 1>3 -> 1 the best
1>2 ^ 1<3 -> 3 the best
1<2 ^ 1>3 -> 2 the best
1<2 ^ 1<3 ^ 2>3 -> 2 the best
2<3 -> 3 the best

Are there other easier/faster ways to find the best method?

Cosine

unread,

Jun 30, 2021, 5:39:00 PM6/30/21

to

Cosine 在 2021年6月29日星期二下午7:47:40 [UTC+8] 的信中寫道：

A relevant question is how do we know the level of confidence?

For example, we have methods 1, 2, and 3.

1>2 with 95% confidence and 1>3 with 90% confidence

then 1 is the best by logic, but with what statistical confidence?

Even we have 1>2 w/ 95% and 1>3 w/ 95%, could we be sure that 1 is the best with 95%? Why?

David Duffy

unread,

Jul 1, 2021, 10:46:51 PM7/1/21

to

Cosine <ase...@gmail.com> wrote:
> Cosine ??? 2021???6???29??? ???????????????7:47:40 [UTC+8] ??????????????????

>> How do we conduct statistical tests to find the best screening method
>> among a set of methods?
>> For example, we have 3 new methods of screening. We tested them in

> Even we have 1>2 w/ 95% and 1>3 w/ 95%, could we be sure that 1 is the best with 95%? Why?

"Best" depends on the setting - Sens may be more important than Spec for
a screen, so need to test screen and follow-ups simultaneously versus
cost-benefit. Can give likelihood to each ordering, so can say 1-2-3 is 5x
more likely than 2-1-3.

Rich Ulrich

unread,

Jul 2, 2021, 2:18:29 AM7/2/21

to

The original and followup questions might be a candidate
for "best combination of MULTIPLE considerations" for
statistical decision-making.

Three competitors instead of two.

"best screening" combines Sens and Spec, with cost-benefit
(as David notes), along with choice of population to sample.
And the "cost" can be concrete, in dollars per test, or it can be
as subjective as the "benefit" by starting out as the projected
number of cases missed or mistakenly mis-attributed.

The cost-benefit must incorporate the purpose of the
decision-making for the particular sample. - "Ideal" screening
varies between samples with low and high prevalence.

Ranking of results can raise the question of whether 1>2
and 2>3 always implies 1>3; but you might have skipped that
complication.

--
Rich Ulrich

Cosine

unread,

Jul 2, 2021, 5:45:12 AM7/2/21

to

Rich Ulrich 在 2021年7月2日星期五下午2:18:29 [UTC+8] 的信中寫道：

> On Fri, 2 Jul 2021 02:46:44 +0000 (UTC), David Duffy

> ...

> Ranking of results can raise the question of whether 1>2
> and 2>3 always implies 1>3; but you might have skipped that
> complication.
>

Well, that is also an issue.

We actually tested by samples to get the result showing that 1>2 w/ 95% confidence.

The same for 2>3 w/ 95%.

But we did NOT do any test to get an actual result showing that 1>3 w/ some confidence.

Does it mean that we still require to test if 1>3 and to get the statistical confidence?

Or there are some ways to show that 1>3 w/ some confidence based on the results of
1>2 w/ 95% and 2>3 w/ 90%?

Rich Ulrich

unread,

Jul 2, 2021, 10:29:09 PM7/2/21

to

On Fri, 2 Jul 2021 02:45:09 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Rich Ulrich ? 2021?7?2? ?????2:18:29 [UTC+8] ??????

>> On Fri, 2 Jul 2021 02:46:44 +0000 (UTC), David Duffy
>> ...
>> Ranking of results can raise the question of whether 1>2
>> and 2>3 always implies 1>3; but you might have skipped that
>> complication.
>>
>
>Well, that is also an issue.

No, using CI's was not what I was thinking of. I don't remember
details, but there are /some/ complicated comparisons that are not
transitive, doing the brute comparisons by pairs.

One awkward scoring that I do recall something about is the
scoring for women's Olympic Ice Skating. Skaters are ranked
in each of several events. Those rank-scores are later combined
(in some fashion... weighting?) to get a final ranking to determine
a winner. I watched a competition where, at the time that the
final skater did her final event, it was possible that (IIRC) any of
the three skaters at the top could end up as #1, #2, or #3.

>
>We actually tested by samples to get the result showing that 1>2 w/ 95% confidence.
>

Stating such-and-so "with 95% confidence" is a syntax that will
grate with a large number of good statisticians. The parameter
(or difference) is not the proper object of "95%"; that describes
the CI. You can find some classical quotes on this in the Wiki
artiicle at https://en.wikipedia.org/wiki/Confidence_interval , under
"misunderstandings". By the way, the article (all in all) could
benefit from expert re-writing, as mentioned in the head-notes
by Wiki overseers.

>The same for 2>3 w/ 95%.
>
>But we did NOT do any test to get an actual result showing that 1>3 w/ some confidence.
>
>Does it mean that we still require to test if 1>3 and to get the statistical confidence?
>
>Or there are some ways to show that 1>3 w/ some confidence based on the results of
>1>2 w/ 95% and 2>3 w/ 90%?

I will mention another ranking complication. When you use SNK for
"post-hoc" range testing, the formal derivation requires that you
test from the outside, heading in. If the extremes do not differ,
you never test the middle value. Of course, the SNK tests" here use
different cutoff values when comparing low to next / low to high.

If you are "merely" using several two-group tests, then here is a
place where paradoxes might seem to arise: two-group tests, with
extreme differences in variance, and groups of vastly different size.
Oh, and when there are "paried" measurements, your correlations
may differ and that can have consequences.

If your question gets reduced to a question of How does /this/
test behave, comparing A to B, B to C, and inferring A vs C:
You probably can set limits showing for your question above
that A has to differ from C (for that test), even when using
"p-level" as an effect-size indicator. The demonstration may
be different for "pooled variance" tests and "separate variance"
tests.

--
Rich Ulrich