Q number of samples and metrics

Cosine

unread,

Aug 2, 2022, 6:00:37 PM8/2/22

to

Hi:

How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?

Rich Ulrich

unread,

Aug 2, 2022, 11:11:15 PM8/2/22

to

On Tue, 2 Aug 2022 15:00:35 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Hi:
>
> How do we determine the minimal number of samples required for a
> statistical experiment?

This is called "power analysis." The statistical procedure uses the
distribution of the non-central F or whatever. I suggest Jacob
Cohen's book for an introduction that goes beyond simply presenting
the tables that can be used for lookup.

It was in the 1980s when NIMH started requiring power analyses
as part of our research grants (psychiatric medicine).

Which statistical test (F, t, etc.)? What alpha-error? What
beta-error (chance of missing an effect), given What assumed
effect size? ... for what N? Power is equal to (1-beta).

Thus, a power analysis might include a table that shows the
power obtained by using specific Ns with specific underlying
effects for the test we are using.

For a two-tailed t-test, at 5%,
FIXED FORMAT TABLE
N needed assumed
effect sizes(d)
0.5 0.6 0.8
power 60% < n's >
80%
90%
95%

Greater power implies larger N; larger effect
size implies smaller N.

In our area, 80% was the minimum for most studies.
If there are multiple hypotheses, the same table shows how
likely the study will "detect" the various effect sizes with
a nominal test of that size.

> For example, we found on the internet that "N
> < 30" is considered a set with a small number of samples. But how do
> we decide if the number of samples is too small? For example, are 2,
> 3, ..., 10 samples too small? Why is that?

I had a friend who did lab work on cells. He told me that his
typical N was 3: The only effect sizes he was interested in was
the HUGE ones. If he used just one or two, then a weird result
might be lab error; two simliar weird results showed that he had
something.

> Any theory to support the
> decision? Likewise, what is the theory behind that decides "N < 30" is
> a set with a small number of samples?
>
> Next, let's consider the number of metrics (e.g., accuracy and
> specificifity) analyzed in the experiment. If we use too many metrics,
> it would be considered that we are fishing the dataset. But again, how
> do we determine the proper number of metrics analyzed in the
> experiment?

I think you are confusing two other discussions here. Metrics
like "specificity and sensitivity" are not assessed by statistical
tests like the t-test; they are found with a sample large enough to
give s small-enough standard deviation.

"Multiple variables" opens a discussion that starts with setting
up your experiment: Have FEW "main" hypotheses. There can be
sub-hypotheses; there can be other, "frankly exploratory" results.

One approach for several variables is to use Bonferroni correction
for multiple tests; the Power Table then might have to refer to a
"nominal" alpha of 2.5% or what-not, to correct for multiple tests.

Another approach is to do a 'multivariate analysis' that tests
several hypotheses at once; that gets into other discussions of
how to properly consider mutliple tests, since the OVERALL test
does not tell you about the relative import of different variables.

I've always recommended creating "composite scores" where they
combine the main criteria -- if you can't just pick a single score.

I took part in a multi-million dollar study, several hundred patients
followed for two years, a dozen rating scales collected at multiple
time points ... where the main criterion for treatment success was
whether a patient had to be withdrawn from the trial because
re-hospitalization was imminent.

--
Rich Ulrich

Bruce Weaver

unread,

Sep 6, 2022, 3:56:53 PM9/6/22

to

I'm about a month late to this party, but I have a couple of thoughts. See below.

On Tuesday, August 2, 2022 at 6:00:37 PM UTC-4, Cosine wrote:
> Hi:
>
> How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

Are you talking about the central limit theorem (CLT) and the so-called "rule of 30"? If so, remember that the shape of the sampling distribution of the mean depends on both the shape of the raw score (population) distribution and the sample size. If the population of raw scores is normal, the sampling distribution of the mean will be normal for any sample size (even n=1, in which case, it will be an exact copy of the normal population distribution). How large n must be to ensure that the sampling distribution of the mean is approximately normal depends on the shape of the population distribution. For many variables that are not too asymmetrical, n=30 may be enough. But for some other variables, it will not be enough.

If this is what you were asking about, you may find some of the following discussion interesting:

https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz

>
> Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?

You talk about accuracy and specificity. But I wonder if you are really just talking about having multiple dependent (or outcome) variables--i.e., the so-called multiplicity problem. If you are, I recommend two 2005 Lancet articles by Schulz and Grimes (links below). For me, they are two of the most thoughtful articles I have read on the multiplicity problem. HTH.

https://pubmed.ncbi.nlm.nih.gov/15866314/
https://pubmed.ncbi.nlm.nih.gov/15885299/

Rich Ulrich

unread,

Sep 7, 2022, 1:26:30 PM9/7/22

to

On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver
<bwe...@lakeheadu.ca> wrote:

>I'm about a month late to this party, but I have a couple of thoughts. See below.

Bruce - This does not indicate that you saw the long reply from me.

I talked about power analysis; also, multipllicity problem. What you
add about normality is good. And the references.

>
>On Tuesday, August 2, 2022 at 6:00:37 PM UTC-4, Cosine wrote:
>> Hi:
>>
>> How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?
>
>Are you talking about the central limit theorem (CLT) and the so-called "rule of 30"? If so, remember that the shape of the sampling distribution of the mean depends on both the shape of the raw score (population) distribution and the sample size. If the population of raw scores is normal, the sampling distribution of the mean will be normal for any sample size (even n=1, in which case, it will be an exact copy of the normal population distribution). How large n must be to ensure that the sampling distribution of the mean is approximately normal depends on the shape of the population distribution. For many variables that are not too asymmetrical, n=30 may be enough. But for some other variables, it will not be enough.
>
>If this is what you were asking about, you may find some of the following discussion interesting:
>
>https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz

Interesting comments.

>
>>
>> Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?
>
>You talk about accuracy and specificity. But I wonder if you are really just talking about having multiple dependent (or outcome) variables--i.e., the so-called multiplicity problem. If you are, I recommend two 2005 Lancet articles by Schulz and Grimes (links below). For me, they are two of the most thoughtful articles I have read on the multiplicity problem. HTH.
>
>https://pubmed.ncbi.nlm.nih.gov/15866314/

This Abstract annoys me a little. FOCUS. Any /randomized/
trial generally has a purpose, an aim, an intention or goal that
should be reduced to one hypothesis (or two, not more than
three). In the NIMH grants that I worked on, NIMH review
insisted on planning the central testing in advance -- I hope.
Beyond the main test, there were 'confirmatory' and descriptive
analyses; and exploratory results.

However, the PIs of those grants had some freedom in what
they reported, so they might need this advice, "Respect your
a-priori planning." This Abstract does not mention that.

A friend who worked at VA, which funded its own studies, once
complained (1990s, maybe -- could be different now) that the VA
research structure was overly iron-clad in that respect; that is,
it was really tough for a PI to write up any test that had not been
described in the proposal.

>https://pubmed.ncbi.nlm.nih.gov/15885299/

That Abstract reads nicely enough.

--
Rich Ulrich

Bruce Weaver

unread,

Sep 7, 2022, 4:25:35 PM9/7/22

to

On Wednesday, September 7, 2022 at 1:26:30 PM UTC-4, Rich Ulrich wrote:
> On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver
> <bwe...@lakeheadu.ca> wrote:
>
> >I'm about a month late to this party, but I have a couple of thoughts. See below.
> Bruce - This does not indicate that you saw the long reply from me.
>
> I talked about power analysis; also, multipllicity problem. What you
> add about normality is good. And the references.

--- snip ---

Hi Rich. I had seen your post, but clearly skimmed through it too quickly, because I missed that you had talked about multiplicity. Sorry about that.

Your comment in your later reply about writing up tests that were not in the proposal reminded me of this recent article, which I think is very good.

Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5-18. https://journals.sagepub.com/doi/full/10.1177/0149206316679487

I don't know if that link will work for everyone, but it might.

Cheers,
Bruce

Rich Ulrich

unread,

Sep 7, 2022, 6:49:26 PM9/7/22

to

On Wed, 7 Sep 2022 13:25:34 -0700 (PDT), Bruce Weaver

<bwe...@lakeheadu.ca> wrote:

>On Wednesday, September 7, 2022 at 1:26:30 PM UTC-4, Rich Ulrich wrote:
>> On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver
>> <bwe...@lakeheadu.ca> wrote:
>>
>> >I'm about a month late to this party, but I have a couple of thoughts. See below.
>> Bruce - This does not indicate that you saw the long reply from me.
>>
>> I talked about power analysis; also, multipllicity problem. What you
>> add about normality is good. And the references.
>--- snip ---
>
>Hi Rich. I had seen your post, but clearly skimmed through it too quickly, because I missed that you had talked about multiplicity. Sorry about that.
>
>Your comment in your later reply about writing up tests that were not in the proposal reminded me of this recent article, which I think is very good.
>
>Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5-18. https://journals.sagepub.com/doi/full/10.1177/0149206316679487
>
>I don't know if that link will work for everyone, but it might.
>

The link works for me.

Very good article, for offsetting some bad practices.

The ongoing difficulty is providing a high quality of
t-harking. Breakthrough? Bad guess?

What already happens for health findings is that the
slight hint of a Big Deal gets grabbed and publicized ...
and, a few years later, when the hypothesis proves to
be a bus, it becomes One More Example of 'Scientists misled
us again, and they don't really know anything.'

--
Rich Ulrich