10 views

Skip to first unread message

Aug 2, 2022, 6:00:37 PMAug 2

to

Hi:

How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?

How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?

Aug 2, 2022, 11:11:15 PMAug 2

to

On Tue, 2 Aug 2022 15:00:35 -0700 (PDT), Cosine <ase...@gmail.com>

wrote:

>Hi:

>

> How do we determine the minimal number of samples required for a

> statistical experiment?

This is called "power analysis." The statistical procedure uses the

distribution of the non-central F or whatever. I suggest Jacob

Cohen's book for an introduction that goes beyond simply presenting

the tables that can be used for lookup.

It was in the 1980s when NIMH started requiring power analyses

as part of our research grants (psychiatric medicine).

Which statistical test (F, t, etc.)? What alpha-error? What

beta-error (chance of missing an effect), given What assumed

effect size? ... for what N? Power is equal to (1-beta).

Thus, a power analysis might include a table that shows the

power obtained by using specific Ns with specific underlying

effects for the test we are using.

For a two-tailed t-test, at 5%,

FIXED FORMAT TABLE

N needed assumed

effect sizes(d)

0.5 0.6 0.8

power 60% < n's >

80%

90%

95%

Greater power implies larger N; larger effect

size implies smaller N.

In our area, 80% was the minimum for most studies.

If there are multiple hypotheses, the same table shows how

likely the study will "detect" the various effect sizes with

a nominal test of that size.

> For example, we found on the internet that "N

> < 30" is considered a set with a small number of samples. But how do

> we decide if the number of samples is too small? For example, are 2,

> 3, ..., 10 samples too small? Why is that?

I had a friend who did lab work on cells. He told me that his

typical N was 3: The only effect sizes he was interested in was

the HUGE ones. If he used just one or two, then a weird result

might be lab error; two simliar weird results showed that he had

something.

> Any theory to support the

> decision? Likewise, what is the theory behind that decides "N < 30" is

> a set with a small number of samples?

>

> Next, let's consider the number of metrics (e.g., accuracy and

> specificifity) analyzed in the experiment. If we use too many metrics,

> it would be considered that we are fishing the dataset. But again, how

> do we determine the proper number of metrics analyzed in the

> experiment?

I think you are confusing two other discussions here. Metrics

like "specificity and sensitivity" are not assessed by statistical

tests like the t-test; they are found with a sample large enough to

give s small-enough standard deviation.

"Multiple variables" opens a discussion that starts with setting

up your experiment: Have FEW "main" hypotheses. There can be

sub-hypotheses; there can be other, "frankly exploratory" results.

One approach for several variables is to use Bonferroni correction

for multiple tests; the Power Table then might have to refer to a

"nominal" alpha of 2.5% or what-not, to correct for multiple tests.

Another approach is to do a 'multivariate analysis' that tests

several hypotheses at once; that gets into other discussions of

how to properly consider mutliple tests, since the OVERALL test

does not tell you about the relative import of different variables.

I've always recommended creating "composite scores" where they

combine the main criteria -- if you can't just pick a single score.

I took part in a multi-million dollar study, several hundred patients

followed for two years, a dozen rating scales collected at multiple

time points ... where the main criterion for treatment success was

whether a patient had to be withdrawn from the trial because

re-hospitalization was imminent.

--

Rich Ulrich

wrote:

>Hi:

>

> How do we determine the minimal number of samples required for a

> statistical experiment?

distribution of the non-central F or whatever. I suggest Jacob

Cohen's book for an introduction that goes beyond simply presenting

the tables that can be used for lookup.

It was in the 1980s when NIMH started requiring power analyses

as part of our research grants (psychiatric medicine).

Which statistical test (F, t, etc.)? What alpha-error? What

beta-error (chance of missing an effect), given What assumed

effect size? ... for what N? Power is equal to (1-beta).

Thus, a power analysis might include a table that shows the

power obtained by using specific Ns with specific underlying

effects for the test we are using.

For a two-tailed t-test, at 5%,

FIXED FORMAT TABLE

N needed assumed

effect sizes(d)

0.5 0.6 0.8

power 60% < n's >

80%

90%

95%

Greater power implies larger N; larger effect

size implies smaller N.

In our area, 80% was the minimum for most studies.

If there are multiple hypotheses, the same table shows how

likely the study will "detect" the various effect sizes with

a nominal test of that size.

> For example, we found on the internet that "N

> < 30" is considered a set with a small number of samples. But how do

> we decide if the number of samples is too small? For example, are 2,

> 3, ..., 10 samples too small? Why is that?

typical N was 3: The only effect sizes he was interested in was

the HUGE ones. If he used just one or two, then a weird result

might be lab error; two simliar weird results showed that he had

something.

> Any theory to support the

> decision? Likewise, what is the theory behind that decides "N < 30" is

> a set with a small number of samples?

>

> Next, let's consider the number of metrics (e.g., accuracy and

> specificifity) analyzed in the experiment. If we use too many metrics,

> it would be considered that we are fishing the dataset. But again, how

> do we determine the proper number of metrics analyzed in the

> experiment?

like "specificity and sensitivity" are not assessed by statistical

tests like the t-test; they are found with a sample large enough to

give s small-enough standard deviation.

"Multiple variables" opens a discussion that starts with setting

up your experiment: Have FEW "main" hypotheses. There can be

sub-hypotheses; there can be other, "frankly exploratory" results.

One approach for several variables is to use Bonferroni correction

for multiple tests; the Power Table then might have to refer to a

"nominal" alpha of 2.5% or what-not, to correct for multiple tests.

Another approach is to do a 'multivariate analysis' that tests

several hypotheses at once; that gets into other discussions of

how to properly consider mutliple tests, since the OVERALL test

does not tell you about the relative import of different variables.

I've always recommended creating "composite scores" where they

combine the main criteria -- if you can't just pick a single score.

I took part in a multi-million dollar study, several hundred patients

followed for two years, a dozen rating scales collected at multiple

time points ... where the main criterion for treatment success was

whether a patient had to be withdrawn from the trial because

re-hospitalization was imminent.

--

Rich Ulrich

Sep 6, 2022, 3:56:53 PMSep 6

to

I'm about a month late to this party, but I have a couple of thoughts. See below.

On Tuesday, August 2, 2022 at 6:00:37 PM UTC-4, Cosine wrote:

> Hi:

>

> How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

Are you talking about the central limit theorem (CLT) and the so-called "rule of 30"? If so, remember that the shape of the sampling distribution of the mean depends on both the shape of the raw score (population) distribution and the sample size. If the population of raw scores is normal, the sampling distribution of the mean will be normal for any sample size (even n=1, in which case, it will be an exact copy of the normal population distribution). How large n must be to ensure that the sampling distribution of the mean is approximately normal depends on the shape of the population distribution. For many variables that are not too asymmetrical, n=30 may be enough. But for some other variables, it will not be enough.

If this is what you were asking about, you may find some of the following discussion interesting:

https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz

>

> Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?

You talk about accuracy and specificity. But I wonder if you are really just talking about having multiple dependent (or outcome) variables--i.e., the so-called multiplicity problem. If you are, I recommend two 2005 Lancet articles by Schulz and Grimes (links below). For me, they are two of the most thoughtful articles I have read on the multiplicity problem. HTH.

https://pubmed.ncbi.nlm.nih.gov/15866314/

https://pubmed.ncbi.nlm.nih.gov/15885299/

On Tuesday, August 2, 2022 at 6:00:37 PM UTC-4, Cosine wrote:

> Hi:

>

> How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

If this is what you were asking about, you may find some of the following discussion interesting:

https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz

>

> Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?

https://pubmed.ncbi.nlm.nih.gov/15866314/

https://pubmed.ncbi.nlm.nih.gov/15885299/

Sep 7, 2022, 1:26:30 PMSep 7

to

On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver

<bwe...@lakeheadu.ca> wrote:

>I'm about a month late to this party, but I have a couple of thoughts. See below.

Bruce - This does not indicate that you saw the long reply from me.
<bwe...@lakeheadu.ca> wrote:

>I'm about a month late to this party, but I have a couple of thoughts. See below.

I talked about power analysis; also, multipllicity problem. What you

add about normality is good. And the references.

>

>On Tuesday, August 2, 2022 at 6:00:37 PM UTC-4, Cosine wrote:

>> Hi:

>>

>> How do we determine the minimal number of samples required for a statistical experiment? For example, we found on the internet that "N < 30" is considered a set with a small number of samples. But how do we decide if the number of samples is too small? For example, are 2, 3, ..., 10 samples too small? Why is that? Any theory to support the decision? Likewise, what is the theory behind that decides "N < 30" is a set with a small number of samples?

>

>Are you talking about the central limit theorem (CLT) and the so-called "rule of 30"? If so, remember that the shape of the sampling distribution of the mean depends on both the shape of the raw score (population) distribution and the sample size. If the population of raw scores is normal, the sampling distribution of the mean will be normal for any sample size (even n=1, in which case, it will be an exact copy of the normal population distribution). How large n must be to ensure that the sampling distribution of the mean is approximately normal depends on the shape of the population distribution. For many variables that are not too asymmetrical, n=30 may be enough. But for some other variables, it will not be enough.

>

>If this is what you were asking about, you may find some of the following discussion interesting:

>

>https://stats.stackexchange.com/questions/2541/what-references-should-be-cited-to-support-using-30-as-a-large-enough-sample-siz

>

>>

>> Next, let's consider the number of metrics (e.g., accuracy and specificifity) analyzed in the experiment. If we use too many metrics, it would be considered that we are fishing the dataset. But again, how do we determine the proper number of metrics analyzed in the experiment?

>

>You talk about accuracy and specificity. But I wonder if you are really just talking about having multiple dependent (or outcome) variables--i.e., the so-called multiplicity problem. If you are, I recommend two 2005 Lancet articles by Schulz and Grimes (links below). For me, they are two of the most thoughtful articles I have read on the multiplicity problem. HTH.

>

>https://pubmed.ncbi.nlm.nih.gov/15866314/

trial generally has a purpose, an aim, an intention or goal that

should be reduced to one hypothesis (or two, not more than

three). In the NIMH grants that I worked on, NIMH review

insisted on planning the central testing in advance -- I hope.

Beyond the main test, there were 'confirmatory' and descriptive

analyses; and exploratory results.

However, the PIs of those grants had some freedom in what

they reported, so they might need this advice, "Respect your

a-priori planning." This Abstract does not mention that.

A friend who worked at VA, which funded its own studies, once

complained (1990s, maybe -- could be different now) that the VA

research structure was overly iron-clad in that respect; that is,

it was really tough for a PI to write up any test that had not been

described in the proposal.

>https://pubmed.ncbi.nlm.nih.gov/15885299/

That Abstract reads nicely enough.

--

Rich Ulrich

Sep 7, 2022, 4:25:35 PMSep 7

to

On Wednesday, September 7, 2022 at 1:26:30 PM UTC-4, Rich Ulrich wrote:

> On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver

> <bwe...@lakeheadu.ca> wrote:

>

> >I'm about a month late to this party, but I have a couple of thoughts. See below.

> Bruce - This does not indicate that you saw the long reply from me.

>

> I talked about power analysis; also, multipllicity problem. What you

> add about normality is good. And the references.

--- snip ---
> On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver

> <bwe...@lakeheadu.ca> wrote:

>

> >I'm about a month late to this party, but I have a couple of thoughts. See below.

> Bruce - This does not indicate that you saw the long reply from me.

>

> I talked about power analysis; also, multipllicity problem. What you

> add about normality is good. And the references.

Hi Rich. I had seen your post, but clearly skimmed through it too quickly, because I missed that you had talked about multiplicity. Sorry about that.

Your comment in your later reply about writing up tests that were not in the proposal reminded me of this recent article, which I think is very good.

Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5-18. https://journals.sagepub.com/doi/full/10.1177/0149206316679487

I don't know if that link will work for everyone, but it might.

Cheers,

Bruce

Sep 7, 2022, 6:49:26 PMSep 7

to

On Wed, 7 Sep 2022 13:25:34 -0700 (PDT), Bruce Weaver

Very good article, for offsetting some bad practices.

The ongoing difficulty is providing a high quality of

t-harking. Breakthrough? Bad guess?

What already happens for health findings is that the

slight hint of a Big Deal gets grabbed and publicized ...

and, a few years later, when the hypothesis proves to

be a bus, it becomes One More Example of 'Scientists misled

us again, and they don't really know anything.'

--

Rich Ulrich

<bwe...@lakeheadu.ca> wrote:

>On Wednesday, September 7, 2022 at 1:26:30 PM UTC-4, Rich Ulrich wrote:

>> On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver

>> <bwe...@lakeheadu.ca> wrote:

>>

>> >I'm about a month late to this party, but I have a couple of thoughts. See below.

>> Bruce - This does not indicate that you saw the long reply from me.

>>

>> I talked about power analysis; also, multipllicity problem. What you

>> add about normality is good. And the references.

>--- snip ---

>

>Hi Rich. I had seen your post, but clearly skimmed through it too quickly, because I missed that you had talked about multiplicity. Sorry about that.

>

>Your comment in your later reply about writing up tests that were not in the proposal reminded me of this recent article, which I think is very good.

>

>Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5-18. https://journals.sagepub.com/doi/full/10.1177/0149206316679487

>

>I don't know if that link will work for everyone, but it might.

>

The link works for me.
>On Wednesday, September 7, 2022 at 1:26:30 PM UTC-4, Rich Ulrich wrote:

>> On Tue, 6 Sep 2022 12:56:51 -0700 (PDT), Bruce Weaver

>> <bwe...@lakeheadu.ca> wrote:

>>

>> >I'm about a month late to this party, but I have a couple of thoughts. See below.

>> Bruce - This does not indicate that you saw the long reply from me.

>>

>> I talked about power analysis; also, multipllicity problem. What you

>> add about normality is good. And the references.

>--- snip ---

>

>Hi Rich. I had seen your post, but clearly skimmed through it too quickly, because I missed that you had talked about multiplicity. Sorry about that.

>

>Your comment in your later reply about writing up tests that were not in the proposal reminded me of this recent article, which I think is very good.

>

>Hollenbeck, J. R., & Wright, P. M. (2017). Harking, sharking, and tharking: Making the case for post hoc analysis of scientific data. Journal of Management, 43(1), 5-18. https://journals.sagepub.com/doi/full/10.1177/0149206316679487

>

>I don't know if that link will work for everyone, but it might.

>

Very good article, for offsetting some bad practices.

The ongoing difficulty is providing a high quality of

t-harking. Breakthrough? Bad guess?

What already happens for health findings is that the

slight hint of a Big Deal gets grabbed and publicized ...

and, a few years later, when the hypothesis proves to

be a bus, it becomes One More Example of 'Scientists misled

us again, and they don't really know anything.'

--

Rich Ulrich

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu