Q sources of correlation

Cosine

未讀,

2021年5月3日晚上8:56:432021/5/3

收件者：

Hi:

What are the sources causing statistical correlation or dependence?

What are the characteristics/factors of these sources in common?

More directly, given a particular situation, how do we identify these sources?

Let's use the human trial as an example.

A well-known example for eliminating the potential sources of correlation when testing the efficacy of a new drug for skin is to use the two hands of the same person as testing and control groups. Then we recruit enough persons to form the sample groups.

This example implies that the sources of correlation exist even in the same person.

Strangely, when we test a drug for another purpose, say, for treating headache, we form the testing and control groups by recruiting persons to each of the two groups. Why could we be sure that there are no sources of correlation in the same person for this case?

Thank you,

David Jones

未讀,

2021年5月4日下午3:33:092021/5/4

收件者：

In principle, there are 3 ways of dealing with this ...

(i) construct the experiment to take account of dependence. One such
approach is to organise potential candidate samples into matched
pairs(for example by weight), and do a paired sample analysis.

(b) construct the experiment to properly ignore the dependence, by
incorporating rabdon assignment of treatments to candidate samples. A
fully encompassing randomisation by definition eliminates the problem
of dependence but at the expense of removing information.

(c) construct the experiment to incorporate the dependence by
quantifying potenitially important dependence effects and including
these measurements in the analysis as dependent variables or factors.

However, any real experiment is likely to be a hybrid to some extent,
with at least some randomisation involved in assigning which treatments
are given to the candidate samples.

A statistically-based book on "design of experiments" would cover this
better.

Rich Ulrich

未讀,

2021年5月5日下午1:06:432021/5/5

收件者：

On Mon, 3 May 2021 17:56:41 -0700 (PDT), Cosine <ase...@gmail.com>
wrote:

>Hi:

I'm more than a little baffled at what Cosine is really looking
for in an answer. David Jones has provided one sort of answer -
Does that one satisfy?

Here is a more philosophical approach.

>
> What are the sources causing statistical correlation or dependence?
>
>What are the characteristics/factors of these sources in common?

This is what the sciences are about, finding correlations and
dependence and trying to describe "causation".

>
>More directly, given a particular situation, how do we identify these sources?

There are a whole lot of sciences, which each have their
own tools. Astrophysicists work rather differently from
biologists.

>
> Let's use the human trial as an example.
>
>A well-known example for eliminating the potential sources of correlation when testing the efficacy of a new drug for skin is to use the two hands of the same person as testing and control groups. Then we recruit enough persons to form the sample groups.
>
> This example implies that the sources of correlation exist even in the same person.

It is KNOWN that age and sex are important in many human
responses, in addition to whatever else might matter as between-
person differences. Using each person as their own control
effectively eliminates those sources of separate causation from
the inference when looking at the quantitative differences of results.

>
> Strangely, when we test a drug for another purpose, say, for treating headache, we form the testing and control groups by recruiting persons to each of the two groups. Why could we be sure that there are no sources of correlation in the same person for this case?

"...no sources of correlation in the same person" is a phrase that
eludes my understanding.

"Crossover designs" do make use of the same person for control
when looking at the headache remedies you imagine.

A trial might go a step beyond "randomizing" to use a "stratified-
random" assignment to groups, if the PIs expect that (say) age and
sex might matter for outcome. That "matches" the characteristics
of groups, to elimiinate the source of variation on an ANOVA.

Lesser factors that are suspected to have a relation to outcome
might be "controlled for" by including covariates in the analysis.

Including covariates is often (far) preferable to the use of "matched
cases" when the matching is not as precise as "same person".
- I was alarmed by a study that analysed by paired-cases when
the matching was "within four years of age". That might seem close
enough as a logicial proposition in a classroom, except that the
disease was "childhood leukemia", age range of maybe 12 years.

--
Rich Ulrich

David Jones

未讀,

2021年5月6日清晨5:24:152021/5/6

收件者：

Perhaps it would be useful to think about what happens for small
experiments, involving extremely small numbers of samples
(unrealistically small). Even if some form of randomisation is used
somewhere in the design, there is a chance that the actual outcome of
the randomisation produces some unfortunate matching that leads to
misleading results. If you are only doing the one experiment you have
to be working conditional on the outcome of the randomisation. Thinking
of marginalising across all possible outcomes of the randomisation may
only be relevant if you are dealing with a whole set of separate
experiments where you might find disparities in the results. So perhaps
one needs to think about how many samples you need for the
randomisation to have the desired effect which, in this context, is to
ensure that there is a good balance of treatment and controls across
the range of any possible hidden determinands.

Similarly, where you have measurable qualities for use as regression
variables or factors, the question of sample size arises not only for
the purposes of estimating regression coefficients and looking for
interactions, but also to alleviate the possibility that the random
allocation of treatments or controls might hit upon some unfortunate
coincidence with any measured or hidden determinands.

Rich Ulrich

未讀,

2021年5月6日晚上7:17:482021/5/6

收件者：

An acquaintance who experimented on single cells told me that
his usual /largest/ sample N was 3, while he was looking for huge
effects. The reason for "3" was replication: A success with just one
might be pretty convincing, but he should be sure the cell should
not have unique features, and he would confirm that he followed
exactly the same procedure each time, exactly as documented. .

>(unrealistically small). Even if some form of randomisation is used
>somewhere in the design, there is a chance that the actual outcome of
>the randomisation produces some unfortunate matching that leads to
>misleading results. If you are only doing the one experiment you have
>to be working conditional on the outcome of the randomisation. Thinking
>of marginalising across all possible outcomes of the randomisation may
>only be relevant if you are dealing with a whole set of separate
>experiments where you might find disparities in the results. So perhaps
>one needs to think about how many samples you need for the
>randomisation to have the desired effect which, in this context, is to
>ensure that there is a good balance of treatment and controls across
>the range of any possible hidden determinands.
>
>Similarly, where you have measurable qualities for use as regression
>variables or factors, the question of sample size arises not only for
>the purposes of estimating regression coefficients and looking for
>interactions, but also to alleviate the possibility that the random
>allocation of treatments or controls might hit upon some unfortunate
>coincidence with any measured or hidden determinands.

The prospect of random effects is elevated with the number of
/hypotheses/ to be tested becomes large (or, /extremely/ large).

Astronomers use a really tiny p-value for certain things they look
for when scanning millions of stars.

On the other hand, I once read a news report about correlations
of high/low insurance claims, which proposed some link to "living
within two blocks of a church." That seemed so specific that I
imagined that the dataset being mined must contain hundreds of
hypotheses of equal a-priori value. (And the report did not mention
anything related to that.) I also suspected that the report was using
the conventional 5% cutoff of the social sciences; and, wth a huge N
(millions of policies), the effect was probably too small to matter.

--
Rich Ulrich