djh

unread,

Mar 20, 2012, 11:30:07 AM3/20/12

to

Thanks for the advice, Art. Since you posted here as well as
emailing, I'll post my reply here as well:

********
Thanks very much for taking the time to offer that advice, Art. It is
much appreciated.

Just so you know - I offered a collaborative co-authorship to Ray if
he would serve as the statistical expert on the team and he declined
on the grounds that we need someone "local". So his willingness to
function "pro bono" over the web to the extent that he has is even
more praiseworthy.

Maybe it's best that I take his advice and try to find a collaborating
local expert. But as I'm sure you're aware, such experts are in high
demand and very very busy, i.e. they generally can't afford the time
to come up to speed on "context". (If they weren't so busy, my
colleagues at Princeton and Penn State would probably have already
been able to find one. )

In any event, if you think I'm taking advantage of sci.stat.math (i.e.
"page-pushing", to use the on-line idiom), just let me know and I'll
cease and desist.

Best regards
djh

Art Kendall

unread,

Mar 20, 2012, 12:39:06 PM3/20/12

to

perhaps I am missing something.

> the more trials you
>> do, the smaller the rmsd.

Everything else being equal a root mean square aka population standard
deviation is expected to be smaller as n increases

this is an SPSS demo that I use in explaining standard deviations. It
should be somewhat clear even if you use other software.
<begin SPSS syntax>
* demo of size of standard deviation as n goes up
* Given the same sum of squared distances (called deviations when one of
the points is a mean)
* as n increases the mean square (aka population variance) goes down and
*the root mean square (aka population standard deviation) goes down
*the sample variance and sample standard deviation use n-1 rather than n
that is used in the pop statistics.
input program.
loop n = 10 to 250 by 10.
end case.
end loop.
end file.
end input program.
compute sum_of_squares =10000.
compute mean_squares = sum_of_squares/n.
compute root_mean_square = sqrt(mean_squares).
var labels
sum_of_squares 'sum of squared distances (of points to mean for
deviations)'/
mean_squares 'population variance when divided by n sample variance
when divided by n-1'/
root_mean_square 'population standard deviation when divided by n
sample SD when divided by n-1'.
list.
<end syntax>

this is the output.
<begin output>

n sum_of_squares mean_squares root_mean_square

10 10000 1000 32
20 10000 500 22
30 10000 333 18
40 10000 250 16
50 10000 200 14
60 10000 167 13
70 10000 143 12
80 10000 125 11
90 10000 111 11
100 10000 100 10
110 10000 91 10
120 10000 83 9
130 10000 77 9
140 10000 71 8
150 10000 67 8
160 10000 63 8
170 10000 59 8
180 10000 56 7
190 10000 53 7
200 10000 50 7
210 10000 48 7
220 10000 45 7
230 10000 43 7
240 10000 42 6
250 10000 40 6

Number of cases read: 25 Number of cases listed: 25
<end output>

Art Kendall
Social Research Consultants

On 3/20/2012 9:51 AM, djh wrote:

Art Kendall

unread,

Mar 20, 2012, 12:48:50 PM3/20/12

to

Best of luck.

wrt "page-pushing" it is not unusual for group or list members to skip
new posts that are part of a thread.

Art Kendall
Social Research Consultants

djh

unread,

Mar 20, 2012, 2:16:52 PM3/20/12

to

I posted a detailed response but it doesn't seem to want to appear, so
I'll do a shorthand version and maybe detailed version will eventually
apppear.

Thanks for taking the time, but our scenarios are unrelated (I
apologize for not making myself clear.)

What's "going down" in my case is an average rmsd that's always
computed from a set of rmsd's that are always each computed from just
30 inter-bead distances for just 30 pairs of beads. So the population
on which the actual rmsd's are being computed is not increasing - it's
always 30.

As this "average rmsd" is "going down", what's "going up" is:

a) the number T of throws
b) the yield T from the T throws, i.e. the number of throws which
yield just 30 pairs of beads lying with 1/4 inch of each other.

djh

unread,

Mar 20, 2012, 1:23:06 PM3/20/12

to

Thanks very much for taking the time - I'm sorry I didn't make myself
clear.

After any given "throw", the "rmsd" is the rmsd for the inter-atomic
distances for the 30 pairs of beads that lie within 1/4 inch of each
other in each trial. (If the "throw" doesn't yield just 30 such
pairs, I ignore the "throw".)

So, when I do 500 "throws", I get 300 in which there are just 30 of
such pairs, and my "average rmsd" is the average of the rmsd's
computed for each of these 300, i.e. an average of 300 individual
rmsds. (Note that each rmsd that figures into this average rmsd was
computed for just 30 pairs.)

And when I do another 1000 "throws", I get 525 in which there are just
30 of such pairs, and my "average rmsd" is the average of the rmsd's
computed for each of these 525, i.e. an average of 525 individual
rmsds. (Again, note that each rmsd that figures into this average
rmsd was computed for just 30 pairs.)

And when I do another 1500 "throws", I get 930 in which there are just
30 of such pairs, and my "average rmsd" is the average of the rmsd's
computed for each of these 930, i.e. an average of 930 indiviudal
rmsds. (Again, note that each rmsd that figures into this average
rmsd was computed for just 30 pairs.)

And when I do another 2000 "throws", I get 1140 in which there are
just 30 of such pairs, and my "average rmsd" is the average of the
rmsd's computed for each of these 1140, i.e. an average of 1130
individual rmsds. (Again, note that each rmsd that figures into this
average rmsd was computed for just 30 pairs.)

And what I was trying to say is that as my "yields" go up from 525 to
930 to 1140, my "average rmsd's" go down, even though each of these
"average rmsd" was computed for a set of rmsd's in which each rmsd was
obtained from measurements on just 30 pairs. (In other words, the
"30" stays constant, even though the throws and the yields increase.)

This has nothing to do with your example, so far as I can tell,
because in every case, the rmsd is always computed for just 30
elements, not an increasing nor decreasing number of elements.

What I believe is happening is that as the number of throws goes up,
the "yields" go up, and as the "yields" go up, there are simply more
ways in which 30 pairs can lie closer, so the "average rmsd's" go
down.

But assuming I'm correct about this, I'm still stuck with my original
question: how would you normalize the average rmsd's relative to the
number of throws (or number of returns in the yields from the throws),
assuming you need to for reasons I won't go into here ?

Again - thanks very much for taking the time ...

djh

unread,

Mar 20, 2012, 4:05:43 PM3/20/12

to A...@drkendall.org

Art -

Below are 55 triples (SampSize, M, (u1+u2)2 ) which I just posted to
Ray (in the original "data bifurcation" thread), in order to show him
that his guess about (u1+u2)/2 was correct: i) SampSize against
(u1+u2)/2 correlates with a coefficient of -0.65, an F of 6.20E-08,
and an RSquare of 0.43; ii) M correlates against (u1+u2)/2 with a
coefficient of 0.65, 7.22E-08,
and an RSquare of 0.42.

I'm also posting theese triples in this thread to illustrate what I
meant by "average RMSD". M is an "average" RMSD - it's computed from
188 individual RMSD's for the 188 cases in the first sample, then from
242 individual RMSD's for the 242 cases in the second sample, etc.
But all of the 188 individual RMSD's were computed for roughly the
same number of elements, and furthermore, this is roughly the same
number of elements as for the 242 individual RMSD's in the second
sample, etc..

Samp
Size M (u1+u2)/2

188 1.572862766 8
242 1.438345207 9
305 1.375536984 7
442 1.438461176 8.5
502 1.438387371 7.5
554 1.354741697 8
632 1.454691013 6
728 1.463273819 7
821 1.40514715 7.5
922 1.393855456 6.5
1201 1.423061391 4
1298 1.445902011 6.5
1363 1.380789912 4.5
1477 1.383602681 7
1549 1.326376675 3.5
1653 1.372242892 6
1697 1.462275203 6
1769 1.323715088 0
1936 1.447994484 6.5
2099 1.416260024 4.5
2148 1.406957826 5
2166 1.365713206 5.5
2269 1.427719401 3
2382 1.324469761 5
2414 1.434091798 5.5
2669 1.304284065 4
2695 1.430254267 5.5
2795 1.460848118 5
3063 1.379745439 6
3086 1.428632712 5
3187 1.391900113 5.5
3472 1.319731285 5
3573 1.344443792 4.5
3617 1.405385986 4
3909 1.397944456 3.5
4067 1.377588158 2.5
4997 1.378889606 4.5
5204 1.422537681 4
5233 1.361786878 2
5418 1.266152154 1
5677 1.39116412 4.5
6347 1.310987263 0.5
6990 1.341029097 3
8321 1.312367819 1.5
8563 1.343656653 1
9030 1.378298448 4
9056 1.334157641 2.5
9167 1.333149317 3
9385 1.393285612 3.5
9873 1.375232508 2
11688 1.362740158 3.5
12122 1.379700086 3
14349 1.297632463 2
14760 1.324303886 1.5
19218 1.351081067 2.5

Ray Koopman

unread,

Mar 21, 2012, 3:06:40 AM3/21/12

to

On Mar 20, 6:51 am, djh <halitsk...@att.net> wrote:
> I have started testing the relationship between sample size and

> average rmsd in repeated different "real-life" scenario that are

> similar to the artificial scenario used above.
>
> The correlation appears to hover about -0.50, from -0.57 down to
> -0.40 "in real life", and is never positive. That is, the more
> trials you do, the smaller the rmsd.
>
> Assuming these consistently negative but moderate correlations
> can be considered "semi-significant", this tells me that the more
> trials one does, the more chance there is for each pair of necklaces
> to fall to the desktop in a similar way.
>
> But I have no idea if I'm reasoning correctly here - that's why I
> asked the question in general.

The only way that could happen with the necklaces is if the
composition of the set of necklaces changes, and/or the way the
pairs are thrown changes, as a function of the size of the set.
If all the samples were drawn randomly and independently from the
same population, and all the throws were done identically, then
traditional probability theory says there should be no systematic
relation between the number of sample pairs drawn and the rmsd for
different sized sets. Sounds to me like the necklace analogy is
missing something that's going on in your real data.

djh

unread,

Mar 21, 2012, 11:12:49 AM3/21/12

to koo...@sfu.edu, a...@drkendall.org, halit...@att.net

Ray -

Thank you for confirming that no statistics-based or probability-based
null hypothesis is operative in the "necklace" example. When I posed
the question, I stated a possible null hypothesis of this type to see
if any professional at sci.stat.math WOULD agree that some such null
hypothesis is operative. I did this simply to avoid embarassment,
because usually in cases such as this, some kind of null hypothesis IS
operative - any other hypotheses are usually "too good to be true".
And I didn’t want to appear really stupid by showing my ignorance of a
standard null hypothesis that would be relevant in this case.

But now that we agree that no null hypothesis is operative, I can tell

you your intuitions were again 100% dead-on when you just wrote:

"The only way that could happen with the necklaces is if the

composition of the set of necklaces changes . . . Sounds to me like

the necklace analogy is missing something that's going on in your real
data."

To see why I say this, look at the matrix at the end of this post,
which shows the row-level correlations of sample size and average RMSD
with the (u1+u2)/2 variable that you suggested in the other thread.
(The previous correlation shown in the post above (i.e. the
correlation on 55 total observations of samp size and average RMSD
relative to (u1+u2)/2) was derived from the values in this matrix in
the obvious way.)

The reason why these correlations CAN exist at all is because you were
absolutely right – different values of u1 and u2 DO change certain
properties of the data that are equivalent to changing the
“composition of the necklaces”. In effect:

a) As (u1+u2)/2 increases, it is more difficult to find pairs of
necklaces in the overall population such that one necklace has u = u1
and the other has u = u2; so, samp size goes down as (u1+u2)/2 goes up

b) As (u1+u2)/2 increases, the RMSD for a pair of necklaces with u =
u1 and u = u2 is going to increase because such pairs are the “odd
balls” in the system and therefore, there is less chance of them being
similar. (I realize this statement is a bit vague, but it’s the best
I can do without going into way too much biomolecular detail.)

So, as I said previously, what remains now is to construct the same
matrix as the one shown below for 239 more cases, to produce a total
of 240 cases in four sets of 60. Since three of the four sets of 60
are three different kinds of controls, we will know for sure whether
we “have something or not” by comparing the 60 matrices for the first
set to the 60 matrices for each of the other three sets. To take you
“into the kitchen” just a little bit, the reason why each set has 60
cases is because our data are from 6 different types of proteins
(globins, cytochrome-C’s, immunoglobulin-like proteins, trypsin-like
proteases, “TIM-barrel” proteins, and “NADP-binding” proteins, and for
each of these six classes we select data in a way that is equivalent
to selecting necklaces of ten different lengths: 13-22 beads, 23-32
beads, 33-42 beads, . . . 103-112 beads, for a total of ten “length
intervals in all. (The matrix above was drawn just from the globin
proteins for length interval 13-22.)

Also, we honestly don’t know “up front: in which cases the
correlations will or will not manifest themselves. As we move from
protein type to protein type, the structural type of the proteins
changes from “mainly helical” (globins and cytochromes) to “mainly
sheet” (IG-like and trypsin-like( to “mainly helical AND sheet” (TIM-
barrel and NADP-binding.) And one progresses through the ten length
intervals specified above (13-22, . . . 103-112) , the nature of the
necklaces drawn from “mainly helical”, “mainly sheet”, and “mainly
helical+sheet” proteins will change considerably.

Finally, I don’t want to tell you what the variable u is for a simple
reason. If you told any molecular biologist that my team is trying to
relate this variable to aspects of protein structure, they would tell
you that my team is crazy. This is because on the one hand, all
molecular biologists practicing today believe that the type of
spontaneous mutation which drives natural selection is essentially
omnipresent and omnipotent to such a great extent that over a time
span of several billion years, it MUST have destroyed any traces of a
relationship that might have existed between the variable “u” and
aspects of protein structure “at the beginning”. While on the other
hand, correlations such as that shown in the matrix below indicate
that this may not be the case, i.e. that we can see traces of the
situation “at the beginning” in much the same way as physicists can
see traces of background radiation from the “big-bang”.

But again, any molecular biologist would tell you that I’m crazy for
saying so, and that my team is crazy for working with me. And in
fact, one of my teammates is very very reluctant to drag in the
evolutionary aspects of situation, inasmuch as he’s a real purist
whose attitude is: “since we weren’t there to observe, we have no
right to say.” But frankly, I don’t see any other way to make sense
of the data.

In any event, thanks so much again for working thru all this with me,
and by extension, my team. You have made an invaluable contribution
(re (u1+u2/2) which will certainly be acknowledged in whatever paper
is finally produced. And one more thing . . . I’ve told you that the
unit of measure of RMSD is angstroms, since we’re taking the RMSD of
inter-atomic distances. But I also told you I’d check on the unit of
measure of u, and when I did so, my most senior colleague reminded me
that its kcals/mols.

Here’s the matrix mentioned above:

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (u1+u2/2)
u1
0 1769 6347 8563 8321 5233 4067 2269 1549 1201 1363 -0.60
1.32 1.31 1.34 1.31 1.36 1.38 1.43 1.33 1.42 1.38 0.67

1 6347 5418 14760 14349 9056 6990 3909 2669 2099 2382 -0.60
1.31 1.27 1.32 1.30 1.33 1.34 1.40 1.30 1.42 1.32 0.56

2 8563 14760 9873 19218 12122 9385 5204 3573 2795 3187 -0.71
1.34 1.32 1.38 1.35 1.38 1.39 1.42 1.34 1.46 1.39 0.65

3 8321 14349 19218 9167 11688 9030 4997 3472 2695 3063 -0.76
1.31 1.30 1.35 1.33 1.36 1.38 1.38 1.32 1.43 1.38 0.70

4 5233 9056 12122 11688 3617 5677 3086 2166 1697 1936 -0.71
1.41 1.33 1.38 1.36 1.41 1.39 1.43 1.37 1.46 1.45 0.64

5 4067 6990 9385 9030 5677 2148 2414 1653 1298 1477 -0.73
1.38 1.34 1.39 1.38 1.39 1.41 1.43 1.37 1.45 1.38 0.52

6 2269 3909 5204 4997 3086 2414 632 922 728 821 -0.73
1.43 1.40 1.42 1.38 1.43 1.43 1.45 1.39 1.46 1.41 0.23

7 1549 2669 3573 3472 2166 1653 922 305 502 554 -0.74
1.33 1.30 1.34 1.32 1.37 1.37 1.39 1.38 1.44 1.35 0.74

8 1201 2099 2795 2695 1697 1298 728 502 188 442 -0.74
1.42 1.42 1.46 1.43 1.46 1.45 1.46 1.44 1.57 1.44 0.50

9 1363 2382 3187 3063 1936 1477 821 554 442 242 -0.75
1.38 1.32 1.39 1.38 1.39 1.38 1.41 1.35 1.44 1.44 0.64

Rich Ulrich

unread,

Mar 21, 2012, 1:33:08 PM3/21/12

unread,

Mar 22, 2012, 3:43:32 AM3/22/12

to

1. Within-row vs overall.

You wrote:

> Both the input and output decrease with increasing u1+u2,
> but the overall correlations with u1+u2 (after replacing the
> (0,7) & (7,0) elements of the input matrix by their average,
> and using only the nonredundant elements of the matrices),
> are -.65 for both output and input, which suggests that
> looking at the within-row correlations may be misleading.

Since the "overall" correlations are -0.65 for both input and output,
I am more than happy to use the "overall" rather than the "within-row"
if you think it wiser. The fact that the correlations are the same,
i.e. -0.65 for both input and output, expresses the desired
"faithfulness" property of the filter as well as the within-row
correlations, so far as I can see.

2. Correlation of yield with increasing u1+u2.

You wrote:

>
> The following table gives the yield (output/input), which is
> increasing with u1+u2, with the within-row correlation dropping
> as the yield increases. The overall correlation is .343 .
>
> 0 1 2 3 4 5 6 7 8 9 corr
> 0 .850 .888 .884 .883 .822 .881 .895 .880 .924 .912 .560
> 1 .888 .904 .901 .900 .840 .895 .911 .899 .954 .942 .549
> 2 .884 .901 .895 .890 .830 .887 .896 .888 .938 .930 .452
> 3 .883 .900 .890 .878 .823 .877 .884 .887 .929 .918 .377
> 4 .822 .840 .830 .823 .761 .816 .807 .819 .866 .859 .296
> 5 .881 .895 .887 .877 .816 .864 .872 .862 .914 .904 .168
> 6 .895 .911 .896 .884 .807 .872 .853 .876 .933 .915 .108
> 7 .880 .899 .888 .887 .819 .862 .876 .869 .930 .892 .154
> 8 .924 .954 .938 .929 .866 .914 .933 .930 .989 .961 .340
> 9 .912 .942 .930 .918 .859 .904 .915 .892 .961 .957 .232

By overall of .343, I assume you mean the overall correlation of
increasing yield with increasing u1+u2.

If so, is the .343 high enough to allow any conclusions to be based on
it?

If so, then this is absolutely critical for the hypothesis we're
pursuing. According to this hypothesis, yield SHOULD increase with
increasing u1+u2 - it means that the filter is not just "faithful" but
"hyper-faithful" in exactly the way the hypothesis says it should be.
(In effect, the filter can be thought of as "tuning" the pattern in
the input data ...)

So please let me know at your earliest convenience whether the .343 is
"worth anything", and for my part, I will compute this number as I do
each of the remaining 239 matrices. I've already done several more
and the input/output "faithfulness" data are patterning virtually the
same as in this initial case, so I'm anxious to see what "yield vs
u1+u2" correlations we get - better than .343, worse, or about the
same.

Also, the "rmsd vs u1+u2" numbers for these several new cases are
interesting, but I want to do many more of the 239 before I show you
anything. As expected, these numbers "bounce" because they're
sensitive to the "necklace-length" variable and the "necklace-type"
variable, and I have to wait and see what happens when the numbers are
in for enough different lengths over enough different types.

Ray Koopman

unread,

Mar 22, 2012, 4:47:20 AM3/22/12

to

Yes. However, I think that an ordinary correlation is not quite the
right statistic to capture what's going on. I'd be inclined to try
logistic regression, because the dependent variable is a proportion.
In general, the closer the proportions get to 1, the smaller you
expect the correlations to be, and the overall relation can't be
linear. The only problem is on the interpretation side, because there
is no summary statistic in logistic regression that is analogous to
an ordinary correlation. You have to be willing to talk about the
regression of the log-odds on your independent variable.

>
> If so, is the .343 high enough to allow any conclusions to be based
> on it?

Well, it's definitely not zero, but beyond that it would depend on
what you were trying to say ;)

djh

unread,

Mar 22, 2012, 9:13:43 AM3/22/12

to koo...@sfu.ca

You wrote:

"I'd be inclined to try logistic regression"

So you're saying:

take the binary dependent variable of the logistic regression to be:

yes = input pair was output by filter
no = input pair was not output by filter

and try to predict it using v1+v2?

Is this correct, i.e. you CAN do a logistics regression with just one
independent variable?

If more than one is better, I'll probably try v1+v2 plus the "length"
variable, i.e. the "centers" of the length intervals (17.5 for 13-22,
etc.)

(Note that I've reverted back to v1 and v2 for the quantized
variables, as per your suggestion in your note above (to distinguish
them from the non-quantized continuous variables u1 and u2.)

Rich Ulrich

unread,

Mar 22, 2012, 2:25:25 PM3/22/12

to

On Thu, 22 Mar 2012 06:13:43 -0700 (PDT), djh <halit...@att.net>
wrote:

>You wrote:
>
>"I'd be inclined to try logistic regression"
>
>So you're saying:
>
>take the binary dependent variable of the logistic regression to be:
>
>yes = input pair was output by filter
>no = input pair was not output by filter
>
>and try to predict it using v1+v2?
>
>Is this correct, i.e. you CAN do a logistics regression with just one
>independent variable?

If you are looking only at the cells and grouped data, you
can return to the roots of logistic regression, and take as
criterion the computed logit for each yield -- log(p/(1-p) ).
For data taken that way, you can compute the simple
correlation with the scored logits, or you can do ordinary
least squares regression with v1 and v2 or their sum.

>
>If more than one is better, I'll probably try v1+v2 plus the "length"
>variable, i.e. the "centers" of the length intervals (17.5 for 13-22,
>etc.)

If that has to be identified for each trial/necklace, then
you do need to use the 0/1 binary criterion in maximum
likelihood logisitic regression.

>
>(Note that I've reverted back to v1 and v2 for the quantized
>variables, as per your suggestion in your note above (to distinguish
>them from the non-quantized continuous variables u1 and u2.)

--
Rich Ulrich

djh

unread,

Mar 22, 2012, 3:25:27 PM3/22/12

to rich-...@live.com, rich....@comcast.net

As I construct each of the 240 "yield" matrices required by our
theoretical framework, I'm checking to see how Ray's simple "overall"
approach to the "yield vs (v1+v2/2)" correlation works: does it hover
around the .343 that he got (see his post above), go higher, go lower,
etc. At the moment it's doing some interesting things so I may not
have to go to any "deeper" method just to try and find something I
haven't found before.

But nonetheless, I would very much like to learn exactly what you mean
here:

"If you are looking only at the cells and grouped data, you
can return to the roots of logistic regression, and take as
criterion the computed logit for each yield -- log(p/(1-p) ).
For data taken that way, you can compute the simple
correlation with the scored logits, or you can do ordinary
least squares regression with v1 and v2 or their sum."

so I can try it in comparison to the simple approach just to see what
happens.

Are you saying that I should take Ray's matrix of yields:

v2

0 1 2 3 4 5 6 7 8 9

v1

0 .850 .888 .884 .883 .822 .881 .895 .880 .924 .912

1 .888 .904 .901 .900 .840 .895 .911 .899 .954 .942

2 .884 .901 .895 .890 .830 .887 .896 .888 .938 .930

3 .883 .900 .890 .878 .823 .877 .884 .887 .929 .918

4 .822 .840 .830 .823 .761 .816 .807 .819 .866 .859

5 .881 .895 .887 .877 .816 .864 .872 .862 .914 .904

6 .895 .911 .896 .884 .807 .872 .853 .876 .933 .915

7 .880 .899 .888 .887 .819 .862 .876 .869 .930 .892

8 .924 .954 .938 .929 .866 .914 .933 .930 .989 .961

9 .912 .942 .930 .918 .859 .904 .915 .892 .961 .957

and then:

a) compute log(p/(1-p)for each of the 55 non-redundant (upper-
triangular) cells
b) simply run the correlation of the these 55 values with (v1+v2)/2?

If this is what you're saying, then I have to confess that I don't
know how to take log(p/(1-p) for each of the 55 non-redundant cells of
Ray's matrix. If you think it's something that I can and should learn
about on my own, then that's what I'll do. (I realize that it isn't
your responsibility to give "pro bono" tutorials here ...)

But if you ahve the time to explain in a little more detail what I
need to do, I would be very grateful, Rich.

Thanks so much again ...

Ray Koopman

unread,

Mar 22, 2012, 4:13:57 PM3/22/12

to

Look at a plot of the logits against (v1+v2)/2. The ordinary linear
correlation is .414, and a maximum likelihood logistic regression
gives a slope of .0682. But with weighting that simulates analyzing
the individual 0/1 data, maximum likelihood logistic regression gives
a slope of -.00140, and the corrspondingly weighted correlation is
-.0804. To see why, re-do the plot, making the area of each dot
proportional to the reciprocal of the variance of the corresponding
logit; i.e., logit = log[#yes/#no], var[logit] = 1/#yes + 1/#no.
The problem is that the points that show a positive relation are
based on little data, whereas the points that show a negative
relation are based on lots of data.

djh

unread,

Mar 22, 2012, 4:41:22 PM3/22/12

to

PS to Rich - when I said I didn't know how to take log(p/(1-p)), I
didn't mean I don't know how to take the log. I meant that I don't
know what p is in this situation.

Note to Ray - I'm assuming your last post was to Rich - I have no idea
what you're talking about, except very vaguely ...

Rich Ulrich

unread,

Mar 22, 2012, 5:43:30 PM3/22/12

to

On Thu, 22 Mar 2012 13:41:22 -0700 (PDT), djh <halit...@att.net>
wrote:

Ray was answering very completely, for me and to you.

Each cell is a proportion/probability. For .850 (1st cell),
replace that number with log (.850/.150). That is the logit.

Further, Ray was showing the resuts. He shows that the simple
correlation is increased, using 55 values versus that average term.

However, the cells represent different Ns. When he applies weights
that represent the Ns, the r becomes very small, and in the opposite
direction.

--
Rich Ulrich

djh

unread,

Mar 22, 2012, 8:56:38 PM3/22/12

to rich-...@live.com, rich....@comcast.net, koo...@sfu.ca

Thanks for the clarification regarding treating the proportions as
probabilities, Rich - much appreciated. Sorry for my relatively
abysmal ignorance of these matters ...

I have completed two more "yield" matrices like the one you and Ray
have been working with. The data are still from the globin proteins,
but the data reflect "necklaces lengths" of 23-32 and 33-42, rather
than the "necklace length" of 13-22 reflected in the data of the
matrix you've been working with.

As soon as I've computed the logits and done the simple linear
correlations, I will present the results in three way. For each of
the two new matrices, I will provide:

a) the simple linear correlation on the proportions themselves
(overall for all 55 non-redundant cells)
b) the simple linear correlation on the logits of the cells (again
overall for all 55 non-redundant cells)
c) the 10 "within-row" correlations

Even though Ray noted above that one must proceed with caution
regarding the "within-row" correlations, I think it's important for
you and him to see some remarkable "highs" that manifest themselves in
these "within row" correlations.

Also, I hope you can make some statistical sense of the wide swing in
correlation (a) between the two new matrices (from ~ +0.40 to ~
-0.46.) I can readily make scientific sense of this swing, given the
fact that one reflects data with length variable 23-32 and the other
reflects data with length variable 33-42, but I'm hoping that you and
Ray will be able to say with some confidence that the swing does not
merely represent the erratic behavior of an essentially "random"
variable. That would put an end to this aspect of my team's
investigation quite promptly.

Finally, I have one further question at this point. Does the fact
that Ray's weighting made the .41 "disappear" mean that we must
discount the increase from .34 to .41 which he obtained by taking
logits? Or is this increase still meaningful in some sense? (I ask
this question, of course, relative the assumption that the "yield"
variable is not, in fact, simply "random".)

Anyway, thanks so much again for continuing to take the time ...

djh

unread,

Mar 23, 2012, 2:01:37 AM3/23/12

to rich-...@live.com, rich....@comcast.net, koo...@sfu.ca

Ray and Rich -

Below are the "yield" analyses for the globin protein length 23-32
data and globin protein length 33-42 data. I am hoping that these
analyses, together with the analysis you've already done of the globin
length 13-22 data, will enable you to say with some confidence that:

a) the "overall" (55-cell) p vs (v1+v2)/2 correlations are
statistically meaningful and can be used in scientific argument

and/or

b) the "overall" (55-cell) logit vs (v1+v2)/2 correlatiosn are
statistically meaningful and can be used in scientific argument

If you can't tell yet from these three analyses, I have 7 more
matrices to do for the globins for lengths 43-52,...,103-112, then the
same 10 again for each of our other five protein classes. And of
course, the same for the data resulting from each of our three control
protocols. So if you can't form a principled judgment yet, there will
be plenty enough additional data coming in.

Also, from a purely scientific point of view, I would like to make use
of the relatively high "within-row" correlations which I've reported
below in addition to the "overall" correlations on the 55 non-
redundant cells. So could you take a look at these and tell me if
it's legitimate or not to do so.

Thank you so very much again.

Globin proteins
Yield Analysis for Length 23-32 Data

Non-Redundant Overall Data (55 cells)

Correlation of p against (v1+v2)/2 = 0.4688
Correlation of logit against (v1+v2)/2 = 0.5704

v1+v2/
2 output input p 1-p log(p/
1-p))

0 845 1176 0.718537415 0.281462585 0.407028715
0.5 3738 4949 0.755304102 0.244695898 0.489495154
1 6074 7252 0.837562052 0.162437948 0.712329497
1.5 4589 5586 0.821518081 0.178481919 0.663022899
2 2801 3577 0.783058429 0.216941571 0.557451388
2.5 1941 2401 0.808413161 0.191586839 0.625267704
3 1077 1274 0.845368917 0.154631083 0.737749477
3.5 1368 1568 0.87244898 0.12755102 0.835056102
4 503 637 0.789638932 0.210361068 0.574463187
4.5 428 490 0.873469388 0.126530612 0.83905208
1 3865 5050 0.765346535 0.234653465 0.513431148
1.5 12535 14948 0.838573722 0.161426278 0.715567016
2 9478 11514 0.823171791 0.176828209 0.667938931
2.5 5767 7373 0.782178218 0.217821782 0.55520441
3 4031 4949 0.814507981 0.185492019 0.642570117
3.5 2214 2626 0.843107388 0.156892612 0.730280401
4 2825 3232 0.874071782 0.125928218 0.841424043
4.5 1041 1313 0.792840823 0.207159177 0.582881825
5 868 1010 0.859405941 0.140594059 0.786231381
2 9675 10878 0.889409818 0.110590182 0.905385346
2.5 14726 16872 0.872807018 0.127192982 0.836455079
3 9009 10804 0.83385783 0.16614217 0.700612134
3.5 5677 7252 0.782818533 0.217181467 0.556838336
4 3086 3848 0.801975052 0.198024948 0.60744095
4.5 2166 4736 0.457347973 0.542652027 -0.074274671
5 1697 1924 0.882016632 0.117983368 0.873655985
5.5 1384 1480 0.935135135 0.064864865 1.158864857
3 5502 6441 0.854215184 0.145784816 0.767854994
3.5 6808 8322 0.818072579 0.181927421 0.652893672
4 4797 5586 0.858754028 0.141245972 0.783892715
4.5 2616 2964 0.882591093 0.117408907 0.876058496
5 3361 3648 0.921326754 0.078673246 1.068586616
5.5 1254 1482 0.846153846 0.153846154 0.740362689
6 1037 1140 0.909649123 0.090350877 1.002941532
4 2027 2628 0.77130898 0.22869102 0.527979277
4.5 2917 3577 0.815487839 0.184512161 0.645392494
5 1579 1898 0.831928346 0.168071654 0.694591447
5.5 2039 2336 0.872859589 0.127140411 0.836660776
6 755 949 0.795574289 0.204425711 0.590145222
6.5 627 730 0.85890411 0.14109589 0.784430316
5 1004 1176 0.853741497 0.146258503 0.766205266
5.5 1119 1274 0.87833595 0.12166405 0.858498388
6 1426 1568 0.909438776 0.090561224 1.001831181
6.5 528 637 0.8288854 0.1711146 0.685207425
7 443 490 0.904081633 0.095918367 0.974305868
6 292 325 0.898461538 0.101538462 0.946868912
6.5 784 832 0.942307692 0.057692308 1.213074825
7 291 338 0.860946746 0.139053254 0.791795131
7.5 237 260 0.911538462 0.088461538 1.01302051
7 475 496 0.95766129 0.04233871 1.354474315
7.5 366 416 0.879807692 0.120192308 0.864511081
8 312 320 0.975 0.025 1.591064607
8 63 78 0.807692308 0.192307692 0.62324929
8.5 116 130 0.892307692 0.107692308 0.918329954
9 43 45 0.955555556 0.044444444 1.33243846

Globin proteins
Yield Analysis for Length 33-42 Data

Non-Redundant Overall Data (55 cells)

Correlation of p against (v1+v2)/2 = -0.4611
Correlation of logit against (v1+v2)/2 = -0.4443

Note the zero output for (v1+v2)/2 = 8 from v1 = 8 and v2 = 8

v1+v2/
2 output input p 1-p log(p/
(1-p))

0 202 595 0.339495798 0.660504202 -0.289041181
0.5 1448 3325 0.435488722 0.564511278 -0.112695711
1 2098 5355 0.39178338 0.60821662 -0.191012275
1.5 1121 3045 0.368144499 0.631855501 -0.234599455
2 878 2065 0.425181598 0.574818402 -0.130956203
2.5 485 1050 0.461904762 0.538095238 -0.066306709
3 322 630 0.511111111 0.488888889 0.019305155
3.5 133 315 0.422222222 0.577777778 -0.136219747
4 26 175 0.148571429 0.851428571 -0.75821292
4.5 53 175 0.302857143 0.697142857 -0.362083961
1 2410 4465 0.539753639 0.460246361 0.069205216
1.5 7012 14535 0.482421741 0.517578259 -0.030549154
2 3770 8265 0.456140351 0.543859649 -0.076388346
2.5 2877 5605 0.513291704 0.486708296 0.023095496
3 1609 2850 0.564561404 0.435438596 0.112784263
3.5 1099 1710 0.642690058 0.357309942 0.254956482
4 437 865 0.505202312 0.494797688 0.009037668
4.5 73 474 0.154008439 0.845991561 -0.739821512
5 160 475 0.336842105 0.663157895 -0.294190571
2 4959 11628 0.426470588 0.573529412 -0.128666609
2.5 5340 13311 0.401171963 0.598828037 -0.173971552
3 4077 9027 0.451645065 0.548354935 -0.084264487
3.5 2279 4590 0.496514161 0.503485839 -0.00605562
4 1569 2754 0.569716776 0.430283224 0.121904593
4.5 639 1377 0.464052288 0.535947712 -0.062555504
5 102 765 0.133333333 0.866666667 -0.812913357
5.5 227 765 0.296732026 0.703267974 -0.374756418
3 1384 3741 0.369954558 0.630045442 -0.231223492
3.5 2175 5133 0.423728814 0.576271186 -0.133538908
4 1213 2610 0.464750958 0.535249042 -0.061335605
4.5 821 1566 0.524265645 0.475734355 0.042186884
5 337 783 0.430395913 0.569604087 -0.121704958
5.5 61 435 0.140229885 0.859770115 -0.787541767
6 116 435 0.266666667 0.733333333 -0.439332694
4 812 1711 0.474576271 0.525423729 -0.044203662
4.5 908 1770 0.51299435 0.48700565 0.022578583
5 623 1062 0.586629002 0.413370998 0.152023526
5.5 251 531 0.472693032 0.527306968 -0.04748431
6 42 295 0.142372881 0.857627119 -0.779871231
6.5 84 295 0.284745763 0.715254237 -0.400003169
5 231 435 0.531034483 0.468965517 0.053981812
5.5 334 540 0.618518519 0.381481481 0.209879246
6 135 270 0.5 0.5 0
6.5 22 150 0.146666667 0.853333333 -0.764787289
7 50 150 0.333333333 0.666666667 -0.301029996
6 90 163 0.552147239 0.447852761 0.090919649
6.5 92 162 0.567901235 0.432098765 0.118689787
7 22 90 0.244444444 0.755555556 -0.490086232
7.5 33 90 0.366666667 0.633333333 -0.237360916
7 14 36 0.388888889 0.611111111 -0.196294645
7.5 5 45 0.111111111 0.888888889 -0.903089987
8 14 45 0.311111111 0.688888889 -0.345233658
8 0 45 0
8.5 2 45 0.044444444 0.955555556 -1.33243846
9 2 10 0.2 0.8 -0.602059991

Globin proteins
Yield Analysis for Length 23-32 Data

"Within-Row" Correlations of output against (v1+v2)/2 and input
against (v1+v2)/2

Row Corr
v2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 output 845 3738 6074 4589 2801 1941 1077 1368 503 428 -0.59
input 1176 4949 7252 5586 3577 2401 1274 1568 637 490 -0.62

1 output 3738 3865 12535 9478 5767 4031 2214 2825 1041 868 -0.58
input 4949 5050 14948 11514 7373 4949 2626 3232 1313 1010 -0.61

2 output 6074 12535 9675 14726 9009 5677 3086 2166 1697 1384 -0.74
input 7252 14948 10878 16872 10804 7252 3848 4736 1924 1480 -0.75

3 output 4589 9478 14726 5502 6808 4797 2616 3361 1254 1037 -0.69
input 5586 11514 16872 6441 8322 5586 2964 3648 1482 1140 -0.72

4 output 2801 5767 9009 6808 2027 2917 1579 2039 755 627 -0.68
input 3577 7973 10804 8322 2628 3577 1898 2336 949 730 -0.71

5 output 1941 4031 574 4797 2917 1004 1119 1426 528 443 -0.56
input 2401 4949 7252 5586 3577 1176 1274 1568 637 490 -0.71

6 output 1077 2214 3454 2616 1579 1119 292 784 291 237 -0.70
input 1274 2626 3848 2964 1898 1274 325 832 338 260 -0.72

7 output 1368 2825 4423 3361 2039 1426 784 475 366 312 -0.72
input 1568 3232 4736 3648 2336 1568 832 496 416 320 -0.74

8 output 503 1041 1664 1254 755 528 291 366 63 116 -0.70
input 637 1313 1924 1482 949 637 338 416 78 130 -0.73

9 output 428 868 1384 1037 627 443 237 312 116 43 -0.71
input 490 1010 1480 1140 730 490 260 320 130 45 -0.73

Globin proteins
Yield Analysis for Length 33-42 Data

"Within-Row" Correlations of output against (v1+v2)/2 and input
against (v1+v2)/2
Row Corr
v2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 output 202 1448 2098 1121 878 485 322 133 26 53 -0.63
input 595 3325 5355 3045 2065 1050 630 315 175 175 -0.62

1 output 1448 2410 7012 3770 2877 1609 1099 437 73 160 -0.61
input 3325 4465 14535 8265 5605 2850 1710 865 474 475 -0.60

2 output 2098 7012 4959 5340 4077 2279 1569 639 102 227 -0.77
input 5355 14535 11628 13311 9027 4590 2754 1377 765 765 -0.78

3 output 1121 3770 5340 1384 2175 1213 821 337 61 116 -0.67
input 3045 8265 13311 3741 5133 2610 1566 783 435 435 -0.67

4 output 878 2877 4077 2175 812 908 623 251 42 84 -0.69
input 2065 5605 9027 5133 1711 1770 1062 531 295 295 -0.68

5 output 485 1609 2279 1213 908 231 334 135 22 50 -0.70
input 1050 2850 4590 2610 1770 435 540 270 150 150 -0.69

6 output 322 1099 1569 821 623 334 90 92 22 33 -0.70
input 630 1710 2754 1566 1062 540 163 162 90 90 -0.70

7 output 133 437 639 337 251 135 92 14 5 14 -0.70
input 315 855 1377 783 531 270 162 36 45 45 -0.70

8 output 26 73 102 61 42 22 15 5 0 2 -0.74
input 315 855 1377 783 531 270 162 36 45 45 -0.70

9 output 53 160 227 116 84 50 33 14 2 2 -0.72
input 175 475 765 435 295 150 90 45 25 10 -0.70

Ray Koopman

unread,

Mar 23, 2012, 2:44:45 AM3/23/12

to

The usual regression formulas assume that error is homoscedastic;
that is, that the variance of the distribution from which the error
component in the equation observed_i = model_i + error_i
is drawn is the same for all data points i. When the error variances
are known to differ substantially, least-squares fitting,
which usually solves for the model parameters that minimize
sum (observed_i - model_i)^2, should instead minimize
sum (observed_i - model_i)^2/variance[error_i],
in which each data point is weighted proportional to
the reciprocal of its error variance.

The following table uses data posted on Mar 21 at 3:14 pm PDT.
I have scaled the weights so that the maximum weight is 1.

There are four correlations with v1+v2 to consider, in a 2 x 2 table:

unweighted weighted

p .343 .188

logit .414 -.080

The weighted correlations are so much lower than the unweighted
correlations because the points that tend to increase the correlation
are poorly determined relative to the points that tend to decrease
the correlation. This is especially true for the logit.

v1,v2
| output
| | input
| | | p = out/in
| | | |
| | | | w[p] = in/(p(1-p))
| | | | |
| | | | | logit = log[p/(1-p)]
| | | | | |
| | | | | | w[logit] = in*p(1-p)
| | | | | | |
0,0 1769 2080 .850 .074 1.738 .125
1,0 6347 7150 .888 .326 2.067 .336
1,1 5418 5995 .904 .314 2.240 .246
2,0 8563 9685 .884 .430 2.032 .467
2,1 14760 16390 .901 .832 2.203 .691
2,2 9873 11026 .895 .536 2.147 .486
3,0 8321 9425 .883 .415 2.020 .459
3,1 14349 15950 .900 .803 2.193 .678
3,2 19218 21605 .890 1.000 2.086 1.000
3,3 9167 10440 .878 .444 1.974 .526
4,0 5233 6370 .822 .198 1.527 .440
4,1 9056 10780 .840 .365 1.659 .682
4,2 12122 14602 .830 .471 1.587 .970
4,3 11688 14210 .823 .443 1.534 .977
4,4 3617 4753 .761 .119 1.158 .407
5,0 4067 4615 .881 .201 2.004 .227
5,1 6990 7810 .895 .378 2.143 .346
5,2 9385 10579 .887 .481 2.062 .499
5,3 9030 10295 .877 .435 1.965 .523
5,4 5677 6958 .816 .211 1.489 .492
5,5 2148 2485 .864 .096 1.852 .137
6,0 2269 2535 .895 .123 2.144 .112
6,1 3909 4290 .911 .241 2.328 .164
6,2 5204 5811 .896 .283 2.149 .256
6,3 4997 5655 .884 .250 2.027 .274
6,4 3086 3822 .807 .112 1.433 .280
6,5 2414 2769 .872 .113 1.917 .146
6,6 632 741 .853 .027 1.758 .044
7,0 1549 1760 .880 .076 1.994 .087
7,1 2669 2970 .899 .148 2.182 .127
7,2 3573 4023 .888 .184 2.072 .188
7,3 3472 3915 .887 .177 2.059 .185
7,4 2166 2646 .819 .081 1.507 .185
7,5 1653 1917 .862 .073 1.834 .107
7,6 922 1053 .876 .044 1.951 .054
7,7 305 351 .869 .014 1.892 .019
8,0 1201 1300 .924 .084 2.496 .043
8,1 2099 2200 .954 .228 3.034 .045
8,2 2795 2980 .938 .233 2.715 .082
8,3 2695 2900 .929 .201 2.576 .090
8,4 1697 1960 .866 .077 1.864 .107
8,5 1298 1420 .914 .082 2.365 .053
8,6 728 780 .933 .057 2.639 .023
8,7 502 540 .930 .038 2.581 .017
8,8 188 190 .989 .083 4.543 .001
9,0 1363 1495 .912 .084 2.335 .057
9,1 2382 2530 .942 .209 2.778 .066
9,2 3187 3427 .930 .239 2.586 .105
9,3 3063 3335 .918 .203 2.421 .118
9,4 1936 2254 .859 .085 1.806 .129
9,5 1477 1633 .904 .086 2.248 .066
9,6 821 897 .915 .053 2.380 .033
9,7 554 621 .892 .029 2.112 .028
9,8 442 460 .961 .056 3.201 .008
9,9 242 253 .957 .028 3.091 .005

djh

unread,

Mar 23, 2012, 8:49:49 AM3/23/12

to koo...@sfu.ca

You were kind enough to take the time to explain the following:

> The usual regression formulas assume that error is homoscedastic;
> that is, that the variance of the distribution from which the error
> component in the equation observed_i = model_i + error_i
> is drawn is the same for all data points i. When the error variances
> are known to differ substantially, least-squares fitting,
> which usually solves for the model parameters that minimize
> sum (observed_i - model_i)^2, should instead minimize
> sum (observed_i - model_i)^2/variance[error_i],
> in which each data point is weighted proportional to
> the reciprocal of its error variance.

for which I am indebted. It's perfectly clear and actually makes
perfect sense once explained.

I'm assuming it's your judgment that the low and oppositely signed
values for the two weighted correlations imply that the underlying
correlations are really meaningless for this first case of length =
13 - 22. (Rich seems to be saying the same thing.)

So based on this fact, is it still worthwhile to do the same weighting
on the last two matrices I sent, to see what happens? Or can you
already
tell that this a fruitless exercise which will just confirm the
results
on the first matrix?

Finally - one last question. From a purely statistical point of view,
is it possible for certain "within-row" correlations to still be
meaningful,
even though the "overall" correlations don't seem to be, at least in
this
first case?

Thanks so much again, Ray.

djh

unread,

Mar 23, 2012, 9:08:48 AM3/23/12

to koo...@sfu.ca

One other question I forgot to ask above:

Would it be "legit" to go back to the original "output" population for
each individual cell, extract the number within 1 or 2sd of the mean
for the population, and then recompute all the matrices/correlations/
weighted correlations using these "central tendency" output counts?

Ray Koopman

unread,

Mar 24, 2012, 3:23:15 AM3/24/12

to

I've been thinking about weighted vs unweighted estimation for your
correlations, and I've come to a tentative conclusion. First, let me
say that I think things will go better if you look at and argue from
the slopes of the regression lines instead of the correlations. The
two statistics always have the same sign, but slopes have at least
two advantages:

1. Slopes are not sensitive to irrelevant features of the data
such as the predictor distributions. For this reason, slopes are
considered "structural parameters" of the relation, but correlations
are not.

2. We can easily get good estimates of the standard errors of the
slopes, but we can't for correlations. (The sample size for the
overall correlations is not 55, and for the within-row correlations
it is not 10; the usual inferential machinery for correlations does
not apply.)

We want to estimate the slope of the linear regression of p on v
even tho we know that the true regression is not linear. That is,
we know that the conditional expectation of p given v is not a linear
function of v, but nevertheless we want to estimate what the slope
of the best-fitting straight line would be if each proportion were
equal to its conditional expectation.

The problem is that the methods we've been using assume that the true
relation is linear (or, more generally, that it has the same form as
the model we're fitting). When this assumption is true, the estimated
slope is an unbiased estimate of the true slope, regardless of what
weights are used; and when, as in the present case, large-sample
model-free estimates of the error variance of each data point are
available, weights can then be chosen to minimize the variance of
the estimated slope.

But that works only if the model is veridical or if the variances
are unrelated to the model, and neither condition holds here. AFAIK
-- and I hope someone will jump in here if I'm wrong -- if we want
an unbiased estimate of the slope then we must weight all the points
equally, which will give us a slope whose standard error may be very
large.

Here's how to get the standard error of the estimated slope.
Using general notation, for each data point we have a triple
{x, y, v}, where x is the predictor, y is the response,
and v is the error variance of y.

Let w = x - mean[x]. Then the slope = sum w*y / sum w^2 ,

and its standard error = sqrt[sum v*w^2] / sum w^2 .

That standard error can be used to test hypotheses
and get confidence intervals.

For the data at hand, the predictor "x" has been either v2 (within-
row), or (v1+v2)/2 (overall). (For correlations, the division by 2
could be omitted at will, but for slopes it needs to be there to
keep the overall slopes comparable to the within-row slopes.)

The response "y" has been either
p = output/input or logit[p] = log[p/(1-p)].

For p, the variance "v" is p(1-p)/input;
for logit[p], the variance is 1/output + 1/(input-output).
Both of those assume that both input and output are "large",
so some adjustments will be needed for some of the new data.
A simple one is to add 1 to the input and .5 to the output.
It's far from perfect, but will at least take care of the zeros.

djh

unread,

Mar 24, 2012, 11:04:25 AM3/24/12

to koo...@sfu.ca, jrfr...@princeton.edu, am...@psu.edu, marvin.s...@gmail.com, rich-...@live.com, rich....@comcast.net

Ray -

Thank you very much for the advice and "spec" that your provided in
your last post. I will program the "spec" as you gave it and report
back the results for our first set of four Yield-related matrices.
(In this regard, would it be easier for me to work in Minitab rather
than the amalgam of PERL and Excel I'm currently using? If so, I can
obtain a copy thru Vanderbilt; if not, please suggest an alternative;
I do run Linux as well as MS, so if you think something like SPSS
would be better than Minitab, please let me know.)

I assume that you would suggest taking the same "slope" approach to
our RMSD-related data as you suggested above for our yield-related
data. I have been holding back the RMSD-related data so that we could
concentrate on one thing at a time, but now that you've decided on an
approach to the yield-related data, it would be good to know whether
we should use the same approach for the RMSD-related data.

Here are the over-all RMSD vs (v1+v2)/2 correlations for our first
four length intervals for the globin proteins. Note that they swing
down and grow increasingly negative as the length variable increases
(which is very much "sauce for our goose" from a purely scientific
point of view.)

Globin proteins

Length 13-22: over-all RMSD vs (v1+v2)/2 = 0.641048664
Length 23-32: over-all RMSD vs (v1+v2/2) = 0.020929748
Length 33-42: over-all RMSD vs (v1+v2/2) = -0.201262146
Length 43-52: over-all RMSD vs (v1+v2/2) = -0.213178417

Although these correlations may not seem much to write home about, you
can see from the matrices below that we do get some very high within-
row correlations, most consistently at length 33-42. Again, it makes
perfect sense from a scientific point of view that RMSD-related
correlations should be most highly negative at certain lengths, less
so at other lengths, and should disappear entirely at other lengths.
This is because the length-related "sweet-spots" can be related to
structural features of the type of protein under examination (in this
case the "all-helical" globin proteins.)

So, assuming that you think it worthwhile (based on these first four
cases) to continue analyzing RMSD as well as Yield, I am assuming that
you would want me to do so using the same spec as your provided above
for the yield analyses.

Please reply back re this question when you have a chance, and again,
thank you so very much again for your time.

Here are the "within-row" correlations for our first four length
intervals for the globin proteins.

Globin proteins
Length 13-22
Within-Row Correlations of RMSD with (v1+v2/2)

Row Corr
v2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1

0 1.32 1.31 1.34 1.31 1.36 1.38 1.43 1.33 1.42 1.38 0.67

1 1.31 1.27 1.32 1.30 1.33 1.34 1.40 1.30 1.42 1.32 0.56
2 1.34 1.32 1.38 1.35 1.38 1.39 1.42 1.34 1.46 1.39 0.65
3 1.31 1.30 1.35 1.33 1.36 1.38 1.38 1.32 1.43 1.38 0.70
4 1.36 1.33 1.38 1.36 1.41 1.39 1.43 1.37 1.46 1.39 0.63
5 1.36 1.34 1.39 1.38 1.39 1.41 1.43 1.37 1.45 1.38 0.52
6 1.43 1.40 1.42 1.38 1.43 1.43 1.45 1.39 1.46 1.41 0.23
7 1.33 1.30 1.34 1.32 1.37 1.37 1.39 1.38 1.44 1.35 0.74
8 1.42 1.42 1.46 1.43 1.46 1.45 1.46 1.44 1.57 1.44 0.50
9 1.38 1.32 1.39 1.38 1.39 1.38 1.41 1.35 1.44 1.44 0.64

Globin proteins
Length 23-32
Within-Row Correlations of RMSD with (v1+v2)/2

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 1.61 1.66 1.67 1.68 1.70 1.77 1.72 1.79 1.85 1.66 0.66
1 1.66 1.64 1.59 1.61 1.62 1.65 1.62 1.70 1.69 1.61 0.25
2 1.67 1.59 1.56 1.56 1.60 1.39 1.43 1.37 1.46 1.39 -0.86
3 1.68 1.61 1.56 1.55 1.59 1.60 1.56 1.63 1.74 1.49 -0.12
4 1.70 1.62 1.60 1.59 1.62 1.66 1.66 1.69 1.77 1.57 0.17
5 1.77 1.65 1.58 1.60 1.66 1.66 1.63 1.70 1.77 1.58 -0.04
6 1.72 1.62 1.55 1.56 1.66 1.63 1.61 1.67 1.75 1.53 -0.02
7 1.79 1.70 1.65 1.63 1.69 1.70 1.67 1.69 1.75 1.64 -0.22
8 1.85 1.69 1.74 1.74 1.77 1.77 1.75 1.75 1.85 1.60 -0.24
9 1.66 1.61 1.56 1.49 1.57 1.58 1.53 1.64 1.60 1.45 -0.40

Globin proteins
Length 33-42
Within-Row Correlations of RMSD with (v1+v2)/2

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 0.92 0.90 0.92 0.90 0.87 0.84 0.89 0.86 1.01 0.89 0.08
1 0.90 0.90 0.92 0.92 0.93 0.86 0.92 0.88 0.84 0.88 -0.55
2 0.92 0.92 0.92 0.91 0.88 0.88 0.90 0.88 0.83 0.85 -0.90
3 0.90 0.92 0.91 0.89 0.87 0.86 0.86 0.84 0.84 0.81 -0.95
4 0.87 0.89 0.88 0.87 0.84 0.81 0.85 0.83 0.84 0.71 -0.75
5 0.84 0.86 0.88 0.87 0.81 0.80 0.80 0.85 0.75 0.82 -0.60
6 0.89 0.92 0.90 0.86 0.85 0.80 0.82 0.81 0.69 0.74 -0.91
7 0.86 0.88 0.88 0.84 0.83 0.85 0.81 0.89 1.05 0.82 0.21
8 1.01 0.84 0.83 0.84 0.84 0.75 0.69 1.05 0.57 -0.51
9 0.89 0.88 0.85 0.81 0.71 0.82 0.74 0.82 1.34 1.34 0.56

Globin proteins
Length 43-52
Within-Row Correlations of RMSD with (v1+v2)/2

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 1.05 1.03 1.08 1.05 1.01 1.01 1.06 0.98 0.86 1.17 -0.14
1 1.03 1.06 1.08 1.06 1.06 1.03 1.08 0.96 1.02 1.08 -0.22
2 1.08 1.08 1.03 1.06 1.04 1.00 1.01 1.01 0.84 1.21 -0.16
3 1.05 1.06 1.06 1.03 1.00 1.03 1.06 0.97 0.86 1.24 0.01
4 1.01 1.06 1.04 1.00 1.01 0.94 1.03 0.95 0.82 -0.73
5 1.01 1.03 1.00 1.03 0.94 0.87 0.99 1.07 0.79 0.99 -0.48
6 1.06 1.08 1.01 1.06 1.07 1.00 0.91 1.04 0.81 1.23 -0.70
7 0.98 0.96 1.01 0.97 0.95 1.07 1.04 1.03 0.98 0.44
8 0.86 1.02 0.84 0.86 0.82 0.79 0.83 0.83 -0.54
9 1.17 1.08 1.21 1.24 0.99 1.23 -0.04

djh

unread,

Mar 24, 2012, 12:28:31 PM3/24/12

to

Ray -

This is a simple question just to make sure I understand w = x -
mean[x] correctly.

For overall correlations in which we've plugged "zeroes" according to
your "input+1" and "output+0.5" suggestion, there are always 55 values
of x:

0, 0.5, 1, 1.5, . . . , 8, 8.5, 9

So if I use all 55 of these values to compute the mean, we get mean[x]
= 4.5

But of course, values duplicate in 0, 0.5, 1, 1.5, . . . , 8, 8.5, 9,
e.g. 2 from (1,3) and 2 from (2,2), and if I take the mean of the non-
redundant values, then w is again 4.5

So, I assume that I use mean[x] = 4.5 for the overall data, correct?

And by analogy:

mean[x] = 2.25 for the row (0,0),...,(0,9)
mean[x] = 2.75 for the row (1,0),...,(1,9)
mean[x] = 3.25 for the row (2,0),...,(2,9)

Correct?

If I'm incorrect regarding any of the above, please take a moment to
clarify.

Thanks very much again.

djh

unread,

Mar 24, 2012, 4:47:11 PM3/24/12

to koo...@sfu.ca

Here is what I get for the within-row slope and standard error for row
(00...09)for length 13-22, using your definitions of x,y,z,w,v.

Standard error is slightly larger than slope ... I assume this is a
very bad thing that we would hope to see improve for
other rows ?

0 1 2 3 4 5 6 7 8 9

out 1769 6347 8563 8321 5233 4067 2269 1549 1201 1363

input 2080 7150 9685 9425 6370 4615 2535 1755 1300 1495

y=p 0.850 0.888 0.884 0.883 0.822 0.881 0.895 0.883 0.924 0.912 orig
corr 0.57
x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
w -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25
w*y -1.914 -1.553 -1.105 -0.662 -0.205 0.220 0.671 1.103 1.617 2.051
sum(w*y) 0.22
w^2 5.063 3.063 1.563 0.563 0.063 0.063 0.563 1.563 3.063 5.063
sum(w^2) 20.63
slope 0.01082
1-p 0.150 0.112 0.116 0.117 0.178 0.119 0.105 0.117 0.076 0.088
p/(1-p) 5.688 7.904 7.632 7.537 4.602 7.422 8.530 7.519 12.131
10.326
v 0.003 0.001 0.001 0.001 0.001 0.002 0.003 0.004 0.009 0.007
v*w^2 0.014 0.003 0.001 0.000 0.000 0.000 0.002 0.007 0.029 0.035 sum
v*w^2 0.09119
sqrt(sum v*w^2) 0.30197
standard error 0.01464

djh

unread,

Mar 24, 2012, 4:51:30 PM3/24/12

to koo...@sfu.ca

I;m sorry the spacing didn't hold in the above post - it looked OK in
the panel before I hit send.

Here is a redone version:

0 1 2 3 4 5 6 7 8 9

out 1769 6347 8563 8321 5233 4067 2269 1549 1201 1363

input 2080 7150 9685 9425 6370 4615 2535 1755 1300 1495

y=p 0.850 0.888 0.884 0.883 0.822 0.881 0.895 0.883 0.924 0.912

x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
w -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25
w*y -1.914 -1.553 -1.105 -0.662 -0.205 0.220 0.671 1.103 1.617 2.051

w^2 5.063 3.063 1.563 0.563 0.063 0.063 0.563 1.563 3.063 5.063

1-p 0.150 0.112 0.116 0.117 0.178 0.119 0.105 0.117 0.076 0.088
p/(1-p) 5.688 7.904 7.632 7.537 4.602 7.422 8.530 7.519 12.131 10.326
v 0.003 0.001 0.001 0.001 0.001 0.002 0.003 0.004 0.009 0.007
v*w^2 0.014 0.003 0.001 0.000 0.000 0.000 0.002 0.007 0.029 0.035

orig corr 0.57

sum(w*p) 0.22
sum(w^2) 20.63
slope 0.01082

djh

unread,

Mar 26, 2012, 12:35:24 AM3/26/12

to koo...@sfu.ca, rich-...@live.com, rich....@comcast.net

Ray

As per your suggestion in your last post, here are some slope and
standard error data for globin proteins, lengths 13-22 thru 43-52,
overall and within row, for y = p and y = logit[p].

In general, taking logit[p] strengthens all correlations and increases
all slopes (pos and neg) (with some exceptions.)

From a purely scientific perspective, the switch from consistently
positive to consistently negative correlations between lengths 23-32
and 33-42 is fascinating (assuming of course, that you find the data
meaningful enough for us to continue thinking about at all.) There are
several possible explanations for this switch in sign, and my team
will have to see the data for the remaining 236 controls before
deciding which of these explanations seems most likely.

One specific question for you re the "y = p" slopes and correlations
in lines 3-4 (overall for lengths 33-42 and 43-52.) I don't
understand why the correlation gets less negative (-0.46108 to
-0.45630) while the slope gets more negative (-0.03342 to -0.04813).
I thought that if the correlation grows more negative, the slope
should also. (Note that this anomaly only occurs in lines 3-4 for the
y=p data, not the y=logit[p] data.

Finally, regarding the "meaningfulness" of the data, are the standard
error values so large as to render all the data virtually
meaningless?

As always, thanks again for whatever time you can afford to spend
considering these matters. Until you tell me it would be the better
part of wisdom to cease and desist collecting data (because they won't
show anything meaningful), I will continue the long march thru the
remaining 236 cases.

y = p y =logit[p]

Corr Corr
with with
Length Row (v1+v2)/2 Slope Std Err (v1+v2)/2 Slope Std Err

13-22 Overall 0.34216 0.00676 0.03927 0.41419 0.04446 0.01298
23-32 Overall 0.46882 0.01668 0.05772 0.57036 0.06840 0.01629
33-42 Overall -0.46108 -0.03342 0.00407 -0.48260 -0.09042 0.02906
43-52 Overall -0.45630 -0.04813 0.01296 -0.45427 -0.04278 0.04243

13-22 00-09 0.56767 0.01082 0.01464 0.61801 0.04811 0.01625
10-19 0.49916 0.01111 0.01361 0.61682 0.06710 0.01455
20-29 0.45163 0.00864 0.01042 0.52565 0.04661 0.01130
30-39 0.37734 0.00710 0.00990 0.44269 0.03554 0.01083
40-49 0.29573 0.00568 0.00880 0.33342 0.01889 0.01035
50-59 0.16848 0.00304 0.01301 0.21235 0.01487 0.01446
60-69 0.10761 0.00254 0.01930 0.16132 0.01573 0.02112
70-79 0.16825 0.00317 0.02221 0.21276 0.01672 0.02468
80-89 0.34039 0.00727 0.06720 0.43720 0.08911 0.06850
90-99 0.23166 0.00467 0.04040 0.32872 0.04060 0.04245

23-32 00-09 0.69844 0.02310 0.01632 0.69710 0.06532 0.01951
10-19 0.64019 0.01696 0.01085 0.63434 0.04922 0.01300
20-29 -0.17178 -0.01505 0.01266 -0.03445 -0.00732 0.01392
30-39 0.67259 0.01603 0.01275 0.66731 0.06282 0.01433
40-49 0.61259 0.01781 0.01266 0.61914 0.05061 0.01512
50-59 0.38671 0.06315 0.01856 0.43353 0.17118 0.02121
60-69 0.55066 0.01258 0.02811 0.54119 0.05619 0.03130
70-79 0.56859 0.01432 0.04306 0.61648 0.10407 0.04482
80-89 0.56671 0.01431 0.03667 0.57530 0.04750 0.04283
90-99 0.57290 0.01478 0.08311 0.55298 0.09389 0.08761

33-42 00-09 -0.31184 -0.02085 0.00776 -0.34511 -0.04940 0.02841
10-19 -0.38758 -0.03471 0.00513 -0.72625 -0.17629 0.01650
20-29 -0.38662 -0.03104 0.00363 -0.41197 -0.07071 0.01349
30-39 -0.41249 -0.03000 0.00454 -0.43928 -0.06869 0.01797
40-49 -0.46424 -0.03913 0.00586 -0.47859 -0.08406 0.02152
50-59 -0.49191 -0.04346 0.00896 -0.49272 -0.08946 0.02946
60-69 -0.59756 -0.04826 0.01315 -0.59548 -0.08894 0.03510
70-79 -0.53172 -0.04528 0.01600 -0.52474 -0.09985 0.05938
80-89 -0.27691 -0.00599 0.00599 -0.76102 -0.40210 0.14971
90-99 -0.50953 -0.02792 0.02022 -0.51233 -0.08088 0.11059

43-52 00-09 -0.42940 -0.03691 0.01312 -0.45629 -0.08528 0.04846
10-19 -0.28369 -0.03134 0.01703 -0.25820 -0.05653 0.03021
20-29 -0.39202 -0.04191 0.00843 -0.43592 -0.11382 0.03337
30-39 -0.40530 -0.04741 0.01076 -0.45496 -0.14777 0.05275
40-49 0.03602 0.00147 0.10980 0.02733 0.00202 0.21912
50-59 -0.03241 -0.03241 0.01788 -0.36495 -0.09201 0.06355
60-69 -0.51894 -0.06090 0.01922 -0.52999 -0.16038 0.08559
70-79 -0.46292 -0.06449 0.04089 -0.49250 -0.18743 0.16799
80-89 -0.17190 -0.00766 0.00675 -0.50444 -0.16336 0.19952
90-99 0.29485 0.02859 0.11007 0.13464 0.05397 0.27347

Ray Koopman

> Using 1760, I get .010670 for the slope and .001479 for the s.e.-

Thank you for catching this error! I will have to recompute all data
posted below as well, but it won't be difficult. I apologize for the
careless misread - I really hate to waste your time like that.

djh

unread,

Mar 26, 2012, 5:52:04 AM3/26/12

to

You wrote:

Let (n_1,...,n_k) be the sample sizes of the observed p. Then the
100C% confidence region for P is the set of all vectors P for which
sum n_i(p_i - P_i)^2/(P_i(1-P_i)) < the 100C'th percentile of the
chi-square distribution with k degrees of freedom.

So if, for instance, you wanted to say that the P_i are monotone
increasing (i.e., that P_1 < ... < P_k), you would have to show
that every P in the confidence region was monotone increasing.

I will save this and study on it very hard. I assume this is where
"the rubber meets the road" with respect to whether all of the data
are really saying anything or not ...

djh

unread,

Mar 26, 2012, 6:22:10 AM3/26/12

to koo...@sfu.ca

I will correct the matrices as per your notes and also recompute all
of them using p*(1-p) when computing v, not p/1-p (the error you noted
above in my first example.)

Thank you very much again, and I apologize again for wasting your time
with clerically erroneous data.

djh

unread,

Mar 26, 2012, 12:04:15 PM3/26/12

to koo...@sfu.ca

I have corrected the "symmetry-violation" errors in all matrices where
you found them, and also corrected all tables derived from these base
matrices. Fortunately, all errors were of the minor clerical variety
that did not substantively change the picture of the data we have at
present (with only 4/240 = 1/60 of the data "known".)

I am now going to correct the p*(1-p) vs p/(1-p) mistake in
computation of v (for the slope/standard error "yield" data with y =
p.This correction is easier to make than the "symmetry-violation"
corrections because essentially it's just a change in one place. I am
hoping this correction in computataion of vwill lower standard error
values throughout, because they seemed awfully high to me.

Also, I will not bother re-posting the corrected RMSD-related and
"yield-related" data unless you'd like to have it for your own
purposes. If so, please let me know via reply to this post and I will
repost it.

Finally, I hope I'm not mistaken when I say that the p/(1-p) vs p*(1-
p) error should not have affected the standard error when y = logit[p]
rather than p. So far as I can tell, v is defined in two different

ways in the two cases, since you wrote:

"For p, the variance "v" is p(1-p)/input;
for logit[p], the variance is 1/output + 1/(input-output). "

But if I'm not seeing something I should here, please let me know.

Thank you so much again for your keen eye. I am truly embarassed that
the expert caught what the novice should have caught before presenting
anything to the expert.

djh

unread,

Mar 26, 2012, 2:14:42 PM3/26/12

to koo...@sfu.ca

Below are the standard error values when y = p and v is computed
correctly. If I correctly understand what these signify, I have to
say you have great intuitions for being able to have realized at
leasat a week ago that the points are less well determined as the
predictor x= (v1+v2)/2 increases. That certainly seems to be what the
"in-row" standard error values show.

Also, I'm assuming that these values are now satisfactorily smaller
than the actual slope values themselves (see second table below). But
you will have to be the judge of this ....

I. Standard Error (y = p)

Row 13-22 23-32 33-42 43-52

00-09 0.00148 0.00270 0.00532 0.00775
10-19 0.00092 0.00180 0.00320 0.00527
20-29 0.00083 0.00120 0.00243 0.00396
30-39 0.00087 0.00140 0.00316 0.00456
40-49 0.00132 0.00210 0.00586 0.05475
50-59 0.00131 0.00230 0.00561 0.01788
60-69 0.00170 0.00290 0.00771 0.00879
70-79 0.00227 0.00210 0.01055 0.01398
80-89 0.00168 0.00540 0.00545 0.00540
90-99 0.00203 0.00460 0.01588 0.05609

Overall 0.00007 0.00030 0.00219 0.00250

II. Slope (y = p)

Row 13-22 23-32 33-42 43-52
00-09 0.01082 0.02310 -0.02085 -0.03691
10-19 0.01111 0.01696 -0.03471 -0.03134
20-29 0.00864 -0.01505 -0.03104 -0.04191
30-39 0.00710 0.01603 -0.03000 -0.04741
40-49 0.00568 0.01781 -0.03913 0.00147
50-59 0.00304 0.06315 -0.04346 -0.03241
60-69 0.00254 0.01258 -0.04826 -0.06090
70-79 0.00317 0.01432 -0.04528 -0.06449
80-89 0.00727 0.01431 -0.00599 -0.00766
90-99 0.00467 0.01478 -0.02792 0.02859

Overall 0.00676 0.01930 -0.03342 -0.04813

djh

unread,

Mar 26, 2012, 8:26:41 PM3/26/12

to koo...@sfu.ca

Ray -

If you have a moment, this is a simple dumb question about computing
standard error when y = rmsd, not p or logit[p].

Here is an example of how I compute slope for rmsd data, according to
the definitions you gave:

w= slope =
y=rmsd x x-mean[x] w^2 w*y sum[w*y]/
sum[w^2]

1.323715088 0 -2.25 5.0625 -2.978358947 0.019074527
1.310987263 0.5 -1.75 3.0625 -2.294227711
1.343656653 1 -1.25 1.5625 -1.679570816
1.312367819 1.5 -0.75 0.5625 -0.984275864
1.361786878 2 -0.25 0.0625 -0.340446719
1.377588158 2.5 0.25 0.0625 0.344397039
1.427719401 3 0.75 0.5625 1.07078955
1.326376675 3.5 1.25 1.5625 1.657970844
1.423061391 4 1.75 3.0625 2.490357433
1.380789912 4.5 2.25 5.0625 3.106777302

2.25 20.625 0.393412112
mean[x] sum[w^2] sum[w*y]

The computed slope of 0.019074527 agrees well with the slope that
Excel gives me when I do a simple regression of y against x for the
same data.

So does this mean that it's safe for me to use the standard error
which Excel provides
for the slope, i.e. 0.007456655 in this case?

Reason I'm asking is that I want to table the slope and standard error
data for rmsd alongside the
slope and standard data for p and logit[p], because there seems to be
some relationships.

But I need to be sure about what to display for standard error before
I do so.

If not, can you tell me how to compute standard error in a case like
this when y is not a probability
or the logit of a probability?

Ray Koopman

unread,

Mar 27, 2012, 12:30:25 AM3/27/12

djh

unread,

Mar 27, 2012, 7:45:35 PM3/27/12

to koo...@sfu.ca

I see from your parenthetical here:

"and that many researchers will often settle for something less
rigorous (at least
in their own work)."

that you have a very dry sense of humor. I hope I get a chance to
meet you in person some time in the future ...

djh

unread,

Mar 27, 2012, 10:21:18 PM3/27/12

to koo...@sfu.ca

In the post to which I'm replying here, you defined logit[p] as

log[p/(1-p)]

but also wrote the following:

> For p, the variance "v" is p(1-p)/input;
> for logit[p], the variance is 1/output + 1/(input-output).

> Both of those assume that both input and output are "large",
> so some adjustments will be needed for some of the new data.
> A simple one is to add 1 to the input and .5 to the output.
> It's far from perfect, but will at least take care of the zeros.

I've been adding .5 to output and 1 to input when I have a zero
output, but now that I'm hitting sparser data, I'm hitting cases when
input and output are both 0, so both additions give me .5 as output
and 1 as input.

And then when I take log(p/1-p), I get log(.5/(1-.5) = log(1) = 0.

And so, when I try to do a correlation with y = logit[p], Excel
complains with a divde by 0 error because of the logit[p] of 0.

Am I misintepreting what you said to do?

If not, what do you suggest for such cases? (The addition of .5 and 1
works fine when I'm doing a correlation with y = p, so no fix is
needed for this type of correlation.)

Thanks again in advance for the continued guidance.

djh

unread,

Mar 27, 2012, 10:30:36 PM3/27/12

to koo...@sfu.ca

Also I'm hitting another "divide by 0" problem - this one involves
the formula

logit[p] = log[p/(1-p)]

for logit[p] itself.

When I hit the rare case where input = output, p is 1 and so 1-p = 0.

Am I missing something?

If not, what to do ?

Ray Koopman

unread,

Mar 28, 2012, 5:26:07 AM3/28/12

to

On Mar 27, 7:21 pm, djh <halitsk...@att.net> wrote:
> In the post to which I'm replying here, you defined logit[p] as
>
> log[p/(1-p)]
>
> but also wrote the following:
>
>> For p, the variance "v" is p(1-p)/input;
>>
>> for logit[p], the variance is 1/output + 1/(input-output).
>> Both of those assume that both input and output are "large",
>> so some adjustments will be needed for some of the new data.
>> A simple one is to add 1 to the input and .5 to the output.
>> It's far from perfect, but will at least take care of the zeros.
>
> I've been adding .5 to output and 1 to input when I have a zero
> output, but now that I'm hitting sparser data, I'm hitting cases
> when input and output are both 0, so both additions give me .5 as
> output and 1 as input.

When the input is zero then you have a hole in your data, but you
can still do the regression -- it just means the programming is a
little messier.

>
> And then when I take log(p/1-p), I get log(.5/(1-.5) = log(1) = 0.
>
> And so, when I try to do a correlation with y = logit[p], Excel

> complains with a divide by 0 error because of the logit[p] of 0.

Having one or more zeros in the data should not upset anything. The
only divisions involved in a correlation are by n (to get the means)
and at the very end, where a quantity proportional to the covariance
is divided by the square root of the product of quantities that are
proportional to the two variances. Is one of the variances zero?

>
> Am I misinterpreting what you said to do?

>
> If not, what do you suggest for such cases? (The addition of .5 and
> 1 works fine when I'm doing a correlation with y = p, so no fix is
> needed for this type of correlation.)
>
> Thanks again in advance for the continued guidance.

On Mar 27, 7:30 pm, djh <halitsk...@att.net> wrote:
> Also I'm hitting another "divide by 0" problem - this one involves
> the formula
>
> logit[p] = log[p/(1-p)]
>
> for logit[p] itself.
>
> When I hit the rare case where input = output, p is 1 and so 1-p = 0.

Adding 1 to the input and .5 to the output works at both ends of the
scale, to avoid both p = 0 and p = 1. But it's still not a great
solution.

>
> Am I missing something?
>
> If not, what to do ?

I don't expect much to come from linear regressions with y = p.
I would still do them, mostly because they're easy, and some sort
of (noisy) pattern may become apparent when you consider the results
from all 240 matrices, but I wouldn't hold my breath.

y = logit[p] seems like a better bet, but given the small inputs
and extreme p's in some matrices, only if the analyses are done as
logistic regressions of p on x, not as linear regressions of logit[p]
on x; and perhaps only if a reduced-bias estimation method is used;
see, for instance,
http://cemsiis.meduniwien.ac.at/kb/wf/software/statistische-software/fllogistf/

Since both p and the input are dependent, you might look for a way
to model them simultaneously. Such a model would have more than
two parameters that would need to be estimated, which would make
doing within-row regressions, where there are only ten data points,
very iffy.

With regard to the within-row regressions, where you seem impressed
by regular changes from row to row, is would be good if you could
expand the model to include parameters that would account for those
changes. Then separate within-row regressions would no longer be
necessary.

Ray Koopman

unread,

Mar 28, 2012, 6:17:26 AM3/28/12

to

On Mar 27, 4:43 pm, djh <halitsk...@att.net> wrote:
> [...] As someone completely naive about statistics, I find it

> fascinating (for reasons I won't go into here) that the residuals

are orthogonalto the unit vector in simple linear regresssion.

Instead of "simple linear regression", I should have said "least
squares linear regression with an intercept". "least squares linear"
implies that the residuals must be orthogonal to the space of the
predictors, and "with an intercept" implies that the predictor space
includes the unit vector. QED.

djh

unread,

Mar 28, 2012, 9:20:17 AM3/28/12

to koo...@sfu.ca

OK - based on what you just said, here's how I'm going to proceed. No
need to reply unless you think I'm planning to proceed incorrectly, in
which case I would of course value your continued guidance.

First, I forgot that you told me to forget correlations and only do
slopes and standard errors. Doing this will remove the "zero" problem
in correlations that I mentioned above.

Second, I will just do the "overalls" and drop the within-row effort
entirely, since even over the web I can see your raised eyebrow about
"just 10" within-row values. Not doing the "within-rows" will solve
the second "zero" problem with logit[p] itself, because I can just
drop the few problem cases from the "overall" data and do the slopes
and standard errors for logit[p] on 52 or 53 values instead of the
full "overall" complement of 55.

Third, for the moment, I will not pursue the "logistic regression"
avenue for the following reason.

What's most important right now from a scientific point of view is the
sign reversal of slope for the "overall" values at different lengths:

"Overall" Slopes

Length y=p y=logit[p]

13-22 0.00676 0.04446
23-32 0.01668 0.06840
33-42 -0.03342 -0.09042
43-52 -0.04813 -0.04278

and the way in which the slopes for the y=RMSD "overalls" behave
relative to these reversals in sign.

So I will continue to get the "overall" slopes and standard errors for
y=p, y=logit[p] and y=RMSD in order to see:

i) whether the complete pattern of sign reversals for slopes of p
and logit[p] makes scientific sense;
ii) whether the complete pattern of slopes for y=RMSD does or does
not correspond to the pattern of sign reversals for slopes of p and
logit[p] in any scientifically meaninful way, and if so how;

If these two complete patterns both make scientific sense, and if they
do not manifest itself under any of the three control protocols, then
I can come back to you and ask: "OK - what do we have to do further to
show that the observed patterns not only make scientific sense, but
are also statistically meaningful as well.

Again, no need to respond unless you think there's something wrong
with this approach.

And thank you very much again, Ray.

djh

unread,

Mar 28, 2012, 11:12:34 AM3/28/12

to koo...@sfu.ca

I've found an free online reduced-bias logistics regression calculator
here:

http://www.wessa.net/rwasp_logisticregression.wasp

Could you take a quick look and tell me if you think it's OK to use
it.

If so, could you tell me what my values should be for the "endogenous
series" binary variable in column 1 of the input panel? (I'm assuming
my second column would be my x values and the third would be my y=p
values.)

If you green-light this, I'll go ahead and do the logistics
regressions on p while I'm doing the linear regressions on p and
logit[p].

Ray Koopman

unread,

Mar 28, 2012, 3:40:29 PM3/28/12

to

For your data there would be only two columns. The first is the
dichotomous dependent variable, either 0 or 1; the second is the
predictor, (v1+v2)/2. For each of the 19 possible values of the
predictor***, you would have to enter 'input' rows, all of which
would have the same predictor value in column 2. 'output' of those
rows would have a 1 in column 1; the remaining 'input-output' rows
would have a 0 in column 1.

Obviously, the program is not designed for replicated data,
but the coefficients and their standard errors should still OK.

You might try asking (in the "I would like to suggest..." box at
the bottom of the page) if there is an alternate input option for
replicated data. (There should also be a corresponding reduction
in the output, in the "Fit of Logistic Regression" table.)
_____

*** It just struck me that altho there are 55 cells in your matrix,
if the predictor is (v1+v2)/2 then there are only 19 different
predictor values. Cells with the same predictor value can be merged,
to give pooled input and output counts. Could have been doing this
all along!

djh

unread,

Mar 28, 2012, 4:12:40 PM3/28/12

to koo...@sfu.ca

I'm really sorry, Ray, but I simply cannot get myself past some kind
of stupid "block" that's preventing me from understanding what you
wrote here:

> For your data there would be only two columns. The first is the
> dichotomous dependent variable, either 0 or 1; the second is the
> predictor, (v1+v2)/2. For each of the 19 possible values of the
> predictor***, you would have to enter 'input' rows, all of which
> would have the same predictor value in column 2. 'output' of those
> rows would have a 1 in column 1; the remaining 'input-output' rows
> would have a 0 in column 1.

First puzzlement: If I am allowed to enter only two columns into the
panel, the first of which has only 0's and 1's, and the second of
which has only the 19 values of x (the 19 unique among the 55 cells of
the matrix), where do I put the actual "y" data from the cells of the
matrix. You've already stipulated the data will only have two columns
and they're both already filled up.

Second puzzlement: what is the "y" data that I'm supposed to enter?
The "p's" (inputs/outputs), or the inputs and outputs themselves ?

I realize I'm "blocking" on something really stupid here, but I just
can't see my way thru it.

Using this mock example for just three predictors:

x input output p
0 1769 2080 0.850480769
0.5 6347 7150 0.887692308
1 8563 9685 0.884150749

could you show me how to fill out the panel provided by the
application?

Final question: once you've gotten me past this block, would it be ok
for me to enter the data unpooled (i.e. for 55 predictors instead of
19, or must the predictors in column 2 be unique?

Again, I apologize for trying your patience here ... really.

PS - judging from your email, I suspect you know some of the folks who
wrote this application ...?

djh

unread,

Mar 28, 2012, 5:03:39 PM3/28/12

to koo...@sfu.ca

Oh wait a second - are you saying that if one cell of my matrix with
predictor 12 has an input of 3 and an output of 2, then the data
should be entered into the panel as:

1 12
1 12
0 12

and if the input and output were 5 and 3, then the data would look
like:

1 12
1 12
1 12
0 12
0 12

djh

unread,

Mar 28, 2012, 5:40:13 PM3/28/12

to koo...@sfu.ca

OK - this one has an option for summary data entry!

http://statpages.org/logistic.html

But I don't know if it implements the "reduced-bias" feature like the
first one claims to do.

If it doesn't is it still OK to use it?

Ray Koopman

unread,

Mar 28, 2012, 6:14:02 PM3/28/12

to

Yup.

Ray Koopman

unread,

Mar 28, 2012, 6:25:44 PM3/28/12

to

Ray Koopman

unread,

Mar 28, 2012, 6:28:48 PM3/28/12

to

Sorry, I hit the "Send" key by mistake a couple minutes ago.

On Mar 28, 2:40 pm, djh <halitsk...@att.net> wrote:

That's probably just a garden-variety maximum likelihood routine.
It should be OK when both 'out' and 'in-out' are large, say > 30
or so when you have only one predictor, but otherwise you need a
bias-reduced penalized maximum likelihood routine. And no, I
don't know any of the people involved with either routine.

djh

unread,

Mar 28, 2012, 6:59:13 PM3/28/12

to koo...@sfu.ca

OK - here's some output for you to look at from the second site's
calculator:

http://statpages.org/logistic.html

(this is the one that doesn't claim to be reduced-bias.)

You can see a problem I had to overcome at this site if you look at
the "summary" examples which the site provides.

You'll see that they require binary values for predictor as well as
predicted, so I pooled my data even further than 19 points to get
binary 0/1 values corresponding to the predictor ranges 0.5<=x<=4.5
and 5<=x<=9:

0 29269 203483 0.5<=x<=4.5
1 5074 40805 5<=x<=9

The output then was this:

Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD
1 0.1647 0.3709

Iteration History...
-2 Log Likelihood = 08060.9779 (Null Model)
-2 Log Likelihood = 07977.6105
-2 Log Likelihood = 07977.4695
-2 Log Likelihood = 07977.4695
-2 Log Likelihood = 07977.4695 (Converged)

Overall Model Fit...
Chi Square= 83.5084; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.1456 0.0161 0.0000
Intercept 1.9391

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.1568 1.1207 1.1939

X1 n0 n1 Calc Prob
0.0000 29269 203483 0.8742
1.0000 5074 40805 0.8894

Can you tell anything from these results as to whether 0.5<=x<=4.5
peforms significantly better or worse than 5<=x<=9 ?

If so, this would be good enough for the scientific claim we're trying
to make, but I wish there was a site that allowed non-binary predictor
values. Also, of course, if you see anything meaningful in the data,
I would have to do the same kind of binary split for my RMSD dependent
variable (which is easily doable, actually)

One last question: would it help to tell you anything if I were to
pool the data over more smaller intervals, e.g.

0: 0.5<=x<=2
1: 2.5<=x<=4

and then:

0: 4.5<=x<=6
1: 6.5<=x<=8

Very much looking forward to your response to all of the above, even
if you tell me that it's time to shutter the shop and lock the doors
after us ....

And thanks again for your patience, as always ...

djh

unread,

Mar 28, 2012, 7:34:32 PM3/28/12

to koo...@sfu.ca

OK - below is the output from http://statpages.org/logistic.html for
the length=30 case (the previous case was length = 20.) Also, please
note that in the previous case, I typo'd the lower x range as
0.5<=x<=4.5, when it should have been 0.0<=x<=4.5.

Length 30

Binary-pooled data:

0 32175 131648 0.0<=x<=4.5
1 2915 22067 5<=x<=9

Output from logistics regression:

Descriptives...

35090 cases have Y=0; 153715 cases have Y=1.

Variable Avg SD
1 0.1323 0.3388

Iteration History...
-2 Log Likelihood = 81310.8671 (Null Model)
-2 Log Likelihood = 80328.4727
-2 Log Likelihood = 80307.1466
-2 Log Likelihood = 80307.1237
-2 Log Likelihood = 80307.1237 (Converged)

Overall Model Fit...
Chi Square= 1003.7434; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 0.6153 0.0207 0.0000
Intercept 1.4089

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 1.8502 1.7767 1.9266

X1 n0 n1 Calc Prob

0.0000 32175 131648 0.8036
1.0000 2915 22067 0.8833

djh

unread,

Mar 28, 2012, 8:56:33 PM3/28/12

to koo...@sfu.ca

Below are the logistic regression outputs for lengths 40 and 50.

Before you look at them, please consider this table:

"statpages"
Corr Slope Log Reg
Length y=p y=p Coeff

20 0.34216 0.00676 0.1456
30 0.34984 0.01930 0.6153
40 -0.46108 -0.03342 -0.3288
50 -0.45630 -0.04813 -0.2410

As a complete ignoramus, it looks to me like the signs of the
coefficients from the logistics regressions are patterning with, not
against, the signs of the coefficients and slopes from the linear

regressions with y = p.

Of course, I'm hoping you'll be able to tell me that this similarity
means that we have more reason to attach statistical significance to
the switch in sign than we would if we only had this switch in the
coefficients and slopes from the linear regressions.

In any event, here are the outputs from the logistics regressions for
lengths 40 and 50.

*********************************************

Length 40 Logistics Regression Data

Descriptives...

67556 cases have Y=0; 55278 cases have Y=1.

Variable Avg SD
1 0.0661 0.2484

Iteration History...
-2 Log Likelihood = 69054.7695 (Null Model)
-2 Log Likelihood = 68857.9864
-2 Log Likelihood = 68857.8934
-2 Log Likelihood = 68857.8934 (Converged)

Overall Model Fit...
Chi Square= 196.8760; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.3288 0.0237 0.0000
Intercept -0.1793

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.7198 0.6872 0.7540

X1 n0 n1 Calc Prob

0.0000 62488 52229 0.4553
1.0000 5068 3049 0.3756

************************************************

Length 50 Logistics Regression Data

Descriptives...

28666 cases have Y=0; 38763 cases have Y=1.

Variable Avg SD
1 0.0608 0.2391

Iteration History...
-2 Log Likelihood = 91958.7888 (Null Model)
-2 Log Likelihood = 91903.2527
-2 Log Likelihood = 91903.2441
-2 Log Likelihood = 91903.2441 (Converged)

Overall Model Fit...
Chi Square= 55.5448; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.2410 0.0323 0.0000
Intercept 0.3166

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.7858 0.7377 0.8371

X1 n0 n1 Calc Prob

-0.0000 26692 36634 0.5785
1.0000 1974 2129 0.5189

djh

unread,

Mar 28, 2012, 9:57:59 PM3/28/12

to koo...@sfu.ca, jcp1...@gmail.com

This to correct a misimpression I have given regarding John Pezzullo's
logistics regression tool at

http://statpages.org/logistic.html

It will run non-binary (multiple) predictors in logistics regression,
as John has explained in an email which appears in full below. John
was also kind enough to provide some thoughts on the case which I sent
him, for which I again thank him very much.

John's email:

Hi David,

I'm glad you've found my online logistic regression calculator to be
useful.

Logistic regression can take one or more continuous (numeric) and/or
categorical predictors, so your data set can be analyzed by my web
page. For grouped data (such as your dataset), the last two columns
would be the number of No and Yes outcomes, just as you have it.

It's important to separate the numbers on each row of data by commas
or tab characters, and not just have them separated by blank spaces.
I copied the data from your e-mail into Excel, and used the "Text to
Columns" entry in the "Tools" menu to get the data into three distinct
columns, then copy/pasted it into the web page. I set the number of
data points to 18, and the number of predictor variables to 1, and
then clicked the Solve button. The page produced the following output:

Descriptives...

34032 cases have Y=0; 242519 cases have Y=1.

Variable Avg SD
1 3.2100 1.5412

Iteration History...
-2 Log Likelihood = 06293.3370 (Null Model)
-2 Log Likelihood = 06292.2564
-2 Log Likelihood = 06292.2564
-2 Log Likelihood = 06292.2564 (Converged)

Overall Model Fit...
Chi Square= 1.0806; df=1; p= 0.2986

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0039 0.0038 0.2984
Intercept 1.9763

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9961 0.9888 1.0035

X1 n0 n1 Calc Prob

0.5000 803 6347 0.8781
1.0000 1699 13981 0.8779
1.5000 2734 23081 0.8777
2.0000 3891 29455 0.8775
2.5000 4659 32341 0.8772
3.0000 4839 30548 0.8770
3.5000 4303 26531 0.8768
4.0000 3408 21721 0.8766
4.5000 2622 17709 0.8764
5.0000 1849 13883 0.8762
5.5000 1280 10462 0.8760
6.0000 908 7045 0.8758
6.5000 571 4156 0.8756
7.0000 254 2510 0.8753
7.5000 114 1323 0.8751
8.0000 69 742 0.8749
8.5000 18 442 0.8747
9.0000 11 242 0.8745

The regression coefficient for the single predictor variable is
-0.0039 with a standard error of +/- 0.0038, which indicates that
larger values of the predictor are associated with a smaller
proportion of the subjects who have the outcome variable. (Actually,
the SE is about the same magnitude as the coefficient itself, so the
regression coefficient is not significantly different from zero
(p=0.3), so it's not really indicating anything about the direction of
the relationship.)

If we plot that proportion-of-subjects-with-outcome vs. predictor
variable, we see that the data appears to show a generally increasing
trend with increasing X values. (I'm attaching an Excel file with the
data and the graph).

But the points don't really seem to look much like the typical "S-
shaped curve that the logistic model is meant to fit. Usually
logistic data has small proportions at one end of the predictor range,
and large proportions at the other end. The proportions in your data
have a very small range of variability (ranging from 0.86 to 0.96),
implying that your proportions are confined to upper part of the S-
curve, where it should be asymptotically approaching the maximum
possible value of 1. This would make the regression somewhat ill-
conditioned, but it should still work.

But what's more troubling is that the plot clearly shows that the data
does not seem to be conforming to a logistic model at all -- it starts
out (at the left side of the graph) around 0.9, then dips down to 0.86
near the middle, then climbs up to 0.96 at the right side. The
logistic function does not have this kind of minimum near its upper
range, and cannot a logistic curve cannot twist itself to fit that
kind of dip in the middle. That's probably why the regression
coefficient is negative, even though the the general trend of the
curve seems to be increasing -- the data doesn't follow a logistic
shape, so the logistic regression doesn't really produce a realistic
result in this case.

Without knowing what the data represents, I can't offer any other
suggestions. It's possible that there's some other kind of
theoretical model underlying the observations, which would suggest
some other kind of formula to fit to the data instead of a logistic
model.

Best Regards,
John.

John C. Pezzullo, PhD
Kissimmee, FL

401-603-1648 (home/office)
703-593-5815 (cell)

djh

unread,

Mar 28, 2012, 10:15:04 PM3/28/12

to koo...@sfu.ca, jcp1...@gmail.com

Here are the results of re-running the logistics regressions for
lengths 20,30,40,50 using John Pezzullo's calculator at:

http://statpages.org/logistic.html

As you can see, multiple non-binary indicators work just fine (my bad
for claiming they don't!)

****************************************************
Length 20
****************************************************
Input Data:

0,311,1769
0.5,803,6347
1,1699,13981
1.5,2734,23081
2,3891,29455
2.5,4659,32341
3,4839,30548
3.5,4303,26531
4,3408,21721
4.5,2622,17709
5,1849,13883
5.5,1280,10462
6,908,7045
6.5,571,4156
7,254,2510
7.5,114,1323
8,69,742
8.5,18,442
9,11,242

Results:

Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD

1 3.1861 1.5601

Iteration History...
-2 Log Likelihood = 08060.9779 (Null Model)

-2 Log Likelihood = 08060.8373
-2 Log Likelihood = 08060.8373
-2 Log Likelihood = 08060.8373 (Converged)

Overall Model Fit...
Chi Square= 0.1406; df=1; p= 0.7077

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0014 0.0037 0.7076
Intercept 1.9664

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9986 0.9914 1.0059

X1 n0 n1 Calc Prob

0.0000 311 1769 0.8772
0.5000 803 6347 0.8771
1.0000 1699 13981 0.8771
1.5000 2734 23081 0.8770
2.0000 3891 29455 0.8769
2.5000 4659 32341 0.8768
3.0000 4839 30548 0.8768
3.5000 4303 26531 0.8767
4.0000 3408 21721 0.8766
4.5000 2622 17709 0.8765
5.0000 1849 13883 0.8765
5.5000 1280 10462 0.8764
6.0000 908 7045 0.8763
6.5000 571 4156 0.8762
7.0000 254 2510 0.8762
7.5000 114 1323 0.8761
8.0000 69 742 0.8760
8.5000 18 442 0.8759
9.0000 11 242 0.8759

****************************************************
Length 30
****************************************************
Input Data:

0,331,845
0.5,1211,3738
1,2363,9939
1.5,3410,17124
2,4015,21954
2.5,4212,22434
3,3849,19619
3.5,8804,10964
4,2325,13606
4.5,1655,11425
5,1180,8476
5.5,776,5796
6,472,3510
6.5,260,1939
7,115,1209
7.5,73,603
8,23,375
8.5,14,116
9,2,43

Results:

Descriptives...

35090 cases have Y=0; 153715 cases have Y=1.

Variable Avg SD
1 3.0081 1.4785

Iteration History...
-2 Log Likelihood = 81310.8671 (Null Model)
-2 Log Likelihood = 81119.5394
-2 Log Likelihood = 81119.3672
-2 Log Likelihood = 81119.3672
-2 Log Likelihood = 81119.3672 (Converged)

Overall Model Fit...
Chi Square= 191.4999; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 0.0560 0.0041 0.0000
Intercept 1.3109

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 1.0576 1.0492 1.0661

X1 n0 n1 Calc Prob

0.0000 331 845 0.7877
0.5000 1211 3738 0.7923
1.0000 2363 9939 0.7969
1.5000 3410 17124 0.8014
2.0000 4015 21954 0.8058
2.5000 4212 22434 0.8101
3.0000 3849 19619 0.8144
3.5000 8804 10964 0.8186
4.0000 2325 13606 0.8227
4.5000 1655 11425 0.8268
5.0000 1180 8476 0.8307
5.5000 776 5796 0.8346
6.0000 472 3510 0.8385
6.5000 260 1939 0.8422
7.0000 115 1209 0.8459
7.5000 73 603 0.8495
8.0000 23 375 0.8531
8.5000 14 116 0.8565
9.0000 2 43 0.8599

****************************************************
Length 40
****************************************************
Input Data:

0,393,202
0.5,1877,1448
1,5312,4508
1.5,9447,8133
2,12351,9607
2.5,11264,8702
3,8856,7392
3.5,6062,5686
4,4058,4057
4.5,2868,2494
5,2067,1453
5.5,1398,873
6,780,383
6.5,409,198
7,190,86
7.5,97,38
8,76,14
8.5,43,2
9,8,2

Results:

Descriptives...

67556 cases have Y=0; 55278 cases have Y=1.

Variable Avg SD
1 2.6319 1.2622

Iteration History...
-2 Log Likelihood = 69054.7695 (Null Model)
-2 Log Likelihood = 69039.0747
-2 Log Likelihood = 69039.0746
-2 Log Likelihood = 69039.0746 (Converged)

Overall Model Fit...
Chi Square= 15.6949; df=1; p= 0.0001

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0180 0.0045 0.0001
Intercept -0.1532

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9821 0.9734 0.9909

X1 n0 n1 Calc Prob

0.0000 393 202 0.4618
0.5000 1877 1448 0.4595
1.0000 5312 4508 0.4573
1.5000 9447 8133 0.4551
2.0000 12351 9607 0.4528
2.5000 11264 8702 0.4506
3.0000 8856 7392 0.4484
3.5000 6062 5686 0.4461
4.0000 4058 4057 0.4439
4.5000 2868 2494 0.4417
5.0000 2067 1453 0.4395
5.5000 1398 873 0.4373
6.0000 780 383 0.4350
6.5000 409 198 0.4328
7.0000 190 86 0.4306
7.5000 97 38 0.4284
8.0000 76 14 0.4262
8.5000 43 2 0.4240
9.0000 8 2 0.4218

****************************************************
Length 50
****************************************************
Input Data:

0,294,202
0.5,853,1323
1,2115,3331
1.5,3401,5411
2,4604,6235
2.5,4505,6062
3,4142,5097
3.5,3076,4100
4,2236,2986
4.5,1466,1887
5,915,1105
5.5,568,578
6,289,285
6.5,121,112
7,57,40
7.5,19,9
8,5,0
8.5,0,0
9,0,0

Results:

Descriptives...

28666 cases have Y=0; 38763 cases have Y=1.

Variable Avg SD
1 2.6539 1.2487

Iteration History...
-2 Log Likelihood = 91958.7888 (Null Model)
-2 Log Likelihood = 91862.7620
-2 Log Likelihood = 91862.7590
-2 Log Likelihood = 91862.7590 (Converged)

Overall Model Fit...
Chi Square= 96.0299; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0611 0.0062 0.0000
Intercept 0.4643

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9407 0.9293 0.9523

X1 n0 n1 Calc Prob

0.0000 294 202 0.6140
0.5000 853 1323 0.6068
1.0000 2115 3331 0.5995
1.5000 3401 5411 0.5921
2.0000 4604 6235 0.5847
2.5000 4505 6062 0.5773
3.0000 4142 5097 0.5698
3.5000 3076 4100 0.5623
4.0000 2236 2986 0.5548
4.5000 1466 1887 0.5472
5.0000 915 1105 0.5396
5.5000 568 578 0.5320
6.0000 289 285 0.5244
6.5000 121 112 0.5168
7.0000 57 40 0.5092
7.5000 19 9 0.5015
8.0000 5 0 0.4939
8.5000 0 0 0.4863
9.0000 0 0 0.4786

Ray Koopman

unread,

Mar 29, 2012, 3:42:54 PM3/29/12

to

I've looked at plots of p against x for the data you analyzed with
John's program, and I agree with his comment that the relations do
not look logistic. I also have some suggestions about what to do.

First, make sure the data are correct. I went back to the Length-30
data at
https://groups.google.com/group/sci.stat.math/msg/1a9f4ccc8451a240
and found that the square versions of both the input and output
matrices are not symmetric. Only one pair disagrees in the output,
but in the input there is one pair plus the entire row/col 8 (except
for the diagonal). Since columns 8 & 9 are the same except for the
bottom row, but row 8 & 9 are never equal, it looks as if column 8 is
bad. However, the upper-triangle summary in the same post appears to
use column 8, not row 8. This sort of thing has happened often enough
now to make me wonder about all the data.

Second, it may be time to complicate the model. A plot of all 55 p's
against x shows considerable variability within levels of x. Why? One
explanation might be that the lengths differ, that the range 33-42 is
large enough to matter. Try putting the actual length into the model.
Another might be that v1-v2 matters. Put it in, too. Another might
be that we should be using the continuous u1 and u2 instead of v1
and v2. And it may be better to model both the input and either the
output or the proportion. I don't know which of those might be the
most important, or what the model should look like -- those are
substantive calls, not statistical ones.

djh

unread,

Mar 29, 2012, 8:16:35 PM3/29/12

to koo...@sfu.ca

As always, Ray, thanks very much for taking the time and my apologies
for any mistakes still in the data - I thought I had gotten rid of
them all. I am ashamed - there's no reason for me to raise that kind
of dust, particularly when you're not billing.

The reason I don't want right now to go to unquantized u rather than
quantized v is because the other half of our central hypothesis (the
protein message side as opposed to the protein structure side) is
remarkably well supported by the following correlations of our 10
quantized v values 0,...9 with 10 levels of a quantized energetic
variable e. (Note how the signs of the coefficients flip perfectly in
our control results, and note also that all these results are entirely
length-independent.)

results for
v (quantized)
against "e"

Corr F RSquare

-0.850 0.0015 0.72 globins
-0.930 0.0001 0.86 cytochrome C's

-0.960 0.00001 0.92 immunoglobulins
-0.730 0.0152 0.53 trypsins

-0.970 0.000003 0.94 TIM-barrels
-0.800 0.0050 0.64 NADP-binders

control results for
v (quantized)
against "e"

Corr F Rsquare

0.870 0.0010 0.76 globins
0.920 0.0002 0.85 cytochrome C's

0.670 0.0332 0.45 immunoglobulins
0.730 0.0150 0.53 trypsins

0.760 0.0110 0.58 TIM-barrels
0.930 0.0001 0.86 NADP-binders

So if we go with unquantized u or re-quantize u differently on this
side of the hypothesis, we would have to unquantize or requantize it
correspondingly on the other side of the hypothesis, and although I
strongly suspect that we could do so without losing the remarkable
correlations shown above, I am loath to do so if we're not sure we
have to. Also, from a scientific point of view, our critics could
argue that if quantized v works aginst e on the message side of the
hypothesis, then quantized v should also work against some variable(s)
on the structure side - otherwise, something is amiss.

What I am going to do first is add u1-u2 to the model, and rather than
add length to the model, I am going to pool ACROSS lengths.

The reason I'm going to add u1-u2 is because it makes perfect
scientific sense that our "filter" should not behave the same way with
respect to a v1+v2)/2 value of 4.5 obtained from 4 and 5, and the same
value of 4.5 obtained from 0 and 9.

The reason I'm going to pool across lengths is three-fold:

a) it was really only a guess on my part that length might be a
category that has to be kept relatively constant when analyzing v
against y = p (yield proportion.)

b) adding u1-u2 to the model is going to dangerously decrease
population per cell unless I pool across lengths, in which case, no
such decrease will generally occur.

c) if the results shown above for "the other side of the street" are
length-independent, then our critics could argue that our results on
this side of the street should also be length-independent.

Finally, regarding both length and v, there is a matter of sampling
that I need to at least make you aware of.

Right now, our length intervals are 13-22,23-32,...,93-102,103-112,
and we never feed our "filter" (Dr. Lesk's structural alignment
program) an input pair that is drawn from two different length
intervals. This keeps us from asking the filter to do silly things
with silly input pairs of greatly disparate lengths, but it does raise
an obvious sampling issue.

In particular, neither nature nor Dr. Lesk's program considers the
lengths 21 and 22 to be somehow closer than 22 and 23, or similarly,
the lengths 23 and 24 to be closer than 21 and 22. Yet of these three
possible classes of input pairs defined by length, we only feed our
filter pairs from one. So I know we're implicitly "sampling" here,
but I don't know if we're doing so legitimately or illegitmately -
perhaps you can advise here.

And of course, the same implicit "sampling" question arises in the
case of our quantized v. The levels 0,...,9 of v correspond, in the
case of the globin proteins, to 10 ranges
0.01-0.50,0.51-1.00,...,4.01-4.50,4.51-5.00. So our protocol
distinguishes between two values such as 4.50 and 4.51 in a way that
it does not distinguish between the equally close values 4.49 and 4.50
or 4.51 and 4.52. Again, perhaps you can advise how bad a problem
this actually is.

In any event, when I add v1-v2 and pool across lengths, please be
assured I will make sure all data are correct, and when I have the
results from John's program, I will post back. Shouldn't take that
much effort.

djh

unread,

Mar 30, 2012, 5:43:52 AM3/30/12

to koo...@sfu.ca

Please note in u1-u2 should have been v1-v2 in these paragraphs of my
last post:

1)

"What I am going to do first is add u1-u2 to the model, and rather
than
add length to the model, I am going to pool ACROSS lengths. "

2)

The reason I'm going to add u1-u2 is because it makes perfect
scientific sense that our "filter" should not behave the same way
with
respect to a v1+v2)/2 value of 4.5 obtained from 4 and 5, and the
same
value of 4.5 obtained from 0 and 9.

3)

b) adding u1-u2 to the model is going to dangerously decrease
population per cell unless I pool across lengths, in which case, no
such decrease will generally occur.

Sorry for the confusion - I was composing rapidly right before
boarding a flight ...

djh

unread,

Mar 30, 2012, 7:59:06 AM3/30/12

to koo...@sfu.ca

Ray - I forgot to ask you one basic question last night.

Suppose that sufficient and insightful complication of the model
produces a satisfactory logistic regression involving the "yield-
related" properties of the "filter".

Is the fact that the "yield-related" properties of the filter are best
modellable by a logistic regression incompatible with the possibility
that the "RMSD-related" property of the filter might be modelled by a
linear regression? Or could these models both exist without conflict
from a purely statistical point of view?

(The "RMSD-related" property has to do with the degree of the
similarity between the two members of each output pair that the filter
"yields" from its input pairs, not how many output pairs are yielded
from input pairs.)

djh

unread,

Mar 30, 2012, 12:32:29 PM3/30/12

to koo...@sfu.ca

Before going to pooled length data, I decided just to add v1-v2 to the
model while still keeping the lengths distinct.

Below are the results for lengths 20,30,40,50,60 with predictors
(v1+v2)/2 and v1=v2. (All input data is error-free - this AM I
rechecked all five 10x10's for symmetry before doing anything else.)

Please take a look if/when you have a chance and let me know what you
think.

Length 20

Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD
1 3.1861 1.5601
2 2.4325 1.9607

Iteration History...
-2 Log Likelihood = 08060.9779 (Null Model)
-2 Log Likelihood = 07579.6536
-2 Log Likelihood = 07577.8873
-2 Log Likelihood = 07577.8873
-2 Log Likelihood = 07577.8873 (Converged)

Overall Model Fit...
Chi Square= 483.0906; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0341 0.0039 0.0000
2 0.0715 0.0033 0.0000
Intercept 1.9027

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9665 0.9591 0.9740

2 1.0742 1.0673 1.0811

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 311 1769 0.8702
0.5000 1.0000 803 6347 0.8762
1.0000 0.0000 577 5418 0.8663
1.0000 2.0000 1122 8563 0.8820
1.5000 1.0000 1630 14760 0.8725
1.5000 3.0000 1104 8321 0.8876
2.0000 0.0000 1153 9873 0.8623
2.0000 2.0000 1601 14349 0.8784
2.0000 4.0000 1137 5233 0.8929
2.5000 1.0000 2387 19218 0.8686
2.5000 3.0000 1724 9056 0.8841
2.5000 5.0000 548 4067 0.8980
3.0000 0.0000 1273 9167 0.8582
3.0000 2.0000 2480 12122 0.8747
3.0000 4.0000 820 6990 0.8896
3.0000 6.0000 266 2269 0.9029
3.5000 1.0000 2522 11688 0.8647
3.5000 3.0000 1194 9385 0.8806
3.5000 5.0000 381 3909 0.8948
3.5000 7.0000 206 1549 0.9076
4.0000 0.0000 1136 3617 0.8540
4.0000 2.0000 1265 9030 0.8710
4.0000 4.0000 607 5204 0.8862
4.0000 6.0000 301 2669 0.8999
4.0000 8.0000 99 1201 0.9120
4.5000 1.0000 1281 5677 0.8607
4.5000 3.0000 658 4997 0.8770
4.5000 5.0000 450 3573 0.8916
4.5000 7.0000 101 2099 0.9047
4.5000 9.0000 132 1363 0.9163
5.0000 0.0000 337 2148 0.8497
5.0000 2.0000 736 3086 0.8671
5.0000 4.0000 443 3472 0.8827
5.0000 6.0000 185 2795 0.8967
5.0000 8.0000 148 2382 0.9093
5.5000 1.0000 355 2414 0.8565
5.5000 3.0000 480 2166 0.8732
5.5000 5.0000 205 2695 0.8883
5.5000 7.0000 240 3187 0.9017
6.0000 0.0000 109 632 0.8453
6.0000 2.0000 264 1653 0.8631
6.0000 4.0000 263 1697 0.8791
6.0000 6.0000 272 3063 0.8935
6.5000 1.0000 131 922 0.8523
6.5000 3.0000 122 1298 0.8694
6.5000 5.0000 318 1936 0.8848
7.0000 0.0000 46 305 0.8408
7.0000 2.0000 52 728 0.8590
7.0000 4.0000 156 1477 0.8755
7.5000 1.0000 38 502 0.8480
7.5000 3.0000 76 821 0.8655
8.0000 0.0000 2 188 0.8362
8.0000 2.0000 67 554 0.8548
8.5000 1.0000 18 442 0.8435
9.0000 0.0000 11 242 0.8315

*********************************************************
Length 30

Descriptives...

35090 cases have Y=0; 153715 cases have Y=1.

Variable Avg SD

1 3.0081 1.4785
2 2.2900 1.8783

Iteration History...
-2 Log Likelihood = 81310.8671 (Null Model)

-2 Log Likelihood = 80999.3103
-2 Log Likelihood = 80998.6696
-2 Log Likelihood = 80998.6696
-2 Log Likelihood = 80998.6696 (Converged)

Overall Model Fit...
Chi Square= 312.1976; df=2; p= -0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 0.0778 0.0046 0.0000
2 -0.0386 0.0035 0.0000
Intercept 1.3349

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 1.0809 1.0713 1.0907

2 0.9622 0.9556 0.9688

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 331 845 0.7917
0.5000 1.0000 1211 3738 0.7917
1.0000 0.0000 1185 3865 0.8042
1.0000 2.0000 1178 6074 0.7918
1.5000 1.0000 2413 12535 0.8043
1.5000 3.0000 997 4589 0.7918
2.0000 0.0000 1203 9675 0.8162
2.0000 2.0000 2036 9478 0.8043
2.0000 4.0000 776 2801 0.7919
2.5000 1.0000 2146 14726 0.8162
2.5000 3.0000 1606 5767 0.8044
2.5000 5.0000 460 1941 0.7919
3.0000 0.0000 939 5502 0.8276
3.0000 2.0000 1795 9009 0.8163
3.0000 4.0000 918 4031 0.8044
3.0000 6.0000 197 1077 0.7920
3.5000 1.0000 1514 6808 0.8276
3.5000 3.0000 6678 574 0.8163
3.5000 5.0000 412 2214 0.8045
3.5000 7.0000 200 1368 0.7921
4.0000 0.0000 601 2027 0.8384
4.0000 2.0000 789 4797 0.8277
4.0000 4.0000 394 3454 0.8164
4.0000 6.0000 407 2825 0.8045
4.0000 8.0000 134 503 0.7921
4.5000 1.0000 660 2917 0.8384
4.5000 3.0000 348 2616 0.8277
4.5000 5.0000 313 4423 0.8164
4.5000 7.0000 272 1041 0.8046
4.5000 9.0000 62 428 0.7922
5.0000 0.0000 172 1004 0.8487
5.0000 2.0000 319 1579 0.8385
5.0000 4.0000 287 3361 0.8278
5.0000 6.0000 260 1664 0.8165
5.0000 8.0000 142 868 0.8046
5.5000 1.0000 155 1119 0.8487
5.5000 3.0000 297 2039 0.8385
5.5000 5.0000 228 1254 0.8278
5.5000 7.0000 96 1384 0.8165
6.0000 0.0000 33 292 0.8584
6.0000 2.0000 142 1426 0.8487
6.0000 4.0000 194 755 0.8386
6.0000 6.0000 103 1037 0.8279
6.5000 1.0000 48 784 0.8584
6.5000 3.0000 109 528 0.8488
6.5000 5.0000 103 627 0.8386
7.0000 0.0000 21 475 0.8676
7.0000 2.0000 47 291 0.8585
7.0000 4.0000 47 443 0.8488
7.5000 1.0000 50 366 0.8676
7.5000 3.0000 23 237 0.8585
8.0000 0.0000 15 63 0.8763
8.0000 2.0000 8 312 0.8677
8.5000 1.0000 14 116 0.8763
9.0000 0.0000 2 43 0.8845

*********************************************************
Length 40

Descriptives...

67543 cases have Y=0; 55271 cases have Y=1.

Variable Avg SD
1 2.6309 1.2600
2 1.8954 1.6584

Iteration History...
-2 Log Likelihood = 69028.0461 (Null Model)
-2 Log Likelihood = 69012.6513
-2 Log Likelihood = 69012.6513
-2 Log Likelihood = 69012.6513 (Converged)

Overall Model Fit...
Chi Square= 15.3948; df=2; p= 0.0005

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0159 0.0052 0.0021
2 -0.0028 0.0039 0.4797
Intercept -0.1535

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9842 0.9744 0.9942

2 0.9972 0.9896 1.0049

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 393 202 0.4617
0.5000 1.0000 1877 1448 0.4590
1.0000 0.0000 2055 2410 0.4578
1.0000 2.0000 3257 2098 0.4564
1.5000 1.0000 7523 7012 0.4551
1.5000 3.0000 1924 1121 0.4537
2.0000 0.0000 6669 4959 0.4538
2.0000 2.0000 4495 3770 0.4524
2.0000 4.0000 1187 878 0.4511
2.5000 1.0000 7971 5340 0.4512
2.5000 3.0000 2728 2877 0.4498
2.5000 5.0000 565 485 0.4484
3.0000 0.0000 2357 1384 0.4499
3.0000 2.0000 4950 4077 0.4485
3.0000 4.0000 1241 1609 0.4471
3.0000 6.0000 308 322 0.4458
3.5000 1.0000 2958 2175 0.4472
3.5000 3.0000 2311 2279 0.4459
3.5000 5.0000 611 1099 0.4445
3.5000 7.0000 182 133 0.4431
4.0000 0.0000 899 812 0.4460
4.0000 2.0000 1397 1213 0.4446
4.0000 4.0000 1185 1569 0.4432
4.0000 6.0000 428 437 0.4419
4.0000 8.0000 149 26 0.4405
4.5000 1.0000 862 908 0.4433
4.5000 3.0000 745 821 0.4419
4.5000 5.0000 738 639 0.4406
4.5000 7.0000 402 73 0.4392
4.5000 9.0000 122 53 0.4379
5.0000 0.0000 204 231 0.4420
5.0000 2.0000 439 623 0.4407
5.0000 4.0000 446 337 0.4393
5.0000 6.0000 663 102 0.4379
5.0000 8.0000 315 160 0.4366
5.5000 1.0000 206 334 0.4394
5.5000 3.0000 280 251 0.4380
5.5000 5.0000 374 61 0.4367
5.5000 7.0000 538 227 0.4353
6.0000 0.0000 73 90 0.4381
6.0000 2.0000 135 135 0.4368
6.0000 4.0000 253 42 0.4354
6.0000 6.0000 319 116 0.4340
6.5000 1.0000 70 92 0.4355
6.5000 3.0000 128 22 0.4341
6.5000 5.0000 211 84 0.4328
7.0000 0.0000 22 14 0.4342
7.0000 2.0000 75 15 0.4329
7.0000 4.0000 100 50 0.4315
7.5000 1.0000 40 5 0.4316
7.5000 3.0000 57 33 0.4302
8.0000 0.0000 45 0 0.4303
8.0000 2.0000 31 14 0.4290
8.5000 1.0000 22 2 0.4277
9.0000 0.0000 8 2 0.4264

*********************************************************
Length 50

Descriptives...

28764 cases have Y=0; 38763 cases have Y=1.

Variable Avg SD
1 2.6602 1.2592
2 1.9564 1.5974

Iteration History...
-2 Log Likelihood = 92126.2478 (Null Model)
-2 Log Likelihood = 92000.3801
-2 Log Likelihood = 92000.3751
-2 Log Likelihood = 92000.3751 (Converged)

Overall Model Fit...
Chi Square= 125.8727; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0684 0.0066 0.0000
2 -0.0018 0.0052 0.7256
Intercept 0.4844

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9339 0.9218 0.9461

2 0.9982 0.9880 1.0085

X1 X2 n0 n1 Calc Prob
-0.0000 0.0000 294 202 0.6188
0.5000 1.0000 853 1323 0.6103
1.0000 0.0000 431 1847 0.6025
1.0000 2.0000 1684 1484 0.6016
1.5000 1.0000 2342 4390 0.5939
1.5000 3.0000 1059 1021 0.5930
2.0000 0.0000 2368 2483 0.5860
2.0000 2.0000 1450 2970 0.5851
2.0000 4.0000 786 782 0.5843
2.5000 1.0000 3021 3414 0.5773
2.5000 3.0000 1041 2291 0.5764
2.5000 5.0000 443 357 0.5755
3.0000 0.0000 959 1121 0.5694
3.0000 2.0000 2265 2586 0.5685
3.0000 4.0000 628 1072 0.5676
3.0000 6.0000 290 318 0.5667
3.5000 1.0000 1420 1765 0.5605
3.5000 3.0000 1205 1270 0.5596
3.5000 5.0000 351 941 0.5587
3.5000 7.0000 100 124 0.5578
4.0000 0.0000 513 663 0.5525
4.0000 2.0000 771 854 0.5516
4.0000 4.0000 826 1055 0.5507
4.0000 6.0000 92 384 0.5498
4.0000 8.0000 34 30 0.5489
4.5000 1.0000 580 645 0.5436
4.5000 3.0000 540 695 0.5427
4.5000 5.0000 262 431 0.5418
4.5000 7.0000 28 108 0.5409
4.5000 9.0000 56 8 0.5399
5.0000 0.0000 161 139 0.5356
5.0000 2.0000 405 526 0.5346
5.0000 4.0000 162 293 0.5337
5.0000 6.0000 83 115 0.5328
5.0000 8.0000 104 32 0.5319
5.5000 1.0000 210 265 0.5266
5.5000 3.0000 123 220 0.5257
5.5000 5.0000 51 79 0.5247
5.5000 7.0000 184 14 0.5238
6.0000 0.0000 65 106 0.5185
6.0000 2.0000 62 113 0.5176
6.0000 4.0000 37 61 0.5167
6.0000 6.0000 125 5 0.5158
6.5000 1.0000 56 77 0.5095
6.5000 3.0000 19 31 0.5086
6.5000 5.0000 98 0 0.5077
7.0000 0.0000 5 16 0.5014
7.0000 2.0000 16 22 0.5005
7.0000 4.0000 46 4 0.4996
7.5000 1.0000 5 9 0.4924
7.5000 3.0000 36 2 0.4915
8.0000 0.0000 1 0 0.4843
8.0000 2.0000 14 0 0.4834
8.5000 1.0000 4 0 0.4753
9.0000 0.0000 0 0 0.4673

*********************************************************
Length 60

Descriptives...

21244.5 cases have Y=0; 27593.5 cases have Y=1.

Variable Avg SD
1 2.4001 1.1066
2 1.6640 1.4668

Iteration History...
-2 Log Likelihood = 66876.1255 (Null Model)
-2 Log Likelihood = 66486.7918
-2 Log Likelihood = 66486.5709
-2 Log Likelihood = 66486.5709
-2 Log Likelihood = 66486.5709 (Converged)

Overall Model Fit...
Chi Square= 389.5546; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 0.1649 0.0093 0.0000
2 0.0003 0.0070 0.9647
Intercept -0.1324

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 1.1793 1.1581 1.2010

2 1.0003 0.9867 1.0141

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 142 68 0.4669
0.5000 1.0000 931 707 0.4876
1.0000 0.0000 1250 1753 0.5081
1.0000 2.0000 937 701 0.5083
1.5000 1.0000 2543 3541 0.5288
1.5000 3.0000 953 685 0.5289
2.0000 0.0000 1306 1697 0.5492
2.0000 2.0000 2738 3346 0.5494
2.0000 4.0000 365 307 0.5495
2.5000 1.0000 2775 3309 0.5696
2.5000 3.0000 963 1533 0.5698
2.5000 5.0000 106 146 0.5699
3.0000 0.0000 1453 1550 0.5896
3.0000 2.0000 985 1511 0.5898
3.0000 4.0000 215 721 0.5899
3.0000 6.0000 117 72 0.5901
3.5000 1.0000 1071 1425 0.6095
3.5000 3.0000 228 708 0.6096
3.5000 5.0000 339 363 0.6098
3.5000 7.0000 18 24 0.6099
4.0000 0.0000 190 306 0.6289
4.0000 2.0000 260 676 0.6290
4.0000 4.0000 339 363 0.6292
4.0000 6.0000 40 116 0.6293
4.0000 8.0000 0.5 0.5 0.6294
4.5000 1.0000 95 289 0.6480
4.5000 3.0000 352 350 0.6481
4.5000 5.0000 36 120 0.6483
4.5000 7.0000 0.5 0.5 0.6484
4.5000 9.0000 27 36 0.6485
5.0000 0.0000 5 61 0.6665
5.0000 2.0000 126 162 0.6666
5.0000 4.0000 46 110 0.6668
5.0000 6.0000 0.5 0.5 0.6669
5.0000 8.0000 55 179 0.6670
5.5000 1.0000 38 70 0.6846
5.5000 3.0000 14 50 0.6848
5.5000 5.0000 0.5 0.5 0.6849
5.5000 7.0000 53 181 0.6850
6.0000 0.0000 20 16 0.7021
6.0000 2.0000 4 20 0.7022
6.0000 4.0000 0.5 0.5 0.7023
6.0000 6.0000 69 165 0.7025
6.5000 1.0000 8 10 0.7191
6.5000 3.0000 0.5 0.5 0.7192
6.5000 5.0000 17 79 0.7194
7.0000 0.0000 0.5 0.5 0.7354
7.0000 2.0000 0.5 0.5 0.7355
7.0000 4.0000 0 36 0.7356
7.5000 1.0000 0.5 0.5 0.7512
7.5000 3.0000 9 18 0.7513
8.0000 0.0000 0.5 0.5 0.7662
8.0000 2.0000 0 6 0.7663
8.5000 1.0000 0.5 0.5 0.7807
9.0000 0.0000 1 2 0.7945

djh

unread,

Mar 30, 2012, 4:27:44 PM3/30/12

to koo...@sfu.ca

From the summary data below for the length 20-60 runs with two
predictors (v1+v2)/2 and (v1-v2), it looks to me like (v1+v2)/2 always
yields significant results (p < 0.5) when we control for (v1-v2). So
I will continue to run with data from different length intervals,
instead of pooling across lengths. This is because much more
information of potential scientific relevance is provided by runs that
don't pool across length, and so long as the information comes with a
satisfactory p-value, I might as well use it.

If you agree about the significance of (v1+v2)/2 in the cases below,
permit me to ask you a question.

since (v1+v2)/2 is significant in all these runs once we control for
(v1-v2), is it safe to trust the linear regression of logit[p] as also
saying something meaningful about the data in these cases?

Length 20

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0341 0.0039 0.0000
2 0.0715 0.0033 0.0000
Intercept 1.9027

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9665 0.9591 0.9740

2 1.0742 1.0673 1.0811

Length 30

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.0778 0.0046 0.0000
2 -0.0386 0.0035 0.0000
Intercept 1.3349

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.0809 1.0713 1.0907

2 0.9622 0.9556 0.9688

Length 40

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0159 0.0052 0.0021
2 -0.0028 0.0039 0.4797
Intercept -0.1535

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9842 0.9744 0.9942

2 0.9972 0.9896 1.0049

Length 50

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0684 0.0066 0.0000
2 -0.0018 0.0052 0.7256
Intercept 0.4844

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9339 0.9218 0.9461

2 0.9982 0.9880 1.0085

Length 60

djh

unread,

Mar 30, 2012, 5:13:25 PM3/30/12

to koo...@sfu.ca

PS - I double-checked with John and he confirmed tht

p = 0.0000

means

p < 0.0001

Ray Koopman

unread,

Mar 30, 2012, 7:59:54 PM3/30/12

to

On Mar 29, 5:16 pm, djh <halitsk...@att.net> wrote:
> [...]

> Finally, regarding both length and v, there is a matter of sampling
> that I need to at least make you aware of.
>
> Right now, our length intervals are 13-22,23-32,...,93-102,103-112,
> and we never feed our "filter" (Dr. Lesk's structural alignment
> program) an input pair that is drawn from two different length
> intervals. This keeps us from asking the filter to do silly things
> with silly input pairs of greatly disparate lengths, but it does
> raise an obvious sampling issue.
>
> In particular, neither nature nor Dr. Lesk's program considers
> the lengths 21 and 22 to be somehow closer than 22 and 23, or
> similarly, the lengths 23 and 24 to be closer than 21 and 22.
> Yet of these three possible classes of input pairs defined by
> length, we only feed our filter pairs from one. So I know we're
> implicitly "sampling" here, but I don't know if we're doing so
> legitimately or illegitmately - perhaps you can advise here.
>
> And of course, the same implicit "sampling" question arises in
> the case of our quantized v. The levels 0,...,9 of v correspond,
> in the case of the globin proteins, to 10 ranges 0.01-0.50,

> 0.51-1.00, ..., 4.01-4.50, 4.51-5.00. So our protocol

> distinguishes between two values such as 4.50 and 4.51 in a way
> that it does not distinguish between the equally close values
> 4.49 and 4.50 or 4.51 and 4.52. Again, perhaps you can advise
> how bad a problem this actually is.

This is one of those situations where you need to know the answer to
the research question before you can design the optimal experiment
to answer the research question. One way to approach such problems
is simulation. Otherwise, the best I can do is to quote Point 3 in
Frank Harrell's Philosophy of Biostatistics: "Avoid categorizing
continuous variables and predicted values at all costs"
<http://biostat.mc.vanderbilt.edu/wiki/Main/FrankHarrell>.
You might try poking around that department to see if you can find
a student who would be interested in getting involved.

Ray Koopman

unread,

Mar 30, 2012, 8:01:24 PM3/30/12

to

On Mar 30, 4:59 am, djh <halitsk...@att.net> wrote:
> Ray - I forgot to ask you one basic question last night.
>
> Suppose that sufficient and insightful complication of the model
> produces a satisfactory logistic regression involving the "yield-
> related" properties of the "filter".
>
> Is the fact that the "yield-related" properties of the filter are
> best modellable by a logistic regression incompatible with the
> possibility that the "RMSD-related" property of the filter might be
> modelled by a linear regression? Or could these models both exist
> without conflict from a purely statistical point of view?

No problem. Just remember that if the yield proportion is a logistic
function then the logit of that proportion is a linear function.

djh

unread,

Mar 30, 2012, 9:22:09 PM3/30/12

As I mentioned earlier, the project's fundamental goal is NOT to
correctly model the entire range of behavior of Dr. Lesk's structural
alignment program, but RATHER to show that one small but "significant"
factor in this behavior may in fact be due to an evolutionary
"residue" which has persisted "from the beginning" (again, like the
radiation left over from the big-bang.)

Keeping this goal in mind, consider the RSquare scores which Excel
reports for the linear regressions for lengths 20-80 reported in my
last email for logit[p]:

|--Logistic Regr------|Lin Reg on logit[p]|
Length Coeff. StdErr p RSquare
20 -0.0341 0.0039 0.0000 0.171554638
30 0.0778 0.0046 0.0000 0.215528661
40 -0.0159 0.0052 0.0021 0.195911439
50 -0.0684 0.0066 0.0000 0.166950785
60 0.1649 0.0093 0.0000 0.286276399
70 -0.1698 0.0108 0.0000 0.091041918
80 -0.1816 0.0120 0.0000 0.033825749

Suppose for the sake of argument that these RSquare's are reflective
of the degree of "meaningful" data coverage we're getting in both the
logistic regressions and the corresponding linear regressions on
logit[p].

Then:

1) Would it not be legitimate for us to claim, based on these
RSquares, that the proposed "evolutionary" residue can be seen in 29%
ofthe data at length 60 and 22% of the data at length 30?

2) If so, wouldn't the actual detail data "covered"within this 29% or
22% be those underlying detail data pairs for which logit[p] is within
1 or 2sd of the value predicted by the equation of the linear
regression on logit[p], i.e. the equation with the slope and y-
intercept constructed as you've shown me?

If your answers to (1) and (2) are "yes" (in principle), could you
tell me whether Excel's RSquares are "OK", or if you want me to
compute, and if so, how ?

djh

The problem is that in order to claim significance, the model
must account for the data, and there must be no omitted variables
that might change the picture if they were included.

>
> If this reanalysis yields same or better logistic correlations than
> those you've seen already, then this result might well help with
> potential referees, because it would lessen the possible effect of
> "chance" involving low outliers and "expected" results involving
> high outliers.
>

> On the other hand, if this reanalysis kills the current logistic

> correlations outright, well, that's the way the cookie crumbles. At
> least we'll know what's what and that we did an honest day's work.
>
> 2.
> Assuming the reanalysis in (1) does not kill our current logistic
> correlations, I will run the weighted LINEAR correlations on
> logit[p] according to the instructions you've provided.
>
> In this regard, I am interpreting your last post as saying that if
> we run the weighted linear correlations on logit[p] rather than the
> ordinary linear correlations, then you would be able to make a more
> informed decision as to whether we could use the resulting R^2's as a
> crude indicator of the extent to which there might be an "evolutionary
> factor" influencing the behavior of Dr. Lesk's program. (But if I'm
> misinterpreting you incorrectly here, please clarify.)

I see no serious role for linear regression analyses of the logits as
a substitute for logistic regression, especially for expanded models
that will necessarily have smaller cell sizes. You need to do actual
logistic regression, and if the cell sizes start getting small or
you get quasi-separation then you will need to do the reduced-bias
version.

djh

unread,

Apr 6, 2012, 10:38:08 AM4/6/12

to koo...@sfu.ca

Thanks for the advice - I am of course very grateful to get it even
though it indicates I have a lot more work lying ahead of me (which
may well also come to naught.)

As you have advised, I will proceed to expand the logistic model by
adding one possibly important predictor which I have so far
suppressed. (For details of this expansion, see the section
"Expanding the Model" at the end of this post.)

But before I do this, I have to ask you one important question
regarding when it is or isn't legitimate to take only data within one
sd of a mean, as opposed to "all" the data in a given cell. The
reason I'm asking this question is because in an earlier post, you
seemed to indicate that this would be OK to do, whereas in your last
post here, you seem to be indicating that it's not.

To see exactly what I mean when I say "taking data within one sd of
the mean", consider two sets of output counts O1 and O2 for the case
when Length = 20 and v1 = 0.

v2 Input O1 O2
0 2080 1769 1340
1 7150 6347 4802
2 9685 8563 6247
3 9425 8321 6237
4 6370 5233 4014
5 4615 4067 3153
6 2535 2269 1706
7 1755 1549 1156
8 1300 1201 910
9 1495 1363 1000

In the first row, the O1 count of 1769 reflects all the output pairs
for (Length = 20, v1= 0, v2 = 0). For each of these 1769 output
pairs, I can compute the value of a "span" variable "S", and get the
mean and sd of this variable S for the total population of 1769. And
if I then exclude all output pairs within the 1769 that fall outside
one sd of the mean of S on either side, I get the reduced "O2" output
count of 1340.

I thought that in an earlier post, you said it would be OK to use the
1340 instead of the 1769. particularly because it makes good
scientific sense to do so.

Furthermore, you can see that by choosing the reduced count of 1340
instead of the total count of 1769, I am not "ignoring" data that
doesn't fit the model, because I haven't yet run any model, so I have
no idea of whether taking 1340 instead of 1769 will improve fit or
not.

So, when you have a chance, please clarify whether taking the 1340
instead of the 1769 is legitimate in this kind of situation. If so, I
would prefer to do this when trying to expand the model.

"Expansion of the model"

Presently, the model uses a length variable L which can assume 10
values that I denote by 20 thru 110 for the sake of convenience (20 =
13-22, 30 = 23-32, etc.).

But within each value Li of length, a second variable C can assume
certain values from 3 to 28, and in fact, the value of C plays a
significant role in determining the value of the unquantized variable
u (and therefore the quantized variable v.)

So I can expand the model quite readily by adding the two predictors
Li and Cj to the current (v1+v2)/2 and (v1-v2) predictors.

And inasmuch as different values of Cj can quite possibly affect the
behavior of Dr. Lesk's program for a given value of Li, I will start
re-analyzing the data using a four-predictor model (Li,Cj,(v1+v2)/2,v1-
v2).

Thanks so much again for your patience, Ray.

djh

unread,

Apr 6, 2012, 11:09:43 PM4/6/12

to koo...@sfu.ca

Below are two sets of results from John's calculator.

The first set you've seen before - they're the "length 20" results
using the two predictors (v1+v)/2 and v1-v2.

The second set of results is also based on a matrix for length 20
using the same two predictors.

The difference is that in this case, I used only data exhibiting a
value of 3 for the new "C" variable which I mentioned in my last post,
rather than allowing C to be any of the values 2 thru 6 which it
actually assumes in the length 20 data. (In other words, I have not
yet incorporated "C" into the model as a third predictor - right now I
am merely using the original two predictors but on data in which C is
constant - just to see what happens.)

As you can see from the two sets of results:

a) the chi-square drops from 483.0906 in the first set of results to
12.6410 in the second set, roughly a 40-fold decrease;
b) the p-value associated with the chi-square goes from < 0.0001 to
0.0018.

Does this drop in chi-square signify a step in the right direction, or
is 12.6410 still a relatively "big" chi-square? We would of course
like you to say that this drop in chi-square is a step in the right
direction, because we can think of "C" as a scalar on our unquantized
variable u that underlies v.

Thanks for any time you can afford to spend considering this question,
and again, please forgive my abysmal ignorance about these matters.

I will proceed to do the same analysis with C held constant at 4, at
5, and at 6, in the hope that this will provide you more information
from which to draw conclusions.

**************************************************************************
Original results for length 20
*************************************************************************
Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD

1 3.1861 1.5601
2 2.4325 1.9607

Iteration History...

-2 Log Likelihood = 08060.9779 (Null Model)
-2 Log Likelihood = 07579.6536
-2 Log Likelihood = 07577.8873
-2 Log Likelihood = 07577.8873

-2 Log Likelihood = 07577.8873 (Converged)

Overall Model Fit...
Chi Square= 483.0906; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 -0.0341 0.0039 0.0000
2 0.0715 0.0033 0.0000
Intercept 1.9027

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 0.9665 0.9591 0.9740

2 1.0742 1.0673 1.0811

Results for Length 20 with C = 3.
*********************************************************
Descriptives...

153 cases have Y=0; 10438 cases have Y=1.

Variable Avg SD
1 3.7039 1.7354
2 2.7674 2.1437

Iteration History...
-2 Log Likelihood = 1600.3995 (Null Model)
-2 Log Likelihood = 1588.0377
-2 Log Likelihood = 1587.7588
-2 Log Likelihood = 1587.7586
-2 Log Likelihood = 1587.7586 (Converged)

Overall Model Fit...
Chi Square= 12.6410; df=2; p= 0.0018

Coefficients and Standard Errors...
Variable Coeff. StdErr p

1 0.0907 0.0522 0.0821
2 0.0967 0.0456 0.0341

Intercept 3.6619

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High

1 1.0950 0.9885 1.2129

2 1.1015 1.0073 1.2046

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 0 26 0.9750
0.5000 1.0000 0 163 0.9782
1.0000 0.0000 1 195 0.9771
1.0000 2.0000 3 198 0.9810
1.5000 1.0000 12 505 0.9801
1.5000 3.0000 1 216 0.9835
2.0000 0.0000 15 299 0.9790
2.0000 2.0000 9 561 0.9827
2.0000 4.0000 1 147 0.9857
2.5000 1.0000 26 690 0.9818
2.5000 3.0000 2 386 0.9849
2.5000 5.0000 0 104 0.9875
3.0000 0.0000 12 360 0.9808
3.0000 2.0000 9 462 0.9841
3.0000 4.0000 2 264 0.9869
3.0000 6.0000 0 62 0.9892
3.5000 1.0000 6 521 0.9833
3.5000 3.0000 5 329 0.9862
3.5000 5.0000 1 167 0.9886
3.5000 7.0000 0 47 0.9906
4.0000 0.0000 1 159 0.9824
4.0000 2.0000 4 354 0.9855
4.0000 4.0000 1 207 0.9880
4.0000 6.0000 0 119 0.9901
4.0000 8.0000 1 89 0.9918
4.5000 1.0000 0 245 0.9847
4.5000 3.0000 1 222 0.9874
4.5000 5.0000 6 142 0.9896
4.5000 7.0000 2 248 0.9914
4.5000 9.0000 0 53 0.9929
5.0000 0.0000 2 74 0.9839
5.0000 2.0000 1 149 0.9867
5.0000 4.0000 4 160 0.9890
5.0000 6.0000 2 305 0.9909
5.0000 8.0000 2 138 0.9925
5.5000 1.0000 0 103 0.9860
5.5000 3.0000 1 106 0.9885
5.5000 5.0000 3 328 0.9905
5.5000 7.0000 6 170 0.9921
6.0000 0.0000 0 27 0.9853
6.0000 2.0000 0 76 0.9879
6.0000 4.0000 2 222 0.9900
6.0000 6.0000 4 187 0.9917
6.5000 1.0000 0 48 0.9872
6.5000 3.0000 0 152 0.9895
6.5000 5.0000 0 128 0.9913
7.0000 0.0000 0 15 0.9866
7.0000 2.0000 0 93 0.9889
7.0000 4.0000 2 87 0.9908
7.5000 1.0000 0 71 0.9883
7.5000 3.0000 0 55 0.9904
8.0000 0.0000 0 65 0.9877
8.0000 2.0000 2 38 0.9899
8.5000 1.0000 1 80 0.9893
9.0000 0.0000 0 21 0.9888

Ray Koopman

unread,

Apr 7, 2012, 2:07:33 AM4/7/12

to

This is just a quick note to sort out the chi-squares.
I'll have more to say later.

The chi-square I described in my previous post compares the logistic
model to the saturated model, that fits the data perfectly.
The bigger the chi-square (and the smaller the p), the more certain
you are that the logistic model is worse than the saturated model.

The chi-square in John's output compares the logistic model to the
"null" model, that gives all the cells the same probability.
The bigger the chi-square (and the smaller the p), the more certain
you are that the logistic model is better than the null model.

So you want my chi-square to be small and nonsignificant,
and John's to be big and significant.

For another take on how well the model fits, plot the "Calc Prob"
in John's output as a function of the sample proportion, n1/(n0+n1).
Ideally, the two should be equal.
How far is the plot from a 45 degree line?

djh

unread,

Apr 7, 2012, 9:19:55 PM4/7/12

to koo...@sfu.ca

I think I see what has to be done mathematically, but don't know how
to express it statistically. I think it boils down to a four
predictor logistic model, but I'm not sure.

Each input pair can not only be characterized by two values (vi,vj) of
our current quantized variable v, but also by two values (ci,cj),
where c is the new "count" variable I mentioned in my last couple of
posts. And similarly for each output pair.

So mathematically, our upper-triangular 10x10 matrix V=(vi,vj) is
actually a matrix of matrices, i.e. each cell (vi,vj) of V is a matrix
Cij = (cijp,cijq)m where p and q are values of C.

I think this boils down to constructing a four-predictor model with
predictors vi,vj,cp,cq, but again, am not sure.

If/when you have a chance, please advise whether I'm looking at this
correctly. If so, I'll have to write a little more PERL to generate
the numbers, but I think it may be worth it. I'm in the process of
hand-tallying a couple of Cij matrices to see what they look like, and
they really seem to be quite orderly ...

One other question - would it be preferable to run some two-predictor
models using just cp,cq before trying to combine these predictors with
the vi,vj predictors? Or do you always have to run all relevant
variables "together"?

Ray Koopman

unread,

Apr 8, 2012, 2:43:43 AM4/8/12

to

On Apr 6, 7:38 am, djh <halitsk...@att.net> wrote:
> Thanks for the advice - I am of course very grateful to get it
> even though it indicates I have a lot more work lying ahead of
> me (which may well also come to naught.)
>
> As you have advised, I will proceed to expand the logistic model
> by adding one possibly important predictor which I have so far
> suppressed. (For details of this expansion, see the section
> "Expanding the Model" at the end of this post.)
>
> But before I do this, I have to ask you one important question
> regarding when it is or isn't legitimate to take only data within
> one sd of a mean, as opposed to "all" the data in a given cell.
> The reason I'm asking this question is because in an earlier post,
> you seemed to indicate that this would be OK to do, whereas in
> your last post here, you seem to be indicating that it's not.

I presume you're referring to your proposal (7:03 am Mar 27) to use
only RMSDs within 1 sd of the mean, to which I replied (10:29 am Mar
27), "Yes, that sort of thing is legit, altho I don't know what I
said that would lead you to that conclusion." In general, if you want
to analyze only the middle values in a data set, ignoring extreme
values that somehow do not represent what you want to talk about,
then it's usually better to define "middle" in terms of something
other than the mean (which is sensitive to outliers) and the sd
(which is even more sensitive to outliers). The most common solution
is a "trimmed mean", which ignores a prespecified proportion of the
most extreme values. But there are many other possibilities.

I misunderstood. I thought you wanted to use only values that were
within 1 sd of the predicted value. However, my comments above about
not basing the selection on the mean and sd still apply. It would
be better to select pairs whose S-values were not in the lowest L%
or the highest H% of the S-distribution. Values of L or H up to 25%
would raise no eyebrows among data analysts.