Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Correct way to normalize an rmsd-based distance metric used in repeated trials of pairs

275 views
Skip to first unread message

djh

unread,
Mar 20, 2012, 8:43:14 AM3/20/12
to
1. Suppose you select 500 pairs of necklaces with 50 beads in each
one.

2. After throwing one of these pairs up in the air and letting it land
on a desktop:

a) you find that there are just 30 pairs of beads for which: i) one
bead of the pair is from one necklace and one bead is from the other;
ii) the two beads in the pair are within 1/4 inch of each other.

b) you compute the rmsd of the distance between the two beads for each
of these 30 pairs.

3. Then you do the same thing for the remaining 499 pairs of necklaces
and compute the rmsd for each trial in which there are also just 30
pairs of beads within 1/4 inch of each other ... say there are 300 of
these cases, including your first one.

4. Then you get the average rmsd for these 300 cases.

5. Having done all the above, you repeat the same procedure on a a
new set of 1000 pairs of necklaces with 50 beads in each one, and
since this results in 525 cases out of the 1000 in which just 30 beads
were within 1/4 inch of each other, you again compute the rmsd for
each of these cases and then the average rmsd over all 600.

6. Finally, you get the average rmsd for:

a) 930 cases that result from another 1500 new trials;

b) 1140 cases that result from another 2000 new trials

etc.

Is it the case that you if you want to comare the average rmsd's you
get for the successive sets of 300, 525, 930, 1130, ... cases, you
somehow have to normalize these average rmsd's relative to 300, 525,
930, 1130 ... ?

And if so, do you also have to somehow have to take into account the
different starting sizes of 500, 1000, 1500, 2000 ... ?

Note to Ray Koopman:

This question arises from the matrix we were discussing here:

http://groups.google.com/group/sci.stat.math/browse_thread/thread/4cde6c1d44a77fd9#

After you pointed out that one of the dv's increases/decreases as you
move right/left from a reference cell, I realized that in fact, the
size of the population in each cell decreases as you move right in the
same row ....

djh

unread,
Mar 20, 2012, 8:49:59 AM3/20/12
to
Correction: the "600" in here:

Having done all the above, you repeat the same procedure on a a
new set of 1000 pairs of necklaces with 50 beads in each one, and
since this results in 525 cases out of the 1000 in which just 30
beads
were within 1/4 inch of each other, you again compute the rmsd for
each of these cases and then the average rmsd over all 600.

should have been 525.

Sorry!

djh

unread,
Mar 20, 2012, 9:51:36 AM3/20/12
to
I have started testing the relationship between sample size and
average rmsd in repeated different "real-life" scenario that ae
similar to the artificial scenario used above.

The correlation appears to hover about -0.50, from -0.57 down to -0.40
"in real life", and is never positive. That is, the more trials you
do, the smaller the rmsd.

Assuming these consistently negative but moderate correlations can be
considered "semi-significant", this tells me that the more trials one
does, the more chance there is for each pair of necklaces to fall to
the desktop in a similar way.

But I have no idea if I'm reasoning correctly here - that's why I
asked the question in general.

Art Kendall

unread,
Mar 20, 2012, 11:00:14 AM3/20/12
to
Ray Koopman has been very generous in trying to reply to your posts.

It is quite possible that you would have more advice if you were to to
explain substantively what you are trying to find out.

It _very frequently_ happens that translation to analogous questions
omits relevant information. The artificial scenario often loses
important aspects of the real life scenario.

Art Kendall
Social Research Consultants

djh

unread,
Mar 20, 2012, 11:30:07 AM3/20/12
to
Thanks for the advice, Art. Since you posted here as well as
emailing, I'll post my reply here as well:

********
Thanks very much for taking the time to offer that advice, Art. It is
much appreciated.

Just so you know - I offered a collaborative co-authorship to Ray if
he would serve as the statistical expert on the team and he declined
on the grounds that we need someone "local". So his willingness to
function "pro bono" over the web to the extent that he has is even
more praiseworthy.

Maybe it's best that I take his advice and try to find a collaborating
local expert. But as I'm sure you're aware, such experts are in high
demand and very very busy, i.e. they generally can't afford the time
to come up to speed on "context". (If they weren't so busy, my
colleagues at Princeton and Penn State would probably have already
been able to find one. )

In any event, if you think I'm taking advantage of sci.stat.math (i.e.
"page-pushing", to use the on-line idiom), just let me know and I'll
cease and desist.

Best regards
djh

Art Kendall

unread,
Mar 20, 2012, 12:39:06 PM3/20/12
to
perhaps I am missing something.
> the more trials you
>> do, the smaller the rmsd.


Everything else being equal a root mean square aka population standard
deviation is expected to be smaller as n increases

this is an SPSS demo that I use in explaining standard deviations. It
should be somewhat clear even if you use other software.
<begin SPSS syntax>
* demo of size of standard deviation as n goes up
* Given the same sum of squared distances (called deviations when one of
the points is a mean)
* as n increases the mean square (aka population variance) goes down and
*the root mean square (aka population standard deviation) goes down
*the sample variance and sample standard deviation use n-1 rather than n
that is used in the pop statistics.
input program.
loop n = 10 to 250 by 10.
end case.
end loop.
end file.
end input program.
compute sum_of_squares =10000.
compute mean_squares = sum_of_squares/n.
compute root_mean_square = sqrt(mean_squares).
var labels
sum_of_squares 'sum of squared distances (of points to mean for
deviations)'/
mean_squares 'population variance when divided by n sample variance
when divided by n-1'/
root_mean_square 'population standard deviation when divided by n
sample SD when divided by n-1'.
list.
<end syntax>

this is the output.
<begin output>




n sum_of_squares mean_squares root_mean_square

10 10000 1000 32
20 10000 500 22
30 10000 333 18
40 10000 250 16
50 10000 200 14
60 10000 167 13
70 10000 143 12
80 10000 125 11
90 10000 111 11
100 10000 100 10
110 10000 91 10
120 10000 83 9
130 10000 77 9
140 10000 71 8
150 10000 67 8
160 10000 63 8
170 10000 59 8
180 10000 56 7
190 10000 53 7
200 10000 50 7
210 10000 48 7
220 10000 45 7
230 10000 43 7
240 10000 42 6
250 10000 40 6


Number of cases read: 25 Number of cases listed: 25
<end output>


Art Kendall
Social Research Consultants

On 3/20/2012 9:51 AM, djh wrote:

Art Kendall

unread,
Mar 20, 2012, 12:48:50 PM3/20/12
to
Best of luck.

wrt "page-pushing" it is not unusual for group or list members to skip
new posts that are part of a thread.

Art Kendall
Social Research Consultants

djh

unread,
Mar 20, 2012, 2:16:52 PM3/20/12
to
I posted a detailed response but it doesn't seem to want to appear, so
I'll do a shorthand version and maybe detailed version will eventually
apppear.

Thanks for taking the time, but our scenarios are unrelated (I
apologize for not making myself clear.)

What's "going down" in my case is an average rmsd that's always
computed from a set of rmsd's that are always each computed from just
30 inter-bead distances for just 30 pairs of beads. So the population
on which the actual rmsd's are being computed is not increasing - it's
always 30.

As this "average rmsd" is "going down", what's "going up" is:

a) the number T of throws
b) the yield T from the T throws, i.e. the number of throws which
yield just 30 pairs of beads lying with 1/4 inch of each other.

djh

unread,
Mar 20, 2012, 1:23:06 PM3/20/12
to
Thanks very much for taking the time - I'm sorry I didn't make myself
clear.

After any given "throw", the "rmsd" is the rmsd for the inter-atomic
distances for the 30 pairs of beads that lie within 1/4 inch of each
other in each trial. (If the "throw" doesn't yield just 30 such
pairs, I ignore the "throw".)

So, when I do 500 "throws", I get 300 in which there are just 30 of
such pairs, and my "average rmsd" is the average of the rmsd's
computed for each of these 300, i.e. an average of 300 individual
rmsds. (Note that each rmsd that figures into this average rmsd was
computed for just 30 pairs.)

And when I do another 1000 "throws", I get 525 in which there are just
30 of such pairs, and my "average rmsd" is the average of the rmsd's
computed for each of these 525, i.e. an average of 525 individual
rmsds. (Again, note that each rmsd that figures into this average
rmsd was computed for just 30 pairs.)

And when I do another 1500 "throws", I get 930 in which there are just
30 of such pairs, and my "average rmsd" is the average of the rmsd's
computed for each of these 930, i.e. an average of 930 indiviudal
rmsds. (Again, note that each rmsd that figures into this average
rmsd was computed for just 30 pairs.)

And when I do another 2000 "throws", I get 1140 in which there are
just 30 of such pairs, and my "average rmsd" is the average of the
rmsd's computed for each of these 1140, i.e. an average of 1130
individual rmsds. (Again, note that each rmsd that figures into this
average rmsd was computed for just 30 pairs.)

And what I was trying to say is that as my "yields" go up from 525 to
930 to 1140, my "average rmsd's" go down, even though each of these
"average rmsd" was computed for a set of rmsd's in which each rmsd was
obtained from measurements on just 30 pairs. (In other words, the
"30" stays constant, even though the throws and the yields increase.)

This has nothing to do with your example, so far as I can tell,
because in every case, the rmsd is always computed for just 30
elements, not an increasing nor decreasing number of elements.

What I believe is happening is that as the number of throws goes up,
the "yields" go up, and as the "yields" go up, there are simply more
ways in which 30 pairs can lie closer, so the "average rmsd's" go
down.

But assuming I'm correct about this, I'm still stuck with my original
question: how would you normalize the average rmsd's relative to the
number of throws (or number of returns in the yields from the throws),
assuming you need to for reasons I won't go into here ?

Again - thanks very much for taking the time ...

djh

unread,
Mar 20, 2012, 4:05:43 PM3/20/12
to A...@drkendall.org
Art -

Below are 55 triples (SampSize, M, (u1+u2)2 ) which I just posted to
Ray (in the original "data bifurcation" thread), in order to show him
that his guess about (u1+u2)/2 was correct: i) SampSize against
(u1+u2)/2 correlates with a coefficient of -0.65, an F of 6.20E-08,
and an RSquare of 0.43; ii) M correlates against (u1+u2)/2 with a
coefficient of 0.65, 7.22E-08,
and an RSquare of 0.42.

I'm also posting theese triples in this thread to illustrate what I
meant by "average RMSD". M is an "average" RMSD - it's computed from
188 individual RMSD's for the 188 cases in the first sample, then from
242 individual RMSD's for the 242 cases in the second sample, etc.
But all of the 188 individual RMSD's were computed for roughly the
same number of elements, and furthermore, this is roughly the same
number of elements as for the 242 individual RMSD's in the second
sample, etc..

Samp
Size M (u1+u2)/2


188 1.572862766 8
242 1.438345207 9
305 1.375536984 7
442 1.438461176 8.5
502 1.438387371 7.5
554 1.354741697 8
632 1.454691013 6
728 1.463273819 7
821 1.40514715 7.5
922 1.393855456 6.5
1201 1.423061391 4
1298 1.445902011 6.5
1363 1.380789912 4.5
1477 1.383602681 7
1549 1.326376675 3.5
1653 1.372242892 6
1697 1.462275203 6
1769 1.323715088 0
1936 1.447994484 6.5
2099 1.416260024 4.5
2148 1.406957826 5
2166 1.365713206 5.5
2269 1.427719401 3
2382 1.324469761 5
2414 1.434091798 5.5
2669 1.304284065 4
2695 1.430254267 5.5
2795 1.460848118 5
3063 1.379745439 6
3086 1.428632712 5
3187 1.391900113 5.5
3472 1.319731285 5
3573 1.344443792 4.5
3617 1.405385986 4
3909 1.397944456 3.5
4067 1.377588158 2.5
4997 1.378889606 4.5
5204 1.422537681 4
5233 1.361786878 2
5418 1.266152154 1
5677 1.39116412 4.5
6347 1.310987263 0.5
6990 1.341029097 3
8321 1.312367819 1.5
8563 1.343656653 1
9030 1.378298448 4
9056 1.334157641 2.5
9167 1.333149317 3
9385 1.393285612 3.5
9873 1.375232508 2
11688 1.362740158 3.5
12122 1.379700086 3
14349 1.297632463 2
14760 1.324303886 1.5
19218 1.351081067 2.5





Ray Koopman

unread,
Mar 21, 2012, 3:06:40 AM3/21/12
to
On Mar 20, 6:51 am, djh <halitsk...@att.net> wrote:
> I have started testing the relationship between sample size and
> average rmsd in repeated different "real-life" scenario that are
> similar to the artificial scenario used above.
>
> The correlation appears to hover about -0.50, from -0.57 down to
> -0.40 "in real life", and is never positive. That is, the more
> trials you do, the smaller the rmsd.
>
> Assuming these consistently negative but moderate correlations
> can be considered "semi-significant", this tells me that the more
> trials one does, the more chance there is for each pair of necklaces
> to fall to the desktop in a similar way.
>
> But I have no idea if I'm reasoning correctly here - that's why I
> asked the question in general.

The only way that could happen with the necklaces is if the
composition of the set of necklaces changes, and/or the way the
pairs are thrown changes, as a function of the size of the set.
If all the samples were drawn randomly and independently from the
same population, and all the throws were done identically, then
traditional probability theory says there should be no systematic
relation between the number of sample pairs drawn and the rmsd for
different sized sets. Sounds to me like the necklace analogy is
missing something that's going on in your real data.

djh

unread,
Mar 21, 2012, 11:12:49 AM3/21/12
to koo...@sfu.edu, a...@drkendall.org, halit...@att.net

Ray -

Thank you for confirming that no statistics-based or probability-based
null hypothesis is operative in the "necklace" example. When I posed
the question, I stated a possible null hypothesis of this type to see
if any professional at sci.stat.math WOULD agree that some such null
hypothesis is operative. I did this simply to avoid embarassment,
because usually in cases such as this, some kind of null hypothesis IS
operative - any other hypotheses are usually "too good to be true".
And I didn’t want to appear really stupid by showing my ignorance of a
standard null hypothesis that would be relevant in this case.

But now that we agree that no null hypothesis is operative, I can tell
you your intuitions were again 100% dead-on when you just wrote:

"The only way that could happen with the necklaces is if the
composition of the set of necklaces changes . . . Sounds to me like
the necklace analogy is missing something that's going on in your real
data."

To see why I say this, look at the matrix at the end of this post,
which shows the row-level correlations of sample size and average RMSD
with the (u1+u2)/2 variable that you suggested in the other thread.
(The previous correlation shown in the post above (i.e. the
correlation on 55 total observations of samp size and average RMSD
relative to (u1+u2)/2) was derived from the values in this matrix in
the obvious way.)

The reason why these correlations CAN exist at all is because you were
absolutely right – different values of u1 and u2 DO change certain
properties of the data that are equivalent to changing the
“composition of the necklaces”. In effect:

a) As (u1+u2)/2 increases, it is more difficult to find pairs of
necklaces in the overall population such that one necklace has u = u1
and the other has u = u2; so, samp size goes down as (u1+u2)/2 goes up

b) As (u1+u2)/2 increases, the RMSD for a pair of necklaces with u =
u1 and u = u2 is going to increase because such pairs are the “odd
balls” in the system and therefore, there is less chance of them being
similar. (I realize this statement is a bit vague, but it’s the best
I can do without going into way too much biomolecular detail.)

So, as I said previously, what remains now is to construct the same
matrix as the one shown below for 239 more cases, to produce a total
of 240 cases in four sets of 60. Since three of the four sets of 60
are three different kinds of controls, we will know for sure whether
we “have something or not” by comparing the 60 matrices for the first
set to the 60 matrices for each of the other three sets. To take you
“into the kitchen” just a little bit, the reason why each set has 60
cases is because our data are from 6 different types of proteins
(globins, cytochrome-C’s, immunoglobulin-like proteins, trypsin-like
proteases, “TIM-barrel” proteins, and “NADP-binding” proteins, and for
each of these six classes we select data in a way that is equivalent
to selecting necklaces of ten different lengths: 13-22 beads, 23-32
beads, 33-42 beads, . . . 103-112 beads, for a total of ten “length
intervals in all. (The matrix above was drawn just from the globin
proteins for length interval 13-22.)

Also, we honestly don’t know “up front: in which cases the
correlations will or will not manifest themselves. As we move from
protein type to protein type, the structural type of the proteins
changes from “mainly helical” (globins and cytochromes) to “mainly
sheet” (IG-like and trypsin-like( to “mainly helical AND sheet” (TIM-
barrel and NADP-binding.) And one progresses through the ten length
intervals specified above (13-22, . . . 103-112) , the nature of the
necklaces drawn from “mainly helical”, “mainly sheet”, and “mainly
helical+sheet” proteins will change considerably.

Finally, I don’t want to tell you what the variable u is for a simple
reason. If you told any molecular biologist that my team is trying to
relate this variable to aspects of protein structure, they would tell
you that my team is crazy. This is because on the one hand, all
molecular biologists practicing today believe that the type of
spontaneous mutation which drives natural selection is essentially
omnipresent and omnipotent to such a great extent that over a time
span of several billion years, it MUST have destroyed any traces of a
relationship that might have existed between the variable “u” and
aspects of protein structure “at the beginning”. While on the other
hand, correlations such as that shown in the matrix below indicate
that this may not be the case, i.e. that we can see traces of the
situation “at the beginning” in much the same way as physicists can
see traces of background radiation from the “big-bang”.

But again, any molecular biologist would tell you that I’m crazy for
saying so, and that my team is crazy for working with me. And in
fact, one of my teammates is very very reluctant to drag in the
evolutionary aspects of situation, inasmuch as he’s a real purist
whose attitude is: “since we weren’t there to observe, we have no
right to say.” But frankly, I don’t see any other way to make sense
of the data.

In any event, thanks so much again for working thru all this with me,
and by extension, my team. You have made an invaluable contribution
(re (u1+u2/2) which will certainly be acknowledged in whatever paper
is finally produced. And one more thing . . . I’ve told you that the
unit of measure of RMSD is angstroms, since we’re taking the RMSD of
inter-atomic distances. But I also told you I’d check on the unit of
measure of u, and when I did so, my most senior colleague reminded me
that its kcals/mols.

Here’s the matrix mentioned above:

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (u1+u2/2)
u1
0 1769 6347 8563 8321 5233 4067 2269 1549 1201 1363 -0.60
1.32 1.31 1.34 1.31 1.36 1.38 1.43 1.33 1.42 1.38 0.67

1 6347 5418 14760 14349 9056 6990 3909 2669 2099 2382 -0.60
1.31 1.27 1.32 1.30 1.33 1.34 1.40 1.30 1.42 1.32 0.56

2 8563 14760 9873 19218 12122 9385 5204 3573 2795 3187 -0.71
1.34 1.32 1.38 1.35 1.38 1.39 1.42 1.34 1.46 1.39 0.65

3 8321 14349 19218 9167 11688 9030 4997 3472 2695 3063 -0.76
1.31 1.30 1.35 1.33 1.36 1.38 1.38 1.32 1.43 1.38 0.70

4 5233 9056 12122 11688 3617 5677 3086 2166 1697 1936 -0.71
1.41 1.33 1.38 1.36 1.41 1.39 1.43 1.37 1.46 1.45 0.64

5 4067 6990 9385 9030 5677 2148 2414 1653 1298 1477 -0.73
1.38 1.34 1.39 1.38 1.39 1.41 1.43 1.37 1.45 1.38 0.52

6 2269 3909 5204 4997 3086 2414 632 922 728 821 -0.73
1.43 1.40 1.42 1.38 1.43 1.43 1.45 1.39 1.46 1.41 0.23

7 1549 2669 3573 3472 2166 1653 922 305 502 554 -0.74
1.33 1.30 1.34 1.32 1.37 1.37 1.39 1.38 1.44 1.35 0.74

8 1201 2099 2795 2695 1697 1298 728 502 188 442 -0.74
1.42 1.42 1.46 1.43 1.46 1.45 1.46 1.44 1.57 1.44 0.50

9 1363 2382 3187 3063 1936 1477 821 554 442 242 -0.75
1.38 1.32 1.39 1.38 1.39 1.38 1.41 1.35 1.44 1.44 0.64

Rich Ulrich

unread,
Mar 21, 2012, 1:33:08 PM3/21/12
to
On Tue, 20 Mar 2012 06:51:36 -0700 (PDT), djh <halit...@att.net>
wrote:

>I have started testing the relationship between sample size and
>average rmsd in repeated different "real-life" scenario that ae
>similar to the artificial scenario used above.
>
>The correlation appears to hover about -0.50, from -0.57 down to -0.40
>"in real life", and is never positive. That is, the more trials you
>do, the smaller the rmsd.
>
>Assuming these consistently negative but moderate correlations can be
>considered "semi-significant", this tells me that the more trials one
>does, the more chance there is for each pair of necklaces to fall to
>the desktop in a similar way.

Maybe I don't understand the problem, but I think you are
not properly describing the outcomes.

I grew up in farm country, and yield was always "bushels
per acre". I'm stuck with it as a rate. And that seems
appropriate here.

(300/500; 525/1000; 960/1500; 1140/2000).

As I read it, the resulting yields of 30+ close beads is
(60%; 52%; 64%; 57%). Apparently random variation?

It would be mysterious and weird, needing a new variable
to explain the outcome, if the necklaces and tosses are all
similar, and these *were* different yields. Is there a reason
for referring to the absolute counts?

If the yields actuallly did differ - like picking out all people
under a certain height/income/IQ in various population - the
process of selecting at a cutoff is known to introduce biases
in the resulting "matched samples", both in mean and variance.
That is one reason to avoid a strategy of matching samples.

However, you don't seem to have that problem of different
yields, using the term as I would use it.

If you have already moved beyond this aspect... then pardon
my interruption. I haven't worked out what it is that you
and Ray have previously agreed on.

>
>But I have no idea if I'm reasoning correctly here - that's why I
>asked the question in general.

--
Rich Ulrich

djh

unread,
Mar 21, 2012, 2:48:47 PM3/21/12
to rich-...@live.com, rich....@comcast.net
Rich -

Thanks very much for taking a moment to focus on the throw:yield
aspect of this matter. It is different than the aspect Ray and I
reached agreement on (which has to do with the average RMSD of the
individual RMDS obtained from a given yield set).

But in fact, the throw:yield aspect is equally important to our
hypothesis, and I'm afraid I muddied the waters by picking a faulty
set of example throw:yield numbers in my necklace example. (I was
careless here because, again, I was focussing on the "average RMSD"
side of the matter, not the "throw:yield" aspect.)

In actuality, throws and yields seem to be highly correlated, almost
perfectly so, as you can see from the following two examples:

Trial Throws Yields Throw:Yield Correlation
1 2080 1769 9.99E-01
7150 6347
9685 8563
9425 8321
6370 5233
4615 4067
2535 2269
1755 1549
1300 1201
1495 1363

2 7150 6347 9.99E-01
5995 5418
16390 14760
15950 14349
10780 9056
7810 6990
4290 3909
2970 2669
2200 2099
2530 2382

These two cases were drawn from our first ten trials in which we were
examining very short stretches (13-22 beads) in the pairs of necklaces
returned by each thow, and in which the necklaces were drawn from a
very homogeneous set. So even though all ten of these initial trials
showed the same high throw:yield correlation as the two above, we
won't know for another several weeks whether the correlation holds up
as: a) we examine stretches of necklace with bead lengths from 23-32,
33-42, ...., 103-112; b) we draw the pairs of necklaces from less
homgeneous sets.

But if the correlation does hold up, its persistence will be extremely
strong evidence for the hypothesis we'd like to advance, and so I'm
again grateful to you for giving me a chance to briefly explain this
aspect of the matter.

djh

unread,
Mar 21, 2012, 6:14:57 PM3/21/12
to rich-...@live.com, rich....@comcast.net, koo...@sfu.ca
Rich -

It may help simplify and clarify discussion if I change analogies and
ask you to think of a filter which:

1) accepts pairs of elements as inputs,
2) decides whether any given pair is acceptable,
3) and if so, outputs the pair.

(If the pair is unacceptable, the filter drops it into the bit-
bucket.)

Now suppose that in any given input pair (i1,i2), i1 has a value from
0 to 9 for a certain parameter u, and i2 has a value from 0 to 9 for
the same parameter u. (Let us agree to refer to the value of u in i1
as u1 and the value of u in i2 as u2.)

Furthermore, suppose that across the entire set of inputs to the
filter, the number of distinct inputs with u1 = i and u2 = j is
roughly directly proportional to (u1+u2)/2, as is shown by the
correlation of # of inputs with (u1+u2)/2 at the end of each row of
this matrix:

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (u1+u2/2)
u1
0 output 1769 6347 8563 8321 5233 4067 2269 1549 1201 1363 -0.60
input 2080 7150 9685 9425 6370 4615 2535 1755 1300 1495 -0.60

1 output 6347 5418 14760 14349 9056 6990 3909 2669 2099 2382 -0.60
input 7150 5995 16390 15950 10780 7810 4290 2970 2200 2530 -0.61

2 output 8563 14760 9873 19218 12122 9385 5204 3573 2795 3187 -0.71
input 9685 16390 11026 21605 14602 10579 5811 4023 2980 3427 -0.70

3 output 8321 14349 19218 9167 11688 9030 4997 3472 2695 3063 -0.76
input 9425 15950 21605 10440 14210 10295 5655 3915 2900 3335 -0.76

4 output 5233 9056 12122 11688 3617 5677 3086 2166 1697 1936 -0.71
input 6370 10780 14602 14210 4753 6958 3822 2646 1960 2254 -0.72

5 output 4067 6990 9385 9030 5677 2148 2414 1653 1298 1477 -0.73
input 4615 7810 10579 10295 6958 2485 2769 1917 1420 1633 -0.73

6 output 2269 3909 5204 4997 3086 2414 632 922 728 821 -0.73
input 2535 4290 5811 5655 3822 2769 741 1053 780 897 -0.73

7 output 1549 2669 3573 3472 2166 1653 922 305 502 554 -0.74
input 1765 2970 4023 3915 2646 1917 1053 351 540 621 -0.74

8 output 1201 2099 2795 2695 1697 1298 728 502 188 442 -0.74
input 1300 2200 2980 2900 1960 1420 780 540 190 460 -0.74

9 output 1363 2382 3187 3063 1936 1477 821 554 442 242 -0.75
input 1495 2530 3427 3335 2254 1633 897 621 460 253 -0.75


Then we can say that our filter is “faithful” to the values of u1 and
u2 in its inputs, in the sense that for each i,j (0<=i<=9 and
0<=j<=9), the number of outputs with u1 = i and u2 = j is ALSO roughly
directly proportional to (u1+u2)/2, with correlation coefficients
virtually identical to those for the correlations of # of inputs with
(u1+u2)/2. (Again, at the end of each row of the above matrix, see
correlations of # of outputs with (u1+u2)/2.)

What’s important to our hypothesis is the “faithfulness” of the filter
here, so I hope you agree that it’s valid to consider the above filter
as “faithful” in the sense I’ve just described.

djh

unread,
Mar 21, 2012, 9:59:04 PM3/21/12
to
Whoops! Sorry! I meant "inversely proportional", not "directly
proportional", in my last post.
I was thinking of our RMSD vs (u1+u2)/2 correlations, which are
directly proportional.

Ray Koopman

unread,
Mar 22, 2012, 1:54:44 AM3/22/12
to
I am able to reproduce your within-row correlations for the sample
sizes but not for the average RMSDs.

0 1 2 3 4 5 6 7 8 9
djh .67 .56 .65 .70 .64 .52 .23 .74 .50 .64
rfk .689 .538 .635 .715 .643 .494 .217 .718 .513 .629

It's possible that rounding the RMSDs has influenced the correlations
that much, but there may be something else going on, too.

A comment on the evolution of the notation: When I suggested u1+u2
as a predictor -- the division by 2 is purely cosmetic, to keep the
numbers on the same scale as they started -- I was referring to the
continuous variable u that was quantized to get the categorical v1 &
v2, which are now being called u1 & u2. If there is any chance that
the continuous variable will ever be used as such then it might not
be too soon to start thinking about the notation, to avoid future
confusion.

Ray Koopman

unread,
Mar 22, 2012, 1:57:59 AM3/22/12
to
Both the input and output decrease with increasing u1+u2,
but the overall correlations with u1+u2 (after replacing the
(0,7) & (7,0) elements of the input matrix by their average,
and using only the nonredundant elements of the matrices),
are -.65 for both output and input, which suggests that
looking at the within-row correlations may be misleading.

The following table gives the yield (output/input), which is
increasing with u1+u2, with the within-row correlation dropping
as the yield increases. The overall correlation is .343 .

0 1 2 3 4 5 6 7 8 9 corr
0 .850 .888 .884 .883 .822 .881 .895 .880 .924 .912 .560
1 .888 .904 .901 .900 .840 .895 .911 .899 .954 .942 .549
2 .884 .901 .895 .890 .830 .887 .896 .888 .938 .930 .452
3 .883 .900 .890 .878 .823 .877 .884 .887 .929 .918 .377
4 .822 .840 .830 .823 .761 .816 .807 .819 .866 .859 .296
5 .881 .895 .887 .877 .816 .864 .872 .862 .914 .904 .168
6 .895 .911 .896 .884 .807 .872 .853 .876 .933 .915 .108
7 .880 .899 .888 .887 .819 .862 .876 .869 .930 .892 .154
8 .924 .954 .938 .929 .866 .914 .933 .930 .989 .961 .340
9 .912 .942 .930 .918 .859 .904 .915 .892 .961 .957 .232

djh

unread,
Mar 22, 2012, 3:43:32 AM3/22/12
to
1. Within-row vs overall.

You wrote:

> Both the input and output decrease with increasing u1+u2,
> but the overall correlations with u1+u2 (after replacing the
> (0,7) & (7,0) elements of the input matrix by their average,
> and using only the nonredundant elements of the matrices),
> are -.65 for both output and input, which suggests that
> looking at the within-row correlations may be misleading.

Since the "overall" correlations are -0.65 for both input and output,
I am more than happy to use the "overall" rather than the "within-row"
if you think it wiser. The fact that the correlations are the same,
i.e. -0.65 for both input and output, expresses the desired
"faithfulness" property of the filter as well as the within-row
correlations, so far as I can see.

2. Correlation of yield with increasing u1+u2.

You wrote:

>
> The following table gives the yield (output/input), which is
> increasing with u1+u2, with the within-row correlation dropping
> as the yield increases. The overall correlation is .343 .
>
>      0    1    2    3    4    5    6    7    8    9    corr
> 0  .850 .888 .884 .883 .822 .881 .895 .880 .924 .912  .560
> 1  .888 .904 .901 .900 .840 .895 .911 .899 .954 .942  .549
> 2  .884 .901 .895 .890 .830 .887 .896 .888 .938 .930  .452
> 3  .883 .900 .890 .878 .823 .877 .884 .887 .929 .918  .377
> 4  .822 .840 .830 .823 .761 .816 .807 .819 .866 .859  .296
> 5  .881 .895 .887 .877 .816 .864 .872 .862 .914 .904  .168
> 6  .895 .911 .896 .884 .807 .872 .853 .876 .933 .915  .108
> 7  .880 .899 .888 .887 .819 .862 .876 .869 .930 .892  .154
> 8  .924 .954 .938 .929 .866 .914 .933 .930 .989 .961  .340
> 9  .912 .942 .930 .918 .859 .904 .915 .892 .961 .957  .232

By overall of .343, I assume you mean the overall correlation of
increasing yield with increasing u1+u2.

If so, is the .343 high enough to allow any conclusions to be based on
it?

If so, then this is absolutely critical for the hypothesis we're
pursuing. According to this hypothesis, yield SHOULD increase with
increasing u1+u2 - it means that the filter is not just "faithful" but
"hyper-faithful" in exactly the way the hypothesis says it should be.
(In effect, the filter can be thought of as "tuning" the pattern in
the input data ...)

So please let me know at your earliest convenience whether the .343 is
"worth anything", and for my part, I will compute this number as I do
each of the remaining 239 matrices. I've already done several more
and the input/output "faithfulness" data are patterning virtually the
same as in this initial case, so I'm anxious to see what "yield vs
u1+u2" correlations we get - better than .343, worse, or about the
same.

Also, the "rmsd vs u1+u2" numbers for these several new cases are
interesting, but I want to do many more of the 239 before I show you
anything. As expected, these numbers "bounce" because they're
sensitive to the "necklace-length" variable and the "necklace-type"
variable, and I have to wait and see what happens when the numbers are
in for enough different lengths over enough different types.

Ray Koopman

unread,
Mar 22, 2012, 4:47:20 AM3/22/12
to
Yes. However, I think that an ordinary correlation is not quite the
right statistic to capture what's going on. I'd be inclined to try
logistic regression, because the dependent variable is a proportion.
In general, the closer the proportions get to 1, the smaller you
expect the correlations to be, and the overall relation can't be
linear. The only problem is on the interpretation side, because there
is no summary statistic in logistic regression that is analogous to
an ordinary correlation. You have to be willing to talk about the
regression of the log-odds on your independent variable.

>
> If so, is the .343 high enough to allow any conclusions to be based
> on it?

Well, it's definitely not zero, but beyond that it would depend on
what you were trying to say ;)

djh

unread,
Mar 22, 2012, 9:13:43 AM3/22/12
to koo...@sfu.ca
You wrote:

"I'd be inclined to try logistic regression"

So you're saying:

take the binary dependent variable of the logistic regression to be:

yes = input pair was output by filter
no = input pair was not output by filter

and try to predict it using v1+v2?

Is this correct, i.e. you CAN do a logistics regression with just one
independent variable?

If more than one is better, I'll probably try v1+v2 plus the "length"
variable, i.e. the "centers" of the length intervals (17.5 for 13-22,
etc.)

(Note that I've reverted back to v1 and v2 for the quantized
variables, as per your suggestion in your note above (to distinguish
them from the non-quantized continuous variables u1 and u2.)

Rich Ulrich

unread,
Mar 22, 2012, 2:25:25 PM3/22/12
to
On Thu, 22 Mar 2012 06:13:43 -0700 (PDT), djh <halit...@att.net>
wrote:

>You wrote:
>
>"I'd be inclined to try logistic regression"
>
>So you're saying:
>
>take the binary dependent variable of the logistic regression to be:
>
>yes = input pair was output by filter
>no = input pair was not output by filter
>
>and try to predict it using v1+v2?
>
>Is this correct, i.e. you CAN do a logistics regression with just one
>independent variable?

If you are looking only at the cells and grouped data, you
can return to the roots of logistic regression, and take as
criterion the computed logit for each yield -- log(p/(1-p) ).
For data taken that way, you can compute the simple
correlation with the scored logits, or you can do ordinary
least squares regression with v1 and v2 or their sum.


>
>If more than one is better, I'll probably try v1+v2 plus the "length"
>variable, i.e. the "centers" of the length intervals (17.5 for 13-22,
>etc.)

If that has to be identified for each trial/necklace, then
you do need to use the 0/1 binary criterion in maximum
likelihood logisitic regression.

>
>(Note that I've reverted back to v1 and v2 for the quantized
>variables, as per your suggestion in your note above (to distinguish
>them from the non-quantized continuous variables u1 and u2.)

--
Rich Ulrich

djh

unread,
Mar 22, 2012, 3:25:27 PM3/22/12
to rich-...@live.com, rich....@comcast.net
As I construct each of the 240 "yield" matrices required by our
theoretical framework, I'm checking to see how Ray's simple "overall"
approach to the "yield vs (v1+v2/2)" correlation works: does it hover
around the .343 that he got (see his post above), go higher, go lower,
etc. At the moment it's doing some interesting things so I may not
have to go to any "deeper" method just to try and find something I
haven't found before.

But nonetheless, I would very much like to learn exactly what you mean
here:

"If you are looking only at the cells and grouped data, you
can return to the roots of logistic regression, and take as
criterion the computed logit for each yield -- log(p/(1-p) ).
For data taken that way, you can compute the simple
correlation with the scored logits, or you can do ordinary
least squares regression with v1 and v2 or their sum."

so I can try it in comparison to the simple approach just to see what
happens.

Are you saying that I should take Ray's matrix of yields:

v2
0 1 2 3 4 5 6 7 8 9
v1
0 .850 .888 .884 .883 .822 .881 .895 .880 .924 .912
1 .888 .904 .901 .900 .840 .895 .911 .899 .954 .942
2 .884 .901 .895 .890 .830 .887 .896 .888 .938 .930
3 .883 .900 .890 .878 .823 .877 .884 .887 .929 .918
4 .822 .840 .830 .823 .761 .816 .807 .819 .866 .859
5 .881 .895 .887 .877 .816 .864 .872 .862 .914 .904
6 .895 .911 .896 .884 .807 .872 .853 .876 .933 .915
7 .880 .899 .888 .887 .819 .862 .876 .869 .930 .892
8 .924 .954 .938 .929 .866 .914 .933 .930 .989 .961
9 .912 .942 .930 .918 .859 .904 .915 .892 .961 .957

and then:

a) compute log(p/(1-p)for each of the 55 non-redundant (upper-
triangular) cells
b) simply run the correlation of the these 55 values with (v1+v2)/2?

If this is what you're saying, then I have to confess that I don't
know how to take log(p/(1-p) for each of the 55 non-redundant cells of
Ray's matrix. If you think it's something that I can and should learn
about on my own, then that's what I'll do. (I realize that it isn't
your responsibility to give "pro bono" tutorials here ...)

But if you ahve the time to explain in a little more detail what I
need to do, I would be very grateful, Rich.

Thanks so much again ...





Ray Koopman

unread,
Mar 22, 2012, 4:13:57 PM3/22/12
to
Look at a plot of the logits against (v1+v2)/2. The ordinary linear
correlation is .414, and a maximum likelihood logistic regression
gives a slope of .0682. But with weighting that simulates analyzing
the individual 0/1 data, maximum likelihood logistic regression gives
a slope of -.00140, and the corrspondingly weighted correlation is
-.0804. To see why, re-do the plot, making the area of each dot
proportional to the reciprocal of the variance of the corresponding
logit; i.e., logit = log[#yes/#no], var[logit] = 1/#yes + 1/#no.
The problem is that the points that show a positive relation are
based on little data, whereas the points that show a negative
relation are based on lots of data.

djh

unread,
Mar 22, 2012, 4:41:22 PM3/22/12
to
PS to Rich - when I said I didn't know how to take log(p/(1-p)), I
didn't mean I don't know how to take the log. I meant that I don't
know what p is in this situation.

Note to Ray - I'm assuming your last post was to Rich - I have no idea
what you're talking about, except very vaguely ...


Rich Ulrich

unread,
Mar 22, 2012, 5:43:30 PM3/22/12
to
On Thu, 22 Mar 2012 13:41:22 -0700 (PDT), djh <halit...@att.net>
wrote:
Ray was answering very completely, for me and to you.

Each cell is a proportion/probability. For .850 (1st cell),
replace that number with log (.850/.150). That is the logit.

Further, Ray was showing the resuts. He shows that the simple
correlation is increased, using 55 values versus that average term.

However, the cells represent different Ns. When he applies weights
that represent the Ns, the r becomes very small, and in the opposite
direction.

--
Rich Ulrich

djh

unread,
Mar 22, 2012, 8:56:38 PM3/22/12
to rich-...@live.com, rich....@comcast.net, koo...@sfu.ca
Thanks for the clarification regarding treating the proportions as
probabilities, Rich - much appreciated. Sorry for my relatively
abysmal ignorance of these matters ...

I have completed two more "yield" matrices like the one you and Ray
have been working with. The data are still from the globin proteins,
but the data reflect "necklaces lengths" of 23-32 and 33-42, rather
than the "necklace length" of 13-22 reflected in the data of the
matrix you've been working with.

As soon as I've computed the logits and done the simple linear
correlations, I will present the results in three way. For each of
the two new matrices, I will provide:

a) the simple linear correlation on the proportions themselves
(overall for all 55 non-redundant cells)
b) the simple linear correlation on the logits of the cells (again
overall for all 55 non-redundant cells)
c) the 10 "within-row" correlations

Even though Ray noted above that one must proceed with caution
regarding the "within-row" correlations, I think it's important for
you and him to see some remarkable "highs" that manifest themselves in
these "within row" correlations.

Also, I hope you can make some statistical sense of the wide swing in
correlation (a) between the two new matrices (from ~ +0.40 to ~
-0.46.) I can readily make scientific sense of this swing, given the
fact that one reflects data with length variable 23-32 and the other
reflects data with length variable 33-42, but I'm hoping that you and
Ray will be able to say with some confidence that the swing does not
merely represent the erratic behavior of an essentially "random"
variable. That would put an end to this aspect of my team's
investigation quite promptly.

Finally, I have one further question at this point. Does the fact
that Ray's weighting made the .41 "disappear" mean that we must
discount the increase from .34 to .41 which he obtained by taking
logits? Or is this increase still meaningful in some sense? (I ask
this question, of course, relative the assumption that the "yield"
variable is not, in fact, simply "random".)

Anyway, thanks so much again for continuing to take the time ...

djh

unread,
Mar 23, 2012, 2:01:37 AM3/23/12
to rich-...@live.com, rich....@comcast.net, koo...@sfu.ca
Ray and Rich -

Below are the "yield" analyses for the globin protein length 23-32
data and globin protein length 33-42 data. I am hoping that these
analyses, together with the analysis you've already done of the globin
length 13-22 data, will enable you to say with some confidence that:

a) the "overall" (55-cell) p vs (v1+v2)/2 correlations are
statistically meaningful and can be used in scientific argument

and/or

b) the "overall" (55-cell) logit vs (v1+v2)/2 correlatiosn are
statistically meaningful and can be used in scientific argument

If you can't tell yet from these three analyses, I have 7 more
matrices to do for the globins for lengths 43-52,...,103-112, then the
same 10 again for each of our other five protein classes. And of
course, the same for the data resulting from each of our three control
protocols. So if you can't form a principled judgment yet, there will
be plenty enough additional data coming in.

Also, from a purely scientific point of view, I would like to make use
of the relatively high "within-row" correlations which I've reported
below in addition to the "overall" correlations on the 55 non-
redundant cells. So could you take a look at these and tell me if
it's legitimate or not to do so.

Thank you so very much again.


Globin proteins
Yield Analysis for Length 23-32 Data

Non-Redundant Overall Data (55 cells)

Correlation of p against (v1+v2)/2 = 0.4688
Correlation of logit against (v1+v2)/2 = 0.5704

v1+v2/
2 output input p 1-p log(p/
1-p))

0 845 1176 0.718537415 0.281462585 0.407028715
0.5 3738 4949 0.755304102 0.244695898 0.489495154
1 6074 7252 0.837562052 0.162437948 0.712329497
1.5 4589 5586 0.821518081 0.178481919 0.663022899
2 2801 3577 0.783058429 0.216941571 0.557451388
2.5 1941 2401 0.808413161 0.191586839 0.625267704
3 1077 1274 0.845368917 0.154631083 0.737749477
3.5 1368 1568 0.87244898 0.12755102 0.835056102
4 503 637 0.789638932 0.210361068 0.574463187
4.5 428 490 0.873469388 0.126530612 0.83905208
1 3865 5050 0.765346535 0.234653465 0.513431148
1.5 12535 14948 0.838573722 0.161426278 0.715567016
2 9478 11514 0.823171791 0.176828209 0.667938931
2.5 5767 7373 0.782178218 0.217821782 0.55520441
3 4031 4949 0.814507981 0.185492019 0.642570117
3.5 2214 2626 0.843107388 0.156892612 0.730280401
4 2825 3232 0.874071782 0.125928218 0.841424043
4.5 1041 1313 0.792840823 0.207159177 0.582881825
5 868 1010 0.859405941 0.140594059 0.786231381
2 9675 10878 0.889409818 0.110590182 0.905385346
2.5 14726 16872 0.872807018 0.127192982 0.836455079
3 9009 10804 0.83385783 0.16614217 0.700612134
3.5 5677 7252 0.782818533 0.217181467 0.556838336
4 3086 3848 0.801975052 0.198024948 0.60744095
4.5 2166 4736 0.457347973 0.542652027 -0.074274671
5 1697 1924 0.882016632 0.117983368 0.873655985
5.5 1384 1480 0.935135135 0.064864865 1.158864857
3 5502 6441 0.854215184 0.145784816 0.767854994
3.5 6808 8322 0.818072579 0.181927421 0.652893672
4 4797 5586 0.858754028 0.141245972 0.783892715
4.5 2616 2964 0.882591093 0.117408907 0.876058496
5 3361 3648 0.921326754 0.078673246 1.068586616
5.5 1254 1482 0.846153846 0.153846154 0.740362689
6 1037 1140 0.909649123 0.090350877 1.002941532
4 2027 2628 0.77130898 0.22869102 0.527979277
4.5 2917 3577 0.815487839 0.184512161 0.645392494
5 1579 1898 0.831928346 0.168071654 0.694591447
5.5 2039 2336 0.872859589 0.127140411 0.836660776
6 755 949 0.795574289 0.204425711 0.590145222
6.5 627 730 0.85890411 0.14109589 0.784430316
5 1004 1176 0.853741497 0.146258503 0.766205266
5.5 1119 1274 0.87833595 0.12166405 0.858498388
6 1426 1568 0.909438776 0.090561224 1.001831181
6.5 528 637 0.8288854 0.1711146 0.685207425
7 443 490 0.904081633 0.095918367 0.974305868
6 292 325 0.898461538 0.101538462 0.946868912
6.5 784 832 0.942307692 0.057692308 1.213074825
7 291 338 0.860946746 0.139053254 0.791795131
7.5 237 260 0.911538462 0.088461538 1.01302051
7 475 496 0.95766129 0.04233871 1.354474315
7.5 366 416 0.879807692 0.120192308 0.864511081
8 312 320 0.975 0.025 1.591064607
8 63 78 0.807692308 0.192307692 0.62324929
8.5 116 130 0.892307692 0.107692308 0.918329954
9 43 45 0.955555556 0.044444444 1.33243846


Globin proteins
Yield Analysis for Length 33-42 Data

Non-Redundant Overall Data (55 cells)

Correlation of p against (v1+v2)/2 = -0.4611
Correlation of logit against (v1+v2)/2 = -0.4443

Note the zero output for (v1+v2)/2 = 8 from v1 = 8 and v2 = 8

v1+v2/
2 output input p 1-p log(p/
(1-p))

0 202 595 0.339495798 0.660504202 -0.289041181
0.5 1448 3325 0.435488722 0.564511278 -0.112695711
1 2098 5355 0.39178338 0.60821662 -0.191012275
1.5 1121 3045 0.368144499 0.631855501 -0.234599455
2 878 2065 0.425181598 0.574818402 -0.130956203
2.5 485 1050 0.461904762 0.538095238 -0.066306709
3 322 630 0.511111111 0.488888889 0.019305155
3.5 133 315 0.422222222 0.577777778 -0.136219747
4 26 175 0.148571429 0.851428571 -0.75821292
4.5 53 175 0.302857143 0.697142857 -0.362083961
1 2410 4465 0.539753639 0.460246361 0.069205216
1.5 7012 14535 0.482421741 0.517578259 -0.030549154
2 3770 8265 0.456140351 0.543859649 -0.076388346
2.5 2877 5605 0.513291704 0.486708296 0.023095496
3 1609 2850 0.564561404 0.435438596 0.112784263
3.5 1099 1710 0.642690058 0.357309942 0.254956482
4 437 865 0.505202312 0.494797688 0.009037668
4.5 73 474 0.154008439 0.845991561 -0.739821512
5 160 475 0.336842105 0.663157895 -0.294190571
2 4959 11628 0.426470588 0.573529412 -0.128666609
2.5 5340 13311 0.401171963 0.598828037 -0.173971552
3 4077 9027 0.451645065 0.548354935 -0.084264487
3.5 2279 4590 0.496514161 0.503485839 -0.00605562
4 1569 2754 0.569716776 0.430283224 0.121904593
4.5 639 1377 0.464052288 0.535947712 -0.062555504
5 102 765 0.133333333 0.866666667 -0.812913357
5.5 227 765 0.296732026 0.703267974 -0.374756418
3 1384 3741 0.369954558 0.630045442 -0.231223492
3.5 2175 5133 0.423728814 0.576271186 -0.133538908
4 1213 2610 0.464750958 0.535249042 -0.061335605
4.5 821 1566 0.524265645 0.475734355 0.042186884
5 337 783 0.430395913 0.569604087 -0.121704958
5.5 61 435 0.140229885 0.859770115 -0.787541767
6 116 435 0.266666667 0.733333333 -0.439332694
4 812 1711 0.474576271 0.525423729 -0.044203662
4.5 908 1770 0.51299435 0.48700565 0.022578583
5 623 1062 0.586629002 0.413370998 0.152023526
5.5 251 531 0.472693032 0.527306968 -0.04748431
6 42 295 0.142372881 0.857627119 -0.779871231
6.5 84 295 0.284745763 0.715254237 -0.400003169
5 231 435 0.531034483 0.468965517 0.053981812
5.5 334 540 0.618518519 0.381481481 0.209879246
6 135 270 0.5 0.5 0
6.5 22 150 0.146666667 0.853333333 -0.764787289
7 50 150 0.333333333 0.666666667 -0.301029996
6 90 163 0.552147239 0.447852761 0.090919649
6.5 92 162 0.567901235 0.432098765 0.118689787
7 22 90 0.244444444 0.755555556 -0.490086232
7.5 33 90 0.366666667 0.633333333 -0.237360916
7 14 36 0.388888889 0.611111111 -0.196294645
7.5 5 45 0.111111111 0.888888889 -0.903089987
8 14 45 0.311111111 0.688888889 -0.345233658
8 0 45 0
8.5 2 45 0.044444444 0.955555556 -1.33243846
9 2 10 0.2 0.8 -0.602059991


Globin proteins
Yield Analysis for Length 23-32 Data

"Within-Row" Correlations of output against (v1+v2)/2 and input
against (v1+v2)/2

Row Corr
v2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 output 845 3738 6074 4589 2801 1941 1077 1368 503 428 -0.59
input 1176 4949 7252 5586 3577 2401 1274 1568 637 490 -0.62

1 output 3738 3865 12535 9478 5767 4031 2214 2825 1041 868 -0.58
input 4949 5050 14948 11514 7373 4949 2626 3232 1313 1010 -0.61

2 output 6074 12535 9675 14726 9009 5677 3086 2166 1697 1384 -0.74
input 7252 14948 10878 16872 10804 7252 3848 4736 1924 1480 -0.75

3 output 4589 9478 14726 5502 6808 4797 2616 3361 1254 1037 -0.69
input 5586 11514 16872 6441 8322 5586 2964 3648 1482 1140 -0.72

4 output 2801 5767 9009 6808 2027 2917 1579 2039 755 627 -0.68
input 3577 7973 10804 8322 2628 3577 1898 2336 949 730 -0.71

5 output 1941 4031 574 4797 2917 1004 1119 1426 528 443 -0.56
input 2401 4949 7252 5586 3577 1176 1274 1568 637 490 -0.71

6 output 1077 2214 3454 2616 1579 1119 292 784 291 237 -0.70
input 1274 2626 3848 2964 1898 1274 325 832 338 260 -0.72

7 output 1368 2825 4423 3361 2039 1426 784 475 366 312 -0.72
input 1568 3232 4736 3648 2336 1568 832 496 416 320 -0.74

8 output 503 1041 1664 1254 755 528 291 366 63 116 -0.70
input 637 1313 1924 1482 949 637 338 416 78 130 -0.73

9 output 428 868 1384 1037 627 443 237 312 116 43 -0.71
input 490 1010 1480 1140 730 490 260 320 130 45 -0.73


Globin proteins
Yield Analysis for Length 33-42 Data

"Within-Row" Correlations of output against (v1+v2)/2 and input
against (v1+v2)/2
Row Corr
v2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 output 202 1448 2098 1121 878 485 322 133 26 53 -0.63
input 595 3325 5355 3045 2065 1050 630 315 175 175 -0.62

1 output 1448 2410 7012 3770 2877 1609 1099 437 73 160 -0.61
input 3325 4465 14535 8265 5605 2850 1710 865 474 475 -0.60

2 output 2098 7012 4959 5340 4077 2279 1569 639 102 227 -0.77
input 5355 14535 11628 13311 9027 4590 2754 1377 765 765 -0.78

3 output 1121 3770 5340 1384 2175 1213 821 337 61 116 -0.67
input 3045 8265 13311 3741 5133 2610 1566 783 435 435 -0.67

4 output 878 2877 4077 2175 812 908 623 251 42 84 -0.69
input 2065 5605 9027 5133 1711 1770 1062 531 295 295 -0.68

5 output 485 1609 2279 1213 908 231 334 135 22 50 -0.70
input 1050 2850 4590 2610 1770 435 540 270 150 150 -0.69

6 output 322 1099 1569 821 623 334 90 92 22 33 -0.70
input 630 1710 2754 1566 1062 540 163 162 90 90 -0.70

7 output 133 437 639 337 251 135 92 14 5 14 -0.70
input 315 855 1377 783 531 270 162 36 45 45 -0.70

8 output 26 73 102 61 42 22 15 5 0 2 -0.74
input 315 855 1377 783 531 270 162 36 45 45 -0.70

9 output 53 160 227 116 84 50 33 14 2 2 -0.72
input 175 475 765 435 295 150 90 45 25 10 -0.70




Ray Koopman

unread,
Mar 23, 2012, 2:44:45 AM3/23/12
to
The usual regression formulas assume that error is homoscedastic;
that is, that the variance of the distribution from which the error
component in the equation observed_i = model_i + error_i
is drawn is the same for all data points i. When the error variances
are known to differ substantially, least-squares fitting,
which usually solves for the model parameters that minimize
sum (observed_i - model_i)^2, should instead minimize
sum (observed_i - model_i)^2/variance[error_i],
in which each data point is weighted proportional to
the reciprocal of its error variance.

The following table uses data posted on Mar 21 at 3:14 pm PDT.
I have scaled the weights so that the maximum weight is 1.

There are four correlations with v1+v2 to consider, in a 2 x 2 table:

unweighted weighted

p .343 .188

logit .414 -.080

The weighted correlations are so much lower than the unweighted
correlations because the points that tend to increase the correlation
are poorly determined relative to the points that tend to decrease
the correlation. This is especially true for the logit.

v1,v2
| output
| | input
| | | p = out/in
| | | |
| | | | w[p] = in/(p(1-p))
| | | | |
| | | | | logit = log[p/(1-p)]
| | | | | |
| | | | | | w[logit] = in*p(1-p)
| | | | | | |
0,0 1769 2080 .850 .074 1.738 .125
1,0 6347 7150 .888 .326 2.067 .336
1,1 5418 5995 .904 .314 2.240 .246
2,0 8563 9685 .884 .430 2.032 .467
2,1 14760 16390 .901 .832 2.203 .691
2,2 9873 11026 .895 .536 2.147 .486
3,0 8321 9425 .883 .415 2.020 .459
3,1 14349 15950 .900 .803 2.193 .678
3,2 19218 21605 .890 1.000 2.086 1.000
3,3 9167 10440 .878 .444 1.974 .526
4,0 5233 6370 .822 .198 1.527 .440
4,1 9056 10780 .840 .365 1.659 .682
4,2 12122 14602 .830 .471 1.587 .970
4,3 11688 14210 .823 .443 1.534 .977
4,4 3617 4753 .761 .119 1.158 .407
5,0 4067 4615 .881 .201 2.004 .227
5,1 6990 7810 .895 .378 2.143 .346
5,2 9385 10579 .887 .481 2.062 .499
5,3 9030 10295 .877 .435 1.965 .523
5,4 5677 6958 .816 .211 1.489 .492
5,5 2148 2485 .864 .096 1.852 .137
6,0 2269 2535 .895 .123 2.144 .112
6,1 3909 4290 .911 .241 2.328 .164
6,2 5204 5811 .896 .283 2.149 .256
6,3 4997 5655 .884 .250 2.027 .274
6,4 3086 3822 .807 .112 1.433 .280
6,5 2414 2769 .872 .113 1.917 .146
6,6 632 741 .853 .027 1.758 .044
7,0 1549 1760 .880 .076 1.994 .087
7,1 2669 2970 .899 .148 2.182 .127
7,2 3573 4023 .888 .184 2.072 .188
7,3 3472 3915 .887 .177 2.059 .185
7,4 2166 2646 .819 .081 1.507 .185
7,5 1653 1917 .862 .073 1.834 .107
7,6 922 1053 .876 .044 1.951 .054
7,7 305 351 .869 .014 1.892 .019
8,0 1201 1300 .924 .084 2.496 .043
8,1 2099 2200 .954 .228 3.034 .045
8,2 2795 2980 .938 .233 2.715 .082
8,3 2695 2900 .929 .201 2.576 .090
8,4 1697 1960 .866 .077 1.864 .107
8,5 1298 1420 .914 .082 2.365 .053
8,6 728 780 .933 .057 2.639 .023
8,7 502 540 .930 .038 2.581 .017
8,8 188 190 .989 .083 4.543 .001
9,0 1363 1495 .912 .084 2.335 .057
9,1 2382 2530 .942 .209 2.778 .066
9,2 3187 3427 .930 .239 2.586 .105
9,3 3063 3335 .918 .203 2.421 .118
9,4 1936 2254 .859 .085 1.806 .129
9,5 1477 1633 .904 .086 2.248 .066
9,6 821 897 .915 .053 2.380 .033
9,7 554 621 .892 .029 2.112 .028
9,8 442 460 .961 .056 3.201 .008
9,9 242 253 .957 .028 3.091 .005

djh

unread,
Mar 23, 2012, 8:49:49 AM3/23/12
to koo...@sfu.ca
You were kind enough to take the time to explain the following:

> The usual regression formulas assume that error is homoscedastic;
> that is, that the variance of the distribution from which the error
> component in the equation observed_i = model_i + error_i
> is drawn is the same for all data points i. When the error variances
> are known to differ substantially, least-squares fitting,
> which usually solves for the model parameters that minimize
> sum (observed_i - model_i)^2, should instead minimize
> sum (observed_i - model_i)^2/variance[error_i],
> in which each data point is weighted proportional to
> the reciprocal of its error variance.

for which I am indebted. It's perfectly clear and actually makes
perfect sense once explained.

I'm assuming it's your judgment that the low and oppositely signed
values for the two weighted correlations imply that the underlying
correlations are really meaningless for this first case of length =
13 - 22. (Rich seems to be saying the same thing.)

So based on this fact, is it still worthwhile to do the same weighting
on the last two matrices I sent, to see what happens? Or can you
already
tell that this a fruitless exercise which will just confirm the
results
on the first matrix?

Finally - one last question. From a purely statistical point of view,
is it possible for certain "within-row" correlations to still be
meaningful,
even though the "overall" correlations don't seem to be, at least in
this
first case?

Thanks so much again, Ray.

djh

unread,
Mar 23, 2012, 9:08:48 AM3/23/12
to koo...@sfu.ca
One other question I forgot to ask above:

Would it be "legit" to go back to the original "output" population for
each individual cell, extract the number within 1 or 2sd of the mean
for the population, and then recompute all the matrices/correlations/
weighted correlations using these "central tendency" output counts?

Ray Koopman

unread,
Mar 24, 2012, 3:23:15 AM3/24/12
to
I've been thinking about weighted vs unweighted estimation for your
correlations, and I've come to a tentative conclusion. First, let me
say that I think things will go better if you look at and argue from
the slopes of the regression lines instead of the correlations. The
two statistics always have the same sign, but slopes have at least
two advantages:

1. Slopes are not sensitive to irrelevant features of the data
such as the predictor distributions. For this reason, slopes are
considered "structural parameters" of the relation, but correlations
are not.

2. We can easily get good estimates of the standard errors of the
slopes, but we can't for correlations. (The sample size for the
overall correlations is not 55, and for the within-row correlations
it is not 10; the usual inferential machinery for correlations does
not apply.)

We want to estimate the slope of the linear regression of p on v
even tho we know that the true regression is not linear. That is,
we know that the conditional expectation of p given v is not a linear
function of v, but nevertheless we want to estimate what the slope
of the best-fitting straight line would be if each proportion were
equal to its conditional expectation.

The problem is that the methods we've been using assume that the true
relation is linear (or, more generally, that it has the same form as
the model we're fitting). When this assumption is true, the estimated
slope is an unbiased estimate of the true slope, regardless of what
weights are used; and when, as in the present case, large-sample
model-free estimates of the error variance of each data point are
available, weights can then be chosen to minimize the variance of
the estimated slope.

But that works only if the model is veridical or if the variances
are unrelated to the model, and neither condition holds here. AFAIK
-- and I hope someone will jump in here if I'm wrong -- if we want
an unbiased estimate of the slope then we must weight all the points
equally, which will give us a slope whose standard error may be very
large.

Here's how to get the standard error of the estimated slope.
Using general notation, for each data point we have a triple
{x, y, v}, where x is the predictor, y is the response,
and v is the error variance of y.

Let w = x - mean[x]. Then the slope = sum w*y / sum w^2 ,

and its standard error = sqrt[sum v*w^2] / sum w^2 .

That standard error can be used to test hypotheses
and get confidence intervals.

For the data at hand, the predictor "x" has been either v2 (within-
row), or (v1+v2)/2 (overall). (For correlations, the division by 2
could be omitted at will, but for slopes it needs to be there to
keep the overall slopes comparable to the within-row slopes.)

The response "y" has been either
p = output/input or logit[p] = log[p/(1-p)].

For p, the variance "v" is p(1-p)/input;
for logit[p], the variance is 1/output + 1/(input-output).
Both of those assume that both input and output are "large",
so some adjustments will be needed for some of the new data.
A simple one is to add 1 to the input and .5 to the output.
It's far from perfect, but will at least take care of the zeros.

djh

unread,
Mar 24, 2012, 11:04:25 AM3/24/12
to koo...@sfu.ca, jrfr...@princeton.edu, am...@psu.edu, marvin.s...@gmail.com, rich-...@live.com, rich....@comcast.net
Ray -

Thank you very much for the advice and "spec" that your provided in
your last post. I will program the "spec" as you gave it and report
back the results for our first set of four Yield-related matrices.
(In this regard, would it be easier for me to work in Minitab rather
than the amalgam of PERL and Excel I'm currently using? If so, I can
obtain a copy thru Vanderbilt; if not, please suggest an alternative;
I do run Linux as well as MS, so if you think something like SPSS
would be better than Minitab, please let me know.)

I assume that you would suggest taking the same "slope" approach to
our RMSD-related data as you suggested above for our yield-related
data. I have been holding back the RMSD-related data so that we could
concentrate on one thing at a time, but now that you've decided on an
approach to the yield-related data, it would be good to know whether
we should use the same approach for the RMSD-related data.

Here are the over-all RMSD vs (v1+v2)/2 correlations for our first
four length intervals for the globin proteins. Note that they swing
down and grow increasingly negative as the length variable increases
(which is very much "sauce for our goose" from a purely scientific
point of view.)

Globin proteins

Length 13-22: over-all RMSD vs (v1+v2)/2 = 0.641048664
Length 23-32: over-all RMSD vs (v1+v2/2) = 0.020929748
Length 33-42: over-all RMSD vs (v1+v2/2) = -0.201262146
Length 43-52: over-all RMSD vs (v1+v2/2) = -0.213178417


Although these correlations may not seem much to write home about, you
can see from the matrices below that we do get some very high within-
row correlations, most consistently at length 33-42. Again, it makes
perfect sense from a scientific point of view that RMSD-related
correlations should be most highly negative at certain lengths, less
so at other lengths, and should disappear entirely at other lengths.
This is because the length-related "sweet-spots" can be related to
structural features of the type of protein under examination (in this
case the "all-helical" globin proteins.)

So, assuming that you think it worthwhile (based on these first four
cases) to continue analyzing RMSD as well as Yield, I am assuming that
you would want me to do so using the same spec as your provided above
for the yield analyses.

Please reply back re this question when you have a chance, and again,
thank you so very much again for your time.

Here are the "within-row" correlations for our first four length
intervals for the globin proteins.

Globin proteins
Length 13-22
Within-Row Correlations of RMSD with (v1+v2/2)

Row Corr
v2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 1.32 1.31 1.34 1.31 1.36 1.38 1.43 1.33 1.42 1.38 0.67
1 1.31 1.27 1.32 1.30 1.33 1.34 1.40 1.30 1.42 1.32 0.56
2 1.34 1.32 1.38 1.35 1.38 1.39 1.42 1.34 1.46 1.39 0.65
3 1.31 1.30 1.35 1.33 1.36 1.38 1.38 1.32 1.43 1.38 0.70
4 1.36 1.33 1.38 1.36 1.41 1.39 1.43 1.37 1.46 1.39 0.63
5 1.36 1.34 1.39 1.38 1.39 1.41 1.43 1.37 1.45 1.38 0.52
6 1.43 1.40 1.42 1.38 1.43 1.43 1.45 1.39 1.46 1.41 0.23
7 1.33 1.30 1.34 1.32 1.37 1.37 1.39 1.38 1.44 1.35 0.74
8 1.42 1.42 1.46 1.43 1.46 1.45 1.46 1.44 1.57 1.44 0.50
9 1.38 1.32 1.39 1.38 1.39 1.38 1.41 1.35 1.44 1.44 0.64


Globin proteins
Length 23-32
Within-Row Correlations of RMSD with (v1+v2)/2

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 1.61 1.66 1.67 1.68 1.70 1.77 1.72 1.79 1.85 1.66 0.66
1 1.66 1.64 1.59 1.61 1.62 1.65 1.62 1.70 1.69 1.61 0.25
2 1.67 1.59 1.56 1.56 1.60 1.39 1.43 1.37 1.46 1.39 -0.86
3 1.68 1.61 1.56 1.55 1.59 1.60 1.56 1.63 1.74 1.49 -0.12
4 1.70 1.62 1.60 1.59 1.62 1.66 1.66 1.69 1.77 1.57 0.17
5 1.77 1.65 1.58 1.60 1.66 1.66 1.63 1.70 1.77 1.58 -0.04
6 1.72 1.62 1.55 1.56 1.66 1.63 1.61 1.67 1.75 1.53 -0.02
7 1.79 1.70 1.65 1.63 1.69 1.70 1.67 1.69 1.75 1.64 -0.22
8 1.85 1.69 1.74 1.74 1.77 1.77 1.75 1.75 1.85 1.60 -0.24
9 1.66 1.61 1.56 1.49 1.57 1.58 1.53 1.64 1.60 1.45 -0.40


Globin proteins
Length 33-42
Within-Row Correlations of RMSD with (v1+v2)/2

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 0.92 0.90 0.92 0.90 0.87 0.84 0.89 0.86 1.01 0.89 0.08
1 0.90 0.90 0.92 0.92 0.93 0.86 0.92 0.88 0.84 0.88 -0.55
2 0.92 0.92 0.92 0.91 0.88 0.88 0.90 0.88 0.83 0.85 -0.90
3 0.90 0.92 0.91 0.89 0.87 0.86 0.86 0.84 0.84 0.81 -0.95
4 0.87 0.89 0.88 0.87 0.84 0.81 0.85 0.83 0.84 0.71 -0.75
5 0.84 0.86 0.88 0.87 0.81 0.80 0.80 0.85 0.75 0.82 -0.60
6 0.89 0.92 0.90 0.86 0.85 0.80 0.82 0.81 0.69 0.74 -0.91
7 0.86 0.88 0.88 0.84 0.83 0.85 0.81 0.89 1.05 0.82 0.21
8 1.01 0.84 0.83 0.84 0.84 0.75 0.69 1.05 0.57 -0.51
9 0.89 0.88 0.85 0.81 0.71 0.82 0.74 0.82 1.34 1.34 0.56

Globin proteins
Length 43-52
Within-Row Correlations of RMSD with (v1+v2)/2

Row Corr
u2 with
0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
v1
0 1.05 1.03 1.08 1.05 1.01 1.01 1.06 0.98 0.86 1.17 -0.14
1 1.03 1.06 1.08 1.06 1.06 1.03 1.08 0.96 1.02 1.08 -0.22
2 1.08 1.08 1.03 1.06 1.04 1.00 1.01 1.01 0.84 1.21 -0.16
3 1.05 1.06 1.06 1.03 1.00 1.03 1.06 0.97 0.86 1.24 0.01
4 1.01 1.06 1.04 1.00 1.01 0.94 1.03 0.95 0.82 -0.73
5 1.01 1.03 1.00 1.03 0.94 0.87 0.99 1.07 0.79 0.99 -0.48
6 1.06 1.08 1.01 1.06 1.07 1.00 0.91 1.04 0.81 1.23 -0.70
7 0.98 0.96 1.01 0.97 0.95 1.07 1.04 1.03 0.98 0.44
8 0.86 1.02 0.84 0.86 0.82 0.79 0.83 0.83 -0.54
9 1.17 1.08 1.21 1.24 0.99 1.23 -0.04




djh

unread,
Mar 24, 2012, 12:28:31 PM3/24/12
to

Ray -

This is a simple question just to make sure I understand w = x -
mean[x] correctly.

For overall correlations in which we've plugged "zeroes" according to
your "input+1" and "output+0.5" suggestion, there are always 55 values
of x:

0, 0.5, 1, 1.5, . . . , 8, 8.5, 9

So if I use all 55 of these values to compute the mean, we get mean[x]
= 4.5

But of course, values duplicate in 0, 0.5, 1, 1.5, . . . , 8, 8.5, 9,
e.g. 2 from (1,3) and 2 from (2,2), and if I take the mean of the non-
redundant values, then w is again 4.5

So, I assume that I use mean[x] = 4.5 for the overall data, correct?

And by analogy:

mean[x] = 2.25 for the row (0,0),...,(0,9)
mean[x] = 2.75 for the row (1,0),...,(1,9)
mean[x] = 3.25 for the row (2,0),...,(2,9)

Correct?

If I'm incorrect regarding any of the above, please take a moment to
clarify.

Thanks very much again.





djh

unread,
Mar 24, 2012, 4:47:11 PM3/24/12
to koo...@sfu.ca
Here is what I get for the within-row slope and standard error for row
(00...09)for length 13-22, using your definitions of x,y,z,w,v.

Standard error is slightly larger than slope ... I assume this is a
very bad thing that we would hope to see improve for
other rows ?

0 1 2 3 4 5 6 7 8 9

out 1769 6347 8563 8321 5233 4067 2269 1549 1201 1363
input 2080 7150 9685 9425 6370 4615 2535 1755 1300 1495
y=p 0.850 0.888 0.884 0.883 0.822 0.881 0.895 0.883 0.924 0.912 orig
corr 0.57
x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
w -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25
w*y -1.914 -1.553 -1.105 -0.662 -0.205 0.220 0.671 1.103 1.617 2.051
sum(w*y) 0.22
w^2 5.063 3.063 1.563 0.563 0.063 0.063 0.563 1.563 3.063 5.063
sum(w^2) 20.63
slope 0.01082
1-p 0.150 0.112 0.116 0.117 0.178 0.119 0.105 0.117 0.076 0.088
p/(1-p) 5.688 7.904 7.632 7.537 4.602 7.422 8.530 7.519 12.131
10.326
v 0.003 0.001 0.001 0.001 0.001 0.002 0.003 0.004 0.009 0.007
v*w^2 0.014 0.003 0.001 0.000 0.000 0.000 0.002 0.007 0.029 0.035 sum
v*w^2 0.09119
sqrt(sum v*w^2) 0.30197
standard error 0.01464

djh

unread,
Mar 24, 2012, 4:51:30 PM3/24/12
to koo...@sfu.ca
I;m sorry the spacing didn't hold in the above post - it looked OK in
the panel before I hit send.

Here is a redone version:

0 1 2 3 4 5 6 7 8 9

out 1769 6347 8563 8321 5233 4067 2269 1549 1201 1363
input 2080 7150 9685 9425 6370 4615 2535 1755 1300 1495
y=p 0.850 0.888 0.884 0.883 0.822 0.881 0.895 0.883 0.924 0.912
x 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
w -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25
w*y -1.914 -1.553 -1.105 -0.662 -0.205 0.220 0.671 1.103 1.617 2.051
w^2 5.063 3.063 1.563 0.563 0.063 0.063 0.563 1.563 3.063 5.063
1-p 0.150 0.112 0.116 0.117 0.178 0.119 0.105 0.117 0.076 0.088
p/(1-p) 5.688 7.904 7.632 7.537 4.602 7.422 8.530 7.519 12.131 10.326
v 0.003 0.001 0.001 0.001 0.001 0.002 0.003 0.004 0.009 0.007
v*w^2 0.014 0.003 0.001 0.000 0.000 0.000 0.002 0.007 0.029 0.035


orig corr 0.57

sum(w*p) 0.22
sum(w^2) 20.63
slope 0.01082

djh

unread,
Mar 26, 2012, 12:35:24 AM3/26/12
to koo...@sfu.ca, rich-...@live.com, rich....@comcast.net
Ray

As per your suggestion in your last post, here are some slope and
standard error data for globin proteins, lengths 13-22 thru 43-52,
overall and within row, for y = p and y = logit[p].

In general, taking logit[p] strengthens all correlations and increases
all slopes (pos and neg) (with some exceptions.)

From a purely scientific perspective, the switch from consistently
positive to consistently negative correlations between lengths 23-32
and 33-42 is fascinating (assuming of course, that you find the data
meaningful enough for us to continue thinking about at all.) There are
several possible explanations for this switch in sign, and my team
will have to see the data for the remaining 236 controls before
deciding which of these explanations seems most likely.

One specific question for you re the "y = p" slopes and correlations
in lines 3-4 (overall for lengths 33-42 and 43-52.) I don't
understand why the correlation gets less negative (-0.46108 to
-0.45630) while the slope gets more negative (-0.03342 to -0.04813).
I thought that if the correlation grows more negative, the slope
should also. (Note that this anomaly only occurs in lines 3-4 for the
y=p data, not the y=logit[p] data.

Finally, regarding the "meaningfulness" of the data, are the standard
error values so large as to render all the data virtually
meaningless?

As always, thanks again for whatever time you can afford to spend
considering these matters. Until you tell me it would be the better
part of wisdom to cease and desist collecting data (because they won't
show anything meaningful), I will continue the long march thru the
remaining 236 cases.

y = p y =logit[p]

Corr Corr
with with
Length Row (v1+v2)/2 Slope Std Err (v1+v2)/2 Slope Std Err

13-22 Overall 0.34216 0.00676 0.03927 0.41419 0.04446 0.01298
23-32 Overall 0.46882 0.01668 0.05772 0.57036 0.06840 0.01629
33-42 Overall -0.46108 -0.03342 0.00407 -0.48260 -0.09042 0.02906
43-52 Overall -0.45630 -0.04813 0.01296 -0.45427 -0.04278 0.04243

13-22 00-09 0.56767 0.01082 0.01464 0.61801 0.04811 0.01625
10-19 0.49916 0.01111 0.01361 0.61682 0.06710 0.01455
20-29 0.45163 0.00864 0.01042 0.52565 0.04661 0.01130
30-39 0.37734 0.00710 0.00990 0.44269 0.03554 0.01083
40-49 0.29573 0.00568 0.00880 0.33342 0.01889 0.01035
50-59 0.16848 0.00304 0.01301 0.21235 0.01487 0.01446
60-69 0.10761 0.00254 0.01930 0.16132 0.01573 0.02112
70-79 0.16825 0.00317 0.02221 0.21276 0.01672 0.02468
80-89 0.34039 0.00727 0.06720 0.43720 0.08911 0.06850
90-99 0.23166 0.00467 0.04040 0.32872 0.04060 0.04245

23-32 00-09 0.69844 0.02310 0.01632 0.69710 0.06532 0.01951
10-19 0.64019 0.01696 0.01085 0.63434 0.04922 0.01300
20-29 -0.17178 -0.01505 0.01266 -0.03445 -0.00732 0.01392
30-39 0.67259 0.01603 0.01275 0.66731 0.06282 0.01433
40-49 0.61259 0.01781 0.01266 0.61914 0.05061 0.01512
50-59 0.38671 0.06315 0.01856 0.43353 0.17118 0.02121
60-69 0.55066 0.01258 0.02811 0.54119 0.05619 0.03130
70-79 0.56859 0.01432 0.04306 0.61648 0.10407 0.04482
80-89 0.56671 0.01431 0.03667 0.57530 0.04750 0.04283
90-99 0.57290 0.01478 0.08311 0.55298 0.09389 0.08761

33-42 00-09 -0.31184 -0.02085 0.00776 -0.34511 -0.04940 0.02841
10-19 -0.38758 -0.03471 0.00513 -0.72625 -0.17629 0.01650
20-29 -0.38662 -0.03104 0.00363 -0.41197 -0.07071 0.01349
30-39 -0.41249 -0.03000 0.00454 -0.43928 -0.06869 0.01797
40-49 -0.46424 -0.03913 0.00586 -0.47859 -0.08406 0.02152
50-59 -0.49191 -0.04346 0.00896 -0.49272 -0.08946 0.02946
60-69 -0.59756 -0.04826 0.01315 -0.59548 -0.08894 0.03510
70-79 -0.53172 -0.04528 0.01600 -0.52474 -0.09985 0.05938
80-89 -0.27691 -0.00599 0.00599 -0.76102 -0.40210 0.14971
90-99 -0.50953 -0.02792 0.02022 -0.51233 -0.08088 0.11059

43-52 00-09 -0.42940 -0.03691 0.01312 -0.45629 -0.08528 0.04846
10-19 -0.28369 -0.03134 0.01703 -0.25820 -0.05653 0.03021
20-29 -0.39202 -0.04191 0.00843 -0.43592 -0.11382 0.03337
30-39 -0.40530 -0.04741 0.01076 -0.45496 -0.14777 0.05275
40-49 0.03602 0.00147 0.10980 0.02733 0.00202 0.21912
50-59 -0.03241 -0.03241 0.01788 -0.36495 -0.09201 0.06355
60-69 -0.51894 -0.06090 0.01922 -0.52999 -0.16038 0.08559
70-79 -0.46292 -0.06449 0.04089 -0.49250 -0.18743 0.16799
80-89 -0.17190 -0.00766 0.00675 -0.50444 -0.16336 0.19952
90-99 0.29485 0.02859 0.11007 0.13464 0.05397 0.27347

Ray Koopman

unread,
Mar 26, 2012, 12:59:53 AM3/26/12
to
I see two problems. First, as I noted in a previous post, the input
matrix is not symmetric: (0,7) = 1755, (7,0) = 1765. I changed both
of those to 1760 in my previous calculations with that matrix.

Second, your detailed calculations of the slope and its standard
error are OK up to the point where you get p/(1-p), which should be
p*(1-p). That doesn't affect the slope, but it does affect the s.e.
Using 1755, I get .010822 for the slope and .001478 for the s.e.
Using 1760, I get .010670 for the slope and .001479 for the s.e.

Ray Koopman

unread,
Mar 26, 2012, 1:59:49 AM3/26/12
to
Yes, but it's more complicated than that. Let me talk about the
situation more generally. Suppose you have a vector of independent
binomial proportions, say p = (p_1,...,p_k), and you want to
say that the corresponding vector of true probabilities, say
P = (P_1,...,P_k), satisfies some theoretical predictions. Even
if the sample p satisfies the predictions, you can't say you have
shown that the theory is correct. All you can say is that the data
are consistent with the theory.

You need to make the stronger statement that every P that is
consistent with the data also satisfies the theory. To do this,
you need to get a "confidence region" for the true probabilities,
the region within which you are certain, at a specified level of
confidence, that the true P lies; any P outside this region would
be said to be inconsistent with the data.

Let (n_1,...,n_k) be the sample sizes of the observed p. Then the
100C% confidence region for P is the set of all vectors P for which
sum n_i(p_i - P_i)^2/(P_i(1-P_i)) < the 100C'th percentile of the
chi-square distribution with k degrees of freedom.

So if, for instance, you wanted to say that the P_i are monotone
increasing (i.e., that P_1 < ... < P_k), you would have to show
that every P in the confidence region was monotone increasing.

Ray Koopman

unread,
Mar 26, 2012, 3:52:17 AM3/26/12
to
(5,0) disagrees with (0,5), which agrees with both (5,0) & (0,5)
in the version of this matrix that you posted previously.

Also, as I posted previously, I get different within-row correlations
for the version of this matrix that you posted previously.

>
> Row Corr
> v2 with
> 0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
> v1
> 0 1.32 1.31 1.34 1.31 1.36 1.38 1.43 1.33 1.42 1.38 0.67
> 1 1.31 1.27 1.32 1.30 1.33 1.34 1.40 1.30 1.42 1.32 0.56
> 2 1.34 1.32 1.38 1.35 1.38 1.39 1.42 1.34 1.46 1.39 0.65
> 3 1.31 1.30 1.35 1.33 1.36 1.38 1.38 1.32 1.43 1.38 0.70
> 4 1.36 1.33 1.38 1.36 1.41 1.39 1.43 1.37 1.46 1.39 0.63
> 5 1.36 1.34 1.39 1.38 1.39 1.41 1.43 1.37 1.45 1.38 0.52
> 6 1.43 1.40 1.42 1.38 1.43 1.43 1.45 1.39 1.46 1.41 0.23
> 7 1.33 1.30 1.34 1.32 1.37 1.37 1.39 1.38 1.44 1.35 0.74
> 8 1.42 1.42 1.46 1.43 1.46 1.45 1.46 1.44 1.57 1.44 0.50
> 9 1.38 1.32 1.39 1.38 1.39 1.38 1.41 1.35 1.44 1.44 0.64
>
> Globin proteins
> Length 23-32
> Within-Row Correlations of RMSD with (v1+v2)/2

(2,5), (2,6), (2,7), (2,8), (2,9) disagree with
(5,2), (6,2), (7,2), (8,2), (9,2).

>
> Row Corr
> u2 with
> 0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
> v1
> 0 1.61 1.66 1.67 1.68 1.70 1.77 1.72 1.79 1.85 1.66 0.66
> 1 1.66 1.64 1.59 1.61 1.62 1.65 1.62 1.70 1.69 1.61 0.25
> 2 1.67 1.59 1.56 1.56 1.60 1.39 1.43 1.37 1.46 1.39 -0.86
> 3 1.68 1.61 1.56 1.55 1.59 1.60 1.56 1.63 1.74 1.49 -0.12
> 4 1.70 1.62 1.60 1.59 1.62 1.66 1.66 1.69 1.77 1.57 0.17
> 5 1.77 1.65 1.58 1.60 1.66 1.66 1.63 1.70 1.77 1.58 -0.04
> 6 1.72 1.62 1.55 1.56 1.66 1.63 1.61 1.67 1.75 1.53 -0.02
> 7 1.79 1.70 1.65 1.63 1.69 1.70 1.67 1.69 1.75 1.64 -0.22
> 8 1.85 1.69 1.74 1.74 1.77 1.77 1.75 1.75 1.85 1.60 -0.24
> 9 1.66 1.61 1.56 1.49 1.57 1.58 1.53 1.64 1.60 1.45 -0.40
>
> Globin proteins
> Length 33-42
> Within-Row Correlations of RMSD with (v1+v2)/2

(1,4), (3,5), (8,9) disagree with
(4,1), (5,3), (9,8).

>
> Row Corr
> u2 with
> 0 1 2 3 4 5 6 7 8 9 (v1+v2/2)
> v1
> 0 0.92 0.90 0.92 0.90 0.87 0.84 0.89 0.86 1.01 0.89 0.08
> 1 0.90 0.90 0.92 0.92 0.93 0.86 0.92 0.88 0.84 0.88 -0.55
> 2 0.92 0.92 0.92 0.91 0.88 0.88 0.90 0.88 0.83 0.85 -0.90
> 3 0.90 0.92 0.91 0.89 0.87 0.86 0.86 0.84 0.84 0.81 -0.95
> 4 0.87 0.89 0.88 0.87 0.84 0.81 0.85 0.83 0.84 0.71 -0.75
> 5 0.84 0.86 0.88 0.87 0.81 0.80 0.80 0.85 0.75 0.82 -0.60
> 6 0.89 0.92 0.90 0.86 0.85 0.80 0.82 0.81 0.69 0.74 -0.91
> 7 0.86 0.88 0.88 0.84 0.83 0.85 0.81 0.89 1.05 0.82 0.21
> 8 1.01 0.84 0.83 0.84 0.84 0.75 0.69 1.05 0.57 -0.51
> 9 0.89 0.88 0.85 0.81 0.71 0.82 0.74 0.82 1.34 1.34 0.56
>
> Globin proteins
> Length 43-52
> Within-Row Correlations of RMSD with (v1+v2)/2

(4,6), (5,6), (6,8), (7,8) disagree with
(6,4), (6,5), (8,6), (8,7).

Ray Koopman

unread,
Mar 26, 2012, 4:18:15 AM3/26/12
to
On Mar 25, 9:35 pm, djh <halitsk...@att.net> wrote:
> Ray
>
> As per your suggestion in your last post, here are some slope and
> standard error data for globin proteins, lengths 13-22 thru 43-52,
> overall and within row, for y = p and y = logit[p].
>
> In general, taking logit[p] strengthens all correlations and increases
> all slopes (pos and neg) (with some exceptions.)
>
> From a purely scientific perspective, the switch from consistently
> positive to consistently negative correlations between lengths 23-32
> and 33-42 is fascinating (assuming of course, that you find the data
> meaningful enough for us to continue thinking about at all.) There are
> several possible explanations for this switch in sign, and my team
> will have to see the data for the remaining 236 controls before
> deciding which of these explanations seems most likely.
>
> One specific question for you re the "y = p" slopes and correlations
> in lines 3-4 (overall for lengths 33-42 and 43-52.) I don't
> understand why the correlation gets less negative (-0.46108 to
> -0.45630) while the slope gets more negative (-0.03342 to -0.04813).
> I thought that if the correlation grows more negative, the slope
> should also. (Note that this anomaly only occurs in lines 3-4 for the
> y=p data, not the y=logit[p] data.

correlation = slope / sqrt[ slope^2 + var(u)/var(x) ],

where var(u) = the variance of the residuals, the differences
between y and the regression line. The correlation is just an
artifact of the three logically independent and more fundamental
quantities on the right hand side.

djh

unread,
Mar 26, 2012, 5:49:23 AM3/26/12
to
You wrote:

> I see two problems. First, as I noted in a previous post, the input
> matrix is not symmetric: (0,7) = 1755, (7,0) = 1765. I changed both
> of those to 1760 in my previous calculations with that matrix.

I will correct this error - I thought I had.

You also wrote:

> Second, your detailed calculations of the slope and its standard
> error are OK up to the point where you get p/(1-p), which should be
> p*(1-p). That doesn't affect the slope, but it does affect the s.e.
> Using 1755, I get .010822 for the slope and .001478 for the s.e.
> Using 1760, I get .010670 for the slope and .001479 for the s.e.-

Thank you for catching this error! I will have to recompute all data
posted below as well, but it won't be difficult. I apologize for the
careless misread - I really hate to waste your time like that.

djh

unread,
Mar 26, 2012, 5:52:04 AM3/26/12
to
You wrote:

Let (n_1,...,n_k) be the sample sizes of the observed p. Then the
100C% confidence region for P is the set of all vectors P for which
sum n_i(p_i - P_i)^2/(P_i(1-P_i)) < the 100C'th percentile of the
chi-square distribution with k degrees of freedom.


So if, for instance, you wanted to say that the P_i are monotone
increasing (i.e., that P_1 < ... < P_k), you would have to show
that every P in the confidence region was monotone increasing.


I will save this and study on it very hard. I assume this is where
"the rubber meets the road" with respect to whether all of the data
are really saying anything or not ...

djh

unread,
Mar 26, 2012, 6:22:10 AM3/26/12
to koo...@sfu.ca
I will correct the matrices as per your notes and also recompute all
of them using p*(1-p) when computing v, not p/1-p (the error you noted
above in my first example.)

Thank you very much again, and I apologize again for wasting your time
with clerically erroneous data.

djh

unread,
Mar 26, 2012, 12:04:15 PM3/26/12
to koo...@sfu.ca
I have corrected the "symmetry-violation" errors in all matrices where
you found them, and also corrected all tables derived from these base
matrices. Fortunately, all errors were of the minor clerical variety
that did not substantively change the picture of the data we have at
present (with only 4/240 = 1/60 of the data "known".)

I am now going to correct the p*(1-p) vs p/(1-p) mistake in
computation of v (for the slope/standard error "yield" data with y =
p.This correction is easier to make than the "symmetry-violation"
corrections because essentially it's just a change in one place. I am
hoping this correction in computataion of vwill lower standard error
values throughout, because they seemed awfully high to me.

Also, I will not bother re-posting the corrected RMSD-related and
"yield-related" data unless you'd like to have it for your own
purposes. If so, please let me know via reply to this post and I will
repost it.

Finally, I hope I'm not mistaken when I say that the p/(1-p) vs p*(1-
p) error should not have affected the standard error when y = logit[p]
rather than p. So far as I can tell, v is defined in two different
ways in the two cases, since you wrote:

"For p, the variance "v" is p(1-p)/input;
for logit[p], the variance is 1/output + 1/(input-output). "

But if I'm not seeing something I should here, please let me know.

Thank you so much again for your keen eye. I am truly embarassed that
the expert caught what the novice should have caught before presenting
anything to the expert.

djh

unread,
Mar 26, 2012, 2:14:42 PM3/26/12
to koo...@sfu.ca
Below are the standard error values when y = p and v is computed
correctly. If I correctly understand what these signify, I have to
say you have great intuitions for being able to have realized at
leasat a week ago that the points are less well determined as the
predictor x= (v1+v2)/2 increases. That certainly seems to be what the
"in-row" standard error values show.

Also, I'm assuming that these values are now satisfactorily smaller
than the actual slope values themselves (see second table below). But
you will have to be the judge of this ....

I. Standard Error (y = p)

Row 13-22 23-32 33-42 43-52

00-09 0.00148 0.00270 0.00532 0.00775
10-19 0.00092 0.00180 0.00320 0.00527
20-29 0.00083 0.00120 0.00243 0.00396
30-39 0.00087 0.00140 0.00316 0.00456
40-49 0.00132 0.00210 0.00586 0.05475
50-59 0.00131 0.00230 0.00561 0.01788
60-69 0.00170 0.00290 0.00771 0.00879
70-79 0.00227 0.00210 0.01055 0.01398
80-89 0.00168 0.00540 0.00545 0.00540
90-99 0.00203 0.00460 0.01588 0.05609

Overall 0.00007 0.00030 0.00219 0.00250

II. Slope (y = p)

Row 13-22 23-32 33-42 43-52
00-09 0.01082 0.02310 -0.02085 -0.03691
10-19 0.01111 0.01696 -0.03471 -0.03134
20-29 0.00864 -0.01505 -0.03104 -0.04191
30-39 0.00710 0.01603 -0.03000 -0.04741
40-49 0.00568 0.01781 -0.03913 0.00147
50-59 0.00304 0.06315 -0.04346 -0.03241
60-69 0.00254 0.01258 -0.04826 -0.06090
70-79 0.00317 0.01432 -0.04528 -0.06449
80-89 0.00727 0.01431 -0.00599 -0.00766
90-99 0.00467 0.01478 -0.02792 0.02859

Overall 0.00676 0.01930 -0.03342 -0.04813


djh

unread,
Mar 26, 2012, 8:26:41 PM3/26/12
to koo...@sfu.ca
Ray -

If you have a moment, this is a simple dumb question about computing
standard error when y = rmsd, not p or logit[p].

Here is an example of how I compute slope for rmsd data, according to
the definitions you gave:

w= slope =
y=rmsd x x-mean[x] w^2 w*y sum[w*y]/
sum[w^2]

1.323715088 0 -2.25 5.0625 -2.978358947 0.019074527
1.310987263 0.5 -1.75 3.0625 -2.294227711
1.343656653 1 -1.25 1.5625 -1.679570816
1.312367819 1.5 -0.75 0.5625 -0.984275864
1.361786878 2 -0.25 0.0625 -0.340446719
1.377588158 2.5 0.25 0.0625 0.344397039
1.427719401 3 0.75 0.5625 1.07078955
1.326376675 3.5 1.25 1.5625 1.657970844
1.423061391 4 1.75 3.0625 2.490357433
1.380789912 4.5 2.25 5.0625 3.106777302

2.25 20.625 0.393412112
mean[x] sum[w^2] sum[w*y]

The computed slope of 0.019074527 agrees well with the slope that
Excel gives me when I do a simple regression of y against x for the
same data.

So does this mean that it's safe for me to use the standard error
which Excel provides
for the slope, i.e. 0.007456655 in this case?

Reason I'm asking is that I want to table the slope and standard error
data for rmsd alongside the
slope and standard data for p and logit[p], because there seems to be
some relationships.

But I need to be sure about what to display for standard error before
I do so.

If not, can you tell me how to compute standard error in a case like
this when y is not a probability
or the logit of a probability?


Ray Koopman

unread,
Mar 27, 2012, 12:30:25 AM3/27/12
to
On Mar 26, 5:26 pm, djh <halitsk...@att.net> wrote:
> Ray -
>
> If you have a moment, this is a simple dumb question about computing
> standard error when y = rmsd, not p or logit[p].
>
> Here is an example of how I compute slope for rmsd data, according
> to the definitions you gave:
>
> w = slope =
> y = rmsd x x-mean[x] w^2 w*y sum[w*y]/
You don't want the standard error that Excel gives, which is
sqrt[var(u)/sum(w^2)], where var(u) is the adjusted variance of
the regression residuals. You want sqrt[sum(v_i*w_i^2)]/sum(w_i^2),
where v_i is the variance of y_i. (Excel is using the standard
formula, which assumes that all the v_i are equal to one another,
and that var(u) is an estimate of their common value. You don't
want to make either of those assumptions.)

If y_i were simply the average of n_i independent distances then
v_i would be V_i/n_i, where V_i is the variance of those distances.

But y_i is an rms distance; the square root complicates things,
and we may have to make some assumptions. What can you tell us
about the n_i distances that go into y_i? Are they independent?
Do you have any idea of the general form of their distribution?
We're talking about inter-atomic distances, right? So are the
distances in 3D, or are the atoms constrained to lie in the same
plane, which would mean the distances are in 2D? Are there any
other constraints, of any kind?

djh

unread,
Mar 27, 2012, 10:03:35 AM3/27/12
to koo...@sfu.ca
Thanks as always for the prompt and informative response.

I've cc'd you on an off-line email to my protein structure colleague
to see if I can't easily get the actual distances that go into each of
our RMSDs (so that actual distributions will be available for
inspection/analysis.)

But to give you a general idea of the situation:

i) the inter-atomic distances are 3D, not 2D.
ii) I suspect that the ones we're interested in range from 0.5 to 2.0
Angstroms;
iii) each RMSD subsumes, say, 50-200 individual distances.

Also, the RMSD data is where I really want to get the "central
tendency" within each cell of our usual 10x10 matrix. In other words,
suppose that the average RMSD for one cell of this matrix subsumes
1000 individual RMSDs. Some of these individual RMSD's are very very
low (because they're each computed on a pair of protein structures
which SHOULD BE very very similar), and some of these individual
RMSD's are very very high (because they're each computed on a pair of
protein structures that are just barely similar enough to make them
worth reporting out.) And so we're really interested in the
individual RMSDs "in the middle" when we compute the average RMSD for
the cell of the matrix.

So what I would like to do for the 1000 individual RMSD's in one cell
of a matrix is to get their mean and sd, and then use just the RMSDs
within the sd to compute the average RMSD for the cell. (You said in
an email above that this would be legitimate to do, unless I
misunderstood you.) This "winnowing" process will, I believe, make
our correlations involving average RMSDs much more meaningful
empirically (and also, perhaps, "stronger".)

But even if you give me the green light to do this "winnowing" within
sd, we would still need the "typical" distribution of distances within
a single RMSD so that you can advise on how to compute standard error
for a correlation involving average RMSDs, and that's why I emailed my
protein structure colleague on this matter.

Ray Koopman

unread,
Mar 27, 2012, 1:29:37 PM3/27/12
to
It's my turn to apologize for a dumb mistake. I forgot that y_i is
an *average* rmsd, not just a single rmsd, so there is no need for
the individual distances that went into each rmsd. To paraphrase my
previous post, "y_i is simply the average of n_i independent rmsd's,
so v_i = V_i/n_i, where V_i is the variance of those rmsd's."

Re using only the rmsd's "in the middle": Yes, that sort of thing is
legit, altho I don't know what I said that would lead you to that
conclusion. However, it won't change y_i much unless the distribution
of rmsd's is pretty skewed, and it will complicate getting v_i,
because the selection rule (which should probably not be based on
the sd of the whole set) needs to be factored into the process.

djh

unread,
Mar 27, 2012, 3:26:43 PM3/27/12
to koo...@sfu.ca
OK - please be patient and let me make sure I understand.

Suppose that in one of the correlations involving average RMSD as y, I
have a y_i whose value is 1.56.

Further, suppose that this y_i is the average of the following 11
individual RMSD's:

1.51
1.52
1.53
1.54
1.55
1.56
1.57
1.58
1.59
1.6
1.61

Then by definition, the mean of these individual RMSDs is yi = 1.56,
and since their differences from 1.56 are:

-0.05
-0.04
-0.03
-0.02
-0.01
0
0.01
0.02
0.03
0.04
0.05

and the squares of these differences are:

0.0025
0.0016
0.0009
0.0004
0.0001
0
0.0001
0.0004
0.0009
0.0016
0.0025

the variance V_i for this y_i is the sum of these squares divided by
11, i.e.

0.011 / 11 = .001

Correct ?

Thanks for taking the time to verify ... I just want to be sure ....

Art Kendall

unread,
Mar 27, 2012, 4:27:00 PM3/27/12
to
I am not sure I completely follow what you have but maybe this will help.

You have several square matrices with the same entities on the margins.
The cells of the matrices are some kind of pairwise
distances/similarities. Each matrix comes from some set of circumstances.

Individual differences multiple dimensional scaling (INDSCAL) is
designed to handle that kind of data.

INDSCAL and other techniques are available in, e.g., SPSS.
I believe Doug Carroll has a monograph in the green Sage series.

I this is at all what you have, I can point you to a couple of lists on
these kinds of topics.

HTH


Art Kendall
Social Research Consultants

djh

unread,
Mar 27, 2012, 5:47:37 PM3/27/12
to a...@dr.kendall.org, koo...@sfu.ca
Hi Art -

Thanks very much for the suggestion, which sounds like it might
conceivably be relevant to our situation. This situation can be
descibed from 30K feet as follows:

1) our variable space includes one dependent variable x and three
independent variables y,z, w, where x is a representation level that
can be over, under, or par, y is a pairwise distance metric, z is a
"yield" variable (output/input) and w is a straightforward energy
level variable;

2) what you've called our "circumstance space" includes six type of
proteins, 10 length intervals (lengths of protein amino acid chains),
plus 4 selections of certain biomolecular elements for which the
behavior of x changes in different ways (1 selection "of interest" and
3 "control" selections);

3) all three of our correlations w:x, y:x, and z:x can be computed for
each row of a 10x10 matrix or for an entire 10x10 matrix, and there
are a minimum of 6*10*4 such matrices corresponding to different
combinations of our "circumstances"

Right now, with Ray's invaluable, kind, and generous guidance, we're
in the process of establishing that our three correlations are in fact
statistically worth talking about.

But assuming that we are successful in this effort, it might well be
that a technique such as INDSCAL might be able to draw our attention
to subtle ways in which our matrices are critically different and
critically similar.

So as soon as I'm out from under some really brutal data reduction/
analysis to generate our 240 matrices along with the associated slopes
and standard errors for Ray, I will most definitely try to get
knowledgeable about INDSCAL ...

Thanks again ...

djh

unread,
Mar 27, 2012, 5:57:35 PM3/27/12
to
Oops! Sorry - above post should of course have said one INdependent
variable x and three dependent variables y,z,w.

Ray Koopman

unread,
Mar 27, 2012, 6:05:34 PM3/27/12
to
I should have been specific when I said "variance". There are two
versions of the variance in common use. They use the same numerator
but different denominators. The descriptive variance (also called
the population variance), divides by n. The inferential variance
(also called the sample variance) divides by n-1.

The descriptive variance is the second central moment of the n
data values, without regard to where they came from or what they
represent; it simply describes those numbers.

The inferential variance assumes the data values are independent
samples from some population, and is an unbiased estimate of the
variance of the population. Although many people call it the sample
variance, it is not the variance of the sample, but an inference
about the variance of the population.

You're trying to estimate the population variance, so you should
divide by n-1.

For the record, the "adjusted variance" that I referred to in a
recent post is a more general form of inferential variance: the
denominator is n-k, where k is the number of linear constraints
on the values whose squares are summed in the numerator. For an
ordinary variance, k = 1: the deviations from the mean sum to 0.
In simple linear regression, k = 2: the residuals are orthogonal
to both the unit vector (1,...,1) and the predictor (x_1,...,x_n).

Ray Koopman

unread,
Mar 27, 2012, 7:17:03 PM3/27/12
to
I should add that what I described above is a conservative criterion
that is difficult to meet, especially as k gets large, and that many
researchers will often settle for something less rigorous (at least
in their own work). There are many ways to liberalize the criterion,
depending on the details of the situation; I just wanted to mention
that what I described is sufficient but in practice is not always
necessary.

djh

unread,
Mar 27, 2012, 7:43:57 PM3/27/12
to koo...@sfu.ca
Thanks for the clarification/education re the different flavors of
variance, and for the specific instruction to use n-1 in the
denominator. As someone completely naive about statistics, I find it
fascinating (for reasons I won't go into here) that the residuals are
orthogonal to the unit vector in simple linear regresssion.)
> to both the unit vector (1,...,1) and the predictor (x_1,...,x_n).- Hide quoted text -
>
> - Show quoted text -

djh

unread,
Mar 27, 2012, 7:45:35 PM3/27/12
to koo...@sfu.ca
I see from your parenthetical here:

"and that many researchers will often settle for something less
rigorous (at least
in their own work)."

that you have a very dry sense of humor. I hope I get a chance to
meet you in person some time in the future ...

djh

unread,
Mar 27, 2012, 10:21:18 PM3/27/12
to koo...@sfu.ca
In the post to which I'm replying here, you defined logit[p] as

log[p/(1-p)]

but also wrote the following:

> For p, the variance "v" is  p(1-p)/input;
> for logit[p], the variance is  1/output + 1/(input-output).
> Both of those assume that both input and output are "large",
> so some adjustments will be needed for some of the new data.
> A simple one is to add 1 to the input and .5 to the output.
> It's far from perfect, but will at least take care of the zeros.

I've been adding .5 to output and 1 to input when I have a zero
output, but now that I'm hitting sparser data, I'm hitting cases when
input and output are both 0, so both additions give me .5 as output
and 1 as input.

And then when I take log(p/1-p), I get log(.5/(1-.5) = log(1) = 0.

And so, when I try to do a correlation with y = logit[p], Excel
complains with a divde by 0 error because of the logit[p] of 0.

Am I misintepreting what you said to do?

If not, what do you suggest for such cases? (The addition of .5 and 1
works fine when I'm doing a correlation with y = p, so no fix is
needed for this type of correlation.)

Thanks again in advance for the continued guidance.

djh

unread,
Mar 27, 2012, 10:30:36 PM3/27/12
to koo...@sfu.ca
Also I'm hitting another "divide by 0" problem - this one involves
the formula

logit[p] = log[p/(1-p)]

for logit[p] itself.

When I hit the rare case where input = output, p is 1 and so 1-p = 0.

Am I missing something?

If not, what to do ?

Ray Koopman

unread,
Mar 28, 2012, 5:26:07 AM3/28/12
to
On Mar 27, 7:21 pm, djh <halitsk...@att.net> wrote:
> In the post to which I'm replying here, you defined logit[p] as
>
> log[p/(1-p)]
>
> but also wrote the following:
>
>> For p, the variance "v" is p(1-p)/input;
>>
>> for logit[p], the variance is 1/output + 1/(input-output).
>> Both of those assume that both input and output are "large",
>> so some adjustments will be needed for some of the new data.
>> A simple one is to add 1 to the input and .5 to the output.
>> It's far from perfect, but will at least take care of the zeros.
>
> I've been adding .5 to output and 1 to input when I have a zero
> output, but now that I'm hitting sparser data, I'm hitting cases
> when input and output are both 0, so both additions give me .5 as
> output and 1 as input.

When the input is zero then you have a hole in your data, but you
can still do the regression -- it just means the programming is a
little messier.

>
> And then when I take log(p/1-p), I get log(.5/(1-.5) = log(1) = 0.
>
> And so, when I try to do a correlation with y = logit[p], Excel
> complains with a divide by 0 error because of the logit[p] of 0.

Having one or more zeros in the data should not upset anything. The
only divisions involved in a correlation are by n (to get the means)
and at the very end, where a quantity proportional to the covariance
is divided by the square root of the product of quantities that are
proportional to the two variances. Is one of the variances zero?

>
> Am I misinterpreting what you said to do?
>
> If not, what do you suggest for such cases? (The addition of .5 and
> 1 works fine when I'm doing a correlation with y = p, so no fix is
> needed for this type of correlation.)
>
> Thanks again in advance for the continued guidance.

On Mar 27, 7:30 pm, djh <halitsk...@att.net> wrote:
> Also I'm hitting another "divide by 0" problem - this one involves
> the formula
>
> logit[p] = log[p/(1-p)]
>
> for logit[p] itself.
>
> When I hit the rare case where input = output, p is 1 and so 1-p = 0.

Adding 1 to the input and .5 to the output works at both ends of the
scale, to avoid both p = 0 and p = 1. But it's still not a great
solution.

>
> Am I missing something?
>
> If not, what to do ?

I don't expect much to come from linear regressions with y = p.
I would still do them, mostly because they're easy, and some sort
of (noisy) pattern may become apparent when you consider the results
from all 240 matrices, but I wouldn't hold my breath.

y = logit[p] seems like a better bet, but given the small inputs
and extreme p's in some matrices, only if the analyses are done as
logistic regressions of p on x, not as linear regressions of logit[p]
on x; and perhaps only if a reduced-bias estimation method is used;
see, for instance,
http://cemsiis.meduniwien.ac.at/kb/wf/software/statistische-software/fllogistf/

Since both p and the input are dependent, you might look for a way
to model them simultaneously. Such a model would have more than
two parameters that would need to be estimated, which would make
doing within-row regressions, where there are only ten data points,
very iffy.

With regard to the within-row regressions, where you seem impressed
by regular changes from row to row, is would be good if you could
expand the model to include parameters that would account for those
changes. Then separate within-row regressions would no longer be
necessary.

Ray Koopman

unread,
Mar 28, 2012, 6:17:26 AM3/28/12
to
On Mar 27, 4:43 pm, djh <halitsk...@att.net> wrote:
> [...] As someone completely naive about statistics, I find it
> fascinating (for reasons I won't go into here) that the residuals
are orthogonalto the unit vector in simple linear regresssion.

Instead of "simple linear regression", I should have said "least
squares linear regression with an intercept". "least squares linear"
implies that the residuals must be orthogonal to the space of the
predictors, and "with an intercept" implies that the predictor space
includes the unit vector. QED.

djh

unread,
Mar 28, 2012, 9:20:17 AM3/28/12
to koo...@sfu.ca
OK - based on what you just said, here's how I'm going to proceed. No
need to reply unless you think I'm planning to proceed incorrectly, in
which case I would of course value your continued guidance.

First, I forgot that you told me to forget correlations and only do
slopes and standard errors. Doing this will remove the "zero" problem
in correlations that I mentioned above.

Second, I will just do the "overalls" and drop the within-row effort
entirely, since even over the web I can see your raised eyebrow about
"just 10" within-row values. Not doing the "within-rows" will solve
the second "zero" problem with logit[p] itself, because I can just
drop the few problem cases from the "overall" data and do the slopes
and standard errors for logit[p] on 52 or 53 values instead of the
full "overall" complement of 55.

Third, for the moment, I will not pursue the "logistic regression"
avenue for the following reason.

What's most important right now from a scientific point of view is the
sign reversal of slope for the "overall" values at different lengths:

"Overall" Slopes

Length y=p y=logit[p]

13-22 0.00676 0.04446
23-32 0.01668 0.06840
33-42 -0.03342 -0.09042
43-52 -0.04813 -0.04278

and the way in which the slopes for the y=RMSD "overalls" behave
relative to these reversals in sign.

So I will continue to get the "overall" slopes and standard errors for
y=p, y=logit[p] and y=RMSD in order to see:

i) whether the complete pattern of sign reversals for slopes of p
and logit[p] makes scientific sense;
ii) whether the complete pattern of slopes for y=RMSD does or does
not correspond to the pattern of sign reversals for slopes of p and
logit[p] in any scientifically meaninful way, and if so how;

If these two complete patterns both make scientific sense, and if they
do not manifest itself under any of the three control protocols, then
I can come back to you and ask: "OK - what do we have to do further to
show that the observed patterns not only make scientific sense, but
are also statistically meaningful as well.

Again, no need to respond unless you think there's something wrong
with this approach.

And thank you very much again, Ray.


djh

unread,
Mar 28, 2012, 11:12:34 AM3/28/12
to koo...@sfu.ca
I've found an free online reduced-bias logistics regression calculator
here:

http://www.wessa.net/rwasp_logisticregression.wasp

Could you take a quick look and tell me if you think it's OK to use
it.

If so, could you tell me what my values should be for the "endogenous
series" binary variable in column 1 of the input panel? (I'm assuming
my second column would be my x values and the third would be my y=p
values.)

If you green-light this, I'll go ahead and do the logistics
regressions on p while I'm doing the linear regressions on p and
logit[p].

Ray Koopman

unread,
Mar 28, 2012, 3:40:29 PM3/28/12
to
For your data there would be only two columns. The first is the
dichotomous dependent variable, either 0 or 1; the second is the
predictor, (v1+v2)/2. For each of the 19 possible values of the
predictor***, you would have to enter 'input' rows, all of which
would have the same predictor value in column 2. 'output' of those
rows would have a 1 in column 1; the remaining 'input-output' rows
would have a 0 in column 1.

Obviously, the program is not designed for replicated data,
but the coefficients and their standard errors should still OK.

You might try asking (in the "I would like to suggest..." box at
the bottom of the page) if there is an alternate input option for
replicated data. (There should also be a corresponding reduction
in the output, in the "Fit of Logistic Regression" table.)
_____

*** It just struck me that altho there are 55 cells in your matrix,
if the predictor is (v1+v2)/2 then there are only 19 different
predictor values. Cells with the same predictor value can be merged,
to give pooled input and output counts. Could have been doing this
all along!

djh

unread,
Mar 28, 2012, 4:12:40 PM3/28/12
to koo...@sfu.ca
I'm really sorry, Ray, but I simply cannot get myself past some kind
of stupid "block" that's preventing me from understanding what you
wrote here:

> For your data there would be only two columns. The first is the
> dichotomous dependent variable, either 0 or 1; the second is the
> predictor, (v1+v2)/2. For each of the 19 possible values of the
> predictor***, you would have to enter 'input' rows, all of which
> would have the same predictor value in column 2. 'output' of those
> rows would have a 1 in column 1; the remaining 'input-output' rows
> would have a 0 in column 1.

First puzzlement: If I am allowed to enter only two columns into the
panel, the first of which has only 0's and 1's, and the second of
which has only the 19 values of x (the 19 unique among the 55 cells of
the matrix), where do I put the actual "y" data from the cells of the
matrix. You've already stipulated the data will only have two columns
and they're both already filled up.

Second puzzlement: what is the "y" data that I'm supposed to enter?
The "p's" (inputs/outputs), or the inputs and outputs themselves ?

I realize I'm "blocking" on something really stupid here, but I just
can't see my way thru it.

Using this mock example for just three predictors:

x input output p
0 1769 2080 0.850480769
0.5 6347 7150 0.887692308
1 8563 9685 0.884150749

could you show me how to fill out the panel provided by the
application?

Final question: once you've gotten me past this block, would it be ok
for me to enter the data unpooled (i.e. for 55 predictors instead of
19, or must the predictors in column 2 be unique?

Again, I apologize for trying your patience here ... really.

PS - judging from your email, I suspect you know some of the folks who
wrote this application ...?

djh

unread,
Mar 28, 2012, 5:03:39 PM3/28/12
to koo...@sfu.ca
Oh wait a second - are you saying that if one cell of my matrix with
predictor 12 has an input of 3 and an output of 2, then the data
should be entered into the panel as:

1 12
1 12
0 12

and if the input and output were 5 and 3, then the data would look
like:

1 12
1 12
1 12
0 12
0 12

djh

unread,
Mar 28, 2012, 5:40:13 PM3/28/12
to koo...@sfu.ca
OK - this one has an option for summary data entry!

http://statpages.org/logistic.html

But I don't know if it implements the "reduced-bias" feature like the
first one claims to do.

If it doesn't is it still OK to use it?

Ray Koopman

unread,
Mar 28, 2012, 6:14:02 PM3/28/12
to
Yup.

Ray Koopman

unread,
Mar 28, 2012, 6:25:44 PM3/28/12
to

Ray Koopman

unread,
Mar 28, 2012, 6:28:48 PM3/28/12
to
Sorry, I hit the "Send" key by mistake a couple minutes ago.

On Mar 28, 2:40 pm, djh <halitsk...@att.net> wrote:
That's probably just a garden-variety maximum likelihood routine.
It should be OK when both 'out' and 'in-out' are large, say > 30
or so when you have only one predictor, but otherwise you need a
bias-reduced penalized maximum likelihood routine. And no, I
don't know any of the people involved with either routine.

djh

unread,
Mar 28, 2012, 6:59:13 PM3/28/12
to koo...@sfu.ca
OK - here's some output for you to look at from the second site's
calculator:

http://statpages.org/logistic.html

(this is the one that doesn't claim to be reduced-bias.)

You can see a problem I had to overcome at this site if you look at
the "summary" examples which the site provides.

You'll see that they require binary values for predictor as well as
predicted, so I pooled my data even further than 19 points to get
binary 0/1 values corresponding to the predictor ranges 0.5<=x<=4.5
and 5<=x<=9:

0 29269 203483 0.5<=x<=4.5
1 5074 40805 5<=x<=9

The output then was this:

Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD
1 0.1647 0.3709

Iteration History...
-2 Log Likelihood = 08060.9779 (Null Model)
-2 Log Likelihood = 07977.6105
-2 Log Likelihood = 07977.4695
-2 Log Likelihood = 07977.4695
-2 Log Likelihood = 07977.4695 (Converged)

Overall Model Fit...
Chi Square= 83.5084; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.1456 0.0161 0.0000
Intercept 1.9391

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.1568 1.1207 1.1939

X1 n0 n1 Calc Prob
0.0000 29269 203483 0.8742
1.0000 5074 40805 0.8894

Can you tell anything from these results as to whether 0.5<=x<=4.5
peforms significantly better or worse than 5<=x<=9 ?

If so, this would be good enough for the scientific claim we're trying
to make, but I wish there was a site that allowed non-binary predictor
values. Also, of course, if you see anything meaningful in the data,
I would have to do the same kind of binary split for my RMSD dependent
variable (which is easily doable, actually)

One last question: would it help to tell you anything if I were to
pool the data over more smaller intervals, e.g.

0: 0.5<=x<=2
1: 2.5<=x<=4

and then:

0: 4.5<=x<=6
1: 6.5<=x<=8

Very much looking forward to your response to all of the above, even
if you tell me that it's time to shutter the shop and lock the doors
after us ....

And thanks again for your patience, as always ...



djh

unread,
Mar 28, 2012, 7:34:32 PM3/28/12
to koo...@sfu.ca
OK - below is the output from http://statpages.org/logistic.html for
the length=30 case (the previous case was length = 20.) Also, please
note that in the previous case, I typo'd the lower x range as
0.5<=x<=4.5, when it should have been 0.0<=x<=4.5.

Length 30

Binary-pooled data:

0 32175 131648 0.0<=x<=4.5
1 2915 22067 5<=x<=9

Output from logistics regression:


Descriptives...

35090 cases have Y=0; 153715 cases have Y=1.

Variable Avg SD
1 0.1323 0.3388

Iteration History...
-2 Log Likelihood = 81310.8671 (Null Model)
-2 Log Likelihood = 80328.4727
-2 Log Likelihood = 80307.1466
-2 Log Likelihood = 80307.1237
-2 Log Likelihood = 80307.1237 (Converged)

Overall Model Fit...
Chi Square= 1003.7434; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.6153 0.0207 0.0000
Intercept 1.4089

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.8502 1.7767 1.9266

X1 n0 n1 Calc Prob
0.0000 32175 131648 0.8036
1.0000 2915 22067 0.8833

djh

unread,
Mar 28, 2012, 8:56:33 PM3/28/12
to koo...@sfu.ca
Below are the logistic regression outputs for lengths 40 and 50.

Before you look at them, please consider this table:

"statpages"
Corr Slope Log Reg
Length y=p y=p Coeff

20 0.34216 0.00676 0.1456
30 0.34984 0.01930 0.6153
40 -0.46108 -0.03342 -0.3288
50 -0.45630 -0.04813 -0.2410

As a complete ignoramus, it looks to me like the signs of the
coefficients from the logistics regressions are patterning with, not
against, the signs of the coefficients and slopes from the linear
regressions with y = p.

Of course, I'm hoping you'll be able to tell me that this similarity
means that we have more reason to attach statistical significance to
the switch in sign than we would if we only had this switch in the
coefficients and slopes from the linear regressions.

In any event, here are the outputs from the logistics regressions for
lengths 40 and 50.

*********************************************

Length 40 Logistics Regression Data

Descriptives...

67556 cases have Y=0; 55278 cases have Y=1.

Variable Avg SD
1 0.0661 0.2484

Iteration History...
-2 Log Likelihood = 69054.7695 (Null Model)
-2 Log Likelihood = 68857.9864
-2 Log Likelihood = 68857.8934
-2 Log Likelihood = 68857.8934 (Converged)

Overall Model Fit...
Chi Square= 196.8760; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.3288 0.0237 0.0000
Intercept -0.1793

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.7198 0.6872 0.7540

X1 n0 n1 Calc Prob
0.0000 62488 52229 0.4553
1.0000 5068 3049 0.3756

************************************************

Length 50 Logistics Regression Data

Descriptives...

28666 cases have Y=0; 38763 cases have Y=1.

Variable Avg SD
1 0.0608 0.2391

Iteration History...
-2 Log Likelihood = 91958.7888 (Null Model)
-2 Log Likelihood = 91903.2527
-2 Log Likelihood = 91903.2441
-2 Log Likelihood = 91903.2441 (Converged)

Overall Model Fit...
Chi Square= 55.5448; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.2410 0.0323 0.0000
Intercept 0.3166

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.7858 0.7377 0.8371

X1 n0 n1 Calc Prob
-0.0000 26692 36634 0.5785
1.0000 1974 2129 0.5189


djh

unread,
Mar 28, 2012, 9:57:59 PM3/28/12
to koo...@sfu.ca, jcp1...@gmail.com
This to correct a misimpression I have given regarding John Pezzullo's
logistics regression tool at

http://statpages.org/logistic.html

It will run non-binary (multiple) predictors in logistics regression,
as John has explained in an email which appears in full below. John
was also kind enough to provide some thoughts on the case which I sent
him, for which I again thank him very much.

John's email:


Hi David,


I'm glad you've found my online logistic regression calculator to be
useful.

Logistic regression can take one or more continuous (numeric) and/or
categorical predictors, so your data set can be analyzed by my web
page. For grouped data (such as your dataset), the last two columns
would be the number of No and Yes outcomes, just as you have it.


It's important to separate the numbers on each row of data by commas
or tab characters, and not just have them separated by blank spaces.
I copied the data from your e-mail into Excel, and used the "Text to
Columns" entry in the "Tools" menu to get the data into three distinct
columns, then copy/pasted it into the web page. I set the number of
data points to 18, and the number of predictor variables to 1, and
then clicked the Solve button. The page produced the following output:


Descriptives...


34032 cases have Y=0; 242519 cases have Y=1.


Variable Avg SD
1 3.2100 1.5412


Iteration History...
-2 Log Likelihood = 06293.3370 (Null Model)
-2 Log Likelihood = 06292.2564
-2 Log Likelihood = 06292.2564
-2 Log Likelihood = 06292.2564 (Converged)


Overall Model Fit...
Chi Square= 1.0806; df=1; p= 0.2986


Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0039 0.0038 0.2984
Intercept 1.9763


Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9961 0.9888 1.0035


X1 n0 n1 Calc Prob
0.5000 803 6347 0.8781
1.0000 1699 13981 0.8779
1.5000 2734 23081 0.8777
2.0000 3891 29455 0.8775
2.5000 4659 32341 0.8772
3.0000 4839 30548 0.8770
3.5000 4303 26531 0.8768
4.0000 3408 21721 0.8766
4.5000 2622 17709 0.8764
5.0000 1849 13883 0.8762
5.5000 1280 10462 0.8760
6.0000 908 7045 0.8758
6.5000 571 4156 0.8756
7.0000 254 2510 0.8753
7.5000 114 1323 0.8751
8.0000 69 742 0.8749
8.5000 18 442 0.8747
9.0000 11 242 0.8745


The regression coefficient for the single predictor variable is
-0.0039 with a standard error of +/- 0.0038, which indicates that
larger values of the predictor are associated with a smaller
proportion of the subjects who have the outcome variable. (Actually,
the SE is about the same magnitude as the coefficient itself, so the
regression coefficient is not significantly different from zero
(p=0.3), so it's not really indicating anything about the direction of
the relationship.)


If we plot that proportion-of-subjects-with-outcome vs. predictor
variable, we see that the data appears to show a generally increasing
trend with increasing X values. (I'm attaching an Excel file with the
data and the graph).


But the points don't really seem to look much like the typical "S-
shaped curve that the logistic model is meant to fit. Usually
logistic data has small proportions at one end of the predictor range,
and large proportions at the other end. The proportions in your data
have a very small range of variability (ranging from 0.86 to 0.96),
implying that your proportions are confined to upper part of the S-
curve, where it should be asymptotically approaching the maximum
possible value of 1. This would make the regression somewhat ill-
conditioned, but it should still work.


But what's more troubling is that the plot clearly shows that the data
does not seem to be conforming to a logistic model at all -- it starts
out (at the left side of the graph) around 0.9, then dips down to 0.86
near the middle, then climbs up to 0.96 at the right side. The
logistic function does not have this kind of minimum near its upper
range, and cannot a logistic curve cannot twist itself to fit that
kind of dip in the middle. That's probably why the regression
coefficient is negative, even though the the general trend of the
curve seems to be increasing -- the data doesn't follow a logistic
shape, so the logistic regression doesn't really produce a realistic
result in this case.


Without knowing what the data represents, I can't offer any other
suggestions. It's possible that there's some other kind of
theoretical model underlying the observations, which would suggest
some other kind of formula to fit to the data instead of a logistic
model.


Best Regards,
John.

John C. Pezzullo, PhD
Kissimmee, FL

401-603-1648 (home/office)
703-593-5815 (cell)



djh

unread,
Mar 28, 2012, 10:15:04 PM3/28/12
to koo...@sfu.ca, jcp1...@gmail.com
Here are the results of re-running the logistics regressions for
lengths 20,30,40,50 using John Pezzullo's calculator at:

http://statpages.org/logistic.html

As you can see, multiple non-binary indicators work just fine (my bad
for claiming they don't!)

****************************************************
Length 20
****************************************************
Input Data:

0,311,1769
0.5,803,6347
1,1699,13981
1.5,2734,23081
2,3891,29455
2.5,4659,32341
3,4839,30548
3.5,4303,26531
4,3408,21721
4.5,2622,17709
5,1849,13883
5.5,1280,10462
6,908,7045
6.5,571,4156
7,254,2510
7.5,114,1323
8,69,742
8.5,18,442
9,11,242

Results:

Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD
1 3.1861 1.5601

Iteration History...
-2 Log Likelihood = 08060.9779 (Null Model)
-2 Log Likelihood = 08060.8373
-2 Log Likelihood = 08060.8373
-2 Log Likelihood = 08060.8373 (Converged)

Overall Model Fit...
Chi Square= 0.1406; df=1; p= 0.7077

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0014 0.0037 0.7076
Intercept 1.9664

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9986 0.9914 1.0059

X1 n0 n1 Calc Prob
0.0000 311 1769 0.8772
0.5000 803 6347 0.8771
1.0000 1699 13981 0.8771
1.5000 2734 23081 0.8770
2.0000 3891 29455 0.8769
2.5000 4659 32341 0.8768
3.0000 4839 30548 0.8768
3.5000 4303 26531 0.8767
4.0000 3408 21721 0.8766
4.5000 2622 17709 0.8765
5.0000 1849 13883 0.8765
5.5000 1280 10462 0.8764
6.0000 908 7045 0.8763
6.5000 571 4156 0.8762
7.0000 254 2510 0.8762
7.5000 114 1323 0.8761
8.0000 69 742 0.8760
8.5000 18 442 0.8759
9.0000 11 242 0.8759

****************************************************
Length 30
****************************************************
Input Data:

0,331,845
0.5,1211,3738
1,2363,9939
1.5,3410,17124
2,4015,21954
2.5,4212,22434
3,3849,19619
3.5,8804,10964
4,2325,13606
4.5,1655,11425
5,1180,8476
5.5,776,5796
6,472,3510
6.5,260,1939
7,115,1209
7.5,73,603
8,23,375
8.5,14,116
9,2,43

Results:

Descriptives...

35090 cases have Y=0; 153715 cases have Y=1.

Variable Avg SD
1 3.0081 1.4785

Iteration History...
-2 Log Likelihood = 81310.8671 (Null Model)
-2 Log Likelihood = 81119.5394
-2 Log Likelihood = 81119.3672
-2 Log Likelihood = 81119.3672
-2 Log Likelihood = 81119.3672 (Converged)

Overall Model Fit...
Chi Square= 191.4999; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.0560 0.0041 0.0000
Intercept 1.3109

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.0576 1.0492 1.0661

X1 n0 n1 Calc Prob
0.0000 331 845 0.7877
0.5000 1211 3738 0.7923
1.0000 2363 9939 0.7969
1.5000 3410 17124 0.8014
2.0000 4015 21954 0.8058
2.5000 4212 22434 0.8101
3.0000 3849 19619 0.8144
3.5000 8804 10964 0.8186
4.0000 2325 13606 0.8227
4.5000 1655 11425 0.8268
5.0000 1180 8476 0.8307
5.5000 776 5796 0.8346
6.0000 472 3510 0.8385
6.5000 260 1939 0.8422
7.0000 115 1209 0.8459
7.5000 73 603 0.8495
8.0000 23 375 0.8531
8.5000 14 116 0.8565
9.0000 2 43 0.8599

****************************************************
Length 40
****************************************************
Input Data:

0,393,202
0.5,1877,1448
1,5312,4508
1.5,9447,8133
2,12351,9607
2.5,11264,8702
3,8856,7392
3.5,6062,5686
4,4058,4057
4.5,2868,2494
5,2067,1453
5.5,1398,873
6,780,383
6.5,409,198
7,190,86
7.5,97,38
8,76,14
8.5,43,2
9,8,2

Results:

Descriptives...

67556 cases have Y=0; 55278 cases have Y=1.

Variable Avg SD
1 2.6319 1.2622

Iteration History...
-2 Log Likelihood = 69054.7695 (Null Model)
-2 Log Likelihood = 69039.0747
-2 Log Likelihood = 69039.0746
-2 Log Likelihood = 69039.0746 (Converged)

Overall Model Fit...
Chi Square= 15.6949; df=1; p= 0.0001

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0180 0.0045 0.0001
Intercept -0.1532

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9821 0.9734 0.9909

X1 n0 n1 Calc Prob
0.0000 393 202 0.4618
0.5000 1877 1448 0.4595
1.0000 5312 4508 0.4573
1.5000 9447 8133 0.4551
2.0000 12351 9607 0.4528
2.5000 11264 8702 0.4506
3.0000 8856 7392 0.4484
3.5000 6062 5686 0.4461
4.0000 4058 4057 0.4439
4.5000 2868 2494 0.4417
5.0000 2067 1453 0.4395
5.5000 1398 873 0.4373
6.0000 780 383 0.4350
6.5000 409 198 0.4328
7.0000 190 86 0.4306
7.5000 97 38 0.4284
8.0000 76 14 0.4262
8.5000 43 2 0.4240
9.0000 8 2 0.4218


****************************************************
Length 50
****************************************************
Input Data:

0,294,202
0.5,853,1323
1,2115,3331
1.5,3401,5411
2,4604,6235
2.5,4505,6062
3,4142,5097
3.5,3076,4100
4,2236,2986
4.5,1466,1887
5,915,1105
5.5,568,578
6,289,285
6.5,121,112
7,57,40
7.5,19,9
8,5,0
8.5,0,0
9,0,0

Results:

Descriptives...

28666 cases have Y=0; 38763 cases have Y=1.

Variable Avg SD
1 2.6539 1.2487

Iteration History...
-2 Log Likelihood = 91958.7888 (Null Model)
-2 Log Likelihood = 91862.7620
-2 Log Likelihood = 91862.7590
-2 Log Likelihood = 91862.7590 (Converged)

Overall Model Fit...
Chi Square= 96.0299; df=1; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0611 0.0062 0.0000
Intercept 0.4643

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9407 0.9293 0.9523

X1 n0 n1 Calc Prob
0.0000 294 202 0.6140
0.5000 853 1323 0.6068
1.0000 2115 3331 0.5995
1.5000 3401 5411 0.5921
2.0000 4604 6235 0.5847
2.5000 4505 6062 0.5773
3.0000 4142 5097 0.5698
3.5000 3076 4100 0.5623
4.0000 2236 2986 0.5548
4.5000 1466 1887 0.5472
5.0000 915 1105 0.5396
5.5000 568 578 0.5320
6.0000 289 285 0.5244
6.5000 121 112 0.5168
7.0000 57 40 0.5092
7.5000 19 9 0.5015
8.0000 5 0 0.4939
8.5000 0 0 0.4863
9.0000 0 0 0.4786

Ray Koopman

unread,
Mar 29, 2012, 3:42:54 PM3/29/12
to
I've looked at plots of p against x for the data you analyzed with
John's program, and I agree with his comment that the relations do
not look logistic. I also have some suggestions about what to do.

First, make sure the data are correct. I went back to the Length-30
data at
https://groups.google.com/group/sci.stat.math/msg/1a9f4ccc8451a240
and found that the square versions of both the input and output
matrices are not symmetric. Only one pair disagrees in the output,
but in the input there is one pair plus the entire row/col 8 (except
for the diagonal). Since columns 8 & 9 are the same except for the
bottom row, but row 8 & 9 are never equal, it looks as if column 8 is
bad. However, the upper-triangle summary in the same post appears to
use column 8, not row 8. This sort of thing has happened often enough
now to make me wonder about all the data.

Second, it may be time to complicate the model. A plot of all 55 p's
against x shows considerable variability within levels of x. Why? One
explanation might be that the lengths differ, that the range 33-42 is
large enough to matter. Try putting the actual length into the model.
Another might be that v1-v2 matters. Put it in, too. Another might
be that we should be using the continuous u1 and u2 instead of v1
and v2. And it may be better to model both the input and either the
output or the proportion. I don't know which of those might be the
most important, or what the model should look like -- those are
substantive calls, not statistical ones.

djh

unread,
Mar 29, 2012, 8:16:35 PM3/29/12
to koo...@sfu.ca
As always, Ray, thanks very much for taking the time and my apologies
for any mistakes still in the data - I thought I had gotten rid of
them all. I am ashamed - there's no reason for me to raise that kind
of dust, particularly when you're not billing.

The reason I don't want right now to go to unquantized u rather than
quantized v is because the other half of our central hypothesis (the
protein message side as opposed to the protein structure side) is
remarkably well supported by the following correlations of our 10
quantized v values 0,...9 with 10 levels of a quantized energetic
variable e. (Note how the signs of the coefficients flip perfectly in
our control results, and note also that all these results are entirely
length-independent.)

results for
v (quantized)
against "e"

Corr F RSquare

-0.850 0.0015 0.72 globins
-0.930 0.0001 0.86 cytochrome C's

-0.960 0.00001 0.92 immunoglobulins
-0.730 0.0152 0.53 trypsins

-0.970 0.000003 0.94 TIM-barrels
-0.800 0.0050 0.64 NADP-binders


control results for
v (quantized)
against "e"

Corr F Rsquare

0.870 0.0010 0.76 globins
0.920 0.0002 0.85 cytochrome C's

0.670 0.0332 0.45 immunoglobulins
0.730 0.0150 0.53 trypsins

0.760 0.0110 0.58 TIM-barrels
0.930 0.0001 0.86 NADP-binders

So if we go with unquantized u or re-quantize u differently on this
side of the hypothesis, we would have to unquantize or requantize it
correspondingly on the other side of the hypothesis, and although I
strongly suspect that we could do so without losing the remarkable
correlations shown above, I am loath to do so if we're not sure we
have to. Also, from a scientific point of view, our critics could
argue that if quantized v works aginst e on the message side of the
hypothesis, then quantized v should also work against some variable(s)
on the structure side - otherwise, something is amiss.


What I am going to do first is add u1-u2 to the model, and rather than
add length to the model, I am going to pool ACROSS lengths.

The reason I'm going to add u1-u2 is because it makes perfect
scientific sense that our "filter" should not behave the same way with
respect to a v1+v2)/2 value of 4.5 obtained from 4 and 5, and the same
value of 4.5 obtained from 0 and 9.

The reason I'm going to pool across lengths is three-fold:

a) it was really only a guess on my part that length might be a
category that has to be kept relatively constant when analyzing v
against y = p (yield proportion.)

b) adding u1-u2 to the model is going to dangerously decrease
population per cell unless I pool across lengths, in which case, no
such decrease will generally occur.

c) if the results shown above for "the other side of the street" are
length-independent, then our critics could argue that our results on
this side of the street should also be length-independent.

Finally, regarding both length and v, there is a matter of sampling
that I need to at least make you aware of.

Right now, our length intervals are 13-22,23-32,...,93-102,103-112,
and we never feed our "filter" (Dr. Lesk's structural alignment
program) an input pair that is drawn from two different length
intervals. This keeps us from asking the filter to do silly things
with silly input pairs of greatly disparate lengths, but it does raise
an obvious sampling issue.

In particular, neither nature nor Dr. Lesk's program considers the
lengths 21 and 22 to be somehow closer than 22 and 23, or similarly,
the lengths 23 and 24 to be closer than 21 and 22. Yet of these three
possible classes of input pairs defined by length, we only feed our
filter pairs from one. So I know we're implicitly "sampling" here,
but I don't know if we're doing so legitimately or illegitmately -
perhaps you can advise here.

And of course, the same implicit "sampling" question arises in the
case of our quantized v. The levels 0,...,9 of v correspond, in the
case of the globin proteins, to 10 ranges
0.01-0.50,0.51-1.00,...,4.01-4.50,4.51-5.00. So our protocol
distinguishes between two values such as 4.50 and 4.51 in a way that
it does not distinguish between the equally close values 4.49 and 4.50
or 4.51 and 4.52. Again, perhaps you can advise how bad a problem
this actually is.

In any event, when I add v1-v2 and pool across lengths, please be
assured I will make sure all data are correct, and when I have the
results from John's program, I will post back. Shouldn't take that
much effort.



djh

unread,
Mar 30, 2012, 5:43:52 AM3/30/12
to koo...@sfu.ca
Please note in u1-u2 should have been v1-v2 in these paragraphs of my
last post:

1)
"What I am going to do first is add u1-u2 to the model, and rather
than
add length to the model, I am going to pool ACROSS lengths. "

2)
The reason I'm going to add u1-u2 is because it makes perfect
scientific sense that our "filter" should not behave the same way
with
respect to a v1+v2)/2 value of 4.5 obtained from 4 and 5, and the
same
value of 4.5 obtained from 0 and 9.

3)
b) adding u1-u2 to the model is going to dangerously decrease
population per cell unless I pool across lengths, in which case, no
such decrease will generally occur.

Sorry for the confusion - I was composing rapidly right before
boarding a flight ...


djh

unread,
Mar 30, 2012, 7:59:06 AM3/30/12
to koo...@sfu.ca
Ray - I forgot to ask you one basic question last night.

Suppose that sufficient and insightful complication of the model
produces a satisfactory logistic regression involving the "yield-
related" properties of the "filter".

Is the fact that the "yield-related" properties of the filter are best
modellable by a logistic regression incompatible with the possibility
that the "RMSD-related" property of the filter might be modelled by a
linear regression? Or could these models both exist without conflict
from a purely statistical point of view?

(The "RMSD-related" property has to do with the degree of the
similarity between the two members of each output pair that the filter
"yields" from its input pairs, not how many output pairs are yielded
from input pairs.)

djh

unread,
Mar 30, 2012, 12:32:29 PM3/30/12
to koo...@sfu.ca
Before going to pooled length data, I decided just to add v1-v2 to the
model while still keeping the lengths distinct.

Below are the results for lengths 20,30,40,50,60 with predictors
(v1+v2)/2 and v1=v2. (All input data is error-free - this AM I
rechecked all five 10x10's for symmetry before doing anything else.)

Please take a look if/when you have a chance and let me know what you
think.

Length 20

Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD
1 3.1861 1.5601
2 2.4325 1.9607

Iteration History...
-2 Log Likelihood = 08060.9779 (Null Model)
-2 Log Likelihood = 07579.6536
-2 Log Likelihood = 07577.8873
-2 Log Likelihood = 07577.8873
-2 Log Likelihood = 07577.8873 (Converged)

Overall Model Fit...
Chi Square= 483.0906; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0341 0.0039 0.0000
2 0.0715 0.0033 0.0000
Intercept 1.9027

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9665 0.9591 0.9740

2 1.0742 1.0673 1.0811

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 311 1769 0.8702
0.5000 1.0000 803 6347 0.8762
1.0000 0.0000 577 5418 0.8663
1.0000 2.0000 1122 8563 0.8820
1.5000 1.0000 1630 14760 0.8725
1.5000 3.0000 1104 8321 0.8876
2.0000 0.0000 1153 9873 0.8623
2.0000 2.0000 1601 14349 0.8784
2.0000 4.0000 1137 5233 0.8929
2.5000 1.0000 2387 19218 0.8686
2.5000 3.0000 1724 9056 0.8841
2.5000 5.0000 548 4067 0.8980
3.0000 0.0000 1273 9167 0.8582
3.0000 2.0000 2480 12122 0.8747
3.0000 4.0000 820 6990 0.8896
3.0000 6.0000 266 2269 0.9029
3.5000 1.0000 2522 11688 0.8647
3.5000 3.0000 1194 9385 0.8806
3.5000 5.0000 381 3909 0.8948
3.5000 7.0000 206 1549 0.9076
4.0000 0.0000 1136 3617 0.8540
4.0000 2.0000 1265 9030 0.8710
4.0000 4.0000 607 5204 0.8862
4.0000 6.0000 301 2669 0.8999
4.0000 8.0000 99 1201 0.9120
4.5000 1.0000 1281 5677 0.8607
4.5000 3.0000 658 4997 0.8770
4.5000 5.0000 450 3573 0.8916
4.5000 7.0000 101 2099 0.9047
4.5000 9.0000 132 1363 0.9163
5.0000 0.0000 337 2148 0.8497
5.0000 2.0000 736 3086 0.8671
5.0000 4.0000 443 3472 0.8827
5.0000 6.0000 185 2795 0.8967
5.0000 8.0000 148 2382 0.9093
5.5000 1.0000 355 2414 0.8565
5.5000 3.0000 480 2166 0.8732
5.5000 5.0000 205 2695 0.8883
5.5000 7.0000 240 3187 0.9017
6.0000 0.0000 109 632 0.8453
6.0000 2.0000 264 1653 0.8631
6.0000 4.0000 263 1697 0.8791
6.0000 6.0000 272 3063 0.8935
6.5000 1.0000 131 922 0.8523
6.5000 3.0000 122 1298 0.8694
6.5000 5.0000 318 1936 0.8848
7.0000 0.0000 46 305 0.8408
7.0000 2.0000 52 728 0.8590
7.0000 4.0000 156 1477 0.8755
7.5000 1.0000 38 502 0.8480
7.5000 3.0000 76 821 0.8655
8.0000 0.0000 2 188 0.8362
8.0000 2.0000 67 554 0.8548
8.5000 1.0000 18 442 0.8435
9.0000 0.0000 11 242 0.8315

*********************************************************
Length 30

Descriptives...

35090 cases have Y=0; 153715 cases have Y=1.

Variable Avg SD
1 3.0081 1.4785
2 2.2900 1.8783

Iteration History...
-2 Log Likelihood = 81310.8671 (Null Model)
-2 Log Likelihood = 80999.3103
-2 Log Likelihood = 80998.6696
-2 Log Likelihood = 80998.6696
-2 Log Likelihood = 80998.6696 (Converged)

Overall Model Fit...
Chi Square= 312.1976; df=2; p= -0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.0778 0.0046 0.0000
2 -0.0386 0.0035 0.0000
Intercept 1.3349

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.0809 1.0713 1.0907

2 0.9622 0.9556 0.9688

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 331 845 0.7917
0.5000 1.0000 1211 3738 0.7917
1.0000 0.0000 1185 3865 0.8042
1.0000 2.0000 1178 6074 0.7918
1.5000 1.0000 2413 12535 0.8043
1.5000 3.0000 997 4589 0.7918
2.0000 0.0000 1203 9675 0.8162
2.0000 2.0000 2036 9478 0.8043
2.0000 4.0000 776 2801 0.7919
2.5000 1.0000 2146 14726 0.8162
2.5000 3.0000 1606 5767 0.8044
2.5000 5.0000 460 1941 0.7919
3.0000 0.0000 939 5502 0.8276
3.0000 2.0000 1795 9009 0.8163
3.0000 4.0000 918 4031 0.8044
3.0000 6.0000 197 1077 0.7920
3.5000 1.0000 1514 6808 0.8276
3.5000 3.0000 6678 574 0.8163
3.5000 5.0000 412 2214 0.8045
3.5000 7.0000 200 1368 0.7921
4.0000 0.0000 601 2027 0.8384
4.0000 2.0000 789 4797 0.8277
4.0000 4.0000 394 3454 0.8164
4.0000 6.0000 407 2825 0.8045
4.0000 8.0000 134 503 0.7921
4.5000 1.0000 660 2917 0.8384
4.5000 3.0000 348 2616 0.8277
4.5000 5.0000 313 4423 0.8164
4.5000 7.0000 272 1041 0.8046
4.5000 9.0000 62 428 0.7922
5.0000 0.0000 172 1004 0.8487
5.0000 2.0000 319 1579 0.8385
5.0000 4.0000 287 3361 0.8278
5.0000 6.0000 260 1664 0.8165
5.0000 8.0000 142 868 0.8046
5.5000 1.0000 155 1119 0.8487
5.5000 3.0000 297 2039 0.8385
5.5000 5.0000 228 1254 0.8278
5.5000 7.0000 96 1384 0.8165
6.0000 0.0000 33 292 0.8584
6.0000 2.0000 142 1426 0.8487
6.0000 4.0000 194 755 0.8386
6.0000 6.0000 103 1037 0.8279
6.5000 1.0000 48 784 0.8584
6.5000 3.0000 109 528 0.8488
6.5000 5.0000 103 627 0.8386
7.0000 0.0000 21 475 0.8676
7.0000 2.0000 47 291 0.8585
7.0000 4.0000 47 443 0.8488
7.5000 1.0000 50 366 0.8676
7.5000 3.0000 23 237 0.8585
8.0000 0.0000 15 63 0.8763
8.0000 2.0000 8 312 0.8677
8.5000 1.0000 14 116 0.8763
9.0000 0.0000 2 43 0.8845

*********************************************************
Length 40

Descriptives...

67543 cases have Y=0; 55271 cases have Y=1.

Variable Avg SD
1 2.6309 1.2600
2 1.8954 1.6584

Iteration History...
-2 Log Likelihood = 69028.0461 (Null Model)
-2 Log Likelihood = 69012.6513
-2 Log Likelihood = 69012.6513
-2 Log Likelihood = 69012.6513 (Converged)

Overall Model Fit...
Chi Square= 15.3948; df=2; p= 0.0005

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0159 0.0052 0.0021
2 -0.0028 0.0039 0.4797
Intercept -0.1535

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9842 0.9744 0.9942

2 0.9972 0.9896 1.0049

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 393 202 0.4617
0.5000 1.0000 1877 1448 0.4590
1.0000 0.0000 2055 2410 0.4578
1.0000 2.0000 3257 2098 0.4564
1.5000 1.0000 7523 7012 0.4551
1.5000 3.0000 1924 1121 0.4537
2.0000 0.0000 6669 4959 0.4538
2.0000 2.0000 4495 3770 0.4524
2.0000 4.0000 1187 878 0.4511
2.5000 1.0000 7971 5340 0.4512
2.5000 3.0000 2728 2877 0.4498
2.5000 5.0000 565 485 0.4484
3.0000 0.0000 2357 1384 0.4499
3.0000 2.0000 4950 4077 0.4485
3.0000 4.0000 1241 1609 0.4471
3.0000 6.0000 308 322 0.4458
3.5000 1.0000 2958 2175 0.4472
3.5000 3.0000 2311 2279 0.4459
3.5000 5.0000 611 1099 0.4445
3.5000 7.0000 182 133 0.4431
4.0000 0.0000 899 812 0.4460
4.0000 2.0000 1397 1213 0.4446
4.0000 4.0000 1185 1569 0.4432
4.0000 6.0000 428 437 0.4419
4.0000 8.0000 149 26 0.4405
4.5000 1.0000 862 908 0.4433
4.5000 3.0000 745 821 0.4419
4.5000 5.0000 738 639 0.4406
4.5000 7.0000 402 73 0.4392
4.5000 9.0000 122 53 0.4379
5.0000 0.0000 204 231 0.4420
5.0000 2.0000 439 623 0.4407
5.0000 4.0000 446 337 0.4393
5.0000 6.0000 663 102 0.4379
5.0000 8.0000 315 160 0.4366
5.5000 1.0000 206 334 0.4394
5.5000 3.0000 280 251 0.4380
5.5000 5.0000 374 61 0.4367
5.5000 7.0000 538 227 0.4353
6.0000 0.0000 73 90 0.4381
6.0000 2.0000 135 135 0.4368
6.0000 4.0000 253 42 0.4354
6.0000 6.0000 319 116 0.4340
6.5000 1.0000 70 92 0.4355
6.5000 3.0000 128 22 0.4341
6.5000 5.0000 211 84 0.4328
7.0000 0.0000 22 14 0.4342
7.0000 2.0000 75 15 0.4329
7.0000 4.0000 100 50 0.4315
7.5000 1.0000 40 5 0.4316
7.5000 3.0000 57 33 0.4302
8.0000 0.0000 45 0 0.4303
8.0000 2.0000 31 14 0.4290
8.5000 1.0000 22 2 0.4277
9.0000 0.0000 8 2 0.4264

*********************************************************
Length 50

Descriptives...

28764 cases have Y=0; 38763 cases have Y=1.

Variable Avg SD
1 2.6602 1.2592
2 1.9564 1.5974

Iteration History...
-2 Log Likelihood = 92126.2478 (Null Model)
-2 Log Likelihood = 92000.3801
-2 Log Likelihood = 92000.3751
-2 Log Likelihood = 92000.3751 (Converged)

Overall Model Fit...
Chi Square= 125.8727; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0684 0.0066 0.0000
2 -0.0018 0.0052 0.7256
Intercept 0.4844

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9339 0.9218 0.9461

2 0.9982 0.9880 1.0085

X1 X2 n0 n1 Calc Prob
-0.0000 0.0000 294 202 0.6188
0.5000 1.0000 853 1323 0.6103
1.0000 0.0000 431 1847 0.6025
1.0000 2.0000 1684 1484 0.6016
1.5000 1.0000 2342 4390 0.5939
1.5000 3.0000 1059 1021 0.5930
2.0000 0.0000 2368 2483 0.5860
2.0000 2.0000 1450 2970 0.5851
2.0000 4.0000 786 782 0.5843
2.5000 1.0000 3021 3414 0.5773
2.5000 3.0000 1041 2291 0.5764
2.5000 5.0000 443 357 0.5755
3.0000 0.0000 959 1121 0.5694
3.0000 2.0000 2265 2586 0.5685
3.0000 4.0000 628 1072 0.5676
3.0000 6.0000 290 318 0.5667
3.5000 1.0000 1420 1765 0.5605
3.5000 3.0000 1205 1270 0.5596
3.5000 5.0000 351 941 0.5587
3.5000 7.0000 100 124 0.5578
4.0000 0.0000 513 663 0.5525
4.0000 2.0000 771 854 0.5516
4.0000 4.0000 826 1055 0.5507
4.0000 6.0000 92 384 0.5498
4.0000 8.0000 34 30 0.5489
4.5000 1.0000 580 645 0.5436
4.5000 3.0000 540 695 0.5427
4.5000 5.0000 262 431 0.5418
4.5000 7.0000 28 108 0.5409
4.5000 9.0000 56 8 0.5399
5.0000 0.0000 161 139 0.5356
5.0000 2.0000 405 526 0.5346
5.0000 4.0000 162 293 0.5337
5.0000 6.0000 83 115 0.5328
5.0000 8.0000 104 32 0.5319
5.5000 1.0000 210 265 0.5266
5.5000 3.0000 123 220 0.5257
5.5000 5.0000 51 79 0.5247
5.5000 7.0000 184 14 0.5238
6.0000 0.0000 65 106 0.5185
6.0000 2.0000 62 113 0.5176
6.0000 4.0000 37 61 0.5167
6.0000 6.0000 125 5 0.5158
6.5000 1.0000 56 77 0.5095
6.5000 3.0000 19 31 0.5086
6.5000 5.0000 98 0 0.5077
7.0000 0.0000 5 16 0.5014
7.0000 2.0000 16 22 0.5005
7.0000 4.0000 46 4 0.4996
7.5000 1.0000 5 9 0.4924
7.5000 3.0000 36 2 0.4915
8.0000 0.0000 1 0 0.4843
8.0000 2.0000 14 0 0.4834
8.5000 1.0000 4 0 0.4753
9.0000 0.0000 0 0 0.4673

*********************************************************
Length 60

Descriptives...

21244.5 cases have Y=0; 27593.5 cases have Y=1.

Variable Avg SD
1 2.4001 1.1066
2 1.6640 1.4668

Iteration History...
-2 Log Likelihood = 66876.1255 (Null Model)
-2 Log Likelihood = 66486.7918
-2 Log Likelihood = 66486.5709
-2 Log Likelihood = 66486.5709
-2 Log Likelihood = 66486.5709 (Converged)

Overall Model Fit...
Chi Square= 389.5546; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.1649 0.0093 0.0000
2 0.0003 0.0070 0.9647
Intercept -0.1324

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.1793 1.1581 1.2010

2 1.0003 0.9867 1.0141

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 142 68 0.4669
0.5000 1.0000 931 707 0.4876
1.0000 0.0000 1250 1753 0.5081
1.0000 2.0000 937 701 0.5083
1.5000 1.0000 2543 3541 0.5288
1.5000 3.0000 953 685 0.5289
2.0000 0.0000 1306 1697 0.5492
2.0000 2.0000 2738 3346 0.5494
2.0000 4.0000 365 307 0.5495
2.5000 1.0000 2775 3309 0.5696
2.5000 3.0000 963 1533 0.5698
2.5000 5.0000 106 146 0.5699
3.0000 0.0000 1453 1550 0.5896
3.0000 2.0000 985 1511 0.5898
3.0000 4.0000 215 721 0.5899
3.0000 6.0000 117 72 0.5901
3.5000 1.0000 1071 1425 0.6095
3.5000 3.0000 228 708 0.6096
3.5000 5.0000 339 363 0.6098
3.5000 7.0000 18 24 0.6099
4.0000 0.0000 190 306 0.6289
4.0000 2.0000 260 676 0.6290
4.0000 4.0000 339 363 0.6292
4.0000 6.0000 40 116 0.6293
4.0000 8.0000 0.5 0.5 0.6294
4.5000 1.0000 95 289 0.6480
4.5000 3.0000 352 350 0.6481
4.5000 5.0000 36 120 0.6483
4.5000 7.0000 0.5 0.5 0.6484
4.5000 9.0000 27 36 0.6485
5.0000 0.0000 5 61 0.6665
5.0000 2.0000 126 162 0.6666
5.0000 4.0000 46 110 0.6668
5.0000 6.0000 0.5 0.5 0.6669
5.0000 8.0000 55 179 0.6670
5.5000 1.0000 38 70 0.6846
5.5000 3.0000 14 50 0.6848
5.5000 5.0000 0.5 0.5 0.6849
5.5000 7.0000 53 181 0.6850
6.0000 0.0000 20 16 0.7021
6.0000 2.0000 4 20 0.7022
6.0000 4.0000 0.5 0.5 0.7023
6.0000 6.0000 69 165 0.7025
6.5000 1.0000 8 10 0.7191
6.5000 3.0000 0.5 0.5 0.7192
6.5000 5.0000 17 79 0.7194
7.0000 0.0000 0.5 0.5 0.7354
7.0000 2.0000 0.5 0.5 0.7355
7.0000 4.0000 0 36 0.7356
7.5000 1.0000 0.5 0.5 0.7512
7.5000 3.0000 9 18 0.7513
8.0000 0.0000 0.5 0.5 0.7662
8.0000 2.0000 0 6 0.7663
8.5000 1.0000 0.5 0.5 0.7807
9.0000 0.0000 1 2 0.7945

djh

unread,
Mar 30, 2012, 4:27:44 PM3/30/12
to koo...@sfu.ca
From the summary data below for the length 20-60 runs with two
predictors (v1+v2)/2 and (v1-v2), it looks to me like (v1+v2)/2 always
yields significant results (p < 0.5) when we control for (v1-v2). So
I will continue to run with data from different length intervals,
instead of pooling across lengths. This is because much more
information of potential scientific relevance is provided by runs that
don't pool across length, and so long as the information comes with a
satisfactory p-value, I might as well use it.

If you agree about the significance of (v1+v2)/2 in the cases below,
permit me to ask you a question.

since (v1+v2)/2 is significant in all these runs once we control for
(v1-v2), is it safe to trust the linear regression of logit[p] as also
saying something meaningful about the data in these cases?

Length 20

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0341 0.0039 0.0000
2 0.0715 0.0033 0.0000
Intercept 1.9027

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9665 0.9591 0.9740

2 1.0742 1.0673 1.0811

Length 30

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.0778 0.0046 0.0000
2 -0.0386 0.0035 0.0000
Intercept 1.3349

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.0809 1.0713 1.0907

2 0.9622 0.9556 0.9688

Length 40

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0159 0.0052 0.0021
2 -0.0028 0.0039 0.4797
Intercept -0.1535

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9842 0.9744 0.9942

2 0.9972 0.9896 1.0049

Length 50

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0684 0.0066 0.0000
2 -0.0018 0.0052 0.7256
Intercept 0.4844

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9339 0.9218 0.9461

2 0.9982 0.9880 1.0085

Length 60

djh

unread,
Mar 30, 2012, 5:13:25 PM3/30/12
to koo...@sfu.ca
PS - I double-checked with John and he confirmed tht

p = 0.0000

means

p < 0.0001

Ray Koopman

unread,
Mar 30, 2012, 7:59:54 PM3/30/12
to
On Mar 29, 5:16 pm, djh <halitsk...@att.net> wrote:
> [...]
> Finally, regarding both length and v, there is a matter of sampling
> that I need to at least make you aware of.
>
> Right now, our length intervals are 13-22,23-32,...,93-102,103-112,
> and we never feed our "filter" (Dr. Lesk's structural alignment
> program) an input pair that is drawn from two different length
> intervals. This keeps us from asking the filter to do silly things
> with silly input pairs of greatly disparate lengths, but it does
> raise an obvious sampling issue.
>
> In particular, neither nature nor Dr. Lesk's program considers
> the lengths 21 and 22 to be somehow closer than 22 and 23, or
> similarly, the lengths 23 and 24 to be closer than 21 and 22.
> Yet of these three possible classes of input pairs defined by
> length, we only feed our filter pairs from one. So I know we're
> implicitly "sampling" here, but I don't know if we're doing so
> legitimately or illegitmately - perhaps you can advise here.
>
> And of course, the same implicit "sampling" question arises in
> the case of our quantized v. The levels 0,...,9 of v correspond,
> in the case of the globin proteins, to 10 ranges 0.01-0.50,
> 0.51-1.00, ..., 4.01-4.50, 4.51-5.00. So our protocol
> distinguishes between two values such as 4.50 and 4.51 in a way
> that it does not distinguish between the equally close values
> 4.49 and 4.50 or 4.51 and 4.52. Again, perhaps you can advise
> how bad a problem this actually is.

This is one of those situations where you need to know the answer to
the research question before you can design the optimal experiment
to answer the research question. One way to approach such problems
is simulation. Otherwise, the best I can do is to quote Point 3 in
Frank Harrell's Philosophy of Biostatistics: "Avoid categorizing
continuous variables and predicted values at all costs"
<http://biostat.mc.vanderbilt.edu/wiki/Main/FrankHarrell>.
You might try poking around that department to see if you can find
a student who would be interested in getting involved.

Ray Koopman

unread,
Mar 30, 2012, 8:01:24 PM3/30/12
to
On Mar 30, 4:59 am, djh <halitsk...@att.net> wrote:
> Ray - I forgot to ask you one basic question last night.
>
> Suppose that sufficient and insightful complication of the model
> produces a satisfactory logistic regression involving the "yield-
> related" properties of the "filter".
>
> Is the fact that the "yield-related" properties of the filter are
> best modellable by a logistic regression incompatible with the
> possibility that the "RMSD-related" property of the filter might be
> modelled by a linear regression? Or could these models both exist
> without conflict from a purely statistical point of view?

No problem. Just remember that if the yield proportion is a logistic
function then the logit of that proportion is a linear function.

djh

unread,
Mar 30, 2012, 9:22:09 PM3/30/12
to Koo...@sfu.ca
I'm glad you added:

"Just remember that if the yield proportion is a logistic
function then the logit of that proportion is a linear function."

I think this provides an affirmative answer to a question I asked
above that you haven't answered yet. This question, slightly
rephrased, was:

"if x predicts p with sufficient significance in a logistic
regression, can we assume that reasonable significance must also
attach to the linear regression of x with logit[p]?"

Assuming I'm correct here, i.e. that x against logit[p] is
"meaningful" linearly if x against logit[p] is "meaningful"
logistically, then one question I don't understand is how one does a
linear regression with two predictors in such a way that one variable
is controlled for. For example, how does one do a linear regression
equivalent to the logistic regressions I just ran in John's program,
where his program analyzed (v1+v2)/2 while controlling for v1-v2.

Is this where tools like MANOVA come in, to deal with multi-predictor
linear regressions?

djh

unread,
Mar 31, 2012, 11:32:45 PM3/31/12
to koo...@sfu.ca, jrfr...@princeton.edu, am...@psu.edu, marvin.s...@gmail.com, rle...@fordham.edu
Ray -

Here are the coefficients, standard error values, and p values for the
logistic regressions on lengths 20-80 globin data using quantized
(v1+v2)/2 and (v1-v2) as predictors. [Note to JRF/AML/MS: v1 and v2
are the dicodon representation levels in the pairs of subsequences
associated with the substructures submitted for structural alignment.]

The table also shows the slopes and standard error values for the
correspondng linear regressions on logit[p], computed as per your
instructions in past posts here.

|--Logistic Regression-----|-Linear Regression on
logit[p]-|
Length Coeff. StdErr p Slope Std Err
20 -0.0341 0.0039 0.0000 0.044459623 0.012984522
30 0.0778 0.0046 0.0000 0.074139435 0.016295094
40 -0.0159 0.0052 0.0021 -0.063256909 0.022263999
50 -0.0684 0.0066 0.0000 -0.096360478 0.046296441
60 0.1649 0.0093 0.0000 0.036298857 0.034564206
70 -0.1698 0.0108 0.0000 -0.079189612 0.060538538
80 -0.1816 0.0120 0.0000 -0.014339985 0.033767676

One question and one comment, if/when you have a chance.

Question:

Since I'm computing the slopes and standard error of the linear
regressions on logit[p] the way you showed me (rather than taking the
Excel slopes and standard errors for these regressions) I don't want
to take the Excel y-intercepts. So can you explain to me how to
compute the y-intercept for the linear regression on logit[p]? (We
need these intercepts for the work we're doing with Bob Lewis at
Fordham.)

Comment:

Although all the p-values indicate that ALL the logistic regressions
are significant (since a p of 0.0000 in John P's read-outs actually
means p <0.0001), you'll note that the directions are only positive in
the case of lengths 30 and 60.

The fact that 30 is twice 60 is scientifically extremely suggestive.
It tells us that the property of the data responsible for the positive
logistic correlation of alignment "yield" with (v1+v2)/2 at length 30
MAY duplicate at length 60. And we can therefore go into the
underlying detail data and see if we can in fact detect any similarity
at these lengths which involves the energetic variable e that linearly
correlates with quantized v and unquantized u. (I provided an
illustration of e:v correlations in an earlier post here.)

But of course, such an investigation would be premature until we have
ascertained by running all the data that the same pattern of positive
and negative logistic correlations does NOT occur: i) in protein
classes other than the globins; ii) in our control runs against the
globins.

So, I am merely giving you a promissory note here, because I think
that it will help you to appreciate the possible extraordinary
potential of the statistical analysis for detecting patterns in the
scientific data that well might otherwise go unnoticed.

Ray Koopman

unread,
Apr 1, 2012, 4:08:57 AM4/1/12
to
On Mar 31, 8:32 pm, djh <halitsk...@att.net> wrote:
> Ray -
>
> Here are the coefficients, standard error values, and p values for
> the logistic regressions on lengths 20-80 globin data using quantized
> (v1+v2)/2 and (v1-v2) as predictors. [Note to JRF/AML/MS: v1 and v2
> are the dicodon representation levels in the pairs of subsequences
> associated with the substructures submitted for structural alignment.]
>
> The table also shows the slopes and standard error values for the
> correspondng linear regressions on logit[p], computed as per your
> instructions in past posts here.
>
> |--Logistic Regression------|-Linear Regression on logit[p]-|
> Length Coeff. StdErr p Slope Std Err
> 20 -0.0341 0.0039 0.0000 0.044459623 0.012984522
> 30 0.0778 0.0046 0.0000 0.074139435 0.016295094
> 40 -0.0159 0.0052 0.0021 -0.063256909 0.022263999
> 50 -0.0684 0.0066 0.0000 -0.096360478 0.046296441
> 60 0.1649 0.0093 0.0000 0.036298857 0.034564206
> 70 -0.1698 0.0108 0.0000 -0.079189612 0.060538538
> 80 -0.1816 0.0120 0.0000 -0.014339985 0.033767676
>
> One question and one comment, if/when you have a chance.
>
> Question:
>
> Since I'm computing the slopes and standard error of the linear
> regressions on logit[p] the way you showed me (rather than taking
> the Excel slopes and standard errors for these regressions) I don't
> want to take the Excel y-intercepts. So can you explain to me how
> to compute the y-intercept for the linear regression on logit[p]?
> (We need these intercepts for the work we're doing with Bob Lewis
> at Fordham.)

intercept = mean[y] - slope*mean[x].

>
> Comment:
>
> Although all the p-values indicate that ALL the logistic regressions
> are significant (since a p of 0.0000 in John P's read-outs actually
> means p <0.0001), you'll note that the directions are only positive
> in the case of lengths 30 and 60.
>
> The fact that 30 is twice 60 is scientifically extremely suggestive.
> It tells us that the property of the data responsible for the positive
> logistic correlation of alignment "yield" with (v1+v2)/2 at length 30
> MAY duplicate at length 60. And we can therefore go into the
> underlying detail data and see if we can in fact detect any similarity
> at these lengths which involves the energetic variable e that linearly
> correlates with quantized v and unquantized u. (I provided an
> illustration of e:v correlations in an earlier post here.)
>
> But of course, such an investigation would be premature until we have
> ascertained by running all the data that the same pattern of positive
> and negative logistic correlations does NOT occur: i) in protein
> classes other than the globins; ii) in our control runs against the
> globins.
>
> So, I am merely giving you a promissory note here, because I think
> that it will help you to appreciate the possible extraordinary
> potential of the statistical analysis for detecting patterns in the
> scientific data that well might otherwise go unnoticed.

I too have some comments. (This thread is getting tangled enough
that I've lost track of which post prompted each of these.)

1. Cells with n0 = n1 = 0 should be omitted from all regressions.
(For the maximum likelihood logistic regressions, it doesn't change
anything if you leave them in, except possibly if the program tries
to compute the corresponding p = 0/(0 + 0). But don't plug in .5 for
both n0 and n1, for any analysis.)

2. Check the (2,5) cell in the Length 30 data: you posted
x1 = 3.5, x2 = 3; n0 = 6678, n1 = 574; those imply p = .079,
which seems too small by a factor of 10 or so.

3. Logistic regression of p vs linear regression of logit[p]. The
two models are the same; the difference is in how the coefficients
are estimated. Doing a least-squares linear regression with log[n1/
n0] as the d.v. (bad) is asymptotically the same doing a maximum
likelihood logistic regression with (n0,n1) as the d.v. (good). But
that's only asymptotically, which in this context means that if the
sample size is big enough then the results of the two methods will
usually be close, without saying anything about how big "big enough"
might be. In practice, no one would expect the two sets of results
to be close if the expected value of any n0 or n1 is small. (I've
already mentioned 30 as a reference value for the observed n0 and
n1, but that's only a conservative rule of thumb.)

4. A small p-value means only that an estimated coefficient as big
as you got would be very unusual if the true value were zero. It
does not mean that the model is correct or is accounting well for
the data. In fact, the 3-D plots of p against (x1,x2) do not look
any more logistic than the 2-D plots of p against x1 did. There is
definitely a whole lot going on that is not being modeled.

Why does this matter? Because when the model omits important
predictors then the predictors that are in the model may be acting
as surrogates for the omitted predictors, rather than as predictors
in their own right, and even the signs of their coefficients may be
wrong. The most famous case of this is probably the WWII regression
analysis where the d.v. was the accuracy of Allied bomb drops, and
the number of Nazi interceptors got a positive coefficient. Why?
Because number of interceptors was being a surrogate for the weather,
which had been left out of the equation. When it was included, number
of interceptors got the negative weight that it should have.

Ray Koopman

unread,
Apr 1, 2012, 5:02:05 AM4/1/12
to
Look up multiple linear regression: y = X b + e, where
y is an n-vector of dependent scores;
X is an n by k matrix of predictor scores;
b is an unknown k-vector of coefficients;
e is a random n-vector, usually assumed to be i.i.d. with zero means.

The ordinary least squares (aka OLS) solution is to choose b to
minimize (y-Xb)'(y-Xb): b = (X'X)^-1 X'y.
This automatically adjusts each predictor for all the others.
All predictors are treated the same; no predictor or subset of
predictors has special status.
If an intercept is wanted then one column of X should be the unit
vector (1,...,1).

djh

unread,
Apr 1, 2012, 11:39:00 AM4/1/12
to koo...@sfu.ca, am...@psu.edu, jrfr...@princeton.edu, marvin.s...@gmail.com
Thanks very much for the providing the definition of y-intercept you
want me to use and also for the advice re the values of n0 n1 to
exclude.

Also, I've checked the length 30 data for x1=3.5, x2 = 3, and it's
correct: it comes from underlying input/output pairs of 7252 and 574
for (2,5) and (5,2) in the "square" matrix for len 30. I can explain
why such an anomaly can arise, but don't want to take the time right
now.


Rather, I want to concentrate on what you wrote here:

"4. A small p-value means only that an estimated coefficient as big
as you got would be very unusual if the true value were zero. It
does not mean that the model is correct or is accounting well for
the data. In fact, the 3-D plots of p against (x1,x2) do not look
any more logistic than the 2-D plots of p against x1 did. There is
definitely a whole lot going on that is not being modeled."

As I mentioned earlier, the project's fundamental goal is NOT to
correctly model the entire range of behavior of Dr. Lesk's structural
alignment program, but RATHER to show that one small but "significant"
factor in this behavior may in fact be due to an evolutionary
"residue" which has persisted "from the beginning" (again, like the
radiation left over from the big-bang.)

Keeping this goal in mind, consider the RSquare scores which Excel
reports for the linear regressions for lengths 20-80 reported in my
last email for logit[p]:

|--Logistic Regr------|Lin Reg on logit[p]|
Length Coeff. StdErr p RSquare
20 -0.0341 0.0039 0.0000 0.171554638
30 0.0778 0.0046 0.0000 0.215528661
40 -0.0159 0.0052 0.0021 0.195911439
50 -0.0684 0.0066 0.0000 0.166950785
60 0.1649 0.0093 0.0000 0.286276399
70 -0.1698 0.0108 0.0000 0.091041918
80 -0.1816 0.0120 0.0000 0.033825749

Suppose for the sake of argument that these RSquare's are reflective
of the degree of "meaningful" data coverage we're getting in both the
logistic regressions and the corresponding linear regressions on
logit[p].

Then:

1) Would it not be legitimate for us to claim, based on these
RSquares, that the proposed "evolutionary" residue can be seen in 29%
ofthe data at length 60 and 22% of the data at length 30?

2) If so, wouldn't the actual detail data "covered"within this 29% or
22% be those underlying detail data pairs for which logit[p] is within
1 or 2sd of the value predicted by the equation of the linear
regression on logit[p], i.e. the equation with the slope and y-
intercept constructed as you've shown me?


If your answers to (1) and (2) are "yes" (in principle), could you
tell me whether Excel's RSquares are "OK", or if you want me to
compute, and if so, how ?

djh

unread,
Apr 1, 2012, 2:23:35 PM4/1/12
to koo...@sfu.ca, jrfr...@princeton.edu, am...@psu.edu, marvin.s...@gmail.com
Ray -
I didn't want to leave you with the impression that we can't elaborate
the quantized v parameter which we use in (v1+v2)/2 and v1-v2. In
fact, we can elaborate this parameter v considerably, for reasons I
won't detail right now, and the result would be several additional
parameters as follows:

i) a "count" parameter indicating how many total elements figure into
a given value of "v", where these elements are drawn from a fixed set
of 63

ii) a second "count" parameter indicating how many of these total
elements are "hydrophobic" in chemical terms, i.e. give rise to amino
acids that don't like water;

iii) a third "count" parameter indicating how many of these total
elements are "hydrophilic" in chemical terms, i.e. give rise to amino
acids that do like water.

(Note that (ii) and (iii) could conceivably be collapsed into an input
yes/no parameter rather than two count parameters. Also, note that i-
iii would be computed at the detail level, and then averaged up at the
"bin" level, just like v1 and v2 are now.)

So our logistic model could "readily" be elaborated into one with
either four or five total predictor variables.

But if this elaboration were done, however, it would again be with the
goal of seeing how our "partial" model gets better, not with the goal
of obtaining a complete characterization of the behavior of Dr. Lesk's
program. (By "partial" model, I mean a model that seems to indicate a
"residual evolutionary factor" at work in the behavior of Dr.Lesk's
program, as per my post.)

And furthermore, if you can in "good statistical conscience" green-
light a paper without this elaboration, based just on the data
coverage we get using just (v1+v2/2) and v1-v2, then I would prefer to
do the elaboration on grant money that we have been assured we can get
if we can get just one initial paper accepted for publication.

I'm being frank here because everyone on the project has been working
pro bono since 1999, when we had our last grant of $400K from the
Army.

djh

unread,
Apr 1, 2012, 5:45:59 PM4/1/12
to koo...@sfu.ca, am...@psu.edu, jrf...@princeton.edu, marvin.s...@gmail.com
I also wanted to mention that there is a way we can directly tune the
count of output pairs that we use in computing y=p="yield" within the
"partial" ((v1+v2)/2,v1-v2) logistic model.

When Dr. Lesk's program is given an input pair of roughly the same
lengths, say 33-42, it will try to find something similar in this pair
even if what's similar only spans a length of 8 (or 11 or 25 or
whatever) within each member of the input pair.

I capture these "span" numbers for each output pair, and therefore can
compute the average "span" per output pair. Hence, I can get the mean
span and sd for a given run, and therefore, I can exclude from the
output count any output pair whose span is > 1 sd from the mean. (You
may wonder why the span numbers can differ between the two members of
a given output pair ... this has to do with something called "gapping"
which I won't go into now, except to say that it arises from the fact
that evolution can insert and delete subsequences from a given
subsequence ...)

From a scientific point of view, Dr. Lesk has already OK'd the
restriction of "counted" output pairs to those lying within an sd of
the mean, because we know exactly what the outliers are on either
side: a) "to the left" of the mean, the outliers are those where Dr.
Lesk's program has just so happened to find something similar, quite
possibly by chance; b) "to the right" of the mean, the outliers are
those where Dr. Lesk's program has output a pair whose members are
evolutionary very close and therefore very similar.

So - once I complete lengths 90, 100, and 110 for the globins, I will
go back and rerun lengths 20-110 using this mean+1sd window for
"span". The results should be very instructive, and I hope will tell
you a lot.

Ray Koopman

unread,
Apr 3, 2012, 4:26:27 AM4/3/12
to
On Apr 1, 8:39 am, djh <halitsk...@att.net> wrote:
> Thanks very much for the providing the definition of y-intercept
> you want me to use and also for the advice re the values of n0 n1
> to exclude.
>
> Also, I've checked the length 30 data for x1=3.5, x2 = 3, and it's
> correct: it comes from underlying input/output pairs of 7252 and 574
> for (2,5) and (5,2) in the "square" matrix for len 30. I can explain
> why such an anomaly can arise, but don't want to take the time right
> now.
>
> Rather, I want to concentrate on what you wrote here:
>
> "4. A small p-value means only that an estimated coefficient as big
> as you got would be very unusual if the true value were zero. It
> does not mean that the model is correct or is accounting well for
> the data. In fact, the 3-D plots of p against (x1,x2) do not look
> any more logistic than the 2-D plots of p against x1 did. There is
> definitely a whole lot going on that is not being modeled."
>
> As I mentioned earlier, the project's fundamental goal is NOT to
> correctly model the entire range of behavior of Dr. Lesk's structural
> alignment program, but RATHER to show that one small but "significant"
> factor in this behavior may in fact be due to an evolutionary
> "residue" which has persisted "from the beginning" (again, like the
> radiation left over from the big-bang.)

Reviewers for the (psychology) journals with which I am familiar
would want some empirical assurance that the small but consistently
significant coefficients were not due to predictors that are known or
suspected to matter but were omitted, such as those mentioned in the
two posts immediately following the one to which I am replying. The
general rule is that the more radical the conclusion you are trying
to draw, the more evidence you need to present that yours is the only
reasonable explanation.

>
> Keeping this goal in mind, consider the RSquare scores which Excel
> reports for the linear regressions for lengths 20-80 reported in my
> last email for logit[p]:
>
> |--Logistic Regr------|Lin Reg on logit[p]|
> Length Coeff. StdErr p RSquare
> 20 -0.0341 0.0039 0.0000 0.171554638
> 30 0.0778 0.0046 0.0000 0.215528661
> 40 -0.0159 0.0052 0.0021 0.195911439
> 50 -0.0684 0.0066 0.0000 0.166950785
> 60 0.1649 0.0093 0.0000 0.286276399
> 70 -0.1698 0.0108 0.0000 0.091041918
> 80 -0.1816 0.0120 0.0000 0.033825749
>
> Suppose for the sake of argument that these RSquare's are reflective
> of the degree of "meaningful" data coverage we're getting in both the
> logistic regressions and the corresponding linear regressions on
> logit[p].
>
> Then:
>
> 1) Would it not be legitimate for us to claim, based on these
> RSquares, that the proposed "evolutionary" residue can be seen in
> 29% of the data at length 60 and 22% of the data at length 30?
>
> 2) If so, wouldn't the actual detail data "covered" within this 29%
> or 22% be those underlying detail data pairs for which logit[p]
> is within 1 or 2 sd of the value predicted by the equation of the
> linear regression on logit[p], i.e. the equation with the slope and
> y-intercept constructed as you've shown me?

R^2 is a summary measure that does not apply to any particular value
or subset of values. It is a "proportion of variance". The observed
data vector is partitioned into two orthogonal components: a vector
of model-produced values, and a vector of residuals. Then

variance of the model-produced values
R^2 = -------------------------------------.
variance of the observed data values

In geometric terms, you're projecting the centered data vector into
the space spanned by the centered predictors, and calculating the
square of the ratio of the length of the projected vector divided by
the length of the data vector. ("Centering" here means subtracting
the means, so that everything is orthogonal to the unit vector.)

>
> If your answers to (1) and (2) are "yes" (in principle), could you
> tell me whether Excel's RSquares are "OK", or if you want me to
> compute, and if so, how ?

The R^2 values you give for the logits appear to be the squares
of the ordinary correlations between x1 and the sample logits. A
weighted correlation may be more appropriate, with the weights being
the point of argument. To be consistent with the maximum likelihood
logistic regression, the weights should be 1/(1/n0+1/n1) = n*p(1-p),
as in my post of Mar 22, 11:44 pm. Whatever the weights may be,
here is how you do weighted correlation and regression:

sum of weights: W = sum w

weighted means: mx = sum w*x / W
my = sum w*y / W

weighted variances: vx = sum w(x - mx) / W
vy = sum w(y - my) / W

weighted covariance: cxy = sum w(x - mx)(y - my) / W

weghted correlation: rxy = cxy / sqrt[vx*vy]

weighted regression of y on x: slope = cxy / vx
intercept = my - slope*mx

For weighted multiple regression, follow my post of Apr 1, 2:02 am,
but first premultiply both X and y by a diagonal matrix containing
the square roots of the weights.

Caveat: Those equations are algebraically correct but not suitable
for numerical computation with ill-conditioned data. Failure to
recognize this is partly responsible for Excel's dismal reputation.
Your data seem reasonably well-conditioned so far, but that may
change as you add predictors. Ask a numerical analyst how best
to deal with the problem. If you don't know one, try asking in
sci.math.num-analysis .

With very large sample sizes and the weights mentioned above, the
weighted slope & intercept should approximate the slope & intercept
given by logistic regression.

djh

unread,
Apr 3, 2012, 6:04:51 AM4/3/12
to koo...@sfu.ca, jrfr...@princeton.edu, am...@psu.edu, marvin.s...@gmail.com
My cup runneth over, Ray, really. Again, your intellectual "caritas"
is inspiring - I use the Latin term here to signify something perhaps
even more than "generosity" and "kindness".

Your post above is really the one of which I should send framed copies
to Jacques and Arthur, inasmuch as it not only holds out the
possibility that we MIGHT be able to make a reasonable statistical
argument for our claims, but also because again, it lays out exactly
what our remaining plan of attack should be.

In this regard, here is what I will do regarding what you wrote here:

"Reviewers for the (psychology) journals with which I am familiar
would want some empirical assurance that the small but consistently
significant coefficients were not due to predictors that are known or
suspected to matter but were omitted, such as those mentioned in the
two posts immediately following the one to which I am replying. "

1.
First, I will definitely go back and do what I mentioned in my second
"follow-up" post, namely - rerun the analysis counting only output
pairs whose average "spans" fall within 1sd of mean average "span" for
the run (where "run" is defined by a triple of values for (length, v1,
v2).

If this reanalysis yields same or better logistic correlations than
those you've seen already, then this result might well help with
potential referees, because it would lessen the possible effect of
"chance" involving low outliers and "expected" results involving high
outliers.

On the other hand, if this reanalysis kills the current logisitic
correlations outright, well, that's the way the cookie crumbles. At
least we'll know what's what and that we did an honest day's work.

2.
Assuming the reanalysis in (1) does not kill our current logistic
correlations, I will run the weighted LINEAR correlations on logit[p]
according to the instructions you've provided.

In this regard, I am interpreting your last post as saying that if we
run the weighted linear correlations on logit[p] rather than the
ordinary linear correlations, then you would be able to make a more
informed decision as to whether we could use the resulting R^2's as a
crude indicator of the extent to which there might be an "evolutionary
factor" influencing the behavior of Dr. Lesk's program. (But if I'm
misinterpreting you incorrectly here, please clarify.)

3.
Regarding what you wrote here:

"The general rule is that the more radical the conclusion you are
trying to draw, the more evidence you need to present that yours is
the only reasonable explanation. "

I am pleased to be able to tell you that if the analysis is still
viable after (1) and (2) are completed, then your point here should be
at least partially addressed by our main control protocol (which I
haven't yet discussed at all). Dr. Lesk is known throughout the
community as being extraordinarly punctilious and severe regarding
controls, but even he has agreed that we have a very elegant control
whose results will be very difficult to disparage (whether good for us
or bad for us.)

4.
Finally, please note that before reporting here the results of the
reanalysis in (1) above, I have to re-calculate the current logistic
correlations for lengths 20-80 as well as the new ones for 90-110.
This is because I forgot what you told me to do when p = 1 and 1-p =
0, so I was deleting cases when output=input, instead of adding the .5
and 1 in these cases (which is what you suggested in your post of
7:21pm Mar 27 as a lousy but workable solution for the "1" cases as
well as the "0" cases.) This shouldn't make a whole lot of
difference, but I want to make sure the current data are in the best
possible shape before proceeding with (1) above, particularly because
it won't take me but a few minutes to re-calculate the data for
lengths 20-110 and re-report the results here.

5. You can expect a note from Jacques toward the end of the week (he
has been travelling and then catching-up with the regular affairs of
his lab.)

Ray Koopman

unread,
Apr 6, 2012, 3:37:22 AM4/6/12
to
On Apr 3, 3:04 am, djh <halitsk...@att.net> wrote:
> [...] here is what I will do regarding what you wrote here:
>
> "Reviewers for the (psychology) journals with which I am familiar
> would want some empirical assurance that the small but consistently
> significant coefficients were not due to predictors that are known or
> suspected to matter but were omitted, such as those mentioned in the
> two posts immediately following the one to which I am replying. "
>
> 1.
> First, I will definitely go back and do what I mentioned in my second
> "follow-up" post, namely - rerun the analysis counting only output
> pairs whose average "spans" fall within 1 sd of mean average "span"
> for the run (where "run" is defined by a triple of values for
> (length, v1, v2).

I'm not sure what you're proposing, but it sounds like you intend to
ignore data points that don't fit the model. If so, then don't waste
your time. What you really need to do is expand the model to fit the
data better. Coefficients, standard errors, and p-values from models
that fit poorly should not be taken seriously.

One thing that I have not mentioned is that you can compare your
model to a saturated model -- one that has as many parameters as
there are cells, and therefore fits the cell proportions perfectly.
John's program gives a quantity labeled "-2 Log Likelihood". You
want its "Converged" value. (But beware -- the printed values have
been truncated ON THE LEFT. For instance, the Length 20 value is
08060.8373, which should read 208060.8373. You should write John
about this.) The corresponding value for the saturated model is
-2 sum {n0*Log[n0/(n0+n1)] + n1*Log[n1/(n0+n1)]} = 207583.0602.
The difference between the two has a chi-square distribution with
df = the difference between the number of parameters in the two
models. A big chi-square, with a small p-value, indicates a
significant lack of fit of the smaller model. All your models
show hugely significant lack of fit.

It would be nice if those comparisons with the saturated model were
nonsignificant, but that may be asking too much, because your sample
sizes are so large that significance is too easy to get. What you
need is a tolerable level of misfit to all the sample proportions.
Throwing out the cells that the model doesn't fit is not the answer.
How much misfit you can tolerate is a subjective call. You're
dealing with what is sometimes referred to as "practical
significance", as opposed to "statistical significance".

In an earlier post (8:39 am Apr 1), you said

> ... the project's fundamental goal is NOT to correctly model
> the entire range of behavior of Dr. Lesk's structural alignment
> program, but RATHER to show that one small but "significant"
> factor in this behavior may in fact be due to an evolutionary
> "residue" which has persisted "from the beginning" (again, like
> the radiation left over from the big-bang.)

The problem is that in order to claim significance, the model
must account for the data, and there must be no omitted variables
that might change the picture if they were included.

>
> If this reanalysis yields same or better logistic correlations than
> those you've seen already, then this result might well help with
> potential referees, because it would lessen the possible effect of
> "chance" involving low outliers and "expected" results involving
> high outliers.
>
> On the other hand, if this reanalysis kills the current logistic
> correlations outright, well, that's the way the cookie crumbles. At
> least we'll know what's what and that we did an honest day's work.
>
> 2.
> Assuming the reanalysis in (1) does not kill our current logistic
> correlations, I will run the weighted LINEAR correlations on
> logit[p] according to the instructions you've provided.
>
> In this regard, I am interpreting your last post as saying that if
> we run the weighted linear correlations on logit[p] rather than the
> ordinary linear correlations, then you would be able to make a more
> informed decision as to whether we could use the resulting R^2's as a
> crude indicator of the extent to which there might be an "evolutionary
> factor" influencing the behavior of Dr. Lesk's program. (But if I'm
> misinterpreting you incorrectly here, please clarify.)

I see no serious role for linear regression analyses of the logits as
a substitute for logistic regression, especially for expanded models
that will necessarily have smaller cell sizes. You need to do actual
logistic regression, and if the cell sizes start getting small or
you get quasi-separation then you will need to do the reduced-bias
version.

djh

unread,
Apr 6, 2012, 10:38:08 AM4/6/12
to koo...@sfu.ca
Thanks for the advice - I am of course very grateful to get it even
though it indicates I have a lot more work lying ahead of me (which
may well also come to naught.)

As you have advised, I will proceed to expand the logistic model by
adding one possibly important predictor which I have so far
suppressed. (For details of this expansion, see the section
"Expanding the Model" at the end of this post.)

But before I do this, I have to ask you one important question
regarding when it is or isn't legitimate to take only data within one
sd of a mean, as opposed to "all" the data in a given cell. The
reason I'm asking this question is because in an earlier post, you
seemed to indicate that this would be OK to do, whereas in your last
post here, you seem to be indicating that it's not.

To see exactly what I mean when I say "taking data within one sd of
the mean", consider two sets of output counts O1 and O2 for the case
when Length = 20 and v1 = 0.


v2 Input O1 O2
0 2080 1769 1340
1 7150 6347 4802
2 9685 8563 6247
3 9425 8321 6237
4 6370 5233 4014
5 4615 4067 3153
6 2535 2269 1706
7 1755 1549 1156
8 1300 1201 910
9 1495 1363 1000


In the first row, the O1 count of 1769 reflects all the output pairs
for (Length = 20, v1= 0, v2 = 0). For each of these 1769 output
pairs, I can compute the value of a "span" variable "S", and get the
mean and sd of this variable S for the total population of 1769. And
if I then exclude all output pairs within the 1769 that fall outside
one sd of the mean of S on either side, I get the reduced "O2" output
count of 1340.

I thought that in an earlier post, you said it would be OK to use the
1340 instead of the 1769. particularly because it makes good
scientific sense to do so.

Furthermore, you can see that by choosing the reduced count of 1340
instead of the total count of 1769, I am not "ignoring" data that
doesn't fit the model, because I haven't yet run any model, so I have
no idea of whether taking 1340 instead of 1769 will improve fit or
not.

So, when you have a chance, please clarify whether taking the 1340
instead of the 1769 is legitimate in this kind of situation. If so, I
would prefer to do this when trying to expand the model.

"Expansion of the model"

Presently, the model uses a length variable L which can assume 10
values that I denote by 20 thru 110 for the sake of convenience (20 =
13-22, 30 = 23-32, etc.).

But within each value Li of length, a second variable C can assume
certain values from 3 to 28, and in fact, the value of C plays a
significant role in determining the value of the unquantized variable
u (and therefore the quantized variable v.)

So I can expand the model quite readily by adding the two predictors
Li and Cj to the current (v1+v2)/2 and (v1-v2) predictors.

And inasmuch as different values of Cj can quite possibly affect the
behavior of Dr. Lesk's program for a given value of Li, I will start
re-analyzing the data using a four-predictor model (Li,Cj,(v1+v2)/2,v1-
v2).

Thanks so much again for your patience, Ray.

djh

unread,
Apr 6, 2012, 11:09:43 PM4/6/12
to koo...@sfu.ca
Below are two sets of results from John's calculator.

The first set you've seen before - they're the "length 20" results
using the two predictors (v1+v)/2 and v1-v2.

The second set of results is also based on a matrix for length 20
using the same two predictors.

The difference is that in this case, I used only data exhibiting a
value of 3 for the new "C" variable which I mentioned in my last post,
rather than allowing C to be any of the values 2 thru 6 which it
actually assumes in the length 20 data. (In other words, I have not
yet incorporated "C" into the model as a third predictor - right now I
am merely using the original two predictors but on data in which C is
constant - just to see what happens.)

As you can see from the two sets of results:

a) the chi-square drops from 483.0906 in the first set of results to
12.6410 in the second set, roughly a 40-fold decrease;
b) the p-value associated with the chi-square goes from < 0.0001 to
0.0018.

Does this drop in chi-square signify a step in the right direction, or
is 12.6410 still a relatively "big" chi-square? We would of course
like you to say that this drop in chi-square is a step in the right
direction, because we can think of "C" as a scalar on our unquantized
variable u that underlies v.

Thanks for any time you can afford to spend considering this question,
and again, please forgive my abysmal ignorance about these matters.

I will proceed to do the same analysis with C held constant at 4, at
5, and at 6, in the hope that this will provide you more information
from which to draw conclusions.

**************************************************************************
Original results for length 20
*************************************************************************
Descriptives...

34343 cases have Y=0; 244288 cases have Y=1.

Variable Avg SD
1 3.1861 1.5601
2 2.4325 1.9607

Iteration History...
-2 Log Likelihood = 08060.9779 (Null Model)
-2 Log Likelihood = 07579.6536
-2 Log Likelihood = 07577.8873
-2 Log Likelihood = 07577.8873
-2 Log Likelihood = 07577.8873 (Converged)

Overall Model Fit...
Chi Square= 483.0906; df=2; p= 0.0000

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 -0.0341 0.0039 0.0000
2 0.0715 0.0033 0.0000
Intercept 1.9027

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 0.9665 0.9591 0.9740

2 1.0742 1.0673 1.0811

Results for Length 20 with C = 3.
*********************************************************
Descriptives...

153 cases have Y=0; 10438 cases have Y=1.

Variable Avg SD
1 3.7039 1.7354
2 2.7674 2.1437

Iteration History...
-2 Log Likelihood = 1600.3995 (Null Model)
-2 Log Likelihood = 1588.0377
-2 Log Likelihood = 1587.7588
-2 Log Likelihood = 1587.7586
-2 Log Likelihood = 1587.7586 (Converged)

Overall Model Fit...
Chi Square= 12.6410; df=2; p= 0.0018

Coefficients and Standard Errors...
Variable Coeff. StdErr p
1 0.0907 0.0522 0.0821
2 0.0967 0.0456 0.0341

Intercept 3.6619

Odds Ratios and 95% Confidence Intervals...
Variable O.R. Low -- High
1 1.0950 0.9885 1.2129

2 1.1015 1.0073 1.2046

X1 X2 n0 n1 Calc Prob
0.0000 0.0000 0 26 0.9750
0.5000 1.0000 0 163 0.9782
1.0000 0.0000 1 195 0.9771
1.0000 2.0000 3 198 0.9810
1.5000 1.0000 12 505 0.9801
1.5000 3.0000 1 216 0.9835
2.0000 0.0000 15 299 0.9790
2.0000 2.0000 9 561 0.9827
2.0000 4.0000 1 147 0.9857
2.5000 1.0000 26 690 0.9818
2.5000 3.0000 2 386 0.9849
2.5000 5.0000 0 104 0.9875
3.0000 0.0000 12 360 0.9808
3.0000 2.0000 9 462 0.9841
3.0000 4.0000 2 264 0.9869
3.0000 6.0000 0 62 0.9892
3.5000 1.0000 6 521 0.9833
3.5000 3.0000 5 329 0.9862
3.5000 5.0000 1 167 0.9886
3.5000 7.0000 0 47 0.9906
4.0000 0.0000 1 159 0.9824
4.0000 2.0000 4 354 0.9855
4.0000 4.0000 1 207 0.9880
4.0000 6.0000 0 119 0.9901
4.0000 8.0000 1 89 0.9918
4.5000 1.0000 0 245 0.9847
4.5000 3.0000 1 222 0.9874
4.5000 5.0000 6 142 0.9896
4.5000 7.0000 2 248 0.9914
4.5000 9.0000 0 53 0.9929
5.0000 0.0000 2 74 0.9839
5.0000 2.0000 1 149 0.9867
5.0000 4.0000 4 160 0.9890
5.0000 6.0000 2 305 0.9909
5.0000 8.0000 2 138 0.9925
5.5000 1.0000 0 103 0.9860
5.5000 3.0000 1 106 0.9885
5.5000 5.0000 3 328 0.9905
5.5000 7.0000 6 170 0.9921
6.0000 0.0000 0 27 0.9853
6.0000 2.0000 0 76 0.9879
6.0000 4.0000 2 222 0.9900
6.0000 6.0000 4 187 0.9917
6.5000 1.0000 0 48 0.9872
6.5000 3.0000 0 152 0.9895
6.5000 5.0000 0 128 0.9913
7.0000 0.0000 0 15 0.9866
7.0000 2.0000 0 93 0.9889
7.0000 4.0000 2 87 0.9908
7.5000 1.0000 0 71 0.9883
7.5000 3.0000 0 55 0.9904
8.0000 0.0000 0 65 0.9877
8.0000 2.0000 2 38 0.9899
8.5000 1.0000 1 80 0.9893
9.0000 0.0000 0 21 0.9888


Ray Koopman

unread,
Apr 7, 2012, 2:07:33 AM4/7/12
to
This is just a quick note to sort out the chi-squares.
I'll have more to say later.

The chi-square I described in my previous post compares the logistic
model to the saturated model, that fits the data perfectly.
The bigger the chi-square (and the smaller the p), the more certain
you are that the logistic model is worse than the saturated model.

The chi-square in John's output compares the logistic model to the
"null" model, that gives all the cells the same probability.
The bigger the chi-square (and the smaller the p), the more certain
you are that the logistic model is better than the null model.

So you want my chi-square to be small and nonsignificant,
and John's to be big and significant.

For another take on how well the model fits, plot the "Calc Prob"
in John's output as a function of the sample proportion, n1/(n0+n1).
Ideally, the two should be equal.
How far is the plot from a 45 degree line?

djh

unread,
Apr 7, 2012, 9:19:55 PM4/7/12
to koo...@sfu.ca
I think I see what has to be done mathematically, but don't know how
to express it statistically. I think it boils down to a four
predictor logistic model, but I'm not sure.

Each input pair can not only be characterized by two values (vi,vj) of
our current quantized variable v, but also by two values (ci,cj),
where c is the new "count" variable I mentioned in my last couple of
posts. And similarly for each output pair.

So mathematically, our upper-triangular 10x10 matrix V=(vi,vj) is
actually a matrix of matrices, i.e. each cell (vi,vj) of V is a matrix
Cij = (cijp,cijq)m where p and q are values of C.

I think this boils down to constructing a four-predictor model with
predictors vi,vj,cp,cq, but again, am not sure.

If/when you have a chance, please advise whether I'm looking at this
correctly. If so, I'll have to write a little more PERL to generate
the numbers, but I think it may be worth it. I'm in the process of
hand-tallying a couple of Cij matrices to see what they look like, and
they really seem to be quite orderly ...

One other question - would it be preferable to run some two-predictor
models using just cp,cq before trying to combine these predictors with
the vi,vj predictors? Or do you always have to run all relevant
variables "together"?

Ray Koopman

unread,
Apr 8, 2012, 2:43:43 AM4/8/12
to
On Apr 6, 7:38 am, djh <halitsk...@att.net> wrote:
> Thanks for the advice - I am of course very grateful to get it
> even though it indicates I have a lot more work lying ahead of
> me (which may well also come to naught.)
>
> As you have advised, I will proceed to expand the logistic model
> by adding one possibly important predictor which I have so far
> suppressed. (For details of this expansion, see the section
> "Expanding the Model" at the end of this post.)
>
> But before I do this, I have to ask you one important question
> regarding when it is or isn't legitimate to take only data within
> one sd of a mean, as opposed to "all" the data in a given cell.
> The reason I'm asking this question is because in an earlier post,
> you seemed to indicate that this would be OK to do, whereas in
> your last post here, you seem to be indicating that it's not.

I presume you're referring to your proposal (7:03 am Mar 27) to use
only RMSDs within 1 sd of the mean, to which I replied (10:29 am Mar
27), "Yes, that sort of thing is legit, altho I don't know what I
said that would lead you to that conclusion." In general, if you want
to analyze only the middle values in a data set, ignoring extreme
values that somehow do not represent what you want to talk about,
then it's usually better to define "middle" in terms of something
other than the mean (which is sensitive to outliers) and the sd
(which is even more sensitive to outliers). The most common solution
is a "trimmed mean", which ignores a prespecified proportion of the
most extreme values. But there are many other possibilities.
I misunderstood. I thought you wanted to use only values that were
within 1 sd of the predicted value. However, my comments above about
not basing the selection on the mean and sd still apply. It would
be better to select pairs whose S-values were not in the lowest L%
or the highest H% of the S-distribution. Values of L or H up to 25%
would raise no eyebrows among data analysts.
It is loading more messages.
0 new messages