470 views

Skip to first unread message

Oct 5, 2010, 5:47:17 AM10/5/10

to meds...@googlegroups.com

**Bendix Carstensen <b...@steno.dk>**
Oct 03 linked a question about viewing output to this issue, arriving at:

“It's immaterial how you arrive at the script. The
point is that you run it an batch mode at the end to get a coherent
documentation of what you did from reading data to the results you present.
Otherwise your results does not qualify as reproducible.”

That’s true but to my mind does not go far
enough. Increasingly, labs have exacting standards to determine that
chemicals are pure and fit for use, instruments calibrated, and staff trained
and certified for procedures (other research areas have equivalents). It’s
so depressing that the data then get put through software that, according to the
ISO standard 17025 5.4.7.2: “Commercial off-the-shelf software (eg
wordprocessing, database and statistical programmes [sic]) in general use
within their designated application range may be considered to be sufficiently
validated, … the laboratory shall ensure that computer software developed
by the user is documented in sufficient detail and is suitably validated as
being adequate for use, [and] procedures are established and implemented for
protecting the data.”

__ __

What that means in practice is the data often goes into
Excel, ad hoc “corrections” are then applied and the file
overwritten, and formulas and data get mixed up on one sheet. The data may
be extracted either by saving a text file or cut’n’paste to create an
R dataset. R is used “because it’s free”. Without
intending disparagement, many people are now using R with no concepts of
programming (a computer does what you tell it, not what you want), numerical
analysis (does 1.5 = 3/2 in a test?), or data structures (eg treatment of missing
values).

__ __

Running analyses from a batch file would be a considerable
step forward, but still means only that the final program is retained with
little record of how it was developed, what changes were made to the data en
route, and what test data had been used to validate the user-written R program.
How can you link an output value to a lab reading when the data may have units
changed, observations aggregated or merged, outliers rejected or values imputed,
and transformations applied? If you change no ***value*** in Excel, it
will still timestamp the file as “last modified” if you change a
column width or font or add a graph.

__ __

My own solution is to use Stata or SPSS, each of which I’ve
configured to keep a record of commands run, not just when I’ve switched
on the log or saved an output file. Stata is particularly helpful in that
any data change made in the interactive editor generates a command in the log
and a session will not exit if the data have been changed without the explicit
override. Nor will it by default overwrite a data file. So I get
the data into a system file as early as possible (from the supplied Excel sheet
E&OE), end the datafile name with yymmdd which should agree with the system
“data modified” and keeps files in name and date order, and
subsequently always (almost always, I’m human) save as a new file with
the date of change. The command log files are text files whose names also
end yymmdd, a new file created every day. This allows simple text
searches to find when I worked on any dataset or variable and enables the
extraction of all changes to the dataset and analyses to be rerun. Not
often enough, I write comments while working, which also go into the log
file. Such logs, of course, document all the analyses and also all the
blunders and blind alleys that are written out of the sanitized official report.

__ __

Allan

***********************************************************************************

This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of the organisation from which it is sent. All emails may be subject to monitoring.

***********************************************************************************

Oct 5, 2010, 8:06:37 AM10/5/10

to meds...@googlegroups.com

At this juncture it seems pertinent to mention the Reproducible

Research resource...

Research resource...

http://www.reproducibleresearch.org/

By no means scripture, but it does have useful resources on software

and articles on reproducible research.

One that I read a number of years ago that made a huge change in my

approach to work was http://www.bepress.com/bioconductor/paper2/

I'm too am continually amazed at the lengths people go to to generate

their data in a documented and reproducible manner only to then

completely disregard the same standards for managing that data.

Neil

--

"Our civilization would be pitifully immature without the intellectual

revolution led by Darwin" - Motoo Kimura, The Neutral Theory of

Molecular Evolution

Email - nshe...@gmail.com

Website - http://slack.ser.man.ac.uk/

Photos - http://www.flickr.com/photos/slackline/

Oct 5, 2010, 10:46:28 AM10/5/10

to MedStats

These thoughts are very helpful. I'll just add that if we got to the

point of disallowing the use of Excel in science, and in scripting all

steps used to arrive at the answers, results would be reproducible and

we'd all be better off. I would rather have code to review (from

almost any language except for the ugly SAS macro language) than the

authors' all-too-brief synopsis of the analyses.

Frank

On Oct 5, 7:06 am, Neil Shephard <nsheph...@gmail.com> wrote:

> At this juncture it seems pertinent to mention the Reproducible

> Research resource...

>

> http://www.reproducibleresearch.org/

>

> By no means scripture, but it does have useful resources on software

> and articles on reproducible research.

>

> One that I read a number of years ago that made a huge change in my

> approach to work washttp://www.bepress.com/bioconductor/paper2/

> Website -http://slack.ser.man.ac.uk/

> Photos -http://www.flickr.com/photos/slackline/

point of disallowing the use of Excel in science, and in scripting all

steps used to arrive at the answers, results would be reproducible and

we'd all be better off. I would rather have code to review (from

almost any language except for the ugly SAS macro language) than the

authors' all-too-brief synopsis of the analyses.

Frank

On Oct 5, 7:06 am, Neil Shephard <nsheph...@gmail.com> wrote:

> At this juncture it seems pertinent to mention the Reproducible

> Research resource...

>

> http://www.reproducibleresearch.org/

>

> By no means scripture, but it does have useful resources on software

> and articles on reproducible research.

>

> One that I read a number of years ago that made a huge change in my

>

> I'm too am continually amazed at the lengths people go to to generate

> their data in a documented and reproducible manner only to then

> completely disregard the same standards for managing that data.

>

> Neil

>

> --

> "Our civilization would be pitifully immature without the intellectual

> revolution led by Darwin" - Motoo Kimura, The Neutral Theory of

> Molecular Evolution

>

> Email - nsheph...@gmail.com
> I'm too am continually amazed at the lengths people go to to generate

> their data in a documented and reproducible manner only to then

> completely disregard the same standards for managing that data.

>

> Neil

>

> --

> "Our civilization would be pitifully immature without the intellectual

> revolution led by Darwin" - Motoo Kimura, The Neutral Theory of

> Molecular Evolution

>

> Website -http://slack.ser.man.ac.uk/

> Photos -http://www.flickr.com/photos/slackline/

Oct 5, 2010, 2:09:23 PM10/5/10

to meds...@googlegroups.com

I would go further than Frank and disallow the use of proprietary

$oftware because you can't examine the source code. Who knows what's

going on under the hood?

$oftware because you can't examine the source code. Who knows what's

going on under the hood?

Bearing in mind FDA 21 CFR 11 ELECTRONIC RECORDS; ELECTRONIC SIGNATURES

one solution is to use GNU make

(http://www.gnu.org/software/make/manual/make.html).

GNU make allows you to put all your programs/macros/data into a

hierarchy of dependencies. You run the make batch file, make

automatically will only run those programs which have changed or whose

dependants have changed. All can be saved to a log file.

If you use emacs using version control as your text editor then you can

also get a record of all the changes that have taken place which

provides a complete audit trail.

Hope this helps.

John

Oct 5, 2010, 2:11:15 PM10/5/10

to MedStats

On Oct 5, 10:46 am, Frank Harrell <f.harr...@vanderbilt.edu> wrote:

> These thoughts are very helpful. I'll just add that if we got to the

> point of disallowing the use of Excel in science, and in scripting all

> steps used to arrive at the answers, results would be reproducible and

> we'd all be better off. I would rather have code to review (from

> almost any language except for the ugly SAS macro language) than the

> authors' all-too-brief synopsis of the analyses.

>

> Frank

That seems very doable these days, too. I know that some journals
> These thoughts are very helpful. I'll just add that if we got to the

> point of disallowing the use of Excel in science, and in scripting all

> steps used to arrive at the answers, results would be reproducible and

> we'd all be better off. I would rather have code to review (from

> almost any language except for the ugly SAS macro language) than the

> authors' all-too-brief synopsis of the analyses.

>

> Frank

have "additional material" that is available through the website.

There's no reason why the code used to perform the analysis could not

be part of it.

--

Bruce Weaver

bwe...@lakeheadu.ca

http://sites.google.com/a/lakeheadu.ca/bweaver/Home

"When all else fails, RTFM."

Oct 5, 2010, 2:19:58 PM10/5/10

to meds...@googlegroups.com

John Hughes wrote

<<<

I would go further than Frank and disallow the use of proprietary

$oftware because you can't examine the source code. Who knows what's

going on under the hood?

>>>

<<<

I would go further than Frank and disallow the use of proprietary

$oftware because you can't examine the source code. Who knows what's

going on under the hood?

>>>

Well, really though, even with open source code, such as R, how many

actually examine the code to a sufficient extent to know it is correct? I

have even seen R packages that explicitly include a warning that this is

beta code and not to be fully trusted. I wonder how many packages OUGHT to

have such warnings? Has a package been fully tested? How fully? OK, the

core R packages are surely fine. And to some extent, you can really on some

people to do good code.

But how many of us would fully understand, say, plyr() ?

Or would take the time to fully understand any of the cutting edge packages?

(And here I mean, look at each line of code and figure out what it is

doing). If we did this, would we actually have time to do any analysis?

Would anyone at NIH or wherever our grant is submitted take the time to even

MINIMALLY understand the code?

We have to rely on the coders.

Further, we can modify the code in R, and, in modifying it, mess it up. Do

we then have to send all modifications of code with each article?

That doesn't mean proprietary code (SAS, SPSS or whatever) is necessarily

correct. But they have a vested interest in not producing code that is

wrong.

Peter

Oct 6, 2010, 2:50:46 AM10/6/10

to meds...@googlegroups.com

On Tue, Oct 5, 2010 at 6:19 PM, Peter Flom

<peterflom...@mindspring.com> wrote:

> John Hughes wrote

>

> But how many of us would fully understand, say, plyr() ?

> Or would take the time to fully understand any of the cutting edge packages?

> (And here I mean, look at each line of code and figure out what it is

> doing). If we did this, would we actually have time to do any analysis?

<peterflom...@mindspring.com> wrote:

> John Hughes wrote

>

> But how many of us would fully understand, say, plyr() ?

> Or would take the time to fully understand any of the cutting edge packages?

> (And here I mean, look at each line of code and figure out what it is

> doing). If we did this, would we actually have time to do any analysis?

Its not necessarily the need for _everyone_ to be able to do this, but

peer groups/users as a whole can check open-source code for errors.

Not everyone who uses plyr() will have the time to check it, but some

will and one would hope they would feed back any errors found. No

matter how many people use proprietary software none of them are able

to check the code, only that the results are accurate (by

cross-validation in a second piece of software).

A useful paper in this area is..

@article{Kee07,

author = {Keeling},

title = {A comparative study of the reliability of nine

statistical software packages},

journal = {Computational Statistics \& Data Analysis},

year = {2007},

month = {May},

volume = {51},

pages = {3811--3831},

}

> That doesn't mean proprietary code (SAS, SPSS or whatever) is necessarily

> correct. But they have a vested interest in not producing code that is

> wrong.

>

Those who write open-source code also have a vested interest in

writing correct code too! Most of the time the packages/software will

have been developed because of their need to solve a solution (and to

solve it accurately).

Neil

--

"Our civilization would be pitifully immature without the intellectual

revolution led by Darwin" - Motoo Kimura, The Neutral Theory of

Molecular Evolution

Email - nshe...@gmail.com

Oct 6, 2010, 6:38:32 AM10/6/10

to meds...@googlegroups.com

Neil Shephard wrote (in part)

<<<

Its not necessarily the need for _everyone_ to be able to do this, but

peer groups/users as a whole can check open-source code for errors.

Not everyone who uses plyr() will have the time to check it, but some

will and one would hope they would feed back any errors found. No

matter how many people use proprietary software none of them are able

to check the code, only that the results are accurate (by

cross-validation in a second piece of software).

>>>>

I will check out the paper in your message, thanks

I wonder how often cross validation is actually done? Or even possible? One

of the often touted advantages of R is that it has "cutting edge"

statistics. This is correct; many statistical programmers and statisticians

write new methods in R. But these methods, by their nature, can't be

cross-validated. Of course, we can check that the answers make sense ....

if they are off by huge amounts or in obvious ways, we can detect it. But

what of more subtle errors?

I am certainly not against R or other open software. I use R, myself (in

addition to SAS). But open sources is not a panacea.

Peter

Oct 7, 2010, 8:19:50 AM10/7/10

to meds...@googlegroups.com

Hello,

I have a set of data that include measurements made by two different

machines. One is a basic machine that measures six different variables

relating to eye shape and the other is a far more sophisticated machine that

measures 20 different variables (plus those that the first machine can do).

I would like to see how much better variables collected on the sophisticated

machine are at predicting contact lens fit than the basic machine.

I have used stepwise regression analysis to get two regression models, one

with the variables collected on the basic machine and with all variables.

My two final models each have one factor, one has an adjusted R^2 of 0.07

and the other 0.14. Is it possible to say the later is statistically better

than the former, and if so how do I do it?

Many thanks,

Chris

Oct 7, 2010, 8:31:57 AM10/7/10

to meds...@googlegroups.com

Chris Hunt wrote

If you use stepwise, you can't really say anything, except that your results

are all wrong.

See Harrell, Regression Modeling Strategies (excellent book) or Flom and

Cassell, Stopping Stepwise: Why stepwise variable selection methods are

wrong, and what you should use (a paper presented various places), or other

books and online resources. Stepwise methods are just wrong.

Now, how to actually answer your question. What is your sample size? Are

the two machines used on the same people, or different people?

If it is large enough, you could run a regression with ALL the variables

from each machine, and then compare R^2. It is virtually certain that the

model with more variables will have a higher R^2; if the two samples are the

same, then delete "virtually". Is that what you want to know? Or do you

want to know HOW much better? Or (for some reason) do you want to know if

the difference is statistically significant?

(Of course, you'll want to check each regression model for the assumptions

and for collinearity).

One thing is to compare various fit indices, like AIC, AICC, BIC, SBC .... I

don't have a strong preference among them, and in my experience they tend to

agree.

Another idea is to use graphs of the "degree of fit" of each model.

If the samples are NOT very large, I suggest using LASSO or LAR, which are

available in both SAS and R, and maybe other software as well.

HTH

Peter

Oct 7, 2010, 8:45:00 AM10/7/10

to meds...@googlegroups.com

In addition to Peter's comments above, I would supplement his question:

"Are the two machines used on the same people, or different people?"

If they were used on the same people, how do the variables which

were measured by both machines compare? (I.e. when the same variables

are measured on the same person by the two different machines,

how closely do they agree?).

The answer to this question could be very important in determining

a good approach to your original question!

Ted.

--------------------------------------------------------------------

E-Mail: (Ted Harding) <ted.h...@wlandres.net>

Fax-to-email: +44 (0)870 094 0861

Date: 07-Oct-10 Time: 13:44:56

------------------------------ XFMail ------------------------------

Oct 7, 2010, 9:10:48 AM10/7/10

to meds...@googlegroups.com

Thanks for your swift replies.

The machines were used on the same people and the variables that were

collected on both were the same. There were 50 people.

Obviously if I run a regression analysis with all predictors from each

machine, the new one will produce a better model. Although many of these

variables are related so the model would be flawed! I think my real

question is, if I pick the best predictor variable from the basic machine

and the best one or two independent predictors from the new machine, which

will give a better model?

Regards

Chris

Ted.

--

To post a new thread to MedStats, send email to MedS...@googlegroups.com .

MedStats' home page is http://groups.google.com/group/MedStats .

Rules: http://groups.google.com/group/MedStats/web/medstats-rules

Oct 7, 2010, 10:13:27 AM10/7/10

to MedStats

On Oct 7, 9:10 am, Chris Hunt <cr...@hotmail.com> wrote:

> Thanks for your swift replies.

>

> The machines were used on the same people and the variables that were

> collected on both were the same. There were 50 people.

>

> Obviously if I run a regression analysis with all predictors from each

> machine, the new one will produce a better model. Although many of these

> variables are related so the model would be flawed! I think my real

> question is, if I pick the best predictor variable from the basic machine

> and the best one or two independent predictors from the new machine, which

> will give a better model?

>

> Regards

> Chris

Hi Chris. Your sample size is not anywhere near large enough to run a
> Thanks for your swift replies.

>

> The machines were used on the same people and the variables that were

> collected on both were the same. There were 50 people.

>

> Obviously if I run a regression analysis with all predictors from each

> machine, the new one will produce a better model. Although many of these

> variables are related so the model would be flawed! I think my real

> question is, if I pick the best predictor variable from the basic machine

> and the best one or two independent predictors from the new machine, which

> will give a better model?

>

> Regards

> Chris

model with all 20 explanatory variables. Here's a short note you

might find useful.

www.angelfire.com/wv/bwhomedir/notes/linreg_rule_of_thumb.txt

Cheers,

Bruce

Oct 7, 2010, 10:30:21 AM10/7/10

to meds...@googlegroups.com

That is why I have selected the best predictor for the model and not used

them all.

Thanks

them all.

Thanks

-----Original Message-----

From: meds...@googlegroups.com [mailto:meds...@googlegroups.com] On Behalf

www.angelfire.com/wv/bwhomedir/notes/linreg_rule_of_thumb.txt

Cheers,

Bruce

--

Oct 7, 2010, 10:39:14 AM10/7/10

to meds...@googlegroups.com

The trouble would be in how you got the 'best predictor' model. STEPWISE

does NOT give you this. To really get the best predictor model, you'd need

to look at EVERY combination of variables. But both that method and

stepwise give wrong results - p values are too low, parameter estimates are

biased away from 0, and so on.

does NOT give you this. To really get the best predictor model, you'd need

to look at EVERY combination of variables. But both that method and

stepwise give wrong results - p values are too low, parameter estimates are

biased away from 0, and so on.

If you must use an automated method, I recommend (again) using a penalized

method such as LASSO or LAR. In SAS, these can be implemented in PROC

GLMSELECT. In R, you can use package LAR. But I even more recommend using a

model based on substantive knowledge.

LAR or LASSO are MUCH better than STEPWISE, but they still are ways of

letting the computer do your thinking for you. Sometimes, there's no

alternative. But it's not ideal

Peter

-----Original Message-----

From: meds...@googlegroups.com [mailto:meds...@googlegroups.com] On Behalf

Of Chris Hunt

Sent: Thursday, October 07, 2010 10:30 AM

To: meds...@googlegroups.com

Subject: RE: {MEDSTATS} Regression

Oct 7, 2010, 11:12:10 AM10/7/10

to meds...@googlegroups.com

I used correlation coefficient to pick out the few "significant" variables,

then looked at each combination of these to find the best model. Taking into

account variables that are correlated to each other.

then looked at each combination of these to find the best model. Taking into

account variables that are correlated to each other.

I know this is not the best method but I think it is better than the

automatic stepwise method. I don't think SPSS does LASSO or LAR.

I know you can compare two regression models if one is nested in the other.

But I cannot find anything to compare two models where the dependant

variable is the same but the independent variables are different. In fact,

the independent variables in each model may be related to each other but not

be the same.

Chris

Oct 7, 2010, 11:54:54 AM10/7/10

to meds...@googlegroups.com

On Thu, Oct 7, 2010 at 3:12 PM, Chris Hunt <cr...@hotmail.com> wrote:

> I know you can compare two regression models if one is nested in the other.

> But I cannot find anything to compare two models where the dependant

> variable is the same but the independent variables are different. In fact,

> the independent variables in each model may be related to each other but not

> be the same.

> I know you can compare two regression models if one is nested in the other.

> But I cannot find anything to compare two models where the dependant

> variable is the same but the independent variables are different. In fact,

> the independent variables in each model may be related to each other but not

> be the same.

Seemingly unrelated regression may be of some utility.

A brief overview on Wikipedia (but I'd recommend reading more in

formal text books)

http://en.wikipedia.org/wiki/Seemingly_unrelated_regressions

Oct 7, 2010, 12:12:19 PM10/7/10

to meds...@googlegroups.com

Chris,

It seems that you really have 2 questions here.

1. are the new and old machines equivalent (on the variables that they both measure)?

2. do the additional variables measured by the new machine give additional information?

Question 1 can be approached using Bland-Altman plots rather than regression, that whole process will probably be more informative and enlightening than doing canned regressions.

Question 2 can be approached by comparing nested regression models. The ideal (but you don't have enough data for meaningful results) is to have the full model include all variables and the reduced model include just the variables from the old machine. The comparison then tests if the new machines additional variables contribute significantly beyond the old machine. You can do something similar with subsets of the variables.

Note also that with R^2 values of 0.14 you are likely to find that the opinion of the doctors assistant gives better predictions than either machine.

--

Gregory (Greg) L. Snow Ph.D.

Statistical Data Center

Intermountain Healthcare

greg...@imail.org

801.408.8111

Oct 7, 2010, 5:54:41 PM10/7/10

to MedStats

On Oct 7, 8:12 am, Chris Hunt <cr...@hotmail.com> wrote:

> [...]

common predictors out of the other predictors and the d.v., drop

the common predictors from both sets, and decrement the sample size

by the number of common predictors.

2. Regress the d.v. on the two predictor sets separately, and get

the residuals from each regression.

3. If the two predictor sets have the same number of of predictors

then test the hypothesis that the two residuals have equal variances

(i.e., test the hypothesis that Correlation(u1+u2,u1-u2) = 0, where

u1 & u2 are the residuals.)

4. If the two predictor sets have different numbers of predictors

then ???. The analysis should probably be similar to that in step 3,

but should include some adjustment for the numbers of predictors.

> [...]

> I know you can compare two regression models if one is nested in

> the other. But I cannot find anything to compare two models where

> the dependant variable is the same but the independent variables

> are different. In fact, the independent variables in each model

> may be related to each other but not be the same.

1. If the two predictor sets share some variables then partial the
> the other. But I cannot find anything to compare two models where

> the dependant variable is the same but the independent variables

> are different. In fact, the independent variables in each model

> may be related to each other but not be the same.

common predictors out of the other predictors and the d.v., drop

the common predictors from both sets, and decrement the sample size

by the number of common predictors.

2. Regress the d.v. on the two predictor sets separately, and get

the residuals from each regression.

3. If the two predictor sets have the same number of of predictors

then test the hypothesis that the two residuals have equal variances

(i.e., test the hypothesis that Correlation(u1+u2,u1-u2) = 0, where

u1 & u2 are the residuals.)

4. If the two predictor sets have different numbers of predictors

then ???. The analysis should probably be similar to that in step 3,

but should include some adjustment for the numbers of predictors.

Oct 7, 2010, 6:24:19 PM10/7/10

to MedStats

On Oct 7, 2:54 pm, Ray Koopman <koop...@sfu.ca> wrote:

> On Oct 7, 8:12 am, Chris Hunt <cr...@hotmail.com> wrote:

>

>> [...]

>> I know you can compare two regression models if one is nested in

>> the other. But I cannot find anything to compare two models where

>> the dependant variable is the same but the independent variables

>> are different. In fact, the independent variables in each model

>> may be related to each other but not be the same.

>

> 1. If the two predictor sets share some variables then partial the

> common predictors out of the other predictors and the d.v., drop

> the common predictors from both sets, and decrement the sample size

> by the number of common predictors.

>

> 2. Regress the d.v. on the two predictor sets separately, and get

> the residuals from each regression.

>

> 3. If the two predictor sets have the same number of of predictors

> then test the hypothesis that the two residuals have equal variances

> (i.e., test the hypothesis that Correlation(u1+u2,u1-u2) = 0, where

> u1 & u2 are the residuals.)

The df for the test is not the usual n-2, but n-p-1, where p = the
> On Oct 7, 8:12 am, Chris Hunt <cr...@hotmail.com> wrote:

>

>> [...]

>> I know you can compare two regression models if one is nested in

>> the other. But I cannot find anything to compare two models where

>> the dependant variable is the same but the independent variables

>> are different. In fact, the independent variables in each model

>> may be related to each other but not be the same.

>

> 1. If the two predictor sets share some variables then partial the

> common predictors out of the other predictors and the d.v., drop

> the common predictors from both sets, and decrement the sample size

> by the number of common predictors.

>

> 2. Regress the d.v. on the two predictor sets separately, and get

> the residuals from each regression.

>

> 3. If the two predictor sets have the same number of of predictors

> then test the hypothesis that the two residuals have equal variances

> (i.e., test the hypothesis that Correlation(u1+u2,u1-u2) = 0, where

> u1 & u2 are the residuals.)

number of predictors in each set.

Oct 7, 2010, 6:30:38 PM10/7/10

to MedStats

On Oct 7, 3:24 pm, Ray Koopman <koop...@sfu.ca> wrote:

> On Oct 7, 2:54 pm, Ray Koopman <koop...@sfu.ca> wrote:

>> On Oct 7, 8:12 am, Chris Hunt <cr...@hotmail.com> wrote:

>>> [...]

>>> I know you can compare two regression models if one is nested in

>>> the other. But I cannot find anything to compare two models where

>>> the dependant variable is the same but the independent variables

>>> are different. In fact, the independent variables in each model

>>> may be related to each other but not be the same.

>>

>> 1. If the two predictor sets share some variables then partial the

>> common predictors out of the other predictors and the d.v., drop

>> the common predictors from both sets, and decrement the sample size

>> by the number of common predictors.

>>

>> 2. Regress the d.v. on the two predictor sets separately, and get

>> the residuals from each regression.

>>

>> 3. If the two predictor sets have the same number of of predictors

>> then test the hypothesis that the two residuals have equal variances

>> (i.e., test the hypothesis that Correlation(u1+u2,u1-u2) = 0, where

>> u1 & u2 are the residuals.)

>

> The df for the test is not the usual n-2, but n-p-1, where p = the

> number of predictors in each set.

It's being a bad day -- make that n-p-2, not n-p-1.
> On Oct 7, 2:54 pm, Ray Koopman <koop...@sfu.ca> wrote:

>> On Oct 7, 8:12 am, Chris Hunt <cr...@hotmail.com> wrote:

>>> [...]

>>> I know you can compare two regression models if one is nested in

>>> the other. But I cannot find anything to compare two models where

>>> the dependant variable is the same but the independent variables

>>> are different. In fact, the independent variables in each model

>>> may be related to each other but not be the same.

>>

>> 1. If the two predictor sets share some variables then partial the

>> common predictors out of the other predictors and the d.v., drop

>> the common predictors from both sets, and decrement the sample size

>> by the number of common predictors.

>>

>> 2. Regress the d.v. on the two predictor sets separately, and get

>> the residuals from each regression.

>>

>> 3. If the two predictor sets have the same number of of predictors

>> then test the hypothesis that the two residuals have equal variances

>> (i.e., test the hypothesis that Correlation(u1+u2,u1-u2) = 0, where

>> u1 & u2 are the residuals.)

>

> The df for the test is not the usual n-2, but n-p-1, where p = the

> number of predictors in each set.

Oct 8, 2010, 12:53:55 AM10/8/10

to MedStats

Let p1 & p2 be the numbers of predictors in the two sets.

Then df1 = n-p1-2, and df2 = n-p2-2. Instead of just u1 & u2,

use u1/sqrt(df1) and u2/sqrt(df2), whose variances will be

equal under the null, and compute t with df = 2/(1/df1 + 1/df2),

the harmonic mean of df1 & df2.

And yes, I realize that step 1 is not necessary. That was

just me thinking out loud, trying to simplify the problem.

Oct 8, 2010, 2:00:54 AM10/8/10

to meds...@googlegroups.com

Hm, this one is a runner isn't it? Tried to send this yesterday but

doesn't seem it got through. Health warning: I'm not a statistician.

doesn't seem it got through. Health warning: I'm not a statistician.

1) You seem to be confirming that the convergent validity of the six

measurements from the two different machines is perfect or near perfect,

is that right? At least that gets rid of one issue if so and it

probably suggests that there is also very low unreliability.

2) Since I'm not a statistician, my first question is whether there is

some a priori model. So:

2a) Is "contact lens fit" a single variable? It sounds like an issue

about the congruence of two two-dimensional surfaces to me. They are

approximately spheroidal but presumably have curvature,

diameter, i.e. arc subtended and some "conical" distortion factor(s)

(astigmatism). My guess is that the reality is that the surfaces are

actually more complex than that but any reasons based on physics to

predict any particular relationship between your measurements

and "fit" is going to help you greatly.

2b) Is fit continuous or catgorical? We seem to be assuming that it's

continuous and linear but I think it could be ordinal or even a survival

time measure couldn't it? What is it?

2c) We are assuming that there are NO physical aspects of the six

variables or the additional 20 that you would expect to predict

interactions between them and the fit. That's vital.

2d) IF you have reason to believe that linear regression is the sole and

best fit to any predictive relationships and that there aren't reasons

to expect interactions what about the correlations within the

predictors? Are there a priori reasons to expect correlations and/or

are there strong and significant observed correlations:

2di) between the six common predictor variables?

2dii) betwen the additional 20?

2dii) between any/all of the six and any/all of the 20?

3) If there are strong observed correlations within the six and within

the 20 then you will have multicollinearity and will have weakness in

the estimation of the regression of them onto the dependent: beware!

4) I don't know about lasso etc but if you are confident that purely

linear regression between correlated predictors and a single linear

dependent is a good model for both the six and the 26 predictors then

wouldn't this work:

4a) enter all six variables into a linear prediction against the

dependent and save the residuals,

4b) now run a canonical correlation analysis of the correlation of the

20 additional variables onto the residuals,

4c) if it's significant, you've got evidence there is some systematic

variance unexplained by the six and

4d) the loadings of the 20 onto the the resulting canonical correlation

variable tell you which of the 20 variables contribute most to that

regression equation.

However, on n=50 you have very little power as others are saying and any

a priori model clarity should be used to help preserve any power it

might give and any plausible non-linearity in the model will pretty much

ensure that you have no real test as yet.

Ultimately, as Greg Snow said, with R^2 values of .07 and .14, you know

you have very little predictive regression so I doubt if this is going

to give you significant results on n=50.

Cheers all,

Chris

Chris Hunt sent the following at 07/10/2010 14:10:

--

Chris Evans <ch...@psyctc.org> Skype: chris-psyctc

Consultant Psychiatrist in Psychotherapy, Notts. PDD network;

Clinical Director, Psychological Therapies, Nottinghamshire NHS Trust;

Professor, Psychotherapy, Nottingham University

*If I am writing from one of those roles, it will be clear. Otherwise*

*my views are my own and not representative of those institutions *

If you have difficulty Emailing me on this address or getting a reply,

send again but cc to: chris dot evans at nottshc dot nhs dot uk

and to: c dot evans at nottingham dot ac dot uk

Reply all

Reply to author

Forward

0 new messages