Bendix Carstensen <b...@steno.dk> Oct 03 linked a question about viewing output to this issue, arriving at:
“It's immaterial how you arrive at the script. The
point is that you run it an batch mode at the end to get a coherent
documentation of what you did from reading data to the results you present.
Otherwise your results does not qualify as reproducible.”
That’s true but to my mind does not go far enough. Increasingly, labs have exacting standards to determine that chemicals are pure and fit for use, instruments calibrated, and staff trained and certified for procedures (other research areas have equivalents). It’s so depressing that the data then get put through software that, according to the ISO standard 17025 5.4.7.2: “Commercial off-the-shelf software (eg wordprocessing, database and statistical programmes [sic]) in general use within their designated application range may be considered to be sufficiently validated, … the laboratory shall ensure that computer software developed by the user is documented in sufficient detail and is suitably validated as being adequate for use, [and] procedures are established and implemented for protecting the data.”
What that means in practice is the data often goes into Excel, ad hoc “corrections” are then applied and the file overwritten, and formulas and data get mixed up on one sheet. The data may be extracted either by saving a text file or cut’n’paste to create an R dataset. R is used “because it’s free”. Without intending disparagement, many people are now using R with no concepts of programming (a computer does what you tell it, not what you want), numerical analysis (does 1.5 = 3/2 in a test?), or data structures (eg treatment of missing values).
Running analyses from a batch file would be a considerable step forward, but still means only that the final program is retained with little record of how it was developed, what changes were made to the data en route, and what test data had been used to validate the user-written R program. How can you link an output value to a lab reading when the data may have units changed, observations aggregated or merged, outliers rejected or values imputed, and transformations applied? If you change no *value* in Excel, it will still timestamp the file as “last modified” if you change a column width or font or add a graph.
My own solution is to use Stata or SPSS, each of which I’ve configured to keep a record of commands run, not just when I’ve switched on the log or saved an output file. Stata is particularly helpful in that any data change made in the interactive editor generates a command in the log and a session will not exit if the data have been changed without the explicit override. Nor will it by default overwrite a data file. So I get the data into a system file as early as possible (from the supplied Excel sheet E&OE), end the datafile name with yymmdd which should agree with the system “data modified” and keeps files in name and date order, and subsequently always (almost always, I’m human) save as a new file with the date of change. The command log files are text files whose names also end yymmdd, a new file created every day. This allows simple text searches to find when I worked on any dataset or variable and enables the extraction of all changes to the dataset and analyses to be rerun. Not often enough, I write comments while working, which also go into the log file. Such logs, of course, document all the analyses and also all the blunders and blind alleys that are written out of the sanitized official report.
Allan
***********************************************************************************
This email and any attachments are intended for the named recipient only. Its unauthorised use, distribution, disclosure, storage or copying is not permitted. If you have received it in error, please destroy all copies and notify the sender. In messages of a non-business nature, the views and opinions expressed are the author's own and do not necessarily reflect those of the organisation from which it is sent. All emails may be subject to monitoring.
***********************************************************************************
http://www.reproducibleresearch.org/
By no means scripture, but it does have useful resources on software
and articles on reproducible research.
One that I read a number of years ago that made a huge change in my
approach to work was http://www.bepress.com/bioconductor/paper2/
I'm too am continually amazed at the lengths people go to to generate
their data in a documented and reproducible manner only to then
completely disregard the same standards for managing that data.
Neil
--
"Our civilization would be pitifully immature without the intellectual
revolution led by Darwin" - Motoo Kimura, The Neutral Theory of
Molecular Evolution
Email - nshe...@gmail.com
Website - http://slack.ser.man.ac.uk/
Photos - http://www.flickr.com/photos/slackline/
Bearing in mind FDA 21 CFR 11 ELECTRONIC RECORDS; ELECTRONIC SIGNATURES
one solution is to use GNU make
(http://www.gnu.org/software/make/manual/make.html).
GNU make allows you to put all your programs/macros/data into a
hierarchy of dependencies. You run the make batch file, make
automatically will only run those programs which have changed or whose
dependants have changed. All can be saved to a log file.
If you use emacs using version control as your text editor then you can
also get a record of all the changes that have taken place which
provides a complete audit trail.
Hope this helps.
John
Well, really though, even with open source code, such as R, how many
actually examine the code to a sufficient extent to know it is correct? I
have even seen R packages that explicitly include a warning that this is
beta code and not to be fully trusted. I wonder how many packages OUGHT to
have such warnings? Has a package been fully tested? How fully? OK, the
core R packages are surely fine. And to some extent, you can really on some
people to do good code.
But how many of us would fully understand, say, plyr() ?
Or would take the time to fully understand any of the cutting edge packages?
(And here I mean, look at each line of code and figure out what it is
doing). If we did this, would we actually have time to do any analysis?
Would anyone at NIH or wherever our grant is submitted take the time to even
MINIMALLY understand the code?
We have to rely on the coders.
Further, we can modify the code in R, and, in modifying it, mess it up. Do
we then have to send all modifications of code with each article?
That doesn't mean proprietary code (SAS, SPSS or whatever) is necessarily
correct. But they have a vested interest in not producing code that is
wrong.
Peter
Its not necessarily the need for _everyone_ to be able to do this, but
peer groups/users as a whole can check open-source code for errors.
Not everyone who uses plyr() will have the time to check it, but some
will and one would hope they would feed back any errors found. No
matter how many people use proprietary software none of them are able
to check the code, only that the results are accurate (by
cross-validation in a second piece of software).
A useful paper in this area is..
@article{Kee07,
author = {Keeling},
title = {A comparative study of the reliability of nine
statistical software packages},
journal = {Computational Statistics \& Data Analysis},
year = {2007},
month = {May},
volume = {51},
pages = {3811--3831},
}
> That doesn't mean proprietary code (SAS, SPSS or whatever) is necessarily
> correct. But they have a vested interest in not producing code that is
> wrong.
>
Those who write open-source code also have a vested interest in
writing correct code too! Most of the time the packages/software will
have been developed because of their need to solve a solution (and to
solve it accurately).
Neil
--
"Our civilization would be pitifully immature without the intellectual
revolution led by Darwin" - Motoo Kimura, The Neutral Theory of
Molecular Evolution
Email - nshe...@gmail.com
<<<
Its not necessarily the need for _everyone_ to be able to do this, but
peer groups/users as a whole can check open-source code for errors.
Not everyone who uses plyr() will have the time to check it, but some
will and one would hope they would feed back any errors found. No
matter how many people use proprietary software none of them are able
to check the code, only that the results are accurate (by
cross-validation in a second piece of software).
>>>>
I will check out the paper in your message, thanks
I wonder how often cross validation is actually done? Or even possible? One
of the often touted advantages of R is that it has "cutting edge"
statistics. This is correct; many statistical programmers and statisticians
write new methods in R. But these methods, by their nature, can't be
cross-validated. Of course, we can check that the answers make sense ....
if they are off by huge amounts or in obvious ways, we can detect it. But
what of more subtle errors?
I am certainly not against R or other open software. I use R, myself (in
addition to SAS). But open sources is not a panacea.
Peter
I have a set of data that include measurements made by two different
machines. One is a basic machine that measures six different variables
relating to eye shape and the other is a far more sophisticated machine that
measures 20 different variables (plus those that the first machine can do).
I would like to see how much better variables collected on the sophisticated
machine are at predicting contact lens fit than the basic machine.
I have used stepwise regression analysis to get two regression models, one
with the variables collected on the basic machine and with all variables.
My two final models each have one factor, one has an adjusted R^2 of 0.07
and the other 0.14. Is it possible to say the later is statistically better
than the former, and if so how do I do it?
Many thanks,
Chris
If you use stepwise, you can't really say anything, except that your results
are all wrong.
See Harrell, Regression Modeling Strategies (excellent book) or Flom and
Cassell, Stopping Stepwise: Why stepwise variable selection methods are
wrong, and what you should use (a paper presented various places), or other
books and online resources. Stepwise methods are just wrong.
Now, how to actually answer your question. What is your sample size? Are
the two machines used on the same people, or different people?
If it is large enough, you could run a regression with ALL the variables
from each machine, and then compare R^2. It is virtually certain that the
model with more variables will have a higher R^2; if the two samples are the
same, then delete "virtually". Is that what you want to know? Or do you
want to know HOW much better? Or (for some reason) do you want to know if
the difference is statistically significant?
(Of course, you'll want to check each regression model for the assumptions
and for collinearity).
One thing is to compare various fit indices, like AIC, AICC, BIC, SBC .... I
don't have a strong preference among them, and in my experience they tend to
agree.
Another idea is to use graphs of the "degree of fit" of each model.
If the samples are NOT very large, I suggest using LASSO or LAR, which are
available in both SAS and R, and maybe other software as well.
HTH
Peter
In addition to Peter's comments above, I would supplement his question:
"Are the two machines used on the same people, or different people?"
If they were used on the same people, how do the variables which
were measured by both machines compare? (I.e. when the same variables
are measured on the same person by the two different machines,
how closely do they agree?).
The answer to this question could be very important in determining
a good approach to your original question!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.h...@wlandres.net>
Fax-to-email: +44 (0)870 094 0861
Date: 07-Oct-10 Time: 13:44:56
------------------------------ XFMail ------------------------------
The machines were used on the same people and the variables that were
collected on both were the same. There were 50 people.
Obviously if I run a regression analysis with all predictors from each
machine, the new one will produce a better model. Although many of these
variables are related so the model would be flawed! I think my real
question is, if I pick the best predictor variable from the basic machine
and the best one or two independent predictors from the new machine, which
will give a better model?
Regards
Chris
Ted.
--
To post a new thread to MedStats, send email to MedS...@googlegroups.com .
MedStats' home page is http://groups.google.com/group/MedStats .
Rules: http://groups.google.com/group/MedStats/web/medstats-rules
-----Original Message-----
From: meds...@googlegroups.com [mailto:meds...@googlegroups.com] On Behalf
www.angelfire.com/wv/bwhomedir/notes/linreg_rule_of_thumb.txt
Cheers,
Bruce
--
If you must use an automated method, I recommend (again) using a penalized
method such as LASSO or LAR. In SAS, these can be implemented in PROC
GLMSELECT. In R, you can use package LAR. But I even more recommend using a
model based on substantive knowledge.
LAR or LASSO are MUCH better than STEPWISE, but they still are ways of
letting the computer do your thinking for you. Sometimes, there's no
alternative. But it's not ideal
Peter
-----Original Message-----
From: meds...@googlegroups.com [mailto:meds...@googlegroups.com] On Behalf
Of Chris Hunt
Sent: Thursday, October 07, 2010 10:30 AM
To: meds...@googlegroups.com
Subject: RE: {MEDSTATS} Regression
I know this is not the best method but I think it is better than the
automatic stepwise method. I don't think SPSS does LASSO or LAR.
I know you can compare two regression models if one is nested in the other.
But I cannot find anything to compare two models where the dependant
variable is the same but the independent variables are different. In fact,
the independent variables in each model may be related to each other but not
be the same.
Chris
Seemingly unrelated regression may be of some utility.
A brief overview on Wikipedia (but I'd recommend reading more in
formal text books)
http://en.wikipedia.org/wiki/Seemingly_unrelated_regressions
It seems that you really have 2 questions here.
1. are the new and old machines equivalent (on the variables that they both measure)?
2. do the additional variables measured by the new machine give additional information?
Question 1 can be approached using Bland-Altman plots rather than regression, that whole process will probably be more informative and enlightening than doing canned regressions.
Question 2 can be approached by comparing nested regression models. The ideal (but you don't have enough data for meaningful results) is to have the full model include all variables and the reduced model include just the variables from the old machine. The comparison then tests if the new machines additional variables contribute significantly beyond the old machine. You can do something similar with subsets of the variables.
Note also that with R^2 values of 0.14 you are likely to find that the opinion of the doctors assistant gives better predictions than either machine.
--
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg...@imail.org
801.408.8111
1) You seem to be confirming that the convergent validity of the six
measurements from the two different machines is perfect or near perfect,
is that right? At least that gets rid of one issue if so and it
probably suggests that there is also very low unreliability.
2) Since I'm not a statistician, my first question is whether there is
some a priori model. So:
2a) Is "contact lens fit" a single variable? It sounds like an issue
about the congruence of two two-dimensional surfaces to me. They are
approximately spheroidal but presumably have curvature,
diameter, i.e. arc subtended and some "conical" distortion factor(s)
(astigmatism). My guess is that the reality is that the surfaces are
actually more complex than that but any reasons based on physics to
predict any particular relationship between your measurements
and "fit" is going to help you greatly.
2b) Is fit continuous or catgorical? We seem to be assuming that it's
continuous and linear but I think it could be ordinal or even a survival
time measure couldn't it? What is it?
2c) We are assuming that there are NO physical aspects of the six
variables or the additional 20 that you would expect to predict
interactions between them and the fit. That's vital.
2d) IF you have reason to believe that linear regression is the sole and
best fit to any predictive relationships and that there aren't reasons
to expect interactions what about the correlations within the
predictors? Are there a priori reasons to expect correlations and/or
are there strong and significant observed correlations:
2di) between the six common predictor variables?
2dii) betwen the additional 20?
2dii) between any/all of the six and any/all of the 20?
3) If there are strong observed correlations within the six and within
the 20 then you will have multicollinearity and will have weakness in
the estimation of the regression of them onto the dependent: beware!
4) I don't know about lasso etc but if you are confident that purely
linear regression between correlated predictors and a single linear
dependent is a good model for both the six and the 26 predictors then
wouldn't this work:
4a) enter all six variables into a linear prediction against the
dependent and save the residuals,
4b) now run a canonical correlation analysis of the correlation of the
20 additional variables onto the residuals,
4c) if it's significant, you've got evidence there is some systematic
variance unexplained by the six and
4d) the loadings of the 20 onto the the resulting canonical correlation
variable tell you which of the 20 variables contribute most to that
regression equation.
However, on n=50 you have very little power as others are saying and any
a priori model clarity should be used to help preserve any power it
might give and any plausible non-linearity in the model will pretty much
ensure that you have no real test as yet.
Ultimately, as Greg Snow said, with R^2 values of .07 and .14, you know
you have very little predictive regression so I doubt if this is going
to give you significant results on n=50.
Cheers all,
Chris
Chris Hunt sent the following at 07/10/2010 14:10:
--
Chris Evans <ch...@psyctc.org> Skype: chris-psyctc
Consultant Psychiatrist in Psychotherapy, Notts. PDD network;
Clinical Director, Psychological Therapies, Nottinghamshire NHS Trust;
Professor, Psychotherapy, Nottingham University
*If I am writing from one of those roles, it will be clear. Otherwise*
*my views are my own and not representative of those institutions *
If you have difficulty Emailing me on this address or getting a reply,
send again but cc to: chris dot evans at nottshc dot nhs dot uk
and to: c dot evans at nottingham dot ac dot uk