Root Mean Squared Difference (RMSPD) vs R-squared in a cross-validation

yoo...@gmail.com

unread,

Oct 1, 2008, 2:18:06 PM10/1/08

to

Hi folks,

I'm conducting a cross-validation study in which I need to select one
model among a few based on selection criteria.

The selection criteria that I can think of is

1) Root Mean Squared Error of Diffrence (RMSD) between predictor and
observation and

2) R-squared value between predictor and observation.

Would any of you give me a guidance on what I can choose as selection
criterion and

what would be the consequence for choosing one over the other.

Thank you in advance.

Yoon

RichUlrich

unread,

Oct 2, 2008, 7:07:32 PM10/2/08

to

On Wed, 1 Oct 2008 11:18:06 -0700 (PDT), "yoo...@gmail.com"
<yoo...@gmail.com> wrote:

>Hi folks,
>
>I'm conducting a cross-validation study in which I need to select one
>model among a few based on selection criteria.
>
>The selection criteria that I can think of is
>
>1) Root Mean Squared Error of Diffrence (RMSD) between predictor and
>observation and
>
>2) R-squared value between predictor and observation.
>
>Would any of you give me a guidance on what I can choose as selection
>criterion and

Assuming that you are fitting to sample B an
equation that is derived from sample A, are
those ever going to be different?

>
>what would be the consequence for choosing one over the other.
>
>Thank you in advance.
>
>Yoon

--
Rich Ulrich

Ray Koopman

unread,

Oct 3, 2008, 6:14:53 PM10/3/08

to

Certainly *not* r^2, ever, because it treats positive and negative
correlations as equally good. Think r, not r^2.

In general, you should always use RMSD unless you are willing to
ignore bias and scale errors in the predicted values, which is what
r does.

Greg Heath

unread,

Oct 5, 2008, 4:20:49 PM10/5/08

to

On Oct 1, 2:18 pm, "yoon...@gmail.com" <yoon...@gmail.com> wrote:
> Hi folks,
>
> I'm conducting a cross-validation study in which I need to select one
> model among a few based on selection criteria.
>
> The selection criteria that I can think of is
>
> 1) Root Mean Squared Error of Diffrence (RMSD) between predictor and
> observation and

Since error is the difference, this is RMSE = sqrt(MSE)

> 2) R-squared value between predictor and observation.

R^2 = 1 - (SSE/TSS) = 1 - (MSE/MSE0)

where MSE0 = (N-1)*var(y)/N is MSE for the model yhat = mean(y).

> Would any of you give me a guidance on what I can choose as selection
> criterion and
>
> what would be the consequence for choosing one over the other.

Take your pick. Given var(y), the transformation is one-to-one.

Hope this helps.

Greg

RichUlrich

unread,

Oct 5, 2008, 4:52:48 PM10/5/08

to

I think that the language of the question is seductively misleading.
"RMSD" is what you refer to the product of a regression, and
yet, "cross-validation" is better used as a term for checking the
fit of one equation in an independent sample.

As Ray has posted, the Differences are better. For one thing, you
can use it in either case. For another, it accounts for bias (for a
continuous prediction). Also, for fitting, you don't have to worry
about the d.f. loss in fitting -- which can cause the "fitted" RMSE
to worsen with an extra predictor, even though the R-squared increases
with every predictor.

--
Rich Ulrich

Greg Heath

unread,

Oct 6, 2008, 6:29:03 AM10/6/08

to

On Oct 1, 2:18 pm, "yoon...@gmail.com" <yoon...@gmail.com> wrote:

> Hi folks,
>
> I'm conducting a cross-validation study in which I need to select one
> model among a few based on selection criteria.

Cross-validation implies averaging over repeated sample splits into
training and testing subsets. Is this what you mean? Or are you
just emphasizing that the test set is an independent sample?

Either way, consideration of adjusted R^2 is unecessary.

Hope this helps.

Greg

Greg Heath

unread,

Oct 6, 2008, 7:21:48 AM10/6/08

to

On Oct 3, 6:14 pm, Ray Koopman <koop...@sfu.ca> wrote:
> On Oct 1, 11:18 am, "yoon...@gmail.com" <yoon...@gmail.com> wrote:
> > I'm conducting a cross-validation study in which I need to select one
> > model among a few based on selection criteria.
>
> > The selection criteria that I can think of is
>
> > 1) Root Mean Squared Error of Diffrence (RMSD) between predictor and
> > observation and

also known as jusr RMSE

> > 2) R-squared value between predictor and observation.
>
> > Would any of you give me a guidance on what I can choose as selection
> > criterion and
>
> > what would be the consequence for choosing one over the other.
>

> Certainly *not* r^2, ever, because it treats positive and negative
> correlations as equally good. Think r, not r^2.

Usually R^2 is interpreted in terms of explained variance and
the sign of R is ignored. Clearly, the sign doesn't help in the OPs
task of model selection.

> In general, you should always use RMSD unless you are willing to
> ignore bias and scale errors in the predicted values, which is what
> r does

I don't see the big deal. Given the variance of y,

R^2 = 1- SSE/TSS = 1-(N*MSE)/((N-1)*var(y))

Therefore,

RMSE = sqrt( (1-R^2)*(N-1)*var(y)/N )..

My regressions are generally nonlinear. My choice
of summary statistics are normalized mean-square-error
NMSE = MSE/MSE0 where MSE0 = (N-1)*var(y)/N),
coefficient of determination R^2 = 1-NMSE and the
correlation coefficient r which for nonlinear models,
is not the same as R.

As far as selecting variables, either NMSE or R^2
can be used.

Hope this helps.

Greg