regressions with non-significant variables

John Kane

unread,

Sep 28, 2006, 11:49:25 AM9/28/06

to

I recently ran into this statement

"Some roadway and traffic variables have a clear effect on the lateral
position of the motorist
during both passing and non-passing events. However, several variables
are only statistically
significant in one of these cases. One regression model may include a
variable that is statistically
insignificant if it is significant in the other in order to maintain
the same independent variables
among the two events and to ultimately generate a measure for the
change in lateral position of
the motorist."

I am not a statistician but this struck me as a bit strange. Can anyone
comment on the idea of keeping a non-significant variable in a model in
order to match another model? I

Richard Ulrich

unread,

Sep 28, 2006, 6:15:38 PM9/28/06

to

On 28 Sep 2006 08:49:25 -0700, "John Kane" <jrkr...@gmail.com>
wrote:

What makes sense is to keep variables in models because you
expect them to be meaningful.

The too-frequent, common mistake is to drop variables from an
equation merely because they fail to be "statistically significant"
in a particular case.

The question of stepwise regression has comments from years
ago collected in my stats-FAQ.

--
Rich Ulrich, wpi...@pitt.edu
http://www.pitt.edu/~wpilib/index.html

Reef Fish

unread,

Sep 28, 2006, 9:04:36 PM9/28/06

to

Richard Ulrich wrote:
> On 28 Sep 2006 08:49:25 -0700, "John Kane" <jrkr...@gmail.com>
> wrote:
>
> > I recently ran into this statement

< not very informative nor necessary statement snipped to answer
the stated question>

> >
> > I am not a statistician but this struck me as a bit strange. Can anyone
> > comment on the idea of keeping a non-significant variable in a model in
> > order to match another model?

Variables that are not "statistically significant" are kept NOT for
the reason of matching anything in most of the common usage.

Statistically REDUNDANT (superfluous, unnecessary) variables
are dropped because they not only add nothing to the model
but they may in fact make the model worse, much worse, in
terms of precision and stability.

Once you get away from those redundant variable cases, the
simplest answer to WHY you keep statisticall not-significant
variables is that for many problems, while they are not
statistically significant, they are much better than nothing. :-)

If you drop variables because they are statistically NOT
significant, they you may find, especially for sociological
data, that often you may end up with NO VARIABLE in a
regression equation because you have drop everything. :)

>
> What makes sense is to keep variables in models because you
> expect them to be meaningful.

That may be true sometimes, but often NOT true for seeking
only FITTING or PREDICTION models.

The "meaningful" idea is one of the common abuses by
social scientists in their misapplication of regression methods.

Variables do not have their unique meanings. In a multiple
regression, the meaning of a variable is its effect IN THE
PRESENCE OF ALL OTHER VARIABLES in the equation.

Therefore, the same variable may have thousands of
different meanings, all depending on which are the OTHER
variables in the equation.

> The too-frequent, common mistake is to drop variables from an
> equation merely because they fail to be "statistically significant"
> in a particular case.

That statement is clearly UNTRUE.

The dropping of variables is when the variables are "statistically
redundant" (or unnecessary) in the presence of other variables
already in the regression model.

> The question of stepwise regression has comments from years
> ago collected in my stats-FAQ.

Most comments I've seen are irrelevant, impertinent, or
technically flawed comments.

The problem in question, and the approach to the solution
of teh problem have very little, if anything, to do with
stepwise regressions.

-- Reef Fish Bob.

John Kane

unread,

Sep 29, 2006, 11:07:36 AM9/29/06

to

Reef Fish wrote:
> Richard Ulrich wrote:
> > On 28 Sep 2006 08:49:25 -0700, "John Kane" <jrkr...@gmail.com>
> > wrote:
> >
> > > I recently ran into this statement
>
> < not very informative nor necessary statement snipped to answer
> the stated question>

Nonsense. Context is important :)

> > >
> > > I am not a statistician but this struck me as a bit strange. Can anyone
> > > comment on the idea of keeping a non-significant variable in a model in
> > > order to match another model?
>
> Variables that are not "statistically significant" are kept NOT for
> the reason of matching anything in most of the common usage.
>
> Statistically REDUNDANT (superfluous, unnecessary) variables
> are dropped because they not only add nothing to the model
> but they may in fact make the model worse, much worse, in
> terms of precision and stability.
>
> Once you get away from those redundant variable cases, the
> simplest answer to WHY you keep statisticall not-significant
> variables is that for many problems, while they are not
> statistically significant, they are much better than nothing. :-)
>
> If you drop variables because they are statistically NOT
> significant, they you may find, especially for sociological
> data, that often you may end up with NO VARIABLE in a
> regression equation because you have drop everything. :)
>
> >
> > What makes sense is to keep variables in models because you
> > expect them to be meaningful.

But in the context (that Bob clipped) the intend seems to be to make
the model somehow comparable to another model . This does not seem to
make sense. I can see keeping the variables if you expect them to be
useful when examing another data set particularly if there is a
theoretical reason.

>
> That may be true sometimes, but often NOT true for seeking
> only FITTING or PREDICTION models.

That was my thought and this was clearly an engineering study intended
for this purpose.

>
> The "meaningful" idea is one of the common abuses by
> social scientists in their misapplication of regression methods.

And those pesky traffic engineers it appears :)

>
> Variables do not have their unique meanings. In a multiple
> regression, the meaning of a variable is its effect IN THE
> PRESENCE OF ALL OTHER VARIABLES in the equation.
>
> Therefore, the same variable may have thousands of
> different meanings, all depending on which are the OTHER
> variables in the equation.
>
>
> > The too-frequent, common mistake is to drop variables from an
> > equation merely because they fail to be "statistically significant"
> > in a particular case.
>
> That statement is clearly UNTRUE.
>
> The dropping of variables is when the variables are "statistically
> redundant" (or unnecessary) in the presence of other variables
> already in the regression model.

My problem is that I cannot see what gain there is to retaining the
variables just to make a comparison against another model. Somehow I
seem to see it as soaking up a bit of variance that might be better
explained by the other variables.

If nothing else leaving a redundent variable in the regression seems to
me to be irresponsible given that the target audience are not likely to
be researchers but either practicing traffic/civil engineers or policy
makers who may not understand the "significance" of an insignificant
variable in a model.

>
>
> > The question of stepwise regression has comments from years
> > ago collected in my stats-FAQ.
>
> Most comments I've seen are irrelevant, impertinent, or
> technically flawed comments.
>
> The problem in question, and the approach to the solution
> of teh problem have very little, if anything, to do with
> stepwise regressions.
>
> -- Reef Fish Bob.

Thanks to both of you for the comments. They have been helpful
John Kane, Kingston ON Canada

da...@autobox.com

unread,

Sep 29, 2006, 12:08:56 PM9/29/06

to

Hello John ...

Oftentimes one is interested in testing the hypothesis that the
coefficients (collectively) are homogenous across groups ... leading to
the Gregory Chow Test ( Princeton University).

A similar problem in Time Series is to test for break points in
parameters i.e. is there a point in time that the coefficients fror an
ARIMA process change significantly.

We have implemented that test in order to test the idea of
non-transient structure ...which leads directly to segmenting the time
series at the identified break point(s) .

Regards

Dave Reilly
http://www.autobox.com

John Kane

unread,

Sep 29, 2006, 12:25:42 PM9/29/06

to

Thanks Dave.

I see what you mean there and that makes sense. However the
researchers seem to have some idea of comparing two models, developed
on the same data set but, if my cursory reading is correct, predicting
different driver behaviour and apparently left the redundent varibles
in to 'facilitate' comparisons.

The study was a very applied one, apparently intended to provide input
to government policy on road design.

Maybe I am suspicious of the faux-3D spreadsheet barplots they used :)
They also seemed to be using stepwise regression to establish the
models, which struck me as a bit dubious.