In a discussion on julia-users about the formula language for the MixedModels package I found myself again explaining the implicit intercept in the R formula language and its ramifications. This led me to consider not having an implicit intercept and requiring the user to write y ~ 1 + x when they want an intercept and slope.
It is two extra characters to type which is negligible. However, it would throw off any R users who expect the implicit intercept. Which is the lesser of the two evils? I personally think that requiring an explicit intercept term would make the connection between the formula and the fitted coefficients much clearer but that doesn't mean that there wouldn't be howls of protest.
In a discussion on julia-users about the formula language for the MixedModels package I found myself again explaining the implicit intercept in the R formula language and its ramifications. This led me to consider not having an implicit intercept and requiring the user to write y ~ 1 + x when they want an intercept and slope.It is two extra characters to type which is negligible. However, it would throw off any R users who expect the implicit intercept. Which is the lesser of the two evils? I personally think that requiring an explicit intercept term would make the connection between the formula and the fitted coefficients much clearer but that doesn't mean that there wouldn't be howls of protest.
--
You received this message because you are subscribed to the Google Groups "julia-stats" group.
To unsubscribe from this group and stop receiving emails from it, send an email to julia-stats...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Stefan, if you're never learned the formula syntax (especially the lme4 variant), I'd encourage you to try it out. I'd argue that it's one of the best features of R.
This led me to consider not having an implicit intercept and requiring the user to write y ~ 1 + x when they want an intercept and slope.
--
It is generally a good principle to make the default the right thing. But in this case it makes specifying that there is no intercept so bizarre – writing `y ~ x + 0` and having that be different from `y ~ x` but having `y ~ x + 1` is just strange.
I believe not specifying anything would count as no intercept.
If plot(y ~ x + 1, exampledata) so it's consistent with the glm use, what does plot(y ~ x, exampledata) generate?
To be honest, I’m not at all concerned about that issue, since I am steadfastly opposed to the idea of using formulas for anything other than the specification of design matrices.That tradition in R violates what is arguably the most central concept in Julia style: don’t write puns. We should never reuse operators to mean things that have no relationship with their core semantics.Also, just to resolve one potential source of disagreement: everyone here understands that Julia is eagerly evaluated, right? I ask because the absence of delayed evaluation means that almost all R idioms involving the ~ tilde operator don’t make any sense in Julia and will never work in Julia.
For anyone interested in the background, apparently the R formula notation is (a variant of) what is known as Wilkinson-Rogers notation, which was originally proposed in this article:
http://www.jstor.org/stable/2346786
Unfortunately they don't explain why the intercepts are implicit.
In my mind, the argument for the implicit intercept ultimately comes down to what you consider the "root" model: is it
(a) Y = error, or
(b) Y = constant + error ?
In the ANOVA context (for which it was originally proposed), it is (b): you're trying to explain the "variance" (i.e. everything that isn't constant) by adding factors into the model, and so I think the implicit intercept makes complete sense here.
Of course, statistics has changed a lot in the 40 years since: we now have glms, mixed and hierarchical models, lasso methods, etc. Rather than dividing up sum-of-squares by factor, we now typically think in a more "generative" sense, by constructing models that represent the process which we're modelling. When linear models and glms are taught, they're usually framed in terms of a linear predictor
eta = beta_0 + beta_1 X_1 + ... beta_k X_k
While this makes sense from a linear algebra and sampling theory perspective (the intercept is just another column), there is still something special about the intercept. It's not just that models with an intercept are more common, it's that models without an intercept are almost always nonsensical.
Personally, I always found that linear models and glms always made the most sense when thinking in terms of an affine space, not a linear space:
* for categorical covariates, you don't care about the absolute coefficients, you care about the relative coefficients (the contrasts): while we're on it, one of the things that really annoys me about R is that it doesn't make the contrasts explicit in the output.
* for numerical covariates, the coefficient describes the change in response per unit change in the covariate.
Moreover, the intercept has special behaviour of its own. For instance,
* it's not even possible to exclude the intercept if you have categorical covariates, as it will be absorbed into covariates of the contrasts. Although the coefficients will have different numerical values, you still get exactly the same model.
julia> gm1 = fit(GeneralizedLinearModel, Counts ~ Outcome + Treatment, dobson, Poisson())
DataFrameRegressionModel{GeneralizedLinearModel,Float64}:
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 3.04452 0.170899 17.8148 <1e-70
Outcome - 2 -0.454255 0.202171 -2.24689 0.0246
Outcome - 3 -0.292987 0.192742 -1.5201 0.1285
Treatment - 2 5.66414e-16 0.2 2.83207e-15 1.0000
Treatment - 3 1.31354e-18 0.2 6.56771e-18 1.0000
# Note that only two dummies are included for each categorical variable,
# as in model 1
julia> gm2 = fit(GeneralizedLinearModel, Counts ~ -1 + Outcome + Treatment, dobson, Poisson())
DataFrameRegressionModel{GeneralizedLinearModel,Float64}:
Coefficients:
Estimate Std.Error z value Pr(>|z|)
Outcome - 2 0.762395 0.256078 2.9772 0.0029
Outcome - 3 0.923663 0.248704 3.71391 0.0002
Treatment - 2 2.17826 0.235164 9.26276 <1e-19
Treatment - 3 2.17826 0.235163 9.26278 <1e-19
julia> deviance(gm1)
5.129141077001146
julia> deviance(gm2)
182.45992882517174
In R, both specifications give the same deviance, the first categorical variable gives three dummies instead of two.* in lasso models, it is the one beta that you don't penalise.
As a result, I would lean toward keeping an implicit intercept, though don't have particularly strong feelings about it. I don't have much experience with mixed models, so can't really comment on the utility of intercepts there. Also, I've never had to teach it to new users, so can't really say what is easier to learn.
As to how to state "no intercept": in my mind, there's no real need for this in the standard formula interface. What you're really doing is imposing a constraint on the model space, and so should be handled in the same manner you would impose other such constraints (however I don't have any good suggestions for that either...)
-Simon
* it's not even possible to exclude the intercept if you have categorical covariates, as it will be absorbed into covariates of the contrasts. Although the coefficients will have different numerical values, you still get exactly the same model.Note this isn't the case currently with GLM:
--
As a result, I would lean toward keeping an implicit intercept, though don't have particularly strong feelings about it.
y ~ 1 + x + z ## Intercept, no warning
y ~ 0 + x + z ## No intercept, no warning
y ~ x + z ## No intercept, glm, etc. may choose to give a warning