In statistics, the term linear model refers to any model which assumes linearity in the system. The most common occurrence is in connection with regression models and the term is often taken as synonymous with linear regression model. However, the term is also used in time series analysis with a different meaning. In each case, the designation "linear" is used to identify a subclass of models for which substantial reduction in the complexity of the related statistical theory is possible.
Given that estimation is undertaken on the basis of a least squares analysis, estimates of the unknown parameters β j \displaystyle \beta _j are determined by minimising a sum of squares function
where again the quantities ε i \displaystyle \varepsilon _i are random variables representing innovations which are new random effects that appear at a certain time but also affect values of X \displaystyle X at later times. In this instance the use of the term "linear model" refers to the structure of the above relationship in representing X t \displaystyle X_t as a linear function of past values of the same time series and of current and past values of the innovations.[1] This particular aspect of the structure means that it is relatively simple to derive relations for the mean and covariance properties of the time series. Note that here the "linear" part of the term "linear model" is not referring to the coefficients ϕ i \displaystyle \phi _i and θ i \displaystyle \theta _i , as it would be in the case of a regression model, which looks structurally similar.
There are some other instances where "nonlinear model" is used to contrast with a linearly structured model, although the term "linear model" is not usually applied. One example of this is nonlinear dimensionality reduction.
Linear models describe a continuous response variable as a function of one or more predictor variables. They can help you understand and predict the behavior of complex systems or analyze experimental, financial, and biological data.
Linear regression is a statistical method used to create a linear model. The model describes the relationship between a dependent variable \(y\) (also called the response) as a function of one or more independent variables \(X_i\) (called the predictors). The general equation for a linear model is:
Simple linear regression is commonly done in MATLAB. For multiple and multivariate linear regression, see Statistics and Machine Learning Toolbox. It enables stepwise, robust, and multivariate regression to:
To create a linear model that fits curves and surfaces to your data, see Curve Fitting Toolbox. To create linear models of dynamic systems from measured input-output data, see System Identification Toolbox. To create a linear model for control system design from a nonlinear Simulink model, see Simulink Control Design.
See also:Statistics and Machine Learning Toolbox, Curve Fitting Toolbox, machine learning, linearization, data fitting, data analysis, mathematical modeling, time series regression, linear model videos, Machine Learning Models, MANOVA
Most of the common statistical models (t-test, correlation, ANOVA; chi-square, etc.) are special cases of linear models or a very close approximation. This beautiful simplicity means that there is less to learn. In particular, it all comes down to \(y = a \cdot x + b\) which most students know from highschool. Unfortunately, stats intro courses are usually taught as if each test is an independent tool, needlessly making life more complicated for students and teachers alike.
This is often simply called a regression model which can be extended to multiple regression where there are several \(\beta\)s and on the right-hand side multiplied with the predictors. Everything below, from one-sample t-test to two-way ANOVA are just special cases of this system. Nothing more, nothing less.
Note that we can interpret the slope which is the number of ranks \(y\) change for each rank on \(x\). I think that this is a pretty interesting number. However, the intercept is less interpretable since it lies at \(rank(x) = 0\) which is impossible since x starts at 1.
This means that there is just one \(y = y_2 - y_1\) to predict and it becomes a one-sample t-test on the pairwise differences. The visualization is therefore also the same as for the one-sample t-test. At the risk of overcomplicating a simple substraction, you can think of these pairwise differences as slopes (see left panel of the figure), which we can represent as y-offsets (see right panel of the figure):
The default output of lm returns parameter estimates as well (bonus!), which you can see if you unfold the R output above. However, because this IS the ANOVA model, you can also get parameter estimates out into the open by calling coefficients(aov(...)).
Below, I present approximate main effect models, though exact calculation of ANOVA main effects is more involved if it is to be accurate and furthermore depend on whether type-II or type-III sum of squares are used for inference.
This is simply ANOVA with a continuous regressor added so that it now contains continuous and (dummy-coded) categorical predictors. For example, if we continue with the one-way ANOVA example, we can add age and it is now called a one-way ANCOVA:
Recall that when you take the logarithm, you can easily make statements about proportions, i.e., that for every increase in \(x\), \(y\) increases a certain percentage. This turns out to be one of the simplest (and therefore best!) ways to make count data and contingency tables intelligible. See this nice introduction to Chi-Square tests as linear models.
For a two-way contingency table, the model of the count variable \(y\) is a modeled using the marginal proportions of a contingency table. Why this makes sense, is too involved to go into here, but see the relevant slides by Christoph Scheepers here for an excellent exposition. The model is composed of a lot of counts and the regression coefficients \(A_i\) and \(B_i\):
Snuggly! Now we can get a better grasp on how the regression coefficients (which are proportions) independently contribute to \(y\). This is why logarithms are so nice for proportions. Note that this is just the two-way ANOVA model with some logarithms added, so we are back to our good old linear models - only the interpretation of the regression coefficients have changed! And we cannot use lm anymore in R.
I would spend 50% of the time on linear modeling of data since this contains 70% of what students need to know (bullet 1 below). The rest of the course is fleshing out what happens when you have one group, two groups, etc.
Extend to a few multiple regression as models. Make sure to include plenty of real-life examples and exercises at this point to make all of this really intuitive. Marvel at how briefly these models allow us to represent large datasets.
Dummy coding of categories: How one regression coefficient for each level of a factor models an intercept for each level when multiplied by a binary indicator. This is just extending what we did in 2.1. to make this data accessible to linear modeling.
Logarithmic transformation: Making multiplicative models linear using logarithms, thus modeling proportions. See this excellent introduction to the equivalence of log-linear models and Chi-Square tests as models of proportions. Also needs to introduce (log-)odds ratios. When the multiplicative model is made summative using logarithms, we just add the dummy-coding trick from 3.1, and see that the models are identical to the ANOVA models in 3.2 and 3.3, only the interpretation of the coefficients have changed.
Hypothesis testing as model comparisons: Hypothesis testing is the act of choosing between a full model and one where a parameter is fixed to a particular value (often zero, i.e., effectively excluded from the model) instead of being estimated. For example, when fixing one of the two means to zero in the t-test, we study how well a single mean (a one-sample t-test) explains all the data from both groups. If it does a good job, we prefer this model over the two-mean model because it is simpler. So hypothesis testing is just comparing linear models to make more qualitative statements than the truly quantitative statements which were covered in bullets 1-4 above. As tests of single parameters, hypothesis testing is therefore less informative However, when testing multiple parameters at the same time (e.g., a factor in ANOVA), model comparison becomes invaluable.
I have not covered assumptions in the examples. This will be another post! But all assumptions of all tests come down to the usual three: a) independence of data points, b) normally distributed residuals, and c) homoscedasticity.
A linear model is a function, like that in equation (8.1), that is fit to a set of data, often to model a process that generated the data or something like the data. The line in Figure 8.1A is just that, a line, but the line in Figure 8.1B is a linear model fit to the data in Figure 8.1B.
Here is a short script that generates data by implementing both the error-draw and conditional-draw specifications. See if you can follow the logic of the code and match it to the meaning of these two ways of specifying a linear model.
The error-draw specification is not very useful for thinking about data generation for data analyzed by generalized linear models, which are models that allow one to specify distribution families other than Normal (such as the binomial, Poisson, and Gamma families). In fact, thinking about a model as a predictor plus error can lead to the misconception that, in a generalized linear model, the error (or residuals from the fit) has a distribution from the non-Normal distribution modeled. This cannot be true because the distributions modeled using generalized linear models (other than the Normal) do not have negative values (some residuals must have negative values since the mean of the residuals is zero). Introductory biostatistics textbooks typically only introduce the error-draw specification because introductory textbooks recommend data transformation or non-parametric tests if the data are not approximately normal. This is unfortunate because generalized linear models are extremely useful for real biological data.
93ddb68554