# linear regression with errors in both variables

363 views

### Joerg

Feb 10, 2009, 5:56:10 AM2/10/09
to
Hi,

I want to test the hypothesis that my data
follows a known simple linear relationship,
y = a + bx. However, I have (known) measurements
errors in both the y and the x values.

I suppose just a simple linear regression
does not do here.

Any suggestions how do test this correctly?

Thanks,

joerg

### DrMajorBob

Feb 11, 2009, 5:19:01 AM2/11/09
to
Here are three solvers, applied to the following data.

data = Flatten[
Table[{x + 0.5 RandomReal[],
y + 0.7 RandomReal[], .2 x + .3 y + .4 RandomReal[]}, {x, 1,
20}, {y, 1, 30}], 1];

xy = data[[All, {1, 2}]];
z = data[[All, -1]];
LeastSquares[xy, z]

{0.200423, 0.302076}

FindFit[data, a x + b y, {a, b}, {x, y}]

{a -> 0.200423, b -> 0.302076}

lmf = LinearModelFit[data, {x, y}, {x, y}];
Normal[lmf]

0.038069 + 0.198848 x + 0.30105 y

They all do the same thing, depending on whether you ask for a constant
term or not.

What if we explicitly KNOW, however, that x, y, and z had specific error
standard deviations, in this case 0.2, 0.3, and 0.4? Surely unequal
variances make a difference?

In that case we could scale each variable to make all error variances the
same, then solve the transformed problem:

normData = #/{.2, .3, .4} & /@ data;
normFit = FindFit[normData, a x + b y, {a, b}, {x, y}]

{a -> 0.100211, b -> 0.226557}

And then we'd correct a and b to UNDO the transformation we made:

.4/{.2, .3} {a, b} /. normFit

{0.200423, 0.302076}

But as you see, this changes nothing.

Bobby

On Tue, 10 Feb 2009 04:57:01 -0600, Joerg <sch...@biologie.hu-berlin.de>
wrote:

### dh

Feb 11, 2009, 5:16:30 AM2/11/09
to

Hi Joerg,

a least square procedure minimizes (yregi-yi)^2, where yi is a measured

value and yregi the value of the regression line. In your case you want

to minimize the squares sum of the distance perpendicular to the line.

Let's denote the line by a+b x and assume that the data is in d, the we

get the squares sum by:

res[a_, b_] =

1/(1 + b^2) Plus @@ (((b #[] + a - #[])^2) & /@ d )

we then minimize this expression over a and b:

sol = Minimize[res[a, b], {a, b}]

hope this helps, Daniel

### Virgil Stokes

Feb 11, 2009, 5:16:41 AM2/11/09
to
Joerg wrote:
> Hi,
>
> I want to test the hypothesis that my data
> follows a known simple linear relationship,
> y = a + bx. However, I have (known) measurements
> errors in both the y and the x values.
>
> I suppose just a simple linear regression
> does not do here.
>
> Any suggestions how do test this correctly?
>
> Thanks,
>
> joerg
>
You might take a look at one of the Wolfram Demonstrations Project:

--V. Stokes

### dh

Feb 11, 2009, 5:23:30 AM2/11/09
to

Hi Joerg,

this is an addendum to my former message for the case where the x and y

axis have different scales. The following example assumes that the error

in the y axis is 100 times that of the x axis. sx and sy denote the

standard deviations. a anb be define again the line: y==a+b x. res

denotes the weighted sum of squares of distances. The metric is taken

as: {{1/sx^2,0},{0,1/sy^2}}

d = Table[{RandomReal[], 100 RandomReal[]}, {4}];

sx = 1; sy = 100;

res[a_, b_] =

Module[{perp, met = {{1/sx^2, 0}, {0, 1/sy^2}}},

perp = {b sx^2, -sy^2};

perp = perp/Sqrt[perp.met.perp];

Plus @@ (perp.met.(# + {a/b, 0}) & /@ d)^2];

sol = Minimize[res[a, b], {a, b}]

fun[x_] = a + b x /. sol[]

Plot[fun[x], {x, 0, 1}, Epilog -> {Red, Point[d]},

PlotRange -> ({0.95 Min @@ #, 1.05 Max @@ #} & /@ Transpose[d])]

hope this helps, Daniel

### Ray Koopman

Feb 12, 2009, 6:42:26 AM2/12/09
to

You don't say whether the errors are the same for all data points.
If they are, and sx & sy are the (known) standard deviations of the
measurement errors in x & y, then

ClearAll[a,b]; Minimize[Evaluate[Expand[
#.#&[a + b*x - y]]/(b^2 sx^2 + sy^2)], {a,b}]

will give what you want. You can save a little time if you eliminate
'a' by centering the data. And if Mathematica honors the parentheses,
centering will also reduce potential roundoff error problems.

xbar = Mean@x; ybar = Mean@y; ClearAll[a,b];
{#[], {a -> ybar - #[[2,1,2]]*xbar, #[[2,1]]}}& @
Minimize[Evaluate[Expand[
#.#&[b(x-xbar)-(y-ybar)]]/(b^2 sx^2 + sy^2)], b]

If the errors differ from point to point then sx & sy will be lists.

ClearAll[a,b];
Minimize[Tr[ (a + b*x - y)^2 / (b^2 sx^2 + sy^2) ], {a,b}]

will work, but it's slow. Pre-evaluating doesn't help.
The following is faster:

worals[x_,y_,sx_,sy_] := Block[
{a,b,f,z, u = 1/sx, v = 1/sy, w = (sy/sx)^2},
{a,b} = (y*v).PseudoInverse@{v,x*v}; f = #.#&[(a+b*x-y)v];
While[f > (z = (x*w + (y-a)b)/(b^2 + w);
{a,b} = (y*v).PseudoInverse@{v,z*v};
f = #.#&@Join[(z-x)u,(a+b*z-y)v])]; {f,{a,b}}]

('worals' is an acronym for Weighted Orthogonal Regression by
Alternating Least Squares.)

### Bill Rowe

Feb 12, 2009, 6:41:10 AM2/12/09
to
Joerg wrote:

>I want to test the hypothesis that my data follows a known simple
>linear relationship, y = a + bx. However, I have (known)
>measurements errors in both the y and the x values.

>I suppose just a simple linear regression does not do here.

>Any suggestions how do test this correctly?

You've not made it clear what your end purpose is. The standard
model assumed for simple linear regression is

y = a x + b + error

where the error term is assumed to be from a normal
distribution. Possibly, doing a simple regression will give you
a good enough result even if it will no longer be the maximum
likelihood estimate when there is error in x.

You say the error is known. If you mean by this you have
definite values, then the obvious thing to do is subtract the
error from the measurements before doing a regression.

If the error is random without specific values and you truly
need maximum likelihood estimates for the regression parameters,
then you will need to create your own functions. Mathematical
details of dealing with errors in variables can be found at <http://en.wiki=
pedia.org/wiki/Total_least_squares>

### Daniel Lichtblau

Feb 13, 2009, 3:43:20 AM2/13/09
to

Assuming errors in both coordinates are comparable, you can do a total
least squares fit with a singular values decomposition (there also
exist weighted variants; not sure what will be suitable for your
needs). I'll illustrate a step-by-step example of the basic case. It
is set up so that the slope is around -2.7, and y intercept around
3.3. We will recover approximations to these values.

m = 20;
xvals = Range[m] + RandomReal[{-1, 1}, {m}];
yvals = 3.3 - 2.7*xvals + RandomReal[{-1, 1}, {m}];

We compute the means, which we will subtract so as to get data
centered at the origin.

means = Mean /@ {xvals, yvals};
shiftedvals = Transpose[{xvals, yvals} - means];

Now compute the SVD of the transposed, shifted data matrix.

{u, w, v} = SingularValueDecomposition[shiftedvals];

The vector perpendicular to the best fitting line is given as the
product of the right-side orthogonal matrix in the SVD, with a unit
vector in the last (y, in this case) direction.

e[j_, n_] := Module[{u = Table[0, {n}]}, u[[j]] = 1; u]
perp = v.e[2, 2];

Now form the linear polynomial that defines our approximated line.

In:= linearpoly = ({x, y} - means).perp;
Expand[linearpoly/Coefficient[linearpoly, y]]

Out= -3.18513 + 2.69773 x + 1. y

Daniel Lichtblau
Wolfram Research

### Ray Koopman

Feb 13, 2009, 3:46:02 AM2/13/09
to

If the true relation is linear, and if the errors are independent
normal with zero means and standard deviations as given in sx & sy,
then the minimized function value returned by each of the four
code fragments that I gave will have a Chi-Square distribution
with n-2 degrees of freedom. The corresponding p-value is
GammaRegularized[(n-2)/2,f/2].

### Mark Fisher

Feb 16, 2009, 6:54:47 AM2/16/09
to

Hi all,

The classic "errors-in-variables" problem in econometrics has the
following structure:

(1) yt = a + b*zt + et

The paramters of interest are (a, b). The problem is that we don't
observe zt. Instead we observe xt which equals zt plus a measurement
error:

(2) xt = zt + ut

In order to proceed, we need to model the unobserved zt. Let

(3) zt = m + vt

Assume (et, ut, vt) are iid N(0, V) where V = diag(se^2, su^2,
sv^2).

Integrating zt out of (1-3) using

Integrate[
PDF[NormalDistribution[a + b*zt, se], yt] *
PDF[NormalDistribution[zt, su], xt] *
PDF[NormalDistribution[m, sv], zt],
{zt, -Infinity, Infinity},
GenerateConditions -> False]

produces (after rearrangement)

(4) PDF[NormalDistribution[a1 + b1*xt, se1], yt] * PDF
[NormalDistribution[m, su1], xt]

which can be expressed as

(5) yt = a1 + b1*xt + e1t
(6) xt = m + u1t

where

(7) a1 = a + b*m*su^2/(su^2+sv^2)
(8) b1 = b*sv^2/(su^2+sv^2)
(9) se1^2 = se^2 + b^2*su^2*sv^2/(su^2+sv^2)
(10) su1^2 = su^2+sv^2

The likelihood for the parameters, theta = (a,b,m,se,su,sv), is given
by

likelihood = Product[fun[data[[t]]], {t, T}]

where data is a list of the observed {xt, yt} pairs and fun[{xt, yt}]
is given by (4) using the substitutions (7-10).

A Bayesian approach specifies a prior distribution for theta, p
(theta), and then uses an MCMC algorithm (such as the random-walk
Metropolis algorithm) to draw from the posterior distribution, p(theta|
data). [The posterior distribution is proportional to likelihood * p
(theta).]

Note that (xt, yt) has a bivariate normal distribution, which is
characterized by 5 parameters. However, there are 6 parameters in
theta; this suggests a lack of identification. Fortunately, this
presents no particular problem for the Bayesian approach, even with a
flat prior for theta.

--Mark