358 views

Skip to first unread message

Feb 10, 2009, 5:56:10 AM2/10/09

to

Hi,

I want to test the hypothesis that my data

follows a known simple linear relationship,

y = a + bx. However, I have (known) measurements

errors in both the y and the x values.

I suppose just a simple linear regression

does not do here.

Any suggestions how do test this correctly?

Thanks,

joerg

Feb 11, 2009, 5:19:01 AM2/11/09

to

Here are three solvers, applied to the following data.

data = Flatten[

Table[{x + 0.5 RandomReal[],

y + 0.7 RandomReal[], .2 x + .3 y + .4 RandomReal[]}, {x, 1,

20}, {y, 1, 30}], 1];

xy = data[[All, {1, 2}]];

z = data[[All, -1]];

LeastSquares[xy, z]

{0.200423, 0.302076}

FindFit[data, a x + b y, {a, b}, {x, y}]

{a -> 0.200423, b -> 0.302076}

lmf = LinearModelFit[data, {x, y}, {x, y}];

Normal[lmf]

0.038069 + 0.198848 x + 0.30105 y

They all do the same thing, depending on whether you ask for a constant

term or not.

What if we explicitly KNOW, however, that x, y, and z had specific error

standard deviations, in this case 0.2, 0.3, and 0.4? Surely unequal

variances make a difference?

In that case we could scale each variable to make all error variances the

same, then solve the transformed problem:

normData = #/{.2, .3, .4} & /@ data;

normFit = FindFit[normData, a x + b y, {a, b}, {x, y}]

{a -> 0.100211, b -> 0.226557}

And then we'd correct a and b to UNDO the transformation we made:

.4/{.2, .3} {a, b} /. normFit

{0.200423, 0.302076}

But as you see, this changes nothing.

Bobby

On Tue, 10 Feb 2009 04:57:01 -0600, Joerg <sch...@biologie.hu-berlin.de>

wrote:

Feb 11, 2009, 5:16:30 AM2/11/09

to

Hi Joerg,

a least square procedure minimizes (yregi-yi)^2, where yi is a measured

value and yregi the value of the regression line. In your case you want

to minimize the squares sum of the distance perpendicular to the line.

Let's denote the line by a+b x and assume that the data is in d, the we

get the squares sum by:

res[a_, b_] =

1/(1 + b^2) Plus @@ (((b #[[1]] + a - #[[2]])^2) & /@ d )

we then minimize this expression over a and b:

sol = Minimize[res[a, b], {a, b}]

hope this helps, Daniel

Feb 11, 2009, 5:16:41 AM2/11/09

to

Joerg wrote:

> Hi,

>

> I want to test the hypothesis that my data

> follows a known simple linear relationship,

> y = a + bx. However, I have (known) measurements

> errors in both the y and the x values.

>

> I suppose just a simple linear regression

> does not do here.

>

> Any suggestions how do test this correctly?

>

> Thanks,

>

> joerg

>

You might take a look at one of the Wolfram Demonstrations Project:> Hi,

>

> I want to test the hypothesis that my data

> follows a known simple linear relationship,

> y = a + bx. However, I have (known) measurements

> errors in both the y and the x values.

>

> I suppose just a simple linear regression

> does not do here.

>

> Any suggestions how do test this correctly?

>

> Thanks,

>

> joerg

>

http://demonstrations.wolfram.com/OrdinaryRegressionAndOrthogonalRegressionInThePlane/

--V. Stokes

Feb 11, 2009, 5:23:30 AM2/11/09

to

Hi Joerg,

this is an addendum to my former message for the case where the x and y

axis have different scales. The following example assumes that the error

in the y axis is 100 times that of the x axis. sx and sy denote the

standard deviations. a anb be define again the line: y==a+b x. res

denotes the weighted sum of squares of distances. The metric is taken

as: {{1/sx^2,0},{0,1/sy^2}}

d = Table[{RandomReal[], 100 RandomReal[]}, {4}];

sx = 1; sy = 100;

res[a_, b_] =

Module[{perp, met = {{1/sx^2, 0}, {0, 1/sy^2}}},

perp = {b sx^2, -sy^2};

perp = perp/Sqrt[perp.met.perp];

Plus @@ (perp.met.(# + {a/b, 0}) & /@ d)^2];

sol = Minimize[res[a, b], {a, b}]

fun[x_] = a + b x /. sol[[2]]

Plot[fun[x], {x, 0, 1}, Epilog -> {Red, Point[d]},

PlotRange -> ({0.95 Min @@ #, 1.05 Max @@ #} & /@ Transpose[d])]

hope this helps, Daniel

Feb 12, 2009, 6:42:26 AM2/12/09

to

You don't say whether the errors are the same for all data points.

If they are, and sx & sy are the (known) standard deviations of the

measurement errors in x & y, then

ClearAll[a,b]; Minimize[Evaluate[Expand[

#.#&[a + b*x - y]]/(b^2 sx^2 + sy^2)], {a,b}]

will give what you want. You can save a little time if you eliminate

'a' by centering the data. And if Mathematica honors the parentheses,

centering will also reduce potential roundoff error problems.

xbar = Mean@x; ybar = Mean@y; ClearAll[a,b];

{#[[1]], {a -> ybar - #[[2,1,2]]*xbar, #[[2,1]]}}& @

Minimize[Evaluate[Expand[

#.#&[b(x-xbar)-(y-ybar)]]/(b^2 sx^2 + sy^2)], b]

If the errors differ from point to point then sx & sy will be lists.

ClearAll[a,b];

Minimize[Tr[ (a + b*x - y)^2 / (b^2 sx^2 + sy^2) ], {a,b}]

will work, but it's slow. Pre-evaluating doesn't help.

The following is faster:

worals[x_,y_,sx_,sy_] := Block[

{a,b,f,z, u = 1/sx, v = 1/sy, w = (sy/sx)^2},

{a,b} = (y*v).PseudoInverse@{v,x*v}; f = #.#&[(a+b*x-y)v];

While[f > (z = (x*w + (y-a)b)/(b^2 + w);

{a,b} = (y*v).PseudoInverse@{v,z*v};

f = #.#&@Join[(z-x)u,(a+b*z-y)v])]; {f,{a,b}}]

('worals' is an acronym for Weighted Orthogonal Regression by

Alternating Least Squares.)

Feb 12, 2009, 6:41:10 AM2/12/09

to

Joerg wrote:

>I want to test the hypothesis that my data follows a known simple

>linear relationship, y = a + bx. However, I have (known)

>measurements errors in both the y and the x values.

>I suppose just a simple linear regression does not do here.

>Any suggestions how do test this correctly?

You've not made it clear what your end purpose is. The standard

model assumed for simple linear regression is

y = a x + b + error

where the error term is assumed to be from a normal

distribution. Possibly, doing a simple regression will give you

a good enough result even if it will no longer be the maximum

likelihood estimate when there is error in x.

You say the error is known. If you mean by this you have

definite values, then the obvious thing to do is subtract the

error from the measurements before doing a regression.

If the error is random without specific values and you truly

need maximum likelihood estimates for the regression parameters,

then you will need to create your own functions. Mathematical

details of dealing with errors in variables can be found at <http://en.wiki=

pedia.org/wiki/Total_least_squares>

Feb 13, 2009, 3:43:20 AM2/13/09

to

Assuming errors in both coordinates are comparable, you can do a total

least squares fit with a singular values decomposition (there also

exist weighted variants; not sure what will be suitable for your

needs). I'll illustrate a step-by-step example of the basic case. It

is set up so that the slope is around -2.7, and y intercept around

3.3. We will recover approximations to these values.

m = 20;

xvals = Range[m] + RandomReal[{-1, 1}, {m}];

yvals = 3.3 - 2.7*xvals + RandomReal[{-1, 1}, {m}];

We compute the means, which we will subtract so as to get data

centered at the origin.

means = Mean /@ {xvals, yvals};

shiftedvals = Transpose[{xvals, yvals} - means];

Now compute the SVD of the transposed, shifted data matrix.

{u, w, v} = SingularValueDecomposition[shiftedvals];

The vector perpendicular to the best fitting line is given as the

product of the right-side orthogonal matrix in the SVD, with a unit

vector in the last (y, in this case) direction.

e[j_, n_] := Module[{u = Table[0, {n}]}, u[[j]] = 1; u]

perp = v.e[2, 2];

Now form the linear polynomial that defines our approximated line.

In[104]:= linearpoly = ({x, y} - means).perp;

Expand[linearpoly/Coefficient[linearpoly, y]]

Out[105]= -3.18513 + 2.69773 x + 1. y

Daniel Lichtblau

Wolfram Research

Feb 13, 2009, 3:46:02 AM2/13/09

to

If the true relation is linear, and if the errors are independent

normal with zero means and standard deviations as given in sx & sy,

then the minimized function value returned by each of the four

code fragments that I gave will have a Chi-Square distribution

with n-2 degrees of freedom. The corresponding p-value is

GammaRegularized[(n-2)/2,f/2].

Feb 16, 2009, 6:54:47 AM2/16/09

to

Hi all,

The classic "errors-in-variables" problem in econometrics has the

following structure:

(1) yt = a + b*zt + et

The paramters of interest are (a, b). The problem is that we don't

observe zt. Instead we observe xt which equals zt plus a measurement

error:

(2) xt = zt + ut

In order to proceed, we need to model the unobserved zt. Let

(3) zt = m + vt

Assume (et, ut, vt) are iid N(0, V) where V = diag(se^2, su^2,

sv^2).

Integrating zt out of (1-3) using

Integrate[

PDF[NormalDistribution[a + b*zt, se], yt] *

PDF[NormalDistribution[zt, su], xt] *

PDF[NormalDistribution[m, sv], zt],

{zt, -Infinity, Infinity},

GenerateConditions -> False]

produces (after rearrangement)

(4) PDF[NormalDistribution[a1 + b1*xt, se1], yt] * PDF

[NormalDistribution[m, su1], xt]

which can be expressed as

(5) yt = a1 + b1*xt + e1t

(6) xt = m + u1t

where

(7) a1 = a + b*m*su^2/(su^2+sv^2)

(8) b1 = b*sv^2/(su^2+sv^2)

(9) se1^2 = se^2 + b^2*su^2*sv^2/(su^2+sv^2)

(10) su1^2 = su^2+sv^2

The likelihood for the parameters, theta = (a,b,m,se,su,sv), is given

by

likelihood = Product[fun[data[[t]]], {t, T}]

where data is a list of the observed {xt, yt} pairs and fun[{xt, yt}]

is given by (4) using the substitutions (7-10).

A Bayesian approach specifies a prior distribution for theta, p

(theta), and then uses an MCMC algorithm (such as the random-walk

Metropolis algorithm) to draw from the posterior distribution, p(theta|

data). [The posterior distribution is proportional to likelihood * p

(theta).]

Note that (xt, yt) has a bivariate normal distribution, which is

characterized by 5 parameters. However, there are 6 parameters in

theta; this suggests a lack of identification. Fortunately, this

presents no particular problem for the Bayesian approach, even with a

flat prior for theta.

--Mark

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu