Linear model with errors

rsuhada

unread,

Sep 6, 2009, 11:12:14 AM9/6/09

to PyMC

Dear pymc developers and user,

first off, thank you for this great application. I'm new to pymc (and
python for that matter), but I'm really excited by the possibilities
which it offers for my projects. I also appreciate the available
documentation (also the video tutorial! Please make the 2nd part with
the fitting, the first one was really useful!) and examples.
Unfortunately, I soon ran into a problem where I can't figure out the
proper syntax. The toy example is the following:

We have i=1,..N measurements of y[i] as a function of x[i] and
measurement errors sigmay[i]. The generative model is a linear
function with gaussian noise:

y[i] ~ Normal(m*x[i] + b, 1/sigmay[i]**2)

The aim is to estimate m and b. My guess is to write a container for
y, but the problem is that I have to pass m,b and x[i] as parents,
which I just can't figure out how. Here is my (pseudo) code (the data
is in the arrays x, ydat, sigmay):

# priors
m = Uniform('m', 0.0, 100.0)
b = Uniform('b', 0.0, 100.0)

# model
@deterministic
def line(xval=x, slope=m, intercept=b):
val = intercept+slope*xval
return val

y_0 = Normal('y_0', mu=line(x[0]), tau=1.0/sigmay[0]**2, value=ydat
[0])
y = [y_0]

for i in range(1, len(x)):
yi = Normal('y_%i' % i, mu=line(x[i]), tau=1.0/sigmay[i]**2,
value=ydat[i])
y.append(yi)

@observed
@deterministic
def yout(yarray=y):
return y

This doesn't work due to the invalid mu=line(x[i]) call. Could you
please explain me how to do this? Also I suspect I have to write a
function for the likelihood (product of prob. of each y[i]), again I
don't understand how to do it properly so that the MCMC call knows
what to do...

The Straight-Line fit thread
http://groups.google.com/group/pymc/browse_thread/thread/fef4582f1ff73077

seems to be related, but at glance it seems that the measurement
errors are constant values so the problem of passing array elements
doesn't occur.

Thank yo very much!

John Salvatier

unread,

Sep 6, 2009, 1:44:10 PM9/6/09

to py...@googlegroups.com

Hello,

I think things are a bit simpler that you imagine, but it takes some getting used to (I don't find it that intuitive). Here is a model for linear regression that I wrote:

from numpy import *
import pymc
from scipy import stats

N = 27

#value of super secret parameters for the data
dataSlope = .4
dataB = 100
dataS = .7

#generate some fake data
dataX = stats.distributions.norm(loc = 50, scale = 40 ).rvs(N)
dataY = dataX*dataSlope + dataB + stats.distributions.norm( loc = 0, scale = dataS).rvs(N)

#create a prior for the standard deviation
stdev = pymc.TruncatedNormal('stdev', mu = 10, tau = 1.0/(30.0**2), a = 0, b = Inf)
#broad priors for the slope/intercept
slope = pymc.Normal('slope', mu = 0, tau = 1.0/(600.0**2))
intercept = pymc.Normal('intercept', mu = 0, tau = 1.0/(600.0**2))

#translate that standard deviation into a precision
@pymc.deterministic
def precision (stdev = stdev):
    return 1.0/stdev**2

# equation for the line
@pymc.deterministic
def line (x = dataX, slope = slope, intercept = intercept):
    return x * slope + intercept

y = pymc.Normal('y', mu = line, tau = precision,
                              value = dataY
                              , observed = True)

rsuhada

unread,

Sep 7, 2009, 5:18:36 AM9/7/09

to PyMC

Dear John and all,

thank you for the quick reply. I had a look on your program and I see
what's going. However my example is different. My independent var is
scattered around the linear model and the cases that work are:

a) if the standard deviation is known and same for all y's
b) the standard deviation is a stochastic and you fit for it (this is
what your example does)

My question is slightly different - each of my y values has its own
stdev and their values are known. I do not fit for them only for the
slope and intercept. The model, in other words is:

y[i] = slope*x[i] + intercept + noise[i]
where
noise[i] ~ Normal(mu=0, tau=1/sigmay[i]**2)
x[i], y[i], sigmay[i] are all observed.

For example in your case it would mean to create a fake stdev array
like
dataErr = 0.2* rand(N) * dataY

and I assume that my y[i] are drawn from gaussians centered on the
line, with this stdev. The scatter in the data would come only from
measurments, no intrinsic scatter. The practical side is how to pass
individual sigmay[i] values, and define the likelihood. I hope this
clarified my post.

and fitting only for

> >http://groups.google.com/group/pymc/browse_thread/thread/fef4582f1ff7...

John Salvatier

unread,

Sep 7, 2009, 3:02:04 PM9/7/09

to py...@googlegroups.com

Oh! I did not read you carefully enough the first time around. This too is simple. Simpler than the previous model, in fact. You probably want something like the following:

from numpy import *
import pymc
from scipy import stats

N = 27

#value of super secret parameters for the data
dataSlope = .4
dataB = 100

#these standard deviations are known
dataS = ones(N) * .7

#generate some fake data
dataX = stats.distributions.norm(loc = 50, scale = 40 ).rvs(N)
dataY = dataX*dataSlope + dataB + stats.distributions.norm( loc = 0, scale = dataS).rvs(N)

#broad priors for the slope/intercept
slope = pymc.Normal('slope', mu = 0, tau = 1.0/(600.0**2))
intercept = pymc.Normal('intercept', mu = 0, tau = 1.0/(600.0**2))

# equation for the line
@pymc.deterministic
def line (x = dataX, slope = slope, intercept = intercept):
return x * slope + intercept

# use the known standard deviations
#data
y = pymc.Normal('y', mu = line, tau = dataS **-2,

value = dataY
, observed = True)

Where you have replaced dataS with your array of known standard deviations. Is this roughly what you want to do?

rsuhada

unread,

Sep 14, 2009, 2:54:07 AM9/14/09

to PyMC

Dear John,

sorry for the delay, I was travelling. Thank you, the last model is
exactly what I meant (this was the first thing I tried but it didn't
work for me - now I realized that due to bad mixing and not incorrect
syntax as I assumed). I also think, I made the mistake of taking a too
simple example compared to what I actually need to do. My actual
problem is that I'm trying to model my errors by asymetric gaussian -
for each y[i] I have sigma1[i] and sigma2[i] which give the stdev in
one and other direction, My naive guess would be:

from numpy import *
import pymc
from scipy import stats

N = 27

#value of super secret parameters for the data
dataSlope = .4
dataB = 100

#generate some fake data
dataX = stats.distributions.norm(loc = 50, scale = 40 ).rvs(N)
dataY = dataX*dataSlope + dataB + stats.distributions.norm( loc = 0,
scale
= dataS).rvs(N)

#these standard deviations are known

dataS1 = ones(N) * .7*sqrt(dataY)
dataS2 = ones(N) * .3*sqrt(dataY)

switch = pymc.Uniform('switch', 0, 1, size=len(dataX))

#broad priors for the slope/intercept
slope = pymc.Normal('slope', mu = 0, tau = 1.0/(600.0**2))
intercept = pymc.Normal('intercept', mu = 0, tau = 1.0/(600.0**2))

# equation for the line
@pymc.deterministic
def line (x = dataX, slope = slope, intercept = intercept):
return x * slope + intercept

@pymc.stochastic
def noise(switch=switch, dataSn=dataSn, dataSp=dataSp):
temp = zeros(len(dataSn))

for i in range(len(dataSn)):
if switch[i] < 0.5:
varian=dataS1[i]
temp[i] = -1.0*abs(pymc.Normal(mu = 0.0, tau =
varian**-2))
else:
varian=dataS2[i]
temp[i]= 1.0*abs(pymc.Normal(mu = 0.0, tau =
varian**-2))

return temp

#data
y = Lambda('y', lambda line=line, noise=noise: line + noise,
observed=True, value=dataY)

The switch stochastic 'switches' between the two gaussians. This
doesn't work, but I hope it is clear what I want to achieve. Do I need
in this case the container, if yes what is the proper syntax and is
MCMC going to be able to get the likelihoods itself or do I need to
define it?

Thank you very much! I'm really glad about how helpful and friendly
the pymc comunity is (and I'm not saying it only because I need
help! :) )

Cheers,
Robert

David Huard

unread,

Sep 14, 2009, 9:04:17 AM9/14/09

to py...@googlegroups.com

Robert,

pymc.Normal creates a Stochastic, so you are creating Stochastics inside the likelihood function of the noise node. You should use pymc.normal_like instead.

David

John Salvatier

unread,

Sep 14, 2009, 1:04:46 PM9/14/09

to py...@googlegroups.com

I am not familiar with asymmetric gaussians, but I suspect the easiest way to do this is to write the likelihood for your distribution. Something like the following:

from numpy import *
import pymc
from scipy import stats

N = 27

#value of super secret parameters for the data
dataSlope = .4
dataB = 100

#these standard deviations are known
dataS = ones(N) * .7

#generate some fake data
dataX = stats.distributions.norm(loc = 50, scale = 40 ).rvs(N)
dataY = dataX*dataSlope + dataB + stats.distributions.norm( loc = 0, scale = dataS).rvs(N)

#broad priors for the slope/intercept
slope = pymc.Normal('slope', mu = 0, tau = 1.0/(600.0**2))
intercept = pymc.Normal('intercept', mu = 0, tau = 1.0/(600.0**2))

# equation for the line
@pymc.deterministic
def line (x = dataX, slope = slope, intercept = intercept):
return x * slope + intercept

#asymmetric gaussian distribution for the y's
@observed # tells pymc that this was actually observed
@stochastic # lets you define your own log likelihood need to replace sd1 and sd2
def D(value = dataY, means = line, sd1 = dataS, sd2 = dataS):
    """Asymmetric gaussian"""

    #calculate the log likelihood for the assymetric gaussian
    switch = (value > means)

    #not sure about this likelihood function
    logp = log(2.0/ ( (sd1 + sd2) * sqrt(2* pi))) + switch * -.5*((value - means)/sd1)**2 + (1-switch) * -.5*((value - means)/sd2)**2

    return logp

I haven't run this code, but something like this should work, if I understand your goal correctly.

rsuhada

unread,

Sep 15, 2009, 11:49:56 PM9/15/09

to PyMC

Dear John,

thank you! I think I understood what is going on. I modified the log
likelihood slightly and tested it now against a standard Chi^2 code
and it seems to work well. The posteriors can get quite assymetric,
but the MAP values still agree well. Thanks again - this lesson will
be very helpful for me! For completeness I'm attaching my likelihood,
if anybody would be interested.

Best,
Robert

#asymmetric gaussian distribution for the y's

@pymc.observed # tells pymc that this was actually observed
@pymc.stochastic # lets you define your own log likelihood need to
replace sd1 and sd2
def D(value = dataY, means = line, sdp = dataSp, sdn = dataSn):

"""Asymmetric gaussian"""

#calculate the log likelihood for the assymetric gaussian
switch = (value > means)

logp = switch * (log(0.5*sqrt(1.0/(2*pi*sdn**2)))) - switch*
(((value - means)/sdn)**2) + (1-switch) * (log(0.5*sqrt(1.0/
(2*pi*sdp**2)))) - (1-switch) * (((value - means)/sdp)**2)
logp = sum(logp)
return logp

Reply all

Reply to author

Forward