Bayesian Linear Regression - assessing model fit?

43 views
Skip to first unread message

Carol Lin

unread,
Mar 3, 2015, 6:54:02 AM3/3/15
to py...@googlegroups.com
Hi everyone!

I am learning bayesian statistics and attempting to fit a multiple linear regression model using pymc. I've successfully executed the linear regression but I am having trouble interpreting the output - specifically, how well does the model fit the data? Which variables matter/don't matter?

I'm looking for the equivalent of an rsquared value from the 'frequentist' approach of OLS regression. 

I know there are some built-in model evaluating functions (geweke, gelman-rubin) but somehow the gelman-rubin is throwing this error: ValueError: Gelman-Rubin diagnostic requires multiple chains of the same length. I don't know how gelman-rubin would help me with model fit though, as it tells you if the model has converged or not. I am not an expert on g-r either so I may be interpreting it wrong.

Also, the M.summary_plot() is not showing the R hat values (probably bc the gelman-rubin is not working for my code). 

code below for setting up multiple linear regression
percent = performance rating
agreeableness, assertiveness, conscientiousness, extraversion are all scores I want to determine if any/all predict performance or not
df['percent'] = preprocessing.scale(df['percent'])
df['agreeableness'] = preprocessing.scale(df['agreeableness'])
df['assertiveness'] = preprocessing.scale(df['assertiveness'])
df['conscientiousness'] = preprocessing.scale(df['conscientiousness'])
df['extraversion'] = preprocessing.scale(df['extraversion'])

def linear_setup(df, ind_cols, dep_col):
    # model our intercept and error term as above
    b0 = Normal('b0', 0, 0.0001)
    err = Uniform('err', 0, 500)
    
    # initialize a NumPy array to hold our betas 
    # and our observed x values
    b = np.empty(len(ind_cols), dtype=object)
    x = np.empty(len(ind_cols), dtype=object)
    
    # loop through b, and make our ith beta
    # a normal random variable, as in the univariate case
    for i in range(len(b)):
        b[i] = Normal('b' + str(i + 1), 0, 0.0001)
        
    # loop through x, and inform our model about the observed
    # x values that correspond to the ith position
    for i, col in enumerate(ind_cols):
        x[i] = Normal('x' + str(i + 1), 0, 1, value=np.array(df[col]), observed=True)
    
    # as above, but use .dot() for 2D array (i.e., matrix) multiplication
    @pymc.deterministic
    def y_pred(b0=b0, b=b, x=x):
        return b0 + b.dot(x)
    
    # finally, "model" our observed y values as above
    y = Normal('y', y_pred, err, value=np.array(df[dep_col]), observed=True)
    
    return Model([b0, Container(b), err, Container(x), y, y_pred])

M = linear_setup(df, ['agreeableness', 'assertiveness', 'conscientiousness', 'extraversion'], 'percent')

MC = MCMC(M)
MC.sample(50000, 20000)

A few haxy ways I am trying to interpret model fit for regression:
- calculate % times the y_pred posterior sampling mean value falls within the 95% HPD interval (the output gives me N samples of arrays of len(data))
- check to see if 0 is within the 95% HPD (then there is a chance that there is no effect) - drop those variables? 

Has anyone else used pymc for linear regression? 

- Carol


Reply all
Reply to author
Forward
0 new messages