Building a best fit model to predict 'y' values and the level of confidence in the prediction

38 views
Skip to first unread message

Ahmed Dassouki

unread,
Jun 29, 2016, 11:55:29 PM6/29/16
to pystatsmodels

Hi folks,


  • I have a million or so data points that appear to follow an exponential trend
  • Is there a way for statsmodels to decide what is the best fit/correct estimation?
  • The data is a mix between integers, real numbers, and categorical data (20 or so rows).
  • The purpose of the analysis is to answer the following question:
    • "What is the 'y' value of a given vector 'x' inputted by the user and what is the level of confidence or correctness of the answer?”
  • I am not a 100% sure how to approach this problem
Thanks,
A

josef...@gmail.com

unread,
Jun 30, 2016, 3:58:09 AM6/30/16
to pystatsmodels
On Wed, Jun 29, 2016 at 10:01 PM, Ahmed Dassouki <dass...@gmail.com> wrote:

Hi folks,


  • I have a million or so data points that appear to follow an exponential trend

If the variance looks stable, then you could just use a trend variable.
The more common case, when variance also increases proportional to y, then an exponential or log-transformed y is usually used.
either ols('np.log(y) ~ ...)
or, often more appropriate using poisson.or GLM with family Poisson  with cov_type='HC' to correct the standard errors for continuous data


  • Is there a way for statsmodels to decide what is the best fit/correct estimation?
That is difficult in the case when models are not nested an/or have different transformed endog y. The best to compare across models in the case when prediction is the target is to use a hold-out data set, similar to what's usually used in scikit-learn/machine learning. But statsmodels doesn't have automatic support.
 
  • The data is a mix between integers, real numbers, and categorical data (20 or so rows).
Does this refer to explanatory variables x, and is the dependent variable y continuous?
 
  • The purpose of the analysis is to answer the following question:
    • "What is the 'y' value of a given vector 'x' inputted by the user and what is the level of confidence or correctness of the answer?”

There are two types of uncertainty, either for a new observation or for the expected mean (expected value) of a new observation given x.
The former is only available for linear models. The latter is available for linear models and GLM.
Both OLS and GLM results have a new `get_prediction` method. (Not fully advertised because the API was not clear yet.)
 
  • I am not a 100% sure how to approach this problem

For more specific help you need to describe the data more.
examples for most parts should be somewhere, on stackoverflow, in the notebooks, ...

Josef

 
Thanks,
A


Ahmed Dassouki

unread,
Jun 30, 2016, 6:24:41 AM6/30/16
to pystatsmodels
Thanks Josef,

Thanks for the great clarification; here are some answers to your questions:


On Thursday, June 30, 2016 at 4:58:09 AM UTC-3, josefpktd wrote:
On Wed, Jun 29, 2016 at 10:01 PM, Ahmed Dassouki <dass...@gmail.com> wrote:

Hi folks,


  • I have a million or so data points that appear to follow an exponential trend

If the variance looks stable, then you could just use a trend variable.
The more common case, when variance also increases proportional to y, then an exponential or log-transformed y is usually used.
either ols('np.log(y) ~ ...)
or, often more appropriate using poisson.or GLM with family Poisson  with cov_type='HC' to correct the standard errors for continuous data

I would say that the variance is not stable. `Item_Age` is one of the x variables and it seems that the `y` (item condition) is more accurate for newer items than it is for older items. 
  • Is there a way for statsmodels to decide what is the best fit/correct estimation?
That is difficult in the case when models are not nested an/or have different transformed endog y. The best to compare across models in the case when prediction is the target is to use a hold-out data set, similar to what's usually used in scikit-learn/machine learning. But statsmodels doesn't have automatic support.
 

I am using 75% of the data to build the model and 25% to test it. Also, users, must be able to input a value and get a result. Do you think i should switch wto scikit? 
  • The data is a mix between integers, real numbers, and categorical data (20 or so rows).
Does this refer to explanatory variables x, and is the dependent variable y continuous?
 
y is continuous "real numbers" whereas x vector are: [`age`, `length`, `weight`, `county`, `state`,... ] So `county` and `state` are categorical values and i use C(County) when i build the models. Age is an integer and the rest are real numbers 
Reply all
Reply to author
Forward
0 new messages