Newbie question about discrete distributions and MLE

300 views
Skip to first unread message

Sarah Mount

unread,
Apr 21, 2013, 4:34:08 PM4/21/13
to pystat...@googlegroups.com
Hi there,

this is a very elementary question about using statsmodels; I'm just getting to grips with the API and a bit confused about what is where. 

I have some discrete data and want to figure out which distribution it best fits. I understand the way to do this is to use MLE to fit various distributions and perform a Chi-squared test to see which has the best fit. scipy has the Chi-squared test covered, so it's the model fitting I'm interested in here.

AIUI the fitting process goes like this using SM:

>>> import scikits.statsmodels.api as sm
>>> import numpy
>>> data = numpy.random.poisson(0.35, 1000)
>>> exog = [i for i in range(1000)]
>>> poisson_model = sm.GLM(data, exog, family=sm.families.Poisson())
>>> results = poisson_model.fit()
>>> print 'Params:', results.params, 'p-value:', results.pvalues
Params: [-0.00169184] p-value: [ 0.]
>>> 

So my questions are:

1) is the above correct, have I understood exog and endog correctly and so on? 

2) which discrete distributions are available? In sm.families.family I see Binomial, Gamma, Gaussian, InverseGaussian, NegativeBinomial, Poisson. Is there a way to use other distributions like Pareto or Yule?

Thanks for your patience,

Sarah

--
Sarah Mount
website:  http://www.snim2.org/
twitter: @snim2

josef...@gmail.com

unread,
Apr 21, 2013, 5:25:21 PM4/21/13
to pystatsmodels
On Sun, Apr 21, 2013 at 4:34 PM, Sarah Mount <mount...@gmail.com> wrote:
> Hi there,
>
> this is a very elementary question about using statsmodels; I'm just getting
> to grips with the API and a bit confused about what is where.
>
> I have some discrete data and want to figure out which distribution it best
> fits. I understand the way to do this is to use MLE to fit various
> distributions and perform a Chi-squared test to see which has the best fit.
> scipy has the Chi-squared test covered, so it's the model fitting I'm
> interested in here.
>
> AIUI the fitting process goes like this using SM:
>
>>>> import scikits.statsmodels.api as sm
>>>> import numpy
>>>> data = numpy.random.poisson(0.35, 1000)
>>>> exog = [i for i in range(1000)]
>>>> poisson_model = sm.GLM(data, exog, family=sm.families.Poisson())
>>>> results = poisson_model.fit()
>>>> print 'Params:', results.params, 'p-value:', results.pvalues
> Params: [-0.00169184] p-value: [ 0.]
>>>>
>
> So my questions are:
>
> 1) is the above correct, have I understood exog and endog correctly and so
> on?

kind of, in your case exog is a trend,
if you want to fit just a constant, then exog = np.ones(1000)
if you want to fit a constant and a trend, then you could use
exog = sm.add_constant(np.arange(1000))

np.arange(1000) is the same as your list comprehension above, but faster.

>
> 2) which discrete distributions are available? In sm.families.family I see
> Binomial, Gamma, Gaussian, InverseGaussian, NegativeBinomial, Poisson. Is
> there a way to use other distributions like Pareto or Yule?

statsmodels.discrete has also Poisson, Logit, ... as pure MLE models
instead of generalized linear model.

both discrete_models and GLM are focused on regression, getting the
distribution parameters as a function of some explanatory variables

If you just want to estimate the parameters of a distribution (instead
of fitting it to some explanatory variables), then you can also use
the fit method of the scipy.stats.distributions. I don't know how good
they are for discrete distributions.

MLE for Pareto works if you know the lower bound (location), but it's
trickier if you want to estimate the location.
I never looked at estimation of Yule.

Do you have explanatory variables, or just constant parameters for the
distributions?

Josef

Sarah Mount

unread,
Apr 21, 2013, 5:42:16 PM4/21/13
to pystatsmodels
Great, thanks.

> 2) which discrete distributions are available? In sm.families.family I see
> Binomial, Gamma, Gaussian, InverseGaussian, NegativeBinomial, Poisson. Is
> there a way to use other distributions like Pareto or Yule?

statsmodels.discrete has also Poisson, Logit, ... as pure MLE models
instead of generalized linear model.

both discrete_models and GLM are focused on regression, getting the
distribution parameters as a function of some explanatory variable

OK, thanks.
 
If you just want to estimate the parameters of a distribution (instead
of fitting it to some explanatory variables), then you can also use
the fit method of the scipy.stats.distributions. I don't know how good
they are for discrete distributions.

I'm pretty sure they only have a fit() method for continuous distributions, that's why I started looking at sm.
 
MLE for Pareto works if you know the lower bound (location), but it's
trickier if you want to estimate the location.
I never looked at estimation of Yule.

Do you have explanatory variables, or just constant parameters for the
distributions?

Explanatory variables, in most cases. 

Thanks,

josef...@gmail.com

unread,
Apr 21, 2013, 6:02:57 PM4/21/13
to pystatsmodels
If you have statsmodels close to current master, then it is relatively easy to
add new models based on MLE using GenericLikelihoodModel
The older version of statsmodels didn't have good numerical
derivatives by default, and those need to be overwritten. The default
numerical derivatives in master are pretty good.

Vincent wrote a nice doc example for it
http://statsmodels.sourceforge.net/devel/examples/generated/example_gmle.html

and here is his pull request for negative binomial
https://github.com/statsmodels/statsmodels/pull/584
It's not merged because I haven't checked yet some parts for the extra
parameters.
Otherwise it looks good.
the current discrete.NBin is not working properly

If you have specific distributions, or if you run into problems with
specific distributions, then I can look into it.
The basic idea is to prototype with GenericLikelihoodModel, and then
refine with distribution specific information.

I hope that helps.

For Pareto tails or power tail distribution there is another python
package, that uses some distance measures to estimate the lower bound
of the power tail, but it doesn't take explanatory variables, IIRC.

Josef

Sarah Mount

unread,
Apr 21, 2013, 6:10:17 PM4/21/13
to pystat...@googlegroups.com
Thanks, I've forked SM, it's not massively likely that I'll get as far as adding new models, but if I do I'll issue a pull request.
 
Cheers,

Sarah

josef...@gmail.com

unread,
Apr 21, 2013, 6:34:15 PM4/21/13
to pystatsmodels
Even if you don't get as far as a pull request, it would be useful to
get some feedback on this, to get more information about what works
and what doesn't.

You could just open an issue and request a new distribution, or add
some comments on which distributions work.

Cheers,

Josef

>
> Cheers,
>
> Sarah
Reply all
Reply to author
Forward
0 new messages