[SciPy-User] log pdf, cdf, etc

56 views
Skip to first unread message

Chris Strickland

unread,
May 28, 2010, 7:29:26 AM5/28/10
to scipy...@scipy.org
Hi,

When using any of the distributions of scipy.stats there does not seem to be
the ability (or at least I cannot figure out how) to have the function return
the log of the pdf, cdf, sf, etc. For statistical analysis this is essential.
For instance suppose we are interested in an exponential distribution for a
random variable x with a hyperparameter lambda there needs to be an option
that returns -log(lambda)-x/lambda. It is not sufficient (numerically) to
calculate log(scipy.stats.expon.pdf(x,lambda)).

Is there a way to do this using the distributions in scipy.stats?

If there is not is it possible for me to suggest that this feature is added.
There is such an excellent range of distributions, each with such an
impressive range of options, it seems ashame to have to mostly manually code
up the log of pdfs and often call the log of CDFs from R. 

Thanks,
Chris.

josef...@gmail.com

unread,
May 28, 2010, 10:15:55 AM5/28/10
to SciPy Users List
On Fri, May 28, 2010 at 7:29 AM, Chris Strickland
<christopherm...@gmail.com> wrote:
> Hi,
>
> When using any of the distributions of scipy.stats there does not seem to be
> the ability (or at least I cannot figure out how) to have the function
> return
> the log of the pdf, cdf, sf, etc. For statistical analysis this is
> essential.
> For instance suppose we are interested in an exponential distribution for a
> random variable x with a hyperparameter lambda there needs to be an option
> that returns -log(lambda)-x/lambda. It is not sufficient (numerically) to
> calculate log(scipy.stats.expon.pdf(x,lambda)).
>
> Is there a way to do this using the distributions in scipy.stats?

It would need a new method for each distribution, e.g. _loglike, _logpdf
So, this is work, and for some distributions the log wouldn't simplify much.

I proposed this once together with other improvements (but without response).

The second useful method for estimation would be _fitstart, which
provides distribution specific starting values for fit, e.g. a moment
estimator, or a simple rules of thumb
http://projects.scipy.org/scipy/ticket/808


Here are some of my currently planned enhancements to the distributions:

http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head:/scikits/statsmodels/sandbox/stats/distributions_patch.py

but I just checked, it looks like I forgot to copy the _loglike method
that I started from my experimental scripts.

For a few distributions, where this is possible, it would also be
useful to add the gradient with respect to the parameters, (or even
the Hessian). But this is currently mostly just an idea, since we need
some analytical gradients in the estimation of stats models.


>
> If there is not is it possible for me to suggest that this feature is added.
> There is such an excellent range of distributions, each with such an
> impressive range of options, it seems ashame to have to mostly manually code
> up the log of pdfs and often call the log of CDFs from R.

So far I only thought about log pdf, because I wanted it for Maximum
Likelihood estimation.

Do you have a rough idea for which distributions log cdf would work?
that is, for which distribution is an analytical or efficient
numerical expression possible.

I also think that scipy.stats.distributions could be one of the best
(broadest, consistent) collection of univariate distributions that I
have seen so far, once we fill in some missing pieces.

As a way forward, I think we could make the distributions into a
numerical encyclopedia by adding private methods to those
distributions where it makes sense, like log pdf, log cdf and I also
started to add characteristic functions to some distributions in my
experimental scripts.
If you have a collection of logpdf, logcdf, we could add a trac ticket for this.

However, this would miss the generic broadcasting part of the public
functions, pdf, cdf,... but for estimation I wouldn't necessarily call
those because of the overhead.


I'm working on and off on this, so it's moving only slowly (and my
wishlist is big).
(for example, I was reading up on extreme value distributions in
actuarial science and hydrology to get a better overview over the
estimators.)


So, I really love to hear any ideas, feedback, and see contributions
to improving the distributions.

Josef


>
> Thanks,
> Chris.
>
> _______________________________________________
> SciPy-User mailing list
> SciPy...@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

Chris Strickland

unread,
May 28, 2010, 9:03:40 PM5/28/10
to SciPy Users List
On Sat, May 29, 2010 at 12:15 AM, <josef...@gmail.com> wrote:

It would need a new method for each distribution, e.g. _loglike, _logpdf
So, this is work, and for some distributions the log wouldn't simplify much.

I am not sure what you mean the log wouldn't simply much.
 
I proposed this once together with other improvements (but without response).

This is a little disappointing, it significantly reduces how useful the library is. In actual fact I have not been able to use a single function for anything other than testing (although, I have been using numpy.random for random numbers, this scipy.stats collection seems far more complete). This would dramatically change if a log version of the distribution were available. I think in most cases this would be a straightforward addition at least for the pdf.

 
The second useful method for estimation would be _fitstart, which
provides distribution specific starting values for fit, e.g. a moment
estimator, or a simple rules of thumb
http://projects.scipy.org/scipy/ticket/808


Here are some of my currently planned enhancements to the distributions:

http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head:/scikits/statsmodels/sandbox/stats/distributions_patch.py

but I just checked, it looks like I forgot to copy the _loglike method
that I started from my experimental scripts.

For a few distributions, where this is possible, it would also be
useful to add the gradient with respect to the parameters, (or even
the Hessian). But this is currently mostly just an idea, since we need
some analytical gradients in the estimation of stats models.

This certainly would be nice as well.

>
> If there is not is it possible for me to suggest that this feature is added.
> There is such an excellent range of distributions, each with such an
> impressive range of options, it seems ashame to have to mostly manually code
> up the log of pdfs and often call the log of CDFs from R.

So far I only thought about log pdf, because I wanted it for Maximum
Likelihood estimation.

It is also necessary for MCMC.
 
Do you have a rough idea for which distributions log cdf would work?
that is, for which distribution is an analytical or efficient
numerical expression possible.

Not sure off the top of my head as I mainly require the only the pdf. I was, however, doing a little survival analysis the other day though and it was required. The log of the survival and hazard functions would be nice also. So far I have only required the exponential (analytical), weibull (analytical), normal (numerical) and powernormal (analytical function of the log of the normal cdf). I just had a peak at the R source code for pnorm (R's code for the normal cdf). The function is not big and also licensed under the GNU public licence. I assume it could be fairly easily ported to scipy.

I also think that scipy.stats.distributions could be one of the best
(broadest, consistent) collection of univariate distributions that I
have seen so far, once we fill in some missing pieces.

As a way forward, I think we could make the distributions into a
numerical encyclopedia by adding private methods to those
distributions where it makes sense, like log pdf, log cdf and I also
started to add characteristic functions to some distributions in my
experimental scripts.
If you have a collection of logpdf, logcdf, we could add a trac ticket for this.

I could fairly easy whip up a collection of functions to compute the logpdf for a large number of distributions. Not sure about the CDFs but I can look into it as well. The pdf's are definitely far more urgent for my own work. I am a bit busy at work though for the next three weeks so it would have to be after that.

josef...@gmail.com

unread,
May 28, 2010, 10:53:37 PM5/28/10
to SciPy Users List
On Fri, May 28, 2010 at 9:03 PM, Chris Strickland
<christopherm...@gmail.com> wrote:
>
>
> On Sat, May 29, 2010 at 12:15 AM, <josef...@gmail.com> wrote:
>>
>> It would need a new method for each distribution, e.g. _loglike, _logpdf
>> So, this is work, and for some distributions the log wouldn't simplify
>> much.
>>
> I am not sure what you mean the log wouldn't simply much.
>
>>
>> I proposed this once together with other improvements (but without
>> response).
>>
> This is a little disappointing, it significantly reduces how useful the
> library is. In actual fact I have not been able to use a single function for
> anything other than testing (although, I have been using numpy.random for
> random numbers, this scipy.stats collection seems far more complete). This
> would dramatically change if a log version of the distribution were
> available. I think in most cases this would be a straightforward addition at
> least for the pdf.

I don't think for many use cases log(stats.t.pdf) or many other
distributions the performance and accuracy hit would be large enough
to make it useless. At least, I haven't seen any other comments in
this direction.

On of the main use cases for me of stats.distributions are all the
statistical test distributions, t, F, chi2 and so on. Howver, in
statsmodels we have a mixture of calls to the pdf/cdf of
stats.distributions and reimplementations of loglikelhood functions,
where the scipy version is also just used for testing.

>
>
>>
>> The second useful method for estimation would be _fitstart, which
>> provides distribution specific starting values for fit, e.g. a moment
>> estimator, or a simple rules of thumb
>> http://projects.scipy.org/scipy/ticket/808
>>
>>
>> Here are some of my currently planned enhancements to the distributions:
>>
>>
>> http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head:/scikits/statsmodels/sandbox/stats/distributions_patch.py
>>
>> but I just checked, it looks like I forgot to copy the _loglike method
>> that I started from my experimental scripts.
>>
>> For a few distributions, where this is possible, it would also be
>> useful to add the gradient with respect to the parameters, (or even
>> the Hessian). But this is currently mostly just an idea, since we need
>> some analytical gradients in the estimation of stats models.
>>
> This certainly would be nice as well.
>>
>> >
>> > If there is not is it possible for me to suggest that this feature is
>> > added.
>> > There is such an excellent range of distributions, each with such an
>> > impressive range of options, it seems ashame to have to mostly manually
>> > code
>> > up the log of pdfs and often call the log of CDFs from R.
>>
>> So far I only thought about log pdf, because I wanted it for Maximum
>> Likelihood estimation.
>>
> It is also necessary for MCMC.

pymc has many distributions with loglike in fortran for speed, but for
most distributions only loglike and rvs are defined, if I remember
correctly.

>
>>
>> Do you have a rough idea for which distributions log cdf would work?
>> that is, for which distribution is an analytical or efficient
>> numerical expression possible.
>
> Not sure off the top of my head as I mainly require the only the pdf. I was,
> however, doing a little survival analysis the other day though and it was
> required. The log of the survival and hazard functions would be nice also.
> So far I have only required the exponential (analytical), weibull
> (analytical), normal (numerical) and powernormal (analytical function of the
> log of the normal cdf). I just had a peak at the R source code for pnorm
> (R's code for the normal cdf). The function is not big and also licensed
> under the GNU public licence. I assume it could be fairly easily ported to
> scipy.

R's license, GPL, is incompatible with the license of scipy, BSD.
While they are allowed to look at our code, code that goes into scipy
cannot be based on GPL licensed code.

If never seen it mentioned before that there is a direct function for
log(norm.cdf). Which functions and packages in R implement the
logarithm of the cdf of these distributions?

The cdf for several distributions (including normal) is implement in
Fortran or C in scipy.special, and I've never seen a log version for
them.

>>
>> I also think that scipy.stats.distributions could be one of the best
>> (broadest, consistent) collection of univariate distributions that I
>> have seen so far, once we fill in some missing pieces.
>>
>> As a way forward, I think we could make the distributions into a
>> numerical encyclopedia by adding private methods to those
>> distributions where it makes sense, like log pdf, log cdf and I also
>> started to add characteristic functions to some distributions in my
>> experimental scripts.
>> If you have a collection of logpdf, logcdf, we could add a trac ticket for
>> this.
>
> I could fairly easy whip up a collection of functions to compute the logpdf
> for a large number of distributions. Not sure about the CDFs but I can look
> into it as well. The pdf's are definitely far more urgent for my own work. I
> am a bit busy at work though for the next three weeks so it would have to be
> after that.

I looked at some of the distributions, and logpdf could be more
efficiently calculated in many of them and very often also logcdf

I opened a ticket for this
http://projects.scipy.org/scipy/ticket/1184

I also saw that there are still smaller, numerical improvements
possible in several distributions.

Thanks,

Josef


>>
>> However, this would miss the generic broadcasting part of the public
>> functions, pdf, cdf,... but for estimation I wouldn't necessarily call
>> those because of the overhead.
>>
>>
>> I'm working on and off on this, so it's moving only slowly (and my
>> wishlist is big).
>> (for example, I was reading up on extreme value distributions in
>> actuarial science and hydrology to get a better overview over the
>> estimators.)
>>
>>
>> So, I really love to hear any ideas, feedback, and see contributions
>> to improving the distributions.
>>
>> Josef
>>
>>
>

Nathaniel Smith

unread,
May 28, 2010, 11:24:21 PM5/28/10
to SciPy Users List
On Fri, May 28, 2010 at 7:53 PM, <josef...@gmail.com> wrote:
> I don't think for many use cases log(stats.t.pdf) or many other
> distributions the performance and accuracy hit would be large enough
> to make it useless. At least, I haven't seen any other comments in
> this direction.

"Useless" is a value judgement, of course, but it doesn't seem *too*
far off to me either. I myself basically always find myself wanting
log-space values, and even if you're just doing statistical tests,
numerical precision in the tails can become very practically relevant
when doing multiple hypothesis correction.

> R's license, GPL, is incompatible with the license of scipy, BSD.
> While they are allowed to look at our code, code that goes into scipy
> cannot be based on GPL licensed code.

You mean, they're allowed to copy our code, and we're allowed to look
at their code for reference but can't use it directly :-).

> If never seen it mentioned before that there is a direct function for
> log(norm.cdf). Which functions and packages in R implement the
> logarithm of the cdf of these distributions?
>
> The cdf for several distributions (including normal) is implement in
> Fortran or C in scipy.special, and I've never seen a log version for
> them.

Yet R does in fact use specialized code for computing the log-cdf for
the normal distribution... at least over some parts of its range. I'm
not sure how much difference it makes or anything, I'm just reporting
on the existence of 'if' statements in the source :-). See the base R
distribution, src/nmath/pnorm.c (which also contains references).

-- Nathaniel

Chris Strickland

unread,
May 28, 2010, 11:34:50 PM5/28/10
to SciPy Users List
On Sat, May 29, 2010 at 12:53 PM, <josef...@gmail.com> wrote:

I don't think for many use cases log(stats.t.pdf) or many other
distributions the performance and accuracy hit would be large enough
to make it useless. At least, I haven't seen any other comments in
this direction.

On of the main use cases for me of stats.distributions are all the
statistical test distributions, t, F, chi2 and so on. Howver, in
statsmodels we have a mixture of calls to the pdf/cdf of
stats.distributions and reimplementations of loglikelhood functions,
where the scipy version is also just used for testing.

The main use for me is in specifying (log) prior distributions, (log) posterior distributions and log-likelihood functions. There is simply no way around using the log pdf in the vast majority of cases in MCMC analysis. Whilst it is trivial for me to simply write functions when I need them it would obviously benefit the statistical community as a whole if the option was available in the excellent set of distributions that are available as a part of Scipy.
 

R's license, GPL, is incompatible with the license of scipy, BSD.
While they are allowed to look at our code, code that goes into scipy
cannot be based on GPL licensed code.

Fair enough. Still at least for the normal cdf we could simply use the references in the R code to write a Scipy version.
 
If never seen it mentioned before that there is a direct function for
log(norm.cdf). Which functions and packages in R implement the
logarithm of the cdf of these distributions?

pnorm it is in the stats package for the log of the normal CDF. Kind of essential for distributions like the powernormal as well that use the normal cdf as a part of their pdf.

The cdf for several distributions (including normal) is implement in
Fortran or C in scipy.special, and I've never seen a log version for
them.

I looked at some of the distributions, and logpdf could be more
efficiently calculated in many of them and very often also logcdf

I opened a ticket for this
http://projects.scipy.org/scipy/ticket/1184

I also saw that there are still smaller, numerical improvements
possible in several distributions.

Thanks,

Josef

josef...@gmail.com

unread,
May 29, 2010, 12:00:07 AM5/29/10
to SciPy Users List
On Fri, May 28, 2010 at 11:24 PM, Nathaniel Smith <n...@pobox.com> wrote:
> On Fri, May 28, 2010 at 7:53 PM,  <josef...@gmail.com> wrote:
>> I don't think for many use cases log(stats.t.pdf) or many other
>> distributions the performance and accuracy hit would be large enough
>> to make it useless. At least, I haven't seen any other comments in
>> this direction.
>
> "Useless" is a value judgement, of course, but it doesn't seem *too*
> far off to me either. I myself basically always find myself wanting
> log-space values, and even if you're just doing statistical tests,
> numerical precision in the tails can become very practically relevant
> when doing multiple hypothesis correction.
>
>> R's license, GPL, is incompatible with the license of scipy, BSD.
>> While they are allowed to look at our code, code that goes into scipy
>> cannot be based on GPL licensed code.
>
> You mean, they're allowed to copy our code, and we're allowed to look
> at their code for reference but can't use it directly :-).

We are allowed to look at their manuals but not their code.
(Life ain't fair.)

>
>> If never seen it mentioned before that there is a direct function for
>> log(norm.cdf). Which functions and packages in R implement the
>> logarithm of the cdf of these distributions?
>>
>> The cdf for several distributions (including normal) is implement in
>> Fortran or C in scipy.special, and I've never seen a log version for
>> them.
>
> Yet R does in fact use specialized code for computing the log-cdf for
> the normal distribution... at least over some parts of its range. I'm
> not sure how much difference it makes or anything, I'm just reporting
> on the existence of 'if' statements in the source :-). See the base R
> distribution, src/nmath/pnorm.c (which also contains references).

pnorm is the cdf not the log of the cdf, that's what I thought,
but I just saw that they have a "log.p" option.

from the R manual:
"""
For pnorm, based on
Cody, W. D. (1993) Algorithm 715: SPECFUN – A portable FORTRAN package
of special function routines and test drivers
"""

this sounds similar to the fortran or c code that scipy.special has. I
never tried to read that one, except for the doc comments.

accuracy doesn't seem to be a problem

np.log(stats.norm.cdf(np.linspace(-20,20,21))) - [r.pnorm(x,
log_p=True) for x in np.linspace(-20,20,21)]
array([ -2.84217094e-14, -2.84217094e-14, -2.84217094e-14,
0.00000000e+00, -1.42108547e-14, -7.10542736e-15,
-7.10542736e-15, -3.55271368e-15, -1.77635684e-15,
-4.44089210e-16, 0.00000000e+00, 0.00000000e+00,
-5.42101086e-20, -5.53867815e-17, -4.40377573e-17,
7.61985302e-24, 1.77648211e-33, 7.79353682e-45,
6.38875440e-58, 9.74094892e-73, 2.75362412e-89])

except the small numbers in the tail look much better in R

>>> np.log(stats.norm.cdf(np.linspace(-20,20,21)))
array([ -2.03917155e+02, -1.65812373e+02, -1.31695396e+02,
-1.01563034e+02, -7.54106730e+01, -5.32312852e+01,
-3.50134372e+01, -2.07367689e+01, -1.03601015e+01,
-3.78318433e+00, -6.93147181e-01, -2.30129093e-02,
-3.16717434e-05, -9.86587701e-10, -6.66133815e-16,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00])

>>> np.array([r.pnorm(x, log_p=True) for x in np.linspace(-20,20,21)])
array([ -2.03917155e+02, -1.65812373e+02, -1.31695396e+02,
-1.01563034e+02, -7.54106730e+01, -5.32312852e+01,
-3.50134372e+01, -2.07367689e+01, -1.03601015e+01,
-3.78318433e+00, -6.93147181e-01, -2.30129093e-02,
-3.16717434e-05, -9.86587646e-10, -6.22096057e-16,
-7.61985302e-24, -1.77648211e-33, -7.79353682e-45,
-6.38875440e-58, -9.74094892e-73, -2.75362412e-89])

except if we use a branch cut

>>> np.log1p(-stats.norm.sf(np.linspace(-20,20,21)))
array([ -Inf, -Inf, -Inf,
-Inf, -Inf, -Inf,
-3.49450411e+01, -2.07367689e+01, -1.03601015e+01,
-3.78318433e+00, -6.93147181e-01, -2.30129093e-02,
-3.16717434e-05, -9.86587646e-10, -6.22096057e-16,
-7.61985302e-24, -1.77648211e-33, -7.79353682e-45,
-6.38875440e-58, -9.74094892e-73, -2.75362412e-89])

I have no idea about speed.

Josef

josef...@gmail.com

unread,
May 29, 2010, 12:15:49 AM5/29/10
to SciPy Users List
On Fri, May 28, 2010 at 11:34 PM, Chris Strickland
<christopherm...@gmail.com> wrote:
>
>
> On Sat, May 29, 2010 at 12:53 PM, <josef...@gmail.com> wrote:
>>
>> I don't think for many use cases log(stats.t.pdf) or many other
>> distributions the performance and accuracy hit would be large enough
>> to make it useless. At least, I haven't seen any other comments in
>> this direction.
>>
>> On of the main use cases for me of stats.distributions are all the
>> statistical test distributions, t, F, chi2 and so on. Howver, in
>> statsmodels we have a mixture of calls to the pdf/cdf of
>> stats.distributions and reimplementations of loglikelhood functions,
>> where the scipy version is also just used for testing.
>>
> The main use for me is in specifying (log) prior distributions, (log)
> posterior distributions and log-likelihood functions. There is simply no way
> around using the log pdf in the vast majority of cases in MCMC analysis.
> Whilst it is trivial for me to simply write functions when I need them it
> would obviously benefit the statistical community as a whole if the option
> was available in the excellent set of distributions that are available as a
> part of Scipy.

I agree that it would be very good to have this generally available,
and I will appreciate it for maximum likelihood.
For MCMC (where I know only little about the details), it might,
however, always be faster to work with dedicated code as in pymc.

>
>>
>> R's license, GPL, is incompatible with the license of scipy, BSD.
>> While they are allowed to look at our code, code that goes into scipy
>> cannot be based on GPL licensed code.
>>
> Fair enough. Still at least for the normal cdf we could simply use the
> references in the R code to write a Scipy version.

If it's the C or Fortran implementation, then it is out of my
competence, I'm a pure scripting language person.

Another idea for this would be to see if any of the pymc code for this
would fit into scipy. Since I leave Fortran to others, I never looked
at it.

I think if we get the easier cases, logpdf and logcdf that don't
require compiled versions, we would be able to cover already a
considerable range of the distributions.

However, I also agree now, having norm.logcdf would also be useful for
many other distributions.

>
>>
>> If never seen it mentioned before that there is a direct function for
>> log(norm.cdf). Which functions and packages in R implement the
>> logarithm of the cdf of these distributions?
>
> pnorm it is in the stats package for the log of the normal CDF. Kind of
> essential for distributions like the powernormal as well that use the normal
> cdf as a part of their pdf.

see previous message, I never paid enough attention to see the log.p option

Josef

>>
>> The cdf for several distributions (including normal) is implement in
>> Fortran or C in scipy.special, and I've never seen a log version for
>> them.
>>
>> I looked at some of the distributions, and logpdf could be more
>> efficiently calculated in many of them and very often also logcdf
>>
>> I opened a ticket for this
>> http://projects.scipy.org/scipy/ticket/1184
>>
>> I also saw that there are still smaller, numerical improvements
>> possible in several distributions.
>>
>> Thanks,
>>
>> Josef
>>
>> ______________________________________________
>> SciPy-User mailing list
>> SciPy...@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>

> _______________________________________________

_______________________________________________

josef...@gmail.com

unread,
May 29, 2010, 12:20:51 AM5/29/10
to SciPy Users List

I'm contradicting and confusing myself, I don't think pymc has any cdf
code, only pdf.

Josef

Chris Strickland

unread,
May 29, 2010, 12:32:48 AM5/29/10
to SciPy Users List
On Sat, May 29, 2010 at 2:20 PM, <josef...@gmail.com> wrote:

Josef

>
> I think if we get the easier cases, logpdf and logcdf that don't
> require compiled versions, we would be able to cover already a
> considerable range of the distributions.
>
> However, I also agree now, having norm.logcdf would also be useful for
> many other distributions.
>
I  can write in C and Fortran (I prefer Fortran with Python) so I could easily write code ( in around three weeks when my workload reduces) for cases where compiled languages are required.

josef...@gmail.com

unread,
May 29, 2010, 12:43:04 AM5/29/10
to SciPy Users List

I'm busy for another month.

Just a warning in case you don't know: scipy is still stuck at fortran
77 because some platforms (e.g. Windows with mingw - which I use)
support only g77. I don't know when the upgrade will happen.

cython would be an alternative that's easier to maintain.

Josef

Chris Strickland

unread,
May 29, 2010, 12:55:26 AM5/29/10
to SciPy Users List
On Sat, May 29, 2010 at 2:43 PM, <josef...@gmail.com> wrote:
On Sat, May 29, 2010 at 12:32 AM, Chris Strickland
<christopherm...@gmail.com> wrote:
>
>
> On Sat, May 29, 2010 at 2:20 PM, <josef...@gmail.com> wrote:
>>
>> Josef
>>
>> >
>> > I think if we get the easier cases, logpdf and logcdf that don't
>> > require compiled versions, we would be able to cover already a
>> > considerable range of the distributions.
>> >
>> > However, I also agree now, having norm.logcdf would also be useful for
>> > many other distributions.
>> >
>
> I  can write in C and Fortran (I prefer Fortran with Python) so I could
> easily write code ( in around three weeks when my workload reduces) for
> cases where compiled languages are required.

I'm busy for another month.

Just a warning in case you don't know: scipy is still stuck at fortran
77 because some platforms (e.g. Windows with mingw - which I use)
support only g77. I don't know when the upgrade will happen.

cython would be an alternative that's easier to maintain.

Josef

Fortran77 isn't a problem assuming we can just link our Fortran using f2py. I don't really have any experience linking code manually.  Hmm, the g77 compiler has been superseded by  the gfortran complier. Does this not work under Windows?

John Salvatier

unread,
May 29, 2010, 1:15:21 AM5/29/10
to scipy...@scipy.org
The package PyMC(http://code.google.com/p/pymc/) contains fortran log likelihood functions for a lot of distributions, but you would have to look at the source code to figure out how to use them since they are meant mostly for internal use. They are not ufuncs but can handle arrays or single values for each parameter. A recent PyMC branch also contains similar log likelihood gradient functions for the same distributions (http://github.com/pymc-devs/pymc/tree/gradientBranch).

Hope that is useful.

josef...@gmail.com

unread,
May 29, 2010, 10:46:14 AM5/29/10
to SciPy Users List

Thanks, I will have a look at the gradient branch


To get started, I added a test script to the ticked that makes it
easier to add and test new methods for lnpdf and lncdf. It's adapted
from the scipy tests and tests all distributions that have a _lnpdf or
_lncdf method.

The new methods can just be added to the script to monkey patch
scipy.stats.distributions.

I monkey patched 13 easy cases so far, mainly to check that the script
works. (Still far too go for full coverage of cases where this makes
sense.)

The tests use nosetests and test for (almost) equality of the new
methods with the log of the old methods, and check a simple
broadcasting case.

Everyone is invited to add new cases, and to report any problems with
the script.

I hope that helps to get the ball rolling.

Josef

>
> Hope that is useful.

John Reid

unread,
May 29, 2010, 12:18:07 PM5/29/10
to scipy...@scipy.org
josef...@gmail.com wrote:
> On Fri, May 28, 2010 at 7:29 AM, Chris Strickland
> <christopherm...@gmail.com> wrote:
>> Hi,
>>
>> When using any of the distributions of scipy.stats there does not seem to be
>> the ability (or at least I cannot figure out how) to have the function
>> return
>> the log of the pdf, cdf, sf, etc. For statistical analysis this is
>> essential.
>> For instance suppose we are interested in an exponential distribution for a
>> random variable x with a hyperparameter lambda there needs to be an option
>> that returns -log(lambda)-x/lambda. It is not sufficient (numerically) to
>> calculate log(scipy.stats.expon.pdf(x,lambda)).
>>
>> Is there a way to do this using the distributions in scipy.stats?
>
> It would need a new method for each distribution, e.g. _loglike, _logpdf
> So, this is work, and for some distributions the log wouldn't simplify much.

Presumably it would be easy to add a method to all distributions that
called the pdf and took its log. This could be over-riden for those
distributions for which a specialised log_pdf implementation was
available. This would make the entry cost of providing the functionality
lower.

josef...@gmail.com

unread,
May 29, 2010, 12:49:06 PM5/29/10
to SciPy Users List
On Sat, May 29, 2010 at 12:18 PM, John Reid <j.r...@mail.cryst.bbk.ac.uk> wrote:
> josef...@gmail.com wrote:
>> On Fri, May 28, 2010 at 7:29 AM, Chris Strickland
>> <christopherm...@gmail.com> wrote:
>>> Hi,
>>>
>>> When using any of the distributions of scipy.stats there does not seem to be
>>> the ability (or at least I cannot figure out how) to have the function
>>> return
>>> the log of the pdf, cdf, sf, etc. For statistical analysis this is
>>> essential.
>>> For instance suppose we are interested in an exponential distribution for a
>>> random variable x with a hyperparameter lambda there needs to be an option
>>> that returns -log(lambda)-x/lambda. It is not sufficient (numerically) to
>>> calculate log(scipy.stats.expon.pdf(x,lambda)).
>>>
>>> Is there a way to do this using the distributions in scipy.stats?
>>
>> It would need a new method for each distribution, e.g. _loglike, _logpdf
>> So, this is work, and for some distributions the log wouldn't simplify much.
>
> Presumably it would be easy to add a method to all distributions that
> called the pdf and took its log. This could be over-riden for those
> distributions for which a specialised log_pdf implementation was
> available. This would make the entry cost of providing the functionality
> lower.

Yes, I haven't thought about it yet for this case, but that's how the
system for the current methods works, only _pdf or _cdf is required,
all other methods have generic substitutes (which are sometimes very
slow.)

For testing, not having the generic version is easier, I have to
figure out again how to check whether a method was defined in the
super or the sub class (instead of using hasattr).

a naming question
_lnpdf or _logpdf ? _lncdf or _logcdf ?

Josef

Skipper Seabold

unread,
May 29, 2010, 1:21:53 PM5/29/10
to SciPy Users List
On Sat, May 29, 2010 at 12:49 PM, <josef...@gmail.com> wrote:
> a naming question
> _lnpdf or _logpdf ? _lncdf or _logcdf ?
>

My vote would be for log over ln since np.log(np.e) == 1.

Skipper

josef...@gmail.com

unread,
May 29, 2010, 2:20:19 PM5/29/10
to SciPy Users List
On Sat, May 29, 2010 at 1:21 PM, Skipper Seabold <jsse...@gmail.com> wrote:
> On Sat, May 29, 2010 at 12:49 PM,  <josef...@gmail.com> wrote:
>> a naming question
>> _lnpdf or _logpdf ? _lncdf or _logcdf ?
>>
>
> My vote would be for log over ln since np.log(np.e) == 1.

Yes, I don't know what I was thinking early in the morning, nobody
seems to use ln anymore

I edited the script
rename ln ->log
make print optional
add generic method
replace hasattr by (not sure this is the best way)

def isnotgeneric(distfn, methodname):
sub = getattr(distfn, methodname)
gen = getattr(stats.distributions.rv_continuous, methodname)
return not sub.im_func is gen.im_func

(generic methods don't pass the tests for all distributions, there are
some problems with broadcasting in the current _pdf, _cdf
implementations for some distributions)

Josef

Nathaniel Smith

unread,
May 29, 2010, 3:44:20 PM5/29/10
to SciPy Users List
On Fri, May 28, 2010 at 9:00 PM, <josef...@gmail.com> wrote:
> On Fri, May 28, 2010 at 11:24 PM, Nathaniel Smith <n...@pobox.com> wrote:
>> On Fri, May 28, 2010 at 7:53 PM,  <josef...@gmail.com> wrote:
>>> R's license, GPL, is incompatible with the license of scipy, BSD.
>>> While they are allowed to look at our code, code that goes into scipy
>>> cannot be based on GPL licensed code.
>>
>> You mean, they're allowed to copy our code, and we're allowed to look
>> at their code for reference but can't use it directly :-).
>
> We are allowed to look at their manuals but not their code.
> (Life ain't fair.)

It sounds like you guys have this well in hand, but just a point here
-- you certainly are allowed to look at their code, just not copy the
"expressive aspects" of it. (Saying you can't *look* at it because of
the license is like saying writers can't read other people's novels!)
"Expressive" is a tricky term, of course -- IIUC it's basically
anything that could be changed while preserving functionality (because
the functionality, the algorithm itself, is not covered by copyright).
So, say, variable names certainly count as expressive, decisions about
which way to lay out the code, etc. If one wants to be really safe,
one can write down a textual description of the algorithm and then ask
someone else to translate back to code (the "clean room" method).

So you do have to be a bit careful, but when you have code that
contains valuable information that isn't really written down anywhere
else then I'd say it's worth it.

Travis Oliphant

unread,
May 29, 2010, 4:51:38 PM5/29/10
to SciPy Users List

On May 28, 2010, at 9:15 AM, josef...@gmail.com wrote:

> On Fri, May 28, 2010 at 7:29 AM, Chris Strickland
> <christopherm...@gmail.com> wrote:
>> Hi,
>>
>> When using any of the distributions of scipy.stats there does not seem to be
>> the ability (or at least I cannot figure out how) to have the function
>> return
>> the log of the pdf, cdf, sf, etc. For statistical analysis this is
>> essential.
>> For instance suppose we are interested in an exponential distribution for a
>> random variable x with a hyperparameter lambda there needs to be an option
>> that returns -log(lambda)-x/lambda. It is not sufficient (numerically) to
>> calculate log(scipy.stats.expon.pdf(x,lambda)).
>>
>> Is there a way to do this using the distributions in scipy.stats?
>
> It would need a new method for each distribution, e.g. _loglike, _logpdf
> So, this is work, and for some distributions the log wouldn't simplify much.
>
> I proposed this once together with other improvements (but without response).
>
> The second useful method for estimation would be _fitstart, which
> provides distribution specific starting values for fit, e.g. a moment
> estimator, or a simple rules of thumb
> http://projects.scipy.org/scipy/ticket/808
>
>
> Here are some of my currently planned enhancements to the distributions:
>
> http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head:/scikits/statsmodels/sandbox/stats/distributions_patch.py

Hey Josef,

I've been playing with distributions.py today and added logpdf, logcdf, logsf methods (based on _logpdf, _logcdf, _logsf methods in each distribution).

I also added your _fitstart suggestion. I would like to do something like your nnlf_fit method that allows you to fix some parameters and only solve for others, but I haven't thought through all the issues yet.

Do you have updated code I could look at. These are relatively easy adds that I would like to put in today. Do you have check-in rights to SciPy?

Thanks,

-Travis

---
Travis Oliphant
Enthought, Inc.
olip...@enthought.com
1-512-536-1057
http://www.enthought.com

josef...@gmail.com

unread,
May 29, 2010, 5:20:25 PM5/29/10
to SciPy Users List

I just committed the changes for the _logpdf, ..., I didn't see any
changes of yours in the timeline, nor in svn changes, plus a fix to
internal wrapcauchy_cdf

generic _logpdf, logcdf and the 13 cases of my test script are in svn

Josef

Josef

josef...@gmail.com

unread,
May 29, 2010, 5:38:46 PM5/29/10
to SciPy Users List
On Sat, May 29, 2010 at 4:51 PM, Travis Oliphant <olip...@enthought.com> wrote:
>

I would like to get the private _logpdf in a useful (vectorized or
broadcastable) version because for estimation and optimization, I want
to avoid the logpdf overhead. So, my testing will be on the underline
versions.

>
> I also added your _fitstart suggestion.   I would like to do something like your nnlf_fit method that allows you to fix some parameters and only solve for others, but I haven't thought through all the issues yet.

I have written a semi-frozen fit function and posted to the mailing
list a long time ago, but since I'm not sure about the API and I'm
expanding to several new estimators, I kept this under
work-in-progress.

Similar _fitstart might need extra options, for estimation when some
parameters are fixed, e.g. there are good moment estimators that work
when some of the parameters (e.g. loc or scale) are fixed. Also
_fitstart is currently used only by my fit_frozen.

I was hoping to get this done this year, maybe together with the
enhancements that Per Brodtkorb proposed two years ago, e.g. Method of
Maximum Spacings.

I also have a Generalized Method of Moments estimator based on
matching quantiles and moments in the works.

So, I don't want yet to be pinned down with any API for the estimation
enhancements.

Josef

josef...@gmail.com

unread,
May 29, 2010, 5:58:31 PM5/29/10
to SciPy Users List

Skipper Seabold

unread,
May 30, 2010, 12:20:45 AM5/30/10
to SciPy Users List

Thanks, this is useful to know. I've always erred on the side of
caution and just compared the results of functions/algorithms that
*should* be the same vs, say, R, but if I could do this and then look
at implementation details this could relieve substantial headaches.
It still seems like such a fine line though.

Skipper

Anne Archibald

unread,
May 30, 2010, 12:36:08 AM5/30/10
to SciPy Users List

This is exactly the problem. I don't think the R community is
particularly litigious, but as a rule of thumb, doing something that
is technically legal but for which the legality is subtle opens one up
to lawsuits. The problem is that even when you are right, a lawsuit is
tremendously destructive. So things that are legal but subtle should
probably be avoided by a group as penniless as the community as scipy
developers. So it's probably better to just not read their source
code.

Anne

Ralf Gommers

unread,
May 31, 2010, 7:39:41 AM5/31/10
to SciPy Users List
On Sun, May 30, 2010 at 5:38 AM, <josef...@gmail.com> wrote:
On Sat, May 29, 2010 at 4:51 PM, Travis Oliphant <olip...@enthought.com> wrote:
>
> Hey Josef,
>
> I've been playing with distributions.py today and added logpdf, logcdf, logsf methods (based on _logpdf, _logcdf, _logsf methods in each distribution).

I would like to get the private _logpdf in a useful (vectorized or
broadcastable) version because for estimation and optimization, I want
to avoid the logpdf overhead. So, my testing will be on the underline
versions.

>
> I also added your _fitstart suggestion.   I would like to do something like your nnlf_fit method that allows you to fix some parameters and only solve for others, but I haven't thought through all the issues yet.

I have written a semi-frozen fit function and posted to the mailing
list a long time ago, but since I'm not sure about the API and I'm
expanding to several new estimators, I kept this under
work-in-progress.

Similar _fitstart might need extra options, for estimation when some
parameters are fixed, e.g. there are good moment estimators that work
when some of the parameters (e.g. loc or scale) are fixed. Also
_fitstart is currently used only by my fit_frozen.

I was hoping to get this done this year, maybe together with the
enhancements that Per Brodtkorb proposed two years ago, e.g. Method of
Maximum Spacings.

I also have a Generalized Method of Moments estimator based on
matching quantiles and moments in the works.

So, I don't want yet to be pinned down with any API for the estimation
enhancements.

 These recent changes are a bit problematic for several reasons:
- there are many new methods for distributions without tests.
- there are no docs for many new private and public methods
- invalid syntax: http://projects.scipy.org/scipy/ticket/1186
- the old rv_continuous doc template was put back in

This, plus Josef saying that he doesn't want to fix the API for some methods yet, makes me want to take it out of the 0.8.x branch. Any objections to that Travis or Josef?

Cheers,
Ralf


Charles R Harris

unread,
May 31, 2010, 11:59:39 AM5/31/10
to SciPy Users List

I'm thinking it should be taken out of the trunk as well as the 0.8.x branch.

Chuck

Chris Strickland

unread,
May 27, 2010, 7:22:54 AM5/27/10
to scipy...@scipy.org
Hi,

When using any of the distributions of scipy.stats there does not seem to be
the ability (or at least I cannot figure out how) to have the function return
the log of the pdf, cdf, sf, etc. For statistical analysis this is essential.
For instance suppose we are interested in an exponential distribution for a
random variable x with a hyperparameter lambda there needs to be an option
that returns -log(lambda)-x/lambda. It is not sufficient (numerically) to
calculate log(scipy.stats.expon.pdf(x,lambda)).

Is there a way to do this using the distributions in scipy.stats?

If there is not is it possible for me to suggest that this feature is added.

There is such an excellent range of distributions, each with such an
impressive range of options, it seems ashame to have to mostly manually code
up the log of pdfs and often call the log of CDFs from R.

Thanks,
Chris.

Chris Strickland

unread,
Jun 8, 2010, 6:52:30 PM6/8/10
to SciPy Users List
I posted this over a week ago and we have a running thread with discussion on it. So ignore this somewhat mysterious re-appearance of an old post. If you are interested in the discussion post in the other thread. 
Reply all
Reply to author
Forward
0 new messages