distributions

196 views
Skip to first unread message

Stefan Karpinski

unread,
May 23, 2012, 3:31:13 AM5/23/12
to Julia Dev
I've committed a very rough initial cut of some probability distribution types:

https://github.com/JuliaLang/julia/blob/master/base/distributions.jl

This just defines how these are parameterized, and doesn't provide any functionality. The list is also very incomplete at this point (I got tired of typing them in). However, it's a start. We can start filling method definitions for the pdf, cdf and sample functions, etc.

Douglas Bates

unread,
May 23, 2012, 8:52:15 AM5/23/12
to juli...@googlegroups.com
It happens that I was working on some of the functionality last evening.  I will take what I did and put it in your framework.

You are using capitalized names for distributions, which I think is a good idea because it distinguishes such names as Logistic for the distribution and logistic for the function.  I will stay with that convention unless someone shouts (I think Jeff prefers lower-case names)

Douglas Bates

unread,
May 23, 2012, 9:46:34 AM5/23/12
to juli...@googlegroups.com
It happens that I started with discrete distributions which, it turns out, are a bit more complicated than continuous distributions.  I think that the DiscreteDistribution type needs to be other than

abstract DiscreteDistribution{T<:Integer}  <: Distribution{T}

because the type T<:Integer just tells you the positions of the point masses in the probability mass function (pmf).  Often the parameters of the distribution are T1<:Real and many of the functions associated with the distribution (cdf, quantile, pmf) are defined over the entire real line or the interval [0,1].  I have also defined methods for mean, variance, stddev, ... functions for individual distributions and they should return a type <:Real.  In terms of types of arguments and returned values it is only the random variate generator, which I have defined as methods for the rand() generic, that uses T<:Integer for discrete distributions and even that is not a hard and fast rule  (often it is convenient to think of the value of a binomial as the proportion of successes).

I will cobble something up with non-parameterized abstract types until this can be discussed in more detail.

Douglas Bates

unread,
May 23, 2012, 10:31:28 AM5/23/12
to juli...@googlegroups.com
On Wednesday, May 23, 2012 8:46:34 AM UTC-5, Douglas Bates wrote:
It happens that I started with discrete distributions which, it turns out, are a bit more complicated than continuous distributions.  I think that the DiscreteDistribution type needs to be other than

abstract DiscreteDistribution{T<:Integer}  <: Distribution{T}

because the type T<:Integer just tells you the positions of the point masses in the probability mass function (pmf).  Often the parameters of the distribution are T1<:Real and many of the functions associated with the distribution (cdf, quantile, pmf) are defined over the entire real line or the interval [0,1].  I have also defined methods for mean, variance, stddev, ... functions for individual distributions and they should return a type <:Real.  In terms of types of arguments and returned values it is only the random variate generator, which I have defined as methods for the rand() generic, that uses T<:Integer for discrete distributions and even that is not a hard and fast rule  (often it is convenient to think of the value of a binomial as the proportion of successes).

I will cobble something up with non-parameterized abstract types until this can be discussed in more detail.
So this is where I am now for a discrete distribution.  Shall we take this discussion to an Issue and cut down the noise on the list? 

type Binomial <: DiscreteDistribution
    size::Integer
    prob::Real

    Binomial(n, p) = n <= 0 ?  error("size must be positive") : (0. <= p <= 1. ? new(n, p) : error("prob must be in [0,1]"))
end

variance(d::Binomial) = d.size * d.prob * (1. - d.prob)
mean(d::Binomial)     = d.size * d.prob
function rand(d::Binomial)
    ccall(dlsym(_jl_libRmath, :rbinom), Float64, (Float64, Float64), d.size, d.prob)
end
function pmf(d::Binomial, x::Real)
    ccall(dlsym(_jl_libRmath, :dbinom), Float64, (Float64,Float64,Float64,Int32),
          x, d.size, d.prob, 0)
end
function lpmf(d::Binomial, x::Real)  # log of pmf calculated directly (more precise)
    ccall(dlsym(_jl_libRmath, :dbinom), Float64, (Float64,Float64,Float64,Int32),
          x, d.size, d.prob, 1)
end
function cdf(d::Binomial, x::Real)
    ccall(dlsym(_jl_libRmath, :pbinom), Float64, (Float64,Float64,Float64,Int32,Int32),
          x, d.size, d.prob, 1, 0)
end

Jeffrey Sarnoff

unread,
May 23, 2012, 12:15:16 PM5/23/12
to juli...@googlegroups.com
Consider a type that collects characterizations of a distribution (mean, median, mode, variance, stddev, skewness, kurtosis, pdf, cdf).
Not all characteristics are available for every distribution, still the information that some such item is undefined for a given distribution is important.
Each of these items is defined as a function of the distribution parameters, and e.g. for pdf, cdf an (often constrained, eg 0..1) independent var. 
Perhaps this suggests, say, 5 versions of the characterization type (one for 0param dists, 1param dists, ... 4param dists) or, less carefully, varargs

Now, at this point I am lost as to the how to:
  (show me pls) Each function needs to find the distribution type's params and use them as the funcall arguments

Douglas Bates

unread,
May 23, 2012, 12:47:52 PM5/23/12
to juli...@googlegroups.com
On Wednesday, May 23, 2012 11:15:16 AM UTC-5, Jeffrey Sarnoff wrote:
Consider a type that collects characterizations of a distribution (mean, median, mode, variance, stddev, skewness, kurtosis, pdf, cdf).

I'm not sure what you mean by a "type" there.  I imagine these names as functions which in Julia means generic functions for which methods with different signatures can be defined.  The trick is to declare Julia types for the distributions and include an instance of a specific distribution type in the signature of the method.
 
Not all characteristics are available for every distribution, still the information that some such item is undefined for a given distribution is important.
Each of these items is defined as a function of the distribution parameters, and e.g. for pdf, cdf an (often constrained, eg 0..1) independent var. 
Perhaps this suggests, say, 5 versions of the characterization type (one for 0param dists, 1param dists, ... 4param dists) or, less carefully, varargs

I don't think that is necessary.  The number and type of parameters in a distribution is defined in that distribution's type.
 
Now, at this point I am lost as to the how to:
  (show me pls) Each function needs to find the distribution type's params and use them as the funcall arguments

The way we had it sketched out a specific distribution is a type containing the parameter values.  A function such as pdf or cdf has a method whose signature includes an instance of the distribution type and another argument, which is the value(s) at which to evaluate the pdf or cdf. That's what the signature like

function pmf(d::Binomial, x::Real)
    ccall(dlsym(_jl_libRmath, :dbinom), Float64, (Float64,Float64,Float64,Int32),
          x, d.size, d.prob, 0)
end

indicates.  Perhaps another example would be more meaningful because that method is using ccall which may be confusing the issue.

type Normal <: ContinuousDistribution
    mean::Float64
    stddev::Float64
    Normal(mu, sd) = sd >= 0 ? new(float64(mean), float64(sd)) : error("stddev must be non-negative")
end
Normal(mu) = Normal(mu, 1)
Normal() = Normal(0,1)
const Gaussian = Normal

mean(d::Normal) = d.mean
variance(d::Normal) = d.stddev^2
rand(d::Normal) = d.mean + d.stddev * randn()

You would call the rand method like

julia> nn = Normal(3, 1)
Normal(3.0,1.0)

julia> rand(nn)
2.122420898574372

Jeffrey Sarnoff

unread,
May 24, 2012, 7:52:55 AM5/24/12
to juli...@googlegroups.com
That is clarifying, and good to know.
A few of generics are mathematically undefined for some distributions. Please allow them, returning error or NaN.

Stefan Karpinski

unread,
May 24, 2012, 1:13:34 PM5/24/12
to juli...@googlegroups.com
This can be found out by using the applicable function: applicable(kurtosis,Normal). If we don't know how to compute the kurtosis of a Normal distribution, that will be false. If course, generic fallbacks can sabotage that test, but I'm not sure why we'd need a type for that. You can just try it and if it fails with a no method error, we didn't know how to compute it.

I'm unclear on what you mean about finding the distribution type's params...

Stefan Karpinski

unread,
May 24, 2012, 1:18:58 PM5/24/12
to juli...@googlegroups.com
I'm not sure that's a good idea. NaN is generally a hardware performance crutch and should be introduced as little as possible. If we don't know how to compute something it should be a no-method error, not a NaN. If you want to get a NaN you can easily do this: `k = try kurtosis(d) catch NaN end`. Can you give an example of what you mean? In cases where the expected value is infinite, the correct answer is Inf, but I don't think that's what you mean.

Jeffrey Sarnoff

unread,
May 24, 2012, 1:43:57 PM5/24/12
to juli...@googlegroups.com
This thread's approach sidesteps needing the distribution type's params.

My thought was about running through many of the available distributions: fitting some data, then evaluating variance, kurtosis, etc for each and choosing one of them for limited use as a rough approximation of that data. 
Some distributions have more than one set of parameters in common use, and some have both a 2-param and 4-param forms. I thought it simplifying to access the distributions' type params and so know their number and their [standard] names (location, scale/rate, shape).

Stefan Karpinski

unread,
May 24, 2012, 1:55:05 PM5/24/12
to juli...@googlegroups.com
I think that fitting should be available generically. Something like

d = fit(Normal,data)

which returns a fitted Normal distribution instance. To use fit, you don't need to know anything about the distribution parameters. The fit method that implements the fitting, of course, does need to know about the parameters, but also knows the type of its argument, so that's just fine.

Jeffrey Sarnoff

unread,
May 24, 2012, 2:34:32 PM5/24/12
to juli...@googlegroups.com
that works for me

Douglas Bates

unread,
May 24, 2012, 2:58:00 PM5/24/12
to juli...@googlegroups.com
On Thursday, May 24, 2012 12:18:58 PM UTC-5, Stefan Karpinski wrote:
I'm not sure that's a good idea. NaN is generally a hardware performance crutch and should be introduced as little as possible. If we don't know how to compute something it should be a no-method error, not a NaN. If you want to get a NaN you can easily do this: `k = try kurtosis(d) catch NaN end`. Can you give an example of what you mean? In cases where the expected value is infinite, the correct answer is Inf, but I don't think that's what you mean.

I did, for the Cauchy distribution, define both the mean and the variance as NaN as they are - literally - undefined.  For the t distribution (TDist) the method definitions are

mean(d::TDist) = d.df > 1 ? 0. : NaN
median(d::TDist) = 0.
variance(d::TDist) = d.df > 2 ? d.df/(d.df-2) : d.df > 1 ? Inf : NaN

For the variance of the t distribution there is a distinction between the cases for which it is infinite and the cases for which it is undefined. 

By the way, I didn't define methods for skewness, kurtosis, etc. but those could definitely be added.  In your example in this thread mentioning fit(Normal, data) I think it would be better to indicate the criterion by which the parameters are to be fit, such as MLfit for maximum likelihood.

Stefan Karpinski

unread,
May 24, 2012, 3:22:30 PM5/24/12
to juli...@googlegroups.com
The "fit" concept should probably be quite generic: one argument is a model, the other arguments data for the model to be fitted against.

This is the kind of thing I was talking about where multiple dispatch is superior to the single dispatch oo "bag of named methods" model: having generic functions that are in a globalish context forces you to think about making consistent, generic interfaces. You end up having a "fit" object that reifies the concept of fitting a model to some data, rather than having a bajillion different specific fitting functions — or methods, methods aren't any better! — each with different and incompatible ways of being called.

John Myles White

unread,
May 24, 2012, 3:27:00 PM5/24/12
to juli...@googlegroups.com
That is basically the R approach, but I think Doug is right that one needs to separate out a fit function based on the model, the data and the fitting procedure. The fit of a Normal via maximum likelihood is quite different from the fit of a Normal via MAP.

One could, for example, have a type for estimation approaches in the same way there's a type for distributions.

 -- John

Stefan Karpinski

unread,
May 24, 2012, 3:30:44 PM5/24/12
to juli...@googlegroups.com
On Thu, May 24, 2012 at 2:58 PM, Douglas Bates <dmb...@gmail.com> wrote:

I did, for the Cauchy distribution, define both the mean and the variance as NaN as they are - literally - undefined.  For the t distribution (TDist) the method definitions are

mean(d::TDist) = d.df > 1 ? 0. : NaN
median(d::TDist) = 0.
variance(d::TDist) = d.df > 2 ? d.df/(d.df-2) : d.df > 1 ? Inf : NaN

For the variance of the t distribution there is a distinction between the cases for which it is infinite and the cases for which it is undefined. 

Excellent. This looks like the right thing to do — and what Jeffrey Sarnoff was asking for.
 
By the way, I didn't define methods for skewness, kurtosis, etc. but those could definitely be added.  In your example in this thread mentioning fit(Normal, data) I think it would be better to indicate the criterion by which the parameters are to be fit, such as MLfit for maximum likelihood.

I'm sure skewness and kurtosis will get added in the fullness of time. For now, it's great to just have more common statistics available.

The fit thing obviously needs to be thought out in much greater detail since it's so incredibly important — and there are so many possible use cases. My main point was just that fitting models to data should be a generic thing with models, methods, and data all reified as types so that you can do fitting completely generically. That will make it possible and even trivial to do things like programmatically try all combinations of a set of models and methods on some data and show the results and see how each one did.
Reply all
Reply to author
Forward
0 new messages