statistics package

663 views
Skip to first unread message

Gerard

unread,
Jul 3, 2013, 4:19:20 AM7/3/13
to gonu...@googlegroups.com
This thread is for general ideas and remarks regarding the statistics package, github.com/gonum/stat.

Statistics is mostly plain text-book material. So the package is pretty straight-forward. The devil is in the details. How to make it simple (free from dependencies) and the tests.

The idea is to combine the package of John Asmuth / Peter A. Cejchan (see link) with the package I translated from the GNU Scientific Library (see link). The last one is derived work and so licenced with the GPL. However, it's straight textbook material so the GPL isn't applicable. But the tests ARE.

Things to do:
- Define a vector interface that's comparable with that of the Matrix package
- Define which functions we want.
- Update the tests

Any idea or remark?


Gerard van de Schoot

Kun Li

unread,
Jul 3, 2013, 1:39:05 PM7/3/13
to Gerard, gonu...@googlegroups.com
In terms of the functions, what about those functionalities mentioned in this article?


I see your package already includes some of them. 

----------------------------------------------------------------------------------
Yours Sincerely
Kun

gplus.to/kunli



--
You received this message because you are subscribed to the Google Groups "gonum-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Brendan Tracey

unread,
Jul 3, 2013, 6:06:32 PM7/3/13
to gonu...@googlegroups.com, Gerard
This is more on the probability side than the stats side, but I've been thinking about how to deal with sampling algorithms, such as importance sampling, rejection sampling, metropolis-hastings etc. It seems to me like go's interfaces could be a significant strength in this area, but strong typing is a detriment. I'm probably in the minority, but many of the tools I am developing (in other languages at the moment) take an arbitrary probability distribution as an input. I haven't yet found a solution for go that feels right. This isn't directly related to a stats package, but the stats and the prob packages should work well together.  

Gerard

unread,
Jul 3, 2013, 10:56:24 PM7/3/13
to gonu...@googlegroups.com, Gerard

On Thursday, July 4, 2013 12:06:32 AM UTC+2, Brendan Tracey wrote:
This is more on the probability side than the stats side, but I've been thinking about how to deal with sampling algorithms, such as importance sampling, rejection sampling, metropolis-hastings etc. It seems to me like go's interfaces could be a significant strength in this area, but strong typing is a detriment. I'm probably in the minority, but many of the tools I am developing (in other languages at the moment) take an arbitrary probability distribution as an input. I haven't yet found a solution for go that feels right. This isn't directly related to a stats package, but the stats and the prob packages should work well together.  

Brendan, can you show an example of your code where you use the probability distribution as input? I am curious. And I agree with the problems of strong typing. The problem I face is the float64 rounding issue and that's IEEE 754 related.
 

Brendan Tracey

unread,
Jul 3, 2013, 11:56:24 PM7/3/13
to Gerard, gonu...@googlegroups.com
Metropolis-Hastings is a common example. There are two inputs, the probability distribution to be sampled from, and the proposal distribution.

A Bayesian Network is a directed graph where each node is a probability distribution conditioned on the parent nodes.

As an example of my work, I have been developing Stacked Monte Carlo, which reduces the estimation error from a set of samples through Machine Learning ( http://arxiv.org/pdf/1108.4879.pdf ). 

Does this answer your question? As I mentioned above, I don't have Go code available to show.

I'm not really sure what to do about it. I do think it would be a good idea to have probability distributions defined as types with methods to allow them to be used with interfaces. This is possibly in addition to normal function calls

func RandN(mean, std float64) float64

type Normal struct{
mean float64,
std float64
}

func (n Normal) Rand() float64

This allows normal to be used as a "Rander" interface where one needs to generate random samples from a univariate probability distribution. Similar interfaces could exist, for example

type Distribution interface{
Rand() float64
Pdf(float64) float64
}

or whatever it is that the current function needs (cdf, mean, mode, whatever), which can be defined by the implementer. Unfortunately, there isn't a good way to make it type agnostic. I don't know how much of a penalty reflection is.

Dan Kortschak

unread,
Jul 4, 2013, 12:16:00 AM7/4/13
to Brendan Tracey, Gerard, gonu...@googlegroups.com
What additional complexity do you see that would enforce a need for
reflection?

Brendan Tracey

unread,
Jul 4, 2013, 12:24:27 AM7/4/13
to Dan Kortschak, Gerard, gonu...@googlegroups.com
I wouldn't go so far as to say enforce a need, but it is giving me pause.

Let's assume that we need the "distribution" interface as I mentioned:

type Distribution interface{
Pdf(float64) float64
Rand() float64
}

The problem is that this is really a one-dimensional continuous distribution. We would need different interfaces for multivariate distributions, and for discrete vs. continuous distributions, not to mention the potential for probability distributions over other things (graphs, actions, etc.). I think this is a case where having a generic type would be really useful, so we could write

type Distribution interface{
Rand() T
Pdf(T) float64
}

I wouldn't say that we need reflection, but we do need to think about what we want to support and how.

Dan Kortschak

unread,
Jul 4, 2013, 12:36:10 AM7/4/13
to Brendan Tracey, Gerard, gonu...@googlegroups.com
On Thu, 2013-07-04 at 00:24 -0400, Brendan Tracey wrote:
> Let's assume that we need the "distribution" interface as I mentioned:
>
> type Distribution interface{
> Pdf(float64) float64
> Rand() float64
> }
>
> The problem is that this is really a one-dimensional continuous
> distribution. We would need different interfaces for multivariate
> distributions, and for discrete vs. continuous distributions, not to
> mention the potential for probability distributions over other things
> (graphs, actions, etc.). I think this is a case where having a generic
> type would be really useful, so we could write
>
> type Distribution interface{
> Rand() T
> Pdf(T) float64
> }
>
> I wouldn't say that we need reflection, but we do need to think about
> what we want to support and how.

You can get away with one interface, two if you want to gain performance
benefits with linear distributions. I don't really think you need
separate interface for continuous/discrete distributions.

type Multivariate interface{
Rand() []float64
Density([]float64) float64
}

type Linear interface{
Rand() float64
Density(float64) float64
}

(I don't like Pdf).

Dan



Gerard

unread,
Jul 4, 2013, 12:46:36 AM7/4/13
to gonu...@googlegroups.com, Dan Kortschak
Some time ago I also translated the histogram module of the GNU Scientific Library (see godoc). I didn't promote it because of the type issues.

But the Pdf type in the package might be a starting point.

Gerard

unread,
Jul 4, 2013, 4:28:04 AM7/4/13
to gonu...@googlegroups.com, Gerard

On Wednesday, July 3, 2013 7:39:05 PM UTC+2, Kun Li wrote:
In terms of the functions, what about those functionalities mentioned in this article?



Which function(s) (in what paragraph) are you interested in?

Brendan Tracey

unread,
Jul 4, 2013, 9:26:56 AM7/4/13
to gonu...@googlegroups.com, Brendan Tracey, Gerard
Okay, if you think it's fine then it simplifies matters quite a bit. It's probably necessary to have both Univariate and Multivariate (my preferred choice over Linear and Multivariate); I would be annoyed if I had to recast as a float everytime I called a normally univariate distribution. 

I don't think I like Density. Prob(float64) float64 perhaps? The nice thing about Pdf is that it has a Cdf complement, and that it has a history as the moniker (lots of other languages have a normpdf as a function). That said, if there are better ways to do it we should do the better ways. I guess you could do Cum() or Cumul() for the cumulative distribution.

For the discrete distributions, should the code panic if it's called with an unsupported value, or should it return a valid cdf, 0 for the probability, etc.?


 

Dan Kortschak

unread,
Jul 4, 2013, 4:38:23 PM7/4/13
to Brendan Tracey, gonu...@googlegroups.com, Brendan Tracey, Gerard
If PDF and CDF are widely used (but all caps since that's the Go convention for intialisms); statisticians and mathematicians have a lot to answer for in destroying code legebility with their function naming conventions :| - it's like the non-availability of ten alphabets limted their creativity. I'd like to think we could do better.

I agree, Univariate over Linear.

Re the discrete, the actual probabilities for non-existing values are 0, so that makes sense. People can wrap with something if they need strictness.

Dan

Gerard

unread,
Jul 5, 2013, 4:51:01 AM7/5/13
to gonu...@googlegroups.com, Brendan Tracey, Gerard
Maybe noise, but what if you have an Atter interface like this?

type Atter interface{

Gerard

unread,
Jul 5, 2013, 4:57:44 AM7/5/13
to gonu...@googlegroups.com, Brendan Tracey, Gerard
Sorry, the tab does not what I expect in the browser ;-)

I meant:

type Atter interface {
  At(interface{}) At
}

Then it can also be used with PDF functions.

type PDF struct { // or whatever
}

func (p *PDF) At(pos interface{}) *PDF {
  // treat At as indexer of float64
}

And with the statistic functions:

func (p FloatSlice) At(pos interface{}) Floater {
  // treat At as an index (int)

Dan Kortschak

unread,
Jul 5, 2013, 5:00:04 AM7/5/13
to Gerard, gonu...@googlegroups.com, Brendan Tracey, Gerard
That doesn't do what you think it does. Interface matching is textual, so none of those types implement your Atter.

Gerard

unread,
Jul 5, 2013, 5:13:14 AM7/5/13
to gonu...@googlegroups.com, Gerard, Brendan Tracey
And with 

type Atter interface {
  At(interface{}) interface{}
}

And let the functions that use this interface do the type recognition, with switch p := value.(type)

Gerard

unread,
Jul 5, 2013, 5:21:32 AM7/5/13
to gonu...@googlegroups.com, Gerard, Brendan Tracey
Hmm, just forget it. It's ugly. I was just thinking out loud.

Peter A. Cejchan

unread,
Jul 8, 2013, 6:44:26 AM7/8/13
to gonu...@googlegroups.com, Gerard
Be careful,
Chapter 1. Formulas For Reporting Averages suppose Normal distribution !
I would prefer Bayesian methods where possible, as they are more powerful (use all information at hand at the moment).
I would abandon CI's at all, and use bayesian Credible interval for the reasons explained elsewhere.

++pac

Peter A. Cejchan

unread,
Jul 8, 2013, 6:45:24 AM7/8/13
to gonu...@googlegroups.com, Gerard
sorry, I forgt the reference:
http://en.wikipedia.org/wiki/Credible_interval

Ethan Burns

unread,
Jul 8, 2013, 8:13:14 AM7/8/13
to gonu...@googlegroups.com
Hi Everyone,

This was just linked on G+: https://github.com/GaryBoone/GoStats.  Has anyone looked at it yet?  If it hasn't been done already, maybe whoever is putting the gonum/stats package together should contact Gary Boone to see if he wants to contribute his regression code to gonum.  I would like to have regression for best-fit lines in Plotinum, and if Plotinum is moving to gonum, then it seems best to use a gonum package for it.


Best,
Ethan

Peter A. Cejchan

unread,
Jul 8, 2013, 8:53:31 AM7/8/13
to gonu...@googlegroups.com
Looks fine to me. The person who will manage the stat pkg should invite the author for sure.
++pac

Peter A. Cejchan

unread,
Jul 15, 2013, 2:18:26 AM7/15/13
to gonu...@googlegroups.com
Folks,

feel free to fork off my probab pkg and curate it, somebody.
My style is horrible, but most of the functionality has been tested.

Many thanks,
kindest regards,

Peter.

Gerard

unread,
Jul 15, 2013, 5:15:25 AM7/15/13
to gonu...@googlegroups.com
Do we have a volunteer who would like to formally manage the stat pkg? I am a terrible manager, and only an autistic technician. ;)
Message has been deleted

Brendan Tracey

unread,
Jul 15, 2013, 5:19:52 PM7/15/13
to gonu...@googlegroups.com
Here's a link to Peter's probability package


or the godoc link:

Peter, you mentioned on go-nuts that you had a simulated annealing code. Is that something you could link / playground?

Peter A. Cejchan

unread,
Jul 17, 2013, 3:02:36 AM7/17/13
to gonu...@googlegroups.com
Of course, it is a part of this:
/code.google.com/p/go-eco/eco/ser
under rob_sa.go. There is also  an ACO code under rob_fant.go (it is a 1998 version, not very sophisticated [sorry, Eric]).
As usually, the code is ugly, but it works.
Feel free to grab it and play, anybody.
Best regards, Peter.

Brendan Tracey

unread,
Aug 21, 2013, 11:11:01 PM8/21/13
to gonu...@googlegroups.com
I need to write some code which can take in an arbitrary probability distribution so I started reworking Peter's code to be based on probability distribution structs with methods rather than just a list of functions.


// Generate a random variable
val := Normal{-2, 3}.Rand()

// Calculate some probabilities
dst := Normal{0,1}
prob1 := dst.Prob(0.123)
cumProb := dst.CumProb(0.123)


It's not that many more characters than Peter's code, but it's a lot more flexible because now all of the distributions implement interfaces.

In my mind, this will be one piece of the stat package. We will at least need some functions for generating random variables (metropolis hastings, latin hypercube, etc.), and doing stats (many mentioned above, sobol indices).

It's a pretty bare-bones start, but is this along the lines of what others were thinking as well?

Dan Kortschak

unread,
Aug 21, 2013, 11:18:57 PM8/21/13
to Brendan Tracey, gonu...@googlegroups.com
The only suggestion I'd have at this stage is s/π/Pi/g.

Brendan Tracey

unread,
Aug 22, 2013, 4:34:19 PM8/22/13
to gonu...@googlegroups.com, Brendan Tracey
I'm not sure what should be done about circular imports. Some of the probability distributions will need a weighted mean / variance (or whatever) for fitting the parameters of that distribution in data. These would ideally just be the ones in stat/stat. However, some statistical functions (sobol indices) need to generate random values, and so stat may want to import dist. I'm not sure if the right answer is to have an "Advanced stat" package for more complicated routines (or specific packages for them, ANOVA package etc.), or if dist should just implement its own versions of weighted mean. Should I start moving the distributions to gonum when I think they're correct, or should I leave them on my github for the time being? Should it be distribution or dist?

Dan Kortschak

unread,
Aug 22, 2013, 11:15:49 PM8/22/13
to Brendan Tracey, gonu...@googlegroups.com
On Thu, 2013-08-22 at 13:34 -0700, Brendan Tracey wrote:
> or if dist should just implement its own
> versions of weighted mean.

I'd do this.

Brendan Tracey

unread,
Aug 23, 2013, 2:04:59 AM8/23/13
to Dan Kortschak, gonu...@googlegroups.com
Should I push the version to stats, or should I keep working in my own box at the moment?

Brendan Tracey

unread,
Jul 28, 2014, 11:52:56 PM7/28/14
to gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
I cleaned up my univariate probability distribution package.

http://godoc.org/github.com/btracey/dist
github.com/btracey/dist

There are some common distributions missing (Bernoulli, beta, etc.) but it's good enough to go up for comments.

The "mv" subpackage does not yet work, though note that it will need to import matrix.

My comments:
1) There is the outstanding issue on the correct behavior for the Parameters functions. if the data were to be reused.
2) I personally have found the DFooDBar functions useful in my work. They come up in places like Machine Learning where you want to have an analytical derivative and you're dealing with randomly generated samples. If anyone has serious objections they don't have to stay (embedding can be used).

Additional comments are appreciated (on a high level, detailed comments can happen during a PR).

I think at some point this package should add some interfaces, but it's probably better to wait to see what the correct method combinations are.

Request for discussion:
I'm not quite sure the best organization for the package. Scipy has a "stats" package that basically has everything else at the same namespace. I don't think this is a good fit for us. However, I don't think we want to have a bunch of statistics-related packages at the highest level of gonum. It is nice to have them all in one place. My proposal is to have a "statistics" repository in Gonum, which possibly only contains documentation. Then we would have

gonum/statistics/dist   (univariate probability distributions)
gonum/statistics/dist/mv     (multivariate probability distributions)
gonum/statistics/stat    (basic "stat" function (mean, variance, etc.), like though in GaryBoone GoStat)
gonum/statistics/sample      (for advanced sampling algorithms like Importance sampling, Latin Hypercube, Metropolis Hastings, etc.)

I like that this proposal puts stat and dist on the same level, they seem equally fundamental to me. We can also have a
gonum/statistics/internal
which could house the overlap code between stat and dist (this way there's only one implementation of WeightedMedian, etc.)

Brendan Tracey

unread,
Aug 4, 2014, 7:17:10 PM8/4/14
to gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
I have written up an initial set of Statistics functions and will create a pull request momentarily.

I think it's a better idea to have population statistics at the core location (github.com/gonum/stat). This is where people can find mean/median/mode. Statistics functions that require the distribution package should go in their own location (student-t, etc.). I don't use these statistical tests very often, so it would be nice if someone could compile a list of what's needed.

If this PR is accepted, I'll update my dist package where appropriate (haven't thought about it yet), and then I'll make another request with the basics of the univariate distribution package.

Brendan Tracey

unread,
Aug 4, 2014, 7:19:27 PM8/4/14
to gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
A copy of the PR is at github.com/btracey/stat, with the respective godoc at http://godoc.org/github.com/btracey/stat if anyone is interested in seeing it more in the flesh. Thanks for taking a look.

Cory Slep

unread,
Aug 4, 2014, 8:07:09 PM8/4/14
to gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
Hey Brenden,

I think I am in a position to help out with generating random variates from distributions (analytically or numerically) and correlated multivariates. You mentioned the distribution package, but I don't see it in the list at https://github.com/gonum . Is that something already set up?
Message has been deleted
Message has been deleted

Brendan Tracey

unread,
Aug 5, 2014, 10:47:29 AM8/5/14
to Gerard, gonu...@googlegroups.com, dan.ko...@adelaide.edu.au

On Aug 5, 2014, at 3:57 AM, Gerard <gvds...@gmail.com> wrote:

Not exactly related to statistics, but what about a filtering mechanism?

Suppose there is a slice of a struct person that has fields sex and age. Now you want to filter on male and age >= 18. With a filter you can generate a new slice of float64 from any slice that has a closure which returns a bool. It could be a seperate package.

I don’t think this would fit in a stats package.

Perhaps I don’t understand your idea, but I don’t really see the utility. To write the “Unzip” function oneself, it would be

s := make([]float64, len(people))
for i, v := range people{
    s[i] = v.Age
}

To actually write the unzip function, you’d have to use the reflect package. This would probably be 5 times slower than writing it yourself, and the above code is compile-time safe while the reflect package implementation would not be. 


Just an idea.


On Tuesday, August 5, 2014 1:19:27 AM UTC+2, Brendan Tracey wrote:
A copy of the PR is at github.com/btracey/stat, with the respective godoc at http://godoc.org/github.com/btracey/stat if anyone is interested in seeing it more in the flesh. Thanks for taking a look.
--
You received this message because you are subscribed to the Google Groups "gonum-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gonum-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brendan Tracey

unread,
Aug 5, 2014, 2:21:59 PM8/5/14
to gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
It is not yet set up. I believe that the distribution package should be in the same git repository as the stats package.

For distributions, I have in mind:
github.com/gonum/stat/dist    // For univariate distributions  (bernouilli, uniform, normal, etc.)
github.com/gonum/stat/dist/mv     // For multivariate distributions  (multi-variate gaussian, etc.)
github.com/gonum/stat/dist/empirical   // For distributions coming from empirical values (Kernel density, bootstrap, Chinese restaurant (?), etc.). There is probably a better name for empirical

I'd also like to see a
github.com/gonum/sample
which would contain sampling algorithms like Metropolis-Hastings, Importance Sampling, Gibbs sampling, etc.

Right now I have a PR to do the basic github.com/gonum/stat. After that, I'd like to create a PR for the beginnings of the distribution package. The code exists in github.com/btracey/dist, though some of it will need to be modified based on discussions in the stats PR.

I think the best way to contribute at the moment is to examine and comment on the stat PR, and once that's in, the beginnings of the dist package. Once we are happy with the signatures for those, the base will be established for building the next set of packages.

Cory Slep

unread,
Aug 5, 2014, 5:49:35 PM8/5/14
to gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
OK. I looked over your dist package, there's quite a lot attached to that normal distribution. math/rand.NormFlaot64 is using what appears to be the 128 ziggurat algorithm. I'm not sure if people will want to use an alternate such as Box-Muller, Marsaglia's, or a 256/512 ziggurat when generating normally distributed variates.

Fundamentally, I think it would be good to allow the use of different sources of random numbers when generating variates for a distribution and not marrying math/rand, since its implementation is not well documented and others have written alternative random number generators (such as Mersenne twisters). It's also needed if we're doing NORTA to approximate correlation between multiple variates in any distribution other than the normal/lognormal/uniform.

I can envision gonum/stat/mv pretty much entirely using gonum/stat for generating its multiple variates, unifying all the distribution-generating functions behind an interface is the win there (plus, since the distributions will use a source instead of marrying math.rand, NORTA-approximated correlated mv will be much simpler to implement).

Distributions I'm thinking off the top of my head:

Normal
Lognormal
Half-Normal
Uniform
Deterministic (OK, not a distribution, but included for completeness)
Shifted Weibull
Shifted Beta
Gamma
Binormal
Truncated Normal
Truncated Lognormal
Gumbel Type I
Linear
Trapezoidal
Arbitrary Discrete
Arbitrary Bounded Continuous

Brendan Tracey

unread,
Aug 5, 2014, 6:37:42 PM8/5/14
to Cory Slep, gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
On Aug 5, 2014, at 2:49 PM, Cory Slep <cjs...@gmail.com> wrote:

OK. I looked over your dist package, there's quite a lot attached to that normal distribution.

Do you see this as a problem? I started with functions implemented by a previous package and added a couple I’ve needed in the past.

math/rand.NormFlaot64 is using what appears to be the 128 ziggurat algorithm. I'm not sure if people will want to use an alternate such as Box-Muller, Marsaglia's, or a 256/512 ziggurat when generating normally distributed variates.

Fundamentally, I think it would be good to allow the use of different sources of random numbers when generating variates for a distribution and not marrying math/rand, since its implementation is not well documented and others have written alternative random number generators (such as Mersenne twisters).

I can understand the desire, but I also want to keep the package usable. What would you suggest as an approach? Have every distribution take in a rand.Rand (and if it’s nil call the normal rand function)?

It's also needed if we're doing NORTA to approximate correlation between multiple variates in any distribution other than the normal/lognormal/uniform.

Why? I don’t know what NORTA is, but I’ve never come across an algorithm that cared about the specific source of pseudo-random numbers. 


I can envision gonum/stat/mv pretty much entirely using gonum/stat for generating its multiple variates, unifying all the distribution-generating functions behind an interface is the win there (plus, since the distributions will use a source instead of marrying math.rand, NORTA-approximated correlated mv will be much simpler to implement).

Distributions I'm thinking off the top of my head:

Scipy has a lot of distributions http://docs.scipy.org/doc/scipy/reference/stats.html . I don’t see any particular need to limit the number we choose to have, but it’s obviously a lot of work to implement them all (including documentation, tests, etc.). I would imagine that we’ll just implement them as we see a need. Of course, if you want to implement them all I won’t stop you :).  

Cory Slep

unread,
Aug 5, 2014, 7:58:20 PM8/5/14
to gonu...@googlegroups.com, cjs...@gmail.com, dan.ko...@adelaide.edu.au
I can see it being an inhibition to learning the statistics package if there's excessive amounts of methods attached to a distribution. However, I don't want to lose the functionality you've already put together. For example, the function for calculating the derivative of the log of the probability probably makes more sense in context of a lognormal distribution, and can be framed as the derivative of the lognormal distribution.

I think accepting an interface that rand.Rand satisfies would be the ticket, and as you suggested use math/rand if it is nil. Although for multivariates, theres some design work needed there.

The reason we would want to have our distributions accept a source is because of the correlated variates problem. NORTA is the "normal to anything" algorithm where you easily generate correlated normal variates (correlation is exact), convert to uniform distribution (correlation is exact, if using Spearman/rank correlation<->Pearson correlation conversion), and then use inverse transform sampling with these exactly correlated uniform variates in order to convert them to the targeted correlated distributions of choice. This last transformation does not preserve the Pearson correlation unfortunately, but they tend to stay within 0.1.

So if our distributions use rand.Float64 anyway to do inverse transform sampling to produce results, then using a source that is spitting out correlated uniformly distributed variates will cause the distribution to spit out correlated variates.

Also most people don't tend to worry about the quality of PRNGs anymore, but there's still people that remember getting burned by RANDU and IBM (http://en.wikipedia.org/wiki/RANDU). Maybe we can even provide Knuth's subtractive PRNG for such people. ;)

I hadn't been aware of the SciPy list of distributions, and my list was not meant to be exhaustive nor set in steel. The Scipy list is a great one to look for to implement.

Brendan Tracey

unread,
Aug 5, 2014, 10:36:58 PM8/5/14
to gonu...@googlegroups.com, cjs...@gmail.com, dan.ko...@adelaide.edu.au
I don't understand your comments about "Although for multivariates, there's some design work there". My plan is for the distributions in dist to provide random numbers according to their distribution (with Rand() float64). In dist/mv, it will likely be Rand([]float64) []float64. I don't think we want to provide an explicit multi-variate PRNG. As far as I know, none of those exist; they all generate numbers in [0,1).

I think your NORTA comments are about the PRNG samples being correlated due to a poor PRNG. Is that correct? Several people have run studies on Go's generator and found that it passes statistical tests. Given that, I think it's fine for the default behavior to be to use math/rand. As you say, some will want to provide their own source, but in my mind the bigger reason to allow a custom source is that math/rand is protected by a mutex, so if someone needs to generate a lot of random variables (relative to clock cycles) they will want different random number streams.

I had assumed the same with your list, just thought I'd point out the scipy list for comparison/inspiration.

Cory Slep

unread,
Aug 6, 2014, 1:23:25 AM8/6/14
to gonu...@googlegroups.com, cjs...@gmail.com, dan.ko...@adelaide.edu.au
Hey Brendan,

Sorry for confusing you. I hope this will clear up the ideas.

PRNGs and NORTA are completely independent. PRNGs just generate that random stream of bytes, as I'm sure you know. You are correct that most wind up providing those bytes as a uniform distribution of [0,1). Right there is where people could want to specify to use their own PRNG (I assume as a rand.Source?) adding a degree of freedom we could design the API for. Or we could marry math/rand directly and not allow people to use their own PRNG. Note that if you look at the rand.Float64 documentation it is not perfectly distributed in [0,1) theoretically (due to a bug in the initial shipment of Go1, see http://golang.org/src/pkg/math/rand/rand.go?s=3018:3050#L94), so some people may want to be able to use their own PRNG. This is opportunity 1 for designing the distribution API.

So with these [0,1) numbers coming from a PRNG, the next step is to convert them from a uniform distribution to something more useful. The most useful is the normal distribution for a variety of reasons, one of which I'll come back to. Luckily, the math/rand package already has a NormFloat64 function. It uses the [0,1) distribution to generate normally distributed variates using a ziggarut-128 algorithm. There are other algorithms that are less efficient, but people may want to use those instead. Additionally, they may want to use custom distributions with our library, so that suggests using a Go interface for distributions. For arbitrary distributions, using the inverse transform method is an option of turning [0,1) into the PDF. Again, that is a detail that could be hidden behind an interface. This is opportunity 2 for designing the distribution API.

Finally, these single variate solution pathways are fine. NORTA really comes into play with correlated multivariate distributions, which I will get back to in a second. I see you are suggesting the function:
      Rand([]float64) []float64
I am not sure if this is for every distribution struct or a free-floating function. I'll assume (uh oh) you mean to have it be
      func (s SomeDistribution) Rand([]float64) []float64
I am also not sure what the []float64 parameter is for. The way I am (probably mistakenly) interpreting it is "give SomeDistribution a list of [0,1) for each variate, get back the PDF for each variate". In this case, it would be more difficult to use when the multivariates are sampled from different distributions, such as:
      multiVars := []float64{0.2, 0.5, 0.44}
      pdf1 := distX.Rand([]float64{multiVars[0]})[0]
      pdf2 := distY.Rand([]float64{multiVars[0]})[0]
      pdf3 := distZ.Rand([]float64{multiVars[0]})[0]
I'm pretty sure I am misinterpreting something here, so I'll leave that for you to help set me straight. Back to NORTA. When generating correlated multivariates, normal statistically independent variates are generated from the normal distribution (typically), and then Cholesky decomposition is applied to correlate them. This provides correlated multiple normally distributed variates. However, the masses may demand correlated variates between, say, a gamma and a linear distribution. NORTA is the process of taking these correlated multiple normally distributed variates (with an extra step before the Cholesky decomposition), running it through the inverse CDF function to get back [0,1), which are correlated this time, and then using these [0,1) to inverse transform sample back to any distributions the users may want. This is opportunity 3 in designing the API.

So putting these three opportunities together, what would an example look like?

     type RandInterface interface {
          Float64() float64 // Opportunity 1: Allow use of custom PRNGs
     }

     type DistX struct {
          r RandInterface
          // other specifics for dist X
     }

     func NewDistXSource(r RandInterface) *DistX {
          return &DistX{r, /* etc */} // Opportunity 1: Use custom PRNG
     }

     func NewDistX(seed int64) *DistX {
          return &DistX{math.New(math.NewSource(seed)), /* etc */} // Opportunity 1: Use std lib
     }

     // Repeat above declarations for DistY, DistZ

     type Distribution interface {
          Rand() float64 // Opportunity 2: Let client distributions figure out how to generate variate
     }

     func (x *DistX) Rand() float64 {
          // Use x.r to get [0,1) and generate variate, Opportunity 2 says "I don't care how" here.
     }

     // Repeat above func declaration for DistY, DistZ

     // Above code can be in dists. Below code in mv.

     // Opportunity 3: Could put this function type under the Distribution interface mentioned above, hence "room for design"
     type InverseTransformFn func(float64) float64

     // Can use our own matrix package as arg below.
     // Opportunity 3: Maybe simpler correlated multivariate function?
     func CorrelatedMultivariates(corrMatrix [][]float64, dists []InverseTransformFn) (results []float64) {
          // Step 1: Adjust Pearson -> Spearman correlations using a Hotelling paper, create var normalDist as normal distribution
          // Step 2: Generate len(dists) std normally distributed variates
          for i, dist := range dists {
               temp[i] = normalDist.Rand()
          }
          // Step 3: Cholesky decomposition to correlate the temp[i] variates
          // Step 4: Convert to targeted correlated distributions. Correlations no longer exact, only approximate
          for i, dist := range dists {
               results = append(results, dist(temp[i]))
          }
     }

     // Opportunity 3: Leverage Go-idiom of reusing code for stat. independent multivariates
     func Multivariates(dists []InverseTransformFn) []float64 {
          // Step 1: Generate correlation matrix that is zero everywhere, with 1 diagonals
          return CorrelatedMultivariates(corrMatrix, dists)
     }

In the code snippet above, the first opportunity (PRNGs) is independent of the other two. Opportunities 2 and 3 if designed properly can leverage each other's interfaces if so desired, or stay separate completely. Really the multivariate code that this point is just the two functions "Multivariates" and "CorrelatedMultivariates" because they leverage the single variate algorithms, so less code needed.

Let me know if I am talking rubbish at this point.

Cory Slep

unread,
Aug 6, 2014, 1:32:34 AM8/6/14
to gonu...@googlegroups.com, cjs...@gmail.com, dan.ko...@adelaide.edu.au
Apologies for the addenum. I want to correct a couple things I typed incorrectly:

Corrected code snippet for my assumption about the multivariate Rand signature:
      multiVars := []float64{0.2, 0.5, 0.44}
      pdf1 := distX.Rand([]float64{multiVars[0]})[0]
      pdf2 := distY.Rand([]float64{multiVars[1]})[0]
      pdf3 := distZ.Rand([]float64{multiVars[2]})[0]

"The most useful is the normal distribution for a variety of reasons, one of which I'll come back to." - I never did: The normal distribution is useful for correlated variates because it is relatively well-known how to convert a Pearson correlation to Spearman, which is necessary when transforming the correlated normal variates to be correlated unifom variates...

...which is a step I omitted from my final code sample! Corrected is below:
     func CorrelatedMultivariates(corrMatrix [][]float64, dists []InverseTransformFn) (results []float64) {
          // Step 1: Adjust Pearson -> Spearman correlations using a Hotelling paper, create var normalDist as normal distribution
          // Step 2: Generate len(dists) std normally distributed variates
          for i, dist := range dists {
               temp[i] = normalDist.Rand()
          }
          // Step 3: Cholesky decomposition to correlate the temp[i] variates
          // Step 3.5: Use CDF of normal to convert correlated temp[i] normally distributed variates to correlated uniformly distributed variates.
          // Step 4: Convert temp[i] to targeted correlated distributions using inverse transform. Correlations no longer exact, only approximate

Brendan Tracey

unread,
Aug 6, 2014, 2:01:36 AM8/6/14
to Cory Slep, gonu...@googlegroups.com, dan.ko...@adelaide.edu.au
I think the proper way to include custom PRNGs is as follows

type Normal struct {
    Mu float64
    Sigma float64
    Source rand.Source
}

func (n Normal) Rand() float64 {
    if n.Source == nil {
return rand.NormFloat64() * n.Sigma + n.Mu
    }
    return n.Source.NormFloat64() * n.Sigma + n.Mu
}

Other distributions can also have Source as a field. The actual source implementation can come from wherever. It should not be in this package. It also needs to be in each distribution as opposed to the package level because of concurrency issues.

We would also have interfaces like

type Rander interface {
    Rand() float64
}

that can be used, for example, in the sampling functions.

I had written the MV Rander interface as 

type Rander interface {
    Rand([]float64) []float64
}

so that users can re-use memory if they would like 

Rand(nil) would allocate new memory, Rand(s) would put in place.

To sample multivariates from different distributions, one would define

type IndSampler []dist.Rander

func (i IndSampler) Rand([s ]float64) []float64 {
    if s == nil{
        s = make([]float64, len(i))
    }
    if len(s) != len(i){
       panic(“mv: slice length mismatch”)
    }
    for j, dist := range i{
        s[j] = dist.Rand()
    }
    return s
}

The user could construct this with
myMv := mv.IndSampler{Normal{Mu: 0, Sigma: 1}, Uniform{Min: -2, Max: 3}, Exponential{Rate: 6}}

Does that make sense?

Dan Kortschak

unread,
Aug 6, 2014, 2:11:50 AM8/6/14
to Brendan Tracey, Cory Slep, gonu...@googlegroups.com

Sebastien Binet

unread,
Aug 6, 2014, 3:56:58 AM8/6/14
to Brendan Tracey, Cory Slep, gonu...@googlegroups.com, Dan Kortschak
On Wed, Aug 6, 2014 at 8:01 AM, Brendan Tracey <tracey....@gmail.com> wrote:
> I think the proper way to include custom PRNGs is as follows
>
> type Normal struct {
> Mu float64
> Sigma float64
> Source rand.Source
> }
>
> func (n Normal) Rand() float64 {
> if n.Source == nil {
> return rand.NormFloat64() * n.Sigma + n.Mu
> }
> return n.Source.NormFloat64() * n.Sigma + n.Mu
> }

perhaps just a matter of cosmetics, but I would err on this kind of
implementation instead:
https://github.com/go-hep/random/blob/master/rand.go#L18

// Dist is a distribution function
type Dist func() float64

// Gauss returns a normally distributed float64 in the range
// [-math.MaxFloat64, +math.MaxFloat64] with standard normal distribution of
// mean=mean and stddev=stddev.
func Gauss(mean, stddev float64, src *rand.Source) Dist {
fct := func() float64 {
return rand.NormFloat64()*stddev + mean
}
if src != nil {
r := rand.New(*src)
fct = func() float64 {
return r.NormFloat64()*stddev + mean
}
}
return Dist(fct)
}

you could even have "my" random.Dist implement "your" Rander:

func (fct Dist) Rand() float64 {
return fct()
}

what do you think ?
the advantage of such a design is less branch-y code.
the cons is that you can't easily inspect a Dist value for its input
parameters (mean and standard deviation).
(as a middle-ground solution, you could have "my" random.Dist be a
field of "your" Normal struct and loose the 'Source' field)

-s

Brendan Tracey

unread,
Aug 6, 2014, 10:13:50 AM8/6/14
to Sebastien Binet, Cory Slep, gonu...@googlegroups.com, Dan Kortschak
It seems like the advantage is less branchy code / more functional, but the downside is that there are many conceptual things attached with a distribution (CDF, Prob, Quantile, Entropy, etc.), which really feel like they should be methods of a type. Additionally, your function only satisfies the Rander interface. By having a struct with a bunch of methods, the distributions satisfy a lot of interfaces. There are a lot of algorithms for which it’s nice to be able to swap distributions

One example:

type FitLogProber struct [
Fit(s []float64)
LogProb(x float64)
}


// Find the most accurate distribution for the data
data := myloader.Load(reader)
dists := []FitProber{Normal{}, Laplace{}, Weibull{})
quality := make([]float64, len(dists))
for j, dist := range dists {
// fit on the first n samples
dists.Fit(data[:n])
// predict on the remaining samples
for i := n; i < len(data); i++ {
quality[j] += dist.LogProb(data[i])
}
}
_, idx := floats.Max(quality)
fmt.Println(“The most accurate distribution is %d”, idx)

Cory Slep

unread,
Aug 6, 2014, 10:43:23 AM8/6/14
to Brendan Tracey, Sebastien Binet, gonu...@googlegroups.com, Dan Kortschak
I am partial towards having a struct type for distributions. Satisfying many interfaces is a boon, but the danger is bloating them with so many functions that discovery and ease of use is damaged. As long as each addition is examined hopefully we can prevent that.

Using the paradigm I listed before removes the branchiness of rand/Source code by putting the source determination in the constructor, and still uses structs.

I think we are mostly on the same page, except possibly when generating multivariates (especially multiple samples quickly, dont want to Cholesky decompose over and over again). Perhaps a correlated uniform distribution type will save the extra computational time there.

Brendan, i think what you posted before is very similar to the code snippet I had. I am just not used to the "reusing memory" part and the signature looks ugly because of it. I was thinking of splitting the signature into:
Rand() []float64 // New slice, calls below fn to not duplicate effort
Rand([]float64) // Reuse slice

What are your thoughts?


On Wednesday, August 6, 2014, Brendan Tracey <tracey....@gmail.com> wrote:
It seems like the advantage is less branchy code / more functional, but the downside is that there are many conceptual things attached with a distribution (CDF, Prob, Quantile, Entropy, etc.), which really feel like they should be methods of a type. Additionally, your function only satisfies the Rander interface. By having a struct with a bunch of methods, the distributions satisfy a lot of interfaces. There are a lot of algorithms for which it's nice to be able to swap distributions

One example:

type FitLogProber struct [
    Fit(s []float64)
    LogProb(x float64)
}

--

Brendan Tracey

unread,
Aug 6, 2014, 11:30:12 AM8/6/14
to Cory Slep, Sebastien Binet, gonu...@googlegroups.com, Dan Kortschak
On Aug 6, 2014, at 7:43 AM, Cory Slep <cjs...@gmail.com> wrote:

I am partial towards having a struct type for distributions. Satisfying many interfaces is a boon, but the danger is bloating them with so many functions that discovery and ease of use is damaged. As long as each addition is examined hopefully we can prevent that.

Using the paradigm I listed before removes the branchiness of rand/Source code by putting the source determination in the constructor, and still uses structs.

Idiomatic Go only has constructors when work needs to be done at creation time. See for example os/exec/Cmd. In our case, the default behavior is reasonable if dist.Source == nil. It doesn’t need to be provided to have reasonable behavior. There doesn’t need to be a constructor for most of the univariate distributions.


I think we are mostly on the same page, except possibly when generating multivariates (especially multiple samples quickly, dont want to Cholesky decompose over and over again). Perhaps a correlated uniform distribution type will save the extra computational time there.

Multivariate distributions are another matter. If you look at github.com/btracey/dist/mv, I have a constructor for the MV gaussian. This is so the Cholesky decomposition can be computed once at creation, and then cached for further use.


Brendan, i think what you posted before is very similar to the code snippet I had. I am just not used to the "reusing memory" part and the signature looks ugly because of it. I was thinking of splitting the signature into:
Rand() []float64 // New slice, calls below fn to not duplicate effort
Rand([]float64) // Reuse slice

This is not possible because Go does not allow method overloading. There can be at most one signature for Rand. One of the goals of the gonum project is to support efficient numerical computation. A necessary action to meet this goal is to provide building blocks that do not inhibit efficiency. It is very frequent for numerical libraries to generate millions of random samples, which can put a lot of pressure on the GC. Enabling memory reuse allows this pressure to be avoided. I understand that at first blush the signature is confusing, but we are consistent about it throughout gonum. mat64, floats, and stat all have this style of function where appropriate.

Cory Slep

unread,
Aug 6, 2014, 11:37:15 AM8/6/14
to Brendan Tracey, Sebastien Binet, gonu...@googlegroups.com, Dan Kortschak
Thanks for setting me straight on the practices. I didn't mean to overload the Rand function, but keeping the memory usage consistent across packages is nice plus I can see how using 1 signature in this case is nicer than 2.

Since you already have mv and dist up, is there anything you need help with in either?

Brendan Tracey

unread,
Aug 6, 2014, 12:24:05 PM8/6/14
to Cory Slep, Sebastien Binet, gonu...@googlegroups.com, Dan Kortschak
On Aug 6, 2014, at 8:37 AM, Cory Slep <cjs...@gmail.com> wrote:

Thanks for setting me straight on the practices. I didn't mean to overload the Rand function, but keeping the memory usage consistent across packages is nice plus I can see how using 1 signature in this case is nicer than 2.

Since you already have mv and dist up, is there anything you need help with in either?

I think at the moment it’s on me to incorporate the changes that happened in Stat, plus incorporating your suggestion about adding a custom Source. Once that’s done, it’ll need a code review, and after that there is a lot more to put in the package (my current list of distributions is much less than your list let alone the scipy list)

Thanks for the comments and discussion. It’s healthy.

Brendan

Brendan Tracey

unread,
Aug 12, 2014, 6:25:21 PM8/12/14
to gonu...@googlegroups.com, cjs...@gmail.com, seb....@gmail.com, dan.ko...@adelaide.edu.au
For those that haven't starred the github repository, I submitted a pull request for an initial commit for the distribution package. It contains Exponential, Uniform, Laplace and Normal. It needs a lot more to be fully featured, but it will set the stage for how the rest of the distributions will look. Comments appreciated.
Reply all
Reply to author
Forward
0 new messages