Sampling in Distributions

Simon Byrne

unread,

Feb 13, 2015, 3:11:40 PM2/13/15

to juli...@googlegroups.com

One of the sticking points to getting rid of Rmath from Base is its interaction with the random number generator. The ideal solution to this is implementing the sampling algorithms in Distributions.jl. Between Dahua and myself, most of the algorithms have now been implemented: the stumbling block is actually tying it all together in an efficient manner.

As we seem to have hit a roadblock with this, I thought I would outline the main issues, in the hope that others might have some useful suggestions.

1. When sampling value multiple times from the same distribution, some values need only be computed once. Rmath uses static variables for this (checking if the distributional parameters have changed at each call), however this presents possible thread safety issues. The current approach we are exploring involves using immutable types to hold all the pre-computed constants, for example:

https://github.com/JuliaStats/Distributions.jl/blob/bb1e2e3c228b72400e1a9db30080043fc7713189/src/samplers/gamma.jl#L10-L134

The idea here being, that you could use

s = GammaGDSampler(d)

X = [rand(s) for i = 1:n]

for efficiently sampling an array of values from the same distribution. Unfortunately, this turns out to be inefficient for sampling single values, as you may end up computing values on branches that aren't used. We could get around this by implementing everything twice, but hopefully there is a neater solution.

2. Typically there isn't one optimal algorithm for all parameter values, instead you need a meta-algorithm to select the appropriate algorithm for a particular value. One idea we tried was defining a function to choose the appropriate algorithm, e.g.

function sampler(d::Gamma)

if d.shape < 1.0

GammaIPSampler(d)

elseif d.shape == 1.0

Exponential(d.scale)

else

GammaGDSampler(d)

end

and then define:

rand(d::Gamma, n...) = rand(sampler(d), n...)

The problem with this is that the return type instability of sampler imposes a significant cost on the resulting rand.

3. These problems are compounded by the fact that rand methods call other rand methods, e.g. a Beta distribution can be sampled by taking

rand(d::Beta)

a = rand(Gamma(d.α)

b = rand(Gamma(d.β)

return a/(a+b)

end

In order to do this efficiently for an array, you would want to choose the appropriate algorithm, and pre-compute the appropriate values, for each Gamma distribution.

Hopefully that explains the main obstacles we're facing. Please let me know if you have any suggestions, or would like further explanation.

Simon

Elliot Saba

unread,

Feb 13, 2015, 3:35:57 PM2/13/15

to Julia Dev

It sounds to me like the way to deal with your polyalgorithmic problem is to try to take advantage of multiple dispatch somehow. Is it possible to, for instance, have Gamma be a parent type, and have child types for each sub-algorithm you need in your polyalgorithm? This way, when you construct a Gamma distribution, it would return something of type GammaIP, or GammaGD, etc... Then you can have the rand() function directly call the relevant sampler function.

Of course, if the sub-algorithm is dependent upon parameters that can change after construction of the distribution object, multiple dispatch isn't going to give us any speedup here, I don't think.

-E

Tim Holy

unread,

Feb 14, 2015, 6:56:37 AM2/14/15

to juli...@googlegroups.com

These are tough problems. In another context I've started wondering whether
there might be a need to develop a package that provides a convenient hack
around type-instability. I have no idea whether this will actually be useful
in practice, but the idea might be something like the following: suppose you
have 3 different types, A, B, and C. We can create a type-stable object
container like this:

immutable ObjectContainer
objtype::DataType
objdata::Vector{UInt8}
end

where objdata is a wrapper around the memory associated with the object. Then
one creates a @dispatch_for macro,

@dispatch_for A B C myfunction(obj::ObjectContainer, args...)

which generates code that looks like this:

function myfunction(obj::ObjectContainer, ...)
if obj.objtype == A
myfunction(unsafe_pointer_to_objref(convert(Ptr{A},
pointer(obj.objdata)), args...)
elseif obj.objtype == B
...
end

Basically the idea here is that you bypass generic runtime method lookup and
handle dispatch manually. Naturally, you have to write myfunction methods for
A, B, and C as usual. There's a remote chance you'd have to give these
different names (myfunction_A, myfunction_B, myfunction_C) and have the
@dispatch_for macro use these names, but I hope that wouldn't be necessary.
You'd also want to declare all of them @noinline.

There are quite a few issues to investigate: will the array-wrapper allocation
destroy any benefit? Can one realistically handle multiple dispatch, not just
single dispatch on the first argument?

--Tim

Simon Byrne

unread,

Feb 14, 2015, 7:02:41 AM2/14/15

to juli...@googlegroups.com

On Friday, 13 February 2015 20:35:57 UTC, Elliot Saba wrote:

It sounds to me like the way to deal with your polyalgorithmic problem is to try to take advantage of multiple dispatch somehow. Is it possible to, for instance, have Gamma be a parent type, and have child types for each sub-algorithm you need in your polyalgorithm? This way, when you construct a Gamma distribution, it would return something of type GammaIP, or GammaGD, etc... Then you can have the rand() function directly call the relevant sampler function.

Yeah, that is kind of the pattern we're going for, but we were also using the types to hold the pre-computed constants: maybe that is the wrong idea though.

Simon Byrne

unread,

Feb 14, 2015, 7:15:21 AM2/14/15

to juli...@googlegroups.com

On Saturday, 14 February 2015 11:56:37 UTC, Tim wrote:

Then one creates a @dispatch_for macro,

@dispatch_for A B C myfunction(obj::ObjectContainer, args...)

which generates code that looks like this:

function myfunction(obj::ObjectContainer, ...)
if obj.objtype == A
myfunction(unsafe_pointer_to_objref(convert(Ptr{A},
pointer(obj.objdata)), args...)
elseif obj.objtype == B
...
end

...

That is an interesting approach, and one that hadn't occurred to me. It does seem like you're almost reimplementing your own dispatch system though!

In my code above, once sampler is inlined, the rand method effectively becomes

function rand(d::Gamma,n...)

if d.shape < 1.0

s = GammaIPSampler(d)

elseif d.shape == 1.0

s = Exponential(d.scale)

else

s = GammaGDSampler(d)

end

rand(s,n...)

end

We would avoid the type instability altogether if there was a way to lift the rand call inside the loop, e.g. you want to generate the following code:

function rand(d::Gamma,n...)

if d.shape < 1.0

s = GammaIPSampler(d)

rand(s,n...)

elseif d.shape == 1.0

s = Exponential(d.scale)

rand(s,n...)

else

s = GammaGDSampler(d)

rand(s,n...)

end

This would be considerably more difficult for problem 3, where we would ideally want two nested if blocks, i.e.

function rand(d::Beta,n...)

if d.α < 1.0

sα = GammaIPSampler(d)

if d.β < 1.0

sβ = GammaIPSampler(d)