The structure of the ProbabilityDistribution class

12 views
Skip to first unread message

Rolf

unread,
Jun 23, 2011, 8:50:21 AM6/23/11
to sage-devel
This is a more general issue to be discussed how to structure
ProbabilityDistribution class.
Currently ProbabilityDistribution class (http://www.sagemath.org/doc/
reference/sage/gsl/probability_distribution.html) includes the
following sub classes.

RealDistribution: various real-valued probability distributions.
SphericalDistribution: uniformly distributed points on the surface
of an $n-1$ sphere in $n$ dimensional euclidean space.
GeneralDiscreteDistribution: user-defined discrete distributions.

I feel this is a bit confusing. RealDistribution to the moment does
not include discrete distributions while GeneralDiscreteDistribution
only handles user defined discrete distributions.

The term RealDistribution is confusing by itself. There is no unreal
distribution or something similar to that. Sage users probably would
use Wikipedia for reference and there is no such term in Wikipedia too

Next, the term GeneralDiscreteDistribution is confusing too as there
is no specific discrete distribution implemented in Sage, and as the
documentation says, this class only holds user defined distributions.
Specific discrete in my understanding would be binomial, pascal and
the like which are not implemented in Sage yet and which on my system
I included into RealDistribution.

I propose the following ways of structuring distribution classes in
Sage:

1. The base class ProbabilityDistribution, as it exists now, will have
two sub classes: (1) ExplicitDistribution holds all the currently
available distributions continuous or discrete. (2)The second subclass
UserDefinedDistribution holds all distributions that can be continuous
(function P(x), x = continuous variable) and discrete (set of P(x), x
discrete values). Not yet implemented in Sage is the user defined
continuous probability function.

2. Alternatively, the base class ProbabilityDistribution, as it
exists now, will have two sub classes: (1) ContinuousDistribution and
(2) DiscreteDistribution, and both of them will implement user defined
probability functions too.

3. Alternatively, each possibility receives its own class
ContinuousExplicit, ContinuousUserdefined, Discrete Explicit,
DiscreteUserdefined. In this case ContinuousExplicit is what
RealDistribution stands for now and DiscreteUserdefined is currently
GeneralDiscreteDistribution.

4. Finally, there will be only one class. The base class will be
extended to hold all types of probability distributions, and user
defined probability distributions are initialized by the keyword
'user'. For instance T(user defined and discrete) =
ProbabilityDistribution('user', [list]) or T(user defined and
continuous) = ProbabilityDistribution('user', [function])

I hope my considerations will trigger some discussion about
restructuring the ProbabilityDistribution class. Alternatively one may
think of omitting this class at all as the functionality is
implemented in NumPy and R more or less available from Sage.

kcrisman

unread,
Jun 23, 2011, 10:14:31 AM6/23/11
to sage-devel


> This is a more general issue to be discussed how to structure
> ProbabilityDistribution class.
> Currently ProbabilityDistribution class (http://www.sagemath.org/doc/
> reference/sage/gsl/probability_distribution.html) includes the
> following sub classes.
>
>     RealDistribution: various real-valued probability distributions.
>     SphericalDistribution: uniformly distributed points on the surface
> of an $n-1$ sphere in $n$ dimensional euclidean space.
>     GeneralDiscreteDistribution: user-defined discrete distributions.
>
> I feel this is a bit confusing. RealDistribution to the moment does
> not include discrete distributions while GeneralDiscreteDistribution
> only handles user defined discrete distributions.
>
> The term  RealDistribution is confusing by itself. There is no unreal
> distribution or something similar to that. Sage users probably would
> use Wikipedia for reference and there is no such term in Wikipedia too

Perhaps 'continuous' would have been better, but that's how it is.

+- Josh Kantor (2007-02): first version
+
+- William Stein (2007-02): rewrite of docs, conventions, etc.
+
+- Carlo Hamalainen (2008-08): full doctest coverage, more
documentation,
+ GeneralDiscreteDistribution, misc fixes.
+

So as you can see, this dates to a very early time in Sage - just when
the notebook was coming out, in fact. The refactoring in 2008 was
very helpful.

> I propose the following ways of structuring distribution classes in
> Sage:
>
> 2. Alternatively,  the base class ProbabilityDistribution, as it
> exists now, will have two sub classes: (1) ContinuousDistribution and
> (2) DiscreteDistribution, and both of them will implement user defined
> probability functions too.

I like this best - keep the mathematically distinct things separate,
but subclass rather than make many separate classes. Unless you are
going to implement general measures as well...

However, for backward compatibility, we will need to keep
RealDistribution etc. And I do not think this is all that
confusing.

More annoying is that all of this is in the gsl/ directory. The state
of probability/ is a little confusing (needs major doctests, but it
was hard to figure out exactly what was intended sometimes). Really,
these functions should live there, or at least have wrappers there.
Note that discrete random variables are already there! And that
probability_space is one of the methods on these random variables...
A job to untangle, to be sure, but a very worthy one.

> I hope my considerations will trigger some discussion about
> restructuring the ProbabilityDistribution class. Alternatively one may
> think of omitting this class at all as the functionality is
> implemented in NumPy and R more or less available from Sage.

No, this is bad (assuming someone like you is interested in working on
it). One of the main goals of Sage is to be a one-stop shop for
mathematics (in the same way that Mma, Maple, etc. are). So we want
unified syntax, and an easy way to get things. What good is it if
something is in R if it's easier just to use straight R for it?

So I encourage you to keep it up!

- kcrisman

Rob Beezer

unread,
Jun 23, 2011, 2:20:06 PM6/23/11
to sage-devel
On Jun 23, 5:50 am, Rolf <kamha...@googlemail.com> wrote:
> This is a more general issue to be discussed how to structure
> ProbabilityDistribution class.

+1 to some serious cleanup in this area. This would be very welcome.

I like option (2), though once you get into it, maybe (3) will have
greater appeal as you actually write some code.

As KDC has pointed out, it would be nice to have a consistent,
Pythonic, easy-to-understand interface to these distributions in Sage,
whether or not we get the actual computations from GSL, NumPy or R, so
I would think hard about a design that would make it easy to use one
of these (or a combination of all three) in the background.

One nit: a student of mine used a distribution backed by GSL to create
random matrices. It was very hard to doctest, since we could not
specify a seed in predictable way (that we could see). So if you see
a way to have the mechanisms for "predictable randomness" in the
doctest structure when using these distributions, that would be a big
bonus.

Thanks,
Rob


Robert Dodier

unread,
Jun 24, 2011, 2:39:56 AM6/24/11
to sage-devel
For what it's worth, bear in mind that the distinction of
continuous vs discrete distributions isn't mathematically
fundamental; it's easy to come up with distributions which
are neither continuous nor discrete. E.g. mixtures of
discrete and continuous distributions, and distributions on
"exotic" sets such as a Cantor set or other fractal.

When I put together a class hierarchy (in Java) for
probability distributions some years ago, the fundamental
type was for a conditional distribution, of which
unconditional distribution was a subclass, and familiar
types such as Gaussian, uniform, etc. were subclasses
of the unconditional distribution type. There was a
useful-but-not-exactly-fundamental category, expressed
as an interface implemented by some distributions, namely
the location & scale category (e.g. Gaussian).

Whether a distribution is continuous or discrete is a
property of its support, and it mostly mattered (in that
Java project) because that determined the method of
integrating wrt to the distribution: discrete => summation,
continuous => ordinary integral. (It seems that sufficient
cleverness could merge the two; you just need the
machinery for a more general integral.)

Aside from that, user defined vs built-in seems incidental;
I guess I wouldn't recommend enshrining it in the type
hierarchy. If the user wants to define a new distribution,
they just create a new class, right?

I suggest that as you try to roll in the symbolic stuff
for each distribution; e.g. the function to compute the
density returns a symbolic expression if it doesn't
evaluate to a number.

FTR the stuff I worked on in Java is called RISO.
http://riso.sourceforge.net
It is essentially a purpose-built quasi-symbolic system
to compute integrals for Bayesian inference.

FWIW & all the best.

Robert Dodier

kcrisman

unread,
Jun 24, 2011, 7:30:53 AM6/24/11
to sage-devel


On Jun 24, 2:39 am, Robert Dodier <robert.dod...@gmail.com> wrote:
> For what it's worth, bear in mind that the distinction of
> continuous vs discrete distributions isn't mathematically
> fundamental; it's easy to come up with distributions which
> are neither continuous nor discrete. E.g. mixtures of
> discrete and continuous distributions, and distributions on
> "exotic" sets such as a Cantor set or other fractal.

Well, this is what I meant by measures. I guess it would be easy to
have an "other" category at a later date. But the Haar measure on the
real line is a nice object, and discrete has a good definition too.
Having this distinction would at least be helpful in starting out with
such classes.

> I suggest that as you try to roll in the symbolic stuff
> for each distribution; e.g. the function to compute the
> density returns a symbolic expression if it doesn't
> evaluate to a number.

That is a very good idea. We even have (or should have soon) a
symbolic erf for the normal distribution, to start things off...

- kcrisman

Rolf

unread,
Jul 5, 2011, 2:55:25 AM7/5/11
to sage-devel
I just want to announce that I added several new probability
distributions.
Patch is available #11572
Also I made little changes to the error reporting system as suggested
in ticket (#11514),
and changed graphical presentation of discrete probability
distributions.

No, I didn't chage the class structure yet. All the build-in
distributions are implemented in the RealDistribution class
which makes the help system (command RealDistribution?) difficult to
read.

There are lot of things GSL still has to offer :-)
Reply all
Reply to author
Forward
0 new messages