probability density function identification

Fern

unread,

May 21, 2013, 12:43:18 AM5/21/13

to

Hi,

I have a question on trying to reverse engineer the probability density function from which a set of numbers were generated. My setup is the following:

1) I have two probability density functions, both of whose domain is bounded in [0,1]:
a) Beta (4,2) distribution
b) Uniform (0.358060,0.975273) distribution

2) Note that the parameters of the Uniform distribution have been carefully selected so that it has the same mean and variance as the Beta distribution.

3)From each distribution we generate 50 numbers

4)We then sum these random numbers separately (for the beta and uniform) and the value are placed as elements in two vectors (RandBeta and RandUnif).

5)We repeat steps 3-4 until the vectors RandBeta and RandUnif have 20,000 elements each.

In light of the Central Limit Theorem (which would hold for summing variates drawn from the two distributions above) my question is whether it is possible to examine the vectors RandBeta and RandUnif (without knowing which is which) and determine which was generated from the Beta pdf and which form the Uniform pdf?

Thanks!

Message has been deleted

Rich Ulrich

unread,

May 21, 2013, 1:41:23 AM5/21/13

to

Selecting between two choices is not much reverse engineering.

The question is whether a sample of 20,000 is large enough to
detect the detect the differences in distributions based on sampling
the averages of 50 uniforms vs. 50 beta(4,2). Testing would depend
higher-order moments than the first and second.

The beta has some skewness, which will (slightly) be reflected in
the sampling. The variances of the variances will differ, probably
by less. I don't know which would be more, but theory can say.

Fairly straightforward theory could be used to find the *power*
of comparing samples of 20,000 -- That is, you can compute the
expected differences of the 3rd and 4th central moments, which
will not have a mean of zero.

The beta will probably have more outliers in the observed
variances. That would take a little trickier application of theory,
but I wonder if it might have enough power to detect a
difference at Ns of 20,000 where the 3rd and 4th powers do not.

--
Rich Ulrich

David Jones

unread,

May 21, 2013, 3:46:55 AM5/21/13

to

"Rich Ulrich" wrote in message
news:i51mp8trgggnsum94...@4ax.com...

<<snip>>

It is not clear that moments would be useful. In this context the ranges of
possible values for the two averages are (0,1) and (0.358060,0.975273) ...
so that as soon as a value of the average outside the range
(0.358060,0.975273) occurs you know that the original distribution must have
been the Beta (4,2) . Of course the probability of such an outcome from the
Beta (4,2) average distribution might be too small for this to have much
chance of happening within 20000 samples, but it perhaps indicates the way
to go..... which to me seems to be to look at the tail behaviour

For the uniform case, there are certainly analytical expressions for the
distribution function of the average. There may not be a corresponding
analytical expression for the beta distribution, but there are possibilities
of finding the distribution function numerically. If neither of these
appeal, there are still possibilities of proceeding if the OP is prepared to
generate samples known to be from one or other of the two sources. A
suggestion would be to construct appropriate log-survivor plots for the two
tails and to see how the sample version of these compares to either the
known distributions (if possible), or to repeated samplings from the two
sources. The repeated-sampling approach would at least give an idea of how
much separation of the cases there can be in a sample 20000 values.

David Jones

Ray Koopman

unread,

May 23, 2013, 6:59:12 PM5/23/13

to

On May 20, 10:41 pm, Rich Ulrich <rich.ulr...@comcast.net> wrote:

The skewness of a Beta[4,2] variable is -sqrt[7/32]. Since the
skewness of the mean of n iid variables is the parent skewness /
sqrt[n], the skewness of the mean of 50 Beta[4,2]'s is
-sqrt[(7/32)/50].

For large samples from a normal population, the variance of the
sample skewness is approximately 6/n. Since both the mean of 50 betas
and the mean of 50 uniforms are approximately normal, the variance
of the difference between the two sample skewnesses with n = 2*10^4
is approximately 6*10^-4, and the standardized difference between the
two skewnesses is approximately -sqrt[(7/32)/50]/sqrt[6*10^-4] = -2.7
That ought to be a detectable difference.

A quick empirical check agrees. I generated 100 pairs of samples
with n = 2*10^4. For 98 of those, the skewness of the Beta sample
was algebraically less than the skewness of the Uniform sample.

Rich Ulrich

unread,

May 24, 2013, 1:28:58 PM5/24/13

to

Thanks. More comments inserted below.

On Thu, 23 May 2013 15:59:12 -0700 (PDT), Ray Koopman <koo...@sfu.ca>
wrote:

A z-test of 2.7 rejects at the 5% level, so the specified sampling
strategy has "moderate" power for a 5% test. Doubling the N would
give "good" power for a 5% test, and moderate power for a
smaller alpha.

>
>A quick empirical check agrees. I generated 100 pairs of samples
>with n = 2*10^4. For 98 of those, the skewness of the Beta sample
>was algebraically less than the skewness of the Uniform sample.

Apparently, you didn't need nearly 100 to be sure of the difference.

This indicates, I think, that this variant sampling strategy of
basing the means (100, as you used) on N=20,000 has
better power than looking at 20,000 means based on N=50.

Given "98 out of 100," I imagine that every one of the samples
based on beta had skew in the same direction, and that you have
at least moderate power for testing "non-uniform" when testing
any one of them.

--
Rich Ulrich

Ray Koopman

unread,

May 24, 2013, 4:55:05 PM5/24/13

to

On May 24, 10:28 am, Rich Ulrich <rich.ulr...@comcast.net> wrote:
> Thanks. More comments inserted below.
>

I should have been more explicit. On each of 100 trials, I compared
the skewness of 20,000 means of 50 Betas to the skewness of 20,000
means of 50 Uniforms. On 98 of those trials, the Beta skewness was
algebraically less than the Uniform skewness. Later I repeated the
experiment (with a different random number generator) using 1000
trials. On 997 of those trials, Beta skew < Uniform skew.

>
> Given "98 out of 100," I imagine that every one of the samples
> based on beta had skew in the same direction, and that you have
> at least moderate power for testing "non-uniform" when testing
> any one of them.

I took the OP's question to be: using using only two sets of 20,000
means of 50, one set from each of the two distributions, could we say
which set used which distribution? My conclusion is that we could be
reasonably confident that the set with the algebraically lower skew
used Beta.

>
> --
> Rich Ulrich

David Jones

unread,

May 24, 2013, 5:50:43 PM5/24/13

to

"Ray Koopman" wrote in message
news:de591757-e9da-4a24...@li6g2000pbb.googlegroups.com...

I took the OP's question to be: using using only two sets of 20,000
means of 50, one set from each of the two distributions, could we say
which set used which distribution? My conclusion is that we could be
reasonably confident that the set with the algebraically lower skew
used Beta.

===============================================================

I think the OP's question was:
using only one set of 20,000 means of 50, decide which of the two
distributions was the original.

David Jones

Rich Ulrich

unread,

May 24, 2013, 7:16:09 PM5/24/13

to

On Fri, 24 May 2013 13:55:05 -0700 (PDT), Ray Koopman <koo...@sfu.ca>
wrote:

oops! I see that I wasn't being very clear-sighted here.

There is no problem in telling apart these uniform and beta
data, based on raw data points, with even a few dozen points.
And, if the problem is one of restricting the test to a test
on kurtosis -- using the raw points -- it still would take an
N of only a few hundred for very high confidence.

>
>I should have been more explicit. On each of 100 trials, I compared
>the skewness of 20,000 means of 50 Betas to the skewness of 20,000
>means of 50 Uniforms. On 98 of those trials, the Beta skewness was
>algebraically less than the Uniform skewness. Later I repeated the
>experiment (with a different random number generator) using 1000
>trials. On 997 of those trials, Beta skew < Uniform skew.

Okay. This is an illustration of the power of testing using
a "50% test": Which is larger? Also, it is confirmation of the
validity of the slight approximation that achieved a point
estimate of the expected value of 2.7 for the overall test.

The p-value of z=2.7 is 0.0069, which suggests the 95% CI
results of (0 to 3) missed tests out of 100, and about (2 to 13)
missed out of 1000.

The result of 3-in-1000, inverting, implies a point estimate for z
of 2.97; the result of 2-in-100 similarly suggests z= 2.05.

>
>>
>> Given "98 out of 100," I imagine that every one of the samples
>> based on beta had skew in the same direction, and that you have
>> at least moderate power for testing "non-uniform" when testing
>> any one of them.
>
>I took the OP's question to be: using using only two sets of 20,000
>means of 50, one set from each of the two distributions, could we say
>which set used which distribution? My conclusion is that we could be
>reasonably confident that the set with the algebraically lower skew
>used Beta.

Thanks for spelling out in detail.

--
Rich Ulrich

Ray Koopman

unread,

May 24, 2013, 7:43:38 PM5/24/13

to

OP: "... my question is whether it is possible to examine the vectors

David Jones

unread,

May 25, 2013, 5:27:01 AM5/25/13

to

"Ray Koopman" wrote in message

news:687ffa75-cc0e-486e...@vy4g2000pbc.googlegroups.com...

====================================

I was going by the initial statement of the problem by the OP: "I have a

question on trying to reverse engineer the probability density function from

which a set of numbers were generated", which seemed to imply that there was
a single set of results (from a computer program?) and an unknown mechanism
by which they had been generated. That seems to me to be the actual dataset
to be analysed. Then, separately, there are sets of simulated data based on
known distributions ... and these can be used to help the analysis. So the
OP may have confused the question to be answered by describing an attempt to
answer it.

David Jones