Arbitrary.arbLong doesn't have good random distribution

67 views
Skip to first unread message

Ryan King

unread,
Sep 28, 2011, 3:30:29 PM9/28/11
to scalacheck
I keep seeing my checks fail due to to many tests being discarded. I
use arbitrary[Long] with a uniqueness filter heavily for my
checks[1], so I looked into whether it was generating random numbers.
It isn't doing that well.

I generated 100,000 random longs from this [2] and here's the top 10
results (output from uniq -c):

10139 0
10111 -4611686018427387904
10006 1
9982 -1
9935 4611686018427387903
1 999408985813006357
1 999404514076553559
1 999243136190674913
1 999141327322374406
1 998717611447317269

Is there something I can tune to get better random numbers?

-ryan

1. class UniqueFilter[T] extends Function1[T, Boolean] {
val values = mutable.Set[T]()
def apply(t: T) = {
val rv = values.contains(t)
values += t
!rv
}
}

arbitrary[Long].filter(new UniqueLong())

2. for (i <- 0.until(100000)) {println(arbLong.arbitrary.sample.get)}

Rickard Nilsson

unread,
Sep 28, 2011, 6:02:11 PM9/28/11
to scala...@googlegroups.com
Hi Ryan,

Den 2011-09-28 21:30:29 skrev Ryan King <ryan...@gmail.com>:

> I keep seeing my checks fail due to to many tests being discarded. I
> use arbitrary[Long] with a uniqueness filter heavily for my
> checks[1], so I looked into whether it was generating random numbers.
> It isn't doing that well.
>
> I generated 100,000 random longs from this [2] and here's the top 10
> results (output from uniq -c):
>
> 10139 0
> 10111 -4611686018427387904
> 10006 1
> 9982 -1
> 9935 4611686018427387903
> 1 999408985813006357
> 1 999404514076553559
> 1 999243136190674913
> 1 999141327322374406
> 1 998717611447317269
>
> Is there something I can tune to get better random numbers?

Well, arbitrary[Long] does give higher weight to the edge
cases (-1, 0, 1, MIN_LONG/2, MAX_LONG/2). The sum of the
weights for the edge cases is equal to the weight of the
rest of the numbers, and this is completely in line with
your findings. There is no way to tweak the arbitrary generator.
Maybe the weights could be adjusted a bit, so we didn't get
quite as many edge cases generated.

However, if you want implement a generator that produces
unique Longs, it is probably better to start out with the
Gen.choose generator instead of Arbitrary.arbitrary:

Gen.choose(Long.MinValue/2, Long.MaxValue/2).filter(new UniqueLong())

(the reason you have do divide max and min by 2 is that
Gen.choose can't handle an interval that is larger than
Long.MaxValue.)

Would that help you?


Best regards,
/ Rickard

Ben Hutchison

unread,
Oct 4, 2011, 11:40:18 PM10/4/11
to scala...@googlegroups.com
I havent check my Scalacheck email folder for while, but looking today
I noticed this topic which Ive spent a bit of time thinking on. Some
comments:

* There are lots of useful distributions for a given type. For
example, for number types like Long we might want:

- A uniform distribution across all long values, possibly with edge
cases boosted as Scalacheck does by default.

- What I call a "log uniform" distribution (not sure if this has
another name) - where a number between 1 - 10 is as likely as one
between 10-100, and as likely as one between 100-1000. Ie, the
distribution becomes sparser at larger values. This is what I use as
my default.

- A bounded distribution, such as positive numbers only, or 0-1 for
fractional numbers.

So I wondered whether I could write properties dependent upon values
(of type T) sourced from Arbritrary[T], and still be free to vary the
underlying generator/distribution. This can be done, with a little
care, by
- Keep definitions of generators separate from the definition of
Arbitrary implicit instances for them.
- Import the generator you wish to use into scope, overriding the
default (global) Scalacheck Arbitrary instance.


* I think quite alot about the idea of generator /composition/, ie,
composing complex objects automatically. For example, a random
triangle can be constructed from 3 random points. And a random point
can be constructed (in 2D) from 2 random numbers.

- When we do this kind of composition, its likely that the underlying
component generators get filtered, transformed and merged in
unpredictable ways. So its not safe to assume that a all of a
generator's values get used, or that they're used in the same order as
they are generated. For example, a integer generator that always emits
(0, 1, -1, +Max/2, -Max/2, <random>, <random> ...) is going to
generate a very poor and predictable first triangle if its leading 6
values are used to build 3 triangle coordinates.

My rule of thumb: generators should always be stochastic and be
capable of revisiting every possible emitted value.

* I think there is a fairly generic parameter to generators: "the
amount that boundary and special values frequency is boosted". For
example, Rickard said below that boundary values are boosted to 50% of
all Long values. When composing generators, it may not make sense to
use such a rich mix of boundary values in component generators. For
example, building triangles out of points: we probably dont want half
of all triangle coordinates to lie on boundaries. Triangles have
different special cases anyway (eg very thin and pointed, right
angle). It would be nice if Scalacheck allows the passing some some
optional parameter to generators that could specify how big the "boost
factor" for special case data items was.

-Ben

> --
> You received this message because you are subscribed to the Google Groups
> "scalacheck" group.
> To post to this group, send email to scala...@googlegroups.com.
> To unsubscribe from this group, send email to
> scalacheck+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/scalacheck?hl=en.
>
>

Jason Zaugg

unread,
Oct 5, 2011, 2:18:02 AM10/5/11
to scala...@googlegroups.com
Yep, these are the challenges that face the author of a Gen composed from other Gens.

A simple example that I found interesting was generating 2d points in the square (-1, -1) -> (1, 1).  If you naively compose a generator for X and Y, it's unlikely you will hit the (quite literal) corner cases. 

Instead, you can start with:

val edges: Gen[Double]    // -1, 0, 1
val uniform: Gen[Double]  // [-1, 1]

Then make a generator that uses a frequency generator to pick between these:

x: uniform, y: uniform
x: edges, y: uniform
x: uniform, y = x
x: edges, y = x

Finally, we can make this symmetric by wrapping this in a generator that either uses that point directly, or reflects it:

x: x1, y: y2
x: y1, y: x1

-jason

Ben Hutchison

unread,
Oct 6, 2011, 2:41:47 AM10/6/11
to scala...@googlegroups.com
On Wed, Oct 5, 2011 at 5:18 PM, Jason Zaugg <jza...@gmail.com> wrote:
> A simple example that I found interesting was generating 2d points in the
> square (-1, -1) -> (1, 1).  If you naively compose a generator for X and Y,
> it's unlikely you will hit the (quite literal) corner cases.
> Instead, you can start with:
> val edges: Gen[Double]    // -1, 0, 1
> val uniform: Gen[Double]  // [-1, 1]
> Then make a generator that uses a frequency generator to pick between these:
> x: uniform, y: uniform
> x: edges, y: uniform
> x: uniform, y = x
> x: edges, y = x
> Finally, we can make this symmetric by wrapping this in a generator that
> either uses that point directly, or reflects it:
> x: x1, y: y2
> x: y1, y: x1

When I first read this, I thought to myself "Surely a Gen[Double] for
the range [-1, 1], which boosted edge cases to eg 10% frequency, would
be simpler, and yield corner cases acceptably often?" But on further
reflection, I realised that the chance of this generating a square
with all 4 corners at the extremes is then 10E-8!

With edge cases 50% likely its P(1/256), which is acceptably often IMO.

Thats another use-case for giving the user a dial where they can turn
edge-case-frequency up and down.

-Ben

Jason Zaugg

unread,
Oct 6, 2011, 3:52:36 AM10/6/11
to scala...@googlegroups.com
I think that it is clearest to be explicit about which non-uniform generators are used, otherwise it's too easy to miss them, especially the product expands (e.g. to a Point3d)

type Point = (Double, Double)
Gen.frequency[Point](
   7 -> uniformGen,
   1 -> edgesGen,
   1 -> diagonalGen,
   1 -> corners
)

-jason
Reply all
Reply to author
Forward
0 new messages