Another Scala problem: Weighted Random Selection

1,886 views
Skip to first unread message

Ben Hutchison

unread,
Mar 8, 2012, 9:02:02 AM3/8/12
to scala...@googlegroups.com
case class WeightedItem[T](item: T, weight: Double)

def weightedSelection[T](items: Seq[WeightedItem[T]], numSelections:
Int, r: Random): Seq[T] = ???


Your task is to implement weightedSelection, such that it makes
'numSelections' random selections (with replacement) from 'items',
where the probability of selecting each item is proportional to its
weighting.

It is a random process. I'd expect the correctness of your algorithm
can only be verified stochastically, eg via repeated testing using
ScalaCheck. For a generous serving of bonus marks, implement a unit
test that provides some degree of confidence that your function works
correctly.

Example,contrived such that the selections occur in exactly the
expected proportions:

val items = Seq(WeightedItem("Red", 1d/6), WeightedItem("Blue", 2d/6),
WeightedItem("Green", 3d/6), )

//Seq(Green, Red, Green, Green, Red, Blue)
val selections = weightedSelection(items, 6, new Random())


I was reminded of the problem in Unit 3 of Sebastian Thrun's online
course "Programming A Robotic Car' (1), implementing Particle Filters.
But the problem comes up in alot of places, and I realize that I've
encountered the same basic problem multiple times before:

- Genetic algorithms: selecting next generation in proportion to fitness
- ScalaCheck test data generation: generating a weighted distribution
of different values
- Game AI: Selecting different actions weighted by a suitability metric

Finally, I'll mention that there are methods of solving this problem
considerably more efficient than at least my first intuitive
solution-sketch was - if interested google "roulette selection" (but
be prepared to pick through some noise), or see the course videos.

Discuss, but hold off posting any solution code til at least next Wednesday..?

-Ben

1: http://www.udacity.com/overview/Course/cs373/CourseRev/feb2012

John Mitchell

unread,
Mar 13, 2012, 6:40:56 PM3/13/12
to Melbourne Scala User Group
Ok, here's my simple attempt.
It has some flaws, but I thought it better to get something up and
receive feedback.
Thanks to Ben for the problem. :)
Cheers,
John

package melbscalausergroup
import scala.util.Random

object Selector {
def main(args: Array[String]) {
   val green = new WeightedItem("green", 1d/6)
   val blue = new WeightedItem("blue", 2d/6)
   val red = new WeightedItem("red", 3d/6)

   val items = green :: blue :: red :: Nil

   val result = weightedSelection(items, 5, new Random())

   println(result.toString())
}

val range = 1000

def weightedSelection[T](items: Seq[WeightedItem[T]],
numSelections:Int, r: Random): Seq[T] = {
 val selectionPool = items.flatMap(createSeq(_)).toSeq
 var selections = List[T]()

 for (i <- 0 until numSelections) {
 val randomPosition = r.nextInt(range)
 selections ::= selectionPool(randomPosition)
 }

 selections.toSeq
}
def createSeq[T](item: WeightedItem[T]): Seq[T] = {
 val howMany  = (item.weight * range).toInt
 Seq.fill(howMany)(item.item)
}
}


package melbscalausergroup

case class WeightedItem[T](item: T, weight: Double) {}

Jem

unread,
Mar 13, 2012, 6:46:40 PM3/13/12
to scala...@googlegroups.com
Ah yes, it is time.

I'm not 100% convinced I understood the requisite algorithm. My function creates a Seq of the items ordered by the cumulative weight, such that each item occupies a sub-range between 0 and sum(weights) . Then it gets a random number twixt 0 and sum(weights) and drops items from the ordered list until it finds the one it needs. 

It is straightforward, but not elegant or efficient.





--
You received this message because you are subscribed to the Google Groups "Melbourne Scala User Group" group.
To post to this group, send an email to scala...@googlegroups.com.
To unsubscribe from this group, send email to scala-melb+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/scala-melb?hl=en-GB.


Toby Corkindale

unread,
Mar 13, 2012, 10:59:05 PM3/13/12
to scala...@googlegroups.com
On 9 March 2012 01:02, Ben Hutchison <brhut...@gmail.com> wrote:
> case class WeightedItem[T](item: T, weight: Double)
>
> def weightedSelection[T](items: Seq[WeightedItem[T]], numSelections:
> Int, r: Random): Seq[T] = ???
>
> Your task is to implement weightedSelection, such that it makes
> 'numSelections' random selections (with replacement) from 'items',
> where the probability of selecting each item is proportional to its
> weighting.


I wrote this solution before looking at any others:
https://gist.github.com/2033639
Was a fairly fast implementation, but didn't have time to refine it.

Looks like I went for the same concept as Jem, with spreading the
options out to occupy space in a range proportional to their weight,
then normalising the random number to that range to discover which
bucket to land it.

The ugliness in mine comes from the pick_one() method; I'm sure it
should be using one of the collections methods with an accumulator to
achieve the same result, somehow.

-Toby

Andrew Conway

unread,
Mar 13, 2012, 11:21:21 PM3/13/12
to Melbourne Scala User Group

My idea is https://gist.github.com/2033568. The randomness check is
not very serious, but my algorithm was unintuitive enough that I
didn't trust it without some sanity check.

It is fascinating to see what other people come up with.

Notation: assume there are L elements of the list, and N items being
selected.

Jem and Toby's approach idea was my first idea - it is simple,
accurate, and O(NL). One speed up I thought of would be to sort the
items with the highest weighted ones first - each iteration would then
be faster, on average. But still O(NL)

BTW: Toby - I don't think your pick_one method is ugly as you claim.
It is fast and easy to understand. Functional programming has many
advantages, but is not always better. There is a miniscule chance that
rounding errors could cause it to drop through to the return
statement.

I thought this morning of the approach I actually implemented, which
is accurate and O(N+L). But it has a variety of disadvantages:
1) More complex than other solutions
2) Produces an ordered list (which you can always randomly permute in
O(N) time if you want).
3) Requires some maths to think about why it works
4) It is fragile - some innocent looking changes would cause it to
fail (about 1 in 10^17 times) due to floating point rounding.
5) It may well have subtle bugs if I made a mistake in (3) or (4).
6) It uses N+L random numbers instead of N as the other approaches
use.

I hadn't thought of John's idea, which is simpler than mine, and
similar in speed, at the (often acceptable) cost of being an
approximation. I believe it is O(N+L+1000) [ at least if you convert
selectionPool to an array or OrderedSeq, and check that the weights
add up to 1 ].

Other suggestions for John: for (i <- 0 until numSelections) yield
selectionPool(r.nextInt(range))
can replace all of:

var selections = List[T]()

for (i <- 0 until numSelections) {
val randomPosition = r.nextInt(range)
selections ::= selectionPool(randomPosition)
}

selections.toSeq

Also it would be good to use selectionPool.length instead of range as
the multiplying number - otherwise you could easily get index out of
range errors due to rounding, and also you can have weights that don't
add up to 1. Also, howMany should have a better rounding, because
0.99.toInt is 0

I've spent far to long thinking about this. Back to some real work.

Regards,

Andrew.

Ben Hutchison

unread,
Mar 13, 2012, 11:26:56 PM3/13/12
to scala...@googlegroups.com
Thanks to everybody for posting their work. As for me, well... the dog
ate my homework!

Actually, I've been in Thoughtworks office in Xi'an, China for a work
trip since Sunday, and its been absolutely full on; getting used to
life (or just, "survival") in a big city here, and meeting the local
teams, and I haven't had a chance to complete my roulette selection
method.

I'll post it later next week when I get back.

-Ben

John Mitchell

unread,
Mar 14, 2012, 6:00:09 AM3/14/12
to Melbourne Scala User Group
Thanks Andrew,

I'll update my code with you suggestion using yield.
The code is fragile around pool length and weighting.
Cheers,
John

King Lung Chiu

unread,
Mar 16, 2012, 9:37:40 AM3/16/12
to scala...@googlegroups.com
Hi all,

Here's my attempt without looking at the others'. I'll include my
thinking process.

First, my initial naive attempt:

// 1st cut: simple but wrong
object weightedSelection {

def apply[T](


items: Seq[WeightedItem[T]],
numSelections: Int,

r: Random): Seq[T] = {
val sortedItems = items.sortBy(_.weight)
for(i <- 1 to numSelections) yield
sortedItems.find(_.weight >= r.nextDouble).get.item
}

def count[T](items: Seq[T]) {
val grouped = items.groupBy(_.toString)
for((key, list) <- grouped)
println(key +": "+ (list.length / (items.size * 1.0)))
}

def test(numToSelect: Int) {
val weights = Seq(0.50, 0.30, 0.15, 0.05)
val items = for(w <- weights) yield WeightedItem(w.toString, w)
count(apply(items, numToSelect, new scala.util.Random))
}

}

Note test(...) just executes a test run and prints the results to
screen. It's not a unit test.

I saw a couple niggling issues straight after writing this:
1) The probability of selection is just plain wrong ... the items
aren't spaced out correctly
2) It doesn't account for duplicate probabilities, eg 2 items both at 50%
3) It doesn't account for the gap between the max probability supplied
and 1 (100%)

Issue #2 actually helped me figure out how to fix it, so my 2nd attempt:

// 2nd cut
object weightedSelection {

def apply[T](


items: Seq[WeightedItem[T]],
numSelections: Int,

r: Random): Seq[T] = {
var max = 0.0;
val spacedItems = for(i <- items.sortBy(_.weight)) yield {
max += i.weight
WeightedItem(i.item, max)
}
for(i <- 1 to numSelections) yield
spacedItems.find(_.weight >= r.nextDouble).get.item
}

... the rest are the same
}

This works assuming 0 <= weight <= 1 for each item, and that the sum
of all item weights = 1. If not, then we'll also need to divide the
weight of each spacedItem by the sum of all weights (max at the end of
the 1st for-loop).

The only niggling issue remaining - or at least annoying to me - is
the linear find(...); I test with 1000+ selections, and that's a lot
of linear searches when the items list is large. Will need to dig
through the API for a better data structure.

Incidentally, I noticed the default PRNG is a little biased: the
probability of getting 0.2 < n < 0.5 and 0.5 < n < 1 is almost the
same, as seen below. Which also means it's a little hard to write unit
tests for this as the tolerance to use will depend on the PRNG.

Comments welcome!

cheers,

King

scala> weightedSelection.test(1000000)
0.3: 0.379311
0.15: 0.189952
0.5: 0.380485
0.05: 0.050252

scala> weightedSelection.test(1000000)
0.3: 0.380466
0.15: 0.189491
0.5: 0.380332
0.05: 0.049711

scala> weightedSelection.test(1000000)
0.3: 0.378862
0.15: 0.190337
0.5: 0.380949
0.05: 0.049852

scala> weightedSelection.test(1000000)
0.3: 0.379963
0.15: 0.189682
0.5: 0.380479
0.05: 0.049876

scala> weightedSelection.test(1000000)
0.15: 0.189071
0.3: 0.381133
0.5: 0.379775
0.05: 0.050021

scala> weightedSelection.test(1000000)
0.15: 0.189668
0.3: 0.379144
0.5: 0.380937
0.05: 0.050251

scala> weightedSelection.test(10000000)
0.3: 0.3800097
0.15: 0.189868
0.5: 0.3801614
0.05: 0.0499609

scala> weightedSelection.test(10000000)
0.3: 0.3798768
0.15: 0.1900881
0.5: 0.3799624
0.05: 0.0500727

scala> weightedSelection.test(10000000)
0.3: 0.3801548
0.15: 0.1899825
0.5: 0.3798037
0.05: 0.050059

On 9 March 2012 01:02, Ben Hutchison <brhut...@gmail.com> wrote:

Andrew Conway

unread,
Mar 16, 2012, 10:46:32 PM3/16/12
to Melbourne Scala User Group
King wrote:
> Incidentally, I noticed the default PRNG is a little biased: the
> probability of getting 0.2 < n < 0.5 and 0.5 < n < 1 is almost the
> same, as seen below.

I don''t thing the PRNG is biased that badly; I think there is a bug
in the line:
> spacedItems.find(_.weight >= r.nextDouble).get.item
It should be
{ val d = r.nextDouble(); spacedItems.find(_.weight >= d).get.item }
Otherwise the PRNG will run again for each comparison, leading to
biases in the favour of mid range probabilities.
Better yet (in case weights do not sum to exactly 1):
{ val d = r.nextDouble()*max; spacedItems.find(_.weight >=
d).get.item }


> The only niggling issue remaining - or at least annoying to me - is
> the linear find(...); I test with 1000+ selections, and that's a lot
> of linear searches when the items list is large. Will need to dig
> through the API for a better data structure.

You could use an array/IndexedSeq and a binary search for O(log L) and
have exact results, or use John's approximate but O(1) solution.

Regards,

Andrew.

King Lung Chiu

unread,
Mar 18, 2012, 4:32:42 AM3/18/12
to scala...@googlegroups.com
> I don''t thing the PRNG is biased that badly; I think there is a bug
> in the line:
>> spacedItems.find(_.weight >= r.nextDouble).get.item
> It should be
> { val d = r.nextDouble(); spacedItems.find(_.weight >= d).get.item }

Thanks Andrew, you're right: r.nextDouble gets called more than once
per find(...), although the bug is a little unintuitive to me ...


> Otherwise the PRNG will run again for each comparison, leading to

I can see it doing that after putting in some println() at strategic
places ... but why does it do that? And it runs more than just 1-extra
time per comparison: varies but in general 3-times in my code.

I thought "_.weight >= r.nextDouble" would just get wrapped into a
function value that gets called once per find(...) ...


> biases in the favour of mid range probabilities.
> Better yet (in case weights do not sum to exactly 1):
> { val d = r.nextDouble()*max; spacedItems.find(_.weight >=
> d).get.item }

Yes, this fixes the biase I saw, though I don't get it: why do the
superfluous calls in the "buggy" code lead to the biase?

cheers,

King

>
>
>> The only niggling issue remaining - or at least annoying to me - is
>> the linear find(...); I test with 1000+ selections, and that's a lot
>> of linear searches when the items list is large. Will need to dig
>> through the API for a better data structure.
>
> You could use an array/IndexedSeq and a binary search for O(log L) and
> have exact results, or use John's approximate but O(1) solution.
>
> Regards,
>
> Andrew.
>

Andrew Conway

unread,
Mar 18, 2012, 7:29:53 AM3/18/12
to Melbourne Scala User Group
Dear King,

> I can see it doing that after putting in some println() at strategic
> places ... but why does it do that?

l.find(_.weight >= r.nextDouble)

basically compiles to

def f[T](x:WeightedItem[T]) = x.weight >= r.nextDouble()
l.find(f _)

So the function f gets repeatedly called until it returns true. Each
time it gets called it calls r.nextDouble().

Scala differs from most (non-functional) languages in that functions
are easily defined inline, so the common rule "the arguments to a
function get evaluated before the function is called" is no longer
true and can be misleading. An even less obvious example is
Array.fill(5)(r.nextDouble()) which will produce an array with 5
different values.

Pure functional languages have referential transparency, which means
that the number of times it gets executed is irrelevant, so they don't
have this source of confusion either.

Scala is amazingly flexible and concise. The cost is a lot of details
you need to understand well.


> Yes, this fixes the biase I saw, though I don't get it: why do the
> superfluous calls in the "buggy" code lead to the biase?

This is just an artifact of the mathematics. Consider a simple case of
three elements A B C, with probabilities a, b, and c respectively. a+b
+c=1.

When we make the cumulative sum list, we get a, a+b, and 1.

Now the first element, A, gets accepted if a PRNG is <a. This is
probability a. All good so far.

The second one has a (1-a) probability of being even considered, after
which it has to pass a PRNG test with prob. a+b. So the probability
that B is selected is (1-a)(a+b)=b+a(1-a-b)=b+ac>b. So B will be
selected with probability greater than b. [ In the particular example
you used, a=0.05, b=0.15 and b+a(1-a-b)=0.19, consistent with the
observed distribution from your tests. ]

In general, the middle choices will be selected more frequently than
they should be. As you sorted the choices by probability, the middle
choices are the mid range probabilities.

Regards,

Andrew.

King Lung Chiu

unread,
Mar 18, 2012, 10:12:05 AM3/18/12
to scala...@googlegroups.com
>> I can see it doing that after putting in some println() at strategic
>> places ... but why does it do that?
>
> l.find(_.weight >= r.nextDouble)
>
> basically compiles to
>
> def f[T](x:WeightedItem[T]) = x.weight >= r.nextDouble()
> l.find(f _)
>
> So the function f gets repeatedly called until it returns true. Each
> time it gets called it calls r.nextDouble().

Thanks Andrew, I get it now! (though I didn't get it on the first
several readings until I tried it on a simpler example)*

I had thought f only gets invoked _once_ each time find(...) is
called, I now realise that since f is the predicate, it gets called as
_many_ times as is required per find(...) call until a matching item
is found!

ie each find(...) results in 1+ f(...)


* def findP(lim: Int)(num: Int): Boolean = {
count += 1; println(count); count > lim }

val a = 1 to 10 toArray
for(i <- 1 to 5) { count = 0; println( a.find(findP(3)) ) }


> Scala differs from most (non-functional) languages in that functions
> are easily defined inline, so the common rule "the arguments to a
> function get evaluated before the function is called" is no longer
> true and can be misleading. An even less obvious example is
> Array.fill(5)(r.nextDouble()) which will produce an array with 5
> different values.

Yes, I get all these, which is why my confusion ... until I realised I
forgot about the 1+ predicate calls in each find call. (thinking it
was 1-to-1)

> Pure functional languages have referential transparency, which means
> that the number of times it gets executed is irrelevant, so they don't
> have this source of confusion either.
>
> Scala is amazingly flexible and concise. The cost is a lot of details
> you need to understand well.

Yes!

And my maths is rusty ... I'll read the rest when my mind is clearer tomorrow!

thanks a lot, cheers,

King

>> Yes, this fixes the biase I saw, though I don't get it: why do the
>> superfluous calls in the "buggy" code lead to the biase?
>
> This is just an artifact of the mathematics. Consider a simple case of
> three elements A B C, with probabilities a, b, and c respectively. a+b
> +c=1.
>
> When we make the cumulative sum list, we get a, a+b, and 1.
>
> Now the first element, A, gets accepted if a PRNG is <a. This is
> probability a. All good so far.
>
> The second one has a (1-a) probability of being even considered, after
> which it has to pass a PRNG test with prob. a+b. So the probability
> that B is selected is (1-a)(a+b)=b+a(1-a-b)=b+ac>b. So B will be
> selected with probability greater than b. [ In the particular example
> you used, a=0.05, b=0.15 and b+a(1-a-b)=0.19, consistent with the
> observed distribution from your tests. ]
>
> In general, the middle choices will be selected more frequently than
> they should be. As you sorted the choices by probability, the middle
> choices are the mid range probabilities.
>
> Regards,
>
> Andrew.
>

King Lung Chiu

unread,
Mar 21, 2012, 6:50:55 AM3/21/12
to scala...@googlegroups.com
>> Yes, this fixes the biase I saw, though I don't get it: why do the
>> superfluous calls in the "buggy" code lead to the biase?
>
> This is just an artifact of the mathematics. Consider a simple case of
...

> Now the first element, A, gets accepted if a PRNG is <a. This is
> probability a. All good so far.
>
> The second one has a (1-a) probability of being even considered, after
> which it has to pass a PRNG test with prob. a+b. So the probability
> that B is selected is (1-a)(a+b)=b+a(1-a-b)=b+ac>b. So B will be
> selected with probability greater than b. [ In the particular example
> you used, a=0.05, b=0.15 and b+a(1-a-b)=0.19, consistent with the
> observed distribution from your tests. ]

Thanks Andrew, I get it now!

> In general, the middle choices will be selected more frequently than


> they should be. As you sorted the choices by probability, the middle
> choices are the mid range probabilities.

Yes, after understanding where the bug is, this now makes sense.

thanks again, cheers,

King

Ben Hutchison

unread,
Mar 25, 2012, 12:34:52 AM3/25/12
to scala...@googlegroups.com
Here's a solution, at last, using Roulette Selection, and an
(imperfect) statistical test based on the ChiSquared "Goodness-of-Fit"
test.

Maybe we can go through the various approaches taken at the meeting
tomorrow? And anything we learned from doing it?

There's things I'd change about both the implementation and test that
I'll discuss tomorrow. For now I'm out of time and tired of being
inside - Sunday arvo gardening awaits...

-Ben

package test

import scala.util.Random
import scala.collection.IterableLike
import scala.annotation.tailrec

//First, some generic infrastructure
object RichTraversable {
implicit def enrich[A](t: Iterable[A]) = new RichIterable(t)

def unfoldLeftN[A, B](seed: B, n: Int)(f: B => (B, A) ) = {
var b = seed
var result = Vector[A]()
for (i <- 0 to n) {
val pair = f(b)
b = pair._1
result = result :+ pair._2
}
result
}
}
class RichIterable[A](coll: Iterable[A]) {
@tailrec
final def foldLeftSelect[B](z: B)(op: (B, A) => B, p: B => Boolean):
(Option[A], Iterable[A]) = {
if (coll.isEmpty) return (None, Iterable.empty)

val a = coll.head
val t = coll.tail
val newZ = op(z, a)
if (p(newZ)) return (Some(a), t)
new RichIterable(t).foldLeftSelect(newZ)(op, p)
}

def cyclic: Iterable[A] = new Iterable[A] {
class CyclicIterator(traversable: Traversable[A]) extends Iterator[A] {
var currentIterator = traversable.toIterator
def next() = {
if (!currentIterator.hasNext) currentIterator = traversable.toIterator
currentIterator.next()
}
def hasNext = !traversable.isEmpty
}

def iterator = new CyclicIterator(coll)
}.view
}

import RichTraversable._
object RouletteSampling extends App {

//the correctness test
val n = 100
val colors = Vector("Red", "Green", "Blue", "Yellow")
val weights = Vector(0.05, 0.15, 0.6, 0.2)
val items = (colors, weights).zipped.map(new WeightedItem(_, _))

val selections = weightedSelection(items, n, new Random())

val expectedFreq = (colors, weights.map(_ * n)).zipped.toMap
val observedFreq = selections.groupBy((s) => s).mapValues(_.size.toDouble)
val pairwise = colors.map((color) => (expectedFreq(color),
observedFreq(color)))
val chiSquare = pairwise.map({case (e, o) => ((o - e)*(o - e))/e}).sum

val isLikely = chiSquare < 6.25
val isPlausible = chiSquare < 12.84
println("Likely: " + isLikely + " Plausible: " + isPlausible + " Chi
Square: " + chiSquare)


//the implementation itself


case class WeightedItem[T](item: T, weight: Double)

def weightedSelection[T](weightedItems: IndexedSeq[WeightedItem[T]],
n: Int, r: Random): Seq[T] = {

val weights = weightedItems.map(_.weight)
val maxIncr = weights.max * 2.0
val randomIndex = r.nextInt(weightedItems.length)
val (head, tail) = weightedItems.splitAt(randomIndex)
val initialWheel = (tail ++ head).cyclic

def randomIncr = r.nextDouble() * maxIncr

unfoldLeftN(initialWheel, n)((wheel) => {
val incr = randomIncr

val (selection, nextWheel) = wheel.foldLeftSelect(0.0)(
(cumWeight, a) => a.weight + cumWeight, (weight) => weight > incr)

(nextWheel, selection.get.item)
})
}
}

Andrew Conway

unread,
Mar 25, 2012, 9:42:58 AM3/25/12
to Melbourne Scala User Group
Ooh, Ben's is an interesting approach, and very different to any of
the other ideas. Like my idea, it doesn't just choose one (repeated n
times).

It seems to be an approximate algorithm, with different behaviour to
John's approximation. Ben's is asymptotically correct for large n, but
can be significantly inaccurate for small n - try with n=1 and the
list given in test4 below:

val test2 =
"ABCDEFGHIJKLMNOPQRSTUVWXYZ".toCharArray.toSeq.map{c=>WeightedItem(c.toString,
1.0)}
val test3 = (WeightedItem("Big",1e6))::test2.toList
val test4 = (WeightedItem("Second Big",1e6))::(WeightedItem("Third
Big",1e6))::test3.toList

[ in test4, n=1, the first item should be chosen about 1/3 of the
time, but in my test it was chosen 0.48 of the time over 20 000
runs. ]

The run time seems to be intrinsically O(kN+L) where k ~ max(weight)/
average(weight) which will be small in most cases. But I think there
is some implementation issue (in the Scala internals - not in Ben's
code) that causes it to run very slowly for n above 1000 or so. I
think it is a problem with the way tail is computed for the cyclic
list implementation, but I don't know the Iterable internals well
enough to track it down. Certainly my REPL takes a very long time to
compute:
(1 until 1000).foldLeft(l2.cyclic)( (l,_) => l.tail ).head
where l2 is a RichIterable. Anyone have any ideas?

One minor typo: "to" should be "until" in unfoldLeftN otherwise it
produces n+1 results. A mistake I have made many times.

Incidentally, I discovered a serious bug in my algorithm - bad
numerical underflow errors crop in for n ~ 10 000 in some cases. I'll
try to fix it.


Regards,

Andrew.

Andrew Conway

unread,
Mar 25, 2012, 9:58:16 AM3/25/12
to Melbourne Scala User Group
I discovered a bug in my implementation. The numEntries function needs
to be changed as follows to avoid numeric underflow problems for large
n:

def numEntries(p:Double,N:Int,r:Random) : Int = if (p>0.5) N-
numEntries(1.0-p,N,r) else if (p<0.0) 0 else {
val q = 1.0-p
if (N*p>100.0) { // numeric underflow can be a problem with exact
computation, but gaussian approximation is good
val continuousApprox = r.nextGaussian()*Math.sqrt(N*p*q)+N*p
continuousApprox.round.toInt.max(0).min(N)
} else { // exact approx will not underflow
var n = 0
var prstop = Math.pow(q,N)
var cumulative = 0.0
while (n<N && (r.nextDouble()*(1-cumulative))>=prstop) {
cumulative+=prstop
prstop*=p*(N-n)/(q*(n+1))
n+=1
}
n

Andrew Conway

unread,
Mar 25, 2012, 9:24:26 PM3/25/12
to Melbourne Scala User Group


On 26 Mar, 00:42, Andrew Conway <agoo...@greatcactus.org> wrote:
> The run time seems to be intrinsically O(kN+L) where k ~ max(weight)/
> average(weight) which will be small in most cases. But I think there
> is some implementation issue (in the Scala internals - not in Ben's
> code) that causes it to run very slowly for n above 1000 or so. I
> think it is a problem with the way tail is computed for the cyclic
> list implementation, but I don't know the Iterable internals well
> enough to track it down. Certainly my REPL takes a very long time to
> compute:
>     (1 until 1000).foldLeft(l2.cyclic)( (l,_) => l.tail ).head
> where l2 is a RichIterable. Anyone have any ideas?

I worked out the problem - the tail method call effectively makes a
lazy list of values. So if you say
l.tail.tail.tail.tail.tail.tail.head,
the last call to head needs to propagate the evaluation back through
all
the tails, meaning that tail becomes an amortized at least linear
operation. As it is called a number of times linear in n, it means
the
whole algorithm becomes quadratic in n. One solution is to override
tail,
as in the following modification which runs orders of magnitude faster
for n=1000:

//First, some generic infrastructure
object RichTraversable {
implicit def enrich[A](t: Traversable[A]) = new RichIterable(t)

def unfoldLeftN[A, B](seed: B, n: Int)(f: B => (B, A) ) = {
var b = seed
val result = new ListBuffer[A]
//var result = Vector[A]()
for (i <- 0 until n) {
val pair = f(b)
b = pair._1
//result = result :+ pair._2
result += pair._2
}
result.toList
}
}
class RichIterable[A](coll: Traversable[A]) {
@tailrec
final def foldLeftSelect[B](z: B)(op: (B, A) => B, p: B => Boolean):
(Option[A], Traversable[A]) = {
if (coll.isEmpty) return (None, Iterable.empty)

val a = coll.head
val t = coll.tail
val newZ = op(z, a)
if (p(newZ)) return (Some(a), t)
new RichIterable(t).foldLeftSelect(newZ)(op, p)
}

class CyclicWork {
val elems = coll.toIndexedSeq
val tailss = (0 until elems.length) map{new SimpleList(_)}
class SimpleList(ind:Int) extends Traversable[A] {
override def head = elems(ind)
val nextInd = (ind+1)%elems.length
override def tail : Traversable[A] = tailss(nextInd)
override def foreach[U](f: A => U) = { f(head); tail.foreach(f)}
}
}
def cyclic: Traversable[A] = (new CyclicWork).tailss(0)

}

Ben Hutchison

unread,
Mar 25, 2012, 11:18:49 PM3/25/12
to scala...@googlegroups.com
Thanks Andrew for all your perceptive analyses.

Yes, Roulette Selection is a (supposedly) fast-but-approximate method
more suitable for a larger number of selections.

Yes, I'd noticed there was some kind of horrible flaw in there causing
very slow performance for high N, but hadn't the patience to locate
it. Thanks for pinpointing it.


Two other points I want to raise tonight re: the weighted selection

1. I'm coming to the opinion that the original API signature I
specified was non-optimal, ie

def weightedSelection[T](items: Seq[WeightedItem[T]], numSelections:

Int, r: Random): Seq[T]

Rather, it should be represented as an Iterator or Stream of
selections, which you can pull on as many times as you need.


2. According to Wikipedia, the (Pearson) Chi Square test I used to
measure goodness-of-fit assumes that your selection categories have
near-equal weighting, which isn't necessarily true.

The basic problem is we know the population distribution (ie the
weightings), and we want to test how likely the observed sample (the
selections) is to have come from that population.

Can anyone suggest a more appropriate method, when the selection
categories might wildly differ in frequency?

-Ben

Reply all
Reply to author
Forward
0 new messages