# ISO help in probability

12 views

### Helmut Giese

Dec 9, 2005, 6:39:57 AM12/9/05
to
Hello out there,
I have a set of variables each of which has a certain weight or
probability assigned. Say
a -> 50
b -> 30
c -> 10
d -> 10

I want an algorithm which (in the above example) selects 'a' in every
other run and 'c' in 10% of the runs (approximately of course).

My attempt below fails miserably. Using the weights above, 'a' scores
way too high, mostly at the expense of 'c' and 'd', which are far away
from their expected 10% shares.
Evidently my logic is flawed: Foreach variable I get the current
probability by multiplying a random value with the associated weight,
and then select the one with the highest result.

Any advice on the 101 of probability applicable here will be greatly
appreciated.
Best regards
Helmut Giese
---
#
# \$lst is a list of 'name / weight' pairs, like <a 20 b 10 c 5>
#
proc selectProb {lst} {
# turn list into array
array set weight \$lst

# get the sum of the weights
set sum 0
foreach n [array names weight] {
incr sum \$weight(\$n)
}

# make a pair 'weighted random value / name' for each
foreach n [array names weight] {
set prob(\$n) [list [expr {(rand() * \$weight(\$n)) / \$sum}] \$n]
}

# sort by probability
foreach n [array names prob] {
lappend pLst \$prob(\$n)
}
set pLst [lsort -decreasing \$pLst]

# but return the names
foreach p \$pLst {
lappend res [lindex \$p 1]
}
return \$res
}

# test
foreach n {a b c d} {
set cnt(\$n) 0
}

for {set i 0} {\$i < 100} {incr i} {
set res [selectProb {a 50 b 30 c 10 d 10}]
# count the "winner"
incr cnt([lindex \$res 0])
}
parray cnt
---

### Michael Schlenker

Dec 9, 2005, 7:50:22 AM12/9/05
to
Helmut Giese wrote:
> Hello out there,
> I have a set of variables each of which has a certain weight or
> probability assigned. Say
> a -> 50
> b -> 30
> c -> 10
> d -> 10
>
> I want an algorithm which (in the above example) selects 'a' in every
> other run and 'c' in 10% of the runs (approximately of course).
>
> My attempt below fails miserably. Using the weights above, 'a' scores
> way too high, mostly at the expense of 'c' and 'd', which are far away
> from their expected 10% shares.
> Evidently my logic is flawed: Foreach variable I get the current
> probability by multiplying a random value with the associated weight,
> and then select the one with the highest result.
>
> Any advice on the 101 of probability applicable here will be greatly
> appreciated.
> Best regards
> Helmut Giese

Basically it works like this:

You divide your probability space [1-100] (if you use percentage) into
intervals based on the weights you have. So for your example you map the
letters to this intervals:

a -> [1 50]
b -> [51 80]
c -> [81 90]
d -> [91 100]

Now you generate a random number between 1 and 100 and lookup in which
interval it is, thats the winner.

If you only allow integer weights, you can get lazy like this:

proc create_wtable {var weights} {
upvar 1 \$var table
set table [list {}]
foreach {key weight} \$weights {
for {set i 0} {\$i < \$weight} {incr i} {
lappend table \$key
}
}
# return the value for probability 1.0
return [expr {[llength \$table]-1}]
}

proc pick_random {var max} {
upvar 1 \$var table
lindex \$table [math::random 1 \$max]
}

# test
set max [create_wtable wtab {a 50 b 30 c 10 d 10} ]
puts [pick_random wtab \$max]

Michael

### Uwe Klein

Dec 9, 2005, 8:31:56 AM12/9/05
to
Helmut Giese wrote:
> Hello out there,
> I have a set of variables each of which has a certain weight or
> probability assigned. Say
> a -> 50
> b -> 30
> c -> 10
> d -> 10
Hi Helmut,
This any better?:

#!/usr/bin/tclsh

set probabilities {a 50 b 30 c 10 d 10 }

set curr 0
foreach {tok prob} \$probabilities {
lappend ::stat(toks) \$tok
set ::stat(\$tok,min) \$curr
incr curr \$prob
set ::stat(\$tok,max) \$curr

set ::res(\$tok) 0
}
set ::stat(summ) \$curr

# rand() gives an even distribution from 0.0 to 0.999999999
proc randtok {} {
set rand [ expr { rand() * \$::stat(summ) }]
foreach tok \$::stat(toks) {
if { \$rand >= \$::stat(\$tok,min) } {
if { \$rand < \$::stat(\$tok,max) } {
return \$tok
}
}
}
return this_is_an_error
}

for {set i 0} {\$i < 10000 } {incr i} {
incr ::res([randtok])
}
parray ::res

#end

G!
uwe

### Joe English

Dec 9, 2005, 2:31:11 PM12/9/05
to
Helmut Giese wrote:

>I have a set of variables each of which has a certain weight or
>probability assigned. Say
>a -> 50
>b -> 30
>c -> 10
>d -> 10
>
>I want an algorithm which (in the above example) selects 'a' in every
>other run and 'c' in 10% of the runs (approximately of course).
>
>My attempt below fails miserably. Using the weights above, 'a' scores
>way too high, mostly at the expense of 'c' and 'd', which are far away
>from their expected 10% shares.
>Evidently my logic is flawed: Foreach variable I get the current
>probability by multiplying a random value with the associated weight,
>and then select the one with the highest result.
>
>Any advice on the 101 of probability applicable here will be greatly
>appreciated.

First, compute the cumulative sum (or "running total")
of the variables and their weights, and construct a new
list that pairs each variable with the

set csum 0
set clst [list]
foreach {k w} \$lst {
set csum [expr {\$csum + \$w}]
lappend clst \$k \$csum
}

Next, generate a random number between 0 and the total weight:

set rn [expr {rand() * \$sum}]

Then select the first element in the transformed list with a
cumulative sum greater than that number:

foreach {k c} \$clst {
if {\$c > \$rn} { break }
}
# ==> \$k is the answer.

With the sample data above, \$clst is:

a 50 b 80 c 90 d 100

So a random number between 0 and \$csum == 100 will
be <50 half the time, selecting 'a'; will lie between
50 and 80 30% of the time, yielding 'b'; will lie between
80 and 90 10% of the time, et cetera.

Tweak the algorithm above as needed depending on the required
performance characteristics. (If you need to do a lot of
selections from a fixed data set, a binary search through
the cumulative sum list will be profitable.)

--Joe English

### Helmut Giese

Dec 10, 2005, 5:01:19 AM12/10/05
to
Many thanks to all of you.
Sorry for the delay - we have a family reunion and priorities have to
shift sometimes.
On first sight it looks like each of the solutions will solve my
problem but I'll have to wait till after the weekend to actually do
anything with them.
Thanks again and best regards.
Yes, Mom, I'm coming.
Helmut - hosting the whole family for the weekend - Giese