Randomly selecting k integers from the range [min, max)

413 views
Skip to first unread message

Fumin Wang

unread,
Apr 17, 2014, 1:27:38 PM4/17/14
to golan...@googlegroups.com
Assuming that max-min can be a very large number, can anyone help me verify that the below algorithm is correct?

// Sample randomly selects k integers from the range [min, max).
// Given that Sample's algorithm is to run Fisher-Yates shuffle k times, it's complexity is O(k).
func Sample(k, min, max int) (sampled []int) {
swapped := make(map[int]int, k)
for i := 0; i < k; i++ {
// generate a random number r, where i <= r < max-min
r := rand.Intn(max-min-i) + i
// swapped[i], swapped[r] = swapped[r], swapped[i]
vr, ok := swapped[r]
if ok {
sampled = append(sampled, vr+min)
} else {
sampled = append(sampled, r+min)
}
vi, ok := swapped[i]
if ok {
swapped[r] = vi
} else {
swapped[r] = i
}
}
return
}


About the idea of having this function in the package "math/rand", do you think it's going towards the kitchen sink, or is there some value in it?

Joshua Liebow-Feeser

unread,
Apr 17, 2014, 2:05:58 PM4/17/14
to golan...@googlegroups.com
Two points:
1. Is this supposed to be sampled randomly with replacement? That is, can I pick the same number twice, or does it sample first with probability 1/(min-max) and then among the remaining numbers with probability 1/(min-max-1) and so on?
2. If you're considering having this in math/rand (and even if you're not, actually), I'd consider changing the signature to be:
func Sample(sampled []int, k, min, max int)
and fill the given slice.

Fumin Wang

unread,
Apr 17, 2014, 2:38:52 PM4/17/14
to golan...@googlegroups.com
Hi Joshua

* Sample samples *without* replacement. To sample with replacement from [min, max), I assume one could trivially call rand.Intn() k times (please correct me if I am wrong?).
* About requiring the caller to allocate the result in advance, is the main intention here to save calls to `malloc` when k is large? I'm curious are there examples of this API design pattern in any of the standard packages?

Kevin Gillette

unread,
Apr 17, 2014, 3:00:28 PM4/17/14
to golan...@googlegroups.com
I'd just do a simple transformation on the output of math/rand's Perm. http://play.golang.org/p/4hzU66PjJ2

Kevin Gillette

unread,
Apr 17, 2014, 3:01:45 PM4/17/14
to golan...@googlegroups.com
Of course, this approach decreases in efficiency as the difference between k and max-min grows, but for effectively 4 lines of code, it's a pretty simple solution.

Fumin Wang

unread,
Apr 17, 2014, 3:07:16 PM4/17/14
to golan...@googlegroups.com
True, as far as I can tell from http://golang.org/src/pkg/math/rand/rand.go?s=3157:3189#L94 , the complexity of this solution is max-min, which may or may not be acceptable depending on the scenario.

In our case, the min and max are geohashes which might be very far away from each other, hence this discussion.

Ignacio Grande

unread,
Apr 18, 2014, 3:53:35 AM4/18/14
to golan...@googlegroups.com
I would go for something like this:

http://play.golang.org/p/IbaFEArlS4

Depending on the ratio between k and max-min, I pick between two methods: if k << max-min, I fill a map with random numbers until its length is k, then I copy them into a slice.

Otherwise, I just fill a slice with all the max-min numbers and shuffle k elements.

Probably the code would need some more checking but it seems to work fine. The threshold may need some adjust for your particular case, too.

Regards.

Dan Kortschak

unread,
Apr 18, 2014, 5:09:33 AM4/18/14
to Ignacio Grande, golan...@googlegroups.com
You shouldn't rely on map ordering to satisfy a requirement for randomness.

Nacho

unread,
Apr 18, 2014, 5:14:35 AM4/18/14
to Dan Kortschak, golan...@googlegroups.com
You are right, but there is no mention that the sample ordering should be random too. If that is a requirement, then the slice should be shuffled after copying it from the map.

Jan Mercl

unread,
Apr 18, 2014, 6:14:44 AM4/18/14
to Fumin Wang, golang-nuts
On Thu, Apr 17, 2014 at 8:38 PM, Fumin Wang <awaw...@gmail.com> wrote:
> * Sample samples *without* replacement.

I guess a FCPRNG[0,1] might be used in that case.

[0]: http://godoc.org/github.com/cznic/mathutil#FC32
[1]: http://godoc.org/github.com/cznic/mathutil#FCBig

-j

Fumin Wang

unread,
Apr 18, 2014, 8:48:30 AM4/18/14
to Jan Mercl, golang-nuts
Hi Jan

Awesome, the performance of FC32 is very impressive! Benchmarking the below two functions Sample1 (based on FC32) and Sample2 (my approach) gives:

BenchmarkFC   200000     13250 ns/op

BenchmarkSample   500000       7369 ns/op

ok  tmptest 6.563s


This is almost a 2X difference in terms of speed. I haven't fully understood mathutil's implementation, but there obviously are some maths going on. Are you free to share some of your thoughts about this large difference in performance? Thanks!

func Sample1(k int, cycle *mathutil.FC32) (reply []int) {

  for i := 0; i < k; i++ {

    reply = append(reply, cycle.Next())

  }

  return

}


func Sample2(k, max int) (sampled []int) {

  if k >= max {

    for i := 0; i <= max; i++ {

      sampled = append(sampled, i)

    }

    return

  }


  swapped := make(map[int]int, k)

  for i := 0; i < k; i++ {

    // generate a random number from [i, max)

    r := rand.Intn(max-i) + i


    // swapped[i], swapped[r] = swapped[r], swapped[i]

    vr, ok := swapped[r]

    if ok {

      sampled = append(sampled, vr)

    } else {

      sampled = append(sampled, r)

    }

    vi, ok := swapped[i]

    if ok {

      swapped[r] = vi

    } else {

      swapped[r] = i

    }

  }

  return

}

Jan Mercl

unread,
Apr 18, 2014, 9:10:31 AM4/18/14
to Fumin Wang, golang-nuts
On Fri, Apr 18, 2014 at 2:48 PM, Fumin Wang <awaw...@gmail.com> wrote:
> Awesome, the performance of FC32 is very impressive! Benchmarking the below
> two functions Sample1 (based on FC32) and Sample2 (my approach) gives:
>
> BenchmarkFC 200000 13250 ns/op
>
> BenchmarkSample 500000 7369 ns/op
>
> ok tmptest 6.563s
>
>
> This is almost a 2X difference in terms of speed.

That's expected:

""""
For a billion numbers cycle the Next/Prev PRN can be produced in cca 100-150ns.
That's like 5-10 times slower compared to PRNs generated using the
(non FC) rand package.
""""

The lower performance is the additional cost of computing the pseudo
random number with the FC properties.

> I haven't fully understood
> mathutil's implementation, but there obviously are some maths going on.

It's related to CRT and positional notation with varying, all prime
bases. If the source code doesn't help then I apologize - it would
require at least few hours for me to write a proper documentation
about the algorithm used - as I'm not aware of any ready made proper
paper about it. (Chances are it might be new.)

> Are
> you free to share some of your thoughts about this large difference in
> performance?

As discussed above. The additional time is buying constant computing
space. Imagine your approach for a billion numbers - the memory of the
map keeping track of the "used" ones grows to big numbers (GBs). FC32
uses few tens of bytes for the same.

-j

Fumin Wang

unread,
Apr 18, 2014, 9:59:26 AM4/18/14
to Jan Mercl, golang-nuts
Hi Jan

By FCPRNG, do you mean Filter-Chaos based PRNGs, or something else?
But overall, thanks to your explanation, I'm able to spot the loop that finds all prime numbers. I also need to think about the memory usage implications of my approach you pointed out.

There're indeed gems of great ideas in "mathutil", thanks for sharing.

Jan Mercl

unread,
Apr 18, 2014, 10:08:52 AM4/18/14
to Fumin Wang, golang-nuts

FC in this case means Full Cycle: http://en.wikipedia.org/wiki/Full_cycle

(From a phone)

-j

Kevin Gillette

unread,
Apr 18, 2014, 1:22:59 PM4/18/14
to golan...@googlegroups.com, Jan Mercl
In this case the full cycle aspect provides the with-non-replacement guarantee.

Kevin Gillette

unread,
Apr 18, 2014, 1:35:56 PM4/18/14
to golan...@googlegroups.com, Fumin Wang
On Friday, April 18, 2014 7:10:31 AM UTC-6, Jan Mercl wrote:
If the source code doesn't help then I apologize - it would
require at least few hours for me to write a proper documentation
about the algorithm used

It'd be helpful if some of the implementation properties were documented. For example, does the Pos method return value wrap back to zero when the cycle resets (such that Pos and Next are effectively bijective)? The documentation of Seek seems to imply this. Also, how does NewFC32's hq parameter affect the maximum cycle range, if at all? Do we get an effective range of ~2 (or is it 4?) billion values no matter what?

Jan Mercl

unread,
Apr 19, 2014, 7:41:22 AM4/19/14
to Kevin Gillette, golang-nuts, Fumin Wang
On Fri, Apr 18, 2014 at 7:35 PM, Kevin Gillette
<extempor...@gmail.com> wrote:
> On Friday, April 18, 2014 7:10:31 AM UTC-6, Jan Mercl wrote:
>>
>> If the source code doesn't help then I apologize - it would
>> require at least few hours for me to write a proper documentation
>> about the algorithm used
>
>
> It'd be helpful if some of the implementation properties were documented.
> For example, does the Pos method return value wrap back to zero when the
> cycle resets (such that Pos and Next are effectively bijective)?

.Pos values have .Cycle modulus. .Cycle >= hi-lo+1. I don't know what
you mean by ".Pos and .Next effectively bijective".

> The
> documentation of Seek seems to imply this. Also, how does NewFC32's hq
> parameter affect the maximum cycle range, if at all? Do we get an effective
> range of ~2 (or is it 4?) billion values no matter what?

First please note that the algorithm is not bound to any specific
limits as seen if FCBig.

Now, the particular FC32 PRNG limits are [math.MinInt32,
math.MaxInt32], which is math.MaxUint32+1 numbers. However, the limits
are configurable in NewFC32. Let's have range r = hi-lo+1. Let's have
the smallest primorial[0] p such that p >= r. Let's have another, but
now greatest primorial such that p/q >= r. Cycle() reports p/q
(int64). The factors of p/q are the positional notation bases the
produced PRNs are computed from. Seeking the next/prev PRN may have to
skip "holes" in Cycle. From the previous it follows there are [0, r)
holes in a Cycle, zero for r being itself a primorial, r-1 if r is a
primoral+1.

The HQ parameter tweaks p, q in such a way (q' = q/2, q > 1) the
number of holes is [0, 2r) (or [r, 2r), not sure now]). This makes the
FCPRNG slower but it improves in the statistical properties of the
produced PRNs*. IIRC, the NIST results[0] for the specific range
used[2] are a bit better than what the math/rand PRNG provides in the
same scenario (or provide 3 years ago).

In the first approximation, PRN-r is like: ... (Pos%B_2)*B_1*B_0 +
(Pos%B_1)*B_0+Pos%B_0 where B_n are factors of p/q. If > r we have hit
a hole a must try again: Pos++.

Proving why it works is the laborious part, this was just a few quick
thoughts, hope it helps...

(*) mod 256. FCPRNGs per se are, by their very nature, progressively
less random towards the end of the cycle.


[0]: http://en.wikipedia.org/wiki/Primorial
[1]: https://github.com/cznic/mathutil/blob/master/nist-sts-2-1-1-report
[2]: https://github.com/cznic/mathutil/blob/master/example/example.go#L27

-j
Reply all
Reply to author
Forward
0 new messages