Compare two methods of random permutations

Mok-Kong Shen

unread,

May 8, 2013, 5:29:27 AM5/8/13

to

I want to compare in a practical sense two methods of random
permutations -- one theoretically perfect, namely that of Fisher and
Yates, and another ad hoc, let's call it X. A way of comparison I could
think of is the following:

One starts from the standard configuration of n objects [0, 1, 2,
...., n-1] and apply each method a fairly large number of times
successively to permute them and each time one computes the Hamming
distance of the result from the standard configuaration. One obtains
thus the frequency distribution of the Hamming distances for each
method. If the frequency distributions are fairly comparable to each
other, then X could be practically employed in place of the
theoretically perfect one.

Is this line of thought of mine correct? Does anyone have an idea of
a better method of comparison? Thanks in advance.

M. K. Shen

Ray Koopman

unread,

May 10, 2013, 1:44:50 AM5/10/13

to

I would count matches rather than mismatches. The distribution of the
number of matches ought to be proportional to row n in OEIS A008290.
However, passing that test would hardly be sufficient to convince a
skeptic to use your ad hoc method.

Rich Ulrich

unread,

May 10, 2013, 2:12:36 AM5/10/13

to

I start with guess-work on what you are doing.

I looked up Hamming distance, and it reminded me only
of some previous questions about matching DNA sequences.

I looked up Fisher and Yates, with Hamming, and came up
with the algorithm also known as Knuth Shuffle. That does
make a bit of sense if that is a way of finding how distant
two DNA strings are from each other.

Since F&Y is "theoretically perfect", I assume that the
problem is that it is too time-consuming, and you want to
do a partial enumeration instead of this (apparently
efficient) full enumeration. This seems logical, whatever
the ultimate goal.

I will suggest that instead of matching the whole range
of the distribution of the Hamming distances, you want
to match the relevant portion -- the matches that are close
enough to be interesting. One thing that comes to mind is
that Monte Carlo simulations to get p-levels require large Ns
if you want "exact p-values" that are very close for p=0.001
or smaller, compared to getting good estimates for p=0.5 or 0.05.
- IIRC, I read years ago about computational "swindles" that
reduced the time by ordering the comparisons so that the tails
were counted first, and counting could stop when a criterion
was reached.

And (I think) you want to look at the particular error distribution:
Matching the shape is no good if you don't select the same cases.

I won't be offended if you point out that your problem is very
different, and why this is all irrelevant.

--
Rich Ulrich

Mok-Kong Shen

unread,

May 10, 2013, 12:38:34 PM5/10/13

to

Thanks for the comments. Indeed what I need for my applications is in
the direction that you indicated. In particular I don't need to
generate the whole range of possible permutations but on the other hand
a not too limited small range. Consider the case n=52, one knows that
in-shuffle and out-shuffle have periods of 52 and 8 respectively, which
are negligibly small compared to the whole range (of size (52!)). On the
other hand, the method of Fisher and Yates needs n-1 (pseudo-)random
numbers. In one of my applications a couple of such random numbers are
already available for other reasons, so it would certainly be
conveninent and advantageous, if I could just use them for doing random
permutation without having to invoke a PRNG to acquire the n-1 random
numbers needed for F&Y. This was the motivation of my OP and, as it
turns out, my scheme runs also much faster than F&Y as reported further
below.

For n=52 I took the list [0, 1, 2, 3, ....., 51] (the standard
configuration) and applied diverse variants of my scheme (which are
based on in-shuffle and thus are actually to be considered as its
generalizations) as well as F&Y 100000000 times repeatedly. At each
step the Hamming distance (H-D) of the result with respect to the
standard configuration is computed, with its frequency distribution
listed below, where Scheme0 is just the normal shuffling employed in
card games, Scheme1 to Scheme4 are different variants of my scheme and
F&Y is Fisher and Yates and 'Expected' is the expected values
(rounded) based on combinatorics. I did 3 runs, the results are almost
the same. Below is one typical series, with runtime in seconds on my PC.

H-D Scheme0 Scheme1 Scheme2 Scheme3 Scheme4 F&Y Expected

0 1923076 0 0 0 0 0 0
1-40 0 0 0 0 0 0 0
41 0 0 0 0 0 1 1
42 0 13 11 10 14 9 10
43 0 111 97 111 124 115 101
44 0 860 872 858 867 910 912
45 0 7259 7346 7239 7468 7315 7299
46 0 51096 50790 51085 50983 51183 51094
47 0 306209 306935 307452 307651 306641 306566
48 0 1532985 1529885 1529360 1532516 1530699 1532831
49 0 6132833 6127981 6131083 6130604 6132153 6131324
50 0 18395994 18399139 18395668 18397303 18385756 18393972
51 0 36791857 36780612 36789226 36780703 36799043 36787944
52 98076924 36780783 36796332 36787908 36791767 36786175 36787944

Time 2878 4706 4417 3512 3208 17937

Scheme0 is the same as in normal card games, where one does a cut
(depending on the PRN ur in [0,n-1]) and then performs an (exact)
in-shuffle. Scheme1 does a cut and then divides the pack of cards into
two (depending on the PRN us in [0,n-1]) and performs an in-shuffle to
each and finally does an in-shuffle of the entire pack. Scheme2 is
the same as Scheme1, excepting with ur = us, i.e. one uses only one PRN.
Scheme3 and Scheme4 are the same as Scheme1 and Scheme2, excepting that
the final in-shuffle is absent. As can be seen, Scheme1 to Scheme4 are
all fairly comparable to F&Y and the theoretically values and run much
faster than F&Y.

The computation was done in Python with codes given further below,
using the built-in PRNG of Python. Scheme3 has been used in an
application of mine (see http://s13.zetaboards.com/Crypto/topic/6925232/1/)

It may be noted that e.g. Scheme3 could be employed also in real card
games with a presumably acceptable complication of the common manual
shuffling procedure.

M. K. Shen

---------------------------------------------------------

import random,time

def hamming(a,b,n):
h=0
for i in range(n):
if a[i] != b[i]:
h+=1
return(h)

# in-shuffle
def riffle(cards,n):
newcards=[]
nh=n//2
for i in range(nh):
newcards.append(cards[nh+i])
newcards.append(cards[i])
if nh*2<n:
newcards.append(cards[n-1])
return(newcards)

# This is taken over from what is done in normal card games.
def scheme0(cards,ur,n):
newcards=cards[0:ur]+cards[ur:n]
newcards=riffle(newcards,n)
return(newcards)

# For Scheme1 use 2 PRNs ur and us, for Scheme2 use 1 PRN, i.e. ur = us.
def scheme12(cards,ur,us,n):
newcards=cards[0:ur]+cards[ur:n]
newcards=riffle(newcards[0:us],us)+riffle(newcards[us:n],n-us)
newcards=riffle(newcards,n)
return(newcards)

# For Scheme3 use 2 PRNs ur and us, for Scheme4 use 1 PRN, i.e. ur = us.
def scheme34(cards,ur,us,n):
newcards=cards[0:ur]+cards[ur:n]
newcards=riffle(newcards[0:us],us)+riffle(newcards[us:n],n-us)
return(newcards)

# The theoretically perfect pseudo-random Permuation of Fisher and Yates.
def fisheryates(cards,n):
newcards=cards[:]
nn1=n-1
for i in range(nn1):
k=random.randint(i,nn1)
t=newcards[i]
newcards[i]=newcards[k]
newcards[k]=t
return(newcards)

# Example computation with Scheme3

def comp(n,times):
bb=[i for i in range(n)]
cc=[0 for i in range(n+1)]
bbc=bb[:]
hmin=n+1
imin=-1
start=time.clock()
for i in range(times):
sh=random.randint(0,n)
kn=random.randint(0,n)
bbc=scheme34(bbc,sh,kn,n)
h=hamming(bb,bbc,n)
cc[h]+=1
print("time",time.clock()-start)
print(cc)

# Run the example

comp(52,100000000)

Gordon Sande

unread,

May 10, 2013, 2:30:42 PM5/10/13

to

Usually the time spent in the random number generator is a very small
portion of a real task. It is only when the real task is empty, like when
testing randon numbers, that the speed of the randomization is of much
concern. The difference between small and a bit bigger matters when
compared to nothing but is of no concern when compared to something real
that is much larger. Usually much faster is reserved for things like
1000*1000 compared to 1000*log(1000)=1000*10. I believe that Python is
an interpreted langauge so is noticably slower that a compiled language
which one would use for a real task where timeing was important.

Mok-Kong Shen

unread,

May 11, 2013, 3:30:08 AM5/11/13

to

Am 10.05.2013 20:30, schrieb Gordon Sande:

> Usually the time spent in the random number generator is a very small
> portion of a real task. It is only when the real task is empty, like when
> testing randon numbers, that the speed of the randomization is of much
> concern. The difference between small and a bit bigger matters when
> compared to nothing but is of no concern when compared to something real
> that is much larger. Usually much faster is reserved for things like
> 1000*1000 compared to 1000*log(1000)=1000*10. I believe that Python is
> an interpreted langauge so is noticably slower that a compiled language
> which one would use for a real task where timeing was important.

Yes, Python is normally interpreted, but IMHO an interpreted run could
in some measure reflect the results in optimized runs in other
languages (though this of course must remain speculative without hard
facts). You are certainly right that my use of "much faster" is highly
problematic. (In fact for common programmers a 50% speed increase of
his codes may be something, but for persons in theoretical computer
science generally only exponential time improvements of algorithms
would be of significance.)

M. K. Shen

David Jones

unread,

May 12, 2013, 1:41:37 PM5/12/13

to

"Mok-Kong Shen" wrote in message news:kmks1l$52h$1...@news.albasani.net...

===============================================================

It is not clear what the real purpose is here, but there seems to be an
emphasis on computing time, and there needs to be a balance depending on the
actual context. There seems scope for introducing a pre-computed list of
permutations held in a file, so that multiple random selections (in
real-time) can be avoided. Different strategies can be thought up depending
on whether the file contents were themselves generated systematically or
randomly. In some applications it might be enough to do a single "full"
(expensive) randomisation to initialize, and then to read permutations of
this result from a file (or array) ... re-initialising every so often would
mean that the file need not be too large ( a 1000 permutations would/could
reduce computation time by a factor of 1000). I guess such ideas must be
fairly standard, but perhaps modern computing scenarios of file-space
limitations may have changed relevant balances.

David Jones

Mok-Kong Shen

unread,

May 13, 2013, 1:54:55 AM5/13/13

to

Am 12.05.2013 19:41, schrieb David Jones:

> It is not clear what the real purpose is here, but there seems to be an
> emphasis on computing time, and there needs to be a balance depending on
> the actual context. There seems scope for introducing a pre-computed
> list of permutations held in a file, so that multiple random selections
> (in real-time) can be avoided. Different strategies can be thought up
> depending on whether the file contents were themselves generated
> systematically or randomly. In some applications it might be enough to
> do a single "full" (expensive) randomisation to initialize, and then to
> read permutations of this result from a file (or array) ...
> re-initialising every so often would mean that the file need not be too
> large ( a 1000 permutations would/could reduce computation time by a
> factor of 1000). I guess such ideas must be fairly standard, but perhaps
> modern computing scenarios of file-space limitations may have changed
> relevant balances.

Thanks for the comment. My personal desire, i.e. for purposes
I myself have in mind, is to employ as few PRNs as possible (see
my 2nd post in this thread). The random permutations are for my
applications to be dependent on the context of the actual runs and
hence cannot be pre-computed -- just like in the cases of real card
games. Some higher efficiency (assuming that the interpreted Python
results in that aspect is analogously true for optimized versions in
other compiled languages) is only a secondary beneficial effect.
Currently I am repeating some more computations of the kind reported
earlier and trying thereby also a couple of code variations.

M. K. Shen

Message has been deleted