Shuffling Random Numbers - lottsa them

the algorithm for shuffling a small number of random numbers is simple

suppose we want to shuffle N numbers from 0 to N-1

1. initialize a vector<int> v; to N values from 0 to N-1
2. let n = v.size()
3. if n>0, randomly pick a number up to n-1 called i, and remove the value from x = v[i] x is your random number
4. goto 2. if v is not empty

this works great if N is not a large number

suppose that N=4294967295 or a 32-bit unsigned int - using this number, you will run out of memory (for reference, 1GB = 10^9; suppose we have 1GB of RAM)

if N is not the full 32bits - for instance, if we only care about 100,000 unique numbers, we can go in the other direction and start with empty map and fill this map with values that we have already picked... ...it all depends on how much of the space we are filling

if it's the entire space (initial example), we'd like to allocate a vector/array of values left, but we may run out of memory

if it's a small part of the entire space (second example), we can use a map of the values we picked and suffer the possibility of picking ones we already picked

so here's the question: what is a good algorithm for the initial case where we want to randomly shuffle the entire 32-bit integer space?

any ideas?

edit: no cheating and saying that you have enough RAM for the entire 32-bit space!
Last edited on
Use a file.


You may not like this answer as it requires a bit of knowledge of group theory.

Under multiplication modulo p where p is prime, the natural numbers up to p-1 form a group. If we consider the iteration x_n+1 = m*x_n modulo p, then there are some numbers m for which the resulting sequence will visit every number in the group. Such numbers are called primitive roots of p. The sequence for some primitive roots may be seemingly random. This is the basis of how random number generators worked 50 years ago.

I believe that 2^32 - 1 is a Mersenne prime, so for every primitive root of 2^32 - 1, you have a seeming randomly ordering of the integers from 1 to 2^32 - 2. not quite the whole 32 bit address space, but only three numbers off.

How to find those primitive roots is another question entirely.

fascinating!

in essence, what you are saying is that if I knew the primitive roots of p for 2^32-1, then I should be able to traverse the entire group?

but would the sequence would always remain the same, wouldn't it? (I suppose that's also a "feature" of all pseudo-random number generators, so no foul...)
Any primitive root will traverse the group, that is in fact one definition of a primitive root. The sequence is unique for each primitive root, but there are lots and lots of primitive roots, and each will generate a different sequence.
Any primitive root will traverse the group, that is in fact one definition of a primitive root. The sequence is unique for each primitive root, but there are lots and lots of primitive roots, and each will generate a different sequence.


As I understand it, isn't that what built in psuedo-random random number generator functions use?
Not for many years.

The original random number generators used

x_{i+1} = m*x_i + c modulo p

This allows 0 to be included. There is a theorem about the conditions of m, c, and p to get a full cycle. But I can't remember the details of it.

These random number generators actually have terrible properties in terms of randomness. However, they are still used in multithreaded environments, as you can solve the recurrence generally so each thread can generate a stream of random numbers repeatably no matter the order of execution.

Most of the generators nowadays are based on feedback shift registers. I believe, although my memory may be lying, that the recurrence is then

x_n = \sum_{i=1 to k} c_i * x_{n-i} modulo 2.

For some k, and some set of constants c.

The returned number is made by the binary sequence of x_n, x_{n-1}, x_{n-2}, ...

I believe Mersenne Twister used in boost is a generalised linear feedback shift register.

If you are interested in the actual properties of the random numbers that come from these generators, the DIEHARD tests are quite old, but worth looking into.
Wait. I'm confused. The OP says "shuffle", but the rest of the thread seems to talk about PRNG.
well, the ultimate goal is to shuffle a large number of items

if we were using a true random number generator (TRNG), we couldn't game the system and would have to find a true algorithm to solve the problem

however, since most of us only have access to a PRNG (including I), discussing ways to peek inside the black box so we can get the job done is ok by me

@kev82 - I always wondered about the metrics used to determine a good PRNG - do you use statistics and sampling theory? I mean, if you had two RNGs - on that is a decent PRNG and the other, a TRNG (say, built off sampling radioactive decay), could you determine which one is which, after taking a million samples of each, within a decent degree of confidence?

edit: interesting stuff - http://en.wikipedia.org/wiki/Diehard_tests

edit: this one's also good: http://www.robertnz.net/true_rng.html
Last edited on
Then there's something I'm missing.
If you have an array of numbers you want to shuffle, at most you need enough extra memory to hold one more element. For i in [0,n-1) do swap(v[i],v[rand(i+1,n)]) where i<rand(i+1,n)<=n
How does knowing how rand() is implemented help you optimize the algorithm in space? Unless there's some other dimension to the problem involving caching part of a file to memory, or something like that.
EDIT: Or unless you're avoiding the in-place version for a particular reason.
Last edited on
well, ultimately, it would be nice to use the standard rnd() and treat is as a black-box

however, if you can choose your box or look inside the box, rnd_whitebox() can be thought of as a compression of the entire output

in other words, mathematically, there is no difference between the set of all random numbers generated by rnd_whitebox() and the algorithm rnd_whitebox() with its starting parameters and current location iff rnd_whitebox() can satisfy the requirement that it traverses the entire space, traverses each element once and only once, and does so in a satisfyingly pseudo-random fashion

well, there is no difference, if we don't need the entire solution all in memory at once (which we can't have anyways due to my assumption of insufficient memory)

however, for practical purposes, there is a huge difference between the two - the former can take a ton of memory if N is big while the latter takes very little memory if we grab the solution parts at a time
question for @kev82

suppose we have a PRNG which traverses the entire 32-bit space one element at a time and never repeats

wouldn't this function be bad for a general RNG because it produces too perfect a result?

as an analog, if I had a 6 sided die which produced for N=6000, exactly 1000 1's, 1000 2's, ..., 1000 6's every time, that would be a suspicious looking RNG, I would think
Ah, I see, now.

http://en.wikipedia.org/wiki/Kolmogorov_complexity
If you were able to store a sequence of 2^32 unique numbers or a generator that sequence in less than a huge amount of memory (it depends on the sequence. 0, 1, 2, ... takes very little room), you'd have the equivalent of an infinite lossless compression algorithm.

suppose we have a PRNG which traverses the entire 32-bit space one element at a time and never repeats

wouldn't this function be bad for a general RNG because it produces too perfect a result?
It would be the exact opposite of a RNG, since as time goes on, its results become increasingly predictable.
I haven't worked in this area for a while, so don't take this as 100% accurate.

@kfmfe04

You would be surprised how many people who are writing simulations, just call rand() without a second thought. Generally for any application you can write down the requirements for you random numbers and then you would design statistical tests to make sure those requirements were satisfied.

As for trying to identify a prng from a trng, it can be very hard, impossible perhaps. By definition of a prng there will be some feature of it's output that it not true of the trng. If you knew what that property was then you could design a test to identify it. The difficulty is finding that property. A good prng is one where these properties are not noticeable.

The first example I gave is very easy to identify. Simplest property I can think of is whether you can get the same number twice in a row.

@helios

Your solution

For i in [0,n-1) do swap(v[i],v[rand(i+1,n)]) where i<rand(i+1,n)<=n

Requires storage of the array to start with. I believe this was disallowed in the problem.

@kfmfe04

Yes, it would be rubbish generally, but as I said above it depends on your application. For the aim of shuffling your set of numbers it is perfect.

The way to approach this problem is the following.

Let us define the set of prng states S, and the set of outputs, O. Let us also define two functions P : S -> S, and Q : S -> O where P is the random number iteration and Q converts a state to an output.

When we generate a random number we basically do

s_{i+1} = P(s_i)
o_i = Q(s_i)

I've forgotten the right words, but as long as Q covers all members of O, and many members of S map to the same value of O then your problem goes away. You have only been so far using the identity function for Q. What if you had a generator with period 2^128 - 1 but the number you returned was modulo 2^32?

The other example of this is what I was explaining to exiledAussie. With a feedback shift register with say 500 bits, you can take 32 from the middle somewhere.
kev82 wrote:
You would be surprised how many people who are writing simulations, just call rand() without a second thought.


I've been guilty of that one, unfortunately!

@kev82 - I think I need to get a better foundation before I can get a better grasp of what you mean.

Any suggestions for good books on Group Theory (more applied than theoretical?)?
The last bit isn't group theory, it's analysis I guess. What it's trying to say is that instead of taking the output of the generator directly, you apply a function to it. By making sure the function does not give a 1-1 correspondence between the state and output, then there will be repeats and it will not have the exact perfection you are talking about.

Non theoretical books on group theory? Ha.

Seminumerical algorithms by Knuth has a massive section on this. He also goes on to talk about trinomials and shift registers which are the basis of the more advanced prng techniques.
Topic archived. No new replies allowed.