suppose we predefine 3! to have the following permutations
0 0 1 2
1 0 2 1
2 1 0 2
3 1 2 0
4 2 0 1
5 2 1 0 |
rather than shuffle the set { 0, 1, 2 } twice using KFY, we can
1. draw one random number from [0..5]
2. get the result from the row drawn in 1.
Let's say we drew from [0..5] and got 4 - I'd bet we can say that the result is 2 0 1
without having to generate these entries in one process or thread
Of course, right now, we don't care because we only have 3 numbers, but in the original problem, there are 2
32 or about 4 billion numbers per row. I posit that once we know which row from 1., we can generate those 4 billion numbers, perhaps the 1st billion on machine A, 2nd billion on machine B and so on to machine D, independently.
So I think you can parallelize 2. (if needed - maybe the run for 2. wouldn't take so long anyways)
However, I am not sure about 1. having to run in one process/machine, because 32! is one big-#!* number and you need a proper PRNG on top of that!
edit: btw, if you care about the quality of the randomness, you would want an uniform PRNG - for our 3-bit example, suppose that we used a really trashy PRNG that would generate in the following biased manner (I'm being extreme here):
P( 3 ) = 50%
P( 5 ) = 50%
P( elsewhere ) = 0%
such a biased PRNG would always have 0 as the 3rd number! Of course if you don't know it's biased and you just drew, you may not even notice or care - for most problems maybe you can get away with it, but I think that loosening the restrictions that much would make this problem much more trivial than my original intention
so let's stick to the requirement that the PRNG must produce uniform randomness