Challenge: Multithreaded KFY/Durstenfeld Shuffle?

Pages: 123
Damn.
Following your algorithm, for n=4, the possible outputs where v[0]=0 are
0123
0132
0312
0321

0213 and 0231 are unreachable because of the order of operations. Getting to 0231 from 0123 would involve inverting column 1 then row 1, which is not possible because we're shuffling rows first.
It can be solved by shuffling columns, then rows, then columns again.
I'm wondering if there's some way to cut down the now absurd number of operations.
well, the really hard part is coming up with operations which are independent of each other (hamsterman had the right intent)

the original linear shuffling didn't lend itself well to multithreading due to its nature:

helios wrote:
Consider how you access the array using a linear shuffle. There's a pointer that's always advancing in one direction and the elements on one side of it are never accessed again, while there's another pointer moving at random over the entire space.


unfortunately, that second pointer affects subsequent operations (bad, very bad for parallelization)

Damn.
helios wrote:
It can be solved by shuffling columns, then rows, then columns again.
Now we're really getting close to stooge sort..
But that's irrelevant. It seems to me that the distribution would still be non uniform. For 4 elements there would be 6 potential swaps and therefore 26 sets of random booleans. There is no way to uniformly map 64 inputs to 24 outputs, is there?
How about this?
With the array loaded in memory, pick the computer up and shake it really hard for one or two minutes; this should be long enough for the bits to dislodge from their slots and move about (make sure the computer is sealed airtight to avoid memory leaks). Now put the computer down and wait five minutes to let the bits settle.
Presto! Array shuffled in O(1).
Last edited on
I like it! 8')

all that's left to do is paint the box black and inscribe a large 8 on one side and a glass window on the other!
I'm still thinking about this problem every once in a while. I didn't come up with anything smart though. Unless you're willing to use twice as much space, of course.
I'm slowly considering a thought that if you shuffle two halves first, you only need to have Cn2n ways to combine them. You'd need (2n)! ways if you didn't shuffle them, so I suppose that's a gain. Though I'm not sure how to get anything from that.

But thats not why I'm making this post..

Here's an observation. Array of n values has n! permutations. What KFY does is take a number between 0 and n! (in a form of its digits in factorial base) and map it to a specific permutation. The problem is that PRNGs find random number rn using rn-1 so basically you only map seed to the n! permutations. Thus, unless you are willing to devote log2( 232! ) bits for a random number, I don't see what kind of uniform randomness we are trying to achieve. I feel that bearing in mind that true randomness is not an option might lead to a different approach to pseudo random permutation generation.

Did I get anything wrong?
hamsterman wrote:
What KFY does is take a number between 0 and n! (in a form of its digits in factorial base) and map it to a specific permutation.

that's a very good observation - so, in theory, if you can take one random number in the range of [0..32!-1], a rather big number, and be done in one shot - if you have the mapping well defined before-hand (it could just be some pre-specified ordering) - but if the algorithm is well defined - this dumping could also be done in parallel!

...but to have a PRNG that can generate that big of a range uniformly might not be easy, and even when you're done, you still have to dump that 32! list of numbers - the advantage is, of course, that you can take just one very large random number (well, the result may be small, but your sample space is humongous)

now, the interesting thing is, if you could generate those bits you want independently somehow (like it doesn't have to be on the same machine or you don't have to have the 2nd bit before you get the 3rd bit), then you definitely have an approach that you could run in parallel or out of sequence

the uniform requirement is just asking for a no-biases in your output - an extreme example might be a PRNG that just generates odd numbers or forms stripes or pretty patterns when you graph it against itself

so the question transforms to, how do you generate one random number that number that big without bias?

gj, hamsterman!
Last edited on
kfmfe wrote:
but if the algorithm is well defined - this dumping could also be done in parallel!
I'm not sure. Wikipedia doesn't mention any such algorithm except for one which works the same way as KFY.

My point was that if you're only going to choose one of 232 or 264 permutations any way, you needn't worry about how random the algorithm is. You may as well use the one I suggested when the thread started. Or maybe devise one which doesn't require two passes over the whole array.
@hamsterman

An interesting point. The argument I suppose, is that your algorithm, even in the presence of a perfect rng will not produce all permutations, whereas the original one will. Whether you have a perfect rng or not is beside the point. That is the difference.

Practically though, I agree with you, but then again practically, I don't think there is any need for a multithreaded solution.
suppose we predefine 3! to have the following permutations
0 0 1 2
1 0 2 1
2 1 0 2
3 1 2 0
4 2 0 1 
5 2 1 0

rather than shuffle the set { 0, 1, 2 } twice using KFY, we can
1. draw one random number from [0..5]
2. get the result from the row drawn in 1.

Let's say we drew from [0..5] and got 4 - I'd bet we can say that the result is 2 0 1 without having to generate these entries in one process or thread

Of course, right now, we don't care because we only have 3 numbers, but in the original problem, there are 232 or about 4 billion numbers per row. I posit that once we know which row from 1., we can generate those 4 billion numbers, perhaps the 1st billion on machine A, 2nd billion on machine B and so on to machine D, independently.

So I think you can parallelize 2. (if needed - maybe the run for 2. wouldn't take so long anyways)

However, I am not sure about 1. having to run in one process/machine, because 32! is one big-#!* number and you need a proper PRNG on top of that!

edit: btw, if you care about the quality of the randomness, you would want an uniform PRNG - for our 3-bit example, suppose that we used a really trashy PRNG that would generate in the following biased manner (I'm being extreme here):
P( 3 ) = 50%
P( 5 ) = 50%
P( elsewhere ) = 0%
such a biased PRNG would always have 0 as the 3rd number! Of course if you don't know it's biased and you just drew, you may not even notice or care - for most problems maybe you can get away with it, but I think that loosening the restrictions that much would make this problem much more trivial than my original intention

so let's stick to the requirement that the PRNG must produce uniform randomness
Last edited on
as a side-note, if we loosen the PRNG requirement for uniform randomness, I could just ask Joe down the hall to pick a number between 0 and 32!-1 (I wouldn't even need a computer for 1. - just one to generate 2.)
Last edited on
32! is not that big. It fits nicely in 118 bits. Generating four 32-bit random numbers should be enough.
good point! I think we've come up with an alternative to KFY that's very practical and splittable into multiple machines/processes/threads, if necessary!

Part 1. only requires 4 calls to a 32-bit PRNG
Part 2. needs some working out for the permutation algorithm, but it's definitely doable and doesn't seem to need to run on one machine since the algorithm is deterministic

perhaps this algorithm should be even faster than KFY, since we don't have to call the PRNG so often and we don't have to do all those memory swaps

recall that KFY for this problem requires 16GB of memory - I think our algorithm will barely need any memory and doesn't even have to run on one machine

gj, guys!
this is nuts - we could potentially expand this to 64-bit space or 128-bit space without requiring more memory - all we need is enough CPU cycles to generate the 64! or 128! numbers once we have our row!

edit: have we missed anything obvious? it definitely seems doable and the result should be equivalent to KFY, as far as I can tell...
Last edited on
If you use only 2 threads, I actually have a ridiculous suggestion (completely orthogonal to your original idea of splitting memory into two):

You can split the Fisher–Yates shuffle into two procedures:
1st thread generates the random number parametrizing the next swap and writes it to a carefully designed buffer.
2st second thread actually carries out the shuffle.

You have to be extra careful how you coordinate the two threads/design the buffer. My initial thought is, design a vector of buffers, and coordinate them so that the
1) 2nd thread always writes to a different buffer from the one being read by 1st thread
2) the 1st thread receives the buffer only after it has been filled up properly.
Last edited on
@kfmfe

helios wrote:
32! is not that big
I don't know whether he was joking, only read your last post or made a mistake. Any way, while 32! fits in 118 bits, 232! does not. Wolfram alpha chokes on ths number, but my guess would be that you need between 3 and 4 gb.

needs some working out for the permutation algorithm, but it's definitely doable and doesn't seem to need to run on one machine since the algorithm is deterministic
As I said, KFY is that algorithm. Good luck with finding another one. It seems like it should be doable.
By the way, why does a deterministic algorithm imply a possibility of parallelization?
now, given
0 0 1 2
1 0 2 1
2 1 0 2
3 1 2 0
4 2 0 1 
5 2 1 0

if we can come up with an algorithm for:
size_t getEntryAt( size_t rowIdx, size_t colIdx, size_t nItemsInSpace )

such that
getEntryAt( 4, 0, 3 )==2
getEntryAt( 4, 1, 3 )==0
getEntryAt( 4, 2, 3 )==1

etc...

then we don't need to even store the results - once we have the rowIdx randomly generated, we can get our entries at will (the permutation is effectively compressed into the algorithm inside getEntryAt() - no need for RAM or disk space!)

Well, you still have to uniformly compress [0;2^118) to [0;32!). If you don't care about uniformity, there's no problem. If you do, one possibility is to generate numbers until it is within the range. There's a chance higher than .5 that a random 118-bit number will also be lower than 32!.
hamsterman wrote:
By the way, why does a deterministic algorithm imply a possibility of parallelization?

look at the example I just posted - once we know rowIdx (by random draw), the algorithm becomes deterministic - the entries also don't depend on each other if we just base the algorithm on strict ordering

if we can generate the number arbitrarily by using the getEntryAt() call, it doesn't have to be on one machine any more - we don't even need memory or disk space to store the result!

edit: hamsterman, due to your fantastic insight, I think we have a kick-ass algorithm!
Last edited on
according to google,

32! = 2.63130837 × 1035
2118 = 3.32306999 × 1035

so I think helios isn't joking

edit:
helios wrote:
Well, you still have to uniformly compress [0;2^118) to [0;32!). If you don't care about uniformity, there's no problem. If you do, one possibility is to generate numbers until it is within the range.

he's right - if you just keep throwing out results until you hit the range, you still satisfy the uniformity requirement, I think...
Last edited on
Pages: 123