Challenge: Multithreaded KFY/Durstenfeld Shuffle?

Pages: 123
in a single-threaded app, the algorithm is simple:

http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Modern_method

making this algorithm multi-threaded seems to be non-trivial since each iteration depends on the previous iteration

as a practical example, suppose I have 2 threads and 2 chunks of 2GB of memory (representing 4GB of contiguous memory that has been initialized from 0 to 2^32-1, but split into two chunks)

is there a way to turn KFY shuffle into a multithreaded variant without severe lock contentions?
Last edited on
representing 4GB of contiguous memory that has been initialized from 0 to 2^32-1
Well, that's not possible, since it takes at least 16 GiB to store that many numbers.

It can be easily parallelized if you can afford to allocate 2^32 mutexes. Each thread should lock the randomly chosen element before swapping. It will behave very well near the beginning, but the number of waits will increase as the array becomes nearly fully shuffled.
If you can't allocate that many mutexes but the CPU has an atomic compare-and-swap instruction, you can sort of simulate the behavior of mutexes; rather than having the thread wait, you go back up and pick a different index if the index is already being used by a different thread.
Otherwise, all I can think of is either so much more complex than the original algorithm that you wouldn't be gaining much or anything at all from the parallelization, or produces bad output. It'd be a different story if you were shuffling bigger things, but pretty much anything you can do is more expensive than moving integers around.
I don't really know anything about multi-threading, but let's ignore that..

First threads does the linear shuffle in even positions of the array.
Second thread does the same in odd positions.
Third thread (sorry about ignoring the example condition) goes through the results of the first two and has a random chance of swapping each pair.

This should take half the time of the linear version. Even though it performs additional n/4 swaps.
This method wouldn't be any good at all if you wanted more than 2(+1) threads, I think.
The third thread could probably be replaced by either some synchronization (is synchronization something to avoid?) or a shared constant array or random booleans (though building one might take almost as much time as doing the linear shuffle).

Are you asking because you need that algorithm or you just want to see what people will come up with?
First threads does the linear shuffle in even positions of the array.
Second thread does the same in odd positions.
Third thread (sorry about ignoring the example condition) goes through the results of the first two and has a random chance of swapping each pair.
This has race conditions. You can't guarantee that thread 3 will stay behind threads 1 and 2.

A better solution would be
1. Threads 0 and 1 shuffle even and odd positions.
2. Threads 0 and 1 terminate.
3. Threads 2 and 3 invert even and odd pairs at random.

The problem with this approach is that
1. It doesn't scale.
2. If it takes 100 seconds to perform the serial version, this takes on average 75 seconds: 50 seconds on step 1 (because each thread shuffles half the array) and 25 seconds on average on step 3 (because each thread will select half the array and on average swap half the selection). Any algorithm that takes more than one pass will have a similar problem.
Last edited on
shuffling evens and odds independently is an interesting idea, but I wonder if it skews the uniform distribution - I'm not 100% convinced that we would get as much mixing as if we did it normally in a single thread, even if we were to do a 2nd level of mixing between evens and odds

initially, I brought up the question because I thought I would be CPU bound, but due to my mis-calc (as helios pointed) out, the bigger issue seems to be insufficient RAM causing virtual memory thrashing (so the total opposite of CPU-bound: the CPU is actually under-utilized)
You can use Intel's TBB threading library.

Using a continuation task, you could create two independent tasks to shuffle the odd and even numbers as hamsterman suggested, and then let the continuation task to shuffle the positions randomly.
Also, I think a pipeline would work.

However, I don't know if this approach would prove slower than the others, since creating the tasks incur in some forfeit in time.
Windows provide the interlocked functions. Especifically, the InterlockedCompareAndExchange() function could prove useful. Instead of having equal number of semaphores to serialize access to items, the aforementioned function could be used on a control array of zeroes and ones, where 0 means the item is in use and 1 would mean the item is available. Using this approach, you double your memory usage, but you reuse synchronization objects (I think).

If the above is feasible, then you can put multiple threads to do the same shuffling over the same data range.
I'm not 100% convinced that we would get as much mixing as if we did it normally in a single thread, even if we were to do a 2nd level of mixing between evens and odds
I agree, which is why I didn't mention it in my first post. However, although I'm not certain about it, I think a second pass over a partially shuffled array is as good as a single pass. The strategy is similar to the one used in stooge sort.

I'd recommend memory-mapping a file rather than using straight memory allocations. The system should have an easier time coordinating writes to that single file than to swap space, although it depends on how much of the array you can hold in RAM at any given time.
helios wrote:
You can't guarantee that thread 3 will stay behind threads 1 and 2.
I have a feeling you can. Keep the indexes used by the two first threads accessible to the third and then read them and wait if needed. Reading an integer doesn't require locking, right?

kfmfe wrote:
I'm not 100% convinced that we would get as much mixing as if we did it normally in a single thread, even if we were to do a 2nd level of mixing between evens and odds
I'm not either. I'm pretty sure all permutations can be produced that way though.. Isn't there a way to make sure?

As for my algorithm. I just realized what it does. It is actually perfectly scalable. What I did there was looking at the array as a 2 x (n/2) grid. First shuffling the rows and then columns. This could be less than O( 2*sqrt(n) ) if the number is square ( Less because I still think that you can start shuffling rows before you're done with columns. Doing that might be inefficient though.. ).
Also, I have a feeling that the same method can be used for higher dimensions. Which should be a lot more efficient when N is large.
You just have to analyze N before shuffling..
Reading an integer doesn't require locking, right?
Yes.

Keep the indexes used by the two first threads accessible to the third and then read them and wait if needed.
This requires either a) the array of 2^32 mutexes, or b) a single mutex plus a spinlock.
Guaranteeing that a thread will stay behind another thread is not as easy as you seem to think.

This could be less than O( 2*sqrt(n) ) if the number is square ( Less because I still think that you can start shuffling rows before you're done with columns. Doing that might be inefficient though.. ).
More like O(2n) (although this is poor notation). You're passing through each cell in the matrix exactly twice. The shape of the matrix doesn't change this, and you can't shuffle any row-column pair concurrently without locking, but any column or row can be shuffled independently.
All in all, it takes 2/thread_count times as long as the serial version, with a lower bound of 2/sqrt(n). With two threads, it takes just as long; with four, it takes half as long; etc.

EDIT: Here's a proof that this method can produce all permutations of the matrix:
A shuffler can produce all permutations of v <=> S(v[a])=v[b] where 0<=b<n for all 0<=a<n.
Element e begins at position m[a][b]. If we shuffle rows first, after shuffling row a, e will be at m[a][c], where 0<=c<2^16. After shuffling column c, e will be at m[d][c], where 0<=d<2^16.
S(m[a][b])=m[d][c] where d,c in [0;2^16) for all a,b in [0;2^16). QED.
Last edited on
ouch - PSU on my compute server went down this morning

you have to be careful with how a shuffle is done - not all of them will produce an uniform distribution of outcomes - with the even/odd done separately and then shuffled gain later, I don't know what the resultant distribution is, but I don't think it's uniform

@helios - I've come across memory-mapped files, but I've never tried using them before. Suppose I need 16GB of RAM to run the entire shuffle, but I only have 4GB of real RAM. How will using memory-mapped files help me? Would using custom coded memory-mapped files be more efficient than the /swap that is built into Linux?

I've also thought a little about sorting over a network - like if I had 4 machines with 4GB RAM each, this should be doable - perhaps there is a library that facilitates this kind of coding?
helios wrote:
More like O(2n) <...>
Indeed, I hadn't thought about how many threads would be needed to get the 2*sqrt(n) time. Unless n was very small. Even more so for higher dimensions. Still, I think that 2n/k time for k threads isn't bad at all..

kfmfe wrote:
<...> but I don't think it's uniform
Doesn't helios's proof prove that? It's the same as linear shuffle, just instead of swapping a[i] with a[rand1], you swap a[x+w*y] with a[rand1+w*rand2]. If w is of correct length, this is as unbiased as any pair of random numbers. ?
I suppose as an intellectual exercise this may be interesting, but why would you ever want to make this multithreaded? Both runtime and space are linear. So you are always going to be struggling with space. If not struggling with space, then the `reasonable' average I've always used is about 10^8 operations per second, so you're not going to be struggling with time.
you have to be careful with how a shuffle is done - not all of them will produce an uniform distribution of outcomes - with the even/odd done separately and then shuffled gain later, I don't know what the resultant distribution is, but I don't think it's uniform
In reality, I only proved that shuffling a matrix over its individual dimensions produces no fewer permutations than shuffling a vector; I said nothing about the probability of reaching each permutation.
However, it's not hard to see that the uniformities of S and S2 are linked. If S is uniform, so is S2. If S is biased, S2 is quadratically so.
As long as you've designed the shuffler well, there should be no problem.

Suppose I need 16GB of RAM to run the entire shuffle, but I only have 4GB of real RAM. How will using memory-mapped files help me?
Consider how you access the array using a linear shuffle. There's a pointer that's always advancing in one direction and the elements on one side of it are never accessed again, while there's another pointer moving at random over the entire space.
If you try to put the array in memory, the OS will constantly attempt to have as much of it as possible in physical memory. You'll end up with thousands of pages in memory that were accessed only once and are then swapped out to make room for a new random portion of the array. You'll only see a speed up when the array is over 75% shuffled, because that portion fits completely in memory.
Conversely, the system is more conservative with its caching when using files. It will only cache if it can afford to do so without swapping out much of the memory of the other running processes. The system as a whole will behave better.
I don't think a memory-mapped file will be faster; in fact, it'll probably be a bit slower because the other method will see a huge speed up once it can fit the whole array in memory. But it will a) share resources better with the rest of the system and b) save you the trouble of writing the array to disk once you're done (I'm assuming you do want to do that).

Both runtime and space are linear. So you are always going to be struggling with space. If not struggling with space, then the `reasonable' average I've always used is about 10^8 operations per second, so you're not going to be struggling with time.
What do you mean by "reasonable average" and "operation"? There'll be a lot I/O involved, so it's not exactly clear how fast it'll be.
Also, what does space have to do with this? Of course space is an issue. Concurrency doesn't attempt to solve that.
My point is that on any case you would want to run. Space is going to stop you before time becomes an issue.

An operation is a cache read/write or an alu function. Reasonable is what I've generally found through experimentation as an order of magnitude which is in the right ball park.

At 2^32 items. You require 16GB of memory. Each iteration generates a random number, and then does two reads and two writes. For sake of argument, lets say a each prng requires 10 operations. (There are generators that will do less than this). There are n-1 iterations required so that's then

2(n-1) + 2(n-1) + 10(n-1) = 14(n-1) operations.

let n=2^32 and assuming 10^8 operations, then it calculates at an estimate of just over ten minutes.

Which is my question. What's the point of multithreading it?
10^8 operations
I don't see where this number comes from, but whatever.

What's the point of multithreading it?
In this case, none, because most of the time is pissed away waiting for I/O.
If we assume we have enough memory, threading would be useful if we had some kind of deadline. For example, if we had some other process capable of consuming the entire array in 7 minutes and we wanted to minimize the time it spends waiting.
10^8 comes from my experimentation on getting Spoj and Topcoder problems to run in the time limits. It may be nonsense, but I've found, especially on Topcoder which has a 2s time limit, once you get much past 10^8 operations, you will timeout.


In this case, none,


This is my point. I can't visualize any situation where there would be any advantage, certainly not with the development cost taken into account, in multithreading this algorithm.
It may be nonsense
In this context, it's an entirely meaningless number, actually. Doubly so because the run time is not CPU bound.

I can't visualize any situation where there would be any advantage, certainly not with the development cost taken into account, in multithreading this algorithm.
I just gave one such situation.
Also, the method proposed by hamsterman is embarrassingly parallel. It can be parallelized with nearly no effort and no synchronization at all. It would literally take me ten minutes to write the code.
I assume we do have enough memory, and are not worrying about i/o. 16GB is perfectly reasonable on NUMA.

We are going to have to disagree on that being a reasonable example. Maybe it's my lack of imagination, but I can't envisage any real situation where the kind of time improvement you're talking about (if there actually was any) would be worth the development time - which would be far more than 10 minutes for me. If you can implement such an algorithm and have it fully tested in ten minutes you are a far better programmer than I am.

I hadn't read the whole thread, just the first post. However, I have now, and I would also question that all permutations are reachable. But maybe I misunderstand the algorithm. I would argue:

1) If we consider 4 things, then that would be two rows of two, and two columns of two.
2) There are 2 permutations of 2 things.
3) Each column has two permutations and there are two columns, so 4 different possibilities after the first set of swaps
4) Each row has two permutations and there are two rows, so 4 possibilities for each state after stage 3. 4*4=16
5) There are 24 permutation of 4 things
6) 16 != 24

I think I may have misunderstood the second set of swaps in stage 4 though.
@helios - thank you for that detailed explanation of memory-mapped files!

following on kev82's practical example - suppose rows are A B and columns are 1 2, I believe he means something like this:

2) ( AB, BA ) or ( 12, 21 )
3) col A -> (12, 21 ), col B -> (12, 21 )
4) row 1 -> (AB, BA), row 2 -> (AB, BA )
5) 4! = 4 * 3 * 2 * 1 = 24 (uniform shuffle) suppose of set ( a b c d )
{ ( a, b, c, d ) ( a, c, b, d ) ( a, d, b, c ) ... }
6) 16 (two-tier shuffling) != 24 (linear-uniform shuffle)

I agree 100% with helios that hamsterman's algorithm is extremely well suited for parallelization; however, as I suspected, and as kev82 just pointed out, you will get a shuffle, but not a uniform shuffle

if our interpretation of stage 4 is incorrect, please correct us with a specific example with 4 elements - ty
Last edited on
Pages: 123