I'm not sure how you managed to implement a multi-threaded introsort without knowing the answer to your question.
Multi-threaded only makes a difference if each thread is actually running on a different core. Further, it's a red-herring. The
efficiency of the algorithm does not change -- you are simply (assuming you are using all four cores) performing some sorting in parallel, which reduces the overall time necessary to sort, but not the overall amount of work.
Musser's paper is really very easy read, and he explains quite simply the purpose of the switch:
- a quicksort typically executes faster than a heapsort
- except when it has bad data, when its execution becomes really inefficient
- heapsort cannot be made inefficient by bad data
Musser noticed that the cutoff is easily calculated by the number of subproblems generated by a quicksort -- if it branches down to more than 2*floor(log2(n)) recursions then the quicksort becomes less efficient than a heapsort, so rather than continue it just switches to heapsort.
In other words, if the quicksort has to switch to a heapsort, it is bottoming out at its worst performance.
What I suspect is happening, (assuming you are successfully partitioning the data to all four cores) is that you are computing the problem depth based upon N, instead of N/4. In other words, you've broken the efficiency switch by whatever you've done to make the algorithm work multi-core.
Remember, 2*log2(n/4) = 2*(log2(n) - log2(4)) = 2*(log2(n) - 2) = 4*(1/2*log2(n))
Take out that 4 (for the number of cores) you should have a worst-case efficiency switch at log2(n)/2.
Your introsort is performing sub-optimally.
Keep in mind as well:
- quicksort has nice locality of reference
- heapsort never does
Don't feel too bad. Introsort is tricky to get right. Fix your errors and you should have a very efficient algorithm.
As a matter of (rhetorical) interest:
Your multi-threaded algorithm works by
(1) determining if your data is actually large enough to need multiple cores, and how many it would like
(2) identifying the actual number of cores it can use
(3) performs the first 1 or 2 iterations of quicksort to split the data to the requisite number of partitions (== number of cores to be used)
(4) initiates a thread for each core that runs a
normal introsort on its data partition
Correct?
Oh, also, I hope you aren't actually using
std::log2() to calculate the maximum recursion depth. And that you are using the deferred insertion sort at the end.
Hope this helps.
[edit] Sorry, forgot this:
http://www.cplusplus.com/faq/sequences/sequencing/sort-algorithms/introsort/