Optimizations are turned on (MSVC++ Release mode with a few manual changes).
Very Sleepy works by sampling. Found this description on the original version of "Sleepy" (later improved by someone else and becamse Very Sleepy):
The Sleepy profiler uses a technique where the profiler runs in a different thread from the target program. Every 1ms or so, the profiler thread suspends the target thread, and pulls out the current instruction pointer register value from the thread context. These mem addresses are resolved into procedure names and line numbers using debug information. This allows line-level resolution, without making any changes to the target program. The only requirement is that the target program is compiled with (MS) debug information |
The updateNode function is a very simple function:
-C and V are function pointers; the first checks two or three things based on i and j, then returns a boolean.
-If C passes, V is calculated; it's a simple add-and-substract calculation.
-If the result is > 0, and object is created and added to a Heap.
There is no reason to assume that (i, j) calls will be more likely to pass C, have a positive V and require more heap work than (j, i) calls. In fact, in several phases of the algorithm, (i, j) and (j, i) are actually symmetrical calls.
It could be a caching problem, but the difference seems ridiculously large. Also, since the calls are together in a loop, wouldn't the second, "cache unfriendly" call ruin the caching of the first, "cache friendly" call?
(If not: how would I get around it?)