Profiling.

Hello,

I've made some critical changes to my current project that I feel should have improved the runtime of the algorithm significantly. Turns out, it doesn't; some steps seem to be a bit sluggish and I can't figure out why.

I'd like to use a Profiler to check where the algorithm is wasting its time, but so far it wasn't a success. I use "Very Sleepy", one of the few easy to use free tools I could find. The problem is that the output is cryptic at best, and possibly even complete bogus.

First of all, the overhead of profiling seems to be influencing the results: basic inlined accessor functions have significant overhead on just the function call (~20% of total time spent on such a function is pure function call overhead).

Secondly, some results are just incredibly odd, such as the following:
1
2
3
4
0.13s	for (NODE j = 1; j < Instance::n; ++j) {
3.84s		updateNode(i, j, C, V);
58.35s		updateNode(j, i, C, V);
	}

The same function is called and there is no reason that (i, j) would be more costly than (j, i). The only thing I can think of is caching behaviour, as part of the function updates a value in a 2D array, meaning the first call updates row-wise and the second call column-wise. The difference seems a bit extreme, though.


So, what am I doing wrong? Is it the tool? Is it my interpretation? How can I get trustworthy, useful results and how do I use them?
Compile the program with optimizations turned on if you have not already done that.

Check what your tool is actually measuring. Is it just counting clock cycles. Is it taking memory cache into account (this will probably just be an estimate). Does it include the time for system calls?

In this case. Is the updateNode function using a loop or recursion? Could it be that the second call makes more iterations for some reason? It could be caused by the memory cache as you said.
Last edited on
Optimizations are turned on (MSVC++ Release mode with a few manual changes).

Very Sleepy works by sampling. Found this description on the original version of "Sleepy" (later improved by someone else and becamse Very Sleepy):
The Sleepy profiler uses a technique where the profiler runs in a different thread from the target program. Every 1ms or so, the profiler thread suspends the target thread, and pulls out the current instruction pointer register value from the thread context. These mem addresses are resolved into procedure names and line numbers using debug information. This allows line-level resolution, without making any changes to the target program. The only requirement is that the target program is compiled with (MS) debug information


The updateNode function is a very simple function:
-C and V are function pointers; the first checks two or three things based on i and j, then returns a boolean.
-If C passes, V is calculated; it's a simple add-and-substract calculation.
-If the result is > 0, and object is created and added to a Heap.

There is no reason to assume that (i, j) calls will be more likely to pass C, have a positive V and require more heap work than (j, i) calls. In fact, in several phases of the algorithm, (i, j) and (j, i) are actually symmetrical calls.

It could be a caching problem, but the difference seems ridiculously large. Also, since the calls are together in a loop, wouldn't the second, "cache unfriendly" call ruin the caching of the first, "cache friendly" call?
(If not: how would I get around it?)
Any input on this?

I'm pretty convinced it has to be a caching issue, but 4sec vs 60sec seems like a ridiculously large difference.

The function itself is very simple, but it's executed billions of times. The cache would still have to be a pretty steep bottleneck to cause this result, though...
Don't the profiler tell you what it is?
The profiler is just a sampler. It tells me how much time is spent on function calls, but doesn't go deeper than that.
Have you considered that it is swapping virtual memory to disk for some reason? Perhaps try running some other tools to investigate. A profiler alone may not be telling the whole story.
Any suggestions on which tools I could use? Or what such a tool is called, so I can look around?
Since you are in the proces of optimizing your algorithm, maybe you shouldn't try to understand why the second call is slower than the first, but instead try to merge the action of the 2 calls in one single function?
Vtune (intel) not free one of my favorites (free trial though).
http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-2011-for-windows-evaluation/

AMD Code Analyst (I believe free). CodeAnalyst works fine on an Intel CPU, as long as you use Time-Based Sampling (TBS).
http://developer.amd.com/tools/CodeAnalyst/Pages/default.aspx
Last edited on
AMD CA doesn't seem to work (based on unsolved error reports on their forums for the problem I'm having), but I'll give VTUne a try once I turn back to MSVC Studio. Thanks for the tips!
Is the application cpu bound? Try watching top (in Linux). Other tools such as ltrace and strace may be helpful to see library and system calls. This may or may not help your performance investigation.

Some other tools that may be of use: vmstat, iostat, valgrind's kcachegrind tool, oprofile
Sadly I'm on Windows.

To provide a bit of an update: After some searching, I've decided to go with TAU, which apparently has a good reputation while being free. Currently upgrading to VS11Pro for compatibility.

I'll report back once I had the chance to try TAU out.
Topic archived. No new replies allowed.