measure execution time - always zero

Pages: 12
Greetings!

I experience some weird behavior when measuring the execution time of two loops.
On my laptop, I get certain finite results, but on a cluster node (after compiling on the cluster as well) one of the measured times is always zero.

Specifically, I want to measure the computational cost of a force function compared to a potential energy function (related to physics). I evaluate the two functions thousand of times in a loop and measure the times the loops take to execute:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
double costfactor_pot_vs_grad(const size_t N_trials, const double sig1, const double sig2, const double a1, const double a2, const double sig0, vector <double>& Xdata){


	vector <int> idx_arr(Xdata.size());	// needed to pass to force functions, but bears no meaning here
	double q;
	params theta{1,1,1,1};  // dummy param
	
	
	
	// measure time for force computation - this works well on laptop and on cluster
	
	auto TG1 = chrono::high_resolution_clock::now();
	forces F;
	
        for (size_t k = 0; k < N_trials; ++k){
		
		F = get_noisy_force_GM(theta, sig1, sig2, a1, a2, Xdata, Xdata.size(), idx_arr, sig0);

	} 

	auto TG2 = chrono::high_resolution_clock::now();
	auto ms1 = chrono::duration_cast<chrono::milliseconds>(TG2 - TG1);
	double TG = ms1.count();



	// measure time for potential energy computation - this gives always 0 on cluster
	
	auto TU1 = chrono::high_resolution_clock::now();
	double U;

	for (size_t k = 0; k < N_trials; ++k){
		
		U = U_pot_GM(theta, sig1, sig2, a1, a2, Xdata, sig0);

	} 

	auto TU2 = chrono::high_resolution_clock::now();
	auto ms2 = chrono::duration_cast<chrono::milliseconds>(TU2 - TU1);
	double TU = ms2.count(); 

	

return TU / TG; 
}


On my laptop, I get the following values for the times:
TG = 27,000 ms
TU = 14,000 ms

On the cluster, however, I get
TG = 9,000 ms
TU = 0

While the cluster seems to be faster than my laptop, the numbers for TG imply a factor of 3. Why does TU evaluate to 0?

I should also note that both functions are linear in the vector Xdata. If I increase its length by a factor of ten, then the times I get will be larger by that factor as well, but TU will still be zero.

Any idea how that can happen?
Last edited on
My guess is that the compiler optimizes the loop away completely.
Hi Peter,

I thought that might be a thing... I do want compiler optimizations for the rest of my simulation, but in this function it keeps me from doing what I want to do. Also, it's weird that the first loop is not optimized away then.
> Also, it's weird that the first loop is not optimized away then.
Well that depends on a few things.
What is the type 'forces'? Is it a complex struct/class ?

Are the functions pure? See https://en.wikipedia.org/wiki/Pure_function

If U_pot_GM is a pure function and the compiler sees you never use U after the assignment, it's a pretty easy optimisation.

You could declare line 30 as volatile.
volatile double U;
I was also thinking about volatile, which essentially makes the compiler think the result is used for something, but it doesn't really prevent the compiler from optimizing the function call. Even if it doesn't optimize everything away it might be able to realize some of the calculations are the same every time and hoist them out of the loop or calculate them at compile time.

It's quite hard to make benchmarks that measure the time spent in a meaningful way that represent real usage.

When using https://quick-bench.com (which I think uses Google Benchmark?) I would have passed the return value and all the arguments to benchmark::DoNotOptimize on each iteration to prevent the compiler from realizing that the return value is never used and that the arguments are always the same (using varying/random arguments might be a better idea).

1
2
3
4
5
6
7
8
9
10
for (size_t k = 0; k < N_trials; ++k) {
	benchmark::DoNotOptimize(theta);
	benchmark::DoNotOptimize(sig1);
	benchmark::DoNotOptimize(sig2);
	benchmark::DoNotOptimize(a1);
	benchmark::DoNotOptimize(a2);
	benchmark::DoNotOptimize(Xdata);
	benchmark::DoNotOptimize(sig0);
	benchmark::DoNotOptimize(U_pot_GM(theta, sig1, sig2, a1, a2, Xdata, sig0));
}

But compiler optimization is only part of the problem. If you repeat the same code again and again, especially if you use the same input, things tend to run faster because the branch predictor learns how to predict the branches, things that you used recently is more likely to be in the faster memory caches already, etc. This might not represent real usage where the function might be called only once in a while.

The performance can also be affected by small things like code alignment, how the code happened to get laid out in memory, which is beyond your control but this is probably more of a problem if the amount of code that you're measuring is small.

It's perfectly possible that a change that makes your benchmark run faster actually makes the real world usage slower.
Last edited on
While the cluster seems to be faster than my laptop, the numbers for TG imply a factor of 3. Why does TU evaluate to 0?
When it's not a factor but 18ms faster then you will have 0 in case of 14ms.
One more thing to consider is the "resolution" (tick period) of the clock:

Just because the clock function returns the time in units of milliseconds (or microseconds, or nanoseconds), there is no guarantee that the resolution of the clock actually is 1 millisecond (or 1 microsecond, or 1 nanosecond). Consequently, it's quite possible that the value returned by the clock remains the same for a certain period of time and then "suddenly" increments by n, where n can be way bigger than 1!

So, if the computation that you are trying to measure completes relatively fast, then it's very possible that the clock value simply hasn't changed in that short period of time (i.e. the function completes in less than a tick period), so you are actually computing the difference between the same value and thus you get zero.

high_resolution_clock is supposed to be the clock with the highest resolution (shortest tick period), but I don't think there is a guarantee on what the resolution (tick period) actually is on a specific system:

high_resolution_clock is the clock with the shortest tick period. It may be a synonym for system_clock or steady_clock.
Last edited on
It might be a better idea to simply use steady_clock because system_clock might be affected when the clock on your computer is adjusted.

Quote from cppreference.com:
The high_resolution_clock is not implemented consistently across different standard library implementations, and its use should be avoided. It is often just an alias for std::chrono::steady_clock or std::chrono::system_clock, but which one it is depends on the library or configuration. When it is a system_clock, it is not monotonic (e.g., the time can go backwards). For example, for gcc's libstdc++ it is system_clock, for MSVC it is steady_clock, and for clang's libc++ it depends on configuration.

Generally one should just use std::chrono::steady_clock or std::chrono::system_clock directly instead of std::chrono::high_resolution_clock: use steady_clock for duration measurements, and system_clock for wall-clock time.
https://en.cppreference.com/w/cpp/chrono/high_resolution_clock#Notes

This could potentially lead to some inaccuracies from time to time but it cannot explain the consistent 0 ms result you're getting even after increasing the amount of work. It doesn't sound like a resolution problem to me but it's something to keep in mind when measuring short time intervals. Not sure whether "9,000 ms" means "9 ms" or "9000 ms". The latter is not short so a low resolution clock would not make much of a difference there.
Last edited on
From your code ... how do you know what TU and TG are?

Without knowing what is in get_Noisy_Force_GM() and U_pot_GM() your question is probably unanswerable.

If you send U_pot_GM() the same arguments each trial, and if the arguments are passed by value and there is no randomness within U_pot_GM(), then any reasonable optimiser might well say "why run this trial more than once - I'll get the same answer every time".


TG = 27,000 ms
TU = 14,000 ms

Whole numbers of seconds?
Last edited on
A millisec is a long time in the environments I work in, and I don't even use it for stuff on my laptops anymore. Have tried showing your time in microsecs?

Having said that, how do the actual result values look when you compare the outputs from the two environments? Are they the same, similar or very different?
Hi guys, thanks for the many responses.

I meant TG = 27,000 ms = 27s. In other words, if every loop iteration was actually executed I get macroscopic times. Both functions are very expensive, and I wanted to gauge their relative expense.

The return typ forces of get_noisy_force_GM( ) is a simple structure consisting of two double.

When it's not a factor but 18ms faster then you will have 0 in case of 14ms.

In that case, I would have seen different behavior after increasing Xdata.

I tried using microseconds instead, but no change.

I am dumbfounded by the fact that there is no easy way to do this. In Python, this would be trivial..
Last edited on
In Python, this would be trivial..

Because Python code is not optimized as aggressively?
Last edited on
python makes everything easier. It typically runs between 3 and 10(!!) times slower than half decent c++ depending on what you are doing.

this should be pretty easy.
first, get rid of the casts and just print the duration. That should be in flat clock ticks, and it should NOT be zero; even if you just do nothing between the calls it should have a small value.
1
2
3
4
5
6
7
int main()
{
auto s = chrono::steady_clock::now();
auto e = chrono::steady_clock::now();
cout << (e-s).count() << endl;
}

that prints 100 for me. If you get zero, your stead_clock stinks.
Last edited on
Because Python code is not optimized as aggressively?
python makes everything easier. It typically runs between 3 and 10(!!) times slower than half decent c++ depending on what you are doing.

I am aware, which is why I typically use C++ for the more involved numerical computations.


@jonnin,
The code snippet you provided gives around 400-500 on my laptop and 50-80 on a cluster node.
I am going to test it in my setting now, but if the compiler optimizes away my loops, then simply using steady_clock (plus removing casting) is not going to solve my problem?
Last edited on
it won't solve what the optimizer did, it just should give a non-zero response.
if you want the loop to run, do something slightly different.
eg
U+= instead of =
and print U at the end so it isn't discarded.
the timing will be good enough to get a sense of the work done.

timing code can get weird too. The first time you run it could cache things so the next time you get a wildly different result (much faster). Running it with other stuff also running can slow it down (you can mess with priority to help that one).
The best thing you can do in my opinion is simply to time it IN PLACE, in the actual code you care about. Then it will optimize the way it does for that code, run it the number of times it needs to run with caching or not, and so on. Then you get a useful answer. Trying to do it stand alone is begging for a useless result ... about all you can do with that is validate that one algorithm is better than another or see if some internal tight loop adjustment has a big enough result to really matter. Its ok for that, but you can't trust the wall clock time result to be anything like it will be 'in action'.
Last edited on
That is unfortunate..

I need the fraction of cost in order to figure out how long my simulation is supposed to run, so I cannot just measure it "in place" while the simulation is already running.


My setting is as follows:
I want to compare the performance of methods A and B when both run the same amount of wall clock time.
In each iteration, method A uses the get_noisy_force_GM() function from above, which dominates the workload in a single iteration.
Method B, however, uses both get_noisy_force_GM() and U_pot_GM() in each iteration, and both of them dominate the workload per iteration.

Given the time it took method A to execute for N iterations, I want to gauge how many iterations method B should run to yield similar execution time. For this, I need to gauge the relative cost of the two methods at the beginning of method B (i.e. before doing the actual simulation iterations). For example, if U_pot_GM() was twice as costly as get_noisy_force_GM(), the number of iterations in method B should be three times lower than in method A.

Since my approach presented in the opening post does not seem to work, what about the following idea?
Instead of measuring the costs of the force/potential functions, I simply enter the execution time of method A as an input argument for method B. While the method B goes through its iterations, I use the chrono clocks to measure the elapsed time in each iteration, and stop the iterations when the total elapsed time reaches the given admissible execution time. That would tie in with what @jonnin recommended, focussing on measuring the times used in an "in practice" run.
Last edited on
sort of. If you do that, the act of looking at the time costs time, so B is a bit slower than it really would be in the results. It may be good enough for what you want.

I would not do it this way. What do you really have here? Do you want to run for N seconds and get as many iterations of both as possible on one CPU? in that case, can you just alternate them?
eg
for(some amount of time)
{
a(stuff);
b(stuff);
}

Or, if you want to know how many of each one you can do in X time
for(some time)
a(..);
for(some time)
b(..);
count # of iterations with the loop counters or whatever.

the longer it runs, the better the answer. But, as I said, real code and testing are better if similar. If you really run 10 loops and then do 50 other things and then back to run 10 loops again, its different from running 10 million loops in a row with nothing between.

you can't rely on timing data to schedule code to a wall clock. Its a guideline, but if its a tight schedule, you have to run for the scheduled window and stop, not hope that because all your tests said 10000 loops was the right value so put in a hard coded 10k loops. It will eventually miss its deadline that way, or stop short and miss a few potential iterations. And hardware can change, or computer workload etc can all affect it. Similarly, the ratio of A to B is going to vary some depending on what is going on; you can get a guideline but not an ensured result.
Last edited on
I think the second option is most closely related to what I want, i.e. :

Or, if you want to know how many of each one you can do in X time
for(some time)
a(..);
for(some time)
b(..);
count # of iterations with the loop counters or whatever.


The point is that A and B are two different methods that ultimately try to do the same. Imagine now that, for the same number of iterations, B leads to better results than A. However, B is more expensive so it takes more time than A. Moreover, A could reach the same performance as B if one just let it run longer.
I want to test: If both run (roughly!) the same amount of wall clock time, B still reaches better results.


I am aware that hardware plays a role. Fortunately, with access to cluster nodes where I can occupy entire nodes of K identical cores (without anybody else using the cores), forcing my code to run a certain wall clock time would lead to results that are somewhat consistent, no?
Last edited on
yes, a dedicated machine bypasses the aggravations of timing something on like a home PC with 200 background services and whatever going on.

You should get consistent results that way. As with above, though, looping and checking the time every iteration is a slight overhead cost, but it would answer your question (esp since unlike the scenario I complained about, here, both A &B have the same extra overhead now)

be sure to test many problems. I don't know what you are doing exactly (sorta, but not exactly) and whether changing the problem changes the speed of convergence to a solution, in which case, you could be up against a tough problem to crack.
Last edited on
Thank you jonnin,

this helped a lot!

Pages: 12