measure execution time - always zero

My guess is that the compiler optimizes the loop away completely.

May 18, 2022 at 8:48pm

Hi Peter,

I thought that might be a thing... I do want compiler optimizations for the rest of my simulation, but in this function it keeps me from doing what I want to do. Also, it's weird that the first loop is not optimized away then.

May 19, 2022 at 4:54am

salem c (3715)

> Also, it's weird that the first loop is not optimized away then.
Well that depends on a few things.
What is the type 'forces'? Is it a complex struct/class ?

Are the functions pure? See https://en.wikipedia.org/wiki/Pure_function

If U_pot_GM is a pure function and the compiler sees you never use U after the assignment, it's a pretty easy optimisation.

You could declare line 30 as volatile.
volatile double U;

May 19, 2022 at 7:54am

I was also thinking about volatile, which essentially makes the compiler think the result is used for something, but it doesn't really prevent the compiler from optimizing the function call. Even if it doesn't optimize everything away it might be able to realize some of the calculations are the same every time and hoist them out of the loop or calculate them at compile time.

It's quite hard to make benchmarks that measure the time spent in a meaningful way that represent real usage.

When using https://quick-bench.com (which I think uses Google Benchmark?) I would have passed the return value and all the arguments to benchmark::DoNotOptimize on each iteration to prevent the compiler from realizing that the return value is never used and that the arguments are always the same (using varying/random arguments might be a better idea).

for (size_t k = 0; k < N_trials; ++k) {
	benchmark::DoNotOptimize(theta);
	benchmark::DoNotOptimize(sig1);
	benchmark::DoNotOptimize(sig2);
	benchmark::DoNotOptimize(a1);
	benchmark::DoNotOptimize(a2);
	benchmark::DoNotOptimize(Xdata);
	benchmark::DoNotOptimize(sig0);
	benchmark::DoNotOptimize(U_pot_GM(theta, sig1, sig2, a1, a2, Xdata, sig0));
}

But compiler optimization is only part of the problem. If you repeat the same code again and again, especially if you use the same input, things tend to run faster because the branch predictor learns how to predict the branches, things that you used recently is more likely to be in the faster memory caches already, etc. This might not represent real usage where the function might be called only once in a while.

The performance can also be affected by small things like code alignment, how the code happened to get laid out in memory, which is beyond your control but this is probably more of a problem if the amount of code that you're measuring is small.

It's perfectly possible that a change that makes your benchmark run faster actually makes the real world usage slower.

Last edited on May 19, 2022 at 8:09am

May 19, 2022 at 8:00am

coder777 (8449)

While the cluster seems to be faster than my laptop, the numbers for TG imply a factor of 3. Why does TU evaluate to 0?

When it's not a factor but 18ms faster then you will have 0 in case of 14ms.

May 19, 2022 at 8:26am

kigar64551 (837)

One more thing to consider is the "resolution" (tick period) of the clock:

Just because the clock function returns the time in units of milliseconds (or microseconds, or nanoseconds), there is no guarantee that the resolution of the clock actually is 1 millisecond (or 1 microsecond, or 1 nanosecond). Consequently, it's quite possible that the value returned by the clock remains the same for a certain period of time and then "suddenly" increments by n, where n can be way bigger than 1!

So, if the computation that you are trying to measure completes relatively fast, then it's very possible that the clock value simply hasn't changed in that short period of time (i.e. the function completes in less than a tick period), so you are actually computing the difference between the same value and thus you get zero.

high_resolution_clock is supposed to be the clock with the highest resolution (shortest tick period), but I don't think there is a guarantee on what the resolution (tick period) actually is on a specific system:

high_resolution_clock is the clock with the shortest tick period. It may be a synonym for system_clock or steady_clock.

Last edited on May 19, 2022 at 8:34am

May 19, 2022 at 8:39am

It might be a better idea to simply use steady_clock because system_clock might be affected when the clock on your computer is adjusted.

Quote from cppreference.com:

The high_resolution_clock is not implemented consistently across different standard library implementations, and its use should be avoided. It is often just an alias for std::chrono::steady_clock or std::chrono::system_clock, but which one it is depends on the library or configuration. When it is a system_clock, it is not monotonic (e.g., the time can go backwards). For example, for gcc's libstdc++ it is system_clock, for MSVC it is steady_clock, and for clang's libc++ it depends on configuration.

Generally one should just use std::chrono::steady_clock or std::chrono::system_clock directly instead of std::chrono::high_resolution_clock: use steady_clock for duration measurements, and system_clock for wall-clock time.

https://en.cppreference.com/w/cpp/chrono/high_resolution_clock#Notes

This could potentially lead to some inaccuracies from time to time but it cannot explain the consistent 0 ms result you're getting even after increasing the amount of work. It doesn't sound like a resolution problem to me but it's something to keep in mind when measuring short time intervals. Not sure whether "9,000 ms" means "9 ms" or "9000 ms". The latter is not short so a low resolution clock would not make much of a difference there.

Last edited on May 19, 2022 at 9:25am

May 19, 2022 at 10:14am

lastchance (6980)

From your code ... how do you know what TU and TG are?

Without knowing what is in get_Noisy_Force_GM() and U_pot_GM() your question is probably unanswerable.

If you send U_pot_GM() the same arguments each trial, and if the arguments are passed by value and there is no randomness within U_pot_GM(), then any reasonable optimiser might well say "why run this trial more than once - I'll get the same answer every time".

TG = 27,000 ms
TU = 14,000 ms

Whole numbers of seconds?

Last edited on May 19, 2022 at 10:40am

May 19, 2022 at 11:01am

kbw (9488)

A millisec is a long time in the environments I work in, and I don't even use it for stuff on my laptops anymore. Have tried showing your time in microsecs?

Having said that, how do the actual result values look when you compare the outputs from the two environments? Are they the same, similar or very different?

May 20, 2022 at 8:50pm

Hi guys, thanks for the many responses.

I meant TG = 27,000 ms = 27s. In other words, if every loop iteration was actually executed I get macroscopic times. Both functions are very expensive, and I wanted to gauge their relative expense.

The return typ forces of get_noisy_force_GM( ) is a simple structure consisting of two double.

When it's not a factor but 18ms faster then you will have 0 in case of 14ms.

In that case, I would have seen different behavior after increasing Xdata.

I tried using microseconds instead, but no change.

I am dumbfounded by the fact that there is no easy way to do this. In Python, this would be trivial..

Last edited on May 20, 2022 at 8:58pm

May 20, 2022 at 9:27pm

In Python, this would be trivial..

Because Python code is not optimized as aggressively?

Last edited on May 20, 2022 at 9:27pm

May 20, 2022 at 10:21pm

python makes everything easier. It typically runs between 3 and 10(!!) times slower than half decent c++ depending on what you are doing.

this should be pretty easy.
first, get rid of the casts and just print the duration. That should be in flat clock ticks, and it should NOT be zero; even if you just do nothing between the calls it should have a small value.

int main()
{
auto s = chrono::steady_clock::now();
auto e = chrono::steady_clock::now();
cout << (e-s).count() << endl;
}

that prints 100 for me. If you get zero, your stead_clock stinks.

Last edited on May 20, 2022 at 10:25pm

May 23, 2022 at 11:27am

Because Python code is not optimized as aggressively?

python makes everything easier. It typically runs between 3 and 10(!!) times slower than half decent c++ depending on what you are doing.

I am aware, which is why I typically use C++ for the more involved numerical computations.

@jonnin,
The code snippet you provided gives around 400-500 on my laptop and 50-80 on a cluster node.
I am going to test it in my setting now, but if the compiler optimizes away my loops, then simply using steady_clock (plus removing casting) is not going to solve my problem?

Last edited on May 23, 2022 at 11:27am

May 23, 2022 at 12:20pm

it won't solve what the optimizer did, it just should give a non-zero response.
if you want the loop to run, do something slightly different.
eg
U+= instead of =
and print U at the end so it isn't discarded.
the timing will be good enough to get a sense of the work done.

timing code can get weird too. The first time you run it could cache things so the next time you get a wildly different result (much faster). Running it with other stuff also running can slow it down (you can mess with priority to help that one).
The best thing you can do in my opinion is simply to time it IN PLACE, in the actual code you care about. Then it will optimize the way it does for that code, run it the number of times it needs to run with caching or not, and so on. Then you get a useful answer. Trying to do it stand alone is begging for a useless result ... about all you can do with that is validate that one algorithm is better than another or see if some internal tight loop adjustment has a big enough result to really matter. Its ok for that, but you can't trust the wall clock time result to be anything like it will be 'in action'.

Last edited on May 23, 2022 at 12:26pm

May 23, 2022 at 1:25pm

That is unfortunate..

I need the fraction of cost in order to figure out how long my simulation is supposed to run, so I cannot just measure it "in place" while the simulation is already running.

My setting is as follows:
I want to compare the performance of methods A and B when both run the same amount of wall clock time.
In each iteration, method A uses the get_noisy_force_GM() function from above, which dominates the workload in a single iteration.
Method B, however, uses both get_noisy_force_GM() and U_pot_GM() in each iteration, and both of them dominate the workload per iteration.

Given the time it took method A to execute for N iterations, I want to gauge how many iterations method B should run to yield similar execution time. For this, I need to gauge the relative cost of the two methods at the beginning of method B (i.e. before doing the actual simulation iterations). For example, if U_pot_GM() was twice as costly as get_noisy_force_GM(), the number of iterations in method B should be three times lower than in method A.

Since my approach presented in the opening post does not seem to work, what about the following idea?
Instead of measuring the costs of the force/potential functions, I simply enter the execution time of method A as an input argument for method B. While the method B goes through its iterations, I use the chrono clocks to measure the elapsed time in each iteration, and stop the iterations when the total elapsed time reaches the given admissible execution time. That would tie in with what @jonnin recommended, focussing on measuring the times used in an "in practice" run.

Last edited on May 23, 2022 at 1:27pm

May 23, 2022 at 3:16pm

sort of. If you do that, the act of looking at the time costs time, so B is a bit slower than it really would be in the results. It may be good enough for what you want.

I would not do it this way. What do you really have here? Do you want to run for N seconds and get as many iterations of both as possible on one CPU? in that case, can you just alternate them?
eg
for(some amount of time)
{
a(stuff);
b(stuff);
}

Or, if you want to know how many of each one you can do in X time
for(some time)
a(..);
for(some time)
b(..);
count # of iterations with the loop counters or whatever.

the longer it runs, the better the answer. But, as I said, real code and testing are better if similar. If you really run 10 loops and then do 50 other things and then back to run 10 loops again, its different from running 10 million loops in a row with nothing between.

you can't rely on timing data to schedule code to a wall clock. Its a guideline, but if its a tight schedule, you have to run for the scheduled window and stop, not hope that because all your tests said 10000 loops was the right value so put in a hard coded 10k loops. It will eventually miss its deadline that way, or stop short and miss a few potential iterations. And hardware can change, or computer workload etc can all affect it. Similarly, the ratio of A to B is going to vary some depending on what is going on; you can get a guideline but not an ensured result.

Last edited on May 23, 2022 at 3:26pm

May 24, 2022 at 10:58am

I think the second option is most closely related to what I want, i.e. :

Or, if you want to know how many of each one you can do in X time
for(some time)
a(..);
for(some time)
b(..);
count # of iterations with the loop counters or whatever.

The point is that A and B are two different methods that ultimately try to do the same. Imagine now that, for the same number of iterations, B leads to better results than A. However, B is more expensive so it takes more time than A. Moreover, A could reach the same performance as B if one just let it run longer.
I want to test: If both run (roughly!) the same amount of wall clock time, B still reaches better results.

I am aware that hardware plays a role. Fortunately, with access to cluster nodes where I can occupy entire nodes of K identical cores (without anybody else using the cores), forcing my code to run a certain wall clock time would lead to results that are somewhat consistent, no?

Last edited on May 24, 2022 at 11:00am

May 24, 2022 at 12:49pm

yes, a dedicated machine bypasses the aggravations of timing something on like a home PC with 200 background services and whatever going on.

You should get consistent results that way. As with above, though, looping and checking the time every iteration is a slight overhead cost, but it would answer your question (esp since unlike the scenario I complained about, here, both A &B have the same extra overhead now)

be sure to test many problems. I don't know what you are doing exactly (sorta, but not exactly) and whether changing the problem changes the speed of convergence to a solution, in which case, you could be up against a tough problem to crack.

Last edited on May 24, 2022 at 12:51pm

May 24, 2022 at 5:40pm