Estimate relative costs of two routines

Greetings!

For a physics inspired numerical method, I want to estimate the relative computational cost of two routines, one used for energy evaluation, one for force evaluation.

My first idea was to use the chrono module to simply measure the runtimes of these routines. But it quickly turned out that the results were horribly compiler and hardware dependent (see my other thread http://www.cplusplus.com/forum/beginner/283620/ ).

I want to have a simpler measure, maybe in terms of required FLOPS.
It's about the cost of an algorithm, rather than the real-world cost of a highly optimized C++ code (or code in any other language). I am a PhD student in applied maths, and my fellow mathematical researches will not care about compilers or hardware dependent stuff anyway...


Now to the two routines:

Both of them require an array of N data points (simple doubles in this example).
The energy routine always needs to iterate through the whole array, whereas the force routine only needs to consider B random array entries (thus we need to draw B array elements with replacement).

I want to know: Given the size of the array N and batch size B, how much more expensive is one routine than the other (rough estimate)?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// This is the energy routine, it needs to process the whole vector Xdata

//params is a simple struct containing 4 doubles.
double U_pot_GM(const params& theta, const double sig1, const double sig2, const double a1, const double a2, 
					  const vector <double>& Xdata, const double sig0){
	
	double U = 0;
	
	for(int i=0; i<Xdata.size(); ++i){

		U += log( likelihood_GM(theta, sig1, sig2, a1, a2, Xdata[i]) );    // sum over log-likelihoods

	}

	U -= (theta.mu1*theta.mu1 + theta.mu2*theta.mu2)/(2*sig0*sig0) + log(2*PI*sig0*sig0);  // log-prior part.

	return -1*U;					  
}



// This is the force routine, it needs to process B random elements of the vector Xdata

//forces and params are structs that contain 2 and 4 doubles, resp. .
forces get_noisy_force_GM(const params& theta, const double sig1, const double sig2, const double a1, const double a2, 
				   vector <double>& Xdata, const size_t B, vector <int>& idx_arr, const double sig0){
	forces F{0,0};
	double P, e1, e2, x, scale;
	scale = Xdata.size()/double(B);
	int help_int, idx;
	int size_minus_B = Xdata.size()-B;

	if(Xdata.size() != B){	// in case of subsampling...
		
		// idx_arr stores the possible indices of the vector Xdata
		// this loop randomly chooses B of them and stores them
		// at the end of idx_arr.
		for(int i=Xdata.size()-1; i>=size_minus_B; --i){ 

			uniform_int_distribution<> distrib(0, i);	// recreates this in every iter... is there a better way?
		
			idx = distrib(twister);
			help_int = idx_arr[i];
			idx_arr[i] = idx_arr[idx];
			idx_arr[idx] = help_int; 

		}

	}

       // actual force evaluation. 
       // the B data points to be considered are given 
       // by the B last indices stored in idx_arr.
	for(int i=idx_arr.size()-1; i>=size_minus_B; --i){	
	
			x = Xdata[ idx_arr[i] ];
			P = likelihood_GM(theta, sig1, sig2, a1, a2, x);	// likelihood of a single data point
			e1 = exp( -1*(x-theta.mu1)*(x-theta.mu1)/(2*sig1*sig1) );
			e2 = exp( -1*(x-theta.mu2)*(x-theta.mu2)/(2*sig2*sig2) );
			F.fmu1 += 1/P * e1 * (x-theta.mu1);
			F.fmu2 += 1/P * e2 * (x-theta.mu2);
				
	}


	F.fmu1 *= a1/(sqrt(2*PI)*sig1*sig1*sig1) * scale;
	F.fmu2 *= a2/(sqrt(2*PI)*sig2*sig2*sig2) * scale;
	
	F.fmu1 -= theta.mu1/(sig0*sig0);   // prior part
	F.fmu2 -= theta.mu2/(sig0*sig0);

	return F;
}



// This is the likelihood evaluation of a given data point x. It is used by both routines above.

double likelihood_GM(const params& theta, const double sig1, const double sig2, const double a1, const double a2, const double x){
	double e1 = exp( -1*(x-theta.mu1)*(x-theta.mu1)/(2*sig1*sig1) );
	double e2 = exp( -1*(x-theta.mu2)*(x-theta.mu2)/(2*sig2*sig2) );
	double p = a1/(sqrt(2*PI)*sig1) * e1 + a2/(sqrt(2*PI)*sig2) * e2;	
	return p;
}


With respect to the subsampling with replacement in the force routine: It is an implementation of the Fisher-Yates algorithm. Feel free to let me know if you believe it can be improved. We might ditch the "without replacement" condition later and switch to sampling with replacement which would make the implementation much easier..

Back to topic: The routines use many library arithmetic functions like logs or exponentials, so simply counting FLOPS is probably not a good idea. However: Both routines make use of the third routine, the likelihood. It consists of two "big" exponential evaluations.
Does it make sense to use this third routine as a unit of cost?

The energy routine mainly consists of N calls of the likelihood routine. So cost of energy routine: N likelihoods.
The force routine calls it B times, but it does much more additional work as well.
In the force routine, right after the likelihood call, there are two other "big" exponentials, so the force evaluation is like 2B likelihood calls.

Does this sound reasonable?

Jellal
Last edited on
there is a lot of unnecessary work there.
almost *all* the variables in likelihood are just passed thorough from the callers unchanged. While the optimizer MAY be smart enough to do this FOR you, factoring it out would ensure it.
stuff like 2*sig*sig can be computed one time in the caller and passed into likelihood, which prevents any compiler derps where it might try to do that each time it is called.

Factoring that stuff out may help, but it may also be premature and unnecessary. You could try it.

minor cleanup -x is the same as -1*x ... just use the unary minus operator there.

the random sampling may be better done with a random-shuffle, or, if you are not able to do that due to order of the data being important (?) you can use iota to make a 0-N filled vector and shuffle that then just iterate the first B of those as your random indices. You can make the random generator static to do it once, but I feel you would be better served to make a random generator class wrapper for the tool that suits your needs and moves all that junk away. But again, you don't need random numbers if you just shuffle 0-N indices, much simpler.
you can keep the random N array around and reshuffle it as needed, or whatever. The only comment I really have here is that you do not want to be doing some slow sampler every iteration, but to do it as little as absolutely necessary to get your samples. There are any number of ways to do that.

as far as time taken, just set B == N and time them both for a variety of sizes of N. That gives you a rough coefficient, and the comparison is just coeff*B. I mean, if N iterations is 1000 seconds and 1100 seconds, for example, then a smaller B of 100 is going to be 110 seconds assuming its more or less linear (looks like it would be).

side note... logs can be a little sluggish. They are done in the FPU hardware, but even so its one of the slowest math ops on the chip (or at least on typical home user chips, dunno about your system). If you can somehow avoid them, by doing the math in another space or reducing the # of calls to it or something, it may be even better. If you can't, you can't. I am ashamed but my logarithm math skills are not what they were... I can't tell without really digging into this if that is possible.
Last edited on
Hi Jonnin,

let me first respond to my main issue in this thread:

as far as time taken, just set B == N and time them both for a variety of sizes of N. That gives you a rough coefficient, and the comparison is just coeff*B. I mean, if N iterations is 1000 seconds and 1100 seconds, for example, then a smaller B of 100 is going to be 110 seconds assuming its more or less linear (looks like it would be).
But measuring doesn't really work as it seems to be heavily hardware / compiler dependent (as seen in the other thread)...
That's why, I try to look at a crude theoretical estimate like:
The energy routine mainly consists of N calls of the likelihood routine. So cost of energy routine: N likelihoods.
The force routine calls it B times, but it does much more additional work as well.
In the force routine, right after the likelihood call, there are two other "big" exponentials, so the force evaluation is like 2B likelihood calls.



Thank you for the hints increasing efficiency.
stuff like 2*sig*sig can be computed one time in the caller and passed into likelihood, which prevents any compiler derps where it might try to do that each time it is called.
Makes sense!

minor cleanup -x is the same as -1*x ... just use the unary minus operator there.
I have to admit that I did not even know that was possible. It's one of those things that I memorized as being "different to Python or Matlab where one can very well write -x instead of -1*x ".


the random sampling may be better done with a random-shuffle, or, if you are not able to do that due to order of the data being important (?) you can use iota to make a 0-N filled vector and shuffle that then just iterate the first B of those as your random indices. You can make the random generator static to do it once, but I feel you would be better served to make a random generator class wrapper for the tool that suits your needs and moves all that junk away. But again, you don't need random numbers if you just shuffle 0-N indices, much simpler.
you can keep the random N array around and reshuffle it as needed, or whatever. The only comment I really have here is that you do not want to be doing some slow sampler every iteration, but to do it as little as absolutely necessary to get your samples. There are any number of ways to do that.
The order is not important. I used shuffling first, but if B is small, I thought this was inefficient. Shuffling is of order N whereas my algorithm is of order B. B, however, is not always small, and I myself was unsure about the speed of recreating the generator again and again. You assume shuffling would still be better?

side note... logs can be a little sluggish. They are done in the FPU hardware, but even so its one of the slowest math ops on the chip (or at least on typical home user chips, dunno about your system). If you can somehow avoid them, by doing the math in another space or reducing the # of calls to it or something, it may be even better. If you can't, you can't. I am ashamed but my logarithm math skills are not what they were... I can't tell without really digging into this if that is possible.
I'll give it a thought, cheers!
Last edited on
It's about the cost of an algorithm, rather than the real-world cost of a highly optimized C++ code (or code in any other language).

The cost of U_pot_GM θ(N) and the cost of get_noisy_force_GM is θ(B). Conceptually, that means the cost is approximately k1N and k2B for some constants k1 and k2 when N and B are large values.

More specifically (and I must admit I'm drawing on 30-year-old memory here), if an algorithm A running in input of size X runs in θ(f(X)), it means that as X→∞:
k1f(X) < A(X) < k2f(X) for some constants k1 and k2.

Conceptually, it means that the algorithm's runtime is proportional to f(x)

Which of yoru algorithms is actually faster? That depends on the how N relates to B. For example, if B is always 25% of N, then either one might be faster because ultimately the run times are both proportional to N. If B is some constant, then get_noisy_force_GM will be faster for some large-enough value of N.

@PhysicsIsFun,
You use two very "expensive" functions in your code: log() and exp(). You should seek to reduce the use of these.

In the potential-energy function you repeatedly call log() in a loop and add things up. But you don't need to: just call it once and use the laws of logs:
log(A)+log(B)+log(C)+... = log(A.B.C....)
Multiplication will be a lot less expensive than that log function.

In your force calculation you call every single exp() function twice for precisely the same arguments: once in likelihood_GM() and again when setting your force components. You should refactor your code to avoid that. In fact, if you think about the maths in fractions you can reduce the number of times by a factor of 4.


If you have a fully-conservative system (all forces with an associated potential energy) then, almost invariably, working with potential energy will be easier and more efficient (single scalar rather than multiple vectors, in case you need convincing.) If you ever actually need the forces then you can always find them as minus the derivative of potential energy.


As @Jonnin has already remarked, multiplying lots of quantities by -1 instead of just negating them with a minus sign is unnecessary. It isn't "wrong", and an optimiser might well eliminate it ... but it looks naff.


There are ways of reducing your number of multiplies slightly, but it palls into insignificance compared with reducing your dependency on exp and log.
Last edited on
@lastchance
In the potential-energy function you repeatedly call log() in a loop and add things up. But you don't need to: just call it once and use the laws of logs:
log(A)+log(B)+log(C)+... = log(A.B.C....)
Multiplication will be a lot less expensive than that log function.

I'm aware of this. The reason why the log is used is to prevent underflows. The likelihoods are probabilities (i.e. between 0 and 1), so multiplying too many of them leads to underflows. What might make sense is to use log( ) only every 10th iteration or so...

In your force calculation you call every single exp() function twice for precisely the same arguments: once in likelihood_GM() and again when setting your force components. You should refactor your code to avoid that. In fact, if you think about the maths in fractions you can reduce the number of times by a factor of 4.
Good catch! I shall get rid of those...

If you have a fully-conservative system (all forces with an associated potential energy) then, almost invariably, working with potential energy will be easier and more efficient (single scalar rather than multiple vectors, in case you need convincing.) If you ever actually need the forces then you can always find them as minus the derivative of potential energy.

I do actually need both forces and potentials. It's not completely conservative.


Anyway, while are of you guys' hints help at making the code faster, I am still in need of a rough theoretical estimate.
Would it make sense to count the effort of the two routines in terms of call of
likelihood_GM
?
While in the energy routine there is additional cost due to the many logs which are not present in the force routine, the latter has lots of other minor computations.
So saying something like energy routine = N likelihoods, and force routine = B likelihoods could be a good start.


edit: This is my new code after optimization. I got rid of the likelihood function all-together.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// This is the energy routine, it needs to process the whole vector Xdata
double U_pot_GM(const params& theta, const vector <double>& Xdata, const double two_sigsig1, const double two_sigsig2, 
                              const double pref_exp1, const double pref_exp2, 
			      const double two_sigsig0, const double log_PI_two_sigsig0){
	
	double e1, e2, likelihood, x, x_minus_mu1, x_minus_mu2, U = 0;
	
	for(int i=0; i<Xdata.size(); ++i){

		x = Xdata[i];
		x_minus_mu1 = x-theta.mu1;
		x_minus_mu2 = x-theta.mu2;		
		e1 = exp( -(x_minus_mu1)*(x_minus_mu1)/(two_sigsig1) );
		e2 = exp( -(x_minus_mu2)*(x_minus_mu2)/(two_sigsig2) );

		likelihood = pref_exp1 * e1  +  pref_exp2 * e2;		// likelihood of a single data point
		
		U += log( likelihood );	// sum over log-likelihoods

	}

	U -= (theta.mu1*theta.mu1 + theta.mu2*theta.mu2)/two_sigsig0 + log_PI_two_sigsig0;  // log-prior part.

	return -U;					  
}



// This is the force routine, it needs to process B random elements of the vector Xdata
forces get_noisy_force_GM(const params& theta, vector <double>& Xdata, const size_t B, vector <int>& idx_arr, const double two_sigsig1, 
                                             const double two_sigsig2, const double pref_exp1, const double pref_exp2, 
					     const double F_scale_1, const double F_scale_2, const double sigsig0){
	forces F{0,0};
	double e1, e2, likelihood, likeli_inv, x, x_minus_mu1, x_minus_mu2, scale;
	int help_int, idx;
	int size_minus_B = Xdata.size()-B;
	scale = Xdata.size() / double(B);
	
	if(Xdata.size() != B){	// in case of subsampling...
		
		// idx_arr stores the possible indices of the vector Xdata
		// this loop randomly chooses B of them and stores them
		// at the end of idx_arr.
		for(int i=Xdata.size()-1; i>=size_minus_B; --i){ 

			uniform_int_distribution<> distrib(0, i);	// recreates this in every iter... is there a better way?
		
			idx = distrib(twister);
			help_int = idx_arr[i];
			idx_arr[i] = idx_arr[idx];
			idx_arr[idx] = help_int; 

		}

	}

	for(int i=idx_arr.size()-1; i>=size_minus_B; --i){	// actual force evaluation. 
											// the B data points to be considered are given 
											// by the B last indices stored in idx_arr.
	
			x = Xdata[ idx_arr[i] ];
			x_minus_mu1 = x-theta.mu1;
			x_minus_mu2 = x-theta.mu2;

			e1 = exp( -(x_minus_mu1)*(x_minus_mu1)/(two_sigsig1) );
			e2 = exp( -(x_minus_mu2)*(x_minus_mu2)/(two_sigsig2) );
			
			likelihood = pref_exp1 * e1  +  pref_exp2 * e2;	// likelihood of a single data point
			likeli_inv = 1/likelihood;

			F.fmu1 += likeli_inv * e1 * (x_minus_mu1);
			F.fmu2 += likeli_inv * e2 * (x_minus_mu2);
				
	}


	F.fmu1 *= F_scale_1 * scale;
	F.fmu2 *= F_scale_2 * scale;
	
	F.fmu1 -= theta.mu1/(sigsig0);   // prior part
	F.fmu2 -= theta.mu2/(sigsig0);

	return F;
}
Last edited on
ok, well flops is based on hardware. You can look this up; my new machine appears to run around 1.3 TF (https://www.intel.com/content/www/us/en/developer/articles/news/raw-compute-power-of-new-intel-core-i9-processor-based-systems-enables-extreme-megatasking.html#:~:text=The%20new%20Intel%C2%AE%20Core,cores%2C%20performing%20at%20124.5%20petaflops.) though that may include running all the possible threads all out (I only read the number off, and ignore the blather).

so to estimate the flops on your hardware you also have to look up the # of clocks each operation takes when looking at your 2 functions exported in assembly language.
That is kinda overkill, though.

Its probably fine to just count the operations in the c++ code:

the guts of your first for loop:

1
2
3
4
5
6
7
8
9
10
	x = Xdata[i]; //lets call assignment 1 operation.  its really an integer multiply to access the array  and some copying to and from the registers, but meh. 
		x_minus_mu1 = x-theta.mu1; //two operations: subtraction and assignment. 3 total
		x_minus_mu2 = x-theta.mu2;  //two more, 5 total		
		e1 = exp( -(x_minus_mu1)*(x_minus_mu1)/(two_sigsig1) ); //+6 ops* 11 total
		e2 = exp( -(x_minus_mu2)*(x_minus_mu2)/(two_sigsig2) );//+6 17 total

		likelihood = pref_exp1 * e1  +  pref_exp2 * e2;  //+4  21 total
		
		U += log( likelihood );	// +4 log, add, assign.  25 total. 
//*  I counted log and exp as 2 operations due to their relative speed this maybe should be like 5 instead of 2? Yes, its that bad.  


anyway so that is 25 operations * number of loops, that many floating point operations. The seconds it takes is tied to your hardware, and the number of iterations is something you should track with the data so it makes sense.

log, exp, pow, and similar functions are troublesome (trig as well). They are faster than what YOU can do in code usually, even with crude approximations you can't beat them, yet compared to addition or multiply, they are rather slow. (for integer powers, you can beat the c++ pow function easily, for floating powers, its harder).

I don't think flops is the way to go, but this is probably the difference between a computer programmer and an engineer.

even with all that, you still have the create/destroy random engine issue and you are using the MT (twister) which is the slowest known random generator of the lot (from memory? anyone else got a thought there?).
Last edited on
Thanks Jonnin,

this is what I had in mind. It should be sufficient for now...


I am indeed still interested in the subsampling efficiency thing (even though this might be another thread). I have read somewhere that creating a distribution is cheap, a random engine is not. So maybe it is not so bad?
good. just note that its crude, as we said. The biggest thing is that this sort of technique used to work reasonably well but new hardware that does a lot of floating point operations at once (eg 4 or 8 or whatever multiplies all at once) throws a wrench into trying to guess at it this way.
Topic archived. No new replies allowed.