I am trying to play a bit with template metaprograming, and I have ran into a situation that puzzles me. I have a very simple template-based function dispatcher that looks like this (to make my point I am keeping only things that matter in this example):
// base template definition
template< int N >
struct ConstArithmetics {
template < typename T > inlinestatic T mult(T t) { return t*N; }
};
// specialization: multiplication by 1 is identity
template<>
struct ConstArithmetics<1> {
template<class T> inlinestatic T mult(T t) { return t; }
};
// specialization: multiplication by 2 is 1-bit left-shift
template<>
struct ConstArithmetics<2> {
template<class T> inlinestatic T mult(T t) { return t << 1; }
};
int main(int argc, char** argv) {
int SIZE = 1000000;
int * z = newint[SIZE];
int * y = newint[SIZE];
for ( int i = 0 ; i < SIZE; i++ ) z[i] = y[i] = i;
time_t start_time = time(NULL);
// external loop is just to make it run longer, we need time stats...
for ( int j = 0 ; j < 1000; j++ ) {
for ( int i = 0 ; i < SIZE; i++ ) z[i] = y[i] * 3 ;
}
std::cout << (time(NULL)-start_time) << " s for direct multiplication"
<< std:: endl;
start_time = time(NULL);
for ( int j = 0 ; j < 1000; j++ ) {
for ( int i = 0 ; i < SIZE; i++ )
z[i] = ConstArithmetics<3>::mult(y[i]);
}
delete [] y;
delete [] z;
}
Now, what happens is that the templatized version runs 1.5-2x *slower* than the direct multiplication (first loop) above. It happens at (non-specialized) version with int=3 as sown above as well as with specialized int=1 and int=2 versions. It almost looks like the compiler is not inlining my functions and/or it can figure out some optimization in the first loop, which is not available in the second (however if the inlining does work, this would be unlikely, no?). I'd greatly appreciate any ideas on what happens here and on how to interrogate a compiler about what has really been done.
Like Disch said, you can't possibly gain anything from using specializations here. Replacing multiplications and divisions involving powers of two by bit shifts is one of the most elementary optimizations a compiler can do. It is so elementary that it will be done even when compiling without optimizations.
My results are:
1.44 s for direct multiplication
1.43 s for indirect multiplication
So yeah, you apparently compiled in Debug mode.
DrYurich wrote:
and on how to interrogate a compiler about what has really been done.
You do that by looking at the assembler output it generates.
For g++, you can do that by passing --save-temps as a parameter. Then g++ will additionally save the assembler output while compiling.
I hoped there is a way one level higher than reading the asm ;) Anyway, I did use g++ with defaults, so apparently I did not get things inlined, thanks for reminding me of that, my bad :( With -O3 everything runs with the same performance.
And I am not hoping to run faster than direct code after all compiler optimizations. This is just a performance test (to make sure I did not really make things *worse*); the real-life use case is something like the following:
[code]
// densely packed vector of arbitrary-length bitfields;
// a "classical" bitvector would be BitFieldVector<1>, etc
template <int BitFieldWidth>
class BitFieldVector {
...
int operator [] (int i) {
// here we can use different arithmetics to compute byte and within-byte
// offsets depending on the actual value of BitFieldWidth
}
};
[code/]
If the compiler can optimize on its own (as in replacing *2 by << 1) all the way throw the template instances down to the operator[] () method, then I will stop worrying right away and let compiler do its job!!!