Dispatching inline function by template specialization

Oct 24, 2010 at 3:33pm
Hi everyone,

I am trying to play a bit with template metaprograming, and I have ran into a situation that puzzles me. I have a very simple template-based function dispatcher that looks like this (to make my point I am keeping only things that matter in this example):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// base template definition
template< int N >
struct ConstArithmetics {
    template < typename T > inline static T mult(T t) { return t*N; }
};

// specialization: multiplication by 1 is identity
template<>
struct ConstArithmetics<1> {
    template<class T> inline static T mult(T t) { return t; }
};

// specialization: multiplication by 2 is 1-bit left-shift
template<>
struct ConstArithmetics<2> {
    template<class T> inline static T mult(T t) { return t << 1; }
};

int main(int argc, char** argv) {

    int SIZE = 1000000;

    int * z = new int[SIZE];
    int * y = new int[SIZE];

    for ( int i = 0 ; i < SIZE; i++ ) z[i] = y[i] = i;
    time_t start_time = time(NULL);

    // external loop is just to make it run longer, we need time stats...
    for ( int j = 0 ; j < 1000; j++ ) { 
        for ( int i = 0 ; i < SIZE; i++ ) z[i] = y[i] * 3 ;
    }

    std::cout << (time(NULL)-start_time) << " s for direct multiplication" 
              << std:: endl;

    start_time = time(NULL);

    for ( int j = 0 ; j < 1000; j++ ) {
        for ( int i = 0 ; i < SIZE; i++ ) 
                 z[i] = ConstArithmetics<3>::mult(y[i]);
    }

    delete [] y;
    delete [] z;
}



Now, what happens is that the templatized version runs 1.5-2x *slower* than the direct multiplication (first loop) above. It happens at (non-specialized) version with int=3 as sown above as well as with specialized int=1 and int=2 versions. It almost looks like the compiler is not inlining my functions and/or it can figure out some optimization in the first loop, which is not available in the second (however if the inlining does work, this would be unlikely, no?). I'd greatly appreciate any ideas on what happens here and on how to interrogate a compiler about what has really been done.
Oct 24, 2010 at 4:46pm
I would be very surprised if your ConstArithmetics class ever produced faster code than direct multiplication in any situation ever.

But anyway, just to cover the obvious.... are you running the debug build? Debug generally does not inline functions.
Oct 24, 2010 at 5:10pm
Like Disch said, you can't possibly gain anything from using specializations here. Replacing multiplications and divisions involving powers of two by bit shifts is one of the most elementary optimizations a compiler can do. It is so elementary that it will be done even when compiling without optimizations.

My results are:
1.44 s for direct multiplication
1.43 s for indirect multiplication

So yeah, you apparently compiled in Debug mode.

DrYurich wrote:
and on how to interrogate a compiler about what has really been done.

You do that by looking at the assembler output it generates.
For g++, you can do that by passing --save-temps as a parameter. Then g++ will additionally save the assembler output while compiling.
Oct 24, 2010 at 5:36pm
I hoped there is a way one level higher than reading the asm ;) Anyway, I did use g++ with defaults, so apparently I did not get things inlined, thanks for reminding me of that, my bad :( With -O3 everything runs with the same performance.

And I am not hoping to run faster than direct code after all compiler optimizations. This is just a performance test (to make sure I did not really make things *worse*); the real-life use case is something like the following:

[code]
// densely packed vector of arbitrary-length bitfields;
// a "classical" bitvector would be BitFieldVector<1>, etc

template <int BitFieldWidth>
class BitFieldVector {
...
int operator [] (int i) {
// here we can use different arithmetics to compute byte and within-byte
// offsets depending on the actual value of BitFieldWidth
}
};
[code/]

If the compiler can optimize on its own (as in replacing *2 by << 1) all the way throw the template instances down to the operator[] () method, then I will stop worrying right away and let compiler do its job!!!

Thank you for your replies!
Topic archived. No new replies allowed.