Improving speed of the math.h libraries

Jul 11, 2011 at 12:33pm
Hello,

I have written a neural network training algorithm that leverages multicore processors and runs more or less as efficiently as I know how to make a program run in c++. Performance on my 8 core computer at work is about 5,000 times that of the MATLAB equivalent, which I'm very happy about.

However, I think I can still make improvements; I use the math.h libraries quite heavily, with many calls to higher-level functions such as exp() and tanh(), as well as the more common +/-* operators on floats.

I don't really need the exp and tanh functions to be very accurate, so I was wondering if there was a faster, maybe slightly less accurate version of the math.h libraries that I could use?
Jul 11, 2011 at 12:37pm
Here is one of the functions that is called quite heavily, I was wondering if there are any obvious performance mistakes that might jump out to the more experienced programmers. Note that the distribution of actfun_index values is statistically equally distributed. typ_act is a float and typ_small is a short.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
typ_act activation_function(typ_small actfun_index, typ_act input) {
    switch (actfun_index) {
    	case 0 : //Threshold
            if (input >= 0)                       return 1;
            else                                  return 0;
        case 1 : //Piecewise linear
            if (input >= 0.5)                     return 0.5;
            else if (input < 0.5 && input > -0.5) return input;
            else                                  return -0.5;
        case 2 : //Sigmoid (hyperbolic tangent)
                                                  return 1 / (1+exp(-input));
        case 3 : //Tanh
                                                  return tanh(input/2);
        case 4 : //Signum
            if (input > 0)                        return 1;
            else if (input < 0)                   return -1;
            else                                  return 0;
        default:
            cout << "ERROR: activation function index '" << actfun_index << "' out of bounds in activation_function"; getchar();
            return 0;
    }
}


The role of this function is to take an input float and pass it through a function defined by actfun_index.
Jul 11, 2011 at 12:41pm
Well, you have fixed point math libraries, they use integer types as base, so the basic operations (addition, multiplication,...) take less cycles than the floating point ones.

They also have the property that precision between 2 representable numbers is constant, as opposed to floating point numbers, where precision between two representable numbers is reduced as the integer part increases
Jul 11, 2011 at 12:52pm
And for your function, as your indexes are consecutive, i would have made a table containing function pointers, then just call the function at the corresponding index in the table. That would remove all the conditinal code (except the default case if you handle it with a if)
Jul 11, 2011 at 6:03pm
Can you not make that a template function over typ_small?

Have you profiled your code to see that improving the math functions will make a significant difference? If so which math functions in particular are the problem?

To cut down the math function speed, the options off the top of my head are Chebyshev Approximation and a look up table with linear interpolation. However, those functions, tanh and exp are so trivial that I doubt you will will improve the speed with either technique.
Jul 18, 2011 at 1:44am
Ok, sorry for the delay but I had to go home and learn function pointers, figuratively speaking. So I've taken bartoli's advice and simply created a function for every case, and then stored the pointers to those functions in an array. Much better!

Bartoli: I'm not familiar with fixed-point libraries, or what that means. Are you saying that 5*2 runs quicker than 5/2? What about 5*0.5?

Kev; my reason for using typ_small was simply to leave open the possibility of changing it later. I've since realised that this probably isn't necessary so I'm simply using integers and floats. Here is the new version of the function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// define the activation functions
float actfun_threshold(float input) {
            if (input >= 0) return 1;
            else            return 0;
}

float actfun_piecewise_linear(float input) {
    if (input >= 0.5)                     return 0.5;
    else if (input < 0.5 && input > -0.5) return input;
    else                                  return -0.5;
}

float actfun_sigmoid(float input) {
    return 1 / (1+exp(-input));
}

float actfun_tanh(float input) {
    return tanh(input/2);
}

float actfun_signum(float input) {
    if (input > 0)        return 1;
    else if (input < 0)   return -1;
    else                  return 0;
}

float (*actfun_pointers[5])(float) = {
    &actfun_threshold,
    &actfun_piecewise_linear,
    &actfun_sigmoid,
    &actfun_tanh,
    &actfun_signum
};


I haven't been able to test it because there are a hundred other things I've toyed with in the code since, so I've got a bit of debugging ahead of me...

Thanks for the productive feedback.
Jul 18, 2011 at 1:55am
You could also consider building pre-calculated look-up tables (arrays) for some functions.
Jul 18, 2011 at 2:44am
Galik: I was thinking about that, but I don't know how to do that efficiently. Would I need to learn assembler to create a function that can do, for example, 1 / (1+exp(-input)) quicker than what I already have?

(BTW: While "unlikely", the range of input is the entire range of possible values of a float type)
Jul 18, 2011 at 2:57am
Now that I think of it, here are 2 minor improvements:

I have replaced this:

1
2
3
4
5
float actfun_piecewise_linear(float input) {
    if (input >= 0.5)                     return 0.5;
    else if (input < 0.5 && input > -0.5) return input;
    else                                  return -0.5;
}


with this:

1
2
3
4
5
float actfun_piecewise_linear(float input) {
    if      (input > 0.5) return 0.5;
    else if (input < -0.5) return -0.5;
    else                  return input;
}


These are logically equivalent, because the "equal to" case from the first one will just yield "input" now, which is 0.5 in that case anyway...

I also took this:

1
2
3
4
5
// define the activation functions
float actfun_threshold(float input) {
            if (input >= 0)  return 1;
            else            return 0;
}


(which contained an <= conditional) and replaced it with this:

1
2
3
4
5
// define the activation functions
float actfun_threshold(float input) {
            if (input < 0)  return 0;
            else            return 1;
}


Not sure that will make much differnece though, in light of the more complex floating point operations I've got in there...
Last edited on Jul 18, 2011 at 3:00am
Jul 18, 2011 at 3:25am
The trouble with function pointers is they kill compiler optimization.

All of the logic inside every one of those functions is so simple that I'd want my compiler to optimize out the function call altogether. In other words, I'd want those functions inlined. The amount of work actfun_threshold() does is actually far, far less than the amount of work it takes to make a call to that function through a pointer.

Jul 18, 2011 at 5:23am
I became interested in seeing how much benefit the table method could bring. Looking at the output of the function I noticed that it was essentially constant outside the ranges of -15.0 and +15.0. So I built a table for between those values. You could select broader values if you need more accuracy.

This short program I wrote generates a lookup table. It then outputs a CSV file showing the difference between the cmath library values and the table lookup values. Lastly it performs some timing loops to compare the cmath library version against the table lookup. On my system, the table lookup was very much faster.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <cmath>
#include <ctime>
#include <fstream>
#include <iostream>

/**
 * Using <cmath>
 */
inline float doexp1(float input)
{
	return 1 / (1 + exp(-input));
}

/**
 * Lookup table
 */
float exp2tab[300];

/**
 * Generate table file
 */
void save_exp2tab()
{
	std::ofstream ofs("exp2tab.dat");
	for(float f = -15.0f; f < 15.0f; f += 0.1f)
	{
		ofs << (1 / (1+exp(-f))) << '\n';
	}
}

/**
 * Load in table from file
 */
void load_exp2tab()
{
	std::ifstream ifs("exp2tab.dat");
	int i = 0;
	float f;
	while(ifs >> f)
	{
		exp2tab[i++] = f;
	}
}

/**
 * Using table
 */
inline float doexp2(float input)
{
	if(input > 15.0f) return 1.0f;
	if(input < -15.0f) return 0.0f;
	int i = int(input * 10) + 149;
	return exp2tab[i];
}

int main()
{
	save_exp2tab(); // create lookup data
	load_exp2tab(); // read in table

	// output a file to examine how close the two methods are

	std::ofstream ofs("output.csv");

	ofs << "Input, Math, Table\n";
	for(float f = -1000.0f; f < 1000.0f; f += 0.1f)
	{
		ofs << f << ", " << doexp1(f) << ", " << doexp2(f) << '\n';
	}

	// Time difference between <cmath> and table

	clock_t begin, end;

	begin = clock();
	for(size_t i = 0; i < 10; ++i)
		for(float f = -1000000.0f; f < 1000000.0f; f += 0.1f)
		{
			doexp1(f);
		}
	end = clock();

	std::cout << "cmath: " << (double(end - begin)/CLOCKS_PER_SEC) << '\n';

	begin = clock();
	for(size_t i = 0; i < 10; ++i)
		for(float f = -1000000.0f; f < 1000000.0f; f += 0.1f)
		{
			doexp2(f);
		}
	end = clock();

	std::cout << "table: " << (double(end - begin)/CLOCKS_PER_SEC) << '\n';

	return 0;
}
cmath: 35.04
table: 0.57

I saw an improvement in speed of about 40 times. However at the cost of some accuracy.
Jul 18, 2011 at 5:43am
Thanks for sharing. Using look-up table technique is not new. Since the dawn of computing, it has been used in OSes, compilers implementation etc. Sometimes certain trade-offs have to be considered, maybe in your case speed vs accuracy. If your application is highly dependent on accuracy, you may not want to use this approach even.

For very very drastic business requirements, you may even need to resort to 'hardware solution'. Say a hardware with a dedicated floating point processor that does very precise processing and nothing else for example.
Jul 18, 2011 at 5:46am
Also another method you could try is to create a simpler model of your function. It all depends on how accurate you need the calculation to be. A more sophisticated model of your function will produce more accurate results at the cost of speed.

I added a very primitive model method to my example which treats the critical zone between -15.0 and +15.0 as a linear function. For some purposes this may be adequate. Timing wise it is about as fast as the table lookup. While less accurate than the table lookup it does have the benefit of being smaller and without initialization cost:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
#include <cmath>
#include <ctime>
#include <fstream>
#include <iostream>

/**
 * Using <cmath>
 */
inline float doexp1(float input)
{
	return 1 / (1 + exp(-input));
}

/**
 * Lookup table
 */
float exp2tab[300];

/**
 * Generate table file
 */
void save_exp2tab()
{
	std::ofstream ofs("exp2tab.dat");
	for(float f = -15.0f; f < 15.0f; f += 0.1f)
	{
		ofs << (1 / (1+exp(-f))) << '\n';
	}
}

/**
 * Load in table from file
 */
void load_exp2tab()
{
	std::ifstream ifs("exp2tab.dat");
	int i = 0;
	float f;
	while(ifs >> f)
	{
		exp2tab[i++] = f;
	}
}

/**
 * Using table
 */
inline float doexp2(float input)
{
	if(input > 15.0f) return 1.0f;
	if(input < -15.0f) return 0.0f;
	int i = int(input * 10) + 149;
	return exp2tab[i];
}

/**
 * Using model
 */
inline float doexp3(float input)
{
	if(input > 15.0f) return 1.0f;
	if(input < -15.0f) return 0.0f;
	return (input + 15.0f) / 30.0f;
}

int main()
{
	save_exp2tab(); // create lookup data
	load_exp2tab(); // read in table

	// output a file to examine how close the two three methods are

	std::ofstream ofs("output.csv");

	ofs << "Input, Math, Table, Model\n";
	for(float f = -1000.0f; f < 1000.0f; f += 0.1f)
	{
		ofs << f
			<< ", " << doexp1(f)
			<< ", " << doexp2(f)
			<< ", " << doexp3(f)
			<< '\n';
	}

	// Time difference between <cmath> and table

	clock_t begin, end;

	begin = clock();
	for(size_t i = 0; i < 10; ++i)
		for(float f = -1000000.0f; f < 1000000.0f; f += 0.1f)
		{
			doexp1(f);
		}
	end = clock();

	std::cout << "cmath: " << (double(end - begin)/CLOCKS_PER_SEC) << '\n';

	begin = clock();
	for(size_t i = 0; i < 10; ++i)
		for(float f = -1000000.0f; f < 1000000.0f; f += 0.1f)
		{
			doexp2(f);
		}
	end = clock();

	std::cout << "table: " << (double(end - begin)/CLOCKS_PER_SEC) << '\n';

	begin = clock();
	for(size_t i = 0; i < 10; ++i)
		for(float f = -1000000.0f; f < 1000000.0f; f += 0.1f)
		{
			doexp3(f);
		}
	end = clock();

	std::cout << "model: " << (double(end - begin)/CLOCKS_PER_SEC) << '\n';

	return 0;
}
cmath: 35.14
table: 0.57
model: 0.56
Last edited on Jul 18, 2011 at 6:13pm
Jul 18, 2011 at 5:50pm
About fixed point numbers, they are real numbers, represented with a decimal part that is always the same size. As a result, they can be implemented on integer types.

A simple example would be to use an int variable, and decide by convention that the last digit will represent a digit after the decimal point.
39 would represent 3.9 for example.

Actually, i was about to illustrate the speed interest with instruction timings, but it seems that floating point operations are not that slow compared to the integer ones on computer processors, so the extra stuff around the integer operations would make the code slower than floating point code.
This means that fixed point numbers only have an interest if you want a constant precision on all the range of the decimal number, or on processors that don't handle float types well

Topic archived. No new replies allowed.