I have written a neural network training algorithm that leverages multicore processors and runs more or less as efficiently as I know how to make a program run in c++. Performance on my 8 core computer at work is about 5,000 times that of the MATLAB equivalent, which I'm very happy about.
However, I think I can still make improvements; I use the math.h libraries quite heavily, with many calls to higher-level functions such as exp() and tanh(), as well as the more common +/-* operators on floats.
I don't really need the exp and tanh functions to be very accurate, so I was wondering if there was a faster, maybe slightly less accurate version of the math.h libraries that I could use?
Here is one of the functions that is called quite heavily, I was wondering if there are any obvious performance mistakes that might jump out to the more experienced programmers. Note that the distribution of actfun_index values is statistically equally distributed. typ_act is a float and typ_small is a short.
Well, you have fixed point math libraries, they use integer types as base, so the basic operations (addition, multiplication,...) take less cycles than the floating point ones.
They also have the property that precision between 2 representable numbers is constant, as opposed to floating point numbers, where precision between two representable numbers is reduced as the integer part increases
And for your function, as your indexes are consecutive, i would have made a table containing function pointers, then just call the function at the corresponding index in the table. That would remove all the conditinal code (except the default case if you handle it with a if)
Can you not make that a template function over typ_small?
Have you profiled your code to see that improving the math functions will make a significant difference? If so which math functions in particular are the problem?
To cut down the math function speed, the options off the top of my head are Chebyshev Approximation and a look up table with linear interpolation. However, those functions, tanh and exp are so trivial that I doubt you will will improve the speed with either technique.
Ok, sorry for the delay but I had to go home and learn function pointers, figuratively speaking. So I've taken bartoli's advice and simply created a function for every case, and then stored the pointers to those functions in an array. Much better!
Bartoli: I'm not familiar with fixed-point libraries, or what that means. Are you saying that 5*2 runs quicker than 5/2? What about 5*0.5?
Kev; my reason for using typ_small was simply to leave open the possibility of changing it later. I've since realised that this probably isn't necessary so I'm simply using integers and floats. Here is the new version of the function:
I haven't been able to test it because there are a hundred other things I've toyed with in the code since, so I've got a bit of debugging ahead of me...
Galik: I was thinking about that, but I don't know how to do that efficiently. Would I need to learn assembler to create a function that can do, for example, 1 / (1+exp(-input)) quicker than what I already have?
(BTW: While "unlikely", the range of input is the entire range of possible values of a float type)
The trouble with function pointers is they kill compiler optimization.
All of the logic inside every one of those functions is so simple that I'd want my compiler to optimize out the function call altogether. In other words, I'd want those functions inlined. The amount of work actfun_threshold() does is actually far, far less than the amount of work it takes to make a call to that function through a pointer.
I became interested in seeing how much benefit the table method could bring. Looking at the output of the function I noticed that it was essentially constant outside the ranges of -15.0 and +15.0. So I built a table for between those values. You could select broader values if you need more accuracy.
This short program I wrote generates a lookup table. It then outputs a CSV file showing the difference between the cmath library values and the table lookup values. Lastly it performs some timing loops to compare the cmath library version against the table lookup. On my system, the table lookup was very much faster.
Thanks for sharing. Using look-up table technique is not new. Since the dawn of computing, it has been used in OSes, compilers implementation etc. Sometimes certain trade-offs have to be considered, maybe in your case speed vs accuracy. If your application is highly dependent on accuracy, you may not want to use this approach even.
For very very drastic business requirements, you may even need to resort to 'hardware solution'. Say a hardware with a dedicated floating point processor that does very precise processing and nothing else for example.
Also another method you could try is to create a simpler model of your function. It all depends on how accurate you need the calculation to be. A more sophisticated model of your function will produce more accurate results at the cost of speed.
I added a very primitive model method to my example which treats the critical zone between -15.0 and +15.0 as a linear function. For some purposes this may be adequate. Timing wise it is about as fast as the table lookup. While less accurate than the table lookup it does have the benefit of being smaller and without initialization cost:
About fixed point numbers, they are real numbers, represented with a decimal part that is always the same size. As a result, they can be implemented on integer types.
A simple example would be to use an int variable, and decide by convention that the last digit will represent a digit after the decimal point.
39 would represent 3.9 for example.
Actually, i was about to illustrate the speed interest with instruction timings, but it seems that floating point operations are not that slow compared to the integer ones on computer processors, so the extra stuff around the integer operations would make the code slower than floating point code.
This means that fixed point numbers only have an interest if you want a constant precision on all the range of the decimal number, or on processors that don't handle float types well