double sqrt(double n){
__asm__(
"fsqrt\n"
:"+t" (n)
);
return n + 1; //I'll explain this shortly
}
I also included <cmath>, and called std::sqrt() in main():
1 2 3 4 5 6 7
int main(){
double d = 2;
double n = sqrt(d);
double m = std::sqrt(d);
printf("n: %f, m: %f\n", n, m);
return 0;
}
The output is interestingly enough:
n: 2.414214, m: 2.414214
My own function's output is used in both cases, as you can see. sqrt(2) should be 1.4142..., but my function adds 1 to it. When I ditch the variable d, and just use a constant 2, like this:
1 2
double n = sqrt(2);
double m = std::sqrt(2);
Now, my output is:
n: 2.414214, m: 1.414214
As I'd expect it to be.
Is this a weird optimization g++ tries to do or what's going on?
Even putting a static 2 in the function calls wouldn't make it call the correct functions for me. In release mode, it wont even compile without changing the argument double n to int n.
This works:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
double sqrt(int n)
{
return n + 1;
}
int main()
{
int d = 2;
double n = sqrt(d);
double m = std::sqrt(d);
printf("n: %f, m: %f\n", n, m);
return 0;
}
However, I don't know why it feels the need to call the defined function when using a double.
Interesting. I just don't get why it compiles to using my function when calling std::sqrt(). All I can think of is that gcc/g++ somehow recognizes that they're both square root functions and for some reason decides to use my function?
Seems like when the output is the same, we can see two calls to <sqrt>, meaning that std::sqrt() has been optimized to my sqrt(). When it is different, on the other hand, there is but one call. When given a constant, it seems like std::sqrt() is calculated pre-hand, and there is no need to call it separately, and because there is no need to call it separately, it doesn't get optimized to my function. That's probably the reason that std::sqrt(2) worked for me.
I found this, and I'm assuming this has something to do with the pre-calculated square root:
The declaration/definition of sqrt in cmath looks like this. If I remove the dumping of sqrt into the global namespace I get ambiguous overload errors.
...
#include <math.h> //ME: this declares sqrt,sqrtf,..., and a quick peek
// looks like its declaring them globally
...
#undef sqrt //ME: get rid of sqrt macro defined in math.h
...
//ME: in the std namespace
namespace std _GLIBCXX_VISIBILITY(default)
{
_GLIBCXX_BEGIN_NAMESPACE_VERSION
...
using ::sqrt; //ME: dump sqrt into the global namespace
//ME: apparently these aren't always defined, but note they are
// inline and use a gcc builtin (presumably calling the assembly
// instruction like your code).
#ifndef __CORRECT_ISO_CPP_MATH_H_PROTO
inline _GLIBCXX_CONSTEXPR float
sqrt(float __x)
{ return __builtin_sqrtf(__x); }
inline _GLIBCXX_CONSTEXPR longdouble
sqrt(longdouble __x)
{ return __builtin_sqrtl(__x); }
#endif
//ME: an overload for when the input is an integer
template<typename _Tp>
inline _GLIBCXX_CONSTEXPR
typename __gnu_cxx::__enable_if<__is_integer<_Tp>::__value,
double>::__type
sqrt(_Tp __x)
{ return __builtin_sqrt(__x); }
...
Dang, why is it allowed to do that? I mean this is a fairly nische problem and easily avoidable with a class or namespace by the programmer, but wouldn't it make more sense to contain all default libraries under std::? Is it just for C-compatibility reasons?
Varies a bit, but here goes. One billion calls to sqrt, with cmath, my fsqrt and as a third, sqrtsd:
Without optimization:
cmath:
4.658 s
fsqrt:
9.838 s
sqrtsr:
8,314 s
-O3:
cmath:
3.037 s
fsqrt:
6.520 s
sqrtsr:
3.402 s
So all in all, completely useless :D
I read https://www.codeproject.com/Articles/69941/Best-Square-Root-Method-Algorithm-Function-Precisi
and thought I'd check it out myself. The author claimed that math.h's square root was too slow for him, and tried out 14 different sqrt implementations. He found that some implementations were up to five times as fast as math.h's. Someone also recommend using sqrtsd instead of fsqrt, which is why I added it to my tests. Anycase, article doesn't seem to hold up anymore.
The whole test code:
1 2 3 4 5 6 7 8
int main(){
double n = 2.0;
for(int i = 1000000000; i >= 0; i--){
n = sqrt(n);
}
printf("n: %f\n", n);
return 0;
}
I can't see that being useful since a long time ago (about 1994 when all major PC-CPU started having a FPU built in that has a sqrt on chip). Before that, with the soft floating point (eg 286 era) it may have been possible to beat it by rolling your own? Or on some integer based embedded system, it could be true.
Currently, all you are doing is adding overhead to the FPU's function.
if you have ENOUGH of them to do back to back, you could do them in multiple threads and get a flat multiplier speedup, say a billion on each of 4 cpus is roughly 4x faster than 4 billion on one cpu.
**something interesting... you may be able to approximate the root to low precision faster, or even integer precision. If that is good enough. Not every program NEEDS to be terribly accurate, just as sometimes you can say 'its 3 miles thataway' and sometimes you need the info in nanometers.
But, there are things in the standard tools that ARE very slow. You can write an integer power function that is 2-3 times faster than the pow() method because pow does a lot of extra junk to handle floating point powers which slow it down if you are not using that feature. Or number to text is not well implemented on all systems -- I just had a DIY on that and it more than doubled the speed of my code, and repaired a MD5 implementation I got off the web that used sprintf to make hex strings and taking that out in favor of my own was also a multiplier faster...
Its a case by case thing, or a tribal knowledge thing... most of the tools are pretty efficient, and when they are not, its usually due to special case vs general purpose problems (your special case is faster than their general purpose solution because you discard some stuff you don't need to do). You will NOT beat anything that is on chip or in FPU directly (trig, sqrt, log, etc hand rolled is going to lose).