Double, Float or other issue?

I have some code with the following parameters required...

u = 1e-08
dr = 0.00673332

mo[0] = 1 - u
mo[1] = u
mo[2] = -u
mo[3] = 1 - u
p[0] = -((1 - u) + dr)

When I run the code I get the following values...

1
2
3
4
5
6
7
u = 1e-08
dr = 0.00673332
mo[0] = 1
mo[1] = 1e-08
mo[2] = -1e-08
mo[3] = 1
p[0] = -1.00673


All are defined as double yet the results for mo[0], mo[3] and p[0] are not accurate. I know that double is twice as precise as float but should I be using float to maintain accuracy? Or am I missing something in my code?

The relevant code segment is below...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
int k = -8;
double x = 1.3; 
double u = pow(10, k);
double dr = x * pow(u, 2.0/7);

valarray<double> mo(4);

mo[0] = 1.0 - u;
mo[1] = u;
mo[2] = -u;
mo[3] = 1.0 - u;

valarray<double> p(4), o(4);
        
p[0] = -1.0 * ((mo[3]) + dr);
p[1] = 0.0;
p[2] = 0.0;
p[3] = sqrt(1.0 / abs(p[0])) - p[0];

What do you mean by "are not accurate"?

You do realize that both float and double are approximations, right?

How are you printing the values, please show the code?
Try the "other issue" possibility.

Write 0.99999999 to six sig figs (the default).
compute your constants once. lines 3 and 4 may give you roundoff that isnt necessary if you had hard coded the true result one time. Use a high precision external tool to get the correct value here.
change it to simply:
dr = x* .005... blah blah digits, it may help a little, or it may be same as before.

Last edited on
@jib; yes I know about approximations. :)
I teach computer science to degree level and focus on binary structures and machine code; along with programming (just not c++). Dependent upon the way the data is stored you can increase range but decrease precision or vice versa. I was trying to identify which one is best to use. I keep returning to double but others with more experience of c++ might have added detail.

My main point is that, clearly, 1 - 10^-8 is not 1; though it is very nearly one. I was intrigued by the fact that sometimes I get a very long decimal result but here a very definite 1 - and hoping for a way to preserve the integrity of the values.

@lastchance, thank you. The problem is (and the reason for separating out k and x from the dr equation) is that it is precisely these two values that I need to explore through changing. So, fixing the values is not an option.

@jonnin, I had this originally and feel it is the best way forward. I separated out k and x for the reason defined above but since I am the only person changing the code, with anyone else only reading my summary of said code, I think reducing the variable count will help in this case and reduce memory usage. Thank you.
In general, double, which is C and C++'s default, is somewhat less error-prone because programmers are less likely to make errors thanks to accidental implicit conversions and type mismatches. That is to say, double makes consistency easy. This makes it the best choice unless there are overriding concerns about performance. For instance, a programmer might choose float when there are constraints on memory or (more frequently) memory bandwidth.

When it comes to precision, double's obviously more precise. This is typically a minor advantage because most users are far less concerned about precision than about accuracy. Accurate results are obtained by picking a better algorithm, not by using a more precise representation.

Floating point quantization error is measured in units in the last place, or ulps. The ulp is a relative unit: it's magnitude is proportional to a particular floating point number. Because quantization produces relative error, absolute error can be minimized by avoiding operations on values of significantly different magnitude. Consider e.g., William Kahan's algorithm as an example of this principle.

Indeed, when it comes to getting accurate results in a lengthy floating point calculation, precision is rarely a major factor. Any electronics hobbyist can tell you that expensive, precise components won't get you far: circuit design is far more important. In software speak, this means no floating point computation will produce good results if the algorithm is bad. Focus on the algorithm, after which additional precision will do nothing except help reduce the (small) relative error in the results.
Last edited on
The relationship between accuracy and precision is explained with both accuracy and precision with this:
https://en.wikipedia.org/wiki/Accuracy_and_precision

If there is any doubt left it is in the deeply philosophical arguments around science and truth which is even more enriching.
To put the 'error' of 1e-8 into perspective it represents a difference of about 45mm in missing the target travelling from a A on the east coast side of the US to another point B on the west coast side.
> I teach computer science to degree level and focus on binary structures and machine code;
Then this should be required reading for you.
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Then this should be required reading for you.

I second this.
Last edited on
Let's see:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <iostream>
#include <iomanip>
using namespace std;

int main()
{
   float u = 1.0e-8;
   float m = 1 - u;
   cout << "As float:\n";
   for ( int i = 0; i < 20; i++ ) cout << fixed << setprecision( i ) << m << '\n';

   double U = 1.0e-8;
   double M = 1 - U;
   cout << "\nAs double:\n";
   for ( int i = 0; i < 20; i++ ) cout << fixed << setprecision( i ) << M << '\n';
}


As float:
1
1.0
1.00
1.000
1.0000
1.00000
1.000000
1.0000000
1.00000000
1.000000000
1.0000000000
1.00000000000
1.000000000000
1.0000000000000
1.00000000000000
1.000000000000000
1.0000000000000000
1.00000000000000000
1.000000000000000000
1.0000000000000000000

As double:
1
1.0
1.00
1.000
1.0000
1.00000
1.000000
1.0000000
0.99999999
0.999999990
0.9999999900
0.99999999000
0.999999990000
0.9999999900000
0.99999999000000
0.999999990000000
0.9999999899999999
0.99999998999999995
0.999999989999999950
0.9999999899999999498


It depends on:
(a) the accuracy of your type: float (about 6 sig figs); double (about 14 sig figs)
(b) how you choose to print it out.



You should:
(1) use double;
(2) try to rearrange your code if you can so that you avoid subtracting (or even, adding) two things of very different magnitude.

I would also suggest:
(3) writing
u=1.0e-8;
rather than
1
2
int k = -8;
double u = pow(10, k);


(4) Not using o as the name of a variable.
Last edited on
Topic archived. No new replies allowed.