### Floats subtraction and comparisons

Pages: 123
Hi, I know of the issues of floating point subtraction and comparison issues but I don't know how to resolve them yet. I also know that you can use Boost or other 3rd party software for floats but in this case I don't want to and I want to learn what the possible solutions are.

It would be ideal to just add a few functions to this and to just make it work and to use a method that is the FASTEST way possible as the values will be constantly changing. Is this even possible? The functions should have a parameter for the precision, for now 6 after the decimal (.123456). All those float numbers will really be returned by functions but I simplified it here to show the main point.

 ``123456789101112`` ``````#include using namespace std; int main(){ float PosX1 = 1234.123456f; float PosX2 = 2345.123456f; float WidthX1 = 150.123456f; if ((PosX1 - PosX2) >= (.1 * WidthX1)) {} }``````

I thought I might be able to multiply and convert the floats to integers like this and then do the check. It works a little better but still glitches from time to time. Do you also lose precision when you multiply numbers, I would imagine so?

 ``12345678910111213141516`` ``````#include using namespace std; int main(){ float PosX1 = 1234.123456f; float PosX2 = 2345.123456f; float x1Width = 150.123456f; long long int IntegerPosDifference = static_cast (PosX1 * 1'000'000) - static_cast (PosX2 * 1'000'000); long long int IntegerPercentCap = static_cast (.1 * x1Width * 1'000'000); if ( IntegerPosDifference >= IntegerPercentCap ) {} }``````

I also found this dudes video on float comparison and I have to look at it more closely. For now I don't know what those formulas really do. What is the best method to use for floats here?

Last edited on
I'm not sure exactly what you're asking. Is there a problem with your first code?

Note that float usually only has about 7 digits of precision (counting from the most significant digit).
 ``123456789101112`` ``````#include #include int main() { float PosX1 = 1234.123456f; float PosX2 = 2345.123456f; std::cout << std::setprecision(100); std::cout << "PosX1 = " << PosX1 << "\n"; std::cout << "PosX2 = " << PosX2 << "\n"; }``````
 ```PosX1 = 1234.1234130859375 PosX2 = 2345.12353515625 ```

If you instead use double you'll normally get close to 16 digits of precision.
 ``123456789101112`` ``````#include #include int main() { double PosX1 = 1234.123456; double PosX2 = 2345.123456; std::cout << std::setprecision(100); std::cout << "PosX1 = " << PosX1 << "\n"; std::cout << "PosX2 = " << PosX2 << "\n"; }``````
 ```PosX1 = 1234.123456000000032872776500880718231201171875 PosX2 = 2345.12345599999980549910105764865875244140625 ```

double is the default floating-point type for a reason. It's often good enough for many situations.
Last edited on
Note that a decimal digit precision only applies to the display/string value - not to the underlying binary representation.

The binary format of a float (32 bit - single precision) is 32 bits where:
bits 0 - 22 are the fraction (mantissa)
bits 23 - 30 are the exponent
bits 31 - sign

See https://en.wikipedia.org/wiki/Single-precision_floating-point_format

There are only 23 bits for the fraction. If you have say a value of 2.345678E+30, then this doesn't have 6 digits of decimal precision. The actual decimal precision of the decimal fraction depends upon the size of the stored number. Precision of a stored floating point number refers to the significant precision - not decimal fraction precision.

What's the maximum/minimum value for the float? If it's going to be less than 9223372036854 and greater then –9223372036854 then consider working solely in long long int with an implied precision of 6 (or unsigned long long int if only positive numbers).

Also note that some floating point numbers can't be represented accurately in binary (the same way that 3 / 10 can't be accurately represented in decimals).
Last edited on
seeplus wrote:
Note that a decimal digit precision only applies to the display/string value - not to the underlying binary representation.

I think the word "precision" is used to mean two things.

std::setprecision is used to specify how many significant digits will be displayed at most. If you use it in combination with std::fixed it instead means the number of digits after the decimal mark.

When I say float has about 7 digits of precision and double has about 16 digits of precision what I mean is that if you round a number to that many significant digits you'll get the expected number. You probably want to display slightly less digits because this is just an approximation because the "fraction" is stored in binary (base 2) which doesn't translate exactly to decimal (base 10) digits. Displaying less digits also helps hiding the effects of rounding errors when doing calculations.

 ``123456789101112131415`` ``````#include #include int main() { // floats: std::cout << std::setprecision(6); std::cout << 1234.123456f << "\n"; std::cout << 2345.123456f << "\n"; // doubles: std::cout << std::setprecision(15); std::cout << 1234.123456 << "\n"; std::cout << 2345.123456 << "\n"; }``````
Output:
 ```1234.12 2345.12 1234.123456 2345.123456```

As you can see, this gives you the output you would expect if you round the numbers to 6 and 15 digits respectively.
Last edited on
Thanks for the link I will have to look at it slowly.

It is basically a 3rd party graphics package that lets you manipulate images on the window. I am comparing the positions of 2 images and setting one of them in position in relation to the one being moved. The function returns a float for the image position and it has .123456 that many digits of precision after the decimal.

The movement of my code works fine, but it glitches and pops the 2nd image from time to time depending on the maths and comparison. Peter, you think if I cast the float to a double and then do the maths that it will be more precise and not glitch? I will have to try that.

 ``` PosX1 = 1234.123456000000032872776500880718231201171875 PosX2 = 2345.12345599999980549910105764865875244140625 ```

With the output here I really did not expect that for PosX2. I thought it would be something like this.

2345.123456000000xxxxxxxxxxxxxxxxxxx //x is garbage numbers

To me that is 16 digits of precision and not the one below

2345.12345......599999980549910105764865875244140625

To me that is only 9 digits of precision and the rest is off and then garbage. I don't understand? Are we saying that the display/string value is off but the actual binary representation is REALLY 16 digits of precision? More precise than what is being displayed?

Further consider:

 ``12345678910111213141516171819202122`` ``````#include #include int main() { // floats: std::cout << std::setprecision(6); std::cout << 1234.123456f << "\n"; std::cout << 2345.123456f << "\n"; std::cout << std::fixed; std::cout << 1234.123456f << "\n"; std::cout << 2345.123456f << "\n"; // doubles: std::cout << std::defaultfloat << std::setprecision(15); std::cout << 1234.123456 << "\n"; std::cout << 2345.123456 << "\n"; std::cout << std::setprecision(6) << std::fixed; std::cout << 1234.123456 << "\n"; std::cout << 2345.123456 << "\n"; }``````

which displays (for Windows with MS VS):

 ``` 1234.12 2345.12 1234.123413 2345.123535 1234.123456 2345.123456 1234.123456 2345.123456 ```

Note the output displayed on L3 and L4 for float!

Last edited on
And some explanations here for setprecision :
https://cplusplus.com/reference/iomanip/setprecision/
SubZeroWins wrote:
The function returns a float for the image position and it has .123456 that many digits of precision after the decimal.

So the integer part is zero? E.g. 0.123456 ?

Otherwise if the integer portion is something like 123 then I don't think float would be precise enough to have that much precision after the decimal point.

Unless it returns a double rather than float.

SubZeroWins wrote:
Peter, you think if I cast the float to a double and then do the maths that it will be more precise and not glitch?

It's possible. It's hard to say for sure without seeing the math but it certainly sounds like it might fix the issue.

SubZeroWins wrote:
With the output here I really did not expect that for PosX2. I thought it would be something like this.
2345.123456000000xxxxxxxxxxxxxxxxxxx //x is garbage numbers

Rounding errors can go both ways. Getting a slightly larger number will result in something like what you wrote but a slightly smaller number will result in something like 2345.123455999999xxxxxxx...............

SubZeroWins wrote:
2345.123456000000xxxxxxxxxxxxxxxxxxx //x is garbage numbers
To me that is 16 digits of precision and not the one below
2345.12345......599999980549910105764865875244140625

I mean if you round it to 16 (or 15 to be safe) significant digits you'll get the expected value. I explained more about this in my reply to seeplus above.
Last edited on
For display, consider scientific format:

 ``1234567891011121314`` ``````#include #include int main() { // floats: std::cout << std::scientific; std::cout << 1234.123456f << "\n"; std::cout << 2345.123456f << "\n"; // doubles: std::cout << std::scientific; std::cout << 1234.123456 << "\n"; std::cout << 2345.123456 << "\n"; }``````

 ``` 1.234123e+03 2.345124e+03 1.234123e+03 2.345123e+03 ```

which I believe is more representative of how the number is actually stored.

Note that when doing calculations etc it's using the internal binary representation - NOT the way it is displayed. Specifying precision, fixed etc etc doesn't change how the number is stored internally - just how it's displayed.

Irrespective of display, a float has 23 bits for the fraction (mantissa) and a double has 52 bits.
Last edited on
Playing with precision :
 ``12345678910111213141516171819202122232425`` ``````#include #include int main() { // float float a = 1.23456789f; float aa = 325e+2f; // exponential float type // double double b = 1.234567890123456789f; double bb = 325e+2f; // exponential double type // output with fixed std::cout << "Displaying Output With Fixed:" << std::endl; std::cout << "Float Type Number 1 = " << std::fixed << a << std::endl; std::cout << "Float Type Number 2 = " << std::fixed << aa << std::endl; std::cout << "Double Type Number 1 = " << std::fixed << b << std::endl; std::cout << "Double Type Number 2 = " << std::fixed << bb << std::endl; // output with scientific std::cout << "\nDisplaying Output With Scientific:" << std::endl; std::cout << "Float Type Number 1 = " << std::scientific << a << std::endl; std::cout << "Float Type Number 2 = " << std::scientific << aa << std::endl; std::cout << "Double Type Number 1 = " << std::scientific << b << std::endl; std::cout << "Double Type Number 2 = " << std::scientific << bb << std::endl; return 0; }``````
Last edited on
 So the integer part is zero? E.g. 0.123456 ?

It can be, if the image is close to the left edge of the window. The number will represent the x coordinate of the image position on the screen and will change based on the users movement of the image from L to R across the screen.

Interesting then, so when they say 8-9 digits of precision or 15-16 digits of precision it does not mean you will get precisely that number. It all depends on the number but if you round it you should get at least 15 digits to be safe.

Maybe I should cast to double and then round, I will have to try it.
The setprecision() is not what I expected in all instances either but good to keep it in mind.
As per previous in this thread, when you say 'precision' you should specify as to what you mean. This word by itself can have 2 meanings - significant precision or decimal (fraction) precision (digits to the right of the decimal point mark). These are different.
SubZeroWins wrote:
The number will represent the x coordinate of the image position on the screen and will change based on the users movement of the image from L to R across the screen.

This means coordinates closer to the left will be more precise than coordinates further to the right. This might not be a problem if the range is not huge but I have heard some "horror stories" of games that used floating point coordinates for very large worlds, and eventually if you get far away from the centre of the world the movement starts acting weirdly, only able to move in large steps at a time, and if your speed is too low you might not be able to move at all.
Last edited on
I mean basically I just want this mathematics to work like a calculator.

if ((PosX1 - PosX2) >= (.1 * WidthX1))

I will not be displaying the float numbers anywhere, I just want that mathematics to work for xxxx.123456 at the very least this.

Because as it is somewhere in that math there is some loss or gain and the image pops out of place slightly (fractional difference) & then aligns. It is very small and looks like a shift or stagger & then it falls into place but then quickly aligns again. Depends on the numbers.
Last edited on
SubZeroWins wrote:
I mean basically I just want this mathematics to work like a calculator.

Calculators also have rounding errors.
The reason you often don't see the problem is because
1) they use something more precise than float,
2) they do not display too many significant digits, and
3) there is no if-logic that end up with two completely different results depending on whether a value is slightly above or below a number.

SubZeroWins wrote:
`if ((PosX1 - PosX2) >= (.1 * WidthX1))`

Could you tell me what this code is doing?
What does PosX1, PosX2 and WidthX1 represent?
Why are we multiplying with 0.1?
Etc.
SubZeroWins wrote:
I just want that mathematics to work for xxxx.123456 at the very least this.

It depends on what you mean by "work". Some rounding errors are to be expected when working with floating-point numbers. Don't expect things to be exact when you write the "math".

If you need 10 digits of precision like you say then you need to use double. Make sure you're using double everywhere. Don't add an `f` at the end of floating-point literals because that will make it a float. `1234.123456f` is a float and is less precise than `1234.123456` even if you assign both numbers to variables of type double. Don't use functions like fabsf that only work with floats, use fabs or abs instead. Etc.

Last edited on
SubZeroWins wrote:
Because as it is somewhere in that math there is some loss or gain and the image pops out of place slightly (fractional difference) & then aligns. It is very small and looks like a shift or stagger & then it falls into place but then quickly aligns again. Depends on the numbers.

Are you perhaps converting floating-point coordinates to integer pixel coordinates somewhere? In that case you probably want to round, rather than just truncating the values (i.e. throwing away the decimal parts) which is what happens when you convert a floating-point value to an integer.

Something that have tripped me up in the past when working with floating-point pixel coordinates and graphics is that the middle of the pixels are not at integer coordinates. For example, the middle of pixel (100, 150) is actually at (100.5, 150.5). I don't know if this matters for what you're doing and you might of course be handling it differently but it could be worth thinking about if you haven't. It might matter for how you do the rounding.

These sort of problems are likely with both float and double because the size of the error is not the main issue, it is the fact that any error in the wrong direction will end up being categorized incorrectly.

For example, if you intended to get 5 but instead you got 4.9 and 4.999999999999. If you tried to convert these numbers to integers by just truncating them then you would get the wrong integer 4 in both cases. If you instead rounded the numbers you would get the correct integer 5 in both cases.
Last edited on
Thanks for your help it is most appreciated.

I tried your suggestion of the casting to double (same for long double for my VS) and it is the same jumpy top image. Basically it stutters or jitters the top image from time to time by a very small amount due to the maths in the if() statement.

Basically envision 2 images, a top image and bottom image. The user is moving the ImageBottom and the ImageTop will follow always at an offset to the top image's left position.

If the ImageBottom is moving to the left, the ImageTop is not allowed to move with the ImageBottom until the ImageBottom is (ImageTop.left() - (.1 * ImageTop.Width).....basically 1/10th extra to the left of the current ImageTop.left() position).

So to recap when moving left the ImageTop is not allowed to move with the ImageBottom until the ImageBottom is <= (ImageTop.left() - (.1 * ImageTop.Width).

 ``123456789101112131415161718192021222324`` ``````//ORIGNAL //if (ImageBottom.getGlobalBounds().left <= (ImageTop.getGlobalBounds().left - (0.1f * ImageTop.getGlobalBounds().width))) //TRY 1:: Integer conversion //intDiff = (static_cast(ImageBottom.getGlobalBounds().left * 1'000'000)) - //(static_cast(ImageTop.getGlobalBounds().left * 1'000'000)); //intWidthCap = static_cast((0.1f * ImageTop.getGlobalBounds().width) * //1'000'000); //TRY 1:: Integer conversion //if ((intDiff) <= (-intWidthCap)) //TRY 2: Cast to double if (static_cast(ImageBottom.getGlobalBounds().left) <= (static_cast(ImageTop.getGlobalBounds().left) - static_cast(0.1f * ImageTop.getGlobalBounds().width))) { ImageTop.setPosition(ImageTop.getGlobalBounds().left - m_SpeedScalesMovingRow * dt.asSeconds(), ImageTop.getGlobalBounds().top); }``````
Last edited on
Pages: 123