C++ Review Questions

Pages: 1... 4 567 8... 10

Yes, the implementation of long double varies quite a lot between different platforms/compilers. On Windows with Visual C++ it's identical to double.

Last edited on

jonnin (11497)

Older (talking VS 6.0 era maybe, its been a WHILE) VS had a way to access the then-hardware fpu 80 bit registers (I don't know that the FPU still works this way?) so you could get a little more precision at the risk of errors (the extra bits were used to help avoid rounding accumulation problems internally in the FPU).

All that is gone now, though.
You can research if your FPU has some bigger storage than 64 bit and if so, use assembly language to craft a hardware efficient long double of sorts. But catch 22, I think 64 bit VS also disables using assembly.

Last edited on

deleted account xyzzy (5768)

Yeah, a compiler's implementation of a C++ feature is a crucial datum point that is usually glossed over, presumed to be the same no matter what the OS and/or compiler.

One reason I'd opine why the C++ standard doesn't lock down the basic type storage requirements, saying only it will be at least this or that.

I do know with MinGW (Code::Blocks) a double and long double are different sizes.

Personal opinion, if the C++ standard doesn't have something I need Boost usually does. And I trust Boost to be nearly as stable as the C++ stdlib.

I have yet to need super-duper ultra precision with floating point numbers, but if I ever do need it I'll Boost it.

seeplus (6653)

But catch 22, I think 64 bit VS also disables using assembly.

It doesn't allow in-line assembly for 64-bit compiles.

You can have a 64-bit assembly source file added as part of the project which is then assembled and linked at link time into the .exe.

https://docs.microsoft.com/en-us/cpp/assembler/masm/masm-for-x64-ml64-exe?view=msvc-170

With VS as 64 bit, both double and long double are 64 bit quantities.

Last edited on

CodeChaser (129)

Thanks guys, I have recovered from this realization & shock of not knowing this from my 1st book, and I have to be weary of floating point values. They are great for representing really small or large values, but the precision can be limiting & tricky, and that is just part of the territory. Fraction numbers are infinite & the float types can only display an extremely small portion of the reality, with sometimes junk values being inserted & causing values to go astray.

Be vigilant when using arithmetic operators on:
A) The combination of very small AND very large numbers.
B) When subtracting numbers that are nearly equal to one another (Catastrophic Cancellation)
C) Careful of floats in general that pass a certain precision threshold.

Beginning C++20 p45

Table 2-5. Floating-Point Type Ranges (FOR Intel processors)
Type_________ Precision (Decimal Digits)_______Range (+ or –)
float__________7_________________________±1.18 × 10-38 to ±3.4 × 1038
double _______ 15 (nearly 16)_______________±2.22 × 10-308 to ±1.8 × 10308
long double____18-19______________________±3.65 × 10-4932 to ±1.18 × 104932

The numbers of digits of precision in Table 2-5 are approximate. Zero can be represented exactly with each type, but values between zero and the lower limit in the positive or negative range can’t be represented, so the lower limits are the smallest possible nonzero values.

I think my 2nd book that I am reading now was trying to tell me that here too, but not as clearly in this section. I think it means what you guys have already echoed, that the lower fraction limits (pos/neg range) just cannot be display correctly & with a high degree of precision. I guess you have to take this on a case-by-case basis & underline the table as APPROXIMATE.

For most of the code you write, do you get to test your code before deploying and rarely have to code in the live & critical present?

Last edited on

JLBorges (13770)

This thread may be of interest: https://cplusplus.com/forum/general/182508/#msg894208

TheIdeasMan (6856)

MrZ wrote:
Be vigilant when using arithmetic operators on: A) The combination of very small AND very large numbers. B) When subtracting numbers that are nearly equal to one another (Catastrophic Cancellation) C) Careful of floats in general that pass a certain precision threshold.

For most applications the 15 sf of a double is plenty, small or large numbers are covered by the exponent, and usually appropriate units are used. For example, astrophysicists tend use megaparsecs as a unit, sometimes light years, but never km, metres, or millimetres !! With a lot of measurement systems, one can hardly ever get near to 15sf of accuracy.

For equality, always compare with operator < , to some value sufficiently near to zero for your application. For example, if working with mm precision and units of metres anything less than 1e-3 is zero. Although one does need more precision if squaring numbers for example. Same for ideas subtraction, write an IsEqual function.

So your A,B,C options are not really a problem if one takes care in comparing.

CodeChaser (129)

Thanks.

@JLBorges
From your link, do you mean this "mean", 10+20+30 = 60/3 = 20?

Shouldn't the mean be 10'000'000'000'000'000 and NOT 5000000000000000.0, since your populating each element of the array with 1.e+16 ?

At some point the "sum += a[i];" just gives up & the next time you try to add 1.e+16, it just maintains the addition to sum as 4999999999971101245440.0...which just drags the average closer & closer to that wrong value of mean.

Also, this number 4999999999971101245440.0 or 4.9999999999E21 is nowhere near the double 1.8 × 10^308 max limit? So, what happened too many significant digits were used & it stopped reporting the addition properly? Why didn't it just add with scientific notation? This is not a fraction number, it is a whole number...the addition of 1.e+16.

You can't add another million, because you cannot get any more sig fig than 4999999999971101245440.0

I understand though that the intentions was to show the compensated algorithm can bring the mean closer to the actual.

Last edited on

JLBorges (13770)

> since your populating each element of the array with 1.e+16 ?

Each element of the first half of the array; the elements of the second half remain as zero.

> Why didn't it just add with scientific notation? This is not a fraction number, it is a whole number.

Whole numbers too can lose precision when they are huge; in this case
1.e+16 * (1'000'000/2) is just too big.

Try this program:

#include <iostream>
#include <iomanip>
#include <cmath>

int main()
{
    std::cout << std::fixed ;
    for( double d = 1.0e+12 ; d < 1.0e+22 ; d *= 10.0 )
    {
        const double next = std::nextafter( d, 1.0e+30 ) ;
        std::cout << "after " << d << ", the next higher representable number is "
                  << next << "   (+" << next-d << ")\n" ;
    }
}

http://coliru.stacked-crooked.com/a/3b28544f1b275c44

CodeChaser (129)

Oh, OK.......N/2

Also, that number 4999999999971101245440.0, that is bigger than 15-16 precision, those numbers & even though they are not fractions and whole, they are junk values somewhere after the last 9.

How the heck can that be...too much precision, just compounds junk....dangerous & requires caution indeed!


after 1000000000000.000000, the next higher representable number is 1000000000000.000122   (+0.000122)
after 10000000000000.000000, the next higher representable number is 10000000000000.001953   (+0.001953)
after 100000000000000.000000, the next higher representable number is 100000000000000.015625   (+0.015625)
after 1000000000000000.000000, the next higher representable number is 1000000000000000.125000   (+0.125000)
after 10000000000000000.000000, the next higher representable number is 10000000000000002.000000   (+2.000000)
after 100000000000000000.000000, the next higher representable number is 100000000000000016.000000   (+16.000000)
after 1000000000000000000.000000, the next higher representable number is 1000000000000000128.000000   (+128.000000)
after 10000000000000000000.000000, the next higher representable number is 10000000000000002048.000000   (+2048.000000)
after 100000000000000000000.000000, the next higher representable number is 100000000000000016384.000000   (+16384.000000)
after 1000000000000000000000.000000, the next higher representable number is 1000000000000000131072.000000   (+131072.000000)

Last edited on

Peter87 (11264)

Mr Z wrote:
this number 4999999999971101245440.0 or 4.9999999999E21 is nowhere near the double 1.8 × 10^308 max limit?

Yeah, but there are plenty of numbers in the range from 0 up to the max that cannot be represented exactly.

Mr Z wrote:
what happened too many significant digits were used & it stopped reporting the addition properly?

Yeah. The sum 5000000000000000000000 has way more significant digits than 15-16.

Look at iteration i=59033 where it tries to add 590320000000000000000 and 10000000000000000.

590320000000000000000 = 1.0000000000000010101011111000001011001110000001001100 × 2⁶⁹
    10000000000000000 = 1.0001110000110111100100110111111000001000000000000000 × 2⁵³

The correct answer would be 590330000000000000000.

590330000000000000000 = 1.00000000000000111100101110111010011000011000001011001 × 2⁶⁹

But since the fractional part only uses 52 bits the last (rightmost) bit does not fit. That's why it ends up with 590329999999999934464 instead.

590329999999999934464 = 1.0000000000000011110010111011101001100001100000101100 × 2⁶⁹

Note that if we round this number to 15 significant digits we still get the expected answer, but as it continues to add more and more such small errors you end up with a total error that is much larger.

Last edited on

CodeChaser (129)

Thanks Peter87...

On my computer & VS long double is the same as double, 8 bytes. I tried to follow your exact numbers with an online binary calculator, but the numbers differ. But I get the gist of what your saying. That the right most bits are truncated & we lose precision and that it can get compounded with further arithmatic....only I might not know EXACTLY where the bits are ligned up to be added internally. I tried to show the alignment below, but it just does not look right, because once you remove the right most bits, it changes the actual value...which might be preserved in the exponent.

Do you think you can show me the alignment & with the lost bits in [] brackets that are not part of the binary addition?

decimal:
590320000000000000000 (5.9032E20) =

binary ACTUAL:
1000000000000001010101111100000101100111000000100110000000000000000000 (2^70)

On your computer long double is 64 bits (1 signed bit + 11 bits exponent + 52 bits mantissa). C++ can't show that number with 70 bits, so only 52 bits are stored for mantissa precision.
1000000000000001010101111100000101100111000000100110 [00000000000000000] (2^52)

The brackets are the least sig bits & represented in the scientific notation part of memory storage & you lose those right-hand bits (17 bits) along with precision (if those bits had values).

decimal:
10000000000000000 (1.0E16) =

binary ACTUAL:
100011100001101111001001101111110000010000000000000000 (2^54)

Even the full precision of this cannot be stored, since it's 2 bits over the 52 bit limit, so it is stored as such with 2 least sig bits lost precision in brackets:
1000111000011011110010011011111100000100000000000000 [00] (2^52)

ACTUAL MANUAL BINARY ADDITION (590320000000000000000 + 10000000000000000):
1000000000000001010101111100000101100111000000100110000000000000000000(2^70)
+ _____________ 100011100001101111001001101111110000010000000000000000(2^54)
1000000000000001111001011101110100110000110000010110010000000000000000(2^70)


= 590330000000000000000

1
2
3

MACHINE TRUNCATION OF RIGHT-MOST LEAST SIG BITS TO FIT 52 BITS: ???????
_ 1000000000000001010101111100000101100111000000100110  [00000000000000000] (2^52)
+_______________ 1000111000011011110010011011111100000  [10000000000000000] (2^37)

In [] brackets are the lost bits....I think?????? Is the alignment right for the addition????

Last edited on

Peter87 (11264)

Forget about long double. The program that we're talking uses only double.
https://cplusplus.com/forum/general/182508/#msg894208

My assumption is that double uses the "double-precision floating-point" format (binary64) as specified by IEEE 754.
https://en.wikipedia.org/wiki/Double-precision_floating-point_format
I'm pretty sure that is true also on your computer.

A total of 64 bits is used.
1 bit is used for the sign.
11 bits are used for the exponent.
52 bits are used for the fraction.

In my previous answer I used this tool http://weitz.de/ieee/ to get the binary representation of the mantissa. The binary representation of the sign and exponent is irrelevant for this particular discussion because we are not stretching the limits of those. That's why I only wrote the mantissa as binary and left the base and exponent as decimal (base 10).

Note that the 1 bit before the dot is implicit and is not part of the 52 bits. That's why I wrote 52 bits are used for the fraction (i.e. the fractional part; what's to the right of the dot) and not for the whole mantissa.

                     implicit
                        ↓
590320000000000000000 = 1.0000000000000010101011111000001011001110000001001100 × 2⁶⁹
    10000000000000000 = 1.0001110000110111100100110111111000001000000000000000 × 2⁵³
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                          52 bits

Note that the exponents of these two numbers are not the same so before we can add them by hand we would have to rewrite them with the same exponent.

590320000000000000000 = 1.00000000000000101010111110000010110011100000010011000 × 2⁶⁹
    10000000000000000 = 0.00000000000000010001110000110111100100110111111000001 × 2⁶⁹
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                          53 bits

Note that to not discard any information I had to use one extra binary digit. I also chose to write them with the same number of digits to simplify the calculation but that isn't strictly necessary.

Now you can just do normal addition of the mantissas by hand if you want.
https://en.wikipedia.org/wiki/Carry_(arithmetic) <--- pay attention to the fact that we're using binary here!

The result is:

590330000000000000000 = 1.00000000000000111100101110111010011000011000001011001 × 2⁶⁹
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                          53 bits

And after removing the last bit to get back to 52 bits for the fraction we get:

590329999999999934464 = 1.0000000000000011110010111011101001100001100000101100 × 2⁶⁹
                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                          52 bits

I have to admit that I only did the addition of the mantissas by hand to verify, but I didn't actually calculate the whole expression on the right to get the number on the left. Instead I just entered the value that I got from the program into the tool that I linked earlier to see that it matched.

I don't know how it's actually being done in the hardware but the values I get in the program is consistent with the calculations here.

Last edited on

deleted account xyzzy (5768)

The cppreference page on fundamental data types should be a "must-read":
https://en.cppreference.com/w/cpp/language/types

There is one very instructive note about floating point types and MSVC (Visual Studio):

cppreference wrote:
The most well known IEEE-754 binary64-extended format is 80-bit x87 extended precision format. It is used by many x86 and x86-64 implementations (a notable exception is MSVC, which implements long double in the same format as double, i.e. binary64).

Use another Windows-based compiler, say MinGW, and a long double is indeed a larger size than double.

CodeChaser (129)

Got it. You use the implicit conversion to store 1. and do the addition using 53 bits, but then you drop the last bit and it seemingly gives you the representation of the loss of data.

Where is that left-most significant digit 1. stored, in a temp memory location?

Thanks George P, at one point I have to try another compiler as well.

deleted account xyzzy (5768)

I'll make a couple of suggestions for an alternate compiler, one an IDE and one commandline.

Code::Blocks, and/or MSYS2.

https://www.codeblocks.org/

https://www.msys2.org/

Code::Blocks can run on a 32-bit system, MSYS2 is strictly 64-bit just like Visual Studio 2022.

I'd suggest getting both, combined they don't gobble up HD space like VS does.

JLBorges (13770)

> but then you drop the last bit and it seemingly gives you the representation of the loss of data.

How the rounding is done (other than for constant expressions) depends on the current floating point rounding direction.
https://en.cppreference.com/w/cpp/numeric/fenv

#include <iostream>
#include <iomanip>
#include <cfenv>
#pragma warning(disable:4068) // unknown pragma (microsoft)
#pragma STDC FENV_ACCESS ON 

int main()
{
    std::cout << std::fixed << std::setprecision(20) << std::showpos ;
    double n = 1.0 ;
    double d = 10.0 ;
    double negd = -d ;

    std::cout << "default rounding direction is: " ;
    const auto rounding_dir = std::fegetround() ;
    switch( rounding_dir )
    {
        case FE_DOWNWARD: std::cout << "FE_DOWNWARD\n"; break;
        case FE_TONEAREST: std::cout << "FE_TONEAREST\n"; break;
        case FE_TOWARDZERO: std::cout << "FE_TOWARDZERO\n"; break;
        case FE_UPWARD: std::cout << "FE_UPWARD\n"; break;
        default: std::cout << "implementation defined\n\n" ;
    }

    std::cout << "default:\n"
        << n << " / " << d << " == " << n / d << '\n'
        << n << " / " << negd << " == " << n / negd << "\n\n" ;

    std::fesetround( FE_DOWNWARD );
    std::cout << "FE_DOWNWARD:\n"
        << n << " / " << d << " == " << n / d << '\n'
        << n << " / " << negd << " == " << n / negd << "\n\n" ;

    std::fesetround( FE_UPWARD );
    std::cout << "FE_UPWARD:\n"
        << n << " / " << d << " == " << n / d << '\n'
        << n << " / " << negd << " == " << n / negd << "\n\n" ;

    std::fesetround( FE_TOWARDZERO );
    std::cout << "FE_TOWARDZERO:\n"
        << n << " / " << d << " == " << n / d << '\n'
        << n << " / " << negd << " == " << n / negd << "\n\n" ;

    std::fesetround( FE_TONEAREST );
    std::cout << "FE_TONEAREST:\n"
        << n << " / " << d << " == " << n / d << '\n'
        << n << " / " << negd << " == " << n / negd << "\n\n" ;
}

Peter87 (11264)

Mr Z wrote:
Where is that left-most significant digit 1. stored, in a temp memory location?

It's not stored anywhere. That's why I said it was implicit (It has nothing to do with "implicit conversion"). It can be assumed to always be 1, except for some special values (Certain values of the exponent have special meaning).

https://en.wikipedia.org/wiki/Double-precision_floating-point_format#Exponent_encoding

Last edited on

CodeChaser (129)

JLBorges, rounding noted.


default rounding direction is: FE_TONEAREST
default:
+1.00000000000000000000 / +10.00000000000000000000 == +0.10000000000000000555
+1.00000000000000000000 / -10.00000000000000000000 == -0.10000000000000000555

FE_DOWNWARD:
+1.00000000000000000000 / +10.00000000000000000000 == +0.09999999999999999167
+1.00000000000000000000 / -10.00000000000000000001 == -0.10000000000000000556

FE_UPWARD:
+1.00000000000000000001 / +10.00000000000000000001 == +0.10000000000000000556
+1.00000000000000000001 / -10.00000000000000000000 == -0.09999999999999999167

FE_TOWARDZERO:
+1.00000000000000000000 / +10.00000000000000000000 == +0.09999999999999999167
+1.00000000000000000000 / -10.00000000000000000000 == -0.09999999999999999167

FE_TONEAREST:
+1.00000000000000000000 / +10.00000000000000000000 == +0.10000000000000000555
+1.00000000000000000000 / -10.00000000000000000000 == -0.10000000000000000555

Peter87,
I thought it was the exponent at first, but "The binary representation of the sign and exponent is irrelevant for this particular discussion" threw me off and I was puzzled at how it cannot be, then I thought it might be C++ magic again. I understand thank you!

Last edited on

seeplus (6653)

For info, C++23 allows some new floating point types:

std::float16_t
std::float32_t
std::float64_t
std::float128_t
std::bfloat16_t

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1467r9.html

Pages: 1... 4 567 8... 10