Is it possible in C++ to handle number that would be greater thatn 8 bytes? I have a situation where I need to do that. I thought the solution would be to use a long double, but apparently on my system using Visual Studio 2019 (even with Processor Architecture set to x64) both doubles and long doubles are 8 bytes.
Is it possible in C++ to handle number that would be greater than 8 bytes?
Yes. Doing so may require compiler or library support. However, "8 bytes" is not a number, so it's not clear what you mean.
The largest non-infinite IEC559 double is
0x1.fffffffffffffp+1023 or about 1.8 * 10^308, far larger than 2^64.
However, double is 8 bytes wide, so a double may contain no more than 2^64 distinct values. To get reasonable behavior, a (normal) double's value is a sum of no more than 53 sequential powers of two. Therefore quantization error is relative to the absolute magnitude of the value.
some compilers and cpus allow you to use the FULL FPU directly, which frequently supports 10+ bytes. The danger of using the max resolution is that roundoff errors propagate faster (the hidden bits in the FPU reduce this because internally it works in a higher resolution, and feeds you good digits as best it can), but you should be able to access them in assembly or with specialized commands on supporting compilers (not sure which do, MSVC used to have it but is been removed).
the compilers that support this will have long double != double in sizeof() test.
For Microsoft's compiler, the representation of double and long double is the same. Other compilers may implement different representations, such as GCC which I believe uses 80-bit long doubles (which take up 12 bytes).
You'd have to use some third-party library, because I cannot find any documentation of Visual Studio have more-precise built-ins. (I could be wrong.)
From the names of your variables (billAmount, billPrice, billAdder), it sounds like you're dealing with a bill. Unless it's the total national debt of galaxy, a double is likely to suffice.
I suspect that your real problem is that you're having roundoff issues. Double's can't represent decimal fractions exactly. To compensate, you can:
- Store the money values as pennies using an integer rather than as dollars using floating point
- Store as dollars and round off the output to 2 decimal places.
- Use a decimal floating point library. Hmm. I believe this was proposed for the standard library. Maybe someone knows the status of that proposal.
- Something else.
Since this is the beginners forum, I'd suggest just rounding off the output to 2 decimal places.
The story behind this is that I'm doing some tutoring and my student has this Activity to complete:
2. Worth Every Penny: 40 points
You are running a business of selling stainless-steel wedding rings. Bill, the
software engineer you hired to program your business software is an experienced
programmer, so he decided to use double-precision floating-point numbers
(double) for all amounts of money in the software, fearing that one day a puny
float would run out of precision to represent your billion-dollar fortune. Being a
brilliant businessman yourself, you have come up with a unique pricing strategy:
the first ring is sold for 30 cents ($0.30), the second one for $10.30, the third
one $20.30 and so on. In other words, the price for each ring is $10 more than
the last. This plan has proven to work quite well: over the years you have sold
30’000’000 (thirty million) rings, and every penny earned was recorded in the
software. The profit from selling all these rings is $4499999859000000.00. This
result can be verified using, for example, Wolfram Alpha.
Yet the software written by Bill surprisingly reports a different profit value.
Obviously you are not happy and decide to take the matters into your own
In this problem you will write a program to compute the exact total profit after
the n − 1th ring is sold. n will be given to your program as input, and it will be
an integer between 1 and 30000000, inclusively. For example, if n = 3, the profit
should be 0.30+10.30+20.30 = 30.90. In addition to the exact total profit, your
program should also output the number computed by Bill’s program (remember
that he used double-precision floating-point numbers for all his calculations).
Use spaces to align Bill’s and yours profits in the output. Print your output
in dollars, in fixed point notation, to up to 2 digits after the decimal point.
You can use std::fixed and std::setprecision in the <iomanip> header file.
For example, to output the variable num in fixed point notation, to 2 digits of
accuracy after the decimal point, do as follows: std::cout << std::fixed <<
std::setprecision(2) << num << "\n";
Please note that each ring is sold independently, so the program needs to add
the rings’ prices to the sum (profit) one by one. In other words, you cannot use
a closed-form formula such as the one shown in the Wolfram Alpha picture.
You need to write a loop to compute the total profit.
30000000 rings were sold
Bill's program outputs 4499999860284070.00
The exact profit is 4499999859000000.00
Perhaps I'm just approaching this with the wrong mindset, but my thought was that iif long double were implemented as 16 bytes (as I thought it was) then I could use double for "bill's" numbers and long double for mine. Obviously, it doesn't work that way.
do it with unsigned 64 bit integers (uint64_t) and in pennies (30 means 0.30). It will work. Then you can format the output back into 'decimal' with string tricks instead of math tricks to get the exact output.
It will work, and you don't need anything more like extended libraries etc.
Read that paper. The problem is not just the # of bits (that affects how many numbers you can represent to how many significant digits) but also the format. The format of floating point will, no matter how many bits you use, eventually fail to represent every number precisely. You will always get some sort of (x-1).99999999999999999999999999999999999... instead of x.0 for a few values...
now for extra credit, don't do this with a loop. can you get the exact sum with a couple of lines of code, following the ideas of "how to get the sum of 0-N" which is just a one line equation..?
the whole thing becomes, then:
where r is your input, r & v are both uint64_t .
v = 500*r*(r-1) + 30ull*r;