Floating point number greater than 8 bytes

Is it possible in C++ to handle number that would be greater thatn 8 bytes? I have a situation where I need to do that. I thought the solution would be to use a long double, but apparently on my system using Visual Studio 2019 (even with Processor Architecture set to x64) both doubles and long doubles are 8 bytes.

1
2
3
4
  cout << "double billAmount, billPrice = .30, billAdder = 10.00;" << endl;
	cout << "long double myAmount, myPrice = .30, myAdder = 10.00;" << endl;
	cout << "billAmount is " <<  sizeof(double) << " bytes" << endl;
	cout << "myAmount is  " << sizeof(long double) << " bytes" << endl;

You can either download an external library:

https://gmplib.org/

Or you can create your own implementation to be able to handle bigger numbers.
Is it possible in C++ to handle number that would be greater than 8 bytes?

Yes. Doing so may require compiler or library support. However, "8 bytes" is not a number, so it's not clear what you mean.

The largest non-infinite IEC559 double is
0x1.fffffffffffffp+1023 or about 1.8 * 10^308, far larger than 2^64.
However, double is 8 bytes wide, so a double may contain no more than 2^64 distinct values. To get reasonable behavior, a (normal) double's value is a sum of no more than 53 sequential powers of two. Therefore quantization error is relative to the absolute magnitude of the value.

See Goldberg's paper:
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Last edited on
some compilers and cpus allow you to use the FULL FPU directly, which frequently supports 10+ bytes. The danger of using the max resolution is that roundoff errors propagate faster (the hidden bits in the FPU reduce this because internally it works in a higher resolution, and feeds you good digits as best it can), but you should be able to access them in assembly or with specialized commands on supporting compilers (not sure which do, MSVC used to have it but is been removed).

the compilers that support this will have long double != double in sizeof() test.
Last edited on
For Microsoft's compiler, the representation of double and long double is the same. Other compilers may implement different representations, such as GCC which I believe uses 80-bit long doubles (which take up 12 bytes).

You'd have to use some third-party library, because I cannot find any documentation of Visual Studio have more-precise built-ins. (I could be wrong.)

If you're on Visual Studio, apparently, people suggest MPIR, which is a port of GMP (zapshe's link).
https://www.exploringbinary.com/how-to-install-and-run-gmp-on-windows-using-mpir/
I myself have not used it.

However, "8 bytes" is not a number,
I think it was implied to mean "greater than 8 bytes [worth of floating-point precision]."
Last edited on
@RdelPorto,
Why exactly do you want such precision? 8 bytes already gives you about 14 sig figs.
Boost::multuprecision does everything you want in ever-so-convenient C++. It'll even link with a backend of your choice, including GNU's multiprecision library if you wish.

That said, I must reiterate lastchance's concern: what rare and unlikely reason could you have for more precision than an IEEE double gives you?

Unless you have a very specific, in-writing requirement for better precision than a double will give you, then you are absolutely wasting your time.
++ lastchance's comment.

From the names of your variables (billAmount, billPrice, billAdder), it sounds like you're dealing with a bill. Unless it's the total national debt of galaxy, a double is likely to suffice.

I suspect that your real problem is that you're having roundoff issues. Double's can't represent decimal fractions exactly. To compensate, you can:
- Store the money values as pennies using an integer rather than as dollars using floating point
- Store as dollars and round off the output to 2 decimal places.
- Use a decimal floating point library. Hmm. I believe this was proposed for the standard library. Maybe someone knows the status of that proposal.
- Something else.

Since this is the beginners forum, I'd suggest just rounding off the output to 2 decimal places.
When dealing with money, don't use floating-point. Used fixed-point.

Simply put, use integers representing the monetary amount in dollars * 10000. Only when you display them do you divide by 10000 (and possibly round to two decimal places).

Double is capable of more precision than is necessary to focus telescopes on objects outside the known universe.
LastChance,
The story behind this is that I'm doing some tutoring and my student has this Activity to complete:
2. Worth Every Penny: 40 points
File: worth_every_penny.cpp
You are running a business of selling stainless-steel wedding rings. Bill, the
software engineer you hired to program your business software is an experienced
programmer, so he decided to use double-precision floating-point numbers
(double) for all amounts of money in the software, fearing that one day a puny
float would run out of precision to represent your billion-dollar fortune. Being a
brilliant businessman yourself, you have come up with a unique pricing strategy:
the first ring is sold for 30 cents ($0.30), the second one for $10.30, the third
one $20.30 and so on. In other words, the price for each ring is $10 more than
the last. This plan has proven to work quite well: over the years you have sold
30’000’000 (thirty million) rings, and every penny earned was recorded in the
software. The profit from selling all these rings is $4499999859000000.00. This
result can be verified using, for example, Wolfram Alpha.

Yet the software written by Bill surprisingly reports a different profit value.
Obviously you are not happy and decide to take the matters into your own
hands.
In this problem you will write a program to compute the exact total profit after
the n − 1th ring is sold. n will be given to your program as input, and it will be
an integer between 1 and 30000000, inclusively. For example, if n = 3, the profit
should be 0.30+10.30+20.30 = 30.90. In addition to the exact total profit, your
program should also output the number computed by Bill’s program (remember
that he used double-precision floating-point numbers for all his calculations).
Use spaces to align Bill’s and yours profits in the output. Print your output
in dollars, in fixed point notation, to up to 2 digits after the decimal point.
You can use std::fixed and std::setprecision in the <iomanip> header file.
For example, to output the variable num in fixed point notation, to 2 digits of
accuracy after the decimal point, do as follows: std::cout << std::fixed <<
std::setprecision(2) << num << "\n";
Please note that each ring is sold independently, so the program needs to add
the rings’ prices to the sum (profit) one by one. In other words, you cannot use
a closed-form formula such as the one shown in the Wolfram Alpha picture.
You need to write a loop to compute the total profit.
Input
1
30000000
Output
Case 0
30000000 rings were sold
Bill's program outputs 4499999860284070.00
The exact profit is 4499999859000000.00

Perhaps I'm just approaching this with the wrong mindset, but my thought was that iif long double were implemented as 16 bytes (as I thought it was) then I could use double for "bill's" numbers and long double for mine. Obviously, it doesn't work that way.
do it with unsigned 64 bit integers (uint64_t) and in pennies (30 means 0.30). It will work. Then you can format the output back into 'decimal' with string tricks instead of math tricks to get the exact output.
It will work, and you don't need anything more like extended libraries etc.

Read that paper. The problem is not just the # of bits (that affects how many numbers you can represent to how many significant digits) but also the format. The format of floating point will, no matter how many bits you use, eventually fail to represent every number precisely. You will always get some sort of (x-1).99999999999999999999999999999999999... instead of x.0 for a few values...

now for extra credit, don't do this with a loop. can you get the exact sum with a couple of lines of code, following the ideas of "how to get the sum of 0-N" which is just a one line equation..?

the whole thing becomes, then:
where r is your input, r & v are both uint64_t .
v = 500*r*(r-1) + 30ull*r;
Last edited on
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
#include <iostream>
#include <iomanip>
using namespace std;

int main()
{
   const int YEARS = 30000000;
   unsigned long long icents = 30, icentsMarkup = 1000, isum = 0;
   double dollars = 0.30, dollarsMarkup = 10.0, sum = 0.0;
   for ( int i = 1; i <= YEARS; i++ )
   {
      isum += icents;
      sum  += dollars;

      icents += icentsMarkup;
      dollars += dollarsMarkup;
   }
   cout << YEARS << " rings were sold\n";
   cout << "Bill's program outputs " << fixed << setprecision( 2 ) << sum << '\n';
   cout << "The exact answer is " << isum / 100 << '.' << setw(2) << setfill('0') << isum % 100 << '\n';
}


30000000 rings were sold
Bill's program outputs 4499999860284069.50
The exact answer is 4499999859000000.00



As @Jonnin points out, a more intelligent way to do the problem is to sum an arithmetic series and avoid the loop - however, your post says that this is banned.

Note that an unsigned long long would overflow eventually - at that point you will either have to live with the imprecision of doubles or use a big integer library.

I don't thing a higher-precision floating-point type is of much use to you here.

Last edited on
Doh, and I fail the class for not reading the instructions!
still, the loop is easy to write. The reason they capped you at 3 gazillion on the input was to ensure it worked in 64 bit math.
Last edited on
Registered users can post here. Sign in or register to post.