double - and how to use it

Score: 3.1/5 (8 votes)

This article shows some properties of "double" atithmetic, and what you can and can't do with it.
NB: it requires UTF-8 support for mathematic symbols.

0. Motivation
=============
When doing floatingpoint arithmetic, some strange effects can occur, like

for(double i=0.0; i!=5.5; i+=0.5)
{
        ... 
}

might never terminate due to rounding errors. Especially testing for (in)equality often does not yield in the desired result. A common workaround is to use an epsilon:

bool is_equal(double d1, double d2) 
{
        if(abs(d1-d2)<epsilon)
                return true;
        return false;
}

However, this "solution" is only in few cases applicable - equality, defined this way, is not an equivalence relation anymore (lost transitivity: a-b<ε Λ b-c<ε doesn't imply a-c<ε - oops). Another question is: how large should epsilon be?

The computer should calculate the correct results or state it's uncertainty. This cannot be achieved in general using an epsilon. This is why you should be interested in the rest of this article ;-)

1. The number type
==================
A IEEE 754 double value is composed of:
1 bit sign (+ or -)
11 bit exponent (range: 0 to 1047, 0 and 1047 "reserved")
52 bit significant (aka "mantissa")

the represented number is then:

[sign] 2^{exponent-bias[=1023]}*1[significant]

where the 1 is an implied bit set to 1 before the bits of the significant start.

For normalized numbers the first bit of the significant is 1 and not saved, so the significant grows to 53 bit (obviously, leading zeros can be left out, so a number starts with a non-zero, which happens to be always one in a binary system).

A value is normalized iff 0 < exponent < 2047. If the exponent is zero, IEEE uses "denormalized" numbers, the value then is [sign]2^-1022*[significant] without the leading one in the significant. Note that calculations in the range of denormalized numbers is usually extremely slow (2047 in the exponent is reserved for "NaN (not a number), ±inf and alike)

If you, for some reason, require bit access, you can use on a *BIG* endian machine:

union ieee_double
{
   double   a;
   struct { unsigned s : 1;
            unsigned e :11;
            unsigned h :20;
            unsigned l :32;
          } b;
};

Little endian has to be handled seperately, but is done in a similar way.

2. Operations
=============
Assume that rounding is set to "round to nearest" and "round to even tie-breaking" (default)

Let ∘ denote an operation of {+,-,*,/} and ⊚ the corresponding machine operation, i.e. one of {⊕,⊖,⊛,⊘}

Let x, y and z be IEEE 754 double values.
Let a be a real number.
Let fl(a) be the IEEE 754 double value nearest to a

IEEE 754 now guarantees:
(a) fl(x ∘ y) = x ⊚ y
Corollary:
x ⊖ y = 0 ⇔ x = y
x ⊕ y = y ⊕ x
x ⊛ y = y ⊛ x
(b) fl(sqrt(x)) = msqrt(x), where msqrt is the "machine sqrt" operation.
Furthermore, some "no-number" results are defined, like NaN (not a number) and +/- infinity.

However in the general case, due to rounding errors:
a ⊚ ( b ⊚ c ) ≠ ( a ⊚ b ) ⊚ c

3. Sterbenz Lemma
=================
One of your most important friends when doing double arithmetic is Sterbenz Lemma, which can be stated as:
If y/2 ≤ x ≤ y then
x ⊖ y = x - y
if x ⊖ y doesn't underflow

Or, formulated in "plain enlish": if y is between x/2 and x, then the machine substraction x ⊖ y gives you the exact result.

4. Error bounds
===============
One can, given the properties in (1) and (2), and the resulting lemma (3), compute an error bound for any formula using +,-,*,/ and 'sqrt'. However, this calculation is tedious and error-prone itself, so an approximation is usually used.

Using Sterbenz' lemma, one can prove some easier error bounds for the evaluation of polynomials. I won't show them here, but you can google for e.g.
- Fortune-van-Wyk Bound
- Mehlhorn-Näher Bound
- Mehlhorn-Neumaier Bound
to find them. (Note: IIRC, all these papers prove thir correctness only for integer operands (represented in doubles). However, they all are correct for any double input). A noteworthy beauty of all these formulas is that the actual error bound can be computed using double arithmetic.

5. Floating point filters
=========================
Remember the motivation example: equality tests. Given two polynomials P and Q, it now can be decided if P(x) = Q(x) in some cases:

// equality test:
if(abs(P(x1) - Q(x2)) > ErrorBound(max(x1,x2)))
        return false; /* we know that for sure */
else
        // ???

(note that if P or Q is multivariate, max had better used the supremum norm of x1 and x2...)

If you know the operand size at compile time (or at program startup) and can pre-compute it, all you have to pay for the exactness is then a simple test in most cases.
Question: what is the '???'? There are several options. You could use exact arethmetic using an arbitrary precision number type, you can use floating-point expansions, or you can perform lazy adaptive evaluation (note that the speed increases, but so does the difficulty of implementation (and coming up with the right formulas), from left to right).
These methods are out of the scope of this article (if time allows it, I might write a sequel, though ;-) )
In some areas, another possibility would be to use controlled pertubation, meaning that the input is transformed in such a way that equality doesn't occur and inequality can be shown here (not exact, but might be robust and sufficient, since we guarantee some property (i.e., we don't just say "not equal", we also make sure it isn't).

6. Conclusion
=============
While calculating the error bound imposes extra work, runtime efficiency is a minor issue (one comparison to prove the correctness of the result in most cases). At the same time, many degeneracencies caused by the incorrect assumption of equality can be avoided, often resulting in a substantial performance gain.

________________________________________________________________________________
If you have any comments, suggestions or improvements, or spotted a spelling error (or several), please let me know.

C++

Articles

double - and how to use it