1 |
38 |
julius |
Long double format
|
2 |
|
|
==================
|
3 |
|
|
|
4 |
|
|
Each long double is made up of two IEEE doubles. The value of the
|
5 |
|
|
long double is the sum of the values of the two parts (except for
|
6 |
|
|
-0.0). The most significant part is required to be the value of the
|
7 |
|
|
long double rounded to the nearest double, as specified by IEEE. For
|
8 |
|
|
Inf values, the least significant part is required to be one of +0.0
|
9 |
|
|
or -0.0. No other requirements are made; so, for example, 1.0 may be
|
10 |
|
|
represented as (1.0, +0.0) or (1.0, -0.0), and the low part of a NaN
|
11 |
|
|
is don't-care.
|
12 |
|
|
|
13 |
|
|
Classification
|
14 |
|
|
--------------
|
15 |
|
|
|
16 |
|
|
A long double can represent any value of the form
|
17 |
|
|
s * 2^e * sum(k=0...105: f_k * 2^(-k))
|
18 |
|
|
where 's' is +1 or -1, 'e' is between 1022 and -968 inclusive, f_0 is
|
19 |
|
|
1, and f_k for k>0 is 0 or 1. These are the 'normal' long doubles.
|
20 |
|
|
|
21 |
|
|
A long double can also represent any value of the form
|
22 |
|
|
s * 2^-968 * sum(k=0...105: f_k * 2^(-k))
|
23 |
|
|
where 's' is +1 or -1, f_0 is 0, and f_k for k>0 is 0 or 1. These are
|
24 |
|
|
the 'subnormal' long doubles.
|
25 |
|
|
|
26 |
|
|
There are four long doubles that represent zero, two that represent
|
27 |
|
|
+0.0 and two that represent -0.0. The sign of the high part is the
|
28 |
|
|
sign of the long double, and the sign of the low part is ignored.
|
29 |
|
|
|
30 |
|
|
Likewise, there are four long doubles that represent infinities, two
|
31 |
|
|
for +Inf and two for -Inf.
|
32 |
|
|
|
33 |
|
|
Each NaN, quiet or signalling, that can be represented as a 'double'
|
34 |
|
|
can be represented as a 'long double'. In fact, there are 2^64
|
35 |
|
|
equivalent representations for each one.
|
36 |
|
|
|
37 |
|
|
There are certain other valid long doubles where both parts are
|
38 |
|
|
nonzero but the low part represents a value which has a bit set below
|
39 |
|
|
2^(e-105). These, together with the subnormal long doubles, make up
|
40 |
|
|
the denormal long doubles.
|
41 |
|
|
|
42 |
|
|
Many possible long double bit patterns are not valid long doubles.
|
43 |
|
|
These do not represent any value.
|
44 |
|
|
|
45 |
|
|
Limits
|
46 |
|
|
------
|
47 |
|
|
|
48 |
|
|
The maximum representable long double is 2^1024-2^918. The smallest
|
49 |
|
|
*normal* positive long double is 2^-968. The smallest denormalised
|
50 |
|
|
positive long double is 2^-1074 (this is the same as for 'double').
|
51 |
|
|
|
52 |
|
|
Conversions
|
53 |
|
|
-----------
|
54 |
|
|
|
55 |
|
|
A double can be converted to a long double by adding a zero low part.
|
56 |
|
|
|
57 |
|
|
A long double can be converted to a double by removing the low part.
|
58 |
|
|
|
59 |
|
|
Comparisons
|
60 |
|
|
-----------
|
61 |
|
|
|
62 |
|
|
Two long doubles can be compared by comparing the high parts, and if
|
63 |
|
|
those compare equal, comparing the low parts.
|
64 |
|
|
|
65 |
|
|
Arithmetic
|
66 |
|
|
----------
|
67 |
|
|
|
68 |
|
|
The unary negate operation operates by negating the low and high parts.
|
69 |
|
|
|
70 |
|
|
An absolute or absolute-negate operation must be done by comparing
|
71 |
|
|
against zero and negating if necessary.
|
72 |
|
|
|
73 |
|
|
Addition and subtraction are performed using library routines. They
|
74 |
|
|
are not at present performed perfectly accurately, the result produced
|
75 |
|
|
will be within 1ulp of the range generated by adding or subtracting
|
76 |
|
|
1ulp from the input values, where a 'ulp' is 2^(e-106) given the
|
77 |
|
|
exponent 'e'. In the presence of cancellation, this may be
|
78 |
|
|
arbitrarily inaccurate. Subtraction is done by negation and addition.
|
79 |
|
|
|
80 |
|
|
Multiplication is also performed using a library routine. Its result
|
81 |
|
|
will be within 2ulp of the correct result.
|
82 |
|
|
|
83 |
|
|
Division is also performed using a library routine. Its result will
|
84 |
|
|
be within 3ulp of the correct result.
|