1 |
6 |
jidan |
|
2 |
|
|
SoftFloat Release 2b General Documentation
|
3 |
|
|
|
4 |
|
|
John R. Hauser
|
5 |
|
|
2002 May 27
|
6 |
|
|
|
7 |
|
|
|
8 |
|
|
----------------------------------------------------------------------------
|
9 |
|
|
Introduction
|
10 |
|
|
|
11 |
|
|
SoftFloat is a software implementation of floating-point that conforms to
|
12 |
|
|
the IEC/IEEE Standard for Binary Floating-Point Arithmetic. As many as four
|
13 |
|
|
formats are supported: single precision, double precision, extended double
|
14 |
|
|
precision, and quadruple precision. All operations required by the standard
|
15 |
|
|
are implemented, except for conversions to and from decimal.
|
16 |
|
|
|
17 |
|
|
This document gives information about the types defined and the routines
|
18 |
|
|
implemented by SoftFloat. It does not attempt to define or explain the
|
19 |
|
|
IEC/IEEE Floating-Point Standard. Details about the standard are available
|
20 |
|
|
elsewhere.
|
21 |
|
|
|
22 |
|
|
|
23 |
|
|
----------------------------------------------------------------------------
|
24 |
|
|
Limitations
|
25 |
|
|
|
26 |
|
|
SoftFloat is written in C and is designed to work with other C code. The
|
27 |
|
|
SoftFloat header files assume an ISO/ANSI-style C compiler. No attempt
|
28 |
|
|
has been made to accomodate compilers that are not ISO-conformant. In
|
29 |
|
|
particular, the distributed header files will not be acceptable to any
|
30 |
|
|
compiler that does not recognize function prototypes.
|
31 |
|
|
|
32 |
|
|
Support for the extended double-precision and quadruple-precision formats
|
33 |
|
|
depends on a C compiler that implements 64-bit integer arithmetic. If the
|
34 |
|
|
largest integer format supported by the C compiler is 32 bits, SoftFloat
|
35 |
|
|
is limited to only single and double precisions. When that is the case,
|
36 |
|
|
all references in this document to extended double precision, quadruple
|
37 |
|
|
precision, and 64-bit integers should be ignored.
|
38 |
|
|
|
39 |
|
|
|
40 |
|
|
----------------------------------------------------------------------------
|
41 |
|
|
Contents
|
42 |
|
|
|
43 |
|
|
Introduction
|
44 |
|
|
Limitations
|
45 |
|
|
Contents
|
46 |
|
|
Legal Notice
|
47 |
|
|
Types and Functions
|
48 |
|
|
Rounding Modes
|
49 |
|
|
Extended Double-Precision Rounding Precision
|
50 |
|
|
Exceptions and Exception Flags
|
51 |
|
|
Function Details
|
52 |
|
|
Conversion Functions
|
53 |
|
|
Standard Arithmetic Functions
|
54 |
|
|
Remainder Functions
|
55 |
|
|
Round-to-Integer Functions
|
56 |
|
|
Comparison Functions
|
57 |
|
|
Signaling NaN Test Functions
|
58 |
|
|
Raise-Exception Function
|
59 |
|
|
Contact Information
|
60 |
|
|
|
61 |
|
|
|
62 |
|
|
|
63 |
|
|
----------------------------------------------------------------------------
|
64 |
|
|
Legal Notice
|
65 |
|
|
|
66 |
|
|
SoftFloat was written by John R. Hauser. This work was made possible in
|
67 |
|
|
part by the International Computer Science Institute, located at Suite 600,
|
68 |
|
|
1947 Center Street, Berkeley, California 94704. Funding was partially
|
69 |
|
|
provided by the National Science Foundation under grant MIP-9311980. The
|
70 |
|
|
original version of this code was written as part of a project to build
|
71 |
|
|
a fixed-point vector processor in collaboration with the University of
|
72 |
|
|
California at Berkeley, overseen by Profs. Nelson Morgan and John Wawrzynek.
|
73 |
|
|
|
74 |
|
|
THIS SOFTWARE IS DISTRIBUTED AS IS, FOR FREE. Although reasonable effort
|
75 |
|
|
has been made to avoid it, THIS SOFTWARE MAY CONTAIN FAULTS THAT WILL AT
|
76 |
|
|
TIMES RESULT IN INCORRECT BEHAVIOR. USE OF THIS SOFTWARE IS RESTRICTED TO
|
77 |
|
|
PERSONS AND ORGANIZATIONS WHO CAN AND WILL TAKE FULL RESPONSIBILITY FOR ALL
|
78 |
|
|
LOSSES, COSTS, OR OTHER PROBLEMS THEY INCUR DUE TO THE SOFTWARE, AND WHO
|
79 |
|
|
FURTHERMORE EFFECTIVELY INDEMNIFY JOHN HAUSER AND THE INTERNATIONAL COMPUTER
|
80 |
|
|
SCIENCE INSTITUTE (possibly via similar legal warning) AGAINST ALL LOSSES,
|
81 |
|
|
COSTS, OR OTHER PROBLEMS INCURRED BY THEIR CUSTOMERS AND CLIENTS DUE TO THE
|
82 |
|
|
SOFTWARE.
|
83 |
|
|
|
84 |
|
|
|
85 |
|
|
----------------------------------------------------------------------------
|
86 |
|
|
Types and Functions
|
87 |
|
|
|
88 |
|
|
When 64-bit integers are supported by the compiler, the `softfloat.h'
|
89 |
|
|
header file defines four types: `float32' (single precision), `float64'
|
90 |
|
|
(double precision), `floatx80' (extended double precision), and `float128'
|
91 |
|
|
(quadruple precision). The `float32' and `float64' types are defined in
|
92 |
|
|
terms of 32-bit and 64-bit integer types, respectively, while the `float128'
|
93 |
|
|
type is defined as a structure of two 64-bit integers, taking into account
|
94 |
|
|
the byte order of the particular machine being used. The `floatx80' type
|
95 |
|
|
is defined as a structure containing one 16-bit and one 64-bit integer, with
|
96 |
|
|
the machine's byte order again determining the order within the structure.
|
97 |
|
|
|
98 |
|
|
When 64-bit integers are _not_ supported by the compiler, the `softfloat.h'
|
99 |
|
|
header file defines only two types: `float32' and `float64'. Because
|
100 |
|
|
ISO/ANSI C guarantees at least one built-in integer type of 32 bits,
|
101 |
|
|
the `float32' type is identified with an appropriate integer type. The
|
102 |
|
|
`float64' type is defined as a structure of two 32-bit integers, with the
|
103 |
|
|
machine's byte order determining the order of the fields.
|
104 |
|
|
|
105 |
|
|
In either case, the types in `softfloat.h' are defined such that if a system
|
106 |
|
|
implements the usual C `float' and `double' types according to the IEC/IEEE
|
107 |
|
|
Standard, then the `float32' and `float64' types should be indistinguishable
|
108 |
|
|
in memory from the native `float' and `double' types. (On the other hand,
|
109 |
|
|
when `float32' or `float64' values are placed in processor registers by
|
110 |
|
|
the compiler, the type of registers used may differ from those used for the
|
111 |
|
|
native `float' and `double' types.)
|
112 |
|
|
|
113 |
|
|
SoftFloat implements the following arithmetic operations:
|
114 |
|
|
|
115 |
|
|
-- Conversions among all the floating-point formats, and also between
|
116 |
|
|
integers (32-bit and 64-bit) and any of the floating-point formats.
|
117 |
|
|
|
118 |
|
|
-- The usual add, subtract, multiply, divide, and square root operations
|
119 |
|
|
for all floating-point formats.
|
120 |
|
|
|
121 |
|
|
-- For each format, the floating-point remainder operation defined by the
|
122 |
|
|
IEC/IEEE Standard.
|
123 |
|
|
|
124 |
|
|
-- For each floating-point format, a ``round to integer'' operation that
|
125 |
|
|
rounds to the nearest integer value in the same format. (The floating-
|
126 |
|
|
point formats can hold integer values, of course.)
|
127 |
|
|
|
128 |
|
|
-- Comparisons between two values in the same floating-point format.
|
129 |
|
|
|
130 |
|
|
The only functions required by the IEC/IEEE Standard that are not provided
|
131 |
|
|
are conversions to and from decimal.
|
132 |
|
|
|
133 |
|
|
|
134 |
|
|
----------------------------------------------------------------------------
|
135 |
|
|
Rounding Modes
|
136 |
|
|
|
137 |
|
|
All four rounding modes prescribed by the IEC/IEEE Standard are implemented
|
138 |
|
|
for all operations that require rounding. The rounding mode is selected
|
139 |
|
|
by the global variable `float_rounding_mode'. This variable may be set
|
140 |
|
|
to one of the values `float_round_nearest_even', `float_round_to_zero',
|
141 |
|
|
`float_round_down', or `float_round_up'. The rounding mode is initialized
|
142 |
|
|
to nearest/even.
|
143 |
|
|
|
144 |
|
|
|
145 |
|
|
----------------------------------------------------------------------------
|
146 |
|
|
Extended Double-Precision Rounding Precision
|
147 |
|
|
|
148 |
|
|
For extended double precision (`floatx80') only, the rounding precision
|
149 |
|
|
of the standard arithmetic operations is controlled by the global variable
|
150 |
|
|
`floatx80_rounding_precision'. The operations affected are:
|
151 |
|
|
|
152 |
|
|
floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
|
153 |
|
|
|
154 |
|
|
When `floatx80_rounding_precision' is set to its default value of 80, these
|
155 |
|
|
operations are rounded (as usual) to the full precision of the extended
|
156 |
|
|
double-precision format. Setting `floatx80_rounding_precision' to 32
|
157 |
|
|
or to 64 causes the operations listed to be rounded to reduced precision
|
158 |
|
|
equivalent to single precision (`float32') or to double precision
|
159 |
|
|
(`float64'), respectively. When rounding to reduced precision, additional
|
160 |
|
|
bits in the result significand beyond the rounding point are set to zero.
|
161 |
|
|
The consequences of setting `floatx80_rounding_precision' to a value other
|
162 |
|
|
than 32, 64, or 80 is not specified. Operations other than the ones listed
|
163 |
|
|
above are not affected by `floatx80_rounding_precision'.
|
164 |
|
|
|
165 |
|
|
|
166 |
|
|
----------------------------------------------------------------------------
|
167 |
|
|
Exceptions and Exception Flags
|
168 |
|
|
|
169 |
|
|
All five exception flags required by the IEC/IEEE Standard are
|
170 |
|
|
implemented. Each flag is stored as a unique bit in the global variable
|
171 |
|
|
`float_exception_flags'. The positions of the exception flag bits within
|
172 |
|
|
this variable are determined by the bit masks `float_flag_inexact',
|
173 |
|
|
`float_flag_underflow', `float_flag_overflow', `float_flag_divbyzero', and
|
174 |
|
|
`float_flag_invalid'. The exception flags variable is initialized to all 0,
|
175 |
|
|
meaning no exceptions.
|
176 |
|
|
|
177 |
|
|
An individual exception flag can be cleared with the statement
|
178 |
|
|
|
179 |
|
|
float_exception_flags &= ~ float_flag_;
|
180 |
|
|
|
181 |
|
|
where `' is the appropriate name. To raise a floating-point
|
182 |
|
|
exception, the SoftFloat function `float_raise' should be used (see below).
|
183 |
|
|
|
184 |
|
|
In the terminology of the IEC/IEEE Standard, SoftFloat can detect tininess
|
185 |
|
|
for underflow either before or after rounding. The choice is made by
|
186 |
|
|
the global variable `float_detect_tininess', which can be set to either
|
187 |
|
|
`float_tininess_before_rounding' or `float_tininess_after_rounding'.
|
188 |
|
|
Detecting tininess after rounding is better because it results in fewer
|
189 |
|
|
spurious underflow signals. The other option is provided for compatibility
|
190 |
|
|
with some systems. Like most systems, SoftFloat always detects loss of
|
191 |
|
|
accuracy for underflow as an inexact result.
|
192 |
|
|
|
193 |
|
|
|
194 |
|
|
----------------------------------------------------------------------------
|
195 |
|
|
Function Details
|
196 |
|
|
|
197 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
198 |
|
|
Conversion Functions
|
199 |
|
|
|
200 |
|
|
All conversions among the floating-point formats are supported, as are all
|
201 |
|
|
conversions between a floating-point format and 32-bit and 64-bit signed
|
202 |
|
|
integers. The complete set of conversion functions is:
|
203 |
|
|
|
204 |
|
|
int32_to_float32 int64_to_float32
|
205 |
|
|
int32_to_float64 int64_to_float32
|
206 |
|
|
int32_to_floatx80 int64_to_floatx80
|
207 |
|
|
int32_to_float128 int64_to_float128
|
208 |
|
|
|
209 |
|
|
float32_to_int32 float32_to_int64
|
210 |
|
|
float32_to_int32 float64_to_int64
|
211 |
|
|
floatx80_to_int32 floatx80_to_int64
|
212 |
|
|
float128_to_int32 float128_to_int64
|
213 |
|
|
|
214 |
|
|
float32_to_float64 float32_to_floatx80 float32_to_float128
|
215 |
|
|
float64_to_float32 float64_to_floatx80 float64_to_float128
|
216 |
|
|
floatx80_to_float32 floatx80_to_float64 floatx80_to_float128
|
217 |
|
|
float128_to_float32 float128_to_float64 float128_to_floatx80
|
218 |
|
|
|
219 |
|
|
Each conversion function takes one operand of the appropriate type and
|
220 |
|
|
returns one result. Conversions from a smaller to a larger floating-point
|
221 |
|
|
format are always exact and so require no rounding. Conversions from 32-bit
|
222 |
|
|
integers to double precision and larger formats are also exact, and likewise
|
223 |
|
|
for conversions from 64-bit integers to extended double and quadruple
|
224 |
|
|
precisions.
|
225 |
|
|
|
226 |
|
|
Conversions from floating-point to integer raise the invalid exception if
|
227 |
|
|
the source value cannot be rounded to a representable integer of the desired
|
228 |
|
|
size (32 or 64 bits). If the floating-point operand is a NaN, the largest
|
229 |
|
|
positive integer is returned. Otherwise, if the conversion overflows, the
|
230 |
|
|
largest integer with the same sign as the operand is returned.
|
231 |
|
|
|
232 |
|
|
On conversions to integer, if the floating-point operand is not already
|
233 |
|
|
an integer value, the operand is rounded according to the current rounding
|
234 |
|
|
mode as specified by `float_rounding_mode'. Because C (and perhaps other
|
235 |
|
|
languages) require that conversions to integers be rounded toward zero, the
|
236 |
|
|
following functions are provided for improved speed and convenience:
|
237 |
|
|
|
238 |
|
|
float32_to_int32_round_to_zero float32_to_int64_round_to_zero
|
239 |
|
|
float64_to_int32_round_to_zero float64_to_int64_round_to_zero
|
240 |
|
|
floatx80_to_int32_round_to_zero floatx80_to_int64_round_to_zero
|
241 |
|
|
float128_to_int32_round_to_zero float128_to_int64_round_to_zero
|
242 |
|
|
|
243 |
|
|
These variant functions ignore `float_rounding_mode' and always round toward
|
244 |
|
|
zero.
|
245 |
|
|
|
246 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
247 |
|
|
Standard Arithmetic Functions
|
248 |
|
|
|
249 |
|
|
The following standard arithmetic functions are provided:
|
250 |
|
|
|
251 |
|
|
float32_add float32_sub float32_mul float32_div float32_sqrt
|
252 |
|
|
float64_add float64_sub float64_mul float64_div float64_sqrt
|
253 |
|
|
floatx80_add floatx80_sub floatx80_mul floatx80_div floatx80_sqrt
|
254 |
|
|
float128_add float128_sub float128_mul float128_div float128_sqrt
|
255 |
|
|
|
256 |
|
|
Each function takes two operands, except for `sqrt' which takes only one.
|
257 |
|
|
The operands and result are all of the same type.
|
258 |
|
|
|
259 |
|
|
Rounding of the extended double-precision (`floatx80') functions is affected
|
260 |
|
|
by the `floatx80_rounding_precision' variable, as explained above in the
|
261 |
|
|
section _Extended Double-Precision Rounding Precision_.
|
262 |
|
|
|
263 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
264 |
|
|
Remainder Functions
|
265 |
|
|
|
266 |
|
|
For each format, SoftFloat implements the remainder function according to
|
267 |
|
|
the IEC/IEEE Standard. The remainder functions are:
|
268 |
|
|
|
269 |
|
|
float32_rem
|
270 |
|
|
float64_rem
|
271 |
|
|
floatx80_rem
|
272 |
|
|
float128_rem
|
273 |
|
|
|
274 |
|
|
Each remainder function takes two operands. The operands and result are all
|
275 |
|
|
of the same type. Given operands x and y, the remainder functions return
|
276 |
|
|
the value x - n*y, where n is the integer closest to x/y. If x/y is exactly
|
277 |
|
|
halfway between two integers, n is the even integer closest to x/y. The
|
278 |
|
|
remainder functions are always exact and so require no rounding.
|
279 |
|
|
|
280 |
|
|
Depending on the relative magnitudes of the operands, the remainder
|
281 |
|
|
functions can take considerably longer to execute than the other SoftFloat
|
282 |
|
|
functions. This is inherent in the remainder operation itself and is not a
|
283 |
|
|
flaw in the SoftFloat implementation.
|
284 |
|
|
|
285 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
286 |
|
|
Round-to-Integer Functions
|
287 |
|
|
|
288 |
|
|
For each format, SoftFloat implements the round-to-integer function
|
289 |
|
|
specified by the IEC/IEEE Standard. The functions are:
|
290 |
|
|
|
291 |
|
|
float32_round_to_int
|
292 |
|
|
float64_round_to_int
|
293 |
|
|
floatx80_round_to_int
|
294 |
|
|
float128_round_to_int
|
295 |
|
|
|
296 |
|
|
Each function takes a single floating-point operand and returns a result of
|
297 |
|
|
the same type. (Note that the result is not an integer type.) The operand
|
298 |
|
|
is rounded to an exact integer according to the current rounding mode, and
|
299 |
|
|
the resulting integer value is returned in the same floating-point format.
|
300 |
|
|
|
301 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
302 |
|
|
Comparison Functions
|
303 |
|
|
|
304 |
|
|
The following floating-point comparison functions are provided:
|
305 |
|
|
|
306 |
|
|
float32_eq float32_le float32_lt
|
307 |
|
|
float64_eq float64_le float64_lt
|
308 |
|
|
floatx80_eq floatx80_le floatx80_lt
|
309 |
|
|
float128_eq float128_le float128_lt
|
310 |
|
|
|
311 |
|
|
Each function takes two operands of the same type and returns a 1 or 0
|
312 |
|
|
representing either _true_ or _false_. The abbreviation `eq' stands for
|
313 |
|
|
``equal'' (=); `le' stands for ``less than or equal'' (<=); and `lt' stands
|
314 |
|
|
for ``less than'' (<).
|
315 |
|
|
|
316 |
|
|
The standard greater-than (>), greater-than-or-equal (>=), and not-equal
|
317 |
|
|
(!=) functions are easily obtained using the functions provided. The
|
318 |
|
|
not-equal function is just the logical complement of the equal function.
|
319 |
|
|
The greater-than-or-equal function is identical to the less-than-or-equal
|
320 |
|
|
function with the operands reversed, and the greater-than function is
|
321 |
|
|
identical to the less-than function with the operands reversed.
|
322 |
|
|
|
323 |
|
|
The IEC/IEEE Standard specifies that the less-than-or-equal and less-than
|
324 |
|
|
functions raise the invalid exception if either input is any kind of NaN.
|
325 |
|
|
The equal functions, on the other hand, are defined not to raise the invalid
|
326 |
|
|
exception on quiet NaNs. For completeness, SoftFloat provides the following
|
327 |
|
|
additional functions:
|
328 |
|
|
|
329 |
|
|
float32_eq_signaling float32_le_quiet float32_lt_quiet
|
330 |
|
|
float64_eq_signaling float64_le_quiet float64_lt_quiet
|
331 |
|
|
floatx80_eq_signaling floatx80_le_quiet floatx80_lt_quiet
|
332 |
|
|
float128_eq_signaling float128_le_quiet float128_lt_quiet
|
333 |
|
|
|
334 |
|
|
The `signaling' equal functions are identical to the standard functions
|
335 |
|
|
except that the invalid exception is raised for any NaN input. Likewise,
|
336 |
|
|
the `quiet' comparison functions are identical to their counterparts except
|
337 |
|
|
that the invalid exception is not raised for quiet NaNs.
|
338 |
|
|
|
339 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
340 |
|
|
Signaling NaN Test Functions
|
341 |
|
|
|
342 |
|
|
The following functions test whether a floating-point value is a signaling
|
343 |
|
|
NaN:
|
344 |
|
|
|
345 |
|
|
float32_is_signaling_nan
|
346 |
|
|
float64_is_signaling_nan
|
347 |
|
|
floatx80_is_signaling_nan
|
348 |
|
|
float128_is_signaling_nan
|
349 |
|
|
|
350 |
|
|
The functions take one operand and return 1 if the operand is a signaling
|
351 |
|
|
NaN and 0 otherwise.
|
352 |
|
|
|
353 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
354 |
|
|
Raise-Exception Function
|
355 |
|
|
|
356 |
|
|
SoftFloat provides a function for raising floating-point exceptions:
|
357 |
|
|
|
358 |
|
|
float_raise
|
359 |
|
|
|
360 |
|
|
The function takes a mask indicating the set of exceptions to raise. No
|
361 |
|
|
result is returned. In addition to setting the specified exception flags,
|
362 |
|
|
this function may cause a trap or abort appropriate for the current system.
|
363 |
|
|
|
364 |
|
|
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
365 |
|
|
|
366 |
|
|
|
367 |
|
|
----------------------------------------------------------------------------
|
368 |
|
|
Contact Information
|
369 |
|
|
|
370 |
|
|
At the time of this writing, the most up-to-date information about
|
371 |
|
|
SoftFloat and the latest release can be found at the Web page `http://
|
372 |
|
|
www.cs.berkeley.edu/~jhauser/arithmetic/SoftFloat.html'.
|
373 |
|
|
|
374 |
|
|
|