1 |
2 |
davidklun |
The following describes the IEEE-Standard-754 compliant, double-precision floating point unit,
|
2 |
|
|
written in VHDL. The module consists of the following files:
|
3 |
|
|
|
4 |
|
|
1. fpu_double.vhd (top level)
|
5 |
|
|
2. fpu_add.vhd
|
6 |
|
|
3. fpu_sub.vhd
|
7 |
|
|
4. fpu_mul.vhd
|
8 |
|
|
5. fpu_div.vhd
|
9 |
|
|
6. fpu_round.vhd
|
10 |
|
|
7. fpu_exceptions.vhd
|
11 |
|
|
8. fpupack.vhd
|
12 |
|
|
9. comppack.vhd
|
13 |
|
|
|
14 |
|
|
And a testbench file is included, containing 50 test-case operations:
|
15 |
|
|
1. fpu_double_TB.vhd
|
16 |
|
|
|
17 |
|
|
This unit has been extensively simulated, covering all 4 operations, rounding modes, exceptions
|
18 |
|
|
like underflow and overflow, and even the obscure corner cases, like when overflowing from
|
19 |
|
|
denormalized to normalized, and vice-versa.
|
20 |
|
|
|
21 |
|
|
The floating point unit supports denormalized numbers,
|
22 |
|
|
4 operations (add, subtract, multiply, divide), and 4 rounding
|
23 |
|
|
modes (nearest, zero, + inf, - inf). The unit was synthesized with an
|
24 |
|
|
estimated frequency of 185 MHz, for a Virtex5 target device. The synthesis results
|
25 |
|
|
are below. fpu_double.vhd is the top-level module, and it contains the input
|
26 |
|
|
and output signals from the unit.
|
27 |
|
|
|
28 |
|
|
The input and output signals to the unit are the following:
|
29 |
|
|
|
30 |
|
|
1. clk (global)
|
31 |
|
|
2. rst (global)
|
32 |
|
|
2. enable (set high, then low, to start operation)
|
33 |
|
|
3. rmode (rounding mode, 2 bits, 00 = nearest, 01 = zero,
|
34 |
|
|
10 = pos inf, 11 = neg inf)
|
35 |
|
|
4. fpu_op (operation code, 3 bits, 000 = add, 001 = subtract,
|
36 |
|
|
010 = multiply, 011 = divide, others are not used)
|
37 |
|
|
5. opa, opb (input operands, 64 bits, Big-endian order,
|
38 |
|
|
bit 63 = sign, bits 62-52 exponent, bits 51-0 mantissa)
|
39 |
|
|
6. out_fp (output from operation, 64 bits, Big-endian order,
|
40 |
|
|
same ordering as inputs)
|
41 |
|
|
7. ready (goes high when output is available)
|
42 |
|
|
8. underflow
|
43 |
|
|
9. overflow
|
44 |
|
|
10. inexact
|
45 |
|
|
11. exception - see IEEE 754 definition
|
46 |
|
|
12. invalid - see IEEE 754 definition
|
47 |
|
|
|
48 |
|
|
The unit was designed to be synchronous with one global clock, and all of the
|
49 |
|
|
registers can be reset with an synchronous global reset.
|
50 |
|
|
When the inputs signals (a and b operands, fpu operation code, rounding mode code) are
|
51 |
|
|
available, set the enable input high, then set it low after 2 clock cycles. When the
|
52 |
|
|
operation is complete and the output is available, the ready signal will go high. To start
|
53 |
|
|
the next operation, set the enable input high.
|
54 |
|
|
|
55 |
|
|
Each operation takes the following amount of clock cycles to complete:
|
56 |
|
|
1. addition : 20 clock cycles
|
57 |
|
|
2. subtraction: 21 clock cycles
|
58 |
|
|
3. multiplication: 24 clock cycles
|
59 |
|
|
4. division: 71 clock cycles
|
60 |
|
|
|
61 |
|
|
This is longer than other floating point units, but supporting denormalized numbers
|
62 |
|
|
requires more signals and logic levels to accommodate gradual underflow. The supported
|
63 |
|
|
clock speed of 185 MHz makes up for the large number of clock cycles required for each
|
64 |
|
|
operation to complete. If you have a lower clock speed, the code can be changed to
|
65 |
|
|
reduce the number of registers and latency of each operation. I purposely increased the
|
66 |
|
|
number of logic levels to get the code to synthesize to a faster clock frequency, but of course,
|
67 |
|
|
this led to longer latency. I guess it depends on your application what is more important.
|
68 |
|
|
|
69 |
|
|
The following output signals are also available: underflow, overflow, inexact, exception,
|
70 |
|
|
and invalid. They are compliant with the IEEE-754 definition of each signal. The unit
|
71 |
|
|
will handle QNaN and SNaN inputs per the standard.
|
72 |
|
|
|
73 |
|
|
I'm planning on adding more operations, like square root, sin, cos, tan, etc.,
|
74 |
|
|
so check back for updates.
|
75 |
|
|
|
76 |
|
|
Multiply:
|
77 |
|
|
The multiply module is written specifically for a Virtex5 target device. The DSP48E slices
|
78 |
|
|
can perform a 25-bit by 18-bit Twos-complement multiply (24 by 17 unsigned multiply). I broke up the multiply to
|
79 |
|
|
fit these DSP48E slices. The breakdown is similar to the design in Figure 4-15 of the
|
80 |
|
|
Xilinx User Guide Document, "Virtex-5 FPGA XtremeDSP Design Considerations", also known as UG193.
|
81 |
|
|
You can find this document at xilinx.com by searching for "UG193".
|
82 |
|
|
Depending on your device, the multiply can be changed to match the bit-widths of the available
|
83 |
|
|
multipliers. A total of 9 DSP48E slices are used to do the 53-bit by 53-bit multiply of 2
|
84 |
|
|
floating point numbers.
|
85 |
|
|
|
86 |
|
|
If you have any questions, please email me at: davidklun@gmail.com
|
87 |
|
|
|
88 |
|
|
Thanks,
|
89 |
|
|
David Lundgren
|
90 |
|
|
|
91 |
|
|
-----
|
92 |
|
|
|
93 |
|
|
Synthesis Results:
|
94 |
|
|
|
95 |
|
|
|
96 |
|
|
|
97 |
|
|
|
98 |
|
|
Performance Summary
|
99 |
|
|
*******************
|
100 |
|
|
|
101 |
|
|
|
102 |
|
|
Worst slack in design: -2.049
|
103 |
|
|
|
104 |
|
|
|
105 |
|
|
Requested Estimated Requested Estimated Clock Clock
|
106 |
|
|
Starting Clock Frequency Frequency Period Period Slack Type Group
|
107 |
|
|
----------------------------------------------------------------------------------------------------------------------
|
108 |
|
|
fpu_double|clk 300.0 MHz 185.8 MHz 3.333 5.382 -2.049 inferred Inferred_clkgroup_0
|
109 |
|
|
======================================================================================================================
|
110 |
|
|
|
111 |
|
|
|
112 |
|
|
---------------------------------------
|
113 |
|
|
Resource Usage Report for fpu_double
|
114 |
|
|
|
115 |
|
|
Mapping to part: xc5vsx95tff1136-2
|
116 |
|
|
Cell usage:
|
117 |
|
|
DSP48E 9 uses
|
118 |
|
|
FD 3 uses
|
119 |
|
|
FDE 21 uses
|
120 |
|
|
FDR 587 uses
|
121 |
|
|
FDRE 3767 uses
|
122 |
|
|
FDRS 8 uses
|
123 |
|
|
FDRSE 51 uses
|
124 |
|
|
GND 6 uses
|
125 |
|
|
MUXCY 20 uses
|
126 |
|
|
MUXCY_L 598 uses
|
127 |
|
|
MUXF7 2 uses
|
128 |
|
|
VCC 6 uses
|
129 |
|
|
XORCY 497 uses
|
130 |
|
|
XORCY_L 5 uses
|
131 |
|
|
LUT1 187 uses
|
132 |
|
|
LUT2 742 uses
|
133 |
|
|
LUT3 1591 uses
|
134 |
|
|
LUT4 847 uses
|
135 |
|
|
LUT5 589 uses
|
136 |
|
|
LUT6 2613 uses
|
137 |
|
|
|
138 |
|
|
I/O ports: 206
|
139 |
|
|
I/O primitives: 205
|
140 |
|
|
IBUF 135 uses
|
141 |
|
|
OBUF 70 uses
|
142 |
|
|
|
143 |
|
|
BUFGP 1 use
|
144 |
|
|
|
145 |
|
|
I/O Register bits: 0
|
146 |
|
|
Register bits not including I/Os: 4437 (7%)
|
147 |
|
|
|
148 |
|
|
Global Clock Buffers: 1 of 32 (3%)
|
149 |
|
|
|
150 |
|
|
Total load per clock:
|
151 |
|
|
fpu_double|clk: 4446
|
152 |
|
|
|
153 |
|
|
Mapping Summary:
|
154 |
|
|
Total LUTs: 6569 (11%)
|
155 |
|
|
|
156 |
|
|
Mapper successful!
|
157 |
|
|
|
158 |
|
|
|
159 |
|
|
|
160 |
|
|
|
161 |
|
|
|