1 |
2 |
davidklun |
The following describes the IEEE-Standard-754 compliant, double-precision floating point unit,
|
2 |
|
|
written in Verilog. The module consists of the following files:
|
3 |
|
|
|
4 |
|
|
1. fpu_double.v (top level)
|
5 |
|
|
2. fpu_add.v
|
6 |
|
|
3. fpu_sub.v
|
7 |
|
|
4. fpu_mul.v
|
8 |
|
|
5. fpu_div.v
|
9 |
|
|
6. fpu_round.v
|
10 |
|
|
7. fpu_exceptions.v
|
11 |
|
|
|
12 |
|
|
And a testbench file is included, containing 50 test-case operations:
|
13 |
|
|
1. fpu_tb.v
|
14 |
|
|
|
15 |
|
|
This unit has been extensively simulated, covering all operations, rounding modes, exceptions
|
16 |
|
|
like underflow and overflow, and even the obscure corner cases, like when overflowing from
|
17 |
|
|
denormalized to normalized, and vice-versa.
|
18 |
|
|
|
19 |
|
|
The floating point unit supports denormalized numbers,
|
20 |
|
|
4 operations (add, subtract, multiply, divide), and 4 rounding
|
21 |
|
|
modes (nearest, zero, + inf, - inf). The unit was synthesized with an
|
22 |
|
|
estimated frequency of 230 MHz, for a Virtex5 target device. The synthesis results
|
23 |
|
|
are below. fpu_double.v is the top-level module, and it contains the input
|
24 |
|
|
and output signals from the unit. The unit was designed to be synchronous with
|
25 |
|
|
one global clock, and all of the registers can be reset with an synchronous global reset.
|
26 |
|
|
When the inputs signals (a and b operands, fpu operation code, rounding mode code) are
|
27 |
|
|
available, set the enable input high, then set it low after 2 clock cycles. When the
|
28 |
|
|
operation is complete and the output is available, the ready signal will go high. To start
|
29 |
|
|
the next operation, set the enable input high.
|
30 |
|
|
|
31 |
|
|
Each operation takes the following amount of clock cycles to complete:
|
32 |
|
|
1. addition : 20 clock cycles
|
33 |
|
|
2. subtraction: 21 clock cycles
|
34 |
|
|
3. multiplication: 24 clock cycles
|
35 |
|
|
4. division: 71 clock cycles
|
36 |
|
|
|
37 |
|
|
This is longer than other floating point units, but supporting denormalized numbers
|
38 |
|
|
requires more signals and logic levels to accommodate gradual underflow. The supported
|
39 |
|
|
clock speed of 230 MHz makes up for the large number of clock cycles required for each
|
40 |
|
|
operation to complete. If you have a lower clock speed, the code can be changed to
|
41 |
|
|
reduce the number of registers and latency of each operation. I purposely increased the
|
42 |
|
|
number of logic levels to get the code to synthesize to a faster clock frequency, but of course,
|
43 |
|
|
this led to longer latency. I guess it depends on your application what is more important.
|
44 |
|
|
|
45 |
|
|
The following output signals are also available: underflow, overflow, inexact, exception,
|
46 |
|
|
and invalid. They are compliant with the IEEE-754 definition of each signal. The unit
|
47 |
|
|
will handle QNaN and SNaN inputs per the standard.
|
48 |
|
|
|
49 |
|
|
I'm planning on adding more operations, like square root, sin, cos, tan, etc.,
|
50 |
|
|
so check back for updates.
|
51 |
|
|
|
52 |
|
|
Multiply:
|
53 |
|
|
The multiply module is written specifically for a Virtex5 target device. The DSP48E slices
|
54 |
|
|
can perform a 25-bit by 18-bit Twos-complement multiply (24 by 17 unsigned multiply). I broke up the multiply to
|
55 |
|
|
fit these DSP48E slices. The breakdown is similar to the design in Figure 4-15 of the
|
56 |
|
|
Xilinx User Guide Document, "Virtex-5 FPGA XtremeDSP Design Considerations", also known as UG193.
|
57 |
|
|
You can find this document at xilinx.com by searching for "UG193".
|
58 |
|
|
Depending on your device, the multiply can be changed to match the bit-widths of the available
|
59 |
|
|
multipliers. A total of 9 DSP48E slices are used to do the 53-bit by 53-bit multiply of 2
|
60 |
|
|
floating point numbers.
|
61 |
|
|
|
62 |
|
|
If you have any questions, please email me at: davidklun@gmail.com
|
63 |
|
|
|
64 |
|
|
Thanks,
|
65 |
|
|
David Lundgren
|
66 |
|
|
|
67 |
|
|
-----
|
68 |
|
|
|
69 |
|
|
Synthesis Results:
|
70 |
|
|
|
71 |
|
|
|
72 |
|
|
|
73 |
|
|
|
74 |
|
|
Performance Summary
|
75 |
|
|
*******************
|
76 |
|
|
|
77 |
|
|
|
78 |
|
|
Worst slack in design: -0.971
|
79 |
|
|
|
80 |
|
|
Requested Estimated Requested Estimated Clock Clock
|
81 |
|
|
Starting Clock Frequency Frequency Period Period Slack Type Group
|
82 |
|
|
-----------------------------------------------------------------------------------------------------------
|
83 |
|
|
fpu|clk 300.0 MHz 232.3 MHz 3.333 4.304 -0.971 inferred
|
84 |
|
|
==========================================================================
|
85 |
|
|
|
86 |
|
|
---------------------------------------
|
87 |
|
|
Resource Usage Report for fpu
|
88 |
|
|
|
89 |
|
|
Mapping to part: xc5vsx95tff1136-2
|
90 |
|
|
Cell usage:
|
91 |
|
|
DSP48E 9 uses
|
92 |
|
|
FD 5 uses
|
93 |
|
|
FDR 519 uses
|
94 |
|
|
FDRE 3920 uses
|
95 |
|
|
FDRSE 1 use
|
96 |
|
|
GND 6 uses
|
97 |
|
|
LD 6 uses
|
98 |
|
|
MUXCY 35 uses
|
99 |
|
|
MUXCY_L 704 uses
|
100 |
|
|
MUXF7 1 use
|
101 |
|
|
VCC 5 uses
|
102 |
|
|
XORCY 491 uses
|
103 |
|
|
XORCY_L 12 uses
|
104 |
|
|
LUT1 185 uses
|
105 |
|
|
LUT2 725 uses
|
106 |
|
|
LUT3 1523 uses
|
107 |
|
|
LUT4 738 uses
|
108 |
|
|
LUT5 604 uses
|
109 |
|
|
LUT6 2506 uses
|
110 |
|
|
|
111 |
|
|
I/O ports: 206
|
112 |
|
|
I/O primitives: 205
|
113 |
|
|
IBUF 135 uses
|
114 |
|
|
OBUF 70 uses
|
115 |
|
|
|
116 |
|
|
BUFGP 1 use
|
117 |
|
|
|
118 |
|
|
I/O Register bits: 0
|
119 |
|
|
Register bits not including I/Os: 4445 (7%)
|
120 |
|
|
Latch bits not including I/Os: 6 (0%)
|
121 |
|
|
|
122 |
|
|
Global Clock Buffers: 1 of 32 (3%)
|
123 |
|
|
|
124 |
|
|
Total load per clock:
|
125 |
|
|
fpu|clk: 4454
|
126 |
|
|
|
127 |
|
|
Mapping Summary:
|
128 |
|
|
Total LUTs: 6281 (10%)
|
129 |
|
|
|
130 |
|
|
|