URL
https://opencores.org/ocsvn/fpu_double/fpu_double/trunk
Subversion Repositories fpu_double
[/] [fpu_double/] [trunk/] [Readme.txt] - Rev 5
Compare with Previous | Blame | View Log
The following describes the IEEE-Standard-754 compliant, double-precision floating point unit,written in VHDL. The module consists of the following files:1. fpu_double.vhd (top level)2. fpu_add.vhd3. fpu_sub.vhd4. fpu_mul.vhd5. fpu_div.vhd6. fpu_round.vhd7. fpu_exceptions.vhd8. fpupack.vhd9. comppack.vhdAnd a testbench file is included, containing 50 test-case operations:1. fpu_double_TB.vhdThis unit has been extensively simulated, covering all 4 operations, rounding modes, exceptionslike underflow and overflow, and even the obscure corner cases, like when overflowing fromdenormalized to normalized, and vice-versa.The floating point unit supports denormalized numbers,4 operations (add, subtract, multiply, divide), and 4 roundingmodes (nearest, zero, + inf, - inf). The unit was synthesized with anestimated frequency of 185 MHz, for a Virtex5 target device. The synthesis resultsare below. fpu_double.vhd is the top-level module, and it contains the inputand output signals from the unit.The input and output signals to the unit are the following:1. clk (global)2. rst (global)2. enable (set high, then low, to start operation)3. rmode (rounding mode, 2 bits, 00 = nearest, 01 = zero,10 = pos inf, 11 = neg inf)4. fpu_op (operation code, 3 bits, 000 = add, 001 = subtract,010 = multiply, 011 = divide, others are not used)5. opa, opb (input operands, 64 bits, Big-endian order,bit 63 = sign, bits 62-52 exponent, bits 51-0 mantissa)6. out_fp (output from operation, 64 bits, Big-endian order,same ordering as inputs)7. ready (goes high when output is available)8. underflow9. overflow10. inexact11. exception - see IEEE 754 definition12. invalid - see IEEE 754 definitionThe unit was designed to be synchronous with one global clock, and all of theregisters can be reset with an synchronous global reset.When the inputs signals (a and b operands, fpu operation code, rounding mode code) areavailable, set the enable input high, then set it low after 2 clock cycles. When theoperation is complete and the output is available, the ready signal will go high. To startthe next operation, set the enable input high.Each operation takes the following amount of clock cycles to complete:1. addition : 20 clock cycles2. subtraction: 21 clock cycles3. multiplication: 24 clock cycles4. division: 71 clock cyclesThis is longer than other floating point units, but supporting denormalized numbersrequires more signals and logic levels to accommodate gradual underflow. The supportedclock speed of 185 MHz makes up for the large number of clock cycles required for eachoperation to complete. If you have a lower clock speed, the code can be changed toreduce the number of registers and latency of each operation. I purposely increased thenumber of logic levels to get the code to synthesize to a faster clock frequency, but of course,this led to longer latency. I guess it depends on your application what is more important.The following output signals are also available: underflow, overflow, inexact, exception,and invalid. They are compliant with the IEEE-754 definition of each signal. The unitwill handle QNaN and SNaN inputs per the standard.I'm planning on adding more operations, like square root, sin, cos, tan, etc.,so check back for updates.Multiply:The multiply module is written specifically for a Virtex5 target device. The DSP48E slicescan perform a 25-bit by 18-bit Twos-complement multiply (24 by 17 unsigned multiply). I broke up the multiply tofit these DSP48E slices. The breakdown is similar to the design in Figure 4-15 of theXilinx User Guide Document, "Virtex-5 FPGA XtremeDSP Design Considerations", also known as UG193.You can find this document at xilinx.com by searching for "UG193".Depending on your device, the multiply can be changed to match the bit-widths of the availablemultipliers. A total of 9 DSP48E slices are used to do the 53-bit by 53-bit multiply of 2floating point numbers.If you have any questions, please email me at: davidklun@gmail.comThanks,David Lundgren-----Synthesis Results:Performance Summary*******************Worst slack in design: -2.049Requested Estimated Requested Estimated Clock ClockStarting Clock Frequency Frequency Period Period Slack Type Group----------------------------------------------------------------------------------------------------------------------fpu_double|clk 300.0 MHz 185.8 MHz 3.333 5.382 -2.049 inferred Inferred_clkgroup_0======================================================================================================================---------------------------------------Resource Usage Report for fpu_doubleMapping to part: xc5vsx95tff1136-2Cell usage:DSP48E 9 usesFD 3 usesFDE 21 usesFDR 587 usesFDRE 3767 usesFDRS 8 usesFDRSE 51 usesGND 6 usesMUXCY 20 usesMUXCY_L 598 usesMUXF7 2 usesVCC 6 usesXORCY 497 usesXORCY_L 5 usesLUT1 187 usesLUT2 742 usesLUT3 1591 usesLUT4 847 usesLUT5 589 usesLUT6 2613 usesI/O ports: 206I/O primitives: 205IBUF 135 usesOBUF 70 usesBUFGP 1 useI/O Register bits: 0Register bits not including I/Os: 4437 (7%)Global Clock Buffers: 1 of 32 (3%)Total load per clock:fpu_double|clk: 4446Mapping Summary:Total LUTs: 6569 (11%)Mapper successful!
