URL https://opencores.org/ocsvn/dblclockfft/dblclockfft/trunk

Subversion Repositories dblclockfft

[/] [dblclockfft/] [trunk/] [doc/] [src/] [spec.tex] - Blame information for rev 27

Go to most recent revision | Details | Compare with Previous | View Log


\documentclass{gqtekspec}
\project{Double Clocked FFT}
\title{Specification}
\author{Dan Gisselquist, Ph.D.}
\email{dgisselq (at) opencores.org}
\revision{Rev.~0.2}
\begin{document}
\pagestyle{gqtekspecplain}
\titlepage
\begin{license}
Copyright (C) \theyear\today, Gisselquist Technology, LLC
 
This project is free software (firmware): you can redistribute it and/or
modify it under the terms of  the GNU General Public License as published
by the Free Software Foundation, either version 3 of the License, or (at
your option) any later version.
 
This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or
FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.
 
You should have received a copy of the GNU General Public License along
with this program.  If not, see \hbox{<http://www.gnu.org/licenses/>} for a copy.
\end{license}
\begin{revisionhistory}
0.2 & 6/2/2015 & Gisselquist & Superficial formatting changes\\\hline
0.1 & 3/3/2015 & Gisselquist & First Draft \\\hline
\end{revisionhistory}
% Revision History
% Table of Contents, named Contents
\tableofcontents
\listoffigures
\listoftables
\begin{preface}
This FFT comes from my attempts to design and implement a signal processing
algorithm inside a generic FPGA, but only on a limited budget.  As such,
I don't yet have the FPGA board I wish to place this algorithm onto, neither
do I have any expensive modeling or simulation capabilities.  I'm using
Verilator for my modeling and simulation needs.  This makes
using a vendor supplied IP core, such as an FFT, difficult if not impossible
to use.
 
My problem was made worse when I learned that the published maximum clock
speed for a device wasn't necessarily the maximum clock speed that I could
achieve.  My design needed to process the incoming signal at 500~MHz to be
commercially viable.  500~MHz is not necessarily a clock speed
that can be easily achieved.  250~MHz, on the other hand, is much more within
the realm of possibility.  Achieving a 500~MHz performance with a 250~MHz
clock, however, requires an FFT that accepts two samples per clock.
 
This, then, was and is the genesis of this project.
\end{preface}
 
\chapter{Introduction}
\pagenumbering{arabic}
\setcounter{page}{1}
 
The Double Clocked FFT project contains all of the software necessary to
create the IP to generate an arbitrary sized FFT that will clock two samples
in at each clock cycle, and after some pipeline delay it will clock two
samples out at every clock cycle.
 
The FFT generated by this approach is very configurable.  By simple adjustment
of a command line parameter, the FFT may be made to be a forward FFT or an
inverse FFT.  The number of bits processed, kept, and maintained by this
FFT are also configurable.  Even the number of bits used for the twiddle
factors, or whether or not to bit reverse the outputs, are all configurable
parts to this FFT core.
 
These features make the Double Clocked FFT very different and unique among the
other cores available on opencores.com.
 
For those who wish to get started right away, please download the package,
change into the {\tt sw} directory and run {\tt make}.  There is no need to
run a configure script, {\tt fftgen} is completely portable C++.  Then, once
built, go ahead and run {\tt fftgen} without any arguments.  This will cause
{\tt fftgen} to print a usage statement to the screen.  Review the usage
statement, and run {\tt fftgen} a second time with the arguments you need.
 
 
\chapter{Generation}
 
Creating a double clocked FFT core is as simple as running the program
{\tt fftgen}.  The program will then create a series of Verilog files, as
well as {\tt .hex} files suitable for use with a \textdollar readmemh, and
place them into an {\tt ./fft-core/} directory that {\tt fftgen} will create.
Creating the core you want takes a touch of configuring.
Therefore, the following lists the arguments that can be given to
{\tt fftgen} to adjust the core that it builds:
\begin{itemize}
\item[\hbox{-f size}]
        This specifies the size of the FFT core that {\tt fftgen} will build.
        The size must be a power of two.  The transform is given, within a
        scale factor, to,
        \begin{eqnarray*}
        X\left[k\right] &=& \sum_{n=0}^{N-1} x\left[n\right]
                e^{-j2\pi \frac{k}{N}n}
        \end{eqnarray*}
 
\item[\hbox{-1}]
        This specifies that the FFT will be an inverse FFT.  Specifically,
        it will calculate,
        \begin{eqnarray*}
        x\left[n\right] &=& \sum_{k=0}^{N-1} X\left[k\right] e^{j2\pi \frac{k}{N}n}
        \end{eqnarray*}
\item[\hbox{-0}]
        This specifies building a forward FFT.  However, since this is the
        default, this option never necessary.
\item[\hbox{-s}]
        This causes the core to skip the final bit reversal stage.  The
        outputs of the FFT will then come out in bit reversed order.
 
        This option is useful in those cases where someone wishes to
        multiply the coefficients coming out of an FFT by some product,
        and then to inverse FFT the results.  If the coefficients are also
        applied in bit--reversed order, then both the FFT and IFFT may
        skip their bit reversals.
 
        Be aware, however, doing this requires the bit reversed forward
        transform be followed by a bitreversed decimation in time approach
        to the inverse transform.  This software does not (yet) provide that
        capability.  As such, the utility just isn't there yet.
\item[\hbox{-S}]
        Include the final bit reversal stage.  As this is also the default,
        specifying the option should not be necessary.
\item[\hbox{-d DIR}]
        Specifies the DIRectory to place the produced Verilog files.  By
        default, this will be in the `./fft-core/' directory, but it can
        be moved to any other directory as necessary.
\item[\hbox{-n bits}] Sets the number of input bits per sample.  Given this
        setting, each of the two samples clocked in at every clock cycle
        will have this many bits for their real portion, and again this many
        bits for their imaginary portion.  Thus, the data input to the
        FFT will be four times this many bits per clock.
\item[\hbox{-m bits}] This sets the maximum bit width of the output.
        By default, the FFT will gain bits as they accumulate within
        the FFT.  Bits are accumulated at roughly one bit for every two stages.
        However, if this value is set, bits are only accumulated up to this
        maximum width.  After this width, further accumulations are truncated.
\item[\hbox{-c bits}] The number of bits in each twiddle coefficient is given
        by the number of bits input to that stage plus this extra number of
        bits per coefficient.  By increasing the number of bits per coefficient
        above that of the input samples, truncation error is kept to the
        original error found within the original samples.
\item[\hbox{-x bits}] Internally accumulated roundoff error can be a difficult
        problem to solve.  By using this option, you guarantee that the FFT
        runs with an additional {\tt bits} bits, and only truncates down to
        the necessary width at the end in order to minimize rounding
        errors along the way.
\item[\hbox{-p nmpy}] This sets the number of hardware multiplies that the FFT
        will consume.  By default, the FFT does not use any hardware multiplies.
        However, this can be expensive on the rest of the logic used by the
        device.  You can avoid this problem by allowing the FFT to use
        hardware multiplies using this option.  By default, the multiplies will
        be used in the latter stages, so that they will be applied where
        the bit width is the greatest.
\end{itemize}
 
\chapter{Architecture}
 
As a component of another system the structure of this system is a simple
black box such as the one shown in Fig.~\ref{fig:black-box}.
\begin{figure}\begin{center}
\begin{pspicture}(-2.1in,0)(2.1in,2in)
% \rput(0,0){\psframe(-2.1in,0)(2.1in,2in)}
\rput(0,0){\rput(0,0){\psframe[linewidth=2\pslinewidth](-0.75in,0)(0.75in,2in)}
        \rput(0,1in){(I)FFT Core}
        \rput[r](-1.6in,1.8in){\tt i\_clk}
                \rput(-1.5in,1.8in){\psline{->}(0,0)(0.7in,0)}
        \rput[r](-1.6in,1.5in){\tt i\_rst}
                \rput(-1.5in,1.5in){\psline{->}(0,0)(0.7in,0)}
        \rput[r](-1.6in,1.2in){\tt i\_ce}
                \rput(-1.5in,1.2in){\psline{->}(0,0)(0.7in,0)}
        \rput[r](-1.6in,0.6in){\tt i\_left}
                \rput(-1.5in,0.6in){\psline{->}(0,0)(0.7in,0)}
                \rput(-1.15in,0.6in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}
                \rput[br](-1.2in,0.6in){\scalebox{0.75}{$2N_i$}}
        \rput[r](-1.6in,0.3in){\tt i\_right}
                \rput(-1.5in,0.3in){\psline{->}(0,0)(0.7in,0)}
                \rput(-1.15in,0.3in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}
                \rput[br](-1.2in,0.3in){\scalebox{0.75}{$2N_i$}}
        %
        \rput[l](1.6in,1.2in){\tt o\_sync}
                \rput(0.8in,1.2in){\psline{->}(0,0)(0.7in,0)}
        \rput[l](1.6in,0.6in){\tt o\_left}
                \rput(0.8in,0.6in){\psline{->}(0,0)(0.7in,0)}
                \rput(1.15in,0.6in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}
                \rput[br](1.1in,0.6in){\scalebox{0.75}{$2N_o$}}
        \rput[l](1.6in,0.3in){\tt o\_right}
                \rput(0.8in,0.3in){\psline{->}(0,0)(0.7in,0)}
                \rput(1.15in,0.3in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}
                \rput[br](1.1in,0.3in){\scalebox{0.75}{$2N_o$}}
        }
\end{pspicture}
\caption{(I)FFT Black Box Diagram}\label{fig:black-box}
\end{center}\end{figure}
The interface
is simple: strobe the reset line, and every clock thereafter set the clock
enable line when data is valid on the left and right input ports.  Likewise
for the outputs, when the {\tt o\_sync} line goes high the first data sample
is available.  Ever after that, one data sample will be available every clock
cycle that the {\tt i\_ce} line is high.
 
Internal to the FFT, things are a touch more complex.  Fig.~\ref{fig:white-box}
\begin{figure}\begin{center}
\begin{pspicture}(1.3in,-0.5in)(4.7in,5in)
        % \rput(0,0){\psframe(0,-0.5in)(\textwidth,5.25in)}
        \rput(0,0){\psframe[linewidth=2\pslinewidth](1.3in,-0.25in)(4.7in,5in)}
        \rput(0,5in){%
                \rput[r](1.95in,0.125in){\tiny\tt i\_left}
                \rput[l](4.05in,0.125in){\tiny\tt i\_right}
                \rput(2.0in,0){\psline{->}(0,0.25in)(0,0.0in)}
                \rput(4.0in,0){\psline{->}(0,0.25in)(0,0.0in)}
        }
        \rput(2in,0){%
                \rput(0,4.25in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput[r](-0.05in,0.675in){\tiny Left}
                        \rput(0.0in,0){\psline{->}(0,0.75in)(0,0.5in)}
                        \rput(0,0.25in){Evens, $N$}
                        \rput[r](-0.35in,-0.125in){\tiny Sync}
                        \rput[l](0.35in,-0.125in){\tiny Data}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                        \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
                \rput(0,3.5in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput(0,0.25in){Evens, $N/2$}
                        \rput[r](-0.35in,-0.125in){\tiny Sync}
                        \rput[l](0.35in,-0.125in){\tiny Data}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                        \rput( 0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
                % \rput(0,3in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        % \rput(0,0.25in){Evens, $N$}}
                \rput(0,2.25in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput(0,0.25in){Evens, $8$}
                        \rput[r](-0.35in,-0.125in){\tiny Sync}
                        \rput[l](0.35in,-0.125in){\tiny Data}
                        \rput[r](-0.35in,0.675in){\tiny Sync}
                        \rput[l](0.35in,0.675in){\tiny Data}
                        \rput(-0.3in,0.9in){$\vdots$}
                        \rput( 0.3in,0.9in){$\vdots$}
                        \rput(-0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}
                        \rput( 0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                        \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
                \rput(0,1.5in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput(0,0.25in){Qtrstage (Even)}
                        \rput[r](-0.35in,-0.125in){\tiny Sync}
                        \rput[lb](0.6in,-0.10in){\tiny Data}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.5in)(0.8in,-0.5in)}
                        \rput(0.3in,0){\psline{->}(0,0)(0,-0.125in)(0.4in,-0.125in)(0.4in,-0.25in)}}
                % \rput(0,0.75in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        % \rput(0,0.25in){dblstage}}
                % \rput(0,0in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        % \rput(0,0.25in){Bit Reversal}}
        }
        \rput(4in,0){%
                \rput(0,4.25in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput[l](0.05in,0.675in){\tiny Right}
                        \rput(0.0in,0){\psline{->}(0,0.75in)(0,0.5in)}
                        \rput(0,0.25in){Odds, $N$}
                        \rput[l](0.35in,-0.125in){\tiny Sync}
                        \rput[r](-0.35in,-0.125in){\tiny Data}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                        \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
                \rput(0,3.5in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput(0,0.25in){Odds, $N/2$}
                        \rput[l](0.35in,-0.125in){\tiny Sync}
                        \rput[r](-0.35in,-0.125in){\tiny Data}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                        \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
                % \rput(0,3in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        % \rput(0,0.25in){Evens, $N$}}
                \rput(0,2.25in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput(0,0.25in){Odds, $8$}
                        \rput[l](0.35in,0.675in){\tiny Sync}
                        \rput[r](-0.35in,0.675in){\tiny Data}
                        \rput(-0.3in,0.9in){$\vdots$}
                        \rput( 0.3in,0.9in){$\vdots$}
                        \rput[l](0.35in,-0.125in){\tiny Sync}
                        \rput[r](-0.35in,-0.125in){\tiny Data}
                        \rput(-0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}
                        \rput(0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                        \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
                \rput(0,1.5in){\psframe(-0.5in,0)(0.5in,0.5in)%
                        \rput(0,0.25in){Qtrstage (Odd)}
                        \rput[rb](-0.6in,-0.10in){\tiny Data}
                        \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                                \rput[t](0.3in,-0.3in){\tiny NC}
                        \rput(-0.3in,0){\psline{->}(0,0)(0,-0.125in)(-0.4in,-0.125in)(-0.4in,-0.25in)}
                        }
        }
        \rput(3in,0.75in){\psframe(-0.5in,0)(0.5in,0.5in)%
                \rput(0,0.25in){Double Stage}
                        \rput[r](-0.35in,-0.125in){\tiny Sync}
                        \rput[l](0.35in,-0.125in){\tiny Right}
                        \rput[r](0.15in,-0.125in){\tiny Left}
                \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                \rput(0.2in,0){\psline{->}(0,0)(0,-0.25in)}
                \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
        \rput(3in,0in){\psframe(-0.5in,0)(0.5in,0.5in)%
                \rput(0,0.25in){Bit Reversal}
                        \rput[r](-0.35in,-0.125in){\tiny Sync}
                        \rput[l](0.35in,-0.125in){\tiny Right}
                        \rput[r](0.15in,-0.125in){\tiny Left}
                \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                \rput(0.2in,0){\psline{->}(0,0)(0,-0.25in)}
                \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
        \rput(3in,-0.25in){\rput[r](-0.35in,-0.125in){\tiny\tt o\_sync}
                        \rput[l](0.35in,-0.125in){\tiny\tt o\_right}
                        \rput[r](0.15in,-0.125in){\tiny\tt o\_left}
                \rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}
                \rput(0.2in,0){\psline{->}(0,0)(0,-0.25in)}
                \rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}
\end{pspicture}
\caption{Internal FFT Structure}\label{fig:white-box}
\end{center}\end{figure}
attempts to show some of this structure.  As you can see from the figure, the
FFT itself is composed of a series of stages.  These stages are split from the
beginning into an even stage and an odd stage.  Further, they are numbered
according to the size of the FFT they represent.  Therefore the first stage
is numbered $N$ and represents the first stage of an $N$ point FFT.  The
second stage is labeled $N/2$, then $N/$, and so on down to $N=8$.  The
four sample stage and the two sample stages are different, however.  These
two stages, representing three blocks on Fig.~\ref{fig:white-box}, can be
accomplished without any multiplies.  Therefore they have been accomplished
separately.  Likewise all of the stages, save the double stage at the bottom,
operate on one data sample per clock.  Only the last stage, prior to the
bit reversal stage, takes two data samples per clock as input, and outputs two
data samples per clock.  Finally, the bit reversal stage acts as the last
piece of the structure.
 
Internal to each of the FFT stages is a butterfly and a complex multiply,
as shown in Fig.~\ref{fig:fftstage}.
\begin{figure}\begin{center}
\begin{pspicture}(-0.25in,-1.8in)(3.25in,4.25in)
        % \rput(0,0){\psframe(0in,-2in)(3in,4.25in)}
        \rput(0,0){\psframe[linewidth=2\pslinewidth](-0.25in,-1.55in)(3.25in,4.0in)}
        \rput[r](1.625in,4.125in){\tt i\_data}
        \rput(1.675in,3.75in){\psline{->}(0,0.5in)(0,0in)%
                        \psline{->}(0,0)(-0.2in,-0.25in)%
                        \psarc{->}{0.15in}{200}{340}}
        \rput(0,2.75in){\rput(0,0){\psframe(0,0)(1.3in,0.25in)}
                        \rput(0,0){\psframe(0.1in,0)(0.2in,0.25in)}
                        \rput(0,0){\psframe(0.3in,0)(0.4in,0.25in)}
                        \rput(0,0){\psframe(0.5in,0)(0.6in,0.25in)}
                        \rput(0,0){\psframe(0.7in,0)(0.8in,0.25in)}
                        \rput(0,0){\psframe(0.9in,0)(1.0in,0.25in)}
                        \rput(0,0){\psframe(1.1in,0)(1.2in,0.25in)}
                        \rput(0,0){\psline{-}(0.7in,-0.05in)(1.1in,-0.25in)}
                        \rput(0,0){\psline{<-}(0.7in,0.3in)(1.5in,0.5in)(1.5in,0.75in)}}
        \rput(1.85in,2.75in){\psline(0,0.75in)(0,-0.25in)}
        \rput(0.6in,0.25in){\rput(0,0){\psframe[linewidth=2\pslinewidth](0,0)(2in,2.0in)}
                \rput(0.50in,2in){\psline{->}(0,0.25in)(0,0in)}
                \rput(1.25in,2in){\psline{->}(0,0.25in)(0,0in)}
                \rput(1.75in,2in){\psline{->}(0,0.25in)(0,0in)}
                \rput(0.5in,0){%
                        \rput(0in,0){\psline{->}(0,2.0in)(0,1.1in)}
                        \rput(0in,0){\psline{->}(0,1.75in)(0.65in,1.1in)}
                        \rput(-0.1in,1.1in){$+$}
                        \rput(0in,1.0in){$\bigoplus$}
                        \rput(0in,0){\psline{->}(0,0.9in)(0,0.75in)}
                        \rput(0in,0.5in){\psframe(-0.45in,-0.25in)(0.45in,0.25in)}
                        \rput(0in,0.5in){\parbox{0.8in}{Delay, and\\shift by $C-2$}}
                        \rput(0in,0){\psline{->}(0,0.25in)(0,0.0in)}}
                \rput(1.25in,0){%
                        \rput(0in,0){\psline{->}(0,2.0in)(0,1.1in)}
                        \rput(0in,0){\psline{->}(0,1.75in)(-0.65in,1.1in)}
                        \rput(0.1in,1.1in){$-$}
                        \rput(0in,1in){$\bigoplus$}
                        \rput(0in,0){\psline{->}(0,0.9in)(0,0.6in)}
                        \rput(0in,0.5in){$\bigotimes$}
                        \rput(0in,0){\psline{->}(0,0.4in)(0,0.0in)}}
                \rput(1.75in,0){%
                        \rput(0,0){\psline{->}(0,2.0in)(0,0.5in)(-0.4in,0.5in)}}
                \rput(0.50in,-0.25in){\psline{->}(0,0.25in)(0,-1.05in)}
                \rput(1.25in,-0.25in){\psline{-}(0,0.25in)(0,0in)}}
        \rput*[l](2.0in,0.5in){DIF Butterfly}
        \rput*[lb](1.95in,2.5in){Coefficient memory}
        % \rput(0,0){\psframe(1.3in,-0.25in)(4.7in,5in)}
        \rput(1.7in,-0.5in){\rput(0,0){\psframe(0,0)(1.3in,0.25in)}
                        \rput(0,0){\psframe(0.1in,0)(0.2in,0.25in)}
                        \rput(0,0){\psframe(0.3in,0)(0.4in,0.25in)}
                        \rput(0,0){\psframe(0.5in,0)(0.6in,0.25in)}
                        \rput(0,0){\psframe(0.7in,0)(0.8in,0.25in)}
                        \rput(0,0){\psframe(0.9in,0)(1.0in,0.25in)}
                        \rput(0,0){\psframe(1.1in,0)(1.2in,0.25in)}
                        \rput(0,0){\psline{<-}(0.7in,0.30in)(0.15in,0.5in)}
                        \rput(0,0){\psline{->}(0.7in,-0.05in)(-0.2in,-0.3in)(-0.2in,-0.55in)}}
        \rput(1.3in,-1.3in){\psline{->}(-0.2in,0.25in)(0,0)}
        \rput(1.3in,-1.3in){\psarcn{->}{0.15in}{150}{30}}
        \rput(1.3in,-1.3in){\psline{->}(0,0)(0,-0.5in)}
        \rput[l](1.35in,-1.675in){\tt o\_data}
\end{pspicture}
\caption{A Single FFT Stage, with Butterfly}\label{fig:fftstage}
\end{center}\end{figure}
These FFT stages are really no different than any other decimation in
frequency FFT, save only that the coefficients are alternated between the
two stages.  That is, the even stages get all the even coefficients, and
the odd stages get all of the odd coefficients.
Internally, each stage spends the first $N/4$ clocks storing its inputs
into memory, and then the next $N/4$ clocks pairing a stored input with
a single external input, so that both values become inputs to the butterfly.
Likewise, the butterfly coefficient is read from a small ROM table.
 
One trick to making the FFT stage work successfully is synchronization.  Since
the shift and add multiplies create a delay of (roughly) one clock cycle per
bit of input, there is a significant pipeline delay from the input to the
output of the butterfly routine.  To match this delay, the FFT stage places a
synchronization pulse into the butterfly.  When this synchronization pulse
comes out of the butterfly, the values of the butterfly then match the
first sample out of the stage.  The next synchronization problem comes from
the fact that the butterflies operate on two samples at a time, whereas the
FFT stage operates on a single sample at a time.  This means that half the
time the butterfly output will be invalid.  To keep things aligned, and to
avoid the invalid data half, a counter is started by the synchronization pulse
coming out of the butterfly in order to keep track.  Using this counter and
once the butterfly produces the first sync pulse, the next $N/4$ clock cycles
will produce valid butterfly outputs.  For these clock cycles, the left or
first output is sent immediately to the next FFT stage, whereas the right
or second output is saved into memory.  Once these cycles are complete, the
butterfly outputs will be invalid for the next $N/4$ clock cycles.  During
these invalid clock cycles, the FFT stage outputs data that had been stored
in memory.  In this fashion, data is always valid coming out of each FFT
stage once the initial synchronization pulse goes high.
 
The complex multiply itself, formed internal to the butterfly routine, is
formed from three very simple shift and add multiplies, whose output is
then transformed into a single complex output, although there is a command
line option to use hardware multiplies instead.  To avoid overflow, the
complex coefficients, $z_n$, for these multiplies are given by,
\begin{eqnarray}
z_n &=& c_n + js_n,\mbox{ where} \\
c_n &=& \left\lfloor 2^{C-2}\cos\left(2\pi \frac{n}{N}\right)+\frac{1}{2}\right\rfloor,\\
s_n &=& \left\lfloor 2^{C-2}\sin\left(2\pi \frac{n}{N}\right)+\frac{1}{2}\right\rfloor\mbox{, and}
\end{eqnarray}
$C$ is the number of bits allocated to the coefficient.
 
For those wishing to understand this operation further and in more depth, I
would commend them to the literature on how a decimation in frequency FFT is
constructed.
 
\chapter{Operation}
 
The core is actually really easy to use:
\begin{enumerate}
        \item Provide a system clock to the core every clock cycle.
        \item Set the {\tt i\_rst} line high for at least one clock cycle
                before you intend to use the core.
        \item From the time of reset until the first sample pair is available
                on the IO ports, {\tt i\_rst} may be kept low, but the clock
                enable line {\tt i\_ce} must also be kept low.
        \item On the clock containing the first sample pair, {\tt i\_left}
                and {\tt i\_right}, set {\tt i\_ce} high.
        \item Ever after, any time a valid pair of samples is available to
                the input of the FFT, place the first sample of the pair
                on the {\tt i\_left} line, the second on the {\tt i\_right}
                line, and set {\tt i\_ce} high.
        \item At the first valid output, the FFT core will set {\tt o\_sync}
                line high in addition to the output values {\tt o\_left}
                (the first of two), and {\tt o\_right} (the second of the two).
        \item Ever after, whenever {\tt i\_ce} is high, the FFT core will clock
                two samples in and two samples out.  On any valid first
                pair of samples coming out of the transform,
                {\tt o\_sync} will be high.  Otherwise {\tt o\_sync} will
                remain low.
\end{enumerate}
 
There are no special modes or states associated with this core.  If you wish
it to stop or pause, just turn off {\tt i\_ce}.  If you wish to flush the
core, just send zeros into the core.
 
\chapter{Registers}
 
Once built, the FFT routine has no capability for runtime configuration
or reconfiguration.  Therefore, this implementation maintains no user
configurable or readable registers.
 
This is a great advantage in many ways, simply because it greatly simplifies
the interface over other cores that are available out there.
 
\chapter{Clocks}
 
The FFT routines built by this core use one clock only.  The speed of this
clock will depend upon the speed your hardware is capable of.  If your data
rate is slower than your clock speed, just hold off on the {\tt i\_ce}
line as necessary so that every clock with the {\tt i\_ce} line high is a
valid sample.
 
\chapter{IO Ports}
 
The FFT core presents a small set of IO ports to its external interface.
These ports are listed in Table.~\ref{tbl:ioports}.
\begin{table}[htbp]
\begin{center}
\begin{portlist}
i\_clk & 1 & Input & The global clock driving the FFT. \\\hline
i\_rst & 1 & Input & An active high synchronous reset.\\\hline
i\_ce & 1 & Input & Clock Enable.  Set this high to clock data in and
                out.\\\hline
i\_left & $2N_i$ & Input & The first of two input complex input samples.  Bits
                [$\left(2N_i-1\right)$:$N_i$] of this value are the real
                portion, whereas bits [$\left(N_i-1\right)$:0] represent the
                imaginary portion.  Both portions are in signed twos complement
                integer format.  The number of bits, $N_i$, is configurable.
                \\\hline
i\_right & $2N_i$ & Input & The second of two input complex input samples.
                The format is the same as {\tt i\_left} above.\\\hline
o\_left & $2N_o$ & Output & The first of two input complex output samples.
                The format is the same, save only that $N_o$ bits are
                used for each twos complement portion instead of $N_i$.\\\hline
o\_right & $2N_o$ & Output & The second of two input complex output samples.
                The format is the same as for {\tt o\_left} above.\\\hline
o\_sync & 1 & Output & Signals the first output sample pair of any transform,
                zero otherwise.
                \\\hline
\end{portlist}
\caption{List of IO ports}\label{tbl:ioports}
\end{center}\end{table}
% Appendices
% Index
\end{document}
 
 

Line No.	Rev	Author	Line
1	10	dgisselq	`\documentclass{gqtekspec}`
2			`\project{Double Clocked FFT}`
3			`\title{Specification}`
4			`\author{Dan Gisselquist, Ph.D.}`
5	27	dgisselq	`\email{dgisselq (at) opencores.org}`
6	26	dgisselq	`\revision{Rev.~0.2}`
7	10	dgisselq	`\begin{document}`
8			`\pagestyle{gqtekspecplain}`
9			`\titlepage`
10			`\begin{license}`
11			`Copyright (C) \theyear\today, Gisselquist Technology, LLC`
12
13			`This project is free software (firmware): you can redistribute it and/or`
14			`modify it under the terms of the GNU General Public License as published`
15			`by the Free Software Foundation, either version 3 of the License, or (at`
16			`your option) any later version.`
17
18			`This program is distributed in the hope that it will be useful, but WITHOUT`
19			`ANY WARRANTY; without even the implied warranty of MERCHANTIBILITY or`
20			`FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License`
21			`for more details.`
22
23			`You should have received a copy of the GNU General Public License along`
24			`with this program. If not, see \hbox{<http://www.gnu.org/licenses/>} for a copy.`
25			`\end{license}`
26			`\begin{revisionhistory}`
27	26	dgisselq	`0.2 & 6/2/2015 & Gisselquist & Superficial formatting changes\\\hline`
28	11	dgisselq	`0.1 & 3/3/2015 & Gisselquist & First Draft \\\hline`
29	10	dgisselq	`\end{revisionhistory}`
30			`% Revision History`
31			`% Table of Contents, named Contents`
32			`\tableofcontents`
33			`\listoffigures`
34			`\listoftables`
35			`\begin{preface}`
36			`This FFT comes from my attempts to design and implement a signal processing`
37			`algorithm inside a generic FPGA, but only on a limited budget. As such,`
38			`I don't yet have the FPGA board I wish to place this algorithm onto, neither`
39			`do I have any expensive modeling or simulation capabilities. I'm using`
40			`Verilator for my modeling and simulation needs. This makes`
41			`using a vendor supplied IP core, such as an FFT, difficult if not impossible`
42			`to use.`
43
44			`My problem was made worse when I learned that the published maximum clock`
45			`speed for a device wasn't necessarily the maximum clock speed that I could`
46			`achieve. My design needed to process the incoming signal at 500~MHz to be`
47			`commercially viable. 500~MHz is not necessarily a clock speed`
48			`that can be easily achieved. 250~MHz, on the other hand, is much more within`
49			`the realm of possibility. Achieving a 500~MHz performance with a 250~MHz`
50			`clock, however, requires an FFT that accepts two samples per clock.`
51
52			`This, then, was and is the genesis of this project.`
53			`\end{preface}`
54
55			`\chapter{Introduction}`
56			`\pagenumbering{arabic}`
57			`\setcounter{page}{1}`
58
59			`The Double Clocked FFT project contains all of the software necessary to`
60			`create the IP to generate an arbitrary sized FFT that will clock two samples`
61			`in at each clock cycle, and after some pipeline delay it will clock two`
62			`samples out at every clock cycle.`
63
64			`The FFT generated by this approach is very configurable. By simple adjustment`
65			`of a command line parameter, the FFT may be made to be a forward FFT or an`
66			`inverse FFT. The number of bits processed, kept, and maintained by this`
67			`FFT are also configurable. Even the number of bits used for the twiddle`
68			`factors, or whether or not to bit reverse the outputs, are all configurable`
69			`parts to this FFT core.`
70
71			`These features make the Double Clocked FFT very different and unique among the`
72			`other cores available on opencores.com.`
73
74			`For those who wish to get started right away, please download the package,`
75			`change into the {\tt sw} directory and run {\tt make}. There is no need to`
76			`run a configure script, {\tt fftgen} is completely portable C++. Then, once`
77			`built, go ahead and run {\tt fftgen} without any arguments. This will cause`
78			`{\tt fftgen} to print a usage statement to the screen. Review the usage`
79			`statement, and run {\tt fftgen} a second time with the arguments you need.`
80
81
82			`\chapter{Generation}`
83
84			`Creating a double clocked FFT core is as simple as running the program`
85			`{\tt fftgen}. The program will then create a series of Verilog files, as`
86			`well as {\tt .hex} files suitable for use with a \textdollar readmemh, and`
87			`place them into an {\tt ./fft-core/} directory that {\tt fftgen} will create.`
88			`Creating the core you want takes a touch of configuring.`
89			`Therefore, the following lists the arguments that can be given to`
90			`{\tt fftgen} to adjust the core that it builds:`
91			`\begin{itemize}`
92			`\item[\hbox{-f size}]`
93			`This specifies the size of the FFT core that {\tt fftgen} will build.`
94			`The size must be a power of two. The transform is given, within a`
95			`scale factor, to,`
96			`\begin{eqnarray*}`
97			`X\left[k\right] &=& \sum_{n=0}^{N-1} x\left[n\right]`
98			`e^{-j2\pi \frac{k}{N}n}`
99			`\end{eqnarray*}`
100
101			`\item[\hbox{-1}]`
102			`This specifies that the FFT will be an inverse FFT. Specifically,`
103			`it will calculate,`
104			`\begin{eqnarray*}`
105			`x\left[n\right] &=& \sum_{k=0}^{N-1} X\left[k\right] e^{j2\pi \frac{k}{N}n}`
106			`\end{eqnarray*}`
107			`\item[\hbox{-0}]`
108			`This specifies building a forward FFT. However, since this is the`
109			`default, this option never necessary.`
110			`\item[\hbox{-s}]`
111			`This causes the core to skip the final bit reversal stage. The`
112			`outputs of the FFT will then come out in bit reversed order.`
113
114			`This option is useful in those cases where someone wishes to`
115			`multiply the coefficients coming out of an FFT by some product,`
116			`and then to inverse FFT the results. If the coefficients are also`
117			`applied in bit--reversed order, then both the FFT and IFFT may`
118			`skip their bit reversals.`
119	22	dgisselq
120			`Be aware, however, doing this requires the bit reversed forward`
121			`transform be followed by a bitreversed decimation in time approach`
122			`to the inverse transform. This software does not (yet) provide that`
123			`capability. As such, the utility just isn't there yet.`
124	10	dgisselq	`\item[\hbox{-S}]`
125			`Include the final bit reversal stage. As this is also the default,`
126			`specifying the option should not be necessary.`
127			`\item[\hbox{-d DIR}]`
128			`Specifies the DIRectory to place the produced Verilog files. By`
129			default, this will be in the `./fft-core/' directory, but it can
130			`be moved to any other directory as necessary.`
131			`\item[\hbox{-n bits}] Sets the number of input bits per sample. Given this`
132			`setting, each of the two samples clocked in at every clock cycle`
133			`will have this many bits for their real portion, and again this many`
134			`bits for their imaginary portion. Thus, the data input to the`
135			`FFT will be four times this many bits per clock.`
136			`\item[\hbox{-m bits}] This sets the maximum bit width of the output.`
137			`By default, the FFT will gain bits as they accumulate within`
138			`the FFT. Bits are accumulated at roughly one bit for every two stages.`
139			`However, if this value is set, bits are only accumulated up to this`
140			`maximum width. After this width, further accumulations are truncated.`
141			`\item[\hbox{-c bits}] The number of bits in each twiddle coefficient is given`
142			`by the number of bits input to that stage plus this extra number of`
143			`bits per coefficient. By increasing the number of bits per coefficient`
144			`above that of the input samples, truncation error is kept to the`
145			`original error found within the original samples.`
146	22	dgisselq	`\item[\hbox{-x bits}] Internally accumulated roundoff error can be a difficult`
147			`problem to solve. By using this option, you guarantee that the FFT`
148			`runs with an additional {\tt bits} bits, and only truncates down to`
149			`the necessary width at the end in order to minimize rounding`
150			`errors along the way.`
151			`\item[\hbox{-p nmpy}] This sets the number of hardware multiplies that the FFT`
152			`will consume. By default, the FFT does not use any hardware multiplies.`
153			`However, this can be expensive on the rest of the logic used by the`
154			`device. You can avoid this problem by allowing the FFT to use`
155			`hardware multiplies using this option. By default, the multiplies will`
156			`be used in the latter stages, so that they will be applied where`
157			`the bit width is the greatest.`
158	10	dgisselq	`\end{itemize}`
159
160			`\chapter{Architecture}`
161
162			`As a component of another system the structure of this system is a simple`
163			`black box such as the one shown in Fig.~\ref{fig:black-box}.`
164			`\begin{figure}\begin{center}`
165			`\begin{pspicture}(-2.1in,0)(2.1in,2in)`
166			`% \rput(0,0){\psframe(-2.1in,0)(2.1in,2in)}`
167			`\rput(0,0){\rput(0,0){\psframe[linewidth=2\pslinewidth](-0.75in,0)(0.75in,2in)}`
168			`\rput(0,1in){(I)FFT Core}`
169			`\rput[r](-1.6in,1.8in){\tt i\_clk}`
170			`\rput(-1.5in,1.8in){\psline{->}(0,0)(0.7in,0)}`
171			`\rput[r](-1.6in,1.5in){\tt i\_rst}`
172			`\rput(-1.5in,1.5in){\psline{->}(0,0)(0.7in,0)}`
173			`\rput[r](-1.6in,1.2in){\tt i\_ce}`
174			`\rput(-1.5in,1.2in){\psline{->}(0,0)(0.7in,0)}`
175			`\rput[r](-1.6in,0.6in){\tt i\_left}`
176			`\rput(-1.5in,0.6in){\psline{->}(0,0)(0.7in,0)}`
177			`\rput(-1.15in,0.6in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}`
178			`\rput[br](-1.2in,0.6in){\scalebox{0.75}{$2N_i$}}`
179			`\rput[r](-1.6in,0.3in){\tt i\_right}`
180			`\rput(-1.5in,0.3in){\psline{->}(0,0)(0.7in,0)}`
181			`\rput(-1.15in,0.3in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}`
182			`\rput[br](-1.2in,0.3in){\scalebox{0.75}{$2N_i$}}`
183			`%`
184			`\rput[l](1.6in,1.2in){\tt o\_sync}`
185			`\rput(0.8in,1.2in){\psline{->}(0,0)(0.7in,0)}`
186			`\rput[l](1.6in,0.6in){\tt o\_left}`
187			`\rput(0.8in,0.6in){\psline{->}(0,0)(0.7in,0)}`
188			`\rput(1.15in,0.6in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}`
189			`\rput[br](1.1in,0.6in){\scalebox{0.75}{$2N_o$}}`
190			`\rput[l](1.6in,0.3in){\tt o\_right}`
191			`\rput(0.8in,0.3in){\psline{->}(0,0)(0.7in,0)}`
192			`\rput(1.15in,0.3in){\psline(-0.05in,-0.05in)(0.05in,0.05in)}`
193			`\rput[br](1.1in,0.3in){\scalebox{0.75}{$2N_o$}}`
194			`}`
195			`\end{pspicture}`
196			`\caption{(I)FFT Black Box Diagram}\label{fig:black-box}`
197			`\end{center}\end{figure}`
198			`The interface`
199			`is simple: strobe the reset line, and every clock thereafter set the clock`
200			`enable line when data is valid on the left and right input ports. Likewise`
201			`for the outputs, when the {\tt o\_sync} line goes high the first data sample`
202			`is available. Ever after that, one data sample will be available every clock`
203			`cycle that the {\tt i\_ce} line is high.`
204
205			`Internal to the FFT, things are a touch more complex. Fig.~\ref{fig:white-box}`
206			`\begin{figure}\begin{center}`
207			`\begin{pspicture}(1.3in,-0.5in)(4.7in,5in)`
208			`% \rput(0,0){\psframe(0,-0.5in)(\textwidth,5.25in)}`
209	11	dgisselq	`\rput(0,0){\psframe[linewidth=2\pslinewidth](1.3in,-0.25in)(4.7in,5in)}`
210	10	dgisselq	`\rput(0,5in){%`
211			`\rput[r](1.95in,0.125in){\tiny\tt i\_left}`
212			`\rput[l](4.05in,0.125in){\tiny\tt i\_right}`
213			`\rput(2.0in,0){\psline{->}(0,0.25in)(0,0.0in)}`
214			`\rput(4.0in,0){\psline{->}(0,0.25in)(0,0.0in)}`
215			`}`
216			`\rput(2in,0){%`
217			`\rput(0,4.25in){\psframe(-0.5in,0)(0.5in,0.5in)%`
218			`\rput[r](-0.05in,0.675in){\tiny Left}`
219			`\rput(0.0in,0){\psline{->}(0,0.75in)(0,0.5in)}`
220			`\rput(0,0.25in){Evens, $N$}`
221			`\rput[r](-0.35in,-0.125in){\tiny Sync}`
222			`\rput[l](0.35in,-0.125in){\tiny Data}`
223			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
224			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
225			`\rput(0,3.5in){\psframe(-0.5in,0)(0.5in,0.5in)%`
226			`\rput(0,0.25in){Evens, $N/2$}`
227			`\rput[r](-0.35in,-0.125in){\tiny Sync}`
228			`\rput[l](0.35in,-0.125in){\tiny Data}`
229			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
230			`\rput( 0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
231			`% \rput(0,3in){\psframe(-0.5in,0)(0.5in,0.5in)%`
232			`% \rput(0,0.25in){Evens, $N$}}`
233			`\rput(0,2.25in){\psframe(-0.5in,0)(0.5in,0.5in)%`
234			`\rput(0,0.25in){Evens, $8$}`
235			`\rput[r](-0.35in,-0.125in){\tiny Sync}`
236			`\rput[l](0.35in,-0.125in){\tiny Data}`
237			`\rput[r](-0.35in,0.675in){\tiny Sync}`
238			`\rput[l](0.35in,0.675in){\tiny Data}`
239			`\rput(-0.3in,0.9in){$\vdots$}`
240			`\rput( 0.3in,0.9in){$\vdots$}`
241			`\rput(-0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}`
242			`\rput( 0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}`
243			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
244			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
245			`\rput(0,1.5in){\psframe(-0.5in,0)(0.5in,0.5in)%`
246			`\rput(0,0.25in){Qtrstage (Even)}`
247			`\rput[r](-0.35in,-0.125in){\tiny Sync}`
248			`\rput[lb](0.6in,-0.10in){\tiny Data}`
249			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.5in)(0.8in,-0.5in)}`
250			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.125in)(0.4in,-0.125in)(0.4in,-0.25in)}}`
251			`% \rput(0,0.75in){\psframe(-0.5in,0)(0.5in,0.5in)%`
252			`% \rput(0,0.25in){dblstage}}`
253			`% \rput(0,0in){\psframe(-0.5in,0)(0.5in,0.5in)%`
254			`% \rput(0,0.25in){Bit Reversal}}`
255			`}`
256			`\rput(4in,0){%`
257			`\rput(0,4.25in){\psframe(-0.5in,0)(0.5in,0.5in)%`
258			`\rput[l](0.05in,0.675in){\tiny Right}`
259			`\rput(0.0in,0){\psline{->}(0,0.75in)(0,0.5in)}`
260			`\rput(0,0.25in){Odds, $N$}`
261			`\rput[l](0.35in,-0.125in){\tiny Sync}`
262			`\rput[r](-0.35in,-0.125in){\tiny Data}`
263			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
264			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
265			`\rput(0,3.5in){\psframe(-0.5in,0)(0.5in,0.5in)%`
266			`\rput(0,0.25in){Odds, $N/2$}`
267			`\rput[l](0.35in,-0.125in){\tiny Sync}`
268			`\rput[r](-0.35in,-0.125in){\tiny Data}`
269			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
270			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
271			`% \rput(0,3in){\psframe(-0.5in,0)(0.5in,0.5in)%`
272			`% \rput(0,0.25in){Evens, $N$}}`
273			`\rput(0,2.25in){\psframe(-0.5in,0)(0.5in,0.5in)%`
274			`\rput(0,0.25in){Odds, $8$}`
275			`\rput[l](0.35in,0.675in){\tiny Sync}`
276			`\rput[r](-0.35in,0.675in){\tiny Data}`
277			`\rput(-0.3in,0.9in){$\vdots$}`
278			`\rput( 0.3in,0.9in){$\vdots$}`
279			`\rput[l](0.35in,-0.125in){\tiny Sync}`
280			`\rput[r](-0.35in,-0.125in){\tiny Data}`
281			`\rput(-0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}`
282			`\rput(0.3in,0.75in){\psline{->}(0,0)(0,-0.25in)}`
283			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
284			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
285			`\rput(0,1.5in){\psframe(-0.5in,0)(0.5in,0.5in)%`
286			`\rput(0,0.25in){Qtrstage (Odd)}`
287			`\rput[rb](-0.6in,-0.10in){\tiny Data}`
288			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
289			`\rput[t](0.3in,-0.3in){\tiny NC}`
290			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.125in)(-0.4in,-0.125in)(-0.4in,-0.25in)}`
291			`}`
292			`}`
293			`\rput(3in,0.75in){\psframe(-0.5in,0)(0.5in,0.5in)%`
294			`\rput(0,0.25in){Double Stage}`
295			`\rput[r](-0.35in,-0.125in){\tiny Sync}`
296			`\rput[l](0.35in,-0.125in){\tiny Right}`
297			`\rput[r](0.15in,-0.125in){\tiny Left}`
298			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
299			`\rput(0.2in,0){\psline{->}(0,0)(0,-0.25in)}`
300			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
301			`\rput(3in,0in){\psframe(-0.5in,0)(0.5in,0.5in)%`
302			`\rput(0,0.25in){Bit Reversal}`
303			`\rput[r](-0.35in,-0.125in){\tiny Sync}`
304			`\rput[l](0.35in,-0.125in){\tiny Right}`
305			`\rput[r](0.15in,-0.125in){\tiny Left}`
306			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
307			`\rput(0.2in,0){\psline{->}(0,0)(0,-0.25in)}`
308			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
309			`\rput(3in,-0.25in){\rput[r](-0.35in,-0.125in){\tiny\tt o\_sync}`
310			`\rput[l](0.35in,-0.125in){\tiny\tt o\_right}`
311			`\rput[r](0.15in,-0.125in){\tiny\tt o\_left}`
312			`\rput(-0.3in,0){\psline{->}(0,0)(0,-0.25in)}`
313			`\rput(0.2in,0){\psline{->}(0,0)(0,-0.25in)}`
314			`\rput(0.3in,0){\psline{->}(0,0)(0,-0.25in)}}`
315			`\end{pspicture}`
316			`\caption{Internal FFT Structure}\label{fig:white-box}`
317			`\end{center}\end{figure}`
318			`attempts to show some of this structure. As you can see from the figure, the`
319			`FFT itself is composed of a series of stages. These stages are split from the`
320			`beginning into an even stage and an odd stage. Further, they are numbered`
321			`according to the size of the FFT they represent. Therefore the first stage`
322			`is numbered $N$ and represents the first stage of an $N$ point FFT. The`
323			`second stage is labeled $N/2$, then $N/$, and so on down to $N=8$. The`
324			`four sample stage and the two sample stages are different, however. These`
325			`two stages, representing three blocks on Fig.~\ref{fig:white-box}, can be`
326			`accomplished without any multiplies. Therefore they have been accomplished`
327			`separately. Likewise all of the stages, save the double stage at the bottom,`
328			`operate on one data sample per clock. Only the last stage, prior to the`
329			`bit reversal stage, takes two data samples per clock as input, and outputs two`
330			`data samples per clock. Finally, the bit reversal stage acts as the last`
331			`piece of the structure.`
332
333			`Internal to each of the FFT stages is a butterfly and a complex multiply,`
334			`as shown in Fig.~\ref{fig:fftstage}.`
335			`\begin{figure}\begin{center}`
336	11	dgisselq	`\begin{pspicture}(-0.25in,-1.8in)(3.25in,4.25in)`
337			`% \rput(0,0){\psframe(0in,-2in)(3in,4.25in)}`
338			`\rput(0,0){\psframe[linewidth=2\pslinewidth](-0.25in,-1.55in)(3.25in,4.0in)}`
339			`\rput[r](1.625in,4.125in){\tt i\_data}`
340			`\rput(1.675in,3.75in){\psline{->}(0,0.5in)(0,0in)%`
341			`\psline{->}(0,0)(-0.2in,-0.25in)%`
342			`\psarc{->}{0.15in}{200}{340}}`
343	10	dgisselq	`\rput(0,2.75in){\rput(0,0){\psframe(0,0)(1.3in,0.25in)}`
344			`\rput(0,0){\psframe(0.1in,0)(0.2in,0.25in)}`
345			`\rput(0,0){\psframe(0.3in,0)(0.4in,0.25in)}`
346			`\rput(0,0){\psframe(0.5in,0)(0.6in,0.25in)}`
347			`\rput(0,0){\psframe(0.7in,0)(0.8in,0.25in)}`
348			`\rput(0,0){\psframe(0.9in,0)(1.0in,0.25in)}`
349			`\rput(0,0){\psframe(1.1in,0)(1.2in,0.25in)}`
350			`\rput(0,0){\psline{-}(0.7in,-0.05in)(1.1in,-0.25in)}`
351	11	dgisselq	`\rput(0,0){\psline{<-}(0.7in,0.3in)(1.5in,0.5in)(1.5in,0.75in)}}`
352	10	dgisselq	`\rput(1.85in,2.75in){\psline(0,0.75in)(0,-0.25in)}`
353	11	dgisselq	`\rput(0.6in,0.25in){\rput(0,0){\psframe[linewidth=2\pslinewidth](0,0)(2in,2.0in)}`
354	10	dgisselq	`\rput(0.50in,2in){\psline{->}(0,0.25in)(0,0in)}`
355			`\rput(1.25in,2in){\psline{->}(0,0.25in)(0,0in)}`
356			`\rput(1.75in,2in){\psline{->}(0,0.25in)(0,0in)}`
357			`\rput(0.5in,0){%`
358			`\rput(0in,0){\psline{->}(0,2.0in)(0,1.1in)}`
359			`\rput(0in,0){\psline{->}(0,1.75in)(0.65in,1.1in)}`
360			`\rput(-0.1in,1.1in){$+$}`
361			`\rput(0in,1.0in){$\bigoplus$}`
362			`\rput(0in,0){\psline{->}(0,0.9in)(0,0.75in)}`
363			`\rput(0in,0.5in){\psframe(-0.45in,-0.25in)(0.45in,0.25in)}`
364			`\rput(0in,0.5in){\parbox{0.8in}{Delay, and\\shift by $C-2$}}`
365			`\rput(0in,0){\psline{->}(0,0.25in)(0,0.0in)}}`
366			`\rput(1.25in,0){%`
367			`\rput(0in,0){\psline{->}(0,2.0in)(0,1.1in)}`
368			`\rput(0in,0){\psline{->}(0,1.75in)(-0.65in,1.1in)}`
369			`\rput(0.1in,1.1in){$-$}`
370			`\rput(0in,1in){$\bigoplus$}`
371			`\rput(0in,0){\psline{->}(0,0.9in)(0,0.6in)}`
372			`\rput(0in,0.5in){$\bigotimes$}`
373			`\rput(0in,0){\psline{->}(0,0.4in)(0,0.0in)}}`
374			`\rput(1.75in,0){%`
375			`\rput(0,0){\psline{->}(0,2.0in)(0,0.5in)(-0.4in,0.5in)}}`
376	11	dgisselq	`\rput(0.50in,-0.25in){\psline{->}(0,0.25in)(0,-1.05in)}`
377			`\rput(1.25in,-0.25in){\psline{-}(0,0.25in)(0,0in)}}`
378			`\rput*[l](2.0in,0.5in){DIF Butterfly}`
379			`\rput*[lb](1.95in,2.5in){Coefficient memory}`
380	10	dgisselq	`% \rput(0,0){\psframe(1.3in,-0.25in)(4.7in,5in)}`
381	11	dgisselq	`\rput(1.7in,-0.5in){\rput(0,0){\psframe(0,0)(1.3in,0.25in)}`
382	10	dgisselq	`\rput(0,0){\psframe(0.1in,0)(0.2in,0.25in)}`
383			`\rput(0,0){\psframe(0.3in,0)(0.4in,0.25in)}`
384			`\rput(0,0){\psframe(0.5in,0)(0.6in,0.25in)}`
385			`\rput(0,0){\psframe(0.7in,0)(0.8in,0.25in)}`
386			`\rput(0,0){\psframe(0.9in,0)(1.0in,0.25in)}`
387			`\rput(0,0){\psframe(1.1in,0)(1.2in,0.25in)}`
388	11	dgisselq	`\rput(0,0){\psline{<-}(0.7in,0.30in)(0.15in,0.5in)}`
389			`\rput(0,0){\psline{->}(0.7in,-0.05in)(-0.2in,-0.3in)(-0.2in,-0.55in)}}`
390			`\rput(1.3in,-1.3in){\psline{->}(-0.2in,0.25in)(0,0)}`
391			`\rput(1.3in,-1.3in){\psarcn{->}{0.15in}{150}{30}}`
392			`\rput(1.3in,-1.3in){\psline{->}(0,0)(0,-0.5in)}`
393			`\rput[l](1.35in,-1.675in){\tt o\_data}`
394	10	dgisselq	`\end{pspicture}`
395	11	dgisselq	`\caption{A Single FFT Stage, with Butterfly}\label{fig:fftstage}`
396	10	dgisselq	`\end{center}\end{figure}`
397			`These FFT stages are really no different than any other decimation in`
398			`frequency FFT, save only that the coefficients are alternated between the`
399			`two stages. That is, the even stages get all the even coefficients, and`
400			`the odd stages get all of the odd coefficients.`
401			`Internally, each stage spends the first $N/4$ clocks storing its inputs`
402			`into memory, and then the next $N/4$ clocks pairing a stored input with`
403			`a single external input, so that both values become inputs to the butterfly.`
404			`Likewise, the butterfly coefficient is read from a small ROM table.`
405
406			`One trick to making the FFT stage work successfully is synchronization. Since`
407	22	dgisselq	`the shift and add multiplies create a delay of (roughly) one clock cycle per`
408			`bit of input, there is a significant pipeline delay from the input to the`
409			`output of the butterfly routine. To match this delay, the FFT stage places a`
410	10	dgisselq	`synchronization pulse into the butterfly. When this synchronization pulse`
411			`comes out of the butterfly, the values of the butterfly then match the`
412			`first sample out of the stage. The next synchronization problem comes from`
413			`the fact that the butterflies operate on two samples at a time, whereas the`
414			`FFT stage operates on a single sample at a time. This means that half the`
415			`time the butterfly output will be invalid. To keep things aligned, and to`
416			`avoid the invalid data half, a counter is started by the synchronization pulse`
417			`coming out of the butterfly in order to keep track. Using this counter and`
418			`once the butterfly produces the first sync pulse, the next $N/4$ clock cycles`
419			`will produce valid butterfly outputs. For these clock cycles, the left or`
420			`first output is sent immediately to the next FFT stage, whereas the right`
421			`or second output is saved into memory. Once these cycles are complete, the`
422			`butterfly outputs will be invalid for the next $N/4$ clock cycles. During`
423			`these invalid clock cycles, the FFT stage outputs data that had been stored`
424			`in memory. In this fashion, data is always valid coming out of each FFT`
425			`stage once the initial synchronization pulse goes high.`
426
427			`The complex multiply itself, formed internal to the butterfly routine, is`
428			`formed from three very simple shift and add multiplies, whose output is`
429	22	dgisselq	`then transformed into a single complex output, although there is a command`
430			`line option to use hardware multiplies instead. To avoid overflow, the`
431	10	dgisselq	`complex coefficients, $z_n$, for these multiplies are given by,`
432			`\begin{eqnarray}`
433			`z_n &=& c_n + js_n,\mbox{ where} \\`
434			`c_n &=& \left\lfloor 2^{C-2}\cos\left(2\pi \frac{n}{N}\right)+\frac{1}{2}\right\rfloor,\\`
435			`s_n &=& \left\lfloor 2^{C-2}\sin\left(2\pi \frac{n}{N}\right)+\frac{1}{2}\right\rfloor\mbox{, and}`
436			`\end{eqnarray}`
437			`$C$ is the number of bits allocated to the coefficient.`
438
439			`For those wishing to understand this operation further and in more depth, I`
440			`would commend them to the literature on how a decimation in frequency FFT is`
441			`constructed.`
442
443			`\chapter{Operation}`
444
445			`The core is actually really easy to use:`
446			`\begin{enumerate}`
447			`\item Provide a system clock to the core every clock cycle.`
448			`\item Set the {\tt i\_rst} line high for at least one clock cycle`
449			`before you intend to use the core.`
450			`\item From the time of reset until the first sample pair is available`
451			`on the IO ports, {\tt i\_rst} may be kept low, but the clock`
452			`enable line {\tt i\_ce} must also be kept low.`
453			`\item On the clock containing the first sample pair, {\tt i\_left}`
454			`and {\tt i\_right}, set {\tt i\_ce} high.`
455			`\item Ever after, any time a valid pair of samples is available to`
456			`the input of the FFT, place the first sample of the pair`
457			`on the {\tt i\_left} line, the second on the {\tt i\_right}`
458			`line, and set {\tt i\_ce} high.`
459			`\item At the first valid output, the FFT core will set {\tt o\_sync}`
460			`line high in addition to the output values {\tt o\_left}`
461			`(the first of two), and {\tt o\_right} (the second of the two).`
462			`\item Ever after, whenever {\tt i\_ce} is high, the FFT core will clock`
463	11	dgisselq	`two samples in and two samples out. On any valid first`
464	10	dgisselq	`pair of samples coming out of the transform,`
465			`{\tt o\_sync} will be high. Otherwise {\tt o\_sync} will`
466			`remain low.`
467			`\end{enumerate}`
468
469			`There are no special modes or states associated with this core. If you wish`
470			`it to stop or pause, just turn off {\tt i\_ce}. If you wish to flush the`
471			`core, just send zeros into the core.`
472
473			`\chapter{Registers}`
474
475			`Once built, the FFT routine has no capability for runtime configuration`
476			`or reconfiguration. Therefore, this implementation maintains no user`
477			`configurable or readable registers.`
478
479			`This is a great advantage in many ways, simply because it greatly simplifies`
480			`the interface over other cores that are available out there.`
481
482			`\chapter{Clocks}`
483
484			`The FFT routines built by this core use one clock only. The speed of this`
485			`clock will depend upon the speed your hardware is capable of. If your data`
486			`rate is slower than your clock speed, just hold off on the {\tt i\_ce}`
487			`line as necessary so that every clock with the {\tt i\_ce} line high is a`
488			`valid sample.`
489
490			`\chapter{IO Ports}`
491
492			`The FFT core presents a small set of IO ports to its external interface.`
493			`These ports are listed in Table.~\ref{tbl:ioports}.`
494			`\begin{table}[htbp]`
495			`\begin{center}`
496			`\begin{portlist}`
497			`i\_clk & 1 & Input & The global clock driving the FFT. \\\hline`
498			`i\_rst & 1 & Input & An active high synchronous reset.\\\hline`
499			`i\_ce & 1 & Input & Clock Enable. Set this high to clock data in and`
500			`out.\\\hline`

Browse

Tools

Subversion Repositories dblclockfft

[/] [dblclockfft/] [trunk/] [doc/] [src/] [spec.tex] - Blame information for rev 27