| 1 |
47 |
JonasDC |
\chapter{Performance and resource usage}
|
| 2 |
|
|
|
| 3 |
|
|
This Modular Simultaneous Exponentiation IP core is designed to speed up modular simultaneous exponentiations on embedded systems.
|
| 4 |
|
|
On embedded processors, software implementations (even with specialized libraries like GMP\footnote{GNU Multiple Precision Arithmetic Library -- Project website: \url{http://gmplib.org/}}), demand much CPU time when large operands are used.
|
| 5 |
|
|
Practical tests of this core have shown a significant speed-up compared to software computations. For $n=1536$ and $t=1024$, hardware is about 70 times faster than a GMP-based implementation (with embedded linux) an a 100 MHz MicroBlaze processor (32-bit).\\
|
| 6 |
|
|
|
| 7 |
|
|
For the multiplier, execution time is given by~(\ref{eq:Tmult}), where $\tau_c$ is defined by the core operating
|
| 8 |
|
|
frequency. Since the maximum frequency is highly influenced by the latency in the critical path, we can expect to
|
| 9 |
|
|
achieve higher frequencies for shorter stage lengths. This trend is seen in Figure~\ref{fig:Virtex6exctime} for
|
| 10 |
|
|
different operand lengths, which are results used from the static timing analysis during synthesis. A minimum execution
|
| 11 |
|
|
time in this graph is found when the maximum operating frequency of the core first reaches the maximum frequency of the
|
| 12 |
|
|
FGPA in use. Beyond that point, using a smaller stage width has no positive effect anymore because the frequency can not
|
| 13 |
|
|
rise anymore and the number of clock cycles to complete a multiplication increases. Another remark that can be made is
|
| 14 |
|
|
that splitting the pipeline, has no considerable effect on the performance of the core.
|
| 15 |
|
|
|
| 16 |
|
|
\begin{figure}[H]
|
| 17 |
|
|
\centering
|
| 18 |
|
|
\includegraphics[trim=2cm 2cm 2cm 2cm, width=16cm]{pictures/Virtex6_stagewidth.pdf}
|
| 19 |
|
|
\caption{Example of multiplication execution time in function of the stage width for a Virtex6 FPGA.}
|
| 20 |
|
|
\label{fig:Virtex6exctime}
|
| 21 |
|
|
\end{figure}
|
| 22 |
|
|
|
| 23 |
|
|
In general, shorter stage lengths result in smaller execution times. However, using more stages implies that more
|
| 24 |
|
|
flip-flops will be needed, thus more resources are used. A balance must be found between a execution time and resources.
|
| 25 |
|
|
Currently, the core's operating frequency is the same as the bus frequency of the embedded processor. For optimal
|
| 26 |
|
|
operation of the core, the stage width must be chosen so that the maximum frequency given in synthesis is just above or
|
| 27 |
|
|
equal to the bus frequency.
|
| 28 |
|
|
|
| 29 |
|
|
In the tables below resource usage and timing results are shown for different operand lengths and FPGA's. As a rule of thumb, the number of flip-flops is given by~(\ref{eq:ff}).
|
| 30 |
|
|
\begin{align}\label{eq:ff}
|
| 31 |
|
|
5+2\cdot n+6\cdot\frac{n}{s}+\lceil\log_{2}(n)\rceil+\lceil\log_{2}(\frac{n}{s})\rceil
|
| 32 |
|
|
\end{align}
|
| 33 |
|
|
where $s$ is the stage width.
|
| 34 |
|
|
|
| 35 |
|
|
The number of LUTs is almost completely determined by $n$ and the number of LUT-inputs. A pre-synthesis estimate can be made with~(\ref{eq:lut4}) and~(\ref{eq:lut6}).
|
| 36 |
|
|
\begin{align}
|
| 37 |
|
|
8\cdot n \hspace{1cm}\text{for 4-input LUTs} \label{eq:lut4}\\
|
| 38 |
|
|
6\cdot n \hspace{1cm}\text{for 6-input LUTs} \label{eq:lut6}
|
| 39 |
|
|
\end{align}
|
| 40 |
|
|
\newline
|
| 41 |
|
|
|
| 42 |
|
|
Results for a Virtex 6 device xc6vlx240t-1ff1156, speedgrade -1\\
|
| 43 |
|
|
Synthesis settings: Optimization: area, Effort: high\\
|
| 44 |
|
|
% Table generated by Excel2LaTeX from sheet 'Virtex 6'
|
| 45 |
|
|
\begin{tabular}{rcccccccccccc}
|
| 46 |
|
|
\hline
|
| 47 |
|
|
\hline
|
| 48 |
|
|
$n$ & \multicolumn{3}{c}{512} & & \multicolumn{3}{c}{1024} & & \multicolumn{3}{c}{2048} & [bit] \bigstrut[t]\\
|
| 49 |
|
|
$stage width$ & 64 & 16 & 4 & & 64 & 16 & 4 & & 64 & 16 & 4 & [bit] \bigstrut[b]\\
|
| 50 |
|
|
\hline
|
| 51 |
|
|
$f_{max}$ & 64,91 & 199,96 & 395,57 & & 64,91 & 199,66 & 358,62 & & 94,91 & 199,96 & 358,62 & [MHz] \bigstrut[t]\\
|
| 52 |
|
|
$T_m@f_{max}$ & 15,87 & 5,27 & 2,91 & & 31,77 & 10,55 & 9,63 & & 63,57 & 21,11 & 12,84 & [$\mu$s] \\
|
| 53 |
|
|
$cycles$ & 1030 & 1054 & 1150 & & 2062 & 2110 & 3454 & & 4126 & 4222 & 4606 & [cycles] \\
|
| 54 |
|
|
\textbf{Resources} & & & & & & & & & & & & \\
|
| 55 |
|
|
Flipflops & 1089 & 1235 & 1813 & & 2163 & 2453 & 5401 & & 4309 & 4887 & 7193 & \\
|
| 56 |
|
|
LUT's & 3094 & 3096 & 3102 & & 6169 & 6171 & 9252 & & 12315 & 12318 & 12324 & \bigstrut[b]\\
|
| 57 |
|
|
\hline
|
| 58 |
|
|
\hline
|
| 59 |
|
|
\end{tabular}%
|
| 60 |
|
|
\vspace{1cm}
|
| 61 |
|
|
\newline
|
| 62 |
|
|
Results for a Spartan 3 device xc3s1000-5fg320, speedgrade -5\\
|
| 63 |
|
|
Synthesis settings: Optimization: area, Effort: high\\
|
| 64 |
|
|
% Table generated by Excel2LaTeX from sheet 'Spartan 3'
|
| 65 |
|
|
\begin{tabular}{rccccccccc}
|
| 66 |
|
|
\hline
|
| 67 |
|
|
\hline
|
| 68 |
|
|
$n$ & \multicolumn{3}{c}{256} & & \multicolumn{4}{c}{512} & [bit] \bigstrut[t]\\
|
| 69 |
|
|
$stage width$ & 32 & 8 & 2 & & 64 & 32 & 8 & 2 & [bit] \bigstrut[b]\\
|
| 70 |
|
|
\hline
|
| 71 |
|
|
$f_max$ & 21,49 & 69,30 & 127,32 & & 11,36 & 21,49 & 69,30 & 127,32 & [MHz] \bigstrut[t]\\
|
| 72 |
|
|
$T_m@f_{max}$ & 24,1 & 7,82 & 5,01 & & 90,7 & 48,29 & 15,67 & 10,04 & [$\mu$s] \\
|
| 73 |
|
|
$cycles$ & 518 & 542 & 638 & & 1030 & 1038 & 1086 & 1278 & [cycles] \\
|
| 74 |
|
|
\textbf{Resources} & & & & & & & & & \\
|
| 75 |
|
|
Flipflops & 576 & 722 & 1300 & & 1089 & 1138 & 1428 & 2582 & \\
|
| 76 |
|
|
LUT's & 2072 & 2074 & 2079 & & 4124 & 4126 & 4128 & 4135 & \bigstrut[b]\\
|
| 77 |
|
|
\hline
|
| 78 |
|
|
\hline
|
| 79 |
|
|
\end{tabular}%
|
| 80 |
|
|
\vspace{1cm}
|
| 81 |
|
|
\newline
|
| 82 |
|
|
Results for a Virtex 4 device xc4vlx200-11ff1513, speedgrade -11\\
|
| 83 |
|
|
Synthesis settings: Optimization: area, Effort: high\\
|
| 84 |
|
|
% Table generated by Excel2LaTeX from sheet 'Virtex 4'
|
| 85 |
|
|
\begin{tabular}{rcccccccccc}
|
| 86 |
|
|
\hline
|
| 87 |
|
|
\hline
|
| 88 |
|
|
$n$ & \multicolumn{4}{c}{512} & & \multicolumn{4}{c}{1024} & [bit] \bigstrut[t]\\
|
| 89 |
|
|
$stage width$ & 64 & 32 & 8 & 2 & & 128 & 32 & 8 & 2 & [bit] \bigstrut[b]\\
|
| 90 |
|
|
\hline
|
| 91 |
|
|
$f_{max}$ & 22,83 & 43,05 & 138,31 & 246,98 & & 11,77 & 43,05 & 138,31 & 246,98 & [MHz] \bigstrut[t]\\
|
| 92 |
|
|
$T_m@f_{max}$ & 45,12 & 24,11 & 7,85 & 5,17 & & 87,5 & 24,5 & 8,31 & 6,21 & [$\mu$s] \\
|
| 93 |
|
|
$cycles$ & 1030 & 1038 & 1086 & 1278 & & 1030 & 1054 & 1150 & 1534 & [cycles] \\
|
| 94 |
|
|
\textbf{Resources} & & & & & & & & & & \\
|
| 95 |
|
|
Flipflops & 1089 & 1138 & 1428 & 2582 & & 2114 & 2260 & 2838 & 5144 & \\
|
| 96 |
|
|
LUT's & 4124 & 4126 & 4128 & 4135 & & 8225 & 8230 & 8234 & 8238 & \bigstrut[b]\\
|
| 97 |
|
|
\hline
|
| 98 |
|
|
\hline
|
| 99 |
|
|
\end{tabular}%
|
| 100 |
|
|
|