1 |
47 |
JonasDC |
\chapter{Performance and resource usage}
|
2 |
|
|
|
3 |
|
|
This Modular Simultaneous Exponentiation IP core is designed to speed up modular simultaneous exponentiations on embedded systems.
|
4 |
|
|
On embedded processors, software implementations (even with specialized libraries like GMP\footnote{GNU Multiple Precision Arithmetic Library -- Project website: \url{http://gmplib.org/}}), demand much CPU time when large operands are used.
|
5 |
|
|
Practical tests of this core have shown a significant speed-up compared to software computations. For $n=1536$ and $t=1024$, hardware is about 70 times faster than a GMP-based implementation (with embedded linux) an a 100 MHz MicroBlaze processor (32-bit).\\
|
6 |
|
|
|
7 |
|
|
For the multiplier, execution time is given by~(\ref{eq:Tmult}), where $\tau_c$ is defined by the core operating
|
8 |
|
|
frequency. Since the maximum frequency is highly influenced by the latency in the critical path, we can expect to
|
9 |
|
|
achieve higher frequencies for shorter stage lengths. This trend is seen in Figure~\ref{fig:Virtex6exctime} for
|
10 |
|
|
different operand lengths, which are results used from the static timing analysis during synthesis. A minimum execution
|
11 |
|
|
time in this graph is found when the maximum operating frequency of the core first reaches the maximum frequency of the
|
12 |
|
|
FGPA in use. Beyond that point, using a smaller stage width has no positive effect anymore because the frequency can not
|
13 |
|
|
rise anymore and the number of clock cycles to complete a multiplication increases. Another remark that can be made is
|
14 |
|
|
that splitting the pipeline, has no considerable effect on the performance of the core.
|
15 |
|
|
|
16 |
|
|
\begin{figure}[H]
|
17 |
|
|
\centering
|
18 |
|
|
\includegraphics[trim=2cm 2cm 2cm 2cm, width=16cm]{pictures/Virtex6_stagewidth.pdf}
|
19 |
|
|
\caption{Example of multiplication execution time in function of the stage width for a Virtex6 FPGA.}
|
20 |
|
|
\label{fig:Virtex6exctime}
|
21 |
|
|
\end{figure}
|
22 |
|
|
|
23 |
|
|
In general, shorter stage lengths result in smaller execution times. However, using more stages implies that more
|
24 |
|
|
flip-flops will be needed, thus more resources are used. A balance must be found between a execution time and resources.
|
25 |
|
|
Currently, the core's operating frequency is the same as the bus frequency of the embedded processor. For optimal
|
26 |
|
|
operation of the core, the stage width must be chosen so that the maximum frequency given in synthesis is just above or
|
27 |
|
|
equal to the bus frequency.
|
28 |
|
|
|
29 |
|
|
In the tables below resource usage and timing results are shown for different operand lengths and FPGA's. As a rule of thumb, the number of flip-flops is given by~(\ref{eq:ff}).
|
30 |
|
|
\begin{align}\label{eq:ff}
|
31 |
|
|
5+2\cdot n+6\cdot\frac{n}{s}+\lceil\log_{2}(n)\rceil+\lceil\log_{2}(\frac{n}{s})\rceil
|
32 |
|
|
\end{align}
|
33 |
|
|
where $s$ is the stage width.
|
34 |
|
|
|
35 |
|
|
The number of LUTs is almost completely determined by $n$ and the number of LUT-inputs. A pre-synthesis estimate can be made with~(\ref{eq:lut4}) and~(\ref{eq:lut6}).
|
36 |
|
|
\begin{align}
|
37 |
|
|
8\cdot n \hspace{1cm}\text{for 4-input LUTs} \label{eq:lut4}\\
|
38 |
|
|
6\cdot n \hspace{1cm}\text{for 6-input LUTs} \label{eq:lut6}
|
39 |
|
|
\end{align}
|
40 |
|
|
\newline
|
41 |
|
|
|
42 |
|
|
Results for a Virtex 6 device xc6vlx240t-1ff1156, speedgrade -1\\
|
43 |
|
|
Synthesis settings: Optimization: area, Effort: high\\
|
44 |
|
|
% Table generated by Excel2LaTeX from sheet 'Virtex 6'
|
45 |
|
|
\begin{tabular}{rcccccccccccc}
|
46 |
|
|
\hline
|
47 |
|
|
\hline
|
48 |
|
|
$n$ & \multicolumn{3}{c}{512} & & \multicolumn{3}{c}{1024} & & \multicolumn{3}{c}{2048} & [bit] \bigstrut[t]\\
|
49 |
|
|
$stage width$ & 64 & 16 & 4 & & 64 & 16 & 4 & & 64 & 16 & 4 & [bit] \bigstrut[b]\\
|
50 |
|
|
\hline
|
51 |
|
|
$f_{max}$ & 64,91 & 199,96 & 395,57 & & 64,91 & 199,66 & 358,62 & & 94,91 & 199,96 & 358,62 & [MHz] \bigstrut[t]\\
|
52 |
|
|
$T_m@f_{max}$ & 15,87 & 5,27 & 2,91 & & 31,77 & 10,55 & 9,63 & & 63,57 & 21,11 & 12,84 & [$\mu$s] \\
|
53 |
|
|
$cycles$ & 1030 & 1054 & 1150 & & 2062 & 2110 & 3454 & & 4126 & 4222 & 4606 & [cycles] \\
|
54 |
|
|
\textbf{Resources} & & & & & & & & & & & & \\
|
55 |
|
|
Flipflops & 1089 & 1235 & 1813 & & 2163 & 2453 & 5401 & & 4309 & 4887 & 7193 & \\
|
56 |
|
|
LUT's & 3094 & 3096 & 3102 & & 6169 & 6171 & 9252 & & 12315 & 12318 & 12324 & \bigstrut[b]\\
|
57 |
|
|
\hline
|
58 |
|
|
\hline
|
59 |
|
|
\end{tabular}%
|
60 |
|
|
\vspace{1cm}
|
61 |
|
|
\newline
|
62 |
|
|
Results for a Spartan 3 device xc3s1000-5fg320, speedgrade -5\\
|
63 |
|
|
Synthesis settings: Optimization: area, Effort: high\\
|
64 |
|
|
% Table generated by Excel2LaTeX from sheet 'Spartan 3'
|
65 |
|
|
\begin{tabular}{rccccccccc}
|
66 |
|
|
\hline
|
67 |
|
|
\hline
|
68 |
|
|
$n$ & \multicolumn{3}{c}{256} & & \multicolumn{4}{c}{512} & [bit] \bigstrut[t]\\
|
69 |
|
|
$stage width$ & 32 & 8 & 2 & & 64 & 32 & 8 & 2 & [bit] \bigstrut[b]\\
|
70 |
|
|
\hline
|
71 |
|
|
$f_max$ & 21,49 & 69,30 & 127,32 & & 11,36 & 21,49 & 69,30 & 127,32 & [MHz] \bigstrut[t]\\
|
72 |
|
|
$T_m@f_{max}$ & 24,1 & 7,82 & 5,01 & & 90,7 & 48,29 & 15,67 & 10,04 & [$\mu$s] \\
|
73 |
|
|
$cycles$ & 518 & 542 & 638 & & 1030 & 1038 & 1086 & 1278 & [cycles] \\
|
74 |
|
|
\textbf{Resources} & & & & & & & & & \\
|
75 |
|
|
Flipflops & 576 & 722 & 1300 & & 1089 & 1138 & 1428 & 2582 & \\
|
76 |
|
|
LUT's & 2072 & 2074 & 2079 & & 4124 & 4126 & 4128 & 4135 & \bigstrut[b]\\
|
77 |
|
|
\hline
|
78 |
|
|
\hline
|
79 |
|
|
\end{tabular}%
|
80 |
|
|
\vspace{1cm}
|
81 |
|
|
\newline
|
82 |
|
|
Results for a Virtex 4 device xc4vlx200-11ff1513, speedgrade -11\\
|
83 |
|
|
Synthesis settings: Optimization: area, Effort: high\\
|
84 |
|
|
% Table generated by Excel2LaTeX from sheet 'Virtex 4'
|
85 |
|
|
\begin{tabular}{rcccccccccc}
|
86 |
|
|
\hline
|
87 |
|
|
\hline
|
88 |
|
|
$n$ & \multicolumn{4}{c}{512} & & \multicolumn{4}{c}{1024} & [bit] \bigstrut[t]\\
|
89 |
|
|
$stage width$ & 64 & 32 & 8 & 2 & & 128 & 32 & 8 & 2 & [bit] \bigstrut[b]\\
|
90 |
|
|
\hline
|
91 |
|
|
$f_{max}$ & 22,83 & 43,05 & 138,31 & 246,98 & & 11,77 & 43,05 & 138,31 & 246,98 & [MHz] \bigstrut[t]\\
|
92 |
|
|
$T_m@f_{max}$ & 45,12 & 24,11 & 7,85 & 5,17 & & 87,5 & 24,5 & 8,31 & 6,21 & [$\mu$s] \\
|
93 |
|
|
$cycles$ & 1030 & 1038 & 1086 & 1278 & & 1030 & 1054 & 1150 & 1534 & [cycles] \\
|
94 |
|
|
\textbf{Resources} & & & & & & & & & & \\
|
95 |
|
|
Flipflops & 1089 & 1138 & 1428 & 2582 & & 2114 & 2260 & 2838 & 5144 & \\
|
96 |
|
|
LUT's & 4124 & 4126 & 4128 & 4135 & & 8225 & 8230 & 8234 & 8238 & \bigstrut[b]\\
|
97 |
|
|
\hline
|
98 |
|
|
\hline
|
99 |
|
|
\end{tabular}%
|
100 |
|
|
|