OpenCores
URL https://opencores.org/ocsvn/mod_sim_exp/mod_sim_exp/trunk

Subversion Repositories mod_sim_exp

[/] [mod_sim_exp/] [trunk/] [doc/] [src/] [performance.tex] - Blame information for rev 47

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 47 JonasDC
\chapter{Performance and resource usage}
2
 
3
This Modular Simultaneous Exponentiation IP core is designed to speed up modular simultaneous exponentiations on embedded systems.
4
On embedded processors, software implementations (even with specialized libraries like GMP\footnote{GNU Multiple Precision Arithmetic Library -- Project website: \url{http://gmplib.org/}}), demand much CPU time when large operands are used.
5
Practical tests of this core have shown a significant speed-up compared to software computations. For $n=1536$ and $t=1024$, hardware is about 70 times faster than a GMP-based implementation (with embedded linux) an a 100 MHz MicroBlaze processor (32-bit).\\
6
 
7
For the multiplier, execution time is given by~(\ref{eq:Tmult}), where $\tau_c$ is defined by the core operating
8
frequency. Since the maximum frequency is highly influenced by the latency in the critical path, we can expect to
9
achieve higher frequencies for shorter stage lengths. This trend is seen in Figure~\ref{fig:Virtex6exctime} for
10
different operand lengths, which are results used from the static timing analysis during synthesis. A minimum execution
11
time in this graph is found when the maximum operating frequency of the core first reaches the maximum frequency of the
12
FGPA in use. Beyond that point, using a smaller stage width has no positive effect anymore because the frequency can not
13
rise anymore and the number of clock cycles to complete a multiplication increases. Another remark that can be made is
14
that splitting the pipeline, has no considerable effect on the performance of the core.
15
 
16
\begin{figure}[H]
17
\centering
18
\includegraphics[trim=2cm 2cm 2cm 2cm, width=16cm]{pictures/Virtex6_stagewidth.pdf}
19
\caption{Example of multiplication execution time in function of the stage width for a Virtex6 FPGA.}
20
\label{fig:Virtex6exctime}
21
\end{figure}
22
 
23
In general, shorter stage lengths result in smaller execution times. However, using more stages implies that more
24
flip-flops will be needed, thus more resources are used. A balance must be found between a execution time and resources.
25
Currently, the core's operating frequency is the same as the bus frequency of the embedded processor. For optimal
26
operation of the core, the stage width must be chosen so that the maximum frequency given in synthesis is just above or
27
equal to the bus frequency.
28
 
29
In the tables below resource usage and timing results are shown for different operand lengths and FPGA's. As a rule of thumb, the number of flip-flops is given by~(\ref{eq:ff}).
30
\begin{align}\label{eq:ff}
31
5+2\cdot n+6\cdot\frac{n}{s}+\lceil\log_{2}(n)\rceil+\lceil\log_{2}(\frac{n}{s})\rceil
32
\end{align}
33
where $s$ is the stage width.
34
 
35
The number of LUTs is almost completely determined by $n$ and the number of LUT-inputs. A pre-synthesis estimate can be made with~(\ref{eq:lut4}) and~(\ref{eq:lut6}).
36
\begin{align}
37
8\cdot n \hspace{1cm}\text{for 4-input LUTs} \label{eq:lut4}\\
38
6\cdot n \hspace{1cm}\text{for 6-input LUTs} \label{eq:lut6}
39
\end{align}
40
\newline
41
 
42
Results for a Virtex 6 device xc6vlx240t-1ff1156, speedgrade -1\\
43
Synthesis settings: Optimization: area, Effort: high\\
44
% Table generated by Excel2LaTeX from sheet 'Virtex 6'
45
\begin{tabular}{rcccccccccccc}
46
\hline
47
\hline
48
$n$     & \multicolumn{3}{c}{512} &       & \multicolumn{3}{c}{1024} &       & \multicolumn{3}{c}{2048} & [bit] \bigstrut[t]\\
49
$stage width$ & 64    & 16    & 4     &       & 64    & 16    & 4     &       & 64    & 16    & 4     & [bit] \bigstrut[b]\\
50
\hline
51
$f_{max}$  & 64,91 & 199,96 & 395,57 &       & 64,91 & 199,66 & 358,62 &       & 94,91 & 199,96 & 358,62 & [MHz] \bigstrut[t]\\
52
$T_m@f_{max}$ & 15,87 & 5,27  & 2,91  &       & 31,77 & 10,55 & 9,63  &       & 63,57 & 21,11 & 12,84 & [$\mu$s] \\
53
$cycles$ & 1030  & 1054  & 1150  &       & 2062  & 2110  & 3454  &       & 4126  & 4222  & 4606  & [cycles] \\
54
\textbf{Resources} &       &       &       &       &       &       &       &       &       &       &       &  \\
55
Flipflops & 1089  & 1235  & 1813  &       & 2163  & 2453  & 5401  &       & 4309  & 4887  & 7193  &  \\
56
LUT's & 3094  & 3096  & 3102  &       & 6169  & 6171  & 9252  &       & 12315 & 12318 & 12324 &  \bigstrut[b]\\
57
\hline
58
\hline
59
\end{tabular}%
60
\vspace{1cm}
61
\newline
62
Results for a Spartan 3 device xc3s1000-5fg320, speedgrade -5\\
63
Synthesis settings: Optimization: area, Effort: high\\
64
% Table generated by Excel2LaTeX from sheet 'Spartan 3'
65
\begin{tabular}{rccccccccc}
66
\hline
67
\hline
68
$n$     & \multicolumn{3}{c}{256} & & \multicolumn{4}{c}{512}       & [bit] \bigstrut[t]\\
69
$stage width$ & 32    & 8     & 2    & & 64    & 32    & 8     & 2     & [bit] \bigstrut[b]\\
70
\hline
71
$f_max$  & 21,49 & 69,30 & 127,32 & & 11,36 & 21,49 & 69,30 & 127,32 & [MHz] \bigstrut[t]\\
72
$T_m@f_{max}$ & 24,1  & 7,82  & 5,01  & & 90,7  & 48,29 & 15,67 & 10,04 & [$\mu$s] \\
73
$cycles$ & 518   & 542   & 638 &  & 1030  & 1038  & 1086  & 1278  & [cycles] \\
74
\textbf{Resources} &       &    &   &       &       &       &       &       &  \\
75
Flipflops & 576   & 722   & 1300 & & 1089  & 1138  & 1428  & 2582  &  \\
76
LUT's & 2072  & 2074  & 2079 & & 4124  & 4126  & 4128  & 4135  &  \bigstrut[b]\\
77
\hline
78
\hline
79
\end{tabular}%
80
\vspace{1cm}
81
\newline
82
Results for a Virtex 4 device xc4vlx200-11ff1513, speedgrade -11\\
83
Synthesis settings: Optimization: area, Effort: high\\
84
% Table generated by Excel2LaTeX from sheet 'Virtex 4'
85
\begin{tabular}{rcccccccccc}
86
\hline
87
\hline
88
$n$     & \multicolumn{4}{c}{512}    &   & \multicolumn{4}{c}{1024}      & [bit] \bigstrut[t]\\
89
$stage width$ & 64    & 32    & 8     & 2    & & 128   & 32    & 8     & 2     & [bit] \bigstrut[b]\\
90
\hline
91
$f_{max}$  & 22,83 & 43,05 & 138,31 & 246,98 & & 11,77 & 43,05 & 138,31 & 246,98 & [MHz] \bigstrut[t]\\
92
$T_m@f_{max}$ & 45,12 & 24,11 & 7,85  & 5,17 & & 87,5  & 24,5  & 8,31  & 6,21  & [$\mu$s] \\
93
$cycles$ & 1030  & 1038  & 1086  & 1278 & & 1030  & 1054  & 1150  & 1534  & [cycles] \\
94
\textbf{Resources} &  &       &       &     &  &       &       &       &       &  \\
95
Flipflops & 1089  & 1138  & 1428  & 2582 & & 2114  & 2260  & 2838  & 5144  &  \\
96
LUT's & 4124  & 4126  & 4128  & 4135 & & 8225  & 8230  & 8234  & 8238  &  \bigstrut[b]\\
97
\hline
98
\hline
99
\end{tabular}%
100
 

powered by: WebSVN 2.1.0

© copyright 1999-2022 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.