OpenCores
URL https://opencores.org/ocsvn/mod_sim_exp/mod_sim_exp/trunk

Subversion Repositories mod_sim_exp

[/] [mod_sim_exp/] [trunk/] [doc/] [src/] [architecture.tex] - Blame information for rev 103

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 47 JonasDC
\chapter{Architecture}
2
 
3
\section{Block diagram}
4
The architecture for the full IP core is shown in the Figure~\ref{blockdiagram}. It consists of 2 major parts, the actual
5
exponentiation core (\verb|mod_sim_exp_core| entity) with a bus interface wrapped around it. In the following sections these
6 103 JonasDC
different blocks are described in detail. The bus interface and the exponentiation core can run on different clock
7
frequencies, so they are independent of each other.\\
8 47 JonasDC
\begin{figure}[H]
9
\centering
10
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=10cm]{pictures/block_diagram.pdf}
11
\caption{Block diagram of the Modular Simultaneous Exponentiation IP core}
12
\label{blockdiagram}
13
\end{figure}
14
\newpage
15
 
16
\section{Exponentiation core}
17
The exponentiation core (\verb|mod_sim_exp_core| entity) is the top level of the modular simultaneous exponentiation
18
core. It is made up by 4 main blocks (Figure~\ref{msec_structure}):\\
19
 
20
\begin{itemize}
21
        \item a pipelined Montgomery multiplier as the main processing unit
22
        \item RAM to store the operands and the modulus
23
        \item a FIFO to store the exponents
24
        \item a control unit which controls the multiplier for the exponentiation and multiplication operations
25
\end{itemize}
26
 
27
\begin{figure}[H]
28
\centering
29
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=10cm]{pictures/mod_sim_exp_core.pdf}
30
\cprotect\caption{\verb|mod_sim_exp_core| structure}
31
\label{msec_structure}
32
\end{figure}
33
 
34 103 JonasDC
The multiplier and control unit operate on the \verb|core_clk| clock frequency and the interface to the operand RAM and
35
exponent FIFO operates on the \verb|bus_clk| clock frequency. The transition between the 2 clock domains is mainly
36
implemented by the RAM and FIFO. For the remainder, the necessary control signals are synchronised to the
37
\verb|bus_clk|. Thus when using the \verb|mod_sim_exp_core|, one can thus assume that al ports are operating on the
38
\verb|bus_clk| clock signal.
39
 
40 47 JonasDC
\subsection{Multiplier}
41
The kernel of this design is a pipelined Montgomery multiplier. A Montgomery multiplication\cite{MontModMul} allows efficient implementation of a
42
modular multiplication without explicitly carrying out the classical modular reduction step. Right-shift operations ensure that the length of the (intermediate) results does not exceed $n+1$ bits. The result of a Montgomery multiplication is given by~(\ref{eq:mont}):
43
\begin{align}\label{eq:mont}
44
r = x \cdot y \cdot R^{-1} \bmod m \hspace{1.5cm}\text{with } R = 2^{n}
45
\end{align}
46
For the structure of the multiplier, the work of \textit{Nedjah and Mourelle}\cite{NedMour} is used as a basis. They show that for large operands ($>$512 bits) the $time\times area$ product is minimal when a systolic implementation is used. This construction is composed of cells that each compute a bit of the (intermediate) result.
47
 
48
Because a fully unrolled two-dimensional systolic implementation would require too many resources, a systolic array (one-dimensional) implementation is chosen. This implies that the intermediate results are fed back to the same same array of cells through a register. A shift register will shift-in a bit of the $x$ operand for every step in the calculation (figure~\ref{mult_structure}). When multiplication is completed, a final check is made to ensure the result is smaller than the modulus. If not, a final reduction with $m$ is necessary.
49
 
50
\textbf{Note:} For this implementation the modulus $m$ has to be uneven to obtain a correct result. However, we can assume that for cryptographic applications, this is the case.
51
 
52
 
53
\begin{figure}[H]
54
\centering
55
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=15cm]{pictures/mult_structure.pdf}
56
\caption{Multiplier structure. For clarification the $my$ adder and reduction logic are depicted separately, whereas in practice they are internal parts of the stages. (See Figure~\ref{stage_structure})}
57
\label{mult_structure}
58
\end{figure}
59
 
60
\subsubsection{Stage and pipeline structure}
61 78 JonasDC
The Montgomery algorithm uses a series of additions and right shifts to obtain the desired result. The main disadvantage
62
is the carry propagation in the adder, and therefore a pipelined version is used. The length of the operands ($n$) and
63
the number of pipeline stages can be chosen before synthesis. The user has the option to split the pipeline into 2
64
smaller parts so there are 3 operand lengths available during runtime\footnote{e.g. a total pipeline length of 1536 bit
65
split into a part of 512 bit and a part of 1024 bit}.
66 47 JonasDC
 
67 78 JonasDC
The stages and first and last cell logic design are presented in Figure~\ref{stage_structure}. Each stage takes in a
68
part of the modulus $m$ and $y$ operand and for each step of the multiplication, a bit of the $x$ operand is fed to the
69
pipeline (together with the generated $q$ signal), starting with the Least Significant Bit. The systolic array cells
70
need the modulus $m$, the operand $y$ and the sum $m+y$ as an input. The result from the cells is latched into a
71
register, and then passed back to the systolic cells for the next bit of $x$. During this pass the right shift operation
72
is implemented. Each stage thus needs the least significant bit from the next stage to calculate the next step. Final
73
reduction logic is also present in the stages for when the multiplication is complete.
74 47 JonasDC
 
75 78 JonasDC
An example of the standard pipeline structure is presented in Figure~\ref{pipeline_structure}. It is constructed using
76
stages with a predefined width. The first cell logic processes the first bit of the $m$ and $y$ operand and generates
77
the $q$ signal. The last cell logic finishes the reduction and selects the correct result. For operation of this
78
pipeline, it is clear that each stage can only compute a step every 2 clock cycles. This is because the stages rely on
79
the result of the next stage.
80 47 JonasDC
 
81
In Figure~\ref{pipeline_structure_split} an example pipeline design is drawn for a split pipeline. All multiplexers on
82
this figure are controlled by the pipeline select signal (\verb|p_sel|). During runtime the user can choose which part
83
of the pipeline is used, the lower or higher part or the full pipeline.
84
 
85
\newpage
86
\begin{figure}[H]
87
\centering
88
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=25cm, angle=90]{pictures/sys_stage.pdf}
89
\caption{Pipeline stage and first and last cell logic}
90
\label{stage_structure}
91
\end{figure}
92
\newpage
93
 
94
\newpage
95
\begin{figure}[H]
96
\centering
97
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=25cm, angle=90]{pictures/sys_pipeline_notsplit.pdf}
98
\caption{Example of the pipeline structure (3 stages)}
99
\label{pipeline_structure}
100
\end{figure}
101
\newpage
102
 
103
\newpage
104
\begin{figure}[H]
105
\centering
106
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=22cm, angle=90]{pictures/sys_pipeline.pdf}
107
\caption{Example of a split pipeline (1+2 stages)}
108
\label{pipeline_structure_split}
109
\end{figure}
110
\newpage
111
 
112
 
113 78 JonasDC
\subsection{Operand RAM and exponent FIFO} \label{subsec:RAM_and_FIFO}
114
The core's RAM is designed to store 4 operands and a modulus. \footnote{This is the default configuration. The number of operands can be increased, but the control logic is only designed to work with the default configuration.} Three (3) options are available for the implementation of the RAM. Setting the parameter \verb|C_MEM_STYLE|, will change the implementation style. All styles try to use the RAM resources available on the FPGA.
115 47 JonasDC
 
116 87 JonasDC
If the FPGA supports asymmetric RAMs, i.e. with a different read and write width, we suggest that the option \verb|"asym"| is selected. Since the (device specific) RAM blocks are inferred through code, it is imperative to select the right device (\verb|C_FPGA_MAN|), as this inference is different between manufacturers. Currently, only Altera and Xilinx are supported.
117 78 JonasDC
 
118
If there's no asymmetric RAM support, the option \verb|"generic"| should be selected. This option will work for most FPGAs, but the disadvantage is that it will use more resources than the \verb|"asym"| option. This is because a significant number of LUTs will be used to construct an asymmetric RAM.
119
 
120
For both options the size of the RAM adapts dynamically to the chosen pipeline width (\verb|C_NR_BITS_TOTAL|).
121
 
122
Finally, the option \verb|"xil_prim"| is targeted specifically to Xilinx devices. It uses blocks of RAM generated with CoreGen. These blocks are of a fixed width and this results in a fixed RAM of 4x1536 bit for the operands and 1536 bit for the modulus. This option is deprecated in favor of \verb|"asym"|.
123
 
124
Reading and writing (from the bus side) to the operands and modulus is done one 32-bit word at a time. If using a split pipeline, it is important that operands for the higher part of the pipeline are loaded into the RAM with preceding zero's for the lower bits of the pipeline. As a rule of thumb, the number of FPGA RAM blocks that will be used is given by (\ref{eq:ramblocks}):
125
\begin{align}
126
        2 \cdot \mathtt{C\_NR\_BITS\_TOTAL} / 32\label{eq:ramblocks}
127
\end{align}
128
\newline
129
 
130
To store the exponents, there is a FIFO of 32 bit wide. Every 32 bit entry has to be formatted as 16 bit of $e_0$ for the
131
lower part [15:0] and 16 bit of $e_1$ for the higher part [31:16]. Entries have to be pushed in the FIFO starting with the least significant word and ending with the most significant word of the exponents.
132
 
133 103 JonasDC
For the FIFO there are 2 styles available. The implementation style depends on the style of the operand memory and it can not be set directly. When the RAM option \verb|"xil_prim"| is chosen, the resulting FIFO will use the FIFO18E1 primitive. It is able to store 512 entries, meaning 2 exponents of each 8192 bit long.
134 78 JonasDC
 
135 103 JonasDC
When the RAM options \verb|"generic"| or \verb|"asym"| are chosen, a generic FIFO \footnote{This FIFO is a slightly
136
modified version of the generic FIFOs project at OpenCores.org (http://opencores.org/project,generic\_fifos).} will be
137
implemented.
138
This consist of a dual port symmetric RAM with the control logic for a FIFO. The depth of this generic FIFO is adjustable with the parameter \verb|C_FIFO_AW|. The number of RAM blocks for the FIFO is given by (\ref{eq:fifoblocks}), where
139
\verb|RAMBLOCK_SIZE| is the size [bits] of the FPGA's RAM primitive.
140 78 JonasDC
\begin{align}
141 103 JonasDC
        \left[\left(\mathtt{2^{C\_FIFO\_AW}}+1\right) \cdot 32 \right]/ \mathtt{RAMBLOCK\_SIZE} \label{eq:fifoblocks}
142 78 JonasDC
\end{align}
143
 
144 47 JonasDC
\subsection{Control unit}
145
The control unit loads in the operands and has full control over the multiplier. For single multiplications, it latches in
146
the $x$ operand, then places the $y$ operand on the bus and starts the multiplier. In case of an exponentiation, the FIFO is
147
emptied while the necessary single multiplications are performed. When the computation is done, the ready signal is
148
asserted to notify the system.
149
 
150
\newpage
151
\subsection{IO ports and memory map}
152
The \verb|mod_sim_exp_core| IO ports\\
153
\newline
154
% Table generated by Excel2LaTeX
155
\begin{tabular}{|l|c|c|p{8cm}|}
156
\hline
157
\rowcolor{Gray}
158
\textbf{Port} & \textbf{Width} & \textbf{Direction} & \textbf{Description} \bigstrut\\
159
\hline
160 103 JonasDC
\verb|core_clk|   & 1     & in    & core clock input, clock signal for the multiplier and control unit \bigstrut\\
161 47 JonasDC
\hline
162 103 JonasDC
\verb|bus_clk|   & 1     & in    & bus clock input, clock signal for all core IO \bigstrut\\
163
\hline
164 47 JonasDC
\verb|reset| & 1     & in    & reset signal (active high) resets the pipeline, fifo and control logic \bigstrut\\
165
\hline
166
\multicolumn{4}{|l|}{\textbf{\textit{operand memory interface}}} \bigstrut\\
167
\hline
168
\verb|rw_address| & 9     & in    & operand memory read/write address (structure descibed below) \bigstrut\\
169
\hline
170
\verb|data_out| & 32    & out   & operand data out (0 is lsb) \bigstrut\\
171
\hline
172
\verb|data_in| & 32    & in    & operand data in (0 is lsb) \bigstrut\\
173
\hline
174
\verb|write_enable| & 1     & in    & write enable signal, latches \verb|data_in| to operand RAM \bigstrut\\
175
\hline
176
\verb|collision| & 1     & out   & collision output, asserts on a write error \bigstrut\\
177
\hline
178
\multicolumn{4}{|l|}{\textbf{\textit{exponent FIFO interface}}} \bigstrut\\
179
\hline
180
\verb|fifo_din| & 32    & in    & FIFO data in, bits [31:16] for $e_1$ operand and bits [15:0] for $e_0$ operand \bigstrut\\
181
\hline
182
\verb|fifo_push| & 1     & in    & push \verb|fifo_din| into the FIFO \bigstrut\\
183
\hline
184
\verb|fifo_nopush| & 1     & out   & flag to indicate if there was an error pushing the word to the FIFO \bigstrut\\
185
\hline
186
\verb|fifo_full| & 1     & out   & flag to indicate the FIFO is full \bigstrut\\
187
\hline
188
\multicolumn{4}{|l|}{\textbf{\textit{control signals}}} \bigstrut\\
189
\hline
190
\verb|x_sel_single| & 2     & in    & selection for x operand source during single multiplication \bigstrut\\
191
\hline
192
\verb|y_sel_single| & 2     & in    & selection for y operand source during single multiplication \bigstrut\\
193
\hline
194
\verb|dest_op_single| & 2     & in    & selection for the result destination operand for single multiplication \bigstrut\\
195
\hline
196
\verb|p_sel| & 2     & in    & specifies which pipeline part to use for exponentiation / multiplication. \bigstrut[t]\\
197
      &       &       & ``01'' : use lower pipeline part \\
198
      &       &       & ``10'' : use higher pipeline part \\
199
      &       &       & ``11'' : use full pipeline \bigstrut[b]\\
200
\hline
201 78 JonasDC
\verb|modulus_sel| & 1     & in    & selection for which modulus to use for the calculations (only available if \verb|C_MEM_STYLE| = \verb|"generic"| or \verb|"asym"|). Otherwise set to 0 \bigstrut\\
202
\hline
203 47 JonasDC
\verb|exp_m| & 1     & in    & core operation mode. ``0'' for single multiplications and ``1'' for exponentiations \bigstrut\\
204
\hline
205
\verb|start| & 1     & in    & start the calculation for current mode \bigstrut\\
206
\hline
207
\verb|ready| & 1     & out   & indicates the multiplication/exponentiation is done \bigstrut\\
208
\hline
209
\verb|calc_time| & 1     & out   & is high during a multiplication, indicator for used calculation time \bigstrut\\
210
\hline
211
\end{tabular}%
212
\newpage
213 78 JonasDC
The \verb|mod_sim_exp_core| parameters\\
214
\begin{center}
215
        \begin{tabular}{|l|p{6.5cm}|c|l|}
216
                \hline
217
                \rowcolor{Gray}
218
                \textbf{Name} & \textbf{Description} & \textbf{VHDL Type} &\textbf{Default Value} \bigstrut\\
219
                \hline
220
                \verb|C_NR_BITS_TOTAL| & total width of the multiplier in bits & integer & 1536\bigstrut\\
221
                \hline
222
                \verb|C_NR_STAGES_TOTAL| & total number of stages in the pipeline & integer & 96\bigstrut\\
223
                \hline
224
                \verb|C_NR_STAGES_LOW| & number of lower stages in the pipeline, defines the bit-width of the lower pipeline part & integer & 32 \bigstrut\\
225
                \hline
226
                \verb|C_SPLIT_PIPELINE| & option to split the pipeline in 2 parts & boolean & true \bigstrut\\
227
                \hline
228 103 JonasDC
                \verb|C_FIFO_AW| & address width of the generic FIFO pointers, FIFO size is equal to $2^{C\_FIFO\_AW} $. & integer & 7 \bigstrut\\
229
                                                 & only applicable if \verb|C_MEM_STYLE| = \verb|"generic"| or \verb|"asym"|  & & \\
230 78 JonasDC
                \hline
231
                \verb|C_MEM_STYLE| & select the RAM memory style (3 options): & string & \verb|"generic"| \bigstrut\\
232
                                                        & \verb|"generic"| : use general 32-bit RAMs & & \\
233
                                                & \verb|"asym"| : use asymmetric RAMs & & \\
234
                                                & (For more information see \ref{subsec:RAM_and_FIFO}) & & \\
235
                                                & \verb|"xil_prim"| : use xilinx primitives & &\\
236
                                                & (deprecated) & & \bigstrut[b] \\
237
                \hline
238 87 JonasDC
                \verb|C_FPGA_MAN| & device manufacturer: & & \\
239 78 JonasDC
                                                & \verb|"xilinx"| or \verb|"altera"| & string & \verb|"xilinx"| \bigstrut\\
240
                \hline
241
        \end{tabular}%
242
\end{center}
243
 
244 47 JonasDC
The following figure describes the structure of the Operand RAM memory, for every operand there is a space of 2048 bits
245 78 JonasDC
reserved. So operand widths up to 2048 bits are supported.\\
246
\newline \\
247 47 JonasDC
\begin{figure}[H]
248
\centering
249
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=15cm]{pictures/msec_memory.pdf}
250 78 JonasDC
\caption{Address structure of the exponentiation core}
251
\label{Address_structure}
252 47 JonasDC
\end{figure}
253
 
254
\section{Bus interface}
255
The bus interface implements the register necessary for the control unit inputs to the \verb|mod_sim_exp_core| entity.
256
It also maps the memory to the required bus and connects the interrupt signals. The embedded processor then has full control
257
over the core.

powered by: WebSVN 2.1.0

© copyright 1999-2022 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.