1 |
47 |
JonasDC |
\chapter{Architecture}
|
2 |
|
|
|
3 |
|
|
\section{Block diagram}
|
4 |
|
|
The architecture for the full IP core is shown in the Figure~\ref{blockdiagram}. It consists of 2 major parts, the actual
|
5 |
|
|
exponentiation core (\verb|mod_sim_exp_core| entity) with a bus interface wrapped around it. In the following sections these
|
6 |
|
|
different blocks are described in detail.\\
|
7 |
|
|
\begin{figure}[H]
|
8 |
|
|
\centering
|
9 |
|
|
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=10cm]{pictures/block_diagram.pdf}
|
10 |
|
|
\caption{Block diagram of the Modular Simultaneous Exponentiation IP core}
|
11 |
|
|
\label{blockdiagram}
|
12 |
|
|
\end{figure}
|
13 |
|
|
\newpage
|
14 |
|
|
|
15 |
|
|
\section{Exponentiation core}
|
16 |
|
|
The exponentiation core (\verb|mod_sim_exp_core| entity) is the top level of the modular simultaneous exponentiation
|
17 |
|
|
core. It is made up by 4 main blocks (Figure~\ref{msec_structure}):\\
|
18 |
|
|
|
19 |
|
|
\begin{itemize}
|
20 |
|
|
\item a pipelined Montgomery multiplier as the main processing unit
|
21 |
|
|
\item RAM to store the operands and the modulus
|
22 |
|
|
\item a FIFO to store the exponents
|
23 |
|
|
\item a control unit which controls the multiplier for the exponentiation and multiplication operations
|
24 |
|
|
\end{itemize}
|
25 |
|
|
|
26 |
|
|
\begin{figure}[H]
|
27 |
|
|
\centering
|
28 |
|
|
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=10cm]{pictures/mod_sim_exp_core.pdf}
|
29 |
|
|
\cprotect\caption{\verb|mod_sim_exp_core| structure}
|
30 |
|
|
\label{msec_structure}
|
31 |
|
|
\end{figure}
|
32 |
|
|
|
33 |
|
|
\subsection{Multiplier}
|
34 |
|
|
The kernel of this design is a pipelined Montgomery multiplier. A Montgomery multiplication\cite{MontModMul} allows efficient implementation of a
|
35 |
|
|
modular multiplication without explicitly carrying out the classical modular reduction step. Right-shift operations ensure that the length of the (intermediate) results does not exceed $n+1$ bits. The result of a Montgomery multiplication is given by~(\ref{eq:mont}):
|
36 |
|
|
\begin{align}\label{eq:mont}
|
37 |
|
|
r = x \cdot y \cdot R^{-1} \bmod m \hspace{1.5cm}\text{with } R = 2^{n}
|
38 |
|
|
\end{align}
|
39 |
|
|
For the structure of the multiplier, the work of \textit{Nedjah and Mourelle}\cite{NedMour} is used as a basis. They show that for large operands ($>$512 bits) the $time\times area$ product is minimal when a systolic implementation is used. This construction is composed of cells that each compute a bit of the (intermediate) result.
|
40 |
|
|
|
41 |
|
|
Because a fully unrolled two-dimensional systolic implementation would require too many resources, a systolic array (one-dimensional) implementation is chosen. This implies that the intermediate results are fed back to the same same array of cells through a register. A shift register will shift-in a bit of the $x$ operand for every step in the calculation (figure~\ref{mult_structure}). When multiplication is completed, a final check is made to ensure the result is smaller than the modulus. If not, a final reduction with $m$ is necessary.
|
42 |
|
|
|
43 |
|
|
\textbf{Note:} For this implementation the modulus $m$ has to be uneven to obtain a correct result. However, we can assume that for cryptographic applications, this is the case.
|
44 |
|
|
|
45 |
|
|
|
46 |
|
|
\begin{figure}[H]
|
47 |
|
|
\centering
|
48 |
|
|
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=15cm]{pictures/mult_structure.pdf}
|
49 |
|
|
\caption{Multiplier structure. For clarification the $my$ adder and reduction logic are depicted separately, whereas in practice they are internal parts of the stages. (See Figure~\ref{stage_structure})}
|
50 |
|
|
\label{mult_structure}
|
51 |
|
|
\end{figure}
|
52 |
|
|
|
53 |
|
|
\subsubsection{Stage and pipeline structure}
|
54 |
78 |
JonasDC |
The Montgomery algorithm uses a series of additions and right shifts to obtain the desired result. The main disadvantage
|
55 |
|
|
is the carry propagation in the adder, and therefore a pipelined version is used. The length of the operands ($n$) and
|
56 |
|
|
the number of pipeline stages can be chosen before synthesis. The user has the option to split the pipeline into 2
|
57 |
|
|
smaller parts so there are 3 operand lengths available during runtime\footnote{e.g. a total pipeline length of 1536 bit
|
58 |
|
|
split into a part of 512 bit and a part of 1024 bit}.
|
59 |
47 |
JonasDC |
|
60 |
78 |
JonasDC |
The stages and first and last cell logic design are presented in Figure~\ref{stage_structure}. Each stage takes in a
|
61 |
|
|
part of the modulus $m$ and $y$ operand and for each step of the multiplication, a bit of the $x$ operand is fed to the
|
62 |
|
|
pipeline (together with the generated $q$ signal), starting with the Least Significant Bit. The systolic array cells
|
63 |
|
|
need the modulus $m$, the operand $y$ and the sum $m+y$ as an input. The result from the cells is latched into a
|
64 |
|
|
register, and then passed back to the systolic cells for the next bit of $x$. During this pass the right shift operation
|
65 |
|
|
is implemented. Each stage thus needs the least significant bit from the next stage to calculate the next step. Final
|
66 |
|
|
reduction logic is also present in the stages for when the multiplication is complete.
|
67 |
47 |
JonasDC |
|
68 |
78 |
JonasDC |
An example of the standard pipeline structure is presented in Figure~\ref{pipeline_structure}. It is constructed using
|
69 |
|
|
stages with a predefined width. The first cell logic processes the first bit of the $m$ and $y$ operand and generates
|
70 |
|
|
the $q$ signal. The last cell logic finishes the reduction and selects the correct result. For operation of this
|
71 |
|
|
pipeline, it is clear that each stage can only compute a step every 2 clock cycles. This is because the stages rely on
|
72 |
|
|
the result of the next stage.
|
73 |
47 |
JonasDC |
|
74 |
|
|
In Figure~\ref{pipeline_structure_split} an example pipeline design is drawn for a split pipeline. All multiplexers on
|
75 |
|
|
this figure are controlled by the pipeline select signal (\verb|p_sel|). During runtime the user can choose which part
|
76 |
|
|
of the pipeline is used, the lower or higher part or the full pipeline.
|
77 |
|
|
|
78 |
|
|
\newpage
|
79 |
|
|
\begin{figure}[H]
|
80 |
|
|
\centering
|
81 |
|
|
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=25cm, angle=90]{pictures/sys_stage.pdf}
|
82 |
|
|
\caption{Pipeline stage and first and last cell logic}
|
83 |
|
|
\label{stage_structure}
|
84 |
|
|
\end{figure}
|
85 |
|
|
\newpage
|
86 |
|
|
|
87 |
|
|
\newpage
|
88 |
|
|
\begin{figure}[H]
|
89 |
|
|
\centering
|
90 |
|
|
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=25cm, angle=90]{pictures/sys_pipeline_notsplit.pdf}
|
91 |
|
|
\caption{Example of the pipeline structure (3 stages)}
|
92 |
|
|
\label{pipeline_structure}
|
93 |
|
|
\end{figure}
|
94 |
|
|
\newpage
|
95 |
|
|
|
96 |
|
|
\newpage
|
97 |
|
|
\begin{figure}[H]
|
98 |
|
|
\centering
|
99 |
|
|
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=22cm, angle=90]{pictures/sys_pipeline.pdf}
|
100 |
|
|
\caption{Example of a split pipeline (1+2 stages)}
|
101 |
|
|
\label{pipeline_structure_split}
|
102 |
|
|
\end{figure}
|
103 |
|
|
\newpage
|
104 |
|
|
|
105 |
|
|
|
106 |
78 |
JonasDC |
\subsection{Operand RAM and exponent FIFO} \label{subsec:RAM_and_FIFO}
|
107 |
|
|
The core's RAM is designed to store 4 operands and a modulus. \footnote{This is the default configuration. The number of operands can be increased, but the control logic is only designed to work with the default configuration.} Three (3) options are available for the implementation of the RAM. Setting the parameter \verb|C_MEM_STYLE|, will change the implementation style. All styles try to use the RAM resources available on the FPGA.
|
108 |
47 |
JonasDC |
|
109 |
87 |
JonasDC |
If the FPGA supports asymmetric RAMs, i.e. with a different read and write width, we suggest that the option \verb|"asym"| is selected. Since the (device specific) RAM blocks are inferred through code, it is imperative to select the right device (\verb|C_FPGA_MAN|), as this inference is different between manufacturers. Currently, only Altera and Xilinx are supported.
|
110 |
78 |
JonasDC |
|
111 |
|
|
If there's no asymmetric RAM support, the option \verb|"generic"| should be selected. This option will work for most FPGAs, but the disadvantage is that it will use more resources than the \verb|"asym"| option. This is because a significant number of LUTs will be used to construct an asymmetric RAM.
|
112 |
|
|
|
113 |
|
|
For both options the size of the RAM adapts dynamically to the chosen pipeline width (\verb|C_NR_BITS_TOTAL|).
|
114 |
|
|
|
115 |
|
|
Finally, the option \verb|"xil_prim"| is targeted specifically to Xilinx devices. It uses blocks of RAM generated with CoreGen. These blocks are of a fixed width and this results in a fixed RAM of 4x1536 bit for the operands and 1536 bit for the modulus. This option is deprecated in favor of \verb|"asym"|.
|
116 |
|
|
|
117 |
|
|
Reading and writing (from the bus side) to the operands and modulus is done one 32-bit word at a time. If using a split pipeline, it is important that operands for the higher part of the pipeline are loaded into the RAM with preceding zero's for the lower bits of the pipeline. As a rule of thumb, the number of FPGA RAM blocks that will be used is given by (\ref{eq:ramblocks}):
|
118 |
|
|
\begin{align}
|
119 |
|
|
2 \cdot \mathtt{C\_NR\_BITS\_TOTAL} / 32\label{eq:ramblocks}
|
120 |
|
|
\end{align}
|
121 |
|
|
\newline
|
122 |
|
|
|
123 |
|
|
To store the exponents, there is a FIFO of 32 bit wide. Every 32 bit entry has to be formatted as 16 bit of $e_0$ for the
|
124 |
|
|
lower part [15:0] and 16 bit of $e_1$ for the higher part [31:16]. Entries have to be pushed in the FIFO starting with the least significant word and ending with the most significant word of the exponents.
|
125 |
|
|
|
126 |
|
|
For the FIFO there are 2 styles available. The implementation style depends on the style of the operand memory and it can not be set directly. When the RAM option \verb|"xil_prim"| is chosen, the resulting FIFO will use the FIFO18E1 primitive. It is able to store 512 entries, meaning 2 exponents of each 8192 bit long.
|
127 |
|
|
|
128 |
|
|
When the RAM options \verb|"generic"| or \verb|"asym"| are chosen, a generic FIFO will be implemented. This consist of a symmetric RAM with the control logic for a FIFO. The depth of this generic FIFO is adjustable with the parameter \verb|C_FIFO_DEPTH|.
|
129 |
|
|
The number of RAM blocks for the FIFO is given by (\ref{eq:fifoblocks}), where \verb|RAMBLOCK_SIZE| is the size [bits] of the FPGA's RAM primitive.
|
130 |
|
|
\begin{align}
|
131 |
|
|
\left[\left(\mathtt{C\_FIFO\_DEPTH}+1\right) \cdot 32 \right]/ \mathtt{RAMBLOCK\_SIZE} \label{eq:fifoblocks}
|
132 |
|
|
\end{align}
|
133 |
|
|
|
134 |
47 |
JonasDC |
\subsection{Control unit}
|
135 |
|
|
The control unit loads in the operands and has full control over the multiplier. For single multiplications, it latches in
|
136 |
|
|
the $x$ operand, then places the $y$ operand on the bus and starts the multiplier. In case of an exponentiation, the FIFO is
|
137 |
|
|
emptied while the necessary single multiplications are performed. When the computation is done, the ready signal is
|
138 |
|
|
asserted to notify the system.
|
139 |
|
|
|
140 |
|
|
\newpage
|
141 |
|
|
\subsection{IO ports and memory map}
|
142 |
|
|
The \verb|mod_sim_exp_core| IO ports\\
|
143 |
|
|
\newline
|
144 |
|
|
% Table generated by Excel2LaTeX
|
145 |
|
|
\begin{tabular}{|l|c|c|p{8cm}|}
|
146 |
|
|
\hline
|
147 |
|
|
\rowcolor{Gray}
|
148 |
|
|
\textbf{Port} & \textbf{Width} & \textbf{Direction} & \textbf{Description} \bigstrut\\
|
149 |
|
|
\hline
|
150 |
|
|
\verb|clk| & 1 & in & core clock input \bigstrut\\
|
151 |
|
|
\hline
|
152 |
|
|
\verb|reset| & 1 & in & reset signal (active high) resets the pipeline, fifo and control logic \bigstrut\\
|
153 |
|
|
\hline
|
154 |
|
|
\multicolumn{4}{|l|}{\textbf{\textit{operand memory interface}}} \bigstrut\\
|
155 |
|
|
\hline
|
156 |
|
|
\verb|rw_address| & 9 & in & operand memory read/write address (structure descibed below) \bigstrut\\
|
157 |
|
|
\hline
|
158 |
|
|
\verb|data_out| & 32 & out & operand data out (0 is lsb) \bigstrut\\
|
159 |
|
|
\hline
|
160 |
|
|
\verb|data_in| & 32 & in & operand data in (0 is lsb) \bigstrut\\
|
161 |
|
|
\hline
|
162 |
|
|
\verb|write_enable| & 1 & in & write enable signal, latches \verb|data_in| to operand RAM \bigstrut\\
|
163 |
|
|
\hline
|
164 |
|
|
\verb|collision| & 1 & out & collision output, asserts on a write error \bigstrut\\
|
165 |
|
|
\hline
|
166 |
|
|
\multicolumn{4}{|l|}{\textbf{\textit{exponent FIFO interface}}} \bigstrut\\
|
167 |
|
|
\hline
|
168 |
|
|
\verb|fifo_din| & 32 & in & FIFO data in, bits [31:16] for $e_1$ operand and bits [15:0] for $e_0$ operand \bigstrut\\
|
169 |
|
|
\hline
|
170 |
|
|
\verb|fifo_push| & 1 & in & push \verb|fifo_din| into the FIFO \bigstrut\\
|
171 |
|
|
\hline
|
172 |
|
|
\verb|fifo_nopush| & 1 & out & flag to indicate if there was an error pushing the word to the FIFO \bigstrut\\
|
173 |
|
|
\hline
|
174 |
|
|
\verb|fifo_full| & 1 & out & flag to indicate the FIFO is full \bigstrut\\
|
175 |
|
|
\hline
|
176 |
|
|
\multicolumn{4}{|l|}{\textbf{\textit{control signals}}} \bigstrut\\
|
177 |
|
|
\hline
|
178 |
|
|
\verb|x_sel_single| & 2 & in & selection for x operand source during single multiplication \bigstrut\\
|
179 |
|
|
\hline
|
180 |
|
|
\verb|y_sel_single| & 2 & in & selection for y operand source during single multiplication \bigstrut\\
|
181 |
|
|
\hline
|
182 |
|
|
\verb|dest_op_single| & 2 & in & selection for the result destination operand for single multiplication \bigstrut\\
|
183 |
|
|
\hline
|
184 |
|
|
\verb|p_sel| & 2 & in & specifies which pipeline part to use for exponentiation / multiplication. \bigstrut[t]\\
|
185 |
|
|
& & & ``01'' : use lower pipeline part \\
|
186 |
|
|
& & & ``10'' : use higher pipeline part \\
|
187 |
|
|
& & & ``11'' : use full pipeline \bigstrut[b]\\
|
188 |
|
|
\hline
|
189 |
78 |
JonasDC |
\verb|modulus_sel| & 1 & in & selection for which modulus to use for the calculations (only available if \verb|C_MEM_STYLE| = \verb|"generic"| or \verb|"asym"|). Otherwise set to 0 \bigstrut\\
|
190 |
|
|
\hline
|
191 |
47 |
JonasDC |
\verb|exp_m| & 1 & in & core operation mode. ``0'' for single multiplications and ``1'' for exponentiations \bigstrut\\
|
192 |
|
|
\hline
|
193 |
|
|
\verb|start| & 1 & in & start the calculation for current mode \bigstrut\\
|
194 |
|
|
\hline
|
195 |
|
|
\verb|ready| & 1 & out & indicates the multiplication/exponentiation is done \bigstrut\\
|
196 |
|
|
\hline
|
197 |
|
|
\verb|calc_time| & 1 & out & is high during a multiplication, indicator for used calculation time \bigstrut\\
|
198 |
|
|
\hline
|
199 |
|
|
\end{tabular}%
|
200 |
|
|
\newpage
|
201 |
78 |
JonasDC |
The \verb|mod_sim_exp_core| parameters\\
|
202 |
|
|
\begin{center}
|
203 |
|
|
\begin{tabular}{|l|p{6.5cm}|c|l|}
|
204 |
|
|
\hline
|
205 |
|
|
\rowcolor{Gray}
|
206 |
|
|
\textbf{Name} & \textbf{Description} & \textbf{VHDL Type} &\textbf{Default Value} \bigstrut\\
|
207 |
|
|
\hline
|
208 |
|
|
\verb|C_NR_BITS_TOTAL| & total width of the multiplier in bits & integer & 1536\bigstrut\\
|
209 |
|
|
\hline
|
210 |
|
|
\verb|C_NR_STAGES_TOTAL| & total number of stages in the pipeline & integer & 96\bigstrut\\
|
211 |
|
|
\hline
|
212 |
|
|
\verb|C_NR_STAGES_LOW| & number of lower stages in the pipeline, defines the bit-width of the lower pipeline part & integer & 32 \bigstrut\\
|
213 |
|
|
\hline
|
214 |
|
|
\verb|C_SPLIT_PIPELINE| & option to split the pipeline in 2 parts & boolean & true \bigstrut\\
|
215 |
|
|
\hline
|
216 |
|
|
\verb|C_FIFO_DEPTH| & depth of the generic FIFO, only applicable if \verb|C_MEM_STYLE| = \verb|"generic"| or \verb|"asym"| & integer & 32 \bigstrut\\
|
217 |
|
|
\hline
|
218 |
|
|
\verb|C_MEM_STYLE| & select the RAM memory style (3 options): & string & \verb|"generic"| \bigstrut\\
|
219 |
|
|
& \verb|"generic"| : use general 32-bit RAMs & & \\
|
220 |
|
|
& \verb|"asym"| : use asymmetric RAMs & & \\
|
221 |
|
|
& (For more information see \ref{subsec:RAM_and_FIFO}) & & \\
|
222 |
|
|
& \verb|"xil_prim"| : use xilinx primitives & &\\
|
223 |
|
|
& (deprecated) & & \bigstrut[b] \\
|
224 |
|
|
\hline
|
225 |
87 |
JonasDC |
\verb|C_FPGA_MAN| & device manufacturer: & & \\
|
226 |
78 |
JonasDC |
& \verb|"xilinx"| or \verb|"altera"| & string & \verb|"xilinx"| \bigstrut\\
|
227 |
|
|
\hline
|
228 |
|
|
\end{tabular}%
|
229 |
|
|
\end{center}
|
230 |
|
|
|
231 |
47 |
JonasDC |
The following figure describes the structure of the Operand RAM memory, for every operand there is a space of 2048 bits
|
232 |
78 |
JonasDC |
reserved. So operand widths up to 2048 bits are supported.\\
|
233 |
|
|
\newline \\
|
234 |
47 |
JonasDC |
\begin{figure}[H]
|
235 |
|
|
\centering
|
236 |
|
|
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=15cm]{pictures/msec_memory.pdf}
|
237 |
78 |
JonasDC |
\caption{Address structure of the exponentiation core}
|
238 |
|
|
\label{Address_structure}
|
239 |
47 |
JonasDC |
\end{figure}
|
240 |
|
|
|
241 |
|
|
\section{Bus interface}
|
242 |
|
|
The bus interface implements the register necessary for the control unit inputs to the \verb|mod_sim_exp_core| entity.
|
243 |
|
|
It also maps the memory to the required bus and connects the interrupt signals. The embedded processor then has full control
|
244 |
|
|
over the core.
|