URL https://opencores.org/ocsvn/mod_sim_exp/mod_sim_exp/trunk

# Subversion Repositoriesmod_sim_exp

## [/] [mod_sim_exp/] [trunk/] [doc/] [src/] [architecture.tex] - Blame information for rev 47

Line No. Rev Author Line
1 47 JonasDC
\chapter{Architecture}
2
 
3
\section{Block diagram}
4
The architecture for the full IP core is shown in the Figure~\ref{blockdiagram}. It consists of 2 major parts, the actual
5
exponentiation core (\verb|mod_sim_exp_core| entity) with a bus interface wrapped around it. In the following sections these
6
different blocks are described in detail.\\
7
\begin{figure}[H]
8
\centering
9
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=10cm]{pictures/block_diagram.pdf}
10
\caption{Block diagram of the Modular Simultaneous Exponentiation IP core}
11
\label{blockdiagram}
12
\end{figure}
13
\newpage
14
 
15
\section{Exponentiation core}
16
The exponentiation core (\verb|mod_sim_exp_core| entity) is the top level of the modular simultaneous exponentiation
17
core. It is made up by 4 main blocks (Figure~\ref{msec_structure}):\\
18
 
19
\begin{itemize}
20
        \item a pipelined Montgomery multiplier as the main processing unit
21
        \item RAM to store the operands and the modulus
22
        \item a FIFO to store the exponents
23
        \item a control unit which controls the multiplier for the exponentiation and multiplication operations
24
\end{itemize}
25
 
26
\begin{figure}[H]
27
\centering
28
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=10cm]{pictures/mod_sim_exp_core.pdf}
29
\cprotect\caption{\verb|mod_sim_exp_core| structure}
30
\label{msec_structure}
31
\end{figure}
32
 
33
\subsection{Multiplier}
34
The kernel of this design is a pipelined Montgomery multiplier. A Montgomery multiplication\cite{MontModMul} allows efficient implementation of a
35
modular multiplication without explicitly carrying out the classical modular reduction step. Right-shift operations ensure that the length of the (intermediate) results does not exceed $n+1$ bits. The result of a Montgomery multiplication is given by~(\ref{eq:mont}):
36
\begin{align}\label{eq:mont}
37
r = x \cdot y \cdot R^{-1} \bmod m \hspace{1.5cm}\text{with } R = 2^{n}
38
\end{align}
39
For the structure of the multiplier, the work of \textit{Nedjah and Mourelle}\cite{NedMour} is used as a basis. They show that for large operands ($>$512 bits) the $time\times area$ product is minimal when a systolic implementation is used. This construction is composed of cells that each compute a bit of the (intermediate) result.
40
 
41
Because a fully unrolled two-dimensional systolic implementation would require too many resources, a systolic array (one-dimensional) implementation is chosen. This implies that the intermediate results are fed back to the same same array of cells through a register. A shift register will shift-in a bit of the $x$ operand for every step in the calculation (figure~\ref{mult_structure}). When multiplication is completed, a final check is made to ensure the result is smaller than the modulus. If not, a final reduction with $m$ is necessary.
42
 
43
\textbf{Note:} For this implementation the modulus $m$ has to be uneven to obtain a correct result. However, we can assume that for cryptographic applications, this is the case.
44
 
45
 
46
\begin{figure}[H]
47
\centering
48
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=15cm]{pictures/mult_structure.pdf}
49
\caption{Multiplier structure. For clarification the $my$ adder and reduction logic are depicted separately, whereas in practice they are internal parts of the stages. (See Figure~\ref{stage_structure})}
50
\label{mult_structure}
51
\end{figure}
52
 
53
\subsubsection{Stage and pipeline structure}
54
The Montgomery algorithm uses a series of additions and right shifts to obtain the desired result. The main disadvantage is the carry propagation in the adder, and therefore a pipelined version is used. The length of the operands ($n$) and the number of pipeline stages can be chosen before synthesis. The user has the option to split the pipeline into 2 smaller parts so there are 3 operand lengths available during runtime\footnote{e.g. a total pipeline length of 1536 bit split into a part of 512 bit and a part of 1024 bit}.
55
 
56
The stages and first and last cell logic design are presented in Figure~\ref{stage_structure}. Each stage takes in a part of the modulus $m$ and $y$ operand and for each step of the multiplication, a bit of the $x$ operand is fed to the pipeline (together with the generated $q$ signal), starting with the Least Significant Bit. The systolic array cells need the modulus $m$, the operand $y$ and the sum $m+y$ as an input. The result from the cells is latched into a register, and then passed back to the systolic cells for the next bit of $x$. During this pass the right shift operation is
57
implemented. Each stage thus needs the least significant bit from the next stage to calculate the next step. Final reduction logic is also present in the stages for when the multiplication is complete.
58
 
59
An example of the standard pipeline structure is presented in Figure~\ref{pipeline_structure}. It is constructed using stages with a predefined width. The first cell logic processes the first bit of the $m$ and $y$ operand and generates the $q$ signal. The last cell logic finishes the reduction and selects the correct result. For operation of this pipeline, it is clear that each stage can only compute a step every 2 clock cycles. This is because the stages rely on the result of the next stage.
60
 
61
In Figure~\ref{pipeline_structure_split} an example pipeline design is drawn for a split pipeline. All multiplexers on
62
this figure are controlled by the pipeline select signal (\verb|p_sel|). During runtime the user can choose which part
63
of the pipeline is used, the lower or higher part or the full pipeline.
64
 
65
\newpage
66
\begin{figure}[H]
67
\centering
68
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=25cm, angle=90]{pictures/sys_stage.pdf}
69
\caption{Pipeline stage and first and last cell logic}
70
\label{stage_structure}
71
\end{figure}
72
\newpage
73
 
74
\newpage
75
\begin{figure}[H]
76
\centering
77
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=25cm, angle=90]{pictures/sys_pipeline_notsplit.pdf}
78
\caption{Example of the pipeline structure (3 stages)}
79
\label{pipeline_structure}
80
\end{figure}
81
\newpage
82
 
83
\newpage
84
\begin{figure}[H]
85
\centering
86
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=22cm, angle=90]{pictures/sys_pipeline.pdf}
87
\caption{Example of a split pipeline (1+2 stages)}
88
\label{pipeline_structure_split}
89
\end{figure}
90
\newpage
91
 
92
 
93
\subsection{Operand RAM and exponent FIFO}
94
In the core's RAM there is space for 4 operands and 1 modulus. Currently this is instantiated using Xilinx BRAM
95
primitives, so there is a fixed RAM of 4x1536 bit for the operands + 1536 bit for the modulus available. If using a split
96
pipeline, it is important that operands for the higher part of the pipeline are loaded into the RAM with preceding
97
zero's for the lower bits of the pipeline. To store the exponents there is a FIFO of 32 bit wide and 512 deep (also a
98
Xilinx primitive, FIFO18E1), so it is able to store 2 exponents of each 8192 bit long. Every 32 bit entry has to be
99
pushed in as 16 bit of $e_0$ for the lower part [15:0] and 16 bit of $e_1$ for the higher part [31:16]. Starting
100
with the least significant word and ending with the most significant word of the exponents.
101
 
102
\subsection{Control unit}
103
The control unit loads in the operands and has full control over the multiplier. For single multiplications, it latches in
104
the $x$ operand, then places the $y$ operand on the bus and starts the multiplier. In case of an exponentiation, the FIFO is
105
emptied while the necessary single multiplications are performed. When the computation is done, the ready signal is
106
asserted to notify the system.
107
 
108
\newpage
109
\subsection{IO ports and memory map}
110
The \verb|mod_sim_exp_core| IO ports\\
111
\newline
112
% Table generated by Excel2LaTeX
113
\begin{tabular}{|l|c|c|p{8cm}|}
114
\hline
115
\rowcolor{Gray}
116
\textbf{Port} & \textbf{Width} & \textbf{Direction} & \textbf{Description} \bigstrut\\
117
\hline
118
\verb|clk|   & 1     & in    & core clock input \bigstrut\\
119
\hline
120
\verb|reset| & 1     & in    & reset signal (active high) resets the pipeline, fifo and control logic \bigstrut\\
121
\hline
122
\multicolumn{4}{|l|}{\textbf{\textit{operand memory interface}}} \bigstrut\\
123
\hline
124
\verb|rw_address| & 9     & in    & operand memory read/write address (structure descibed below) \bigstrut\\
125
\hline
126
\verb|data_out| & 32    & out   & operand data out (0 is lsb) \bigstrut\\
127
\hline
128
\verb|data_in| & 32    & in    & operand data in (0 is lsb) \bigstrut\\
129
\hline
130
\verb|write_enable| & 1     & in    & write enable signal, latches \verb|data_in| to operand RAM \bigstrut\\
131
\hline
132
\verb|collision| & 1     & out   & collision output, asserts on a write error \bigstrut\\
133
\hline
134
\multicolumn{4}{|l|}{\textbf{\textit{exponent FIFO interface}}} \bigstrut\\
135
\hline
136
\verb|fifo_din| & 32    & in    & FIFO data in, bits [31:16] for $e_1$ operand and bits [15:0] for $e_0$ operand \bigstrut\\
137
\hline
138
\verb|fifo_push| & 1     & in    & push \verb|fifo_din| into the FIFO \bigstrut\\
139
\hline
140
\verb|fifo_nopush| & 1     & out   & flag to indicate if there was an error pushing the word to the FIFO \bigstrut\\
141
\hline
142
\verb|fifo_full| & 1     & out   & flag to indicate the FIFO is full \bigstrut\\
143
\hline
144
\multicolumn{4}{|l|}{\textbf{\textit{control signals}}} \bigstrut\\
145
\hline
146
\verb|x_sel_single| & 2     & in    & selection for x operand source during single multiplication \bigstrut\\
147
\hline
148
\verb|y_sel_single| & 2     & in    & selection for y operand source during single multiplication \bigstrut\\
149
\hline
150
\verb|dest_op_single| & 2     & in    & selection for the result destination operand for single multiplication \bigstrut\\
151
\hline
152
\verb|p_sel| & 2     & in    & specifies which pipeline part to use for exponentiation / multiplication. \bigstrut[t]\\
153
      &       &       & 01'' : use lower pipeline part \\
154
      &       &       & 10'' : use higher pipeline part \\
155
      &       &       & 11'' : use full pipeline \bigstrut[b]\\
156
\hline
157
\verb|exp_m| & 1     & in    & core operation mode. 0'' for single multiplications and 1'' for exponentiations \bigstrut\\
158
\hline
159
\verb|start| & 1     & in    & start the calculation for current mode \bigstrut\\
160
\hline
161
\verb|ready| & 1     & out   & indicates the multiplication/exponentiation is done \bigstrut\\
162
\hline
163
\verb|calc_time| & 1     & out   & is high during a multiplication, indicator for used calculation time \bigstrut\\
164
\hline
165
\end{tabular}%
166
\newpage
167
The following figure describes the structure of the Operand RAM memory, for every operand there is a space of 2048 bits
168
reserved.\\
169
\begin{figure}[H]
170
\centering
171
\includegraphics[trim=1.2cm 1.2cm 1.2cm 1.2cm, width=15cm]{pictures/msec_memory.pdf}
172
\caption{RAM structure of the exponentiation core}
173
\label{RAM_structure}
174
\end{figure}
175
 
176
\section{Bus interface}
177
The bus interface implements the register necessary for the control unit inputs to the \verb|mod_sim_exp_core| entity.
178
It also maps the memory to the required bus and connects the interrupt signals. The embedded processor then has full control
179
over the core.