1 |
2 |
jeunes2 |
\documentclass[10pt, twoside, a4paper]{article}
|
2 |
|
|
\usepackage{graphicx}
|
3 |
|
|
\usepackage{listings}
|
4 |
|
|
|
5 |
|
|
\title{marca - McAdam's RISC Computer Architecture\\Implementation Details}
|
6 |
|
|
\author{Wolfgang Puffitsch}
|
7 |
|
|
|
8 |
|
|
\begin{document}
|
9 |
|
|
|
10 |
|
|
\maketitle
|
11 |
|
|
|
12 |
|
|
\section{General}
|
13 |
|
|
|
14 |
|
|
\begin{itemize}
|
15 |
|
|
\item 16 16-bit registers
|
16 |
|
|
\item 16KB instruction ROM (8192 instructions)
|
17 |
|
|
\item 8KB data RAM
|
18 |
|
|
\item 256 byte data ROM
|
19 |
|
|
\item 75 instructions
|
20 |
|
|
\item 16 interrupt vectors
|
21 |
|
|
\end{itemize}
|
22 |
|
|
|
23 |
|
|
\section{Internals}
|
24 |
|
|
|
25 |
|
|
The processor features a 4-stage pipeline:
|
26 |
|
|
\begin{itemize}
|
27 |
|
|
\item instruction fetch
|
28 |
|
|
\item instruction decode
|
29 |
|
|
\item execution/memory access
|
30 |
|
|
\item write back
|
31 |
|
|
\end{itemize}
|
32 |
|
|
This scheme is similar to the one used in the MIPS architecture,
|
33 |
|
|
only execution and write back stage are drawn together. For our
|
34 |
|
|
architecture does not support indexed addressing, it does not need
|
35 |
|
|
the ALU's result and can work in parallel, having the advantage of
|
36 |
|
|
reducing the possible hazards.
|
37 |
|
|
|
38 |
|
|
Figure \ref{fig:marca} shows a rough scheme of the internals of the
|
39 |
|
|
processor.
|
40 |
|
|
\begin{figure}[ht!]
|
41 |
|
|
\centering
|
42 |
|
|
\includegraphics[width=.95\textwidth]{marca}
|
43 |
|
|
\caption{Internal scheme}
|
44 |
|
|
\label{fig:marca}
|
45 |
|
|
\end{figure}
|
46 |
|
|
|
47 |
|
|
\subsection{Branches}
|
48 |
|
|
Branches are not predicted and if executed they stall the the
|
49 |
|
|
pipeline, leading to a total execution time of 4 cycles. The fetch
|
50 |
|
|
stage is not stalled, the decode stage however is stalled for two
|
51 |
|
|
cycles to compensate that.
|
52 |
|
|
|
53 |
|
|
\subsection{Instruction fetch}
|
54 |
|
|
This stage is not spectacular: it simply reads an instruction from
|
55 |
|
|
the instruction ROM, and extracts the bits for the source and
|
56 |
|
|
destination registers.
|
57 |
|
|
|
58 |
|
|
\subsection{Instruction decode}
|
59 |
|
|
This stage translates the bit-patterns of the opcodes to the signals
|
60 |
|
|
used internally for the operations. It also holds the register file
|
61 |
|
|
and handles access to it. Immediate values are also constructed here.
|
62 |
|
|
|
63 |
|
|
\subsection{Execution / Memory access}
|
64 |
|
|
The execution stage is the heart and soul of the processor: it holds
|
65 |
|
|
the ALU, the memory/IO unit and a unit for interrupt handling.
|
66 |
|
|
|
67 |
|
|
\subsubsection{ALU}
|
68 |
|
|
The ALU does all arithmetic and logic computations as well as taking
|
69 |
|
|
care of the processors flags (which are organized as seen in table
|
70 |
|
|
\ref{tab:flags}).
|
71 |
|
|
|
72 |
|
|
\begin{table}[ht!]
|
73 |
|
|
\centering
|
74 |
|
|
\begin{tabular}{|p{.75em}|p{.75em}|p{.75em}|p{.75em}
|
75 |
|
|
|p{.75em}|p{.75em}|p{.75em}|p{.75em}
|
76 |
|
|
|p{.75em}|p{.75em}|p{.75em}|p{.75em}
|
77 |
|
|
|p{.75em}|p{.75em}|p{.75em}|p{.75em}|p{.75em}}
|
78 |
|
|
\multicolumn{16}{c}{Bit 15 \hfill Bit 0} \\
|
79 |
|
|
\hline
|
80 |
|
|
& & & & & & & & & & P & I & N & V & C & Z \\
|
81 |
|
|
\hline
|
82 |
|
|
\end{tabular}
|
83 |
|
|
\caption{The flag register}
|
84 |
|
|
\label{tab:flags}
|
85 |
|
|
\end{table}
|
86 |
|
|
|
87 |
|
|
Operations which need more than one cycle to execute (multiplication,
|
88 |
|
|
division and modulo) block the rest of the processor until they are
|
89 |
|
|
finished.
|
90 |
|
|
|
91 |
|
|
\subsubsection{Memory/IO unit}
|
92 |
|
|
The memory/IO unit takes care of the ordinary data memory, the data
|
93 |
|
|
ROM (which is mapped to the addresses right above the RAM) and the
|
94 |
|
|
communication to peripheral modules. Peripheral modules are located
|
95 |
|
|
within the memory/IO unit and mapped to the highest addresses.
|
96 |
|
|
|
97 |
|
|
The memories (the instruction ROM too) are Altera specific; we
|
98 |
|
|
decided not to use generic memories, because \textsl{Quartus} can update the
|
99 |
|
|
contents of its proprietary ROMs without synthesizing the whole
|
100 |
|
|
design. Because all memories are single-ported (and thus fairly
|
101 |
|
|
simple) it should be easy to replace them with memories specific to
|
102 |
|
|
other vendors.
|
103 |
|
|
|
104 |
|
|
We also decided against the use of external memories; larger FPGAs
|
105 |
|
|
can accommodate all addressable memory on-chip, so the implementation
|
106 |
|
|
overhead would not have paid off.
|
107 |
|
|
|
108 |
|
|
Accesses which take more than one cycle (stores to peripheral
|
109 |
|
|
modules and all load operations) block the rest of the processor
|
110 |
|
|
until they are finished.
|
111 |
|
|
|
112 |
|
|
\paragraph{Peripheral modules}
|
113 |
|
|
The peripheral modules use a slightly modified version of the SimpCon
|
114 |
|
|
interface. The SimpCon specific signals are pulled together to
|
115 |
|
|
records, and the words which can be read/written are limited to 16
|
116 |
|
|
bits. For accessing such a module, one may only use \texttt{load}
|
117 |
|
|
and \texttt{store} instructions which point to aligned addresses.
|
118 |
|
|
|
119 |
|
|
\paragraph{UART}
|
120 |
|
|
The built-in UART is derived from the sc\_uart from Martin
|
121 |
|
|
Sch\"oberl. Apart from adapting the SimpCon interface, an interrupt
|
122 |
|
|
line and two bits for enabling/masking receive (bit 3 in the status
|
123 |
|
|
register) and transmit (bit 2) interrupts. In the current version
|
124 |
|
|
address 0xFFF8 (-8) correspond to the UART's status register and
|
125 |
|
|
address 0xFFFA (-6) to the wr\_data/rd\_data register.
|
126 |
|
|
|
127 |
|
|
\subsubsection{Interrupt unit}
|
128 |
|
|
The interrupt unit takes care of the interrupt vectors and, of
|
129 |
|
|
course, the triggering of interrupts. Interrupts are executed only
|
130 |
|
|
if the global interrupt flag is set, none of the other units is busy
|
131 |
|
|
and the instruction in the execution stage is valid (it takes 3
|
132 |
|
|
cycles after jumps, branches etc. until a new valid instruction is
|
133 |
|
|
in that stage).
|
134 |
|
|
|
135 |
|
|
Instructions which cannot be decoded as well as the ``error''
|
136 |
|
|
instruction trigger interrupt 0; the ALU can trigger interrupt 1
|
137 |
|
|
(division by zero), the memory unit can trigger interrupt 2 (invalid
|
138 |
|
|
memory access). In contrast to all other interrupts, these three
|
139 |
|
|
interrupts do not repeat the instruction which is executed when they
|
140 |
|
|
occur.
|
141 |
|
|
|
142 |
|
|
\subsection{Write back}
|
143 |
|
|
The write back stage passes on the result of the execution stage to
|
144 |
|
|
all other stages.
|
145 |
|
|
|
146 |
|
|
\section{Assembler}
|
147 |
|
|
The assembler \textsl{spar} (SPear Assembler Recycled) uses a syntax
|
148 |
|
|
quite like usual Unix-style assemblers. It accepts the pseudo-ops
|
149 |
|
|
\texttt{.file}, \texttt{.text}, \texttt{.data}, \texttt{.bss},
|
150 |
|
|
\texttt{.align}, \texttt{.comm}, \texttt{.lcomm}, \texttt{.org} and
|
151 |
|
|
\texttt{.skip} with the usual meanings. The mnemonic \texttt{data}
|
152 |
|
|
initializes a byte to some constant value. In difference to the
|
153 |
|
|
instruction set architecture specification, \texttt{mod} and
|
154 |
|
|
\texttt{umod} accept three operands (if a move is needed, it is
|
155 |
|
|
silently inserted).
|
156 |
|
|
|
157 |
|
|
The assembler produces three files: one file for the instruction
|
158 |
|
|
ROM, one file for the even bytes of the data ROM and one file for
|
159 |
|
|
the odd bytes of the instruction ROM. The splitting of the data is
|
160 |
|
|
necessary, because the data memories internally are split into two
|
161 |
|
|
8-bit memories in order to support unaligned memory accesses without
|
162 |
|
|
delays.
|
163 |
|
|
|
164 |
|
|
Three output formats are supported: .mif (Memory Initialization
|
165 |
|
|
Format), .hex (Intel Hex Format) and a binary format designed for
|
166 |
|
|
download via UART.
|
167 |
|
|
|
168 |
|
|
\section{Resource usage and speed}
|
169 |
|
|
|
170 |
|
|
The processor was synthesized with \textsl{Quartus II} for the
|
171 |
|
|
\textsl{Cyclone EP1C12Q240C8} FPGA with 12060 logic cells and 29952
|
172 |
|
|
bytes of on-chip memory available.
|
173 |
|
|
|
174 |
|
|
The processor needs $\sim$3550 logic cells or 29\% when being
|
175 |
|
|
compiled for maximum clock frequency, which is $\sim$60 MHz. When
|
176 |
|
|
optimizing for area, it needs $\sim$2600 logic cells or 22\% at
|
177 |
|
|
$\sim$25 MHz.
|
178 |
|
|
|
179 |
|
|
The processor uses 24832 bytes or 83\% of on-chip memory.
|
180 |
|
|
|
181 |
|
|
\section{Example}
|
182 |
|
|
|
183 |
|
|
\subsection{Reversing a line}
|
184 |
|
|
|
185 |
|
|
In listing \ref{lst:uart} one can see how to interface the uart via
|
186 |
|
|
interrupts. The program reads in a line from the UART and the writes
|
187 |
|
|
it back reversed. The lines 1 to 4 show how to instantiate memory
|
188 |
|
|
(the two bytes defined form the DOS-style end-of-line). The
|
189 |
|
|
lines 7 to 25 initialize the registers and register the interrupt
|
190 |
|
|
vectors, line 28 builds a barrier against the rest of the code.
|
191 |
|
|
|
192 |
|
|
The lines 32 to 76 form the interrupt service routine. It first
|
193 |
|
|
checks if it is operating in read or in write mode. When reading, it
|
194 |
|
|
reads from the UART and stores the result. A mode switch occurs when
|
195 |
|
|
a newline character is encountered. In write mode the contents of
|
196 |
|
|
the buffer is written to the UART and switching back to read mode is
|
197 |
|
|
done when finished.
|
198 |
|
|
|
199 |
|
|
In figure \ref{fig:sim} the results of the simulation are presented.
|
200 |
|
|
|
201 |
|
|
\lstset{basicstyle=\footnotesize,numbers=left,numberstyle=\tiny}
|
202 |
|
|
\lstset{caption=Example for the UART and interrupts}
|
203 |
|
|
\lstset{label=lst:uart}
|
204 |
|
|
\lstinputlisting{uart_reverse.s}
|
205 |
|
|
|
206 |
|
|
\begin{figure}[ht!]
|
207 |
|
|
\centering
|
208 |
|
|
\includegraphics[width=.95\textwidth]{uart_sim}
|
209 |
|
|
\caption{Simulation results}
|
210 |
|
|
\label{fig:sim}
|
211 |
|
|
\end{figure}
|
212 |
|
|
|
213 |
|
|
\subsection{Computing factorials}
|
214 |
|
|
|
215 |
|
|
The example in \ref{lst:fact} computes the factorials of 1 \ldots 9
|
216 |
|
|
and writes the results to the PC via UART. Note that the last result
|
217 |
|
|
transmitted will be wrong, because it is truncated to 16 bits.
|
218 |
|
|
|
219 |
|
|
\lstset{basicstyle=\footnotesize,numbers=left,numberstyle=\tiny}
|
220 |
|
|
\lstset{caption=Computing factorials}
|
221 |
|
|
\lstset{label=lst:fact}
|
222 |
|
|
\lstinputlisting{factorial.s}
|
223 |
|
|
|
224 |
|
|
|
225 |
|
|
\section{Versions Of This Document}
|
226 |
|
|
|
227 |
|
|
2006-12-14: Draft version \textbf{0.1}
|
228 |
|
|
|
229 |
|
|
\noindent
|
230 |
|
|
2006-12-29: Draft version \textbf{0.2}
|
231 |
|
|
\begin{itemize}
|
232 |
|
|
\item A few refinements.
|
233 |
|
|
\end{itemize}
|
234 |
|
|
|
235 |
|
|
\noindent
|
236 |
|
|
2007-01-22: Draft version \textbf{0.3}
|
237 |
|
|
\begin{itemize}
|
238 |
|
|
\item Added another example.
|
239 |
|
|
\end{itemize}
|
240 |
|
|
|
241 |
|
|
\noindent
|
242 |
|
|
2007-02-02: Draft version \textbf{0.4}
|
243 |
|
|
\begin{itemize}
|
244 |
|
|
\item Updated resource usage and speed section.
|
245 |
|
|
\end{itemize}
|
246 |
|
|
|
247 |
|
|
\end{document}
|