1 |
210 |
ja_rd |
|
2 |
|
|
\chapter{CPU Description}
|
3 |
|
|
\label{cpu_description}
|
4 |
|
|
|
5 |
|
|
This section is about the 'core' cpu (mips\_cpu.vhdl), excluding the cache.
|
6 |
|
|
|
7 |
|
|
\begin{figure}[h]
|
8 |
|
|
\makebox[\textwidth]{\framebox[9cm]{\rule{0pt}{9cm}
|
9 |
|
|
\includegraphics[width=8cm]{img/cpu_symbol.png}}}
|
10 |
|
|
\caption{CPU module interface\label{cpu_symbol}}
|
11 |
|
|
\end{figure}
|
12 |
|
|
|
13 |
|
|
the CPU module is not meant to be used directly. Instead, the MCU module
|
14 |
|
|
described in section ~\ref{mcu_module} should be used.\\
|
15 |
|
|
|
16 |
|
|
The following sections will describe the CPU structure and interfaces.\\
|
17 |
|
|
|
18 |
|
|
\section{Bus Architecture}
|
19 |
|
|
\label{bus_architecture}
|
20 |
|
|
The CPU uses a Harvard architecture: separate paths for code and data. It
|
21 |
|
|
has three separate, independent buses: code read, data read and data write.
|
22 |
|
|
(Data read and write share the same address bus).\\
|
23 |
|
|
|
24 |
|
|
The CPU will perform opcode fetches at the same time it does data reads
|
25 |
|
|
or writes. This is not only faster than a Von Neumann architecture where
|
26 |
|
|
all memory cycles have to be bottlenecked through a single bus; it is much
|
27 |
|
|
simpler too. And most importantly, it matches the way the CPU will work
|
28 |
|
|
when connected to code and data caches.\\
|
29 |
|
|
|
30 |
|
|
(Actually, in the current state of the core, due to the inefficient way in
|
31 |
|
|
which load interlocking has been implemented, the core is less efficient
|
32 |
|
|
than that -- more on this later).\\
|
33 |
|
|
|
34 |
|
|
The core can't read and write data at the same time; this is a fundamental
|
35 |
|
|
limitation of the core structure: doing both at the same time would take
|
36 |
|
|
one more read port in the register bank -- too expensive. Besides, it's not
|
37 |
|
|
necessary given the 3-stage pipeline structure we use.\\
|
38 |
|
|
|
39 |
|
|
In the most basic use scenario (no external memory and no caches), code and
|
40 |
|
|
data buses have a common address space but different storages and the
|
41 |
|
|
architecture is strictly Harvard. When cache memory is connected
|
42 |
|
|
the architecture becomes a Modified Harvard -- because data and
|
43 |
|
|
code ultimately come from the same storage, on the same external
|
44 |
|
|
bus(es).\\
|
45 |
|
|
|
46 |
|
|
Note that the basic cpu module (mips\_cpu) is meant to be connected to
|
47 |
|
|
internal, synchronous BRAMs only (i.e. the cache BRAMs). Some of its
|
48 |
|
|
outputs are not registered because they needn't be. The parent module
|
49 |
|
|
(called 'mips\_mcu') has its outputs registered to limit $t_{co}$ to
|
50 |
|
|
acceptable values.\\
|
51 |
|
|
|
52 |
|
|
|
53 |
|
|
\subsection{Code and data read bus interface}
|
54 |
|
|
\label{code_data_buses}
|
55 |
|
|
Both buses have the same interface:\\
|
56 |
|
|
|
57 |
|
|
\begin{tabular}{ l l }
|
58 |
|
|
$\star$\_rd\_addr & Address bus\\
|
59 |
|
|
$\star$\_rd & Data/opcode bus\\
|
60 |
|
|
$\star$\_rd\_vma & Valid Memory Address (VMA)\\
|
61 |
|
|
\end{tabular}\\
|
62 |
|
|
|
63 |
|
|
The CPU assumes SYNCHRONOUS external memory (most or all FPGA architectures
|
64 |
|
|
have only synchronous RAM blocks):
|
65 |
|
|
|
66 |
|
|
When $\star$\_rd\_vma is active ('1'), $\star$\_rd\_addr is a valid read address and the
|
67 |
|
|
memory should provide read data at the next clock cycle.
|
68 |
|
|
|
69 |
|
|
The following ascii-art waveforms depict simple data and code read cycles
|
70 |
|
|
where there is no interlock -- interlock is discussed in section 2.3.\\
|
71 |
|
|
|
72 |
|
|
\needspace{26\baselineskip}
|
73 |
|
|
\begin{verbatim}
|
74 |
|
|
==== Chronogram 3.1.A: data read cycle, no stall ======================
|
75 |
|
|
____ ____ ____ ____ ____
|
76 |
|
|
clk ____/ \____/ \____/ \____/ \____/ \____/
|
77 |
|
|
|
78 |
|
|
_________ _________
|
79 |
|
|
data_rd_vma ____/ \_________/ \____________________
|
80 |
|
|
|
81 |
|
|
data_rd_addr XXXX| 0x0700 |XXXXXXXXX| 0x0800 |XXXXXXXXX|XXXXXXXXX|
|
82 |
|
|
|
83 |
|
|
data_rd XXXX|XXXXXXXXX| [0x700] |XXXXXXXXX| [0x800] |XXXXXXXXX|
|
84 |
|
|
|
85 |
|
|
(target reg) ????|?????????|?????????| [0x700] |?????????| [0x800] |
|
86 |
|
|
|
87 |
|
|
(data is registered here...)--^ (...and here)--^
|
88 |
|
|
|
89 |
|
|
==== Chronogram 3.1.B: code read cycle, no stall =====================
|
90 |
|
|
____ ____ ____ ____ ____
|
91 |
|
|
clk ____/ \____/ \____/ \____/ \____/ \____/
|
92 |
|
|
|
93 |
|
|
__________________________________________________
|
94 |
|
|
code_rd_vma ____/ | | | | |
|
95 |
|
|
|
96 |
|
|
????| 0x0100 | 0x0104 | 0x0108 | 0x010c | 0x0200 |
|
97 |
|
|
|
98 |
|
|
code_rd XXXX|XXXXXXXXX| [0x100] | [0x104] | [0x108] | [0x10c] |
|
99 |
|
|
|
100 |
|
|
p1_ir_reg ????|?????????|?????????| [0x100] | [0x104] | [0x108] |
|
101 |
|
|
|
102 |
|
|
(first code word is registered here)--^
|
103 |
|
|
|
104 |
|
|
========================================================================
|
105 |
|
|
\end{verbatim}\\
|
106 |
|
|
|
107 |
|
|
The data address bus is 32-bit wide; the lowest 2 bits are redundant in
|
108 |
|
|
read cycles since the CPU always reads full words, but they may be useful
|
109 |
|
|
for debugging.\\
|
110 |
|
|
|
111 |
|
|
\subsection{Data write interface}
|
112 |
|
|
\label{data_write_bus}
|
113 |
|
|
|
114 |
|
|
The write bus does not have a vma output because byte\_we fulfills the same
|
115 |
|
|
role:\\
|
116 |
|
|
|
117 |
|
|
\begin{tabular}{ l l }
|
118 |
|
|
$\star$\_wr\_addr & Address bus\\
|
119 |
|
|
$\star$\_wr & Data/opcode bus\\
|
120 |
|
|
byte\_we & WE for each of the four bytes
|
121 |
|
|
\end{tabular}\\
|
122 |
|
|
|
123 |
|
|
Write cycles are synchronous too. The four bytes in the word should be
|
124 |
|
|
handled separately -- the CPU will assert a combination of byte\_we bits
|
125 |
|
|
according to the size and alignment of the store.\\
|
126 |
|
|
|
127 |
|
|
When byte\_we(i) is active, the matching byte at data\_wr should be stored
|
128 |
|
|
at address data\_wr\_addr. byte\_we(0) is for the LSB, byte\_we(3) for the MSB.
|
129 |
|
|
Note that since the CPU is big endian, the MSB has the lowest address and
|
130 |
|
|
LSB the highest. The memory system does not need to care about that.\\
|
131 |
|
|
|
132 |
|
|
Write cycles never cause data-hazard stalls. They would take a single
|
133 |
|
|
clock cycle except for the inefficient cache implementation, which
|
134 |
|
|
stalls the CPU until the writethrough is done.\\
|
135 |
|
|
|
136 |
|
|
This is the waveform for a basic write cycle (remember, this is without
|
137 |
|
|
cache; the cache would just stretch this diagram by a few cycles):
|
138 |
|
|
|
139 |
|
|
\needspace{20\baselineskip}
|
140 |
|
|
\begin{verbatim}
|
141 |
|
|
|
142 |
|
|
==== Chronogram 3.2: data write cycle =================================
|
143 |
|
|
____ ____ ____ ____ ____
|
144 |
|
|
clk ____/ \____/ \____/ \____/ \____/ \____/
|
145 |
|
|
|
146 |
|
|
byte_we XXXX| 1111 | 0000 | 0100 | 1100 | 0000 |
|
147 |
|
|
|
148 |
|
|
data_wr_addr XXXX| 0x0700 |XXXXXXXXX| 0x0800 | 0x0900 |XXXXXXXXX|
|
149 |
|
|
|
150 |
|
|
data_wr XXXX|12345678h|XXXXXXXXX|12345678h|12345678h|XXXXXXXXX|
|
151 |
|
|
|
152 |
|
|
[0x0700] ????|????????h|12345678h|12345678h|12345678h|12345678h|
|
153 |
|
|
|
154 |
|
|
[0x0800] ????|????????h|????????h|????????h|??34????h|??34????h|
|
155 |
|
|
|
156 |
|
|
[0x0900] ????|????????h|????????h|????????h|????????h|1234????h|
|
157 |
|
|
|
158 |
|
|
========================================================================
|
159 |
|
|
\end{verbatim}\\
|
160 |
|
|
|
161 |
|
|
Note the two back-to-back stores to addresses 0x0800 and 0x0900. They are
|
162 |
|
|
produced by two consecutive S$\star$ instructions (SB and SH in the example),
|
163 |
|
|
and can only be done this fast because of the Harvard architecture (and,
|
164 |
|
|
again, because the diagram does not account for the cache interaction).
|
165 |
|
|
|
166 |
|
|
|
167 |
|
|
\subsection{CPU stalls caused by memory accesses}
|
168 |
|
|
\label{memory_cpu_stalls}
|
169 |
|
|
|
170 |
|
|
In short, the 'mem\_wait' input will unconditionally stall all pipeline
|
171 |
|
|
stages as long as it is active. It is meant to be used by the cache at cache
|
172 |
|
|
refills.\\
|
173 |
|
|
|
174 |
|
|
In short, the cache/memory controller stops the cpu for all data/code
|
175 |
|
|
misses for as long as it takes to do a line refill. The current cache
|
176 |
|
|
implementation does refills in order (i.e. not 'target address first').
|
177 |
|
|
|
178 |
|
|
Note that external memory wait states are a separate issue. They too are
|
179 |
|
|
handled in the cache/memory controller. See section~\ref{cache} on the memory
|
180 |
|
|
controller.
|
181 |
|
|
|
182 |
|
|
\section{Pipeline}
|
183 |
|
|
\label{pipeline}
|
184 |
|
|
|
185 |
|
|
Here is where I would explain the structure of the cpu in detail; these
|
186 |
|
|
brief comments will have to suffice until I write some real documentation.\\
|
187 |
|
|
|
188 |
|
|
This section could really use a diagram; since it can take me days to draw
|
189 |
|
|
one, that will have to wait for a further revision.\\
|
190 |
|
|
|
191 |
|
|
This core has a 3-stage pipeline quite different from the original
|
192 |
|
|
architecture spec. Instead of trying to use the original names for the
|
193 |
|
|
stages, I'll define my own.\\
|
194 |
|
|
|
195 |
|
|
A computational instruction of the I- or R- type goes through the following
|
196 |
|
|
stages during execution:\\
|
197 |
|
|
|
198 |
|
|
\begin{tabular}{ l l }
|
199 |
|
|
FETCH-0 & Instruction address is in code\_rd\_addr bus\\
|
200 |
|
|
FETCH-1 & Instruction opcode is in code\_rd bus\\
|
201 |
|
|
ALU/MEM & ALU operation or memory read/write cycle is done OR\\
|
202 |
|
|
& Memory read/data address is on data\_rd/wr\_address bus OR\\
|
203 |
|
|
& Memory write data is on data\_wr bus\\
|
204 |
|
|
LOAD & Memory read data is on data\_rd bus
|
205 |
|
|
\end{tabular}\\
|
206 |
|
|
|
207 |
|
|
In the core source (mips\_cpu.vhdl) the stages have been numbered:\\
|
208 |
|
|
|
209 |
|
|
\begin{tabular}{ l l }
|
210 |
|
|
FETCH-1 & = stage 0\\
|
211 |
|
|
ALU/MEM & = stage 1\\
|
212 |
|
|
LOAD & = stage 2
|
213 |
|
|
\end{tabular}\\
|
214 |
|
|
|
215 |
|
|
Here's a few examples:\\
|
216 |
|
|
|
217 |
|
|
\needspace{9\baselineskip}
|
218 |
|
|
\begin{verbatim}
|
219 |
|
|
==== Chronogram 3.3.A: stages for instruction "lui gp,0x1" ============
|
220 |
|
|
____ ____ ____ ____ ____
|
221 |
|
|
clk ____/ \____/ \____/ \____/ \____/ \____/
|
222 |
|
|
|
223 |
|
|
code_rd_addr | 0x099c |
|
224 |
|
|
|
225 |
|
|
code_rd_data |3c1c0001h|
|
226 |
|
|
|
227 |
|
|
rbank[$gp] | 0x0001 |
|
228 |
|
|
|
229 |
|
|
|< fetch0>|< 0 >|< 1 >|
|
230 |
|
|
=======================================================================
|
231 |
|
|
\end{verbatim}\\
|
232 |
|
|
|
233 |
|
|
|
234 |
|
|
\needspace{17\baselineskip}
|
235 |
|
|
\begin{verbatim}
|
236 |
|
|
==== Chronogram 3.3.B: stages for instruction "lw a0,16(v0)" ==========
|
237 |
|
|
____ ____ ____ ____ ____
|
238 |
|
|
clk ____/ \____/ \____/ \____/ \____/ \____/
|
239 |
|
|
|
240 |
|
|
code_rd_addr | 0x099c |
|
241 |
|
|
|
242 |
|
|
code_rd_data |8c420010h|
|
243 |
|
|
|
244 |
|
|
data_rd_addr | $v0+16 |
|
245 |
|
|
_________
|
246 |
|
|
data_rd_vma _____/ \______
|
247 |
|
|
|
248 |
|
|
data_rd | <data> |
|
249 |
|
|
|
250 |
|
|
rbank[$a0] | <data> |
|
251 |
|
|
|
252 |
|
|
|< fetch1>|< 0 >|< 1 >|< 2 >|
|
253 |
|
|
========================================================================
|
254 |
|
|
\end{verbatim}\\
|
255 |
|
|
|
256 |
|
|
In the source code, all registers and signals in stage
|
257 |
|
|
\textless i\textgreater are prefixed by
|
258 |
|
|
"p\textless i\textgreater\_", as in p0\_*, p1\_* and p2\_*.
|
259 |
|
|
A stage includes a set of registers and
|
260 |
|
|
all the logic that feeds from those registers (actually, all the logic
|
261 |
|
|
that is between registers p0\_* and p1\_* belongs in stage 0, and so on).
|
262 |
|
|
Since there are signals that feed from more than one pipeline stage (for
|
263 |
|
|
example p2\_wback\_mux\_sel, which controls the register bank write port
|
264 |
|
|
data multiplexor and feeds from p1 and p2), the naming convention has to be
|
265 |
|
|
a little flexible.\\
|
266 |
|
|
|
267 |
|
|
FETCH-0 would only include the logic between p0\_pc\_reg and the code ram
|
268 |
|
|
address port, so it has been omitted from the naming convention.\\
|
269 |
|
|
|
270 |
|
|
All read and write ports of the register bank are synchronous. The read
|
271 |
|
|
ports belong logically to stage 1 and the write port to stage 2.\\
|
272 |
|
|
|
273 |
|
|
IMPORTANT: though the register bank read port is synchronous, its data can
|
274 |
|
|
be used in stage 1 because it is read early (the read port is loaded at the
|
275 |
|
|
same time as the instruction opcode). That is, a small part of the
|
276 |
|
|
instruction decoding is done on stage FETCH-1. Bearing in mind that the code
|
277 |
|
|
ram is meant to be the exact same type of block as the register bank (or
|
278 |
|
|
faster if the register bank is implemented with distributed RAM), and we
|
279 |
|
|
will cram the whole ALU delay plus the reg bank delay in stage 1, it does
|
280 |
|
|
not hurt moving a tiny part of the decoding to the previous cycle.\\
|
281 |
|
|
|
282 |
|
|
All registers but a few exceptions belong squarely to one of the pipeline
|
283 |
|
|
stages:\\
|
284 |
|
|
|
285 |
|
|
\begin{tabular}{ l l }
|
286 |
|
|
Stage 0: & \\
|
287 |
|
|
p0\_pc\_reg & PC \\
|
288 |
|
|
(external code ram read port register) & Loads the same as PC \\
|
289 |
|
|
& \\
|
290 |
|
|
Stage 1: & \\
|
291 |
|
|
p1\_ir\_reg & Instruction register \\
|
292 |
|
|
(register bank read port register) & \\
|
293 |
|
|
p1\_rbank\_forward & Feed-forward data (hazards) \\
|
294 |
|
|
p1\_rbank\_rs\_hazard & Rs hazard detected \\
|
295 |
|
|
p1\_rbank\_rt\_hazard & Rt hazard detected \\
|
296 |
|
|
& \\
|
297 |
|
|
Stage 2: & \\
|
298 |
|
|
p2\_exception & Exception control \\
|
299 |
|
|
p2\_do\_load & Load from data\_rd \\
|
300 |
|
|
p2\_ld\_* & Load control\\
|
301 |
|
|
(register bank write port register) & \\
|
302 |
|
|
\end{tabular}\\
|
303 |
|
|
|
304 |
|
|
Note how the register bank ports belong in different stages even if it's
|
305 |
|
|
the same physical device. No conflict here, hazards are handled properly
|
306 |
|
|
(logic defined with explicit vendor-independent vhdl code, not with
|
307 |
|
|
synthesis pragmas, etc.).\\
|
308 |
|
|
|
309 |
|
|
There is a small number of global registers that don't belong to any
|
310 |
|
|
pipeline stage:\\
|
311 |
|
|
|
312 |
|
|
\begin{tabular}{ p{4cm} l }
|
313 |
|
|
pipeline\_stalled & Together, these two signals...\\
|
314 |
|
|
pipeline\_interlocked & ...control pipeline stalls
|
315 |
|
|
\end{tabular}\\
|
316 |
|
|
|
317 |
|
|
|
318 |
|
|
And of course there are special registers accessible to the CPU programmer
|
319 |
|
|
model:\\
|
320 |
|
|
|
321 |
|
|
\begin{tabular}{ p{4cm} l }
|
322 |
|
|
mdiv\_hi\_reg & register HI from multiplier block\\
|
323 |
|
|
mdiv\_lo\_reg & register LO from multiplier block\\
|
324 |
|
|
cp0\_status & register CP0[status]\\
|
325 |
|
|
cp0\_epc & register CP0[epc]\\
|
326 |
|
|
cp0\_cause & register CP0[cause]
|
327 |
|
|
\end{tabular}\\
|
328 |
|
|
|
329 |
|
|
These belong logically to pipeline stage 1 (can be considered an extension
|
330 |
|
|
of the register bank) but have been spared the prefix for clarity.\\
|
331 |
|
|
|
332 |
|
|
Note that the CP0 status and cause registers are only partially implemented.\\
|
333 |
|
|
|
334 |
|
|
Again, this needs a better explaination and a diagram.\\
|
335 |
|
|
|
336 |
|
|
|
337 |
|
|
\section{Interlocking and Data Hazards}
|
338 |
|
|
\label{interlocking_and_data_hazards}
|
339 |
|
|
|
340 |
|
|
There are two data hazards we need to care about:\\
|
341 |
|
|
|
342 |
|
|
a) If an instruction needs to access a register which was modified by the
|
343 |
|
|
previous instruction, we have a data hazard -- because the register bank is
|
344 |
|
|
synchronous, a memory location can't be read in the same cycle it is updated
|
345 |
|
|
-- we will get the pre-update value.
|
346 |
|
|
|
347 |
|
|
b) A memory load into a register Rd produces its result a cycle late, so if
|
348 |
|
|
the instruction after the load needs to access Rd there is a conflict.\\
|
349 |
|
|
|
350 |
|
|
|
351 |
|
|
Conflict (a) is solved with some data forwarding logic: if we detect the
|
352 |
|
|
data hazard, the register bank uses a 'feed-forward' value instead of the
|
353 |
|
|
value read from the memory file. \\
|
354 |
|
|
In file mips\_cpu.vhdl, see process 'data\_forward\_register' and the following
|
355 |
|
|
few lines, where the hazard detection logic and data register and
|
356 |
|
|
multiplexors are implemented. Note that hazard is detected separately for
|
357 |
|
|
both read ports of the reg bank (p0\_rbank\_rs\_hazard and p0\_rbank\_rt\_hazard).
|
358 |
|
|
Note too that this logic is strictly regular vhdl code -- no need to rely here
|
359 |
|
|
on the synthesis tool to add the bypass logic for us. This gets us some
|
360 |
|
|
measure of vendor independence.\\
|
361 |
|
|
|
362 |
|
|
As for conflict (b), in the original MIPS-I architecture it was the job
|
363 |
|
|
of the programmer to make sure that a loaded value was not used before it
|
364 |
|
|
was available -- by inserting NOPs after the load instruction, if necessary.
|
365 |
|
|
This is what I call the 'load delay slot', as discussed in \cite[p.~13-1]{r3k_ref_man}.\\
|
366 |
|
|
|
367 |
|
|
The C toolchain needs to be set up for MIPS-I compliance in order to build
|
368 |
|
|
object code compatible with this scheme.\\
|
369 |
|
|
But all succeeding versions of the MIPS architecture implement a
|
370 |
|
|
different scheme instead, 'load interlock' (\cite[p.~28]{see_mips_run}). You often see
|
371 |
|
|
this behavior in code generated by gcc, even when using the -mips1 flag (this
|
372 |
|
|
is probably due to the precompiled support code or libc, I have to check).\\
|
373 |
|
|
In short, it pays to implement load interlocks so this core does, but the
|
374 |
|
|
feature should be optional through a generic.\\
|
375 |
|
|
|
376 |
|
|
|
377 |
|
|
Load interlock is triggered in stage 1 (ALU/MEM) of the load instruction;
|
378 |
|
|
when triggered, the pipeline stages 0 and 1 stall, but the pipeline stage
|
379 |
|
|
2 is allowed to proceed. That is, PC and IR are frozen but the value loaded
|
380 |
|
|
from memory is written in the register bank.\\
|
381 |
|
|
|
382 |
|
|
In the current implementation, the instruction following the load is
|
383 |
|
|
UNCONDITIONALLY stalled; even if it does not use the target register of the
|
384 |
|
|
load. This prevents, for example, interleaving read and write memory cycles
|
385 |
|
|
back to back, which the CPU otherwise could do.\\
|
386 |
|
|
So the interlock should only be triggered when necessary; this has to be
|
387 |
|
|
fixed.\\
|
388 |
|
|
|
389 |
|
|
|
390 |
|
|
\needspace{17\baselineskip}
|
391 |
|
|
\begin{verbatim}
|
392 |
|
|
==== Chronogram 3.4.A: data read cycle, showing interlock =============
|
393 |
|
|
____ ____ ____ ____ ____
|
394 |
|
|
clk ____/ \____/ \____/ \____/ \____/ \____/
|
395 |
|
|
|
396 |
|
|
code_rd_addr XXXX| 0x099c | 0x09a0 | 0x09a4 | 0x09a8 |
|
397 |
|
|
|
398 |
|
|
byte_we XXXX| 0000 | 1111 | 0000 | 0000 | 1100 |
|
399 |
|
|
|
400 |
|
|
|< >|
|
401 |
|
|
________________________ ____________________
|
402 |
|
|
code_rd_vma \_________/
|
403 |
|
|
_________
|
404 |
|
|
data_rd_vma ________________________/ \____________________
|
405 |
|
|
|
406 |
|
|
data_rd_addr XXXX|XXXXXXXXX|XXXXXXXXX| 0x0700 |XXXXXXXXX|
|
407 |
|
|
|
408 |
|
|
data_rd XXXX|XXXXXXXXX|XXXXXXXXX|XXXXXXXXX| [0x700] |XXXXXXXXX|
|
409 |
|
|
|
410 |
|
|
(target reg) ????|?????????|?????????|?????????| [0x700] |?????????|
|
411 |
|
|
|
412 |
|
|
(data is registered here)--^
|
413 |
|
|
|
414 |
|
|
========================================================================
|
415 |
|
|
\end{verbatim}\\
|
416 |
|
|
|
417 |
|
|
Note how a fetch cycle is delayed.
|
418 |
|
|
|
419 |
|
|
This waveform was produced by this code:
|
420 |
|
|
|
421 |
|
|
\begin{verbatim}
|
422 |
|
|
...
|
423 |
|
|
998: ac430010 sw v1,16(v0)
|
424 |
|
|
99c: 80440010 lb a0,16(v0)
|
425 |
|
|
9a0: a2840000 sb a0,0(s4)
|
426 |
|
|
9a4: 80440011 lb a0,17(v0)
|
427 |
|
|
9a8: 00000000 nop
|
428 |
|
|
...
|
429 |
|
|
\end{verbatim}\\
|
430 |
|
|
|
431 |
|
|
Note how read and write cycles are spaced instead of being interleaved, as
|
432 |
|
|
they would if interlocking was implemented efficiently (in this example,
|
433 |
|
|
there was a real hazard, register \$a0, but that's coincidence -- I need to
|
434 |
|
|
find a better example in the listing files...).
|
435 |
|
|
|
436 |
|
|
|
437 |
|
|
\section{Exceptions}
|
438 |
|
|
\label{exceptions}
|
439 |
|
|
|
440 |
|
|
The only exceptions supported so far are software exceptions, and of those
|
441 |
|
|
only the instructions BREAK and SYSCALL, the 'unimplemented opcode' trap and
|
442 |
|
|
the 'user-mode access to CP0' trap.\\
|
443 |
|
|
Memory privileges are not and will not be implemented. Hardware/software
|
444 |
|
|
interrupts are still unimplemented too.\\
|
445 |
|
|
|
446 |
|
|
Exceptions are meant to work as in the R3000 CPUs except for the vector
|
447 |
|
|
address.\\
|
448 |
|
|
They save their own address to EPC, update the SR, abort the following
|
449 |
|
|
instruction, and jump to the exception vector 0x0180. All as per the specs
|
450 |
|
|
except the vector address (we only use one).\\
|
451 |
|
|
|
452 |
|
|
The following instruction is aborted even if it is a load or a jump, and
|
453 |
|
|
traps work as specified even from a delay slot -- in that case, the address
|
454 |
|
|
saved to EPF is not the victim instruction's but the preceding jump
|
455 |
|
|
instruction's as explained in \cite[p.~64]{see_mips_run}.\\
|
456 |
|
|
|
457 |
|
|
Plasma used to save in epc the address of the instruction after break or
|
458 |
|
|
syscall, and used an unstandard vector address (0x03c). This core will go
|
459 |
|
|
the standard R3000 way instead.\\
|
460 |
|
|
|
461 |
|
|
Note that the epc register is not used by any instruction other than mfc0.
|
462 |
|
|
RTE is implemented and works as per R3000 specs.\\
|
463 |
|
|
|
464 |
|
|
|
465 |
|
|
\section{Multiplier}
|
466 |
|
|
\label{multiplier}
|
467 |
|
|
|
468 |
|
|
The core includes a multiplier/divider module which is a slightly modified
|
469 |
|
|
version of Plasma's multiplier unit. Changes have been commented in the
|
470 |
|
|
source code.\\
|
471 |
|
|
|
472 |
|
|
The main difference is that Plasma does not stall the pipeline while a
|
473 |
|
|
multiplication/division is going on. It only does when you attempt to get
|
474 |
|
|
registers HI or LO while the multiplier is still running. Only then will
|
475 |
|
|
the pipeline stall until the operation completes.\\
|
476 |
|
|
This core instead stalls always for all the time it takes to do the
|
477 |
|
|
operation. Not only it is simpler this way, it will also be easier to
|
478 |
|
|
abort mult/div instructions.\\
|
479 |
|
|
|
480 |
|
|
The logic dealing with mul/div stalls is a bit convoluted and coud use some
|
481 |
|
|
explaining and some ascii chronogram. Again, TBD.\\
|