OpenCores
URL https://opencores.org/ocsvn/ion/ion/trunk

Subversion Repositories ion

[/] [ion/] [trunk/] [doc/] [src/] [tex/] [cpu.tex] - Blame information for rev 221

Details | Compare with Previous | View Log

Line No. Rev Author Line
1 210 ja_rd
 
2
\chapter{CPU Description}
3
\label{cpu_description}
4
 
5
    This section is about the 'core' cpu (mips\_cpu.vhdl), excluding the cache.
6
 
7
    \begin{figure}[h]
8
    \makebox[\textwidth]{\framebox[9cm]{\rule{0pt}{9cm}
9
    \includegraphics[width=8cm]{img/cpu_symbol.png}}}
10
    \caption{CPU module interface\label{cpu_symbol}}
11
    \end{figure}
12
 
13 221 ja_rd
    the CPU module is not meant to be used directly. Instead, the SoC module
14
    described in chapter ~\ref{soc_module} should be used.\\
15 210 ja_rd
 
16
    The following sections will describe the CPU structure and interfaces.\\
17
 
18
\section{Bus Architecture}
19
\label{bus_architecture}
20
    The CPU uses a Harvard architecture: separate paths for code and data. It
21
    has three separate, independent buses: code read, data read and data write.
22
    (Data read and write share the same address bus).\\
23
 
24
    The CPU will perform opcode fetches at the same time it does data reads
25
    or writes. This is not only faster than a Von Neumann architecture where
26
    all memory cycles have to be bottlenecked through a single bus; it is much
27
    simpler too. And most importantly, it matches the way the CPU will work
28
    when connected to code and data caches.\\
29
 
30
    (Actually, in the current state of the core, due to the inefficient way in
31
    which load interlocking has been implemented, the core is less efficient
32
    than that -- more on this later).\\
33
 
34
    The core can't read and write data at the same time; this is a fundamental
35
    limitation of the core structure: doing both at the same time would take
36
    one more read port in the register bank -- too expensive. Besides, it's not
37
    necessary given the 3-stage pipeline structure we use.\\
38
 
39
    In the most basic use scenario (no external memory and no caches), code and
40
    data buses have a common address space but different storages and the
41
    architecture is strictly Harvard. When cache memory is connected
42
    the architecture becomes a Modified Harvard -- because data and
43
    code ultimately come from the same storage, on the same external
44
    bus(es).\\
45
 
46
    Note that the basic cpu module (mips\_cpu) is meant to be connected to
47
    internal, synchronous BRAMs only (i.e. the cache BRAMs). Some of its
48
    outputs are not registered because they needn't be. The parent module
49 221 ja_rd
    (called 'mips\_soc') has its outputs registered to limit $t_{co}$ to
50 210 ja_rd
    acceptable values.\\
51
 
52
 
53
\subsection{Code and data read bus interface}
54
\label{code_data_buses}
55
    Both buses have the same interface:\\
56
 
57
\begin{tabular}{ l l }
58
    $\star$\_rd\_addr  & Address bus\\
59
    $\star$\_rd        & Data/opcode bus\\
60
    $\star$\_rd\_vma   & Valid Memory Address (VMA)\\
61
\end{tabular}\\
62
 
63
    The CPU assumes SYNCHRONOUS external memory (most or all FPGA architectures
64
    have only synchronous RAM blocks):
65
 
66
    When $\star$\_rd\_vma is active ('1'), $\star$\_rd\_addr is a valid read address and the
67
    memory should provide read data at the next clock cycle.
68
 
69
    The following ascii-art waveforms depict simple data and code read cycles
70
    where there is no interlock -- interlock is discussed in section 2.3.\\
71
 
72
\needspace{26\baselineskip}
73
\begin{verbatim}
74
    ==== Chronogram 3.1.A: data read cycle, no stall ======================
75
                         ____      ____      ____      ____      ____
76
     clk            ____/    \____/    \____/    \____/    \____/    \____/
77
 
78
                         _________           _________
79
     data_rd_vma    ____/         \_________/         \____________________
80
 
81
     data_rd_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  |XXXXXXXXX|XXXXXXXXX|
82
 
83
     data_rd        XXXX|XXXXXXXXX| [0x700] |XXXXXXXXX| [0x800] |XXXXXXXXX|
84
 
85
     (target reg)   ????|?????????|?????????| [0x700] |?????????| [0x800] |
86
 
87
              (data is registered here...)--^    (...and here)--^
88
 
89
     ==== Chronogram 3.1.B: code read cycle, no stall =====================
90
                         ____      ____      ____      ____      ____
91
     clk            ____/    \____/    \____/    \____/    \____/    \____/
92
 
93
                         __________________________________________________
94
    code_rd_vma     ____/         |         |         |         |         |
95
 
96
                    ????| 0x0100  | 0x0104  | 0x0108  | 0x010c  | 0x0200  |
97
 
98
    code_rd         XXXX|XXXXXXXXX| [0x100] | [0x104] | [0x108] | [0x10c] |
99
 
100
    p1_ir_reg       ????|?????????|?????????| [0x100] | [0x104] | [0x108] |
101
 
102
      (first code word is registered here)--^
103
 
104
    ========================================================================
105
\end{verbatim}\\
106
 
107
    The data address bus is 32-bit wide; the lowest 2 bits are redundant in
108
    read cycles since the CPU always reads full words, but they may be useful
109
    for debugging.\\
110
 
111
\subsection{Data write interface}
112
\label{data_write_bus}
113
 
114
    The write bus does not have a vma output because byte\_we fulfills the same
115
    role:\\
116
 
117
\begin{tabular}{ l l }
118
    $\star$\_wr\_addr   & Address bus\\
119
    $\star$\_wr         & Data/opcode bus\\
120
    byte\_we            & WE for each of the four bytes
121
\end{tabular}\\
122
 
123
    Write cycles are synchronous too. The four bytes in the word should be
124
    handled separately -- the CPU will assert a combination of byte\_we bits
125
    according to the size and alignment of the store.\\
126
 
127
    When byte\_we(i) is active, the matching byte at data\_wr should be stored
128
    at address data\_wr\_addr. byte\_we(0) is for the LSB, byte\_we(3) for the MSB.
129
    Note that since the CPU is big endian, the MSB has the lowest address and
130
    LSB the highest. The memory system does not need to care about that.\\
131
 
132
    Write cycles never cause data-hazard stalls. They would take a single
133
    clock cycle except for the inefficient cache implementation, which
134
    stalls the CPU until the writethrough is done.\\
135
 
136
    This is the waveform for a basic write cycle (remember, this is without
137
    cache; the cache would just stretch this diagram by a few cycles):
138
 
139
\needspace{20\baselineskip}
140
\begin{verbatim}
141
 
142
    ==== Chronogram 3.2: data write cycle =================================
143
                         ____      ____      ____      ____      ____
144
     clk            ____/    \____/    \____/    \____/    \____/    \____/
145
 
146
     byte_we        XXXX|  1111   |  0000   |  0100   |  1100   |  0000   |
147
 
148
     data_wr_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  | 0x0900  |XXXXXXXXX|
149
 
150
     data_wr        XXXX|12345678h|XXXXXXXXX|12345678h|12345678h|XXXXXXXXX|
151
 
152
     [0x0700]       ????|????????h|12345678h|12345678h|12345678h|12345678h|
153
 
154
     [0x0800]       ????|????????h|????????h|????????h|??34????h|??34????h|
155
 
156
     [0x0900]       ????|????????h|????????h|????????h|????????h|1234????h|
157
 
158
    ========================================================================
159
\end{verbatim}\\
160
 
161
    Note the two back-to-back stores to addresses 0x0800 and 0x0900. They are
162
    produced by two consecutive S$\star$ instructions (SB and SH in the example),
163
    and can only be done this fast because of the Harvard architecture (and,
164
    again, because the diagram does not account for the cache interaction).
165
 
166
 
167
\subsection{CPU stalls caused by memory accesses}
168
\label{memory_cpu_stalls}
169
 
170
    In short, the 'mem\_wait' input will unconditionally stall all pipeline
171
    stages as long as it is active. It is meant to be used by the cache at cache
172
    refills.\\
173
 
174 221 ja_rd
    The cache/memory controller stops the cpu for all data/code
175 210 ja_rd
    misses for as long as it takes to do a line refill. The current cache
176 221 ja_rd
    implementation does refills in reverse order (i.e. not 'target address first').
177 210 ja_rd
 
178
    Note that external memory wait states are a separate issue. They too are
179
    handled in the cache/memory controller. See section~\ref{cache} on the memory
180
    controller.
181
 
182
\section{Pipeline}
183
\label{pipeline}
184
 
185
    Here is where I would explain the structure of the cpu in detail; these
186
    brief comments will have to suffice until I write some real documentation.\\
187
 
188
    This section could really use a diagram; since it can take me days to draw
189
    one, that will have to wait for a further revision.\\
190
 
191
    This core has a 3-stage pipeline quite different from the original
192
    architecture spec. Instead of trying to use the original names for the
193
    stages, I'll define my own.\\
194
 
195
    A computational instruction of the I- or R- type goes through the following
196
    stages during execution:\\
197
 
198
    \begin{tabular}{ l l }
199
        FETCH-0   & Instruction address is in code\_rd\_addr bus\\
200
        FETCH-1   & Instruction opcode is in code\_rd bus\\
201
        ALU/MEM   & ALU operation or memory read/write cycle is done OR\\
202
                  &   Memory read/data address is on data\_rd/wr\_address bus OR\\
203
                  &   Memory write data is on data\_wr bus\\
204
        LOAD      & Memory read data is on data\_rd bus
205
    \end{tabular}\\
206
 
207
    In the core source (mips\_cpu.vhdl) the stages have been numbered:\\
208
 
209
    \begin{tabular}{ l l }
210
        FETCH-1 & = stage 0\\
211
        ALU/MEM & = stage 1\\
212
        LOAD    & = stage 2
213
    \end{tabular}\\
214
 
215
    Here's a few examples:\\
216
 
217
\needspace{9\baselineskip}
218
\begin{verbatim}
219
    ==== Chronogram 3.3.A: stages for instruction "lui gp,0x1" ============
220
                         ____      ____      ____      ____      ____
221
     clk            ____/    \____/    \____/    \____/    \____/    \____/
222
 
223
     code_rd_addr       | 0x099c  |
224
 
225
     code_rd_data                 |3c1c0001h|
226
 
227
     rbank[$gp]                             | 0x0001  |
228
 
229
                        |< fetch0>|<   0   >|<   1   >|
230
    =======================================================================
231
\end{verbatim}\\
232
 
233
 
234
\needspace{17\baselineskip}
235
\begin{verbatim}
236
    ==== Chronogram 3.3.B: stages for instruction "lw a0,16(v0)" ==========
237
                         ____      ____      ____      ____      ____
238
     clk            ____/    \____/    \____/    \____/    \____/    \____/
239
 
240
     code_rd_addr       | 0x099c  |
241
 
242
     code_rd_data                 |8c420010h|
243
 
244
     data_rd_addr                           | $v0+16  |
245
                                             _________
246
     data_rd_vma                       _____/         \______
247
 
248
     data_rd                                          | <data>  |
249
 
250
     rbank[$a0]                                                 | <data>  |
251
 
252
                        |< fetch1>|<   0   >|<   1   >|<   2   >|
253
    ========================================================================
254
\end{verbatim}\\
255
 
256
    In the source code, all registers and signals in stage
257 221 ja_rd
    \textless i\textgreater  are prefixed by
258 210 ja_rd
    "p\textless i\textgreater\_", as in p0\_*, p1\_* and p2\_*.
259
    A stage includes a set of registers and
260
    all the logic that feeds from those registers (actually, all the logic
261
    that is between registers p0\_* and p1\_* belongs in stage 0, and so on).
262
    Since there are signals that feed from more than one pipeline stage (for
263
    example p2\_wback\_mux\_sel, which controls the register bank write port
264
    data multiplexor and feeds from p1 and p2), the naming convention has to be
265
    a little flexible.\\
266
 
267
    FETCH-0 would only include the logic between p0\_pc\_reg and the code ram
268
    address port, so it has been omitted from the naming convention.\\
269
 
270
    All read and write ports of the register bank are synchronous. The read
271
    ports belong logically to stage 1 and the write port to stage 2.\\
272
 
273
    IMPORTANT: though the register bank read port is synchronous, its data can
274 221 ja_rd
    be used in stage 1 because it is read early (the read address port is loaded at the
275 210 ja_rd
    same time as the instruction opcode). That is, a small part of the
276 221 ja_rd
    instruction decoding is done on stage FETCH-1, by feeding the source
277
    register index field straight from the code bus to the register bank BRAM.
278
 
279
    Bearing in mind that the code cache
280 210 ja_rd
    ram is meant to be the exact same type of block as the register bank (or
281
    faster if the register bank is implemented with distributed RAM), and we
282
    will cram the whole ALU delay plus the reg bank delay in stage 1, it does
283
    not hurt moving a tiny part of the decoding to the previous cycle.\\
284
 
285
    All registers but a few exceptions belong squarely to one of the pipeline
286
    stages:\\
287
 
288
    \begin{tabular}{ l l }
289
    Stage 0:                                & \\
290
    p0\_pc\_reg                             & PC \\
291
    (external code ram read port register)  & Loads the same as PC \\
292
                                            & \\
293
    Stage 1:                                & \\
294
    p1\_ir\_reg                             & Instruction register \\
295
    (register bank read port register)      &  \\
296
    p1\_rbank\_forward                      & Feed-forward data (hazards) \\
297
    p1\_rbank\_rs\_hazard                   & Rs hazard detected \\
298
    p1\_rbank\_rt\_hazard                   & Rt hazard detected \\
299
                                            & \\
300
    Stage 2:                                & \\
301
    p2\_exception                           & Exception control \\
302
    p2\_do\_load                            & Load from data\_rd \\
303
    p2\_ld\_*                               & Load control\\
304
    (register bank write port register)     & \\
305
    \end{tabular}\\
306
 
307
    Note how the register bank ports belong in different stages even if it's
308
    the same physical device. No conflict here, hazards are handled properly
309
    (logic defined with explicit vendor-independent vhdl code, not with
310
    synthesis pragmas, etc.).\\
311
 
312
    There is a small number of global registers that don't belong to any
313
    pipeline stage:\\
314
 
315
    \begin{tabular}{ p{4cm} l }
316
    pipeline\_stalled                        & Together, these two signals...\\
317
    pipeline\_interlocked                    & ...control pipeline stalls
318
    \end{tabular}\\
319
 
320
 
321
    And of course there are special registers accessible to the CPU programmer
322
    model:\\
323
 
324
    \begin{tabular}{ p{4cm} l }
325
        mdiv\_hi\_reg     & register HI from multiplier block\\
326
        mdiv\_lo\_reg     & register LO from multiplier block\\
327
        cp0\_status      & register CP0[status]\\
328
        cp0\_epc         & register CP0[epc]\\
329
        cp0\_cause       & register CP0[cause]
330
    \end{tabular}\\
331
 
332
    These belong logically to pipeline stage 1 (can be considered an extension
333
    of the register bank) but have been spared the prefix for clarity.\\
334
 
335
    Note that the CP0 status and cause registers are only partially implemented.\\
336
 
337
    Again, this needs a better explaination and a diagram.\\
338
 
339
 
340
\section{Interlocking and Data Hazards}
341
\label{interlocking_and_data_hazards}
342
 
343
    There are two data hazards we need to care about:\\
344
 
345
    a) If an instruction needs to access a register which was modified by the
346
    previous instruction, we have a data hazard -- because the register bank is
347
    synchronous, a memory location can't be read in the same cycle it is updated
348
    -- we will get the pre-update value.
349
 
350
    b) A memory load into a register Rd produces its result a cycle late, so if
351
    the instruction after the load needs to access Rd there is a conflict.\\
352
 
353
 
354
    Conflict (a) is solved with some data forwarding logic: if we detect the
355
    data hazard, the register bank uses a 'feed-forward' value instead of the
356
    value read from the memory file. \\
357
    In file mips\_cpu.vhdl, see process 'data\_forward\_register' and the following
358
    few lines, where the hazard detection logic and data register and
359
    multiplexors are implemented. Note that hazard is detected separately for
360
    both read ports of the reg bank (p0\_rbank\_rs\_hazard and p0\_rbank\_rt\_hazard).
361
    Note too that this logic is strictly regular vhdl code -- no need to rely here
362
    on the synthesis tool to add the bypass logic for us. This gets us some
363
    measure of vendor independence.\\
364
 
365
    As for conflict (b), in the original MIPS-I architecture it was the job
366
    of the programmer to make sure that a loaded value was not used before it
367
    was available -- by inserting NOPs after the load instruction, if necessary.
368
    This is what I call the 'load delay slot', as discussed in \cite[p.~13-1]{r3k_ref_man}.\\
369
 
370
    The C toolchain needs to be set up for MIPS-I compliance in order to build
371
    object code compatible with this scheme.\\
372
    But all succeeding versions of the MIPS architecture implement a
373
    different scheme instead, 'load interlock' (\cite[p.~28]{see_mips_run}). You often see
374
    this behavior in code generated by gcc, even when using the -mips1 flag (this
375
    is probably due to the precompiled support code or libc, I have to check).\\
376
    In short, it pays to implement load interlocks so this core does, but the
377
    feature should be optional through a generic.\\
378
 
379
 
380
    Load interlock is triggered in stage 1 (ALU/MEM) of the load instruction;
381
    when triggered, the pipeline stages 0 and 1 stall, but the pipeline stage
382
    2 is allowed to proceed. That is, PC and IR are frozen but the value loaded
383
    from memory is written in the register bank.\\
384
 
385
    In the current implementation, the instruction following the load is
386
    UNCONDITIONALLY stalled; even if it does not use the target register of the
387
    load. This prevents, for example, interleaving read and write memory cycles
388
    back to back, which the CPU otherwise could do.\\
389
    So the interlock should only be triggered when necessary; this has to be
390
    fixed.\\
391
 
392
 
393
\needspace{17\baselineskip}
394
\begin{verbatim}
395
    ==== Chronogram 3.4.A: data read cycle, showing interlock =============
396
                         ____      ____      ____      ____      ____
397
     clk            ____/    \____/    \____/    \____/    \____/    \____/
398
 
399
     code_rd_addr   XXXX| 0x099c  | 0x09a0            | 0x09a4  | 0x09a8  |
400
 
401
     byte_we        XXXX|  0000   |  1111   |  0000   |  0000   |  1100   |
402
 
403
                                            |<                 >|
404
                    ________________________           ____________________
405
     code_rd_vma                            \_________/
406
                                             _________
407
     data_rd_vma    ________________________/         \____________________
408
 
409
     data_rd_addr   XXXX|XXXXXXXXX|XXXXXXXXX| 0x0700            |XXXXXXXXX|
410
 
411
     data_rd        XXXX|XXXXXXXXX|XXXXXXXXX|XXXXXXXXX| [0x700] |XXXXXXXXX|
412
 
413
     (target reg)   ????|?????????|?????????|?????????| [0x700] |?????????|
414
 
415
                           (data is registered here)--^
416
 
417
    ========================================================================
418
\end{verbatim}\\
419
 
420
    Note how a fetch cycle is delayed.
421
 
422
    This waveform was produced by this code:
423
 
424
\begin{verbatim}
425
                ...
426
                998:    ac430010    sw  v1,16(v0)
427
                99c:    80440010    lb  a0,16(v0)
428
                9a0:    a2840000    sb  a0,0(s4)
429
                9a4:    80440011    lb  a0,17(v0)
430
                9a8:    00000000    nop
431
                ...
432
\end{verbatim}\\
433
 
434
    Note how read and write cycles are spaced instead of being interleaved, as
435
    they would if interlocking was implemented efficiently (in this example,
436
    there was a real hazard, register \$a0, but that's coincidence -- I need to
437
    find a better example in the listing files...).
438
 
439
 
440
\section{Exceptions}
441
\label{exceptions}
442
 
443
    The only exceptions supported so far are software exceptions, and of those
444
    only the instructions BREAK and SYSCALL, the 'unimplemented opcode' trap and
445
    the 'user-mode access to CP0' trap.\\
446
    Memory privileges are not and will not be implemented. Hardware/software
447
    interrupts are still unimplemented too.\\
448
 
449
    Exceptions are meant to work as in the R3000 CPUs except for the vector
450
    address.\\
451
    They save their own address to EPC, update the SR, abort the following
452
    instruction, and jump to the exception vector 0x0180. All as per the specs
453
    except the vector address (we only use one).\\
454
 
455
    The following instruction is aborted even if it is a load or a jump, and
456
    traps work as specified even from a delay slot -- in that case, the address
457
    saved to EPF is not the victim instruction's but the preceding jump
458
    instruction's as explained in \cite[p.~64]{see_mips_run}.\\
459
 
460 221 ja_rd
    Plasma CPU used to save in epc the address of the instruction after break or
461 210 ja_rd
    syscall, and used an unstandard vector address (0x03c). This core will go
462
    the standard R3000 way instead.\\
463
 
464
    Note that the epc register is not used by any instruction other than mfc0.
465
    RTE is implemented and works as per R3000 specs.\\
466
 
467
 
468
\section{Multiplier}
469
\label{multiplier}
470
 
471
    The core includes a multiplier/divider module which is a slightly modified
472
    version of Plasma's multiplier unit. Changes have been commented in the
473
    source code.\\
474
 
475
    The main difference is that Plasma does not stall the pipeline while a
476
    multiplication/division is going on. It only does when you attempt to get
477
    registers HI or LO while the multiplier is still running. Only then will
478
    the pipeline stall until the operation completes.\\
479
    This core instead stalls always for all the time it takes to do the
480
    operation. Not only it is simpler this way, it will also be easier to
481
    abort mult/div instructions.\\
482
 
483
    The logic dealing with mul/div stalls is a bit convoluted and coud use some
484
    explaining and some ascii chronogram. Again, TBD.\\

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.