OpenCores
URL https://opencores.org/ocsvn/ion/ion/trunk

Subversion Repositories ion

[/] [ion/] [trunk/] [doc/] [src/] [tex/] [cpu.tex] - Blame information for rev 210

Go to most recent revision | Details | Compare with Previous | View Log

Line No. Rev Author Line
1 210 ja_rd
 
2
\chapter{CPU Description}
3
\label{cpu_description}
4
 
5
    This section is about the 'core' cpu (mips\_cpu.vhdl), excluding the cache.
6
 
7
    \begin{figure}[h]
8
    \makebox[\textwidth]{\framebox[9cm]{\rule{0pt}{9cm}
9
    \includegraphics[width=8cm]{img/cpu_symbol.png}}}
10
    \caption{CPU module interface\label{cpu_symbol}}
11
    \end{figure}
12
 
13
    the CPU module is not meant to be used directly. Instead, the MCU module
14
    described in section ~\ref{mcu_module} should be used.\\
15
 
16
    The following sections will describe the CPU structure and interfaces.\\
17
 
18
\section{Bus Architecture}
19
\label{bus_architecture}
20
    The CPU uses a Harvard architecture: separate paths for code and data. It
21
    has three separate, independent buses: code read, data read and data write.
22
    (Data read and write share the same address bus).\\
23
 
24
    The CPU will perform opcode fetches at the same time it does data reads
25
    or writes. This is not only faster than a Von Neumann architecture where
26
    all memory cycles have to be bottlenecked through a single bus; it is much
27
    simpler too. And most importantly, it matches the way the CPU will work
28
    when connected to code and data caches.\\
29
 
30
    (Actually, in the current state of the core, due to the inefficient way in
31
    which load interlocking has been implemented, the core is less efficient
32
    than that -- more on this later).\\
33
 
34
    The core can't read and write data at the same time; this is a fundamental
35
    limitation of the core structure: doing both at the same time would take
36
    one more read port in the register bank -- too expensive. Besides, it's not
37
    necessary given the 3-stage pipeline structure we use.\\
38
 
39
    In the most basic use scenario (no external memory and no caches), code and
40
    data buses have a common address space but different storages and the
41
    architecture is strictly Harvard. When cache memory is connected
42
    the architecture becomes a Modified Harvard -- because data and
43
    code ultimately come from the same storage, on the same external
44
    bus(es).\\
45
 
46
    Note that the basic cpu module (mips\_cpu) is meant to be connected to
47
    internal, synchronous BRAMs only (i.e. the cache BRAMs). Some of its
48
    outputs are not registered because they needn't be. The parent module
49
    (called 'mips\_mcu') has its outputs registered to limit $t_{co}$ to
50
    acceptable values.\\
51
 
52
 
53
\subsection{Code and data read bus interface}
54
\label{code_data_buses}
55
    Both buses have the same interface:\\
56
 
57
\begin{tabular}{ l l }
58
    $\star$\_rd\_addr  & Address bus\\
59
    $\star$\_rd        & Data/opcode bus\\
60
    $\star$\_rd\_vma   & Valid Memory Address (VMA)\\
61
\end{tabular}\\
62
 
63
    The CPU assumes SYNCHRONOUS external memory (most or all FPGA architectures
64
    have only synchronous RAM blocks):
65
 
66
    When $\star$\_rd\_vma is active ('1'), $\star$\_rd\_addr is a valid read address and the
67
    memory should provide read data at the next clock cycle.
68
 
69
    The following ascii-art waveforms depict simple data and code read cycles
70
    where there is no interlock -- interlock is discussed in section 2.3.\\
71
 
72
\needspace{26\baselineskip}
73
\begin{verbatim}
74
    ==== Chronogram 3.1.A: data read cycle, no stall ======================
75
                         ____      ____      ____      ____      ____
76
     clk            ____/    \____/    \____/    \____/    \____/    \____/
77
 
78
                         _________           _________
79
     data_rd_vma    ____/         \_________/         \____________________
80
 
81
     data_rd_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  |XXXXXXXXX|XXXXXXXXX|
82
 
83
     data_rd        XXXX|XXXXXXXXX| [0x700] |XXXXXXXXX| [0x800] |XXXXXXXXX|
84
 
85
     (target reg)   ????|?????????|?????????| [0x700] |?????????| [0x800] |
86
 
87
              (data is registered here...)--^    (...and here)--^
88
 
89
     ==== Chronogram 3.1.B: code read cycle, no stall =====================
90
                         ____      ____      ____      ____      ____
91
     clk            ____/    \____/    \____/    \____/    \____/    \____/
92
 
93
                         __________________________________________________
94
    code_rd_vma     ____/         |         |         |         |         |
95
 
96
                    ????| 0x0100  | 0x0104  | 0x0108  | 0x010c  | 0x0200  |
97
 
98
    code_rd         XXXX|XXXXXXXXX| [0x100] | [0x104] | [0x108] | [0x10c] |
99
 
100
    p1_ir_reg       ????|?????????|?????????| [0x100] | [0x104] | [0x108] |
101
 
102
      (first code word is registered here)--^
103
 
104
    ========================================================================
105
\end{verbatim}\\
106
 
107
    The data address bus is 32-bit wide; the lowest 2 bits are redundant in
108
    read cycles since the CPU always reads full words, but they may be useful
109
    for debugging.\\
110
 
111
\subsection{Data write interface}
112
\label{data_write_bus}
113
 
114
    The write bus does not have a vma output because byte\_we fulfills the same
115
    role:\\
116
 
117
\begin{tabular}{ l l }
118
    $\star$\_wr\_addr   & Address bus\\
119
    $\star$\_wr         & Data/opcode bus\\
120
    byte\_we            & WE for each of the four bytes
121
\end{tabular}\\
122
 
123
    Write cycles are synchronous too. The four bytes in the word should be
124
    handled separately -- the CPU will assert a combination of byte\_we bits
125
    according to the size and alignment of the store.\\
126
 
127
    When byte\_we(i) is active, the matching byte at data\_wr should be stored
128
    at address data\_wr\_addr. byte\_we(0) is for the LSB, byte\_we(3) for the MSB.
129
    Note that since the CPU is big endian, the MSB has the lowest address and
130
    LSB the highest. The memory system does not need to care about that.\\
131
 
132
    Write cycles never cause data-hazard stalls. They would take a single
133
    clock cycle except for the inefficient cache implementation, which
134
    stalls the CPU until the writethrough is done.\\
135
 
136
    This is the waveform for a basic write cycle (remember, this is without
137
    cache; the cache would just stretch this diagram by a few cycles):
138
 
139
\needspace{20\baselineskip}
140
\begin{verbatim}
141
 
142
    ==== Chronogram 3.2: data write cycle =================================
143
                         ____      ____      ____      ____      ____
144
     clk            ____/    \____/    \____/    \____/    \____/    \____/
145
 
146
     byte_we        XXXX|  1111   |  0000   |  0100   |  1100   |  0000   |
147
 
148
     data_wr_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  | 0x0900  |XXXXXXXXX|
149
 
150
     data_wr        XXXX|12345678h|XXXXXXXXX|12345678h|12345678h|XXXXXXXXX|
151
 
152
     [0x0700]       ????|????????h|12345678h|12345678h|12345678h|12345678h|
153
 
154
     [0x0800]       ????|????????h|????????h|????????h|??34????h|??34????h|
155
 
156
     [0x0900]       ????|????????h|????????h|????????h|????????h|1234????h|
157
 
158
    ========================================================================
159
\end{verbatim}\\
160
 
161
    Note the two back-to-back stores to addresses 0x0800 and 0x0900. They are
162
    produced by two consecutive S$\star$ instructions (SB and SH in the example),
163
    and can only be done this fast because of the Harvard architecture (and,
164
    again, because the diagram does not account for the cache interaction).
165
 
166
 
167
\subsection{CPU stalls caused by memory accesses}
168
\label{memory_cpu_stalls}
169
 
170
    In short, the 'mem\_wait' input will unconditionally stall all pipeline
171
    stages as long as it is active. It is meant to be used by the cache at cache
172
    refills.\\
173
 
174
    In short, the cache/memory controller stops the cpu for all data/code
175
    misses for as long as it takes to do a line refill. The current cache
176
    implementation does refills in order (i.e. not 'target address first').
177
 
178
    Note that external memory wait states are a separate issue. They too are
179
    handled in the cache/memory controller. See section~\ref{cache} on the memory
180
    controller.
181
 
182
\section{Pipeline}
183
\label{pipeline}
184
 
185
    Here is where I would explain the structure of the cpu in detail; these
186
    brief comments will have to suffice until I write some real documentation.\\
187
 
188
    This section could really use a diagram; since it can take me days to draw
189
    one, that will have to wait for a further revision.\\
190
 
191
    This core has a 3-stage pipeline quite different from the original
192
    architecture spec. Instead of trying to use the original names for the
193
    stages, I'll define my own.\\
194
 
195
    A computational instruction of the I- or R- type goes through the following
196
    stages during execution:\\
197
 
198
    \begin{tabular}{ l l }
199
        FETCH-0   & Instruction address is in code\_rd\_addr bus\\
200
        FETCH-1   & Instruction opcode is in code\_rd bus\\
201
        ALU/MEM   & ALU operation or memory read/write cycle is done OR\\
202
                  &   Memory read/data address is on data\_rd/wr\_address bus OR\\
203
                  &   Memory write data is on data\_wr bus\\
204
        LOAD      & Memory read data is on data\_rd bus
205
    \end{tabular}\\
206
 
207
    In the core source (mips\_cpu.vhdl) the stages have been numbered:\\
208
 
209
    \begin{tabular}{ l l }
210
        FETCH-1 & = stage 0\\
211
        ALU/MEM & = stage 1\\
212
        LOAD    & = stage 2
213
    \end{tabular}\\
214
 
215
    Here's a few examples:\\
216
 
217
\needspace{9\baselineskip}
218
\begin{verbatim}
219
    ==== Chronogram 3.3.A: stages for instruction "lui gp,0x1" ============
220
                         ____      ____      ____      ____      ____
221
     clk            ____/    \____/    \____/    \____/    \____/    \____/
222
 
223
     code_rd_addr       | 0x099c  |
224
 
225
     code_rd_data                 |3c1c0001h|
226
 
227
     rbank[$gp]                             | 0x0001  |
228
 
229
                        |< fetch0>|<   0   >|<   1   >|
230
    =======================================================================
231
\end{verbatim}\\
232
 
233
 
234
\needspace{17\baselineskip}
235
\begin{verbatim}
236
    ==== Chronogram 3.3.B: stages for instruction "lw a0,16(v0)" ==========
237
                         ____      ____      ____      ____      ____
238
     clk            ____/    \____/    \____/    \____/    \____/    \____/
239
 
240
     code_rd_addr       | 0x099c  |
241
 
242
     code_rd_data                 |8c420010h|
243
 
244
     data_rd_addr                           | $v0+16  |
245
                                             _________
246
     data_rd_vma                       _____/         \______
247
 
248
     data_rd                                          | <data>  |
249
 
250
     rbank[$a0]                                                 | <data>  |
251
 
252
                        |< fetch1>|<   0   >|<   1   >|<   2   >|
253
    ========================================================================
254
\end{verbatim}\\
255
 
256
    In the source code, all registers and signals in stage
257
    \textless i\textgreater are prefixed by
258
    "p\textless i\textgreater\_", as in p0\_*, p1\_* and p2\_*.
259
    A stage includes a set of registers and
260
    all the logic that feeds from those registers (actually, all the logic
261
    that is between registers p0\_* and p1\_* belongs in stage 0, and so on).
262
    Since there are signals that feed from more than one pipeline stage (for
263
    example p2\_wback\_mux\_sel, which controls the register bank write port
264
    data multiplexor and feeds from p1 and p2), the naming convention has to be
265
    a little flexible.\\
266
 
267
    FETCH-0 would only include the logic between p0\_pc\_reg and the code ram
268
    address port, so it has been omitted from the naming convention.\\
269
 
270
    All read and write ports of the register bank are synchronous. The read
271
    ports belong logically to stage 1 and the write port to stage 2.\\
272
 
273
    IMPORTANT: though the register bank read port is synchronous, its data can
274
    be used in stage 1 because it is read early (the read port is loaded at the
275
    same time as the instruction opcode). That is, a small part of the
276
    instruction decoding is done on stage FETCH-1. Bearing in mind that the code
277
    ram is meant to be the exact same type of block as the register bank (or
278
    faster if the register bank is implemented with distributed RAM), and we
279
    will cram the whole ALU delay plus the reg bank delay in stage 1, it does
280
    not hurt moving a tiny part of the decoding to the previous cycle.\\
281
 
282
    All registers but a few exceptions belong squarely to one of the pipeline
283
    stages:\\
284
 
285
    \begin{tabular}{ l l }
286
    Stage 0:                                & \\
287
    p0\_pc\_reg                             & PC \\
288
    (external code ram read port register)  & Loads the same as PC \\
289
                                            & \\
290
    Stage 1:                                & \\
291
    p1\_ir\_reg                             & Instruction register \\
292
    (register bank read port register)      &  \\
293
    p1\_rbank\_forward                      & Feed-forward data (hazards) \\
294
    p1\_rbank\_rs\_hazard                   & Rs hazard detected \\
295
    p1\_rbank\_rt\_hazard                   & Rt hazard detected \\
296
                                            & \\
297
    Stage 2:                                & \\
298
    p2\_exception                           & Exception control \\
299
    p2\_do\_load                            & Load from data\_rd \\
300
    p2\_ld\_*                               & Load control\\
301
    (register bank write port register)     & \\
302
    \end{tabular}\\
303
 
304
    Note how the register bank ports belong in different stages even if it's
305
    the same physical device. No conflict here, hazards are handled properly
306
    (logic defined with explicit vendor-independent vhdl code, not with
307
    synthesis pragmas, etc.).\\
308
 
309
    There is a small number of global registers that don't belong to any
310
    pipeline stage:\\
311
 
312
    \begin{tabular}{ p{4cm} l }
313
    pipeline\_stalled                        & Together, these two signals...\\
314
    pipeline\_interlocked                    & ...control pipeline stalls
315
    \end{tabular}\\
316
 
317
 
318
    And of course there are special registers accessible to the CPU programmer
319
    model:\\
320
 
321
    \begin{tabular}{ p{4cm} l }
322
        mdiv\_hi\_reg     & register HI from multiplier block\\
323
        mdiv\_lo\_reg     & register LO from multiplier block\\
324
        cp0\_status      & register CP0[status]\\
325
        cp0\_epc         & register CP0[epc]\\
326
        cp0\_cause       & register CP0[cause]
327
    \end{tabular}\\
328
 
329
    These belong logically to pipeline stage 1 (can be considered an extension
330
    of the register bank) but have been spared the prefix for clarity.\\
331
 
332
    Note that the CP0 status and cause registers are only partially implemented.\\
333
 
334
    Again, this needs a better explaination and a diagram.\\
335
 
336
 
337
\section{Interlocking and Data Hazards}
338
\label{interlocking_and_data_hazards}
339
 
340
    There are two data hazards we need to care about:\\
341
 
342
    a) If an instruction needs to access a register which was modified by the
343
    previous instruction, we have a data hazard -- because the register bank is
344
    synchronous, a memory location can't be read in the same cycle it is updated
345
    -- we will get the pre-update value.
346
 
347
    b) A memory load into a register Rd produces its result a cycle late, so if
348
    the instruction after the load needs to access Rd there is a conflict.\\
349
 
350
 
351
    Conflict (a) is solved with some data forwarding logic: if we detect the
352
    data hazard, the register bank uses a 'feed-forward' value instead of the
353
    value read from the memory file. \\
354
    In file mips\_cpu.vhdl, see process 'data\_forward\_register' and the following
355
    few lines, where the hazard detection logic and data register and
356
    multiplexors are implemented. Note that hazard is detected separately for
357
    both read ports of the reg bank (p0\_rbank\_rs\_hazard and p0\_rbank\_rt\_hazard).
358
    Note too that this logic is strictly regular vhdl code -- no need to rely here
359
    on the synthesis tool to add the bypass logic for us. This gets us some
360
    measure of vendor independence.\\
361
 
362
    As for conflict (b), in the original MIPS-I architecture it was the job
363
    of the programmer to make sure that a loaded value was not used before it
364
    was available -- by inserting NOPs after the load instruction, if necessary.
365
    This is what I call the 'load delay slot', as discussed in \cite[p.~13-1]{r3k_ref_man}.\\
366
 
367
    The C toolchain needs to be set up for MIPS-I compliance in order to build
368
    object code compatible with this scheme.\\
369
    But all succeeding versions of the MIPS architecture implement a
370
    different scheme instead, 'load interlock' (\cite[p.~28]{see_mips_run}). You often see
371
    this behavior in code generated by gcc, even when using the -mips1 flag (this
372
    is probably due to the precompiled support code or libc, I have to check).\\
373
    In short, it pays to implement load interlocks so this core does, but the
374
    feature should be optional through a generic.\\
375
 
376
 
377
    Load interlock is triggered in stage 1 (ALU/MEM) of the load instruction;
378
    when triggered, the pipeline stages 0 and 1 stall, but the pipeline stage
379
    2 is allowed to proceed. That is, PC and IR are frozen but the value loaded
380
    from memory is written in the register bank.\\
381
 
382
    In the current implementation, the instruction following the load is
383
    UNCONDITIONALLY stalled; even if it does not use the target register of the
384
    load. This prevents, for example, interleaving read and write memory cycles
385
    back to back, which the CPU otherwise could do.\\
386
    So the interlock should only be triggered when necessary; this has to be
387
    fixed.\\
388
 
389
 
390
\needspace{17\baselineskip}
391
\begin{verbatim}
392
    ==== Chronogram 3.4.A: data read cycle, showing interlock =============
393
                         ____      ____      ____      ____      ____
394
     clk            ____/    \____/    \____/    \____/    \____/    \____/
395
 
396
     code_rd_addr   XXXX| 0x099c  | 0x09a0            | 0x09a4  | 0x09a8  |
397
 
398
     byte_we        XXXX|  0000   |  1111   |  0000   |  0000   |  1100   |
399
 
400
                                            |<                 >|
401
                    ________________________           ____________________
402
     code_rd_vma                            \_________/
403
                                             _________
404
     data_rd_vma    ________________________/         \____________________
405
 
406
     data_rd_addr   XXXX|XXXXXXXXX|XXXXXXXXX| 0x0700            |XXXXXXXXX|
407
 
408
     data_rd        XXXX|XXXXXXXXX|XXXXXXXXX|XXXXXXXXX| [0x700] |XXXXXXXXX|
409
 
410
     (target reg)   ????|?????????|?????????|?????????| [0x700] |?????????|
411
 
412
                           (data is registered here)--^
413
 
414
    ========================================================================
415
\end{verbatim}\\
416
 
417
    Note how a fetch cycle is delayed.
418
 
419
    This waveform was produced by this code:
420
 
421
\begin{verbatim}
422
                ...
423
                998:    ac430010    sw  v1,16(v0)
424
                99c:    80440010    lb  a0,16(v0)
425
                9a0:    a2840000    sb  a0,0(s4)
426
                9a4:    80440011    lb  a0,17(v0)
427
                9a8:    00000000    nop
428
                ...
429
\end{verbatim}\\
430
 
431
    Note how read and write cycles are spaced instead of being interleaved, as
432
    they would if interlocking was implemented efficiently (in this example,
433
    there was a real hazard, register \$a0, but that's coincidence -- I need to
434
    find a better example in the listing files...).
435
 
436
 
437
\section{Exceptions}
438
\label{exceptions}
439
 
440
    The only exceptions supported so far are software exceptions, and of those
441
    only the instructions BREAK and SYSCALL, the 'unimplemented opcode' trap and
442
    the 'user-mode access to CP0' trap.\\
443
    Memory privileges are not and will not be implemented. Hardware/software
444
    interrupts are still unimplemented too.\\
445
 
446
    Exceptions are meant to work as in the R3000 CPUs except for the vector
447
    address.\\
448
    They save their own address to EPC, update the SR, abort the following
449
    instruction, and jump to the exception vector 0x0180. All as per the specs
450
    except the vector address (we only use one).\\
451
 
452
    The following instruction is aborted even if it is a load or a jump, and
453
    traps work as specified even from a delay slot -- in that case, the address
454
    saved to EPF is not the victim instruction's but the preceding jump
455
    instruction's as explained in \cite[p.~64]{see_mips_run}.\\
456
 
457
    Plasma used to save in epc the address of the instruction after break or
458
    syscall, and used an unstandard vector address (0x03c). This core will go
459
    the standard R3000 way instead.\\
460
 
461
    Note that the epc register is not used by any instruction other than mfc0.
462
    RTE is implemented and works as per R3000 specs.\\
463
 
464
 
465
\section{Multiplier}
466
\label{multiplier}
467
 
468
    The core includes a multiplier/divider module which is a slightly modified
469
    version of Plasma's multiplier unit. Changes have been commented in the
470
    source code.\\
471
 
472
    The main difference is that Plasma does not stall the pipeline while a
473
    multiplication/division is going on. It only does when you attempt to get
474
    registers HI or LO while the multiplier is still running. Only then will
475
    the pipeline stall until the operation completes.\\
476
    This core instead stalls always for all the time it takes to do the
477
    operation. Not only it is simpler this way, it will also be easier to
478
    abort mult/div instructions.\\
479
 
480
    The logic dealing with mul/div stalls is a bit convoluted and coud use some
481
    explaining and some ascii chronogram. Again, TBD.\\

powered by: WebSVN 2.1.0

© copyright 1999-2024 OpenCores.org, equivalent to Oliscience, all rights reserved. OpenCores®, registered trademark.