URL https://opencores.org/ocsvn/ion/ion/trunk

Subversion Repositories ion

[/] [ion/] [trunk/] [doc/] [src/] [tex/] [cpu.tex] - Blame information for rev 210

Go to most recent revision | Details | Compare with Previous | View Log


 
\chapter{CPU Description}
\label{cpu_description}
 
    This section is about the 'core' cpu (mips\_cpu.vhdl), excluding the cache.
 
    \begin{figure}[h]
    \makebox[\textwidth]{\framebox[9cm]{\rule{0pt}{9cm}
    \includegraphics[width=8cm]{img/cpu_symbol.png}}}
    \caption{CPU module interface\label{cpu_symbol}}
    \end{figure}
 
    the CPU module is not meant to be used directly. Instead, the MCU module
    described in section ~\ref{mcu_module} should be used.\\
 
    The following sections will describe the CPU structure and interfaces.\\
 
\section{Bus Architecture}
\label{bus_architecture}
    The CPU uses a Harvard architecture: separate paths for code and data. It
    has three separate, independent buses: code read, data read and data write.
    (Data read and write share the same address bus).\\
 
    The CPU will perform opcode fetches at the same time it does data reads
    or writes. This is not only faster than a Von Neumann architecture where
    all memory cycles have to be bottlenecked through a single bus; it is much
    simpler too. And most importantly, it matches the way the CPU will work
    when connected to code and data caches.\\
 
    (Actually, in the current state of the core, due to the inefficient way in
    which load interlocking has been implemented, the core is less efficient
    than that -- more on this later).\\
 
    The core can't read and write data at the same time; this is a fundamental
    limitation of the core structure: doing both at the same time would take
    one more read port in the register bank -- too expensive. Besides, it's not
    necessary given the 3-stage pipeline structure we use.\\
 
    In the most basic use scenario (no external memory and no caches), code and
    data buses have a common address space but different storages and the
    architecture is strictly Harvard. When cache memory is connected
    the architecture becomes a Modified Harvard -- because data and
    code ultimately come from the same storage, on the same external
    bus(es).\\
 
    Note that the basic cpu module (mips\_cpu) is meant to be connected to
    internal, synchronous BRAMs only (i.e. the cache BRAMs). Some of its
    outputs are not registered because they needn't be. The parent module
    (called 'mips\_mcu') has its outputs registered to limit $t_{co}$ to
    acceptable values.\\
 
 
\subsection{Code and data read bus interface}
\label{code_data_buses}
    Both buses have the same interface:\\
 
\begin{tabular}{ l l }
    $\star$\_rd\_addr  & Address bus\\
    $\star$\_rd        & Data/opcode bus\\
    $\star$\_rd\_vma   & Valid Memory Address (VMA)\\
\end{tabular}\\
 
    The CPU assumes SYNCHRONOUS external memory (most or all FPGA architectures
    have only synchronous RAM blocks):
 
    When $\star$\_rd\_vma is active ('1'), $\star$\_rd\_addr is a valid read address and the
    memory should provide read data at the next clock cycle.
 
    The following ascii-art waveforms depict simple data and code read cycles
    where there is no interlock -- interlock is discussed in section 2.3.\\
 
\needspace{26\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.1.A: data read cycle, no stall ======================
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
                         _________           _________
     data_rd_vma    ____/         \_________/         \____________________
 
     data_rd_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  |XXXXXXXXX|XXXXXXXXX|
 
     data_rd        XXXX|XXXXXXXXX| [0x700] |XXXXXXXXX| [0x800] |XXXXXXXXX|
 
     (target reg)   ????|?????????|?????????| [0x700] |?????????| [0x800] |
 
              (data is registered here...)--^    (...and here)--^
 
     ==== Chronogram 3.1.B: code read cycle, no stall =====================
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
                         __________________________________________________
    code_rd_vma     ____/         |         |         |         |         |
 
                    ????| 0x0100  | 0x0104  | 0x0108  | 0x010c  | 0x0200  |
 
    code_rd         XXXX|XXXXXXXXX| [0x100] | [0x104] | [0x108] | [0x10c] |
 
    p1_ir_reg       ????|?????????|?????????| [0x100] | [0x104] | [0x108] |
 
      (first code word is registered here)--^
 
    ========================================================================
\end{verbatim}\\
 
    The data address bus is 32-bit wide; the lowest 2 bits are redundant in
    read cycles since the CPU always reads full words, but they may be useful
    for debugging.\\
 
\subsection{Data write interface}
\label{data_write_bus}
 
    The write bus does not have a vma output because byte\_we fulfills the same
    role:\\
 
\begin{tabular}{ l l }
    $\star$\_wr\_addr   & Address bus\\
    $\star$\_wr         & Data/opcode bus\\
    byte\_we            & WE for each of the four bytes
\end{tabular}\\
 
    Write cycles are synchronous too. The four bytes in the word should be
    handled separately -- the CPU will assert a combination of byte\_we bits
    according to the size and alignment of the store.\\
 
    When byte\_we(i) is active, the matching byte at data\_wr should be stored
    at address data\_wr\_addr. byte\_we(0) is for the LSB, byte\_we(3) for the MSB.
    Note that since the CPU is big endian, the MSB has the lowest address and
    LSB the highest. The memory system does not need to care about that.\\
 
    Write cycles never cause data-hazard stalls. They would take a single
    clock cycle except for the inefficient cache implementation, which
    stalls the CPU until the writethrough is done.\\
 
    This is the waveform for a basic write cycle (remember, this is without
    cache; the cache would just stretch this diagram by a few cycles):
 
\needspace{20\baselineskip}
\begin{verbatim}
 
    ==== Chronogram 3.2: data write cycle =================================
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     byte_we        XXXX|  1111   |  0000   |  0100   |  1100   |  0000   |
 
     data_wr_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  | 0x0900  |XXXXXXXXX|
 
     data_wr        XXXX|12345678h|XXXXXXXXX|12345678h|12345678h|XXXXXXXXX|
 
     [0x0700]       ????|????????h|12345678h|12345678h|12345678h|12345678h|
 
     [0x0800]       ????|????????h|????????h|????????h|??34????h|??34????h|
 
     [0x0900]       ????|????????h|????????h|????????h|????????h|1234????h|
 
    ========================================================================
\end{verbatim}\\
 
    Note the two back-to-back stores to addresses 0x0800 and 0x0900. They are
    produced by two consecutive S$\star$ instructions (SB and SH in the example),
    and can only be done this fast because of the Harvard architecture (and,
    again, because the diagram does not account for the cache interaction).
 
 
\subsection{CPU stalls caused by memory accesses}
\label{memory_cpu_stalls}
 
    In short, the 'mem\_wait' input will unconditionally stall all pipeline
    stages as long as it is active. It is meant to be used by the cache at cache
    refills.\\
 
    In short, the cache/memory controller stops the cpu for all data/code
    misses for as long as it takes to do a line refill. The current cache
    implementation does refills in order (i.e. not 'target address first').
 
    Note that external memory wait states are a separate issue. They too are
    handled in the cache/memory controller. See section~\ref{cache} on the memory
    controller.
 
\section{Pipeline}
\label{pipeline}
 
    Here is where I would explain the structure of the cpu in detail; these
    brief comments will have to suffice until I write some real documentation.\\
 
    This section could really use a diagram; since it can take me days to draw
    one, that will have to wait for a further revision.\\
 
    This core has a 3-stage pipeline quite different from the original
    architecture spec. Instead of trying to use the original names for the
    stages, I'll define my own.\\
 
    A computational instruction of the I- or R- type goes through the following
    stages during execution:\\
 
    \begin{tabular}{ l l }
        FETCH-0   & Instruction address is in code\_rd\_addr bus\\
        FETCH-1   & Instruction opcode is in code\_rd bus\\
        ALU/MEM   & ALU operation or memory read/write cycle is done OR\\
                  &   Memory read/data address is on data\_rd/wr\_address bus OR\\
                  &   Memory write data is on data\_wr bus\\
        LOAD      & Memory read data is on data\_rd bus
    \end{tabular}\\
 
    In the core source (mips\_cpu.vhdl) the stages have been numbered:\\
 
    \begin{tabular}{ l l }
        FETCH-1 & = stage 0\\
        ALU/MEM & = stage 1\\
        LOAD    & = stage 2
    \end{tabular}\\
 
    Here's a few examples:\\
 
\needspace{9\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.3.A: stages for instruction "lui gp,0x1" ============
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     code_rd_addr       | 0x099c  |
 
     code_rd_data                 |3c1c0001h|
 
     rbank[$gp]                             | 0x0001  |
 
                        |< fetch0>|<   0   >|<   1   >|
    =======================================================================
\end{verbatim}\\
 
 
\needspace{17\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.3.B: stages for instruction "lw a0,16(v0)" ==========
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     code_rd_addr       | 0x099c  |
 
     code_rd_data                 |8c420010h|
 
     data_rd_addr                           | $v0+16  |
                                             _________
     data_rd_vma                       _____/         \______
 
     data_rd                                          | <data>  |
 
     rbank[$a0]                                                 | <data>  |
 
                        |< fetch1>|<   0   >|<   1   >|<   2   >|
    ========================================================================
\end{verbatim}\\
 
    In the source code, all registers and signals in stage
    \textless i\textgreater are prefixed by
    "p\textless i\textgreater\_", as in p0\_*, p1\_* and p2\_*.
    A stage includes a set of registers and
    all the logic that feeds from those registers (actually, all the logic
    that is between registers p0\_* and p1\_* belongs in stage 0, and so on).
    Since there are signals that feed from more than one pipeline stage (for
    example p2\_wback\_mux\_sel, which controls the register bank write port
    data multiplexor and feeds from p1 and p2), the naming convention has to be
    a little flexible.\\
 
    FETCH-0 would only include the logic between p0\_pc\_reg and the code ram
    address port, so it has been omitted from the naming convention.\\
 
    All read and write ports of the register bank are synchronous. The read
    ports belong logically to stage 1 and the write port to stage 2.\\
 
    IMPORTANT: though the register bank read port is synchronous, its data can
    be used in stage 1 because it is read early (the read port is loaded at the
    same time as the instruction opcode). That is, a small part of the
    instruction decoding is done on stage FETCH-1. Bearing in mind that the code
    ram is meant to be the exact same type of block as the register bank (or
    faster if the register bank is implemented with distributed RAM), and we
    will cram the whole ALU delay plus the reg bank delay in stage 1, it does
    not hurt moving a tiny part of the decoding to the previous cycle.\\
 
    All registers but a few exceptions belong squarely to one of the pipeline
    stages:\\
 
    \begin{tabular}{ l l }
    Stage 0:                                & \\
    p0\_pc\_reg                             & PC \\
    (external code ram read port register)  & Loads the same as PC \\
                                            & \\
    Stage 1:                                & \\
    p1\_ir\_reg                             & Instruction register \\
    (register bank read port register)      &  \\
    p1\_rbank\_forward                      & Feed-forward data (hazards) \\
    p1\_rbank\_rs\_hazard                   & Rs hazard detected \\
    p1\_rbank\_rt\_hazard                   & Rt hazard detected \\
                                            & \\
    Stage 2:                                & \\
    p2\_exception                           & Exception control \\
    p2\_do\_load                            & Load from data\_rd \\
    p2\_ld\_*                               & Load control\\
    (register bank write port register)     & \\
    \end{tabular}\\
 
    Note how the register bank ports belong in different stages even if it's
    the same physical device. No conflict here, hazards are handled properly
    (logic defined with explicit vendor-independent vhdl code, not with
    synthesis pragmas, etc.).\\
 
    There is a small number of global registers that don't belong to any
    pipeline stage:\\
 
    \begin{tabular}{ p{4cm} l }
    pipeline\_stalled                        & Together, these two signals...\\
    pipeline\_interlocked                    & ...control pipeline stalls
    \end{tabular}\\
 
 
    And of course there are special registers accessible to the CPU programmer
    model:\\
 
    \begin{tabular}{ p{4cm} l }
        mdiv\_hi\_reg     & register HI from multiplier block\\
        mdiv\_lo\_reg     & register LO from multiplier block\\
        cp0\_status      & register CP0[status]\\
        cp0\_epc         & register CP0[epc]\\
        cp0\_cause       & register CP0[cause]
    \end{tabular}\\
 
    These belong logically to pipeline stage 1 (can be considered an extension
    of the register bank) but have been spared the prefix for clarity.\\
 
    Note that the CP0 status and cause registers are only partially implemented.\\
 
    Again, this needs a better explaination and a diagram.\\
 
 
\section{Interlocking and Data Hazards}
\label{interlocking_and_data_hazards}
 
    There are two data hazards we need to care about:\\
 
    a) If an instruction needs to access a register which was modified by the
    previous instruction, we have a data hazard -- because the register bank is
    synchronous, a memory location can't be read in the same cycle it is updated
    -- we will get the pre-update value.
 
    b) A memory load into a register Rd produces its result a cycle late, so if
    the instruction after the load needs to access Rd there is a conflict.\\
 
 
    Conflict (a) is solved with some data forwarding logic: if we detect the
    data hazard, the register bank uses a 'feed-forward' value instead of the
    value read from the memory file. \\
    In file mips\_cpu.vhdl, see process 'data\_forward\_register' and the following
    few lines, where the hazard detection logic and data register and
    multiplexors are implemented. Note that hazard is detected separately for
    both read ports of the reg bank (p0\_rbank\_rs\_hazard and p0\_rbank\_rt\_hazard).
    Note too that this logic is strictly regular vhdl code -- no need to rely here
    on the synthesis tool to add the bypass logic for us. This gets us some
    measure of vendor independence.\\
 
    As for conflict (b), in the original MIPS-I architecture it was the job
    of the programmer to make sure that a loaded value was not used before it
    was available -- by inserting NOPs after the load instruction, if necessary.
    This is what I call the 'load delay slot', as discussed in \cite[p.~13-1]{r3k_ref_man}.\\
 
    The C toolchain needs to be set up for MIPS-I compliance in order to build
    object code compatible with this scheme.\\
    But all succeeding versions of the MIPS architecture implement a
    different scheme instead, 'load interlock' (\cite[p.~28]{see_mips_run}). You often see
    this behavior in code generated by gcc, even when using the -mips1 flag (this
    is probably due to the precompiled support code or libc, I have to check).\\
    In short, it pays to implement load interlocks so this core does, but the
    feature should be optional through a generic.\\
 
 
    Load interlock is triggered in stage 1 (ALU/MEM) of the load instruction;
    when triggered, the pipeline stages 0 and 1 stall, but the pipeline stage
    2 is allowed to proceed. That is, PC and IR are frozen but the value loaded
    from memory is written in the register bank.\\
 
    In the current implementation, the instruction following the load is
    UNCONDITIONALLY stalled; even if it does not use the target register of the
    load. This prevents, for example, interleaving read and write memory cycles
    back to back, which the CPU otherwise could do.\\
    So the interlock should only be triggered when necessary; this has to be
    fixed.\\
 
 
\needspace{17\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.4.A: data read cycle, showing interlock =============
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     code_rd_addr   XXXX| 0x099c  | 0x09a0            | 0x09a4  | 0x09a8  |
 
     byte_we        XXXX|  0000   |  1111   |  0000   |  0000   |  1100   |
 
                                            |<                 >|
                    ________________________           ____________________
     code_rd_vma                            \_________/
                                             _________
     data_rd_vma    ________________________/         \____________________
 
     data_rd_addr   XXXX|XXXXXXXXX|XXXXXXXXX| 0x0700            |XXXXXXXXX|
 
     data_rd        XXXX|XXXXXXXXX|XXXXXXXXX|XXXXXXXXX| [0x700] |XXXXXXXXX|
 
     (target reg)   ????|?????????|?????????|?????????| [0x700] |?????????|
 
                           (data is registered here)--^
 
    ========================================================================
\end{verbatim}\\
 
    Note how a fetch cycle is delayed.
 
    This waveform was produced by this code:
 
\begin{verbatim}
                ...
                998:    ac430010    sw  v1,16(v0)
                99c:    80440010    lb  a0,16(v0)
                9a0:    a2840000    sb  a0,0(s4)
                9a4:    80440011    lb  a0,17(v0)
                9a8:    00000000    nop
                ...
\end{verbatim}\\
 
    Note how read and write cycles are spaced instead of being interleaved, as
    they would if interlocking was implemented efficiently (in this example,
    there was a real hazard, register \$a0, but that's coincidence -- I need to
    find a better example in the listing files...).
 
 
\section{Exceptions}
\label{exceptions}
 
    The only exceptions supported so far are software exceptions, and of those
    only the instructions BREAK and SYSCALL, the 'unimplemented opcode' trap and
    the 'user-mode access to CP0' trap.\\
    Memory privileges are not and will not be implemented. Hardware/software
    interrupts are still unimplemented too.\\
 
    Exceptions are meant to work as in the R3000 CPUs except for the vector
    address.\\
    They save their own address to EPC, update the SR, abort the following
    instruction, and jump to the exception vector 0x0180. All as per the specs
    except the vector address (we only use one).\\
 
    The following instruction is aborted even if it is a load or a jump, and
    traps work as specified even from a delay slot -- in that case, the address
    saved to EPF is not the victim instruction's but the preceding jump
    instruction's as explained in \cite[p.~64]{see_mips_run}.\\
 
    Plasma used to save in epc the address of the instruction after break or
    syscall, and used an unstandard vector address (0x03c). This core will go
    the standard R3000 way instead.\\
 
    Note that the epc register is not used by any instruction other than mfc0.
    RTE is implemented and works as per R3000 specs.\\
 
 
\section{Multiplier}
\label{multiplier}
 
    The core includes a multiplier/divider module which is a slightly modified
    version of Plasma's multiplier unit. Changes have been commented in the
    source code.\\
 
    The main difference is that Plasma does not stall the pipeline while a
    multiplication/division is going on. It only does when you attempt to get
    registers HI or LO while the multiplier is still running. Only then will
    the pipeline stall until the operation completes.\\
    This core instead stalls always for all the time it takes to do the
    operation. Not only it is simpler this way, it will also be easier to
    abort mult/div instructions.\\
 
    The logic dealing with mul/div stalls is a bit convoluted and coud use some
    explaining and some ascii chronogram. Again, TBD.\\

Browse

Tools

Subversion Repositories ion

[/] [ion/] [trunk/] [doc/] [src/] [tex/] [cpu.tex] - Blame information for rev 210

Line No.	Rev	Author	Line
1	210	ja_rd
2			`\chapter{CPU Description}`
3			`\label{cpu_description}`
4
5			`This section is about the 'core' cpu (mips\_cpu.vhdl), excluding the cache.`
6
7			`\begin{figure}[h]`
8			`\makebox[\textwidth]{\framebox[9cm]{\rule{0pt}{9cm}`
9			`\includegraphics[width=8cm]{img/cpu_symbol.png}}}`
10			`\caption{CPU module interface\label{cpu_symbol}}`
11			`\end{figure}`
12
13			`the CPU module is not meant to be used directly. Instead, the MCU module`
14			`described in section ~\ref{mcu_module} should be used.\\`
15
16			`The following sections will describe the CPU structure and interfaces.\\`
17
18			`\section{Bus Architecture}`
19			`\label{bus_architecture}`
20			`The CPU uses a Harvard architecture: separate paths for code and data. It`
21			`has three separate, independent buses: code read, data read and data write.`
22			`(Data read and write share the same address bus).\\`
23
24			`The CPU will perform opcode fetches at the same time it does data reads`
25			`or writes. This is not only faster than a Von Neumann architecture where`
26			`all memory cycles have to be bottlenecked through a single bus; it is much`
27			`simpler too. And most importantly, it matches the way the CPU will work`
28			`when connected to code and data caches.\\`
29
30			`(Actually, in the current state of the core, due to the inefficient way in`
31			`which load interlocking has been implemented, the core is less efficient`
32			`than that -- more on this later).\\`
33
34			`The core can't read and write data at the same time; this is a fundamental`
35			`limitation of the core structure: doing both at the same time would take`
36			`one more read port in the register bank -- too expensive. Besides, it's not`
37			`necessary given the 3-stage pipeline structure we use.\\`
38
39			`In the most basic use scenario (no external memory and no caches), code and`
40			`data buses have a common address space but different storages and the`
41			`architecture is strictly Harvard. When cache memory is connected`
42			`the architecture becomes a Modified Harvard -- because data and`
43			`code ultimately come from the same storage, on the same external`
44			`bus(es).\\`
45
46			`Note that the basic cpu module (mips\_cpu) is meant to be connected to`
47			`internal, synchronous BRAMs only (i.e. the cache BRAMs). Some of its`
48			`outputs are not registered because they needn't be. The parent module`
49			`(called 'mips\_mcu') has its outputs registered to limit $t_{co}$ to`
50			`acceptable values.\\`
51
52
53			`\subsection{Code and data read bus interface}`
54			`\label{code_data_buses}`
55			`Both buses have the same interface:\\`
56
57			`\begin{tabular}{ l l }`
58			`$\star$\_rd\_addr & Address bus\\`
59			`$\star$\_rd & Data/opcode bus\\`
60			`$\star$\_rd\_vma & Valid Memory Address (VMA)\\`
61			`\end{tabular}\\`
62
63			`The CPU assumes SYNCHRONOUS external memory (most or all FPGA architectures`
64			`have only synchronous RAM blocks):`
65
66			`When $\star$\_rd\_vma is active ('1'), $\star$\_rd\_addr is a valid read address and the`
67			`memory should provide read data at the next clock cycle.`
68
69			`The following ascii-art waveforms depict simple data and code read cycles`
70			`where there is no interlock -- interlock is discussed in section 2.3.\\`
71
72			`\needspace{26\baselineskip}`
73			`\begin{verbatim}`
74			`==== Chronogram 3.1.A: data read cycle, no stall ======================`
75			`____ ____ ____ ____ ____`
76			`clk ____/ \____/ \____/ \____/ \____/ \____/`
77
78			`_________ _________`
79			`data_rd_vma ____/ \_________/ \____________________`
80
81			`data_rd_addr XXXX\| 0x0700 \|XXXXXXXXX\| 0x0800 \|XXXXXXXXX\|XXXXXXXXX\|`
82
83			`data_rd XXXX\|XXXXXXXXX\| [0x700] \|XXXXXXXXX\| [0x800] \|XXXXXXXXX\|`
84
85			`(target reg) ????\|?????????\|?????????\| [0x700] \|?????????\| [0x800] \|`
86
87			`(data is registered here...)--^ (...and here)--^`
88
89			`==== Chronogram 3.1.B: code read cycle, no stall =====================`
90			`____ ____ ____ ____ ____`
91			`clk ____/ \____/ \____/ \____/ \____/ \____/`
92
93			`__________________________________________________`
94			`code_rd_vma ____/ \| \| \| \| \|`
95
96			`????\| 0x0100 \| 0x0104 \| 0x0108 \| 0x010c \| 0x0200 \|`
97
98			`code_rd XXXX\|XXXXXXXXX\| [0x100] \| [0x104] \| [0x108] \| [0x10c] \|`
99
100			`p1_ir_reg ????\|?????????\|?????????\| [0x100] \| [0x104] \| [0x108] \|`
101
102			`(first code word is registered here)--^`
103
104			`========================================================================`
105			`\end{verbatim}\\`
106
107			`The data address bus is 32-bit wide; the lowest 2 bits are redundant in`
108			`read cycles since the CPU always reads full words, but they may be useful`
109			`for debugging.\\`
110
111			`\subsection{Data write interface}`
112			`\label{data_write_bus}`
113
114			`The write bus does not have a vma output because byte\_we fulfills the same`
115			`role:\\`
116
117			`\begin{tabular}{ l l }`
118			`$\star$\_wr\_addr & Address bus\\`
119			`$\star$\_wr & Data/opcode bus\\`
120			`byte\_we & WE for each of the four bytes`
121			`\end{tabular}\\`
122
123			`Write cycles are synchronous too. The four bytes in the word should be`
124			`handled separately -- the CPU will assert a combination of byte\_we bits`
125			`according to the size and alignment of the store.\\`
126
127			`When byte\_we(i) is active, the matching byte at data\_wr should be stored`
128			`at address data\_wr\_addr. byte\_we(0) is for the LSB, byte\_we(3) for the MSB.`
129			`Note that since the CPU is big endian, the MSB has the lowest address and`
130			`LSB the highest. The memory system does not need to care about that.\\`
131
132			`Write cycles never cause data-hazard stalls. They would take a single`
133			`clock cycle except for the inefficient cache implementation, which`
134			`stalls the CPU until the writethrough is done.\\`
135
136			`This is the waveform for a basic write cycle (remember, this is without`
137			`cache; the cache would just stretch this diagram by a few cycles):`
138
139			`\needspace{20\baselineskip}`
140			`\begin{verbatim}`
141
142			`==== Chronogram 3.2: data write cycle =================================`
143			`____ ____ ____ ____ ____`
144			`clk ____/ \____/ \____/ \____/ \____/ \____/`
145
146			`byte_we XXXX\| 1111 \| 0000 \| 0100 \| 1100 \| 0000 \|`
147
148			`data_wr_addr XXXX\| 0x0700 \|XXXXXXXXX\| 0x0800 \| 0x0900 \|XXXXXXXXX\|`
149
150			`data_wr XXXX\|12345678h\|XXXXXXXXX\|12345678h\|12345678h\|XXXXXXXXX\|`
151
152			`[0x0700] ????\|????????h\|12345678h\|12345678h\|12345678h\|12345678h\|`
153
154			`[0x0800] ????\|????????h\|????????h\|????????h\|??34????h\|??34????h\|`
155
156			`[0x0900] ????\|????????h\|????????h\|????????h\|????????h\|1234????h\|`
157
158			`========================================================================`
159			`\end{verbatim}\\`
160
161			`Note the two back-to-back stores to addresses 0x0800 and 0x0900. They are`
162			`produced by two consecutive S$\star$ instructions (SB and SH in the example),`
163			`and can only be done this fast because of the Harvard architecture (and,`
164			`again, because the diagram does not account for the cache interaction).`
165
166
167			`\subsection{CPU stalls caused by memory accesses}`
168			`\label{memory_cpu_stalls}`
169
170			`In short, the 'mem\_wait' input will unconditionally stall all pipeline`
171			`stages as long as it is active. It is meant to be used by the cache at cache`
172			`refills.\\`
173
174			`In short, the cache/memory controller stops the cpu for all data/code`
175			`misses for as long as it takes to do a line refill. The current cache`
176			`implementation does refills in order (i.e. not 'target address first').`
177
178			`Note that external memory wait states are a separate issue. They too are`
179			`handled in the cache/memory controller. See section~\ref{cache} on the memory`
180			`controller.`
181
182			`\section{Pipeline}`
183			`\label{pipeline}`
184
185			`Here is where I would explain the structure of the cpu in detail; these`
186			`brief comments will have to suffice until I write some real documentation.\\`
187
188			`This section could really use a diagram; since it can take me days to draw`
189			`one, that will have to wait for a further revision.\\`
190
191			`This core has a 3-stage pipeline quite different from the original`
192			`architecture spec. Instead of trying to use the original names for the`
193			`stages, I'll define my own.\\`
194
195			`A computational instruction of the I- or R- type goes through the following`
196			`stages during execution:\\`
197
198			`\begin{tabular}{ l l }`
199			`FETCH-0 & Instruction address is in code\_rd\_addr bus\\`
200			`FETCH-1 & Instruction opcode is in code\_rd bus\\`
201			`ALU/MEM & ALU operation or memory read/write cycle is done OR\\`
202			`& Memory read/data address is on data\_rd/wr\_address bus OR\\`
203			`& Memory write data is on data\_wr bus\\`
204			`LOAD & Memory read data is on data\_rd bus`
205			`\end{tabular}\\`
206
207			`In the core source (mips\_cpu.vhdl) the stages have been numbered:\\`
208
209			`\begin{tabular}{ l l }`
210			`FETCH-1 & = stage 0\\`
211			`ALU/MEM & = stage 1\\`
212			`LOAD & = stage 2`
213			`\end{tabular}\\`
214
215			`Here's a few examples:\\`
216
217			`\needspace{9\baselineskip}`
218			`\begin{verbatim}`
219			`==== Chronogram 3.3.A: stages for instruction "lui gp,0x1" ============`
220			`____ ____ ____ ____ ____`
221			`clk ____/ \____/ \____/ \____/ \____/ \____/`
222
223			`code_rd_addr \| 0x099c \|`
224
225			`code_rd_data \|3c1c0001h\|`
226
227			`rbank[$gp] \| 0x0001 \|`
228
229			`\|< fetch0>\|< 0 >\|< 1 >\|`
230			`=======================================================================`
231			`\end{verbatim}\\`
232
233
234			`\needspace{17\baselineskip}`
235			`\begin{verbatim}`
236			`==== Chronogram 3.3.B: stages for instruction "lw a0,16(v0)" ==========`
237			`____ ____ ____ ____ ____`
238			`clk ____/ \____/ \____/ \____/ \____/ \____/`
239
240			`code_rd_addr \| 0x099c \|`
241
242			`code_rd_data \|8c420010h\|`
243
244			`data_rd_addr \| $v0+16 \|`
245			`_________`
246			`data_rd_vma _____/ \______`
247
248			`data_rd \| <data> \|`
249
250			`rbank[$a0] \| <data> \|`
251
252			`\|< fetch1>\|< 0 >\|< 1 >\|< 2 >\|`
253			`========================================================================`
254			`\end{verbatim}\\`
255
256			`In the source code, all registers and signals in stage`
257			`\textless i\textgreater are prefixed by`
258			`"p\textless i\textgreater\_", as in p0\_, p1\_ and p2\_*.`
259			`A stage includes a set of registers and`
260			`all the logic that feeds from those registers (actually, all the logic`
261			`that is between registers p0\_* and p1\_* belongs in stage 0, and so on).`
262			`Since there are signals that feed from more than one pipeline stage (for`
263			`example p2\_wback\_mux\_sel, which controls the register bank write port`
264			`data multiplexor and feeds from p1 and p2), the naming convention has to be`
265			`a little flexible.\\`
266
267			`FETCH-0 would only include the logic between p0\_pc\_reg and the code ram`
268			`address port, so it has been omitted from the naming convention.\\`
269
270			`All read and write ports of the register bank are synchronous. The read`
271			`ports belong logically to stage 1 and the write port to stage 2.\\`
272
273			`IMPORTANT: though the register bank read port is synchronous, its data can`
274			`be used in stage 1 because it is read early (the read port is loaded at the`
275			`same time as the instruction opcode). That is, a small part of the`
276			`instruction decoding is done on stage FETCH-1. Bearing in mind that the code`
277			`ram is meant to be the exact same type of block as the register bank (or`
278			`faster if the register bank is implemented with distributed RAM), and we`
279			`will cram the whole ALU delay plus the reg bank delay in stage 1, it does`
280			`not hurt moving a tiny part of the decoding to the previous cycle.\\`
281
282			`All registers but a few exceptions belong squarely to one of the pipeline`
283			`stages:\\`
284
285			`\begin{tabular}{ l l }`
286			`Stage 0: & \\`
287			`p0\_pc\_reg & PC \\`
288			`(external code ram read port register) & Loads the same as PC \\`
289			`& \\`
290			`Stage 1: & \\`
291			`p1\_ir\_reg & Instruction register \\`
292			`(register bank read port register) & \\`
293			`p1\_rbank\_forward & Feed-forward data (hazards) \\`
294			`p1\_rbank\_rs\_hazard & Rs hazard detected \\`
295			`p1\_rbank\_rt\_hazard & Rt hazard detected \\`
296			`& \\`
297			`Stage 2: & \\`
298			`p2\_exception & Exception control \\`
299			`p2\_do\_load & Load from data\_rd \\`
300			`p2\_ld\_* & Load control\\`
301			`(register bank write port register) & \\`
302			`\end{tabular}\\`
303
304			`Note how the register bank ports belong in different stages even if it's`
305			`the same physical device. No conflict here, hazards are handled properly`
306			`(logic defined with explicit vendor-independent vhdl code, not with`
307			`synthesis pragmas, etc.).\\`
308
309			`There is a small number of global registers that don't belong to any`
310			`pipeline stage:\\`
311
312			`\begin{tabular}{ p{4cm} l }`
313			`pipeline\_stalled & Together, these two signals...\\`
314			`pipeline\_interlocked & ...control pipeline stalls`
315			`\end{tabular}\\`
316
317
318			`And of course there are special registers accessible to the CPU programmer`
319			`model:\\`
320
321			`\begin{tabular}{ p{4cm} l }`
322			`mdiv\_hi\_reg & register HI from multiplier block\\`
323			`mdiv\_lo\_reg & register LO from multiplier block\\`
324			`cp0\_status & register CP0[status]\\`
325			`cp0\_epc & register CP0[epc]\\`
326			`cp0\_cause & register CP0[cause]`
327			`\end{tabular}\\`
328
329			`These belong logically to pipeline stage 1 (can be considered an extension`
330			`of the register bank) but have been spared the prefix for clarity.\\`
331
332			`Note that the CP0 status and cause registers are only partially implemented.\\`
333
334			`Again, this needs a better explaination and a diagram.\\`
335
336
337			`\section{Interlocking and Data Hazards}`
338			`\label{interlocking_and_data_hazards}`
339
340			`There are two data hazards we need to care about:\\`
341
342			`a) If an instruction needs to access a register which was modified by the`
343			`previous instruction, we have a data hazard -- because the register bank is`
344			`synchronous, a memory location can't be read in the same cycle it is updated`
345			`-- we will get the pre-update value.`
346
347			`b) A memory load into a register Rd produces its result a cycle late, so if`
348			`the instruction after the load needs to access Rd there is a conflict.\\`
349
350
351			`Conflict (a) is solved with some data forwarding logic: if we detect the`
352			`data hazard, the register bank uses a 'feed-forward' value instead of the`
353			`value read from the memory file. \\`
354			`In file mips\_cpu.vhdl, see process 'data\_forward\_register' and the following`
355			`few lines, where the hazard detection logic and data register and`
356			`multiplexors are implemented. Note that hazard is detected separately for`
357			`both read ports of the reg bank (p0\_rbank\_rs\_hazard and p0\_rbank\_rt\_hazard).`
358			`Note too that this logic is strictly regular vhdl code -- no need to rely here`
359			`on the synthesis tool to add the bypass logic for us. This gets us some`
360			`measure of vendor independence.\\`
361
362			`As for conflict (b), in the original MIPS-I architecture it was the job`
363			`of the programmer to make sure that a loaded value was not used before it`
364			`was available -- by inserting NOPs after the load instruction, if necessary.`
365			`This is what I call the 'load delay slot', as discussed in \cite[p.~13-1]{r3k_ref_man}.\\`
366
367			`The C toolchain needs to be set up for MIPS-I compliance in order to build`
368			`object code compatible with this scheme.\\`
369			`But all succeeding versions of the MIPS architecture implement a`
370			`different scheme instead, 'load interlock' (\cite[p.~28]{see_mips_run}). You often see`
371			`this behavior in code generated by gcc, even when using the -mips1 flag (this`
372			`is probably due to the precompiled support code or libc, I have to check).\\`
373			`In short, it pays to implement load interlocks so this core does, but the`
374			`feature should be optional through a generic.\\`
375
376
377			`Load interlock is triggered in stage 1 (ALU/MEM) of the load instruction;`
378			`when triggered, the pipeline stages 0 and 1 stall, but the pipeline stage`
379			`2 is allowed to proceed. That is, PC and IR are frozen but the value loaded`
380			`from memory is written in the register bank.\\`
381
382			`In the current implementation, the instruction following the load is`
383			`UNCONDITIONALLY stalled; even if it does not use the target register of the`
384			`load. This prevents, for example, interleaving read and write memory cycles`
385			`back to back, which the CPU otherwise could do.\\`
386			`So the interlock should only be triggered when necessary; this has to be`
387			`fixed.\\`
388
389
390			`\needspace{17\baselineskip}`
391			`\begin{verbatim}`
392			`==== Chronogram 3.4.A: data read cycle, showing interlock =============`
393			`____ ____ ____ ____ ____`
394			`clk ____/ \____/ \____/ \____/ \____/ \____/`
395
396			`code_rd_addr XXXX\| 0x099c \| 0x09a0 \| 0x09a4 \| 0x09a8 \|`
397
398			`byte_we XXXX\| 0000 \| 1111 \| 0000 \| 0000 \| 1100 \|`
399
400			`\|< >\|`
401			`________________________ ____________________`
402			`code_rd_vma \_________/`
403			`_________`
404			`data_rd_vma ________________________/ \____________________`
405
406			`data_rd_addr XXXX\|XXXXXXXXX\|XXXXXXXXX\| 0x0700 \|XXXXXXXXX\|`
407
408			`data_rd XXXX\|XXXXXXXXX\|XXXXXXXXX\|XXXXXXXXX\| [0x700] \|XXXXXXXXX\|`
409
410			`(target reg) ????\|?????????\|?????????\|?????????\| [0x700] \|?????????\|`
411
412			`(data is registered here)--^`
413
414			`========================================================================`
415			`\end{verbatim}\\`
416
417			`Note how a fetch cycle is delayed.`
418
419			`This waveform was produced by this code:`
420
421			`\begin{verbatim}`
422			`...`
423			`998: ac430010 sw v1,16(v0)`
424			`99c: 80440010 lb a0,16(v0)`
425			`9a0: a2840000 sb a0,0(s4)`
426			`9a4: 80440011 lb a0,17(v0)`
427			`9a8: 00000000 nop`
428			`...`
429			`\end{verbatim}\\`
430
431			`Note how read and write cycles are spaced instead of being interleaved, as`
432			`they would if interlocking was implemented efficiently (in this example,`
433			`there was a real hazard, register \$a0, but that's coincidence -- I need to`
434			`find a better example in the listing files...).`
435
436
437			`\section{Exceptions}`
438			`\label{exceptions}`
439
440			`The only exceptions supported so far are software exceptions, and of those`
441			`only the instructions BREAK and SYSCALL, the 'unimplemented opcode' trap and`
442			`the 'user-mode access to CP0' trap.\\`
443			`Memory privileges are not and will not be implemented. Hardware/software`
444			`interrupts are still unimplemented too.\\`
445
446			`Exceptions are meant to work as in the R3000 CPUs except for the vector`
447			`address.\\`
448			`They save their own address to EPC, update the SR, abort the following`
449			`instruction, and jump to the exception vector 0x0180. All as per the specs`
450			`except the vector address (we only use one).\\`
451
452			`The following instruction is aborted even if it is a load or a jump, and`
453			`traps work as specified even from a delay slot -- in that case, the address`
454			`saved to EPF is not the victim instruction's but the preceding jump`
455			`instruction's as explained in \cite[p.~64]{see_mips_run}.\\`
456
457			`Plasma used to save in epc the address of the instruction after break or`
458			`syscall, and used an unstandard vector address (0x03c). This core will go`
459			`the standard R3000 way instead.\\`
460
461			`Note that the epc register is not used by any instruction other than mfc0.`
462			`RTE is implemented and works as per R3000 specs.\\`
463
464
465			`\section{Multiplier}`
466			`\label{multiplier}`
467
468			`The core includes a multiplier/divider module which is a slightly modified`
469			`version of Plasma's multiplier unit. Changes have been commented in the`
470			`source code.\\`
471
472			`The main difference is that Plasma does not stall the pipeline while a`
473			`multiplication/division is going on. It only does when you attempt to get`
474			`registers HI or LO while the multiplier is still running. Only then will`
475			`the pipeline stall until the operation completes.\\`
476			`This core instead stalls always for all the time it takes to do the`
477			`operation. Not only it is simpler this way, it will also be easier to`
478			`abort mult/div instructions.\\`
479
480			`The logic dealing with mul/div stalls is a bit convoluted and coud use some`
481			`explaining and some ascii chronogram. Again, TBD.\\`