URL https://opencores.org/ocsvn/ion/ion/trunk
Subversion Repositories ion

[/] [ion/] [trunk/] [doc/] [src/] [tex/] [cpu.tex] - Rev 250

Go to most recent revision | Compare with Previous | Blame | View Log
 
\chapter{CPU Description}
\label{cpu_description}
 
    This section is about the 'core' cpu (mips\_cpu.vhdl), excluding the cache.
 
    \begin{figure}[h]
    \makebox[\textwidth]{\framebox[9cm]{\rule{0pt}{9cm}
    \includegraphics[width=8cm]{img/cpu_symbol.png}}}
    \caption{CPU module interface\label{cpu_symbol}}
    \end{figure}
 
    the CPU module is not meant to be used directly. Instead, the SoC module
    described in chapter ~\ref{soc_module} should be used.\\
 
    The following sections will describe the CPU structure and interfaces.\\
 
\section{Bus Architecture}
\label{bus_architecture}
    The CPU uses a Harvard architecture: separate paths for code and data. It
    has three separate, independent buses: code read, data read and data write.
    (Data read and write share the same address bus).\\
 
    The CPU will perform opcode fetches at the same time it does data reads
    or writes. This is not only faster than a Von Neumann architecture where
    all memory cycles have to be bottlenecked through a single bus; it is much
    simpler too. And most importantly, it matches the way the CPU will work
    when connected to code and data caches.\\
 
    (Actually, in the current state of the core, due to the inefficient way in
    which load interlocking has been implemented, the core is less efficient
    than that -- more on this later).\\
 
    The core can't read and write data at the same time; this is a fundamental 
    limitation of the core structure: doing both at the same time would take 
    one more read port in the register bank -- too expensive. Besides, it's not
    necessary given the 3-stage pipeline structure we use.\\
 
    In the most basic use scenario (no external memory and no caches), code and
    data buses have a common address space but different storages and the
    architecture is strictly Harvard. When cache memory is connected
    the architecture becomes a Modified Harvard -- because data and 
    code ultimately come from the same storage, on the same external 
    bus(es).\\
 
    Note that the basic cpu module (mips\_cpu) is meant to be connected to 
    internal, synchronous BRAMs only (i.e. the cache BRAMs). Some of its 
    outputs are not registered because they needn't be. The parent module 
    (called 'mips\_soc') has its outputs registered to limit $t_{co}$ to 
    acceptable values.\\
 
 
\subsection{Code and data read bus interface}
\label{code_data_buses}
    Both buses have the same interface:\\
 
\begin{tabular}{ l l }
    $\star$\_rd\_addr  & Address bus\\ 
    $\star$\_rd        & Data/opcode bus\\ 
    $\star$\_rd\_vma   & Valid Memory Address (VMA)\\ 
\end{tabular}\\
 
    The CPU assumes SYNCHRONOUS external memory (most or all FPGA architectures
    have only synchronous RAM blocks):
 
    When $\star$\_rd\_vma is active ('1'), $\star$\_rd\_addr is a valid read address and the
    memory should provide read data at the next clock cycle.
 
    The following ascii-art waveforms depict simple data and code read cycles
    where there is no interlock -- interlock is discussed in section 2.3.\\
 
\needspace{26\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.1.A: data read cycle, no stall ======================
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
                         _________           _________
     data_rd_vma    ____/         \_________/         \____________________
 
     data_rd_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  |XXXXXXXXX|XXXXXXXXX|
 
     data_rd        XXXX|XXXXXXXXX| [0x700] |XXXXXXXXX| [0x800] |XXXXXXXXX|
 
     (target reg)   ????|?????????|?????????| [0x700] |?????????| [0x800] |
 
              (data is registered here...)--^    (...and here)--^
 
     ==== Chronogram 3.1.B: code read cycle, no stall =====================
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
                         __________________________________________________
    code_rd_vma     ____/         |         |         |         |         |
 
                    ????| 0x0100  | 0x0104  | 0x0108  | 0x010c  | 0x0200  |
 
    code_rd         XXXX|XXXXXXXXX| [0x100] | [0x104] | [0x108] | [0x10c] |
 
    p1_ir_reg       ????|?????????|?????????| [0x100] | [0x104] | [0x108] |
 
      (first code word is registered here)--^
 
    ========================================================================
\end{verbatim}\\
 
    The data address bus is 32-bit wide; the lowest 2 bits are redundant in 
    read cycles since the CPU always reads full words, but they may be useful 
    for debugging.\\
 
\subsection{Data write interface}
\label{data_write_bus}
 
    The write bus does not have a vma output because byte\_we fulfills the same
    role:\\
 
\begin{tabular}{ l l }
    $\star$\_wr\_addr   & Address bus\\
    $\star$\_wr         & Data/opcode bus\\
    byte\_we            & WE for each of the four bytes
\end{tabular}\\
 
    Write cycles are synchronous too. The four bytes in the word should be
    handled separately -- the CPU will assert a combination of byte\_we bits
    according to the size and alignment of the store.\\
 
    When byte\_we(i) is active, the matching byte at data\_wr should be stored
    at address data\_wr\_addr. byte\_we(0) is for the LSB, byte\_we(3) for the MSB.
    Note that since the CPU is big endian, the MSB has the lowest address and
    LSB the highest. The memory system does not need to care about that.\\
 
    Write cycles never cause data-hazard stalls. They would take a single
    clock cycle except for the inefficient cache implementation, which
    stalls the CPU until the writethrough is done.\\
 
    This is the waveform for a basic write cycle (remember, this is without
    cache; the cache would just stretch this diagram by a few cycles):
 
\needspace{20\baselineskip}
\begin{verbatim}
 
    ==== Chronogram 3.2: data write cycle =================================
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     byte_we        XXXX|  1111   |  0000   |  0100   |  1100   |  0000   |
 
     data_wr_addr   XXXX| 0x0700  |XXXXXXXXX| 0x0800  | 0x0900  |XXXXXXXXX|
 
     data_wr        XXXX|12345678h|XXXXXXXXX|12345678h|12345678h|XXXXXXXXX|
 
     [0x0700]       ????|????????h|12345678h|12345678h|12345678h|12345678h|
 
     [0x0800]       ????|????????h|????????h|????????h|??34????h|??34????h|
 
     [0x0900]       ????|????????h|????????h|????????h|????????h|1234????h|
 
    ========================================================================
\end{verbatim}\\
 
    Note the two back-to-back stores to addresses 0x0800 and 0x0900. They are
    produced by two consecutive S$\star$ instructions (SB and SH in the example),
    and can only be done this fast because of the Harvard architecture (and,
    again, because the diagram does not account for the cache interaction).
 
 
\subsection{CPU stalls caused by memory accesses}
\label{memory_cpu_stalls}
 
    In short, the 'mem\_wait' input will unconditionally stall all pipeline
    stages as long as it is active. It is meant to be used by the cache at cache 
    refills.\\
 
    The cache/memory controller stops the cpu for all data/code 
    misses for as long as it takes to do a line refill. The current cache 
    implementation does refills in reverse order (i.e. not 'target address first').
 
    Note that external memory wait states are a separate issue. They too are 
    handled in the cache/memory controller. See section~\ref{cache} on the memory
    controller.
 
\section{Pipeline}
\label{pipeline}
 
    Here is where I would explain the structure of the cpu in detail; these 
    brief comments will have to suffice until I write some real documentation.\\
 
    This section could really use a diagram; since it can take me days to draw 
    one, that will have to wait for a further revision.\\
 
    This core has a 3-stage pipeline quite different from the original 
    architecture spec. Instead of trying to use the original names for the
    stages, I'll define my own.\\
 
    A computational instruction of the I- or R- type goes through the following
    stages during execution:\\
 
    \begin{tabular}{ l l }
        FETCH-0   & Instruction address is in code\_rd\_addr bus\\
        FETCH-1   & Instruction opcode is in code\_rd bus\\
        ALU/MEM   & ALU operation or memory read/write cycle is done OR\\
                  &   Memory read/data address is on data\_rd/wr\_address bus OR\\
                  &   Memory write data is on data\_wr bus\\
        LOAD      & Memory read data is on data\_rd bus
    \end{tabular}\\        
 
    In the core source (mips\_cpu.vhdl) the stages have been numbered:\\
 
    \begin{tabular}{ l l }
        FETCH-1 & = stage 0\\
        ALU/MEM & = stage 1\\
        LOAD    & = stage 2
    \end{tabular}\\  
 
    Here's a few examples:\\
 
\needspace{9\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.3.A: stages for instruction "lui gp,0x1" ============
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     code_rd_addr       | 0x099c  |                                        
 
     code_rd_data                 |3c1c0001h|
 
     rbank[$gp]                             | 0x0001  |
 
                        |< fetch0>|<   0   >|<   1   >|
    =======================================================================
\end{verbatim}\\
 
 
\needspace{17\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.3.B: stages for instruction "lw a0,16(v0)" ==========
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     code_rd_addr       | 0x099c  |                                        
 
     code_rd_data                 |8c420010h|
 
     data_rd_addr                           | $v0+16  |
                                             _________
     data_rd_vma                       _____/         \______
 
     data_rd                                          | <data>  |
 
     rbank[$a0]                                                 | <data>  |
 
                        |< fetch1>|<   0   >|<   1   >|<   2   >|
    ========================================================================
\end{verbatim}\\
 
    In the source code, all registers and signals in stage 
    \textless i\textgreater  are prefixed by 
    "p\textless i\textgreater\_", as in p0\_*, p1\_* and p2\_*. 
    A stage includes a set of registers and 
    all the logic that feeds from those registers (actually, all the logic
    that is between registers p0\_* and p1\_* belongs in stage 0, and so on).
    Since there are signals that feed from more than one pipeline stage (for
    example p2\_wback\_mux\_sel, which controls the register bank write port 
    data multiplexor and feeds from p1 and p2), the naming convention has to be
    a little flexible.\\
 
    FETCH-0 would only include the logic between p0\_pc\_reg and the code ram 
    address port, so it has been omitted from the naming convention.\\
 
    All read and write ports of the register bank are synchronous. The read 
    ports belong logically to stage 1 and the write port to stage 2.\\
 
    IMPORTANT: though the register bank read port is synchronous, its data can
    be used in stage 1 because it is read early (the read address port is loaded at the 
    same time as the instruction opcode). That is, a small part of the 
    instruction decoding is done on stage FETCH-1, by feeding the source 
    register index field straight from the code bus to the register bank BRAM. 
 
    Bearing in mind that the code cache 
    ram is meant to be the exact same type of block as the register bank (or 
    faster if the register bank is implemented with distributed RAM), and we 
    will cram the whole ALU delay plus the reg bank delay in stage 1, it does 
    not hurt moving a tiny part of the decoding to the previous cycle.\\
 
    All registers but a few exceptions belong squarely to one of the pipeline
    stages:\\
 
    \begin{tabular}{ l l }
    Stage 0:                                & \\
    p0\_pc\_reg                             & PC \\
    (external code ram read port register)  & Loads the same as PC \\
                                            & \\
    Stage 1:                                & \\
    p1\_ir\_reg                             & Instruction register \\
    (register bank read port register)      &  \\
    p1\_rbank\_forward                      & Feed-forward data (hazards) \\
    p1\_rbank\_rs\_hazard                   & Rs hazard detected \\
    p1\_rbank\_rt\_hazard                   & Rt hazard detected \\
                                            & \\
    Stage 2:                                & \\
    p2\_exception                           & Exception control \\
    p2\_do\_load                            & Load from data\_rd \\
    p2\_ld\_*                               & Load control\\
    (register bank write port register)     & \\
    \end{tabular}\\  
 
    Note how the register bank ports belong in different stages even if it's
    the same physical device. No conflict here, hazards are handled properly
    (logic defined with explicit vendor-independent vhdl code, not with 
    synthesis pragmas, etc.).\\
 
    There is a small number of global registers that don't belong to any 
    pipeline stage:\\
 
    \begin{tabular}{ p{4cm} l }
    pipeline\_stalled                        & Together, these two signals...\\
    pipeline\_interlocked                    & ...control pipeline stalls
    \end{tabular}\\
 
 
    And of course there are special registers accessible to the CPU programmer 
    model:\\
 
    \begin{tabular}{ p{4cm} l }
        mdiv\_hi\_reg     & register HI from multiplier block\\
        mdiv\_lo\_reg     & register LO from multiplier block\\
        cp0\_status      & register CP0[status]\\
        cp0\_epc         & register CP0[epc]\\
        cp0\_cause       & register CP0[cause]
    \end{tabular}\\
 
    These belong logically to pipeline stage 1 (can be considered an extension
    of the register bank) but have been spared the prefix for clarity.\\
 
    Note that the CP0 status and cause registers are only partially implemented.\\
 
    Again, this needs a better explaination and a diagram.\\
 
 
\section{Interlocking and Data Hazards}
\label{interlocking_and_data_hazards}
 
    There are two data hazards we need to care about:\\
 
    a) If an instruction needs to access a register which was modified by the
    previous instruction, we have a data hazard -- because the register bank is
    synchronous, a memory location can't be read in the same cycle it is updated
    -- we will get the pre-update value.
 
    b) A memory load into a register Rd produces its result a cycle late, so if
    the instruction after the load needs to access Rd there is a conflict.\\
 
 
    Conflict (a) is solved with some data forwarding logic: if we detect the
    data hazard, the register bank uses a 'feed-forward' value instead of the
    value read from the memory file. \\
    In file mips\_cpu.vhdl, see process 'data\_forward\_register' and the following
    few lines, where the hazard detection logic and data register and 
    multiplexors are implemented. Note that hazard is detected separately for
    both read ports of the reg bank (p0\_rbank\_rs\_hazard and p0\_rbank\_rt\_hazard).
    Note too that this logic is strictly regular vhdl code -- no need to rely here
    on the synthesis tool to add the bypass logic for us. This gets us some
    measure of vendor independence.\\
 
    As for conflict (b), in the original MIPS-I architecture it was the job
    of the programmer to make sure that a loaded value was not used before it
    was available -- by inserting NOPs after the load instruction, if necessary.
    This is what I call the 'load delay slot', as discussed in \cite[p.~13-1]{r3k_ref_man}.\\
 
    The C toolchain needs to be set up for MIPS-I compliance in order to build
    object code compatible with this scheme.\\
    But all succeeding versions of the MIPS architecture implement a 
    different scheme instead, 'load interlock' (\cite[p.~28]{see_mips_run}). You often see
    this behavior in code generated by gcc, even when using the -mips1 flag (this 
    is probably due to the precompiled support code or libc, I have to check).\\
    In short, it pays to implement load interlocks so this core does, but the
    feature should be optional through a generic.\\
 
 
    Load interlock is triggered in stage 1 (ALU/MEM) of the load instruction;
    when triggered, the pipeline stages 0 and 1 stall, but the pipeline stage
    2 is allowed to proceed. That is, PC and IR are frozen but the value loaded
    from memory is written in the register bank.\\
 
    In the current implementation, the instruction following the load is 
    UNCONDITIONALLY stalled; even if it does not use the target register of the 
    load. This prevents, for example, interleaving read and write memory cycles
    back to back, which the CPU otherwise could do.\\
    So the interlock should only be triggered when necessary; this has to be
    fixed.\\
 
 
\needspace{17\baselineskip}
\begin{verbatim}
    ==== Chronogram 3.4.A: data read cycle, showing interlock =============
                         ____      ____      ____      ____      ____
     clk            ____/    \____/    \____/    \____/    \____/    \____/
 
     code_rd_addr   XXXX| 0x099c  | 0x09a0            | 0x09a4  | 0x09a8  |
 
     byte_we        XXXX|  0000   |  1111   |  0000   |  0000   |  1100   |
 
                                            |<                 >|
                    ________________________           ____________________
     code_rd_vma                            \_________/               
                                             _________                    
     data_rd_vma    ________________________/         \____________________
 
     data_rd_addr   XXXX|XXXXXXXXX|XXXXXXXXX| 0x0700            |XXXXXXXXX|
 
     data_rd        XXXX|XXXXXXXXX|XXXXXXXXX|XXXXXXXXX| [0x700] |XXXXXXXXX|
 
     (target reg)   ????|?????????|?????????|?????????| [0x700] |?????????|
 
                           (data is registered here)--^
 
    ========================================================================
\end{verbatim}\\
 
    Note how a fetch cycle is delayed.
 
    This waveform was produced by this code:
 
\begin{verbatim}    
                ...
                998:    ac430010    sw  v1,16(v0)
                99c:    80440010    lb  a0,16(v0)
                9a0:    a2840000    sb  a0,0(s4)
                9a4:    80440011    lb  a0,17(v0)
                9a8:    00000000    nop
                ...
\end{verbatim}\\
 
    Note how read and write cycles are spaced instead of being interleaved, as
    they would if interlocking was implemented efficiently (in this example, 
    there was a real hazard, register \$a0, but that's coincidence -- I need to
    find a better example in the listing files...).
 
 
\section{Exceptions}
\label{exceptions}
 
    The only exceptions supported so far are software exceptions, and of those 
    only the instructions BREAK and SYSCALL, the 'unimplemented opcode' trap and
    the 'user-mode access to CP0' trap.\\
    Memory privileges are not and will not be implemented. Hardware/software 
    interrupts are still unimplemented too.\\
 
    Exceptions are meant to work as in the R3000 CPUs except for the vector 
    address.\\
    They save their own address to EPC, update the SR, abort the following 
    instruction, and jump to the exception vector 0x0180. All as per the specs 
    except the vector address (we only use one).\\
 
    The following instruction is aborted even if it is a load or a jump, and 
    traps work as specified even from a delay slot -- in that case, the address
    saved to EPF is not the victim instruction's but the preceding jump 
    instruction's as explained in \cite[p.~64]{see_mips_run}.\\
 
    Plasma CPU used to save in epc the address of the instruction after break or 
    syscall, and used an unstandard vector address (0x03c). This core will go 
    the standard R3000 way instead.\\
 
    Note that the epc register is not used by any instruction other than mfc0.
    RTE is implemented and works as per R3000 specs.\\
 
 
\section{Multiplier}
\label{multiplier}
 
    The core includes a multiplier/divider module which is a slightly modified 
    version of Plasma's multiplier unit. Changes have been commented in the 
    source code.\\
 
    The main difference is that Plasma does not stall the pipeline while a 
    multiplication/division is going on. It only does when you attempt to get 
    registers HI or LO while the multiplier is still running. Only then will
    the pipeline stall until the operation completes.\\
    This core instead stalls always for all the time it takes to do the 
    operation. Not only it is simpler this way, it will also be easier to 
    abort mult/div instructions.\\
 
    The logic dealing with mul/div stalls is a bit convoluted and coud use some
    explaining and some ascii chronogram. Again, TBD.\\
Go to most recent revision | Compare with Previous | Blame | View Log
Browse

Tools

Subversion Repositories ion

[/] [ion/] [trunk/] [doc/] [src/] [tex/] [cpu.tex] - Rev 250