URL https://opencores.org/ocsvn/marca/marca/trunk
Subversion Repositories marca

[/] [marca/] [trunk/] [doc/] [implementation.tex] - Rev 8

Compare with Previous | Blame | View Log
\documentclass[10pt, twoside, a4paper]{article}
\usepackage{graphicx}
\usepackage{listings}
 
\title{marca - McAdam's RISC Computer Architecture\\Implementation Details}
\author{Wolfgang Puffitsch}
 
\begin{document}
 
  \maketitle
 
  \section{General}
 
  \begin{itemize}
  \item 16 16-bit registers
  \item 16KB instruction ROM (8192 instructions)
  \item 8KB data RAM
  \item 256 byte data ROM
  \item 75 instructions
  \item 16 interrupt vectors
  \end{itemize}
 
  \section{Internals}
 
  The processor features a 4-stage pipeline:
  \begin{itemize}
  \item instruction fetch
  \item instruction decode
  \item execution/memory access
  \item write back
  \end{itemize}
  This scheme is similar to the one used in the MIPS architecture,
  only execution and write back stage are drawn together. For our
  architecture does not support indexed addressing, it does not need
  the ALU's result and can work in parallel, having the advantage of
  reducing the possible hazards.
 
  Figure \ref{fig:marca} shows a rough scheme of the internals of the
  processor.
  \begin{figure}[ht!]
    \centering
    \includegraphics[width=.95\textwidth]{marca}
    \caption{Internal scheme}
    \label{fig:marca}
  \end{figure}
 
  \subsection{Branches}
  Branches are not predicted and if executed they stall the the
  pipeline, leading to a total execution time of 4 cycles. The fetch
  stage is not stalled, the decode stage however is stalled for two
  cycles to compensate that.
 
  \subsection{Instruction fetch}
  This stage is not spectacular: it simply reads an instruction from
  the instruction ROM, and extracts the bits for the source and
  destination registers.
 
  \subsection{Instruction decode}
  This stage translates the bit-patterns of the opcodes to the signals
  used internally for the operations. It also holds the register file
  and handles access to it. Immediate values are also constructed here.
 
  \subsection{Execution / Memory access}
  The execution stage is the heart and soul of the processor: it holds
  the ALU, the memory/IO unit and a unit for interrupt handling.
 
  \subsubsection{ALU}
  The ALU does all arithmetic and logic computations as well as taking
  care of the processors flags (which are organized as seen in table
  \ref{tab:flags}).
 
  \begin{table}[ht!]
    \centering
    \begin{tabular}{|p{.75em}|p{.75em}|p{.75em}|p{.75em}
                    |p{.75em}|p{.75em}|p{.75em}|p{.75em}
		    |p{.75em}|p{.75em}|p{.75em}|p{.75em}
		    |p{.75em}|p{.75em}|p{.75em}|p{.75em}|p{.75em}}
      \multicolumn{16}{c}{Bit 15 \hfill Bit 0} \\
      \hline
      & & & & & & & & & & P & I & N & V & C & Z \\
      \hline
    \end{tabular}
    \caption{The flag register}
    \label{tab:flags}
  \end{table}
 
  Operations which need more than one cycle to execute (multiplication,
  division and modulo) block the rest of the processor until they are
  finished.
 
  \subsubsection{Memory/IO unit}
  The memory/IO unit takes care of the ordinary data memory, the data
  ROM (which is mapped to the addresses right above the RAM) and the
  communication to peripheral modules. Peripheral modules are located
  within the memory/IO unit and mapped to the highest addresses.
 
  The memories (the instruction ROM too) are Altera specific; we
  decided not to use generic memories, because \textsl{Quartus} can update the
  contents of its proprietary ROMs without synthesizing the whole
  design. Because all memories are single-ported (and thus fairly
  simple) it should be easy to replace them with memories specific to
  other vendors.
 
  We also decided against the use of external memories; larger FPGAs
  can accommodate all addressable memory on-chip, so the implementation
  overhead would not have paid off.
 
  Accesses which take more than one cycle (stores to peripheral
  modules and all load operations) block the rest of the processor
  until they are finished.
 
  \paragraph{Peripheral modules}
  The peripheral modules use a slightly modified version of the SimpCon
  interface. The SimpCon specific signals are pulled together to
  records, and the words which can be read/written are limited to 16
  bits. For accessing such a module, one may only use \texttt{load}
  and \texttt{store} instructions which point to aligned addresses.
 
  \paragraph{UART}
  The built-in UART is derived from the sc\_uart from Martin
  Sch\"oberl.  Apart from adapting the SimpCon interface, an interrupt
  line and two bits for enabling/masking receive (bit 3 in the status
  register) and transmit (bit 2) interrupts. In the current version
  address 0xFFF8 (-8) correspond to the UART's status register and
  address 0xFFFA (-6) to the wr\_data/rd\_data register.
 
  \subsubsection{Interrupt unit}
  The interrupt unit takes care of the interrupt vectors and, of
  course, the triggering of interrupts. Interrupts are executed only
  if the global interrupt flag is set, none of the other units is busy
  and the instruction in the execution stage is valid (it takes 3
  cycles after jumps, branches etc. until a new valid instruction is
  in that stage).
 
  Instructions which cannot be decoded as well as the ``error''
  instruction trigger interrupt 0; the ALU can trigger interrupt 1
  (division by zero), the memory unit can trigger interrupt 2 (invalid
  memory access). In contrast to all other interrupts, these three
  interrupts do not repeat the instruction which is executed when they
  occur.
 
  \subsection{Write back}
  The write back stage passes on the result of the execution stage to
  all other stages.
 
  \section{Assembler}
  The assembler \textsl{spar} (SPear Assembler Recycled) uses a syntax
  quite like usual Unix-style assemblers. It accepts the pseudo-ops
  \texttt{.file}, \texttt{.text}, \texttt{.data}, \texttt{.bss},
  \texttt{.align}, \texttt{.comm}, \texttt{.lcomm}, \texttt{.org} and
  \texttt{.skip} with the usual meanings. The mnemonic \texttt{data}
  initializes a byte to some constant value. In difference to the
  instruction set architecture specification, \texttt{mod} and
  \texttt{umod} accept three operands (if a move is needed, it is
  silently inserted).
 
  The assembler produces three files: one file for the instruction
  ROM, one file for the even bytes of the data ROM and one file for
  the odd bytes of the instruction ROM. The splitting of the data is
  necessary, because the data memories internally are split into two
  8-bit memories in order to support unaligned memory accesses without
  delays.
 
  Three output formats are supported: .mif (Memory Initialization
  Format), .hex (Intel Hex Format) and a binary format designed for
  download via UART.
 
  \section{Resource usage and speed}
 
  The processor was synthesized with \textsl{Quartus II} for the
  \textsl{Cyclone EP1C12Q240C8} FPGA with 12060 logic cells and 29952
  bytes of on-chip memory available.
 
  The processor needs $\sim$3550 logic cells or 29\% when being
  compiled for maximum clock frequency, which is $\sim$60 MHz. When
  optimizing for area, it needs $\sim$2600 logic cells or 22\% at
  $\sim$25 MHz.
 
  The processor uses 24832 bytes or 83\% of on-chip memory.
 
  \section{Example}
 
  \subsection{Reversing a line}
 
  In listing \ref{lst:uart} one can see how to interface the uart via
  interrupts. The program reads in a line from the UART and the writes
  it back reversed. The lines 1 to 4 show how to instantiate memory
  (the two bytes defined form the DOS-style end-of-line). The
  lines 7 to 25 initialize the registers and register the interrupt
  vectors, line 28 builds a barrier against the rest of the code.
 
  The lines 32 to 76 form the interrupt service routine. It first
  checks if it is operating in read or in write mode. When reading, it
  reads from the UART and stores the result. A mode switch occurs when
  a newline character is encountered. In write mode the contents of
  the buffer is written to the UART and switching back to read mode is
  done when finished.
 
  In figure \ref{fig:sim} the results of the simulation are presented.
 
  \lstset{basicstyle=\footnotesize,numbers=left,numberstyle=\tiny}
  \lstset{caption=Example for the UART and interrupts}
  \lstset{label=lst:uart}
  \lstinputlisting{uart_reverse.s}
 
  \begin{figure}[ht!]
    \centering
    \includegraphics[width=.95\textwidth]{uart_sim}
    \caption{Simulation results}
    \label{fig:sim}    
  \end{figure}
 
  \subsection{Computing factorials}
 
  The example in \ref{lst:fact} computes the factorials of 1 \ldots 9
  and writes the results to the PC via UART. Note that the last result
  transmitted will be wrong, because it is truncated to 16 bits.
 
  \lstset{basicstyle=\footnotesize,numbers=left,numberstyle=\tiny}
  \lstset{caption=Computing factorials}
  \lstset{label=lst:fact}
  \lstinputlisting{factorial.s}
 
 
  \section{Versions Of This Document}
 
  2006-12-14: Draft version \textbf{0.1}
 
  \noindent
  2006-12-29: Draft version \textbf{0.2}
  \begin{itemize}
    \item A few refinements.
  \end{itemize}
 
  \noindent
  2007-01-22: Draft version \textbf{0.3}
  \begin{itemize}
    \item Added another example.
  \end{itemize}
 
  \noindent
  2007-02-02: Draft version \textbf{0.4}
  \begin{itemize}
    \item Updated resource usage and speed section.
  \end{itemize}
 
\end{document}
Compare with Previous | Blame | View Log
Browse

Tools

Subversion Repositories marca

[/] [marca/] [trunk/] [doc/] [implementation.tex] - Rev 8