URL https://opencores.org/ocsvn/zipcpu/zipcpu/trunk

# Subversion Repositorieszipcpu

## Compare Revisions

• This comparison shows the changes necessary to convert path
/zipcpu/trunk/doc/src
from Rev 199 to Rev 202
Reverse comparison

## Rev 199 → Rev 202

/spec.tex
86,8 → 86,8
 \project{ZipCPU} \title{Specification} \author{Dan Gisselquist, Ph.D.} \email{dgisselq (at) opencores.org} \revision{Rev.~1.0} \email{dgisselq (at) ieee.org} \revision{Rev.~1.1} \definecolor{webred}{rgb}{0.5,0,0} \definecolor{webgreen}{rgb}{0,0.4,0} \hypersetup{
122,6 → 122,8
 a copy. \end{license} \begin{revisionhistory} 2.0 & 1/18/2017 & Gisselquist & Switched from 32--bit to 8--bit bytes.\\\hline 1.1 & 11/28/2016 & Gisselquist & Moved the ZipSystem address to {\tt 0xff000000} base.\\\hline 1.0 & 11/4/2016 & Gisselquist & Major rewrite,  includes compiler info\\\hline 0.91& 7/16/2016 & Gisselquist & Described three more CC bits\\\hline
205,9 → 207,7
 For those who like buzz words, the ZipCPU is: \begin{itemize} \item A 32-bit CPU: All registers are 32-bits, addresses are 32-bits,  instructions are 32-bits wide, etc. Indeed, the byte size''  for this processor, as per the C--language definition of a  byte'' being the smallest addressable unit, is 32--bits.  instructions are 32-bits wide, etc. \item A RISC CPU. There is no microcode for executing instructions. All  instructions are designed to be completed in one clock cycle. \item A Load/Store architecture. (Only load and store instructions
456,13 → 456,13
 Given the performance benefits achieved by early branching, setting this flag is highly recommended.   {\tt OPT\_PIPELINED\_BUS\_ACCESS} controls whether or not {\tt LOD}/{\tt STO} {\tt OPT\_PIPELINED\_BUS\_ACCESS} controls whether or not memory instructions can take advantage of the pipelined wishbone bus. To be eligible, the operations to be pipelined must be adjacent, must be all {\tt LOD}s or all {\tt STO}s, and the addresses must all use the same base loads or all stores, and the addresses must all use the same base address register and either have identical immediate offsets, or immediate offsets that increase by one for each instruction. Further, the {\tt LOD}/{\tt STO} string of instructions must all have the same conditional string of load (or store) instructions must all have the same conditional (if any). Currently, this approach and benefit is most effectively used when saving registers to or restoring registers from the stack at the beginning/end of a procedure, when using assembly optimized programs, or
472,7 → 472,7
 wishbone bus implementation can handle pipelined bus accesses. The logic impact of this setting is minimal, the performance impact can be significant.   {\tt OPT\_VLIW} includes within the instruction set the Very Long Instruction {\tt OPT\_CIS} includes within the instruction set the Very Long Instruction Word packing, which packs up to two instructions within each instruction word. Non--packed instructions will still execute as normal, this just enables the decoding and running of packed instructions.
484,20 → 484,20
 include the respective peripherals, comment them out not to. These only  affect the ZipSystem implementation, and not any ZipBones implementations.   Finally, if you find yourself needing to debug the core and specifically needing to get a trace from the core to find out why something specifically failed, you may find it useful to define {\tt DEBUG\_SCOPE}. This will add a 32--bit debug output from the core, as the last argument to the core, to the ZipSystem, or even to ZipBones. The actual definition and composition of this debugging bit--field changes from one implementation to the next, depending upon needs and necessities, so please look at the code at the bottom of {\tt zipcpu.v} for more details. Finally, if you find yourself needing to debug the core and specifically needing to get a trace from the core to find out why something specifically failed, you may find it useful to define {\tt DEBUG\_SCOPE}. This will add a 32--bit debug output from the core, as the last argument to the core, to the ZipSystem, or even to ZipBones. The actual definition and composition of this debugging bit--field changes from one implementation to the next, depending upon needs and necessities, so please look at the code at the bottom of {\tt zipcpu.v} for more details.   That ends our discussion of CPU options, but there remain several implementation parameters that can be defined with the CPU as well. Some of these, such as {\tt IMPLEMENT\_MPY}, {\tt IMPLEMENT\_DIVIDE}, {\tt IMPLEMENT\_FPU}, and {\tt EARLY\_BRANCHING} have already been discussed. The remainder shall be discussed quickly here. That ends our discussion of CPU options, but there remain several implementation parameters that can be defined with the CPU as well. Some of these, such as {\tt IMPLEMENT\_MPY}, {\tt IMPLEMENT\_DIVIDE}, {\tt IMPLEMENT\_FPU}, and {\tt EARLY\_BRANCHING} have already been discussed. The remainder shall be discussed quickly here.   The {\tt RESET\_ADDRESS} parameter controls what address the CPU attempts to fetch its first instruction from upon any CPU reset. The default value is
506,10 → 506,11
   The {\tt ADDRESS\_WIDTH} parameter can be used to trim down the width of addresses used by the CPU. For example, although the Wishbone Bus definition used by the CPU has 32--address lines, particular implementations may have used by the CPU has 30--address lines, particular implementations may have fewer. By setting this value to the actual number of wires in the address bus, some logic can be spared within the CPU. The default is a 32--bit wide bus. bus, some logic can be spared within the CPU. The default is also the maximum, a 30--bit address width. Two additional bits are used internally by the CPU to create the appearance of an 8--bit bus, by using the wishbone select lines.   The {\tt LGICACHE} parameter specifies the log base two of the instruction cache size. If no instruction cache is used, this option has no effect.
529,7 → 530,7
 CPU to be halted upon startup. This is useful for debugging, since it prevents the CPU from doing anything without supervision. Of course, once all pieces of your design are in place and proven, you'll probably want to set this to zero. zero, so that the CPU will then start up immediately upon power up.   The {\tt EXTERNAL\_INTERRUPTS} parameter controls the number of interrupt wires coming into the CPU. This number must be between one and sixteen,
546,7 → 547,7
 Fundamental to the understanding of the ZipCPU is its register set, and the performance model associated with it. The ZipCPU register set contains two sets of sixteen 32-bit registers, a supervisor and a user set as shown in Fig.~\ref{fig:regset}.  supervisor and a user set as shown in Fig.~\ref{fig:regset}. \begin{figure}\begin{center} \begin{tabular}{|c|c|c|c|c|} \multicolumn{2}{c}{Supervisor Register Set} &
586,11 → 587,12
 other registers are identical in their hardware functionality.\footnote{Jumps to {\tt R0}, an instruction used to implement a return from a subroutine, may be optimized in the future within the early branch logic.} By convention, the stack pointer is register 13 and noted as (SP)--although there is nothing special about this register other than this convention. Also by convention, if the compiler needs a frame pointer it will be placed into register~12, and may be abbreviated by FP. Finally, by convention, R0 will hold a subroutine's return address, sometimes called the link register (LR). stack pointer is register 13 and noted as (SP). Beyond this convention,  word accesses to offsets of the stack pointer are compressed when using the CIS instruction set. Also by convention, if the compiler needs a frame pointer it will be placed into register~12, and may be abbreviated by FP.  Finally, by convention, R0 will hold a subroutine's return address, sometimes called the link register (LR).   When the CPU is in supervisor mode, instructions can access both register sets via the {\tt MOV} instruction, whereas when the CPU is in user mode, {\tt MOV}
606,7 → 608,7
 22\ldots 16 & R/W & Reserved for future uses\\\hline 15 & R & Reserved for MMU exceptions\\\hline 14 & W & Clear I-Cache command, always reads zero\\\hline 13 & R & VLIW instruction phase (1 for first half)\\\hline 13 & R & CIS instruction phase (1 for first half)\\\hline 12 & R & (Reserved for) Floating Point Exception\\\hline 11 & R & Division by Zero Exception\\\hline 10 & R & Bus-Error Flag\\\hline
739,9 → 741,10
  error and division by zero flags, only it will be set upon a (yet to  be determined) floating point error.   \item In the case of VLIW instructions, if an exception occurs after the first  instruction but before the second, the fourteenth bit of the CC register  will be set to indicate this fact. \item In the case of CIS instructions, if an exception occurs after the first  instruction but before the second, the fourteenth bit of the CC  register will be set to indicate this fact. This can be combined with  the user PC to the address of the half-word where the fault occurred.   \item The fifteenth bit references a clear cache bit. The supervisor may  write a one to this bit in order to clear the CPU instruction cache.
762,26 → 765,26
 \begin{figure}\begin{center} \begin{bytefield}[endianness=big]{32} \bitheader{0-31}\\ \begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox{4}{DR} \begin{leftwordgroup}{Standard}\bitbox{1}{0}\bitbox[tlr]{4}{}  \bitbox[lrt]{5}{OpCode}  \bitbox[lrt]{3}{Cnd}  \bitbox[lrt]{3}{}  \bitbox{1}{0}  \bitbox{18}{18-bit Signed Immediate} \\ \bitbox{1}{0}\bitbox{4}{DR} \bitbox{1}{0}\bitbox[lr]{4}{DR}  \bitbox[lrb]{5}{}  \bitbox[lrb]{3}{}  \bitbox[lr]{3}{Cnd}  \bitbox{1}{1}  \bitbox{4}{BR}  \bitbox{14}{14-bit Signed Immediate}\end{leftwordgroup} \\ \begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox{4}{DR} \begin{leftwordgroup}{MOV}\bitbox{1}{0}\bitbox[lr]{4}{}  \bitbox[lrt]{5}{5'hf}  \bitbox[lrt]{3}{Cnd}  \bitbox[lrb]{3}{}  \bitbox{1}{A}  \bitbox{4}{BR}  \bitbox{1}{B}  \bitbox{13}{13-bit Signed Immediate}\end{leftwordgroup} \\ \begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox{4}{DR}  \bitbox{4}{4'hb} \begin{leftwordgroup}{LDI}\bitbox{1}{0}\bitbox[lrb]{4}{}  \bitbox{4}{4'hc}  \bitbox{23}{23-bit Signed Immediate}\end{leftwordgroup} \\ \begin{leftwordgroup}{NOOP}\bitbox{1}{0}\bitbox{3}{3'h7}  \bitbox{1}{}
789,129 → 792,72
  \bitbox{3}{xxx}  \bitbox{22}{Ignored}  \end{leftwordgroup} \\ \begin{leftwordgroup}{VLIW}\bitbox{1}{1}\bitbox[lrt]{4}{DR}  \bitbox[lrt]{5}{OpCode}  \bitbox[lrt]{3}{Cnd}  \bitbox{1}{0}  \bitbox{4}{Imm.}  \bitbox{14}{---} \\ \bitbox{1}{1}\bitbox[lr]{4}{}  \bitbox[lrb]{5}{}  \bitbox[lr]{3}{}  \bitbox{1}{1}  \bitbox{4}{BR}  \bitbox{14}{---} \\ \bitbox{1}{1}\bitbox[lrb]{4}{}  \bitbox{4}{4'hb}  \bitbox{1}{}  \bitbox[lrb]{3}{}  \bitbox{5}{5'b Imm}  \bitbox{14}{---} \\ \bitbox{1}{1}\bitbox{9}{---}  \bitbox[lrt]{3}{Cnd}  \bitbox{5}{---}  \bitbox[lrt]{4}{DR}  \bitbox[lrt]{5}{OpCode}  \bitbox{1}{0}  \bitbox{4}{Imm}  \\ \bitbox{1}{1}\bitbox{9}{---}  \bitbox[lr]{3}{}  \bitbox{5}{---}  \bitbox[lr]{4}{}  \bitbox[lrb]{5}{}  \bitbox{1}{1}  \bitbox{4}{Reg} \\ \bitbox{1}{1}\bitbox{9}{---}  \bitbox[lrb]{3}{}  \bitbox{5}{---}  \bitbox[lrb]{4}{}  \bitbox{4}{4'hb}  \bitbox{1}{}  \bitbox{5}{5'b Imm}  \end{leftwordgroup} \\ \end{bytefield} \caption{Zip Instruction Set Format}\label{fig:iset-format} \end{center}\end{figure} The basic format is that some operation, defined by the OpCode, is applied if a condition, Cnd, is true in order to produce a result which is placed in the destination register (DR). There are three basic exceptions to this model. The first is the {\tt MOV} instruction, which steals bits~13 and~18 to allow supervisor access to user registers. The second is the load 23--bit the destination register (DR).    There are three basic exceptions to this general instruction model. The first is the {\tt MOV} instruction, which steals bits~13 and~18 to allow supervisor access to user registers. In supervisor mode, these are set to one to reference user registers, zero otherwise. They are ignored in user mode. The second exception is the load 23--bit signed immediate instruction ({\tt LDI}), in that it accepts no conditions and uses only a 4-bit opcode. The last exception is the {\tt NOOP} instruction group, containing the {\tt NOOP}, {\tt BREAK}, and {\tt LOCK} opcodes. These instructions ignore their register and immediate settings.\footnote{A future version of the CPU may repurpose the immediate bits within the {\tt NOOP} instruction to be simulator commands, while the immediate/register bits within the {\tt BREAK} instruction may be used by the debugger for whatever purpose it chooses to use them for--such as a breakpoint table index.} group, containing the {\tt BREAK}, {\tt LOCK}, {\tt SIM}, and {\tt NOOP} opcodes. These instructions ignore their register and immediate settings. Further, the immediate bits used by these opcodes are available for simulation or debug facilities, but otherwise ignored by the CPU.   The ZipCPU also supports a very long instruction word (VLIW) set of instructions. These aren't truly VLIW instructions in the sense that the CPU still only issues one instruction at a time, but they do pack two instructions into a single instuction word. The number of bits used by the immediate field are adjusted to make space for these instruction words. Other than instruction format, the only basic difference between VLIW and normal instructions is that the CPU will not switch to interrupt mode in between the two instructions, unless an exception is generated by the first instruction. Likewise a new job given to the assembler is that of automatically packing as many instructions as possible into the VLIW format.   The disassembler will represent VLIW instructions by placing a vertical bar between the two components, but still leaving them on the same line.   \subsection{Instruction OpCodes}\label{sec:isa-opcodes} With a 5--bit opcode field, there are 32--possible instructions as shown in  Tbl.~\ref{tbl:iset-opcodes}. \begin{table}\begin{center} \begin{tabular}{|l|l|l|c|} \hline \rowcolor[gray]{0.85} OpCode & & Instruction &Sets CC \\\hline\hline 5'h00 & {\tt SUB} & Subtract & \\\cline{1-3} 5'h01 & {\tt AND} & Bitwise And & \\\cline{1-3} 5'h02 & {\tt ADD} & Add two numbers & \\\cline{1-3} 5'h03 & {\tt OR} & Bitwise Or & Y \\\cline{1-3} 5'h04 & {\tt XOR} & Bitwise Exclusive Or & \\\cline{1-3} 5'h05 & {\tt LSR} & Logical Shift Right & \\\cline{1-3} 5'h06 & {\tt LSL} & Logical Shift Left & \\\cline{1-3} 5'h07 & {\tt ASR} & Arithmetic Shift Right & \\\hline 5'h08 & {\tt MPY} & 32x32 bit multiply & Y \\\hline 5'h09 & {\tt LDILO} & Load Immediate Low & N\\\hline 5'h0a & {\tt MPYUHI} & Upper 32 of 64 bits from an unsigned 32x32 multiply & \\\cline{1-3} 5'h0b & {\tt MPYSHI} & Upper 32 of 64 bits from a signed 32x32 multiply & Y \\\cline{1-3} 5'h0c & {\tt BREV} & Bit Reverse B operand into result& \\\cline{1-3} 5'h0d & {\tt POPC}& Population Count & \\\cline{1-3} 5'h0e & {\tt ROL} & Rotate Ra left by OpB bits& \\\hline 5'h0f & {\tt MOV} & Move OpB into Ra & N \\\hline 5'h10 & {\tt CMP} & Compare (Ra-OpB) to zero & Y \\\cline{1-3} 5'h11 & {\tt TST} & Test (AND w/o setting result) & \\\hline 5'h12 & {\tt LOD} & Load Ra from memory (OpB) & N \\\cline{1-3} 5'h13 & {\tt STO} & Store Ra into memory at (OpB) & \\\hline\hline 5'h14 & {\tt DIVU} & Divide, unsigned & Y \\\cline{1-3} 5'h15 & {\tt DIVS} & Divide, signed & \\\hline\hline 5'h16/7 & {\tt LDI} & Load 23--bit signed immediate & N \\\hline\hline 5'h18 & {\tt FPADD} & Floating point add & \\\cline{1-3} 5'h19 & {\tt FPSUB} & Floating point subtract & \\\cline{1-3} 5'h1a & {\tt FPMPY} & Floating point multiply & Y \\\cline{1-3} 5'h1b & {\tt FPDIV} & Floating point divide & \\\cline{1-3} 5'h1c & {\tt FPI2F} & Convert integer to floating point & \\\cline{1-3} 5'h1d & {\tt FPF2I} & Convert floating point to integer & \\\hline 5'h1e & & {\em Reserved for future use} &\\\hline 5'h1f & & {\em Reserved for future use} &\\\hline 5'h18 & & \hbox to 0.5in{\tt NOOP} (A-register = PC)&\\\cline{1-3} 5'h19 & & \hbox to 0.5in{\tt BREAK} (A-register = PC)& N\\\cline{1-3} 5'h1a & & \hbox to 0.5in{\tt LOCK} (A-register = PC)&\\\hline \begin{tabular}{|l|l|l|l|c|} \hline \rowcolor[gray]{0.85} OpCode & & A-Reg & Instruction &Sets CC \\\hline\hline 5'h00 & {\tt SUB} & \multicolumn{2}{l|}{Subtract} & \\\cline{1-4} 5'h01 & {\tt AND} & \multicolumn{2}{l|}{Bitwise And} & \\\cline{1-4} 5'h02 & {\tt ADD} & \multicolumn{2}{l|}{Add two numbers} & \\\cline{1-4} 5'h03 & {\tt OR} & \multicolumn{2}{l|}{Bitwise Or} & Y \\\cline{1-4} 5'h04 & {\tt XOR} & \multicolumn{2}{l|}{Bitwise Exclusive Or} & \\\cline{1-4} 5'h05 & {\tt LSR} & \multicolumn{2}{l|}{Logical Shift Right} & \\\cline{1-4} 5'h06 & {\tt LSL} & \multicolumn{2}{l|}{Logical Shift Left} & \\\cline{1-4} 5'h07 & {\tt ASR} & \multicolumn{2}{l|}{Arithmetic Shift Right} & \\\hline   5'h08 & {\tt BREV} & \multicolumn{2}{l|}{Bit Reverse B operand into result}& \\\cline{1-4} 5'h09 & {\tt LDILO} & \multicolumn{2}{l|}{Load Immediate Low} & N\\\hline 5'h0a & {\tt MPYUHI} & \multicolumn{2}{l|}{Upper 32 of 64 bits from an unsigned 32x32 multiply} & \\\cline{1-4} 5'h0b & {\tt MPYSHI} & \multicolumn{2}{l|}{Upper 32 of 64 bits from a signed 32x32 multiply} & Y \\\cline{1-4} 5'h0c & {\tt MPY} & \multicolumn{2}{l|}{32x32 bit multiply} & \\\hline 5'h0d & {\tt MOV} & \multicolumn{2}{l|}{Move OpB into Ra} & N \\\hline 5'h0e & {\tt DIVU} & R0-R13 & Divide, unsigned & Y \\\cline{1-4} 5'h0f & {\tt DIVS} & R0-R13 & Divide, signed & \\\hline\hline % 5'h10 & {\tt CMP} & \multicolumn{2}{l|}{Compare (Ra-OpB) to zero} & Y \\\cline{1-4} 5'h11 & {\tt TST} & \multicolumn{2}{l|}{Test (AND w/o setting result)} & \\\hline 5'h12 & {\tt LW} & \multicolumn{2}{l|}{Load a 32-bit word from memory (OpB) into Ra} & \\\cline{1-4} 5'h13 & {\tt SW} & \multicolumn{2}{l|}{Store a 32-bit word from Ra into memory at (OpB)} & \\\cline{1-4} 5'h14 & {\tt LH} & \multicolumn{2}{l|}{Load 16-bits from memory (opB) into Ra, clear upper 16 bits} & N \\\cline{1-4} 5'h15 & {\tt SH} & \multicolumn{2}{l|}{Store the lower 16-bits of Ra into memory at (OpB)} & \\\cline{1-4} 5'h16 & {\tt LB} & \multicolumn{2}{l|}{Load 8-bits from memory (OpB) into Ra, clear upper 24 bits} & \\\cline{1-4} 5'h17 & {\tt SB} & \multicolumn{2}{l|}{Store the lower 8-bits of Ra into memory at (OpB)} & \\\hline\hline 5'h18/9 & {\tt LDI} & \multicolumn{2}{l|}{Load 23--bit signed immediate} & N \\\hline\hline 5'h1a & {\tt FPADD} & R0-R13 & Floating point add & \\\cline{1-4} 5'h1b & {\tt FPSUB} & R0-R13 & Floating point subtract & \\\cline{1-4} 5'h1c & {\tt FPMPY} & R0-R13 & Floating point multiply & Y \\\cline{1-4} 5'h1d & {\tt FPDIV} & R0-R13 & Floating point divide & \\\cline{1-4} 5'h1e & {\tt FPI2F} & R0-R13 & Convert integer to floating point & \\\cline{1-4} 5'h1f & {\tt FPF2I} & R0-R13 & Convert floating point to integer & \\\hline\hline 5'h1c & {\tt BREAK} &None(15)&& \\\cline{1-4} 5'h1d & {\tt LOCK} &None(15)&& N\\\cline{1-4} 5'h1e & {\tt SIM} &None(15)&&\\\cline{1-4} 5'h1f & {\tt NOOP} &None(15)&&\\\hline \end{tabular} \caption{ZipCPU OpCodes}\label{tbl:iset-opcodes} \end{center}\end{table} % Of these opcodes, {\tt ROL} and {\tt POPC} are experimental and may be replaced in future revisions. (If you have a reason to like or wish to keep these opcodes, please contact me. If you know of alternatives that might be better, please let me know as well.) There is also room for six more register-less instructions in the {\tt NOOP} instruction space, and two floating point instruction opcodes have been reserved for future use.   \subsection{Conditional Instructions}\label{sec:isa-cond} Most, although not quite all, instructions may be conditionally executed.  The 23--bit load immediate instruction, together with the {\tt NOOP},
918,33 → 864,35
 {\tt BREAK}, and {\tt LOCK} instructions are the exceptions to this rule. All other instructions may be conditionally executed.   From the four condition code flags, eight conditions are defined for standard instructions. These are shown in Tbl.~\ref{tbl:conditions}. From the four condition code flags, eight conditions are defined, as shown in Tbl.~\ref{tbl:conditions}. \begin{table}\begin{center} \begin{tabular}{l|l|l} Code & Mnemonic & Condition \\\hline 3'h0 & None & Always execute the instruction \\ 3'h1 & {\tt .LT}& Less than ('N' set) \\ 3'h2 & {\tt .Z} & Only execute when 'Z' is set \\ 3'h3 & {\tt .NZ}& Only execute when 'Z' is not set \\ 3'h4 & {\tt .GT}& Greater than ('N' not set, 'Z' not set) \\ 3'h5 & {\tt .GE}& Greater than or equal ('N' not set, 'Z' irrelevant) \\ 3'h6 & {\tt .C} & Carry set (Also known as less-than unsigned) \\ 3'h7 & {\tt .V} & Overflow set\\ 3'h1 & {\tt .Z} & Only execute when Z' is set \\ 3'h2 & {\tt .LT}& Less than (N' set) \\ 3'h3 & {\tt .C} & Carry set (Also known as less-than unsigned) \\ 3'h4 & {\tt .V} & Overflow set\\ 3'h5 & {\tt .NZ}& Only execute when Z' is not set \\ 3'h6 & {\tt .GE}& Greater than or equal (N' not set) \\ 3'h7 & {\tt .NC}& Not carry (also known as greater-than or equal, unsigned) \\ \end{tabular} \caption{Conditions for conditional operand execution}\label{tbl:conditions} \end{center}\end{table} There is no condition code for less than or equal, not C or not V---there just wasn't enough space in 3--bits. Ways of handling non--supported  conditions are discussed in Sec.~\ref{sec:in-mcond}. There are no condition codes for either less than or equal or greater than, whether signed or unsigned. In a similar fashion, there is no condition code for not V---there just wasn't enough space in 3--bits. Ways of handling non--supported conditions are discussed in Sec.~\ref{sec:in-mcond}.   With the exception of \hbox{\tt CMP} and \hbox{\tt TST} instructions, conditionally executed instructions will not further adjust the condition codes. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions will adjust conditions whenever they are executed. In this way, multiple conditions may be evaluated without branches, creating a sort of logical and--but only if all the conditions are the same. For example, to do something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try code such as Tbl.~\ref{tbl:dbl-condition}. conditionally executed instructions will not further adjust the condition codes. Conditional \hbox{\tt CMP} or \hbox{\tt TST} instructions will adjust conditions whenever they are executed. In this way, multiple conditions may be evaluated without branches, creating a sort of logical and--but only if all the conditions are the same. For example, to do something if \hbox{\tt R0} is one and \hbox{\tt R1} is two, one might try code such as Tbl.~\ref{tbl:dbl-condition}. \begin{table}\begin{center} \begin{tabular}{l}  {\tt CMP 1,R0} \\
954,9 → 902,9
  {\em ; If R0 $=$ 1, conditions are now set based upon R1-2.} \\  {\em ; Now some instruction could be done based upon the conjunction} \\  {\em ; of both conditions.} \\  {\em ; While we use the example of a {\tt STO}, it could easily be any  {\em ; While we use the example of a {\tt SW}, it could easily be any  instruction.} \\  {\tt STO.Z R0,(R2)} \\  {\tt SW.Z R0,(R2)} \\ \end{tabular} \caption{An example of a double conditional}\label{tbl:dbl-condition} \end{center}\end{table}
965,24 → 913,6
 conditional branches, conditionally executed instructions will not stall the bus if they are not executed.   In the case of VLIW instructions, only four conditions are defined as shown  in Tbl.~\ref{tbl:vliw-conditions}. \begin{table}\begin{center} \begin{tabular}{l|l|l} Code & Mnemonic & Condition \\\hline 2'h0 & None & Always execute the instruction \\ 2'h1 & {\tt .LT} & Less than ('N' set) \\ 2'h2 & {\tt .Z} & Only execute when 'Z' is set \\ 2'h3 & {\tt .NZ} & Only execute when 'Z' is not set \\ \end{tabular} \caption{VLIW Conditions}\label{tbl:vliw-conditions} \end{center}\end{table} Further, the first bit of the three is given a special meaning: If the first bit is set, the conditions apply to the second half of the instruction, otherwise the conditions will only apply to the first half of a conditional instruction. Of course, the other conditions are still available by mingling the non--VLIW instructions with VLIW instructions.   \subsection{Modifying Conditions}\label{sec:in-mcond} A quick look at the list of conditions supported by the ZipCPU and listed in Tbl.~\ref{tbl:conditions} reveals that the ZipCPU does not have a full set
991,24 → 921,30
 \begin{table}\begin{center} \begin{tabular}{|l|l|l|}\hline Original & Modified & Name \\\hline\hline \parbox[t]{1.5in}{\tt CMP Imm,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1  & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BLT label}  & Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLE label} % If Ry <= Rx -> Ry < Rx+1  & \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BLT label}  & Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline  & \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLT label\\BZ label}  & Less-than or equal (signed, {\tt Z} or {\tt N} set)\\[4mm]\hline\hline \parbox[t]{1.5in}{\tt CMP Imm,Ry\\BGT label} % if (Ry > Rx) -> Rx < Ry  & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BGE label}  & Greater-than (immediate) \\[4mm]\hline \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGT label} % if (Ry > Rx) -> Rx < Ry  & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BLT label}  & Greater-than (register) \\[4mm]\hline\hline \parbox[t]{1.5in}{\tt CMP Imm,Ry\\BLEU label}  & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BC label}  & Less-than or equal unsigned immediate \\[4mm]\hline \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BLEU label}  & \parbox[t]{1.5in}{\tt CMP 1+Rx,Ry\\BC label}  & Less-than or equal unsigned \\[4mm]\hline  & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BNC label}  & Less-than or equal unsigned register\\[4mm]\hline\hline \parbox[t]{1.5in}{\tt CMP Imm,Ry\\BGTU label} % if (Ry > Rx) -> Rx < Ry  & \parbox[t]{1.5in}{\tt CMP 1+Imm,Ry\\BNC label}  & Greater-than unsigned (immediate)\\[4mm]\hline \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGTU label} % if (Ry > Rx) -> Rx < Ry  & \parbox[t]{1.5in}{\tt CMP Ry,Rx\\BC label}  & Greater-than unsigned \\[4mm]\hline \parbox[t]{1.5in}{\tt CMP Rx,Ry\\BGEU label} % if (Ry >= Rx) -> Rx <= Ry -> Rx < Ry+1  & \parbox[t]{1.5in}{\tt CMP 1+Ry,Rx\\BC label}  & Greater-than equal unsigned \\[4mm]\hline \parbox[t]{1.5in}{\tt CMP A+Rx,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A  & \parbox[t]{1.5in}{\tt CMP (1-A)+Ry,Rx\\BC label}  & Greater-than equal unsigned (with offset)\\[4mm]\hline \parbox[t]{1.5in}{\tt CMP A,Ry\\BGEU label} % if (Ry >= A+Rx)-> A+Rx <= Ry -> Rx < Ry+1-A  & \parbox[t]{1.5in}{\tt LDI (A-1),Rx\\CMP Ry,Rx\\BC label}  & Greater-than equal comparison with a constant\\[4mm]\hline \end{tabular} \caption{Modifying conditions}\label{tbl:creating-conditions} \end{center}\end{table}
1050,22 → 986,11
 simply ignored. (This rule does not apply to the shift instructions, {\tt ASR}, {\tt LSR}, and {\tt LSL}--which all use all of their immediate bits.)   VLIW instructions still use the same operand B as regular instructions, only there was no room for any instruction plus immediate addressing. Therefore, VLIW instructions have either a register or a 4--bit signed immediate as their operand B. The only exception is the load immediate instruction, which permits a 5--bit signed operand B.\footnote{Although the space exists to extend this VLIW load immediate instruction to six bits, the 5--bit limit was chosen to simplify the disassembler. This may change in the future.}   \subsection{Address Modes}\label{sec:isa-addr} The ZipCPU supports two addressing modes: register plus immediate, and immediate addressing. Addresses are encoded in the same fashion as Operand B's, discussed above.    The VLIW instruction set only offers register addressing.   \subsection{Move Operands}\label{sec:isa-mov} The previous set of operands would be perfect and complete, save only that the CPU needs access to non--supervisory registers while in supervisory mode. The
1129,67 → 1054,147
 mode command. The supervisor bit may be cleared either by a reboot or by the external debugger.   \subsection{NOOP, BREAK, and Bus LOCK Instruction} Three instructions within the opcode list in Tbl.~\ref{tbl:iset-opcodes}, are somewhat special. These are the {\tt NOOP}, {\tt BREAK}, and bus {\tt LOCK} instructions. These are encoded according to \section{CIS Instructions}   The ZipCPU also supports a compressed instruction set (CIS), outlined in Fig.~\ref{fig:iset-cis}, \begin{figure}\begin{center} \begin{bytefield}[endianness=big]{16} \bitheader{0-15}\\ \bitbox[lrt]{1}{}\bitbox[lrt]{4}{}  \bitbox[lrt]{3}{COp}  \bitbox{1}{0}  \bitbox{7}{Imm.} \\ \bitbox[lr]{1}{1}\bitbox[lr]{4}{DR}  \bitbox[lrb]{3}{}  \bitbox{1}{1}  \bitbox{4}{BR}  \bitbox{3}{Imm} \\ \bitbox[lr]{1}{}\bitbox[lr]{4}{}  \bitbox{3}{\tt LDI}  \bitbox{8}{8'b Imm} \\ \bitbox[lrb]{1}{}\bitbox[lrb]{4}{}  \bitbox{3}{\tt MOV}  \bitbox{1}{1}  \bitbox{4}{BR}  \bitbox{3}{Imm} \\ \end{bytefield} \caption{Zip Compressed Instruction Set (CIS) Format}\label{fig:iset-cis} \end{center}\end{figure} when enabled via {\tt OPT\_CIS}. This compressed instruction set packs two instructions per word. Words must still be aligned, and jumping into the middle of a compressed instruction is not allowed. Further, the CIS only permits the encoding of 8~of the 32~opcodes available in the ISA, as listed in Tbl.~\ref{tbl:iset-cisops}. \begin{table}\begin{center} \begin{tabular}{|l|l|l|} \hline \rowcolor[gray]{0.85} COp & & Instruction \\\hline\hline 3'h00 & {\tt SUB} & Subtract \\\hline 3'h01 & {\tt AND} & Bitwise And \\\hline 3'h02 & {\tt ADD} & Add two numbers \\\hline 3'h03 & {\tt CMP} & Bitwise Or \\\hline 3'h04 & {\tt LW} & Bitwise Exclusive Or \\\hline 3'h05 & {\tt SW} & Logical Shift Right \\\hline 3'h06 & {\tt LDI} & Logical Shift Left \\\hline 3'h07 & {\tt MOV} & Arithmetic Shift Right \\\hline \end{tabular} \caption{CIS OpCodes}\label{tbl:iset-cisops} \end{center}\end{table} A final feature of the compressed instruction set has to do with {\tt LW} and {\tt SW} instructions. An {\tt LW} or {\tt SW} instruction with bit-7 set low references an offset of the Stack Pointer, (SP). Hence the compressed instruction set allows loads and stores to offsets of the Stack Pointer of -128~octets on up to~127 octets. In practice, this gives the compressed load and store instructions, when referencing the stack, thirty--two words that they can reference.   This compressed instruction set somewhat similar to other architectures that have a thumb instruction set, with the difference that the ZipCPU can intermix regular and thumb instructions at will. When using the CIS, instructions are still issued one at a time, however interrupts are disabled between instruction halves, in order to prevent the CPU from stopping mid instruction. Further, it is the silent job of the assembler to compress CIS instructions in an opportunistic fashion.   The disassembler represents CIS instructions by placing a vertical bar between the two components, while still leaving them on the same line.   The CIS instruction set does not support conditional execution.   \subsection{BREAK, Bus LOCK, SIM, and NOOP Instructions} Four instructions within the opcode list in Tbl.~\ref{tbl:iset-opcodes}, are somewhat special. These are the {\tt BREAK}, bus {\tt LOCK}, {\tt SIM}, and {\tt NOOP} instructions. These are encoded according to Fig.~\ref{fig:iset-noop}. \begin{figure}\begin{center} \begin{bytefield}[endianness=big]{32} \bitheader{0-31}\\ \begin{leftwordgroup}{NOOP} \bitbox{1}{0}\bitbox{3}{3'h7}\bitbox{1}{}  \bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{Reserved for Simulator} \\ \bitbox{1}{1}\bitbox{3}{3'h7}\bitbox{1}{}  \bitbox{2}{11}\bitbox{3}{000}\bitbox{22}{---} \\ \bitbox{1}{1}\bitbox{9}{---}\bitbox{3}{---}\bitbox{5}{---}  \bitbox{3}{3'h7}\bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}  \bitbox{5}{Rsrvd}  \end{leftwordgroup} \\ \begin{leftwordgroup}{BREAK} \bitbox{1}{0}\bitbox{3}{3'h7}  \bitbox{1}{}\bitbox{2}{11}\bitbox{3}{001}\bitbox{22}{Reserved for debugger} \bitbox[lrt]{1}{}\bitbox[lrt]{3}{}  \bitbox{1}{}\bitbox[lrt]{3}{}\bitbox{2}{00}\bitbox{22}{Reserved for debugger}  \end{leftwordgroup} \\ \begin{leftwordgroup}{LOCK} \bitbox{1}{0}\bitbox{3}{3'h7}  \bitbox{1}{}\bitbox{2}{11}\bitbox{3}{010}\bitbox{22}{Ignored} \bitbox[lr]{1}{0}\bitbox[lr]{3}{3'h7}  \bitbox{1}{}\bitbox[lr]{3}{111}\bitbox{2}{01}\bitbox{22}{Ignored}  \end{leftwordgroup} \\ \begin{leftwordgroup}{SIM} \bitbox[lr]{1}{}\bitbox[lr]{3}{}\bitbox{1}{}  \bitbox[lr]{3}{}\bitbox{2}{10}\bitbox[lrt]{22}{Reserved for Simulator}   \end{leftwordgroup} \\ \begin{leftwordgroup}{NOOP} \bitbox[lrb]{1}{}\bitbox[lrb]{3}{}\bitbox{1}{}  \bitbox[lrb]{3}{}\bitbox{2}{11}\bitbox[lrb]{22}{}   \end{leftwordgroup} \\ \end{bytefield} \caption{NOOP/Break/LOCK Instruction Format}\label{fig:iset-noop} \end{center}\end{figure}   The {\tt NOOP} instruction is just that: an instruction that does not perform any operation. While many other instructions, such as a move from a register to itself, could also fit this role, only the NOOP instruction guarantees that it will not stall waiting for a register to be available. For this reason, it gets its own place in the instruction set. Bits 21--0 of this instruction are reserved for commands which may be given to a simulator, such as simulator exit, should the code be run from a simulator. However, such simulation codes have not yet been defined.   The {\tt BREAK} instruction is useful for creating a debug instruction that will halt the CPU without executing. If in user mode, depending upon the setting of the break enable bit, it will either switch to supervisor mode or halt the CPU--depending upon where the user wishes to do his debugging. The lower 22~bits of this instruction are likewise reserved for the debuggers use. lower 22~bits of this instruction are reserved for the debuggers use.   Finally, the {\tt LOCK} instruction was added in order to provide for atomic operations. The {\tt LOCK} instruction only works when the CPU is configured for pipeline mode. It works by stalling the ALU pipeline stack until all prior stages are filled, and then it guarantees that once a bus cycle is started, the wishbone {\tt CYC} line will remain asserted until the LOCK is deasserted. This allows the execution of three instructions, one memory (ex. {\tt LOD}), one ALU (ex. {\tt ADD}), and another memory instruction (ex. {\tt STO}), to take place in an unbreakable fashion. Example uses of this capability include an atomic increment, such as {\tt LOCK}, {\tt LOD (Rx),Ry}, {\tt ADD \#,Ry}, and then {\tt STO Ry,(Rx)}, or even a two instruction pair such as a test and set sequence: {\tt LDI 1,Rz}, {\tt LOCK}, {\tt LOD (Rx),Ry}, {\tt STO Rz,(Rx)}. The {\tt LOCK} instruction provides the ZipCPU's atomic operation support,  althought it only works when the CPU is configured for pipeline mode.\footnote{The reason for not allowing {\tt LOCK} support in non-pipelined mode is that the instruction fetch is not allowed to interrupt a lock cycle. In non-pipelined mode, the instruction fetch must take place between every bus access, negating this utility.} It works by stalling the ALU pipeline stack until all prior stages are filled, and then it guarantees that once a bus cycle is started, the wishbone {\tt CYC} line will remain asserted for up to three instructions. This allows the execution of one memory load (ex. {\tt LW}), one ALU operation (ex. {\tt ADD}), and then another memory instruction (ex. {\tt SW}), to take place in an uninterrupted fashion. Example uses of this capability include an atomic increment, such as {\tt LOCK}, {\tt LW (Rx),Ry}, {\tt ADD \#1,Ry}, {\tt SW Ry,(Rx)}, or even a two instruction pair such as a test and set sequence: {\tt LDI 1,Rz}, {\tt LOCK}, {\tt LW (Rx),Ry}, {\tt SW Rz,(Rx)}.   The {\tt SIM} and {\tt NOOP} instructions need a touch more explaining. From the standpoint of the CPU, when running from Verilog within an FPGA, the {\tt SIM} instruction is an illegal instruction--generating an illegal instruction exception. Likewise the {\tt NOOP} instruction is just that: an instruction that consumes a clock, but does not perform any operation. In both cases, the lower 22--bits are ignored.   Both {\tt SIM} and {\tt NOOP} instructions, though, contain 22--bits that can be used by a simulator if present. The encoding of these 22-bits is identical, so that programs that run in a simulator may run on actual hardware as well (using the {\tt NOOP} encoding), or they may complain that they were unintended to run on actual hardware, such as if the {\tt SIM} encoding were used. Particular encodings allow for exiting the simulation with a known exit code, {\tt $x$EXIT}, dumping either one or all registers, {\tt $x$DUMP},  or simpling sending a character to the simulator's standard output stream, {\tt $x$OUT}--where $x$ is either {\tt N} for the {\tt NOOP} version of the instruction, or {\tt S} for the {\tt SIM} version of the opcode.   The {\tt SIM} instruction is currrently a new facility for the ZipCPU, and so its functionality remains under test.   \subsection{Floating Point} Although the ZipCPU does not (yet) have a floating point unit, the current instruction set offers eight opcodes for floating point operations, and treats instruction set offers six opcodes for floating point operations, and treats floating point exceptions like divide by zero errors. Once this unit is built and integrated together with the rest of the CPU, the ZipCPU will support 32--bit floating point instructions natively. Any 64--bit floating point
1196,24 → 1201,11
 instructions will either need to be emulated in software, or else they will need an external floating point peripheral.   Until the FPU is built and integrated, of even afterwards if the floating point unit is not installed by option, floating point instructions will trigger an illegal instruction exception, which may be trapped and then implemented in software. Until this FPU is built and integrated, of even afterwards if the floating point unit is not installed by option, floating point instructions will trigger an illegal instruction exception, which may be trapped and then implemented in software.   \subsection{Load/Store byte} One difference between the ZipCPU and many other architectures is that there are no load byte {\tt LB}, store byte {\tt SB}, load halfword {\tt LH} or store halfword {\tt SH} instructions. This lack is by design in an attempt to keep the 32--bit bus simple.   Because the ZipCPU's addresses refer to 32--bit values, i.e. address one will refer to a completely different 32--bit value than address two, simulating these load and store byte instructions is difficult.   This is just the nature of the ZipCPU, as a result of the design choices that were made.   \subsection{Derived Instructions} The ZipCPU supports many other common instructions by construction, although not all of them are single cycle instructions. Tables~\ref{tbl:derived-1}, \ref{tbl:derived-2}, \ref{tbl:derived-3} and~\ref{tbl:derived-4} show how these
1221,7 → 1213,7
 instructions will have assembly equivalents, such as the branch instructions, to facilitate working with the CPU. \begin{table}\begin{center} \begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline \begin{tabular}{p{1.0in}p{1.5in}p{3in}}\\\hline Mapped & Actual & Notes \\\hline {\tt ABS Rx}  & \parbox[t]{1.5in}{\tt TST -1,Rx\\NEG.LT Rx}
1230,7 → 1222,7
 \parbox[t]{1.4in}{\tt ADD Ra,Rx\\ADDC Rb,Ry}  & \parbox[t]{1.5in}{\tt Add Ra,Rx\\ADD.C \$1,Ry\\Add Rb,Ry}  & Add with carry \\\hline {\tt BRA.$x$+/-\$Addr} \hbox{\tt BRA.$x$ +/-\$Addr}  & \hbox{\tt ADD.$x$\$Addr+PC,PC}  & Branch or jump on condition $x$. Works for 18--bit  signed address offsets.\\\hline
1261,7 → 1253,10
  between {\tt LDI} and {\tt BREV} depending upon the existence  of the condition.\\\hline {\tt EXCH.W Rx }  & {\tt ROL \$16,Rx}  & \parbox[t]{1.5in}{\tt MOV Rx,Rh \\  LSL \$16,Rh \\  LSR \$16,Rx \\  OR Rh,Rx }  & Exchanges the top and bottom 16'bit words of Rx \\\hline {\tt HALT }  & {\tt Or \$SLEEP,CC}
1274,19 → 1269,19
 {\tt IRET}  & {\tt OR \$GIE,CC}  & Also known as an RTU instruction (Return to Userspace) \\\hline {\tt JMP R6+\$Offset} \hbox{\tt JMP R6+\$Offset}  & {\tt MOV \$Offset(R6),PC}  & Only works for 13--bit offsets. Other offsets require adding the  offset first to R6 before jumping.\\\hline {\tt LJMP \$Addr}  & \parbox[t]{1.5in}{\tt LOD (PC),PC \\ {\em Address }}  & \parbox[t]{1.5in}{\tt LW (PC),PC \\ {\em Address }}  & Although this only works for an unconditional jump, and it only  works in an architecture with a unified instruction and data address  space, this instruction combination makes for a nice combination that  can be adjusted by a linker at a later time.\\\hline {\tt LJMP.x \$Addr}  & \parbox[t]{1.5in}{\tt LOD.x 2(PC),PC \\ ADD 1,PC \\ {\em Address }}  & Long jump, works for a conditional long jump. \\\hline  & \parbox[t]{1.5in}{\tt LW.x 4(PC),PC \\ ADD 4,PC \\ {\em Address }}  & Long jump, works for a conditional long jump, not necessarily the best way to do this. \\\hline \end{tabular} \caption{Derived Instructions}\label{tbl:derived-1} \end{center}\end{table}
1294,17 → 1289,17
 \begin{tabular}{p{1.1in}p{1.8in}p{3in}}\\\hline Mapped & Actual & Notes \\\hline {\tt LJSR \$Addr }  & \parbox[t]{1.5in}{\tt MOV \$2+PC,R0 \\ LOD (PC),PC \\ {\em Address}}  & \parbox[t]{1.5in}{\tt MOV \$8+PC,R0 \\ LW (PC),PC \\ {\em Address}}  & Similar to LJMP, but it handles the return address properly.  \\\hline {\tt JSR PC+\$Offset }  & \parbox[t]{1.5in}{\tt MOV \$1+PC,R0 \\ ADD \$Offset,PC}  & \parbox[t]{1.5in}{\tt MOV \$4+PC,R0 \\ ADD \$Offset,PC}  & This is similar to the jump and link instructions from other  architectures, save only that it requires a specific link  instruction, seen here as the {\tt MOV} instruction on the  left.\\\hline {\tt LDI \$val,Rx }  & \parbox[t]{1.8in}{\tt BREV REV($val$)\&0x0ffff, Rx \\  & \parbox[t]{1.8in}{\tt BREV REV($val$)\&0x0ffff,Rx \\  LDILO ($val$\&0x0ffff),Rx}  & \parbox[t]{3.0in}{Sadly, there's not enough instruction  space to load a complete immediate value into any register. 1315,27 → 1310,6   This is also the appropriate means for setting a register value  to an arbitrary 32--bit value in a post--assembly link  operation.}\\\hline {\tt LOD.b \$addr,Rx}  & \parbox[t]{1.5in}{\tt %  LDI \$addr,Ra \\  LDI \$addr,Rb \\  LSR \$2,Ra \\  AND \$3,Rb \\  LOD (Ra),Rx \\  LSL \$3,Rb \\  SUB \$32,Rb \\  ROL Rb,Rx \\  AND \$0ffh,Rx}  & \parbox[t]{3in}{This CPU is designed for 32'bit word  length instructions. Byte addressing is not supported by the CPU or  the bus, so it therefore takes more work to do.     Note also that in this example, \$Addr is a byte-wise address, where  all other addresses in this document are 32-bit wordlength addresses.   For this reason,  we needed to drop the bottom two bits. This also limits the address  space of character accesses using this method from 16 MB down to 4MB.}  \\\hline \parbox[t]{1.5in}{\tt LSL \$1,Rx\\ LSLC \$1,Ry}  & \parbox[t]{1.5in}{\tt LSL \$1,Ry \\  LSL \$1,Rx \\
1353,34 → 1327,34
  OR Rz,Rx}  & Logical shift right with carry. Unlike the shift left, this  approach doesn't extend well to numbers larger than two words. \\\hline \end{tabular} \caption{Derived Instructions, continued}\label{tbl:derived-2} \end{center}\end{table} \begin{table}\begin{center} \begin{tabular}{p{1.2in}p{1.5in}p{3.2in}}\\\hline {\tt NEG Rx} & \parbox[t]{1.5in}{\tt XOR \$-1,Rx \\ ADD \$1,Rx} & Negates Rx\\\hline {\tt NEG.C Rx} & \parbox[t]{1.5in}{\tt MOV.C \$-1+Rx,Rx\\XOR.C \$-1,Rx}  & Conditionally negates Rx\\\hline {\tt NOT Rx } & {\tt XOR \$-1,Rx } & One's complement\\\hline {\tt POP Rx }  & \parbox[t]{1.5in}{\tt LOD \$(SP),Rx \\ ADD \$1,SP}  & \parbox[t]{1.5in}{\tt LW \$(SP),Rx \\ ADD \$4,SP}  & The compiler avoids the need for this instruction and the similar  {\tt PUSH} instruction when setting up the stack by coalescing all  the stack address modifications into a single instruction at the  beginning of any stack frame.\\\hline {\tt PUSH Rx}  & \parbox[t]{1.5in}{\hbox{\tt SUB \$1,SP}   \hbox{\tt STO Rx,\$(SP)}}  & \parbox[t]{1.5in}{\hbox{\tt SUB \$4,SP}   \hbox{\tt SW Rx,\$(SP)}}  & Note that for pipelined operation, it helps to coalesce all the  {\tt SUB}'s into one command, and place the {\tt STO}'s right  {\tt SUB}'s into one command, and place the {\tt SW}'s right  after each other. Further, to avoid a pipeline stall, the  immediate value for the first store must be zero.  \\\hline \end{tabular} \caption{Derived Instructions, continued}\label{tbl:derived-2} \end{center}\end{table} \begin{table}\begin{center} \begin{tabular}{p{1.0in}p{1.5in}p{3.2in}}\\\hline {\tt PUSH Rx-Ry}  & \parbox[t]{1.5in}{\tt SUB \$$n,SP \\  STO Rx,\(SP)  & \parbox[t]{1.5in}{\tt SUB \$$4n$,SP \\  SW Rx,\$(SP)  \ldots \\  STO Ry,\$$\left(n-1\right)(SP)}  SW Ry,\$$4\left(n-1\right)$(SP)}  & Multiple pushes at once only need the single subtract from the  stack pointer. This derived instruction is analogous to a similar one  on the Motorola 68k architecture, although the Zip Assembler
1387,11 → 1361,11
  does not support the combined instruction. This instruction  also supports pipelined memory access.\\\hline {\tt RESET}  & \parbox[t]{1in}{\tt STO \$1,\$watchdog(R12)\\BUSY}  & \parbox[t]{1in}{\tt LDI~0xff000000,R2\\LDI 1,R1\\\hbox{SW R1,\$watchdog(R2)}\\BUSY}  & This depends upon the existence of a watchdog peripheral, and the  peripheral base address being preloaded into {\tt R12}. The BUSY  instructions are required because the CPU will continue until the  {\tt STO} has completed.  {\tt SW} has completed.    Another opportunity might be to jump to the reset address from within  supervisor mode.\\\hline 1399,10 → 1373,10   & This depends upon the form of the {\tt JSR} given on the previous  page that stores the return address into R0.  \\\hline {\tt SEX.b Rx } {\tt SEXB Rx }  & \parbox[t]{1.5in}{\tt LSL 24,Rx \\ ASR 24,Rx}  & Signed extend an 8--bit value into a full word.\\\hline {\tt SEX.h Rx }  {\tt SEXH Rx }   & \parbox[t]{1.5in}{\tt LSL 16,Rx \\ ASR 16,Rx}  & Sign extend a 16--bit value into a full word.\\\hline {\tt STEP Rr,Rt} 1412,37 → 1386,9  {\tt STEP}  & \parbox[t]{1.5in}{\tt OR \$Step|\$GIE,CC}  & Steps a user mode process by one instruction\\\hline % % \end{tabular} \caption{Derived Instructions, continued}\label{tbl:derived-3} \end{center}\end{table} \begin{table}\begin{center} \begin{tabular}{p{1.4in}p{1.5in}p{3in}}\\\hline {\tt STO.b Rx,\$addr}  & \parbox[t]{1.5in}{\tt %  LDI \$addr,Ra \\  LDI \$addr,Rb \\  LSR \$2,Ra \\  AND \$3,Rb \\  SUB \$32,Rb \\  LOD (Ra),Ry \\  AND \$0ffh,Rx \\  AND \~\$0ffh,Ry \\  ROL Rb,Rx \\  OR Rx,Ry \\  STO Ry,(Ra) }  & \parbox[t]{3in}{This CPU and its bus are {\em not} optimized  for byte-wise operations.    Note that in this example, \$addr is a  byte-wise address, whereas in all of our other examples it is a   32-bit word address. This also limits the address space  of character accesses from 16 MB down to 4MB.  Further, this instruction implies a byte ordering,  such as big or little endian.} \\\hline {\tt SUBR Rx,Ry }  & \parbox[t]{1.5in}{\tt SUB 1+Rx,Ry\\ XOR -1,Ry}   % & \parbox[t]{1.5in}{\tt SUB 1+Rx,Ry\\ XOR -1,Ry}   & \parbox[t]{1.5in}{\tt XOR -1,Ry\\ADD 1+Rx,Ry}   & Ry is set to Rx-Ry, rather than the normal subtract which  sets Ry to Ry-Rx. \\\hline \parbox[t]{1.4in}{\tt SUB Ra,Rx\\SUBC Rb,Ry}
1465,13 → 1411,20
 {\tt TS Rx,Ry,(Rz)}  & \hbox{\tt LDI 1,Rx}  \hbox{\tt LOCK}  \hbox{\tt LOD (Rz),Ry}  \hbox{\tt STO Rx,(Rz)}  \hbox{\tt LW (Rz),Ry}  \hbox{\tt SW Rx,(Rz)}  & A test and set instruction. The {\tt LOCK} instruction insures  that the next two instructions lock the bus between the instructions,  so no one else can use it. Thus guarantees that the operation is  atomic.  \\\hline % % \end{tabular} \caption{Derived Instructions, continued}\label{tbl:derived-3} \end{center}\end{table} \begin{table}\begin{center} \begin{tabular}{p{1.0in}p{1.5in}p{3in}}\\\hline {\tt TST Rx}  & {\tt TST \$-1,Rx}  & Set the condition codes based upon Rx without changing Rx. 1487,12 → 1440,15  \subsection{Interrupt Handling} The ZipCPU does not maintain any interrupt vector tables. If an interrupt takes place, the CPU simply switches to from user to supervisor (interrupt) mode. The supervisor code then continues from where it left off after  executing a return to userspace {\tt RTU} instruction. mode. Since getting to user mode in the first place required a return to userspace instruction, {\tt RTU}, once the interrupt takes place the  supervisor just simply starts executing code immediately after that {\tt RTU} instruction.   Since the CPU may return from userspace after either an interrupt, a trap, or an exception, it is up to the supervisor code that handles the transition to determine which of the three has taken place. Since the CPU may return from userspace after either an interrupt (hardware generated), a trap (software generated), or an exception (a fault of some type), it is up to the supervisor code that handles the transition to determine which of the three has taken place.   \subsection{Pipeline Stages} As mentioned in the introduction, and highlighted in Fig.~\ref{fig:cpu}, 1532,10 → 1488,14   read, condition code, and immediate offset. This stage also  determines whether the flags will be read or set, whether registers  will be read (and hence the pipeline may need to stall), or whether the  result will be written back. In many ways, simplifying the CPU also  meant simplifying this pipeline stage and hence the instruction set  architecture.  result will be written back. In many ways, simplifying the CPU has  meant simplifying this particular pipeline stage and hence the  instruction set architecture that it implements.    This stage is also responsible for both normal and CIS decoding.  Hence, following this stage, little information remains regarding  whether or not the CPU was executing a CIS instruction.   \item {\bf Read Operands}: Read from the register file and applies any  immediate values to the result. There is no means of detecting or  flagging arithmetic overflow or carry when adding the immediate to the 1544,13 → 1504,14    \item At this point, the processing flow splits into one of four tracks: An  {\bf ALU} track which will accomplish a simple instruction, the  {\bf MemOps} stage which handles {\tt LOD} (load) and {\tt STO}  {\bf MemOps} stage which handles {\tt LW} (load) and {\tt SW}  (store) instructions, the {\bf divide} unit, and the  {\bf floating point} unit.  \begin{itemize}  \item Loads will stall instructions in the read operands stage until the  entire memory is complete, lest a register be read only to be  updated unseen by the Load.  \item Loads will stall instructions in the read operands stage until  the entire memory operation is complete, lest a register be  read from the register file only to be updated unseen by the  Load.  \item Condition codes are set upon completion of the ALU, divide,  or FPU stage. (Memory operations do not set conditions.)  \item Issuing a non--pipelined memory instruction to the memory unit 1612,10 → 1573,10  be another stall internal to the {\tt pipefetch} cache.)   The decode stage can handle the {\tt ADD \$X,PC}, {\tt LDI \X,PC}, and {\tt LOD (PC),PC} instructions specially, however, when {\tt EARLY\_BRANCHING} {\tt LW (PC),PC} instructions specially, however, when {\tt EARLY\_BRANCHING} is enabled. These instructions, when not conditioned on the flags, can execute with only a single stall cycle (two for the {\tt LOD(PC),PC} instruction), for the {\tt LW(PC),PC} instruction), such as is shown in Fig.~\ref{fig:branch}. \begin{figure}\begin{center} \includegraphics[width=4in]{../gfx/bra.eps} %0.4in per clock 1646,9 → 1607,9  applies if there is no local memory to allocate on the stack as well.} Hence, to save registers at the top of a procedure, one would write: \begin{enumerate} \item\ {\tt SUB 2,SP} \item\ {\tt STO R1,(SP)} \item\ {\tt STO R2,1(SP)} \item\ {\tt SUB 16,SP} \item\ {\tt SW R1,(SP)} \item\ {\tt SW R2,4(SP)} \end{enumerate} Had {\tt R1} instead been stored at {\tt 1(SP)} as the top of the stack, there would've been an extra stall in setting up the stack frame. 1679,7 → 1640,7    \item When waiting for a memory read operation to complete \begin{enumerate} \item\ {\tt LOD address,RA} \item\ {\tt LW address,RA} \item\ {\em (multiple stalls, bus dependent, 4 clocks best)} \item\ {\tt OPCODE I+RA,RB} \end{enumerate} 1718,13 → 1679,13    \item Memory operation followed by a memory operation \begin{enumerate} \item\ {\tt STO address,RA} \item\ {\tt SW address,RA} \item\ {\em (multiple stalls, bus dependent, 4 clocks best)} \item\ {\tt LOD address,RB} \item\ {\tt LW address,RB} \item\ {\em (multiple stalls, bus dependent, 4 clocks best)} \end{enumerate}   In this case, the LOD instruction cannot start until the STO is finished, In this case, the LW instruction cannot start until the SW is finished, as illustrated by Fig.~\ref{fig:mstld}. \begin{figure}\begin{center} \includegraphics[width=5.5in]{../gfx/mstld.eps} 1731,7 → 1692,7  \caption{Pipeline handling of a store followed by a load instruction}\label{fig:mstld} \end{center}\end{figure} With proper scheduling, it is possible to do something in the ALU while the memory unit is busy with the STO instruction, but otherwise this pipeline will memory unit is busy with the SW instruction, but otherwise this pipeline will stall while waiting for it to complete before the load instruction can start.   1770,12 → 1731,8  bus built according to the B4 specification. Several changes have been made to simplify this bus. First, all unnecessary ancillary information has been removed. This includes the retry, tag, lock, cycle type indicator, and burst indicator signals. It also includes the select lines which would enable the CPU to act on less than 32--bit words. As a result all operations on the bus are 32--bit operations. The bus is neither little endian nor big endian. For this reason, all words are 32--bits. All instructions are also 32--bits wide.  Everything has been built around the 32--bit word. Even the byte size (the size of the minimum addressable unit) is 32--bits. Second, we insist that all indicator signals. The bus supports big endian operation where the high order octet occupies the low order address. Second, we insist that all accesses be pipelined, and simplify that further by insisting that pipelined accesses not cross peripherals---although we leave it to the user to keep that from happening in practice. Finally, we insist that the wishbone strobe line 1792,8 → 1749,8  within the CPU if a particular piece of memory can be cached or not, save that the CPU assumes any and all instruction words can be cached.   The one exception to this rule revolves around addresses beginning with {\tt 2'b11} in their high order word. These addresses are used to access a The one exception to this rule revolves around addresses where the top 8-bits of their high order word are all ones. These addresses are used to access a variety of optional peripherals that will be discussed more in Sec.~\ref{sec:zipsys}, but that are only present within the {\tt ZipSystem}. When used with the bare {\tt ZipBones}, these addresses will cause a bus error. 1804,8 → 1761,7  self--modifying code.   Should the memory management unit (MMU) be integrated into the ZipCPU, the MMU will be able to be configured to instruct the ZipCPU as to which addresses may be cached and which not. configuration will tell the ZipCPU wich addresses may be cached and which not.   This topic is discussed further in the linker section, Sec.~\ref{sec:ld-mem} of the ABI chapter, Chap.~\ref{chap:abi}. 1968,6 → 1924,9  bus access is granted to the DMA peripheral, it will not be revoked mid--read or mid--write.)   The DMA controller supports only aligned word accesses. It does not support byte or half-word accesses.   When copying memory from one location to another, the DMA controller will copy in units of a given transfer length--up to 1024 words at a time. It will read that transfer length into its internal buffer, and then write to the 2025,7 → 1984,7  % R13 is the stack register. % The stack grows downward. % Memory at the current stack pointer is allocated. % Hence, a PUSH is : SUB 1,SP; STO Rx,(SP) % Hence, a PUSH is : SUB 1,SP; SW Rx,(SP) % Heap: % In general, not yet implemented.  % A less than adequate Heap has been implemented as a pointer, from which 2035,18 → 1994,18  \section{Executable File Format}\label{sec:abi-elf} ZipCPU executable files are stored in the Executable and Linkable Format (ELF), prior to being placed in flash, or whatever memory they will be executed from. All addresses within this format are ZipCPU addresses, referencing 32--bit quantities, whereas all offsets internal to the ELF file represent 8--bit quantities. Thus, when running the {\tt zip-objdump} utility on a ZipCPU ELF file, the addresses are properly set. executed from.    The ZipCPU described by this specification uses the 16-bits {\tt 16'hdad1} to identify itself against other CPUs. This is not an officially registered number, and may change in the future.   The ZipCPU does not (yet) have a dynamic linker/loader. All linking is currently static, and done prior to run time.   \section{Stack}\label{sec:abi-stack} Although nothing in the hardware requires this, the compiler back end implementation uses {\tt R13} (also known as the {\tt SP} register) as a stack register, and grows the stack from Register {\tt R13} (also known as the {\tt SP} register) is the stack register. The compiler generates code that grows the stack from high addresses to lower addresses. That means that the stack will usually start out set to a very large value, such as one past the last RAM address, and it will grow to lower and lower values--hopefully never mixing with the 2056,37 → 2015,38  the size of the stack frame from the stack register. It will then store any registers used by the function, from {\tt R5} to {\tt R12} (including the link register {\tt R0}) onto offsets given by the stack pointer plus a  constant. If a frame pointer is used, the compiler uses {\tt R12} (or {\tt FP}) for this purpose. The frame pointer is set by moving the stack pointer plus an offset into {\tt FP}. This {\tt MOV} instruction effectively limits the size of any individual stack frame to2^{12}-1$words. constant. If a frame pointer is used, the compiler uses {\tt R12} (or {\tt FP}) for this purpose. The frame pointer is set by moving the stack pointer plus an offset into {\tt FP}. This {\tt MOV} instruction effectively limits the size of any individual stack frame to$2^{12}-1$octets.   Once a subroutine is complete, the frame is unwound. If the frame pointer, {\tt FP} was used, then {\tt FP} is copied directly to the stack pointer, {\tt SP}. Registers are restored, starting with {\tt R0} all the way to {\tt R12} ({\tt FP}). This also restores, and obliterates, the subroutine frame pointer. Once complete, a value is added to the stack pointer to return it to its original value, and a jump is made to the value located within {\tt R0}. frame pointer. Once complete, a value is added to the stack pointer to return it to its original value, and a jump is made to the value located within {\tt R0}.   \section{Relocations}\label{sec:abi-reloc}   The ZipCPU binutils back end supports two several relocations, although the two most common are the 32--bit relocations for register load and long jump. The ZipCPU binutils back end supports several types of relocations, although the two most common are the 32--bit relocations for register load and long jump.   The first of these is for loading an arbitrary 32--bit value into a register.  Such instructions are broken into a pair of {\tt BREV} and {\tt LDILO} instructions, and once the value of the parameter is known their immediates are filled in. instructions, and once the value of the parameter is known their immediate values can be filled in.   The second type of 32--bit relocation is for jumps to arbitrary addresses. These jumps are supported by the \hbox{\tt LOD (PC),PC} instruction, followed These jumps are supported by the \hbox{\tt LW (PC),PC} instruction, followed by the 32--bit address to be filled in later by the linker. If the jump is conditional, then a conditional \hbox{\tt LOD.$x$1(PC),PC} instruction is used, followed by a {\tt BRA 1(PC),PC} and then the 32--bit relocation value. conditional, then a conditional \hbox{\tt LW.$x4(PC),PC} instruction is used, followed by a {\tt ADD 4,PC} and then the 32--bit relocation value.   If the branch distance is known and within reach, branches will be implemented with {\tt ADD \#,PC} instructions, possibly conditional, as necessary. If a branch distance is known and within reach, then it will be implemented with an {\tt ADD \#,PC} instruction, possibly conditional, as necessary.   While other relocations are supported, they tend not to be used nearly as much as these two. 2093,18 → 2053,17    \section{Call format}\label{sec:abi-jsr}   One feature of the ZipCPU is that it has no JSR instruction. Jumps to subroutine's therefore take three assembly instructions:  The first is a {\tt MOV .Lcall\#\#(PC),R0}, which places the return address into R0. {\tt .Lcall\#\#} in this case is a label, where \#\# is a unique number filled in by the compiler. This instruction is followed by a {\tt BRA subroutine} instruction. Finally, the third assembly instruction'' of any call sequence is the label {\tt .Lcall\#\#}. One unique of the ZipCPU is that it has no JSR instruction. The assembler attempts to minimize this problem by replacing a {\tt JSR}~{\em address} instruction with a {\tt MOV \#(PC),R0} followed by a jump to the requested address. In this case, the offset to the PC for the {\tt MOV} instruction is determined by whether or not the jump can be accomplished with a local branch or a long jump.   While this works well in practice, GCC's implementation prevents such things While this works well in practice, this implementation prevents such things as {\tt JSR}'s followed by {\tt BRA}'s from being combined together.   Finally, the first five operands passed to the subroutine will be placed into Finally, GCC will place first five operands passed to the subroutine into registers R1--R5. Any additional operands are placed upon the stack.   \section{Built-ins}\label{sec:abi-builtin} 2116,7 → 2075,7  \item {\tt zip\_bitrev(int)} reverses the bits in the given integer, returning  the result. This utilizes the internal {\tt BREV} instruction, and is  designed to be used with FFT's as necessary. \item {\tt zip\_busy()} executes an {\tt ADD -1,PC} function, essentially \item {\tt zip\_busy()} executes an {\tt ADD -4,PC} function, essentially  forcing the CPU into a very tight infinite loop. \item {\tt zip\_cc()} returns the value of the current CC register. This may  be used within both user and supervisor code to determine in which 2198,7 → 2157,7  \end{eqnarray*} specifies that there is a region of memory, called blkram, that can be read and written, and that programs can execute from. This section starts at address {\tt 0x8000} and extends for another {\tt 0x8000} words. The other memories {\tt 0x8000} and extends for another {\tt 0x8000} bytes. The other memories are defined in a similar manner, with names {\tt flash} and {\tt sdram}.   Following the memory section, three specific symbols are defined: 2303,6 → 2262,8   subsystem. \end{enumerate}   All of these symbols need to reference word aligned addresses.   \section{Loading ZipCPU Programs} There are two basic ways to load a ZipCPU program, depending upon whether or not the ZipCPU is active within the current configuration. If the ZipCPU 2509,11 → 2470,11   \> {\tt zip->pic = DISABLEALL|SYSINT\_TMA;}\\  \> {\tt if (nclocks > 10) \{}\\  \> \hbox to 0.25in{}\= {\em // Set our timer to count down the given number of counts}\\  \> \> {\tt zip->tma = counts} \\  \> \> {\tt zip->tma = nclocks;} \\  \> \> {\tt zip->pic = EINT(SYSINT\_TMA);} \\  \> \> {\tt zip\_wait();} \\  \> \> {\tt zip->pic = CLEARPIC;} \\  \> {\tt \} }{\em // else anything less has likely already passed}  \> {\tt \} }{\em // else anything less has likely already passed} \\ {\tt \}}\\ \end{tabbing} \caption{Waiting on a timer}\label{tbl:shi-timer} 2681,7 → 2642,7  copy, starting with the C code shown in Tbl.~\ref{tbl:memcp-c}. \begin{table}\begin{center} \parbox{4in}{\begin{tabbing} {\tt void} \= {\tt memcpy(void *dest, void *src, int len) \{} \\ {\tt void} \= {\tt memcpy(char *dest, char *src, int len) \{} \\  \> {\tt for(int i=0; i \hspace{0.2in} {\tt *dest++ = *src++;} \\ \} 2698,16 → 2659,16  \hbox to 0.35in{}\={\em ; R0 = return address, R1 = *dest, R2 = *src, R3 = LEN} \\ \> {\em ; The following will operate in 6 (N=0$), or$2+12N$clocks ($N\neq 0$).} \\ \> {\tt CMP 0,R3} \\ % 8 clocks per setup \> {\tt JMP.Z R0} \hbox to 0.3in{}\= {\em ; A conditional return }\\ \> {\tt RETN.Z} \hbox to 0.3in{}\= {\em ; A conditional return }\\ \> {\em ; No stack frame needs to be set up to use {\tt R4}, since the compiler}\\ \> {\em ; assumes {\tt R1}-{\tt R4} may be used and changed by any subroutine} \\ memcpy\_loop: \\ % 12 clocks per loop \> {\tt LOD (R2),R4} \\ \> {\tt LB (R2),R4} \\ \> {\em ; (4 stalls, cannot be scheduled away)} \\ \> {\tt STO R4,(R1)} \> {\em ; (4 schedulable stalls, has no impact now)} \\ \> {\tt SB R4,(R1)} \> {\em ; (4 schedulable stalls, has no impact now)} \\ \> {\em ; Update our count of the number of remaining values to copy}\\ \> {\tt SUB 1,R3} \> {\em ; This will be zero when we have copied our last}\\ \> {\tt JMP.Z R0} \> {\em ; + 4 stalls, if taken}\\ \> {\tt RETN.Z} \> {\em ; + 4 stalls, if taken}\\ \> {\tt ADD 1,R1} \> {\em ; Implement the destination pointer }\\ \> {\tt ADD 1,R2} \> {\em ; Implement the source pointer }\\ \> {\tt BRA memcpy\_loop} \\ 2718,12 → 2679,9  This example points out several things associated with the ZipCPU. First, a straightforward implementation of a for loop is not the fastest loop structure. For this reason, we have placed the test to continue at the end. Second, all pointers are {\tt void} pointers to arbitrary 32--bit data types. The ZipCPU does not have explicit support for smaller or larger data types, and so this memory copy cannot be applied at an 8--bit level. Third, notice that we can use {\tt R4} without storing it, since the C~ABI allows for subroutines to use {\tt R1}--{\tt R4} without saving them. This means that we can return from this subroutine using conditional jumps to end. Second, notice that we can use {\tt R4} without storing it, since the C~ABI allows for subroutines to use {\tt R1}--{\tt R4} without saving them.  This means that we can return from this subroutine using conditional jumps to {\tt R0}.   Still, there's more that could be done. Suppose we wished to use the pipeline 2738,51 → 2696,51   after the initial pipeline delay}\\ memcpy\_opt: \\ \hbox to 0.35in{}\=\hbox to 1.4in{\tt CMP 4,R3}\= {\em ; Check for small short lengths, len$<$4}\\ \> {\tt BC memcpy\_finish} \> {\em ; Jump to the end if so}\\ \hbox to 0.35in{}\=\hbox to 1.4in{\tt SUB 3,SP}\= {\em ; Otherwise, create a stack frame, storing the registers}\\ \> {\tt STO R5,(SP)} \> {\em ; we will be using. Note that this is a pipelined store, so}\\ \> {\tt STO R6,1(SP)} \> {\em ; subsequent stores only cost 1 clock.}\\ \> {\tt STO R7,2(SP)}\\ \> {\tt BC \_memcpy\_finish} \> {\em ; Jump to the end if so}\\ \hbox to 0.35in{}\=\hbox to 1.4in{\tt SUB 12,SP}\= {\em ; Otherwise, create a stack frame, storing the registers}\\ \> {\tt SW R5,(SP)} \> {\em ; we will be using. Note that this is a pipelined store, so}\\ \> {\tt SW R6,4(SP)} \> {\em ; subsequent stores only cost 1 clock.}\\ \> {\tt SW R7,8(SP)}\\ \> {\tt ADD 4,R2} \> {\em ; Pre-Increment our pointers, for a 4-stage pipeline. This}\\ \> {\tt ADD 4,R1} \> {\em ; also fills up the 3 of the 4 stall states following the}\\ \> {\tt SUB 5,R3} \> {\em ; stores. Also, leave {\tt R3} as the number left minus one.}\\ \> {\tt LOD -4(R2),R4} \> {\em ; Load the first four values into }\\ \> {\tt LOD -3(R2),R5} \> {\em ; registers, using a pipelined load.}\\ \> {\tt LOD -2(R2),R6}\\ \> {\tt LOD -1(R2),R7}\\ {\tt mcopy\_next\_four\_chars:} \>\> {\em ; Here's the top of our copy loop}\\ \> {\tt STO R4,-4(R1)} \> {\em ; Store four values, using a burst memory operation.}\\ \> {\tt STO R5,-3(R1)} \> {\em ; One clock for subsequent stores.}\\ \> {\tt STO R6,-2(R1)} \> {\em ; None of these effect the flags, that were set when}\\ \> {\tt STO R7,-1(R1)} \> {\em ; we last adjusted {\tt R3}}\\ \> {\tt BC preend\_memcpy} \> {\em ; +4 stall cycles, but only when taken}\\ \> {\tt LB -4(R2),R4} \> {\em ; Load the first four values into }\\ \> {\tt LB -3(R2),R5} \> {\em ; registers, using pipelined loads.}\\ \> {\tt LB -2(R2),R6}\\ \> {\tt LB -1(R2),R7}\\ {\tt \_mcopy\_next\_four\_chars:} \>\> {\em ; Here's the top of our copy loop}\\ \> {\tt SB R4,-4(R1)} \> {\em ; Store four values, using a burst memory operation.}\\ \> {\tt SB R5,-3(R1)} \> {\em ; One clock for subsequent stores.}\\ \> {\tt SB R6,-2(R1)} \> {\em ; None of these effect the flags, that were set when}\\ \> {\tt SB R7,-1(R1)} \> {\em ; we last adjusted {\tt R3}}\\ \> {\tt BC \_preend\_memcpy} \> {\em ; +4 stall cycles, but only when taken}\\ \> {\tt ADD 4,R1} \> {\em ; ALU ops don't stall during stores, so}\\ \> {\tt ADD 4,R2} \> {\em ; increment our pointers here.} \\ \> {\tt SUB 4,R3} \> {\em ; Calculate whether or not we have a next round}\\ \> {\tt LOD -4(R2),R4} \> {\em ; Preload the values for the next round}\\ \> {\tt LOD -3(R2),R5}\> {\em ; Notice that these are also pipelined}\\ \> {\tt LOD -2(R2),R6}\> {\em ; loads, as before.}\\ \> {\tt LOD -1(R2),R7}\> {\em ; The four stall cycles, though, are concurrent w/ the branch.}\\ \> {\tt BRA mcopy\_next\_char} \> {\em ; Early branching avoids the full memory pipeline stall} \\ {\tt preend\_memcpy:}\\ \> {\tt LB -4(R2),R4} \> {\em ; Preload the values for the next round}\\ \> {\tt LB -3(R2),R5}\> {\em ; Notice that these are also pipelined}\\ \> {\tt LB -2(R2),R6}\> {\em ; loads, as before.}\\ \> {\tt LB -1(R2),R7}\> {\em ; The four stall cycles, though, are concurrent w/ the branch.}\\ \> {\tt BRA \_mcopy\_next\_four\_chars}\hspace{0.25in} {\em ; Early branching avoids the full memory pipeline stall} \\ {\tt \_preend\_memcpy:}\\ \> {\tt ADD 1,R3} \>{\em ; R3 is now the remaining length, rather than one less than it}\\ \> {\tt LOD (SP),R5} \> {\em ; Restore our saved registers, since the remainder of the routine}\\ \> {\tt LOD 1(SP),R6} \> {\em ; doesn't use these registers}\\ \> {\tt LOD 2(SP),R7} \> {\em ;}\\ \> {\tt ADD 3,SP} \>{\em ; Adjust the stack pointer back to what it was}\\ {\tt memcpy\_finish:}\>\>{\em ; At this point, there are$0\leq${\tt R3}$<4$words left}\\ \> {\tt LW (SP),R5} \> {\em ; Restore our saved registers, since the remainder of the routine}\\ \> {\tt LW 4(SP),R6} \> {\em ; doesn't use these registers}\\ \> {\tt LW 8(SP),R7} \> {\em ;}\\ \> {\tt ADD 12,SP} \>{\em ; Adjust the stack pointer back to what it was}\\ {\tt \_memcpy\_finish:}\>\>{\em ; At this point, there are$0\leq${\tt R3}$<4words left}\\ \> {\tt CMP 1,R3} \> {\em ; Check if any ops are remaining }\\ \> {\tt JMP.LT R0} \> {\em ; Return now if nothing is left}\\ \> {\tt LOD (R1),R4} \> {\em ; Load and store the first item}\\ \> {\tt STO R4,(R1)} \> {\em ;}\\ \> {\tt JMP.Z R0} \> {\em ; Return if that was our only value}\\ \> {\tt LOD 1(R1),R4}\>{\em; Load and store the second item (if necessary)} \\ \> {\tt STO R4,1(R1)}\\ \> {\tt RETN.LT} \> {\em ; Return now if nothing is left}\\ \> {\tt LB (R1),R4} \> {\em ; Load and store the first item}\\ \> {\tt SB R4,(R1)} \> {\em ;}\\ \> {\tt RETN.Z} \> {\em ; Return if that was our only value}\\ \> {\tt LB 1(R1),R4}\>{\em; Load and store the second item (if necessary)} \\ \> {\tt SB R4,1(R1)}\\ \> {\tt CMP 2, R3}\\ \> {\tt JMP.LT R0}\\ \> {\tt LOD 2(R1),R4}\>{\em; Load and store the second item (if necessary)} \\ \> {\tt STO R4,2(R1)}\>{\em; {\tt LOD}, {\tt STO}, {\tt JMP R0} will cost 10 cycles}\\ \> {\tt JMP R0} \> {\em ; Finally, we return}\\ \> {\tt RETN.LT}\\ \> {\tt LB 2(R1),R4}\>{\em; Load and store the second item (if necessary)} \\ \> {\tt SB R4,2(R1)}\>{\em; {\tt LW}, {\tt SW}, {\tt RETN} will cost 10 cycles}\\ \> {\tt RETN} \> {\em ; Finally, we return}\\ \end{tabbing}}} \caption{Example Memory Copy code in Zip Assembly, Hand Optimized}\label{tbl:memcp-opt} \end{center}\end{table} 2820,8 → 2778,8    However, this discussion wouldn't be complete without an example of how this memory operation would be made even simpler using the direct memory access controller. In that case, we can return to C with the code in Tbl.~\ref{tbl:memcp-dmac}. access controller. In that case, we can return to the C language with the code in Tbl.~\ref{tbl:memcp-dmac}. \begin{table}\begin{center} \begin{tabbing} {\tt \#define DMACOPY 0x0fed0000} {\em // Copy memory, largest chunk at a time possible} \\ 2847,8 → 2805,9  \end{tabbing} \caption{Example Memory Copy code using the DMA}\label{tbl:memcp-dmac} \end{center}\end{table} For large memory amounts, the cost of this approach will scale at roughly 2~clocks per word transferred. The DMA, however, will only work with an integer number of 32--bit aligned words. Still, for large memory amounts, the cost of this approach will scale at roughly 2~clocks per word transferred.   Notice how much simpler this memory copy has become to write by using the DMA.  But also consider, the system has only one direct memory access controller.  2863,7 → 2822,7  Tbl.~\ref{tbl:memset-c}. \begin{table}\begin{center} \begin{tabbing} \hbox to 0.4in{\tt void} \= {\tt *memset(void *s, int c, size\_t n) \{} \\ \hbox to 0.4in{\tt void} \= {\tt *memset(char *s, int c, size\_t n) \{} \\  \> {\tt for(size\_t i=0; i \hspace{0.4in} {\tt *s++ = c;} \\  \> {\tt return s;}\\ 2879,12 → 2838,12  {\em ; Cost: Roughly4+6Nclocks}\\ {\tt memset:}\\ \hbox to 0.25in{}\=\hbox to 1in{\tt TST R3}\={\em ; Return immediately if len (R3) is zero}\\ \> {\tt JMP.Z R0}\\ \> {\tt RETN.Z}\\ \> {\tt MOV R1,R4} \> {\em ; Keep our return value in R1, use R4 as a local}\\ {\tt memset\_loop:}\>\> {\em ; Here, we know we have at least one more to go}\\ \> {\tt STO R2,(R4)} \> {\em ; Store one value (no pipelining)} \\ \> {\tt SB R2,(R4)} \> {\em ; Store one value (no pipelining)} \\ \> {\tt SUB 1,R3} \> {\em; Subtract during the store}\\ \> {\tt JMP.Z R0} \> {\em; Return (during store) if all done}\\ \> {\tt RETN.Z} \> {\em; Return (during store) if all done}\\ \> {\tt ADD 1,R4} \> {\em; Otherwise increment our pointer}\\ \> {\tt BRA memset\_loop} {\em ; and repeat}\\ \end{tabbing} 2910,35 → 2869,33  \> {\tt JMP.C}\>{\tt memset\_pipe\_tail}\\ \> {\tt SUB}\>{\tt 1,R3}\> {\em ; R3 is now one less than the number to finish}\\ {\tt memset\_pipe\_unrolled:}\>\>\> {\em ; Here, we know we have at least four more to go}\\ \> {\tt STO}\>{\tt R2,(R4)} \> {\em ; Store our four values, pipelining our}\\ \> {\tt STO}\>{\tt R2,1(R4)} \> {\em ; access across the bus }\\ \> {\tt STO}\>{\tt R2,2(R4)} \\ \> {\tt STO}\>{\tt R2,3(R4)} \\ \> {\tt SB}\>{\tt R2,(R4)} \> {\em ; Store our four values, pipelining our}\\ \> {\tt SB}\>{\tt R2,1(R4)} \> {\em ; access across the bus }\\ \> {\tt SB}\>{\tt R2,2(R4)} \\ \> {\tt SB}\>{\tt R2,3(R4)} \\ \> {\tt SUB}\>{\tt 4,R3} \> {\em; If there are zero left, this will be a -1 result}\\ \> {\tt JMP.C}\>{\tt prememset\_pipe\_tail}\> \hbox to 0.5in{}\= {\em; So we can use our LT condition}\\ \> {\tt BC}\>{\tt prememset\_pipe\_tail}\> \hbox to 0.5in{}\= {\em; So we can use our LT condition}\\ \> {\tt ADD}\>{\tt 4,R4} \> {\em ; Otherwise increment our pointer}\\ \> {\tt BRA}\>{\tt memset\_pipe\_loop} {\em ; and repeat using an early branchable instruction}\\ \> {\tt BRA}\>{\tt memset\_pipe\_unrolled} {\em ; and repeat using an early branchable instruction}\\ {\tt prememset\_pipe\_tail:} \\ \> {\tt ADD}\>{\tt 1,R3}\>{\em ; Return our counts left to the run number}\\ {\tt memset\_pipe\_tail:}\>\>\>{\em ; At this point, we have R3=0-3 remaining}\\ \> {\tt CMP}\>{\tt 1,R3} \> {\em ; If there's less than one left}\\ \> {\tt JMP.C}\>{\tt R0} \> {\em ; then return early.}\\ \> {\tt STO}\>{\tt R2,(R4)} \> {\em ; If we've got one left, store it}\\ \> {\tt STO.GT}\>{\tt R2,1(R4)} \> {\em ; if two, do a burst store}\\ \> {\tt RETN.C}\> \> {\em ; then return early.}\\ \> {\tt SB}\>{\tt R2,(R4)} \> {\em ; If we've got one left, store it}\\ \> {\tt SB.GT}\>{\tt R2,1(R4)} \> {\em ; if two, do a burst store}\\ \> {\tt CMP}\>{\tt 3,R3} \> {\em ; Check if we have another left}\\ \> {\tt STO.Z}\>{\tt R2,2(R4)} \> {\em ; and store it if so.}\\ \> {\tt JMP}\>{\tt R0} \> {\em ; Return now that we are complete.} \> {\tt SB.Z}\>{\tt R2,2(R4)} \> {\em ; and store it if so.}\\ \> {\tt RETN}\> \> {\em ; Return now that we are complete.} \end{tabbing} \caption{Example Memset after loop unrolling, using pipelined memory ops}\label{tbl:memset-pipe} \end{center}\end{table} Note that, in this example as with the {\tt memcpy} example, our loop variable is one less than the number of operations remaining. This is because the ZipCPU has no less than or equal comparison, but only a less than comparison. Further, because the length is given as an unsigned quantity, we {\em only} have a  less than comparison. By subtracting one from the loop variable, that's all our comparison needs to be--at least, until the end of the loop. For that, we jump to a section one instruction earlier and return our counts value to the true remaining length.  is one less than the number of operations remaining. This is because the ZipCPU has no less than or equal comparison, but only a less than comparison.  By subtracting one from the loop variable, that's all our comparison needs to be--at least, until the end of the loop. For that, we jump to a section one instruction earlier and return our counts value to the true remaining length.    You may also notice that, despite the four possibilities in the end game, we can carefully rearrange the logic to only use two compares. The first compare 2946,13 → 2903,20  the same compare, though, we can also know if we have one or more stores left. Hence, we can create a burst memory operation with one or two stores.    As one final example, we might also use the DMA for this operation, as with The three examples given so far discuss and demonstrate solutions appropriate for memory accesses that are not necessarily aligned. Were the accesses aligned, the operation could be done about four times faster. To do this, the {\tt LB} and {\tt SB} instructions would need to be replaced by {\tt LW} and {\tt SW} instructions.   Still, if all accesses were able to be aligned, then we might also use the DMA for this operation. Hence, the DMA makes our final example in Tbl.~\ref{tbl:memset-dma}. \begin{table}\begin{center} \begin{tabbing} {\tt \#define DMA\_CONSTSRC 0x20000000} {\em // Don't increment the source address} \\ {\tt void *} \= {\tt memset\_dma(void *s, int c, size\_t len) \{} \\ {\tt int *} \= {\tt memset\_dma(int *s, int c, size\_t len) \{} \\  \> {\em // As before, this assumes we have access to the DMA, and that}\\  \> {\em // we are running in system high mode ...}\\  \> {\tt zip->dma.len = len;} \= {\em // Set up the DMA }\\ 2970,6 → 2934,9   \> {\tt zip->pic = EINT(SYSINT\_DMA);}\\  \> {\em // And wait for the DMA to complete.} \\  \> {\tt zip\_wait();}\\  \> {\em // Return the original source pointer, so as to} \\  \> {\em // match the library definition.} \\  \> {\tt return s;}\\ {\tt \}} \end{tabbing} \caption{Example Memset code, only this time with the DMA}\label{tbl:memset-dma} 2980,123 → 2947,7  The DMA will still do {\tt len} reads, so the asymptotic performance will never be less than2N\$ clocks per transfer.   \section{String Operations} Perhaps one of the immediate questions most folks will have is, how does one handle string operations on a CPU that only handles 32--bit numbers? Here we offer a couple of possibilities.   The first possibility is the easy and natural choice: just define characters to be 32--bit numbers and ignore the upper 24 bits. This is the choice made by the compiler. Hence, if you compile a simple string compare function, such as Tbl.~\ref{tbl:str-cmp}, \begin{table}\begin{center} \begin{tabbing} \hbox to 0.25in{\tt int} \= {\tt strcmp(const char *s1, const char *s2) \{} \\  \> {\tt while(*s1 == *s2)} \\  \> \hbox to 0.25in{} {\tt s1++, s2++;} \\  \> {\tt return *s2 - *s1;} \\ {\tt \}} \end{tabbing} \caption{Example string compare function}\label{tbl:str-cmp} \end{center}\end{table} string length function, such as Tbl.~\ref{tbl:str-len}, \begin{table}\begin{center} \begin{tabbing} {\tt unsigned} \= {\tt strlen(const char *s) \{} \\  \> {\tt int ln = 0;} \\  \> {\tt while(*s++ != 0)} \\  \> \hbox to 0.25in{} {\tt ln++;} \\  \> {\tt return ln;} \\ {\tt \}} \end{tabbing} \caption{Example string compare function}\label{tbl:str-len} \end{center}\end{table} or string copy function, such as Tbl.~\ref{tbl:str-cpy}, \begin{table}\begin{center} \begin{tabbing} {\tt char *} \= {\tt strcpy(char *dest, const char *src) \{} \\  \> {\tt char *d = dest;} {\em // Make a working copy of the dest ptr}\\  \> {\tt do \{} \\  \> \hbox to 0.25in{} {\tt *d++ = *src;} \\  \> {\tt \} while(*src++);} \\  \> {\tt return dest;} \\ {\tt \}} \end{tabbing} \caption{Example string copy function}\label{tbl:str-cpy} \end{center}\end{table} this is what you will get.   A little work with these functions, and you should be able to optimize them in a fashion similar to that with memcpy. This doesn't solve the fundamental problem, though, of why am I wasting 32--bits for 8--bit quantities?   An alternative would be to use a packed string structure. To pack a string, one might do something like Tbl.~\ref{tbl:pstr}. \begin{table}\begin{center} \begin{tabbing} {\tt void} \= {\tt packstr(char *s) \{} \\  \> {\tt char *d = s;} \= {\em // Pack our string in place} \\  \> {\tt int w;}\>{\em // A holding word to pack things into} \\  \> {\tt int k=0;}\>{\em // A count to know when to move to the next word} \\  \> {\tt while(*s) \{} \\  \> \hbox to 0.25in{}\={\tt w = (w<<8)|(*s \& 0x0ff);} \\  \>\> {\em // After four of these octets, write the result out} \\  \> \> {\tt if (((++k)\&3)==0) *d++ = w;} \\  \> {\tt \}} \\  \> {\em // But what happens if we never got to the fourth octet}\\  \> {\em // in our last word? We need to clean that up here.}\\ \\  \> {\em // First, shift the partial value all the way up}\\  \> {\tt w = (w<<(32-((k\&3)<<3));} {\em // Shift up the last word}\\  \> {\tt *d++ = w;} {\em // Store any remaining partial value }\\  \> {\em // If we want to make sure our strings end in zero, we need}\\  \> {\em // one more step:}\\  \> {\tt *d = 0;} {\em // Make sure string ends in a zero.}\\ {\tt \}} \end{tabbing} \caption{String packing function}\label{tbl:pstr} \end{center}\end{table} Notice that our packed string places its first byte in the high order octet of our first word, that any excess octets in the last word are zeros, and that there remains a zero word following our string. With this packed string approach, compares and copies can proceed four times faster. As an example, Tbl.~\ref{tbl:pstr-cmp} \begin{table}\begin{center} \begin{tabbing} \hbox to 0.25in{\tt int} \= {\tt pstrcmp(const char *s1, const char *s2) \{} \\  \> {\tt while(*s1 == *s2)} \\  \> \hbox to 0.25in{} {\tt s1++, s2++;} \\  \> {\tt return *s2 - *s1;} \\ {\tt \}} \end{tabbing} \caption{Packed string compare function}\label{tbl:pstr-cmp} \end{center}\end{table} presents a string compare function for a packed string. You'll notice that it doesn't look all that different from a string compare for a non-packed string. This is on purpose. Another example might be a string copy, which again, wouldn't look all that different. Getting the number of used 8--bit octets within a string is a touch more difficult. In that case, one might try something like Tbl.~\ref{tbl:pstr-len}. \begin{table}\begin{center} \begin{tabbing} {\tt unsigned} \= {\tt pstrlen(const char *s) \{} \\  \> {\tt int ln = 0;} \\  \> {\tt while(*s++ != 0)} \\  \> \hbox to 0.25in{}\={\tt ln+=4;} \\  \> {\tt if (ln) \{}\\  \>\> {\em // Touch up the length in case of an incomplete last word} \\  \>\> {\tt int lastval = s[-1];}\\ \\  \>\> {\tt if ((lastval \& 0x0ff)==0) ln--;}\\  \>\> {\tt if ((lastval \& 0x0ffff)==0) ln--;}\\  \>\> {\tt if ((lastval \& 0x0ffffff)==0) ln--;}\\  \> {\tt \}} \\  \> {\tt return ln;} \\ {\tt \}} \end{tabbing} \caption{Packed string subcharacter length function}\label{tbl:pstr-len} \end{center}\end{table}   \section{Context Switch}   Fundamental to any multiprocessing system is the ability to switch from one
3158,40 → 3009,28
 \begin{table}\begin{center} \begin{tabbing} {\tt save\_context:} \\ \hbox to 0.25in{}\={\tt SUB 1,SP}\hbox to 0.5in{}\= {\em ; Function prologue: create a stack}\\ \> {\tt STO R5,(SP)} \> {\em ; frame and save R5. (R1-R4 are assumed}\\ \hbox to 0.25in{}\={\tt SUB 4,SP}\hbox to 0.5in{}\= {\em ; Function prologue: create a stack}\\ \> {\tt SW R5,(SP)} \> {\em ; frame and save R5. (R1-R4 are assumed}\\ \> {\tt MOV uR0,R2} \> {\em ; to be used and in need of saving. Then}\\ \> {\tt MOV uR1,R3} \> {\em ; copy the user registers, four at a time to }\\ \> {\tt MOV uR2,R4} \> {\em ; supervisor registers, where they can be}\\ \> {\tt MOV uR3,R5} \> {\em ; stored, while exploiting memory pipelining}\\ \> {\tt STO R2,(R1)} \>{\em ; Exploit memory pipelining: }\\ \> {\tt STO R3,1(R1)} \>{\em ; All instructions write to same base memory}\\ \> {\tt STO R4,2(R1)} \>{\em ; All offsets increment by one }\\ \> {\tt STO R5,3(R1)} \\ \> {\tt SW R2,(R1)} \>{\em ; Exploit memory pipelining: }\\ \> {\tt SW R3,4(R1)} \>{\em ; All instructions write to same base memory}\\ \> {\tt SW R4,8(R1)} \>{\em ; All offsets increment by one }\\ \> {\tt SW R5,12(R1)} \\ \> \ldots {\em ; Need to repeat for all user registers} \\ \iffalse & {\tt MOV uR5,R0} \\ & {\tt MOV uR6,R1} \\ & {\tt MOV uR7,R2} \\ & {\tt MOV uR8,R3} \\ & {\tt MOV uR9,R4} \\ & {\tt STO R0,5(R5) }\\ & {\tt STO R1,6(R5) }\\ & {\tt STO R2,7(R5) }\\ & {\tt STO R3,8(R5) }\\ & {\tt STO R4,9(R5)} \\ \fi \> {\tt MOV uR12,R2} \> {\em ; Finish copying ... } \\ \> {\tt MOV uSP,R3} \\ \> {\tt MOV uCC,R4} \\ \> {\tt MOV uPC,R5} \\ \> {\tt STO R2,12(R1)} \> {\em ; and saving the last registers.}\\ \> {\tt STO R3,13(R1)} \> {\em ; Note that even the special user registers }\\ \> {\tt STO R4,14(R1)} \> {\em ; are saved just like any others. }\\ \> {\tt STO R5,15(R1)} \\ \> {\tt LOD (SP),R5} \> {\em ; Restore our one saved register}\\ \> {\tt ADD 1,SP} \> {\em ; our stack frame,} \\ \> {\tt JMP R0} \> {\em ; and return }\\ \> {\tt SW R2,48(R1)} \> {\em ; and saving the last registers.}\\ \> {\tt SW R3,52(R1)} \> {\em ; Note that even the special user registers }\\ \> {\tt SW R4,56(R1)} \> {\em ; are saved just like any others. }\\ \> {\tt SW R5,60(R1)} \\ \> {\tt LW (SP),R5} \> {\em ; Restore our one saved register}\\ \> {\tt ADD 4,SP} \> {\em ; our stack frame,} \\ \> {\tt RETN} \> {\em ; and return }\\ \end{tabbing} \caption{Example Storing User Task Context}\label{tbl:context-out} \end{center}\end{table}
3238,30 → 3077,30
 \begin{table}\begin{center} \begin{tabbing} {\tt restore\_context:} \\ \hbox to 0.25in{}\= {\tt SUB 1,SP}\hbox to 0.4in{}\={\em ; Set up a stack frame} \\ \> {\tt STO R5,(SP)} \> {\em ; and store a local register onto it.}\\ \hbox to 0.25in{}\= {\tt SUB 4,SP}\hbox to 0.4in{}\={\em ; Set up a stack frame} \\ \> {\tt SW R5,(SP)} \> {\em ; and store a local register onto it.}\\ \\ \> {\tt LOD (R1),R2} \> {\em ; By doing four loads at a time, we are }\\ \> {\tt LOD 1(R1),R3} \> {\em ; making sure we are using our pipelined}\\ \> {\tt LOD 2(R1),R4} \> {\em ; memory capability. }\\ \> {\tt LOD 3(R1),R5} \\ \> {\tt LW (R1),R2} \> {\em ; By doing four loads at a time, we are }\\ \> {\tt LW 4(R1),R3} \> {\em ; making sure we are using our pipelined}\\ \> {\tt LW 8(R1),R4} \> {\em ; memory capability. }\\ \> {\tt LW 12(R1),R5} \\ \> {\tt MOV R2,uR1} \> {\em ; Once the registers are loaded, copy them }\\ \> {\tt MOV R3,uR2} \> {\em ; into the user registers that they need to}\\ \> {\tt MOV R4,uR3} \> {\em ; be placed within.} \\ \> {\tt MOV R5,uR4} \\  \> \ldots {\em ; Need to repeat for all user registers} \\ \> {\tt LOD 12(R1),R2} \> {\em ; Now for our last four registers ...}\\ \> {\tt LOD 13(R5),R3} \\ \> {\tt LOD 14(R5),R4} \\ \> {\tt LOD 15(R5),R5} \\ \> {\tt LW 48(R1),R2} \> {\em ; Now for our last four registers ...}\\ \> {\tt LW 52(R5),R3} \\ \> {\tt LW 56(R5),R4} \\ \> {\tt LW 60(R5),R5} \\ \> {\tt MOV R2,uR12} \> {\em ; These are the special purpose ones, restored }\\ \> {\tt MOV R3,uSP} \> {\em ; just like any others.}\\ \> {\tt MOV R4,uCC} \\ \> {\tt MOV R5,uPC} \\   \> {\tt LOD (SP),R5} \> {\em ; Restore our saved register, } \\ \> {\tt ADD 1,SP} \> {\em ; and the stack frame, }\\ \> {\tt JMP R0} \> {\em ; and return to where we were called from.}\\ \> {\tt LW (SP),R5} \> {\em ; Restore our saved register, } \\ \> {\tt ADD 4,SP} \> {\em ; and the stack frame, }\\ \> {\tt RETN} \> {\em ; and return to where we were called from.}\\ \end{tabbing} \caption{Example Restoring User Task Context}\label{tbl:context-in} \end{center}\end{table}
3295,35 → 3134,33
 in Tbl.~\ref{tbl:zpregs}. \begin{table}[htbp] \begin{center}\begin{reglist} PIC & \scalebox{0.8}{\tt 0xc0000000} & 32 & R/W & Primary Interrupt Controller \\\hline WDT & \scalebox{0.8}{\tt 0xc0000001} & 32 & R/W & Watchdog Timer \\\hline WBU&\scalebox{0.8}{\tt 0xc0000002} & 32 & R & Address of last bus timeout error\\\hline CTRIC & \scalebox{0.8}{\tt 0xc0000003} & 32 & R/W & Secondary Interrupt Controller \\\hline TMRA & \scalebox{0.8}{\tt 0xc0000004} & 32 & R/W & Timer A\\\hline TMRB & \scalebox{0.8}{\tt 0xc0000005} & 32 & R/W & Timer B\\\hline TMRC & \scalebox{0.8}{\tt 0xc0000006} & 32 & R/W & Timer C\\\hline JIFF & \scalebox{0.8}{\tt 0xc0000007} & 32 & R/W & Jiffies \\\hline MTASK & \scalebox{0.8}{\tt 0xc0000008} & 32 & R/W & Master Task Clock Counter \\\hline MMSTL & \scalebox{0.8}{\tt 0xc0000009} & 32 & R/W & Master Stall Counter \\\hline MPSTL & \scalebox{0.8}{\tt 0xc000000a} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline MICNT & \scalebox{0.8}{\tt 0xc000000b} & 32 & R/W & Master Instruction Counter\\\hline UTASK & \scalebox{0.8}{\tt 0xc000000c} & 32 & R/W & User Task Clock Counter \\\hline UMSTL & \scalebox{0.8}{\tt 0xc000000d} & 32 & R/W & User Stall Counter \\\hline UPSTL & \scalebox{0.8}{\tt 0xc000000e} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline UICNT & \scalebox{0.8}{\tt 0xc000000f} & 32 & R/W & User Instruction Counter\\\hline DMACTRL & \scalebox{0.8}{\tt 0xc0000010} & 32 & R/W & DMA Control Register\\\hline DMALEN & \scalebox{0.8}{\tt 0xc0000011} & 32 & R/W & DMA total transfer length\\\hline DMASRC & \scalebox{0.8}{\tt 0xc0000012} & 32 & R/W & DMA source address\\\hline DMADST & \scalebox{0.8}{\tt 0xc0000013} & 32 & R/W & DMA destination address\\\hline % Cache & \scalebox{0.8}{\tt 0xc0100000} & & & Base address of the Cache memory\\\hline PIC & \scalebox{0.8}{\tt 0xff000000} & 32 & R/W & Primary Interrupt Controller \\\hline WDT & \scalebox{0.8}{\tt 0xff000004} & 32 & R/W & Watchdog Timer \\\hline WBU &\scalebox{0.8}{\tt 0xff000008} & 32 & R & Address of last bus timeout error\\\hline CTRIC & \scalebox{0.8}{\tt 0xff00000c} & 32 & R/W & Secondary Interrupt Controller \\\hline TMRA & \scalebox{0.8}{\tt 0xff000010} & 32 & R/W & Timer A\\\hline TMRB & \scalebox{0.8}{\tt 0xff000014} & 32 & R/W & Timer B\\\hline TMRC & \scalebox{0.8}{\tt 0xff000018} & 32 & R/W & Timer C\\\hline JIFF & \scalebox{0.8}{\tt 0xff00001c} & 32 & R/W & Jiffies \\\hline MTASK & \scalebox{0.8}{\tt 0xff000020} & 32 & R/W & Master Task Clock Counter \\\hline MMSTL & \scalebox{0.8}{\tt 0xff000024} & 32 & R/W & Master Stall Counter \\\hline MPSTL & \scalebox{0.8}{\tt 0xff000028} & 32 & R/W & Master Pre--Fetch Stall Counter \\\hline MICNT & \scalebox{0.8}{\tt 0xff00002c} & 32 & R/W & Master Instruction Counter\\\hline UTASK & \scalebox{0.8}{\tt 0xff000030} & 32 & R/W & User Task Clock Counter \\\hline UMSTL & \scalebox{0.8}{\tt 0xff000034} & 32 & R/W & User Stall Counter \\\hline UPSTL & \scalebox{0.8}{\tt 0xff000038} & 32 & R/W & User Pre--Fetch Stall Counter \\\hline UICNT & \scalebox{0.8}{\tt 0xff00003c} & 32 & R/W & User Instruction Counter\\\hline DMACTRL& \scalebox{0.8}{\tt 0xff000040} & 32 & R/W & DMA Control Register\\\hline DMALEN & \scalebox{0.8}{\tt 0xff000044} & 32 & R/W & DMA total transfer length\\\hline DMASRC & \scalebox{0.8}{\tt 0xff000048} & 32 & R/W & DMA source address\\\hline DMADST & \scalebox{0.8}{\tt 0xff00004c} & 32 & R/W & DMA destination address\\\hline \end{reglist} \caption{ZipSystem Internal/Peripheral Registers}\label{tbl:zpregs} \end{center}\end{table} These registers are located in the CPU's address space, although in a special area of that space. Indeed, the area is so special, that the CPU decodes the address space location before placing the request onto the bus. For this reason, other containers for the CPU, such as the ZipBones which doesn't have these registers, will still create errors when they are referenced. These registers are all 32-bit registers. Writes of less than 32--bits may have unexpected results. Further, they are located in a reserved location within the CPU's address space. As a result, references to these locations by a ZipBones based system will generate a bus error.   Here in this section, we'll walk through the definition of each of these registers in turn, together with any bit fields that may be associated with
3528,7 → 3365,7
 \begin{table}[htbp] \begin{center}\begin{reglist} ZIPCTRL & 0 & 32 & R/W & Debug Control Register \\\hline ZIPDATA & 1 & 32 & R/W & Debug Data Register \\\hline ZIPDATA & 4 & 32 & R/W & Debug Data Register \\\hline \end{reglist} \caption{ZipSystem Debug Registers}\label{tbl:dbgregs} \end{center}\end{table}
3654,12 → 3491,12
 \begin{center} \begin{wishboneds} Revision level of wishbone & WB B4 spec \\\hline Type of interface & Master, Read/Write, single cycle or pipelined\\\hline Address Width & (ZipSystem parameter, can be up to 32--bit bits) \\\hline Type of interface & Master, Read/Write, pipelined\\\hline Address Width & (ZipSystem parameter, up to 30~bits) \\\hline Port size & 32--bit \\\hline Port granularity & 32--bit \\\hline Port granularity & 8--bit \\\hline Maximum Operand Size & 32--bit \\\hline Data transfer ordering & (Irrelevant) \\\hline Data transfer ordering & Big--Endian \\\hline Clock constraints & Works at 100~MHz on a Basys--3 board, and 80~MHz on a  XuLA2--LX25\\\hline Signal Names & \begin{tabular}{ll}
3670,6 → 3507,7
  {\tt o\_wb\_we} & {\tt WE\_O} \\  {\tt o\_wb\_addr} & {\tt ADR\_O} \\  {\tt o\_wb\_data} & {\tt DAT\_O} \\  {\tt o\_wb\_sel} & {\tt SEL\_O} \\  {\tt i\_wb\_ack} & {\tt ACK\_I} \\  {\tt i\_wb\_stall} & {\tt STALL\_I} \\  {\tt i\_wb\_data} & {\tt DAT\_I} \\
3740,8 → 3578,9
 {\tt o\_wb\_cyc} & 1 & Output & Indicates an active Wishbone cycle\\\hline {\tt o\_wb\_stb} & 1 & Output & WB Strobe signal\\\hline {\tt o\_wb\_we} & 1 & Output & Write enable\\\hline {\tt o\_wb\_addr} & 32 & Output & Bus address \\\hline {\tt o\_wb\_addr} & 30 & Output & Bus address \\\hline {\tt o\_wb\_data} & 32 & Output & Data on WB write\\\hline {\tt o\_wb\_sel} & 4 & Output & Select lines\\\hline {\tt i\_wb\_ack} & 1 & Input & Slave has completed a R/W cycle\\\hline {\tt i\_wb\_stall} & 1 & Input & WB bus slave not ready\\\hline {\tt i\_wb\_data} & 32 & Input & Incoming bus data\\\hline
3818,13 → 3657,17
  just barely.   \item The ZipCPU was designed to be an implementable soft core that could be  placed within an FPGA, controlling actions internal to the FPGA. It  fits this role rather nicely. It does not fit the role of a general  purpose CPU replacement very well: it has no octet level access,  no double--precision floating point capability, neither does it have  vector registers and operations. However, it was never designed to be  such a general purpose CPU but rather a system within a chip.   placed within an FPGA, controlling actions internal to the FPGA. This  version of the CPU in particular has been updated so that it would  support a more general purpose CPU, since as of version~2.0 the ZipCPU  now supports octet level access across the bus.     Still, it fits this role rather nicely. Other capabilities common  to more general purpose CPUs, such as   double--precision floating point capability, vector registers and  vector operations have been left out. However, it was never designed  to be such a general purpose CPU but rather a system within a chip.    \item The extremely simplified instruction set of the ZipCPU was a good  choice. Although it does not have many of the commonly used  instructions, PUSH, POP, JSR, and RET among them, the simplified
3833,9 → 3676,6
  offers a full and complete capability for whatever a user might wish  to do with two exceptions: bytewise character access and accelerated  floating-point support. \item This simplified instruction set is easy to decode. \item The simplified bus transactions (32-bit words only) were also very easy  to implement. \item The burst load/store approach using the wishbone pipelining mode is  novel, and can be used to greatly increase the speed of the processor. \item The novel approach to interrupts greatly facilitates the development of
3861,24 → 3701,12
  cycle per access again.    \item Both GCC and binutils back ends exist for the ZipCPU. \item As of this version of the CPU, a newlib veresion of the C--library  now exists. \end{itemize}   \section{The Not so Good} \begin{itemize} \item The CPU has no octet (character) support. This is both good and bad.  Realistically, the CPU works just fine without it. Characters can be  supported as subsets of 32-bit words without any problem. Practically,  though, this creates two problems. The first is that it makes porting  code from non-ZipCPU platforms to the ZipCPU is difficult--especially  anything that depends upon the existence of {\tt *int8\_t},   {\tt *int16\_t}, the size difference between  {\tt sizeof(int)=4*sizeof(char)}, or that tries to  create unions with characters and integers and then attempts to  reference the address of the characters within that union.     The second problem is that peripherals that depend upon character  support on the bus may need to be rewritten to work on a 32--bit bus.   \item The ZipCPU does not (yet) support a data cache. One is currently under  development. 
3912,15 → 3740,15
  context swap.   \item The ZipCPU is by no means generic: it will never handle addresses  larger than 32-bits (4GW or 16GB) without a complete and total redesign.  larger than 32-bits (4GB) without a complete and total redesign.  This may limit its utility as a generic CPU in the future, although  as an embedded CPU within an FPGA this isn't really much of a  restriction.   \item While a toolchain does exist for the ZipCPU, it isn't yet fully featured.  The ZipCPU has no support for soft floating point arithmetic,  nor does it have support for several standard library functions.  Indeed, full C library support and gdb support are still lacking.  The ZipCPU does not yet have any support for soft floating point  arithmetic, nor does it have gdb support. These may be provided  in future versions. \end{itemize}   \section{The Next Generation}
3934,13 → 3762,6
  porting math software to the ZipCPU difficult. Simply building a   soft floating point library will solve this problem.   \item A C library.    The lack of octet support has so far prevented the porting of   newlib to the ZipCPU platform. In the end, it may mean that any  C library implementation for the ZipCPU may be subtly different  from any you are familiar with.   \item A data cache    A preliminary data cache implemented as a write through cache has
3953,9 → 3774,9
  The first version of such an MMU has already been written. It is   available for examination in the ZipCPU repository. This MMU exists  as a peripheral of the ZipCPU. Integrating this MMU into the ZipCPU  will involve slowing down memory stores so that they can be accomplished  synchronously, as well as determining how and when particular cache  lines need to be invalidated.  will involve slowing down memory stores so that they can be  accomplished synchronously, as well as determining how and when  particular cache lines need to be invalidated.   \item An integrated floating point unit (FPU) 
3989,14 → 3810,14
 % - ADD.C x,PC // Any PC relative conditional jump (20 bits) % % - LDIHI Addr,Rx // Load from any 32-bit address, clobbers Rx, % LOD Addr(Rx),Rx // unconditional, requires second instruction % LW Addr(Rx),Rx // unconditional, requires second instruction % % - LOD.C Addr(Ry),Rx // Any 16-bit relative address load, poss. cond % - LW.C Addr(Ry),Rx // Any 16-bit relative address load, poss. cond % % - STO.C Rx,Addr(Ry) // Any 16-bit rel addr, Rx and Ry must be valid % - SW.C Rx,Addr(Ry) // Any 16-bit rel addr, Rx and Ry must be valid % % - FARJMP #Addr: // Arbitrary 32-bit jumps require a jump table % BRA +1 // memory address. The BRA +1 can be skipped, % .WORD Addr // but only if the address is placed at the end % LOD -2(PC),PC // of an executable section % LW -2(PC),PC // of an executable section %